10 October 2015

Week 3: Initial Predictions

Dimensionality Reduction Process

t-SNE Review

Approximation algorithm only relies on distance vectors, has linearithmic time and linear space requirements. Only downside is that it’s a black box, so we can’t update our dimensionality reduction based on the results of our predictions.

To give a brief glance into the paper (see last post for the link), here’s the definition of the “similarity distribution” in high-dimensional space.

from, pij

As you can expect, the KL divergence, which we’re trying to optimize, will have a quadratic number of terms since there’s \(O(N^2)\) such \(p_{ij}\).

Thus, for faster computation, we need a smaller approximate representation, which relies on neighborhoods constructed from nearest neighbors \(\mathcal{N}_i\).

from, pij approx

Note for completeness the low-dimensional similarity is given by:

to, qij exact

This is the exact form of the low-dimensional Student-t distribution. We never really compute it explicitly, since we don’t actually care what the low-dimensionality similarity distribution \(Q\) is, but rather only are concerned with the representation of the KL divergence - or, more precisely, its gradient, which we are optimizing with the projected values \(\textbf{y}_i\).

The quadratic term in the gradient is interestingly similar enough to an \(n\)-body simulation problem that we can use the Barnes-Hut approximation problem to approximate it.

Note the above step implies the use of a \(2^d\)-tree, where \(d\) is the dimension we’re reducing to, so there’s a serious cap on \(d\).

t-SNE Results


Double-encoded MNIST (from paper, had PCA run initially to 50 dims for speed):

t-sne mnist

Byte-encoded MNIST (t-SNE only in my fork), 10000 imgs x (28x28) pixels trains in 164s:

t-sne out


News data reduction. 2 days’ worth of data - 326604 x 14 data matrix, with categories. Took <2 hrs. Just two instantiations for parameters, perplexity \(u = 30\) for the high-dimensional distribution and cutoff \(\theta = 0.5\) were used. Note the higher default perplexity causes the typical set to be larger, which in turn results in high standard deviations in the high-dimensional distribution. When encoded to the low-dimensional distribution, this means that all the points look like they’re all just as similar to one another (because the high-dimensional kernels all look the same). In turn we have a tendency towards equidistance in the low dimensions, forcing the data to aggregate in a circular blob.

t-sne out

Nonetheless, there are some definitive clusters already. The extracted columns used were (with additional preprocessing):

{"SQLDATE", "EventCode", "QuadClass", "GoldsteinScale", "Actor1Name",
"Actor2Name", "IsRootEvent", "NumMentions", "NumSources",
"NumArticles", "AvgTone" , "Actor1Geo_FullName", 
"Actor2Geo_FullName", "URL"}

HMM Results

Clustering data points into: up trend / down trend Feeding only the actual prices to the HMM is not sufficient, because the trends are not very clear in commodities (by opposition to stocks for example) => erratic predictions. We remedy to this problem by adding the moving average as input so that the model is more stable.

XAU hmm out XAG hmm out

Train data in different clusters independently using a linear model.