# Week 3: Initial Predictions

## Dimensionality Reduction Process

#### t-SNE Review

Approximation algorithm only relies on distance vectors, has linearithmic time and linear space requirements. Only downside is that it’s a black box, so we can’t update our dimensionality reduction based on the results of our predictions.

To give a brief glance into the paper (see last post for the link), here’s the definition of the “similarity distribution” in high-dimensional space.

As you can expect, the KL divergence, which we’re trying to optimize, will have a quadratic number of terms since there’s \(O(N^2)\) such \(p_{ij}\).

Thus, for faster computation, we need a smaller approximate representation, which relies on neighborhoods constructed from nearest neighbors \(\mathcal{N}_i\).

Note for completeness the low-dimensional similarity is given by:

This is the exact form of the low-dimensional Student-t distribution. We never really compute it explicitly, since we don’t actually care what the low-dimensionality similarity distribution \(Q\) is, but rather only are concerned with the representation of the KL divergence - or, more precisely, its gradient, which we are optimizing with the projected values \(\textbf{y}_i\).

The quadratic term in the gradient is interestingly similar enough to an \(n\)-body simulation problem that we can use the Barnes-Hut approximation problem to approximate it.

Note the above step implies the use of a \(2^d\)-tree, where \(d\) is the dimension we’re reducing to, so there’s a serious cap on \(d\).

#### t-SNE Results

**MNIST**

Double-encoded MNIST (from paper, had PCA run initially to 50 dims for speed):

Byte-encoded MNIST (t-SNE only in my fork), `10000 imgs x (28x28) pixels`

trains in 164s:

**GDELT**

News data reduction. 2 days’ worth of data - `326604 x 14`

data matrix, with categories. Took <2 hrs. Just two instantiations for parameters, perplexity \(u = 30\) for the high-dimensional distribution and cutoff \(\theta = 0.5\) were used. Note the higher default perplexity causes the typical set to be larger, which in turn results in high standard deviations in the high-dimensional distribution. When encoded to the low-dimensional distribution, this means that all the points look like they’re all just as similar to one another (because the high-dimensional kernels all look the same). In turn we have a tendency towards equidistance in the low dimensions, forcing the data to aggregate in a circular blob.

Nonetheless, there are some definitive clusters already. The extracted columns used were (with additional preprocessing):

#### HMM Results

Clustering data points into: up trend / down trend Feeding only the actual prices to the HMM is not sufficient, because the trends are not very clear in commodities (by opposition to stocks for example) => erratic predictions. We remedy to this problem by adding the moving average as input so that the model is more stable.

Train data in different clusters independently using a linear model.