{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Frequently, we run into situations where need to deal with arrays of varying sizes in `numpy`. These result in much slower code that deals with different sizes individually. Luckily, by extracting commutative and associative operations, we can vectorize even in such scenarios, resulting in significant speed improvements. This is especially pronounced when doing the same thing with deep learning packages like `torch`.\n",
"\n",
"For instance, take a typical k-means implementation, which has an inner loop for a naive algorithm like the following."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy.spatial.distance import cdist\n",
"\n",
"def centroids(X_nd, label_n):\n",
" \"\"\"\n",
" Given X_nd, a 2-dimensional array of n d-dimensional points,\n",
" and n cluster assignments label_n (a 1-d array of n labels,\n",
" ints in range [0, k)), return (c_kd, dist_n) the k centroids c_kd and the\n",
" squared Euclidean distances dist_n from each point to its centroid.\n",
" \n",
" Intentionally zero out any empty clusters.\n",
" \"\"\"\n",
" n, d = X_nd.shape\n",
" k = label_n.max() + 1\n",
" c_kd = np.zeros((k, d))\n",
" dist_n = np.zeros(n)\n",
" for i in range(k):\n",
" ilabel_n = label_n == i\n",
" if not ilabel_n.sum():\n",
" continue\n",
" X_id = X_nd[ilabel_n]\n",
" c_kd[i] = X_id.mean(axis=0)\n",
" dist_n[ilabel_n] = cdist(c_kd[i:i+1, :], X_id, 'sqeuclidean').ravel()\n",
" return c_kd, dist_n "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to do the same thing (mean and compute pairwise square distances) to each of these mixed-size `X_id` arrays, but the `for i in range(k)` loop is difficult to vectorize.\n",
"\n",
"Luckily, notice that our main reduction (`np.mean`) over the ragged arrays is a composition of two operations: `sum / count`. Extracting the reduction operation (the sum) into its own step will let us use our numpy gem, `np.cumsum` + `np.diff`, to aggregate across ragged arrays.\n",
"\n",
"Then we can take adjacent differences to recover per-cluster means. This \"accumulate ragged\" trick will work for any respectable [ufunc](https://numpy.org/doc/stable/reference/ufuncs.html) with a negation. The key to making it work is to sort such that each cluster is contiguous."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def inverse_permutation(p):\n",
" ip = np.empty_like(p)\n",
" ip[p] = np.arange(len(p))\n",
" return ip\n",
"\n",
"def vcentroids(X, label):\n",
" \"\"\"\n",
" Vectorized version of centroids.\n",
" \"\"\" \n",
" # order points by cluster label\n",
" ix = np.argsort(label)\n",
" label = label[ix]\n",
" Xz = X[ix]\n",
" \n",
" # compute pos where pos[i]:pos[i+1] is span of cluster i\n",
" d = np.diff(label, prepend=0) # binary mask where labels change\n",
" pos = np.flatnonzero(d) # indices where labels change\n",
" pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters\n",
" pos = np.append(np.insert(pos, 0, 0), len(X))\n",
" \n",
" # accumulate dimension sums\n",
" Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)\n",
" Xsums = np.cumsum(Xz, axis=0)\n",
" \n",
" # reduce by taking differences of accumulations exactly at the\n",
" # endpoints for cluster indices, using pos array\n",
" Xsums = np.diff(Xsums[pos], axis=0)\n",
" counts = np.diff(pos)\n",
" c = Xsums / np.maximum(counts, 1)[:, np.newaxis]\n",
" \n",
" # re-broadcast centroids for final distance calculation\n",
" repeated_centroids = np.repeat(c, counts, axis=0)\n",
" aligned_centroids = repeated_centroids[inverse_permutation(ix)]\n",
" dist = np.sum((X - aligned_centroids) ** 2, axis=1)\n",
" \n",
" return c, dist"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(True, True)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(1234)\n",
"\n",
"n = 10000\n",
"d = 10\n",
"k = 10000\n",
"x = np.random.randn(n, d)\n",
"label = np.random.randint(k, size=n)\n",
"c0, dists0 = centroids(x, label)\n",
"c1, dists1 = vcentroids(x, label)\n",
"np.allclose(c0, c1), np.allclose(dists0, dists1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"398 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n",
"3.16 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit centroids(x, label)\n",
"%timeit vcentroids(x, label)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Thanks to my friend [Ben Eisner](https://scholar.google.com/citations?user=RWe-v0UAAAAJ&hl=en) for inspiring this post with his [SO](https://stackoverflow.com/questions/65623906/pytorch-how-to-vectorize-indexing-and-computation-when-indexed-tensors-are-diff) question."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}