Vlad Feinberg

Distillation Walkthrough

Sun, 04 Feb 2024 00:00:00 +0000

Distillation Walkthrough

Distillation is a critical technique towards improving a network’s quality while keeping its serving latency constant.

This is becoming crucial as people focus on serving larger and larger models. Image found on LinkedIn.

Distillation is a powerful technique, but a couple things about it are quite mystical. The purpose of this post is to:

Provide a very high level explainer (but mostly refer to source papers) of distillation.
Show that you can create a simple, linear example where distillation works.

How is distillation implemented?

At an algorithmic level, distillation, introduced by Hinton, Vinyals, and Dean 2015, applied to a classification problem with labels $y=1,\cdots,d $ proceeds as follows.

Train a large, expensive teacher network $T(x;\mu) $ to predict the distribution $y|x $ (a categorical distribution on $[d] $, $\Delta_d $) from a large dataset of $(x, y) $ pairs; this is done by minimizing negative log-likelihood (NLL) of the teacher’s prediction $\min_\mu -\log T(x; \mu)_y $, yielding $\mu_* $.
Train your smaller, cheaper network $S(x;\theta) $ on a combined loss from the data NLL and matching the teacher distribution by minimizing cross-entropy $H(T(x;\mu_*),S(x;\theta))=-\sum_{i=1}^d T(x; \mu_*)_i\log S(x; \theta)_i$, i.e., $\min_\theta -\log S(x; \theta)_y + \alpha H(T(x;\mu_*),S(x;\theta)) $. Since $\mu_* $ is fixed, optimizing the latter term is equivalent to minimizing $D_{\mathrm{KL}}(T||S) $, where $\alpha $ is some hyper.

This has been scaled up to language models too, see Anil et al 2018.

Why wouldn’t it work?

At first, I found the fact that distillation helps mind-bending. To understand why I was confused, let’s look at the distillation loss for $\theta $:

\[ -\log S(x; \theta)_y + \alpha H(T(x;\mu_*), S(x;\theta)) = c_\alpha H\left(\frac{\delta_y +\alpha T(x;\mu_*)}{1+\alpha},S(x;\theta)\right) \]

That is, distillation loss is, up to a constant $c_\alpha $, equivalent to softening our data label $y $ by the teacher’s predictions $T(x;\mu_*) $. Above, $\delta_y$ stands for the atomic distribution which concentrates all the mass on $y$; we perform a weighted average with the teacher distribution for the label $T(x;\mu_*)$.

We started with a dataset. Ran it through some matmuls, $T(x;\mu_*) $. So if we view the dataset as a huge random variable (rv) $D $, $\mu_* $ and indeed $T(x;\mu_*) $ is just some other rv which is a very complicated function of $D $ (since they resulted from SGD or some training process applied to $D $).

So we just took our clean labels $y $ and added noise to them! And it helped! Indeed, we can formalize this concern in the case $\alpha=\infty $ (so that the distillation label $y’=T(x;\mu_*) $). In practice, and even in the linear setting I discuss next, this still works, in that train and test loss improve over using $y $.

If we train with just $y’ $ then the training process forms a Markov chain conditioned on all $x $: $y \rightarrow \mu_*\rightarrow \theta_* $ (given $\mu_* $, and therefore $y’ $, the training of $\theta_* $ is conditionally independent of $y $).

Then by the data processing inequality, the mutual information $I(y, \mu_*)\ge I_{\alpha=\infty}(y, \theta_*) $.

This implies that, in the case of self-distillation, where you can globally optimize the learner (find the true minimum $\mu_*,\theta_*$), we shouldn’t expect any improvement over the teacher—this would violate the DPI!

There’s no paradox with the example below, since our teacher is of a different class than the student, but, still, it feels wrong to distance ourselves from the true label. We don’t know the relationship between $ I_{\alpha=\infty}(y, \theta_*) $ and $ I_{\alpha=0}(y, \theta_*) $ but this seems like a step in the wrong direction.

This concern could be framed as an invocation Vapnik’s principle: “When solving a problem of interest, do not solve a more general problem as an intermediate step.”

Why does it work?

In short, it can be the case that $ I_{\alpha=\infty}(y, \theta_*) > I_{\alpha=0}(y, \theta_*) $ because of variance reduction for a finite dataset (showing equality in the data limit would be quite interesting, reach out to me if you have that result!). The teacher makes the labels less noisy.

The underlying classification problem has $(X, Y)\sim\mathcal{D} $ where noisy $Y|X $ is still stochastic. One can imagine that for a logistic loss, the variance of the training loss from $n $ examples derived from the dataset $(X, \mathbb{E}[Y|X]) $ is smaller than that of the training loss from $n $ examples of $(X, Y) $. For a probabilistic binary model $p_\theta $ parameterized by $\theta $:

\[ \mathrm{var}\ \frac{1}{n}\sum_i\log(1+\exp(-Y_ip_\theta(X_i)))\ge \mathrm{var}\ \frac{1}{n}\sum_i\log(1+\exp(-\mathbb{E}[Y_i|X_i]p_\theta(x_i)))\,\,, \]

where this holds by Jensen and the fact that log loss is convex (note this still holds for neural nets $p_\theta $).

So if instead of $\mathbb{E}[Y|X] $ we use a larger neural network teacher $\hat f(X)\approx \mathbb{E}[Y|X] $, so long as the teacher itself does not add more variance to our training objective, then knowledge distillation can indeed be useful.

The below shows a non-deep simple example in the low-data setting where this matters, but again, we rely on the fact that the teacher has a strong inductive bias so that the last caveat is met.

import numpy as np
from sklearn.base import clone
from sklearn.preprocessing import PolynomialFeatures, FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from scipy.special import expit
from sklearn.metrics import log_loss

def base_distribution(rng, n):
  x = rng.standard_normal(n) * 10
  ey = 0.4 * np.sin(x) + 0.5
  y = rng.binomial(1, ey)
  return x[:, np.newaxis], y

def run_kd(num_training_ex, x, y):
  test_prop = 1 - num_training_ex / len(x)
  xtr, xte, ytr, yte = train_test_split(x, y, test_size=test_prop, random_state=42)

  # Student is from a model class that won't be able to reconstruct the ground
  # truth
  degree = 3
  student = make_pipeline(
      PolynomialFeatures(degree),
      LogisticRegression(max_iter=1000))

  # Teacher comes from a better model class
  teacher = make_pipeline(
      FunctionTransformer(np.sin),
      LogisticRegression(max_iter=1000))

  student.fit(xtr, ytr)
  teacher.fit(xtr, ytr)

  teacher_labels = teacher.decision_function(xtr)

  # Hack to train on soft targets
  # https://stackoverflow.com/a/60969923

  distilled = make_pipeline(
      PolynomialFeatures(degree),
      LinearRegression())

  distilled.fit(xtr, teacher_labels)
  distilled.decision_function = distilled.predict

  models = {'student': student, 'teacher': teacher, 'distilled': distilled}
  datasets = {'train': (xtr, ytr), 'test': (xte, yte)}

  results = {}

  for model_name, model in models.items():
    for split_name, (x, y) in datasets.items():
      pred = model.decision_function(x)
      pred = expit(pred)
      results[model_name + '.' + split_name] = log_loss(y, pred)

  return results

results = None
ntrials = 15
import jax
nexs = np.arange(10, 50, 5, int)
for nex in nexs:
  rng = np.random.default_rng(1234)
  trials = [run_kd(nex, *base_distribution(rng, 10000)) for _ in range(ntrials)]
  # List of dicts to dict of singleton list of average
  run = jax.tree_map(lambda *xs: [np.mean(xs)], *trials)
  results = run if not results else {k: results[k] + run[k] for k in results}

from matplotlib import pyplot as plt

for split in ['train', 'test']:
  for k, v in results.items():
    if split in k:
      plt.plot(nexs, v, label=k.replace('.' + split, ''))
  plt.legend()
  plt.title(split)
  plt.xlabel('num training examples')
  plt.ylabel('logloss')
  plt.show()

Is that all there is to it?

Not quite. With neural nets, we do see some really interesting training phenomena not fully accounted for by the story above. For instance, self-distillation (where the “teacher” is the same size as the student) somehow works!

There are interactions with imperfect non-convex optimization in the deep neural network setting which are still a research topic. For more, check out a Zeyuan Allen-Zhu and Yuanzhi Li blog post from MSR.

Try the notebook out yourself.

Sketching Algorithms for Matrix Preconditioning in Neural Network Optimization

Wed, 18 Oct 2023 00:00:00 +0000

Introduction

In this blog post, I’ll provide data stream background, applied context, motivation, and an overview for my recent work, Sketchy, co-authored with Xinyi Chen, Y. Jennifer Sun, Rohan Anil, and Elad Hazan, which will be featured in NeurIPS 2023 proceedings.

I really enjoyed working on this project because it simultaneously scratches different methodological itches of mine, from data stream sketching to online convex optimization. However, the algorithm introduced here was borne out of necessity; I didn’t go around with those two hammers a priori (though, presented with a nail, it made sense to reach for what’s readily available in my tool belt)!

This work is a big step in making matrix preconditioning training methods more accessible for deep learning by removing a critical memory bottleneck. For a deeper analysis, skip to the Motivation section below. Using matrix preconditioning training is probably one of the most under-appreciated techniques in modern neural network training, but for good reason. Shipping matrix preconditioning is quite challenging:

It takes engineering work to eliminate overheads and numerical linear algebra bugs from the required matrix operations.
Like any drastic optimizer change, it requires hyperparameter tuning.
Like any drastic change, it requires socialization and “emotional work” to gain adoption, since everyone’s naturally risk-averse, especially to optimizer changes, since they are easy papers to write and thus its literature has a lot of noise.
Organizational decision making is complex. An increase in quality demanding increased training compute will not sell, even if the exchange is favorable when converted to dollars. By trading off other factors, such as model or dataset size, a quality win can be converted to something Pareto dominant on all target teams’ desiderata.

I was quite lucky in that I didn’t need to start on these 4 challenges on my own! My mentor and manager, Rohan Anil did a lot of the initial labor here in shipping a version of the Shampoo optimizer. Sketchy breaks down a subsequent barrier to adoption for Shampoo, the memory usage.

Context

Here, I’ll discuss the research and production background that sets up for Sketchy. As the ML behind Google’s predicted click-thru-rate (pCTR) evolved, so too did its optimizer, often translating to quality wins. Better pCTR means more relevant ads selection means more money. Some public papers from Google provide great openly-available background.

The original 2013 View from the Trenches paper (which I’ve analyzed before on this blog) introduces a variant of AdaGrad called FTRL-Prox.

Rather than updating network weights with SGD given a gradient $\textbf{g}_t$ using a fixed learning rate $\textbf{x}_{t+1} = \textbf{x}_{t}-\eta\textbf{g}_t$, using the per-coordinate learning rates defined as above with updates $x_{t+1, i} = x_{t, i}-\eta_{t,i}g_{t, i}$ (glossing over the lasso component here) dramatically improves optimization quality, due to a preconditioning effect. You can imagine how rescaling the $x$ and $y$ axes in the below level-plot of a 2D loss surface can result in faster optimization.

The magic of the AdaGrad’s analysis is in identifying that the running root-mean-square of the coordinate-wise gradients is the best such rescaling choice we can make in hindsight, up to a constant factor. Notice though, in the plot above, that the ellipses representing the parabolic loss surface on the left are rotated off-kilter: no rescaling of individual axes can transform the image on the left into the one on the right!

The problem here is that our gradients would be correlated across dimensions, and decorrelating them requires a full whitening matrix $H_t^{-1}$ as pictured above. Unfortunately, this is a showstopper for all but the smallest problems. Full matrix Adagrad analysis states that the optimal preconditioner is the inverse matrix square root of the gradient covariance, $C_t=\sum_t g_tg_t^\top$, the sum of gradient outer products. This would require petabytes to represent in modern neural networks!

Enter Shampoo. Shampoo tells us that for convex functions with matrix-shaped inputs, we can use a structured approximation to the full covariance $C_t$ instead (DNNs are non-convex functions of multiple matrix-shaped inputs, but the convex-inspired approach seems to work!). In particular, given a weight matrix of shape $a\times b$, rather than using the full flat gradient $\textbf{g}_t\in\mathbb{R}^{ab}$, whose outer product is a matrix of size $ab\times ab$, we can use the Kronecker product of reshaped matrix gradient’s tensor products. Specifically, we set

\[ G_t=\mathrm{reshape}\left(\textbf{g}_t, \left(a, b\right)\right)\,\,,\]

and then define the accumulated tensor products for both the left and right sides of the matrix gradient,

\[ L_t=\sum_{s=1}^tG_tG_t^\top\,\,\,\,;\,\,\,\,R_t=\sum_{s=1}^tG_t^\top G_t\,\,,\]

where $L_t,R_t$ are of shape $a\times a,b\times b$, respectively. Then the Kronecker product $L_t\otimes R_t\approx C_t^2$, in the sense that we can recover regret guarantees similar to full matrix AdaGrad when using the former. The Kronecker product thus can be viewed as an approximation of the $C_t$ matrix of $(ab)^2$ entries using $a^2+b^2$ entries instead. I’ve gone into detail on the Kronecker product from a computation perspective in a walkthrough notebook, but as a little demo, see below.

The critical part of the Shampoo paper which relates the approximation to $C_t$ is in Lemma 8, which I excerpt below but won’t go into detail on.

So, are we done? Not quite. For the ever-ubiquitous Transformers, highly rectangular weights are common, e.g., take the big transformer, which has $a=1024$ and $b=4a$ (these are models from 2017; nowadays we have bigger ones but $b/a$ tends to be at least 4). Shampoo’s memory overhead here would be around $2(1 + 16)a^2$, since you need to store the statistics $L_t,R_t$ and their matrix inverse roots. Remember, the full parameter has size $4a^2$!

Nowadays, when you’re getting ready to train a transformer, or whatever, you don’t set aside enough room in your GPU memory for 8 more copies of the model, you train a bigger model.

This is where Sketchy enters the picture: can we retain the quality wins of Shampoo but drop memory to AdaGrad/Adam levels? Previous work like AdaFactor shows we can give up quality to reduce memory. But can we leverage second moment information to use less memory than Adam and still beat it?

Motivation

Besides wanting to have Shampoo quality without its memory prices, there are deep trends in computing which suggest memory reduction is worth investing in.

Batch size. We’ve all seen the clock speed charts and parallelism is the future arguments already; hardware acceleration via parallelism has won. Here’s an indicative plot from CPUDB:

OK, so say in the future you buy that training will have more parallel devices each contributing FLOPS rather than more FLOPS per device. What does that tell us about optimizers? That we can afford to do fancier operations in our optimizers, which usually (and with ZeRO especially) are not a meaningful amount of compute compared to the rest of DNN training!

If you have many devices, and therefore large batch sizes, training your neural network, then any optimizer which only processes the mini-batch gradient (e.g., not K-FAC) effectively has its step time amortized per example in the batch. Note that this falls out of the fact that your gradient shape is independent of the batch size. If you can spend the time to make a higher-quality step from each gradient, then do so!

Memory Bandwidth. Device logic is speeding up faster than memory access speed. There are fundamental reasons we should expect algorithms with high compute density (compute per memory access) to be faster. But more practically speaking, there’s growing headroom in physical devices where we should be looking to find better ways of optimizing our network by increasing compute per memory access (e.g., from loading of an example or weights).

		Device Family			Compute Increase			Memory Bandwidth Increase
		TPUv2 to TPUv3			2.67×			1.29×
		V100 to A100			5×			2.2×

This means that if we have an algorithm that improves quality while using nearly the same amount of memory bandwidth (e.g., it is a low-memory optimizer which is applied to the same-size neural net and requires scanning through the same number of examples), but perhaps more compute, then we expect this algorithm to speed up over time as compute increases faster than memory bandwidth.

Overview

On the face of it, we should be suspect that we can use asymptotically equivalent memory to Adam and still achieve near-Shampoo levels of quality (there are other buffers which require memory linear in parameter size, for learning rate grafting, momentum, etc.).

So we should clarify how we can take advantage of problem-specific structure to reduce memory use.

The key is that $L_t,R_t$ (or moving average analogues of the two, in the non-convex NN case) exhibit fast spectral decay. In other words, a low rank plus diagonal approximation to each of those two matrices suffices to preserve the matrix action of each one individually.

The plot above shows that taking the top 25% of the eigenvalues for a weight with an axis of size 1024 is enough to capture over 80% of the spectral mass in those statistics, across architectures, throughout training.

This is a highly nontrivial property—note that in fact we have a rotating top-subspace over the course of training for the EMA version $L_t=\sum_{s\le t}\beta^{t-s}G_sG_s^\top$. If we just had a $256\times 1024$ weight matrix with normal gradients, we’d see a near isometry when EMA’ing:

import numpy as np
from jax.config import config
config.update("jax_enable_x64", True)
from jax import numpy as jnp
rng = np.random.default_rng(1234)
d, n, beta = 256, 10000, 0.999
x = rng.normal(size=(n, 1024, d))
# reverse sort for numerical stability, note square root b/c we square x.
x *= np.power(beta, np.arange(n) / 2)[::-1, np.newaxis, np.newaxis]
cov = 0
for i in range(0, 10000, 1000):
  s = x[i:i + 1000]
  cov += jnp.einsum('nxd,nyd->xy', s, s).block_until_ready()
eigvals = np.linalg.eigvalsh(cov)
eigvals[-256:].sum() / eigvals.sum()  # 26.5%

But simply knowing a low rank approximation would do isn’t enough. You’d need the SVD (or iterative top-$k$ variants of it) applied to the full statistics to compute this low rank approximation. But if we’re tracking full statistics, we’re already paying for $a^2+b^2$ space!

Can we approximate the low rank approximation of a matrix, here, $C_t=\sum_{s\le t}G_sG_s^\top$, using only the space of the low rank approximation itself incrementally updating our approximation, having each $G_s$ revealed to you?

This seems impossible at first sight (indeed, the exact problem is). How can you both track the current top eigenspace and also update it as you go along? It’d be like trying to find the top-$k$ most frequent unique items in a data stream using only $O(k)$ memory, except the eigenvalue version of that.

This happens to be exactly the observation Edo Liberty made when introducing Frequent Directions (FD). Indeed, as my co-author Elad Hazan pointed out, if you trace the FD algorithm with incoming vectors equal to basis vectors, you recover Misra-Gries (the algorithm solving the frequent item stream challenge in the previous paragraph).

In the above, the sketch $B_i$ has the property that $B_i^\top B_i$ approximates the covariance $A_i^\top A_i$ in operator norm, with error that scales in the lowest $d-\ell$ singular values of $A_i$.

Replace $A_i$ with $C_t$ and we have the opportunity to apply sketching for second moment capture! The real work in Sketchy came in not from the idea of applying sketches to data streams, but from proving that the approximation from FD can be made good-enough to use for optimization, and that your error at the end of optimizing also only depens on lower-order eigenvalues of your gradient covariance.

We were heartened by the test performance of the optimizer against Adam and Shampoo, landing in between the linear and super-linear optimizers in terms of quality.

This overview post barely touched on the algorithms involved in Sketchy, and only alluded to the theoretical details, all of which can be found in the full paper. There’s also a completely separate half to this paper which we didn’t have space to get into: the computational aspect of it! We used full SVD for the theoretical paper results, but it’s possible to exploit iterative top-singular-value routines instead.

We’re eager to work on these solutions with you! Reference Jax code is available on github, and please reach out if you’re interested in a pytorch or iterative top-k eigenvalue implementation!

Crinkle Crankle Optimization

Sun, 08 Oct 2023 00:00:00 +0000

A serpentine wall, or crinkle crankle wall, may seem like a surprising structure to use for fences, but may end up being more efficient in terms of building material than a straight fence which must withstand the same horizontal forces.

In a post which is the raison d’etre for this one, John D. Cook derives a formula for computing the arc length of a sinusoidal curve. By assuming that such a sinusoidally-shaped wall could withstand lateral forces as much as a straight wall would, had the straight wall been twice as thick, he shows that you could save bricks in fencing off the same perimeter with a serpentine wall.

However, I found this implicit physical assumption to be quite unsatisfying. Why would a serpentine wall be as strong as a straight one twice as thick? Why should it be sinusoidal?

In this blog post, I inspect these questions with idealized but explicit physical assumptions and use them to explore what would the optimal shape of a wall be, given a fixed amount of building material and a finite length to cover.

With the right design (it’s not sinusoidal!), we can use the same materials to build a wall $8\times$ stronger than a straight wall with the same number of bricks or using $15\times$ fewer materials than an equivalently-resilient straight wall!

A Simplified Physical Model

If we were trying to realistically model this scenario, we’d need to consider building materials, stress, whether the fence would be dug into the ground, and most likely use numerical finite element method software for modeling.

To not take up so much time, we’ll try to solve this with statics alone. For developing our model, let’s initially consider just two types of walls: straight and zig-zag (we’ll revisit shape later).

Each wall is assumed to consist of a uniform material, and be of constant, nonzero width (to ensure this for the zig-zag, we can give it rounded corners).
We won’t dig the walls into the ground.
We’re primarily interested in preventing the fence from toppling over due to lateral wind pressure, coming from the side, orthogonal to the length of the straight fence, with some fixed force.
We aren’t concerned about the fence shifting laterally, sliding along the floor. As a result, we assume infinite friction between the fence and the floor.
We’ll only consider fence designs which can be repeated to cover arbitrary lengths (which is our $x$-axis), don’t loop back around, and look the same whether we’re building east-to-west or west-to-east (after all, the wind could come from either side).
All fences must be of fixed height $h$.

The penultimate condition, mathematically put, means that if we view our fence from above, the curve along the center of the width of the fence should form a mathematical function $y(x)$ for $x$ running from $-0.5$ to $0.5$, such that $y$ is a continuous, odd function vanishing at the endpoints of its domain.

Toppling a Straight Fence

At this point, it’s worthwhile to inspect what would even make a fence topple over. We’ve assumed infinite friction, so it won’t slide around. For our straight fence of, say, unit width, why would sufficient wind flip it?

Let’s first consider the head-on/axial view of the fence, if we’re standing east of the fence and looking west into the length of its body.

For the fence to topple over, we’d need the fence to rotate counter-clockwise (CCW) about its North edge (with infinite friction and non-deformable material, where else would it rotate about).

This makes it clear how the fence resists the CCW torque from the wind and friction: via CW torque from gravity. Notice that when $\theta = 0$, in the untilted case, friction is orthogonal to the instantaneous CCW rotation, so it doesn’t add torque. In the schematic below, I’ll draw it untilted to illustrate the wind and grafvity torques appropriately.

Now we have our final set of torques, for the wind force (which, on average, strikes the center of the south side of the wall), gravity, and friction. The limit point for maximal wind force $F_W$ our unit-width straight wall can tolerate occurs at equality:

\[ F_g \cos \varphi_1 = F_W\sin\varphi_2 \,\,. \]

Since we know $\varphi_1=\arctan h$ and $\varphi_2=\arctan \frac{h}{2}$, we can simplify further to

\[ F_g \frac{1}{\sqrt{h^2+1}} = F_W \frac{h}{\sqrt{h^2+4}}\,\, \]

Rearranging, we have

\[ F_W = F_g\frac{1}{h}\sqrt{1 + \frac{3}{h^2+1}} \]

We see that there’s roughly an inversely proportional between requisite wind force to topple over the fence and height (from inside the square root; note that mass scales with height so the outer $h^{-1}$ cancels with terms inside $F_g$). Makes sense, the taller you are, the easier to topple you. On the other hand, the denser you are, the harder it gets (due to growth of $F_g$.

Finally, if we derived the above keeping a variable width $w$ in mind we’d actually end up with

\[ F_W = F_g \frac{w}{h}\sqrt{1 + \frac{3w^2}{h^2+w^2}}\,\,. \]

Toppling a Zig-Zag

How do things change with the zig-zag fence? Now, the fence would rotate not along the entire North edge, like the straight fence would, but only along two points, highlighted in red below.

We can calculate the torques from the axial view again, but note that this picture may be deceiving! The fence is angled here, so the $F_W$ incident to it will on average act at the “center” of the $y$-axis.

One might worry that because the wind is hitting the fence at an angle, it will be applying less pressure compared to the straight fence completely orthogonal to the wind. And it is! However, there’s correspondingly more fence to hit. Across a fixed length $\ell$, the total flux of wind hitting the fence is going to be the same, since the orthogonal component to the wind is all that matters. This is not the same as the drag that different fence shapes would have, if they were moving. Note that we’re assuming the fence is long enough here that we don’t have to model the fluid dynamics of the wind wrapping around the ends of the fence.

We also note that there’s a lot more going on here, for a given “V” of the zig-zag, the center of mass, by symmetry, will be at $(\ell/2, 0, h/2)$, with the average wind force and gravity both acting there (outside of the material itself, counterintuitively).

Nonetheless, the torque calculation is more or less the same as the flat wall case, besides the fact that the wind is now acting at the center of the axial view, so the torque is acting on a lever of length $\sqrt{(a/2)^2+(h/2)^2}$, where $a$ is twice the amplitude of the zig-zag, or its span crest-to-trough. The width of the fence itself (which is now no longer parallel to the $y$ axis) doesn’t play a role except in helping us calculate the total mass (assuming it’s negligble relative to the amplitude). If anything, the angles cancel out more nicely here.

\[ F_W = F_g \frac{a}{h}\]

Notice the width $w$ doesn’t play a role here at all (except insofar as it affects the mass of the fence and therefore $F_g$).

But if we use the same amount of building material for a length $\ell$ of fence, so as to equalize the comparison between the zig-zag and the straight fence, the $F_g$ between the two is the same, we get a directly proportional improvement in maximum withstood wind force and the amplitude $a$. On the flip side, for a fixed height and mass, the width of the straight fence is also determined, so although $w$ and $F_W$ also share a similar nearly proportional relationship, straight fences of a given total mass and density can’t increase their width.

Optimizing the Wall

Given all the above, it’s clear that we could always improve lateral wind resistance by making our zig-zag thinner and amplitude larger.

Furthermore, the shape of the zig-zag didn’t end up mattering much: by symmetry the torque acting upon the center-of-mass is what kept our fence from toppling.

Thus, we’ll assume we must have some minimal thickness that the fence must have, a property of our building material (e.g., brick width). This, in turn, translates to essentially an arc length constraint on our fence shape for a fixed amount of material.

We can thus simplify our optimal wall shape question: given some “period length” to fence off, say from $x=-1$ to $x=1$, can we identify an odd, continuous function whose average on $x>0$ is as large as possible? Then, if the pattern was repeated, this would amount to having the longest lever possible counterbalancing the fence.

Given the setup above, we don’t even need to think too hard about the shape or bring out variational calculus: a zig-zag is already the most efficient way to reach a prescribed amplitude! Given that, regardless of the shape of the odd function, the center of mass in our statics equations always happens to be across the $x$-axis, we should try to reach the largest amplitude possible!

What’s the Upshot?

So now that we know the optimal shape, the question stands, given a straight wall two bricks thick (width $2w$), then keeping the same materials, and using a combined catenary wall one brick thick ($w$) instead, and the same amount of building material, how much more wind resistance have we earned?

To facilitate numerical computation, let’s imagine our unit of length to cover here is 1 meter, with the requisite height being 1 meter as well. We can then suppose $w=0.05$ is a reasonable brick width. For the straight wall, the wind force equation is unsurprisingly dominated by $h$ so it behaves roughly linearly in width (recall this is $2w$).

\[F_{W, \text{straight}}\approx 0.1015 F_g\,\,.\]

Given that we’re covering 1 meter of length with a 2-brick-thick straight wall, we can consider our (optimal) alternative of a zig-zag with a total arc length of 2 meters. With a little bit of trigonometry, we can back out that the zig-zag is composed of two opposing isoceles triangles joined at one end, each of which must have sides of length $0.5$. In this case their base is also $0.5$ meters long, so it’s actually equilateral (if we were covering a longer stretch of fence, it’d be a pair of obtuse isoceles triangles comprising the zig-zag).

This puts the amplitude at $\frac{\sqrt{3}}{4}$. Then twice the amplitude gets us the wind force since $h=1$ per our formula derived in the prior section.

\[F_{W, \text{ zig-zag, equilateral}}\approx 0.8660 F_g\,\,,\]

or, put another way, we’ve engineered a fence over $8\times$ as strong!

We can ask analogous question. How much thicker would our straight fence need to be to withstand the same force? Here, we’re willing to give it additional mass. Since that scales linearly with width, we’d actually have a nearly quadratic relationship:

\[ F_{W,\text{ straight,wider}}=F_{g,\text{ per unit brick width}}\frac{w^2}{h}\sqrt{1 + \frac{3w^2}{h^2+w^2}} \]

Setting this equal to our zig-zag fence, we find that a thickness of approximately 77 cm is required, or 15 bricks!

A Final Note

One hidden unit factor present in our zig-zag calculations has been the length of fence one curve closes off. In the calculations above, it was a unit meter.

Of course, we could consider covering not just 1 meter with one “zig zag” but several. A longer length that we cover between zigs and zags will, for a fixed amount of building material per unit length, result in more gentle zig and zag slopes (since we only do one zig and one zag per “period”). Many sawtooths in a row is inefficient at reaching the same amplitude.

This leads to other constraints on practicality: can you really maintain a zig-zag shape across such long lengths? Moreover, underlying our torque computations is the fact that the fence is assumed to be a rigid lever. The longer out this math is applied, the less realistic this becomes. That said, the simplistic model is quite helpful. For a particular length of fence, zig-zag, don’t sinusoid!

Special thanks to Tom Hartke who gave this post a physicist’s review!

Babi Yar

Sun, 31 Jul 2022 00:00:00 +0000

This post is not like my usual posts.

Over the last two weeks, I went on a Birthright trip to Israel. It was an uncharacteristically spiritual journey for me.

I was agnostic as a child. But this didn’t get in the way of my Jewish self-definition. Early on, I heard stories from my grandpa about his grandma’s execution by the Einsatzgruppen. In my ancestor’s deaths was a conviction for my own identity.

My group visited Yad Vashem a few days into the trip. I didn’t expect to learn anything new. I kept quiet. I braced for a heavy day.

Of all the exhibits there, one spoke to me most loudly. Maybe because it was in Russian. It was just a simple poem excerpt, describing a swamp called Babi Yar.

Над Бабьим Яром шелест диких трав.

Деревья смотрят грозно,

по-судейски.

Всё молча здесь кричит.

Translated,

Over Babi Yar wild grasses rustle.

Trees look sternly, as if in judgment.

Everything here screams silently.

A picture of the valley echoes the poem, by Yevgeni Yevtushenko.

Here the entire Jewish population of Kiev was massacred; some whose names might never be known. Only their screams are captured silently in the grass which remains. With the current war it’s deeply saddening even the greenery has no rest.

Retelling my experience to my grandpa, he informed me that I had Ukrainian branches of family who were there, too. My grandmother told me their story. This was news to me. Yad Vashem curators make a point to separate the number 6 million piece by painful piece. With light editing, I present her words below.

My Great-Grandfather

By Alla Voloshina.

My maternal great-grandfather, Shenfeld David Solomonovich, was born in 1855 in a small Jewish village in Ukraine. These types of small Jewish villages in 19th-20th centuries were called mestechkos.

His parents were very poor and had a lot of children. He and his brother took turns to go to school in the winter because they had only one pair of boots for the two of them. Jewish people in Ukraine spoke a Hebrew-German dialect, Yiddish. When he was 12, David earned little bit of money by helping his next door neighbor, which he used to buy a Russian language textbook. He learned how to read and write from it. At 14, he left his family and his mestechko to look for better life.

His goal was to find a place where he could live and learn some craft, paying by his work for food and shelter. He was a tall, strong, and handsome young man, looking older than his age. His fluency extended to Russian, Hebrew, and Yiddish. All of this helped him eventually be hired as an apprentice to a little vinegar production shop. Young David was fast learner. He could follow instructions for recipes and preparation methods to the tee. Gradually, he became an expert in vinegar production and a partner of the shop owner.

When the kinless owner retired, he left his shop to David. He was about 19 years old by then and already known in Ukraine as a vinegar production specialist. I don’t know anything more about how he expanded production or added some new products. I only know that he became a First Guild Merchant, which allowed him to live and to have property in Kiev (as there were restrictions where Jewish people were allowed to live in Russia). He married when he was 30, had two daughters Fania and Clara, and a son Peter, my future grandfather. When the Revolution of 1905 in Russia was brutally suppressed, his daughters were university students. They participated in student protests against police brutality and repression. The girls were arrested on multiple occasions. Each time their father would go to police, pay the Policmeister 100 rubles in silver, and take his daughters home.

His son Peter started to work at his father’s factory when he was still in Gymnasia (a selective private school in Russia before the revolution of 1917). It was expected that he would go to university to study organic chemistry, but he met my future grandmother, and decided that he wanted to get married. He was 18, and my grandmother was 16.

When I was a girl, I used to ask my grandmother why she agreed to get married so young. I still remember the smile on her face when she answered me: “Why would not I agree? He was very handsome, very rich, and very much in love with me. I was very happy to get married him.” They married in 1913. By that time both of my great-grandfather’s daughters were married, had one daughter each, and Fania had separated from her husband. Her husband was an opera singer, and a gambler. He lost his wife’s dowry to gambling and disappeared. In 1914, Russia entered World War One, and Fania became a front line military doctor. She lived a long life, was an accomplished doctor, but never married again.

My great-grandfather owned a two story house on the Obolonskaya Street in Kiev. His family, including my grandparents, lived on the second floor of that house. Family of the only hired worker of his factory resided on the first floor of the same house. The vinegar production factory was located next door.

Vinegar production is somewhat similar to wine production. The main production tools are some huge wooden barrels called chun, where the chemistry of converting input materials (grains or fruits) into vinegar take place. My great-grandfather, his son, and the hired worker took care of all production needs. The factory became very successful; it earned David Solomonovich respect as a merchant as product popularity grew. He earned enough money to provide for his family, paid his worker well, and invested extra money into real estate. He owned several apartment houses. The middle class folks, and artists, writers and actors rented apartments in his houses. The world-famous Yiddish writer Sholom Aleichem lived in one of his apartments for several years. The script of The Fiddler on the Roof is based on the stories written by Sholom Aleichem. David Solomonovich used to tease his daughters on Friday evening after he had a glass of wine, saying that he is in the mood to go to Sholom Aleichem to talk about literature and art. Like all children, the girls were mortified with embarrassment, that their father would do such an awkward thing.

David Solomonovich considered these apartment houses to be a reliable and profitable investment. But the Great October Revolution of 1917 proved him to be wrong. Everything that he owned was nationalized. All personal valuables that he and his family had: gold and silver coins, watches, women’s jewelry and so on, were confiscated. The new authorities moved into his house several families of revolutionaries, leaving for my great-grandfather’s family two rooms in his own house. All capitalists, landlords, religious figures, and members of their families were denied the right to vote for 20 years.

Soon all aspects of life in the cities and villages of Russia fell to ruin. The bloody civil war lasted for more than five years. Famines in 1921-1923, and in 1932-1933, direct results of the Soviet ruling, lead to death of 12-13 million of Ukrainians.

Despite the cruelties of Soviet authorities, my great-grandfather did not lose his optimism. His faith, love of his family, talents, energy, wisdom and abilities helped him adjust to the new realities, and to continue his life as a decent, respected person, who was admired by his family, and by everyone who knew him.

Quickly enough, the new administration of his former vinegar factory realized that knowledge of Marxist theory did not help with vinegar production. My great-grandfather was invited to work at his former factory as a consultant, and my grandfather continued to do the manual work.

My mom was born in 1922. She cherished her grandfather’s memory, and I loved to listen her stories about him. She loved everything about him, but she was especially proud of his vast knowledge in theology, history, chemistry, and other sciences because everything he knew he learned by himself.

From 1930 to 1941, he traveled all over European part of Soviet Union to cure barrels of different vinegar production factories. His last business trip was in spring of year 1941, when he was 86 years old, just before Germany invaded Soviet Union.

Germany attacked the Soviet Union on June 22nd, 1941. Very soon my great-grandfather realized that Kiev will be occupied by the Germans. He insisted that his daughters and son with their families leave Kiev. However, he himself was not be able to leave because his wife was gravely ill from heart disease and could not travel.

Soviet authorities were organizing evacuation of important establishments, educational and scientific institution, hospitals, and factories. My mom, as a student, was supposed to be evacuated alone with Kiev State University, and only very last minute was she allowed to take her parents with her. My mom’s father, Peter Davidovich, was devastated, because he had to leave his parents behind. He had a weak heart, and he just died of grief four months after leaving Kiev, when he learned about the fate of his parents.

Within three months after the initial German attack on Soviet Union, on September 19, 1941, German forces entered Kiev, the capital of the Soviet Ukraine. On September 29-30, they murdered most of Jewish population of Kiev, over 33 thousand men, women and children at Babi Yar, a ravine northwest of the city.

This was how the life of my great-grandparents ended.

I am very proud to be descendant of the bright, talented, very kind and wise person, Shenfeld David Solomonovich, and I want my descendants to know about him.

Finding the Most Popular SciHub Articles with Approximate Heavy Hitters

Fri, 01 Oct 2021 00:00:00 +0000

I’ve added heavy hitters functionality to the dsrs crate (in addition to a variant of Count-Min). It’s another streaming algorithm which helps us find the most popular repeated lines in a stream. In this blog post, we’ll see how this approximate algorithm saves memory over an exact approach.

For instance, maybe we have access logs which contain IP addresses like so:

1.1.1
1.2.3
1.1.1
2.1.2
1.1.1

where there could be millions of unique IP addresses accessing our server, but we’d only be interested in monitoring the ones like 1.1.1.1 that access it most often to check for possible malicious behavior such as a DoS attack. In principle, we could track every single unique IP address and how often it appears in the log, but this’d require as much memory as there are unique IPs. If we’re only interested in the top-$k $ IPs by frequency, could we do better?

Indeed, if we’re willing to give approximate answers! Sketching approaches have nuanced guarantees, but generally work well in practice. The dsrs library provides an API for the heavy hitters sketch, which accepts a textual stream and returns the approximate top-$k $ most popular items in that stream.

Tim Bray has a tuned Go package which I’ve installed as tf below which answers the exact top-$k $ query. Over several blog posts, Tim’s package has evolved to be mostly I/O bound. So it’ll be tough competition for the approximate approach.

In the experiment below, we seek to answer, what are the 10 most popular of 28 million downloaded SciHub articles from September 2015 to February 2016? At 2.6 GB we’ll see which approach best answers this question on my laptop!

%%bash

cd /tmp
test -f scihub.zip || curl -s -o scihub.zip -L https://datadryad.org/stash/downloads/file_stream/1483
du -hs scihub.zip
unzip -qf scihub.zip
test -d topfew && test -f topfew/bin/tf || ( \
  git clone git@github.com:timbray/topfew.git 2>/dev/null && \
  cd topfew && make 2>&1 >/dev/null)

echo 'will cite' | parallel --citation 1> /dev/null 2> /dev/null 

du -hsc scihub_data/*.tab | tail -1

parallel --pipepart wc -l :::: scihub_data/*.tab \
  | awk '{s+=$1}END{print s " downloads"}'

653M	scihub.zip
2.6G	total
27819965 downloads

# the true exact top-10 most downloaded articles via tbray's topfew
! cat /tmp/scihub_data/*.tab | /usr/bin/time -f "%e sec %M KB" /tmp/topfew/bin/tf -f 3 -n 10

10.1007/978-1-4419-9716-6_11
10.1056/NEJMoa1402121
10.1116/1.4904970
10.1103/PhysRevB.63.224204
10.1182/asheducation-2015.1.8
10.4028/www.scientific.net/AMM.7-8.159
10.1111/j.1365-277X.2004.00520.x
10.1002/pmic.200600525
10.1161/CIRCRESAHA.117.306290
10.1002/smll.201002009
83 sec 2128580 KB

# approximate top-10 (along with very weak upper bounds of counts)
! cat /tmp/scihub_data/*.tab | cut -d$'\t' -f2 | /usr/bin/time -f "%e sec %M KB" dsrs --hh 10

1112828 10.1002/ppsc.201300314
1112828 10.1016/j.physio.2015.03.3636
1112828 10.1177/014920638701300408
1112828 10.1053/j.gastro.2015.08.004
1112828 10.1002/jbm.a.31063
1112828 10.1645/0022-3395(2000)086[1137:EAISMS]2.0.CO;2
1112828 10.1016/j.biortech.2014.11.112
1112828 10.1016/j.reval.2014.02.154
1112828 10.1016/j.tet.2015.07.005
1112828 10.2174/1568026023394443
11.49 sec 4716 KB

%%bash
# hoping that a sketch with only ~10 slots of space can recover the exact top 10 is wishful thinking
# but it really doesn't take that much to get to the top-10. Asking for an *approximate* top-4100
# gets us to the *exact* top-10
cd /tmp
M=4100
cat scihub_data/*.tab | cut -d$'\t' -f2 \| /usr/bin/time -f "%e sec %M KB" dsrs --hh $M > hh-lots
cat scihub_data/*.tab | topfew/bin/tf -f 3 -n 10 > exact

# right outer join minus inner join should be empty if the second argument is a subset
join -v2 <(cut -d" " -f2 hh-lots | sort) <(cut -d" " -f2 exact | sort)

11.04 sec 5868 KB

In the logs above, we observe the total runtime and memory use in KB for a tuned Go implementation based on a hashmap versus two approximate competitors: approximate top-$k $ and top-$M $, where $M $ was found via binary search as roughly the smallest constant for which all of the true top-$k $ articles appear.

We notice a couple of things

The estimates from the sketch can’t be trusted (nor do they ever purport to be that trustworthy). However, a low-memory second pass could be used to recover exact counts for just the heavy hitters selected by the sketch.
The approximate approach significantly improves on both runtime and memory usage. Even with the larger $M=4100 $ necessary to recover the true top-$k $ at $k=10 $, the approximation was about $2\times $ faster and used $362\times $ less memory!

I hope this motivates you to try out dsrs next time you have a lot of logfiles to churn through but don’t want to reach for a heavyweight distributed computing solution.

Try the notebook out yourself.

Just in case you were curious for the actual names:

10.1007/978-1-4419-9716-6_11 Full-scale modal wind turbine tests: comparing shaker excitation with wind excitation. Conference Proceedings of the Society for Experimental Mechanics Series, 113–124
10.1056/NEJMoa1402121 
10.1116/1.4904970 Photosensitive field emission study of SnS2 nanosheets. Journal of Vacuum Science & Technology B, Nanotechnology and Microelectronics: Materials, Processing, Measurement, and Phenomena, 33(3), 03C106
10.1103/PhysRevB.63.224204 Griffiths effects and quantum critical points in dirty superconductors without spin-rotation invariance: One-dimensional examples. Physical Review B, 63(22)
10.1182/asheducation-2015.1.8 Iron deficiency: new insights into diagnosis and treatment. Hematology, 2015(1), 8–13
10.4028/www.scientific.net/AMM.7-8.159 Monitoring the Evolution of Fatigue in Corrugated Paperboard under Random Loads. Applied Mechanics and Materials, 7-8, 159–164
10.1111/j.1365-277X.2004.00520.x Intentional mis-reporting of food consumption and its relationship with body mass index and psychological scores in women. Journal of Human Nutrition and Dietetics, 17(3), 209–218
10.1002/pmic.200600525 Conifer defense against insects: Proteome analysis of Sitka spruce (Picea sitchensis) bark induced by mechanical wounding or feeding by white pine weevils (Pissodes strobi). PROTEOMICS, 7(2), 248–270
10.1161/CIRCRESAHA.117.306290 Efficient Gene Disruption in Cultured Primary Human Endothelial Cells by CRISPR/Cas9Novelty and Significance. Circulation Research, 117(2), 121–128
10.1002/smll.201002009 Graphene-Based Materials: Synthesis, Characterization, Properties, and Applications. Small, 7(14), 1876–1902

Scatter Reduction (Numpy Gems, Part 5)

Wed, 15 Sep 2021 00:00:00 +0000

The VLAD (vector of locally aggregated descriptors) (no relation!) algorithm was proposed as a mechanism for compacting image descriptors (related follow-on work). This is useful for creating similarity search indices.

A reader of my blog referred me to this algorithm, noting that the supposedly vectorized version turns out slower than non-vectorized code. We review indexing and broadcasting rules to diagnose the slowdown and prescribe a fix with a lesser-known numpy gem for what’s known as a scatter-reduce operation. If you’ve ever found yourself in a numpy setting wanting to “collect” the sum of some data into buckets determined at runtime, this gem is for you.

Along the way, we’ll learn some pretty surprising facts about vectorization! What I thought was an idiomatic numpy solution using ufunc.at turns out to be pretty non-performant! Luckily, a solid vectorized solution exists out there, gaining us a 400% improvement on a TPU (via Colab).

from IPython.display import Image
Image(filename='2021-09-15-vlad.png') 

For each of our $T $ images, we have $L $ “local descriptions”, i.e., some $F $-dimensional feature vectors that semantically characterize certain attributes of each image. In other words, every image $t\in[T] $ has $L $ vectors $\textbf{v}_\ell^{(t)}\in\mathbb{R}^F $. Since $L $ may be large, and we’d like to compact the $L\times F $ data that we have per image, across all $L\times T $ feature vectors, we may consider some $K $ centroids $\textbf{c}_k $ from a $K $-means computation (again, looking indiscriminantly across all local descriptions).

Then VLAD defines, for each image $t $, a set of residual vectors which is the sum of the errors that you get by approximating each local description to its closest centroid, i.e., for $k\in[K] $, $\textbf{r}_{k}^{(t)}=\sum_{\ell\in M_k^{(t)}}\textbf{v}_{\ell}^{(t)}-\textbf{c}_k $, where

\[M_k^{(t)}=\{\ell\in[L]\,\big|\,k=\mathrm{NN}(\textbf{v}_\ell^{(t)})\}\,\,,\]

and the nearest neighbor function $\mathrm{NN}$ is given by

\[\mathrm{NN}(\textbf{x})=\mathrm{argmin}_{k’}\|\textbf{c}_{k’}-\textbf{x}\|_2\}\,\,. \]

Finally, the actual VLAD encoding for the $t $-th image is the $D $-sized vector $n(\textbf{u}) $, where $\textbf{u}=\mathrm{stack}\left(\{\textbf{r}_k^{(t)}\}_{k\in[K]}\right) $, $D=K\times F $, and $n $ is some normalization (varies between implementations). Speaking very intuitively, we can imagine the components of VLAD to being something like the score statistic for a GMM model, assuming the latent assignment (from local aggregate to the cluster it belongs to) is known.

VLAD is quite compute-intensive, and due to its hyperparemeters (such as $F $ and $K $) requiring tuning, optimization of its transformation can be a big quality-of-life improvement!

A direct implementation is below, along with an initial vectorization attempt I found on Stack Overflow (but re-written to make broadcasting clear). Note that for simplicity we have $L $ be a constant, but the approach described in this blog post can be extended to allow for an image-dependent number of local descriptors $L^{(t)} $.

import numpy as np
from sklearn.cluster import MiniBatchKMeans

def looping(kmeans: MiniBatchKMeans, local_tlf):
    k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
    centers_kf = kmeans.cluster_centers_
    vlad_tkf = np.zeros((t, k, f))
    for vlad_kf, local_lf in zip(vlad_tkf, local_tlf):
        label_l = kmeans.predict(local_lf)
        for i in range(k):
            vlad_kf[i] = np.sum(local_lf[label_l == i] - centers_kf[i], axis=0)
        vlad_D = vlad_kf.ravel()
        vlad_D = np.sign(vlad_D) * np.sqrt(np.abs(vlad_D))
        vlad_D /= np.linalg.norm(vlad_D)
        vlad_kf[:,:] = vlad_D.reshape(k, f)
    return vlad_tkf.reshape(t, -1)

def naivec(kmeans: MiniBatchKMeans, local_tlf):
    k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
    centers_kf = kmeans.cluster_centers_
    labels_tl = kmeans.predict(local_tlf.reshape(-1,f)).reshape(t, l)
    mask_tlk = labels_tl[..., np.newaxis] == np.arange(k)
    local_tl1f = local_tlf[...,np.newaxis,:]
    delta_tlkf = local_tl1f - centers_kf # <-- easy to run out of memory
    vlad_tD = (delta_tlkf * mask_tlk[..., np.newaxis]).sum(axis=1).reshape(t, -1)
    vlad_tD = np.sign(vlad_tD) * np.sqrt(np.abs(vlad_tD))
    vlad_tD /= np.linalg.norm(vlad_tD, axis=1, keepdims=True)
    return vlad_tD

On a simple benchmark, we notice the vectorization performs poorly. The reason boils down to doing a lot more work than necessary on the delta_tlkf = local_tl1f - centers_kf and (delta_tlkf * mask_tlk[..., np.newaxis]).sum(axis=1) steps. We’re using a dense 4D tensor but only ever looking at 3 dimensions of it!

np.random.seed(1234)
# usually there are a lot more images than this
t, l, f, k = 256, 128, 64, 512
X = np.random.randn(t, l, f)
km = MiniBatchKMeans(n_clusters=16, n_init=10, random_state=0)
km.fit(X.reshape(-1, f))

result_looping = looping(km, X)
result_naivec = naivec(km, X)

%timeit looping(km, X)
%timeit naivec(km, X)

assert np.allclose(result_looping, result_naivec)

197 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
313 ms ± 33.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

While masks like this are often an easy way of vectorizing, they can result in wasted work. This brings us to our numpy gem: if you find yourself using a mask which is only nonzero in exactly one (or a few) locations for a dimension, then you can replace it with one (or a few) calls to ufunc.at.

The core of the VLAD computation that’s difficult to vectorize is the residual sum: $\textbf{r}_{k}^{(t)}=\sum_{\ell\in M_k}\textbf{v}_{\ell}^{(t)}-\textbf{c}_k $. But we could represent $M_k^{(t)} $ across all $k,t $ with a label array and in a sense such a label array would tell us “where” to add the corresponding $\textbf{v}_{\ell}^{(t)} $. In scientific computing, this is called a scatter operation.

In numpy, you may have attempted to use in-place advanced indexing addition or reduction but found that, due to buffering, duplicate values are not collected together. For instance:

x = np.zeros(5)
x[[1, 1, 2, 2]] += np.ones(4)
x # want [0, 2, 2, 0, 0] but instead get...

array([0., 1., 1., 0., 0.])

The above behavior isn’t obvious from the advanced indexing docs, since it requires an understanding of how python de-sugars the in-place addition. As mentioned before, ufunc.at saves the day.

x = np.zeros(5)
np.add.at(x, [1, 1, 2, 2], np.ones(4))
x

array([0., 2., 2., 0., 0.])

We can now re-visit VLAD vectorization.

def truvec(kmeans: MiniBatchKMeans, local_tlf):
    k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
    centers_kf = kmeans.cluster_centers_
    labels_tl = kmeans.predict(local_tlf.reshape(-1,f)).reshape(t, l)
    
    vlad_tkf = np.zeros((t, k, f))
    M = t * k
    labels_tl += np.arange(t)[:, np.newaxis] * k
    vlad_Mf = vlad_tkf.reshape(-1, f)
    np.add.at(vlad_Mf, labels_tl.ravel(), local_tlf.reshape(-1, f))
    counts_M = np.bincount(labels_tl.ravel(), minlength=M)
    vlad_tkf -= counts_M.reshape(t, k, 1) * centers_kf
    
    vlad_tD = vlad_tkf.reshape(t, -1)
    vlad_tD = np.sign(vlad_tD) * np.sqrt(np.abs(vlad_tD))
    vlad_tD /= np.linalg.norm(vlad_tD, axis=1, keepdims=True)
    return vlad_tD

result_truvec = truvec(km, X)
assert np.allclose(result_looping, result_truvec)
%timeit truvec(km, X)

208 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Wow! I admit I was pretty surprised by this initially. While this is certainly an improvement over naivec, why is a python loop performing about as good as a supposedly-vectorized approach?

It turns out the so-called “supposedly-vectorized approach” is actually hurting vectorization. The code np.sum(local_lf[labels_l == i], axis=0) can leverage actual strided vectorized addition instructions, since it’s summing a contiguous array (generated on-the-fly) whereas np.add.at is forced to use non-vectorized adds as it’s adding in arbitrarily-located slices.

So can we do better with an approach that avoids python for-loops? Indeed, but we’ll need to re-use a previous numpy gem. In effect, we should perform scatter reduction by re-interpreting the array as a ragged array. In doing so, and applying the techniques from the Vectorizing Ragged Arrays blog post, we can perform contiguous reductions which the looping approach requires to be strided!

def optvec(kmeans: MiniBatchKMeans, local_tlf):
    k, (t, l, f) = kmeans.n_clusters, local_tlf.shape
    centers_kf = kmeans.cluster_centers_
    label_tl = kmeans.predict(local_tlf.reshape(-1,f)).reshape(t, l)
    
    vlad_tkf = np.zeros((t, k, f))
    M = t * k
    label_tl += np.arange(t)[:, np.newaxis] * k
    N = t * l
    label_N = label_tl.reshape(N)
    local_fN = local_tlf.reshape(N, f).T
    ix_N = np.argsort(label_N)
    local_fN = local_fN[:, ix_N]
    label_N = label_N[ix_N]
        
    # ragged array vectorization, see linked post
    label_switch_N = np.diff(label_N, prepend=0)
    pos = np.flatnonzero(label_switch_N)
    pos = np.repeat(pos, label_switch_N[pos])
    pos = np.append(pos, N)
    pos -= 1
    assert len(pos) == M, (len(pos), M)
    pos_M = pos
    
    np.cumsum(local_fN, axis=1, out=local_fN)
    clustered_fM = np.diff(local_fN[:, pos_M], axis=1, prepend=np.zeros_like(local_fN[:,0:1]))
    vlad_tkf = clustered_fM.T.reshape(t, k, f)
    
    counts_M = np.diff(pos_M, prepend=-1)
    vlad_tkf -= counts_M.reshape(t, k, 1) * centers_kf
    
    vlad_tD = vlad_tkf.reshape(t, -1)
    vlad_tD = np.sign(vlad_tD) * np.sqrt(np.abs(vlad_tD))
    vlad_tD /= np.linalg.norm(vlad_tD, axis=1, keepdims=True)
    return vlad_tD

result_optvec = optvec(km, X)
assert np.allclose(result_looping, result_optvec), sum(result_looping != result_optvec)

%timeit optvec(km, X)

127 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

OK, so with all this work, we’ve gotten a modest improvement of around 40%, but I’m sure in some settings, such as when $F $ is large or $L\gg K $, the original python loop approach works fine. This is because there’s enough work being done in the inner loop of the looping() function that additional vectorization is not that necessary.

However, one benefit to the final implementation is it could, with a little bit of effort, be translated to jax. This would let us run the same computation entirely on a hardware accelerator. Experimenting with this (Colab), I was able to get a 4x improvement over the original looping code by leveraging a TPU. Now that’s more like it!

I’d like to extend a special thank you to Ashwin Nair for emailing me this interesting challenge. I always find these numpy questions fun and encourage my other readers to reach out! If they require a new technique I haven’t covered in my previous “Numpy Gems” posts, I’d be happy to take a look.

Try the notebook out yourself.

Amazon Reviewers With Sketches

Tue, 29 Jun 2021 00:00:00 +0000

To show off a recent command line tool for sketching, dsrs, let’s plot the rolling 28-day average daily count of active reviewers on Amazon.

The raw data here is item,user,rating,timestamp so this would map to a sophisticated GROUP BY with a COUNT DISTINCT over 28-day windows in SQL. But since the data’s only available as CSV, how can we get to the same answer? If we’re just interested in an approximate solution, can we do this without using a bunch of memory or custom (shuffle-inducing…) sliding window implementation?

All timings below done on a 16-physical CPU machine (AWS r4.8xlarge).

# https://nijianmo.github.io/amazon/index.html
# 6.7gb
# May 1996 - Oct 2018, e.g.:
# 0449819906,A3U4E9PIZ8OWH1,5.0,1383696000
# timestamp is then unix time in seconds.
prefix = 'http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/'
review_data = {
  'Amazon Fashion': 'AMAZON_FASHION.csv',
  'All Beauty': 'All_Beauty.csv',
  'Appliances': 'Appliances.csv',
  'Arts, Crafts and Sewing': 'Arts_Crafts_and_Sewing.csv',
  'Automotive': 'Automotive.csv',
  'Books': 'Books.csv',
  'CDs and Vinyl': 'CDs_and_Vinyl.csv',
  'Cell Phones and Accessories': 'Cell_Phones_and_Accessories.csv',
  'Clothing, Shoes and Jewelry': 'Clothing_Shoes_and_Jewelry.csv',
  'Digital Music': 'Digital_Music.csv',
  'Electronics': 'Electronics.csv',
  'Gift Cards': 'Gift_Cards.csv',
  'Grocery and Gourmet Food': 'Grocery_and_Gourmet_Food.csv',
  'Home and Kitchen': 'Home_and_Kitchen.csv',
  'Industrial and Scientific': 'Industrial_and_Scientific.csv',
  'Kindle Store': 'Kindle_Store.csv',
  'Luxury Beauty': 'Luxury_Beauty.csv',
  'Magazine Subscriptions': 'Magazine_Subscriptions.csv',
  'Movies and TV': 'Movies_and_TV.csv',
  'Musical Instruments': 'Musical_Instruments.csv',
  'Office Products': 'Office_Products.csv',
  'Patio, Lawn and Garden': 'Patio_Lawn_and_Garden.csv',
  'Pet Supplies': 'Pet_Supplies.csv',
  'Prime Pantry': 'Prime_Pantry.csv',
  'Software': 'Software.csv',
  'Sports and Outdoors': 'Sports_and_Outdoors.csv',
  'Tools and Home Improvement': 'Tools_and_Home_Improvement.csv',
  'Toys and Games': 'Toys_and_Games.csv',
  'Video Games': 'Video_Games.csv'
}
review_data = {k: prefix + v for k, v in review_data.items()}

Even with a 28d sliding window, if we’re sliding by a day, it’s still quite a few data points.

import pandas as pd
(pd.Timestamp('Oct 2018') - pd.Timestamp('May 1996')) / pd.Timedelta('1d')

8188.0

Store all urls in a variable

from shlex import quote
urls = ' '.join(list(map(quote, review_data.values())))

%%bash -s {urls}

echo 'will cite' | parallel --citation 1> /dev/null 2> /dev/null 

parallel curl -o "/tmp/amazon{#}.csv" -s {} ::: "$@"

%%bash

# Total data size
du -hsc /tmp/amazon*.csv | tail -1

# How many reviews?
parallel --pipepart wc -l :::: /tmp/amazon*.csv \
  | awk '{s+=$1}END{print s}'

9.0G	total
230139802

%%bash

# How many users?
parallel --pipepart 'cut -d, -f2 | dsrs --raw' :::: /tmp/amazon*.csv \
  | dsrs --merge

43404924

%%writefile /tmp/date-user-extract.awk
#!/usr/bin/awk

BEGIN {
    FS = "," 
}

1 {
    user = $2;
    epoch_sec = $4;
    # round down to nearest day
    rounded_epoch_sec = strftime("%Y %m %d 00 00 00", epoch_sec);
    rounded_epoch_sec = mktime(rounded_epoch_sec)
    for (i = 0; i < 28; i += 1) {
        dt = strftime("%F", rounded_epoch_sec);
        print dt " " user
        # a day can be more than this many seconds due to leaps but
        # since we only decrement 28 times the undershoot doesn't matter
        rounded_epoch_sec -= 86400
    }
}

Overwriting /tmp/date-user-extract.awk

%%bash

# test date mapper
echo 0449819906,A3U4E9PIZ8OWH1,5.0,1383696000 | awk -f /tmp/date-user-extract.awk | head -3

2013-11-06 A3U4E9PIZ8OWH1
2013-11-05 A3U4E9PIZ8OWH1
2013-11-04 A3U4E9PIZ8OWH1

%%bash
  
# How many 28d users?
parallel --pipepart 'awk -f /tmp/date-user-extract.awk' :::: /tmp/amazon*.csv \
  | dsrs --key >/tmp/ts

t = pd.read_csv('/tmp/ts', delimiter=' ', names=["date", "cnt"])
t.set_index("date", inplace=True, verify_integrity=True)
t.sort_index(inplace=True)
t.head()

	cnt
date
1996-04-23	1
1996-04-24	1
1996-04-25	1
1996-04-26	1
1996-04-27	1

from matplotlib import pyplot as plt
%matplotlib inline
(t/28).plot(rot=45, legend=False)
plt.xlabel("date")
plt.ylabel("users")
plt.title("28-day rolling average amazon reviewers")
plt.show()

%%bash

start=`date +%s`
parallel --pipepart 'cut -d, -f2' :::: /tmp/amazon*.csv \
  | awk '{a[$1]=1}END{print length(a)}'
end=`date +%s`
echo "How many users? awk time" $((end-start)) "sec"
echo

start=`date +%s`
parallel --pipepart 'cut -d, -f2' :::: /tmp/amazon*.csv \
  | dsrs
end=`date +%s`
echo "How many users? serial sketching time" $((end-start)) "sec"
echo

start=`date +%s`
parallel --pipepart 'cut -d, -f2 | dsrs --raw' :::: /tmp/amazon*.csv \
  | dsrs --merge
end=`date +%s`
echo "How many users? parallel sketching time" $((end-start)) "sec"

43249276
How many users? awk time 190 sec

43206238
How many users? serial sketching time 11 sec

43404924
How many users? parallel sketching time 4 sec

I tried comparing the sketch-based rolling average computation to an awk one:

parallel --pipepart 'awk -f /tmp/date-user-extract.awk' :::: /tmp/amazon*.csv \
  | awk '{a[ \\(1][ \\)2]=1}END{for(i in a)print i " " length(a[i])}' >/tmp/ts-awk

But this got OOM killed after 2700 seconds on a 240GB RAM machine. Perhaps the easiest non-sketch approach here would require ingesting the CSVs into postgres and just using a window function, but at this point we’re well over a few-line solution.

Try the notebook out yourself.

Map Reduce to Flatmap Fold

Sun, 25 Apr 2021 00:00:00 +0000

Step aside, map reduce. In this post, I’ll introduce a single-machine utility for parallel processing that significantly improves upon the typical map-reduce approach. When dealing with GB-to-TB size datasets, using a large multiprocessing machine should be enough for fast computation, but performance falls short of expectations due to naive reduce implementations.

Let’s take the canonical map reduce example, word count, where our goal is to take a corpus of text and construct a map from words to the number of times they appear in the corpus. We’ll be working with a 16-core machine throughout this post.

Let’s grab 1GB of English wikipedia for a running example and do some lightweight cleaning.

curl -o enwik9.bz2 https://cs.fit.edu/~mmahoney/compression/enwik9.bz2
bunzip2 enwik9.bz2
tr '[:upper:]' '[:lower:]' enwik9 | tr -c '[:alnum:]- \n' ' ' > enwik9.clean
rm enwik9
tail -1 enwik9.clean  | tr -s " "; echo
# breathing high-pressure oxygen for long periods can causes oxygen toxicity one of the side effects 

Let’s do a typical map-reduce with spark to get the top words.

pyspark
...
>>> from operator import add
>>> from collections import Counter
>>> ctr = spark.read.text('enwik9.clean')\
...   .rdd.mapPartitions(lambda rs: [
...     Counter(w for r in rs
...       for w in r[0].split(' ') if w)])\
...   .reduce(add) 
>>> ctr.most_common(5)
[('the', 7797642), ('of', 4855049), ('and', 3059322), ('in', 2621192), ('a', 2332364)]

This takes about 64 sec. If we track cpu and memory utilization during the above run, we’ll notice something strange:

Towards the end of the computation, we have long stretches of almost serial code. Why is that? The problem is visible from the computation graph of this map reduce job.

As we go down the computation, we distill words into word-count pairs. This does O(corpus size) work and can be infinitely parallelized. But in the reduce steps we need to combine many small hashmaps together to retrieve the final hashmap. At each level, we combine pairs of hashmaps. The total work done across all combination step is O(num unique words), yet the number of hashmap pairs to combine drops to 1. So we’re stuck waiting at the end for a single worker combining the final map.

What can we do to fix the situation? The problem stems from the fact that our final output—a map from words to counts—is much larger than tolerable to process serially. So, no matter what, an interface such as map reduce which returns a “serial object” such as a regular, single-threaded hashmap is doomed to require the large O(num unique words) processing step on one thread at some point.

Sure, we could look into lockfree or multi-threaded hashmaps, assuming a shared-memory system, but per Joe Hellerstein’s “swim-lane” intuition, it’d be preferable to instead have a framework which keeps data local to a single CPU’s cache as much as possible (this preference is also amenable to distributed computation later, unlike shared-memory approaches).

Map reduce relies on the following abstractions, which naturally lead to a single output “serial object” $U$ from a collection of inputs of type $X$:

\[\begin{align} \mathrm{map}&:X \rightarrow U\\\\ \mathrm{reduce}&: U \rightarrow U \rightarrow U \end{align}\]

where $\mathrm{reduce}$ is a commutative, associative operation. Instead, an API like

\[\begin{align} \mathrm{flatmap}&: X \rightarrow [Y]\\\\ \mathrm{fold}&:U \rightarrow Y \rightarrow U \end{align}\]

ends up being very natural in a Unix setting, and aside from purity of the $\mathrm{flatmap},\mathrm{fold}$ functions has no requirements. Schematically, it looks like this:

Note, this requires actually changing what our result is. We’re no longer offering a serial hashmap of our keys, but rather a disjunction over disjoint hashmaps. This requires no additional merges, thanks to disjointness.

The advantage of this approach over map reduce for keyed inputs is that (1) there are no serial reduce steps, (2) all the computational steps can be made online and (3) memory usage is bounded, unlike the tree reduce approach, where in principle all keys could be replicated across multiple transient unreduced maps.

I implemented a version of the above for a unix-like text stream interface. slb (for “sharded load balance”) essentially works like parallel --pipe --roundrobin would, splitting its input based on hash, and maintaining parallel independent mapper processes and folding processes which are to emit lines at the end of computation. The disjunction step here is just line concatenation (where the output lines for wordcount are key-value pairs).

Let’s revisit our wordcount benchmark with our new approach.

/usr/bin/time -f "%e sec" target/release/slb \
  --mapper 'tr " " "\n" | rg -v "^$"' \
  --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \
  --infile enwik9.clean \
  --outprefix wikslb.
# 6.20 sec
cat wikslb.* | sort --parallel=$(nproc) -k2nr -k1 | head -5
# the 7797642
# of 4855049
# and 3059322
# in 2621192
# a 2332364

Much better! The flatmap operation tr " " "\n" | rg -v "^$", which puts every word on its own line, is a natural Unix line streaming operation. The folder, awk '{a[$0]++}END{for(k in a)print k,a[k]}' statefully tracks a simple keyed counter. This makes slb a fitting primitive for parallelizing keyed aggregations in the Unix way, which is convenient for ML use cases such as:

feature frequency counting
distinct feature value aggregation and counting

There’s all sorts of interesting extensions to be made for slb; check out the repo for details and examples.

How could we support multiple input files?
Do buffer and queue sizes affect performance? Can they be autotuned?
Are stragglers causing problems?

Illustrations provided by Olivia Wynkoop.

P.S. Nowadays, Spark has a more advanced API which can look at the full AST of our parallel computation:

pyspark
...
>>> from pyspark.sql.functions import split, col, explode
>>> ctr = spark.read.text('enwik9.clean')\
...    .select(explode(split(col('value'), '\s+')).alias('word')) \
...    .where(col('word') != '') \
...    .groupby('word') \
...    .count() \
...    .collect()
>>> ctr.sort(key=lambda r: r["count"])
>>> ctr[-5:]
[Row(word='a', count=2332364), Row(word='in', count=2621192), Row(word='and', count=3059322), Row(word='of', count=4855049), Row(word='the', count=7797642)]

The groupby portion now runs in 34 sec. As we can see, we have much higher utilization too:

However, this is still slower than using the Unix utilities with slb.

Parallel Glauber Inference

Sat, 20 Mar 2021 00:00:00 +0000

Markov Chain Monte Carlo methods (MCMC) are, functionally, very cool: they enable us to convert a specification of a probability distribution from a likelihood $\ell$ into samples from that likelihood. The main downside is that they’re very slow. That’s why lots of effort has been invested in data-parallel MCMC (e.g., EPMCMC). This blog post takes a look at a specialized MCMC sampler which is transition-parallel, for a simple distribution:

Given a fixed, simple, large graph $G=(V,E)$ on $n$ vertices and $m$ edges, return a uniform proper random $k$-coloring, where $k > 2\Delta(G)$ and $\Delta=\Delta(G)$ is the maximum degree of $G$

Such a sampler could be used to generate colorings with generalization properties for machine learning or simulating Glauber dynamics in physical systems.

Note that while it’s easy to state this distribution as drawing a random element from the combinatorial structure \[ \left\{v\in[k]^n\, \big|\, v_i\neq v_j\,\,\forall (i,j)\in E\right\}\,, \] it’s far from trivial to actually do so (of course, we make colors synonymous with numbers here). Sure, every marginal distribution of every vertex is just uniform, but complex dependencies across vertices arise: highly central nodes can significantly affect permissible colors for other nodes.

For the given problem, Jerrum provides a classic Gibbs sampler:

Perform any proper greedy coloring of $G$.
Repeatedly, sample a vertex and resample its color uniformly from the set of colors among $[k]$ absent from its neighbors.

Note, the image above uses curly braces to denote the uniform random outcome of the above sampling step—for a given chain, only one color will be chosen.

This works fine and is considered fast-mixing, but is single-threaded. You have to loop on the “repeatedly” step for the Markov chain (MC) to burn in. Choosing the smallest legal $k$, the mixing time is $\tilde{O}(\Delta m)$.

We can make a simple observation here: our graph $G$ describes the MRF for the interdependent random variables defining the color of each vertex in $V$. That’s just a fancy way of saying: conditioned on your neighbor’s colors, your color is a uniform independent random variable over the remaining colors, ignorant of the color of any other vertex. In turn, suppose over the course of Jerrum’s algorithm we were to sample two vertices $v, w$ at least distance 2 apart. As far as the MC is concerned, we might as well transition $v$ and $w$ in parallel!

This leads us to a naively parallel sampler, attaching RW locks to each vertex.

Perform any proper greedy coloring of $G$
On many threads, repeatedly:
1. Uniformly sample a vertex $v$
2. Try to write-lock $v$ (upon failure, restart at (2-1))
3. Try to read-lock all neighbors of $v$ (upon failure, restart at (2-1))
4. Resample a color for $v$ based on the read-locked snapshots

Pictorially, after sampling a vertex on each thread (2-1), threads continue to acquire write locks (2-2):

Then they grab neighbor read locks (2-3):

And finally, they perform resampling and unlock (2-4):

Now we have a natively transition-parallel sampler. The only catch is what happens when we fail to acquire a lock. If a thread just unlocks everything and tries to grab another vertex, then we’re no longer perfectly replicating Jerrum’s sampler: we’re going to implicitly favor updating the color of lower-degree vertices since they’ll have less conflicts.

There might be ways to counter this, e.g., non-uniform sampling in part (2-1), restarting at step (2-2) instead of step (2-1), etc., but an interesting question is “who cares?” because the increased sampling rate could mean we can get to some (possibly asymptotically biased) samples faster.

Here’s a plot of how long it takes to reach a given transition step count across various levels of parallelism using the sampler above (on a fixed, randomly sampled connected graph of 1M vertices and average degree 100). Time here is in seconds.

As we increase the number of threads, so too does the percentage of unsuccessful lock attempts, as expected. However, for this sparse graph, even with 32 threads (this is hyperthreading 2x for 16-core machine), the percentage of lock attempts that are unsuccessful remains less than about half a percent.

Despite using biased transition dynamics (but only slightly so, as the above demonstrates), natural parallelism within our MRF makes the “just use more cores” approach appealing for sparse graphs. This was all a ton of fun to code up, and had lots of little interesting systems problems:

How would one actually build this many fast RW locks?
How might one avoid deadlock?
How can we leverage Rust’s type system to enforce lightweight atomics-based lock guards on our graph?

Check the repo out here.

There’s a lot further one could take this: I’m sure you’d want to partition $G$ and only use locks at partition boundaries, for instance, as well as debias the parallel sampling process by changing how to sample $v$. But the upshot to me is that MRFs provide a natural mechanism for interior parallelism when performing MCMC sampling.

Illustrations provided by Olivia Wynkoop.

Fast SVMlight to CSR in Python with Rust

Sun, 17 Jan 2021 00:00:00 +0000

Lots of sparse datasets are kept around in a convenient text format called SVMlight. It’s easy to manipulate with unix tools and very easily compressed so it’s perfect to distribute.

However, the main way that’s available to access this format in python is dreadfully slow due to a natural lack of parallelism. svm2csr is a quick python package I wrote with a rust extension that parallelizes SVMlight parsing for faster loads in python. Check it out!

P.S., here’s what this format looks like:

-1 2:1 4:0.165975 5:0.103448 6:0.176471 11:0.285714
-1 17:0.760482 18:0.820882
1 4:0.0580913 5:0.0896552 6:0.176471 11:0.142857 21:1

Corresponding to labels -1, -1, 1 and a 3-by-23 sparse matrix with 12 nonzero entries.