Vlad FeinbergVlad's Blog
https://vlad17.github.io/
Wed, 04 Dec 2019 19:14:36 +0000Wed, 04 Dec 2019 19:14:36 +0000Jekyll v3.8.5Metaphysics of Causality<h1 id="metaphysics-of-causality">Metaphysics of Causality</h1>
<p>If you read Judea Pearl’s <em>The Book of Why</em>, it makes it seem like exercising observational statistics makes you an ignoramus. Look at the stupid robot:</p>
<p><img src="/assets/2019-12-01-metaphysics-of-causality/ladder.png" alt="ladder" class="center-image" /></p>
<p>Pearl and Jonas Peters (in <em>Elements of Causal Inference</em>) both make a strong distinction, it seems at the physical level, between causal and statistical learning. Correlation is not causation, as it goes.</p>
<p>From a deeply (<a href="https://en.wikipedia.org/wiki/Subjective_idealism">Berkeley-like</a>) skeptical lens, where all that we can be sure of is what we observe, it seems that we nonetheless can recover nice properties of causal modeling even as associative machines through something we can call the <em>Epistemic Backstep</em>.</p>
<p>This is of less of a declaration of me knowing better, and more of an attempt at trying to put into words a different take I had as I was reading the aforementioned works.</p>
<h2 id="our-shining-example">Our Shining Example</h2>
<p>Intuitively, the difference between cause and effect seems to be a fundamental property of nature. Let \(B\) be barometric pressure and \(G\) be a pressure gauge’s reading. We can build a structural causal model (SCM), which is some equations which are tied to the edges and vertices of a directed acyclic graph (DAG):
\[
B\rightarrow G
\]
where \(B\) is the cause and \(G\) is the effect. It’s clear to us that the former is a <em>cause</em> because of what interventions do.</p>
<p>If we intervene on \(B\), say, by increasing our elevation, then the gauge starts reading a lower number. There’s clearly a functional dependence there (or a statistical one, say, in the presence of measurement noise).</p>
<p>If we intervene on \(G\), by breaking the glass and turning the measurement needle, our eardrums don’t pop no matter how low we turn the needle.</p>
<p>We point at this asymmetry and say, this is causality in the real world.</p>
<h2 id="the-epistemic-backstep">The Epistemic Backstep</h2>
<p>But now I ask us to take a step back. Why does this example even make sense to us, evoking vivid imagery about how ridiculous a ruptured eardrum would be due to manually changing a barometer’s needle?</p>
<p>Well, it turns out that we have, through media or real-life experiences, learned about and observed barometers. In science class, we may have read about or seen or heard how they turn as pressure changes.</p>
<p>We may never have broken a barometer and changed its needle position, but we’ve certainly seen enough glass being broken in the past and needles moving that we can put two and two together and imagine what that would look like. In those situations, the thing that the needle measures rarely changes.</p>
<p>Stepping back a bit, it turns out that we actually have a lot of observations of some kind of environmental characteristic \(C\) (which might be temperature or pressure), its corresponding entailment \(E\) (a thermometer or barometer reading), and a possible interaction with the measurement took place, \(I\), where here this represents an indicator for “did we increase the reading of our measurement manually.”</p>
<p>So, we actually have a lot of observational evidence of the more generalized system \((C, E, I)\).</p>
<ol>
<li>We’ve seen how barometers read high numbers \(E=1\) at high pressure \(C=1\) by being at low altitude and observing a functioning barometer. We did not mess with the barometer. \((C, E, I)=(1, 1, 0)\).</li>
<li>We’ve seen how barometers behave at high altitude \((0, 0, 0)\).</li>
<li>We’ve seen how ovens increase the temperature in the attached thermometer \((1, 1, 0)\).</li>
<li>How we don’t have a fever if our mom measures our temperature and we’re healthy \((0, 0, 0)\).</li>
<li>After we do jumping jacks to raise our temperature, to get out of school, we see that it works \((0, 1, 1)\).</li>
</ol>
<p>Given a bunch of situations like this, and taking some liberty in our ability to generalize here, it’s totally reasonable we can come up with a rule \(E=\max(C, I)\) and given observational data alone. We might go even further, and model a joint probability on \((C, E, I)\) where the conditional probability distribution of \(C\) given \(E=1,I=1\) ends up just being the marginal probability of \(C\):
\[
p(C) = p(C|E=1,I=1)\,,
\]
as opposed to what happens for \(E\)
\[
\forall i\,,\,\,\,p(E|C=1, I = i) = 1_{E=1}\,.
\]
These <em>observations</em> make for natural matches for causal inference, from which we can infer that there won’t be much effect on pressure by changing the barometer, but we <em>could have known this</em> (at least in theory) just by building up an associative model for what happens when you manually override what the measurement tool says.</p>
<p>By considering the associative model over a wider universe, a universe that includes the interventions <em>themselves</em> as observed variables, and having a strong ability to generalize between related interventions and settings, we can view our causal inference as solely an associative one.</p>
<h2 id="in-short">In Short</h2>
<p>The epistemic backstep proceeds by adding a new variable, \(F\), for “did you fuck with the system you’re modeling,” capturing all the essential ways in which you can fuck with it, and in this manner we can seemingly reduce causal questions to those that in theory an associative machine could answer.</p>
<p>Maybe this is a cheap move, just moving intervention into the arena of observations. I still think it’s a fairly powerful reduction: we can tell things about gauges and barometers having never experimented with them before, as long as we can solve the transfer learning problem between those and settings where we <em>have</em> messed with measuring natural phenomena, maybe thinking back to weight scales in a farmer’s market or trying to get out of school by raising our temperature.</p>
<p>So, we can view randomized controlled trials in this context not as different experimental settings, but rather a way to collect data in a region of the \((C, E, I)\) space that might be sparsely populated otherwise (so we’d have a tough time fitting the data there).</p>
<p>It’s important to note that <em>it doesn’t matter</em> that it’s more convenient for us to model causal inference with DAGs.</p>
<ul>
<li>You may say something like “well, how did you know that jumping jacks would help raise your temperature?”</li>
<li>This suggests that humans do really think causally.</li>
<li>However, the above is a psychological claim about humans, rather than a metaphysical claim about causality.</li>
<li>For all we know, an associative machine may have an exploration policy where, on some days, it sets \(I=1\) just to see what can happen. After gathering some data, it builds something equivalent to a causal model, but without ever explicitly constructing any DAGs.</li>
</ul>
<h2 id="full-circle">Full Circle</h2>
<p>For what it’s worth, maybe the best way to model our new joint density \((C, E, I)\) is by first identifying the causal DAG structure, constructing a full SCM by fitting conditional functions, and then using that SCM for our predictions.</p>
<p>But that seems presumptuous. Surely, viewing this as a more abstract statistical learning problem, there might be more generic ways of finding representations that help us efficiently learn the “full joint” which includes interventions.</p>
<p>Another interesting point is asking questions about counterfactuals. Personally, I don’t find counterfactuals that useful (unless blame is an end in itself), but that’s a discussion for another time. I didn’t want to muddy the waters above, but there’s an example of an associative counterfactual analysis below with the epistemic backstep.</p>
<p>Note that the transfer notions introduced here aren’t related to the <a href="https://arxiv.org/abs/1301.2312">Tian and Pearl</a> transportability between different environments, where the nodes of your SCM stay the same (<a href="https://ftp.cs.ucla.edu/pub/stat_ser/r402.pdf">see here for further developments</a>). What I’m talking about is definitely more of a transfer learning problem, where you’re trying to perform a natural matching based on your past experiences, and it’s learning this matching function that’s interesting to study.</p>
<p>So in sum we have an interesting take on <a href="https://plato.stanford.edu/entries/causation-probabilistic/">Regularity Theory</a>, which doesn’t have the usual drawbacks. Maybe all of this is a grand exercise in identifying a motivation for Robins’ G-estimation. In any case it was fun to think about so here we are.</p>
<h2 id="another-worked-example">Another worked example</h2>
<p>Let’s first look at <a href="https://ftp.cs.ucla.edu/pub/stat_ser/r301-final.pdf">Pearl’s firing squad</a>.</p>
<p><img src="/assets/2019-12-01-metaphysics-of-causality/firing-squad.png" alt="firing squad" class="center-image" /></p>
<p>Say the captain gave the order to fire, both \(R_1,R_2\) did so, and the prisoner died. Now, what would have happened had \(R_1\) not fired?</p>
<p>Pearl says the association machine breaks down here, it’s a contradiction, since the rifleman always fires when the captain gives the order to. So why aren’t we confused when we think about it?</p>
<p>Step back: consider a universe where riflemen can refuse to follow orders. The first rifleman is now wont to be mutinous \(M\) (add an arrow \(M\rightarrow R_1\)).</p>
<p>In situations, where the first rifleman is mutinous, but the second isn’t, it’s pretty clear what’ll happen. The second rifleman still fires, and the prisoner is still shot dead.</p>
<p>To me, it’s only because I’ve seen a lot of movies, read books, heard poems where there’s a duty to disobey that I could reason through this. If all of my experience up to this point has confirmed that riflemen <em>always</em> fire when their commanding officer tells them to, I would’ve been as confused as our associative machine at the counterfactual question.</p>
<p>To close up, we have one big happy joint model
\[
p(C, M, R_1, R_2, D)\,,
\]
now so to ask the counterfactual is just to ask what the value of
\[
p(D=1|C=1, M=1, R_1=0, R_2=1)
\]
is, which is something we can answer given our wider set of observations and the ability to generalize.</p>
Sun, 01 Dec 2019 00:00:00 +0000
https://vlad17.github.io/2019/12/01/metaphysics-of-causality.html
https://vlad17.github.io/2019/12/01/metaphysics-of-causality.htmlphilosophyThe Triple Staple<h1 id="the-triple-staple">The Triple Staple</h1>
<p>When reading, I prefer paper to electronic media. Unfortunately, a lot of my reading involves manuscripts from 8 to 100 pages in length, with the original document being an electronic PDF.</p>
<p>Double-sided printing works really well to resolve this issue partway. It lets me convert PDFs into paper documents, which I can focus on. This works great up to 15 pages. I print the page out and staple it. I’ve tried not-stapling the printed pages before, but then the individual papers frequently get out of order or generally all over the place.</p>
<p><strong>However</strong>, for larger manuscripts I frequently found myself in a pickle:</p>
<ul>
<li>I don’t want to manage loose leaf pages individually.</li>
<li>Staplers that can handle stapling over 15 pages don’t occur naturally, at least near the printers I’m around.</li>
</ul>
<p>Attempting to use a stapler beyond its capacity does not end successfully.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/the-problem.png" alt="weak staplers" class="center-image" /></p>
<p>For a good deal of my life I’ve resigned myself to dealing with a reality of mediocre staplers and even more mediocre workarounds, e.g., a packet on a single topic now needs be represented by 3 independent, separately-stapled documents, which is 2 too many.</p>
<p>I’m confident many others also have this problem. To wit, I’d like to introduce a life hack, for all situations where you have documents of up to \(2X\) pages and staplers with penetration power rated at \(X\) pages.</p>
<h2 id="the-problem">The Problem</h2>
<p>I want to staple this thick paper stack.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/initial-conditions.png" alt="initial conditions" class="center-image" /></p>
<p><em>Optimality criteria</em>.</p>
<p>(A) Grip strength of resulting staple.</p>
<p>(B) Non-obstruction of reading material.</p>
<h2 id="solution">Solution</h2>
<ol>
<li>Staple pages \(1\) to \(X\).</li>
<li>Staple pages \(X+1\) to \(2X\).</li>
<li>Peel back the corner of pages \(1\) to \(\lfloor X/2\rfloor\) over the staple. Repeat for \(\lfloor 3X/2\rfloor\) to \(2X\)</li>
<li>Insert the exposed corner of pages \(\lfloor X/2\rfloor +1\) to \(\lfloor 3X/2\rfloor - 1\) into the stapler, making sure the folded-away corners of the outer pages are out of the stapler’s line of fire.</li>
<li>Apply the stapler to the middle pages, then fold the outer pages’ corners back up.</li>
</ol>
<h2 id="results">Results</h2>
<p>Step 1 and 2.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-one.png" alt="step 1 and 2" class="center-image" /></p>
<p>Step 3.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-three.png" alt="step 3" class="center-image" /></p>
<p>Step 4.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-four.png" alt="step 4" class="center-image" /></p>
<p>Step 5.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five.png" alt="step 5" class="center-image" /></p>
<p>Additional results (skew angle, front, and back views).</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five1.png" alt="step 5 1" class="center-image" /></p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five3.png" alt="step 5 2" class="center-image" /></p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five2.png" alt="step 5 3" class="center-image" /></p>
<h2 id="discussion-and-related-work">Discussion and Related Work</h2>
<p>(A) is met due to each staple holding together at least \(X\) pages. Contrast this with related work which only staples two pages \(X,X+1\) with an intermediate staple, resulting in a single point of failure at page \(X\).</p>
<p>(B) UX is equivalent to a single-stapled page, as opposed to binder-clip methodology which frequently requires clipping past the margin.</p>
<h2 id="future-work">Future Work</h2>
<p>There exists a straightforward alternating iteration of our method that can be shown, by induction to apply to documents of length up to \(n X\) for any \(n\in\mathbb{N}\). We leave evaluation to future work.</p>
Sat, 30 Nov 2019 00:00:00 +0000
https://vlad17.github.io/2019/11/30/the-triple-staple.html
https://vlad17.github.io/2019/11/30/the-triple-staple.htmltoolsjoke-postPRNGs<h1 id="prngs">PRNGs</h1>
<p>Trying out something new here with a Jupyter notebook blog post. We’ll keep this short. Let’s see how it goes!</p>
<p>In this episode, we’ll be exploring random number generators.</p>
<p>Usually, you use psuedo-random number generators (PRNGs) to simulate randomness for simulations. In general, randomness is a great way of avoiding doing integrals because it’s cheaper to average a few things than integrate over the whole space, and things tend to have accurate averages after just a few samples… This is the <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo Method</a>.</p>
<p>That said, since the priority is speed here, and the more samples, the better, we want to take as many samples as possible, so parallelism seems viable.</p>
<p>This occurs in lots of scenarios:</p>
<ul>
<li>Stochastic simulations of physical systems for risk assessment</li>
<li>Machine learning experiments (e.g., to show a new training method is consistently effective)</li>
<li>Numerical estimation of integrals for scientific equations</li>
<li>Bootstrap estimation in statistics</li>
</ul>
<p>For all of these situations, we also usually want replicatable studies.</p>
<p>Seeding is great for making the random PRNG sequence deterministic for one thread, but how do you do this for multiple threads?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">ttest_1samp</span>
<span class="k">def</span> <span class="nf">something_random</span><span class="p">(</span><span class="n">_</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">()</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">2056</span>
<span class="k">print</span><span class="p">(</span><span class="s">"stddev {:.5f}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">something_random</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="n">mu</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>stddev 0.02205
-0.03392958488974697
</code></pre></div></div>
<p>OK, so not seeding (using the system default of time-based seeding) gives us dependent trials, and that can really mess up the experiment and it prevents the very determinism we need!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">seeds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">32</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">something_random</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seeds</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">()</span>
<span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">something_random</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="n">mu</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-0.6038931772504026
</code></pre></div></div>
<p>The common solution I see for this is what we see above, or using <code class="language-plaintext highlighter-rouge">i</code> directly as the seed. It kind of works, in this case, but for the default numpy PRNG, the Mersenne Twister, it’s not a good strategy.</p>
<p><a href="https://docs.scipy.org/doc/numpy/reference/random/parallel.html#seedsequence-spawning">Here’s the full discussion</a> in the numpy docs.</p>
<p>To short circuit to the “gem” ahead of time, the solution is to use the new API.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">SeedSequence</span><span class="p">,</span> <span class="n">default_rng</span>
<span class="n">ss</span> <span class="o">=</span> <span class="n">SeedSequence</span><span class="p">(</span><span class="mi">12345</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">16</span>
<span class="n">child_seeds</span> <span class="o">=</span> <span class="n">ss</span><span class="o">.</span><span class="n">spawn</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">something_random</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">default_rng</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="k">return</span> <span class="n">rng</span><span class="o">.</span><span class="n">normal</span><span class="p">()</span>
<span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">something_random</span><span class="p">,</span> <span class="n">child_seeds</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">mu</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-0.11130135587093562
</code></pre></div></div>
<p>That said, I think the fun part is in trying to break the old PRNG seeding method to make this gem more magical.</p>
<p>That is, the rest of this blog post is going to be trying to find non-randomness that occurs when you seed in a n invalid way.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># aperitif numpy trick -- get bits, fast!
</span><span class="k">def</span> <span class="nf">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">nbytes</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">7</span><span class="p">)</span> <span class="o">//</span> <span class="mi">8</span> <span class="c1"># == ceil(n / 8) but without using floats (gross!)
</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">unpackbits</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="nb">bytes</span><span class="p">(</span><span class="n">nbytes</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">))[:</span><span class="n">n</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">timeit</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>39.5 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">timeit</span>
<span class="n">fastbits</span><span class="p">(</span><span class="mi">10</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2.29 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Attempt 1: will lining up random
# streams break a chi-square test?
</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">10</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">x2</span> <span class="o">=</span> <span class="n">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">chisquare</span>
<span class="k">def</span> <span class="nf">simple_pairwise</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="c1"># do a simple pairwise check on equilength arrays dof = 4 - 1
</span> <span class="c1"># build a contingency table for cases 00 10 01 11
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">bincount</span><span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">chisquare</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'random'</span><span class="p">,</span> <span class="n">simple_pairwise</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">x2</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'seeds 1-2'</span><span class="p">,</span> <span class="n">simple_pairwise</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>random Power_divergenceResult(statistic=6.848932, pvalue=0.07687191550956339)
seeds 1-2 Power_divergenceResult(statistic=10000003.551559199, pvalue=0.0)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># And now let's try another approach!
</span>
<span class="kn">import</span> <span class="nn">tempfile</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="k">def</span> <span class="nf">size</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="s">'/tmp/x.bz2'</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="s">'/tmp/x.bz2'</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/tmp/x'</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">tobytes</span><span class="p">())</span>
<span class="err">!</span> <span class="n">bzip2</span> <span class="o">-</span><span class="n">z</span> <span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">x</span>
<span class="k">return</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">getsize</span><span class="p">(</span><span class="s">'/tmp/x.bz2'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">rbytes</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="nb">bytes</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
<span class="n">trials</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">trials</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span>
<span class="k">print</span><span class="p">(</span><span class="s">'random'</span><span class="p">,</span> <span class="n">size</span><span class="p">(</span><span class="n">rbytes</span><span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="n">trials</span><span class="p">)))</span>
<span class="n">re_seeded</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">trials</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">re_seeded</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">rbytes</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">re_seeded</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'seeds 0-255'</span><span class="p">,</span> <span class="n">size</span><span class="p">(</span><span class="n">a</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>random 257131407
seeds 0-255 257135234
</code></pre></div></div>
<p>OK, so zip isn’t easily able to untangle any correlation between the streams (in which case, the compressed file of bits from random streams from sequential seeds would presumably be able to compress better).</p>
<p>We’ll need another approach.</p>
<p>There’s a lot of investment in PRNG quality tests.</p>
<p>However, we’re not interested in evaluating whether <em>individual</em> streams are random-looking, which they very well might be. Instead, we want to find out if there’s any dependence between streams. Above we just tried two tests for independence, but they didn’t work well (there’s a lot of ways to be dependent, including ways that don’t fail the chi squared test or bz2-file-size test).</p>
<p>That said, we can use a simple trick, which is to interleave streams from the differently-seeded PRNGs. If the streams are dependent, the resulting interleaved stream is not going to be a realistic random stream. This is from the <a href="https://www.iro.umontreal.ca/~lecuyer/myftp/papers/testu01.pdf">TestU01</a> docs. Unfortunately, my laptop couldn’t really handle running the full suite of tests… Hopefully someone else can break MT for me!</p>
Sun, 20 Oct 2019 00:00:00 +0000
https://vlad17.github.io/2019/10/20/prngs.html
https://vlad17.github.io/2019/10/20/prngs.htmltoolsCompressed Sensing and Subgaussians<h1 id="compressed-sensing-and-subgaussians">Compressed Sensing and Subgaussians</h1>
<p>Candes and Tao came up with a broad characterization of compressed sensing solutions <a href="https://statweb.stanford.edu/~candes/papers/RIP.pdf">a while ago</a>. Partially inspired by a past homework problem, I’d like to explore an area of this setting.</p>
<p>This post will dive into the compressed sensing context and then focus on a proof that squared subgaussian random variables are subexponential (the relation between the two will be explained).</p>
<h2 id="compressed-sensing">Compressed Sensing</h2>
<p>For context, we’re interested in the setting where we observe an \(n\)-dimensional vector \(\vy\) that is a random linear transformation \(X\) of a hidden \(p\)-dimensional vector \(\vx_*\):</p>
<p>\[
\vy = X\vx_*
\]</p>
<p>In general, in this setting, we could have \(p>n\). If we wanted to recover \(\vx_*\), the system may be underdetermined. So a least-squares solution \((X^\top X)^{-1}X^\top\vy\) may not exist or may be unstable due to very small \(\lambda_\min(X^{\top} X)\).</p>
<p>In cases where we have knowledge of sparsity, however, that \(\norm{\vx}_0=k<p,n\), we can actually find the result.</p>
<p>In particular, the \(\ell_0\) estimator, which finds
\(
\vx_0=\argmin_{\vx:\norm{\vx}_0\le k}\norm{\vy-X\vx}_2
\), will converge, in the sense that the risk \(\E\norm{\vy-X\vx}_2\) is bounded above by \(O\pa{\frac{k\log p}{n}}\). This can be used to show that under some straightforward assumptions on \(k,X\) we actually converge to the true answer \(\vx_*\). Moreover, while this method seems to depend on \(k\) we can imagine doing hyperparameter search on \(k\).</p>
<p>This all looks great, in that we can recover the original entries of sparse \(\vx_*\), but the problem is solving the minimization problem under the constraint \(\norm{\vx}_0\le k\) is computationally difficult. This is a non-convex set of points with at most \(k\) non-zero entries. We’d need to check every subset to find the optimum (<em>question to self:</em> do we really? You’d think that in a non-adversarial stochastic-\(X\) situation you might want to use \(2k\) instead of \(k\) and then use a greedy algorithm like backward selection and it’d be good enough).</p>
<p>This is why Tao and Candes’ work is so cool. They take the efficiently-computable LASSO estimator,
\[
\vx_\lambda = \argmin_{\vx:\norm{\vx}_0\le k}\norm{\vy-X\vx}_2
^2+\lambda\norm{\vx}_1\,,
\]
and show that under a certain condition on \(X\), the <em>Restricted Isometry Property</em> (RIP), \(\vx_\lambda = \vx_0\). In essence, the RIP property requires that \(X\) has nearly unit eigenvalues with high probability, so it’s almost an isometry. Technically, there’s a relaxed condition called the restricted eigenvalue condition implied by RIP where we get a weaker result that implies LASSO has the same risk as \(\ell_0\).</p>
<p>All this is motivation for understanding the question: <strong>what practical conditions on \(X\) ensure the RIP?</strong></p>
<p>It turns out we can characterize a broad class of distributions for the entries of \(X\) that enable this.</p>
<h2 id="subgaussian-random-variables">Subgaussian Random Variables</h2>
<p>Subgaussian random variables have heavy tails. In particular, \(Y\in\sg(\sigma^2)\) when
\[
\E\exp(\lambda Y)\le\exp\pa{\frac{1}{2}\lambda^2\sigma^2}
\]</p>
<p>By the Taylor expansion of \(\exp\), Markov’s inequality, and elementary properties of expectation, we can use the above to show all sorts of properties.</p>
<ul>
<li>Subgaussian variance. \(\var Y\le \sigma^2\)</li>
<li>Zero mean. \(\E Y = 0\)</li>
<li>2-homogeneity. \(\alpha Y\in\sg(\sigma^2\alpha^2)\)</li>
<li>Light tails. \(\P\ca{\abs{Y}>t}\le 2\exp\pa{\frac{-t^2}{2\sigma^2}}\)</li>
<li>Additive closure. \(Z\in\sg(\eta^2 )\independent Y\) implies \(Y+Z\in\sg(\sigma^2+\eta^2)\)</li>
<li>Higher moments. \(\E Y^{4k}\le 8k(2\sigma)^{4k}(2k-1)!\)</li>
</ul>
<h2 id="subexponential-random-variables">Subexponential Random Variables</h2>
<p>Subexponential random variables are like subgaussians, but their tails can be heavy. In particular, \(Y\in\se(\sigma^2,s)\) satisfies the equation for \(\sg(\sigma^2)\) for \(\abs{\lambda}<s\).</p>
<p>We don’t really need to know much else about these, but it’s clear we can show similar additive closure and homogeneity properties as in the subgaussian case as long as we do bookkeeping on the second parameter \(s\).</p>
<p>It turns out that RIP holds for \(X\) with high probability if \(\vu^\top X^\top X\vu\in\se(nc, c’)\) for some constants \(c,c’\) and any unit vector \(\vu\).</p>
<p>When entries of \(X\) are independent and identically distributed, \(\vu\) can essentially be taken to be a standard unit vector without loss of generality. This requires some justification but it’s intuitive so I’ll skip it for brevity. This lets us simplify the problem to asking if \(\norm{X_1}^2\in\se(nc, c’)\), where \(X_1\) is the first column of \(X\).</p>
<p>So let’s take the entries of \(X\) to be iid, which, due to additive closure, means that the previous condition can just be \({X}_{11}^2\in\se(c,c’)\).</p>
<h2 id="squared-subgaussians">Squared Subgaussians</h2>
<p>Turns out, if the entries of \(X\) are subgaussian and iid, all of the above conditions hold. In particular, we need to show that the first entry \(X_11\), when squared, is squared exponential.</p>
<p>We focus on a loose but good-enough bound for this use case.</p>
<p>Suppose \(Z\in\sg(\sigma^2)\). Then \(Z^2-\E Z^2\in \se(c\sigma^4,\sigma^{-2}/8)\), again, being very loose with the bound here.</p>
<p>First, consider an arbitrary rv \(Y\). By the conditional Jensen’s Inequality, for any \(\lambda\) and \(Y’\sim Y\) iid,
\[
\E\exp\pa{\lambda (Y-\E Y)}=\E\exp\pa{\CE{\lambda (Y-Y’)}{Y}}\le \E\CE{\exp\pa{\lambda (Y-Y’)}}{Y}=\E\exp\pa{\lambda (Y-Y’)}\,.
\]
Then let \(\epsilon\) be an independent Rademacher random variable, and notice we can replace \(Y-Y’\disteq \epsilon(Y-Y’)\) above. Now choose \(Y=X^2\). Then by Taylor expansion and dominated convergence,
\[
\E\exp\pa{\lambda \pa{X^2-\E X^2}}\le \E \exp\pa{\lambda \epsilon \pa{X^2-(X’)^2}}=\sum_{k=0}^\infty\frac{\lambda^k\E\ha{\epsilon^k(X^2-(X’)^2)^k}}{k!}\,.
\]
Next, notice for odd \O(k\), \(\epsilon^k=\epsilon\) so by symmetry the odd terms vanish, leaving the MGF bound
\[
\E\exp\pa{\lambda \pa{X^2-\E X^2}}\le\sum_{k=0}^\infty\frac{\lambda^{2k}\E\ha{\pa{X^2-(X’)^2}^{k}}}{(2k)!}\le 2\sum_{k=0}^\infty\frac{\lambda^{2k}\E\ha{X^{4k}}}{(2k)!}\,,
\]
where above we use the fact that \(x\mapsto x^p\) is montonic and \(\abs{X^2-(X’)^2}\le X^2\) when \(\abs{X}>\abs{X’}\), which occurs half the time by symmetry. The other half of the time, we get an equivalent expression. By subgaussian higher moments,
\[
\E \exp\pa{\lambda (X^2-\E X^2)}\le 1+c\sum_{k=1}^\infty \frac{k\pa{4\sigma^2\lambda}^{2p}(2k-1)!}{(2k)!}=1+c\sum_{p=1}^\infty\pa{4\sigma^2\lambda}^{2p}
\]
Next we assume, crudely, that \(4\sigma^2\lambda\le 2^{-1/2}\), so the head of the series above is at least as large as the tail (since the ratio decreases by at least \(1/2\)). Then,
\[
\E \exp\pa{\lambda (X^2-\E X^2)}\le 1+c(2\sigma^2\lambda)^2\le \exp(c\sigma^4\lambda^2)\,.
\]</p>
Wed, 11 Sep 2019 00:00:00 +0000
https://vlad17.github.io/2019/09/11/compressed-sensing-subgaussians.html
https://vlad17.github.io/2019/09/11/compressed-sensing-subgaussians.htmlmachine-learningMaking Lavender<h1 id="making-lavender">Making Lavender</h1>
<p>I’ve tried using Personal Capital and Mint to monitor my spending, but I wasn’t happy with what those tools offered.</p>
<p>In short, I was looking for a tool that:</p>
<ul>
<li>requires no effort on my part to get value out of (I don’t want to set budgets, I don’t even want the overhead of logging in to get updates)</li>
<li>would tell me how much I’m spending</li>
<li>would tell me why I’m spending this much</li>
<li>would tell me if anything’s changed</li>
</ul>
<p>All the tools out there are in some other weird market of “account management” where they take all your accounts (investment, saving, credit card, checking), not just the spending ones. They’re your one stop shop for managing all your net worth in one place.</p>
<p>However, I just wanted to be responsible about my spending. And I didn’t want to spend any more time dealing with personal finance apps than I had to. Kind of like <a href="https://albert.com/">Albert</a>. But when I tried it, it was way too annoying and didn’t support my credit card account.</p>
<p>At this point, I figured that I know what I want and I could do a better job at getting it myself, so I just hacked some stuff together. The end result is a weekly digest that gives exactly the analysis I want.</p>
<h2 id="pandas">Pandas</h2>
<p><em>Time investment</em>: 30 minutes</p>
<p>Download Chase statement csv. It looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Transaction Date,Post Date,Description,Category,Type,Amount
07/03/2019,07/04/2019,SQ *UDON UNDERGROUND,Food & Drink,Sale,-19.20
07/03/2019,07/04/2019,Amazon web services,Personal,Sale,-27.31
07/01/2019,07/03/2019,SWEETGREEN SOMA,Food & Drink,Sale,-17.56
</code></pre></div></div>
<p>Then just give me the heavy hitters. <a href="https://github.com/vlad17/misc/blob/master/groupby.py">Pandas hack script</a>. Among the biggest two give me a breakdown.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python ~/dev/misc/groupby.py ~/Downloads/Chase.CSV
most recent payment period <from date> <to date>
usd frac
Category
Food & Drink -841.05 51%
Travel -301.65 18%
Shopping -148.69 9%
Health & Wellness -140.09 9%
Groceries -134.64 8%
Personal -58.00 4%
total -1640.04
Food & Drink
Transaction Date Description Amount
*** 2019-**-** CIBOS ITALIAN RESTAURANT -120.00
*** 2019-**-** SALT WOOD RESTAURANT -70.00
*** 2019-**-** SAPPORO -69.98
*** 2019-**-** PACHINO PIZZERIA -60.00
*** 2019-**-** DOORDASH*BURMA LOVE -53.53
Travel
Transaction Date Description Amount
*** 2019-**-** UBER *TRIP -58.97
*** 2019-**-** CLIPPER #**** -50.00
*** 2019-**-** *********** HOTEL -32.00
*** 2019-**-** UBER *TRIP -17.02
*** 2019-**-** UBER *TRIP -16.12
</code></pre></div></div>
<p>Neato! Already more value than those stupid pie charts. But I have to log into Chase now, which is worse than logging into Mint.</p>
<h2 id="timely-hn-methodology">Timely HN Methodology</h2>
<p><em>Time investment</em>: 2 straight days of coding.</p>
<p>A <a href="https://news.ycombinator.com/item?id=19833881">HN</a> post came out with a guy basically doing the same thing but for privacy reasons. So I copied his approach, where you just tell Chase to send you email alerts for transactions.</p>
<p>Emails from Chase look like this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This is an Alert to help you manage your credit card account ending in ****.
As you requested, we are notifying you of any charges over the amount of ($USD) 0.00, as specified in your Alert settings.
A charge of ($USD) 12.74 at SQ *BLUE BOTTLE C... has been authorized on **/**/2019 7:**:** PM EDT.
Do not reply to this Alert.
If you have questions, please call the number on the back of your credit card, or send a secure message from your Inbox on www.chase.com.
To see all of the Alerts available to you, or to manage your Alert settings, please log on to www.chase.com.
</code></pre></div></div>
<p>Unlike blog post guy, I didn’t want to fuck with Zapier or Google Sheets since I want my code to do more special things. Somehow I hyped up my friend <a href="https://github.com/JoshBollar">Josh</a> to help (I think he wanted to mess with AWS). Here was our design doc</p>
<p><img src="/assets/2019-08-18-making-lavender/ddoc.png" alt="design doc" class="center-image" /></p>
<p>So yeah, the flamegraph of your finances never happened. But hey, we did the important parts, namely:</p>
<ul>
<li>Get a domain through Route 53 to send mail to/from.</li>
<li>Set up an SNS topic to receive emails. Received emails are either forwarding confirmations (which need to be confirmed) or actual transaction notifications from Chase, set up to be forwarded via the user’s email account.</li>
<li>AWS lambda to regex parse the transaction emails, dump transaction in makeshift NoSQL store which is really just flat json documents on S3.</li>
<li>AWS lambda to spin up weekly and send out summary digests via SES to all users (who we know by ls-ing the S3 bucket)</li>
<li>Matplotlib rendering of a barchart</li>
</ul>
<p>Yeah, yeah, so much yikes architecturally. The code’s just as smelly, but whatever we wanted a scalability of 2.</p>
<p><img src="/assets/2019-08-18-making-lavender/v0email.png" alt="first version" class="center-image" /></p>
<h2 id="switch-to-an-api">Switch to an API</h2>
<p><em>Time investment</em>: 6 non-contiguous days intermittent, 17 hours.</p>
<p>The above was hacky, but an essentially free service that gave me what I wanted. The main downside was that the emails from Chase didn’t have a lot of info on the transactions themselves.</p>
<ul>
<li>Switch to <a href="https://plaid.com/">Plaid</a>, a real API for transactions. This meant I could get rid of the lambda for handling new transactions. And I got nicer categories for the payments.</li>
<li>Keep a postgres RDS running on a <code class="language-plaintext highlighter-rouge">t3.micro</code> with all the transaction info. The lambda would spin up, use environment variable secrets to connect, update with new transactions from Plaid, and send the digest. Migrating from flat json S3 storage to a real database took the most time.</li>
</ul>
<p>The biggest improvement, I think, was “versus” analysis, which identifies what categories you’re spending more or less in than usual. I just made up a differencing algorithm here, I don’t think anything out there solves this problem super well on its own (it’s a harder problem than you’d think, since transactions belong to multiple categories).</p>
<p><img src="/assets/2019-08-18x-making-lavender/time-spend.png" alt="spend" class="center-image" /></p>
<p>The biggest pain point here was that AWS Lambda didn’t support deployment packages that are >250MB uncompressed. With scipy at 70MB, this was a pretty annoying thing to extract. I had to manually go into the seaborn package, which I use for viz now, and gut out scipy. Probably a better way is to just download dependencies on init.</p>
<h2 id="whats-next">What’s next?</h2>
<p>I’m pretty happy with the app as it is now for personal use.</p>
<p>I may make this available to others (<a href="/about">email me</a> if you want this to happen). The app would send you weekly digests, at 8am Pacific Time on Saturdays.</p>
<p>Before it’s generally publicly available, the email needs a bit of polish, and a static website would be nice, as well as some EULA or something.</p>
Sun, 18 Aug 2019 00:00:00 +0000
https://vlad17.github.io/2019/08/18/making-lavender.html
https://vlad17.github.io/2019/08/18/making-lavender.htmltoolsFacebook AI Similarity Search (FAISS), Part 1<h1 id="faiss-part-1">FAISS, Part 1</h1>
<p>FAISS is a powerful GPU-accelerated library for similarity search. It’s available under MIT <a href="https://github.com/facebookresearch/faiss">on GitHub</a>. Even though <a href="https://arxiv.org/abs/1702.08734">the paper</a> came out in 2017, and, under some interpretations, the library lost its SOTA title, when it comes to a practical concerns:</p>
<ul>
<li>the library is actively maintained and cleanly written.</li>
<li>it’s still extremely competitive by any metric, enough so that the bottleneck for your application won’t likely be in FAISS anyway.</li>
<li>if you bug me enough, I may fix my one-line EC2 spin-up script that sets up FAISS deps <a href="https://github.com/vlad17/aws-magic">here</a>.</li>
</ul>
<p>This post will review context and motivation for the paper. Again, the approximate similarity search space may have progressed to different kinds of techniques, but FAISS’s techniques are powerful, simple, and inspirational in their own right.</p>
<h2 id="motivation">Motivation</h2>
<p>At a high level, <strong>similarity search helps us find similar high dimensional real vectors from a fixed “database” of vectors to a given query vector, without resorting to checking each one. In database terms, we’re making an index of high-dimensional real vectors.</strong></p>
<h3 id="who-cares">Who Cares</h3>
<h5 id="spam-detection">Spam Detection</h5>
<p><img src="/assets/2019-07-18-faiss/tinder.jpg" alt="tinder logo" class="center-image" /></p>
<blockquote>
<p>Tinder bot 1 bio: “Hey, I’m just down for whatever you know? Let’s have some fun.”</p>
<p>Tinder bot 2 bio: “Heyyy, I’m just down for whatevvver you know? Let’s have some fun.”</p>
<p>Tinder bot 3 bio: “Heyyy, I’m just down for whatevvver you know!!? I just wanna find someone who wants to have some fun.”</p>
</blockquote>
<p>You’re Tinder and you know spammers make different accounts, and they randomly tweak the bios of their bots, so you have to check similarity across all your comments. How?</p>
<h5 id="recommendations">Recommendations</h5>
<p>You’re <img src="/assets/2019-07-18-faiss/fb.png" alt="facebook" style="display:inline" /> or <img src="/assets/2019-07-18-faiss/goog.png" alt="google" style="display:inline" /> and users clicking on ads keep the juices flowing.</p>
<p>Or you’re <img src="/assets/2019-07-18-faiss/amazon.png" alt="amazon" style="display:inline" /> and part of trapping people with convenience is telling them what they want before they want it. Or you’re <img src="/assets/2019-07-18-faiss/netflix.png" alt="netflix" style="display:inline" /> and you’re trying to keep people inside on a Friday night with another Office binge.</p>
<p>Luckily for those companies, their greatest minds have turned those problems into summarizing me as faux-hipster half-effort yuppie as encoded in a dense 512-dimensional vector, which must be matched via inner product with another 512-dimensional vector for Outdoor Voices’ new marketing “workout chic” campaign.</p>
<h3 id="problem-setup">Problem Setup</h3>
<p>You have a set of database vectors \(\{\textbf{y}_i\}_{i=0}^\ell\), each in \(\mathbb{R}^d\). You can do some prep work to create an index. Then at runtime I ask for the \(k\) closest vectors, which might be measured in \(L^2\) distance, or the vectors with the largest inner product.</p>
<p>Formally, we want the set \(L=\text{$k$-argmin}_i\norm{\textbf{x}-\textbf{y}_i}\) given \(\textbf{x}\).</p>
<p>Overlooking the fact that this is probably an image of \(k\)-nearest neighbors, this summarizes the situation, in two dimensions:</p>
<p><img src="/assets/2019-07-18-faiss/nearest-neighbors.png" alt="nearest neighbors" class="center-image" /></p>
<h5 id="why-is-this-hard">Why is this hard?</h5>
<p>Suppose we have 1M embeddings at a dimensionality of about 1K. This is a very conservative estimate; but that amounts to scanning over 1GB of data per query if doing it naively.</p>
<p>Let’s continue to be extremely conservative, say our service is replicated so much that we have one machine per live query per second, which is still a lot of machines. Scanning over 1GB of data serially on one 10Gb RAM bandwidth node isn’t something you can do at interactive speeds, clocking in at 1 second response time for just this extremely crude simplification.</p>
<p>Exact methods for answering the above problem (Branch-and-Bound, LEMP, FEXIPRO) limit search space. Most recent <a href="https://github.com/stanford-futuredata/optimus-maximus">SOTA for exact</a> is still 1-2 orders below approximate methods. For prev use cases, we don’t care about exact (though there certainly are cases where it does matter).</p>
<h2 id="related-work">Related Work</h2>
<h5 id="before-faiss">Before FAISS</h5>
<p>FAISS itself is built on product quantization work from its authors, but for context there were a couple of interesting approximate nearest-neighbor search problems around.</p>
<p>Tangentially related is the lineage of hashing-based approaches <a href="https://www.microsoft.com/en-us/research/publication/speeding-up-the-xbox-recommender-system-using-a-euclidean-transformation-for-inner-product-spaces/">Bachrach et al 2014</a> (Xbox), <a href="https://arxiv.org/abs/1405.5869">Shrivastava and Li 2014</a> (L2ALSH), <a href="https://arxiv.org/abs/1410.5518">Neyshabur and Srebro 2015</a> (Simple-ALSH) for solving inner product similarity search. The last paper in particular has a unifying perspective between inner product similarity search and \(L^2\) nearest neighbors (namely a reduction from the former to the latter).</p>
<p>However, for the most part, it wasn’t locally-sensitive hashing, but rather clustering and hierarchical index construction that was the main approach to this problem before. One of the nice things about the FAISS paper in my view is that it is a disciplined epitome of these approaches that’s effectively implemented.</p>
<h5 id="after-faiss">After FAISS</h5>
<p>Recently hot new graph-based approaches have been killing it in the <a href="http://ann-benchmarks.com/">benchmarks</a>. It makes you think FAISS is out, <a href="https://github.com/nmslib/hnswlib">HNSW</a> and <a href="https://github.com/yahoojapan/NGT">NGT</a> are in.</p>
<p><img src="/assets/2019-07-18-faiss/benchmarks.png" alt="benchmarks" class="center-image" /></p>
<p>Just kidding. Like the second place winners for ILSVRC 2012 will tell you, simple and fast beats smart and slow. As <a href="https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/">this guy</a> proved, a CPU implementation from 2 years in the future still won’t compete with a simpler GPU implementation from the past.</p>
<p><img src="/assets/2019-07-18-faiss/gpucpu.png" alt="gpu vs cpu" class="center-image" /></p>
<p>You might say this is an unfair comparison, but life (resource allocation) doesn’t need to be fair either.</p>
<h2 id="evaluation">Evaluation</h2>
<p>FAISS provides an engine which approximately answers the query \(L=\text{$k$-argmin}_i\norm{\textbf{x}-\textbf{y}_i}\) with the response \(S\).</p>
<p>The metrics for evaluation here are:</p>
<ul>
<li>Index build time, in seconds. For a set of \(\ell\) database vectors, how long does it take to construct the index?</li>
<li>Search time, in seconds, which is the average time it takes to respond to a query.</li>
<li><em>R@k</em>, or recall-at-\(k\). Here the response \(S\) may be slightly larger than \(k\), but we look at the closest \(k\) items in \(S\) with an exact search, yielding \(S_k\). This value is then \(\card{S_k\cap L}/k\), where \(k=\card{L}\).</li>
</ul>
<h2 id="faiss-details">FAISS details</h2>
<p>In <a href="/2019/07/18/faiss-pt-2.html">the next post</a>, I’ll take a look at how FAISS addresses this problem.</p>
Thu, 18 Jul 2019 00:00:00 +0000
https://vlad17.github.io/2019/07/18/faiss.html
https://vlad17.github.io/2019/07/18/faiss.htmlparallelhardware-accelerationFacebook AI Similarity Search (FAISS), Part 2<h1 id="faiss-part-2">FAISS, Part 2</h1>
<p>I’ve <a href="/2019/07/18/faiss.html">previously</a> motivated why nearest-neighbor search is important. Now we’ll look at how <a href="https://arxiv.org/abs/1702.08734">FAISS</a> solves this problem.</p>
<p>Recall that you have a set of database vectors \(\{\textbf{y}_i\}_{i=0}^\ell\), each in \(\mathbb{R}^d\). You can do some prep work to create an index. Then at runtime I ask for the \(k\) closest vectors in \(L^2\) distance.</p>
<p>Formally, we want the set \(L=\text{$k$-argmin}_i\norm{\textbf{x}-\textbf{y}_i}\) given \(\textbf{x}\).</p>
<p>The main paper contributions in this regard were a new algorithm for computing the top-\(k\) scalars of a vector on the GPU and an efficient k-means implementation.</p>
<h2 id="big-lessons-from-faiss">Big Lessons from FAISS</h2>
<p>Parsimony is important. Not only does it indicate you’re using the right representation for your problem, but it’s better for bandwidth and better for cache. E.g., see this <a href="https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index">wiki link</a>, HNSW on 1B vectors at 32 levels results in TB-level index size!</p>
<p>Prioritize parallel-first computing. The underlying algorithmic novelty behind FAISS takes a serially slow algorithm, an \(O(n \log^2 n)\) sort, and parallelizes it to something that takes \(O(\log^2 n)\) serial time. Unlike serial computing, we can take on more work if the span of our computation DAG is wider in parallel settings. Here, speed is proper hardware-efficient vectorization.</p>
<h2 id="the-gpu">The GPU</h2>
<p>The paper, refreshingly, reviews the GPU architecture.</p>
<p><img src="/assets/2019-07-18-faiss-pt2/gpu.png" alt="gpu" class="center-image" /></p>
<p>Logical compute hierarchy is <code class="language-plaintext highlighter-rouge">grid -> block -> warp -> lane (thread)</code></p>
<p>Memory hierarchy is <code class="language-plaintext highlighter-rouge">main mem (vram) -> global l2 -> stream multiprocessor (SM) l1 + shared mem</code>, going from multi-GB to multi-MB to about <code class="language-plaintext highlighter-rouge">16+48 KB</code>.</p>
<p>There might be one or more blocks scheduled to a single streaming multiprocessor, which is itself a set of cores. Cores have their own floating point processing units and integer units, but other supporting units like the MMU-equivalent are shared.</p>
<p>My takeaways from this section were the usual “maximize the amount of work each core is doing independently, keeping compute density high and memory accesses low, especially shared memory”, but with two important twists:</p>
<ul>
<li>GPU warps (gangs of threads) exhibit worse performance when the threads aren’t performing the same instructions on possibly different data (<em>warp divergence</em>).</li>
<li>Each thread is best kept dealing with the memory in its own lane (which typically is a slice of a 32-strided array that the block is processing with multiple warps in a higher granularity of parallelism), but there can be synchronization points through the register file which exchange memory between the threads.</li>
</ul>
<p>Note there are 32 threads to a warp, we’ll see that come up.</p>
<h2 id="faiss--ivf--adc">FAISS = IVF + ADC</h2>
<p>FAISS answers the question of “what are the closest database points to the query point” by constructing a 2-level tree. Database vectors are further compressed to make the tree smaller.</p>
<p>Given \(n\) database vectors, we cluster with k-means for the top level, getting about \(\sqrt{n}\) centroids. Then, at search time, we use exact search to find the closest centroids, and then among the closest centroid’s clusters we look for the closest clusters overall.</p>
<p>For a 2-level tree, a constant factor of \(\sqrt{n}\) is the optimal cluster size since then the exact search that we do is as small as possible at both levels of the tree.</p>
<p>Since it’s possible the point might be near multiple centroids, FAISS looks at the \(\tau\) closest centroids in the top level of the tree, and then searches all cluster members among the \(\tau\) clusters.</p>
<p>So the larger search occurs when looking at the second level.</p>
<p>Compression reduces I/O pressure as the second-level’s database vectors are loaded. Furthermore, the specific compression algorithm chosen for FAISS, <a href="https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf">Product Quantization</a> (PQ) enables distance computation on the codes themselves! The code is computed on the <em>residual</em> of the database vector \(\textbf{y}\) from its centroid \(q_1(\textbf{y})\).</p>
<p><img src="/assets/2019-07-18-faiss-pt2/residual.png" alt="residual" class="center-image" /></p>
<p>The two-level tree format is the inverted file (IVF), which is essentially a list of records for the database vectors associated with each cluster.</p>
<p>ADC, or asymmetric distance computation, refers to the fact that we’re using the code of the database vector and calculating its distance from the exact query vector. This can be made symmetric by using a code for the query vector as well. We might do this because the coded distance computation can actually be faster than a usual Euclidean distance computation.</p>
<p><img src="/assets/2019-07-18-faiss-pt2/adc.png" alt="ADC" class="center-image" /></p>
<h2 id="faiss-the-easy-part">FAISS, the easy part</h2>
<p>The above overview yields a simple algorithm.</p>
<ol>
<li>Compute exact distances to top-level centroids</li>
<li>Compute ADC in inverted list in probed centroids, generating essentially a list of pairs (index of probed database vector, approximate distance to query point)</li>
<li>The smallest-\(\ell\) by the second pair item are extracted, for some \(\ell\) not much larger than \(k\). Then the top \(k\) among those is returned.</li>
</ol>
<p>The meat of the paper is doing these steps quickly.</p>
<h2 id="fast-adc-via-pq">Fast ADC via PQ</h2>
<p>Product Quantization (PQ) boils down to looking compressing subvectors independently. E.g., we might have a four-dimensional vector \(\textbf{y}=[1, 2, 3, 4]\). We quantize it with \(b=2\) factors as \([(1, 2), (3, 4)]\). Doing this for all our vectors yields \(b\) sets of smaller vectors. The FAISS paper denotes these subvectors as \(\textbf{y}^1=(1, 2), \textbf{y}^2=(3, 4)\).</p>
<p>We then cluster the \(b\) sets independently with 256 centroids. The centroids that these subvectors get assigned to might be \(q^1(\textbf{y}^1)=(1, 1), q^2(\textbf{y}^2)=(4, 4.5)\), which is where the lossy part of the compression comes in. On the plus side, we just encoded 4 floats with 2 bytes!</p>
<p>This compression technique is applied to the <em>residual</em> of the database vectors for their centroids, meaning we have PQ dictionaries for each centroid.</p>
<p>The key insight here is that we can also break up our query vector \(\textbf{x}=[\textbf{x}^1, \cdots, \textbf{x}^b]\), and create distance lookup tables on the sub-vectors individually, so the distance to a database vector is just a sum of \(b\) looked-up values!</p>
<p><img src="/assets/2019-07-18-faiss-pt2/pq-lookup.png" alt="PQ Lookup" class="center-image" /></p>
<h2 id="top-k">Top-k</h2>
<p>OK, so now comes the hard part, we just did steps 1 and 2 really fast, and it’s clear those are super parallelizable algorithms, but how do we get the top (smallest) \(k\) items from the list?</p>
<p>Well, on a CPU, we’d implement this in a straightforward way. Use a max-heap of size \(k\), scan through our list of size \(n\), and then if the next element is smaller than the max of the heap or the heap has size less than \(k\), pop-and-insert or just insert, respectively, into the heap, yielding an \(O(n\log k)\) algorithm.</p>
<p>We could parallelize this \(p\) ways by chopping into \(n/p\)-sized chunks, getting \(k\)-max-heaps, and merging all the heaps, but the intrinsic algorithm does not parallelize well. This means this approach works well when you have lots of CPUs, but is not nearly compute-dense enough for tightly-packed GPU threads, 32 to a warp, where you need to do a lot more computation per byte (having each of those threads maintain its own heap results in a lot of data-dependent instruction divergence).</p>
<p>The alternative approach proposed by FAISS is:</p>
<ul>
<li>Create an extremely parallel mergesort</li>
<li>“Chunkify” the CPU algorithm, taking a big bite of the array at a given time, keeping a “messy max-heap” of a lot more than \(k\) (namely, \(k+32t\)) that includes everything the \(k\)-max-heap would.</li>
<li>Every once in a while, do a full sort on the messy max-heap.</li>
</ul>
<p>Squinting from a distance, this looks similar to the original algorithm, but the magic is in the “chunkification” which enables full use of the GPU.</p>
<h3 id="highly-parallel-mergesort">Highly Parallel Mergesort</h3>
<p>As mentioned, this innovation is essentially a serial \(O(n\log^2 n)\) in-place mergesort that has a high computational span.</p>
<p>The money is in the merge operation, which is based on Batcher’s bitonic bit sort. The invariant is that we maintain a list of sorted sequences (lexicographically).</p>
<ol>
<li>First, we have one sequence of length at most \(n\) [trivially holds]</li>
<li>Then, we have 2 sequences of length at most \(n/2\)</li>
<li>4 sequences length \(n/4\)</li>
<li>Etc.</li>
</ol>
<p><img src="/assets/2019-07-18-faiss-pt2/odd-size.png" alt="odd size" class="center-image" /></p>
<p>Each merge has \(\log n\) steps, where at each step we might have up to \(n\) swaps, but they are disjoint and can happen in parallel. The key is to see that these \(n\) independent swaps ensure lexicographic ordering among the sequences</p>
<p>This is the <code class="language-plaintext highlighter-rouge">odd-merge</code> (Algorithm 1) in the paper. There’s additional logic for irregularly-sized lists to be merged. We’ll come back to this.</p>
<p>Once we have a parallel merge that requires logarithmic serial time, the usual merge sort (Algorithm 2), which itself has a recursion tree of logarithmic depth, results in a \(O(\log^2 n)\) serial time (or depth) algorithm, assuming infinite processors.</p>
<p><img src="/assets/2019-07-18-faiss-pt2/merge-sort.png" alt="merge sort" class="center-image" /></p>
<h3 id="chunkification">Chunkification</h3>
<p>This leads to WarpSelect, which is the chunkification mentioned earlier. In essence, our messy max-heap is a combination (and thus superset) of:</p>
<ul>
<li>The strict size \(k\) max-heap with the \(k\) lowest values seen so far. In fact, this is sorted when viewed as a 32-stride array.</li>
<li>32 thread queues, each maintained in sorted order.</li>
</ul>
<p>So \(T_0^j\le T_i^j\) for \(i>0\) and \(T_0^j\ge W_{k-1}\) . So if an input is greater than any thread queue head, it can be safely ignored (weak bound).</p>
<p><img src="/assets/2019-07-18-faiss-pt2/warp-select.png" alt="warp select" class="center-image" /></p>
<p>On the fast path, the next 32 values are read in, and we do a SIMT (single instruction, multiple-thread) compare on each value assigned to each thread. A primitive instruction checks if any of the warp’s threads had a value below the cutoff of the max heap (if none did, we know for sure none of those 32 values are in the top \(k\) and can move on).</p>
<p>If there was a violation, after the per-lane insertion sort the thread heads might be smaller than they were before. Then we do a full sort of the messy heap, restoring the fact that the strict max-heap has the lowest \(k\) values so far.</p>
<ul>
<li>At this point, it’s clear why we needed a merge sort, which is because the strict max-heap (“warp queue” in the image) is already sorted, so we can avoid re-sorting it by using a merge-based sorting algorithm.
<ul>
<li>Finally, it’s worth pointing out that recasting the fully sorted messy heap into the thread queues maintains the sorted order within each lane.</li>
</ul>
</li>
<li>Further, it’s clear why FAISS authors created a homebrew merge algorithm that’s compatible with irregular merge sizes, as opposed to existing power-of-2 parallel merge algorithms: the thread queues are irregularly sized compared to \(k\) and it’d be a lot of overhead to round the array sizes</li>
</ul>
<p>This leads to the question: why have thread queues at all? Why not make their size exactly 1?</p>
<p>This points to a convenient piece of slack, the thread queue length \(t\), which lets us trade off the cost of the full merge sort against the per-thread insertion sort done every time the new values are read in. The optimal choice depends on \(k\).</p>
<h2 id="results">Results</h2>
<p>Remember, it’s not apples to apples, because FAISS gets a GPU and modern methods use CPUs, but who cares.</p>
<p>Recall from the <a href="/2019/07/18/faiss.html">previous post</a> the <code class="language-plaintext highlighter-rouge">R@1</code> metric is the average frequency the method actually returns the nearest neighbor (it mayhave the query \(k\) set higher). The different parameters used here don’t matter so much, but I’ll highlight what each row means individually.</p>
<p><a href="https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors">SIFT1M</a></p>
<p><img src="/assets/2019-07-18-faiss-pt2/sift.png" alt="sift" class="center-image" /></p>
<p>HNSW is a modern competitor based on the CPU using an algorithm written 2 years after the paper. Flat is naive search. In this benchmark, the PQ optimization was not used (database vector distances were computed exactly).</p>
<p><a href="https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors">Deep1B</a></p>
<p><img src="/assets/2019-07-18-faiss-pt2/deep1b.png" alt="deep1b" class="center-image" /></p>
<p>Here, for the very large dataset, the authors do use compression (OPQ indicates a preparatory transformation for the compression).</p>
<p>On the whole, FAISS is still the winner since it can take advantage of hardware. On the CPUs, it’s still a contender when it comes to a memory-speed-accuracy tradeoff.</p>
<h2 id="extensions-and-future-work">Extensions and Future Work</h2>
<p>The authors of the original FAISS work have themselves looked into extensions that combine the FAISS approach with then newer graph-based neighborhood algorithms (<a href="https://arxiv.org/abs/1804.09996">Link and Code</a>).</p>
<p>Other future work that the authors have since performed has been in improving the organization of the two-level tree structure. The centroid based approach of the IVF implicitly partitions the space with a Voronoi diagram. As the <a href="https://cache-ash04.cdn.yandex.net/download.yandex.ru/company/cvpr2012.pdf">Inverted Multi-Index</a> (IMI) paper explores, this results in a lot of unnecessary neighbors being probed that are far away from the query point but happen to belong to the same Vornoi cell. One extension that now exists in the code base is to use IMI instead of IVF.</p>
<p>It’s also fun to consider how these systems will be evolving over time. As memory bandwidth increases, single node approaches (like FAISS) grow increasingly viable since they can keep compute dense. However, as network speeds improve, distributed approaches with many, many CPUs look attractive. The latter types of algorithms rely more on hierarchy and less on vectorization and compute density.</p>
Thu, 18 Jul 2019 00:00:00 +0000
https://vlad17.github.io/2019/07/18/faiss-pt-2.html
https://vlad17.github.io/2019/07/18/faiss-pt-2.htmlparallelhardware-accelerationBERT, Part 3: BERT<h1 id="bert">BERT</h1>
<p>In the last two posts, we reviewed <a href="/2019/03/09/dl-intro.html">Deep Learning</a> and <a href="/2019/06/22/bert-pt-2-transformer.html">The Transformer</a>. Now we can discuss an interesting advance in NLP, BERT, Bidirectional Encoder Representations from Transformers (<a href="https://arxiv.org/abs/1810.04805">arxiv link</a>).</p>
<p>BERT is a self-supervised method, which uses just a large set of unlabeled textual data to learn representations broadly applicable for different language tasks.</p>
<p>At a high level, BERT’s pre-training objective, which is what’s used to get its parameters, is a Language modelling (LM) problem. LM is an instance of parametric modeling applied to language.</p>
<blockquote>
<p>Typical LM task: what’s the probability that the next word is “cat” given the sentence is “The dog chased the ????”</p>
</blockquote>
<p>Let’s consider a natural language sentence \(x\). In some way, we’d like to construct a loss function \(L\) for a language modeling task. We’ll keep it abstract for now, but, if we set up the model \(M\) right, and have something that generally optimizes \(L(M(\theta), x)\), then we can interpret one of BERT’s theses as the claim that this representation transfers to new domains.</p>
<p>That is, for some very small auxiliary model \(N\) and a set of parameters \(\theta’\) close enough to \(\theta\), we can optimize a different task’s loss (say, \(L’\), the task that tries to classify sentiment \(y\)) by minimizing \(L’(N(\omega)\circ M(\theta’),(x, y))\).</p>
<p>One of the reasons we might imagine this to work is by viewing networks like \(M(\theta’\) as featurizers that create a representation ready for the final layer to do a simple linear classification on.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/last-layer-feat.png" alt="featurization on the last layer" class="center-image" /></p>
<p>Indeed, the last layer of a neural network performing a classification task is just a logistic regression on the features generated by the layers before it. It makes sense that those features could be useful elsewhere.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/fig1.png" alt="bert figure 1" class="center-image" /></p>
<h2 id="contribution">Contribution</h2>
<p>The motivation for this kind of approach (LM pre-training and then a final fine-tuning step) versus task-specific NLP is twofold:</p>
<ul>
<li>Data volume is much larger for the LM pre-training task</li>
<li>The approach can solve multiple problems at once.</li>
</ul>
<p>Thus, the contributions of the paper are:</p>
<ul>
<li>An extremely robust, generic approach to pretraining. 11 SOTAs in one paper.</li>
<li>Simple algorithm.</li>
<li>Effectiveness is profound because (1) the general principle of self-supervision can likely be applied elsewhere and (2) ablation studies in the paper show that representation is the bottleneck.</li>
</ul>
<h2 id="technical-insights">Technical Insights</h2>
<p>The new training procedure and architecture that BERT provides is conceptually simple.</p>
<p>Bert provides deep, bidirectional, context-sensitive encodings.</p>
<p>Why do we need all three of these things? Let’s consider a training task, next sentence prediction (NSP) to demonstrate.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/deep-bi-cxt-ex.png" alt="example deep bidirectional" class="center-image" /></p>
<p>We can’t claim that this is exactly what’s going on in BERT, but clearly as humans we certainly require bidirectional context to answer. In particular, for some kind of logical relation between the entities in a sentence, we first need (bidirectional) context. I.e., to answer if “buying milk” is something we do in a store, we need to look at the verb, object, and location.</p>
<p>What’s more, to answer complicated queries about the coherence of two sentences, we need to layer additional reasoning beyond the logical relations we can infer at the first level. We might be able to detect inconsistencies at L0, but for more complicated interactions we need to look at a relationship between logical relationships (L1 as pictured above).</p>
<p>So, it may make sense that to answer logical queries of a certain nesting depth, we’d need to recursively apply our bidirectional, contextualization representation up to a corresponding depth (namely, stacking Transformers). In the example, we might imagine this query to look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>was-it-the-same-person(
who-did-this("man", "went"),
who-did-this("he", "bought")) &&
is-appropriate-for-location(
"store", "bought", "milk")
</code></pre></div></div>
<h2 id="related-work">Related work</h2>
<p>It’s important to describe existing related work that made strides in this direction. Various previous deep learning architectures have independently proposed using LM for transfer learning to other tasks and deep, bidirectional context (but not all at once).</p>
<p>In particular, relevant works are <a href="https://nlp.stanford.edu/pubs/glove.pdf">GloVe</a>, <a href="https://arxiv.org/abs/1802.05365">ELMo</a>, and <a href="https://openai.com/blog/language-unsupervised/">GPT</a>.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/related-work.png" alt="related work overview" class="center-image" /></p>
<h2 id="training">Training</h2>
<p>As input, BERT uses the BooksCorpus (800M words) and English Wikipedia (2,500M words), totaling 3.3B words, split into a vocabulary of 33K word pieces. There were a few standard NLP featurization techniques applied to this as well (lower casing, for instance), though I think the architecture could’ve handled richer English input.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/fig2.png" alt="bert figure 2" class="center-image" /></p>
<p>But what’s the output? Given just the inputs, how can we create a loss that learns a good context-sensitive representation of each word? This needs to be richer than the context-free representation of each word (i.e., the embedding that each word piece starts as in the first layer of the input to the BERT network).</p>
<p>We might try to recover the original input embedding, but then the network would just learn the identity function. This is the correct answer if we’re just learning on the joint distribution of \((x, x)\) between a sentence and itself.</p>
<p>Instead, BERT trains on sequence <em>recovery</em>. That is, our input is a sentence \(x_{-i}\) missing its \(i\)-th word, and our output is the \(i\)-th word itself, \(x_i\). This is implemented efficiently with masking in practice. That is, the input-output pair is \((\text{“We went [MASK] at the mall.”}, \text{“shopping”})\). In the paper, <code class="language-plaintext highlighter-rouge">[MASK]</code> is the placeholder for a missing word.</p>
<p>In addition, BERT adds an auxiliary task, NSP, where a special <code class="language-plaintext highlighter-rouge">[CLS]</code> classification token is used at the beginning of a sentence that serves as a marker for “this token should represent the whole context of the input sentence(s),” which is then used as a single fixed-width input for classification. This improves performance slightly (see Table 15 in the original work).</p>
<p>That’s essentially it.</p>
<blockquote>
<p>BERT = Transformer Encoder + MLM + NSP</p>
</blockquote>
<p>There’s an important caveat due to training/test distribution mismatch. See the last section, <a href="#open-questions">Open Questions</a>, below.</p>
<h2 id="fine-tuning">Fine-tuning</h2>
<p>For fine tuning, we just add one more layer on top of the final encoded sequence that BERT generates.</p>
<p>In the case of class prediction, we apply a classifier to the fixed width embedding of the <code class="language-plaintext highlighter-rouge">[CLS]</code> marker.</p>
<p>In the case of subsequence identification, like in SQuAD, we want to select a start and end by using a start classifier and end classifier applied to each token in the final output sequence.</p>
<p>For instance, a network is handed a paragraph like the following:</p>
<blockquote>
<p>One of the most famous people born in Warsaw was Maria Skłodowska-Curie, who achieved international recognition for her research on radioactivity and was the first female recipient of the Nobel Prize. Famous musicians include Władysław Szpilman and Frédéric Chopin. Though Chopin was born in the village of Żelazowa Wola, about 60 km (37 mi) from Warsaw, he moved to the city with his family when he was seven months old. Casimir Pulaski, a Polish general and hero of the American Revolutionary War, was born here in 1745.</p>
</blockquote>
<p>And then asked a reading comprehension question like “How old was Chopin when he moved to Warsaw with his family?” to which the answer is the subsequence “seven months old.” Hard stuff! And BERT performs at or above <a href="https://rajpurkar.github.io/SQuAD-explorer/">human level</a>.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/tbl1.png" alt="bert table 1" class="center-image" /></p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/tbl2.png" alt="bert table 2" class="center-image" /></p>
<h2 id="conclusions">Conclusions</h2>
<p>The BERT model is extremely simple, to the point where there’s a mismatch with intuition.</p>
<p>There seem to be some seemingly spurious decisions that don’t have a big effect on training.</p>
<p>First, the segment embeddings indicate different sentences in inputs, but positional embeddings provide positional information anyway. This is seemingly redundant information the network needs to learn to combine.</p>
<p>Second, the start and end indicators for the span predicted for SQuAD are computed independently, where it might make sense to compute the end conditional on the start position. Indeed, it’s possibly to get an end before the start (in which case the span is considered empty).</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/separate-span.png" alt="independent span" class="center-image" /></p>
<p>There are probably many such smaller modeling improvements we could make. But the point is that <em>it’s a waste of time</em>. If anything is the most powerful table to take away from this paper, it’s Table 6.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/tbl6.png" alt="bert table 6" class="center-image" /></p>
<p>Above any kind of task-specific tuning or model improvements, the longest pole in the tent is representation. Investing effort in finding the “right” representation (here, bidirectional, deep, contextual word piece embeddings) is what maximizes broad applicability and the potential for transfer learning.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/all-representation.png" alt="independent span" class="center-image" /></p>
<h2 id="open-questions">Open Questions</h2>
<h4 id="transfer-learning-distribution-mismatch">Transfer Learning Distribution Mismatch</h4>
<p>At the end of Section 3.1, we notice something weird. In the masked language modeling task, our job is to derive what the <code class="language-plaintext highlighter-rouge">[MASK]</code> token was.</p>
<p>But in the evaluation tasks, <code class="language-plaintext highlighter-rouge">[MASK]</code> never appears. To combat this “mismatch” between the distribution of evaluation task tokens and that of the MLM task, occasionally full sequences are shown without the <code class="language-plaintext highlighter-rouge">[MASK]</code> tokens, in which the network is expected to recover the identity functions.</p>
<p>Appendix C.2 digs into the robustness of BERT with respect to messing around with the distribution. This is definitely something that deserves some attention.</p>
<p>During pre-training, we’re minimizing a loss with respect to a distribution that doesn’t match the test distribution (where we randomly remove the mask). How is this a well-posed learning problem?</p>
<p>How much should we smooth the distribution with the mask removals? It’s unclear how to properly set up the “mismatch amount”.</p>
<h4 id="richer-inputs">Richer Inputs</h4>
<p>Based on the ability of BERT to perform well even with redundant encodings (segment encoding and positional encoding), and given its large representational capacity, why operate BERT on word pieces? Why not include punctuation or even HTML markup from Wikipedia?</p>
<p>This kind of input could surely offer more signal for fine tuning.</p>
Sun, 23 Jun 2019 00:00:00 +0000
https://vlad17.github.io/2019/06/23/bert-pt3-bert.html
https://vlad17.github.io/2019/06/23/bert-pt3-bert.htmldeep-learningBERT, Part 2: The Transformer<h1 id="bert-prerequisite-2-the-transformer">BERT Prerequisite 2: The Transformer</h1>
<p>In the last post, we took a look at deep learning from a very high level (<a href="/2019/03/09/dl-intro.html">Part 1</a>). Here, we’ll cover the second and final prerequisite for setting the stage for discussion about BERT, the Transformer.</p>
<p>The Transformer is a novel sequence-to-sequence architecture proposed in Google’s <a href="https://arxiv.org/abs/1706.03762">Attention is All You Need</a> paper. BERT builds on this significantly, so we’ll discuss here why this architecture was important.</p>
<h2 id="the-challenge">The Challenge</h2>
<p>Recall the language of the previous post applied to supervised learning. We’re interested in a broad class of settings where the input \(\textbf{x}\) has some shared structure with the output \(\textbf{y}\), which we don’t know ahead of time. For instance, \(\textbf{x}\) might be an English sentence and \(\textbf{y}\) might be a German sentence with the same context.</p>
<p>For a parameterized model \(M(\theta)\) which might just be a function over \(\textbf{x}\), we recall the \(L\)-layer MLP from last time, where \(\theta=\mat{\theta_1& \theta_2&\cdots&\theta_L}\),
\[
M(\theta)= x\mapsto f_{\theta_L}^{(L)}\circ f_{\theta_{L-1}}^{(L-1)}\circ\cdots\circ f_{\theta_1}^{(1)}(x)\,,
\]
and we define each layer as
\[
f_{\theta_i}=\max(0, W_ix+b_i)\,,\,\,\, \mat{W_i & b_i} = \theta_i\,.
\]</p>
<p>Most feed-forward neural nets (FFNNs) are just variants on this architecture, with some loss typically like \(\norm{M(\theta)(\textbf{x}) - \textbf{y}}^2\).</p>
<p>One issue with this, and typical FFNNs, is that they’re mappings from some fixed size vector space \(\mathbb{R}^m\) to another \(\mathbb{R}^k\). When your inputs are variable-length sequences like sentences, this doesn’t make sense for two reasons:</p>
<ol>
<li>Sentences can be longer than the width of your input space (not a fundamental issue, you could just make \(m\) really large).</li>
<li>The inputs don’t respect the semantics of the input dimensions.</li>
</ol>
<p>For typical learning tasks, the \(i\)-th input dimension corresponds to a meaningful position in the input space. E.g., for images, this is the \(i\)-th pixel in the space of fixed size \(64\times 64\) images. It’s next to the \((i-1)\)-th and \((i+1)\)-th pixels, and every \(64\times 64\) image \(\textbf{x}\) will also have its \(i\)-th pixel in the \(i\)-th place.</p>
<p>Not so for sentences. In sentences, the subject may the first or second or third word. It might be preceded by an article, or it might not. If you look at a fixed offset for many different sentences, you’d be hard pressed to find a robust semantics for the word or letter that you see there. So it’s unreasonable to assume a model could extract relevant structure with such a representation.</p>
<h2 id="recursive-neural-networks-rnns">Recursive Neural Networks (RNNs)</h2>
<p>The typical resolution to this problem in deep learning is to use RNNs. For an overview, see <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Karpathy’s blog post</a>.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/rnn.jpeg" alt="RNN" class="center-image" /></p>
<p>To resolve this issue, we can view our input as a variable-length list of fixed length vectors \(\{\textbf{x}_i\}_{i}\). Next, we modify our FFNN to accept two fixed-length parameters at a time step \(i\), a hidden state \(\textbf{h}_i\) and input \(\textbf{x}_i\). It’s the green box in the diagram above.</p>
<p>This retains essential properties of FFNNs that allow it to optimize well (backprop still works). But, from a perspective of input semantics, we’ve resolved our problem by assuming the hidden state at timestep \(\textbf{h}_i\) tells the FFNN how to interpret the \(i\)-th sequence element (which could be a word or word part or character in the sentence). The FFNN is then also responsible for updating how the \((i+1)\)-th sequence element is to be interpreted, by returning \(\textbf{h}_{i+1}\) on the evaluation in timestep \(i\).</p>
<p>We might want to wait until the network reads the entire input if the entire variable-length output may change depending on all parts of the input (the second to last diagram above). This is the case in translation, where words at the end of the source language may end up at the beginning in the target language.</p>
<p>Alternatively, we might do something like try to classify off of the hidden state after reading the sentence, like identifying the sentiment of a text-based review.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/yelp1.png" alt="get final hidden state" class="center-image" /></p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/yelp2.png" alt="transform final" class="center-image" /></p>
<h2 id="rnn-challenges">RNN challenges</h2>
<p>Consider the task of translating English to Spanish. Let’s suppose our inputs are sequences of words, like</p>
<blockquote>
<p>I arrived at the bank after crossing the {river,road}.</p>
</blockquote>
<p>The proper translation might be either:</p>
<blockquote>
<p>Llegué a la orilla después de cruzar el río.</p>
</blockquote>
<p>or:</p>
<blockquote>
<p>Llegué al banco después de cruzar la calle.</p>
</blockquote>
<p>Notice how we need to look at the <em>whole</em> sentence to translate it correctly. The choice of “river” or “road” affects the translation of “bank”.</p>
<p>This means that the RNN needs to store information about the entire sentence when translating. For longer sentences, we’d definitely need to use a larger hidden state, but also we’re assuming the network would even be able to train to a parameter setting that properly recalls whole-sentence information.</p>
<h2 id="the-transformer">The Transformer</h2>
<p>The problem we faced above is one of <em>context</em>: to translate “bank” properly we need the full context of the sentence. This is what the Transformer architecture addresses. It inspects each word in the context of others.</p>
<p>Again, let’s view each word in our input sequence as some embedded vector \(\textbf{e}_i\) (for context on word embeddings, check out <a href="https://en.wikipedia.org/wiki/Word2vec">the Wikipedia page</a>).</p>
<p>Our goal is to come up with a new embedding for each word, \(\textbf{a}_i\), which contains context from all other words. This is done through a mechanism called attention. For a code-level explanation, see <a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html">The Annotated Transformer</a>, though I find that focusing on a particular word (the one at position \(i\)) helped me understand better.</p>
<p>The following defines (one head of) a Transformer block. A transformer block just contextualizes embeddings. They can be stacked on top of each other and then handed off to the transformer decoder, which is a more complicated kind of transformer that includes attention over both the inputs and outputs. Luckily, we don’t need that for BERT.</p>
<p>Remember, at the end of the day, we’re trying to take one sequence \(\{\textbf{e}_i\}_i\) and convert it into another sequence \(\{\textbf{a}_i\}_i\) which is then used as input for another stage that does the actual transformation. The point is that the representation \(\{\textbf{a}_i\}_i\) is broadly useful for many different decoding tasks.</p>
<ol>
<li>Apply an FFNN pointwise to each of the inputs \(\{\textbf{e}_i\}_i\) to get \(\{\textbf{x}_i\}_i\).</li>
</ol>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/pointwise-ffn.png" alt="pointwise ffnn" class="center-image" /></p>
<ol>
<li>Now consider a fixed index \(i\). How do we contextualize the word at \(\textbf{x}_i\) in the presence of other words \(\textbf{x}_1,\cdots,\textbf{x}_{i-1},\textbf{x}_{i+1},\cdots,\textbf{x}_s\)?</li>
</ol>
<p>We attend to the sequence itself. Attention tells us how much to pay attention to each element when coming up with a fixed-width context for the \(i\)-th element. This is done with the inner product.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/self-attn.png" alt="self attention" class="center-image" /></p>
<p>After computing how important each element \(\textbf{x}_j\) is to the element in question \(\textbf{x}_i\) as \(\alpha_j\), we combine the weighted sum of each of the \(\textbf{x}_j\) themselves.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/value-sum.png" alt="self attention" class="center-image" /></p>
<ol>
<li>After doing this for every index \(i\in[s]\), we get a new sequence \(\textbf{a}_i\). That’s it!</li>
</ol>
<p>This glosses over a couple normalization, multiple heads, and computational details, but it’s the gist of self-attention and the Transformer block.</p>
<p>One thing worth mentioning is the positional encoding, which makes sure that information about a word being present in the \(i\)-th position is present before the first Transformer block is applied.</p>
<p>After possibly many transformer blocks, we get our \(L\)-th sequence of embeddings, \(\{\textbf{a}^{(L)}_i\}_i\). We plug this as input to another model, the transformer decoder, which uses a similar process to eventually get a loss based on some input-output pair of sentences (e.g., in translation, the decoder converts the previous sequence into \(\{\textbf{b}_j\}_j\), which is compared with the actual translation \(\{\textbf{y}^{(L)}_j\}_j\)</p>
<h2 id="so-what">So What?</h2>
<p>On the face of it, this all sounds like a bunch of hand-wavy deep learning nonsense. “Attention”, “embedding”, etc. all look like fancy words to apply to math that is operating on meaningless vectors of floating-point numbers. Layer on top of this (lol) the other crap I didn’t cover, like multiple heads, normalization, and various knobs pulled during training, and the whole thing looks suspect.</p>
<p>It’s not clear which parts are essential, but something is doing its job:</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/res.png" alt="Transformer Results" class="center-image" /></p>
<p>And self-attention looks like it’s doing something like what we think it should.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/attn-viz.png" alt="Transformer Attention" class="center-image" /></p>
<p>Regardless how much of a deep learning believer you are, this architecture solves problems which require contextualizing our representation of words, and it picks the right things to attend to in examples.</p>
<h2 id="next-time">Next time</h2>
<p>We’ll see how BERT uses the context-aware Transformer to come up with a representation without any supervision.</p>
Sat, 22 Jun 2019 00:00:00 +0000
https://vlad17.github.io/2019/06/22/bert-pt-2-transformer.html
https://vlad17.github.io/2019/06/22/bert-pt-2-transformer.htmldeep-learningBERT, Part 1: Deep Learning Intro<h1 id="a-modeling-introduction-to-deep-learning">A Modeling Introduction to Deep Learning</h1>
<p>In this post, I’d like to introduce you to some basic concepts of deep learning (DL) from a modeling perspective. I’ve tended to stay away from “intro” style blog posts because:</p>
<ul>
<li>There are so, so many of them.</li>
<li>They’re hard to keep in focus.</li>
</ul>
<p>That said, I was presenting on <a href="https://arxiv.org/abs/1810.04805">BERT</a> for a discussion group at work. This was our first DL paper, so I needed to warm-start a technical audience with a no-frills intro to modeling with deep nets. So here we are, trying to focus what this post will be:</p>
<ul>
<li>It will presume a technically sophisticated reader.</li>
<li>No machine learning (ML) background is assumed.</li>
<li>The main goal is to set the stage for future discussion about BERT.</li>
</ul>
<p>Basically, this is me typing up those notes. Note the above leaves questions about optimization and generalization squarely out of scope.</p>
<h2 id="the-parametric-model">The Parametric Model</h2>
<p>Deep learning is a tool for the generic task of parametric modeling. Parametric modeling (PM) is a term I am generously applying from statistical estimation theory that encapsulates a broad variety of ML buzzwords, including supervised, unsupervised, reinforcement, and transfer learning.</p>
<p>In the most general sense, a parametric model \(M\) accepts some vector of parameters \(\theta\) and describes some structure in a random process. Goodness, what does that mean?</p>
<ul>
<li>Structure in a random process is everything that differentiates it from noise. But what’s “noise”?</li>
<li>When we fix the model \(M\), we’re basically saying there’s only some classes of structure we’re going to represent, and everything else is what we consider noise.</li>
<li>The goal is to pick a “good” model and find parameters for it.</li>
</ul>
<h3 id="a-simple-example">A Simple Example</h3>
<p>For instance, let’s take a simple random process, iid draws from the normal distribution \(z\sim \mathcal{D}= N(\mu, \sigma^2)\) with an unknown mean \(\mu\) and variance \(\sigma^2\). We’re going to try capture the richest possible structure over \(z\), its actual distribution. One model might be the unit normal, \(M(\theta)=N(\theta, 1)\). Then our setup, and potential sources of error, look like this:</p>
<p><img src="/assets/2019-03-09-dl-intro/model-err.png" alt="sources of error" class="center-image" /></p>
<p>What I call parametric and model mismatch are also known as estimation and approximation error (<a href="https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning">Bottou and Bousquet 2007</a>).</p>
<p>Here, we have one the most straightforward instances of PM, parameter estimation (we’re trying to estimate \(\mu\)).</p>
<h3 id="revisiting-our-definitions">Revisiting our definitions</h3>
<p>What constitutes a “good” model? Above, we probably want to call models with \(\theta\) near \(\mu\) good ones. But in other cases, it’s not so obvious what makes a good model.</p>
<p>One of the challenges in modeling in general is articulating what we want. This is done through a loss function \(\ell\), where want models with small losses. In other words, we’d like to find a model \(M\) and related parameters \(\theta\) where
\[
\E_{z\sim \mathcal{D}}\ha{\ell(z, M(\theta))}
\]
is as small as possible (here, for our iid process). Note that in some cases, this doesn’t have to be the same as the loss function used for optimization for finding \(\theta\), but that’s another discussion (there are several reasons to do so).</p>
<h3 id="another-example">Another Example</h3>
<p>Now let’s jump into another modeling task, supervised learning. Here:</p>
<ul>
<li>Our iid random process \(\mathcal{D}\) will be generating pairs \(\pa{\text{some image}, \text{“cat” or “dog”}}\).</li>
<li>The structure we want to capture is that all images of dogs happen to be paired with the label \(\text{“dog”}\) and analogously so for cats.</li>
<li>We’ll gloss over what our model is for now.</li>
</ul>
<p>A loss that captures what we want for our desired structure would be the <em>zero-one loss</em>, which is \(1\) when we’re wrong, \(0\) when we’re right. Let’s fix some model and parameters, which takes an image and labels it as a cat or dog (so \(M(\theta)\) is a <em>function</em> itself) as follows, and then let’s see how it does on our loss function.</p>
<p><img src="/assets/2019-03-09-dl-intro/losses.png" alt="sources of error" class="center-image" /></p>
<h2 id="ok-so-why-deep-learning">OK, so why Deep Learning?</h2>
<p>This post was intentionally structured in a way that takes the attention away from DL. DL is a means to achieve the above PM goals–it’s a means to an end and being able to reason about higher-level modeling concerns is crucial to understanding the tool.</p>
<p>So, DL is an approach to building models \(M\) and it studies how to find good parameters \(\theta\) for those models.</p>
<h3 id="deep-learning-models">Deep Learning Models</h3>
<p>A DL model is anything that vaguely resembles the following model. Namely, it has many parameterized functions composed together to create a function.</p>
<p>A function is usually good enough to capture most structure that we’re interested in random processes, given sufficiently sophisticated inputs and outputs. The inputs and outputs to this function can be (not exhaustive):</p>
<ul>
<li>fixed-width multidimensional arrays (casually known as tensors, sort of)</li>
<li>embeddings (numerical translations) of categories (like all the words in the English dictionary)</li>
<li>variable width tensors</li>
</ul>
<p>The parameters this function takes (which differ from its inputs and effect what the function looks like) are fixed width tensors. I haven’t seen variable-width parameters in DL models, except as some Bayesian interpretations (<a href="https://www.cs.toronto.edu/~hinton/absps/colt93.pdf">Hinton 1993</a>).</p>
<h3 id="the-multi-layer-perceptron">The Multi-Layer Perceptron</h3>
<p>Our prototypical example of a neural network is the Multi-Layer Perceptron, or MLP, which takes a numerical vector input to a numerical vector output. For a parameter vector \(\theta=\mat{\theta_1& \theta_2&\cdots&\theta_L}\), which contains parameters for our \(L\) layers, an MLP looks like:
\[
M(\theta)= x\mapsto f_{\theta_L}^{(L)}\circ f_{\theta_{L-1}}^{(L-1)}\circ\cdots\circ f_{\theta_1}^{(1)}(x)\,,
\]
and we define each layer as
\[
f_{\theta_i}=\max(0, W_ix+b_i)\,.
\]
The parameters \(W_i, b_i\) are set by the contents of \(\theta_i\).</p>
<p>This is the functional form of linear transforms followed by nonlinearities. It describes what’s going on in this image:</p>
<p><img src="/assets/2019-03-09-dl-intro/mlpi.png" alt="sources of error" class="center-image" /></p>
<h3 id="why-dl">Why DL?</h3>
<p>While it might be believable that functions in general make for great models that could capture structure in a lot of phenomena, why have these particular parameterizations of functions taken off recently?</p>
<p>This is basically the only part of this post that has to do with DL, and most of it’s out of scope.</p>
<p>In my opinion, it boils down to three things.</p>
<p>Deep learning is simultaneously:</p>
<ul>
<li>Flexible in terms of how many functions it can represent for a fixed parameter size.</li>
<li>Lets us find so-called low-loss estimates of \(\theta\) fairly quickly.</li>
<li>Has working regularization strategies.</li>
</ul>
<h4 id="flexibility">Flexibility</h4>
<p>The MLP format above might seem strange, but this linearity-followed-by-non-linearity happens to be particularly expressive, in terms of the number of different functions we can represent with a small set of parameters.</p>
<p>The fact that a sufficiently wide neural network can well-approximate smooth functions is well known (<a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">Universal Approximation Theorem</a>), but what’s of particular interest is how linear increases in depth to a network exponentially increase its expressiveness (<a href="https://arxiv.org/abs/1402.1869">Montúfar, et al 2014</a>).</p>
<p><img src="/assets/2019-03-09-dl-intro/montufar2014.png" alt="expressiveness" class="center-image" /></p>
<p>An image from the cited work above demonstrates how composition with non-linearities increases expressiveness. Here, with an absolute value nonlinearity, we can reflect the input space on itself through composition. This means we double the number of linear regions in our neural net by adding a layer.</p>
<h4 id="efficiency">Efficiency</h4>
<p>One of the papers that kicked off the DL craze was Alexnet (<a href="the foundational papers that">Krizhevsky 2012</a>), and the reasons for its existence was that we could efficiently compute the value of a neural network \(M(\theta)\) on a particular image \(x\) using specialized hardware.</p>
<p>Not only does the simple composition of simple functions enable fast <em>forward</em> computation of the model value \(M(\theta)(x)\), but because the operations can be expressed as a directed acyclic graph of almost differentiable functions, one can quickly compute <em>reverse</em> automatic derivatives \(\partial_\theta M(\theta)(x)\) in just about the same amount of time.</p>
<p>This is a very happy coincidence. We can compute the functional value of a neural net and its derivative in time linear in the parameter size, and we have a lot of parameters. Here, efficiency matters a lot for the inner loop of the optimization (which uses derivatives with SGD) to find “good” parameters \(\theta\). This efficiency, in turn, enabled a lot of successful research.</p>
<h4 id="generalization">Generalization</h4>
<p>Finally, neural networks generalize well. This means that given a training set of examples, they are somehow able to have low loss on unseen examples coming from the same random process, just by training on a (possibly altered, or regularized) loss from given examples.</p>
<p>This is particularly counterintuitive for nets due to their expressivity, which is typically at odds with generalization with traditional ML analyses.</p>
<p><a href="https://arxiv.org/abs/1611.03530">Many</a></p>
<p><a href="https://arxiv.org/abs/1710.05468">theories</a></p>
<p><a href="https://arxiv.org/abs/1705.05502">for</a></p>
<p><a href="https://arxiv.org/abs/1503.02406">why</a></p>
<p><a href="https://arxiv.org/abs/1711.01530">this</a></p>
<p><a href="https://arxiv.org/abs/1710.09553">occurs</a></p>
<p>have been proposed, but none of them are completely satisfying yet.</p>
<h2 id="next-time">Next time</h2>
<ol>
<li>We’ll review the Transformer, and what it does.</li>
<li>That’ll set us up for some BERT discussion.</li>
</ol>
Sat, 09 Mar 2019 00:00:00 +0000
https://vlad17.github.io/2019/03/09/dl-intro.html
https://vlad17.github.io/2019/03/09/dl-intro.htmldeep-learning