Vlad FeinbergVlad's Blog
https://vlad17.github.io/
Sat, 12 Oct 2019 21:16:23 +0000Sat, 12 Oct 2019 21:16:23 +0000Jekyll v3.8.5Compressed Sensing and Subgaussians<h1 id="compressed-sensing-and-subgaussians">Compressed Sensing and Subgaussians</h1>
<p>Candes and Tao came up with a broad characterization of compressed sensing solutions <a href="https://statweb.stanford.edu/~candes/papers/RIP.pdf">a while ago</a>. Partially inspired by a past homework problem, I’d like to explore an area of this setting.</p>
<p>This post will dive into the compressed sensing context and then focus on a proof that squared subgaussian random variables are subexponential (the relation between the two will be explained).</p>
<h2 id="compressed-sensing">Compressed Sensing</h2>
<p>For context, we’re interested in the setting where we observe an \(n\)-dimensional vector \(\vy\) that is a random linear transformation \(X\) of a hidden \(p\)-dimensional vector \(\vx_*\):</p>
<p>\[
\vy = X\vx_*
\]</p>
<p>In general, in this setting, we could have \(p>n\). If we wanted to recover \(\vx_*\), the system may be underdetermined. So a least-squares solution \((X^\top X)^{-1}X^\top\vy\) may not exist or may be unstable due to very small \(\lambda_\min(X^{\top} X)\).</p>
<p>In cases where we have knowledge of sparsity, however, that \(\norm{\vx}_0=k<p,n\), we can actually find the result.</p>
<p>In particular, the \(\ell_0\) estimator, which finds
\(
\vx_0=\argmin_{\vx:\norm{\vx}_0\le k}\norm{\vy-X\vx}_2
\), will converge, in the sense that the risk \(\E\norm{\vy-X\vx}_2\) is bounded above by \(O\pa{\frac{k\log p}{n}}\). This can be used to show that under some straightforward assumptions on \(k,X\) we actually converge to the true answer \(\vx_*\). Moreover, while this method seems to depend on \(k\) we can imagine doing hyperparameter search on \(k\).</p>
<p>This all looks great, in that we can recover the original entries of sparse \(\vx_*\), but the problem is solving the minimization problem under the constraint \(\norm{\vx}_0\le k\) is computationally difficult. This is a non-convex set of points with at most \(k\) non-zero entries. We’d need to check every subset to find the optimum (<em>question to self:</em> do we really? You’d think that in a non-adversarial stochastic-\(X\) situation you might want to use \(2k\) instead of \(k\) and then use a greedy algorithm like backward selection and it’d be good enough).</p>
<p>This is why Tao and Candes’ work is so cool. They take the efficiently-computable LASSO estimator,
\[
\vx_\lambda = \argmin_{\vx:\norm{\vx}_0\le k}\norm{\vy-X\vx}_2
^2+\lambda\norm{\vx}_1\,,
\]
and show that under a certain condition on \(X\), the <em>Restricted Isometry Property</em> (RIP), \(\vx_\lambda = \vx_0\). In essence, the RIP property requires that \(X\) has nearly unit eigenvalues with high probability, so it’s almost an isometry. Technically, there’s a relaxed condition called the restricted eigenvalue condition implied by RIP where we get a weaker result that implies LASSO has the same risk as \(\ell_0\).</p>
<p>All this is motivation for understanding the question: <strong>what practical conditions on \(X\) ensure the RIP?</strong></p>
<p>It turns out we can characterize a broad class of distributions for the entries of \(X\) that enable this.</p>
<h2 id="subgaussian-random-variables">Subgaussian Random Variables</h2>
<p>Subgaussian random variables have heavy tails. In particular, \(Y\in\sg(\sigma^2)\) when
\[
\E\exp(\lambda Y)\le\exp\pa{\frac{1}{2}\lambda^2\sigma^2}
\]</p>
<p>By the Taylor expansion of \(\exp\), Markov’s inequality, and elementary properties of expectation, we can use the above to show all sorts of properties.</p>
<ul>
<li>Subgaussian variance. \(\var Y\le \sigma^2\)</li>
<li>Zero mean. \(\E Y = 0\)</li>
<li>2-homogeneity. \(\alpha Y\in\sg(\sigma^2\alpha^2)\)</li>
<li>Light tails. \(\P\ca{\abs{Y}>t}\le 2\exp\pa{\frac{-t^2}{2\sigma^2}}\)</li>
<li>Additive closure. \(Z\in\sg(\eta^2 )\independent Y\) implies \(Y+Z\in\sg(\sigma^2+\eta^2)\)</li>
<li>Higher moments. \(\E Y^{4k}\le 8k(2\sigma)^{4k}(2k-1)!\)</li>
</ul>
<h2 id="subexponential-random-variables">Subexponential Random Variables</h2>
<p>Subexponential random variables are like subgaussians, but their tails can be heavy. In particular, \(Y\in\se(\sigma^2,s)\) satisfies the equation for \(\sg(\sigma^2)\) for \(\abs{\lambda}<s\).</p>
<p>We don’t really need to know much else about these, but it’s clear we can show similar additive closure and homogeneity properties as in the subgaussian case as long as we do bookkeeping on the second parameter \(s\).</p>
<p>It turns out that RIP holds for \(X\) with high probability if \(\vu^\top X^\top X\vu\in\se(nc, c’)\) for some constants \(c,c’\) and any unit vector \(\vu\).</p>
<p>When entries of \(X\) are independent and identically distributed, \(\vu\) can essentially be taken to be a standard unit vector without loss of generality. This requires some justification but it’s intuitive so I’ll skip it for brevity. This lets us simplify the problem to asking if \(\norm{X_1}^2\in\se(nc, c’)\), where \(X_1\) is the first column of \(X\).</p>
<p>So let’s take the entries of \(X\) to be iid, which, due to additive closure, means that the previous condition can just be \({X}_{11}^2\in\se(c,c’)\).</p>
<h2 id="squared-subgaussians">Squared Subgaussians</h2>
<p>Turns out, if the entries of \(X\) are subgaussian and iid, all of the above conditions hold. In particular, we need to show that the first entry \(X_11\), when squared, is squared exponential.</p>
<p>We focus on a loose but good-enough bound for this use case.</p>
<p>Suppose \(Z\in\sg(\sigma^2)\). Then \(Z^2-\E Z^2\in \se(c\sigma^4,\sigma^{-2}/8)\), again, being very loose with the bound here.</p>
<p>First, consider an arbitrary rv \(Y\). By the conditional Jensen’s Inequality, for any \(\lambda\) and \(Y’\sim Y\) iid,
\[
\E\exp\pa{\lambda (Y-\E Y)}=\E\exp\pa{\CE{\lambda (Y-Y’)}{Y}}\le \E\CE{\exp\pa{\lambda (Y-Y’)}}{Y}=\E\exp\pa{\lambda (Y-Y’)}\,.
\]
Then let \(\epsilon\) be an independent Rademacher random variable, and notice we can replace \(Y-Y’\disteq \epsilon(Y-Y’)\) above. Now choose \(Y=X^2\). Then by Taylor expansion and dominated convergence,
\[
\E\exp\pa{\lambda \pa{X^2-\E X^2}}\le \E \exp\pa{\lambda \epsilon \pa{X^2-(X’)^2}}=\sum_{k=0}^\infty\frac{\lambda^k\E\ha{\epsilon^k(X^2-(X’)^2)^k}}{k!}\,.
\]
Next, notice for odd \O(k\), \(\epsilon^k=\epsilon\) so by symmetry the odd terms vanish, leaving the MGF bound
\[
\E\exp\pa{\lambda \pa{X^2-\E X^2}}\le\sum_{k=0}^\infty\frac{\lambda^{2k}\E\ha{\pa{X^2-(X’)^2}^{k}}}{(2k)!}\le 2\sum_{k=0}^\infty\frac{\lambda^{2k}\E\ha{X^{4k}}}{(2k)!}\,,
\]
where above we use the fact that \(x\mapsto x^p\) is montonic and \(\abs{X^2-(X’)^2}\le X^2\) when \(\abs{X}>\abs{X’}\), which occurs half the time by symmetry. The other half of the time, we get an equivalent expression. By subgaussian higher moments,
\[
\E \exp\pa{\lambda (X^2-\E X^2)}\le 1+c\sum_{k=1}^\infty \frac{k\pa{4\sigma^2\lambda}^{2p}(2k-1)!}{(2k)!}=1+c\sum_{p=1}^\infty\pa{4\sigma^2\lambda}^{2p}
\]
Next we assume, crudely, that \(4\sigma^2\lambda\le 2^{-1/2}\), so the head of the series above is at least as large as the tail (since the ratio decreases by at least \(1/2\)). Then,
\[
\E \exp\pa{\lambda (X^2-\E X^2)}\le 1+c(2\sigma^2\lambda)^2\le \exp(c\sigma^4\lambda^2)\,.
\]</p>
Sun, 18 Aug 2019 00:00:00 +0000
https://vlad17.github.io/2019/08/18/compressed-sensing-subgaussians.html
https://vlad17.github.io/2019/08/18/compressed-sensing-subgaussians.htmlmachine-learningMaking Lavender<h1 id="making-lavender">Making Lavender</h1>
<p>I’ve tried using Personal Capital and Mint to monitor my spending, but I wasn’t happy with what those tools offered.</p>
<p>In short, I was looking for a tool that:</p>
<ul>
<li>requires no effort on my part to get value out of (I don’t want to set budgets, I don’t even want the overhead of logging in to get updates)</li>
<li>would tell me how much I’m spending</li>
<li>would tell me why I’m spending this much</li>
<li>would tell me if anything’s changed</li>
</ul>
<p>All the tools out there are in some other weird market of “account management” where they take all your accounts (investment, saving, credit card, checking), not just the spending ones. They’re your one stop shop for managing all your net worth in one place.</p>
<p>However, I just wanted to be responsible about my spending. And I didn’t want to spend any more time dealing with personal finance apps than I had to. Kind of like <a href="https://albert.com/">Albert</a>. But when I tried it, it was way too annoying and didn’t support my credit card account.</p>
<p>At this point, I figured that I know what I want and I could do a better job at getting it myself, so I just hacked some stuff together. The end result is a weekly digest that gives exactly the analysis I want.</p>
<h2 id="pandas">Pandas</h2>
<p><em>Time investment</em>: 30 minutes</p>
<p>Download Chase statement csv. It looks like this:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Transaction Date,Post Date,Description,Category,Type,Amount
07/03/2019,07/04/2019,SQ *UDON UNDERGROUND,Food & Drink,Sale,-19.20
07/03/2019,07/04/2019,Amazon web services,Personal,Sale,-27.31
07/01/2019,07/03/2019,SWEETGREEN SOMA,Food & Drink,Sale,-17.56
</code></pre></div></div>
<p>Then just give me the heavy hitters. <a href="https://github.com/vlad17/misc/blob/master/groupby.py">Pandas hack script</a>. Among the biggest two give me a breakdown.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python ~/dev/misc/groupby.py ~/Downloads/Chase.CSV
most recent payment period <from date> <to date>
usd frac
Category
Food & Drink -841.05 51%
Travel -301.65 18%
Shopping -148.69 9%
Health & Wellness -140.09 9%
Groceries -134.64 8%
Personal -58.00 4%
total -1640.04
Food & Drink
Transaction Date Description Amount
*** 2019-**-** CIBOS ITALIAN RESTAURANT -120.00
*** 2019-**-** SALT WOOD RESTAURANT -70.00
*** 2019-**-** SAPPORO -69.98
*** 2019-**-** PACHINO PIZZERIA -60.00
*** 2019-**-** DOORDASH*BURMA LOVE -53.53
Travel
Transaction Date Description Amount
*** 2019-**-** UBER *TRIP -58.97
*** 2019-**-** CLIPPER #**** -50.00
*** 2019-**-** *********** HOTEL -32.00
*** 2019-**-** UBER *TRIP -17.02
*** 2019-**-** UBER *TRIP -16.12
</code></pre></div></div>
<p>Neato! Already more value than those stupid pie charts. But I have to log into Chase now, which is worse than logging into Mint.</p>
<h2 id="timely-hn-methodology">Timely HN Methodology</h2>
<p><em>Time investment</em>: 2 straight days of coding.</p>
<p>A <a href="https://news.ycombinator.com/item?id=19833881">HN</a> post came out with a guy basically doing the same thing but for privacy reasons. So I copied his approach, where you just tell Chase to send you email alerts for transactions.</p>
<p>Emails from Chase look like this.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This is an Alert to help you manage your credit card account ending in ****.
As you requested, we are notifying you of any charges over the amount of ($USD) 0.00, as specified in your Alert settings.
A charge of ($USD) 12.74 at SQ *BLUE BOTTLE C... has been authorized on **/**/2019 7:**:** PM EDT.
Do not reply to this Alert.
If you have questions, please call the number on the back of your credit card, or send a secure message from your Inbox on www.chase.com.
To see all of the Alerts available to you, or to manage your Alert settings, please log on to www.chase.com.
</code></pre></div></div>
<p>Unlike blog post guy, I didn’t want to fuck with Zapier or Google Sheets since I want my code to do more special things. Somehow I hyped up my friend <a href="https://github.com/JoshBollar">Josh</a> to help (I think he wanted to mess with AWS). Here was our design doc</p>
<p><img src="/assets/2019-08-18-making-lavender/ddoc.png" alt="design doc" class="center-image" /></p>
<p>So yeah, the flamegraph of your finances never happened. But hey, we did the important parts, namely:</p>
<ul>
<li>Get a domain through Route 53 to send mail to/from.</li>
<li>Set up an SNS topic to receive emails. Received emails are either forwarding confirmations (which need to be confirmed) or actual transaction notifications from Chase, set up to be forwarded via the user’s email account.</li>
<li>AWS lambda to regex parse the transaction emails, dump transaction in makeshift NoSQL store which is really just flat json documents on S3.</li>
<li>AWS lambda to spin up weekly and send out summary digests via SES to all users (who we know by ls-ing the S3 bucket)</li>
<li>Matplotlib rendering of a barchart</li>
</ul>
<p>Yeah, yeah, so much yikes architecturally. The code’s just as smelly, but whatever we wanted a scalability of 2.</p>
<p><img src="/assets/2019-08-18-making-lavender/v0email.png" alt="first version" class="center-image" /></p>
<h2 id="switch-to-an-api">Switch to an API</h2>
<p><em>Time investment</em>: 6 non-contiguous days intermittent, 17 hours.</p>
<p>The above was hacky, but an essentially free service that gave me what I wanted. The main downside was that the emails from Chase didn’t have a lot of info on the transactions themselves.</p>
<ul>
<li>Switch to <a href="https://plaid.com/">Plaid</a>, a real API for transactions. This meant I could get rid of the lambda for handling new transactions. And I got nicer categories for the payments.</li>
<li>Keep a postgres RDS running on a <code class="highlighter-rouge">t3.micro</code> with all the transaction info. The lambda would spin up, use environment variable secrets to connect, update with new transactions from Plaid, and send the digest. Migrating from flat json S3 storage to a real database took the most time.</li>
</ul>
<p>The biggest improvement, I think, was “versus” analysis, which identifies what categories you’re spending more or less in than usual. I just made up a differencing algorithm here, I don’t think anything out there solves this problem super well on its own (it’s a harder problem than you’d think, since transactions belong to multiple categories).</p>
<p><img src="/assets/2019-08-18x-making-lavender/time-spend.png" alt="spend" class="center-image" /></p>
<p>The biggest pain point here was that AWS Lambda didn’t support deployment packages that are >250MB uncompressed. With scipy at 70MB, this was a pretty annoying thing to extract. I had to manually go into the seaborn package, which I use for viz now, and gut out scipy. Probably a better way is to just download dependencies on init.</p>
<h2 id="whats-next">What’s next?</h2>
<p>I’m pretty happy with the app as it is now for personal use.</p>
<p>I may make this available to others (<a href="/about">email me</a> if you want this to happen). The app would send you weekly digests, at 8am Pacific Time on Saturdays.</p>
<p>Before it’s generally publicly available, the email needs a bit of polish, and a static website would be nice, as well as some EULA or something.</p>
Sun, 18 Aug 2019 00:00:00 +0000
https://vlad17.github.io/2019/08/18/making-lavender.html
https://vlad17.github.io/2019/08/18/making-lavender.htmltoolsFacebook AI Similarity Search (FAISS), Part 1<h1 id="faiss-part-1">FAISS, Part 1</h1>
<p>FAISS is a powerful GPU-accelerated library for similarity search. It’s available under MIT <a href="https://github.com/facebookresearch/faiss">on GitHub</a>. Even though <a href="https://arxiv.org/abs/1702.08734">the paper</a> came out in 2017, and, under some interpretations, the library lost its SOTA title, when it comes to a practical concerns:</p>
<ul>
<li>the library is actively maintained and cleanly written.</li>
<li>it’s still extremely competitive by any metric, enough so that the bottleneck for your application won’t likely be in FAISS anyway.</li>
<li>if you bug me enough, I may fix my one-line EC2 spin-up script that sets up FAISS deps <a href="https://github.com/vlad17/aws-magic">here</a>.</li>
</ul>
<p>This post will review context and motivation for the paper. Again, the approximate similarity search space may have progressed to different kinds of techniques, but FAISS’s techniques are powerful, simple, and inspirational in their own right.</p>
<h2 id="motivation">Motivation</h2>
<p>At a high level, <strong>similarity search helps us find similar high dimensional real vectors from a fixed “database” of vectors to a given query vector, without resorting to checking each one. In database terms, we’re making an index of high-dimensional real vectors.</strong></p>
<h3 id="who-cares">Who Cares</h3>
<h5 id="spam-detection">Spam Detection</h5>
<p><img src="/assets/2019-07-18-faiss/tinder.jpg" alt="tinder logo" class="center-image" /></p>
<blockquote>
<p>Tinder bot 1 bio: “Hey, I’m just down for whatever you know? Let’s have some fun.”</p>
<p>Tinder bot 2 bio: “Heyyy, I’m just down for whatevvver you know? Let’s have some fun.”</p>
<p>Tinder bot 3 bio: “Heyyy, I’m just down for whatevvver you know!!? I just wanna find someone who wants to have some fun.”</p>
</blockquote>
<p>You’re Tinder and you know spammers make different accounts, and they randomly tweak the bios of their bots, so you have to check similarity across all your comments. How?</p>
<h5 id="recommendations">Recommendations</h5>
<p>You’re <img src="/assets/2019-07-18-faiss/fb.png" alt="facebook" style="display:inline" /> or <img src="/assets/2019-07-18-faiss/goog.png" alt="google" style="display:inline" /> and users clicking on ads keep the juices flowing.</p>
<p>Or you’re <img src="/assets/2019-07-18-faiss/amazon.png" alt="amazon" style="display:inline" /> and part of trapping people with convenience is telling them what they want before they want it. Or you’re <img src="/assets/2019-07-18-faiss/netflix.png" alt="netflix" style="display:inline" /> and you’re trying to keep people inside on a Friday night with another Office binge.</p>
<p>Luckily for those companies, their greatest minds have turned those problems into summarizing me as faux-hipster half-effort yuppie as encoded in a dense 512-dimensional vector, which must be matched via inner product with another 512-dimensional vector for Outdoor Voices’ new marketing “workout chic” campaign.</p>
<h3 id="problem-setup">Problem Setup</h3>
<p>You have a set of database vectors \(\{\textbf{y}_i\}_{i=0}^\ell\), each in \(\mathbb{R}^d\). You can do some prep work to create an index. Then at runtime I ask for the \(k\) closest vectors, which might be measured in \(L^2\) distance, or the vectors with the largest inner product.</p>
<p>Formally, we want the set \(L=\text{$k$-argmin}_i\norm{\textbf{x}-\textbf{y}_i}\) given \(\textbf{x}\).</p>
<p>Overlooking the fact that this is probably an image of \(k\)-nearest neighbors, this summarizes the situation, in two dimensions:</p>
<p><img src="/assets/2019-07-18-faiss/nearest-neighbors.png" alt="nearest neighbors" class="center-image" /></p>
<h5 id="why-is-this-hard">Why is this hard?</h5>
<p>Suppose we have 1M embeddings at a dimensionality of about 1K. This is a very conservative estimate; but that amounts to scanning over 1GB of data per query if doing it naively.</p>
<p>Let’s continue to be extremely conservative, say our service is replicated so much that we have one machine per live query per second, which is still a lot of machines. Scanning over 1GB of data serially on one 10Gb RAM bandwidth node isn’t something you can do at interactive speeds, clocking in at 1 second response time for just this extremely crude simplification.</p>
<p>Exact methods for answering the above problem (Branch-and-Bound, LEMP, FEXIPRO) limit search space. Most recent <a href="https://github.com/stanford-futuredata/optimus-maximus">SOTA for exact</a> is still 1-2 orders below approximate methods. For prev use cases, we don’t care about exact (though there certainly are cases where it does matter).</p>
<h2 id="related-work">Related Work</h2>
<h5 id="before-faiss">Before FAISS</h5>
<p>FAISS itself is built on product quantization work from its authors, but for context there were a couple of interesting approximate nearest-neighbor search problems around.</p>
<p>Tangentially related is the lineage of hashing-based approaches <a href="https://www.microsoft.com/en-us/research/publication/speeding-up-the-xbox-recommender-system-using-a-euclidean-transformation-for-inner-product-spaces/">Bachrach et al 2014</a> (Xbox), <a href="https://arxiv.org/abs/1405.5869">Shrivastava and Li 2014</a> (L2ALSH), <a href="https://arxiv.org/abs/1410.5518">Neyshabur and Srebro 2015</a> (Simple-ALSH) for solving inner product similarity search. The last paper in particular has a unifying perspective between inner product similarity search and \(L^2\) nearest neighbors (namely a reduction from the former to the latter).</p>
<p>However, for the most part, it wasn’t locally-sensitive hashing, but rather clustering and hierarchical index construction that was the main approach to this problem before. One of the nice things about the FAISS paper in my view is that it is a disciplined epitome of these approaches that’s effectively implemented.</p>
<h5 id="after-faiss">After FAISS</h5>
<p>Recently hot new graph-based approaches have been killing it in the <a href="http://ann-benchmarks.com/">benchmarks</a>. It makes you think FAISS is out, <a href="https://github.com/nmslib/hnswlib">HNSW</a> and <a href="https://github.com/yahoojapan/NGT">NGT</a> are in.</p>
<p><img src="/assets/2019-07-18-faiss/benchmarks.png" alt="benchmarks" class="center-image" /></p>
<p>Just kidding. Like the second place winners for ILSVRC 2012 will tell you, simple and fast beats smart and slow. As <a href="https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/">this guy</a> proved, a CPU implementation from 2 years in the future still won’t compete with a simpler GPU implementation from the past.</p>
<p><img src="/assets/2019-07-18-faiss/gpucpu.png" alt="gpu vs cpu" class="center-image" /></p>
<p>You might say this is an unfair comparison, but life (resource allocation) doesn’t need to be fair either.</p>
<h2 id="evaluation">Evaluation</h2>
<p>FAISS provides an engine which approximately answers the query \(L=\text{$k$-argmin}_i\norm{\textbf{x}-\textbf{y}_i}\) with the response \(S\).</p>
<p>The metrics for evaluation here are:</p>
<ul>
<li>Index build time, in seconds. For a set of \(\ell\) database vectors, how long does it take to construct the index?</li>
<li>Search time, in seconds, which is the average time it takes to respond to a query.</li>
<li><em>R@k</em>, or recall-at-\(k\). Here the response \(S\) may be slightly larger than \(k\), but we look at the closest \(k\) items in \(S\) with an exact search, yielding \(S_k\). This value is then \(\card{S_k\cap L}/k\), where \(k=\card{L}\).</li>
</ul>
<h2 id="faiss-details">FAISS details</h2>
<p>In <a href="/2019/07/18/faiss-pt-2.html">the next post</a>, I’ll take a look at how FAISS addresses this problem.</p>
Thu, 18 Jul 2019 00:00:00 +0000
https://vlad17.github.io/2019/07/18/faiss.html
https://vlad17.github.io/2019/07/18/faiss.htmlparallelhardware-accelerationFacebook AI Similarity Search (FAISS), Part 2<h1 id="faiss-part-2">FAISS, Part 2</h1>
<p>I’ve <a href="/2019/07/18/faiss.html">previously</a> motivated why nearest-neighbor search is important. Now we’ll look at how <a href="https://arxiv.org/abs/1702.08734">FAISS</a> solves this problem.</p>
<p>Recall that you have a set of database vectors \(\{\textbf{y}_i\}_{i=0}^\ell\), each in \(\mathbb{R}^d\). You can do some prep work to create an index. Then at runtime I ask for the \(k\) closest vectors in \(L^2\) distance.</p>
<p>Formally, we want the set \(L=\text{$k$-argmin}_i\norm{\textbf{x}-\textbf{y}_i}\) given \(\textbf{x}\).</p>
<p>The main paper contributions in this regard were a new algorithm for computing the top-\(k\) scalars of a vector on the GPU and an efficient k-means implementation.</p>
<h2 id="big-lessons-from-faiss">Big Lessons from FAISS</h2>
<p>Parsimony is important. Not only does it indicate you’re using the right representation for your problem, but it’s better for bandwidth and better for cache. E.g., see this <a href="https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index">wiki link</a>, HNSW on 1B vectors at 32 levels results in TB-level index size!</p>
<p>Prioritize parallel-first computing. The underlying algorithmic novelty behind FAISS takes a serially slow algorithm, an \(O(n \log^2 n)\) sort, and parallelizes it to something that takes \(O(\log^2 n)\) serial time. Unlike serial computing, we can take on more work if the span of our computation DAG is wider in parallel settings. Here, speed is proper hardware-efficient vectorization.</p>
<h2 id="the-gpu">The GPU</h2>
<p>The paper, refreshingly, reviews the GPU architecture.</p>
<p><img src="/assets/2019-07-18-faiss-pt2/gpu.png" alt="gpu" class="center-image" /></p>
<p>Logical compute hierarchy is <code class="highlighter-rouge">grid -> block -> warp -> lane (thread)</code></p>
<p>Memory hierarchy is <code class="highlighter-rouge">main mem (vram) -> global l2 -> stream multiprocessor (SM) l1 + shared mem</code>, going from multi-GB to multi-MB to about <code class="highlighter-rouge">16+48 KB</code>.</p>
<p>There might be one or more blocks scheduled to a single streaming multiprocessor, which is itself a set of cores. Cores have their own floating point processing units and integer units, but other supporting units like the MMU-equivalent are shared.</p>
<p>My takeaways from this section were the usual “maximize the amount of work each core is doing independently, keeping compute density high and memory accesses low, especially shared memory”, but with two important twists:</p>
<ul>
<li>GPU warps (gangs of threads) exhibit worse performance when the threads aren’t performing the same instructions on possibly different data (<em>warp divergence</em>).</li>
<li>Each thread is best kept dealing with the memory in its own lane (which typically is a slice of a 32-strided array that the block is processing with multiple warps in a higher granularity of parallelism), but there can be synchronization points through the register file which exchange memory between the threads.</li>
</ul>
<p>Note there are 32 threads to a warp, we’ll see that come up.</p>
<h2 id="faiss--ivf--adc">FAISS = IVF + ADC</h2>
<p>FAISS answers the question of “what are the closest database points to the query point” by constructing a 2-level tree. Database vectors are further compressed to make the tree smaller.</p>
<p>Given \(n\) database vectors, we cluster with k-means for the top level, getting about \(\sqrt{n}\) centroids. Then, at search time, we use exact search to find the closest centroids, and then among the closest centroid’s clusters we look for the closest clusters overall.</p>
<p>For a 2-level tree, a constant factor of \(\sqrt{n}\) is the optimal cluster size since then the exact search that we do is as small as possible at both levels of the tree.</p>
<p>Since it’s possible the point might be near multiple centroids, FAISS looks at the \(\tau\) closest centroids in the top level of the tree, and then searches all cluster members among the \(\tau\) clusters.</p>
<p>So the larger search occurs when looking at the second level.</p>
<p>Compression reduces I/O pressure as the second-level’s database vectors are loaded. Furthermore, the specific compression algorithm chosen for FAISS, <a href="https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf">Product Quantization</a> (PQ) enables distance computation on the codes themselves! The code is computed on the <em>residual</em> of the database vector \(\textbf{y}\) from its centroid \(q_1(\textbf{y})\).</p>
<p><img src="/assets/2019-07-18-faiss-pt2/residual.png" alt="residual" class="center-image" /></p>
<p>The two-level tree format is the inverted file (IVF), which is essentially a list of records for the database vectors associated with each cluster.</p>
<p>ADC, or asymmetric distance computation, refers to the fact that we’re using the code of the database vector and calculating its distance from the exact query vector. This can be made symmetric by using a code for the query vector as well. We might do this because the coded distance computation can actually be faster than a usual Euclidean distance computation.</p>
<p><img src="/assets/2019-07-18-faiss-pt2/adc.png" alt="ADC" class="center-image" /></p>
<h2 id="faiss-the-easy-part">FAISS, the easy part</h2>
<p>The above overview yields a simple algorithm.</p>
<ol>
<li>Compute exact distances to top-level centroids</li>
<li>Compute ADC in inverted list in probed centroids, generating essentially a list of pairs (index of probed database vector, approximate distance to query point)</li>
<li>The smallest-\(\ell\) by the second pair item are extracted, for some \(\ell\) not much larger than \(k\). Then the top \(k\) among those is returned.</li>
</ol>
<p>The meat of the paper is doing these steps quickly.</p>
<h2 id="fast-adc-via-pq">Fast ADC via PQ</h2>
<p>Product Quantization (PQ) boils down to looking compressing subvectors independently. E.g., we might have a four-dimensional vector \(\textbf{y}=[1, 2, 3, 4]\). We quantize it with \(b=2\) factors as \([(1, 2), (3, 4)]\). Doing this for all our vectors yields \(b\) sets of smaller vectors. The FAISS paper denotes these subvectors as \(\textbf{y}^1=(1, 2), \textbf{y}^2=(3, 4)\).</p>
<p>We then cluster the \(b\) sets independently with 256 centroids. The centroids that these subvectors get assigned to might be \(q^1(\textbf{y}^1)=(1, 1), q^2(\textbf{y}^2)=(4, 4.5)\), which is where the lossy part of the compression comes in. On the plus side, we just encoded 4 floats with 2 bytes!</p>
<p>This compression technique is applied to the <em>residual</em> of the database vectors for their centroids, meaning we have PQ dictionaries for each centroid.</p>
<p>The key insight here is that we can also break up our query vector \(\textbf{x}=[\textbf{x}^1, \cdots, \textbf{x}^b]\), and create distance lookup tables on the sub-vectors individually, so the distance to a database vector is just a sum of \(b\) looked-up values!</p>
<p><img src="/assets/2019-07-18-faiss-pt2/pq-lookup.png" alt="PQ Lookup" class="center-image" /></p>
<h2 id="top-k">Top-k</h2>
<p>OK, so now comes the hard part, we just did steps 1 and 2 really fast, and it’s clear those are super parallelizable algorithms, but how do we get the top (smallest) \(k\) items from the list?</p>
<p>Well, on a CPU, we’d implement this in a straightforward way. Use a max-heap of size \(k\), scan through our list of size \(n\), and then if the next element is smaller than the max of the heap or the heap has size less than \(k\), pop-and-insert or just insert, respectively, into the heap, yielding an \(O(n\log k)\) algorithm.</p>
<p>We could parallelize this \(p\) ways by chopping into \(n/p\)-sized chunks, getting \(k\)-max-heaps, and merging all the heaps, but the intrinsic algorithm does not parallelize well. This means this approach works well when you have lots of CPUs, but is not nearly compute-dense enough for tightly-packed GPU threads, 32 to a warp, where you need to do a lot more computation per byte (having each of those threads maintain its own heap results in a lot of data-dependent instruction divergence).</p>
<p>The alternative approach proposed by FAISS is:</p>
<ul>
<li>Create an extremely parallel mergesort</li>
<li>“Chunkify” the CPU algorithm, taking a big bite of the array at a given time, keeping a “messy max-heap” of a lot more than \(k\) (namely, \(k+32t\)) that includes everything the \(k\)-max-heap would.</li>
<li>Every once in a while, do a full sort on the messy max-heap.</li>
</ul>
<p>Squinting from a distance, this looks similar to the original algorithm, but the magic is in the “chunkification” which enables full use of the GPU.</p>
<h3 id="highly-parallel-mergesort">Highly Parallel Mergesort</h3>
<p>As mentioned, this innovation is essentially a serial \(O(n\log^2 n)\) in-place mergesort that has a high computational span.</p>
<p>The money is in the merge operation, which is based on Batcher’s bitonic bit sort. The invariant is that we maintain a list of sorted sequences (lexicographically).</p>
<ol>
<li>First, we have one sequence of length at most \(n\) [trivially holds]</li>
<li>Then, we have 2 sequences of length at most \(n/2\)</li>
<li>4 sequences length \(n/4\)</li>
<li>Etc.</li>
</ol>
<p><img src="/assets/2019-07-18-faiss-pt2/odd-size.png" alt="odd size" class="center-image" /></p>
<p>Each merge has \(\log n\) steps, where at each step we might have up to \(n\) swaps, but they are disjoint and can happen in parallel. The key is to see that these \(n\) independent swaps ensure lexicographic ordering among the sequences</p>
<p>This is the <code class="highlighter-rouge">odd-merge</code> (Algorithm 1) in the paper. There’s additional logic for irregularly-sized lists to be merged. We’ll come back to this.</p>
<p>Once we have a parallel merge that requires logarithmic serial time, the usual merge sort (Algorithm 2), which itself has a recursion tree of logarithmic depth, results in a \(O(\log^2 n)\) serial time (or depth) algorithm, assuming infinite processors.</p>
<p><img src="/assets/2019-07-18-faiss-pt2/merge-sort.png" alt="merge sort" class="center-image" /></p>
<h3 id="chunkification">Chunkification</h3>
<p>This leads to WarpSelect, which is the chunkification mentioned earlier. In essence, our messy max-heap is a combination (and thus superset) of:</p>
<ul>
<li>The strict size \(k\) max-heap with the \(k\) lowest values seen so far. In fact, this is sorted when viewed as a 32-stride array.</li>
<li>32 thread queues, each maintained in sorted order.</li>
</ul>
<p>So \(T_0^j\le T_i^j\) for \(i>0\) and \(T_0^j\ge W_{k-1}\) . So if an input is greater than any thread queue head, it can be safely ignored (weak bound).</p>
<p><img src="/assets/2019-07-18-faiss-pt2/warp-select.png" alt="warp select" class="center-image" /></p>
<p>On the fast path, the next 32 values are read in, and we do a SIMT (single instruction, multiple-thread) compare on each value assigned to each thread. A primitive instruction checks if any of the warp’s threads had a value below the cutoff of the max heap (if none did, we know for sure none of those 32 values are in the top \(k\) and can move on).</p>
<p>If there was a violation, after the per-lane insertion sort the thread heads might be smaller than they were before. Then we do a full sort of the messy heap, restoring the fact that the strict max-heap has the lowest \(k\) values so far.</p>
<ul>
<li>At this point, it’s clear why we needed a merge sort, which is because the strict max-heap (“warp queue” in the image) is already sorted, so we can avoid re-sorting it by using a merge-based sorting algorithm.
<ul>
<li>Finally, it’s worth pointing out that recasting the fully sorted messy heap into the thread queues maintains the sorted order within each lane.</li>
</ul>
</li>
<li>Further, it’s clear why FAISS authors created a homebrew merge algorithm that’s compatible with irregular merge sizes, as opposed to existing power-of-2 parallel merge algorithms: the thread queues are irregularly sized compared to \(k\) and it’d be a lot of overhead to round the array sizes</li>
</ul>
<p>This leads to the question: why have thread queues at all? Why not make their size exactly 1?</p>
<p>This points to a convenient piece of slack, the thread queue length \(t\), which lets us trade off the cost of the full merge sort against the per-thread insertion sort done every time the new values are read in. The optimal choice depends on \(k\).</p>
<h2 id="results">Results</h2>
<p>Remember, it’s not apples to apples, because FAISS gets a GPU and modern methods use CPUs, but who cares.</p>
<p>Recall from the <a href="/2019/07/18/faiss.html">previous post</a> the <code class="highlighter-rouge">R@1</code> metric is the average frequency the method actually returns the nearest neighbor (it mayhave the query \(k\) set higher). The different parameters used here don’t matter so much, but I’ll highlight what each row means individually.</p>
<p><a href="https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors">SIFT1M</a></p>
<p><img src="/assets/2019-07-18-faiss-pt2/sift.png" alt="sift" class="center-image" /></p>
<p>HNSW is a modern competitor based on the CPU using an algorithm written 2 years after the paper. Flat is naive search. In this benchmark, the PQ optimization was not used (database vector distances were computed exactly).</p>
<p><a href="https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors">Deep1B</a></p>
<p><img src="/assets/2019-07-18-faiss-pt2/deep1b.png" alt="deep1b" class="center-image" /></p>
<p>Here, for the very large dataset, the authors do use compression (OPQ indicates a preparatory transformation for the compression).</p>
<p>On the whole, FAISS is still the winner since it can take advantage of hardware. On the CPUs, it’s still a contender when it comes to a memory-speed-accuracy tradeoff.</p>
<h2 id="extensions-and-future-work">Extensions and Future Work</h2>
<p>The authors of the original FAISS work have themselves looked into extensions that combine the FAISS approach with then newer graph-based neighborhood algorithms (<a href="https://arxiv.org/abs/1804.09996">Link and Code</a>).</p>
<p>Other future work that the authors have since performed has been in improving the organization of the two-level tree structure. The centroid based approach of the IVF implicitly partitions the space with a Voronoi diagram. As the <a href="https://cache-ash04.cdn.yandex.net/download.yandex.ru/company/cvpr2012.pdf">Inverted Multi-Index</a> (IMI) paper explores, this results in a lot of unnecessary neighbors being probed that are far away from the query point but happen to belong to the same Vornoi cell. One extension that now exists in the code base is to use IMI instead of IVF.</p>
<p>It’s also fun to consider how these systems will be evolving over time. As memory bandwidth increases, single node approaches (like FAISS) grow increasingly viable since they can keep compute dense. However, as network speeds improve, distributed approaches with many, many CPUs look attractive. The latter types of algorithms rely more on hierarchy and less on vectorization and compute density.</p>
Thu, 18 Jul 2019 00:00:00 +0000
https://vlad17.github.io/2019/07/18/faiss-pt-2.html
https://vlad17.github.io/2019/07/18/faiss-pt-2.htmlparallelhardware-accelerationBERT, Part 3: BERT<h1 id="bert">BERT</h1>
<p>In the last two posts, we reviewed <a href="/2019/03/09/dl-intro.html">Deep Learning</a> and <a href="/2019/06/22/bert-pt-2-transformer.html">The Transformer</a>. Now we can discuss an interesting advance in NLP, BERT, Bidirectional Encoder Representations from Transformers (<a href="https://arxiv.org/abs/1810.04805">arxiv link</a>).</p>
<p>BERT is a self-supervised method, which uses just a large set of unlabeled textual data to learn representations broadly applicable for different language tasks.</p>
<p>At a high level, BERT’s pre-training objective, which is what’s used to get its parameters, is a Language modelling (LM) problem. LM is an instance of parametric modeling applied to language.</p>
<blockquote>
<p>Typical LM task: what’s the probability that the next word is “cat” given the sentence is “The dog chased the ????”</p>
</blockquote>
<p>Let’s consider a natural language sentence \(x\). In some way, we’d like to construct a loss function \(L\) for a language modeling task. We’ll keep it abstract for now, but, if we set up the model \(M\) right, and have something that generally optimizes \(L(M(\theta), x)\), then we can interpret one of BERT’s theses as the claim that this representation transfers to new domains.</p>
<p>That is, for some very small auxiliary model \(N\) and a set of parameters \(\theta’\) close enough to \(\theta\), we can optimize a different task’s loss (say, \(L’\), the task that tries to classify sentiment \(y\)) by minimizing \(L’(N(\omega)\circ M(\theta’),(x, y))\).</p>
<p>One of the reasons we might imagine this to work is by viewing networks like \(M(\theta’\) as featurizers that create a representation ready for the final layer to do a simple linear classification on.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/last-layer-feat.png" alt="featurization on the last layer" class="center-image" /></p>
<p>Indeed, the last layer of a neural network performing a classification task is just a logistic regression on the features generated by the layers before it. It makes sense that those features could be useful elsewhere.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/fig1.png" alt="bert figure 1" class="center-image" /></p>
<h2 id="contribution">Contribution</h2>
<p>The motivation for this kind of approach (LM pre-training and then a final fine-tuning step) versus task-specific NLP is twofold:</p>
<ul>
<li>Data volume is much larger for the LM pre-training task</li>
<li>The approach can solve multiple problems at once.</li>
</ul>
<p>Thus, the contributions of the paper are:</p>
<ul>
<li>An extremely robust, generic approach to pretraining. 11 SOTAs in one paper.</li>
<li>Simple algorithm.</li>
<li>Effectiveness is profound because (1) the general principle of self-supervision can likely be applied elsewhere and (2) ablation studies in the paper show that representation is the bottleneck.</li>
</ul>
<h2 id="technical-insights">Technical Insights</h2>
<p>The new training procedure and architecture that BERT provides is conceptually simple.</p>
<p>Bert provides deep, bidirectional, context-sensitive encodings.</p>
<p>Why do we need all three of these things? Let’s consider a training task, next sentence prediction (NSP) to demonstrate.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/deep-bi-cxt-ex.png" alt="example deep bidirectional" class="center-image" /></p>
<p>We can’t claim that this is exactly what’s going on in BERT, but clearly as humans we certainly require bidirectional context to answer. In particular, for some kind of logical relation between the entities in a sentence, we first need (bidirectional) context. I.e., to answer if “buying milk” is something we do in a store, we need to look at the verb, object, and location.</p>
<p>What’s more, to answer complicated queries about the coherence of two sentences, we need to layer additional reasoning beyond the logical relations we can infer at the first level. We might be able to detect inconsistencies at L0, but for more complicated interactions we need to look at a relationship between logical relationships (L1 as pictured above).</p>
<p>So, it may make sense that to answer logical queries of a certain nesting depth, we’d need to recursively apply our bidirectional, contextualization representation up to a corresponding depth (namely, stacking Transformers). In the example, we might imagine this query to look like:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>was-it-the-same-person(
who-did-this("man", "went"),
who-did-this("he", "bought")) &&
is-appropriate-for-location(
"store", "bought", "milk")
</code></pre></div></div>
<h2 id="related-work">Related work</h2>
<p>It’s important to describe existing related work that made strides in this direction. Various previous deep learning architectures have independently proposed using LM for transfer learning to other tasks and deep, bidirectional context (but not all at once).</p>
<p>In particular, relevant works are <a href="https://nlp.stanford.edu/pubs/glove.pdf">GloVe</a>, <a href="https://arxiv.org/abs/1802.05365">ELMo</a>, and <a href="https://openai.com/blog/language-unsupervised/">GPT</a>.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/related-work.png" alt="related work overview" class="center-image" /></p>
<h2 id="training">Training</h2>
<p>As input, BERT uses the BooksCorpus (800M words) and English Wikipedia (2,500M words), totaling 3.3B words, split into a vocabulary of 33K word pieces. There were a few standard NLP featurization techniques applied to this as well (lower casing, for instance), though I think the architecture could’ve handled richer English input.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/fig2.png" alt="bert figure 2" class="center-image" /></p>
<p>But what’s the output? Given just the inputs, how can we create a loss that learns a good context-sensitive representation of each word? This needs to be richer than the context-free representation of each word (i.e., the embedding that each word piece starts as in the first layer of the input to the BERT network).</p>
<p>We might try to recover the original input embedding, but then the network would just learn the identity function. This is the correct answer if we’re just learning on the joint distribution of \((x, x)\) between a sentence and itself.</p>
<p>Instead, BERT trains on sequence <em>recovery</em>. That is, our input is a sentence \(x_{-i}\) missing its \(i\)-th word, and our output is the \(i\)-th word itself, \(x_i\). This is implemented efficiently with masking in practice. That is, the input-output pair is \((\text{“We went [MASK] at the mall.”}, \text{“shopping”})\). In the paper, <code class="highlighter-rouge">[MASK]</code> is the placeholder for a missing word.</p>
<p>In addition, BERT adds an auxiliary task, NSP, where a special <code class="highlighter-rouge">[CLS]</code> classification token is used at the beginning of a sentence that serves as a marker for “this token should represent the whole context of the input sentence(s),” which is then used as a single fixed-width input for classification. This improves performance slightly (see Table 15 in the original work).</p>
<p>That’s essentially it.</p>
<blockquote>
<p>BERT = Transformer Encoder + MLM + NSP</p>
</blockquote>
<p>There’s an important caveat due to training/test distribution mismatch. See the last section, <a href="#open-questions">Open Questions</a>, below.</p>
<h2 id="fine-tuning">Fine-tuning</h2>
<p>For fine tuning, we just add one more layer on top of the final encoded sequence that BERT generates.</p>
<p>In the case of class prediction, we apply a classifier to the fixed width embedding of the <code class="highlighter-rouge">[CLS]</code> marker.</p>
<p>In the case of subsequence identification, like in SQuAD, we want to select a start and end by using a start classifier and end classifier applied to each token in the final output sequence.</p>
<p>For instance, a network is handed a paragraph like the following:</p>
<blockquote>
<p>One of the most famous people born in Warsaw was Maria Skłodowska-Curie, who achieved international recognition for her research on radioactivity and was the first female recipient of the Nobel Prize. Famous musicians include Władysław Szpilman and Frédéric Chopin. Though Chopin was born in the village of Żelazowa Wola, about 60 km (37 mi) from Warsaw, he moved to the city with his family when he was seven months old. Casimir Pulaski, a Polish general and hero of the American Revolutionary War, was born here in 1745.</p>
</blockquote>
<p>And then asked a reading comprehension question like “How old was Chopin when he moved to Warsaw with his family?” to which the answer is the subsequence “seven months old.” Hard stuff! And BERT performs at or above <a href="https://rajpurkar.github.io/SQuAD-explorer/">human level</a>.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/tbl1.png" alt="bert table 1" class="center-image" /></p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/tbl2.png" alt="bert table 2" class="center-image" /></p>
<h2 id="conclusions">Conclusions</h2>
<p>The BERT model is extremely simple, to the point where there’s a mismatch with intuition.</p>
<p>There seem to be some seemingly spurious decisions that don’t have a big effect on training.</p>
<p>First, the segment embeddings indicate different sentences in inputs, but positional embeddings provide positional information anyway. This is seemingly redundant information the network needs to learn to combine.</p>
<p>Second, the start and end indicators for the span predicted for SQuAD are computed independently, where it might make sense to compute the end conditional on the start position. Indeed, it’s possibly to get an end before the start (in which case the span is considered empty).</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/separate-span.png" alt="independent span" class="center-image" /></p>
<p>There are probably many such smaller modeling improvements we could make. But the point is that <em>it’s a waste of time</em>. If anything is the most powerful table to take away from this paper, it’s Table 6.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/tbl6.png" alt="bert table 6" class="center-image" /></p>
<p>Above any kind of task-specific tuning or model improvements, the longest pole in the tent is representation. Investing effort in finding the “right” representation (here, bidirectional, deep, contextual word piece embeddings) is what maximizes broad applicability and the potential for transfer learning.</p>
<p><img src="/assets/2019-06-23-bert-pt3-bert/all-representation.png" alt="independent span" class="center-image" /></p>
<h2 id="open-questions">Open Questions</h2>
<h4 id="transfer-learning-distribution-mismatch">Transfer Learning Distribution Mismatch</h4>
<p>At the end of Section 3.1, we notice something weird. In the masked language modeling task, our job is to derive what the <code class="highlighter-rouge">[MASK]</code> token was.</p>
<p>But in the evaluation tasks, <code class="highlighter-rouge">[MASK]</code> never appears. To combat this “mismatch” between the distribution of evaluation task tokens and that of the MLM task, occasionally full sequences are shown without the <code class="highlighter-rouge">[MASK]</code> tokens, in which the network is expected to recover the identity functions.</p>
<p>Appendix C.2 digs into the robustness of BERT with respect to messing around with the distribution. This is definitely something that deserves some attention.</p>
<p>During pre-training, we’re minimizing a loss with respect to a distribution that doesn’t match the test distribution (where we randomly remove the mask). How is this a well-posed learning problem?</p>
<p>How much should we smooth the distribution with the mask removals? It’s unclear how to properly set up the “mismatch amount”.</p>
<h4 id="richer-inputs">Richer Inputs</h4>
<p>Based on the ability of BERT to perform well even with redundant encodings (segment encoding and positional encoding), and given its large representational capacity, why operate BERT on word pieces? Why not include punctuation or even HTML markup from Wikipedia?</p>
<p>This kind of input could surely offer more signal for fine tuning.</p>
Sun, 23 Jun 2019 00:00:00 +0000
https://vlad17.github.io/2019/06/23/bert-pt3-bert.html
https://vlad17.github.io/2019/06/23/bert-pt3-bert.htmldeep-learningBERT, Part 2: The Transformer<h1 id="bert-prerequisite-2-the-transformer">BERT Prerequisite 2: The Transformer</h1>
<p>In the last post, we took a look at deep learning from a very high level (<a href="/2019/03/09/dl-intro.html">Part 1</a>). Here, we’ll cover the second and final prerequisite for setting the stage for discussion about BERT, the Transformer.</p>
<p>The Transformer is a novel sequence-to-sequence architecture proposed in Google’s <a href="https://arxiv.org/abs/1706.03762">Attention is All You Need</a> paper. BERT builds on this significantly, so we’ll discuss here why this architecture was important.</p>
<h2 id="the-challenge">The Challenge</h2>
<p>Recall the language of the previous post applied to supervised learning. We’re interested in a broad class of settings where the input \(\textbf{x}\) has some shared structure with the output \(\textbf{y}\), which we don’t know ahead of time. For instance, \(\textbf{x}\) might be an English sentence and \(\textbf{y}\) might be a German sentence with the same context.</p>
<p>For a parameterized model \(M(\theta)\) which might just be a function over \(\textbf{x}\), we recall the \(L\)-layer MLP from last time, where \(\theta=\mat{\theta_1& \theta_2&\cdots&\theta_L}\),
\[
M(\theta)= x\mapsto f_{\theta_L}^{(L)}\circ f_{\theta_{L-1}}^{(L-1)}\circ\cdots\circ f_{\theta_1}^{(1)}(x)\,,
\]
and we define each layer as
\[
f_{\theta_i}=\max(0, W_ix+b_i)\,,\,\,\, \mat{W_i & b_i} = \theta_i\,.
\]</p>
<p>Most feed-forward neural nets (FFNNs) are just variants on this architecture, with some loss typically like \(\norm{M(\theta)(\textbf{x}) - \textbf{y}}^2\).</p>
<p>One issue with this, and typical FFNNs, is that they’re mappings from some fixed size vector space \(\mathbb{R}^m\) to another \(\mathbb{R}^k\). When your inputs are variable-length sequences like sentences, this doesn’t make sense for two reasons:</p>
<ol>
<li>Sentences can be longer than the width of your input space (not a fundamental issue, you could just make \(m\) really large).</li>
<li>The inputs don’t respect the semantics of the input dimensions.</li>
</ol>
<p>For typical learning tasks, the \(i\)-th input dimension corresponds to a meaningful position in the input space. E.g., for images, this is the \(i\)-th pixel in the space of fixed size \(64\times 64\) images. It’s next to the \((i-1)\)-th and \((i+1)\)-th pixels, and every \(64\times 64\) image \(\textbf{x}\) will also have its \(i\)-th pixel in the \(i\)-th place.</p>
<p>Not so for sentences. In sentences, the subject may the first or second or third word. It might be preceded by an article, or it might not. If you look at a fixed offset for many different sentences, you’d be hard pressed to find a robust semantics for the word or letter that you see there. So it’s unreasonable to assume a model could extract relevant structure with such a representation.</p>
<h2 id="recursive-neural-networks-rnns">Recursive Neural Networks (RNNs)</h2>
<p>The typical resolution to this problem in deep learning is to use RNNs. For an overview, see <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Karpathy’s blog post</a>.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/rnn.jpeg" alt="RNN" class="center-image" /></p>
<p>To resolve this issue, we can view our input as a variable-length list of fixed length vectors \(\{\textbf{x}_i\}_{i}\). Next, we modify our FFNN to accept two fixed-length parameters at a time step \(i\), a hidden state \(\textbf{h}_i\) and input \(\textbf{x}_i\). It’s the green box in the diagram above.</p>
<p>This retains essential properties of FFNNs that allow it to optimize well (backprop still works). But, from a perspective of input semantics, we’ve resolved our problem by assuming the hidden state at timestep \(\textbf{h}_i\) tells the FFNN how to interpret the \(i\)-th sequence element (which could be a word or word part or character in the sentence). The FFNN is then also responsible for updating how the \((i+1)\)-th sequence element is to be interpreted, by returning \(\textbf{h}_{i+1}\) on the evaluation in timestep \(i\).</p>
<p>We might want to wait until the network reads the entire input if the entire variable-length output may change depending on all parts of the input (the second to last diagram above). This is the case in translation, where words at the end of the source language may end up at the beginning in the target language.</p>
<p>Alternatively, we might do something like try to classify off of the hidden state after reading the sentence, like identifying the sentiment of a text-based review.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/yelp1.png" alt="get final hidden state" class="center-image" /></p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/yelp2.png" alt="transform final" class="center-image" /></p>
<h2 id="rnn-challenges">RNN challenges</h2>
<p>Consider the task of translating English to Spanish. Let’s suppose our inputs are sequences of words, like</p>
<blockquote>
<p>I arrived at the bank after crossing the {river,road}.</p>
</blockquote>
<p>The proper translation might be either:</p>
<blockquote>
<p>Llegué a la orilla después de cruzar el río.</p>
</blockquote>
<p>or:</p>
<blockquote>
<p>Llegué al banco después de cruzar la calle.</p>
</blockquote>
<p>Notice how we need to look at the <em>whole</em> sentence to translate it correctly. The choice of “river” or “road” affects the translation of “bank”.</p>
<p>This means that the RNN needs to store information about the entire sentence when translating. For longer sentences, we’d definitely need to use a larger hidden state, but also we’re assuming the network would even be able to train to a parameter setting that properly recalls whole-sentence information.</p>
<h2 id="the-transformer">The Transformer</h2>
<p>The problem we faced above is one of <em>context</em>: to translate “bank” properly we need the full context of the sentence. This is what the Transformer architecture addresses. It inspects each word in the context of others.</p>
<p>Again, let’s view each word in our input sequence as some embedded vector \(\textbf{e}_i\) (for context on word embeddings, check out <a href="https://en.wikipedia.org/wiki/Word2vec">the Wikipedia page</a>).</p>
<p>Our goal is to come up with a new embedding for each word, \(\textbf{a}_i\), which contains context from all other words. This is done through a mechanism called attention. For a code-level explanation, see <a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html">The Annotated Transformer</a>, though I find that focusing on a particular word (the one at position \(i\)) helped me understand better.</p>
<p>The following defines (one head of) a Transformer block. A transformer block just contextualizes embeddings. They can be stacked on top of each other and then handed off to the transformer decoder, which is a more complicated kind of transformer that includes attention over both the inputs and outputs. Luckily, we don’t need that for BERT.</p>
<p>Remember, at the end of the day, we’re trying to take one sequence \(\{\textbf{e}_i\}_i\) and convert it into another sequence \(\{\textbf{a}_i\}_i\) which is then used as input for another stage that does the actual transformation. The point is that the representation \(\{\textbf{a}_i\}_i\) is broadly useful for many different decoding tasks.</p>
<ol>
<li>Apply an FFNN pointwise to each of the inputs \(\{\textbf{e}_i\}_i\) to get \(\{\textbf{x}_i\}_i\).</li>
</ol>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/pointwise-ffn.png" alt="pointwise ffnn" class="center-image" /></p>
<ol>
<li>Now consider a fixed index \(i\). How do we contextualize the word at \(\textbf{x}_i\) in the presence of other words \(\textbf{x}_1,\cdots,\textbf{x}_{i-1},\textbf{x}_{i+1},\cdots,\textbf{x}_s\)?</li>
</ol>
<p>We attend to the sequence itself. Attention tells us how much to pay attention to each element when coming up with a fixed-width context for the \(i\)-th element. This is done with the inner product.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/self-attn.png" alt="self attention" class="center-image" /></p>
<p>After computing how important each element \(\textbf{x}_j\) is to the element in question \(\textbf{x}_i\) as \(\alpha_j\), we combine the weighted sum of each of the \(\textbf{x}_j\) themselves.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/value-sum.png" alt="self attention" class="center-image" /></p>
<ol>
<li>After doing this for every index \(i\in[s]\), we get a new sequence \(\textbf{a}_i\). That’s it!</li>
</ol>
<p>This glosses over a couple normalization, multiple heads, and computational details, but it’s the gist of self-attention and the Transformer block.</p>
<p>One thing worth mentioning is the positional encoding, which makes sure that information about a word being present in the \(i\)-th position is present before the first Transformer block is applied.</p>
<p>After possibly many transformer blocks, we get our \(L\)-th sequence of embeddings, \(\{\textbf{a}^{(L)}_i\}_i\). We plug this as input to another model, the transformer decoder, which uses a similar process to eventually get a loss based on some input-output pair of sentences (e.g., in translation, the decoder converts the previous sequence into \(\{\textbf{b}_j\}_j\), which is compared with the actual translation \(\{\textbf{y}^{(L)}_j\}_j\)</p>
<h2 id="so-what">So What?</h2>
<p>On the face of it, this all sounds like a bunch of hand-wavy deep learning nonsense. “Attention”, “embedding”, etc. all look like fancy words to apply to math that is operating on meaningless vectors of floating-point numbers. Layer on top of this (lol) the other crap I didn’t cover, like multiple heads, normalization, and various knobs pulled during training, and the whole thing looks suspect.</p>
<p>It’s not clear which parts are essential, but something is doing its job:</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/res.png" alt="Transformer Results" class="center-image" /></p>
<p>And self-attention looks like it’s doing something like what we think it should.</p>
<p><img src="/assets/2019-06-22-bert-pt-2-transformer/attn-viz.png" alt="Transformer Attention" class="center-image" /></p>
<p>Regardless how much of a deep learning believer you are, this architecture solves problems which require contextualizing our representation of words, and it picks the right things to attend to in examples.</p>
<h2 id="next-time">Next time</h2>
<p>We’ll see how BERT uses the context-aware Transformer to come up with a representation without any supervision.</p>
Sat, 22 Jun 2019 00:00:00 +0000
https://vlad17.github.io/2019/06/22/bert-pt-2-transformer.html
https://vlad17.github.io/2019/06/22/bert-pt-2-transformer.htmldeep-learningBERT, Part 1: Deep Learning Intro<h1 id="a-modeling-introduction-to-deep-learning">A Modeling Introduction to Deep Learning</h1>
<p>In this post, I’d like to introduce you to some basic concepts of deep learning (DL) from a modeling perspective. I’ve tended to stay away from “intro” style blog posts because:</p>
<ul>
<li>There are so, so many of them.</li>
<li>They’re hard to keep in focus.</li>
</ul>
<p>That said, I was presenting on <a href="https://arxiv.org/abs/1810.04805">BERT</a> for a discussion group at work. This was our first DL paper, so I needed to warm-start a technical audience with a no-frills intro to modeling with deep nets. So here we are, trying to focus what this post will be:</p>
<ul>
<li>It will presume a technically sophisticated reader.</li>
<li>No machine learning (ML) background is assumed.</li>
<li>The main goal is to set the stage for future discussion about BERT.</li>
</ul>
<p>Basically, this is me typing up those notes. Note the above leaves questions about optimization and generalization squarely out of scope.</p>
<h2 id="the-parametric-model">The Parametric Model</h2>
<p>Deep learning is a tool for the generic task of parametric modeling. Parametric modeling (PM) is a term I am generously applying from statistical estimation theory that encapsulates a broad variety of ML buzzwords, including supervised, unsupervised, reinforcement, and transfer learning.</p>
<p>In the most general sense, a parametric model \(M\) accepts some vector of parameters \(\theta\) and describes some structure in a random process. Goodness, what does that mean?</p>
<ul>
<li>Structure in a random process is everything that differentiates it from noise. But what’s “noise”?</li>
<li>When we fix the model \(M\), we’re basically saying there’s only some classes of structure we’re going to represent, and everything else is what we consider noise.</li>
<li>The goal is to pick a “good” model and find parameters for it.</li>
</ul>
<h3 id="a-simple-example">A Simple Example</h3>
<p>For instance, let’s take a simple random process, iid draws from the normal distribution \(z\sim \mathcal{D}= N(\mu, \sigma^2)\) with an unknown mean \(\mu\) and variance \(\sigma^2\). We’re going to try capture the richest possible structure over \(z\), its actual distribution. One model might be the unit normal, \(M(\theta)=N(\theta, 1)\). Then our setup, and potential sources of error, look like this:</p>
<p><img src="/assets/2019-03-09-dl-intro/model-err.png" alt="sources of error" class="center-image" /></p>
<p>What I call parametric and model mismatch are also known as estimation and approximation error (<a href="https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning">Bottou and Bousquet 2007</a>).</p>
<p>Here, we have one the most straightforward instances of PM, parameter estimation (we’re trying to estimate \(\mu\)).</p>
<h3 id="revisiting-our-definitions">Revisiting our definitions</h3>
<p>What constitutes a “good” model? Above, we probably want to call models with \(\theta\) near \(\mu\) good ones. But in other cases, it’s not so obvious what makes a good model.</p>
<p>One of the challenges in modeling in general is articulating what we want. This is done through a loss function \(\ell\), where want models with small losses. In other words, we’d like to find a model \(M\) and related parameters \(\theta\) where
\[
\E_{z\sim \mathcal{D}}\ha{\ell(z, M(\theta))}
\]
is as small as possible (here, for our iid process). Note that in some cases, this doesn’t have to be the same as the loss function used for optimization for finding \(\theta\), but that’s another discussion (there are several reasons to do so).</p>
<h3 id="another-example">Another Example</h3>
<p>Now let’s jump into another modeling task, supervised learning. Here:</p>
<ul>
<li>Our iid random process \(\mathcal{D}\) will be generating pairs \(\pa{\text{some image}, \text{“cat” or “dog”}}\).</li>
<li>The structure we want to capture is that all images of dogs happen to be paired with the label \(\text{“dog”}\) and analogously so for cats.</li>
<li>We’ll gloss over what our model is for now.</li>
</ul>
<p>A loss that captures what we want for our desired structure would be the <em>zero-one loss</em>, which is \(1\) when we’re wrong, \(0\) when we’re right. Let’s fix some model and parameters, which takes an image and labels it as a cat or dog (so \(M(\theta)\) is a <em>function</em> itself) as follows, and then let’s see how it does on our loss function.</p>
<p><img src="/assets/2019-03-09-dl-intro/losses.png" alt="sources of error" class="center-image" /></p>
<h2 id="ok-so-why-deep-learning">OK, so why Deep Learning?</h2>
<p>This post was intentionally structured in a way that takes the attention away from DL. DL is a means to achieve the above PM goals–it’s a means to an end and being able to reason about higher-level modeling concerns is crucial to understanding the tool.</p>
<p>So, DL is an approach to building models \(M\) and it studies how to find good parameters \(\theta\) for those models.</p>
<h3 id="deep-learning-models">Deep Learning Models</h3>
<p>A DL model is anything that vaguely resembles the following model. Namely, it has many parameterized functions composed together to create a function.</p>
<p>A function is usually good enough to capture most structure that we’re interested in random processes, given sufficiently sophisticated inputs and outputs. The inputs and outputs to this function can be (not exhaustive):</p>
<ul>
<li>fixed-width multidimensional arrays (casually known as tensors, sort of)</li>
<li>embeddings (numerical translations) of categories (like all the words in the English dictionary)</li>
<li>variable width tensors</li>
</ul>
<p>The parameters this function takes (which differ from its inputs and effect what the function looks like) are fixed width tensors. I haven’t seen variable-width parameters in DL models, except as some Bayesian interpretations (<a href="https://www.cs.toronto.edu/~hinton/absps/colt93.pdf">Hinton 1993</a>).</p>
<h3 id="the-multi-layer-perceptron">The Multi-Layer Perceptron</h3>
<p>Our prototypical example of a neural network is the Multi-Layer Perceptron, or MLP, which takes a numerical vector input to a numerical vector output. For a parameter vector \(\theta=\mat{\theta_1& \theta_2&\cdots&\theta_L}\), which contains parameters for our \(L\) layers, an MLP looks like:
\[
M(\theta)= x\mapsto f_{\theta_L}^{(L)}\circ f_{\theta_{L-1}}^{(L-1)}\circ\cdots\circ f_{\theta_1}^{(1)}(x)\,,
\]
and we define each layer as
\[
f_{\theta_i}=\max(0, W_ix+b_i)\,.
\]
The parameters \(W_i, b_i\) are set by the contents of \(\theta_i\).</p>
<p>This is the functional form of linear transforms followed by nonlinearities. It describes what’s going on in this image:</p>
<p><img src="/assets/2019-03-09-dl-intro/mlpi.png" alt="sources of error" class="center-image" /></p>
<h3 id="why-dl">Why DL?</h3>
<p>While it might be believable that functions in general make for great models that could capture structure in a lot of phenomena, why have these particular parameterizations of functions taken off recently?</p>
<p>This is basically the only part of this post that has to do with DL, and most of it’s out of scope.</p>
<p>In my opinion, it boils down to three things.</p>
<p>Deep learning is simultaneously:</p>
<ul>
<li>Flexible in terms of how many functions it can represent for a fixed parameter size.</li>
<li>Lets us find so-called low-loss estimates of \(\theta\) fairly quickly.</li>
<li>Has working regularization strategies.</li>
</ul>
<h4 id="flexibility">Flexibility</h4>
<p>The MLP format above might seem strange, but this linearity-followed-by-non-linearity happens to be particularly expressive, in terms of the number of different functions we can represent with a small set of parameters.</p>
<p>The fact that a sufficiently wide neural network can well-approximate smooth functions is well known (<a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">Universal Approximation Theorem</a>), but what’s of particular interest is how linear increases in depth to a network exponentially increase its expressiveness (<a href="https://arxiv.org/abs/1402.1869">Montúfar, et al 2014</a>).</p>
<p><img src="/assets/2019-03-09-dl-intro/montufar2014.png" alt="expressiveness" class="center-image" /></p>
<p>An image from the cited work above demonstrates how composition with non-linearities increases expressiveness. Here, with an absolute value nonlinearity, we can reflect the input space on itself through composition. This means we double the number of linear regions in our neural net by adding a layer.</p>
<h4 id="efficiency">Efficiency</h4>
<p>One of the papers that kicked off the DL craze was Alexnet (<a href="the foundational papers that">Krizhevsky 2012</a>), and the reasons for its existence was that we could efficiently compute the value of a neural network \(M(\theta)\) on a particular image \(x\) using specialized hardware.</p>
<p>Not only does the simple composition of simple functions enable fast <em>forward</em> computation of the model value \(M(\theta)(x)\), but because the operations can be expressed as a directed acyclic graph of almost differentiable functions, one can quickly compute <em>reverse</em> automatic derivatives \(\partial_\theta M(\theta)(x)\) in just about the same amount of time.</p>
<p>This is a very happy coincidence. We can compute the functional value of a neural net and its derivative in time linear in the parameter size, and we have a lot of parameters. Here, efficiency matters a lot for the inner loop of the optimization (which uses derivatives with SGD) to find “good” parameters \(\theta\). This efficiency, in turn, enabled a lot of successful research.</p>
<h4 id="generalization">Generalization</h4>
<p>Finally, neural networks generalize well. This means that given a training set of examples, they are somehow able to have low loss on unseen examples coming from the same random process, just by training on a (possibly altered, or regularized) loss from given examples.</p>
<p>This is particularly counterintuitive for nets due to their expressivity, which is typically at odds with generalization with traditional ML analyses.</p>
<p><a href="https://arxiv.org/abs/1611.03530">Many</a></p>
<p><a href="https://arxiv.org/abs/1710.05468">theories</a></p>
<p><a href="https://arxiv.org/abs/1705.05502">for</a></p>
<p><a href="https://arxiv.org/abs/1503.02406">why</a></p>
<p><a href="https://arxiv.org/abs/1711.01530">this</a></p>
<p><a href="https://arxiv.org/abs/1710.09553">occurs</a></p>
<p>have been proposed, but none of them are completely satisfying yet.</p>
<h2 id="next-time">Next time</h2>
<ol>
<li>We’ll review the Transformer, and what it does.</li>
<li>That’ll set us up for some BERT discussion.</li>
</ol>
Sat, 09 Mar 2019 00:00:00 +0000
https://vlad17.github.io/2019/03/09/dl-intro.html
https://vlad17.github.io/2019/03/09/dl-intro.htmldeep-learningNumpy Gems, Part 1<h1 id="numpy-gems-1-approximate-dictionary-encoding-and-fast-python-mapping">Numpy Gems 1: Approximate Dictionary Encoding and Fast Python Mapping</h1>
<p>Welcome to the first installment of <em>Numpy Gems</em>, a deep dive into a library that probably shaped python itself into the language it is today, <a href="http://www.numpy.org/">numpy</a>.</p>
<p>I’ve spoken <a href="https://nbviewer.jupyter.org/github/vlad17/np-learn/blob/master/presentation.ipynb">extensively</a> on numpy (<a href="https://news.ycombinator.com/item?id=15996077">HN discussion</a>), but I think the library is full of delightful little gems that enable perfect instances of API-context fit, the situation where interfaces and algorithmic problem contexts fall in line oh-so-nicely and the resulting code is clean, expressive, and efficient.</p>
<h2 id="what-is-dictionary-encoding">What is dictionary encoding?</h2>
<p>A dictionary encoding is an efficient way of representing data with lots of repeated values. For instance, at the <a href="https://grouplens.org/datasets/movielens/Movie">MovieLens dataset</a>, which contains a list of ratings for a variety of movies.</p>
<p><img src="/assets/2019-01-19-numpy-gems-1/joined.png" alt="movielens movies" class="center-image" /></p>
<p>But the dataset only has around 27K distinct movies for over 20M ratings. If the average movie is rated around 700 times, then it doesn’t make much sense to represent the list of movies for each rating as an array of strings. There’s a lot of needless copies. If we’re trying to build a recommendation engine, then a key part of training is going to involve iterating over these ratings. With so much extra data being transferred between RAM and cache, we’re just asking for our bandwidth to be saturated. Not to mention the gross overuse of RAM in the first place.</p>
<p>That’s why this dataset actually comes with <code class="highlighter-rouge">movieId</code>s, and then each rating refers to a movie though its identifier. Then we store a “dictionary” mapping movie identifiers to movie names and their genre metadata. This solves all our problems: no more duplication, no more indirection, much less memory use.</p>
<p>That’s basically it. It’s a very simple encoding, which makes it easy to integrate efficiently in many algorithms. So much so, that many, many libraries natively support dictionary encoding your data–see factors in <a href="https://www.stat.berkeley.edu/~s133/factors.html">R</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/categorical.html">pandas</a>.</p>
<h2 id="why-approximate">Why approximate?</h2>
<p>Let’s run with our example. Suppose we have a list of our movie titles, and we’re doing some NLP on them for better recommendations. Usually, that means each of these movies correspond to some kind of encodings.</p>
<p><img src="/assets/2019-01-19-numpy-gems-1/titles.png" alt="titles" class="center-image" /></p>
<p>Let’s use the built-in pandas categorical dtype, which is a dictionary encoding.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>len(titles) # ---> 20000263
cat_titles = titles.astype(
pd.api.types.CategoricalDtype(
pd.unique(titles)))
len(cat_titles.cat.categories) # ---> 9260
len(cat_titles.cat.codes) # ---> 20000263
</code></pre></div></div>
<p>This stores our data into a densely packed array of integers, the codes, which index into the categories array, which is now a much smaller array of 9K deduplicated strings. But still, if our movie titles correspond to giant floating-point encodings, we’ll still end up shuffling a bunch of memory around. Maybe 9K doesn’t sound so bad to you, but what if we had a larger dataset? Bear with this smaller one for demonstration purposes.</p>
<p>A key observation is that, like most datasets, we’ll observe a power-law like distribution of popularity:</p>
<p><img src="/assets/2019-01-19-numpy-gems-1/movie-popularity.png" alt="movie popularity" class="center-image" /></p>
<p>What this means is that we have a long tail of obscure movies that we just don’t care about. In fact, if we’re OK dropping 5% coverage, which won’t affect our performance too much, we can save a bunch of space.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cdf = counts_desc.cumsum() / counts_desc.sum()
np.searchsorted(cdf, [.95, .99, .999, 1])
# ---> array([3204, 5575, 7918, 9259])
</code></pre></div></div>
<p>Indeed, it looks like dropping the 5% least-popular movies corresponds to needing to support only 1/3 as many movies overall! This can be a huge win, especially if your model considers higher-order interactions (if you like movie X and movie Y, then you might like movie Z). In such models that 1/3 becomes a 1/27th!</p>
<h2 id="how-to-approximate">How to approximate?</h2>
<p>However, if we’re being asked to serve model predictions online or want to train a “catch-all” encoding, then we still need to have a general catch-all “movie title” corresponding to the unknown situation. We have a bunch of dictionary indices in <code class="highlighter-rouge">[0, d)</code>, like <code class="highlighter-rouge">[1, 3, 5, 2, 6, 1, 0, 11]</code>. In total we have <code class="highlighter-rouge">n</code> of these. We also have a list of <code class="highlighter-rouge">e</code> items we actually care about in our approximate dictionary, say <code class="highlighter-rouge">[5, 8, 10, 11]</code>, but this might not be a contiguous range.</p>
<p>What we want is an approximate dictionary encoding with a catch-all, namely we want to get a list of <code class="highlighter-rouge">n</code> numbers between <code class="highlighter-rouge">0</code> and <code class="highlighter-rouge">e</code>, with <code class="highlighter-rouge">e</code> being the catch all.</p>
<p>In the above example, <code class="highlighter-rouge">n = 8, d = 12, e = 4</code>, and the correct result array is <code class="highlighter-rouge">[4, 4, 0, 4, 4, 4, 4, 3]</code>. For something like embeddings, it’s clear how this is useful in greatly reducing the number of things we need to represent.</p>
<h2 id="the-gem">The Gem</h2>
<p>The above is actually an instance of a translation problem, in the sense that we have some translation mapping from <code class="highlighter-rouge">[0, d)</code> into <code class="highlighter-rouge">[0, e]</code> and we’d like to apply it to every item in the array. Like many things in python, this is most efficient when pushed to C. Indeed, for strings, there’s <a href="https://docs.python.org/3/library/stdtypes.html#str.translate">translate</a> that does this.</p>
<p>We’ll consider two dummy distributions, which will either be extremely sparse (<code class="highlighter-rouge">d > n</code>) or more typical (<code class="highlighter-rouge">d <= n</code>). Both kinds show up in real life.
We extract the most popular <code class="highlighter-rouge">e</code> of these items (or maybe we have some other metric, not necessarily popularity, that extracts these items of interest).
There are more efficient ways of doing the below, but we’re just setting up.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if d < n:
dindices = np.random.geometric(p=0.01, size=(n - d)) - 1
dindices = np.concatenate([dindices, np.arange(d)])
dcounts = np.bincount(dindices)
selected = dcounts.argsort()[::-1][:e]
else:
dindices = np.random.choice(d, n // 2)
frequent = np.random.choice(n, n - n // 2)
dindices = np.concatenate([dindices, frequent])
c = Counter(dindices)
selected = np.asarray(sorted(c, key=c.get, reverse=True)[:e])
</code></pre></div></div>
<p>Let’s look at the obvious implementation. We’d like to map contiguous integers, so let’s implement a mapping as an array, where the array value at an index is the mapping’s value for that index as input. This is the implementation that pandas uses under the hood when you ask it to change its categorical values.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mapping = np.full(d, e)
mapping[selected] = np.arange(e)
result = np.take(mapping, dindices)
</code></pre></div></div>
<p>As can be seen from the code, we’re going to get burned when <code class="highlighter-rouge">d</code> is large, and we can’t take advantage of the fact that <code class="highlighter-rouge">e</code> is small. These benchmarks, performed with <code class="highlighter-rouge">%%memit</code> and <code class="highlighter-rouge">%%timeit</code> jupyter magics on fresh kernels each run, back this sentiment up.</p>
<table class="table table-bordered">
<thead>
<tr>
<th><code class="highlighter-rouge">e</code></th>
<th><code class="highlighter-rouge">d</code></th>
<th><code class="highlighter-rouge">n</code></th>
<th>memory (MiB)</th>
<th>time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td>763</td>
<td>345</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>11</td>
<td>9.62</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code> </td>
<td><code class="highlighter-rouge">10^8</code> </td>
<td><code class="highlighter-rouge">10^4</code> </td>
<td>763</td>
<td>210</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td>763</td>
<td>330</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>11</td>
<td>9.66</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td>763</td>
<td>210</td>
</tr>
</tbody>
</table>
<p>This brings us to our first puzzle and numpy gem. How can we re-write this to take advantage of small <code class="highlighter-rouge">e</code>? The trick is to use a sparse representation of our mapping, namely just <code class="highlighter-rouge">selected</code>. We can look in this mapping very efficiently, thanks to <code class="highlighter-rouge">np.searchsorted</code>. Then with some extra tabulation (using <code class="highlighter-rouge">-1</code> as a sentinel value), all we have to ask is where in <code class="highlighter-rouge">selected</code> a given index from <code class="highlighter-rouge">dindices</code> was found.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>searched = np.searchsorted(selected, dindices)
selected2 = np.append(selected, [-1])
searched[selected2[searched] != dindices] = -1
searched[searched == -1] = e
result = searched
</code></pre></div></div>
<p>A couple interesting things happen up there: we switch our memory usage from linear in <code class="highlighter-rouge">d</code> to linear in <code class="highlighter-rouge">n</code>, and completely adapt our algorithm to being insensitive to a high number of unpopular values. Certainly, this performs horribly where <code class="highlighter-rouge">d</code> is small enough that the mapping above is the clear way to go, but the benchmarks expose an interesting tradeoff frontier:</p>
<table class="table table-bordered">
<thead>
<tr>
<th><code class="highlighter-rouge"> e </code></th>
<th><code class="highlighter-rouge"> d </code></th>
<th><code class="highlighter-rouge"> n </code></th>
<th>memory (MiB)</th>
<th>time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">10^3</code> </td>
<td><code class="highlighter-rouge">10^4</code> </td>
<td><code class="highlighter-rouge">10^8</code> </td>
<td>1546</td>
<td>5070</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>13</td>
<td>31</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td>0.24</td>
<td>0.295</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^4 </code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td>1573</td>
<td>1940</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^6 </code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>13</td>
<td>17</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^8 </code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td>0.20</td>
<td>0.117</td>
</tr>
</tbody>
</table>
<p><a href="/assets/2019-01-19-numpy-gems-1/numpy-gems-1.ipynb">Link to benchmarks.</a></p>
Sat, 19 Jan 2019 00:00:00 +0000
https://vlad17.github.io/2019/01/19/numpy-gems-1.html
https://vlad17.github.io/2019/01/19/numpy-gems-1.htmlhardware-accelerationtoolsnumpy-gemsSubgaussian Concentration<h1 id="subgaussian-concentration">Subgaussian Concentration</h1>
<p>This is a quick write-up of a brief conversation I had with Nilesh Tripuraneni and Aditya Guntuboyina a while ago that I thought others might find interesting.</p>
<p>This post focuses on the interplay between two types of concentration inequalities. Concentration inequalities usually describe some random quantity \(X\) as a constant \(c\) which it’s frequently near (henceforth, \(c\) will be our stand-in for some constant which possibly changes equation-to-equation). Basically, we can quantify how infrequent divergence \(t\) of \(X\) from \(c\) is with some rate \(r(t)\) which vanishes as \(t\rightarrow\infty\).</p>
<p>\[
\P\pa{\abs{X-c}>t}\le r(t)\,.
\]</p>
<p>In fact, going forward, if \(r(t)=c’\exp(-c’’ O(g(t)))\), we’ll say \(X\) <em>concentrates about</em> \(c\) <em>in rate</em> \(g(t)\).</p>
<p>Subgaussian (sg) random variables (rvs) with parameter \(\sigma^2\) exhibit a strong form of this. They have zero mean and concentrate in rate \(-t^2/\sigma^2\).
Equivalently, we may write \(X\in\sg(\sigma^2)\). Subgaussian rvs decay quickly because of a characteristic about their moments. In particular, \(X\) is subgaussian if for all \(\lambda\), the following holds:
\[
\E\exp\pa{\lambda X}\le \exp\pa{\frac{1}{2}\lambda^2\sigma^2}\,.
\]</p>
<p>On the other hand, suppose we have \(n\) independent (indep) bounded (bdd) rvs \(X=\ca{X_i}_{i=1}^n\) and a function \(f\) that’s convex (cvx) in each one. Note being cvx in each variable isn’t so bad, for instance the low-rank matrix completion loss \(\norm{A-UV^\top}^2\) does this in \(U, V\). Then by BLM Thm. 6.10 (p. 180), \(f(X)\) concentrates about its mean quadratically.</p>
<p>This is pretty damn spiffy. You get a <em>function</em> that’s nothing but a little <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">montonic in averages</a>, and depends on a bunch of different knobs. Said knobs spin independently, and somehow this function behaves <a href="https://en.wikipedia.org/wiki/Talagrand%27s_concentration_inequality">basically constant</a>. This one isn’t a deep property of some distribution, like sg rvs, but rather a deep property of smooth functions on product measures.</p>
<h2 id="a-little-motivation">A Little Motivation</h2>
<p>Concentration lies at the heart of machine learning. For instance, take the well-known probably approximately correct (PAC) learning framework–it’s old, yes, and has been superseded by more generic techniques, but it still applies to simple classifiers we know and love. At its core, it seems to be making something analogous to a counting argument:</p>
<ol>
<li>The set of all possible classifiers is small by assumption.</li>
<li>Since there aren’t many classifiers overall, there can’t be many crappy classifiers.</li>
<li>Crappy classifiers have a tendency of fucking up on random samples of data (like our training set).</li>
<li>Therefore any solution we find that nails our training set is likely not crap (i.e., probably approximately correct).</li>
</ol>
<p>However, this argument can be viewed from a different lens, one which exposes machinery that underlies much more expressive theories about learning like M-estimation or empirical process analysis.</p>
<ol>
<li>The <em>generalization error</em> of our well-trained classifier is no more than twice the worst <em>generalization gap</em> (difference between training and test errors) in our hypothesis class (symmetrization).</li>
<li>For large sample sizes, this gap vanishes because training errors concentrate around the test errors (concentration).</li>
</ol>
<p>For this reason, being able to identify when a random variable (such as a classifier’s generalization gap, before we see its training dataset) concentrates is useful.</p>
<h2 id="ok-get-to-the-point">OK, Get to the Point</h2>
<p>Now that we’ve established why concentration is interesting, I’d like to present the conversation points. Namely, we have a general phenomenon, the <a href="https://en.wikipedia.org/wiki/Concentration_of_measure">concentration of measure</a>.</p>
<p>Recall the concentration of measure from above, that for a convex, Lipschitz function \(f\) is basically constant, but requiring bounded variables. However, these are some onerous conditions.</p>
<p>To some degree, these conditions to be weakened. For starters, convexity need only be quasi-convexity. The Wikipedia article is a bit nebulous, but the previously linked <a href="https://en.wikipedia.org/wiki/Talagrand%27s_concentration_inequality">Talagrand’s Inequality</a> can be used to weaken this requirement (BLM Thm. 7.12, p. 230).</p>
<p>Still:</p>
<ol>
<li>One can imagine that a function that’s not necessarily globally Lipschitz, but instead just coordinate-wise Lipschitz, we can still give some guarantees.</li>
<li>Why do we need bounded random variables? Perhaps variables that are <em>effectively</em> bounded most of the time are good enough.</li>
</ol>
<p>Our goal here will be to see if there are smooth ways of relaxing the conditions above and framing the concentration rates \(r(t)\) in terms of these relaxations.</p>
<h3 id="coordinate-sensitivity-and-bounded-differences">Coordinate Sensitivity and Bounded Differences</h3>
<p>The concentration of measure bounds above rely on a global Lipschitz property: no matter which way you go, the function \(f\) must lie in a slope-bounded double cone, which can be centered at any of its points; this can be summarized by the property that our \(f:\R^n\rightarrow\R\) satisfies \(\abs{f(\vx)-f(\vy)}\le L\norm{\vx-\vy}\) for all \(\vx,\vy\)</p>
<p><img src="/assets/2018-12-22-subgaussian-concentration/lipschitz_continuity.png" alt="lipschitz continuity image" class="center-image" /></p>
<p>Moreover, why does it matter that the preimage metric space of our \(f\) need to, effectively, be bounded? All that really matters is how the function \(f\) responds to changes in inputs, right?</p>
<p>Here’s where <a href="https://en.wikipedia.org/wiki/Doob_martingale#McDiarmid's_inequality">McDiarmid’s Inequality</a> comes in, which says that so long as we satisfy the bounded difference property, where
\[
\sup_{\vx, \vx^{(i)}}\abs{f(\vx)-f(\vx^{(i)})}\le c_i\,,
\]
holding wherever \(\vx, \vx^{(i)}\) only differ in position \(i\), then we concentrate with rate \(t^2/\sum_ic_i^2\). The proof basically works by computing the distance of \(f(X)\), our random observation, from \(\E f(X)\), the mean, through a series of successive approximations done by changing each coordinate, one at a time. Adding up these approximations happens to give us a martingale, and it turns out these bounded differences have a concentration (<a href="https://en.wikipedia.org/wiki/Hoeffding%27s_inequality">Hoeffding’s</a>) of their own.</p>
<p>Notice how the rate worsens individually according to the constants \(c_i\) in each dimension.</p>
<h3 id="whats-in-the-middle">What’s in the Middle?</h3>
<p>We’ve seen how we can achieve concentration (that’s coordinate-wise sensitive in its bounds) by restricting ourselves to:</p>
<ul>
<li>Well-behaved functions and bounded random inputs (Talagrand’s).</li>
<li>Functions with bounded responses to coordinate change (McDiarmid’s).</li>
</ul>
<p>Can we get rid of boundedness altogether now, relaxing it to the probibalistic “boundedness” that is subgaussian concentration? Well, yes and no.</p>
<h3 id="hows-this-possible">How’s this possible?</h3>
<p><a href="https://arxiv.org/abs/1309.1007">Kontorovich 2014</a> claims concentration for generic Lipschitz functions for subgaussian inputs. At first, this may sound too good to be true. Indeed, a famous counterexample (BLM Problem 6.4, p. 211, which itself refers to LT p. 25) finds a particular \(f\) where the following holds for sufficiently large \(n\).
\[
\P\ca{f(X)> \E f(X)+cn^{1/4}}\ge 1/4\,.
\]
Technically, the result is shown for the median, not mean value of \(f\), but by integrating the median concentration inequality for Lipschitz functions of subgaussian variables (LT p. 21), we can state the above, since the mean and median are within a constant of each of other (bdd rvs with zero mean are sg).
From the proof (LT, p. 25), \(f(X)\) has rate no better than \(t^2n^{-1/2}\).</p>
<p>Therein lies the resolution for the apparent contradiction: we’re <em>pathologically</em> dependent on the dimension factor.
On the other hand, the bound proven in the aforementioned Kontorovich 2014 paper is that for sg \(X\), we can achieve a concentration rate \(t^2/\sum_i\Delta_{\text{SG}, i}^2\), where \(\Delta_{\text{SG}, i}\) is a subgaussian diameter, which for our purposes is just a constant times \(\sigma_i^2\), the subgaussian parameter for the \(i\)-th position in the \(n\)-dimensional vector \(X\). For some \(\sigma^2=\max_i\sigma^2\), note that the hidden dimensionality emerges, since the Kontorovich rate is then \(t^2/(n\sigma^2)\).</p>
<p>The Kontorovich paper is a nice generalization of McDiarmid’s inequality which replaces the boundedness condition with a subgaussian one. We still incur the dimensionality penalty, but we don’t care about this if we’re making a one-dimensional or fixed-\(n\) statement. In fact, the rest of the Kontorovich paper investigates scenarios where this dimensionality term is cancelled out by a shrinking \(\sigma^2\sim n^{-1}\) (in the paper, this is observed for some stable learning algorithms).</p>
<p>In fact, there’s even quite a bit of room between the Kontorovich bound \(t^2/n\) (fixing the sg diameter now) and the counterexample lower bound \(t^2/\sqrt{n}\). This next statement might be made out of my own ignorance, but it seems like there’s still a lot of open space to map out in terms of what rates are possible to achieve in the non-convex case, if we care about the dimension \(n\) (which we do).</p>
<h1 id="references">References</h1>
<ol>
<li>BLM - Boucheron, Lugosi, Massart (2013), Concentration Inequalities</li>
<li>LT - Ledoux and Talagrand (1991), Probability in Banach Spaces</li>
</ol>
Sat, 22 Dec 2018 00:00:00 +0000
https://vlad17.github.io/2018/12/22/subgaussian-concentration.html
https://vlad17.github.io/2018/12/22/subgaussian-concentration.htmlmachine-learningBeating TensorFlow Training in-VRAM<h1 id="beating-tensorflow-training-in-vram">Beating TensorFlow Training in-VRAM</h1>
<p>In this post, I’d like to introduce a technique that I’ve found helps accelerate mini-batch SGD training in my use case. I suppose this post could also be read as a public grievance directed towards the TensorFlow Dataset API optimizing for the large vision deep learning use-case, but maybe I’m just not hitting the right incantation to get <code class="highlighter-rouge">tf.Dataset</code> working (in which case, <a href="https://github.com/vlad17/vlad17.github.io/issues/new">drop me a line</a>). The solution is to TensorFlow <em>harder</em> anyway, so this shouldn’t really be read as a complaint.</p>
<p>Nonetheless, if you are working with a new-ish GPU that has enough memory to hold a decent portion of your data alongside your neural network, you may find the final training approach I present here useful. The experiments I’ve run fall exactly in line with this “in-VRAM” use case (in particular, I’m training deep reinforcement learning value and policy networks on semi-toy environments, whose training profile is many iterations of training on a small replay buffer of examples). For some more context, you may want to check out an article on the <a href="https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/">TensorForce blog</a>, which suggests that RL people should be building more of their TF graphs like this.</p>
<p>Briefly, if you have a dataset that fits into a GPU’s memory, you’re giving away a lot of speed with the usual TensorFlow pipelining or data-feeding approach, where the CPU delivers mini-batches whose forward/backward passes are computed on GPUs. This gets worse as you move to pricier GPUs, whose relative CPU-GPU bandwidth-to-GPU-speed ratio drops. Pretty easy change for a 2x.</p>
<h2 id="punchline">Punchline</h2>
<p>Let’s get to it. With numbers similar to my use case, 5 epochs of training take about <strong>16 seconds</strong> with the standard <code class="highlighter-rouge">feed_dict</code> approach, <strong>12-20 seconds</strong> with the TensorFlow Dataset API, and <strong>8 seconds</strong> with a custom TensorFlow control-flow construct.</p>
<p>This was tested on an Nvidia Tesla P100 with a compiled TensorFlow 1.4.1 (CUDA 9, cuDNN 7), Python 3.5. Here is the <a href="https://gist.github.com/vlad17/5d67eef9fb06c6a679aeac6d07b4dc9c">test script</a>. I didn’t test it too many times (<a href="https://gist.github.com/vlad17/f43dba5783adfc21b1abab520dd2a8f1">exec trace</a>). Feel free to change the data sizes to see if the proposed approach would still help in your setting.</p>
<p>Let’s fix the toy benchmark supervised task we’re looking at:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="c1"># pretend we don't have X, Y available until we're about
# to train the network, so we have to use placeholders. This is the case
# in, e.g., RL.
</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="c1"># suffix tensors with their shape
# n = number of data points, x = x dim, y = y dim
</span><span class="n">X_nx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">1000</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">,</span> <span class="mi">64</span><span class="p">))</span>
<span class="n">Y_ny</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">([</span><span class="n">X_nx</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)])</span>
<span class="n">nbatches</span> <span class="o">=</span> <span class="mi">10000</span> <span class="c1"># == 20 epochs at 512 batch
</span><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">512</span></code></pre></figure>
<h3 id="vanilla-approach">Vanilla Approach</h3>
<p>This is the (docs-discouraged) approach that everyone really uses for training. Prepare a mini-batch on the CPU, ship it off to the GPU. <em>Note code here and below is excerpted (see the test script link above for the full code). It won’t work if you just copy it.</em></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># b = batch size
</span><span class="n">input_ph_bx</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">X_nx</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]])</span>
<span class="n">output_ph_by</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">Y_ny</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]])</span>
<span class="c1"># mlp = a depth 5 width 32 MLP net
</span><span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">input_ph_bx</span><span class="p">)</span>
<span class="n">tot_loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">output_ph_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">adam</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">tot_loss</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nbatches</span><span class="p">):</span>
<span class="n">batch_ix_b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="n">X_nx</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,))</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">update</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span>
<span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">[</span><span class="n">batch_ix_b</span><span class="p">],</span>
<span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">[</span><span class="n">batch_ix_b</span><span class="p">]})</span></code></pre></figure>
<p>This drops whole-dataset loss from around 4500 to around 4, taking around <strong>16 seconds</strong> for training. You might worry that random-number generation might be taking a while, but excluding that doesn’t drop the time more than <strong>0.5 seconds</strong>.</p>
<h3 id="dataset-api-approach">Dataset API Approach</h3>
<p>With the dataset API, we set up a pipeline where TensorFlow orchestrates some dataflow by synergizing more buzzwords on its worker threads. This should constantly feed the GPU by staging the next mini-batch while the current one is sitting on the GPU. This might be the case when there’s a lot of data, but it doesn’t seem to work very well when the data is small and GPU-CPU latency, not throughput, is the bottleneck.</p>
<p>Another unpleasant thing to deal with is that all those orchestrated workers and staging areas and buffers and shuffle queues need magic constants to work well. I tried my best, but it seems like performance is very sensitive with this use case. This could be fixed if Dataset detected (or could be told) it could be placed onto the GPU, and then it did so.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># make training dataset, which should swallow the entire dataset once
# up-front and then feed it in mini-batches to the GPU
# presumably since we only need to feed stuff in once it'll be faster
</span><span class="n">ds</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">from_tensor_slices</span><span class="p">((</span><span class="n">input_ph_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">))</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">repeat</span><span class="p">()</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="n">bufsize</span><span class="p">)</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="p">)</span>
<span class="c1"># magic that Zongheng Yang (http://zongheng.me/) suggested I add that was
# necessary to keep this from being *worse* than feed_dict
</span><span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">prefetch</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="p">(</span><span class="n">batch_size</span> <span class="o">*</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">it</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">make_initializable_iterator</span><span class="p">()</span>
<span class="c1"># reddit user ppwwyyxx further suggests folding training into a single call
</span><span class="k">def</span> <span class="nf">while_fn</span><span class="p">(</span><span class="n">t</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">t</span><span class="p">]):</span>
<span class="n">next_bx</span><span class="p">,</span> <span class="n">next_by</span> <span class="o">=</span> <span class="n">it</span><span class="o">.</span><span class="n">get_next</span><span class="p">()</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">next_bx</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">next_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">adam</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">update</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">while_loop</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="n">t</span> <span class="o"><</span> <span class="n">nbatches</span><span class="p">,</span>
<span class="n">while_fn</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">back_prop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">fd</span> <span class="o">=</span> <span class="p">{</span><span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">}</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">initializer</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">fd</span><span class="p">)</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">training</span><span class="p">)</span></code></pre></figure>
<p>For a small <code class="highlighter-rouge">bufsize</code>, like <code class="highlighter-rouge">1000</code>, this trains in around <strong>12 seconds</strong>. But then it’s not actually shuffling the data too well (since all data points can only move by a position of 1000). Still, the loss drops from around 4500 to around 4, as in the <code class="highlighter-rouge">feed_dict</code> case. A large <code class="highlighter-rouge">bufsize</code> like <code class="highlighter-rouge">1000000</code>, which you’d think should effectively move the dataset onto the GPU entirely, performs <em>worse</em> than <code class="highlighter-rouge">feed_dict</code> at around <strong>20 seconds</strong>.</p>
<p>I don’t think I’m unfair in counting <code class="highlighter-rouge">it.initializer</code> time in my benchmark (which isn’t that toy, either, since it’s similar to my RL use case size). All the training methods need to load the data onto the GPU, and the data isn’t available until run time.</p>
<h3 id="using-a-tensorflow-loop">Using a TensorFlow Loop</h3>
<p>This post isn’t a tutorial on <code class="highlighter-rouge">tf.while_loop</code> and friends, but this code does what was promised: just feed everything once into the GPU and do all your epochs without asking for permission to continue from the CPU.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># generate random batches up front
# i = iterations
</span><span class="n">n</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">shape</span><span class="p">(</span><span class="n">input_ph_nx</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">batches_ib</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">random_uniform</span><span class="p">((</span><span class="n">nbatches</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="c1"># use a fold + control deps to make sure we only train on the next batch
# after we're done with the first
</span><span class="k">def</span> <span class="nf">fold_fn</span><span class="p">(</span><span class="n">prev</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">):</span>
<span class="n">X_bx</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">input_ph_nx</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">)</span>
<span class="n">Y_by</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">output_ph_ny</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">)</span>
<span class="c1"># removing control deps here probably gives you Hogwild!
</span> <span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">prev</span><span class="p">]):</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">X_bx</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">Y_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">opt</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)]):</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">foldl</span><span class="p">(</span><span class="n">fold_fn</span><span class="p">,</span> <span class="n">batches_ib</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">back_prop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">fd</span> <span class="o">=</span> <span class="p">{</span><span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">}</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">fd</span><span class="p">)</span></code></pre></figure>
<p>This one crushes at around <strong>8 seconds</strong>, dropping loss again from around 4500 to around 4.</p>
<h2 id="discussion">Discussion</h2>
<p>It’s pretty clear Dataset isn’t feeding as aggressively as it can, and its many widgets and knobs don’t help (well, they do, but only after making me do more work). But, if TF wants to invalidate this blog post, I suppose it could add yet another option that plops the dataset into the GPU.</p>
Sat, 23 Dec 2017 00:00:00 +0000
https://vlad17.github.io/2017/12/23/beating-tf-api-in-vram.html
https://vlad17.github.io/2017/12/23/beating-tf-api-in-vram.htmlhardware-accelerationmachine-learningtools