Vlad FeinbergVlad's Blog
https://vlad17.github.io/
Tue, 31 Mar 2020 00:13:48 +0000Tue, 31 Mar 2020 00:13:48 +0000Jekyll v3.8.5A Broader Emergence (Simpson's part 3 of 3)<h1 id="a-broader-emergence-simpsons-part-3-of-3">A Broader Emergence (Simpson’s part 3 of 3)</h1>
<p>One neat takeaway from the previous post was really around the structure of what we were doing.</p>
<p>What did it take for the infinite DAG we were building to become a valid probability distribution?</p>
<p>We can throw some things out there that were necessary for its construction.</p>
<ol>
<li>The infinite graph needed to be a DAG</li>
<li>We needed inductive “construction rules” $\alpha,\beta$ where we could derive conditional kernels from a finite subset of infinite parents to a larger subset of the infinite parents.</li>
<li>The construction rules need to be internally consistent so as to satisfy Kolmogorov’s extension theorem.</li>
</ol>
<h2 id="plate-notation">Plate Notation</h2>
<p>We’ll borrow plate notation from graphical models literature, where you can generate variable-size models by taking the union of the graphs for each index in the plate.</p>
<p><img src="/assets/2020-simpsons-series/plate-demo.jpg" alt="plate intro" class="center-image" /></p>
<h2 id="examples">Examples</h2>
<p>The first two rules (DAG, construction rules) seem intuitive. Further, from a probibalist’s perspective, rule (3) is just as self-evident. The rule is exactly the Kolmogorov consistency condition: we’ll admit all construction rules that generate joints which are externally consistent.</p>
<p>But this doesn’t quite touch on some interesting interaction with structure. For instance, we have our familiar <strong>infinite Simpson’s paradox</strong> diagram.</p>
<p><img src="/assets/2020-simpsons-series/infinite-simpsons-paradox.jpg" alt="infinite simpsons plate diagram" class="center-image" /></p>
<p>We have our construction rules, which correspond roughly to the two points where an arrow crosses a plate, at $X\leftarrow Z_j$ and $Z_j\rightarrow Y$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta_i(\vz)&=\CP{Y=1}{X=i,Z=\vz}\\
\alpha_i(\vz)&=\CP{Z_j=1}{X=i,Z=\vz}\,\,.
\end{align} %]]></script>
<p>Next, the consistency rule $\beta_x(\vz)=(1-\alpha_x(\vz))\beta_i(\vz:0)+\alpha_x(\vz)\beta_x(\vz:1)$ seems to correspond to the (undirected) loop formed by $X,Y,Z_j$.</p>
<p>We go on. Here’s the <strong>double infinite Simpson’s paradox</strong> diagram.</p>
<p><img src="/assets/2020-simpsons-series/double-infinite-simpsons-paradox.jpg" alt="double infinite simpsons plate diagram" class="center-image" /></p>
<p>A few things emerge. If we leave the arrow $A\leftarrow B$ in there, it’s clear we have two independent Simpson’s paradox structures. The undirected loops birth two consistency rules, and we expect four builder rules for each plate/arrow crossing.</p>
<p>If we remove the arrow $A\leftarrow B$, the consistency rule goes away: you just need to specify the builder rules between $Z$ and each of $A,B$ individually.</p>
<p>Visually, a few other things become clear. As far as we’re concerned, chains aren’t that important. If we had $X\rightarrow Q\rightarrow Y$ in the above diagrams instead of $X\rightarrow Y$, there would be an extra few terms but we’d only have once consistency rule, still.</p>
<p>Let’s keep going. Here’s where things get spicy. The <strong>overlapping loops</strong> diagram has two cases, (i) with one direction of arrows and (ii) with a collider.</p>
<p><img src="/assets/2020-simpsons-series/overlapping-loops.jpg" alt="two overlapping loops plate diagram" class="center-image" /></p>
<p>Having a <strong>chain in plate</strong> don’t seem interesting.</p>
<p><img src="/assets/2020-simpsons-series/chain-in-plate.jpg" alt="chain in plate diagram" class="center-image" /></p>
<p>Finally, we can mix things up with another plate, here with two loops with a <strong>shared regular edge</strong>.</p>
<p><img src="/assets/2020-simpsons-series/shared-regular-edge.jpg" alt="shared regular edge diagram" class="center-image" /></p>
<h2 id="open-questions">Open Questions</h2>
<p>What’s the role of builder rule parameterization? In the infinite Simpson’s paradox, I specifically chose</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta_i(\vz)&=\CP{Y=1}{X=i,Z=\vz}\\
\alpha_i(\vz)&=\CP{Z_j=1}{X=i,Z=\vz}\,\,.
\end{align} %]]></script>
<p>because $\beta_i$ was a useful parameterization for computing the difference $\Delta_j(\vz)$. Perhaps</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta_{y}(\vz)&=\CP{Z_j=1}{Y=y,Z=\vz}\\
\alpha_x(\vz)&=\CP{Z_j=1}{X=x,Z=\vz}
\end{align} %]]></script>
<p>is more natural as the set of builder rules. What constraint does this require?</p>
<p>It seems like we get consistency constraints for every junction that has an “infinite probability flow”. That is, if there’s two directed paths like $Z_j\rightarrow X\rightarrow Y$ and $Z_j\rightarrow Y$ with a source that’s an infinite plate and the same sink, then we’ll expect a consistency rule for each path. The paths can have different infinite sources, such as shared regular edge diagrma’s junction for $A_j,B_j\rightarrow X$.</p>
<p>In the case of an overlapping loop (ii), we have three directed paths meeting at junction $B$, so we have a consistency rule over three builder rules simultaneously.</p>
<p>There’s bound to be some very pretty, minimal, category-theoretic way of expressing Kolmogorov-extensible DAGs. This is useful because it gives us a natural parameterization of such conditional structures, with the minimal amount of constraints on conditional probability kernels.</p>
Mon, 01 Jun 2020 00:00:00 +0000
https://vlad17.github.io/2020/06/01/a-broader-emergence.html
https://vlad17.github.io/2020/06/01/a-broader-emergence.htmlcausalAn Infinite Simpson's Paradox (Simpson's part 2 of 3)<h1 id="an-infinite-simpsons-paradox-simpsons-part-2-of-3">An Infinite Simpsons Paradox (Simpson’s part 2 of 3)</h1>
<p>This is Problem 9.11 in Elements of Causal Inference.</p>
<p><em>Construct a single Bayesian network on binary $X,Y$ and variables $\{Z_j\}_{j=1}^\infty$ where the difference in conditional expectation,
\[
\Delta_j(\vz_{\le j}) = \CE{Y}{X=1, Z_{\le j}=\vz_{\le j}}-\CE{Y}{X=0, Z_{\le j}=\vz_{\le j}}\,\,,
\]
satisfies $\DeclareMathOperator\sgn{sgn}\sgn \Delta_j=(-1)^{j}$ and $\abs{\Delta_j}\ge \epsilon_j$ for some fixed $\epsilon_j>0$. $\Delta_0$ is unconstrained.</em></p>
<h3 id="proof-overview">Proof overview</h3>
<p>We will do this by induction, constructing a sequence of Bayes nets $\mcC_d$ for $d\in\N$ with variables $X,Y,Z_1,\cdots,Z_d$, such that $\mcC_d\subset\mcC_{d+1}$, in a strict sense. In particular, our nets will be nested so that they have the same structure on common variables. This means that for their entailed respective joints $p_d,p_{d+1}$,
\[
p_d(x, y, z_{1:d})=\int\d{z_{d+1}}p_{d+1}(x, y, z_{1:d+1})\,\,.
\]</p>
<p>Intuitively, this seems to lead us to a limiting structure $\mcC_\infty$ over the infinite set of nodes, but it’s not clear that this necessarily exists. Our ability to generate larger probability spaces by adding independent variables doesn’t help here, since those are finite tools.</p>
<p>For simplicity, we’ll construct $X$ such that its marginal has mass $b(x)=0.5$ for $x=0,1$. We’ll also take $Z_j$ to be binary. Nonetheless, even in this simple setting, the set of realizations of $\{Z_j\}_{j=1}^\infty$ is uncountable, $2^\N$. Assigning probabilities to every subset of this set isn’t easy.</p>
<p>So first we’ll have to tackle well-definedness. What does $\mcC_\infty$ even mean, mathematically? Equiped with this, we can specify more details about the specific $\mcC_\infty$ we want to have that’ll satisfy properties about $\Delta_j$.</p>
<h3 id="well-definedness-of-mcc_infty">Well-definedness of $\mcC_\infty$</h3>
<p><em>Suppose that we have open unit interval-valued functions $\alpha_x,\beta_x$ for $x\in\{0,1\}$ on binary strings that satsify, for any $j\in\N$ and binary $j$-length string $\vz$, that
\[
\beta_x(\vz)=(1-\alpha_x(\vz))\beta_i(\vz:0)+\alpha_x(\vz)\beta_x(\vz:1)\,\,,
\]
where $\vz:i$ is the concatenation operation (at the end). We construct an object $\mcC_\infty$ defined by finite kernels $p(x|\vz_J),p(y|x,\vz_J),p(\vz_J)$ (that is, for any $J\subset \N$, $\mcC_\infty$ provides us with these functions) that induce a joint distribution over $(X, Y, Z_J)$. Moreover, there exists unique law $\P$ Markov wrt $\mcC_\infty$ (it is consistent with the kernels), which adheres to the following equality over binary strings $\vz$ of length $j$ with $Z=(Z_1,\cdots, Z_j)$:</em></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta_i(\vz)&=\CP{Y=1}{X=i,Z=\vz}\\
\alpha_i(\vz)&=\CP{Z_j=1}{X=i,Z=\vz}\,\,.
\end{align} %]]></script>
<h4 id="proof">Proof</h4>
<p>The proof amounts to specifying Markov kernels inductively in a way that respects the invariant promised, and then applying the Kolmogorov extension theorem. While this would be the way to formally go about it, an inverse order, starting from Kolmogorov, is more illustritive.</p>
<p>The theorem states that given joint distributions $p_N$ over arbitrary finite tuples $N\subset \{X,Y,Z_1, Z_2,\cdots\}$ which match the consistency property $p_K=\int\d{\vv}p_N$, where $\vv$ is the realization of variables $N\setminus K$ for $K\subset N$, there exists a unique law $\P$ matching all the joint distributions on all tuples (even infinite ones). That is, you need be consistent under marginalization.</p>
<p>Before diving in, we make a few simplifications.</p>
<ol>
<li>Since the variables are all binary it’s easy enough to make sure all our kernels are valid conditional probability distributions; just specify a valid probability for one of the outcomes, the other being the complement.</li>
<li>We’ll focus only on kernels $p(z_j)$, $p(x|\vz_{\le j})$, and $p(y|x, \vz_{\le j})$ for $j\in\N$. It’s easy enough to derive the other ones; for any finite $J\subset \N$, with $m=\max J$, just let
\[
p(x|\vz_J)=\int\d{\vz_{[m]\setminus J}}p(x|\vz_{\le m})p(\vz_{[m]\setminus J})\,\,,
\]
and analogously for $p(y|x, \vz_{\le j})$. Thanks to independence structure, $p(\vz_J)=\prod_{j\in J}p(z_j)$.</li>
</ol>
<p><img src="/assets/2020-simpsons-series/diagram.jpg" alt="bayes net" class="center-image" /></p>
<h5 id="extension">Extension</h5>
<p>This simplification in (2) means when checking the Kolmogorov extension condition, we needn’t worry about differences in $N$ and $K$ by $Z_j$ nodes.</p>
<p>Consider tuples $K\subset N$ over our variables from $\mcC_\infty$, and denote their intersections with $\{Z_j\}_{j\in\N}$ as $Z_{J_K},Z_{J_N}$. Letting $m=\max J_N$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p_K(x, y, \vz_{J_K})&= p_K(y|x, \vz_{J_K})p_K(x|\vz_{J_K})p(\vz_{J_K})\\
&=\int\d{\vz_{[m]\setminus J_K}}p(y|x, \vz_{\le m})p(x|\vz_{\le m})p(\vz_{\le m})\\
&=\int\d{\vz_{J_N\setminus J_K}}\int\d{\vz_{[m]\setminus J_N}}p(y|x, \vz_{\le m})p(x|\vz_{\le m})p(\vz_{\le m})\\
&=\int\d{\vz_{J_N\setminus J_K}}p_N(y|x, \vz_{J_N})p_N(x|\vz_{J_N})p_N(\vz_{J_N})\\
&=\int\d{\vz_{J_N\setminus J_K}}p_N(x, y, \vz_{J_N})\,\,.
\end{align} %]]></script>
<p>The subscripts are important here. Of course, $p_K(\vz_L)=p_N(\vz_L)=p(\vz_L)=\prod_{j\in L}p(z_j)$ for any $L\subset K\subset N$ by independence and kernel specification. Otherwise, the steps above rely on joint decomposition, then simplification (2) applied to $K$, Fubini, then simplification (2) applied to $N$ now in reverse, and finally joint distribution composition.</p>
<p>The above presumes $X,Y\in K$, but it’s clear that we can simply add in the corresponding integrals on the right hand side to recover them if they’re in $N$ after performing the steps above.</p>
<p>The above finishes our use of the extension theorem, relying only on the fact that we constructed valid Markov kernels to provde us a law $\P$ consistent with them. But to actually apply this reasoning, we have to explicitly construct these kernels, which we’ll do with the help of simplification (1).</p>
<h5 id="kernel-specification">Kernel Specification</h5>
<p>We show that there exist marginals $p(z_j)$ such that $p(x)$ is the Bernoulli pmf $b(x)$ and $p(z_j|x,\vz_{<j})$ is defined by $\alpha_x(\vz_{<j})$. In particular, we first inductively define $p(x|\vz_{\le j})$ and $p(z_j)$ simultaneously. For $j=0$ set $p(x)=b(x)$. Then for $j>0$ define
\[
p(Z_j=1)=\int\d{(x,\vz_{<j})}\alpha_x(\vz_{<j})p(x|\vz_{<j})p(\vz_{<j})\,\,,
\]
which for $j=1$ simplifies to $p(Z_1=1)=\int \d{x}p(x)\alpha_x(\emptyset)=b(0)\alpha_0(\emptyset)+b(1)\alpha_1(\emptyset)$, and then use that to define
\[
p(X=1|\vz_{\le j})=\frac{\alpha_x(\vz_{\le j})p(x|\vz_{<j})}{p(z_j)}\,\,,
\]
which induces $p(x|\vz_{\le j})$. For the case of $j=1$ the above centered equation simplifies to $\alpha_x(z_1)b(x)/p(z_1)$. It is evident from Bayes’ rule applied to $p(x|z_j,\vz_{< j})$ that this is the unique distribution $p(x|\vz_{\le j})$ matching the semantic constraint on $\alpha_x$, assuming $Z_j$ are independent.</p>
<p>Uniqueness of $p(x|\vz_{\le j})$
follows inductively, as does
$\int \d{\vz_{\le m}}p(x|\vz_{\le m})p(\vz_{\le m})=b(x)$.</p>
<p>While above we could <em>construct</em> conditional pmfs and marginals pmfs to suit our needs, where they formed valid measures simply by construction (i.e., specifying any open unit interval valued $\alpha$ constructs valid pmfs above), we must now use our assumption to validate that the free function $\beta$ induces a valid measure on $Y$.</p>
<p>It must be the case that for any kernel we define that for all $j\in \N$,
\[
p(y|x, \vz_{\le j})=\int\d{z_{j+1}}p(z_{j+1}|x, \vz_{\le j})p(y|x, \vz_{\le j+1})\,\,,
\]
which by the assumption $\beta_x(\vz)=(1-\alpha_x(\vz))\beta_i(\vz:0)+\alpha_x(\vz)\beta_x(\vz:1)$ holds precisely when
\[
p(Y=1|x, \vz_{\le j})=\beta_x(\vz_{\le j})\,\,,
\]
by our definition of $p(z_{j+1})$ above. Then such a specification of kernels is valid.</p>
<h3 id="configuring-delta_j">Configuring $\Delta_j$</h3>
<p>Having done all the work in constructing $\mcC_\infty$, we now just need to specify $\alpha, \beta$ meeting our constraints.</p>
<p>To do this, it’s helpful to work through some examples. We first note a simple equality, which is
\[
\Delta_d(\vz)=\beta_1(\vz)-\beta_0(\vz)\,\,.
\]</p>
<p>For $d=0$, we can just take $\beta_0(\emptyset)=\beta_1(\emptyset)=0.5$</p>
<p><img src="/assets/2020-simpsons-series/contingency.jpg" alt="contingency table" class="center-image" /></p>
<p>For $d=1$, we introduce $Z_1$. Notice we now are bound to our constraints,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta_0(\emptyset)&=(1-\alpha_0(\emptyset))\beta_i(0)+\alpha_0(\emptyset)\beta_0(1)\\
\beta_1(\emptyset)&=(1-\alpha_1(\emptyset))\beta_i(0)+\alpha_1(\emptyset)\beta_1(1)\,\,.
\end{align} %]]></script>
<p>At the same time, we’re looking to find settings such that $\forall z,\,\,\beta_1(z)-\beta_0(z)\le -\epsilon$.</p>
<p>Luckily, our constraints amount to convexity constraints; algebraicly, this means that $\beta_0(0),\beta_0(1)$ must be on either side of $\beta_0(\emptyset)$ and similarly for $\beta_1(\emptyset)$. At the same time, we’d like to make sure that $\beta_0(1)-\beta_1(1)\ge \epsilon_1$. This works out! See the picture below, which sets
\[
(\beta_1(0), \beta_0(0),\beta_1(1),\beta_0(1))=0.5+\pa{-2\epsilon_1,-\epsilon_1,\epsilon_1,2\epsilon_1}\,\,.
\]</p>
<p><img src="/assets/2020-simpsons-series/numberline.jpg" alt="number line" class="center-image" /></p>
<p>Then, choice of $\alpha_i(\emptyset)$ meets the constraints.</p>
<p>For the recursive case, recall we need
\[
\beta_x(\vz)=(1-\alpha_x(\vz))\beta_x(\vz:0)+\alpha_x(\vz)\beta_x(\vz:1)\,\,,
\]
so we’ll choose $\beta_x(\vz:1)>\beta_x(\vz)>\beta_x(\vz:0)$, which always admits solutions on the open unit interval, but to ensure that $\beta_1(\vz:z_j)-\beta_0(\vz:z_j)=(-1)^{j}\epsilon_j$, we need another construction similar to the above with the number line. Here’s the next step.</p>
<p><img src="/assets/2020-simpsons-series/recursive-numberline.jpg" alt="recursive number line" class="center-image" /></p>
<p>Where we can frame the above recursively. Suppose $m$ is the minimum distance from $\beta_1(\vz),\beta_0(\vz)$ to $0,1$. Without loss of generality assume the parity of $\vz$ is such that we’re interested in having $\beta_1(\vz:z_j)>\beta_0(\vz:z_j)$, which implies by parity as well that in the previous step $\beta_1(\vz)\le \beta_0(\vz)$.</p>
<p>Then for $j=\card{\vz}+1$ set
\[
\mat{\beta_0(\vz:0)\\ \beta_1(\vz:0)\\\beta_0(\vz:1)\\\beta_1(\vz:1) }=
\mat{\beta_1(\vz)-\frac{m+\epsilon_{j}}{2} \\ \beta_1(\vz)-\frac{m-\epsilon_{j}}{2} \\ \beta_0(\vz)+\frac{m-\epsilon_{j}}{2} \\ \beta_0(\vz)+\frac{m+\epsilon_{j}}{2} }\,\,.
\]</p>
<p>This recursion keeps the equations solvable, and by choice of $\epsilon_{j}$ sufficiently small all quantities are within $(0,1)$.</p>
Fri, 01 May 2020 00:00:00 +0000
https://vlad17.github.io/2020/05/01/an-infinite-simpsons-paradox.html
https://vlad17.github.io/2020/05/01/an-infinite-simpsons-paradox.htmlcausalObservational Causal Inference (Simpson's part 1 of 3)<h1 id="observational-causal-inference-simpsons-part-1-of-3">Observational Causal Inference (Simpson’s part 1 of 3)</h1>
<p>In most data analysis, especially in business contexts, we’re looking for answers about how we can do better. This implies that we’re looking for a change in our actions that will improve some measure of performance.</p>
<p>There’s an abundance of passively collected data from analytics. Why not point fancy algorithms at that?</p>
<p>In this post, I’ll introduce a counterexample showing why we shouldn’t be able to extract such information easily.</p>
<h2 id="simpsons-paradox">Simpson’s Paradox</h2>
<p>This has been explained many <a href="https://en.wikipedia.org/wiki/Simpson%27s_paradox">times</a>, so I’ll be brief. Suppose we’re collecting information about which of two treatments, A or B, better cures kidney stones, regardless of their size.</p>
<p><img src="/assets/2020-simpsons-series/simpson.png" alt="simpson table" class="center-image" /></p>
<p>Notice a strange phenomenon: if you know that you have either a small stone or a large stone, you’d want to opt for A, but if you don’t know which of the two stones you have, it looks like B is better, since its cure rate is higher overall.</p>
<p>The counts betray what’s really going on: doctors systematically apply A to harder cases of kidney stones, so even though A performs better on each one, it just has more cases where the average outcome is worse due to their difficulty. Taking the average of the treatment performance across strata resolves the paradox.</p>
<p><em>Or does it?</em> In this case, we were given the causal knowledge of how doctors apply the treatment to their patients. It is precisely because there’s a causal arrow pointing from kidney stone size to which treatment you’re given that we want to control for kidney stone size.</p>
<p>From an observational perspective, this knowledge doesn’t exist in a set of examples <code class="language-plaintext highlighter-rouge">(stone size, treatment, cured or not)</code>. As far as that dataset is concerned, stone size could’ve been measured after taking the treatment for a few days, in which case we could still notice the correlations shown in the graph above but wouldn’t want to average in the same way (in particular, treatment A might make the stone bigger before it gets rid of it for some medical reason).</p>
<p>What’s worse, if you don’t see the stone size variable, because you don’t capture it, you’ll make the wrong conclusion about which treatment is more effective.</p>
<h2 id="knee-jerks">Knee-jerks</h2>
<p>There’s a few reasonable responses from the optimist on this.</p>
<ol>
<li>We capture all relevant variables and know which causes which.</li>
<li>We can apply more fanciness.</li>
</ol>
<p>Note, more fanciness doesn’t just mean you can dump all your variables into an ML model and see what it spits out. There are other pathological cases this approach <a href="https://journals.sagepub.com/doi/10.1080/07388940500339167">gives rise to</a>.</p>
<p>But, for the main problem of the above (something called colliders), there are techniques that can identify whether or not they’re present using independence testing. So at first glance the optimist’s approach might be tenable: enough variables, enough smarts to only look at variables with total causal effect on our response, and a sophisticated enough model of the interactions, which we presume fits the situation. We’ll get there, right?</p>
<p>Not quite. An extension of Simpson’s paradox gives us confidence about deeper epistemic uncertainty in causal modelling.</p>
<h2 id="an-infinite-simpsons-paradox">An Infinite Simpson’s Paradox</h2>
<p>If we can come up with an example of an infinite Simpson’s paradox, where we have two variables $X,Y$ and they are confounded by $Z_1,Z_2, \cdots$, which go on forever, then regardless of how much data we have, and how many variables we capture, we simply will not be able to tell what the correlation between $X$ and $Y$ is. A “confounder” here is like the kidney stone size—an underlying systematic association that colors what our assessment of $X$’s effect on $Y$ should be.</p>
<p>This gives a precise example to point to. Here’s an instance where you can always have access to as much data instances as you want, as many relevant variables as you want, and all the causal information about those variables, and you’ll still end up with the wrong answer about the average effect of $X$ on $Y$.</p>
<p><img src="/assets/2020-simpsons-series/diagram.jpg" alt="bayes net" class="center-image" /></p>
<p>Before jumping into that, let’s be clear about what a diagram like the above means. Every vertex is a random variable. The graph will always be a directed acyclic graph, so there’s an order over the variables sweeping from parents to children, with the eldest parents having no parents themselves (they’re the roots, in this case $Z_j$).</p>
<p>If you define the marginal distribution $p(z_j)$ of the roots and the conditional probabilities of every child given their parents, then you’ve defined the full joint distribution of every variable in the graph. For example, for the graph $\mcG$ below,</p>
<p><img src="/assets/2020-simpsons-series/small-diagram.jpg" alt="bayes net small" class="center-image" /></p>
<p>it’s easy to convince ourselves that for the parent operation $\mathrm{Pa}$ that returns a node’s parents,
\[
p(a, b, c, d)=\prod_{v\in\mcG}p(v|\mathrm{Pa}(v))=p(c)p(d)p(b|c, d)p(a|b, c, d)\,\,,
\]
but what does this mean when we have infinitely many variables as shown in the previous diagram?</p>
Wed, 01 Apr 2020 00:00:00 +0000
https://vlad17.github.io/2020/04/01/observational-causal-inference.html
https://vlad17.github.io/2020/04/01/observational-causal-inference.htmlcausalNumpy Gems, Part 3<h1 id="subset-isomorphism">Subset Isomorphism</h1>
<p>Much of scientific computing revolves around the manipulation of indices. Most formulas involve sums of things and at the core of it the formulas differ by which things we’re summing.</p>
<p>Being particularly clever about indexing helps with that. A complicated example is the <a href="https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm">FFT</a>. A less complicated example is computing the inverse of a permutation:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">inverse</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty_like</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="n">inverse</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">s</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="nb">all</span><span class="p">(</span><span class="n">x</span> <span class="o">==</span> <span class="n">inverse</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<p>The focus of this post is to expand on a maybe-useful, vectorizable isomorphism between indices, that comes up all the time: indexing pairs. In particular, it’s often the case that we’d want to come up with an <em>a priori</em> indexing scheme into a weighted, complete undirected graph on \(V\) vertices and \(E\) edges.</p>
<p>In particular, our edge set is \(\binom{[V]}{2}=\left\{(0, 0), (0, 1), \cdots, (V-2, V-1)\right\}\), the set of ordered \(2\)-tuples. Our index set is \(\left[\binom{V}{2}\right]=\left\{0, 1, \cdots, \frac{V(V-1)}{2} - 1\right\}\) (note we’re 0-indexing here).</p>
<p>Can we come up with an isomorphism between these two sets that vectorizes well?</p>
<p>A natural question is why not just use a larger index. Say we’re training a <a href="https://arxiv.org/abs/1511.05493">GGNN</a>, and we want to maintain embeddings for our edges. Our examples might be in a format where we have two vertices \((v_1, v_2)\) available. We’d like to index into an edge array maintaining the corresponding embedding. Here, you may very well get away with using an array of size \(V^2\). That takes about twice as much memory as you need, though.</p>
<p>A deeper problem is simply that you can <em>represent</em> invalid indices, and if your program manipulates the indices themselves, this can cause bugs. This matters in settings like <a href="http://graphblas.org/">GraphBLAS</a> where you’re trying to vectorize classical graph algorithms.</p>
<p>The following presents a completely static isomorphism that doesn’t need to know \(V\) in advance.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># an edge index is determined by the isomorphism from
# ([n] choose 2) to [n choose 2]
</span>
<span class="c1"># mirror (i, j) to (i, j - i - 1) first. then:
</span>
<span class="c1"># (0, 0) (0, 1) (0, 2)
# (1, 0) (1, 1)
# (2, 0)
</span>
<span class="c1"># isomorphism goes in downward diagonals
# like valence electrons in chemistry
</span>
<span class="k">def</span> <span class="nf">c2</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">return</span> <span class="n">n</span> <span class="o">*</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span>
<span class="k">def</span> <span class="nf">fromtup</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">):</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">j</span> <span class="o">-</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">diagonal</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="n">j</span>
<span class="k">return</span> <span class="n">c2</span><span class="p">(</span><span class="n">diagonal</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span>
<span class="k">def</span> <span class="nf">totup</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="c1"># https://math.stackexchange.com/a/1417583
</span> <span class="c1"># sqrt is valid as long as we work with numbers that are small
</span> <span class="c1"># note, importantly, this is vectorizable
</span> <span class="n">diagonal</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint64</span><span class="p">))</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="n">c2</span><span class="p">(</span><span class="n">diagonal</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">diagonal</span> <span class="o">-</span> <span class="n">i</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span>
<span class="n">nverts</span> <span class="o">=</span> <span class="mi">1343</span>
<span class="n">edges</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">c2</span><span class="p">(</span><span class="n">nverts</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="nb">all</span><span class="p">(</span><span class="n">fromtup</span><span class="p">(</span><span class="o">*</span><span class="n">totup</span><span class="p">(</span><span class="n">edges</span><span class="p">))</span> <span class="o">==</span> <span class="n">edges</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<p>This brings us to our first numpy gem of this post, to check that our isomorphism is surjective, <code class="language-plaintext highlighter-rouge">np.triu_indices</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">left</span><span class="p">,</span> <span class="n">right</span> <span class="o">=</span> <span class="n">totup</span><span class="p">(</span><span class="n">edges</span><span class="p">)</span>
<span class="n">expected_left</span><span class="p">,</span> <span class="n">expected_right</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">triu_indices</span><span class="p">(</span><span class="n">nverts</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="n">Counter</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">))</span> <span class="o">==</span> <span class="n">Counter</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">expected_left</span><span class="p">,</span> <span class="n">expected_right</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<p>The advantage over indexing into <code class="language-plaintext highlighter-rouge">np.triu_indices</code> is of course the scenario where you <em>don’t</em> want to fully materialize all edges in memory, such as in frontier expansions for graph search.</p>
<p>You might be wondering how dangerous that <code class="language-plaintext highlighter-rouge">np.sqrt</code> is, especially for large numbers. Since we’re concerned about the values of <code class="language-plaintext highlighter-rouge">np.sqrt</code> for inputs at least 1, and on this domain the mathematical function is sublinear, there’s actually <em>less</em> rounding error in representing the square root of an integer with a double than the input itself. <a href="https://stackoverflow.com/a/22547057/1779853">Details here</a>.</p>
<p>Of course, we’re in trouble if <code class="language-plaintext highlighter-rouge">8 * x + 1</code> cannot even up to ULP error be represented by a 64-bit double. It’s imaginable to have graphs on <code class="language-plaintext highlighter-rouge">2**32</code> vertices, so it’s not a completely artificial concern, and in principle we’d want to have support for edges up to index value less than \(\binom{2^{32}}{2}=2^{63} - 2^{32}\). Numpy correctly refuses to perform the mapping in this case, throwing on <code class="language-plaintext highlighter-rouge">totup(2**61)</code>.</p>
<p>In this case, some simple algebra and recalling that we don’t need a lot of precision anyway will save the day.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="mi">2</span><span class="o">**</span><span class="mi">53</span>
<span class="nb">float</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="nb">float</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">totup_flexible</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="nb">all</span><span class="p">(</span><span class="n">x</span> <span class="o"><=</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">63</span> <span class="o">-</span> <span class="mi">2</span><span class="o">**</span><span class="mi">32</span><span class="p">)</span>
<span class="k">if</span> <span class="n">x</span> <span class="o">></span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">53</span><span class="p">:</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint64</span><span class="p">)</span>
<span class="c1"># in principle, the extra multiplication here could require correction
</span> <span class="c1"># by at most 1 ulp; luckily (s+1)**2 is representable in u64
</span> <span class="c1"># because (sqrt(2)*sqrt(2**63 - 2**32)*(1+3*eps) + 1) is (just square it to see)
</span> <span class="n">s3</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">stack</span><span class="p">([</span><span class="n">s</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">s</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">s3</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">s3</span><span class="p">)),</span> <span class="n">np</span><span class="o">.</span><span class="n">argmin</span><span class="p">(</span><span class="n">s3</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">x</span><span class="p">,</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">)]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint64</span><span class="p">)</span>
<span class="n">add</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">x</span> <span class="o">></span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">53</span> <span class="k">else</span> <span class="mi">1</span>
<span class="n">diagonal</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">s</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">diagonal</span> <span class="o">=</span> <span class="n">diagonal</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="n">c2</span><span class="p">(</span><span class="n">diagonal</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">diagonal</span> <span class="o">-</span> <span class="n">i</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">63</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">32</span>
<span class="n">fromtup</span><span class="p">(</span><span class="o">*</span><span class="n">totup_flexible</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="o">==</span> <span class="n">x</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<p>At the end of the day, this is mostly useful not for the 2x space savings but for online algorithms that don’t know \(V\) ahead of time.</p>
<p>That said, you can expand the above approach to an isomorphism betwen larger subsets, e.g., between \(\binom{[V]}{k}\) and \(\left[\binom{V}{k}\right]\) for \(k>2\) (if you do this, I’d be really interested in seeing what you get). To extend this to higher dimensions, you can either directly generalize the geometric construction above, by slicing through \(k\)-dimensional cones with \((k-1)\)-dimensional hyperplanes, and recursively iterating through the nodes. But, easier said than done.</p>
<p>That’s not to say this is unilaterally better than the simpler representation \(V^k\). Because the space wasted by the “easy” representation \(V^k\) compared to this “hard” isomorphism-based one is \(k!\), but the objects we’re talking about have size \(n^k\), the memory savings isn’t really a good argument for using this indexing. It’s not a constant worth scoffing at, but the main reason to use this is that it’s online, and has no “holes” in the indexing.</p>
<p><a href="/assets/2020-03-07-subset-isomorphism/subset-isomorphism.ipynb">Try the notebook out yourself</a>.</p>
Sat, 07 Mar 2020 00:00:00 +0000
https://vlad17.github.io/2020/03/07/subset-isomorphism.html
https://vlad17.github.io/2020/03/07/subset-isomorphism.htmltoolsGraph Coloring for Machine Learning<h1 id="graph-coloring-for-machine-learning">Graph Coloring for Machine Learning</h1>
<p>This month, I posted a blog entry on Sisu’s engineering blog post. I discuss an effective strategy for lossless column reduction on sparse datasets.</p>
<p><a href="https://sisu.ai/blog/graph-coloring-for-machine-learning">Check out the blog post there.</a></p>
Sat, 22 Feb 2020 00:00:00 +0000
https://vlad17.github.io/2020/02/22/graph-coloring-for-machine-learning.html
https://vlad17.github.io/2020/02/22/graph-coloring-for-machine-learning.htmlmachine-learningStop Anytime Multiplicative Weights<h1 id="stop-anytime-multiplicative-weights">Stop Anytime Multiplicative Weights</h1>
<p>Multiplicative weights is a simple, randomized algorithm for picking an option among \(n\) choices against an adversarial environment.</p>
<p>The algorithm has widespread applications, but its analysis frequently introduces a learning rate parameter, \(\epsilon\), which we’ll be trying to get rid of.</p>
<p>In this first post, we introduce multiplicative weights and make some practical observations. We follow <a href="https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf">Arora’s survey</a> for the most part.</p>
<h2 id="problem-setting">Problem Setting</h2>
<p>We play \(T\) rounds. On the \(t\)-th round, the player is to make a (possibly randomized) choice \(I_t\in[n]\), and then observes the losses for all of the choices, the vector \(M_{\cdot, j_t}\), corresponding to the \(j_t\)-th column of a matrix \(M\) unknown to the player. Note \(j_t\) can be any fixed sequence here, perhaps adversarially chosen with advance knowledge of the distribution of all \(I_t\) but not the actual chance value of \(I_t\) itself.</p>
<p>The goal is to have vanishing regret; that is, our average loss should tend to the best loss we’d be observing for a single fixed choice in hindsight.
\[
\frac{1}{T}\mathbb{E}\max_{i}\left(\sum_t M(I_t, j_t) - M(i, j_t)\right)
\]</p>
<p>This turns out to be a powerful, widely applicable setting, precisely because we have guarantees in spite of any selected sequence of columns \(j_t\), possibly adversarially chosen.</p>
<p>It turns out the above expected regret will have the same guarantees as pseudo-regret \(\mathbb{E}\left[\sum_t M(I_t, j_t)\right] - \min_i \sum_tM(i, j_t)\) because in our setting \(j_t\) is fixed (<a href="http://localhost:4000/2019/12/12/stop-anytime-multiplicative-weights-pt1.html">other adversaries exist</a>, but in this setting where the algorithm doesn’t depend on its own choices even a weak adaptive adversary would work just as well as one that specifies its sequence up-front).</p>
<h2 id="pseudo-regret-guarantees">Pseudo-regret Guarantees</h2>
<p>Fix our setting as above, with the augmented notation \(M(i, j)=M_{ij}\) and \(M(\mathcal{D}, j)=\mathbb{E}[M(I, j)]\) where \(I\sim \mathcal{D}\).</p>
<p>The multiplicative weight update rule (MWUA) tracks a weight \(w^{(t)}\) each round \(t\in[T]\). Let \(\Phi_t=\sum_iw_i^{(t)} \). Then we pick expert \(i\) on the \(t\)-th round with probability \(w_i^{(t)}/\Phi_t \). Let this distribution over \([n]\) be \(\mathcal{D}_t\)</p>
<p>MWUA initializes \(w^{(0)}_i =1\) for all \(i\in[n]\). With parameter \(\epsilon \le \frac{1}{2}\) we set the next round’s weights based on the loss of the current round:
\[
w_i^{(t+1)}=w_i^{(t)}(1-\epsilon M(i, j_t))
\]</p>
<p><strong>Fixed-rate MW theorem</strong>. Range \(t\in[T]\) and \(i\in[n]\). Fix any sequence of column selections \(j_t \). For any fixed \(i_* \in[n]\), the average performance of MWUA is characterized by the following pseudo-regret bound:
\[
\sum_t M(\mathcal{D}_t, j_t)\le \frac{\log n}{\epsilon}+\sum_t M(i_*, j_t)\,\,,
\]
where dividing throughout by \(T\) demonstrates vanishing regret.</p>
<p><em>Proof</em>. Bound the weight of our designated index by using induction.
\[
w_{i_* }^{(T)}=\prod_t(1-\epsilon M(i_* , j_t))\ge \prod_t(1-\epsilon)^{ M(i_* , j_t))}=(1-\epsilon)^{ \sum_t M(i_* , j_t))}\,\,,
\]
with the inequality holding by convexity of the \((1-\epsilon)\)-exponential function on the interval \(x\in[0,1]\), \((1-\epsilon)^x\le 1-\epsilon x\).</p>
<p>Next, we bound the potential
\[
\Phi_{t+1}=\sum_iw_i^{(t+1)}=\sum_iw_i^{(t)}(1-\epsilon M(i, j_t))=\Phi_t-\epsilon\Phi_t\sum_i \frac{w_i^{(t)}}{\Phi_t}M(i, j_t)\,\,,
\]
and then replacing the definition of \(\mathcal{D}_t\),
\[
\Phi_{t+1}=\Phi_t(1-\epsilon M(\mathcal{D}_t, j_t))\le \Phi_t\exp \left(-\epsilon M(\mathcal{D}_t, j_t)\right)\,\,,
\]
where we rely on the exponential inequality \(1+x\le e^x\) holding for all \(x\) by the Taylor expansion.</p>
<p>We put everything together with another induction, yielding
\[
\Phi_0\exp\sum_t-\epsilon M(\mathcal{D}_t, j_t)\ge \Phi_T\ge w_{i_* }^{(T)}\ge (1-\epsilon)^{ \sum_t M(i_* , j_t))}\,\,.
\]
Noticing \(\Phi_0=n\), taking logarithms of both sides, and shifting terms to opposite sides of the inequality, we end up at
\[
\log n - \log(1-\epsilon)\sum_tM(i_* , j_t)\ge \epsilon M(\mathcal{D}_t, j_t)\,\,.
\]
At this point, applying the inequality \(-\log(1-\epsilon)\le \epsilon(1+\epsilon)\), which only holds for \(0\le \epsilon\le 1/2\), and dividing throughout by \(\epsilon\) gives us the theorem.</p>
<p>To show the inequality, notice it holds tightly for \(\epsilon=0\). Taking derivatives yields \((1-\epsilon)^{-1},1+2\epsilon\). But the former is a geometric series \(1+\epsilon+\epsilon^2+\cdots\), with second and higher order terms bounded above by \(\epsilon\) precisely as long as \(\epsilon\le 1/2\).</p>
<h2 id="stock-market-experts">Stock Market Experts</h2>
<p>There’s a fairly classical example here where we ask \(n\) experts if they think the stock market is going to go up or down the next day. We can view this as an \(n\times 2^n\) binary matrix where we can choose one of these experts’ advice each day and then the outcome is whether each expert was correct (each column corresponding to one of the \(2^n\) binary outcomes for each of the experts.</p>
<p>The theorem then guarantees that if we pick an expert according to the MWUA we’ll on average have a correct guess rate within a diminishing margin of the best expert.</p>
<p>But that example is kind of boring and simplified.</p>
<p>Maybe a more realistic setting would have experts being Vanguard ETFs. Monthly, we assess their log-returns and subtract out S&P 500 log returns.</p>
<p>What’s more, we can get the exact multiplicative weights average performance guarantee by simply allocating its portfolio according to the expert weights. Then it’d be much more interesting to see how a such a portfolio, e.g., with \(\epsilon = 0.01\) would perform against</p>
<ul>
<li>The best “expert” (Vanguard ETF),</li>
<li>Uniform investment in all ETFs,</li>
<li>And the S&P500 directly,</li>
</ul>
<p>over various timeframes.</p>
<p>One nice thing about this example is it exemplifies the online nature of the algorithm: you don’t need to specify \(A\) up-front, and can still capture adversarial phenomena like stock markets (as long as you’re not such a large player that you start affecting the stock market with your live order).</p>
<p>As the rest of the survey explores, this is really useful when \(A\) is impossibly large but it’s easy to find the adversarial column \(j_t\) using its structure.</p>
<h2 id="the-epsilon-constant">The \(\epsilon\) Constant</h2>
<p>To me, the elephant in the room is still this \(\epsilon\le 1/2\) that I choose. Let’s see how the literature suggests we choose this number.</p>
<p>Let’s bound first the “best action”. We’ll work with a slightly more common scenario where the largest entry of \(M\) is \(\rho\) (to apply MWUA one simply needs to feed loss \(M/\rho\), then in the final equation we need to replace all \(M\) with \(M/\rho\) as well).</p>
<p>Then \(\min_i M(i, j_t)\le \rho\). More generically, any upper bound \(\lambda\) on the game value \(\lambda^*=\max_\mathcal{P}\min_iM(i,\mathcal{P})=\min_\mathcal{D}\max_jM(\mathcal{D}, j)\) suffices:</p>
<p>\[
\min_i M(i, j_t)\le\max_j\min_i M(i, j)\le \lambda^*\le \lambda
\]</p>
<p>Based on this, the regret bound above, which requires only knowledge of \(\lambda\), can be reduced to:
\[
\frac{1}{T}\sum_t M(\mathcal{D}_t, j_t)\le \frac{\rho \log n}{\epsilon T} +\epsilon \lambda
\]</p>
<p>Then for a fixed budget \(T\) its easy enough to observe that the optimal \(\epsilon_* (T)=\sqrt{\frac{\rho \log n}{\lambda T}} \) (assuming values are large enough we don’t need to worry about \(\epsilon \le 1/2 \) ), giving a gap of \(2\sqrt{\frac{\lambda \rho \log n}{T }} \)</p>
<p>Analogously, for fixed up-front \(\epsilon\), we get an optimal \(T_*(\epsilon)=\frac{\rho \log n}{\lambda \epsilon^2}\) with a gap of \(2\epsilon\lambda\)</p>
<p>Of course we’ll do best by choosing \(\lambda=\lambda^* \), but figuring that value out takes solving the game, which is what we’re trying to do in the first place (<a href="https://cseweb.ucsd.edu/~yfreund/papers/games_long.pdf">Freund and Schapire 1999</a>).</p>
<p>In some scenarios, we might be looking to reach some absolute value of regret \(\delta\) as fast as possible, in which case Corollary 4 of <a href="https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf">the survey</a> essentially makes the same \(\rho=\lambda \ge \lambda^* \) upper bound, then since we know at best we can have \(2\epsilon \lambda = \delta \), where then \(T\) should be \( \frac{4\rho\lambda \log n }{\delta^2} \).</p>
<p>Note Corollary 4 is worse than this by a factor of 2 because Arora’s survey generalizes to negative and positive losses, but then needs to use a weak upper bound of \(0\) for the negative losses.</p>
<h2 id="some-motivation">Some Motivation</h2>
<p>The above approaches gave us a few settings for \(\epsilon\).</p>
<ul>
<li>If you know your time horizon \(T\), use \(\epsilon_*(T) \).</li>
<li>If you want to get to regret \(\delta\), use \(\epsilon_\delta =\frac{\delta}{2\lambda} \) and \(T_*(\epsilon_\delta ) \).</li>
</ul>
<p>In fact, for settings where \(\rho=1\), the Freund and Schapire paper finds you cannot improve on this \(T_*(\epsilon_\delta)\) rate even by a constant with any online algorithm. So that’s good to know, it’s just downhill from there.</p>
<p>The Arora paper further shows that as long as \(\rho = O(n^{1/8})\) the lower bound
\[
T=\Omega\left(\frac{\rho^2\log n}{\delta^2}\right)
\]
holds up, where they find a matrix with \(\lambda^*=\Omega(\rho)\).</p>
<p>I’m not aware of other regimes for lower bounds (I <a href="https://www.google.com/search?q=multiplicative%20weights%20lower%20bound%20site%3Ahttp%3A//proceedings.mlr.press/%2A/%23open-problem">Googled around</a>), still looks like open problems since 2012!).</p>
<p>These lower bounds are great, in that they tell us that we should stop looking for improvements. They also tell us that MWUA is optimal and if we want to mess with this setting it’s a good target.</p>
<p>However, MWUA as presented above is not useful in truly online real-life scenarios. Take our <a href="#stock-market-experts">Stock Market Experts</a> example. I don’t know my horizon \(T\). Maybe it’s an exercise in modeling when I want to retire, but it sure would be nice to have a portfolio with a guarantee that I can pull out of the market any time and it won’t be a problem in terms of regret guarantees.</p>
<p>Note that’s not the case with fixed rates: if I build a portfolio using \(\epsilon_* (T_1) \) but pull out early at \(T_2 < T_1 \), my regret ratio can underperform by a factor growing in \(\sqrt{ T_1 /T_2 }\) compared to the \(\epsilon_* (T_2) \) portfolio, and vice versa if I stay in too long.</p>
<p>What would give me peace of mind would be an algorithm that gives the guarantee, such that for any time horizon \(T\), we can get performance within some fixed constant of the expected regret for the optimal mixture weights algorithm \(\epsilon_* (T) \).</p>
<p>What’s more, I need to know \(\lambda,\rho\) to do well (or just \(\rho\) if I don’t think the zero-sum game defined by \(M\) is tilted in my favor, where \(\lambda\sim\rho\). If something could figure out \(\rho\) too, that’d be great. Arora proposes doing this by using the doubling trick.</p>
<h2 id="experiments">Experiments</h2>
<p>It’s easy to see that the longer the time horizon, the smaller the learning rate should be. Choosing a rule like \(\epsilon_t = \frac{1}{2\sqrt{t}}\) does well in an adversarial environment.</p>
<p>We create game matrices \(M\) of various sizes with entries sampled from a symmetric \(\text{Beta}(0.5, 0.5)\) and compare performance across different \(\epsilon\). <code class="language-plaintext highlighter-rouge">opt</code> is the optimal value \(\lambda^* \) in the games below, which we use to plot \(T_* \) for each of our fixed MWUA runs. At each time \(T\), we plot the optimality gap:
\[
\frac{1}{T}\sum_{t}M(\mathcal{D}_t, j_ t) - \lambda^*\,\,.
\]</p>
<p><img src="/assets/2019-12-25-stop-anytime-multiplicative-weights/25x40.png" alt="25 by 40" class="center-image" /></p>
<p><img src="/assets/2019-12-25-stop-anytime-multiplicative-weights/10x100.png" alt="10 by 100" class="center-image" /></p>
<p><img src="/assets/2019-12-25-stop-anytime-multiplicative-weights/5x200.png" alt="5 by 200" class="center-image" /></p>
<p><a href="https://github.com/vlad17/mw">Code</a> @ <code class="language-plaintext highlighter-rouge">af5ad62</code></p>
<p>What’s super curious here is that square-root decay <strong>dominates</strong> all of the fixed-rate ones, even at their optimal \(T_* \).</p>
<p>Another curiosity is that \(T_*\) looks really off for the final case, where 5 experts square off against an extremely adversarial environment where the column player can choose from 200 columns. To be honest, I don’t know what’s going on here.</p>
<h2 id="related-work">Related Work</h2>
<p>A weaker version of my requirements above might be methods robust to these concerns:</p>
<ol>
<li>Anytime, so a single algorithm works for all \(T\).</li>
<li>Scale-free, so a single algorithm works for all \(\rho\)</li>
</ol>
<p>I was able to find these relevant notes, that more or less put all the issues mentioned above to rest:</p>
<ul>
<li>Sequential decision-making <a href="https://inst.eecs.berkeley.edu/~ee290s/fa18/scribe_notes/EE290S_Lecture_Note_10.pdf">Lecture 10</a>, <a href="https://inst.eecs.berkeley.edu/~ee290s/fa18/scribe_notes/EE290S_Lecture_Note_11.pdf">Lecture 11, Part 1</a>, <a href="https://inst.eecs.berkeley.edu/~ee290s/fa18/scribe_notes/EE290S_Lecture_Note_11_2.pdf">Lecture 11, Part 2</a>, <a href="https://inst.eecs.berkeley.edu/~ee290s/fa18/scribe_notes/EE290S_Lecture_Note_12.pdf">Lecture 12</a></li>
<li><a href="https://arxiv.org/abs/1301.0534">Elegant AdaHedge</a>, so called because it’s the second version of AdaHedge and it doesn’t use budgeting.</li>
<li><a href="https://www.quora.com/What-is-an-intuitive-explanation-for-the-AdaHedge-algorithm">Steven’s Quora Answer</a></li>
</ul>
<p>In short, I want to summarize what I found as the best resources in a field that’s quite saturated: (<a href="https://cseweb.ucsd.edu/~yfreund/papers/games_long.pdf">Freund and Schapire 1999</a>) as the original work and the elegant write-up <a href="https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf">in Arora’s survey</a>.</p>
<p>Further, <a href="https://arxiv.org/abs/1301.0534">Elegant AdaHedge</a> is both anytime and scale-free.</p>
<p>A recent analysis of the <a href="https://arxiv.org/abs/1809.01382">Decreasing Hedge</a>, shown above as the square-root decay rate version of hedge helps tidy some things up.</p>
<p>We also have some specializations:</p>
<ul>
<li>Constant FTL Regret (FlipFlop, from Elegant AdaHedge paper) - constant-factor performance for the worst case and additive-constant performance compared to the follow-the-leader algorithm.</li>
<li>Universal Hedge (Decreasing Hedge*) - the “perform within a constant factor of the optimal-constant MWUA for that horizon” guarantee.</li>
<li>Stochastic Optimality (Decreasing Hedge*) - perform the best when the column player plays randomly (i.e., all experts take losses that are just fixed random variables over time)</li>
</ul>
<p>*Importantly, decreasing hedge isn’t scale-free, so these claims only hold for \(\rho=1\).</p>
Sun, 05 Jan 2020 00:00:00 +0000
https://vlad17.github.io/2020/01/05/stop-anytime-multiplicative-weights.html
https://vlad17.github.io/2020/01/05/stop-anytime-multiplicative-weights.htmlmachine-learningMetaphysics of Causality<h1 id="metaphysics-of-causality">Metaphysics of Causality</h1>
<p>If you read Judea Pearl’s <em>The Book of Why</em>, it makes it seem like exercising observational statistics makes you an ignoramus. Look at the stupid robot:</p>
<p><img src="/assets/2019-12-01-metaphysics-of-causality/ladder.png" alt="ladder" class="center-image" /></p>
<p>Pearl and Jonas Peters (in <em>Elements of Causal Inference</em>) both make a strong distinction, it seems at the physical level, between causal and statistical learning. Correlation is not causation, as it goes.</p>
<p>From a deeply (<a href="https://en.wikipedia.org/wiki/Subjective_idealism">Berkeley-like</a>) skeptical lens, where all that we can be sure of is what we observe, it seems that we nonetheless can recover nice properties of causal modeling even as associative machines through something we can call the <em>Epistemic Backstep</em>.</p>
<p>This is of less of a declaration of me knowing better, and more of an attempt at trying to put into words a different take I had as I was reading the aforementioned works.</p>
<h2 id="our-shining-example">Our Shining Example</h2>
<p>Intuitively, the difference between cause and effect seems to be a fundamental property of nature. Let \(B\) be barometric pressure and \(G\) be a pressure gauge’s reading. We can build a structural causal model (SCM), which is some equations which are tied to the edges and vertices of a directed acyclic graph (DAG):
\[
B\rightarrow G
\]
where \(B\) is the cause and \(G\) is the effect. It’s clear to us that the former is a <em>cause</em> because of what interventions do.</p>
<p>If we intervene on \(B\), say, by increasing our elevation, then the gauge starts reading a lower number. There’s clearly a functional dependence there (or a statistical one, say, in the presence of measurement noise).</p>
<p>If we intervene on \(G\), by breaking the glass and turning the measurement needle, our eardrums don’t pop no matter how low we turn the needle.</p>
<p>We point at this asymmetry and say, this is causality in the real world.</p>
<h2 id="the-epistemic-backstep">The Epistemic Backstep</h2>
<p>But now I ask us to take a step back. Why does this example even make sense to us, evoking vivid imagery about how ridiculous a ruptured eardrum would be due to manually changing a barometer’s needle?</p>
<p>Well, it turns out that we have, through media or real-life experiences, learned about and observed barometers. In science class, we may have read about or seen or heard how they turn as pressure changes.</p>
<p>We may never have broken a barometer and changed its needle position, but we’ve certainly seen enough glass being broken in the past and needles moving that we can put two and two together and imagine what that would look like. In those situations, the thing that the needle measures rarely changes.</p>
<p>Stepping back a bit, it turns out that we actually have a lot of observations of some kind of environmental characteristic \(C\) (which might be temperature or pressure), its corresponding entailment \(E\) (a thermometer or barometer reading), and a possible interaction with the measurement took place, \(I\), where here this represents an indicator for “did we increase the reading of our measurement manually.”</p>
<p>So, we actually have a lot of observational evidence of the more generalized system \((C, E, I)\).</p>
<ol>
<li>We’ve seen how barometers read high numbers \(E=1\) at high pressure \(C=1\) by being at low altitude and observing a functioning barometer. We did not mess with the barometer. \((C, E, I)=(1, 1, 0)\).</li>
<li>We’ve seen how barometers behave at high altitude \((0, 0, 0)\).</li>
<li>We’ve seen how ovens increase the temperature in the attached thermometer \((1, 1, 0)\).</li>
<li>How we don’t have a fever if our mom measures our temperature and we’re healthy \((0, 0, 0)\).</li>
<li>After we do jumping jacks to raise our temperature, to get out of school, we see that it works \((0, 1, 1)\).</li>
</ol>
<p>Given a bunch of situations like this, and taking some liberty in our ability to generalize here, it’s totally reasonable we can come up with a rule \(E=\max(C, I)\) and given observational data alone. We might go even further, and model a joint probability on \((C, E, I)\) where the conditional probability distribution of \(C\) given \(E=1,I=1\) ends up just being the marginal probability of \(C\):
\[
p(C) = p(C|E=1,I=1)\,,
\]
as opposed to what happens for \(E\)
\[
\forall i\,,\,\,\,p(E|C=1, I = i) = 1_{E=1}\,.
\]
These <em>observations</em> make for natural matches for causal inference, from which we can infer that there won’t be much effect on pressure by changing the barometer, but we <em>could have known this</em> (at least in theory) just by building up an associative model for what happens when you manually override what the measurement tool says.</p>
<p>By considering the associative model over a wider universe, a universe that includes the interventions <em>themselves</em> as observed variables, and having a strong ability to generalize between related interventions and settings, we can view our causal inference as solely an associative one.</p>
<h2 id="in-short">In Short</h2>
<p>The epistemic backstep proceeds by adding a new variable, \(F\), for “did you fuck with the system you’re modeling,” capturing all the essential ways in which you can fuck with it, and in this manner we can seemingly reduce causal questions to those that in theory an associative machine could answer.</p>
<p>Maybe this is a cheap move, just moving intervention into the arena of observations. I still think it’s a fairly powerful reduction: we can tell things about gauges and barometers having never experimented with them before, as long as we can solve the transfer learning problem between those and settings where we <em>have</em> messed with measuring natural phenomena, maybe thinking back to weight scales in a farmer’s market or trying to get out of school by raising our temperature.</p>
<p>So, we can view randomized controlled trials in this context not as different experimental settings, but rather a way to collect data in a region of the \((C, E, I)\) space that might be sparsely populated otherwise (so we’d have a tough time fitting the data there).</p>
<p>It’s important to note that <em>it doesn’t matter</em> that it’s more convenient for us to model causal inference with DAGs.</p>
<ul>
<li>You may say something like “well, how did you know that jumping jacks would help raise your temperature?”</li>
<li>This suggests that humans do really think causally.</li>
<li>However, the above is a psychological claim about humans, rather than a metaphysical claim about causality.</li>
<li>For all we know, an associative machine may have an exploration policy where, on some days, it sets \(I=1\) just to see what can happen. After gathering some data, it builds something equivalent to a causal model, but without ever explicitly constructing any DAGs.</li>
</ul>
<h2 id="full-circle">Full Circle</h2>
<p>For what it’s worth, maybe the best way to model our new joint density \((C, E, I)\) is by first identifying the causal DAG structure, constructing a full SCM by fitting conditional functions, and then using that SCM for our predictions.</p>
<p>But that seems presumptuous. Surely, viewing this as a more abstract statistical learning problem, there might be more generic ways of finding representations that help us efficiently learn the “full joint” which includes interventions.</p>
<p>Another interesting point is asking questions about counterfactuals. Personally, I don’t find counterfactuals that useful (unless blame is an end in itself), but that’s a discussion for another time. I didn’t want to muddy the waters above, but there’s an example of an associative counterfactual analysis below with the epistemic backstep.</p>
<p>Note that the transfer notions introduced here aren’t related to the <a href="https://arxiv.org/abs/1301.2312">Tian and Pearl</a> transportability between different environments, where the nodes of your SCM stay the same (<a href="https://ftp.cs.ucla.edu/pub/stat_ser/r402.pdf">see here for further developments</a>). What I’m talking about is definitely more of a transfer learning problem, where you’re trying to perform a natural matching based on your past experiences, and it’s learning this matching function that’s interesting to study.</p>
<p>So in sum we have an interesting take on <a href="https://plato.stanford.edu/entries/causation-probabilistic/">Regularity Theory</a>, which doesn’t have the usual drawbacks. Maybe all of this is a grand exercise in identifying a motivation for Robins’ G-estimation. In any case it was fun to think about so here we are.</p>
<h2 id="another-worked-example">Another worked example</h2>
<p>Let’s first look at <a href="https://ftp.cs.ucla.edu/pub/stat_ser/r301-final.pdf">Pearl’s firing squad</a>.</p>
<p><img src="/assets/2019-12-01-metaphysics-of-causality/firing-squad.png" alt="firing squad" class="center-image" /></p>
<p>Say the captain gave the order to fire, both \(R_1,R_2\) did so, and the prisoner died. Now, what would have happened had \(R_1\) not fired?</p>
<p>Pearl says the association machine breaks down here, it’s a contradiction, since the rifleman always fires when the captain gives the order to. So why aren’t we confused when we think about it?</p>
<p>Step back: consider a universe where riflemen can refuse to follow orders. The first rifleman is now wont to be mutinous \(M\) (add an arrow \(M\rightarrow R_1\)).</p>
<p>In situations, where the first rifleman is mutinous, but the second isn’t, it’s pretty clear what’ll happen. The second rifleman still fires, and the prisoner is still shot dead.</p>
<p>To me, it’s only because I’ve seen a lot of movies, read books, heard poems where there’s a duty to disobey that I could reason through this. If all of my experience up to this point has confirmed that riflemen <em>always</em> fire when their commanding officer tells them to, I would’ve been as confused as our associative machine at the counterfactual question.</p>
<p>To close up, we have one big happy joint model
\[
p(C, M, R_1, R_2, D)\,,
\]
now so to ask the counterfactual is just to ask what the value of
\[
p(D=1|C=1, M=1, R_1=0, R_2=1)
\]
is, which is something we can answer given our wider set of observations and the ability to generalize.</p>
Sun, 01 Dec 2019 00:00:00 +0000
https://vlad17.github.io/2019/12/01/metaphysics-of-causality.html
https://vlad17.github.io/2019/12/01/metaphysics-of-causality.htmlphilosophyThe Triple Staple<h1 id="the-triple-staple">The Triple Staple</h1>
<p>When reading, I prefer paper to electronic media. Unfortunately, a lot of my reading involves manuscripts from 8 to 100 pages in length, with the original document being an electronic PDF.</p>
<p>Double-sided printing works really well to resolve this issue partway. It lets me convert PDFs into paper documents, which I can focus on. This works great up to 15 pages. I print the page out and staple it. I’ve tried not-stapling the printed pages before, but then the individual papers frequently get out of order or generally all over the place.</p>
<p><strong>However</strong>, for larger manuscripts I frequently found myself in a pickle:</p>
<ul>
<li>I don’t want to manage loose leaf pages individually.</li>
<li>Staplers that can handle stapling over 15 pages don’t occur naturally, at least near the printers I’m around.</li>
</ul>
<p>Attempting to use a stapler beyond its capacity does not end successfully.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/the-problem.png" alt="weak staplers" class="center-image" /></p>
<p>For a good deal of my life I’ve resigned myself to dealing with a reality of mediocre staplers and even more mediocre workarounds, e.g., a packet on a single topic now needs be represented by 3 independent, separately-stapled documents, which is 2 too many.</p>
<p>I’m confident many others also have this problem. To wit, I’d like to introduce a life hack, for all situations where you have documents of up to \(2X\) pages and staplers with penetration power rated at \(X\) pages.</p>
<h2 id="the-problem">The Problem</h2>
<p>I want to staple this thick paper stack.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/initial-conditions.png" alt="initial conditions" class="center-image" /></p>
<p><em>Optimality criteria</em>.</p>
<p>(A) Grip strength of resulting staple.</p>
<p>(B) Non-obstruction of reading material.</p>
<h2 id="solution">Solution</h2>
<ol>
<li>Staple pages \(1\) to \(X\).</li>
<li>Staple pages \(X+1\) to \(2X\).</li>
<li>Peel back the corner of pages \(1\) to \(\lfloor X/2\rfloor\) over the staple. Repeat for \(\lfloor 3X/2\rfloor\) to \(2X\)</li>
<li>Insert the exposed corner of pages \(\lfloor X/2\rfloor +1\) to \(\lfloor 3X/2\rfloor - 1\) into the stapler, making sure the folded-away corners of the outer pages are out of the stapler’s line of fire.</li>
<li>Apply the stapler to the middle pages, then fold the outer pages’ corners back up.</li>
</ol>
<h2 id="results">Results</h2>
<p>Step 1 and 2.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-one.png" alt="step 1 and 2" class="center-image" /></p>
<p>Step 3.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-three.png" alt="step 3" class="center-image" /></p>
<p>Step 4.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-four.png" alt="step 4" class="center-image" /></p>
<p>Step 5.</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five.png" alt="step 5" class="center-image" /></p>
<p>Additional results (skew angle, front, and back views).</p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five1.png" alt="step 5 1" class="center-image" /></p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five3.png" alt="step 5 2" class="center-image" /></p>
<p><img src="/assets/2019-11-30-the-triple-staple/step-five2.png" alt="step 5 3" class="center-image" /></p>
<h2 id="discussion-and-related-work">Discussion and Related Work</h2>
<p>(A) is met due to each staple holding together at least \(X\) pages. Contrast this with related work which only staples two pages \(X,X+1\) with an intermediate staple, resulting in a single point of failure at page \(X\).</p>
<p>(B) UX is equivalent to a single-stapled page, as opposed to binder-clip methodology which frequently requires clipping past the margin.</p>
<h2 id="future-work">Future Work</h2>
<p>There exists a straightforward alternating iteration of our method that can be shown, by induction to apply to documents of length up to \(n X\) for any \(n\in\mathbb{N}\). We leave evaluation to future work.</p>
Sat, 30 Nov 2019 00:00:00 +0000
https://vlad17.github.io/2019/11/30/the-triple-staple.html
https://vlad17.github.io/2019/11/30/the-triple-staple.htmltoolsjoke-postNumpy Gems, Part 2<h1 id="prngs">PRNGs</h1>
<p>Trying out something new here with a Jupyter notebook blog post. We’ll keep this short. Let’s see how it goes!</p>
<p>In this episode, we’ll be exploring random number generators.</p>
<p>Usually, you use psuedo-random number generators (PRNGs) to simulate randomness for simulations. In general, randomness is a great way of avoiding doing integrals because it’s cheaper to average a few things than integrate over the whole space, and things tend to have accurate averages after just a few samples… This is the <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo Method</a>.</p>
<p>That said, since the priority is speed here, and the more samples, the better, we want to take as many samples as possible, so parallelism seems viable.</p>
<p>This occurs in lots of scenarios:</p>
<ul>
<li>Stochastic simulations of physical systems for risk assessment</li>
<li>Machine learning experiments (e.g., to show a new training method is consistently effective)</li>
<li>Numerical estimation of integrals for scientific equations</li>
<li>Bootstrap estimation in statistics</li>
</ul>
<p>For all of these situations, we also usually want replicatable studies.</p>
<p>Seeding is great for making the random PRNG sequence deterministic for one thread, but how do you do this for multiple threads?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">ttest_1samp</span>
<span class="k">def</span> <span class="nf">something_random</span><span class="p">(</span><span class="n">_</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">()</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">2056</span>
<span class="k">print</span><span class="p">(</span><span class="s">"stddev {:.5f}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">something_random</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="n">mu</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>stddev 0.02205
-0.03392958488974697
</code></pre></div></div>
<p>OK, so not seeding (using the system default of time-based seeding) gives us dependent trials, and that can really mess up the experiment and it prevents the very determinism we need!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">seeds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">32</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">something_random</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seeds</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">()</span>
<span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">something_random</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="n">mu</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-0.6038931772504026
</code></pre></div></div>
<p>The common solution I see for this is what we see above, or using <code class="language-plaintext highlighter-rouge">i</code> directly as the seed. It kind of works, in this case, but for the default numpy PRNG, the Mersenne Twister, it’s not a good strategy.</p>
<p><a href="https://docs.scipy.org/doc/numpy/reference/random/parallel.html#seedsequence-spawning">Here’s the full discussion</a> in the numpy docs.</p>
<p>To short circuit to the “gem” ahead of time, the solution is to use the new API.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">SeedSequence</span><span class="p">,</span> <span class="n">default_rng</span>
<span class="n">ss</span> <span class="o">=</span> <span class="n">SeedSequence</span><span class="p">(</span><span class="mi">12345</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">16</span>
<span class="n">child_seeds</span> <span class="o">=</span> <span class="n">ss</span><span class="o">.</span><span class="n">spawn</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">something_random</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">default_rng</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="k">return</span> <span class="n">rng</span><span class="o">.</span><span class="n">normal</span><span class="p">()</span>
<span class="k">with</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">as</span> <span class="n">p</span><span class="p">:</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">something_random</span><span class="p">,</span> <span class="n">child_seeds</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">mu</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-0.11130135587093562
</code></pre></div></div>
<p>That said, I think the fun part is in trying to break the old PRNG seeding method to make this gem more magical.</p>
<p>That is, the rest of this blog post is going to be trying to find non-randomness that occurs when you seed in a n invalid way.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># aperitif numpy trick -- get bits, fast!
</span><span class="k">def</span> <span class="nf">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">nbytes</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">7</span><span class="p">)</span> <span class="o">//</span> <span class="mi">8</span> <span class="c1"># == ceil(n / 8) but without using floats (gross!)
</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">unpackbits</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="nb">bytes</span><span class="p">(</span><span class="n">nbytes</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">))[:</span><span class="n">n</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">timeit</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>39.5 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">timeit</span>
<span class="n">fastbits</span><span class="p">(</span><span class="mi">10</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2.29 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Attempt 1: will lining up random
# streams break a chi-square test?
</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">10</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">x2</span> <span class="o">=</span> <span class="n">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">fastbits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">chisquare</span>
<span class="k">def</span> <span class="nf">simple_pairwise</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="c1"># do a simple pairwise check on equilength arrays dof = 4 - 1
</span> <span class="c1"># build a contingency table for cases 00 10 01 11
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">bincount</span><span class="p">(</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">chisquare</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'random'</span><span class="p">,</span> <span class="n">simple_pairwise</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">x2</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'seeds 1-2'</span><span class="p">,</span> <span class="n">simple_pairwise</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="n">y1</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>random Power_divergenceResult(statistic=6.848932, pvalue=0.07687191550956339)
seeds 1-2 Power_divergenceResult(statistic=10000003.551559199, pvalue=0.0)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># And now let's try another approach!
</span>
<span class="kn">import</span> <span class="nn">tempfile</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="k">def</span> <span class="nf">size</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="s">'/tmp/x.bz2'</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="s">'/tmp/x.bz2'</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/tmp/x'</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">tobytes</span><span class="p">())</span>
<span class="err">!</span> <span class="n">bzip2</span> <span class="o">-</span><span class="n">z</span> <span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">x</span>
<span class="k">return</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">getsize</span><span class="p">(</span><span class="s">'/tmp/x.bz2'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">rbytes</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="nb">bytes</span><span class="p">(</span><span class="n">n</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
<span class="n">trials</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">trials</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span>
<span class="k">print</span><span class="p">(</span><span class="s">'random'</span><span class="p">,</span> <span class="n">size</span><span class="p">(</span><span class="n">rbytes</span><span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="n">trials</span><span class="p">)))</span>
<span class="n">re_seeded</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">trials</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="n">re_seeded</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">rbytes</span><span class="p">(</span><span class="n">n</span><span class="p">))</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">re_seeded</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'seeds 0-255'</span><span class="p">,</span> <span class="n">size</span><span class="p">(</span><span class="n">a</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>random 257131407
seeds 0-255 257135234
</code></pre></div></div>
<p>OK, so zip isn’t easily able to untangle any correlation between the streams (in which case, the compressed file of bits from random streams from sequential seeds would presumably be able to compress better).</p>
<p>We’ll need another approach.</p>
<p>There’s a lot of investment in PRNG quality tests.</p>
<p>However, we’re not interested in evaluating whether <em>individual</em> streams are random-looking, which they very well might be. Instead, we want to find out if there’s any dependence between streams. Above we just tried two tests for independence, but they didn’t work well (there’s a lot of ways to be dependent, including ways that don’t fail the chi squared test or bz2-file-size test).</p>
<p>That said, we can use a simple trick, which is to interleave streams from the differently-seeded PRNGs. If the streams are dependent, the resulting interleaved stream is not going to be a realistic random stream. This is from the <a href="https://www.iro.umontreal.ca/~lecuyer/myftp/papers/testu01.pdf">TestU01</a> docs. Unfortunately, my laptop couldn’t really handle running the full suite of tests… Hopefully someone else can break MT for me!</p>
<p><a href="/assets/2019-10-20-prngs.ipynb">Try the notebook out yourself</a></p>
Sun, 20 Oct 2019 00:00:00 +0000
https://vlad17.github.io/2019/10/20/prngs.html
https://vlad17.github.io/2019/10/20/prngs.htmltoolsCompressed Sensing and Subgaussians<h1 id="compressed-sensing-and-subgaussians">Compressed Sensing and Subgaussians</h1>
<p>Candes and Tao came up with a broad characterization of compressed sensing solutions <a href="https://statweb.stanford.edu/~candes/papers/RIP.pdf">a while ago</a>. Partially inspired by a past homework problem, I’d like to explore an area of this setting.</p>
<p>This post will dive into the compressed sensing context and then focus on a proof that squared subgaussian random variables are subexponential (the relation between the two will be explained).</p>
<h2 id="compressed-sensing">Compressed Sensing</h2>
<p>For context, we’re interested in the setting where we observe an \(n\)-dimensional vector \(\vy\) that is a random linear transformation \(X\) of a hidden \(p\)-dimensional vector \(\vx_*\):</p>
<p>\[
\vy = X\vx_*
\]</p>
<p>In general, in this setting, we could have \(p>n\). If we wanted to recover \(\vx_*\), the system may be underdetermined. So a least-squares solution \((X^\top X)^{-1}X^\top\vy\) may not exist or may be unstable due to very small \(\lambda_\min(X^{\top} X)\).</p>
<p>In cases where we have knowledge of sparsity, however, that \(\norm{\vx}_0=k<p,n\), we can actually find the result.</p>
<p>In particular, the \(\ell_0\) estimator, which finds
\(
\vx_0=\argmin_{\vx:\norm{\vx}_0\le k}\norm{\vy-X\vx}_2
\), will converge, in the sense that the risk \(\E\norm{\vy-X\vx}_2\) is bounded above by \(O\pa{\frac{k\log p}{n}}\). This can be used to show that under some straightforward assumptions on \(k,X\) we actually converge to the true answer \(\vx_*\). Moreover, while this method seems to depend on \(k\) we can imagine doing hyperparameter search on \(k\).</p>
<p>This all looks great, in that we can recover the original entries of sparse \(\vx_*\), but the problem is solving the minimization problem under the constraint \(\norm{\vx}_0\le k\) is computationally difficult. This is a non-convex set of points with at most \(k\) non-zero entries. We’d need to check every subset to find the optimum (<em>question to self:</em> do we really? You’d think that in a non-adversarial stochastic-\(X\) situation you might want to use \(2k\) instead of \(k\) and then use a greedy algorithm like backward selection and it’d be good enough).</p>
<p>This is why Tao and Candes’ work is so cool. They take the efficiently-computable LASSO estimator,
\[
\vx_\lambda = \argmin_{\vx:\norm{\vx}_0\le k}\norm{\vy-X\vx}_2
^2+\lambda\norm{\vx}_1\,,
\]
and show that under a certain condition on \(X\), the <em>Restricted Isometry Property</em> (RIP), \(\vx_\lambda = \vx_0\). In essence, the RIP property requires that \(X\) has nearly unit eigenvalues with high probability, so it’s almost an isometry. Technically, there’s a relaxed condition called the restricted eigenvalue condition implied by RIP where we get a weaker result that implies LASSO has the same risk as \(\ell_0\).</p>
<p>All this is motivation for understanding the question: <strong>what practical conditions on \(X\) ensure the RIP?</strong></p>
<p>It turns out we can characterize a broad class of distributions for the entries of \(X\) that enable this.</p>
<h2 id="subgaussian-random-variables">Subgaussian Random Variables</h2>
<p>Subgaussian random variables have heavy tails. In particular, \(Y\in\sg(\sigma^2)\) when
\[
\E\exp(\lambda Y)\le\exp\pa{\frac{1}{2}\lambda^2\sigma^2}
\]</p>
<p>By the Taylor expansion of \(\exp\), Markov’s inequality, and elementary properties of expectation, we can use the above to show all sorts of properties.</p>
<ul>
<li>Subgaussian variance. \(\var Y\le \sigma^2\)</li>
<li>Zero mean. \(\E Y = 0\)</li>
<li>2-homogeneity. \(\alpha Y\in\sg(\sigma^2\alpha^2)\)</li>
<li>Light tails. \(\P\ca{\abs{Y}>t}\le 2\exp\pa{\frac{-t^2}{2\sigma^2}}\)</li>
<li>Additive closure. \(Z\in\sg(\eta^2 )\independent Y\) implies \(Y+Z\in\sg(\sigma^2+\eta^2)\)</li>
<li>Higher moments. \(\E Y^{4k}\le 8k(2\sigma)^{4k}(2k-1)!\)</li>
</ul>
<h2 id="subexponential-random-variables">Subexponential Random Variables</h2>
<p>Subexponential random variables are like subgaussians, but their tails can be heavy. In particular, \(Y\in\se(\sigma^2,s)\) satisfies the equation for \(\sg(\sigma^2)\) for \(\abs{\lambda}<s\).</p>
<p>We don’t really need to know much else about these, but it’s clear we can show similar additive closure and homogeneity properties as in the subgaussian case as long as we do bookkeeping on the second parameter \(s\).</p>
<p>It turns out that RIP holds for \(X\) with high probability if \(\vu^\top X^\top X\vu\in\se(nc, c’)\) for some constants \(c,c’\) and any unit vector \(\vu\).</p>
<p>When entries of \(X\) are independent and identically distributed, \(\vu\) can essentially be taken to be a standard unit vector without loss of generality. This requires some justification but it’s intuitive so I’ll skip it for brevity. This lets us simplify the problem to asking if \(\norm{X_1}^2\in\se(nc, c’)\), where \(X_1\) is the first column of \(X\).</p>
<p>So let’s take the entries of \(X\) to be iid, which, due to additive closure, means that the previous condition can just be \({X}_{11}^2\in\se(c,c’)\).</p>
<h2 id="squared-subgaussians">Squared Subgaussians</h2>
<p>Turns out, if the entries of \(X\) are subgaussian and iid, all of the above conditions hold. In particular, we need to show that the first entry \(X_11\), when squared, is squared exponential.</p>
<p>We focus on a loose but good-enough bound for this use case.</p>
<p>Suppose \(Z\in\sg(\sigma^2)\). Then \(Z^2-\E Z^2\in \se(c\sigma^4,\sigma^{-2}/8)\), again, being very loose with the bound here.</p>
<p>First, consider an arbitrary rv \(Y\). By the conditional Jensen’s Inequality, for any \(\lambda\) and \(Y’\sim Y\) iid,
\[
\E\exp\pa{\lambda (Y-\E Y)}=\E\exp\pa{\CE{\lambda (Y-Y’)}{Y}}\le \E\CE{\exp\pa{\lambda (Y-Y’)}}{Y}=\E\exp\pa{\lambda (Y-Y’)}\,.
\]
Then let \(\epsilon\) be an independent Rademacher random variable, and notice we can replace \(Y-Y’\disteq \epsilon(Y-Y’)\) above. Now choose \(Y=X^2\). Then by Taylor expansion and dominated convergence,
\[
\E\exp\pa{\lambda \pa{X^2-\E X^2}}\le \E \exp\pa{\lambda \epsilon \pa{X^2-(X’)^2}}=\sum_{k=0}^\infty\frac{\lambda^k\E\ha{\epsilon^k(X^2-(X’)^2)^k}}{k!}\,.
\]
Next, notice for odd \(k\), \(\epsilon^k=\epsilon\) so by symmetry the odd terms vanish, leaving the MGF bound
\[
\E\exp\pa{\lambda \pa{X^2-\E X^2}}\le\sum_{k=0}^\infty\frac{\lambda^{2k}\E\ha{\pa{X^2-(X’)^2}^{k}}}{(2k)!}\le 2\sum_{k=0}^\infty\frac{\lambda^{2k}\E\ha{X^{4k}}}{(2k)!}\,,
\]
where above we use the fact that \(x\mapsto x^p\) is montonic and \(\abs{X^2-(X’)^2}\le X^2\) when \(\abs{X}>\abs{X’}\), which occurs half the time by symmetry. The other half of the time, we get an equivalent expression. By subgaussian higher moments,
\[
\E \exp\pa{\lambda (X^2-\E X^2)}\le 1+c\sum_{k=1}^\infty \frac{k\pa{4\sigma^2\lambda}^{2p}(2k-1)!}{(2k)!}=1+c\sum_{p=1}^\infty\pa{4\sigma^2\lambda}^{2p}
\]
Next we assume, crudely, that \(4\sigma^2\lambda\le 2^{-1/2}\), so the head of the series above is at least as large as the tail (since the ratio decreases by at least \(1/2\)). Then,
\[
\E \exp\pa{\lambda (X^2-\E X^2)}\le 1+c(2\sigma^2\lambda)^2\le \exp(c\sigma^4\lambda^2)\,.
\]</p>
Wed, 11 Sep 2019 00:00:00 +0000
https://vlad17.github.io/2019/09/11/compressed-sensing-subgaussians.html
https://vlad17.github.io/2019/09/11/compressed-sensing-subgaussians.htmlmachine-learning