B.log  RSS feed
http://artem.sobolev.name
Sun, 02 May 2021 00:00:00 UTReciprocal Convexity to reverse the Jensen Inequality
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/k7re0131hU/20210502reciprocalconvexitytoreversethejenseninequality.html
<p><a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen’s inequality</a> is a powerful tool often used in mathematical derivations and analyses. It states that for a convex function <span class="math inline">\(f(x)\)</span> and an arbitrary random variable <span class="math inline">\(X\)</span> we have the following <em>upper</em> bound: <span class="math display">\[
f\left(\E X\right)
\le
\E f\left(X\right)
\]</span></p>
<p>However, oftentimes we want the inequality to work in the other direction, to give a <em>lower</em> bound. In this post I’ll outline one possible approach to this.</p>
<!more>
<h2 id="thetrick">The Trick</h2>
<p>The basic idea is very simple: let’s turn our convex function into a concave function. First, define</p>
<p><span class="math display">\[
\hat{f}(x) = f\left(\tfrac{1}{x}\right)
\]</span></p>
<p>As <a href="https://core.ac.uk/download/pdf/82634388.pdf">defined by Merkle</a>, a function <span class="math inline">\(h(x)\)</span> is called <strong>reciprocally convex</strong> if <span class="math inline">\(h(x)\)</span> is concave and <span class="math inline">\(\hat{h}(x) = h(1/x)\)</span> is convex. For the sake of this discussion we’ll assume <span class="math inline">\(f(x)\)</span> is <strong>reciprocally concave</strong>, that is, of course, <span class="math inline">\(f(x)\)</span> is reciprocally convex.</p>
<p>Next, we’ll need an unbiased estimator <span class="math inline">\(Y\)</span> of the reciprocal of the mean <span class="math inline">\(\E X\)</span>, that is, <span class="math inline">\(Y\)</span> should satisfy the following:</p>
<p><span class="math display">\[
\E Y = \frac{1}{\E X}
\]</span></p>
<p>Now, the rest is simple algebra and the standard Jensen’s inequality (remember <span class="math inline">\(\hat{f}\)</span> is concave by definition): <span class="math display">\[
f\left(\E X\right)
= \hat{f}\left(\frac{1}{\E X}\right)
= \hat{f}\left(\E Y\right)
\ge \E \hat{f}\left(Y\right)
= \E f\left(\frac{1}{Y}\right)
\tag{1}
\]</span></p>
<h3 id="example">Example</h3>
<p>This trick is actually the reason why we can have both <a href="/posts/20190510importanceweightedhierarchicalvariationalinference.html">upper</a> and lower bounds on the log marginal likelihood in latent variable models. Indeed, consider the following example:</p>
<p><span class="math display">\[
f(x) := \log(x),
\quad X := p(x \mid Z),
\quad Z \sim p(z)
\]</span></p>
<p>This is the standard Variational Inference setup. Putting it all together, we’d like to give bounds on</p>
<p><span class="math display">\[
\log \left( \E_{Z \sim p(z)} p(xZ) \right)
= \log p(x)
\]</span></p>
<p>Normally, in VI we use the standard Jensen’s Inequality to obtain an upper bound on this negative loglikelihood, and all is good. However, sometimes we need lower bounds on the same quantity. This is where the framework above comes to the rescue.</p>
<p>First, it’s easy to see that we’re very lucky – <span class="math inline">\(f(x)\)</span> is indeed reciprocally concave: <span class="math inline">\(\log(x)\)</span> is convex, and <span class="math inline">\(\log\tfrac{1}{x} = \log(x)\)</span> is concave.</p>
<p>Next, we need an unbiased estimator <span class="math inline">\(Y\)</span> of the inverse mean of <span class="math inline">\(X\)</span>, that is, an unbiased estimator of <span class="math inline">\(1/p(x)\)</span>. Such estimator can be given this way:</p>
<p><span class="math display">\[
\frac{1}{p(x)}
= \int \frac{q(z)}{p(x)} dz
= \int \frac{q(z) p(zx)}{p(x) p(z  x)} dz
= \E_{p(zx)} \frac{q(z)}{p(x, z)}
\]</span></p>
<p>Where <span class="math inline">\(q(z)\)</span> is an arbitrary distribution. Thus, the estimator is <span class="math inline">\(Y\)</span> generated by r.v. <span class="math inline">\(Z\)</span>: <span class="math display">\[
Y := \frac{q(Z)}{p(x, Z)},
\quad Z \sim p(zx)
\]</span></p>
<p>Now, putting these into (1) we obtain: <span class="math display">\[
\log p(x)
\ge \E_{p(zx)} \log \frac{p(x, z)}{q(z)}
\]</span> Or, equivalently, <span class="math display">\[
\log p(x)
\le \E_{p(zx)} \log \frac{p(x, z)}{q(z)}
\]</span> By the way, for comparison, here’s the classical lower bound obtained through the standard Jensen’s Inequality. Curiously, the only difference is where the random variables <span class="math inline">\(z\)</span> are coming from: <span class="math display">\[
\log p(x)
\ge \E_{q(z)} \log \frac{p(x, z)}{q(z)}
\]</span></p>
<h3 id="generalization">Generalization</h3>
<p>Why limit ourselves to a particular <span class="math inline">\(\hat{f} = f \circ (1/x)\)</span>? One can consider other invertible functions <span class="math inline">\(g(x)\)</span> instead of the <span class="math inline">\(1/x\)</span>. Here’s the recipe:</p>
<ul>
<li>Define <span class="math inline">\(f^{[g]}(x) = f(g(x))\)</span></li>
<li>First, we need <span class="math inline">\(f^{[g]}(x)\)</span> to be concave</li>
<li>Second, we need an unbiased estimator <span class="math inline">\(Y\)</span> of <span class="math inline">\(g^{1}(\E X)\)</span></li>
</ul>
<p>This leads to a generalization of (1): <span class="math display">\[
f\left(\E X\right)
= f^{[g]}\left( g^{1}(\E X) \right)
= f^{[g]}\left( \E Y \right)
\ge \E f^{[g]}\left( Y \right)
= \E f\left( g(Y) \right)
\]</span></p>
<h2 id="conclusion">Conclusion</h2>
<p>This trick is simple, and perhaps obvious even without any fancy words such as reciprocal convexity. Moreover, it has its limitations: you either need to get lucky with <span class="math inline">\(f(x)\)</span> being reciprocally concave, or need to find an invertible <span class="math inline">\(g(x)\)</span> such that <span class="math inline">\(f \circ g\)</span> is concave. But even that’s not enough, as you also need to construct an unbiased estimator <span class="math inline">\(Y\)</span>, and if you fancy practical applications, efficiency of the resulting bound will heavily depend on the quality of this estimator.</p>
<p>Nevertheless, I believe this is an interesting idea and it might prove itself useful in various analyses and derivations.</p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/k7re0131hU" height="1" width="1" alt=""/>Sun, 02 May 2021 00:00:00 UThttp://artem.sobolev.name/posts/20210502reciprocalconvexitytoreversethejenseninequality.htmlArtemhttp://artem.sobolev.name/posts/20210502reciprocalconvexitytoreversethejenseninequality.htmlNot every REINFORCE should be called Reinforcement Learning
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/tf6YO83d6j0/20201129reinforceisnotrl.html
<p>Deep RL is hot these days. It’s one of the most popular topics in the submissions at NeurIPS / ICLR / ICML and other ML conferences. And while the definition of RL is pretty general, in this note I’d argue that the famous REINFORCE algorithm <em>alone</em> is not enough to label your method as a Reinforcement Learning one.</p>
<!more>
<h2 id="reinforce">REINFORCE</h2>
<p>REINFORCE is a method introduced by Ronald Williams, commonly cited as coming from “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning”. Given a long and fruitful history of the method, it’s natural that it’s definition has evolved and for different people this method might mean somewhat different things, so let me first describe what <strong>I</strong> mean by the REINFROCE in this particular discussion<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>In this post we’ll assume REINFORCE to be equivalent to the scorefunction gradient estimator (also known as the logderivative trick gradient estimator) with a certain (most likely constant<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>) baseline for variance reduction.</p>
<p>I don’t want to reintroduce this method (I believe I <a href="/posts/20171112stochasticcomputationgraphsfixingreinforce.html">already did</a> it quite some time ago), instead I refer an interested reader to a <a href="http://blog.shakirm.com/2015/11/machinelearningtrickoftheday5logderivativetrick/">great blog post by Shakir Mohamed</a>, where the scorefunction (gradient) estimator is explained.</p>
<h2 id="whatreinforceisusedfor">What REINFORCE is used for</h2>
<p>REINFORCE is used to estimate the gradients of the policy <span class="math inline">\(\pi_\theta(\tau)\)</span> when dealing with the objectives of the following form<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>: <span class="math display">\[
\mathop{\mathbb{E}}_{\pi_\theta(\tau)} R(\tau) \to \max_{\theta}
\]</span> The REINFORCE gradient estimator is then given by (where <span class="math inline">\(b\in\mathbb{R}\)</span> is a baseline)<span class="math display">\[
\left(R(\tau)  b\right) \nabla_\theta \log \pi_\theta(\tau), \quad\quad \text{where $\tau \sim \pi_\theta(\tau)$}
\]</span>The major benefits of this estimator are:</p>
<ul>
<li>We don’t need to know the reward function <span class="math inline">\(R(\tau)\)</span>, we only need to evaluate it on the sampled trajectories <span class="math inline">\(\tau\)</span>.</li>
<li>There are no assumptions on <span class="math inline">\(R(\tau)\)</span>, it can be nondifferentiable or even discontinuous.</li>
<li>Even the <span class="math inline">\(\tau\)</span> itself could be discrete! We only need to log probability <span class="math inline">\(\log \pi_\theta(\tau)\)</span> to be differentiable in <span class="math inline">\(\theta\)</span> (but not in <span class="math inline">\(\tau\)</span>).</li>
</ul>
<p>The last two properties make the REINFORCE estimator an appealing choice for the gradient estimation in stochastic computation graphs, which I <a href="/tags/stochastic%20computation%20graphs%20series.html">have written at length</a> about.</p>
<p>There are lots of papers that do use REINFORCE in this exact scenario. For example, in a recent paper <a href="http://proceedings.mlr.press/v119/yoon20a.html">Data Valuation using Reinforcement Learning</a> (DVRL) researchers from Google do exactly that: they define a certain stochastic computation graph that contains discrete binary random variables in it. Then a simple REINFORCE gradient estimator is used to train those layers which cannot be reached by the standard backpropagation.</p>
<p>Notably, the paper cites only one paper that has Reinforcement Learning in its title – the original one by Williams. Other than that it seems pretty disconnected from the RL literature. This hints a question: should it even be called to be “using RL”?</p>
<h2 id="communicativevalue">Communicative Value</h2>
<p>Words are used to communicate ideas. When I say “Deep Neural Network” associations fire up in your brain and, provided you’re wellversed in the modern ML, you immediately think of all these modern (well, maybe not all of them) fancy things we call CNNs, ResNets, RNNs, LSTMs, Transformers, GNNs and manymanymany more. But I can also claim that a Logistic Regression (LR) is a special case of fullyconnected neural networks, especially if you train them with stochastic optimization methods. But what’s the <em>communicative value</em> of this statement? What information does it convey? Does much of knowledge about LR generalize to Neural Nets? Or, does it benefit hugely from our modern Deep Learning toolkit? When was the last time you used batchnorm to train your Logistic Regression?</p>
<p>What I’m trying to say is that although LR can be technically categorized as a Neural Network, this categorization appears to be useless, it does not open any interesting knowledge / expertice transfer. However, stack a logistic classifier on top of a pretrained neural network and train the whole pipeline endtoend – and you’re in the <a href="https://twitter.com/realTurboPascal/status/1111136291394068480">#backpropaganda</a> now!</p>
<p>Same goes for REINFORCE: the communicative value of calling methods like the aforementioned DVRL as “using RL” is very small. In my opinion, distinctive traits of (modern) Reinforcement Learning are:</p>
<ul>
<li>Delayed rewards<a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a></li>
<li>Unknown environment model</li>
<li>A single action at each state</li>
</ul>
<p>When you say you “use RL” it should mean you’ve posed the problem at hand such that it benefits from the vast research produced by RL people that address these traits. It’s this connection that bears communicative value as now you know that advances in RL would translate to your problem, too.</p>
<p>If your problem lacks these traits and you go for RL methods anyway, you ignore much of the useful structure you have in your problem, constraining yourself to methods that are designed for a much harder problem. Keep in mind that RL is hard:</p>
<blockquote class="twittertweet">
<p lang="en" dir="ltr">
When you say, “This is a reinforcement learning problem,” you should say it with the same excitement as “This is NPhard.”
</p>
— Tim Vieira (<span class="citation">@xtimv</span>) <a href="https://twitter.com/xtimv/status/795050238948110336?ref_src=twsrc%5Etfw">November 5, 2016</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf8"></script>
<p>Perhaps a large body of RL work might be solving a problem you don’t even have to start with! Speaking of the REINFORCE method, it’s biggest problem is large variance, for which people have designed <a href="/posts/20171112stochasticcomputationgraphsfixingreinforce.html">clever baselines</a>, but in RL, one might argue, <a href="https://arxiv.org/abs/1802.10031">such baselines have limited value</a>. On the other hand, GumbelSoftmax (and <a href="/posts/20171028stochasticcomputationgraphsdiscreterelaxations.html">relaxations</a> in general) – a method one should almost always consider when thinking of training stochastic computation graphs with REINFORCE – is not applicable in the standard RL setting.</p>
<p>In the particular case of DVRL the problem has much more useful structure that can be used than the RL literature assumes. It has no delay in feedback, has fully known environment model and allows you to take multiple actions at each state – all of these imply you can do things RL people can’t afford. Unsurprisingly, this departure from the standard RL setting is reflected in the absence of RL works in the bibliographic selection.</p>
<h2 id="conclusion">Conclusion</h2>
<p>There are other papers just like the DVRL that use REINFORCE to perform gradient estimation in models with discrete random variables and claim to be doing Reinforcement Learning. While possibly benefitting from all the hype around RL, this narrows the selection of methods to those designed for a much more general and hard problem. I hope I have convinced you that the Venn diagram for RL and REINFORCE should not have one containing the other.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>If to you REINFORCE means something different of something more than what I describe, then you’d probably agree with my claim. But anyway let me know in the comments below!<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>The original REINFORCE did assume a certain (probably) constant baseline to be employed, but let’s assume that constant could be 0 to include vanilla scorefunction estimator as well.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>In the RL parlance <span class="math inline">\(\tau\)</span> is a trajectory (sequence of stateaction pairs) and <span class="math inline">\(R(\tau)\)</span> is an unknown reward function, which is usually assumed to be comprised of individual rewards per each stateaction pair: <span class="math display">\[ R(\tau) = \sum_{(s_t, a_t) \in \tau} r_t(s_t, a_t) \]</span><a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>For this reason I don’t think bandits should be called RL either.<a href="#fnref4">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/tf6YO83d6j0" height="1" width="1" alt=""/>Sun, 29 Nov 2020 00:00:00 UThttp://artem.sobolev.name/posts/20201129reinforceisnotrl.htmlArtemhttp://artem.sobolev.name/posts/20201129reinforceisnotrl.htmlA simpler derivation of fGANs
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/RCe9HJbHaWc/20191201asimplerderivationoffgans.html
<p>I have been looking at <span class="math inline">\(f\)</span>GANs derivation doing some of my research, and found an easier way to derive its lower bound, without invoking convex conjugate functions.</p>
<!more>
<p><a href="https://arxiv.org/abs/1606.00709"><span class="math inline">\(f\)</span>GANs</a> are a generalization of standard GANs to arbitrary <span class="math inline">\(f\)</span>divergence. Given a convex function <span class="math inline">\(f\)</span>, <a href="https://en.wikipedia.org/wiki/Fdivergence"><span class="math inline">\(f\)</span>divergence</a>, in turn, can be used to measure “difference” between the data distribution <span class="math inline">\(p_\text{data}(x)\)</span> and our model <span class="math inline">\(q(x)\)</span>:</p>
<p><span class="math display">\[
D_f(p_\text{data}(x) \mid\mid q(x)) = \E_{q(x)} f \left( \frac{p_\text{data}(x)}{q(x)} \right)
\]</span></p>
<p>Of course, we don’t know the datagenerating distribution <span class="math inline">\(p_\text{data}(x)\)</span>. Moreover, in a typical GAN setting <span class="math inline">\(q(x)\)</span> is an implicit model, and thus its density is not known either <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>. Thus, to make things tractable GANs employ tractable samplebased lower bounds <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>.</p>
<h2 id="simplederivation">Simple Derivation</h2>
<p>Our derivation is based on the following simple inequality, a very wellknown fact for convex functions<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>, namely that a convex function is always greater than its tangent or is equal to at the point of tangency (denoted <span class="math inline">\(r(x)\)</span>):</p>
<p><span class="math display">\[
f\left( \frac{p_\text{data}(x)}{q(x)} \right)
\ge
f\left( r(x) \right)
+
f'\left( r(x) \right) \left( \frac{p_\text{data}(x)}{q(x)}  r(x) \right)
\]</span></p>
<p>For any nonnegative function <span class="math inline">\(r(x)\)</span>. Now we take the expected value</p>
<p><span class="math display">\[
\begin{align*}
D_f(p_\text{data}(x) \mid\mid q(x))
&\ge
\E_{q(x)}
\left[
f\left( r(x) \right)
+
f'\left( r(x) \right) \left( \frac{p_\text{data}(x)}{q(x)}  r(x) \right)
\right] \\
& =
\E_{q(x)}
f\left( r(x) \right)
+
\E_{p_\text{data}(x)} f'\left( r(x) \right)  \E_{q(x)} f'\left( r(x) \right) r(x)
\tag{1}
\end{align*}
\]</span></p>
<p>This bound has several nice properties:</p>
<ol style="liststyletype: decimal">
<li>It does not require knowing densities, only having samples.</li>
<li>By construction, it’s a lower bound for all <span class="math inline">\(r(x)\)</span>.</li>
<li>Plugging <span class="math inline">\(r^*(x) = \frac{p_\text{data}(x)}{q(x)}\)</span> recovers the <span class="math inline">\(f\)</span>divergence.</li>
</ol>
<p>However, this formula looks different from the one in the <span class="math inline">\(f\)</span>GANs paper. Are they related? We’ll now show they’re exactly the same.</p>
<h2 id="fgansderivation"><span class="math inline">\(f\)</span>GANs Derivation</h2>
<p>The original derivation, which probably should be attributed to <a href="http://dept.stat.lsa.umich.edu/~xuanlong/Papers/NguyenWainwrightJordan10.pdf">“Estimating divergence functionals and the likelihood ratio by convex risk minimization” by XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan (2010)</a> is based on the <a href="https://en.wikipedia.org/wiki/Convex_conjugate">convex conjugate</a> concept. The convex conjugate <span class="math inline">\(f^*\)</span> for a function <span class="math inline">\(f\)</span> is <span class="math display">\[
f^*(t) = \sup_{u \in \text{dom}(f)} \left[ u t  f(u) \right]
\]</span></p>
<p>Nguen et al. have shown the following variational characterization of the <span class="math inline">\(f\)</span>divergence <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>: <span class="math display">\[
D_f(p(x) \mid\mid q(x)) = \sup_{T(x)} \left[ \E_{p(x)} T(x)  \E_{q(x)} f^*(T(x)) \right]
\]</span> Where <span class="math inline">\(f^*(t)\)</span> is the aforementioned convex conjugate for <span class="math inline">\(f(t)\)</span>, and the supremum is taken over all functions. However, we’re safe to restrict the range of <span class="math inline">\(T(x)\)</span> to those values where <span class="math inline">\(f^*\)</span> is finite, that is, the set <span class="math inline">\(\mathcal{V} = \{t \in \mathbb{R} \mid f^*(t) < \infty \}\)</span>. Now this form is already amendable to practical applications, just make <span class="math inline">\(T(x)\)</span> a neural network whose activation respects <span class="math inline">\(\mathcal{V}\)</span> and maximize the lower bound w.r.t. its parameters. The question then is how to construct this activation.</p>
<p>Skipping the more detailed analysis, we note that the optimal <span class="math inline">\(T(x)\)</span> is known to be <span class="math display">\[T^*(x) = f'\left( \frac{p(x)}{q(x)} \right)\]</span> Since we’re only interested in approximating the optimal value, we might as well consider the following parametrization for <span class="math inline">\(T(x)\)</span> (using a nonnegative function <span class="math inline">\(r(x)\)</span>): <span class="math display">\[
T(x) = f'(r(x))
\]</span> Which gives us the following objective</p>
<p><span class="math display">\[
D_f(p(x) \mid\mid q(x)) = \sup_{r(x)} \left[ \E_{p(x)} f'(r(x))  \E_{q(x)} f^*(f'(r(x))) \right]
\tag{2}
\]</span></p>
<p>Finally, we use <a href="https://math.stackexchange.com/a/1428011/463191">an important property of convex conjugate functions</a>: <span class="math display">\[
\begin{align*}
f^*(f'(r(x)))
&= \sup_u \left[ u f'(r(x))  f(u) \right] \\
&= \sup_u \left[ u f'(r(x))  r(x) f'(r(x))  f(u) \right] + r(x) f'(r(x)) \\
&= \sup_u \left[ \underbrace{f(r(x)) + f'(r(x)) (u  r(x))  f(u)}_\text{$\le 0$ due to convexity of $f$} \right] + r(x) f'(r(x))  f(r(x)) \\
&= r(x) f'(r(x))  f(r(x)) \\
\end{align*}
\]</span> Where in the last line we’ve used the fact that for a convex <span class="math inline">\(f(t)\)</span> its tangent at any point is always a lower bound, and the surpremum of 0 is achieved for <span class="math inline">\(u = r(x)\)</span>.</p>
<p>Now we plug this equivalent formula into the objective and obtain <span class="math display">\[
\begin{align*}
D_f(p(x) \mid\mid q(x))
& = \sup_{r(x)} \left[ \mathbb{E}_{p(x)} f'(r(x))  \mathbb{E}_{q(x)} \left( r(x) f'(r(x))  f(r(x)) \right) \right] \\
& = \sup_{r(x)} \left[ \mathbb{E}_{q(x)} f(r(x)) + \mathbb{E}_{p(x)} f'(r(x))  \mathbb{E}_{q(x)} r(x) f'(r(x)) \right]
\end{align*}
\]</span></p>
<p>Which <strong>exactly</strong> recovers the formula (1). Moreover, the conjugate identity holds for all realizations of random variables involved, so not only the bounds (1) and (2) are the same, but their stochastic estimates are too<a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The presented derivation and objective form is interesting for several reasons. First, by design the optimal “discriminator” <span class="math inline">\(r^*(x) = \frac{p_\text{data}(x)}{q(x)}\)</span> is independent of the particular <span class="math inline">\(f\)</span>divergence used. Second, thinking of <span class="math inline">\(r(x)\)</span> as of importance weights approximation gives an intuitive understanding of different terms in the objective (1): the first term is <span class="math inline">\(f\)</span>divergence approximation that uses learned density ratio <span class="math inline">\(r(x)\)</span> instead of the actual density ratio. The rest two terms balance the first one to ensure the lower bound guarantee. In particular, the last term uses <span class="math inline">\(r(x)\)</span> as an importance weight to “approximate” the second one so that they cancel out when the <span class="math inline">\(r(x)\)</span> is optimal. The last, but not least, the presented derivation is <em>simpler</em>.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Actually, most of the time it does not exist at all. But that’s a story for another time.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Although, a lower bound on the loss is not something you’d like to minimize, this is how things are done in the GAN realm.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>We assume <span class="math inline">\(f\)</span> is differentiable here, but if it’s not, the statement still holds with <span class="math inline">\(f'\)</span> being replaced with a subgradient.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>Nguen et al. use a bit different convention for <span class="math inline">\(f\)</span>divergences, namely <span class="math display">\[D_f(p(x) \mid\mid q(x)) = \E_{p(x)} f\left(\frac{q(x)}{p(x)}\right)\]</span><a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>As long as you use the same samples to estimate different expectations over the distribution <span class="math inline">\(q(x)\)</span>.<a href="#fnref5">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/RCe9HJbHaWc" height="1" width="1" alt=""/>Sun, 01 Dec 2019 00:00:00 UThttp://artem.sobolev.name/posts/20191201asimplerderivationoffgans.htmlArtemhttp://artem.sobolev.name/posts/20191201asimplerderivationoffgans.htmlThoughts on Mutual Information: Alternative Dependency Measures
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/jGy7Czf1a_k/20190915thoughtsonmutualinformationalternativedependencymeasures.html
<p>This posts finishes the discussion started in the <a href="/posts/20190810thoughtsonmutualinformationmoreestimators.html">Thoughts on Mutual Information: More Estimators</a> with a consideration of alternatives to the Mutual Information.</p>
<!more>
<h2 id="mutualinformation">Mutual Information</h2>
<p>Let’s step out a bit and take a critical look at the MI. One of its equivalent definitions says that it’s a KLdivergence between the joint distribution and the product of marginals: <span class="math display">\[
\text{MI}[p(x, z)] = D_{KL}(p(x, z) \mid\mid p(x) p(z))
\]</span></p>
<p>Indeed, if the random variables <span class="math inline">\(X\)</span> and <span class="math inline">\(Z\)</span> are independent, then the joint distribution <span class="math inline">\(p(x, z)\)</span> factorizes as <span class="math inline">\(p(x) p(z)\)</span>, and the KL (or any other divergence or distance between probability distributions) is equal to zero. Conversely, the more <span class="math inline">\(X\)</span> and <span class="math inline">\(Z\)</span> are dependent, the further the joint <span class="math inline">\(p(x, z)\)</span> deviates from the product of marginals <span class="math inline">\(p(x) p(z)\)</span>.</p>
<p>But why this particular choice of divergence?</p>
<p>Why not <a href="https://threeplusone.com/pubs/on_jensenshannon.pdf">Jeffreys divergence</a>, <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">JensenShannon divergence</a>, <a href="https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures">Total Variation distance</a> or <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a>?</p>
<p>The answer to this question lies in the entropic form of the MI: <span class="math display">\[
\text{MI}[p(x, z)]
= D_{KL}(p(x, z) \mid\mid p(x) p(z))
= \mathbb{H}[p(x)]  \E_{p(z)} \mathbb{H}[p(xz)]
\]</span></p>
<p>The Mutual Information is equal to average reduction in entropy (a measure of uncertainty) of <span class="math inline">\(X\)</span> when we know <span class="math inline">\(Z\)</span>. Such informationtheoretic interpretation is the main reason the MI is so widespread. However, there’s a major issue when <span class="math inline">\(X\)</span> and <span class="math inline">\(Z\)</span> are continuous: these entropies become differential ones, and the differential entropy <a href="https://stats.stackexchange.com/a/256238">does not enjoy the same uncertaintymeasuring interpretation as the discrete one does</a>.</p>
<p>One particular issue with the continuous Mutual Information is the following one: if <span class="math inline">\(\text{Pr}[X = Z] = 1\)</span>, then the MI attains its maximal value. In the discrete case this maximal value is equal to the entropy of <span class="math inline">\(X\)</span> and finite, but in the continuous case it’s equal to <span class="math inline">\(+\infty\)</span> <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>. Moreover, imagine <span class="math inline">\(X\)</span> and <span class="math inline">\(Z\)</span> are <span class="math inline">\(N\)</span>dimensional random vectors s.t. <span class="math inline">\(\text{Pr}(X_1 = Z_1) = 1\)</span> and the rest components are all independent random variables. Then, it’s easy to show that the MI <span class="math inline">\(I(X, Z) = +\infty\)</span> regardless of <span class="math inline">\(N\)</span>! So if <span class="math inline">\(N\)</span> is in billions these vectors are mostly independent, but one pesky component ruined it all, and we ended up with an infinite mutual “information”.</p>
<p>I hope this convinced you there’s no informationtheoretic interpretation in the continuous case to hold on to the particular choice of divergence between <span class="math inline">\(p(x, z)\)</span> and <span class="math inline">\(p(x) p(z)\)</span>, which means we’re free to explore alternatives…</p>
<h2 id="fdivergences"><span class="math inline">\(f\)</span>divergences</h2>
<p>KL divergence is a special case of <a href="https://en.wikipedia.org/wiki/Fdivergence"><span class="math inline">\(f\)</span>divergences</a> given by <span class="math inline">\(f(t) = t \log t\)</span>. However, other choices are perfectly legal, too:</p>
<ol style="liststyletype: decimal">
<li><a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">JensenShannon divergence</a> corresponds to <span class="math inline">\(f(t) = t \log t  (t+1) \log \tfrac{t+1}{2}\)</span>.</li>
<li><a href="https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures">Total Variation distance</a> corresponds to <span class="math inline">\(f(t) = t1\)</span>.</li>
<li>Jeffreys divergence is given by <span class="math inline">\(f(t) = \tfrac{t1}{2} \log t\)</span>.</li>
<li>Reverse KL divergence is given by <span class="math inline">\(f(t) = \log t\)</span>.</li>
</ol>
<p>So, one can consider <span class="math inline">\(f\)</span>Mutual Information defined as <span class="math display">\[
I_f(X, Z) := D_f(p(x, z) \mid\mid p(x) p(z)) = \E_{p(x) p(z)} f\left(\frac{p(x, z)}{p(x) p(z)}\right)
\]</span></p>
<p>In general, however, KL, Reverse KL and combinations thereof are the only <span class="math inline">\(f\)</span>divergences that are additive for independent random variables: if <span class="math inline">\(X_1 \perp X_2\)</span> under both <span class="math inline">\(p(x)\)</span> and <span class="math inline">\(q(x)\)</span>, then <span class="math display">\[\text{KL}(p(x) \mid\mid q(x)) = \text{KL}(p(x_1) \mid\mid q(x_1)) + \text{KL}(p(x_2) \mid\mid q(x_2))\]</span> And thus for <span class="math inline">\(X_1 \perp X_2\)</span> and <span class="math inline">\(Z_1 \perp Z_2\)</span> <span class="math display">\[
I_f(X, Z)
=
I_f(X_1, Z_1)
+
I_f(X_2, Z_2)
\Leftrightarrow
\text{$f$ is a combination of KLs}
\]</span> Is such additivity important, though? Imagine having a sample set of independent objects <span class="math inline">\(X_1, \dots, X_N\)</span> used to extract corresponding representations <span class="math inline">\(Z_1, \dots, Z_N\)</span>. In general with <span class="math inline">\(f\)</span>MI you’re not allowed to use stochastic optimization / minibatching to work with <span class="math inline">\(I_f(X, Z)\)</span>. This is counterintuitive and not something we’d expect from a measure of <em>information</em>.</p>
<p>That said, there some thing to keep in mind:</p>
<ol style="liststyletype: decimal">
<li>In practice you probably can use <span class="math inline">\(\sum_{n=1}^N I_f(X_n, Z_n)\)</span> instead of <span class="math inline">\(I_f(X, Z)\)</span> without having to suffer any consequences.</li>
<li>In some special cases such additivity <em>does</em> hold. These are cases of KL divergence, Reverse KL divergence and any combination thereof.</li>
</ol>
<h3 id="lautuminformation">Lautum Information</h3>
<p>Palomar and Verdu introduced the <a href="https://ieeexplore.ieee.org/document/4455754">Lautum Information</a> (I particularly liked their naming: Lautum is Mutual backwards): an analogue of the Mutual Information with KL’s arguments swapped:</p>
<p><span class="math display">\[
\text{LI}[p(x, z)] = D_{KL}(p(x) p(z) \mid\mid p(x, z))
\]</span></p>
<p>It an be equivalently rewritten as</p>
<p><span class="math display">\[
\begin{align*}
\text{LI}[p(x, z)]
&= \E_{p(x) p(z)} \log \frac{p(x) p(z)}{p(x, z)}
= \E_{p(x) p(z)} \log \frac{p(z)}{p(zx)} \\
&= \E_{p(x) p(z)} \log p(zx)  \mathbb{H}[p(z)]
\end{align*}
\]</span></p>
<p>Notice that in general the first term is not an entropy, but rather a crossentropy. Unfortunately, this crossentropy term lacks intuitive informationtheoretic interpretation – a distinctive feature of the Mutual Information.</p>
<p>Another disadvantage of the Lautum Information is that even for discrete random variables it’s infinite when <span class="math inline">\(X = Z\)</span>. Since one is sampling from the product of marginals <span class="math inline">\(p(x) p(z)\)</span>, you’ll inevitably have some probability mass in the area where <span class="math inline">\(x \not= z\)</span>, but <span class="math inline">\(p(x, z)\)</span> will be exactly zero in such regions, hence logarithm’s argument will be infinite. However, for continuous random variables, even the standard Mutual Information will be infinite in the case of a deterministic invertible dependency.</p>
<p>But how do we estimate the LI? First, the good news is that its only entropy term comes with a minus sign, hence if you seek to lower bound the LI, then <a href="/posts/20190814thoughtsonmutualinformationformallimitations.html">the formal limitations theorem</a> does not apply. Unfortunately, I’m not aware of any good blackbox bounds on the crossentropy, so we’ll have to assume at least one conditional to be known, say, <span class="math inline">\(p(xz)\)</span>. For the other term we can use any of the plethora of bounds on the log marginal likelihood: <span class="math display">\[
\begin{align*}
\text{LI}[p(x, z)]
&= \E_{p(x) p(z)} \log p(xz) + \E_{p(x)} \log p(x) \\
&= \E_{p(x) p(z)} \log p(xz) + \E_{p(x)} \log q(x) + \E_{p(x)} \log \frac{p(x)}{q(x)} \\
&= \E_{p(x) p(z)} \log p(xz) + \E_{p(x)} \log q(x) + D_\text{KL}(p(x) \mid\mid q(x) ) \\
&\ge \E_{p(x) p(z)} \log \frac{q(x)}{p(xz)}
\end{align*}
\]</span></p>
<p>Where <span class="math inline">\(q(x)\)</span> is any (variational) distribution with the same support as <span class="math inline">\(p(x)\)</span>. However, notice that we’ve already assumed <span class="math inline">\(p(xz)\)</span> to be known. With an additional assumption of being able to sample from this conditional we can use <a href="/posts/20190510importanceweightedhierarchicalvariationalinference.html#newsemiimplicithope">SIVI</a> to give an (nonblackbox) lower bound on the entropy of <span class="math inline">\(p(x)\)</span>: <span class="math display">\[
\begin{align*}
\text{LI}[p(x, z)]
&= \E_{p(x)} \log p(x)  \E_{p(x) p(z)} \log p(xz) \\
&\le \E_{p(x, z_0)} \E_{p(z_{1:K})} \log \left( \frac{1}{K+1} \sum_{k=0}^K p(xz_k) \right)  \E_{p(x) p(z)} \log p(xz) \\
&= \E_{p(x, z_0)} \E_{p(z_{1:K})} \left[ \log \left( \frac{1}{K+1} \sum_{k=0}^K p(xz_k) \right)  \frac{1}{K} \sum_{k=1}^K \log p(xz_k) \right] \\
\end{align*}
\]</span></p>
<p>Moreover, if the marginal distribution <span class="math inline">\(p(z)\)</span> is known, <a href="(/posts/20190510importanceweightedhierarchicalvariationalinference.html#importanceweightedhierarchicalvariationalinference)">IWHVI</a> provides a better estimate at the cost of introducing a variational distribution <span class="math inline">\(q(zx)\)</span>: <span class="math display">\[
\begin{align*}
\text{LI}[p(x, z)]
&= \E_{p(x)} \log p(x)  \E_{p(x) p(z)} \log p(xz) \\
&\le \E_{p(x, z_0)} \E_{q(z_{1:K}x)} \log \left( \frac{1}{K+1} \sum_{k=0}^K \frac{p(x, z_k)}{q(z_kx)} \right)  \E_{p(x) p(z)} \log p(xz)
\end{align*}
\]</span></p>
<p>Analogously, one can use <a href="/posts/20160714neuralvariationalimportanceweightedautoencoders.html">IWAE</a> bounds to arrive at the following lower bound on the LI: <span class="math display">\[
\begin{align*}
\text{LI}[p(x, z)]
&= \E_{p(x)} \log p(x) + \E_{p(x) p(z)} \log p(xz) \\
&\ge \E_{p(x)} \E_{q(z_{1:K}x)} \log \left( \frac{1}{K} \sum_{k=1}^K \frac{p(x, z_k)}{q(z_kx)} \right)  \E_{p(x) p(z)} \log p(xz)
\end{align*}
\]</span></p>
<h2 id="wassersteindistance">Wasserstein Distance</h2>
<p>An interesting alternative to both Forward and Reverse KLs is the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein Distance aka Kantorovich–Rubinstein distance aka optimal transport distance</a>. Formally, Wasserstein<span class="math inline">\(p\)</span> metric <span class="math inline">\(W_p\)</span> defined as <span class="math display">\[
W_p(p(x), q(x)) := \left( \inf_{\gamma \in \Gamma(p, q)} \E_{\gamma(x, y)} \xy\_p^p \right)^{1/p}
\]</span></p>
<p>Where <span class="math inline">\(\Gamma(p, q)\)</span> is a set of all possible joint distributions over <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> s.t. <span class="math inline">\(p(x)\)</span> and <span class="math inline">\(q(y)\)</span> are its respective marginals: <span class="math display">\[
\gamma \in \Gamma(p, q) \Leftrightarrow
\int_{\text{dom}(y)} \gamma(x, y) dy = p(x),
\quad
\int_{\text{dom}(x)} \gamma(x, y) dx = q(y)
\]</span></p>
<p>In particular, we’ll be considering Wasserstein1 distance <span class="math inline">\(W_1\)</span>: <span class="math display">\[
W_1(p(x), q(x)) := \inf_{\gamma \in \Gamma(p, q)} \E_{\gamma(x, y)} \xy\_1
\]</span></p>
<p>The Wasserstein1 distance lies at the heart of the <a href="https://arxiv.org/abs/1701.07875">Wasserstein GAN</a>, and it was this paper that suggested to use Kantorovich–Rubinstein duality to estimate the distance in practice: <span class="math display">\[
W_1(p(x), q(x)) \ge \E_{p(x)} f(x)  \E_{q(x)} f(x)
\]</span> Where <span class="math inline">\(f\)</span> is any 1Lipschitz function. It also seems to be additive for independent random variables – a nice property to have.</p>
<p>Now we can use this tractable lower bound to estimate the <a href="https://arxiv.org/abs/1903.11780">Wasserstein Dependency Measure</a> <span class="math inline">\(W_1(p(x, z), p(x) p(z))\)</span> – a Wasserstein analogue of the Mutual Information<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>: <span class="math display">\[
W_1(p(x, z), p(x) p(z)) \ge \E_{p(x, z)} f(x, z)  \E_{p(x) p(z)} f(x, z)
\]</span></p>
<p>You can notice that this lower bound is similar to the <a href="http://localhost:8000/posts/20190814thoughtsonmutualinformationformallimitations.html#thenguyenwainwrightjordanbound">Nguyen Wainwright Jordan lower bound</a> on KL. Unfortunately, it’s not known whether this bound is efficient or if it also exhibits large variance. Thorough theoretical analysis of the Kantorovich–Rubinstein lower bound is an interesting research question.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The Mutual Information is far from being the only dependency measure, yet it’s the most intuitive one, with the intuition coming from the information theory. However, as I argued here, this nice interpretation goes out of the window once you introduce continuous random variables, so no need to stick to the particular choice of the MI, especially given that there’re lots of alternative dependency measures. With the <a href="/posts/20190814thoughtsonmutualinformationformallimitations.html">Formal Limitations</a> paper preventing us from having practical blackbox lower bounds on the MI, I believe more and more researchers will study alternative dependence measures, and we already see some pioneering work, such as Wasserstein Dependency Measure.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>This is, of course, due to the fact that a real number requires an infinite amount of bits to be written exactly, allowing one to indeed store an infinite amount of information in a truly real number.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Notably, authors of the paper did not name their measure something like Wasserstein Mutual Information. Technically speaking, only the KLbased MI can be called <em>something</em>information since, as we’ve discussed already, it’s the only dependency measure that has informationtheoretic interpretation. In that sense, Lautum Information should have been named differently.<a href="#fnref2">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/jGy7Czf1a_k" height="1" width="1" alt=""/>Sun, 15 Sep 2019 00:00:00 UThttp://artem.sobolev.name/posts/20190915thoughtsonmutualinformationalternativedependencymeasures.htmlArtemhttp://artem.sobolev.name/posts/20190915thoughtsonmutualinformationalternativedependencymeasures.htmlThoughts on Mutual Information: Formal Limitations
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/2ct3nzhcA/20190814thoughtsonmutualinformationformallimitations.html
<p>This posts continues the discussion started in the <a href="/posts/20190810thoughtsonmutualinformationmoreestimators.html">Thoughts on Mutual Information: More Estimators</a>. This time we’ll focus on drawbacks and limitations of these bounds.</p>
<!more>
<p>Let’s start with a elephant in the room: a year ago an interesting preprint has been uploaded to arxiv: <a href="https://arxiv.org/abs/1811.04251">Formal Limitations on the Measurement of Mutual Information</a> in which authors essentially argue that if you don’t know any densities (the hardest case, according to my hierarchy), then any <strong>distributionfree highconfidence lower bound</strong> on the MI would require <span class="math inline">\(\exp(\text{MI})\)</span> number of samples and thus blackbox MI lower bounds should be deemed impractical.</p>
<h2 id="formallimitations">Formal Limitations</h2>
<p>The paper mounts a massive attack on distributionfree lower bounds on the Mutual Information. Not only McAllester and Stratos show that existing bounds are inferior, but also kill any blackbox lower bounds on the KL divergence. The core result that warrants impossibility of cheap and good lower bounds on the Mutual Information is the Theorem 2, which states (in a slightly reformulated notation)</p>
<blockquote>
<p><strong>Theorem</strong>: Let <span class="math inline">\(B\)</span> be any distributionfree highconfidence lower bound on <span class="math inline">\(\mathbb{H}[p(x)]\)</span> computed from a sample <span class="math inline">\(x_{1:N} \sim p(x)\)</span> More specifically, let <span class="math inline">\(B(x_{1:N}, \delta)\)</span> be any realvalued function of a sample and a confidence parameter <span class="math inline">\(\delta\)</span> such that for any <span class="math inline">\(p(x)\)</span>, with probability at least <span class="math inline">\((1 − \delta)\)</span> over a draw of <span class="math inline">\(x_{1:N}\)</span> from <span class="math inline">\(p(x)\)</span>, we have <span class="math display">\[\mathbb{H}[p(x)] \ge B(x_{1:N}, δ).\]</span> For any such bound, and for <span class="math inline">\(N \ge 50\)</span> and <span class="math inline">\(k \ge 2\)</span>, with probability at least <span class="math inline">\(1 − \delta − 1.01/k\)</span> over the draw of <span class="math inline">\(x_{1:N}\)</span> we have <span class="math display">\[B(x_{1:N}, \delta) ≤ \log(2k N^2)\]</span></p>
</blockquote>
<p>Indeed, since (in discrete case) <span class="math inline">\(I(X, X) = H(X)\)</span> <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>, any good<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> lower bound on the MI would give a good lower bound on the entropy, and the theorem above says there are no such bounds (only those that are either exponentially expensive to compute or are not highconfidence or are not blackbox). Authors then argue that one can have good estimators if they forgo the lower bound guarantee and settle with an estimate that is neither a lower or an upper bound. However, this is undesirable in many cases, especially when we’d like to compare two numbers.</p>
<p>Unfortunately, I found the paper hard to digest and as far as I know, it’s still not published, so probably we should be cautious about the presented result. Nevertheless, I’ll show below that several often used bounds do indeed seem to have this limitation.</p>
<h3 id="thenguyenwainwrightjordanbound">The NguyenWainwrightJordan Bound</h3>
<p>The process we’ve followed so far to derive a lower bound on the MI has been somewhat cumbersome: we first decomposed the MI into some expectation and then used fancy bounds on some of the terms. An alternative and easier approach is to recall that the MI is a certain KL divergence take any offtheshelf lower bounds on the KL divergence.</p>
<p>One such lower bound can be obtained using the Fenchel conjugate functions (<a href="https://arxiv.org/abs/0809.0853">Nguyen et al.</a>, alternatively see the <a href="https://arxiv.org/abs/1606.00709">fGANs</a> paper):</p>
<p><span class="math display">\[
KL(p(x) \mid\mid q(x)) \ge
\mathbb{E}_{p(x)} f(x)

\mathbb{E}_{q(x)} \exp(f(x))
+ 1
\]</span></p>
<p>Where <span class="math inline">\(f(x)\)</span> (a critic) is any function that takes <span class="math inline">\(x\)</span> as input and outputs a scalar. The optimal choice can be shown to be <span class="math inline">\(f^*(x) = \ln \tfrac{p(x)}{q(x)}\)</span>. And all is nice, except the lurking menace of the <span class="math inline">\(\exp\)</span> term. Consider a Monte Carlo estimate in the case of optimal critic (<span class="math inline">\(x_{1:N} \sim p(x), y_{1:M} \sim q(y)\)</span>): <span class="math display">\[
\frac{1}{N} \sum_{n=1}^N \ln \frac{p(x_n)}{q(x_n)}

\frac{1}{M} \sum_{m=1}^M \frac{p(y_m)}{q(y_m)}
+ 1
\]</span> The first term is exactly the Monte Carlo estimate of the KL divergence, while the second (the balancing term as it’s counterweights the first one) in expectation gives 1. However, ratio of densities might take on extremely large values and in general has enormous variance. Indeed, the variance of the balancing term is</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{V}_{q(y_{1:M})} \left[ \frac{1}{M} \sum_{m=1}^M \frac{p(y_m)}{q(y_m)} \right]
&=
\frac{1}{M}
\mathbb{V}_{q(y)} \left[ \frac{p(y)}{q(y)} \right]
=
\frac{1}{M}
\E_{q(y)} \left[ \left(\frac{p(y)}{q(y)} \right)^2  1 \right] \\
&=
\frac{1}{M}
\E_{p(y)} \left[ \frac{p(y)}{q(y)}  1 \right]
=
\frac{\E_{p(y)} \left[ \exp \log \frac{p(y)}{q(y)} \right]  1}{M} \\
& \ge
\frac{\exp \E_{p(y)} \left[ \log \frac{p(y)}{q(y)} \right]  1}{M}
=
\frac{\exp\left(\text{KL}(p(y) \mid\mid q(y))\right)  1}{M}
\end{align*}
\]</span></p>
<p>So, one can see that indeed, the NWJ bound can’t give us highconfidence fewsamples lower bound on any KL, not only the MI. This is because the second term would bias the bound by contributing large zeromean noise. The only way to drive the magnitude of this noise down is to take more samples, and as the analysis above shows, number of samples should be exponential in the KL (The statement could be made more precise by appealing to the Chebyshev’s inequality).</p>
<h3 id="thedonskervaradhanestimator">The DonskerVaradhan Estimator</h3>
<p>Donsker and Varadhan have proposed essentially a tighter bound on the KL divergence, of the form <span class="math display">\[
KL(p(x) \mid\mid q(x)) \ge
\E_{p(x)} f(x)

\log \E_{q(x)} \exp(f(x))
+ 1
\]</span> With the same <span class="math inline">\(f^*(x) = \ln \tfrac{p(x)}{q(x)}\)</span> being an optimal critic. There are two key differences to the previous bound: the first is that it uses a logarithm in front of the balancing term, preventing it from contributing huge variance (but this variance still has to go somewhere, and we’ll see where it goes), and the second (and the most important) is that this bound is no loger amendable to (unbiased) Monte Carlo estimation due to the logarithm outside of the expectation. In practice <a href="https://arxiv.org/abs/1801.04062">people just take an empirical average</a> under the expectation thus obtaining a biased estimate (which in general is neither a lower nor an upper bound):</p>
<p><span class="math display">\[
\frac{1}{N} \sum_{n=1}^N \ln \frac{p(x_n)}{q(x_n)}

\log
\frac{1}{M} \sum_{m=1}^M \frac{p(y_m)}{q(y_m)}
\]</span></p>
<p>It can be shown that the balancing term now has huge bias and is always negative. It’s also easy to see that the biased converges to 0 as we take more samples <span class="math inline">\(M\)</span>, so one might hope that with moderately many samples we’d have some tolerable bias. Well, this doesn’t seem to be the case.</p>
<p>Take a closer look at the bias of the balancing term <span class="math display">\[
\E_{q(y_{1:M})}
\log
\frac{1}{M} \sum_{m=1}^M \frac{p(y_m)}{q(y_m)}
\]</span> It can be seen as an asymptotically unbiased estimate (a lower bound for all finite <span class="math inline">\(M\)</span>) of the lognormalizing constant of <span class="math inline">\(p(y)\)</span> (which is 1 since it’s already normalized) and is wellstudied. In particular, <a href="https://arxiv.org/abs/1808.09034">Domke and Sheldon</a> have shown (Theorem 3) that, essentially, the bias of the balancing term converges to 0 with the following rate:</p>
<p><span class="math display">\[
O\left(M^{1} \mathbb{V}_{q(y)} \left[ \frac{p(y)}{q(y)} \right] \right)
\]</span></p>
<p>Which 1) shows us where the variance has gone; 2) hints that in order to eliminate the bias we’d again need to take exponential number of samples.I don’t know what happens to the actual variance of the balancing term, but it can only make things worse.</p>
<h3 id="thecontrastivepredictivecodingbound">The ContrastivePredictiveCoding Bound</h3>
<p>Let’s leave the realm of lower bounds on KL now. Previously I have already presented the InfoNCE bound: <span class="math display">\[
\text{MI}[p(x, z)]
\ge
\E_{p(z_{0:K})}
\E_{p(x  z_0)}
\log \frac{p(xz_0)}{\frac{1}{K+1} \sum_{k=0}^K p(xz_k)}
\]</span></p>
<p>Importantly, this bound does not have access to any marginals of the <span class="math inline">\(p(x,z)\)</span> joint. It’s easy to show that this lower bound is upper bounded by <span class="math inline">\(\log (K+1)\)</span>, which confirms the thesis:</p>
<p><span class="math display">\[
\begin{align*}
\E_{p(z_{0:K})}
\E_{p(x  z_0)}
\log \frac{p(xz_0)}{\frac{1}{K+1} \sum\limits_{k=0}^K p(xz_k)}
& =
\log (K+1)
+
\E_{p(z_{0:K})}
\E_{p(x  z_0)}
\log \frac{p(xz_0)}{\sum\limits_{k=0}^K p(xz_k)} \\
& \le
\log (K+1)
\end{align*}
\]</span> Which is due to the log’s argument being between 0 and 1. So this <span class="math inline">\(\log(K+1)\)</span> upper bound on the lower bound means that if the true MI is much larger than this value, the bound will be very loose.</p>
<p>Given all these negative results, one might ask themselves if knowing the marginal <span class="math inline">\(p(z)\)</span> would do much better. Consider the “known prior” case:</p>
<p><span class="math display">\[
\text{MI}[p(x, z)]
\ge
\E_{p(x, z_0)}
\E_{q_\phi(z_{1:K}x)}
\log \frac{\hat\varrho_\eta(xz_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\varrho_\eta(xz_k) \frac{p(z_k)}{q_\phi(z_kx)}}
\]</span></p>
<p>Then we have <span class="math display">\[
\begin{align*}
\E_{\substack{p(x, z_0) \\ q_\phi(z_{1:K}x)}}
&
\log \frac{\hat\varrho_\eta(xz_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\varrho_\eta(xz_k) \frac{p(z_k)}{q_\phi(z_kx)}} \\
& =
\log(K+1)
+
\E_{\substack{p(x, z_0) \\ q_\phi(z_{1:K}x)}}
\left[
\log \frac{\hat\varrho_\eta(xz_0) \frac{p(z_0)}{q_\phi(z_0x)}}{\sum_{k=0}^K \hat\varrho_\eta(xz_k) \frac{p(z_k)}{q_\phi(z_kx)}}

\log \frac{p(z_0)}{q_\phi(z_0x)}
\right]
\\
& \le
\log(K+1)
+
\E_{\substack{p(x, z_0) \\ q_\phi(z_{1:K}x)}}
\log \frac{q_\phi(z_0x) p(xz_0)}{p(z_0) p(xz_0)} \\
& =
\log(K+1)
+
\E_{p(x, z_0)}
\log \frac{q_\phi(zx) p(xz)}{p(zx) p(x)} \\
& =
\log(K+1)
+
\E_{p(x, z)}
\log \frac{p(xz)}{p(x)}

\E_{p(x, z)}
\log \frac{p(zx)}{q_\phi(zx)} \\
& =
\log(K+1)
+
\text{MI}[p(x, z)]

\text{KL}(p(x,z) \mid\mid q_\phi(zx) p(x))
\end{align*}
\]</span></p>
<p>Which shows that by choosing <span class="math inline">\(q_\phi(zx) = p(z)\)</span> we essentially threw the baby out with the bathwater. Yes, <span class="math inline">\(K\)</span> still needs to be exponential, but this time not in the original MI, but rather in <span class="math inline">\(\text{KL}(p(x,z) \mid\mid q_\phi(zx) p(x))\)</span>, which can be made much smaller with a good choice of the variational distribution <span class="math inline">\(q_\phi(zx)\)</span>.</p>
<p>Also recall that we can reparametrize the bound in terms of <span class="math inline">\(\hat\rho_\eta(x, z) = \hat\varrho_\eta(xz) p(z)\)</span></p>
<p><span class="math display">\[
\begin{align*}
\text{MI}[p(x, z)]
& \ge
\E_{p(x, z_0)}
\E_{q_\phi(z_{1:K}x)}
\log \frac{\hat\rho_\eta(x, z_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho_\eta(x, z_k) }{q_\phi(z_kx)}}

\E_{p(z)}
\log p(z) \\
\text{MI}[p(x, z)]
& =
\E_{p(x, z)} \log p(zx)  \E_{p(z)} \log p(z)
\end{align*}
\]</span></p>
<p>Hence <span class="math display">\[
\E_{p(x, z_0)} \log p(z_0x)
\ge
\E_{p(x, z_0)}
\E_{q_\phi(z_{1:K}x)}
\log \frac{\hat\rho_\eta(x, z_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho_\eta(x, z_k) }{q_\phi(z_kx)}}
\]</span> Now we can choose the <span class="math inline">\(p(x)\)</span> marginal freely. Let <span class="math inline">\(p(x) = \delta(x  \tilde{x})\)</span>. Then <span class="math display">\[
\E_{p(z_0\tilde{x})}
\log p(z_0\tilde{x})
\ge
\E_{p(z_0\tilde{x})}
\E_{q_\phi(z_{1:K}\hat{x})}
\log \frac{\hat\rho_\eta(\tilde{x}, z_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho_\eta(\tilde{x}, z_k) }{q_\phi(z_k\tilde{x})}}
\]</span></p>
<p>So in a sense (and this is indeed how we derived the bound in the first place), this “known prior” lower bound is based on a distributionfree <em>upper</em> bound on the entropy of <span class="math inline">\(zx\)</span> and avoids lower bounding any entropies.</p>
<h2 id="butwhydoesitworkinpractice">But why does it work in practice?</h2>
<p>Despite the negative results above, there’s a lot of empirical evidence of successful applications of all of the distributionfree bounds presented above. So what’s going on? Quite possibly, this was the question folks from the Google Brain have asked themselves in their recent preprint <a href="https://arxiv.org/abs/1907.13625">On Mutual Information Maximization for Representation Learning</a>. In this paper researchers investigated representations obtained by the Mutual Information maximization principle. For example, one finding is that tighter MI estimates surprisingly led to worse performance. Overall, the apparent conclusion of the paper is that MI estimation perspective does not seem to explain the observed behavior. Authors then suggest the metric learning perspective and reinterpret lower bounds on the MI as metric learning objectives.</p>
<h2 id="conclusion">Conclusion</h2>
<p>All this evidence suggests that estimating the MI is an even harder problem than we used to think. In particular, blackbox MI estimation seems to be intractable in nontoy cases. Luckily, representation learning works nevertheless, probably due to a different phenomena.</p>
<p>However, for many problems it’d be really nice to have a way to quantify the dependence between <span class="math inline">\(x\)</span> and <span class="math inline">\(z\)</span>. A possible approach here is to consider different divergences between the joint <span class="math inline">\(p(x, z)\)</span> and the product of marginals <span class="math inline">\(p(x) p(z)\)</span>. For example, one possible direction is to replace the KL divergence with some other <a href="https://en.wikipedia.org/wiki/Fdivergence"><span class="math inline">\(f\)</span>divergence</a> (see <a href="https://ieeexplore.ieee.org/document/4455754">Lautum Information</a>, for example), or, <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a>. And there’s already some works in this direction: <a href="https://arxiv.org/abs/1903.11780">Wasserstein Dependency Measure for Representation Learning</a> explores, unsurprisingly, the Wasserstein distance, or, <a href="https://arxiv.org/abs/1808.06670">Learning deep representations by mutual information estimation and maximization</a> considers JensenShannon divergence instead of the KL divergence. It’d curious to see some theorems / efficient bounds for these and other divergences.</p>
<p>Finally, one additional contribution of the Formal Limitations paper is the impossibility or good lower bounds on the KL divergence (supported by the reasoning above). This raises the question: given the whole family of <span class="math inline">\(f\)</span>divergences and their Fenchel conjugatebased blackbox lower bounds, do all of them exhibit such computationally unfavorable behavior? If no, which ones do?</p>
<p>Thanks to <a href="https://twitter.com/poolio">Ben Poole</a>, <a href="https://twitter.com/eeevgen">Evgenii Egorov</a> and Arseny Kuznetsov for valuable discussions.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Or, in my notation, <span class="math inline">\(\text{MI}[p(x) \delta(z  x)] = \mathbb{H}[p(x)]\)</span>.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>By “good lower bound” I mean distributionfree highconfidence lower bound that uses some decent amount of samples, say, polynomial or even linear in the MI.<a href="#fnref2">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/2ct3nzhcA" height="1" width="1" alt=""/>Wed, 14 Aug 2019 00:00:00 UThttp://artem.sobolev.name/posts/20190814thoughtsonmutualinformationformallimitations.htmlArtemhttp://artem.sobolev.name/posts/20190814thoughtsonmutualinformationformallimitations.htmlThoughts on Mutual Information: More Estimators
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/H77CedEWiFA/20190810thoughtsonmutualinformationmoreestimators.html
<p>In this post I’d like to show how SelfNormalized Importance Sampling (<a href="/posts/20190510importanceweightedhierarchicalvariationalinference.html">IWHVI</a> and IWAE) and Annealed Importance Sampling can be used to give (sometimes sandwich) bounds on the MI in many different cases.</p>
<!more>
<p><a href="https://en.wikipedia.org/wiki/Mutual_information">Mutual Information</a> (MI) is an important concept from the Information Theory that captures the idea of information one random variable <span class="math inline">\(X\)</span> carries about the r.v. <span class="math inline">\(Z\)</span> and is usually denoted <span class="math inline">\(I(X, Z)\)</span>, however in order to emphasize the underlying joint distribution I’ll be using a nonstandard notation <span class="math inline">\(\text{MI}[p(x,z)]\)</span>. Formally the MI has the following definition:</p>
<p><span class="math display">\[
\text{MI}[p(x,z)]
:= \E_{p(x, z)} \log \frac{p(x, z)}{p(x) p(z)}
= \E_{p(x, z)} \log \frac{p(x \mid z)}{p(x)}
\]</span></p>
<p>Having such a nice informationmeasuring interpretation, MI is a natural objective and/or metric in many problems in Machine Learning. One <a href="https://arxiv.org/abs/1802.04874">particular application</a> is evaluatiuon of encoderdecoderlike architectures as MI quantifies amount of information contained in the code. In particular, in Variational Autoencoders a good decoder should have high <span class="math inline">\(\text{MI}[p(x,z)]\)</span>, meaning the code is very useful for generations. A good encoder though… has to keep balance between providing enough information to the decoder, while not deviating too much from the prior by, for example, encoding redundant / unnecessary information. <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
<h2 id="miestimation">MI Estimation</h2>
<p>In general estimating the MI is intractable as it requires knowing the log marginal density <span class="math inline">\(\log p(x)\)</span> or the intractable log posterior <span class="math inline">\(\log p(zx)\)</span>. However, there are many efficient variational bounds which can be employed to give tractable lower or upper bounds on the MI. Many existing bounds are reviewed in the <a href="https://arxiv.org/abs/1905.06922">On Variational Bounds of Mutual Information</a> paper.</p>
<p>It is important to take into account what we know about the joint distribution of <span class="math inline">\(x\)</span> and <span class="math inline">\(z\)</span>. We’ll consider several nested “layers” of decreasing complexity:</p>
<ol style="liststyletype: decimal">
<li><strong>Blackbox</strong> case: Distributions <span class="math inline">\(p(x, z)\)</span> we can only sample from, but don’t know any densities. This way we can form Monte Carlo estimates after some learning and surprisingly we can give some lower bounds already in this case.</li>
<li><strong>Known conditional</strong> case: Distributions <span class="math inline">\(p(x, z)\)</span> we can sample from and know one conditional distribution, say, <span class="math inline">\(p(xz)\)</span>.</li>
<li><strong>Known marginal</strong> case: Distributions <span class="math inline">\(p(x, z)\)</span> we can sample from and know one marginal distribution, say, <span class="math inline">\(p(z)\)</span>.</li>
<li><strong>Known joint</strong> case: Distributions <span class="math inline">\(p(x, z)\)</span> we can sample from and know both a marginal and a conditional distributions, say, <span class="math inline">\(p(z)\)</span> and <span class="math inline">\(p(xz)\)</span>.</li>
<li><strong>Known everything</strong> case: Distributions <span class="math inline">\(p(x, z)\)</span> which we can sample from and know all conditionals and marginals. This is a trivial case and doesn’t require any bounds. The MI can be estimated using Monte Carlo directly, so we’ll omit it from the discussion.</li>
</ol>
<h2 id="boundsbasedonselfnormalizedimportancesampling">Bounds based on SelfNormalized Importance Sampling</h2>
<p>SelfNormalized Importance Sampling (SNIS) has been shown (see my previous posts) to give both lower and upper bounds on the marginal loglikelihood:</p>
<p><span class="math display">\[
\begin{align*}
\text{IWAE}:&
\quad\quad\quad
\log p(x) \ge \E_{q(z_{1:K}x)} \log \frac{1}{K} \sum_{k=1}^K \frac{p(x,z_k)}{q(z_kx)}
\\
\text{IWHVI}:&
\quad\quad\quad
\log p(x) \le \E_{p(z_{0}x)} \E_{q(z_{1:K}x)} \log \frac{1}{K+1} \sum_{k=0}^K \frac{p(x,z_k)}{q(z_kx)}
\end{align*}
\]</span></p>
<p>We use such bounds to give sandwich bound on the intractable entropy term in the MI. Another useful insight is that</p>
<p><span class="math display">\[
\omega(z_{0:K}x) := \frac{\hat\rho(x, z_0) \tau(z_{1:K}x)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho(x, z_k)}{\tau(z_kx)}}
\]</span></p>
<p>is a distribution (a valid pdf, to be precise) for almost any (the only condition is the same as in the standard importance sampling) unnormalized distribution <span class="math inline">\(\hat\rho(x, z)\)</span> (a joint distribution over <span class="math inline">\(z\)</span> and <span class="math inline">\(x\)</span>) and a normalized distribution <span class="math inline">\(\tau(zx)\)</span> (a distribution over <span class="math inline">\(z\)</span> possibly conditioned on <span class="math inline">\(x\)</span>). The fact that <span class="math inline">\(\omega(z_{0:K}x)\)</span> is a valid distribution allows us to consider the following KL divergence:</p>
<p><span class="math display">\[
\begin{align*}
0
\le
\text{KL}(p(x, z_0) \tau(z_{1:K}x) \mid\mid p(x) \omega(z_{0:K}x))
=
\E_{p(x, z_0)}
\E_{\tau(z_{1:K}x)}
\log
\frac{p(x, z_0) \tau(z_{1:K}x)}{p(x) \omega(z_{0:K}x)}
\end{align*}
\]</span></p>
<p>Which gives the following lower bound on the MI: <span class="math display">\[
\begin{align*}
\E_{p(x, z_0)}
\E_{\tau(z_{1:K}x)}
\log
\frac{p(x  z_0)}{p(x)}
& \ge
\E_{p(x, z_0)}
\E_{\tau(z_{1:K}x)}
\log
\frac{\omega(z_{0:K}x)}{p(z_0) \tau(z_{1:K}x)} \\
& =
\E_{p(x, z_0)}
\E_{\tau(z_{1:K}x)}
\log
\frac{\hat\rho(x, z_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{\hat\rho(x, z_k)}{\tau(z_kx)}}

\E_{p(z)} \log p(z)
\end{align*}
\]</span></p>
<p>Equivalently, by reparametrizing the bound in terms of <span class="math inline">\(\hat\varrho(xz) = \hat\rho(x, z) / p(z)\)</span> we have <span class="math display">\[
\begin{align*}
\E_{p(x, z_0)}
\E_{\tau(z_{1:K}x)}
\log
\frac{p(x  z_0)}{p(x)}
\ge
\E_{p(x, z_0)}
\E_{\tau(z_{1:K}x)}
\log
\frac{\hat\varrho(x  z_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\varrho(x  z_k) \frac{p(z_k)}{\tau(z_kx)}}
\end{align*}
\]</span></p>
<p>These lower bounds work for any <span class="math inline">\(\rho\)</span> and <span class="math inline">\(\varrho\)</span>, and the optimal choices are <span class="math inline">\(p(x, z)\)</span> and <span class="math inline">\(p(xz)\)</span>, correspondingly.</p>
<h3 id="knownjoint">Known Joint</h3>
<p>First, we’ll consider the easiest case – when we know the joint distribution in the form of a prior + conditional. A prominent example of this class is VAEs decoders, which is defined by some prior in the latent space <span class="math inline">\(p(z)\)</span> and a decoder <span class="math inline">\(p_\theta(xz)\)</span> that uses a neural network to generate a distribution over observations <span class="math inline">\(x\)</span> for a particular <span class="math inline">\(z\)</span>. Computing the MI of the decoder is arguably a natural way to measure the extent of posterior collapse<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>, since MI can be expressed in the following form, which essentially measures the true posterior’s deviation from the prior:</p>
<p><span class="math display">\[
\text{MI}[p(x, z)]
:= \E_{p(x, z)} \log \frac{p(zx)}{p(z)}
= \E_{p(x)} D_{KL}(p(zx) \mid\mid p(z))
\]</span></p>
<p>However the MI as introduced above requires knowing marginal <span class="math inline">\(p(x)\)</span>, which is intractable. Luckily, the <a href="/posts/20190426neuralsamplersandhierarchicalvariationalinference.html">Multisample Variational Bounds</a> allow us to give efficient variational sandwich bounds on MI (where <span class="math inline">\(q_\phi(zx)\)</span> is an encoder with known density<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>):</p>
<p><span class="math display">\[
\E_{\substack{p_\theta(x, z_0) \\ q_\phi(z_{1:K}x)} }
\log \frac{p_\theta(xz_0)}{\frac{1}{K+1} \sum_{k=0}^K \frac{p_\theta(x, z_k)}{q_\phi(z_kx)}}
\le
\text{MI}[p_\theta(x, z)]
%= \mathbb{E}_{p_\theta(x, z)} \left[ \log p_\theta(xz)  \log p_\theta(x) \right]
\le
\E_{\substack{p_\theta(x, z_0) \\ q_\phi(z_{1:K}x)} }
\log \frac{p_\theta(xz_0)}{\frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_kx)}}
\]</span></p>
<h3 id="knownconditional">Known Conditional</h3>
<p>However, we might not have any of marginal densities in the closed form. For example, this is the case if we want to estimate the amount of information the encoder <span class="math inline">\(q_\phi(zx)\)</span> puts into the code <span class="math inline">\(z\)</span>. Since it only defines the conditional distribution, we pair it with some datagenerating process <span class="math inline">\(p(x)\)</span> which defines the population we’d like to evaluate the encoder over.</p>
<p>Direct application of the aforementioned bounds leads to (where <span class="math inline">\(\tau_\eta(x_kz)\)</span> is our variational inverse distribution) <span class="math display">\[
\E_{\substack{p(x_0) \\ q_\phi(z  x_0) \\ \tau_\eta(x_{1:K}z)}}
\!\!\!
\log \frac{q_\phi(zx_0)}{\frac{1}{K+1} \sum\limits_{k=0}^K \frac{q_\phi(zx_k) p(x_k)}{\tau_\eta(x_kz)}}
\le
\text{MI}[q_\phi(z  x) p(x)]
\le
\!\!\!
\E_{\substack{p(x_0) \\ q_\phi(z  x_0) \\ \tau_\eta(x_{1:K}z)}}
\!\!\!
\log \frac{q_\phi(zx_0)}{\frac{1}{K} \sum\limits_{k=1}^K \frac{q_\phi(zx_k) p(x_k)}{\tau_\eta(x_kz)}}
\]</span></p>
<p>However, we might not have an access to the density <span class="math inline">\(p(x)\)</span>. In this case one can resort to SIVIlike bounds by setting <span class="math inline">\(\tau_\eta(xz) = p(x)\)</span> and arrive to the following bounds: <span class="math display">\[
\E_{\substack{p(x_{0:K}) \\ q_\phi(z  x_0)}}
\log \frac{q_\phi(zx_0)}{\frac{1}{K+1} \sum\limits_{k=0}^K q_\phi(zx_k)}
\le
\text{MI}[q_\phi(z  x) p(x)]
\le
\E_{\substack{p(x_{0:K}) \\ q_\phi(z  x_0)}}
\log \frac{q_\phi(zx_0)}{\frac{1}{K} \sum\limits_{k=1}^K q_\phi(zx_k)}
\]</span></p>
<p>These bounds are known as special case of <a href="https://arxiv.org/abs/1807.03748">InfoNCE bounds</a> and are much worse due to uninformed proposal <span class="math inline">\(\tau\)</span>.</p>
<h3 id="knownprior">Known Prior</h3>
<p>Sometimes we use complex implicit models as decoders and don’t have closedform densities for <span class="math inline">\(p(xz)\)</span>. The most popular instance of such models is Generative Adversarial Networks, which is similar to VAE with an exception of not having a welldefined decoder’s density <span class="math inline">\(p(xz)\)</span> (but the prior <span class="math inline">\(p(z)\)</span> is typically simple and known). In some sense, this density is degenerate: <span class="math inline">\(p(xz) = \delta(x  f(z))\)</span> where <span class="math inline">\(f\)</span> is generator’s neural network. Unfortunately, we cannot use such density in the IWHVI bounds. Are we doomed then? Turns out, not quite so. Even in this case it’s still possible to give an efficient multisample variational lower bound: <span class="math display">\[
\text{MI}[p(x, z)]
\ge
\E_{p(x, z_0)}
\E_{q_\phi(z_{1:K}x)}
\log \frac{\hat\rho_\eta(xz_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\rho_\eta(xz_k) \frac{p(z_k)}{q_\phi(z_kx)}}
\]</span></p>
<p>Where <span class="math inline">\(\hat\rho_\eta(xz) = \exp(h_\eta(xz))\)</span> is any nonnormalized energybased model that essentially estimates the unknown density <span class="math inline">\(p(xz)\)</span>.</p>
<p>I am not aware of any good <em>upper</em> bounds and if they are possible. I would think the answer is negative due to hardness of upperbounding the crossentropy in the blackbox case.</p>
<h3 id="knownnothing">Known Nothing</h3>
<p>Staying in the realm of implicit models, let’s assume we trained an implicit inference model (think of GANs with encoders) and would like estimate the MI much like in the case of VAE’s encoder in the “known conditional” section. Denoting our inference model <span class="math inline">\(q(zx)\)</span> and the datagenerating process <span class="math inline">\(p(x)\)</span> (both densities are unknown to us, but we can sample from them), we can adapt the previous section’s lower bound by choosing the proposal <span class="math inline">\(\tau_\eta(xz) = p(x)\)</span> <span class="math display">\[
\text{MI}[q(z  x) p(x)]
\ge
\E_{p(x_{0:K})}
\E_{q(zx_0)}
\log \frac{\hat\rho_\eta(zx_0)}{\frac{1}{K+1} \sum_{k=0}^K \hat\rho_\eta(zx_k)}
\]</span></p>
<p>Where <span class="math inline">\(\hat\rho_\eta(zx) = \exp(h_\eta(zx))\)</span> is again a nonnormalized energybased model that estimates the unknown density <span class="math inline">\(q(zx)\)</span>.</p>
<p>While convenient in its wide applicability, this bound is known to be very loose in cases when the true MI is high. We’ll discuss drawbacks and limitations in the next post on the topic.</p>
<h2 id="boundsbasedonannealedimportancesampling">Bounds based on Annealed Importance Sampling</h2>
<p>SNIS is not the only way to obtain variational sandwich bounds on log marginal likelihood. Another widely known and powerful approach is <a href="https://arxiv.org/abs/1511.02543">Annealed Importance Sampling</a> (AIS)</p>
<p>AIS uses two distributions, called forward and backward: <span class="math display">\[
q_\rightarrow(z_{1:T}  x) = q(z_1x) \mathcal{T}_2(z_2 \mid z_1, x) \cdots \mathcal{T}_{T}(z_{T} \mid z_{T1}, x)
\\
q_\leftarrow(z_{T:1}  x) = p(z_{T}x) \mathcal{T}_{T}(z_{T1} \mid z_{T}, x) \cdots \mathcal{T}_{2}(z_{1} \mid z_{2}, x)
\\
q_\leftarrow(x, z_{T:1}) = p(x, z_{T}) \mathcal{T}_{T}(z_{T1} \mid z_{T}, x) \cdots \mathcal{T}_{2}(z_{1} \mid z_{2}, x)
\]</span></p>
<p>Where <span class="math inline">\(\mathcal{T}_t\)</span> is a transition operator that is designed to be invariant to <span class="math inline">\(p_t(zx) \propto q(zx)^{1\beta_t} p(x, z)^{\beta_t}\)</span> and <span class="math inline">\(\beta_{1:T+1}\)</span> is a monotonically increasing sequence s.t. <span class="math inline">\(\beta_1 = 0\)</span> and <span class="math inline">\(\beta_{T+1} = 1\)</span>. That is, in the <em>forward distribution</em> <span class="math inline">\(q_\rightarrow(z_{1:T}  x)\)</span> one starts with a sample <span class="math inline">\(z_1\)</span> from some proposal <span class="math inline">\(q(zx)\)</span> and then transforms it into a sample from <span class="math inline">\(p_2(zx)\)</span> using the <span class="math inline">\(\mathcal{T}_t\)</span> transition operator (typically a MCMC kernel). The sample <span class="math inline">\(z_2\)</span> is then analogously transformed into <span class="math inline">\(z_3\)</span> amd so on. The <em>backward distribution</em> <span class="math inline">\(q_\rightarrow(z_{T:1}  x)\)</span> is similar except it starts with the true posterior sample <span class="math inline">\(z_T \sim p(zx)\)</span> and then sequentially transforms it into a sample from the proposal <span class="math inline">\(z_1 \sim q(zx)\)</span>.</p>
<p>Then one defines the importance weight</p>
<p><span class="math display">\[
\begin{align*}
w(z_{1:T} \mid x)
&=
\frac{q_\leftarrow(z_{T:1}  x)}{q_\rightarrow(z_{1:T}  x)}
=
\frac{\hat{p}_2(z_1 \mid x)} {q(z_1x)}
\frac{\hat{p}_3(z_2 \mid x)} {\hat{p}_2(z_2 \mid x)}
\cdots
\frac{p(x, z_{T})} {\hat{p}_{T}(z_{T} \mid x)} \\
&=
% \left( \tfrac{p(x, z)}{q(zx)} \right)^{\beta_t} q(zx)
\frac{ \left( \tfrac{p(x, z_1)}{q(z_1x)} \right)^{\beta_2} q(z_1x) }{q(z_1x)}
\frac{ \left( \tfrac{p(x, z_2)}{q(z_2x)} \right)^{\beta_3} q(z_2x) }{ \left( \tfrac{p(x, z_2)}{q(z_2x)} \right)^{\beta_2} q(z_2x) }
\cdots
\frac{ \left( \tfrac{p(x, z_{T})}{q(z_{T}x)} \right)^{\beta_{T+1}} q(z_{T}x) }{ \left( \tfrac{p(x, z_{T})}{q(z_{T}x)} \right)^{\beta_{T}} q(z_{T}x) } \\
&=
\left( \tfrac{p(x, z_1)}{q(z_1x)} \right)^{\beta_2  \beta_1}
\left( \tfrac{p(x, z_2)}{q(z_2x)} \right)^{\beta_3  \beta_2}
\cdots
\left(\tfrac{p(x, z_{T})}{q(z_{T}x)} \right)^{\beta_{T+1}  \beta_{T}} \\
\end{align*}
\]</span> Where the second identity is due to <span class="math inline">\(\mathcal{T}_t\)</span> satisfying the detailed balance equation. Then one can show that <span class="math display">\[
\E_{q_\rightarrow(z_{1:T}  x)} w(z_{1:T} \mid x) = p(x)
\quad\Rightarrow\quad
\E_{q_\rightarrow(z_{1:T}  x)} \log w(z_{1:T} \mid x) \le \log p(x),
\\
\E_{q_\leftarrow(z_{T:1}  x)} \frac{1}{w(z_{1:T} \mid x)} = \frac{1}{p(x)}
\quad\Rightarrow\quad
\E_{q_\leftarrow(z_{T:1}  x)} \log w(z_{1:T} \mid x) \ge \log p(x)
\]</span></p>
<p>Which gives us another set of sandwich bounds, which we can use to sandwich bound the MI: <span class="math display">\[
\boxed{
\E_{q_\leftarrow(x, z_{T:1}  x)}
% \E_{p_\theta(x, z_T)}
% \E_{ \mathcal{T}_{T}(z_{T1} \mid z_{T}, x) \cdots \mathcal{T}_{2}(z_{1} \mid z_{2}, x) }
\log \frac{p_\theta(xz_T)}{ w(z_{1:T} \mid x) }
\le
\text{MI}[p_\theta(x, z)]
\le
\E_{p_\theta(x, z_0)}
\E_{q_\rightarrow(z_{1:T}  x)}
\log \frac{p_\theta(xz_0)}{ w(z_{1:T} \mid x) }
}
\]</span></p>
<p>One can also come up with a “decoderfree” version of the bound in a similar fashion to what we’ve done above. First, introduce the following distributions: <span class="math display">\[
\gamma_\rightarrow(z_{1:T}  x) = q(z_1x) \mathcal{K}_2(z_2 \mid z_1, x) \cdots \mathcal{K}_{T}(z_{T} \mid z_{T1}, x)
\\
\gamma_\leftarrow(x, z_{T:1}) = p(x, z_{T}) \mathcal{K}_{T}(z_{T1} \mid z_{T}, x) \cdots \mathcal{K}_{2}(z_{1} \mid z_{2}, x)
\]</span> Where <span class="math inline">\(\mathcal{K}_t\)</span> is now tailored to <span class="math inline">\(\kappa_t(zx) \propto q(zx)^{1\beta_t} \left( \hat\varrho(xz) p(z) \right)^{\beta_t}\)</span> for certain unnormalized <span class="math inline">\(\hat\varrho(xz)\)</span></p>
<p>Now consider <span class="math display">\[
\begin{align*}
0
& \le
\text{KL}\left(
\gamma_\leftarrow(x, z_{T:1})
\mid\mid
p(x) \gamma_\rightarrow(z_{1:T}  x)
\right)
=
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log \frac{\gamma_\leftarrow(x, z_{T:1})}{p(x) \gamma_\rightarrow(z_{1:T}  x)} \\
& =
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log
\left[
\frac{p(x, z_{T})}{p(x) q(z_1x) }
\frac{\mathcal{K}_{2}(z_{1} \mid z_{2}, x)}{\mathcal{K}_2(z_2 \mid z_1, x)}
\cdots
\frac{ \mathcal{K}_{T}(z_{T1} \mid z_{T}, x) }{\mathcal{K}_{T}(z_{T} \mid z_{T1}, x)}
\right] \\
& =
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log
\left[
\frac{p(x, z_{T})}{p(x) q(z_1x) }
\frac{\kappa_2(z_1x)}{\kappa_2(z_2x)}
\cdots
\frac{ \kappa_T(z_{T1}x) }{\kappa_T(z_{T}x)}
\right] \\
& =
% q(zx) \left( \hat\varrho(xz) \frac{p(z)}{q(zx)} \right)^{\beta_t}
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log
\left[
\tfrac{p(x, z_{T})}{p(x) q(z_1x) }
\tfrac{ q(z_1x) \left( \frac{\hat\varrho(xz_1) p(z_1)}{q(z_1x)} \right)^{\beta_2} }{ q(z_2x) \left( \frac{\hat\varrho(xz_2) p(z_2)}{q(z_2x)} \right)^{\beta_2} }
\cdots
\tfrac{ q(z_{T1}x) \left( \frac{\hat\varrho(xz_{T1}) p(z_{T1})}{q(z_{T1}x)} \right)^{\beta_T} }{ q(z_Tx) \left( \frac{\hat\varrho(xz_T) p(z_T)}{q(z_Tx)} \right)^{\beta_T} }
\right] \\
& =
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log
\left[
\frac{p(x, z_{T})}{p(x)}
\frac{1}{p(z_T) \hat\varrho(xz_T)}
\prod_{t=1}^T
\left( \frac{\hat\varrho(xz_t) p(z_t)}{q(z_tx)} \right)^{\beta_{t+1}\beta_t}
\right]
\end{align*}
\]</span> Hence <span class="math display">\[
\text{MI}[p(x, z)]
\ge
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log
\frac{\hat\varrho(xz_T)}{\prod_{t=1}^T
\left( \frac{\hat\varrho(xz_t) p(z_t)}{q(z_tx)} \right)^{\beta_{t+1}\beta_t}}
\]</span></p>
<p>Now we can again reparametrize this formula in terms of <span class="math inline">\(\hat\rho(x, z) = \hat\varrho(xz) p(z)\)</span>:</p>
<p><span class="math display">\[
\text{MI}[p(x, z)]
\ge
\mathbb{E}_{\gamma_\leftarrow(x, z_{T:1})}
\log
\frac{\hat\rho(x, z_T)}{\prod_{t=1}^T
\left( \frac{\hat\rho(x, z_t)}{q(z_tx)} \right)^{\beta_{t+1}\beta_t}}
 \mathbb{E}_{p(z)} \log p(z)
\]</span></p>
<p>It’s tempting to simply put <span class="math inline">\(q(zx) = p(z)\)</span> to obtain a blackbox AISbased analogue of the InfoNCE bound, however notice that <span class="math inline">\(\gamma_\leftarrow(x, z_{T:1})\)</span> relies on an MCMC kernel that gradually transforms a sample <span class="math inline">\(z_T \sim p(z_Tx)\)</span> into <span class="math inline">\(z_1 \sim p(z_1)\)</span> and thus needs to know this density. Thus I don’t think one can use AIS in the blackbox mode.</p>
<p>Finally, note while the bound is valid for any <span class="math inline">\(\hat\rho\)</span> and <span class="math inline">\(\hat\varrho\)</span>, it’s not quite differentiable w.r.t. their parameters as any proper MCMC method would require an acceptreject step which is not differentiable. Thus if one seeks to learn <span class="math inline">\(\hat\rho\)</span> or <span class="math inline">\(\hat\varrho\)</span>, an alternative objective should be used. Luckily, the SNISbased lower bound with small <span class="math inline">\(K\)</span> would work just fine.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I presented two different ways to give bounds on the MI in cases of variable complexity. The SNISbased approach seem to be somewhat novel (the special case of InfoNCE has been already known), and is applicable in many different problems. The AISbased one is based on wellknown sandwich bounds on the log marginal likelihoods, but I haven’t seen it being applied to the problem of MI estimation. The reason might be that it’s more restrictive that the SNISbased estimations: AIS only works for continuous variables, requires complicated MCMC to function and does not seem to allow blackbox estimators. On the positive side of things, AISbased estimator should perform much better in highdimensional problems with large MI, especially if one uses gradientbased kernels like HMC.</p>
<p>Next, I’ll share some of my thoughts on drawbacks of these (and some other) bounds, in particular in light of the notorious “Formal Limitations” paper.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>See <a href="http://approximateinference.org/accepted/HoffmanJohnson2016.pdf">ELBO surgery: yet another way to carve up the variational evidence lower bound</a><a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Oftentimes <span class="math inline">\(D_{KL}(q_\phi(zx) \mid\mid p(z))\)</span> is used to measure the extent of the so called posterior collapse. One can argue this is an indirect metric as it does not use <em>decoder at all</em> and relies on the encoder’s ability to approximate the true posterior better than the prior. Moreover, high <span class="math inline">\(D_{KL}(q_\phi(zx) \mid\mid p(z))\)</span> is likely to be an indicator of poor encoder <span class="math inline">\(q(zx)\)</span>, whereas the Mutual Information is monotonic in decoder’s quality.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>As I have argued in the IWHVI paper, it might be beneficial to fit two different encoders for lower and upper bounds, correspondingly.<a href="#fnref3">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/H77CedEWiFA" height="1" width="1" alt=""/>Sat, 10 Aug 2019 00:00:00 UThttp://artem.sobolev.name/posts/20190810thoughtsonmutualinformationmoreestimators.htmlArtemhttp://artem.sobolev.name/posts/20190810thoughtsonmutualinformationmoreestimators.htmlImportance Weighted Hierarchical Variational Inference
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/nqI3_kSR2Cc/20190510importanceweightedhierarchicalvariationalinference.html
<p>This post finishes the discussion on <a href="/posts/20190426neuralsamplersandhierarchicalvariationalinference.html">Neural Samplers for Variational Inference</a> by introducing some recent results (including mine).</p>
<p>Also, there’s <a href="https://youtu.be/pdSu7XfGhHw">a talk recording</a> of me presenting this post’s content, so if you like videos more than texts, check it out.</p>
<!more>
<h2 id="quickrecap">Quick Recap</h2>
<p>It all started with an aspiration for a more expressive variational approximation <span class="math inline">\(q_\phi(zx)\)</span> since it restricts expressivity of our hierarhical model <span class="math inline">\(p_\theta(x)\)</span>. We could use the multisample bound, which can lighten the restriction to an arbitrary extent, but the price is more computation and multiple evaluations of highdimensional decoder <span class="math inline">\(p_\theta(xz)\)</span> are especially frustrating.</p>
<p>Instead, we hope to leverage Neural Net’s universal approximation properties and introduce a hierarchical variational approximation <span class="math inline">\(q_\phi(zx) = \int q_\phi(z, \psix) d\psi\)</span> which should be much more expressive and we can sample from it by passing some simple noise <span class="math inline">\(\psi\)</span> through a neural network that generates<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> <span class="math inline">\(q_\phi(z\psi, x)\)</span> distribution. However, we lost access to the marginal logdensity <span class="math inline">\(\log q_\phi(zx)\)</span>, required by the KL term of the ELBO.</p>
<p>A theoretically sound way then is to give an upper bound on the logdensity (to obtain a lower bound on the ELBO), but this bound regularizes the <span class="math inline">\(q_\phi(zx)\)</span> and alleviating this regularization requires more expressive auxiliary variational distribution <span class="math inline">\(\tau_\eta(\psix,z)\)</span>. Full circle, full stop. At this point, an efficiently computable multisample variational upper bound on the <span class="math inline">\(\log q_\phi(zx)\)</span> would be handy, but our naive attempt to obtain one was unsuccessful. Moreover, it might well be that there are no good bounds at all.</p>
<h2 id="newsemiimplicithope"><del>New</del> SemiImplicit Hope</h2>
<p>A year ago Mingzhang Yin and Mingyuan Zhou published a paper <a href="https://arxiv.org/abs/1805.11183">SemiImplicit Variational Inference</a> (SIVI) where they essentially proposed the following multisample surrogate ELBO for our model:</p>
<p><span class="math display">\[
\hat{\mathcal{L}}_K^\text{SIVI}
:=
\E_{q_\phi(z, \psi_0x)}
\E_{q_\phi(\psi_{1:K}x)}
\log \frac{p_\theta(x, z)}{ \frac{1}{K+1} \sum_{k=0}^K q_\phi(z\psi_k, x) }
\]</span></p>
<p>However, the original paper did not prove that this surrogate is a lower bound for all finite <span class="math inline">\(K\)</span>, only that it converges to the ELBO <span class="math inline">\(\mathcal{L}\)</span> in the limit of infinite <span class="math inline">\(K\)</span>. This fact was later <a href="https://arxiv.org/abs/1810.02789">shown by Molchanov et al.</a>: this surrogate objective is indeed a lower bound for all finite <span class="math inline">\(K\)</span>. Moreover, since this is a lower bound on ELBO,</p>
<p><span class="math display">\[
\E_{q_\phi(z, \psi_0x)}
\E_{q_\phi(\psi_{1:K}x)}
\left[
\log \frac{p_\theta(x, z)}{ q_\phi(zx) }

\log \frac{p_\theta(x, z)}{ \frac{1}{K+1} \sum_{k=0}^K q_\phi(z\psi_k, x) }
\right]
\ge 0
\]</span> We can recover an upper bound on the marginal logdensity (at least in expectation) <span class="math display">\[
\E_{q_\phi(zx)}
\log q_\phi(zx)
\le
\E_{q_\phi(zx)}
\E_{q_\phi(\psi_0z, x)}
\E_{q_\phi(\psi_{1:K}x)}
\log \frac{1}{K+1} \sum_{k=0}^K q_\phi(z\psi_k, x)
\]</span></p>
<p>Which does indeed give us a multisample upper bound (not variational, though). Unfortunately, this particular bound has a severe weakness: the samples <span class="math inline">\(\psi_{1:K}\)</span> are <em>uninformed</em> about the <span class="math inline">\(z\)</span> they’re supposed to describe in the <span class="math inline">\(q(zx,\psi_k)\)</span> terms, so they are likely to do a poor job of reconstructing a particular <span class="math inline">\(z\)</span>.</p>
<p>Interestingly, this bound looks similar to the multisample variational <em>lower</em> bound <span class="math inline">\(\mathcal{L}_K\)</span>… <span class="math display">\[
\log q(zx)
\ge
\E_{\tau_\eta(\psi_{1:K}z, x)}
\log \frac{1}{K} \sum_{k=1}^K \frac{q_\phi(z, \psi_kx)}{\tau_\eta(\psi_kx,z)}
\]</span> … when <span class="math inline">\(\tau_\eta(\psix,z)\)</span> is taken to be <span class="math inline">\(q_\phi(\psix)\)</span> – the “variational prior”: <span class="math display">\[
\log q(zx)
\ge
\E_{q_\phi(\psi_{1:K}x)}
\log \frac{1}{K} \sum_{k=1}^K q_\phi(z\psi_k, x)
\]</span></p>
<p>The only difference between this lower bound and the SIVI upper bound is that the later adds one (free, see previous post for the discussion on free posterior samples) sample from the true inverse model <span class="math inline">\(q_\phi(\psix,z)\)</span>.</p>
<h2 id="importanceweightedhierarchicalvariationalinference">Importance Weighted Hierarchical Variational Inference</h2>
<p>The natural question to ask then is… could maybe the following be an upper bound on <span class="math inline">\(\log q_\phi(zx)\)</span>? <span class="math display">\[
\mathcal{U}_K
:=
\E_{q_\phi(\psi_0z, x)}
\E_{\tau_\eta(\psi_{1:K}z, x)}
\log \frac{1}{K+1} \sum_{k=0}^K \frac{q_\phi(z, \psi_kx)}{\tau_\eta(\psi_kx,z)}
\]</span> The formula is very bizarre, yet several special cases do give upper bounds:</p>
<ul>
<li>Setting <span class="math inline">\(K=0\)</span> gives the Hierarchical Variational Models (HVM) bound (from the previous post) for arbitrary <span class="math inline">\(\tau_\eta(\psix,z)\)</span>,</li>
<li>Setting <span class="math inline">\(\tau_\eta(\psix,z) = q_\phi(\psix)\)</span> gives the SIVI bound for arbitrary <span class="math inline">\(K\)</span>,</li>
<li>Setting <span class="math inline">\(\tau_\eta(\psix,z) = q_\phi(\psi{\color{red} z}, x)\)</span> recovers the <span class="math inline">\(\log q_\phi(zx)\)</span> exactly.</li>
</ul>
<p>The <a href="https://arxiv.org/abs/1905.03290">Importance Weighted Hierarchical Variational Inference</a> paper gives an affirmative answer. <span class="math inline">\(\mathcal{U}_K\)</span> is indeed an upper bound (Multisample Variational Upper Bound) for any <span class="math inline">\(K\)</span> and any <span class="math inline">\(\tau(\psix,z)\)</span>. Moreover, it enjoys same guarantees as the IWAE bound (Multisample Variational Lower Bound):</p>
<ol style="liststyletype: decimal">
<li><span class="math inline">\(\mathcal{U}_K \ge \log q_\phi(zx)\)</span></li>
<li><span class="math inline">\(\mathcal{U}_K \ge \mathcal{U}_{K+1}\)</span></li>
<li><span class="math inline">\(\lim_{K \to \infty} \mathcal{U}_K = \log q_\phi(zx)\)</span></li>
</ol>
<p>Combining this bound with the (intractable) ELBO, we obtain the following lower bound on <span class="math inline">\(\log p_\theta(x)\)</span>:</p>
<p><span class="math display">\[
\hat{\mathcal{L}}_K^\text{IWHVI}
:=
\E_{q_\phi(z, \psi_0x)}
\E_{\tau_\eta(\psi_{1:K}z, x)}
\log \frac{p_\theta(x, z)}{ \frac{1}{K+1} \sum_{k=0}^K \frac{q_\phi(z, \psi_kx)}{\tau_\eta(\psi_kx,z)} }
\]</span></p>
<p>To test the bound we used a simple toy task of upperbounding the negative differential entropy <span class="math inline">\(\mathbb{E}_{q(z)} \log q(z)\)</span> of the standard 50dimensional Laplace distribution represented as a <a href="https://statisticaloddsandends.wordpress.com/2018/12/21/laplacedistributionasamixtureofnormals/">Gaussian compound</a>: <span class="math display">\[
\prod_{d=1}^{50} \text{Laplace}(z_d  0, 1) = \int \prod_{d=1}^{50} \mathcal{N}(z_d  0, \psi_d) \text{Exp}(\psi_d  \tfrac{1}{2}) d\psi_{1:50}
\]</span></p>
<p>The results look good</p>
<div class="postimage">
<p><img src="/files/iwhviplot.png" style="width: 500px" /> Comparison of IWHVI bounds for different number of optimization steps over <span class="math inline">\(\eta\)</span>.</p>
</div>
<p>Moreover, multisample bounds have been extensively studied and some results translate to our bound as well.</p>
<h3 id="estimatingthemarginalloglikelihoodlogp_thetax">Estimating the marginal loglikelihood <span class="math inline">\(\log p_\theta(x)\)</span></h3>
<p>Increasing <span class="math inline">\(K\)</span> will lead to the bound <span class="math inline">\(\hat{\mathcal{L}}_K^\text{IWHVI}\)</span> approaching the ELBO <span class="math inline">\(\mathcal{L}\)</span>, but the gap between the ELBO and the marginal loglikelihood <span class="math inline">\(\log p_\theta(x)\)</span> is not negligible. Even by employing more powerful variational distribution we might not be able to overcome the <a href="https://arxiv.org/abs/1802.02550">gap introduced by amortization</a>. The standard approach to evaluate Variational Autoencoders it to use the Multisample Variational Lower Bound (IWAE bound) with large <span class="math inline">\(M\)</span>. Can we tighten our bound in such a way?</p>
<p>It turns out, the answer is yes and the tighter bound is simply</p>
<p><span class="math display">\[
\hat{\mathcal{L}}_K^\text{$M$IWHVI}
:=
\E_{\substack{q_\phi(z_{1:M}, \psi_{1:M, 0}x) \\ \tau_\eta(\psi_{1:M, 1:K}z_{1:M}, x)}}
\log \frac{1}{M} \sum_{m=1}^M \frac{p_\theta(x, z_m)}{ \frac{1}{K+1} \sum_{k=0}^K \frac{q_\phi(z_m, \psi_{m,k}x)}{\tau_\eta(\psi_{m,k}x,z_m)} }
\le
\log p_\theta(x)
\]</span></p>
<p>Essentially we just sampled the original <span class="math inline">\(\hat{\mathcal{L}}_K^\text{IWHVI}\)</span> bound <span class="math inline">\(M\)</span> times (independently) and averaged them all inside the <span class="math inline">\(\log\)</span>.</p>
<h3 id="buttightervariationalboundsarenotnecessarilybetter">But Tighter Variational Bounds are Not Necessarily Better</h3>
<p>It <a href="https://arxiv.org/abs/1802.04537">was observed</a> that training IWAE with large <span class="math inline">\(K\)</span> leads to inference networks gradients deterioration. Namely, the signaltonoise ratio of <span class="math inline">\(\nabla_\phi \mathcal{L}_K\)</span> estimates decrease with <span class="math inline">\(K\)</span>, while the signaltonoise ratio of <span class="math inline">\(\nabla_\theta \mathcal{L}_K\)</span> estimates increase with <span class="math inline">\(K\)</span>. Luckily, the <a href="https://arxiv.org/abs/1810.04152">Doubly Reparameterized Gradients paper</a> resolved this problem. The same derivations apply to our case except for having an additional term corresponding to a sample from <span class="math inline">\(q_\phi(\psiz, x)\)</span>, which prevents SNR from increasing, leaving it approximately constant.</p>
<h3 id="debiasingandjackknife">Debiasing and Jackknife</h3>
<p><a href="https://openreview.net/forum?id=HyZoiWRb">Nowozin has shown</a> that Multisample Variational Lower Bound <span class="math inline">\(\mathcal{L}_K\)</span> (the IWAE bound) can be seen as a biased evidence estimate with the bias of order <span class="math inline">\(1/K\)</span>, which can be reduced with <a href="https://en.wikipedia.org/wiki/Jackknife_resampling">Jackknife</a>. This procedure results in an improved estimator with the bias of order <span class="math inline">\(1/K^2\)</span>. By repeating the procedure over and over again <span class="math inline">\(d\)</span> times we obtain an estimator with the bias of order <span class="math inline">\(1/K^{d+1}\)</span>. The price for that is increased variance, computational complexity and loss of bound guarantees.</p>
<p>It can be shown that the Multisample Variational Upper Bound <span class="math inline">\(\mathcal{U}_K\)</span> also has the bias of order <span class="math inline">\(1/(K+1)\)</span> and thus allows the jackknife. We tested the debiased estimator on a toy task but did not use it in more serious experiments due to loss of guarantees.</p>
<h2 id="issiviobsolete">Is SIVI obsolete?</h2>
<p>It depends. In the case of Neural Samplers IWHVI does give a much tighter bound with little extra overhead. However, in some cases the general formulation of IWHVI might be challenging to work with, for example, in the case of <a href="https://arxiv.org/abs/1705.07120">VampPrior</a>like distributions: <span class="math display">\[
q_\phi(z)
:= \frac{1}{N} \sum_{n=1}^N q_\phi(zx_n)
= \sum_{n=1}^N q_\phi(zn) q(\psi = n)
\]</span> Here <span class="math inline">\(\psi\)</span> is essentially a number from 1 to N and the prior <span class="math inline">\(q(\psi)\)</span> is a uniform distribution. The IWHVI bound would involve <span class="math inline">\(\tau(\psix,z)\)</span> as a categorical distribution over <span class="math inline">\(N\)</span> outcomes. Learning <span class="math inline">\(\tau\)</span> not only would require <a href="/tags/stochastic%20computation%20graphs%20series.html">advandced gradient estimates</a><a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a> to deal with discrete random variables, but also an efficient <a href="http://ruder.io/wordembeddingssoftmax/index.html">softmax estimators</a> to scale favorably to large datasets. In this setting SIVI presents a much simpler alternative as it frees us from all these hurdles. SIVI only requires sampling from <span class="math inline">\(U\{1, \dots, N\}\)</span>, which is easy.</p>
<p>In many cases though, IWHVI only adds one extra pass of each <span class="math inline">\(z\)</span> through a network that generates <span class="math inline">\(\tau_\eta(\psiz,x)\)</span> distribution, which is dominated by <span class="math inline">\(K+1\)</span> passes of <span class="math inline">\(\psi_{0:K}\)</span> through a network that generates <span class="math inline">\(q_\phi(zx, \psi_k)\)</span> distributions, so it’s added cost is negligible.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this work we identified a generalized bound that bridges prior work on HVM and SIVI. Such generalized bounds are shown to be much tighter. A particularly nice property is that such multisample bound breaks the vicious cycle we stumbled upon in the last post: increasing the number of samples allows us to tighten the bound without complicating the auxiliary variational distribution <span class="math inline">\(\tau_\eta(\psix,z)\)</span> and thus reduce the amount of regularuzation simple it imposes on the true inverse model <span class="math inline">\(q_\phi(\psix,z)\)</span>, which lets us learn expressive Neural Samplers. Although multiple samples are still more computationally expensive than just one sample (HVM), <span class="math inline">\(z\)</span> typically has much lower dimension than <span class="math inline">\(x\)</span>, so this bound is cheaper to evaluate than the IWAE’s one.</p>
<p>For more details check out the <a href="https://arxiv.org/abs/1905.03290">preprint</a>.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>In the standard VAE the encoder network takes in the observation <span class="math inline">\(x\)</span> and generates <span class="math inline">\(q(zx)\)</span> by outputting mean and variance of a normal distribution.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Although one can attempt avoiding this particular issue by fitting <span class="math inline">\(\tau(nx,z)\)</span> using the bound with <span class="math inline">\(K=0\)</span><a href="#fnref2">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/nqI3_kSR2Cc" height="1" width="1" alt=""/>Fri, 10 May 2019 00:00:00 UThttp://artem.sobolev.name/posts/20190510importanceweightedhierarchicalvariationalinference.htmlArtemhttp://artem.sobolev.name/posts/20190510importanceweightedhierarchicalvariationalinference.htmlNeural Samplers and Hierarchical Variational Inference
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/ntduRRZd19I/20190426neuralsamplersandhierarchicalvariationalinference.html
<p>This post sets background for the upcoming post on my work on more efficient use of neural samplers for Variational Inference.</p>
<!more>
<h2 id="variationalinference">Variational Inference</h2>
<p>At the core of <em>Bayesian Inference</em> lies the wellknown Bayes’ theorem, relating our prior beliefs <span class="math inline">\(p(z)\)</span> with those obtained after observing some data <span class="math inline">\(x\)</span>:</p>
<p><span class="math display">\[
p(zx)
=
\frac{p(xz) p(z)}{p(x)}
=
\frac{p(xz) p(z)}{\int p(x, z) dz}
\]</span></p>
<p>However, in most practical cases the denominator <span class="math inline">\(p(x)\)</span> requires intractable integration. Thus the field of Approximate Bayesian Inference seeks to efficiently approximate this posterior. For example, MCMCbased methods essentially use samplebased empirical distribution as an approximation.</p>
<p>In problems of <em>learning</em> latent variable models (for example, <a href="/posts/20160711neuralvariationalinferencevariationalautoencodersandHelmholtzmachines.html">VAEs</a>) we seek to do maximum likelihood learning for some hierarchical model <span class="math inline">\(p_\theta(x) = \int p_\theta(x, z) dz\)</span>, but computing the integral is intractable and latent variables <span class="math inline">\(z\)</span> are not observed.</p>
<p><a href="/posts/20160701neuralvariationalinferenceclassicaltheory.html">Variational Inference</a> is a method that gained a lot of popularity recently, especially due to its scalability. It nicely allows for simultaneous inference (finding the posterior approximate) and learning (optimizing parameters of the model) by means of the <em>evidence lower bound</em> (ELBO) on the <em>marginal loglikelihood</em> <span class="math inline">\(\log p_\theta(x)\)</span>, obtained by applying importance sampling followed by Jensen’s inequality:</p>
<p><span class="math display">\[
\log p_\theta(x)
= \log \mathbb{E}_{p_\theta(z)} p_\theta(xz)
= \log \mathbb{E}_{q_\phi(zx)} \frac{p_\theta(x, z)}{q_\phi(zx)}
\ge \mathbb{E}_{q_\phi(zx)} \log \frac{p_\theta(x, z)}{q_\phi(zx)}
=: \mathcal{L}
\]</span></p>
<p>This lower bound should be maximized w.r.t. both <span class="math inline">\(\phi\)</span> (variational parameters) and <span class="math inline">\(\theta\)</span> (model parameters). To better understand the effect of such optimization, it’s helpful to consider the gap between the marginal loglikelihood and the bound. It’s easy to show that this gap is equal to some KullbackLeibler (KL) divergence:</p>
<p><span class="math display">\[
\log p_\theta(x)  \mathbb{E}_{q(zx)} \log \frac{p_\theta(x,z)}{q_\phi(zx)}
=
D_{KL}(q_\phi(zx) \mid\mid p_\theta(zx))
\]</span></p>
<p>Now it’s easy to see that maximizing the ELBO w.r.t. <span class="math inline">\(\phi\)</span> tightens the bound and performs approximate inference – <span class="math inline">\(q(zx)\)</span> becomes closer to the true posterior <span class="math inline">\(p(zx)\)</span> as measured by the KL divergence. While we hope that maximizig the bound w.r.t. <span class="math inline">\(\theta\)</span> increases marginal loglikelihood <span class="math inline">\(\log p_\theta(x)\)</span>, this is obstructed by the KLdivergence. In a more realistic setting maximizing the ELBO is equivalent to maximizing the marginal loglikelihood regularized with the <span class="math inline">\(D_{KL}(q_\phi(zx) \mid\mid p_\theta(zx))\)</span>, except there’s no hyperparameter to control the strength of this regularization. This regularization prevents the true posterior <span class="math inline">\(p_\theta(zx)\)</span> from deviating too much from the variational distribution <span class="math inline">\(q(zx)\)</span>, which is not bad, as you’d know that the true posterior has somewhat simple form, but on the other hand it prevents us from learning powerful and expressive models <span class="math inline">\(p_\theta(x) = \int p_\theta(xz) p_\theta(z) dz\)</span>. Therefore if we’re after expressive models <span class="math inline">\(p_\theta(x)\)</span>, we probably should minimize such regularization effect, for example, by means of more expressive variational approximations.</p>
<p>Intuitively, tighter the bound – lesser the regularizational effect is. And it’s relatively easy to obtain a tighter bound: <span class="math display">\[
\begin{align*}
\log p_\theta(x)
&= \log \mathbb{E}_{p_\theta(z)} p_\theta(xz)
= \log \mathbb{E}_{q_\phi(z_{1:K}x)} \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_kx)} \\
&\ge \mathbb{E}_{q_\phi(z_{1:K}x)} \log\left( \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_kx)} \right)
=: \mathcal{L}_K
\ge \mathcal{L}
\end{align*}
\]</span> That is, by simply taking several samples to estimate the marginal likelihood <span class="math inline">\(p_\theta(x)\)</span> under the logarithm, we made the bound tighter. Such bounds usually are called <a href="/posts/20160714neuralvariationalimportanceweightedautoencoders.html">IWAE bounds</a> (for <a href="https://arxiv.org/abs/1509.00519">Importance Weighted Autoencoders paper</a> they were first introduced in), but we’ll be calling these bounds <em>Multisample Variational Lower Bounds</em>. Such bounds <a href="https://arxiv.org/abs/1808.09034">were shown</a> to correspond to using more expressive proposal distributions and are very powerful, but require multiple evaluations of the decoder <span class="math inline">\(p_\theta(xz)\)</span>, which might be very expensive for complex models, for example, when applying <a href="https://arxiv.org/abs/1802.02032">VAEs to dialogue modelling</a>.</p>
<p>An alternative direction is to use more expressive family of variational distributions <span class="math inline">\(q_\phi(zx)\)</span>. Moreover, with the explosion of Deep Learning we actually know one family of models that have empirically demonstrated terrific approximation capabilities – Neural Networks. We therefore will consider so called Neural Samplers as generators of approximate posterior <span class="math inline">\(q(zx)\)</span> samples. A <em>Neural Sampler</em> is simply a neural network that is trained to take some simple (say, Gaussian) random variable <span class="math inline">\(\psi \sim q(\psix)\)</span> and transform it into <span class="math inline">\(z\)</span> that has the properties we seek. Canonical examples are GANs and VAEs and we’ll get back to them later in the discussion.</p>
<p>And using neural nets is not a new idea. There’s been a lot of research along this direction, which we might roughly classify into 3 directions based on how they deal with the intractable <span class="math inline">\(\log q_\phi(zx)\)</span> term:</p>
<ul>
<li>Flows</li>
<li>Estimates</li>
<li>Bounds</li>
</ul>
<p>I’ll briefly cover the first two and then discuss the last one, which is of central relevance to this post.</p>
<h3 id="flows">Flows</h3>
<p>So called Flow models appeared on the radar with the publication of the <a href="https://arxiv.org/abs/1505.05770">Normalizing Flows paper</a>, and then quickly exploded into a hot topic of research. At the moment there exist dozens of works on all kinds of flows. The basic idea is that if the Neural Net defining the sampler is invertible, then by computing its Jacobian (the determinant of the Jacobi matrix) we can analytically find the density <span class="math inline">\(q(zx)\)</span>. Flows further restrict the samplers to have efficiently computable Jacobians. For further reading refer to <a href="http://akosiorek.github.io/ml/2018/04/03/norm_flows.html">Adam Kosiorek’s post</a>.</p>
<p>Flows were shown to be very powerful, they even managed to model the highdimensional data directly, as was shown by OpenAI researchers with <a href="https://openai.com/blog/glow/">Glow model</a>. However, Flowbased model require a neural network specially designed to be invertible and have easytocompute Jacobian. Such restrictions might lead to inefficiency in parameter usage, requiring much more parameters and compute compared to simpler methods. The aforementioned Glow uses a lot of parameters and compute to learn modestly hires images.</p>
<h3 id="estimates">Estimates</h3>
<p>Another direction is to estimate <span class="math inline">\(q_\phi(zx)/p(z)\)</span> by means of auxiliary models. For example, the <a href="http://blog.shakirm.com/2018/01/machinelearningtrickoftheday7densityratiotrick/">Density Ratio Trick</a> lying at the heart of many GANs say that if you have an optimal discriminator <span class="math inline">\(D^*(z, x)\)</span> discerning samples from <span class="math inline">\(q(zx)\)</span> from those from <span class="math inline">\(p(z)\)</span> (for the given <span class="math inline">\(x\)</span>), then the following is true:</p>
<p><span class="math display">\[
\frac{D^*(z, x)}{1  D^*(z, x)} = \frac{q(zx)}{p(z)}
\]</span></p>
<p>In practice we do not have the optimal classifier, so instead we train auxiliary model to perform such classification. A particularly successful approach along this direction is the <a href="https://avg.is.tuebingen.mpg.de/publications/mescheder2017arxiv">Adversarial Variational Bayes</a>. Biggest advantage of this method is the lack of any restrictions on the Neural Sampler (except the standard requirement of differentiability). The disadvantage is that it loses all bound guarantees and inherits a lot of stability issues from GANs.</p>
<h2 id="boundsandhierarchicalvariationalinference">Bounds and Hierarchical Variational Inference</h2>
<p>Arguably, the most natural approach to employing Neural Samplers as variational approximations is to give an efficient lower bound on the ELBO. In particular, we’d like to give a variational lower bound on the intractable term <span class="math inline">\(\log \tfrac{1}{q_\phi(zx)}\)</span>.</p>
<p>You can notice that for the Neural Sampler as described above the marginal density <span class="math inline">\(q_\phi(zx)\)</span> has the form of <span class="math inline">\(q_\phi(zx) = \int q_\phi(zx, \psi) q_\phi(\psix) d\psi\)</span>, very similr to that of VAE itself! Indeed, the Neural Sampler is a latent variable model like the VAE itself, except its conditioned on <span class="math inline">\(x\)</span>. Great – you might think – we’ll just reuse the bounds we have derived above, problem solved, right? Well, no. The problem is that we need to give a lower bound on <strong>negative</strong> marginal logdensity, or equivalently, an upper bound on the marginal logdensity.</p>
<p>But first we need to figure out one important question: what is <span class="math inline">\(q_\phi(zx, \psi)\)</span>? In case of the GANlike procedure we could say that this density is degenerate: <span class="math inline">\(q_\phi(z\psi, x) = \delta(z  f_\phi(\psi, x))\)</span> where <span class="math inline">\(f_\phi\)</span> is the neural network that generates <span class="math inline">\(z\)</span> from <span class="math inline">\(\psi\)</span>. While the estimationbased approach is fine with this since it doesn’t work with densities directly, for the bounds, however, we need <span class="math inline">\(q_\phi(zx, \psi)\)</span> to be a welldefined density, so from now on we’ll assume it’s some proper density, not the delta function<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<p>Luckily, one can use the following identity</p>
<p><span class="math display">\[
\mathbb{E}_{q_\phi(\psiz, x)} \frac{\tau_\eta(\psiz, x)}{q_\phi(z, \psix)}
=
\frac{1}{q_\phi(zx)}
\]</span></p>
<p>Where <span class="math inline">\(\tau(\psiz, x)\)</span> is arbitrary density we’ll be calling <em>auxiliary variational distribution</em>. Then, by applying logarithm and the Jensen’s inequality, we obtain a much needed variational upper bound:</p>
<p><span class="math display">\[
\log q_\phi(zx)
\le
\mathbb{E}_{q_\phi(\psiz, x)} \log \frac{q_\phi(z, \psix)}{\tau_\eta(\psiz, x)}
:=
\mathcal{U}
\]</span></p>
<p>Except – oops – it needs a sample from the true inverse model <span class="math inline">\(q_\phi(\psiz, x)\)</span>, which in general is not any easier to obtain than to calculate the <span class="math inline">\(\log q_\phi(z)\)</span> in the first place. Bummer? No – turns out, we can use the fact that samples <span class="math inline">\(z\)</span> are coming from the same hierarchical process <span class="math inline">\(q_\phi(z, \psix)\)</span>! Indeed, since we’re interested in the <span class="math inline">\(\log q_\phi(z)\)</span> averaged over all <span class="math inline">\(zx\)</span>: <span class="math display">\[
\begin{align*}
\mathbb{E}_{q_\phi(zx)}
\log q_\phi(zx)
&\le
\mathbb{E}_{q_\phi(zx)}
\mathbb{E}_{q_\phi(\psiz, x)} \log \frac{q_\phi(z, \psix)}{\tau_\eta(\psiz, x)} \\
& =
\mathbb{E}_{q_\phi(z, \psix)} \log \frac{q_\phi(z, \psix)}{\tau_\eta(\psiz, x)} \\
&=
\mathbb{E}_{q_\phi(\psix)}
\mathbb{E}_{q_\phi(z\psi, x)} \log \frac{q_\phi(z, \psix)}{\tau_\eta(\psiz, x)}
\end{align*}
\]</span></p>
<p>These algebraic manipulations show that if we sampled <span class="math inline">\(z\)</span> through a hierarchical scheme, then <span class="math inline">\(\psi\)</span> used to generate this <span class="math inline">\(z\)</span> can be thought of as a free posterior sample<a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>. This leads to the following lower bound on the ELBO, introduced in <a href="https://arxiv.org/abs/1511.02386">Hierarchical Variational Models</a> paper:</p>
<p><span class="math display">\[
\log p_\theta(x)
\ge
\mathcal{L}
\ge \mathbb{E}_{q_\phi(z, \psix)} \log \frac{p_\theta(x, z)}{ \tfrac{q_\phi(z, \psix)}{\tau_\eta(\psiz, x)} }
\]</span> Interestingly, this bound admits another interpretation. Indeed, it can be equivalently represented as <span class="math display">\[
\log p_\theta(x)
\ge \mathbb{E}_{q_\phi(z, \psix)} \log \frac{p_\theta(x, z) \tau_\eta(\psiz, x)}{q_\phi(z, \psix) }
\]</span> Which is just ELBO for an extended model where the latent code <span class="math inline">\(z\)</span> as extended with <span class="math inline">\(\psi\)</span>, and since there was not <span class="math inline">\(\psi\)</span> in the original model <span class="math inline">\(p_\theta(x, z)\)</span>, we extended the model as well with <span class="math inline">\(\tau_\eta(\psiz, x)\)</span>. This view has been investigated in the <a href="https://arxiv.org/abs/1602.05473">Auxiliary Deep Generative Models</a> paper.</p>
<p>Let’s now return to the variational upper bound <span class="math inline">\(\mathcal{U}\)</span>. Can we give a multisample variational upper bound on <span class="math inline">\(\log q_\phi(zx)\)</span> similar to IWAE? Well, following the same logic, we can arrive to the following:</p>
<p><span class="math display">\[
\begin{align*}
\log \frac{1}{q_\phi(zx)}
& =
\log
\mathbb{E}_{q_\phi(\psi_{1:K}z, x)} \frac{1}{K} \sum_{k=1}^K \frac{\tau_\eta(\psi_kz, x)}{q_\phi(z, \psi_kx)} \\
&\ge
\mathbb{E}_{q_\phi(\psi_{1:K}z, x)}
\log \frac{1}{K} \sum_{k=1}^K \frac{\tau_\eta(\psi_kz, x)}{q_\phi(z, \psi_kx)}
\end{align*}
\]</span> <span class="math display">\[
\log q_\phi(zx)
\le
\mathbb{E}_{q_\phi(\psi_{1:K}z, x)}
\log \frac{1}{\frac{1}{K} \sum_{k=1}^K \frac{\tau_\eta(\psi_kz, x)}{q_\phi(z, \psi_kx)}}
\]</span></p>
<p>However, this bound – Variational Harmonic Mean Estimator – is no good as it uses <span class="math inline">\(K\)</span> samples from the true inverse model <span class="math inline">\(q_\phi(\psix,z)\)</span> whereas we can have only one free sample. The rest have to be obtained through expensive MCMC sampling and that doesn’t scale well. Interestingly, this estimator was already presented in the original VAE paper (though buried in the Appendix D), but discarded as too unstable.</p>
<h3 id="whymultisamplevariationalupperbound">Why multisample variational upper bound?</h3>
<p>The gap between the ELBO and it’s tractable lower bound can be shown to be <span class="math display">\[
\mathcal{L}

\mathbb{E}_{q_\phi(z, \psix)} \log \frac{p_\theta(x, z)}{ \tfrac{q_\phi(z, \psix)}{\tau_\eta(\psiz, x)} }
=
D_{KL}(q_\phi(\psix,z) \mid\mid \tau_\eta(\psix,z))
\]</span> So since we’ll be using some simple <span class="math inline">\(\tau_eta(\psix,z)\)</span>, we’ll be restricting the true inverse model <span class="math inline">\(q_\phi(\psix,z)\)</span> to also be somewhat simple, limiting the expressivity of <span class="math inline">\(q(zx)\)</span>, thus limiting the expressivity <span class="math inline">\(p(zx)\)</span>… Looks like we ended up with where we started, right? Well, not quite so, as we might have gained more than lost by moving the simple distribution from <span class="math inline">\(q(zx)\)</span> to <span class="math inline">\(\tau(\psix,z)\)</span>, but still not quite satisfying. So having a multisample upper bound would allow us to give tighter bounds (which don’t suffer from the regularization that much) and not invoke any additional model’s decoder <span class="math inline">\(p_\theta(xz)\)</span> evaluations (see the Variational Harmonic Mean Estimator above as example).</p>
<p>So… Are there efficient multisample variational upper bounds? A year ago you might have thought the answer is “Probably no”, until… [To be continued]</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>The problem is that delta function is not a real function, but a generalized function, and a special case has to be taken to deal with them.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>This is not a new result, see <a href="https://arxiv.org/abs/1511.02543">Grosse et al.</a>, section 4.2, paragraph on “simulated data”.<a href="#fnref2">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/ntduRRZd19I" height="1" width="1" alt=""/>Fri, 26 Apr 2019 00:00:00 UThttp://artem.sobolev.name/posts/20190426neuralsamplersandhierarchicalvariationalinference.htmlArtemhttp://artem.sobolev.name/posts/20190426neuralsamplersandhierarchicalvariationalinference.htmlStochastic Computation Graphs: Fixing REINFORCE
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/WajrpLhGE3g/20171112stochasticcomputationgraphsfixingreinforce.html
<p>This is the final post of the <a href="/tags/stochastic%20computation%20graphs%20series.html">stochastic computation graphs series</a>. Last time we discussed models with <a href="/posts/20171028stochasticcomputationgraphsdiscreterelaxations.html">discrete relaxations of stochastic nodes</a>, which allowed us to employ the power of reparametrization.</p>
<p>These methods, however, posses one flaw: they consider different models, thus introducing inherent bias – your test time discrete model will be doing something different from what your training time model did. Therefore in this post we’ll get back to the REINFORCE aka Score Function estimator, and see if we can fix its problems.</p>
<!more>
<h2 id="backtoreinforce">Back to REINFORCE</h2>
<p>REINFORCE<a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a> estimator arises from the following identity:</p>
<p><span class="math display">\[
\begin{align*}
\nabla_\theta \mathcal{F}(\theta)
& = \nabla_\theta \mathbb{E}_{p(z\theta)} f(z)
= \nabla_\theta \int f(z) p(z\theta) dz
= \int f(z) \nabla_\theta p(z\theta) dz \\
&= \int f(z) \nabla_\theta \log p(z\theta) p(z\theta) dz
= \mathbb{E}_{p(z\theta)} f(z) \nabla_\theta \log p(z\theta)
\end{align*}
\]</span></p>
<p>This allows us to estimate the gradient of the expected objective using Monte Carlo estimation:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^{\text{SF}} \mathcal{F} = \frac{1}{L} \sum_{l=1}^L f(z^{(l)}) \nabla_\theta \log p(z^{(l)}\theta)
, \quad \text{ where }
z^{(l)} \sim p(z\theta)
\]</span></p>
<p>The downside of this method is that it does not use the gradient information of the objective <span class="math inline">\(f\)</span>. This is useful in cases where we don’t have access to such information, for example, in Reinforcement Learning. However, when working with Stochastic Computation Graphs, we usually do have gradients <span class="math inline">\(\nabla_z f(z)\)</span> available, and I believe methods that intelligently use this gradient should perform better.</p>
<p>However, the score function estimator does not use this information, yet it’s an unbiased estimator of the true gradient. What’s the problem then? The problem is in impractically high variance that requires one to obtain some astronomical amount of samples to reduce the variance and make optimization actually feasible <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>. Recall the intuition behind this from <a href="/posts/20170910stochasticcomputationgraphscontinuouscase.html">the first post</a>: a REINFORCE estimator <span class="math inline">\(\hat{\nabla}_\theta^{\text{SF}} \mathcal{F}\)</span> is just <span class="math inline">\(L\)</span> singlesample gradients averaged together and each single singlesample gradient <span class="math inline">\(f(z) \nabla_\theta \log p(z\theta)\)</span> essentially implements a random search: it wants to increase the probability of a given sample <span class="math inline">\(z\)</span> proportionally to <span class="math inline">\(f(z)\)</span>, and if the later is negative, then reduce it. Each of the samples then pulls the probability towards itself, and this lack of a consensus is the source of the problem.</p>
<p>However, despite REINFORCE being essentially a random search in disguise, not all is lost yet. As we shall see, one can extend it with lots of different tricks, greatly reducing the variance.</p>
<h2 id="controlvariates">Control Variates</h2>
<p>One method for reducing the variance in statistics (and the major one for this post) is the method of <strong>Control Variates</strong>, that’s based on the idea that if you have two negatively correlated random variables, their sum could have lower variance. Indeed, let’s assume we have random variables <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> such that <span class="math inline">\(\mathbb{D}(X) = \sigma^2_x\)</span>, <span class="math inline">\(\mathbb{D}(Y) = \sigma^2_y\)</span> and <span class="math inline">\(\text{Cov}(X, Y) = \tau \sigma_x \sigma_y\)</span>. Then</p>
<p><span class="math display">\[
\mathbb{D}(X + Y)  \mathbb{D}(X)
= \mathbb{D}(Y) + 2 \text{Cov}(X, Y)
= \sigma^2_y  2 \tau \sigma_x \sigma_y
= \sigma_y (\sigma_y  2 \tau \sigma_x)
\]</span></p>
<p>So if <span class="math inline">\(\sigma_y < 2 \tau \sigma_x\)</span>, then the sum <span class="math inline">\(X + Y\)</span> will have lower variance than the <span class="math inline">\(X\)</span> alone. Of course, <span class="math inline">\(Y\)</span> needs to be centered <span class="math inline">\(\mathbb{E} Y = 0\)</span> to not bias the <span class="math inline">\(X\)</span>, but centering does not affect the variance.</p>
<p>We’ll be considering control variates of a special form: <span class="math inline">\(b(z) \nabla_\theta \log p(z\theta)\)</span> where <span class="math inline">\(b\)</span> is a <strong>baseline</strong> and can be either a scalar of a vector (the multiplication is pointwise then)<a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>. This leads to the estimator of the following form</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SF} \mathcal{F}(\theta) = (f(z)  b(z)) \nabla_\theta \log p(z\theta)
\]</span></p>
<p>Here I used only one sample to simplify the notation (and will be doing so from now on), in practice you always can average several samples, though that probably won’t help you much <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>. However, by using a baseline we might have introduced unwanted bias in our gradient estimation. Let’s see:</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{E}_{p(z\theta)} \hat{\nabla}_\theta^\text{SF} \mathcal{F}(\theta)
&= \mathbb{E}_{p(z\theta)} (f(z)  b(z)) \nabla_\theta \log p(z\theta)
= \nabla_\theta \mathbb{E}_{p(z\theta)} \left[ f(z)  b(z) \right] \\
&= \nabla_\theta \mathbb{E}_{p(z\theta)} f(z)  \nabla_\theta \mathbb{E}_{p(z\theta)} b(z)
\end{align*}
\]</span></p>
<p>Looks like we did indeed bias the estimator! In order to be able to reduce the variance and keep the estimator unbiased, we should remove the introduced bias from the <span class="math inline">\(\hat{\nabla}_\theta^\text{SF} \mathcal{F}(\theta)\)</span>:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SF} \mathcal{F}(\theta) = (f(z)  b(z)) \nabla_\theta \log p(z\theta) + \nabla_\theta \mathbb{E}_{p(z\theta)} b(z)
\]</span></p>
<p>This, of course, only works if you can compute the last term analytically. Estimating it with REINFORCE won’t help you, as you’d then recover the standard Score Function estimator.</p>
<p>The easiest baseline one can think of is a constant baseline. It doesn’t introduce any bias: indeed offsetting the target <span class="math inline">\(f(z)\)</span> should not (and does not) change the true gradient of the expectation. However, as we’ve seen in the first part of the series, it can mess with the variance. So, let’s use a baseline that would minimize the total variance of the adjusted estimator:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SFconst} \mathcal{F}(\theta) = (f(z)  b) \nabla_\theta \log p(z\theta)
\]</span></p>
<p>The total variance along all <span class="math inline">\(D\)</span> coordinates of this gradient estimator is <span class="math display">\[
\begin{align*}
\sum_{d=1}^D &\mathbb{D}\left[\hat{\nabla}_{\theta_d}^\text{SFconst} \mathcal{F}(\theta)\right]
= \sum_{d=1}^D \mathbb{D}\left[(f(z)  b) \nabla_{\theta_d} \log p(z\theta)\right] \\
&= \sum_{d=1}^D \Bigl(
{\scriptsize
\mathbb{D}\left[f(z) \nabla_{\theta_d} \log p(z\theta)\right]
 2b \text{Cov}\left[f(z) \nabla_{\theta_d} \log p(z\theta), \nabla_{\theta_d} \log p(z\theta)\right]
+ b^2 \mathbb{D}\left[\nabla_{\theta_d} \log p(z\theta)\right]
}
\Bigr)
\end{align*}
\]</span></p>
<p>The formula does look a bit terrifying, but we only care about <span class="math inline">\(b\)</span> at the moment, and the variance is quadratic in b. The optimal value thus is obtained by minimizing the quadratic formula:</p>
<p><span class="math display">\[
b
= \frac{\sum_{d=1}^D \text{Cov}\left[f(z) \nabla_{\theta_d} \log p(z\theta), \nabla_{\theta_d} \log p(z\theta)\right]}{\sum_{d=1}^D \mathbb{D}\left[\nabla_{\theta_d} \log p(z\theta)\right]}
= \frac{\sum_{d=1}^D \mathbb{E}\left[f(z) (\nabla_{\theta_d} \log p(z\theta))^2\right]}{\sum_{d=1}^D \mathbb{E}\left[(\nabla_{\theta_d} \log p(z\theta))^2\right]}
\]</span></p>
<p>Where we used the fact that <span class="math inline">\(\mathbb{E} \nabla_{\theta_d} \log p(z\theta) = 0\)</span> for any <span class="math inline">\(d\)</span>. The moments in the formula can not be computed analytically, but one can estimate them using running averages.</p>
<p>In the same fashion one can derive the optimal vectorvalued baseline <span class="math inline">\(b\)</span> (and even the matrixvalued!), consisting of individual baselines for each dimension of the gradient:</p>
<p><span class="math display">\[
b_d = \frac{\mathbb{E}\left[f(z) (\nabla_{\theta_d} \log p(z\theta))^2\right]}{\mathbb{E}\left[(\nabla_{\theta_d} \log p(z\theta))^2\right]}
\]</span></p>
<h2 id="selfcriticallearning">Selfcritical Learning</h2>
<p>Ideally, the baseline approximates <span class="math inline">\(f(z)\)</span> as good as possible without using the actual sample <span class="math inline">\(z\)</span> <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a>. However, it can still depend on <span class="math inline">\(\theta\)</span> without introducing any bias:</p>
<p><span class="math display">\[
\mathbb{E}_{p(z\theta)} b(\theta) \nabla_\theta \log p(z\theta) =
b(\theta) \mathbb{E}_{p(z\theta)} \nabla_\theta \log p(z\theta) =
0
\]</span></p>
<p>So, how can we use <span class="math inline">\(\theta\)</span> and <span class="math inline">\(f\)</span> to approximate <span class="math inline">\(f(z)\)</span> without touching the sample <span class="math inline">\(z\)</span> itself? Authors of the <a href="https://arxiv.org/abs/1612.00563">Selfcritical Sequence Training for Image Captioning</a> paper suggested to replace the stochastic <span class="math inline">\(z\)</span> with a deterministic most probable outcome:</p>
<p><span class="math display">\[
\hat{z} = \text{argmax}_k \; p(z = k  \theta)
\]</span></p>
<p>And then we use <span class="math inline">\(f(\hat z)\)</span> as a baseline:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SFSC} \mathcal{F}(\theta) = (f(z)  f(\hat{z})) \nabla_\theta \log p(z\theta)
\]</span></p>
<p>This is a very interesting baseline. Unlike the standard REINFORCE, where each sample pulls probability towards itself, this estimator pulls probability in only for samples that are better than the most likely one. Conversely, for samples that are worse than the most likely one, this estimator pushes probability away. Unsurprisingly, this baseline is just a constant baseline that automatically adapts to whether a probability should be increased or decreased for a given sample <span class="math inline">\(z\)</span>.</p>
<h2 id="specialcases">Special Cases</h2>
<p>When <span class="math inline">\(f\)</span> is of some special form, one can design ad hoc variance reduction techniques. In particular, we’ll consider two of them:</p>
<h3 id="nvil">NVIL</h3>
<p>NVIL stands for <a href="https://arxiv.org/abs/1402.0030">Neural Variational Inference and Learning</a> after a paper it was introduced in. Essentially, it combines tricks to reduce the variance people of Reinforcement Learning came up with to reduce the variance of REINFORCE (which they usually call the Policy Gradients method). The paper introduced three methods: <em>signal centering</em>, <em>variance normalization</em> and <em>local learning signals</em>. The <em>variance normalization</em> normalizes the gradient by a running average estimate of its standard deviation – this is what, say, the Adam optimizer would do for you automatically, so let’s not stop here.</p>
<p><em>Signal centering</em> can be considering as a baseline amortization for a contextdependent case. Let me decypher that: oftentimes stochastic random variable <span class="math inline">\(z\)</span> depends on some context <span class="math inline">\(x\)</span> (for example, state of the environment in RL, or the observation <span class="math inline">\(x\)</span> in the amortized variational inference), then the expected objective becomes <span class="math inline">\(\mathcal{F}(\thetax) = \mathbb{E}_{p(zx,\theta)} f(x, z)\)</span>. Then we can make the baseline <span class="math inline">\(b\)</span> depend on <span class="math inline">\(x\)</span> as well without any sacrifice:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SFNVIL} \mathcal{F}(\theta) = (f(x, z)  b(x)) \nabla_\theta \log p(zx, \theta)
\]</span></p>
<p>We could reuse the formulas from the previous section, but that’d require us to store independent baseline for each <span class="math inline">\(x\)</span> in the trainset – doesn’t scale. Therefore instead we’ll amortize the baseline using a neural network <span class="math inline">\(b(x\varphi)\)</span> with parameters <span class="math inline">\(\varphi\)</span> and learn it by minimizing the expected squared error <a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a> <span class="math display">\[\varphi^* = \text{argmin}_\phi \mathbb{E}_{p(zx,\theta)} (b(x\varphi)  f(x, z))^2\]</span></p>
<p>The <em>local learning signal</em> allows you to exploit some nontrivial structure in <span class="math inline">\(f(z)\)</span> (and <span class="math inline">\(p(z\theta)\)</span>). Namely, suppose we divided our <span class="math inline">\(z\)</span> into <span class="math inline">\(N\)</span> chunks: <span class="math inline">\(z = (z_1, \dots, z_N)\)</span>, and <span class="math inline">\(f\)</span> is a sum of rewards on prefixes: <span class="math inline">\(f(z) = \sum_{n=1}^N f_n(z_{<n})\)</span> <a href="#fn7" class="footnoteRef" id="fnref7"><sup>7</sup></a>. It’s then obvious that choice of later blocks <span class="math inline">\(z_n\)</span> layers does not influence the prior rewards <span class="math inline">\(f_m\)</span> for <span class="math inline">\(m < n\)</span>. Indeed, one can see that the true gradient obeys the following:</p>
<p><span class="math display">\[
\begin{align*}
\nabla_\theta \mathcal{F}(\theta)
&= \mathbb{E}_{p(z_{\le N}  \theta)} \sum_{n=1}^N \left(\sum_{k=1}^N f_k(z_{\le k})\right) \nabla_\theta \log p(z_nz_{<n}, \theta) \\
&= \sum_{n=1}^N \sum_{k=1}^N \mathbb{E}_{p(z_{\le N}  \theta)} \left[ f_k(z_{\le k}) \nabla_\theta \log p(z_nz_{<n}, \theta)\right] \\
&= {\scriptsize \sum_{n=1}^N \left(\sum_{k=1}^{n1} \mathbb{E}_{p(z_{\le n}  \theta)} \left[f_k(z_{\le k}) \nabla_\theta \log p(z_nz_{<n}, \theta) \right] + \sum_{k=n}^N \mathbb{E}_{p(z_{\le N}  \theta)} \left[f_k(z_{\le k}) \nabla_\theta \log p(z_nz_{<n}, \theta) \right]\right)} \\
&= {\scriptsize \sum_{n=1}^N \left(\mathbb{E}_{z_{<n}} \left[\left(\sum_{k=1}^{n1} f_k(z_{\le k}) \right) \overbrace{\mathbb{E}_{z_nz_{<n}} \nabla_\theta \log p(z_nz_{<n}, \theta)}^{=0}\right] + \sum_{k=n}^N \mathbb{E}_{z_{\le N}} \left[f_k(z_{\le k}) \nabla_\theta \log p(z_nz_{<n}, \theta) \right]\right)} \\
&= \mathbb{E}_{p(z\theta)} \left[{\scriptsize \sum_{n=1}^N \sum_{k=n}^N f_k(z_{\le k}) \nabla_\theta \log p(z_nz_{<n}, \theta)} \right] \\
\end{align*}
\]</span></p>
<p>Naturally, the part of the gradient corresponding to <span class="math inline">\(n\)</span>th chunk is weighted by the total reward we’d get after deciding upon <span class="math inline">\(z_n\)</span>, since the previous rewards do not depend on <span class="math inline">\(z_n\)</span>.</p>
<p>Combined with the contextdependent baseline the estimator would be</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SFNVIL} \mathcal{F}(\theta) =
{\scriptsize \sum_{n=1}^N \sum_{k=n}^N (f_k(x, z_{\le k})  b_k(x)) \nabla_\theta \log p(z_nx, z_{<n}, \theta)} \\
\]</span></p>
<p>Moreover, the baseline can be made dependent on some previous <span class="math inline">\(z\)</span> since such baseline does not introduce any bias:</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{E}_{p(zx, \theta)}
& \sum_{n=1}^N \sum_{k=n}^N b_{n,k}(x, z_{<n}) \nabla_\theta \log p(z_nx, z_{<n}, \theta) \\
& =
\sum_{n=1}^N \sum_{k=n}^N
\mathbb{E}_{p(z_{<n}x, \theta)}
\mathbb{E}_{p(z_nx, z_{<n}, \theta)}
b_{n,k}(x, z_{<n}) \nabla_\theta \log p(z_nx, z_{<n}, \theta) \\
& =
\sum_{n=1}^N \sum_{k=n}^N
\mathbb{E}_{p(z_{<n}x, \theta)} b_{n,k}(x, z_{<n})
\overbrace{\mathbb{E}_{p(z_nx, z_{<n}, \theta)} \nabla_\theta \log p(z_nx, z_{<n}, \theta)}^{=0}
= 0
\end{align*}
\]</span></p>
<p>However, learning <span class="math inline">\({n \choose 2}\)</span> different baselines is computationally demanding, so one would probably at least assume some common underlying structure.</p>
<h3 id="vimco">VIMCO</h3>
<p>Another case of using the particular structure is the the VIMCO (<a href="https://arxiv.org/abs/1602.06725">Variational inference for Monte Carlo objectives</a>) estimator. Again, consider a case of the latent variable <span class="math inline">\(z\)</span> being divided in <span class="math inline">\(N\)</span> chunks, but now <span class="math inline">\(z_n\)</span> are independent identically distributed samples: <span class="math inline">\(z_n \sim p(z\theta)\)</span>. Suppose <span class="math inline">\(f\)</span> has the following form: <span class="math inline">\(f(z) = g\left(\tfrac{1}{N} {\scriptsize\sum_{n=1}^N} h(z_n)\right)\)</span>. Then the REINFORCE gradient estimate would be:</p>
<p><span class="math display">\[
\begin{align*}
\nabla_\theta \mathcal{F}(\theta)
&= \mathbb{E}_{p(z  \theta)} \sum_{n=1}^N g\left(\tfrac{1}{N} {\scriptsize\sum_{n=1}^N} h(z_n)\right) \nabla_\theta \log p(z_n\theta) \\
\end{align*}
\]</span></p>
<p>The problem with this estimator is that <span class="math inline">\(g(\dots)\)</span> is a common multiplier, and defines a magnitude of the gradient for each of <span class="math inline">\(N\)</span> samples, without any distinction, despite some samples <span class="math inline">\(z_n\)</span> might have turned out better than others. We would like to penalise such samples lesser, performing a kind of <em>credit assignment</em>.</p>
<p>Just as in the previous section, we can consider baselines <span class="math inline">\(b_n\)</span> that depend on samples <span class="math inline">\(z\)</span>. To keep them from biasing the gradient estimate we need to make sure each <span class="math inline">\(b_n\)</span> does not depend on <span class="math inline">\(z_n\)</span>. However, it can depend on all other <span class="math inline">\(z\)</span> (denoted <span class="math inline">\(z_{n}\)</span>) since they are independent of <span class="math inline">\(z_n\)</span>. Thus the bias of such baseline is:</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{E}_{p(z  \theta)}
\sum_{n=1}^N b_n(z_{n}) \nabla_\theta \log p(z_n\theta)
=
\sum_{n=1}^N
\mathbb{E}_{p(z_{n}  \theta)}
b_n(z_{n})
\overbrace{
\mathbb{E}_{p(z_{n}  \theta)}
\nabla_\theta \log p(z_n\theta)
}^{=0}
= 0
\end{align*}
\]</span></p>
<p>Authors of the VIMCO paper also suggested an interesting trick to avoid learning <span class="math inline">\(b_n(z_{n})\)</span>: we want <span class="math inline">\(b_{n}(z_{n})\)</span> to approximate <span class="math inline">\(f(z)\)</span> as good as possible and we actually have access to everything we need to compute <span class="math inline">\(f(z)\)</span> except the term that depends on <span class="math inline">\(z_n\)</span>: <span class="math inline">\(h(z_n)\)</span>. However, all samples <span class="math inline">\(z\)</span> are identically distributed, so we can approximate this missing term as the average of others:</p>
<p><span class="math display">\[
\hat h_n(z_{n}) = \frac{1}{N1} \sum_{j \not= n} h(z_j) \stackrel{\text{hopefully}}{\approx} h(z_n)
\]</span></p>
<p>Then our baseline becomes</p>
<p><span class="math display">\[
b_n(z_{n})
=
g\left(\tfrac{{\scriptsize\sum_{j \not= n}} h(z_j) + \hat h_n(z_{n})}{N} \right)
\]</span></p>
<p>One can also consider other averaging schemes for <span class="math inline">\(\hat h_n(z_{n})\)</span> to approximate <span class="math inline">\(h(z_n)\)</span>: geometric, harmonic, Minkowski, etc.</p>
<h2 id="muprop">MuProp</h2>
<p>So far we have been considering only baselines <span class="math inline">\(b\)</span> that have zero expected value and thus do not bias the gradient estimator. However, there are cases when we actually know the baseline’s expectation and can compensate the introduced bias.</p>
<p>The <a href="https://arxiv.org/abs/1511.05176">MuProp</a> paper suggests to use a Taylor expansion as a baseline, provided we can compute certain moments of the distribution <span class="math inline">\(p(z\theta)\)</span> in a closed form. For example, if <span class="math inline">\(p(z\theta) = \mathcal{N}(z \mid \mu(\theta), \Sigma(\theta))\)</span>, then we already have access to 1st and 2nd moments – the mean and the covariance matrix.</p>
<p>Consider a Taylor expansion of <span class="math inline">\(f(z)\)</span> at <span class="math inline">\(\mu(\theta) = \mathbb{E}_{p(z\theta)} z\)</span> of the first order:</p>
<p><span class="math display">\[
b_\theta(z) = f(\mu(\theta)) + \nabla_z f(\mu(\theta))^T (z  \mu(\theta))
\]</span></p>
<p>Then the bias introduced by such baseline would be</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{E}_{p(z\theta)}
&
b_\theta(z) \nabla_\theta \log p(z\theta) \\
&=
\mathbb{E}_{p(z\theta)}
\left[
f(\mu(\theta)) + \nabla_z f(\mu(\theta))^T (z  \mu(\theta))
\right] \nabla_\theta \log p(z\theta) \\
&=
\mathbb{E}_{p(z\theta)}
\left[
\nabla_z f(\mu(\theta))^T z
+
f(\mu(\theta))  \nabla_z f(\mu(\theta))^T \mu(\theta)
\right] \nabla_\theta \log p(z\theta) \\
&=
\mathbb{E}_{p(z\theta)}
\left[
\nabla_z f(\mu(\theta))^T z \nabla_\theta \log p(z\theta)
\right] \\
& \quad\quad\quad +
\left[
f(\mu(\theta))  \nabla_z f(\mu(\theta))^T \mu(\theta)
\right]
\overbrace{\mathbb{E}_{p(z\theta)} \nabla_\theta \log p(z\theta)}^{=0} \\
&=
\nabla_z f(\mu(\theta))^T
\mathbb{E}_{p(z\theta)}
\left[
z \nabla_\theta \log p(z\theta)
\right]
=
\nabla_z f(\mu(\theta))^T
\nabla_\theta
\mathbb{E}_{p(z\theta)}
\left[
z
\right] \\
& =
\nabla_z f(\mu(\theta))^T
\nabla_\theta \mu(\theta) =
\nabla_\theta f(\mu(\theta))
\end{align*}
\]</span></p>
<p>So the (1st order) MuProp estimator has the following form:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SFMuProp} \mathcal{F}(\theta) = (f(z)  f(\mu(\theta))  \nabla_z f(z)^T (z  \mu(\theta))) \nabla_\theta \log p(z\theta) + \nabla_\theta f(\mu(\theta))
\]</span></p>
<p>An appealing property is that not only does this gradient estimator is unbiased, but it also uses the gradients of <span class="math inline">\(f\)</span> in the <span class="math inline">\(\nabla_\theta f(\mu(\theta))\)</span>, essentially propagating the learning signal though the mean of the random variable <span class="math inline">\(z\)</span>, and then correcting for the introduced bias with REINFORCE.</p>
<p>One could, of course, envision a secondorder baseline, especially considering we have the covariance matrix readily available for many distributions. However, such baseline would be more computationally demanding, requiring us to compute the Hessian matrix of <span class="math inline">\(f(z)\)</span> and evaluate it at some point, which would cost at least <span class="math inline">\(\text{dim}(z)^2\)</span> computations. Higher order expansions would require even more computations, thus it’s hard to achive high nonlinearity in the baseline using MuProp alone <a href="#fn8" class="footnoteRef" id="fnref8"><sup>8</sup></a>.</p>
<h2 id="rebar">REBAR</h2>
<p><a href="https://arxiv.org/abs/1703.07370">REBAR</a><a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a> is a clever way to use the <a href="/posts/20171028stochasticcomputationgraphsdiscreterelaxations.html#gumbelsoftmaxrelaxationakaconcretedistribution">GumbelSoftmax (aka Concrete) Relaxation</a> as a baseline.</p>
<p>A naive approach to the task would be to recall the GumbelMax trick: as we have already seen, this trick gives us the reparametrization, albeit not a differentiable one. However, we can move the nondifferentiability into the <span class="math inline">\(f(z)\)</span>, and then invoke REINFORCE to estimate gradient of average of the nondifferentiable function (from now on we will assume <span class="math inline">\(z\)</span> is a onehot vector and argmax is an operator that returns a onehot vector, indicating position of the maximal element in the input and overall will be abusing notation treating the same <span class="math inline">\(z\)</span> a onehot vector or a number depending on a context):</p>
<p><span class="math display">\[
\nabla_\theta \mathbb{E}_{p(z\theta)} f(z)
= \nabla_\theta \mathbb{E}_{p(\zeta\theta)} f(\text{argmax} \zeta)
= \mathbb{E}_{p(\zeta\theta)} f(\text{argmax} \zeta) \nabla_\theta \log p(\zeta\theta)
\]</span></p>
<p>Where <span class="math inline">\(\zeta_k\)</span> is obtained by shifting an independent standard Gumbel r.v. <span class="math inline">\(\gamma_k\)</span> by a logit of <span class="math inline">\(k\)</span>th probability:</p>
<p><span class="math display">\[
\zeta_k = \log p(z = k  \theta) + \gamma_k, \quad\quad \gamma_k \sim \text{Gumbel}(0, 1)
\]</span></p>
<p>Thus <span class="math inline">\(\zeta_k\)</span> also has a Gumbel distribution: <span class="math inline">\(\zeta_k \sim \text{Gumbel}(\log p(z = k  \theta), 1)\)</span>. Ok, so what have we bought ourselves here? So far it looks like we gained nothing but instead only complicated the whole thing with these extra <span class="math inline">\(\zeta\)</span>s. However, we just obtained a crucial property: we separated nondifferentiability from the reparametrization. We now can sample continuous reparametrizeable <span class="math inline">\(\zeta\)</span>s and the troublesome part – the argmax – is now a part of <span class="math inline">\(f\)</span>. And this opens up a new way to use <strong>baselines with nonzero expectation</strong>:</p>
<p><span class="math display">\[
\mathbb{E}_{p(\zeta\theta)} (f(\text{argmax} \zeta)  b(\zeta)) \nabla_\theta \log p(\zeta  \theta) + \nabla_\theta \mathbb{E}_{p(\zeta\theta)} b(\zeta)
\]</span></p>
<p>And the most interesting thing is that the bias correction term, <span class="math inline">\(\nabla_\theta \mathbb{E}_{p(\zeta\theta)} b(\zeta)\)</span>, is differentiable and reparametrizable, and thus its gradient can be estimated with the reparametrization trick. Now, that’s nice, but we can’t just take any <span class="math inline">\(b(\zeta)\)</span> and hope for variance reduction. In order to actually benefit from such baseline, we need <span class="math inline">\(b(\zeta) \approx f(\text{argmax} \zeta)\)</span>. Luckily, we already know a way to organize this: the GumbelSoftmax obtained nicely by setting <span class="math inline">\(b(\zeta) = f(\text{softmax}_\tau(\zeta))\)</span>:</p>
<p><span class="math display">\[
\hat{\nabla}_\theta^\text{SFREBARnaive} \mathcal{F}(\theta)
=
(f(\text{argmax} \zeta)  b(\zeta)) \nabla_\theta \log p(\zeta  \theta) + \nabla_\theta f(\text{softmax}_\tau(\zeta))
\]</span></p>
<p>However, there’s a reason I called this estimator <em>naive</em>. If you actually try implementing this estimator, you would hardly see any improvements. If you look closely, you’d notice that we actually increased the variance of the REINFORCE estimator by switching to <span class="math inline">\(\zeta\)</span>s, and this increase might not be compensated by the GumbelSoftmax baseline we introduced.</p>
<p>I guess it all looks a bit confusing at this moment, so lets take a closer look at the original REINFORCE estimator and the naive REBAR without baseline:</p>
<p><span class="math display">\[
\begin{align*}
\hat{\nabla}_\theta^\text{SF} \mathcal{F}(\theta)
&=
f(z) \nabla_\theta \log p(z  \theta)
\\
\hat{\nabla}_\theta^\text{SFREBARnaivewithoutbaseline} \mathcal{F}(\theta)
&=
f(\text{argmax} \zeta) \nabla_\theta \log p(\zeta  \theta)
\end{align*}
\]</span></p>
<p>You’d think they’re the same, however actually they’re quite different. But not in the first terms, <span class="math inline">\(f(z)\)</span> and <span class="math inline">\(f(\text{argmax} \zeta)\)</span>, as those are basically the same. It’s the second term that’s important to us: the vanilla REINFORCE has <span class="math inline">\(\nabla_\theta \log p({\color{red} z}\theta)\)</span>, whereas our naive REBAR has <span class="math inline">\(\nabla_\theta \log p({\color{red} \zeta}\theta)\)</span>. This seemingly innocent difference is a huge deal! To see why <a href="/posts/20170910stochasticcomputationgraphscontinuouscase.html">recall the REINFORCE intuition</a>: it is not a gradient method, but rather a random search in disguise: it tries a bunch of points, and increases probabilities of those performing good. However, the major problem is that different <span class="math inline">\(\zeta\)</span>s can lead to the same <span class="math inline">\(z\)</span>: indeed the argmax takes on only finite number of different values, whereas there’s continuum of different vectors <span class="math inline">\(\zeta\)</span>. This, in result, means that our naive REBAR estimate would be trying some <span class="math inline">\(\zeta\)</span> (corresponding to some <span class="math inline">\(z\)</span>) and then trying to pull the probability mass towards (or away from) this point, maybe undoing some useful work it did for a different <span class="math inline">\(\zeta\)</span> (but same <span class="math inline">\(z\)</span>).</p>
<p>To fix this issue we need to stay in “space of <span class="math inline">\(\nabla_\theta \log p(z\theta)\)</span>” – a control variate of the form <span class="math inline">\(b(z) \nabla_\theta \log p(z\theta)\)</span>. And one is given with help of the following clever identity:</p>
<p><span class="math display">\[
\begin{align*}
\nabla_\theta \mathbb{E}_{p(\zeta\theta)} b(\text{softmax}_\tau(\zeta))
&=
\nabla_\theta \mathbb{E}_{p(z, \zeta\theta)} b(\text{softmax}_\tau(\zeta)) \\
&=
\nabla_\theta \mathbb{E}_{p(z\theta)} \mathbb{E}_{p(\zetaz, \theta)} b(\text{softmax}_\tau(\zeta)) \\
&=
\mathbb{E}_{p(z\theta)} \mathbb{E}_{p(\zetaz, \theta)} b(\text{softmax}_\tau(\zeta)) \nabla_\theta \log p(z\theta)
\\& \quad +
\mathbb{E}_{p(z\theta)} \nabla_\theta \mathbb{E}_{p(\zetaz, \theta)} b(\text{softmax}_\tau(\zeta))
\end{align*}
\]</span></p>
<p>On the left hand side we have the usual GumbelSoftmax relaxed gradient which we can compute using the reparametrization. On the right hand size we have a REINFORCElike gradient – which is a good candidate for a baseline – and another weirdly looking term. We can rearrange the terms to express the bias of such a baseline through the other two terms:</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{E}_{p(z\theta)} \mathbb{E}_{p(\zetaz, \theta)} & b(\text{softmax}_\tau(\zeta)) \nabla_\theta \log p(z\theta)
\\ =
\nabla_\theta \mathbb{E}_{p(\zeta\theta)} & b(\text{softmax}_\tau(\zeta))

\mathbb{E}_{p(z\theta)} \nabla_\theta \mathbb{E}_{p(\zetaz, \theta)} b(\text{softmax}_\tau(\zeta))
\end{align*}
\]</span></p>
<p>But what about that weirdly looking last term? Can it be estimated efficiently? First, note that we do not need to differentiate through <span class="math inline">\(z\)</span>, the dependence through <span class="math inline">\(z\)</span> was already accounted for. The expectation we need to differentiate is taken over <span class="math inline">\(p(\zetaz, \theta)\)</span> which is a distribution over <span class="math inline">\(\zeta\)</span> such that <span class="math inline">\(\text{argmax} \zeta = z\)</span>. A reassuring observation is that such random variable is continuous. Moreover, the restriction <span class="math inline">\(\text{argmax} \zeta = z\)</span> defines a connected region of <span class="math inline">\(\mathbb{R}^K\)</span>, which means there does exist a differentible reparametrization for such random variable! We won’t be deriving this reparametrization here, please refer to <a href="https://cmaddis.github.io/gumbelmachinery">Chris Maddison’s blog</a>. That said, the reparametrization is</p>
<p><span class="math display">\[
\zeta_kz = \begin{cases}
\log ( \log v_k), \quad\quad\quad & \text{if $z = k$}, \\
\log \left(\frac{\log v_k}{p(z=k\theta)}  \log v_z \right), \quad\quad\quad & \text{otherwise}.
\end{cases}
\]</span></p>
<p>Where <span class="math inline">\(v \sim U[0,1]^K\)</span> is a <span class="math inline">\(K\)</span>dimensional standard uniform r.v. Now, having this reparametrization we can estimate both terms in the bias correction via the reparametrization trick, which leads to the following estimate (I use notation <span class="math inline">\(\hat{z}z\)</span> to mean singular object, the conditional relaxed variable, it’s <strong>not</strong> <span class="math inline">\(\hat{z}\)</span> with some <span class="math inline">\(z\)</span> applied to it, and neither it’s <span class="math inline">\(b(\cdotz)\)</span>):</p>
<p><span class="math display">\[
\begin{align*}
\hat{\nabla}_\theta^\text{SFREBAR} \mathcal{F}(\theta)
=
\left[f(z)  b(\hat{z}z) \right] \nabla_\theta \log p(z  \theta)
+
\nabla_\theta
b(\hat{z})

\nabla_\theta b(\hat{z}z)
\end{align*}
\]</span> We use <span class="math inline">\(u \sim U[0,1]^K\)</span> to reparametrize <span class="math inline">\(\zeta\)</span> (which leads to both <span class="math inline">\(z\)</span> and <span class="math inline">\(\hat{z}\)</span>), and <span class="math inline">\(v \sim U[0,1]^K\)</span> is used to reparametrize <span class="math inline">\(\zetaz\)</span>, see the formula above. The quantities of the REBAR gradient estimate are computed as follows: <span class="math display">\[
\begin{align*}
z = \text{argmax} \zeta,
\quad\quad
\hat{z} = \text{softmax}_\tau(\zeta),
\quad\quad
\hat{z}z = \text{softmax}_\tau(\zetaz),
\\
\zeta_k = \log p(z = k  \theta)  \log(\log u_k),
\quad\quad
\zeta_kz = \text{given above}
\end{align*}
\]</span></p>
<p>What about <span class="math inline">\(b(\cdot)\)</span>? Authors use <span class="math inline">\(b(\cdot) = \eta f(\cdot)\)</span> where <span class="math inline">\(\eta\)</span> is some hyperparameter that regulates the strength of a baseline. But turns out, we can avoid hyperparameter search for this variable…</p>
<h3 id="hyperparameterlearingandrelax">Hyperparameter learing and RELAX</h3>
<p>An important observation is that the gradient estimator we’ve obtained is unbiased<a href="#fn10" class="footnoteRef" id="fnref10"><sup>10</sup></a>. That is, for any choice of hyperparameters <span class="math inline">\(\tau\)</span> (the GumbelSoftmax temperature) and <span class="math inline">\(\eta\)</span> the average value of our estimator is equal to the true gradient. Thus, we can actually learn their values! The only question is, well, which objective should we minimize? We can’t minimize the problem’s loss <span class="math inline">\(f(\cdot)\)</span>, since we already have its gradient. The next logical step is to minimize the <em>variance</em> of the gradient estimator. <span class="math display">\[
\text{Var}\left( \hat{\nabla}_\theta^\text{SFREBAR} \mathcal{F}(\theta) \right)
=
\sum_{i}
\left(
\mathbb{E}
\left[\hat{\nabla}_{\theta_i}^\text{SFREBAR} \mathcal{F}(\theta)\right]^2

\left[
\mathbb{E} \hat{\nabla}_{\theta_i}^\text{SFREBAR} \mathcal{F}(\theta)\right]^2
\right)
\]</span> Where the expectation is taken over all randomness. Moreover, since the estimator is unbiased, we can omit the 2nd term in the sum, since it’ll be constant w.r.t. <span class="math inline">\(\tau\)</span> and <span class="math inline">\(\eta\)</span>.</p>
<p>Thus the objective for <span class="math inline">\(\tau\)</span> and <span class="math inline">\(\eta\)</span> is <span class="math display">\[
\begin{align*}
\tau^*, \eta^*
&=
\text{argmin}_{\tau, \eta} \text{Var}\left( \hat{\nabla}_\theta^\text{SFREBAR} \mathcal{F}(\theta) \right) \\
&=
\text{argmin}_{\tau, \eta}
\mathbb{E}
\sum_{i}
\left[\hat{\nabla}_{\theta_i}^\text{SFREBAR} \mathcal{F}(\theta)\right]_2^2
=
\text{argmin}_{\tau, \eta}
\mathbb{E}
\left\
\hat{\nabla}_{\theta}^\text{SFREBAR} \mathcal{F}(\theta)
\right\^2
\end{align*}
\]</span></p>
<p>This optimization problem can be solved using stochastic optimization. We first get a stochastic estimate of the gradient w.r.t. <span class="math inline">\(\theta\)</span>, and then obtain an estimate of the gradient w.r.t. “hyperparameters” <span class="math inline">\(\tau\)</span> and <span class="math inline">\(\eta\)</span>. Practical implementation is somewhat tricky, the <a href="https://arxiv.org/abs/1802.05098">MagicBox operator</a> might be useful.</p>
<p>Finally, it’s worth noticing that although we can’t apply this estimator some scenarios like Reinforcement Learning (because we don’t have access to <span class="math inline">\(f(\cdot)\)</span>), it’s possible to introduce a minor modification to overcome this issue. Remember the moment we decided to put <span class="math inline">\(b(\cdot) = \eta f(\cdot)\)</span>? At this moment we could have made any other choice, for example consider <span class="math inline">\(b(\cdot) = h_\eta(\cdot)\)</span> – a neural network with parameters <span class="math inline">\(\eta\)</span> that takes <span class="math inline">\(\hat{z}\)</span> as input and returns the same thig <span class="math inline">\(f(\cdot)\)</span> would return (a scalar in our case). Then we can learn the parameters <span class="math inline">\(\eta\)</span> of this network in the same way as before.</p>
<p>This gives us the socalled <a href="https://arxiv.org/abs/1711.00123">RELAX</a> gradient estimator: <span class="math display">\[
\hat{\nabla}_\theta^\text{SFRELAX} \mathcal{F}(\theta)
=
\left[f(z)  h_\eta(\hat{z}z) \right] \nabla_\theta \log p(z  \theta)
+
\nabla_\theta
h_\eta(\hat{z})

\nabla_\theta h_\eta(\hat{z}z)
\]</span></p>
<p>This estimator now does not assume access to the optimizeable function <span class="math inline">\(f(\cdot)\)</span>, nor its differentiability, so it can be applied in larger number of scenarios. Of course, having an access to a differentiable <span class="math inline">\(f(\cdot)\)</span> would put this estimator into a disadvantage compared to REBAR, since the later already has a pretty good idea as to how the baseline should look like.</p>
<p>Overall, I like the REBAR/RELAX gradient estimator for its use of the target function’s gradient <span class="math inline">\(\nabla_z f(\cdot)\)</span> and nonlinear baseline somewhat closely approximating the target <span class="math inline">\(f(z)\)</span>. However, it’s effectiveness comes at a cost: you’d need 3 times more computation: one discrete run <span class="math inline">\(f(z)\)</span>, one relaxed run <span class="math inline">\(f(\hat{z})\)</span> and one conditionally relaxed run <span class="math inline">\(f(\hat{z}z)\)</span> – which is much more computation than the plain GumbelSoftmax does.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This post closes the series of Stochastic Computation Graphs. There are many other methods, but for some reason I left them uncovered. Maybe I consider them weird mathematical hacks or simply didn’t know about their existence! Overall, I think all these estimators I covered in 3 posts and reasoning behind them establish a solid toolkit for many problems of practical interest.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>REINFORCE stands for <strong>RE</strong>ward <strong>I</strong>ncrement = <strong>N</strong>onnegative <strong>F</strong>actor × <strong>O</strong>ffset <strong>R</strong>einforcement × <strong>C</strong>haracteristic <strong>E</strong>ligibility<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Monte Carlo averaging isn’t very efficient. The variance decreases as <span class="math inline">\(O(1/L)\)</span> for <span class="math inline">\(L\)</span> samples, and typical error (by invoking the CLT) drops as <span class="math inline">\(O(1 / \sqrt{L})\)</span>. That is, to reduce the typical error of MC approximation by a factor of 1000, you’d need an order of millions samples! It’s very hard to beat the high variance by sampling alone.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>One could also use matrix baselines and multiply them by the <span class="math inline">\(\nabla \log p(z\theta)\)</span> as usual, but we won’t cover these – this method does not scale well with number of parameters in <span class="math inline">\(\theta\)</span>.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>Monte Carlo averaging isn’t very efficient. The variance decreases as <span class="math inline">\(O(1/L)\)</span> for <span class="math inline">\(L\)</span> samples, and typical error (by invoking the CLT) drops as <span class="math inline">\(O(1 / \sqrt{L})\)</span>. That is, to reduce the typical error of MC approximation by a factor of 1000, you’d need an order of millions samples! It’s very hard to beat the high variance by sampling alone.<a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>You might ask, wait, what if we use an independent and identically distributed sample <span class="math inline">\(z'\)</span> in the baseline? Consider the following: <span class="math display">\[ \left( f(z)  b(z') \right) \nabla_\theta \log p(z\theta), \quad\quad\quad z, z' \sim p(z\theta) \]</span> This is a valid and unbiased gradient estimate, however since <span class="math inline">\(z\)</span> and <span class="math inline">\(z'\)</span> are independent, this is essentially a stochastic version of the following estimator: <span class="math display">\[ \left( f(z)  \mathbb{E}_{p(z'\theta)} b(z') \right) \nabla_\theta \log p(z\theta), \quad\quad\quad z \sim p(z\theta) \]</span> So we’re better off with approximating with a constant (w.r.t. <span class="math inline">\(z\)</span>) baseline <span class="math inline">\(b(\theta)\)</span> the expectation of that value, and it is done in the NVIL method we’ll talk about later.<a href="#fnref5">↩</a></p></li>
<li id="fn6"><p>Actually, it’d make much more sense to minimize the variance of the obtained estimator directly, we’ll discuss this later when talking about the REBAR and RELAX methods.<a href="#fnref6">↩</a></p></li>
<li id="fn7"><p>The Evidence Lower Bound of Variational Inference can be presented in this way. Namely, the ELBO is <span class="math display">\[
\begin{align*}
\mathcal{F}(\theta) &= \mathbb{E}_{q(z_{1, \dots, N\theta})} \log \frac{p(X, z_1, \dots, z_N  \theta)}{q(z_{1, \dots, N\theta})} \\
&= \mathbb{E}_{q(z_{1, \dots, N\theta})} \left[ \log p(Xz_{1, \dots, N}, \theta) + \sum_{n=1}^N \log \frac{p(z_n  z_{<n})}{q(z_n  z_{<n}, \theta)} \right]
\end{align*}
\]</span> Then each intermediate layer gives you reward corresponding to the KL divergence with the prior, and the last layer also gives you the reconstruction reward.<a href="#fnref7">↩</a></p></li>
<li id="fn8"><p>This might be due to the Taylor expansion being an unfortunate choice. Probably, considering some other expansion would be advantageous, but I’m unaware of any such works.<a href="#fnref8">↩</a></p></li>
<li id="fn9"><p>The name is a very clever joke. Rebar is a term from construction works for steel bars that are used to <em>reinforce</em> <em>concrete</em>, and Concrete distribution is the name for the distribution of the GumbelSoftmax relaxed random variables.<a href="#fnref9">↩</a></p></li>
<li id="fn10"><p>Unlike the GumbelSoftmax, which was biased for all <span class="math inline">\(\tau > 0\)</span>. In a sense, REBAR is a debiased version of GumbelSoftmax.<a href="#fnref10">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/WajrpLhGE3g" height="1" width="1" alt=""/>Sun, 12 Nov 2017 00:00:00 UThttp://artem.sobolev.name/posts/20171112stochasticcomputationgraphsfixingreinforce.htmlArtemhttp://artem.sobolev.name/posts/20171112stochasticcomputationgraphsfixingreinforce.htmlStochastic Computation Graphs: Discrete Relaxations
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/rm0asUsdwas/20171028stochasticcomputationgraphsdiscreterelaxations.html
<p>This is the second post of the <a href="/tags/stochastic%20computation%20graphs%20series.html">stochastic computation graphs series</a>. Last time we discussed models with <a href="/posts/20170910stochasticcomputationgraphscontinuouscase.html">continuous stochastic nodes</a>, for which there are powerful reparametrization technics.</p>
<p>Unfortunately, these methods don’t work for discrete random variables. Moreover, it looks like there’s no way to backpropagate through discrete stochastic nodes, as there’s no infinitesimal change of random values when you infinitesimally perturb their parameters.</p>
<p>In this post I’ll talk about continuous relaxations of discrete random variables.</p>
<!more>
<h2 id="asymptoticreparametrization">Asymptotic reparametrization</h2>
<p>One way to train models with discrete random variables is to consider an equivalent model with continuous random variables. Let me show you an example. Suppose you have a feedforward neural network for classification that receives <span class="math inline">\(x\)</span> and outputs distribution over targets <span class="math inline">\(p(y \mid x)\)</span>, where typical layer looks like <span class="math inline">\(h_k = \sigma(W_k h_{k1} + b_k)\)</span>. You’d like to apply dropout to each weight of this layer and <em>tune its probabilities</em>. To do so we introduce binary latent variables <span class="math inline">\(z^{(k)}_{ij}\)</span> denoting if weight is on or off. There’s one such variable for each weight, so we can stack them into a matrix <span class="math inline">\(Z_k\)</span> of the same shape as weight matrix <span class="math inline">\(W_k\)</span>. Then elementwise multiplication <span class="math inline">\(W_k \circ Z_k\)</span> would zero out dropped weights, so the formula becomes <span class="math inline">\(h_k = \sigma((W_k \circ Z_k) h_{k1} + b_k)\)</span>. We assume each dropout mask independently follows Bernoulli distribution: <span class="math inline">\(z_{ij}^{(k)} \sim \text{Bernoulli}(p_{ij}^{(k)})\)</span> (<span class="math inline">\(Z^{(k)} \sim \text{Bernoulli}(P^{(k)})\)</span> for short)</p>
<p>In order to learn these masks (or, rather parameters of the distribution <span class="math inline">\(q(Z \mid \Lambda)\)</span> over masks parametrized by <span class="math inline">\(\Lambda\)</span>) we employ variational inference approach:</p>
<p><span class="math display">\[
\begin{align*}
\log p(yx) \ge
\mathcal{L}(\Lambda)
&= \mathbb{E}_{q(Z \mid \Lambda)} \log \frac{p(y, Z \mid x)}{q(Z \mid \Lambda)} \\
&= \underbrace{\mathbb{E}_{q(Z \mid \Lambda)} \log p(y \mid Z, x)}_{\text{expected likelihood}}  \underbrace{D_{KL}(q(Z \mid \Lambda) \mid\mid p(Z))}_{\text{KLdivergence}}
\to \max_\Lambda
\end{align*}
\]</span></p>
<p>We can’t backpropagate gradients through discrete sampling procedure, so we need to overcome this problem somehow. Notice, however, that each unit in a layer <span class="math inline">\(h_{k+1}\)</span> is an affine transformation of <span class="math inline">\(k\)</span>th layer’s nodes followed by a nonlinear activation function. If <span class="math inline">\(k\)</span>th layer has sufficiently many neurons, then one might expect the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">Central Limit Theorem</a> to hold at least approximately for the preactivations. Namely, consider a single neuron <span class="math inline">\(s\)</span> that takes an affine combination of previous layer’s neurons <span class="math inline">\(h_{k1}\)</span> and applies a nonlinearity: <span class="math inline">\(s = \sigma(w^T h + b)\)</span>. In our case, however, we have a vector of masks <span class="math inline">\(z \sim \text{Bernoulli}(P)\)</span>, so the formula becomes <span class="math inline">\(s = \sigma((w \odot \tfrac{z}{P})^T h + b)\)</span> (<span class="math inline">\(\odot\)</span> stands for elementwise multiplication), and if <span class="math inline">\(K=\text{dim}(z)\)</span> is large enough, then we might expect the preactivations <span class="math inline">\(\sum_{k=1}^K \tfrac{z_k}{p_k} w_k h_k + b\)</span> (we divide each weight by its keeping probability <span class="math inline">\(p_k\)</span> to make keep the expectation unaffected by noise) to be approximately distributed as <span class="math inline">\(\mathcal{N}\left(w^T h + b, \sum_{k=1}^K \tfrac{1  p_k}{p_k} w_k^2 h_k^2 \right)\)</span>.</p>
<p>Now suppose that instead of Bernoulli multiplicative noise <span class="math inline">\(z\)</span> we actually used multiplicative Gaussian noise <span class="math inline">\(\zeta \sim \mathcal{N}(1, (1P) / P)\)</span> (elementwise division). It’s easy to check then that the preactivations <span class="math inline">\((w \odot \zeta)^T h + b\)</span> would have the same Gaussian distribution with exactly the same parameters. Therefore, we can replace expected likelihood term in objective <span class="math inline">\(\mathcal{L}(\Lambda)\)</span> with a continuous distribution <span class="math inline">\(q(\zeta\Lambda) = \prod_{i,j,k} \mathcal{N}\left(\zeta^{(k)}_{ij} \mid 1, (1\lambda^{(k)}_{ij})/\lambda^{(k)}_{ij}\right)\)</span>. However, we can’t simply do the same in the KL divergence term. Instead, we need to use simple priors (like factorized Bernoulli) so that closed form can be computed – then we can take deterministic gradients w.r.t. <span class="math inline">\(\Lambda\)</span>.</p>
<p>This example shows us that for some simple models collective behavior of discrete random variables can be accurately approximated by continuous equivalents. I’d call this approach the <strong>asymptotic reparametrization</strong> <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<h2 id="naiverelaxation">Naive Relaxation</h2>
<p>The previous trick is nice and appealing, but has very limited scope of applicability. If you have just a few discrete random variables or have other issues preventing you from relying on the CLT, you’re out of luck.</p>
<p>However, consider a binary discrete random variable: <span class="math inline">\(z \sim \text{Bernoulli}(p)\)</span>. How would you sample it? Easy! Just sample a uniform r.v. <span class="math inline">\(u \sim U[0,1]\)</span> and see if it’s less than <span class="math inline">\(p\)</span>: <span class="math inline">\(z = [u > q]\)</span> where <span class="math inline">\(q = 1  p\)</span> and brackets denote an indicator function that is equal to one when the argument is True, and zero otherwise. Equivalently we can rewrite it as <span class="math inline">\(z = H(u  q)\)</span> where <span class="math inline">\(H(x)\)</span> is a step function: it’s zero for negative <span class="math inline">\(x\)</span> and 1 for positive ones <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>. Now, this is a nicelooking reparametrization, but <span class="math inline">\(H\)</span> is not differentiable <a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>, so you can’t backpropagate through it. What if we replace <span class="math inline">\(H\)</span> with some differentiable analogue that has a similar shape? One candidate is a sigmoid with temperature: <span class="math inline">\(\sigma_\tau(x) = \sigma\left(\tfrac{x}{\tau}\right)\)</span>: by varying temperature you can control steepness of the function. In the limit of <span class="math inline">\(\tau \to 0\)</span> we actually recover the step function <span class="math inline">\(\lim_{\tau \to 0} \sigma_\tau(x) = H(x)\)</span>.</p>
<p>So the relaxation we’ll consider is <span class="math inline">\(\zeta = \sigma_\tau(u  q)\)</span>. How can we see if it’s a good one? What do we even want from the relaxation? Well, in the end we’ll be using the discrete version of the model, the one with zeros and ones, so we’d definitely like our relaxation to sample zeros and ones often. Actually, we’d even want them to be the modes of the underlying distribution. Let’s see if that’s the case for the proposed relaxation.</p>
<p>CDF of the relaxed r.v. <span class="math inline">\(\zeta\)</span> <span class="math display">\[
\mathbb{P}(\zeta < x) = \mathbb{P}(u < q + \tau \sigma^{1}(x)) = \min(1, \max(0, q + \tau \sigma^{1}(x)))
\]</span> And the corresponding PDF <span class="math display">\[
\frac{\partial}{\partial x}\mathbb{P}(\zeta < x)
=
\begin{cases}
\frac{\tau}{x (1x)}, & \sigma\left(\frac{q}{\tau}\right) < x < \sigma\left(\frac{1q}{\tau}\right) \\
0, & \text{otherwise}
\end{cases}
\]</span></p>
<p>Even the formula suggests that the support of the distribution of <span class="math inline">\(\zeta\)</span> is never the whole <span class="math inline">\((0, 1)\)</span>, but only approaches it as temperature <span class="math inline">\(\tau\)</span> goes to zero. For all nonzero temperatures though the support will exclude some neighborhood of the endpoints which might bias the model towards the intermediate values. This is why you want to have a random variable with infinite support <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>. If the distribution is skewed, then the resulting relaxation will also be skewed, but it’s not a problem since the probabilities are adjusted according to the CDF.</p>
<p>Let’s plot some densities for different <span class="math inline">\(\tau\)</span> (let <span class="math inline">\(q\)</span> be 0.1).</p>
<div class="postimage">
<p><img src="/files/naiverelaxationdensities.png" style="maxwidth: 90%" /></p>
</div>
<p>But having infinite support is not enough. It’s hard to see from plots, but if the distribution has very light tails (like Gaussian), then it’s effective support is still finite. Authors of the Concrete Distribution notice this in their paper, saying that sigmoid squashing rate is not enough to compensate (even if you twist the temperature!) for quickly decreasing Gaussian density as you approach either of infinities.</p>
<p>Let’s also think about the impact of the temperature on the relaxation. Intuitively one would expect that as we decrease the temperature, the relaxation becomes more accurate and the problem becomes “more discrete”, hence it should be harder to optimize. Indeed, <span class="math inline">\(\tfrac{d}{dx}\sigma_\tau(x) = \tfrac{1}{\tau} \sigma_\tau(x) \sigma_\tau(x)\)</span> – as you decrease the temperature, both sigmoids become more steep, and the derivative approaches an infinitely tall spike at 0 and zero everywhere else <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a>.</p>
<div class="postimage">
<p><img src="/files/naiverelaxationvariancebytau.png" style="maxwidth: 90%" /></p>
</div>
<p>As expected, higher approximation accuracy (obtained by lowering the temperature) comes at a cost of increased variance.</p>
<h2 id="gumbelsoftmaxrelaxationakaconcretedistribution">GumbelSoftmax Relaxation (aka Concrete Distribution)</h2>
<p>We could consider some other distributions (with larger support, like the gaussian one) instead of uniform in our relaxation, but let’s try a different approach. Let’s see how we can sample arbitrary <span class="math inline">\(K\)</span>valued discrete random variables. It’s wellknown fact (the so called <a href="https://hips.seas.harvard.edu/blog/2013/04/06/thegumbelmaxtrickfordiscretedistributions/">Gumbel Max Trick</a>) that if <span class="math inline">\(\gamma_k\)</span> are i.i.d. <span class="math inline">\(\text{Gumbel}(0, 1)\)</span> random variables, then <span class="math inline">\(\text{argmax}_k \{\gamma_k + \log p_k\} \sim \text{Categorical}(p_1, \dots, p_K)\)</span>, that is, probability that <span class="math inline">\(k\)</span>th perturbed r.v. attains maximal value is exactly <span class="math inline">\(p_k\)</span> <a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a>. This gives you a sampling procedure: just sample <span class="math inline">\(K\)</span> independent Gumbels, add corresponding log probabilities, and then take the argmax. However, though mathematically elegant, this formula won’t help us much since argmax is not differentiable. Let’s relax it then! We have already seen that the step function <span class="math inline">\(H(x)\)</span> can be seen as a limit of a sigmoid with temperature: <span class="math inline">\(H(X) = \lim_{\tau \to 0} \sigma_\tau(x)\)</span>, so we might expect (and indeed it is) that if you assume that <span class="math inline">\(\text{argmax}(x)\)</span> returns you a onehot vector indicating the maximum index, it can be viewed as a zerotemperature version of a softmax with temperature: <span class="math inline">\(\text{argmax}(x)_j = \lim_{\tau \to 0} \text{softmax}_\tau(x)_j\)</span> where</p>
<p><span class="math display">\[
\text{softmax}_\tau(x)_j
= \frac{\exp(x_j / \tau)}{\sum_{k=1}^K \exp(x_k / \tau)}
\]</span></p>
<p>This formula gives us continuous relaxation of discrete random variables. Let’s see what it corresponds to in binary case:</p>
<p><span class="math display">\[
\begin{align*}
\zeta
&= \frac{\exp((\gamma_1 + \log p) / \tau)}{\exp((\gamma_1 + \log p) / \tau) + \exp((\gamma_0 + \log (1p)) / \tau)} \\
&= \frac{1}{1 + \exp((\gamma_0 + \log (1p)  \gamma_1  \log p) / \tau)}\\
&= \sigma_\tau\left(\gamma_1  \gamma_0 + \log \tfrac{p}{1p}\right)
\end{align*}
\]</span></p>
<p>Then <span class="math inline">\(\gamma_1  \gamma_0\)</span> has <a href="https://en.wikipedia.org/wiki/Logistic_distribution">Logistic</a>(0, 1) distribution <a href="#fn7" class="footnoteRef" id="fnref7"><sup>7</sup></a>. This estimator is a bit more efficient since you can generate Logistic random variables faster than generating two independent Gumbel r.v.s. <a href="#fn8" class="footnoteRef" id="fnref8"><sup>8</sup></a></p>
<p>Even though this choice of Logistic distribution in binary case seems arbitrary, let’s not forget, that it’s a special case of a more general relaxation of any categorical r.v. If we chose some other<a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a> distribution in a binary case, we’d have to construct some cumbersome stickbreaking procedure to generalize it to the multivariate case.</p>
<h2 id="marginalizationviacontinuousnoise">Marginalization via Continuous Noise</h2>
<p>An interesting approach was proposed in the <a href="https://arxiv.org/abs/1609.02200">Discrete Variational Autoencoders paper</a>. The core idea is that you can smooth binary r.v. <span class="math inline">\(z\)</span> with p.m.f. <span class="math inline">\(p(z)\)</span> by adding extra noise <span class="math inline">\(\tau_1\)</span> and <span class="math inline">\(\tau_2\)</span> and treating <span class="math inline">\(z\)</span> as a swticher between these. Indeed, consider such smoothed r.v. <span class="math inline">\(\zeta = z \cdot \tau_1 + (1  z) \tau_0\)</span>. Now if we choose <span class="math inline">\(\tau_0\)</span> and <span class="math inline">\(\tau_1\)</span> such that the CDF of marginal <span class="math inline">\(\zeta\)</span> can be computed and inverted, we would be able to devise a reparametrization for this scheme.</p>
<p>Consider a particular example of <span class="math inline">\(\tau_0 = 0\)</span> – a constant zero, and <span class="math inline">\(\tau_1\)</span> having some continuous distribution. The marginal CDF of <span class="math inline">\(\zeta\)</span> would then be</p>
<p><span class="math display">\[
\mathbb{P}(\zeta < x) = \mathbb{P}(z = 0) [\zeta > 0] + \mathbb{P}(z=1) \mathbb{P}(\tau_1 < x)
\]</span></p>
<p>Now we can invert this CDF:</p>
<p><span class="math display">\[
\mathbb{Q}_\zeta(\rho) = \begin{cases}
\mathbb{Q}_\tau \left( \frac{\rho}{p(z=1)} \right), & \rho \le p(z=1) \mathbb{P}(\tau < 0) \\
\mathbb{Q}_\tau \left( \frac{\rho  p(z = 0)}{1  p(z = 0)} \right), & \rho \ge p(z = 0) + p(z=1) \mathbb{P}(\tau < 0) \\
0, & \text{otherwise}
\end{cases}
\]</span></p>
<p>Where <span class="math inline">\(\mathbb{Q}_\tau(\rho)\)</span> is an inverse of CDF of <span class="math inline">\(\tau\)</span>, that is <span class="math inline">\(\mathbb{P}(\tau < \mathbb{Q}_\tau(\rho)) = \rho\)</span>. This formula clearly suggests that if you can compute and invert CDF of the smoothing noise <span class="math inline">\(\tau\)</span>, you can do the same with the smoothed variable <span class="math inline">\(\zeta\)</span>, essentially giving us reparametrization for the smoothed r.v. <span class="math inline">\(\zeta\)</span>, so we can backpropagate as usual. <a href="#fn10" class="footnoteRef" id="fnref10"><sup>10</sup></a></p>
<p>However, an attentive reader could spot a problem here. In the multivariate case we typically have some dependency structure, hence probabilities <span class="math inline">\(p(z_k = 0)\)</span> and <span class="math inline">\(p(z_k = 1)\)</span> depend on previous samples <span class="math inline">\(z_{<k}\)</span>, which we can’t backpropagate through, and need to relax in the same way.</p>
<p>Consider, for example, a general stochastic computation graph with 4dimensional binary random variable <span class="math inline">\(z\)</span>:</p>
<div class="postimage">
<p><img src="/files/dvae.png" style="width: 400px" /></p>
</div>
<p>When applying this trick, we introduce relaxed continuous random variables <span class="math inline">\(\zeta\)</span> as simple transformations of corresponding binary random variables <span class="math inline">\(z\)</span> (red lines), and make <span class="math inline">\(z\)</span> depend on each other only through relaxed variables (purple lines).</p>
<div class="postimage">
<p><img src="/files/dvaesmoothed.png" style="width: 400px" /></p>
</div>
<p>This trick is somewhat similar to the asymptotic reparametrization – you end up with a model that only has continuous random variables, but is equivalent to the original one that has discreteness. However, it requires you to significantly alter the model by reexpressing dependence in <span class="math inline">\(z\)</span> using continuous relaxations <span class="math inline">\(\zeta\)</span>. It worked fine for the Discrete VAE application, where you want to learn this dependence structure in <span class="math inline">\(z\)</span>, but if you have a specific one in mind, you might be in trouble.</p>
<p>Also, we don’t want to introduce this noise at the test stage. So we’d like to fix the discrepancy between train and test somehow. One way to do so is to choose <span class="math inline">\(\tau\)</span> that depends on some parameter and adjust it so that <span class="math inline">\(\tau\)</span>’s distribution becomes closer to <span class="math inline">\(\delta(\tau1)\)</span>. In the paper authors use “truncated exponential” distribution <span class="math inline">\(p(\tau) \propto \exp(\beta \tau) [0 \le \tau \le 1]\)</span> where <span class="math inline">\(\beta\)</span> is a (bounded) learnable parameter. The upper bound grows linearly as training progresses – essentially shrinking the noise towards 1 (in the limit of infinite <span class="math inline">\(\beta\)</span> we have <span class="math inline">\(p(\tau) = \delta(\tau1)\)</span>).</p>
<h2 id="gradientrelaxations">Gradient Relaxations</h2>
<p>There’s been also some research around the following idea: we don’t have any problems with discrete random variable during the forward pass, it’s differentiation during the backward one that brings difficulties. Can we approximate the gradient only? Essentially the idea is to compute the forward pass as usual, but replace the gradient through random samples with some approximation. To a mathematician (like I pretend to be) this sounds very suspicious – the gradient will no longer correspond to the objective, it’s not even clear which objective it’d correspond to. However, these methods have some attractive properties, and are wellknown in the area, so I feel I have to cover them as well.</p>
<p>One of such methods is the <a href="https://arxiv.org/abs/1308.3432"><strong>Straight Through</strong> estimator</a>, which backpropagates the gradient <span class="math inline">\(\nabla_\theta\)</span> through binary random variable <span class="math inline">\(z \sim \text{Bernoulli}(p(\theta))\)</span> as if there was no stochasticity (and nonlinearity!) in the first place. So in the forward pass you take your outputs (logits) of a layer preceding to the discrete stochastic node, squash it by a sigmoid function, then sample a Bernoulli random variable with such probability, and move on to the next layer, possibly sampling some more stochastic nodes along the way. In the backward phase, though, when it comes to differentiate through the discrete sampling procedure, you just go straight to the gradients of logits, like if there was no sigmoid and sampling in the first place: <span class="math display">\[\nabla_\theta \text{Bernoulli}(\sigma(g(\theta))) := \nabla_\theta g(\theta)\]</span></p>
<p>This estimator is clearly computing anything but an estimate of the gradient of your model, but authors claim that at least it gives you right direction for the gradient. You can also keep the sigmoid function – that’s what <a href="https://arxiv.org/abs/1406.2989">Raiko et. al</a> do: <span class="math display">\[\nabla_\theta \text{Bernoulli}(\sigma(g(\theta))) := \nabla_\theta \sigma(g(\theta))\]</span></p>
<p>Finally, authors of one of the <a href="https://arxiv.org/abs/1611.01144">GumbelSoftmax relaxation</a> papers proposed Straight Through Gumbel where you, again, compute forward pass as usual, but in the backward pass assume there’s GumbelSoftmax relaxation and backpropagate through it. Hypothetically as you decrease the temperature, this relaxation becomes more exact. I guess one can try to convince themselves that for small enough <span class="math inline">\(\tau\)</span> this is a reasonable approximation. <span class="math display">\[\nabla_\theta \text{Bernoulli}(\sigma(g(\theta))) := \nabla_\theta \text{RelaxedBernoulli}(\sigma(g(\theta)))\]</span></p>
<p>I personally consider these methods mathematically unsound, and advise you to refrain form using them (unless you know what you’re doing – then tell me what was your rationale for this particular choice).</p>
<h2 id="experiments">Experiments</h2>
<p>To please your eyes with some experimental plots, I used all these relaxations to train a Discrete Variational Autoencoder on MNIST. I used a single stochastic layer (shallow encoder) with 3 layers in between, and evaluated the result using <a href="/posts/20160714neuralvariationalimportanceweightedautoencoders.html">10,000sample lower bound</a>, which I assume approximates marginal loglikelihood relatively well.</p>
<div class="postimage">
<p><img src="/files/dvaeexperiments.png" /></p>
</div>
<p>First, the Relaxed ELBO pane tells us all methods have no problems optimizing their target objective. However, one should refrain from comparing them according to this number, since these are different relaxations, they are not even lower bounds for a marginal loglikelihood for some relaxed model <a href="#fn11" class="footnoteRef" id="fnref11"><sup>11</sup></a>.</p>
<p>Instead, let’s look at the second and third panes. The second shows the Evidence Lower Bound for the original discrete model, and the third shows the gap between the discrete ELBO and the relaxed one. First, the marginal likelihood estimation agrees with the discrete ELBO – that’s a good thing and means nothing bad is happening to the KLdivergence between the true posterior and the approximate one.</p>
<p>You can see that the green line – the logistic distributionbased relaxation with unit temperature – actually diverges. This is a direct consequence of the chosen temperature: for <span class="math inline">\(\tau = 1\)</span> the density of relaxed random variables is unimodal and has its mode somewhere in the interior of the [0, 1] interval. This leads the network to adjust to the values around this mode, which poorly represent test time samples.</p>
<p>As you can see the normal distribution with temperature 0.4 works very well at first, but then starts diverging. This might be because of the problems of Gaussian distribution we discussed earlier: namely, it has zero mass at exact 0 and 1: the network might adapt to having some small nonzero elements, and be very surprised to see them zeroed out completely at the testing time.</p>
<p>The asymptotic reparametrization seems to be suffering from the inaccuracy of the CLT approximation. Latent code of 200 units is big, but not infinitely big for the approximation to be exact. Unfortunately, there’s no hyperparameter to adjust the approximation quality. Moreover, the relaxation gap keeps increasing.</p>
<p>The NoiseRelaxed model performs poorly compared to other methods. This might be a result of a poor hyperparameter management: recall that we introduce continuous noise that’s missing at the test time. To make the net adjust to the test time regime we need to make the noise approach <span class="math inline">\(\delta(\tau  1)\)</span>. However, if you approach it too fast, you’ll see the relaxation gap decreasing, but the learning wouldn’t progress much as the variance of your gradients would be too high.</p>
<p>Straight through estimators perform surprisingly good: not as good as the GumbelSoftmax relaxation, but better than what you’d expect from a mathematically unsound method.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we talked about continuous relaxations of discrete models. These relaxations allow you to invoke the reparametrization trick to backpropagate the gradients, but this comes at a cost of bias. The gradients are no more unbiased as you essentially optimize different objective. In the next blogpost we will return back to where we started – to the scorefunction estimator, and try to reduce its variance keeping zero bias.</p>
<p>By the way, if you’re interested, the code for DVAE implementations is <a href="https://github.com/artsobolev/dvaes">available on my GitHub</a>. However, I should warn you: it’s still workinprogress, and lacks any documentation. I’ll add it one I’m finished with the series.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>There’s no established name for such technique. Gaussian dropout has been proposed in the original <a href="http://jmlr.org/papers/v15/srivastava14a.html">Dropout paper</a>, but their equivalence under CLT was not stated formally until the <a href="http://proceedings.mlr.press/v28/wang13a.html">Fast dropout training</a> paper. Nor has anyone applied this trick to, say, Discrete Variational Autoencoders. <strong>UPD</strong>: I just discovered an <a href="https://openreview.net/forum?id=BySRH6CpW&noteId=BySRH6CpW">ICLR2018 submission</a> using this technique to learn discrete weights.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p><span class="math inline">\(H\)</span> is called a Heavyside function, and you can define it’s behavior at 0 as you like, it doesn’t matter in most cases as it’s a zeromeasure point.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Unless you use generalized functions from the distributions theory (do not confuse with probability distributions!). However, that’s a whole different world, and one should be careful doing derivations there.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>Consider a random variable <span class="math inline">\(U\)</span> with its support being <span class="math inline">\(\mathbb{R}\)</span>. Then <span class="math inline">\(\mathbb{P}(H(U + c) = 1) = \mathbb{P}(U > c) = 1  \Phi(c)\)</span> where <span class="math inline">\(\Phi\)</span> is CDF of <span class="math inline">\(U\)</span>. Then if you want this probability to be equal to some value <span class="math inline">\(p\)</span>, you should shift <span class="math inline">\(U\)</span> by <span class="math inline">\(c = \Phi^{1}(1p)\)</span><a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>This sounds a lot like Dirac’s delta function, which is a wellknown distributional derivative of the Heavyside function.<a href="#fnref5">↩</a></p></li>
<li id="fn6"><p>An alternative derivation of this fact can be seen through a property of exponentially distributed random variables.<a href="#fnref6">↩</a></p></li>
<li id="fn7"><p>This is clearly a special case of the general one with <span class="math inline">\(U\)</span> being <span class="math inline">\(\text{Logistic}(0, 1)\)</span> and <span class="math inline">\(\Phi\)</span> being its CDF.<a href="#fnref7">↩</a></p></li>
<li id="fn8"><p>Let’s say a word or two on how to sample Gumbels and Logistics. For both of them one can analytically derive and invert the CDF, and hence come up with formulas to transform samples from uniform distribution. For <span class="math inline">\(\text{Gumbel}(\mu, \beta)\)</span> distribution this transformation is <span class="math inline">\(u \mapsto \mu  \beta \log \log \tfrac{1}{u}\)</span>, for <span class="math inline">\(\text{Logistic}(\mu, \beta)\)</span> it’s <span class="math inline">\(u \mapsto \mu + \beta \sigma^{1}(u)\)</span>. Hence if you use the general case, you’d need to generate 2 random variables, whereas in binary case you can use just one. I guess in general <span class="math inline">\(K\)</span>variate case you <em>could theoretically</em> use just <span class="math inline">\(K1\)</span> random variables, but that’d induce some possibly complicated dependence structure on them, and thus unnecessary complicate the sampling process.<a href="#fnref8">↩</a></p></li>
<li id="fn9"><p>Consider a random variable <span class="math inline">\(U\)</span> with its support being <span class="math inline">\(\mathbb{R}\)</span>. Then <span class="math inline">\(\mathbb{P}(H(U + c) = 1) = \mathbb{P}(U > c) = 1  \Phi(c)\)</span> where <span class="math inline">\(\Phi\)</span> is CDF of <span class="math inline">\(U\)</span>. Then if you want this probability to be equal to some value <span class="math inline">\(p\)</span>, you should shift <span class="math inline">\(U\)</span> by <span class="math inline">\(c = \Phi^{1}(1p)\)</span><a href="#fnref9">↩</a></p></li>
<li id="fn10"><p>There’s an alternative way to write the sampling formula <span class="math display">\[
\mathbb{Q}_\zeta(\rho) = \begin{cases}
0, & \rho \le p(z=0) \\
\mathbb{Q}_\tau \left( \frac{\rho  p(z = 0)}{1  p(z = 0)} \right), & \text{otherwise}
\end{cases}
\]</span> This formula has less branching, and thus is more efficient to compute. <br/> Moreover, in general one can avoid inverting the CDF by noticing that <span class="math inline">\(y = [\rho < \mathbb{P}(z=0)] \mathbb{Q}_{\tau_0}\left(\tfrac{\rho}{\mathbb{P}(z=0)}\right) + [\rho > \mathbb{P}(z=0)] \mathbb{Q}_{\tau_1}\left(\tfrac{1\rho}{1\mathbb{P}(z=0)}\right)\)</span> for <span class="math inline">\(\rho \sim U[0,1]\)</span> has exactly the same distribution as the marginal <span class="math inline">\(p(\zeta)\)</span>.<a href="#fnref10">↩</a></p></li>
<li id="fn11"><p>Authors of the Concrete Distribution paper did also relax the KLdivergence term, which means they optimized lower bound for marginal likelihood in a different model, however, it’s reported to lead to better results.<a href="#fnref11">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/rm0asUsdwas" height="1" width="1" alt=""/>Sat, 28 Oct 2017 00:00:00 UThttp://artem.sobolev.name/posts/20171028stochasticcomputationgraphsdiscreterelaxations.htmlArtemhttp://artem.sobolev.name/posts/20171028stochasticcomputationgraphsdiscreterelaxations.htmlStochastic Computation Graphs: Continuous Case
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/eHYnuBerCMs/20170910stochasticcomputationgraphscontinuouscase.html
<p>Last year I covered <a href="/tags/modern%20variational%20inference%20series.html">some modern Variational Inference theory</a>. These methods are often used in conjunction with Deep Neural Networks to form deep generative models (VAE, for example) or to enrich deterministic models with stochastic control, which leads to better exploration. Or you might be interested in amortized inference.</p>
<p>All these cases turn your computation graph into a stochastic one – previously deterministic nodes now become random. And it’s not obvious how to do backpropagation through these nodes. In <a href="/tags/stochastic%20computation%20graphs%20series.html">this series</a> I’d like to outline possible approaches. This time we’re going to see why general approach works poorly, and see what we can do in a continuous case.</p>
<!more>
<p>First, let’s state the problem more formally. Consider the approximate inference objective:</p>
<p><span class="math display">\[
\mathbb{E}_{q(zx)} \log \frac{p(x, z)}{q(zx)} \to \max_{q(zx)}
\]</span></p>
<p>or a reinforcement learning objective:</p>
<p><span class="math display">\[
\mathbb{E}_{\pi(as)} R(a, s) \to \max_{\pi}
\]</span></p>
<p>In the following I’ll use the following notation for the objective:</p>
<p><span class="math display">\[
\mathcal{F}(\theta) = \mathbb{E}_{p(x \mid \theta)} f(x) \to \max_{\theta}
\]</span></p>
<p>In that case the (stochastic) computation graph (SCG) can be represented in the following form <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>:</p>
<div class="postimage">
<p><img src="/files/scgthroughrandomness.png" style="width: 400px" /></p>
</div>
<p>Here <span class="math inline">\(\theta\)</span>, in double circle is a set of tunable parameters, blue rhombus is a stochastic node that takes on random values, but their distribution depends on <span class="math inline">\(\theta\)</span> (maybe through some complicated but known function, like a neural network), and orange circle is the value we’re maximizing. In order to estimate the <span class="math inline">\(\mathcal{F}(\theta)\)</span> using such graph, you just take your <span class="math inline">\(\theta\)</span>s, compute <span class="math inline">\(x\)</span>’s distribution, take as many samples from it as you can get, compute <span class="math inline">\(f(x)\)</span> for each one, and then just average them.</p>
<p>How do we maximize it though? The workhorse of optimization in modern deep learning is the Stochastic Gradient Descent (or, in our case, Ascent), and if we want to apply it in our case, all we need to compute is an (preferably unbiased and lowvariance) estimate of the gradient of the objective <span class="math inline">\(\nabla_\theta \mathcal{F}(\theta)\)</span> w.r.t. <span class="math inline">\(\theta\)</span>. This is seemingly easy for anyone familiar with basic calculus:</p>
<p><span class="math display">\[
\begin{align*}
\nabla_{\theta} \mathcal{F}(\theta)
& = \nabla_{\theta} \mathbb{E}_{p(x \mid \theta)} f(x)
= \nabla_{\theta} \int p(x \mid \theta) f(x) dx \\
& = \int \nabla_{\theta} p(x \mid \theta) f(x) dx
= \int \nabla_{\theta} \log p(x \mid \theta) f(x) p(x \mid \theta) dx \\
& = \mathbb{E}_{p(x \mid \theta)} \nabla_{\theta} \log p(x \mid \theta) f(x) dx
\end{align*}
\]</span></p>
<p>There you have it! Just sample some <span class="math inline">\(x \sim p(x \mid \theta)\)</span>, calculate <span class="math inline">\(f(x)\)</span> using this sample, and then multiply the result by the gradient of log density – here’s your unbiased estimate of the true gradient. However, in practice people have observed that this estimator (called the <strong>scorefunction estimator</strong>, and also <strong>REINFORCE</strong> in reinforcement learning literature <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>) has large variance, making it impractical for highdimensional <span class="math inline">\(x\)</span>.</p>
<p>And it kinda makes sense. Look at the estimator. It does not use gradient information of <span class="math inline">\(f\)</span>, so it does not have any guidance where to move <span class="math inline">\(p(x\theta)\)</span> to make the expectation <span class="math inline">\(\mathcal{F}(\theta)\)</span> higher. Instead, it tries many random <span class="math inline">\(x\)</span>s, for each sample it takes the direction one should go to make this sample more probable, and weights these directions according to the magnitude of <span class="math inline">\(f(x)\)</span>. When averaged, this gives you true direction to maximize the objective, but it’s hard to randomly stumble upon good <span class="math inline">\(x\)</span> using just a few samples (especially early in training, or in highdimensional spaces), hence high variance.</p>
<p>This manifests a necessity of either ways to improve the variance of such estimator, or different, more efficient approaches. In the following we will consider both.</p>
<h2 id="reparametrizationtrick">Reparametrization trick</h2>
<p>Being perfectly aware of the aforementioned limitation, <a href="https://arxiv.org/abs/1312.6114">Kingma et. al</a> used a smart trick in their Variational Autoencoder paper. Basically, the idea is the following: if some random variables can be decomposed into combinations of other random variables, can we transform our stochastic computation graph such that we don’t need to backpropagate through randomness, and have stochasticity injected into the model as independent noise?</p>
<p>Turns out, we can. Namely, for any gaussian random variable <span class="math inline">\(x \sim \mathcal{N}(\mu, \sigma^2)\)</span> we can decompose it into affine transformation of some independent standard gaussian noise: <span class="math inline">\(x = \mu + \sigma \varepsilon\)</span> <a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> where <span class="math inline">\(\varepsilon \sim \mathcal{N}(0, 1)\)</span> (we reparametrize the distribution, hence the name of the trick).</p>
The SCG then becomes
<div class="postimage">
<p><img src="/files/scggaussianreparametrization.png" style="width: 400px" /></p>
</div>
<p>Here pink arrows denote the “flow” of backpropagation: notice that we do not encounter any sampling nodes along the way – hence we don’t need to use the highvariance scorefunction estimator. We can even have many layers of stochastic nodes – after the reparametrization we don’t need to differentiate through random samples, we only mix them in. Let us look at the formulas.</p>
<p><span class="math display">\[
\nabla_\theta \mathbb{E}_{p(x\theta)} f(x)
= \nabla_\theta \mathbb{E}_{p(\varepsilon)} f(\mu(\theta) + \sigma(\theta) \varepsilon)
= \mathbb{E}_{p(\varepsilon)} \nabla_\theta f(\mu(\theta) + \sigma(\theta) \varepsilon)
\]</span></p>
<p>Notice that this time we do use the gradient of <span class="math inline">\(f\)</span>! This is the crucial difference between this estimator, and the scorefunction one: in the later we were averaging random directions using their “scores”, whereas here we learn an affine transformation of independent noise such that transformed samples lie in an area that has large <span class="math inline">\(f(x)\)</span>. Gradient information of <span class="math inline">\(f\)</span> tells us where to move samples <span class="math inline">\(x\)</span>, and we do so by adjusting <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span>.</p>
<p>Okay, so it looks like a great method, why not use it everywhere? The problem is that even though you can always transform a uniformly distributed random variable into any other, it’s not always computationally easy <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>. For some distributions (Dirichlet, for example <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a>) we simply don’t know any effective transformations from parameterfree random variables.</p>
<h2 id="generalizedreparametrizationtrick">Generalized reparametrization trick</h2>
<p>The reparametrization trick can be seen as a transformation <span class="math inline">\(\mathcal{T}(\varepsilon  \theta)\)</span> of some independent noise into a desired random variable. Conversely, if <span class="math inline">\(\mathcal{T}\)</span> is invertible, <span class="math inline">\(\mathcal{T}^{1}(x  \theta)\)</span> is a “whitening” / “standardizing” transformation: it takes some random variable that depends on parameters <span class="math inline">\(\theta\)</span> and makes it parameterindependent.</p>
<p>What if we find a transformation that maybe does not whiten <span class="math inline">\(x\)</span> completely, but still significantly reduce its dependence on <span class="math inline">\(\theta\)</span>? This is the core idea of the <a href="https://arxiv.org/abs/1610.02287">The Generalized Reparameterization Gradient</a> paper. In that case <span class="math inline">\(\varepsilon\)</span> would still depend on <span class="math inline">\(\theta\)</span>, but hopefully only “weakly”.</p>
<p><span class="math display">\[
\begin{align*}
\nabla_\theta \mathbb{E}_{p(x\theta)} f(x)
&= \nabla_\theta \mathbb{E}_{p(\varepsilon\theta)} f(\mathcal{T}(\varepsilon \mid \theta)) \\
&= \underbrace{\mathbb{E}_{p(\varepsilon\theta)} \nabla_\theta f(\mathcal{T}(\varepsilon \mid \theta))}_{g^\text{rep}}
+ \underbrace{\mathbb{E}_{p(\varepsilon\theta)} \nabla_\theta \log p(\varepsilon\theta) f(\mathcal{T}(\varepsilon \mid \theta))}_{g^\text{corr}}
\end{align*}
\]</span></p>
<p>Here <span class="math inline">\(g^\text{rep}\)</span> is our usual reparametrized gradient, and <span class="math inline">\(g^\text{corr}\)</span> is the scorefunction part of it. It’s easy to see that varying the transformation <span class="math inline">\(\mathcal{T}\)</span> allows you to interpolate between the fully reparametrized gradients and the fully scorefunctionbased gradients. Indeed, if <span class="math inline">\(\mathcal{T}\)</span> whitens <span class="math inline">\(x\)</span> completely, then <span class="math inline">\(p(\varepsilon\theta)\)</span> is independent of <span class="math inline">\(\theta\)</span> and <span class="math inline">\(\nabla_\theta \log p(\varepsilon\theta) = 0\)</span>, leaving us with <span class="math inline">\(g^\text{rep}\)</span> only. If, however, <span class="math inline">\(\mathcal{T}\)</span> is an identity map, which does not do anything, then <span class="math inline">\(\nabla_\theta f(\mathcal{T}(\varepsilon \mid \theta)) = \nabla_\theta f(\varepsilon) = 0\)</span>, and we recover the scorefunction estimator.</p>
<p>This formula looks great, but it requires us to know the distribution of <span class="math inline">\(\mathcal{T}^{1}(x \mid \theta)\)</span> to sample <span class="math inline">\(\varepsilon\)</span> from. It’s more convenient to reformulate the gradient in terms of samples from <span class="math inline">\(p(x\theta)\)</span>, which we can do after some algebraic manipulations:</p>
<p><span class="math display">\[
\begin{align*}
g^\text{rep}
=& \mathbb{E}_{p(x\theta)} \nabla_x f(x) \nabla_\theta \mathcal{T}(\varepsilon \mid \theta)
\\
g^\text{corr}
=& \mathbb{E}_{p(x\theta)} \Bigl[\nabla_\theta \log p(x\theta) + \nabla_x \log p(x\theta) \nabla_\theta \mathcal{T}(\varepsilon \mid \theta) \\& \quad\quad\quad\quad+ \nabla_\theta \log \text{det} \nabla_\varepsilon \mathcal{T}(\varepsilon \mid \theta)\Bigr] f(x)
\\
& \text{where } \varepsilon = \mathcal{T}^{1}(x \mid \theta)
\end{align*}
\]</span></p>
<p>In this formulation we sample <span class="math inline">\(x\)</span> as usual, pass it through the “whitening” transformation <span class="math inline">\(\mathcal{T}^{1}(x  \theta)\)</span> to obtain sample <span class="math inline">\(\varepsilon\)</span>, and substitute these variables into gradient constituents. One can also see everything but <span class="math inline">\(f(x) \nabla_\theta \log p(x \mid \theta)\)</span> as a <em>control variate</em> (we’ll talk about these later in the series) that uses <span class="math inline">\(f\)</span>’s gradient information and hence can be expected to be quite powerful.</p>
<p>The last question is which transformation to choose? The formulas authors propose to use usual standardizing transformation, i.e. to subtract the mean and divide by standard deviation. This choice is motivated by the following: a) it’s computationally convenient, recall that we need both <span class="math inline">\(\mathcal{T}\)</span> and <span class="math inline">\(\mathcal{T}^{1}\)</span> <a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a>; b) it makes first two moments independent of <span class="math inline">\(\theta\)</span>, which is some sense makes resulting variable “weakly” dependent on it.</p>
<h3 id="rejectionsamplingperspectivecasmlscitation">Rejection sampling perspective <a href="#fn7" class="footnoteRef" id="fnref7"><sup>7</sup></a></h3>
<p>Another interesting perspective on generalized reparametrization comes from the following thought: there are efficient samplers for many distributions, can we somehow backpropagate through the sampling process? This is what authors of the <a href="http://proceedings.mlr.press/v54/naesseth17a.html">Reparameterization Gradients through AcceptanceRejection Sampling Algorithms</a> paper decided to find out.</p>
<p>You want to sample some distribution <span class="math inline">\(p(x\theta)\)</span>, but can’t compute and invert its CDF, what to do then? You can use <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampling</a> procedure. Basically, you take some proposal distribution <span class="math inline">\(r(x \mid \theta)\)</span> that is easy to sample from, find a scaling factor <span class="math inline">\(M_\theta\)</span> such that scaled proposal is uniformly higher than the target density for all <span class="math inline">\(x\)</span>: <span class="math inline">\(M_\theta r(x\theta) \ge p(x\theta) \forall x\)</span>. Then you generate points randomly under the scaled <span class="math inline">\(M_\theta r(x\theta)\)</span> curve, and keep only those that are also below the <span class="math inline">\(p(x\theta)\)</span> curve:</p>
<ol style="liststyletype: decimal">
<li>Generate <span class="math inline">\(x \sim r(x\theta)\)</span>.</li>
<li>Generate <span class="math inline">\(u \sim U[0, M_\theta r(x\theta)]\)</span>.</li>
<li>If <span class="math inline">\(u > p(x\theta)\)</span>, repeat from step 1, else return <span class="math inline">\(x\)</span>.</li>
</ol>
<p>Moreover, at step 1 we can use some transformation <span class="math inline">\(\mathcal{T}(\varepsilon  \theta)\)</span> of the sample <span class="math inline">\(\varepsilon \sim r(\varepsilon)\)</span> (provided the scaled density of transformed variable is uniformly higher). This is how <code>numpy</code> generates Gamma variables: if samples <span class="math inline">\(\varepsilon\)</span> from standard Gaussian, transforms the sample through some function <span class="math inline">\(x = \mathcal{T}(\varepsilon\theta)\)</span>, and then accepts it with probability <span class="math inline">\(a(x\theta)\)</span> <a href="#fn8" class="footnoteRef" id="fnref8"><sup>8</sup></a>.</p>
<p>Let’s find the density of <span class="math inline">\(\varepsilon\)</span>s that lead to acceptance of corresponding <span class="math inline">\(x\)</span>s. Some calculations (provided in supplementary) show that</p>
<p><span class="math display">\[
p(\varepsilon\theta) = M_\theta r(\varepsilon) a(\mathcal{T}(\varepsilon\theta)\theta)
\]</span></p>
<p>Note that this density is easy to calculate, and if we reparametrize generated samples <span class="math inline">\(\varepsilon\)</span>, we’d get samples <span class="math inline">\(x\)</span> we’re looking for <span class="math inline">\(x = \mathcal{T}(\varepsilon\theta)\)</span>. Hence the objective becomes</p>
<p><span class="math display">\[
\mathcal{F}(\theta) = \mathbb{E}_{p(\varepsilon\theta)} f(\mathcal{T}(\varepsilon\theta))
\]</span></p>
<p>Differentiating it w.r.t. <span class="math inline">\(\theta\)</span> gives <span class="math display">\[
\nabla_\theta \mathcal{F}(\theta)
= \mathbb{E}_{p(\varepsilon\theta)} \nabla_\theta f(\mathcal{T}(\varepsilon\theta))
+ \mathbb{E}_{p(\varepsilon\theta)} f(\mathcal{T}(\varepsilon\theta)) \nabla_\theta \log p(\varepsilon\theta)
\]</span></p>
<p>Now compare these addends to the <span class="math inline">\(g^\text{rep}\)</span> and <span class="math inline">\(g^\text{corr}\)</span> from the previous section. You can see that they’re <em>exactly</em> the same!</p>
<p>In the previous section we choose the transformation <span class="math inline">\(\mathcal{T}^{1}\)</span> such that it tries to remove at least some dependency on <span class="math inline">\(\theta\)</span> from samples <span class="math inline">\(x\)</span>. This section allows us to view the same method from the other end: if you have some independent noise <span class="math inline">\(\varepsilon\)</span> and a transformation <span class="math inline">\(\mathcal{T}\)</span> that makes the samples look like samples from the target density <span class="math inline">\(p(x\theta)\)</span>, then you can add some rejection sampling on top to compensate for the mismatch, and still enjoy the lower variance of gradient estimate.</p>
<h2 id="averysimpleexample">A (very) simple example</h2>
<p>Let’s see how much variance reduction the reparametrization trick actually gets us in a very simple problem. Namely, let’s try to minimize expected square of a Gaussian random variable <a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a> (shifted by some positive constant <span class="math inline">\(c\)</span>, we will see later how it comes into play):</p>
<p><span class="math display">\[
\mathcal{F}(\mu, \sigma) = \mathbb{E}_{x \sim \mathcal{N}(\mu, \sigma^2)} [x^2 + c] \to \min_{\mu, \sigma}
\]</span></p>
<p>First, reparametrized objective is</p>
<p><span class="math display">\[
\mathcal{F}^\text{rep}(\mu, \sigma) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} (\mu + \sigma \varepsilon)^2
\]</span></p>
<p>And its stochastic gradients are <span class="math display">\[
\hat \nabla_\mu \mathcal{F}^\text{rep}(\mu, \sigma) = 2 (\mu + \sigma \varepsilon) \\
\hat \nabla_\sigma \mathcal{F}^\text{rep}(\mu, \sigma) = 2 \varepsilon (\mu + \sigma \varepsilon)
\]</span></p>
<p>The scorefunctionbased gradients are the following:</p>
<p><span class="math display">\[
\hat \nabla_\mu \mathcal{F}^\text{SF}(\mu, \sigma) = \frac{\varepsilon}{\sigma} \left((\mu + \sigma \varepsilon)^2 + c\right) \\
\hat \nabla_\sigma \mathcal{F}^\text{SF}(\mu, \sigma) = \frac{\varepsilon^2  1}{\sigma} \left((\mu + \sigma \varepsilon)^2 + c\right)
\]</span></p>
<p>Both estimators are unbiased, but what are the variances of these estimators? WolframAlpha suggests</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{D}\left[\hat \nabla_\mu \mathcal{F}^\text{SF}(\mu, \sigma)\right] &= \frac{(\mu^2 + c)^2}{\sigma^2} + 15 \sigma^2 + 14 \mu^2 + 6 c,
\\
\mathbb{D}\left[\hat \nabla_\mu \mathcal{F}^\text{rep}(\mu, \sigma)\right] &= 4 \sigma^2
\\
\mathbb{D}\left[\hat \nabla_\sigma \mathcal{F}^\text{SF}(\mu, \sigma)\right] &= \frac{2 (c + \mu^2)^2}{\sigma^{2}} + 60 \mu^{2} + 74 \sigma^{2} + 20 c,
\\
\mathbb{D}\left[\hat \nabla_\sigma \mathcal{F}^\text{rep}(\mu, \sigma)\right] &= 4 \mu^2 + 8 \sigma^2
\end{align*}
\]</span></p>
<p>You can see that not only the scorefunctionbased gradient always has a higher variance, its variance actually explodes as we approach <span class="math inline">\(\mu = 0, \sigma = 0\)</span> (unless <span class="math inline">\(c = 0\)</span> and <span class="math inline">\(\mu\)</span> is small enough to counter <span class="math inline">\(\sigma\)</span>)! This is due to the fact that as your variance shrinks, points somewhat far away from the mean get very tiny probabilities, hence scorefunctionbased gradients thinks it should try very hard to make them more probable.</p>
<p>You might be wodering, how would generalized reparametrization work? If we consider <span class="math inline">\(\mathcal{T}^{1}(x\mu,\sigma) = x  \mu\)</span> transformation (it “whitens” first moment only), then we obtain the following gradient estimates:</p>
<p><span class="math display">\[
\hat \nabla_\mu \mathcal{F}^\text{grep}(\mu, \sigma) = 2 (\mu + \varepsilon) \\
\hat \nabla_\sigma \mathcal{F}^\text{grep}(\mu, \sigma) = \frac{\varepsilon^2  \sigma^2}{\sigma^3} (\mu + \varepsilon)^2
\]</span></p>
<p>This is the reparametrized gradient w.r.t. <span class="math inline">\(\mu\)</span> and scorefunction gradient w.r.t. <span class="math inline">\(\sigma\)</span> (notice that <span class="math inline">\(\varepsilon \sim \mathcal{N}(0, \sigma^2)\)</span> in this case). I don’t think this is an interesting scenario, so instead we’ll consider a weirdlooking secondmomentwhitening transformation <span class="math inline">\(\mathcal{T}^{1}(x\mu,\sigma) = \frac{x  \mu}{\sigma} + \mu\)</span> with <span class="math inline">\(\mathcal{T}(\varepsilon\mu,\sigma) = \sigma (\epsilon  \mu) + \mu\)</span>. The gradients for this transformation are:</p>
<p><span class="math display">\[
\begin{align*}
\hat \nabla_\mu \mathcal{F}^\text{grep}(\mu, \sigma) &=
\left(c + \left(\mu + \sigma \left(\epsilon  \mu\right)\right)^{2}\right) \left(\epsilon  \mu\right)  2 \left(\mu + \sigma \left(\epsilon  \mu\right)\right) \left(\sigma  1\right)
\\
\hat \nabla_\sigma \mathcal{F}^\text{grep}(\mu, \sigma) &=
2 \left(\epsilon  \mu\right) \left(\mu + \sigma \left(\epsilon  \mu\right)\right)
\end{align*}
\]</span></p>
<p>You can already see that the magnitude of the gradients does not explode when the variance <span class="math inline">\(\sigma\)</span> goes to zero. Let’s check the variances:</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{D}\left[\hat \nabla_\mu \mathcal{F}^\text{grep}(\mu, \sigma)\right] &=
(\mu^2 + c)^{2} + 2 c \sigma^{2} + 4 c \sigma + 10 \mu^{2} \sigma^{2} + 4 \mu^{2} \sigma + 7 \sigma^{4} + 4 \sigma^{3} + 4 \sigma^{2}
\\
\mathbb{D}\left[\hat \nabla_\sigma \mathcal{F}^\text{grep}(\mu, \sigma)\right] &=
4 \mu^{2} + 8 \sigma^{2}
\end{align*}
\]</span></p>
<p>First, we see that the variance of gradient w.r.t. <span class="math inline">\(\sigma\)</span> has become identical to the variance of the reparametrized case. Second, we can confirm that the variance does not explode as we approach the optimum.</p>
<div class="postimage">
<p><img src="/files/scgexample.png" /> Gen Rep 1 is a generalized reparametrization with only 1st moment whitened out,<br/> Gen Rep 2 – with only the second one</p>
</div>
<p>The simulation plots clearly show that scorefunctionbased gradients and the first generalized reparametrization fail to converge, which is in line with our variance analysis. The second generalized reparametrization, however, performs just as good, as the full reparametrization, even though it does have higher variance.</p>
<p>All the code I wrote while working on this post can be found <a href="https://gist.github.com/artsobolev/fec7c052d712889ef69656825634c4d4">here</a>. Though it’s quite messy, I warned you.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have discussed tricks that make Stochastic Variational Inference with continuous latent variables computationally feasible. However, quite often we’re interested in models with discrete latent variables – for example, we might be interested in a model that dynamically chooses one computation path or another, essentially controlling how much computation time to spend on a given sample. Or, train a GAN for textual data – we need a way to backpropagate through discriminator’s inputs.</p>
<p>We’ll talk about such methods in the next post.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>In this post I’ll only consider models with only one stochastic “layer”, but roughly the same math applies in more general cases.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Sometimes people also call this <strong>logderivative trick</strong>, however, in my opinion, logderivate trick is about a derivation technique, namely the fact that <span class="math inline">\(\nabla_\theta p(x\mid\theta) = p(x\mid\theta) \nabla_\theta \log p(x\mid\theta)\)</span>, and it’s a bit incorrect to call the estimator this way.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Equality here means both sides have the same distribution.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>We know that for <span class="math inline">\(X \sim p(x)\)</span> with c.d.f. <span class="math inline">\(F(x)\)</span> we have <span class="math inline">\(F(X) \sim U[0, 1]\)</span>, hence <span class="math inline">\(X = F^{1}(u)\)</span> for standard uniform <span class="math inline">\(u \sim U[0, 1]\)</span>, so there always exist a (smooth, if <span class="math inline">\(x\)</span> is continuous) transformation from standard uniform noise to any other distribution. However, computing CDF function often requires expensive integration, which is quite often infeasible.<a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>Original VAE paper lists Dirichlet distribution among ones that have effective reparametrizations, however that’s actually not the case, as you still need to generate parametrized Gamma variables.<a href="#fnref5">↩</a></p></li>
<li id="fn6"><p>Technically, you could derive the density <span class="math inline">\(p(\varepsilon\theta)\)</span> and to sample from it – this way you’d not need the inverse <span class="math inline">\(\mathcal{T}^{1}\)</span>. However, it’s not easy in general.<a href="#fnref6">↩</a></p></li>
<li id="fn7"><p>This section is largely based on the <a href="https://casmls.github.io/general/2017/04/25/rsvi.html">Reparameterization Gradients through Rejection Sampling Algorithms</a> blogpost.<a href="#fnref7">↩</a></p></li>
<li id="fn8"><p>Normally that’d be just <span class="math inline">\(a(x\theta) = \tfrac{p(x\theta)}{M_\theta r(x\theta)}\)</span>, however, if we don’t have <span class="math inline">\(r(x\theta)\)</span> readily available, we can express the acceptance probability in terms of <span class="math inline">\(\varepsilon\)</span>: <span class="math display">\[a(\varepsilon\theta) = \tfrac{p(\mathcal{T}(\varepsilon\theta)\theta) \text{det} \nabla_\varepsilon \mathcal{T}(\varepsilon\theta)}{M_\theta r(\varepsilon)}\]</span><a href="#fnref8">↩</a></p></li>
<li id="fn9"><p>One might argue that our approach is flawed, as the optimal distribution is <span class="math inline">\(\mathcal{N}(0, 0)\)</span> which is not a valid distribution. However, here we’re just interested in the gradient dynamics as we approach this optimum.<a href="#fnref9">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/eHYnuBerCMs" height="1" width="1" alt=""/>Sun, 10 Sep 2017 00:00:00 UThttp://artem.sobolev.name/posts/20170910stochasticcomputationgraphscontinuouscase.htmlArtemhttp://artem.sobolev.name/posts/20170910stochasticcomputationgraphscontinuouscase.htmlICML 2017 Summaries
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/Da31J1HffXE/20170814icml2017.html
<p>Just like with <a href="/posts/20161231nips2016summaries.html">NIPS last year</a>, here’s a list of ICML’17 summaries (updated as I stumble upon new ones)</p>
<!more>
<ul>
<li><a href="https://olgalitech.wordpress.com/tag/icml2017/">Random ML&Datascience musing</a> by <a href="https://twitter.com/OlgaLiakhovich">Olga Liakhovich</a>
<ul>
<li><a href="https://olgalitech.wordpress.com/2017/08/07/icmlandmynotesonday1/">ICML and my notes on day 1</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/07/brainenduranceorday2aticml2017/">Brain endurance or Day 2 at ICML 2017</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day3aticml2017musicalrnns/">Day 3 at ICML 2017 — musical RNNs</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day4aticml2017moreadversarialnns/">Day 4 at ICML 2017 — more Adversarial NNs</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day56aticmlalldone/">Day 5 & 6 at ICML. All done.</a></li>
</ul></li>
<li><a href="https://keunwoochoi.wordpress.com/2017/08/14/machinelearningformusicdiscoveryworkshopicml2017sydney/">Machine learning for music discovery (workshop)</a> by <a href="https://twitter.com/keunwoochoi">Keunwoo Choi</a></li>
<li><a href="https://gmarti.gitlab.io/ml/2017/08/11/ICML2017fieldreports.html">Field reports from ICML 2017 in Sydney</a> by <a href="https://twitter.com/GautierMarti1">Gautier Marti’s Wander</a></li>
<li><a href="http://www.machinedlearnings.com/2017/08/icml2017thoughts.html">ICML 2017 Thoughts</a> by <a href="https://twitter.com/PaulMineiro">Paul Mineiro</a></li>
<li><a href="http://mattdickenson.com/2017/08/17/icml2017recap/">ICML 2017 Recap</a> by <a href="https://twitter.com/mcdickenson">Matt Dickenson</a></li>
<li><a href="https://www.bulletproof.net.au/internationalconferencemachinelearning2017partone/">International Conference for Machine Learning 2017 – Part One</a> + <a href="https://www.bulletproof.net.au/internationalconferencemachinelearning2017parttwo/">International Conference for Machine Learning 2017 – Part Two</a></li>
</ul><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/Da31J1HffXE" height="1" width="1" alt=""/>Mon, 14 Aug 2017 00:00:00 UThttp://artem.sobolev.name/posts/20170814icml2017.htmlArtemhttp://artem.sobolev.name/posts/20170814icml2017.htmlOn No Free Lunch Theorem and some other impossibility results
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/32qnGEkYoTM/20170723nofreelunchtheorem.html
<p>The more I talk to people online, the more I hear about the famous No Free Lunch Theorem (NFL theorem). Unfortunately, quite often people don’t really understand what the theorem is about, and what its implications are. In this post I’d like to share my view on the NFL theorem, and some other impossibility results.</p>
<!more>
<h3 id="nofreelunchtheoremrevisited">No Free Lunch Theorem Revisited</h3>
<p>First, let’s formally state the NFL theorem. I’ll take theorem statement from the (freely available!) book <a href="http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html"><em>Understanding Machine Learning: From Theory to Algorithms</em></a> by Shai ShalevShwartz and Shai BenDavid.</p>
<p>In the nutshell, the theorem says that whatever learning algorithm you pick, there will always be a problem (=dataset + some metrics), that your particular algorithm is incapable of solving, even though in principle the problem could be solved (by some other algorithm, which would have its own kryptonite problem). More formally (I modified the statement to distill the notation):</p>
<blockquote>
Let <span class="math inline">\(A\)</span> be any learning algorithm for the task of binary classification with respect to the 0−1 loss over a domain <span class="math inline">\(\mathcal{X}\)</span> . Let <span class="math inline">\(m\)</span> be any number smaller than <span class="math inline">\(\mathcal{X}/2\)</span>, representing a training set size. Then, there exists a distribution <span class="math inline">\(D\)</span> over <span class="math inline">\(\mathcal{X} × \{0, 1\}\)</span> such that:
<ol>
<li>
There exists a function <span class="math inline">\(f : \mathcal{X} \mapsto \{0, 1\}\)</span> with <span class="math inline">\(\mathbb{P}(f(x) \not= y \mid (x, y) \sim D) = 0\)</span>.
</li>
<li>
With probability of at least 1/7 over the choice of <span class="math inline">\(S \sim D^m\)</span> we have that <span class="math inline">\(\mathbb{P}(A_S(x) \not= y \mid (x, y) \sim D) \ge 1/8\)</span>
</li>
</ol>
</blockquote>
<p>The idea of the proof is that if you have fixed training set and some nontrivial number of unseen examples, one can vary labels of these unseen examples arbitrarily. So, if your algorithm classifies some example correctly, there exists similar problem with the only difference being different ground truth label for this example. Essentially, for the same training set you can construct completely different test sets.</p>
<h3 id="soundsprettyfrustratingisntit">Sounds pretty frustrating, isn’t it?</h3>
<p>The result suggests impossibility of the universal learning machine that’d be able to take any training set, and make the best predictions possible for the unseen data. And this <em>is</em> impossible! Another reformulation of the same theorem says that every classification algorithm has accuracy 1/2 when averaged over all possible problems. However, practical implications of the theorem are not so farreaching.</p>
<p>The theorem essentially says that every problem has an evil doppelgänger, that’d break your precious model you trained so long. However, how likely are you to run into this doppelgänger? How likely is it to run into the problem where the test set differs from the train set so much? Or, how can our human brains work so well <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>? Let me expand the later thought.</p>
<p>I believe our brains are not magical, they are just another kind of a (biological) learning machine, obeying same mathematical principles, powerful enough to solve various problems we face every day. Yes, we can’t solve all the problems in the world, but why would we care? In Machine Learning as a subfield of Artificial Intelligence we seek to solve problems of <em>practical importance</em>, and in the first place automate what people already can do. Thus, we have a proof that there exists an algorithm that works reasonably well. It’s right here, in your brain.</p>
<p>So how come we’re able to navigate in such a complex world, communicate in such complicated languages, and discover laws of nature through science by thinking hard enough, if for every problem <em>we</em> successfully solve the mathematics has an evil copy of? The answer seems to be that these evil copies are very rare. And I believe there’s a reason for that.</p>
<p>Let’s get back to the theorem. Recall, that is essentially based in the fact that for a fixed training set you can vary test set as you wish. How complicated (for some intuitive notion of complexity) does that make the distribution <span class="math inline">\(p(yx)\)</span> that makes perfect predictions for a given <span class="math inline">\(x\)</span>? Well, if it had one regularity pattern in the training set, and then suddenly changed this pattern in the test set to something completely different, that’d make the target distribution <span class="math inline">\(p(yx)\)</span> more complicated. So, even if every good problem (i.e. one we, humans, can solve) has an evil twin, twin’s complexity should be higher due to way more complex regularity pattern.</p>
<p>Thus, I’m sure more complicated problems and objects are less likely in the Universe. Otherwise, we’d not be able to have such a complicated life with our particular instance of a learning machine, implemented in our brain. The NFL theorem states there are hard problems out there, but doesn’t say anything how common they are, implicitly assuming uniform distribution, which seems to disagree with our observations.</p>
<h3 id="otherimpossibilityresults">Other impossibility results</h3>
<p>Another similar result is the <a href="https://en.wikipedia.org/wiki/Halting_problem">halting problem</a>, which states that given any program, you’d not be able to determine if it stops with 100% accuracy. However, this does not mean that every program’s halting is undecidable. For example, for <a href="https://en.wikipedia.org/wiki/Linear_bounded_automaton">linear bounded automata</a> one actually can decide if a program for this automation halts (thought that might require some astronomical amount of memory). This result only states there’s no universal decider, and every particular class should be inspected separately.</p>
<p>To recap, the idea of this post is that even though the theory seemingly limits our capabilities, we should not get discouraged by these results, as they are way more general than we need in practice. Quite often we can still solve real problems due to the fact that the general case includes some really weird functions, but reality does not.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Well, we don’t have other baselines, so it’s hard to tell if they indeed work well. But still, human brains are the best learning machines known to humanity.<a href="#fnref1">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/32qnGEkYoTM" height="1" width="1" alt=""/>Sun, 23 Jul 2017 00:00:00 UThttp://artem.sobolev.name/posts/20170723nofreelunchtheorem.htmlArtemhttp://artem.sobolev.name/posts/20170723nofreelunchtheorem.htmlMatrix and Vector Calculus via Differentials
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/5m_RiBN2DWU/20170129matrixandvectorcalculusviadifferentials.html
<p>Many tasks of machine learning can be posed as optimization problems. One comes up with a parametric model, defines a loss function, and then minimizes it in order to learn optimal parameters. One very powerful tool of optimization theory is the use of smooth (differentiable) functions: those that can be locally approximated with a linear functions. We all surely know how to differentiate a function, but often it’s more convenient to perform all the derivations in matrix form, since many computational packages like numpy or matlab are optimized for vectorized expressions.</p>
<p>In this post I want to outline the general idea of how one can calculate derivatives in vector and matrix spaces (but the idea is general enough to be applied to other algebraic structures).</p>
<!more>
<h3>
The Gradient
</h3>
<p>What is the gradient? Recall that smooth function (for now we’ll be considering scalar functions only) <span class="math inline">\(f : \mathcal{X} \to \mathbb{R}\)</span> is one which is approximately linear within some neighborhood of a given point. That means <span class="math inline">\(f(x + dx)  f(x) = \langle g(x), dx \rangle\)</span> (think of <span class="math inline">\(dx\)</span> as of a very small perturbation of <span class="math inline">\(x\)</span>) where <span class="math inline">\(\langle \cdot, \cdot \rangle\)</span> denotes the dot product in the space <span class="math inline">\(\mathcal{X}\)</span>, and <span class="math inline">\(g(x)\)</span> is called the <strong>gradient</strong> of <span class="math inline">\(f(x)\)</span> at the point <span class="math inline">\(x\)</span> (we’ll be using the nabla notation from now on: <span class="math inline">\(g(x) = \nabla f(x)\)</span>).</p>
<p>For example, for functions of one variable (<span class="math inline">\(\mathcal{X} = \mathbb{R}\)</span>) we have <span class="math inline">\(\langle a, b \rangle = a b\)</span>, for functions of several variables (<span class="math inline">\(\mathcal{X} = \mathbb{R}^n\)</span>) it’s the usual dot product <span class="math inline">\(\langle a, b \rangle = a^T b\)</span>, and for functions of matrices (<span class="math inline">\(\mathcal{X} = \mathbb{R}^{n \times m}\)</span>) it generalizes vector dot product: <span class="math inline">\(\langle A, B \rangle = \text{Tr}(A^T B)\)</span>.</p>
<p>Now let’s introduce the notion of <strong>differential</strong> <span class="math inline">\(df(x)\)</span> to be a perturbation of the function <span class="math inline">\(f\)</span> if we perturb <span class="math inline">\(x\)</span> by <span class="math inline">\(dx\)</span>, which we assume to be infinitesimally small. The gradient only affects firstorder behavior of <span class="math inline">\(df(x)\)</span>, that is as <span class="math inline">\(dx\)</span> goes to zero, thus if one expands <span class="math inline">\(df(x)\)</span> in terms of <span class="math inline">\(dx\)</span>, it’s enough to write down only the linear term to find the differential. For example, for smooth scalarvalued functions we have <span class="math inline">\(df(x) = \langle \nabla f(x), dx \rangle\)</span>, so the gradient <span class="math inline">\(\nabla f(x)\)</span> defines a linear coefficient for the differential <span class="math inline">\(df(x)\)</span> <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<h3>
Calculus
</h3>
<p>Okay, how can we derive the gradient of a function? One way is to take a derivative with respect to each scalar input variable, and then compose a vector out of it. This approach is quite inefficient and messy: you might have to recompute the same vector expressions for each input variable, or compute lots of sums (which does not leverage benefits of vector operations), or try to compose vector operations out of them. Instead we’ll develop a formal method that will allow us to derive gradients for many functions (just like differentiation rules you learned in the introduction to calculus) without leaving the realm of vector algebra.</p>
<p>Recall that the differential <span class="math inline">\(df(x)\)</span> of a scalarvalued function <span class="math inline">\(f\)</span> is a linear function of <span class="math inline">\(dx\)</span> and the gradient. That means, that if we could write a differential <span class="math inline">\(df(x)\)</span> and then simplify it to <span class="math inline">\(g(x)^T dx + O(\dx\^2)\)</span> (we ignore higherorder terms, as they are not linear in <span class="math inline">\(dx\)</span>, and go to zero faster than <span class="math inline">\(dx\)</span> does), we’ll recover the gradient <span class="math inline">\(\nabla f(x) = g(x)\)</span>. This is exactly what we’re going to do: develop a set of formal rules that will allow us to compute differentials of various operations and their combinations.</p>
<p>The general idea is to consider <span class="math inline">\(f(x + \Delta)\)</span>, and manipulate it into something of the form <span class="math inline">\(f(x) + L_x(\Delta) + O(\\Delta\^2)\)</span> where <span class="math inline">\(L_x(\Delta)\)</span> is a function of <span class="math inline">\(x\)</span> (maybe constant, though) and <span class="math inline">\(\Delta\)</span> that is linear in <span class="math inline">\(\Delta\)</span> (but not necessarily in <span class="math inline">\(x\)</span>). Then <span class="math inline">\(L_x(dx)\)</span> is exactly the differential <span class="math inline">\(df(x)\)</span>.</p>
<p>Let’s consider an example. Let <span class="math inline">\(f(X) = A X^{1} + B\)</span> (all variables are square matrices of the same size):</p>
<p><span class="math display">\[
\begin{align*}
f(X + \Delta)
&= A (X + \Delta)^{1} + B
= A X^{1} (I + \Delta X^{1})^{1} + B \\
&= A X^{1} \left(\sum_{k=0}^\infty (\Delta X^{1})^k \right) + B
= A X^{1} \left(I  \Delta X^{1} + O(\\Delta\^2) \right) + B \\
&= \underbrace{A X^{1} + B}_{f(X)} \underbrace{A X^{1} \Delta X^{1}}_{\text{linear in }\Delta} + O(\\Delta\^2)
\end{align*}
\]</span></p>
<p>Hence <span class="math inline">\(df(X) = A X^{1} dX X^{1}\)</span> (it’s not of the form <span class="math inline">\(\text{Tr}(g(X)^T dX)\)</span> because <span class="math inline">\(f\)</span> is not scalarvalued, so you can’t extract the gradient from it, because the “gradient” would be something like a 4dimensional tensor). This way we can derive differentials for many common functions and operations:</p>
<p><span class="math display">\[
d(\alpha X) = \alpha dX \\
d(X + Y) = dX + dY \\
d(XY) = dX Y + X dY \\
d(X^{1}) = X^{1} dX X^{1} \\
d(c^T x) = c^T dx \\
d(x^T A x) = x^T (A + A^T) dx \\
d(\text{Tr}(X)) = \text{Tr}(dX) \\
d(\text{det}(X)) = \text{det}(X) \text{Tr}(X^{1} dX)
\]</span></p>
<p>It’s also very helpful to derive a rule to deal with function composition: suppose we have <span class="math inline">\(f(x)\)</span> and <span class="math inline">\(g(x)\)</span> with corresponding differentials <span class="math inline">\(df(x)\)</span> and <span class="math inline">\(dg(x)\)</span>. Then the differential of <span class="math inline">\(h(x) = f(g(x))\)</span> is</p>
<p><span class="math display">\[
dh(x) = f(g(x+dx))  f(g(x)) = f(g(x) + dg(x))  f(g(x)) = df(y)_{dy = dg(x), y = g(x)}
\]</span></p>
<p>That is, we take <span class="math inline">\(df(y)\)</span>, and replace each <span class="math inline">\(dy\)</span> with <span class="math inline">\(dg(x)\)</span>, and each <span class="math inline">\(y\)</span> with <span class="math inline">\(g(x)\)</span>.</p>
<p>These rules allow us to differentiate fairly complicated expressions like <span class="math inline">\(f(x) = \text{det}(X + B) \log(a^T X^{1} a)  \text{Tr}(X)\)</span></p>
<p><span class="math display">\[
\begin{align*}
df(X)
&= d(\text{det}(X + B) \log (a^T X^{1} a))  d(\text{Tr}(X)) \\
&= d(\text{det}(X + B)) \log (a^T X^{1} a) + \text{det}(X + B) d(\log (a^T X^{1} a))  \text{Tr}(dX) \\
&= \text{det}(X + B) \text{Tr}((X+B)^{1} dX) \log (a^T X^{1} a) + \frac{\text{det}(X + B)}{a^T X^{1} a} d(a^T X^{1} a)  \text{Tr}(dX) \\
&= \text{Tr}\left[\text{det}(X + B)\log (a^T X^{1} a) (X+B)^{1} dX \right]  \frac{\text{det}(X + B)}{a^T X^{1} a} \left(a^T X^{1} dX X^{1} a\right)  \text{Tr}(dX) \\
&= \text{Tr}\left[\left(\text{det}(X + B)\log (a^T X^{1} a) (X+B)^{1}  \frac{\text{det}(X + B)}{a^T X^{1} a} X^{1} a a^T X^{1}  I\right) dX \right] \\
\end{align*}
\]</span></p>
<p>One way to sanitycheck (not complete, though!) our derivations is to consider <span class="math inline">\(1 \times 1\)</span> matrices, that is, scalar case. In scalar case it all boils down to <span class="math inline">\(f(x) = (x+b) \log(a^2 / x)  x\)</span> with derivative <span class="math inline">\(f'(x) = \log(a^2 / x)  \frac{x+b}{x}1\)</span>, which coincides with the formula above for <span class="math inline">\(1 \times 1\)</span> matrices.</p>
<h3 id="thehessian">The Hessian</h3>
<p>The same idea can be used to calculate the hessian of a function, that is, a coefficient describing function’s local quadratic behavior. We restrict ourselves with scalarvalued functions of finitedimensional vectors, but it generalizes to other functions if you consider appropriate bilinear maps.</p>
<p>We define the second order differential recursively <span class="math inline">\(d^2 f(x) = d(df(x))\)</span> as a linearization of a linearization (and we need to go deeper!). One might note that linearization of a linear function does not make any difference, but we’re actually linearizing not the linear approximation, but the map <span class="math inline">\(x \mapsto df(x)\)</span> itself. In a way, you can think of <span class="math inline">\(df(x)\)</span> as a function of 2 arguments: the point <span class="math inline">\(x\)</span> and an infinitesimal perturbation <span class="math inline">\(dx\)</span>. And we’re linearizing with respect to the first one. Since we have 2 independent linearizations, it’s incorrect to use the same perturbation to both of them, so we’ll introduce <span class="math inline">\(dx_1\)</span> and <span class="math inline">\(dx_2\)</span> as first and second order perturbations.</p>
<p>If <span class="math inline">\(df^2(x)\)</span> at a given point <span class="math inline">\(x\)</span> is a linearization of a linearization, it’s a function of 2 perturbations: <span class="math inline">\(dx_1\)</span> and <span class="math inline">\(dx_2\)</span>. Moreover, it’s linear in both of them, so <span class="math inline">\(df^2(x)\)</span> is actually a bilinear map. In case of a finitedimensional vector space a bilinear map can be represented using a matrix <span class="math inline">\(H(x)\)</span>, that is <span class="math inline">\(d^2f(x) = dx_1^T H(x) dx_2\)</span>. The matrix <span class="math inline">\(H(x)\)</span> is called the <strong>hessian</strong> and denoted <span class="math inline">\(\nabla^2 f(x)\)</span>.</p>
<p>Then one uses the same formal rules, expanding <span class="math inline">\(d^2 f(x) = d(df(x))\)</span> by first computing <span class="math inline">\(df(x)\)</span> w.r.t. <span class="math inline">\(dx_1\)</span>, and then differentiating the resultant expression w.r.t. <span class="math inline">\(dx_2\)</span>. Again, let’s consider an example <span class="math inline">\(f(x) = \text{det}(I + x x^T)\)</span></p>
<p><span class="math display">\[
df(x) = 2\text{det}(I + x x^T) x^T (I + x x^T)^{1} dx_1
= 2 f(x) x^T (I + x x^T)^{1} dx_1
\]</span></p>
<p>Now, keeping in mind that we can move scalars around (as well as transpose them), we get</p>
<p><span class="math display">\[
\begin{align*}
d^2f(x) &= d(df(x)) \\
&= 2 \overbrace{df(x)}^{=dx_2^T \nabla f(x)} x^T (I + x x^T)^{1} dx_1
+ 2 f(x) d(x^T) (I + x x^T)^{1} dx_1
+ 2 f(x) x^T d((I + x x^T)^{1}) dx_1 \\
&= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{1} + f(x) (I + x x^T)^{1} \right) dx_1 \\
&\quad 2 f(x) x^T (I + x x^T)^{1} (dx_2 x^T + x dx_2^T) (I + x x^T)^{1} dx_1 \\
&= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{1} + f(x) (I + x x^T)^{1} \right) dx_1 \\
&\quad 2 f(x) \overbrace{x^T (I + x x^T)^{1} dx_2}^{\text{scalar, transpose}} x^T (I + x x^T)^{1} dx_1
 2 \overbrace{f(x) x^T (I + x x^T)^{1} x}^{\text{scalar, trace}} dx_2^T (I + x x^T)^{1} dx_1 \\
&= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{1} + f(x) (I + x x^T)^{1} \right) dx_1 \\
&\quad 2 dx_2^T f(x) (I + x x^T)^{1} x x^T (I + x x^T)^{1} dx_1
 2 dx_2^T f(x) \text{Tr}\left[ (I + x x^T)^{1} x x^T \right] (I + x x^T)^{1} dx_1 \\
&= 2 dx_2^T \Bigl( \nabla f(x) x^T (I + x x^T)^{1} + f(x) (I + x x^T)^{1} \\
&\quad f(x) (I + x x^T)^{1} x x^T (I + x x^T)^{1}
f(x) \text{Tr}\left[ (I + x x^T)^{1} x x^T \right] (I + x x^T)^{1}
\Bigr) dx_1 \\
\end{align*}
\]</span></p>
<p>Thus the Hessian is</p>
<p><span class="math display">\[
\begin{align*}
\nabla^2 f(x)
&= 2 (I + x x^T)^{1} x \left(\nabla f(x)^T  f(x) x^T (I + x x^T)^{1} \right)
+ 2 f(x) \left(1  \text{Tr}\left[ (I + x x^T)^{1} x x^T \right] \right) (I + x x^T)^{1} \\
&= (I + x x^T)^{1} x \nabla f(x)^T
+ \left(2 f(x)  \nabla f(x)^T x \right) (I + x x^T)^{1} \\
&= (I + x x^T)^{1} x \nabla f(x)^T  \nabla f(x)^T x (I + x x^T)^{1} + 2 f(x) (I + x x^T)^{1} \\
&= 2f(x) \left((2  x^T (I + x x^T)^{1} x) I  (I + x x^T)^{1}\right) (I + x x^T)^{1} \\
\end{align*}
\]</span></p>
<p>The funny thing is, <span class="math inline">\(f(x) = \text{det}(I + x x^T)\)</span> can be simplified using the <a href="https://en.wikipedia.org/wiki/Matrix_determinant_lemma">determinant lemma</a> as <span class="math inline">\(f(x) = 1 + x^T x\)</span>. Now this is a very simple function, whose gradient is just <span class="math inline">\(2x\)</span> and the Hessian is constant <span class="math inline">\(2I\)</span>. And actually, the expression above can be simplified into <span class="math inline">\(2I\)</span>. Sometimes it’s beneficial to simplify the function first (:</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this post I showed how one can derive gradients and hessians using formal algebraic manipulations with differentials. The same technique is, of course, applicable to infinitedimensional spaces (calculus of variations) and vectorvalued functions (where linear map is described using the Jacobi matrix).</p>
<h3 id="addendumonthedualnumbers">Addendum on the Dual Numbers</h3>
<p>The set of formal rules described above is not only helpful when calculating gradients by hand, but can also be used to automatically differentiate a function as you evaluate it. Indeed, suppose you need to differentiate some big and complex function <span class="math inline">\(f(x)\)</span>. In the above I shoved how one can use formal rules to compute <span class="math inline">\(d f(x)\)</span>, rearrange the result into the form of <span class="math inline">\(\langle g(x), dx\rangle\)</span>, and use <span class="math inline">\(g(x)\)</span> as the gradient. Note that if we use Taylor expansion of <span class="math inline">\(f(x+dx)\)</span> at <span class="math inline">\(x\)</span> we get <span class="math inline">\(f(x+dx) = f(x) + \langle \nabla f(x), dx \rangle + O(\dx\^2)\)</span> and neither <span class="math inline">\(f(x)\)</span> nor <span class="math inline">\(\nabla f(x)\)</span> contain (or depend on) <span class="math inline">\(dx\)</span>. This means that if we extend our set of numbers by a symbol <span class="math inline">\(dx\)</span> with a property <span class="math inline">\(\dx\^2 = 0\)</span> (much like we have obtained complex numbers by adding a symbol <span class="math inline">\(i\)</span> to the real numbers with a special property <span class="math inline">\(i^2 = 1\)</span>), and evaluate <span class="math inline">\(f(x+dx)\)</span> in this expanded algebra, we will obtain an expression of the form <span class="math inline">\(a + \langle b, dx \rangle\)</span> with <span class="math inline">\(a = f(x)\)</span> and <span class="math inline">\(b = \nabla f(x)\)</span>. And this is a wellknown extension of the real numbers called <a href="https://en.wikipedia.org/wiki/Dual_number">dual numbers</a>.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>If you’re wondering about the particular way of combining the gradient <span class="math inline">\(\nabla f(x)\)</span> and <span class="math inline">\(dx\)</span>, here’s the explanation. Recall, that the firstorder term is the linearization of a function, that is, it’s linear transformation <span class="math inline">\(L_x\)</span> of <span class="math inline">\(dx\)</span>. Because of the <a href="https://en.wikipedia.org/wiki/Riesz_representation_theorem">Riesz representation theorem</a> this linear transformation <span class="math inline">\(L_x\)</span> can be represented as a scalar product with some element of <span class="math inline">\(\mathcal{X}\)</span> (the gradient, in our case): <span class="math inline">\(L_x(dx) = \langle \nabla f(x), dx \rangle\)</span>. Of course, this logic generalizes to nonscalarvalued functions (like <span class="math inline">\(f : \mathbb{R}^n \to \mathbb{R}^m\)</span>): the gradient is used to define a linear map.<a href="#fnref1">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/5m_RiBN2DWU" height="1" width="1" alt=""/>Sun, 29 Jan 2017 00:00:00 UThttp://artem.sobolev.name/posts/20170129matrixandvectorcalculusviadifferentials.htmlArtemhttp://artem.sobolev.name/posts/20170129matrixandvectorcalculusviadifferentials.htmlNIPS 2016 Summaries
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/uXb15XnvxEo/20161231nips2016summaries.html
<p>I did not attend this year’s NIPS, but I’ve gathered many summaries published online by those who did attend the conference.</p>
<!more>
<ul>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hdofr/d_nips_2016_symposium_on_people_and_machines/">NIPS 2016 Symposium on People and machines: Public views on machine learning, and what this means for machine learning researchers. (Notes and panel discussion)</a> by /u/gcr</li>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hzvfi/d_nips_2016_summary_wrap_up_and_links_to_slides/">NIPS 2016 summary, wrap up, and links to slides</a> by /u/beamsearch</li>
<li><a href="http://inverseprobability.com/2016/12/13/nipshighlights.html">Post NIPS Reflections</a> by <a href="https://twitter.com/lawrennd">Neil Lawrence</a></li>
<li><a href="https://medium.com/@IgorCarron/somegeneraltakeawaysfromnips2016c3c5ec23bf1a#.rykyvqbvm">Some general take aways from #NIPS2016</a> by <a href="http://nuitblanche.blogspot.com">Igor Carron</a></li>
<li><a href="https://medium.com/@libfun/nips2016experienceandhighlights104e19e4ac95#.umy1vunwa">NIPS 2016 experience and highlights</a> by <a href="https://twitter.com/libfun_sk">Sergey Korolev</a></li>
<li><a href="http://www.machinedlearnings.com/2016/12/nips2016reflections.html">NIPS 2016 Reflections</a> by Paul Mineiro</li>
<li><a href="http://abunchofdata.com/somegeneraltakeawaysfromnips2016/">Some general takeaways from #NIPS2016</a> by Arturo Slim</li>
<li><a href="https://twitter.com/rossfadely">Ross Fadely</a> and <a href="https://twitter.com/mwakanosya">Jeremy Karnowski</a>:
<ul>
<li><a href="https://blog.insightdatascience.com/nips2016day16ae1207cab82">NIPS 2016 — Day 1 Highlights</a></li>
<li><a href="https://blog.insightdatascience.com/nips2016day2highlightsplatformwarsrlandrnns9dca43bc1448#.r2aync4cu">NIPS 2016 — Day 2 Highlights: Platform wars, RL and RNNs</a></li>
<li><a href="https://blog.insightdatascience.com/nips2016day3highlightsrobotsthatknowcarsthatseeandmore1ec958896791#.geqs66a4b">NIPS 2016 — Day 3 Highlights: Robots that know, Cars that see, and more!</a></li>
<li><a href="https://blog.insightdatascience.com/nips2016finalhighlightsdays46likelihoodfreeinferencedessertanalogiesandmuchmoreed7352d321ff#.uil9xf2mt">NIPS 2016 — Final Highlights Days 4–6: Likelihoodfree inference, Dessert analogies, and much more.</a></li>
</ul></li>
<li><a href="https://aichamp.wordpress.com/2016/12/09/nips2016top10/">Key deep learning takeaways from NIPS2016 for applied data scientist</a> by Avkash Chauhan</li>
<li><a href="https://paper.dropbox.com/doc/BradNeubergsNIPS2016NotesXUFRdpNYyBhau0gWcybRo">Brad Neuberg’s NIPS 2016 Notes</a> by Brad Neuberg</li>
<li><a href="https://blog.ought.com/nips2016875bb8fadb8c#.yrr46pb2t">50 things I learned at NIPS 2016</a></li>
<li>Lab 41 by <a href="https://twitter.com/karllab41">Karl Ni</a>:
<ul>
<li><a href="https://gab41.lab41.org/nips2016reviewday16e504bcf1451#.g3uvis858">NIPS 2016 Review, Days 0 & 1</a></li>
<li><a href="https://gab41.lab41.org/nips2016reviewday2daff1088135e#.prd61skhx">NIPS 2016 Review, Day 2</a></li>
<li><a href="https://gab41.lab41.org/nips2016reviewday321c78586a0ec#.a3mmr9wmi">NIPS 2016 Review, Day 3</a></li>
</ul></li>
<li><a href="http://wimlworkshop.org/2016/">WiML 2016</a> (Women in Machine Learning) videos:
<ul>
<li><a href="https://www.periscope.tv/WiMLworkshop/1ypKdAZXVOyGW?">Designing Algorithms for Practical Machine Learning</a> by Maya Gupta</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1DXxyoqrMXgGM?">On the Expressive Power of Deep Neural Networks</a> by Maithra Raghu</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1DXxyoqryqWGM?">Ancestral Causal Inference</a> by Sara Magliacane</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1vOxweXvPwgGB?">Towards a Reasoning Engine for Individualizing Healthcare</a> by <a href="http://www.suchisaria.com/">Suchi Saria</a></li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1vOxweXvqjEGB?">Learning Representations from Time Series Data through Contextualized LSTMs</a> by Madalina Fiterau</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1vAGRXDbvbkxl?">Towards Conversational Recommender Systems</a> by Konstantina Christakopoulou</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1gqGvRjOeWOGB?">LargeScale Machine Learning through Spectral Methods: Theory & Practice</a> by Anima Anandkumar</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1jMJgAkEVajKL?">WiML Updates</a> by Tamara Broderick</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1MnGnXwZmVMxO?">Using Convolutional Neural Networks to Estimate Population Density from High Resolution Satellite Images</a> by Amy Zhang</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1dRKZRYgQwvKB">Graphons and Machine Learning</a> by <span class="citation">@JenniferChayes</span></li>
</ul></li>
<li><a href="https://medium.com/@elluba/nips2016cakerocketaigansandthestyletransferdebate708c46438053#.he59yf9ah">NIPS 2016: cake, Rocket AI, GANs and the style transfer debate</a> by Luba Elliott</li>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5ib4jf/discussion_summary_of_nips_2016_adversarial/">Summary of NIPS 2016 Adversarial Training Workshop: More Theory, Exciting Progress</a> by /u/fhuszar</li>
<li><a href="https://www.taraslehinevych.me/blog/2016/12/14/nipsbarcelona/">NIPS 2016 Notes</a> by /u/lehinevych</li>
<li><a href="http://apeiroto.pe/ml/nips2016.html">NIPS 2016</a> by <a href="https://twitter.com/__hylandSL">Stephanie Hyland</a></li>
<li><a href="http://www.computervisionblog.com/2016/12/nutsandboltsofbuildingdeep.html">Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016</a> by <a href="https://twitter.com/quantombone">Tomasz Malisiewicz</a></li>
<li><a href="http://computerblindness.blogspot.ru/2016/12/nips2016.html">NIPS 2016</a> by <a href="https://twitter.com/ovrdr">Roman Shapovalov</a></li>
<li><a href="http://abunchofdata.com/magentawinsbestdemoatnips2016/">Magenta wins “Best Demo” at NIPS 2016!</a>, checkout the demo <a href="https://magenta.tensorflow.org/2016/12/16/nipsdemo/">here</a></li>
<li><a href="http://www.machinedlearnings.com/2016/12/dialogueworkshoprecap.html?spref=tw">Dialogue Workshop Recap</a> by <a href="https://twitter.com/PaulMineiro">Paul Mineiro</a></li>
<li><a href="https://www.linkedin.com/pulse/nips2016towardsenddynamicdialoguesystemvishalbhalla">NIPS 2016: Towards an end to end Dynamic Dialogue System</a> by <a href="https://twitter.com/vishy_punditry">Vishal Bhalla</a></li>
<li><a href="http://www.sandtable.com/nips2016deepreinforcementlearning/">NIPS 2016: Deep Reinforcement Learning</a> by Leighton Turner</li>
<li><a href="http://www.slideshare.net/SebastianRuder/nips2016highlightssebastianruder">NIPS 2016 Highlights</a> by <a href="http://sebastianruder.com/">Sebastian Ruder</a></li>
<li><a href="http://www.nxn.se/nips2016barcelona/">NIPS 2016 Notes</a> by Valentine Svensson</li>
<li><a href="http://yenhuanli.github.io/blog/2016/12/12/interestingtalksinnips2016/">Some Interesting Talks at NIPS 2016</a> by <a href="https://twitter.com/yenhuan_li">YenHuan Li</a></li>
<li><a href="https://deezer.io/deezerrdgoestonips2016e7a895c2c7ff#.j0tgjb3l6">Deezer R&D goes to NIPS 2016</a></li>
<li><a href="https://blog.cometlabs.io/robotslearningabouthumanvaluesemotionandintenta39ca12c1908#.r4afpw77e">Robots Learning About Human Values, Emotion, and Intent</a> by Malika Cantor</li>
<li><a href="http://ikuz.eu/2016/12/16/notesonnips2016/">Notes on NIPS 2016</a> by Ilya Kuzovkin</li>
<li><a href="http://www.mikelanzetta.com/nips2016tripreport.html">NIPS 2016 Trip Report</a> by <a href="https://twitter.com/noodlefrenzy">Mike Lanzetta</a></li>
<li><a href="http://sebastianruder.com/highlightsnips2016/">Highlights of NIPS 2016: Adversarial learning, Metalearning, and more</a> by <a href="https://twitter.com/seb_ruder">Sebastian Ruder</a></li>
<li><a href="https://livingthing.danmackinlay.name/nips_2016.html">Garbled highlights from NIPS 2016</a> by <a href="https://danmackinlay.name/">Dan Mackinlay</a></li>
<li><a href="http://approximatelycorrect.com/2016/12/28/aisafetyhighlightsfromnips2016/">AI Safety Highlights from NIPS 2016</a> by <a href="https://twitter.com/vkrakovna">Victoria Krakovna</a></li>
<li><a href="http://blog.evjang.com/2017/01/nips2016.html">Summary of NIPS 2016</a> by Eric Jang</li>
<li><a href="http://www.nowozin.net/sebastian/blog/nips2016generativeadversarialtrainingworkshoptalk.html">NIPS 2016 Generative Adversarial Training workshop talk</a></li>
<li><a href="http://hunch.net/?p=5937325">EWRL and NIPS 2016</a> by <a href="http://hunch.net/~jl/">John Langford</a></li>
</ul>
<p>You might also be interested in:</p>
<ul>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hwqeb/project_all_code_implementations_for_nips_2016/">All Code Implementations for NIPS 2016 papers</a></li>
<li>On RocketAI: [<a href="https://twitter.com/deanpomerleau/status/808011377059254273">1</a>] + [<a href="https://www.reddit.com/r/MachineLearning/comments/5hmdty/discussion_rocketai/db2mit3/">2</a>] + [<a href="https://medium.com/themission/rocketai2016smostnotoriousailaunchandtheproblemwithaihyped7908013f8c9">3</a>]</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/uXb15XnvxEo" height="1" width="1" alt=""/>Sat, 31 Dec 2016 00:00:00 UThttp://artem.sobolev.name/posts/20161231nips2016summaries.htmlArtemhttp://artem.sobolev.name/posts/20161231nips2016summaries.htmlNeural Variational Inference: Importance Weighted Autoencoders
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/CFIlfE5saw/20160714neuralvariationalimportanceweightedautoencoders.html
<p>Previously we covered <a href="/posts/20160711neuralvariationalinferencevariationalautoencodersandHelmholtzmachines.html">Variational Autoencoders</a> (VAE) — popular inference tool based on neural networks. In this post we’ll consider, a followup work from Torronto by Y. Burda, R. Grosse and R. Salakhutdinov, <a href="https://arxiv.org/abs/1509.00519">Importance Weighted Autoencoders</a> (IWAE). The crucial contribution of this work is introduction of a new lowerbound on the marginal loglikelihood <span class="math inline">\(\log p(x)\)</span> which generalizes ELBO, but also allows one to use less accurate approximate posteriors <span class="math inline">\(q(z \mid x, \Lambda)\)</span>.</p>
<p>On a dessert we’ll discuss another paper, <a href="https://arxiv.org/abs/1602.06725">Variational inference for Monte Carlo objectives</a> by A. Mnih and D. Rezende which aims to broaden the applicability of this approach to models where reparametrization trick can not be used (e.g. for discrete variables).</p>
<!more>
<h3>
Importance Weighted Autoencoders
</h3>
<p>Let’s first answer the question of how one can come up with a lower bound for the marginal loglikelihood? In the very beginning of the series, <a href="/posts/20160701neuralvariationalinferenceclassicaltheory.html">Classical Theory</a> post, we used some trickery to come up with the ELBO. That massaging of the marginal loglikelihood wasn’t particulary enlightening on how one could invent that lower bound. Now we’re going to consider a principled approach to invention of new lower bounds based on <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen’s inequality</a>.</p>
<p>Suppose we have some unbiased estimator <span class="math inline">\(f(x, z)\)</span> of <span class="math inline">\(p(x)\)</span>, that is, <span class="math inline">\(\mathbb{E}_z f(x, z) = p(x)\)</span>. Then</p>
<p><span class="math display">\[
\log p(x) = \log \mathbb{E}_z f(x, z) \stackrel{\text{Jensen}}{\ge} \mathbb{E}_z \log f(x, z)
\]</span></p>
<p>In particular, if <span class="math inline">\(z \sim q(z \mid x)\)</span> and <span class="math inline">\(f(x, z) = \tfrac{p(x, z)}{q(z \mid x)}\)</span>, we obtain the standard ELBO. The IWAE paper proposes another estimate (actually, a family of estimators parametrized by an integer <span class="math inline">\(K\)</span>) of marginal <span class="math inline">\(p(x)\)</span>:</p>
<p><span class="math display">\[
f(x, z_1, \dots, z_K) = \frac{1}{K} \sum_{k=1}^K \frac{p(x, z_k)}{q(z_k \mid x)}
\]</span></p>
<p>Where each <span class="math inline">\(z_k\)</span> comes from the same distribution <span class="math inline">\(q(z_k \mid x) = q(z \mid x)\)</span>. Obviously, <span class="math inline">\(f(x, z_1, \dots, z_K)\)</span> is still an unbiased estimator of the <span class="math inline">\(p(x)\)</span>, and therefore <span class="math inline">\(\mathbb{E}_z \log f(x, z_1, \dots, z_K)\)</span> is a valid lowerbound on the marginal loglikelihood.</p>
<p>Let’s analyze this new lowerbound now. First, let’s dissect the ELBO:</p>
<p><span class="math display">\[
\mathcal{L}(\Theta, \Lambda)
= \mathbb{E}_q \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
= \mathbb{E}_q \left[\log \frac{p(z \mid x, \Theta)}{q(z \mid x, \Lambda)} \right] + \log p(x \mid \Theta)
\]</span></p>
<p>If <span class="math inline">\(q\)</span> approximates the true posterior accurately, the first term (which is a KLdivergence, BTW) is close to zero. However, when estmating it using Monte Carlo samples, the ELBO heavily penalizes inaccurate approximations: if <span class="math inline">\(q(z \mid x, \Lambda)\)</span> gives us samples from high probability regions of the true posterior <span class="math inline">\(p(z \mid x, \Theta)\)</span> only occasionally (like 20% of times), the gap between the ELBO and the marginal loglikelihood would be huge (<span class="math inline">\(p(z\mid x, \Theta)\)</span> is small, <span class="math inline">\(q(z \mid x, \Lambda)\)</span> is big), which does not help learning. As you might have guessed, IWAE allows us to use several samples. Let’s see it in detail:</p>
<p><span class="math display">\[
\mathcal{L}_K(\Theta, \Lambda)
:= \mathbb{E}_q \left[\log \frac{1}{K} \sum_{k=1}^K \frac{p(x, z_k \mid \Theta)}{q(z_k \mid x, \Lambda)} \right]
:= \mathbb{E}_q \left[\log \frac{1}{K} \sum_{k=1}^K \frac{p(z_k \mid x, \Theta)}{q(z_k \mid x, \Lambda)} \right] + \log p(x \mid \Theta)
\]</span></p>
<p>This averaging of posterior ratios saves us from bad samples screwing the lower bound, as it’ll be pushed up by good samples (provided the approximation has a reasonable probability of generating a good sample in <span class="math inline">\(K\)</span> attempts). This allows one to perform model inference even with poor approximations <span class="math inline">\(q(z \mid x, \Lambda)\)</span>. The more samples <span class="math inline">\(K\)</span> we use — the less accurate approximation we can tolerate. In fact, authors prove the following theorem:</p>
<blockquote>
<p><strong>Theorem 1</strong>. For all <span class="math inline">\(K\)</span>, the lower bounds satisfy <span class="math display">\[
\log p(x \mid \Theta) \ge \mathcal{L}_{K+1}(\Theta, \Lambda) \ge \mathcal{L}_{K}(\Theta, \Lambda)
\]</span></p>
Moreover, if <span class="math inline">\(p(z, x \mid \Theta) / q(z \mid x, \Lambda)\)</span> is bounded, then <span class="math inline">\(\mathcal{L}_{K}(\Theta, \Lambda)\)</span> approaches <span class="math inline">\(\log p(x \mid \Theta)\)</span> as <span class="math inline">\(K\)</span> goes to infinity.
</blockquote>
<p>The convergence result follows from the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law">strong law of large numbers</a>.</p>
<p>As with VAE, we use the reparametrization trick to avoid backpropagation through stochastic units:</p>
<p><span class="math display">\[
\mathcal{L}_K(\Theta, \Lambda) = \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_K} \log \frac{1}{K} \sum_{k=1}^K \overbrace{\frac{p(x, g(\varepsilon_k; \Lambda) \mid \Theta)}{q(g(\varepsilon_k; \Lambda) \mid x, \Lambda)}}^{w(x, \varepsilon_k, \Theta, \Lambda)}
\]</span></p>
<p>The gradients then are</p>
<p><span class="math display">\[
\nabla_\Theta \mathcal{L}_K(\Theta, \Lambda) = \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_K} \sum_{k=1}^K \hat w_k(x, \varepsilon_{1 \dots K}, \Theta, \Lambda) \nabla_\Theta \log w(x, \varepsilon_k, \Theta, \Lambda) \\
\nabla_\Lambda \mathcal{L}_K(\Theta, \Lambda) = \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_K} \sum_{k=1}^K \hat w_k(x, \varepsilon_{1 \dots K}, \Theta, \Lambda) \nabla_\Lambda \log w(x, \varepsilon_k, \Theta, \Lambda) \\
\text{where } \hat w_k(x, \varepsilon_{1 \dots K}, \Theta, \Lambda) := \frac{w(x, \varepsilon_k, \Theta, \Lambda)}{\sum_{k=1}^K w(x, \varepsilon_k, \Theta, \Lambda)}
\]</span></p>
<p>(We used the identity <span class="math inline">\(\nabla_x f(x) = f(x) \nabla_x \log f(x)\)</span> here).</p>
<p>Just as one would expect, setting <span class="math inline">\(K=1\)</span> reduces these gradients to ones we’ve seen in VAEs as the only importance weight <span class="math inline">\(\hat w_1\)</span> is equal to 1. Unfortunatelly, this approach does not allow one to decompose the lowerbound into the reconstruction error and KLdivergence to analytically compute the later. However, authors report indistinguishable performance of 2 approaches (with KL computed analytically or estimated using Monte Carlo) in case of <span class="math inline">\(K=1\)</span>.</p>
<p>BTW, <a href="http://info.usherbrooke.ca/hlarochelle/index_en.html">Hugo Larochelle</a> writes <a href="https://twitter.com/hugo_larochelle/timelines/639067398511968256">notes</a> on different papers, and he has written and made publicly available <a href="https://www.evernote.com/shard/s189/sh/e2c8a1331814474cb267366600a1921b/06a756f1618cd47ababc7aae0e514dbf">Notes on Importance Weighted Autoencoders</a>.</p>
<h3>
Variational inference for Monte Carlo objectives
</h3>
<p>As I said in the introduction, IWAE has been “generalized” to discrete variables — a case when one can not employ the reparametrization trick, and instead has to somehow reduce high variance of a score functionbased estimator. Previously, during our discussion of the <a href="/posts/20160705neuralvariationalinferenceblackbox.html">Blackbox VI and variance reduction techniques</a> we covered NVIL (Neural Variational Inference and Learning) estimator, which uses another neural network to estimate marginal likelihood and reduce the variance. This work is built upon a similar idea.</p>
<p>First, let’s derive scorefunctionbased gradients for variational parameters <span class="math inline">\(\Lambda\)</span> (where <span class="math inline">\(w\)</span> now is defined as <span class="math inline">\(w(x, z, \Theta, \Lambda) = \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}\)</span>, and <span class="math inline">\(\hat w\)</span> is a normalized across all samples <span class="math inline">\(z_{1\dots K}\)</span> version):</p>
<p><span class="math display">\[
\begin{align}
\nabla_\Lambda \mathcal{L}_K(\Theta, \Lambda)
&= \nabla_\Lambda \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \\
&= \nabla_\Lambda \int q(z_1, \dots, z_K \mid x, \Lambda) \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \; dz_1 \dots dz_K \\
&= \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \left[ \sum_{k=1}^K \nabla_\Lambda \log q(z_k \mid x, \Lambda) \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \right] \\
& \quad+ \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \nabla_\Lambda \log \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \\
&= \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \left[ \sum_{k=1}^K \nabla_\Lambda \log q(z_k \mid x, \Lambda) \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \right] \\
& \quad+ \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \left[ \sum_{k=1}^K \hat w_k(x, z_{1 \dots K}, \Theta, \Lambda) \nabla_\Lambda \log w(x, z_k, \Theta, \Lambda) \right]
\end{align}
\]</span></p>
<p>The second term is exactly the gradient of the reparametrized case, and it does not cause us any troubles. The first term, however has some issues.</p>
<p>First, it does not distinguish individual samples’ contributions: indeed, gradients for all samples have the same weight of <span class="math inline">\(\log \tfrac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda)\)</span> (called the <em>learning signal</em>) regardless of how probable they’re in terms of the true posterior (that is, how well they describe an observation <span class="math inline">\(x\)</span>). Compare it with the second term, where gradient for each sample <span class="math inline">\(z_k\)</span> is weighted in proportion to its importance weight <span class="math inline">\(\hat w_k\)</span>.</p>
<p>Second problem is that the learning signal is unbounded, and can be quite high. Again, the second term does not suffer this as importance weights <span class="math inline">\(\hat w_k\)</span> are normalized to sum to 1.</p>
<p>One can use the NVIL estimator we’ve discussed previously to reduce the variance due to large magnitude of a learning signal. However, it does not address the problem of all gradients having the same weight. For this the authors propose to introduce persample baselines that minimize dependencies between samples.</p>
<p>This paper has also caught Dr. Larochelle’s attention: <a href="https://www.evernote.com/shard/s189/sh/54a9fb881a714e8ab0e3f13480a68b8d/0663de49b93d397f519c7d7f73b6a441">Notes on Variational inference for Monte Carlo objectives</a>.</p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/CFIlfE5saw" height="1" width="1" alt=""/>Thu, 14 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/20160714neuralvariationalimportanceweightedautoencoders.htmlArtemhttp://artem.sobolev.name/posts/20160714neuralvariationalimportanceweightedautoencoders.htmlNeural Variational Inference: Variational Autoencoders and Helmholtz machines
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/kx9whnUkp94/20160711neuralvariationalinferencevariationalautoencodersandHelmholtzmachines.html
<p>So far we had a little of “neural” in our VI methods. Now it’s time to fix it, as we’re going to consider <a href="https://arxiv.org/abs/1312.6114">Variational Autoencoders</a> (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new approach (AEVB) to largescale inference in nonconjugate models with continuous latent variables, and a probabilistic model of autoencoders as an example of this approach. We then discuss connections to <a href="https://en.wikipedia.org/wiki/Helmholtz_machine">Helmholtz machines</a> — a predecessor of VAEs.</p>
<!more>
<h3 id="autoencodingvariationalbayes">AutoEncoding Variational Bayes</h3>
<p>As noted in the introduction of the post, this approach, called AutoEncoding Variational Bayes (AEVB) works only for some models with continuous latent variables. Recall from our discussion of <a href="/posts/20160705neuralvariationalinferenceblackbox.html">Blackbox VI</a> and <a href="/posts/20160704neuralvariationalinferencestochasticvariationalinference.html">Stochastic VI</a>, we’re interested in maximizing the ELBO <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span>:</p>
<p><span class="math display">\[
\mathcal{L}(\Theta, \Lambda) = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
\]</span></p>
<p>It’s not a problem to compute an estimate of the gradient of the ELBO w.r.t. model parameters <span class="math inline">\(\Theta\)</span>, but estimating the gradient w.r.t. approximation parameters <span class="math inline">\(\Lambda\)</span> is tricky as these parameters influence the distribution the expectation is taken over, and as we know from the post on <a href="/posts/20160705neuralvariationalinferenceblackbox.html">Blackbox VI</a>, naive gradient estimator based on score function exhibits high variance.</p>
<p>Turns out, for some distributions we can make change of variables, that is, for some distributions a sample <span class="math inline">\(z \sim q(z \mid x, \Lambda)\)</span> can be represented as a (differentiable) transformation <span class="math inline">\(g(\varepsilon; \Lambda, x)\)</span> of some auxiliary random variable <span class="math inline">\(\varepsilon\)</span> whose distribution does not depend on <span class="math inline">\(\Lambda\)</span>. A wellknown example of such reparametrization is Gaussian distribution: if <span class="math inline">\(z \sim \mathcal{N}(\mu, \Sigma)\)</span> then <span class="math inline">\(z\)</span> can be represented as <span class="math inline">\(z = g(\varepsilon; \mu, \Sigma) := \mu + \Sigma^{1/2} \varepsilon\)</span> for <span class="math inline">\(\varepsilon \sim \mathcal{N}(0, I)\)</span>. This transformation is called the <strong>reparametrization trick</strong>. After the reparametrization the ELBO becomes</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)} \log \frac{p(x, g(\varepsilon; \Lambda, x)\mid \Theta)}{q(g(\varepsilon; \Lambda, x) \mid \Lambda, x)} \\
&\approx \frac{1}{L} \sum_{l=1}^L \log \frac{p(x, g(\varepsilon^{(l)}; \Lambda, x)\mid \Theta)}{q(g(\varepsilon^{(l)}; \Lambda, x) \mid \Lambda, x)} \quad \quad \text{where $\varepsilon^{(l)} \sim \mathcal{N}(0, I)$}
\end{align}
\]</span></p>
<p>This objective is a much better one as we don’t need to differentiate w.r.t. expectation’s distribution, essentially putting the variational parameters <span class="math inline">\(\Lambda\)</span> to the same regime as the model parameters <span class="math inline">\(\Theta\)</span>. It’s sufficient now to just take gradients of the ELBO’s estimate, and run any optimization algorithm like <a href="https://arxiv.org/abs/1412.6980">Adam</a>.</p>
<p>Oh, and if you wonder what AutoEncoding in AutoEncoding Variational Bayes means, there’s an interesting interpretation of the ELBO in terms of autoencoding:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\Theta, \Lambda)
& = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
= \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x \mid z, \Theta) p(z \mid \Theta)}{q(z \mid x, \Lambda)} \\
& = \mathbb{E}_{q(z\mid x, \Lambda)} \log p(x \mid z, \Theta)  D_{KL}(q(z \mid \Lambda, x) \mid\mid p(z \mid \Theta)) \\
\end{align}
\]</span></p>
<p>Here the first term can be treated as expected reconstruction (<span class="math inline">\(x\)</span> from the code <span class="math inline">\(z\)</span>) error, while the second one is just a regularizer.</p>
<h3 id="variationalautoencoder">Variational Autoencoder</h3>
<p>One particular application of AEVB framework comes from using neural networks as the model <span class="math inline">\(p(x \mid z, \Theta)\)</span> (called <strong>generative network</strong>) and the approximation <span class="math inline">\(q(z \mid x, \Lambda)\)</span> (called <strong>inference network</strong> or <strong>recognition network</strong>). The model has no requirements, and <span class="math inline">\(x\)</span> can be discrete or continuous (or mixed). <span class="math inline">\(z\)</span>, however, has to be continuous. Moreover, we need to be able to apply the reparametrization trick. Therefore in many practical applications <span class="math inline">\(q(z \mid x, \Lambda)\)</span> is set to be Gaussian distribution <span class="math inline">\(q(z \mid \Lambda, x) = \mathcal{N}(z \mid \mu(x; \Lambda), \Sigma(x; \Lambda))\)</span> where <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\Sigma\)</span> are outputs of a neural network taking <span class="math inline">\(x\)</span> as input, and <span class="math inline">\(\Lambda\)</span> denotes a set of neural network’s weights — the parameters we optimize the ELBO with respect to (the same applies to <span class="math inline">\(\Theta\)</span>). In order to make the reparametrization trick practical, one would like to be able to compute <span class="math inline">\(\Sigma^{1/2}\)</span> quick. We don’t want to actually compute this quantity directly as it’d be too computationally expensive. Instead you might want to predict <span class="math inline">\(\Sigma^{1/2}\)</span> by the neural network in the first place, or consider only diagonal covariance matrices (as it’s done in the paper).</p>
<p>In case of the Gaussian inference network <span class="math inline">\(q(z \mid x, \Lambda)\)</span> and a Gaussian prior <span class="math inline">\(p(z \mid \Theta)\)</span> we can compute KLdivergence <span class="math inline">\(D_{KL}(q(z \mid \Lambda, x) \mid\mid p(z \mid \Theta))\)</span> analytically, see the formula at <a href="http://stats.stackexchange.com/a/60699/62549">stats.stackexchange</a>. This slightly reduces the variance of the gradient estimator, though one can still train a VAE estimating KLdivergence using Monte Carlo, just like the reconstruction error.</p>
<p>We optimize both generative and inference networks by gradient ascent. This joint optimization pushes both the approximation towards the model, and the model towards the approximation. As a result, the generative network is encouraged to learn latent representations <span class="math inline">\(z\)</span> that exhibit the same independence pattern as the inference network. For example, if the inference network is Gaussian and has diagonal covariance matrices, then the generative model will try to learn representations with independent components.</p>
<p>VAEs have become popular because one can use it as a generative model. Essentially VAE is an easy to train autoencoder with a natural sampling procedure <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>: suppose you’ve trained the model, and now want to sample new samples similar to those you used in the training set. To do so you first sample <span class="math inline">\(z\)</span> from the prior <span class="math inline">\(p(z)\)</span>, and then generate <span class="math inline">\(x\)</span> using the model <span class="math inline">\(p(x \mid z, \Theta)\)</span>. Both operations are easy: the first one is sampling from some standard distribution (like Gaussian, for example), and the second one is just one feedforward pass followed by sampling from another standard distribution (Bernoulli, for example, in case <span class="math inline">\(x\)</span> is a binary image).</p>
<p>If you want to read more on Variational AutoEncoders, I refer you to a great <a href="https://arxiv.org/abs/1606.05908">tutorial by Carl Doersch</a>. Also take a look at Dustin Tran’s post <a href="http://dustintran.com/blog/variationalautoencodersdonottraincomplexgenerativemodels/">Variational autoencoders do not train complex generative models</a> (and see the <a href="https://www.reddit.com/r/MachineLearning/comments/4ph8cq/variational_autoencoders_do_not_train_complex/">reddit discussion</a> also!).</p>
<h3 id="helmholtzmachines">Helmholtz Machines</h3>
<p>In the end I’d like to add a historical perspective. The idea of two networks, one “encoding” an observation <span class="math inline">\(x\)</span> to some latent representation (code) <span class="math inline">\(z\)</span>, and another “decoding” it back is definitely not new. In fact, the whole idea is a special case of the <a href="https://en.wikipedia.org/wiki/Helmholtz_machine">Helmholtz Machines</a> introduced by Geoffrey Hinton 20 years ago.</p>
<p>Helmholtz machine can be thought of as a neural network of stochastic hidden layers. Namely, we now have <span class="math inline">\(M\)</span> stochastic hidden layers (latent variables) <span class="math inline">\(h_1, \dots, h_M\)</span> (with deterministic <span class="math inline">\(h_0 = x\)</span>) where the layer <span class="math inline">\(h_{m1}\)</span> is stochastically produced by the layer <span class="math inline">\(h_{m}\)</span>, that is, it is samples from some distribution <span class="math inline">\(p(h_{m1} \mid h_m)\)</span>, which as you might have guessed already is parametrized in the same way as in usual VAEs. Actually, VAEs is a special case of a Helmholtz machine with just one stochastic layer (but each stochastic layer contains a neural network of arbitrarily many deterministic layers inside of it).</p>
<div style="textalign: center">
<p><img src="/files/Helmholtzmachine.png" style="width: 400px" /></p>
</div>
<p>This image shows an instance of a Helmholtz machine with 2 stochastic layers (blue cloudy nodes), and each stochastic layer having 2 deterministic hidden layers (white rectangles).</p>
<p>The joint model distribution is</p>
<p><span class="math display">\[
p(x, h_1, \dots, h_M \mid \Theta) = p(h_M \mid \Theta) \prod_{m=0}^{M1} p(h_m \mid h_{m+1}, \Theta)
\]</span></p>
<p>And the approximate posterior is the same, but in inverse order:</p>
<p><span class="math display">\[
q(h_1, \dots, h_M \mid x, \Lambda) = \prod_{m=1}^{M} p(h_{m} \mid h_{m1}, \Theta)
\]</span></p>
<p>The <span class="math inline">\(p(x, h_1, \dots, h_{M1} \mid h_M)\)</span> distribution is usually called a <strong>generative network</strong> (or model) as it allows one to generate samples from latent representation(s). The approximate posterior <span class="math inline">\(q(h_1, \dots, h_M \mid x, \Lambda)\)</span> in this framework is called a <strong>recognition network</strong> (or model). Presumably, the name reflects the purpose of the network to recognize the hidden structure of observations.</p>
<p>So, if the VAE is a special case of Helmholtz machines, what’s new then? The standard algorithm for learning Helmholtz machines, the <a href="https://en.wikipedia.org/wiki/Wakesleep_algorithm">WakeSleep algorithm</a>, turns out to be optimizing a different objective. Thus, one of significant contributions of Kingma and Welling is application of the reparametrization trick to make optimization of the ELBO w.r.t. <span class="math inline">\(\Lambda\)</span> tractable.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>This is not true for other popular autoencoding architectures. Boltzman Machines are too hard to train properly, while tranditional autoencoders (contractive, denoising) are hard to sample from (special procedures involving MCMC are required).<a href="#fnref1">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/kx9whnUkp94" height="1" width="1" alt=""/>Mon, 11 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/20160711neuralvariationalinferencevariationalautoencodersandHelmholtzmachines.htmlArtemhttp://artem.sobolev.name/posts/20160711neuralvariationalinferencevariationalautoencodersandHelmholtzmachines.htmlNeural Variational Inference: Blackbox Mode
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/Rx_ge6y2j6g/20160705neuralvariationalinferenceblackbox.html
<p>In the <a href="/posts/20160704neuralvariationalinferencestochasticvariationalinference.html">previous post</a> we covered Stochastic VI: an efficient and scalable variational inference method for exponential family models. However, there’re many more distributions than those belonging to the exponential family. Inference in these cases requires significant amount of model analysis. In this post we consider <a href="https://arxiv.org/abs/1401.0118">Black Box Variational Inference</a> by Ranganath et al. This work just as the previous one comes from <a href="http://www.cs.columbia.edu/~blei/">David Blei</a> lab — one of the leading researchers in VI. And, just for the dessert, we’ll touch upon another paper, which will finally introduce some neural networks in VI.</p>
<!more>
<h3>
Blackbox Variational Inference
</h3>
<p>As we have learned so far, the goal of VI is to maximize the ELBO <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span>. When we maximize it by <span class="math inline">\(\Lambda\)</span>, we decrease the gap between the marginal likelihood of the model considered <span class="math inline">\(\log p(x \mid \Theta)\)</span>, and when we maximize it by <span class="math inline">\(\Theta\)</span> we acltually fit the model. So let’s concentrate on optimizing this objective:</p>
<p><span class="math display">\[
\mathcal{L}(\Theta, \Lambda) = \mathbb{E}_{q(z \mid x, \Lambda)} \left[\log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda) \right]
\]</span></p>
<p>Let’s find gradients of this objective:</p>
<p><span class="math display">\[
\begin{align}
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
&= \nabla_{\Lambda} \int q(z \mid x, \Lambda) \left[\log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda) \right] dz \\
&= \int \nabla_{\Lambda} q(z \mid x, \Lambda) \left[\log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda) \right] dz  \int q(z \mid x, \Lambda) \nabla_{\Lambda} \log q(z \mid x, \Lambda) dz \\
&= \mathbb{E}_{q} \left[\frac{\nabla_{\Lambda} q(z \mid x, \Lambda)}{q(z \mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right]  \int q(z \mid x, \Lambda) \frac{\nabla_{\Lambda} q(z \mid x, \Lambda)}{q(z \mid x, \Lambda)} dz \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right]  \int \nabla_{\Lambda} q(z \mid x, \Lambda) dz \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right]  \nabla_{\Lambda} \overbrace{\int q(z \mid x, \Lambda) dz}^{=1} \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right]
\end{align}
\]</span></p>
<p>In statistics <span class="math inline">\(\nabla_\Lambda \log q(z \mid x, \Lambda)\)</span> is known as <a href="https://en.wikipedia.org/wiki/Score_(statistics)">score function</a>. For more on this “trick” see <a href="http://blog.shakirm.com/2015/11/machinelearningtrickoftheday5logderivativetrick/">a blogpost by Shakir Mohamed</a>. In many cases of practical interest <span class="math inline">\(\log p(x, z, \mid \Theta)\)</span> is too complicated to compute this expectation in closed form. Recall that we already used stochastic optimization successfully, so we can settle with just an estimate of true gradient. We get one by approximating the expectation using MonteCarlo estimates using <span class="math inline">\(L\)</span> samples <span class="math inline">\(z^{(l)} \sim q(z \mid x, \Lambda)\)</span> (in practice we sometimes use just <span class="math inline">\(L=1\)</span> sample. We expect correct averaging to happen automagically due to use of minibatches):</p>
<p><span class="math display">\[
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Lambda} \log q(z^{(l)} \mid x, \Lambda) \log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)}
\]</span></p>
<p>For model parameters <span class="math inline">\(\Theta\)</span> gradients look even simpler, as we don’t need to differentiate w.r.t. expectation distribution’s parameters:</p>
<p><span class="math display">\[
\begin{align}
\nabla_{\Theta} \mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{q} \nabla_{\Theta} \log p(x, z \mid \Theta)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Theta} \log p(x, z^{(l)} \mid \Theta)
\end{align}
\]</span></p>
<p>We can even “naturalize” these gradients by premultiplying by the inverse Fisher Information Matrix <span class="math inline">\(\mathcal{I}(\Lambda)^{1}\)</span>. And that’s it! Much simpler than before, right? Of course, there’s no free lunch, so there must be a catch… And there is: performance of stochastic optimization methods crucially depends on the variance of gradient estimators. It makes perfect sense: the higher the variance — the less information about the step direction we get. And unfortunately, in practice the aforementioned estimator based on the score function has impractically high variance. Luckily, in Monte Carlo community there are many variance reductions techniques known, we now describe some of them.</p>
<p>The first technique we’ll describe is <strong>RaoBlackwellization</strong>. The idea is simple: if it’s possible to compute the expectation w.r.t. some of random variables, you should do it. If you think of it, it’s an obvious advice as you essentially reduce amount of randomness in your Monte Carlo estimates. But let’s put it more formally: we use chain rule to rewrite joint expectation as marginal expectation of conditional one:</p>
<p><span class="math display">\[
\mathbb{E}_{X, Y} f(X, Y) = \mathbb{E}_X \left[ \mathbb{E}_{Y \mid X} f(X, Y) \right]
\]</span></p>
<p>Let’s see what happens with variance (in scalar case) when we estimate expectation of <span class="math inline">\(\mathbb{E}_{Y \mid X} f(X, Y)\)</span> instead of expectation of <span class="math inline">\(f(X, Y)\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\text{Var}_X(\mathbb{E}_{Y \mid X} f(X, Y))
&= \mathbb{E} (\mathbb{E}_{Y \mid X} f(X, Y))^2  (\mathbb{E}_{X, Y} f(X, Y))^2 \\
&= \text{Var}_{X,Y}(f(X, Y))  \mathbb{E}_X \left(\mathbb{E}_{Y \mid X} f(X, Y)^2  (\mathbb{E}_{Y \mid X} f(X, Y))^2 \right) \\
&= \text{Var}_{X,Y}(f(X, Y))  \mathbb{E}_X \text{Var}_{Y\mid X} (f(X, Y))
\end{align}
\]</span></p>
<p>This formula says that RaoBlackwellizing an estimator reduces its variance by <span class="math inline">\(\mathbb{E}_X \text{Var}_{Y\mid X} (f(X, Y))\)</span>. Indeed, you can think of this term as of a measure of how much information <span class="math inline">\(Y\)</span> contains about <span class="math inline">\(X\)</span> that’s relevant to computing <span class="math inline">\(f(X, Y)\)</span>. Suppose <span class="math inline">\(Y = X\)</span>: then you have <span class="math inline">\(\mathbb{E}_X f(X, X)\)</span>, and taking expectation w.r.t. <span class="math inline">\(Y\)</span> does not reduce amount of randomness in the estimator. And this is what the formula tells us as <span class="math inline">\(\text{Var}_{Y \mid X} f(X, Y)\)</span> would be 0 in this case. Here’s another example: suppose <span class="math inline">\(f\)</span> does not use <span class="math inline">\(X\)</span> at all: then only randomness in <span class="math inline">\(Y\)</span> affects the estimate, and after RaoBlackwellization we expect the variance to drop to 0. And the formula agrees with out expectations as <span class="math inline">\(\mathbb{E}_X \text{Var}_{Y \mid X} f(X, Y) = \text{Var}_Y f(X, Y)\)</span> for any <span class="math inline">\(X\)</span> since <span class="math inline">\(f(X, Y)\)</span> does not depend on <span class="math inline">\(X\)</span>.</p>
<p>Next technique is <strong>Control Variates</strong>, which is slightly less intuitive. The idea is that we can add zeromean function <span class="math inline">\(h(X)\)</span> that’ll preserve the expectation, but reduce the variance. Again, for a scalar case</p>
<p><span class="math display">\[
\text{Var}(f(X)  \alpha h(X)) = \text{Var}(f(X))  2 \alpha \text{Cov}(f(X), h(X)) + \alpha^2 \text{Var}(f(X))
\]</span></p>
<p>Optimal <span class="math inline">\(\alpha^* = \frac{\text{Cov}(f(X), h(X))}{\text{Var}(f(X))}\)</span>. This formula reflects an obvious fact: if we want to reduce the variance, <span class="math inline">\(h(X)\)</span> must be correlated with <span class="math inline">\(f(X)\)</span>. Sign of correlation does not matter, as <span class="math inline">\(\alpha^*\)</span> will adjust. BTW, in reinforcement learning <span class="math inline">\(\alpha\)</span> is called <strong>baseline</strong>.</p>
<p>As we already have learned, <span class="math inline">\(\mathbb{E}_{q(z \mid x, \Lambda)} \nabla_\Lambda \log q(z \mid x, \Lambda) = 0\)</span>, so the score function is a good candidate for <span class="math inline">\(h(x)\)</span>. Therefore our estimates become</p>
<p><span class="math display">\[
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Lambda} \log q(z^{(l)} \mid x, \Lambda) \circ \left(\log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)}  \alpha^* \right)
\]</span></p>
<p>Where <span class="math inline">\(\circ\)</span> is pointwise multiplication and <span class="math inline">\(\alpha\)</span> is a vector of <span class="math inline">\(\Lambda\)</span> components with <span class="math inline">\(\alpha_i\)</span> being a baseline for variational parameter <span class="math inline">\(\Lambda_i\)</span>:</p>
<p><span class="math display">\[
\alpha^*_i = \frac{\text{Cov}(\nabla_{\Lambda_i} \log q(z \mid x, \Lambda)\left( \log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda) \right), \nabla_{\Lambda_i} \log q(z \mid x, \Lambda))}{\text{Var}(\nabla_{\Lambda_i} \log q(z \mid x, \Lambda)\left( \log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda) \right))}
\]</span></p>
<h3>
Neural Variational Inference and Learning
</h3>
<p>Hoooray, neural networks! In this section I’ll briefly describe a variance reduction technique coined by A. Mnih and K. Gregor in <a href="https://arxiv.org/abs/1402.0030">Neural Variational Inference and Learning in Belief Networks</a>. The idea is surprisingly simple: why not learn a baseline <span class="math inline">\(\alpha\)</span> using a neural network?</p>
<p><span class="math display">\[
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Lambda} \log q(z^{(l)} \mid x, \Lambda) \circ \left(\log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)}  \alpha^*  \alpha(x) \right)
\]</span></p>
<p>Where <span class="math inline">\(\alpha(x)\)</span> is a neural network trained to minimize</p>
<p><span class="math display">\[
\mathbb{E}_{q(z \mid x, \Lambda)} \left( \log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)}  \alpha^*  \alpha(x) \right)^2
\]</span></p>
<p>What’s the motivation of this objective? The gradient step of <span class="math inline">\(\nabla_\Lambda \mathcal{L}(\Theta, \Lambda)\)</span> can be seen as pushing <span class="math inline">\(q(z\mid x, \Lambda)\)</span> towards <span class="math inline">\(p(x, z \mid \Theta)\)</span>. Since <span class="math inline">\(q\)</span> has to be normalized like any other proper distribution, it’s actually pushed towards the true posterior <span class="math inline">\(p(z \mid x, \Theta)\)</span>. We can rewrite the gradient <span class="math inline">\(\nabla_\Lambda \mathcal{L}(\Theta, \Lambda)\)</span> as</p>
<p><span class="math display">\[
\begin{align}
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \left(\log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda) \right) \right] \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \left(\log p(z \mid x, \Theta)  \log q(z \mid x, \Lambda) + \log p(x \mid \Theta) \right) \right]
\end{align}
\]</span></p>
<p>While this additional <span class="math inline">\(\log p(x \mid \Theta)\)</span> term does not contribute to the expectation, it affects the variance on the estimator. Therefore, <span class="math inline">\(\alpha(x)\)</span> is supposed to estimate the marginal loglikelihood <span class="math inline">\(\log p(x \mid \Theta)\)</span>.</p>
<p>The paper also lists several other variance reduction techniques that can be used in combination with the neural networkbased baseline:</p>
<ul>
<li>
<strong>Constant baseline</strong> — analogue of <em>Control Variates</em>, uses running average of <span class="math inline">\(\log p(x, z \mid \Theta)  \log q(z \mid x, \Lambda)\)</span> as a baseline
</li>
<li>
<strong>Variance normalization</strong> — normalizes the learning signal to unit variance, equivalent to adaptive learning rate
</li>
<li>
<strong>Local learning signals</strong> — falls out of the scope of this post as requires it modelspecific analysis and alternations, and can’t be used in Blackbox regime
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/Rx_ge6y2j6g" height="1" width="1" alt=""/>Tue, 05 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/20160705neuralvariationalinferenceblackbox.htmlArtemhttp://artem.sobolev.name/posts/20160705neuralvariationalinferenceblackbox.htmlNeural Variational Inference: Scaling Up
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/bYNrSVQbgPU/20160704neuralvariationalinferencestochasticvariationalinference.html
<p>In the <a href="/posts/20160701neuralvariationalinferenceclassicaltheory.html">previous post</a> I covered wellestablished classical theory developed in early 2000s. Since then technology has made huge progress: now we have much more data, and a great need to process it and process it fast. In big data era we have huge datasets, and can not afford too many full passes over it, which might render classical VI methods impractical. Recently M. Hoffman et al. dissected classical MeanField VI to introduce stochasticity right into its heart, which resulted in <a href="https://arxiv.org/abs/1206.7051">Stochastic Variational Inference</a>.</p>
<!more>
<h3>
Stochastic Variational Inference
</h3>
<p>We start with model assumptions: we have 2 types of latent variables, the global latent variable <span class="math inline">\(\beta\)</span> and a bunch of local variables <span class="math inline">\(z_n\)</span> for each observation <span class="math inline">\(x_n\)</span>. Recalling our GMM example, <span class="math inline">\(\beta\)</span> can be thought of as a mixture weights <span class="math inline">\(\pi\)</span>, and <span class="math inline">\(z_n\)</span> are membership indicators, as previously. These variables are assumed to come from some exponential family distribution:</p>
<p><span class="math display">\[
p(x_n, z_n \mid \beta) = h(x_n, z_n) \exp \left( \beta^T t(x_n, z_n)  a_l(\beta) \right) \\
\\
p(\beta) = h(\beta) \exp(\alpha^T t(\beta)  a_g(\alpha))
\]</span></p>
<p>Where <span class="math inline">\(t(\cdot)\)</span> and <span class="math inline">\(h(\cdot)\)</span> are overloaded by their argument, so <span class="math inline">\(t(\beta)\)</span> and <span class="math inline">\(t(z_{nj})\)</span> correspond to two different functions. <span class="math inline">\(t(\cdot)\)</span> gives a <strong>natural parameter</strong> and also <strong>sufficient statistics</strong>. <span class="math inline">\(a_g\)</span> and <span class="math inline">\(a_l\)</span> are lognormalizing constants which for exponential family distributions have an interesting property, namely, the gradient of the lognormalizing constant is the expectation of sufficient statistics: <span class="math inline">\(\nabla_\alpha a_g(\alpha) = \mathbb{E} t(\beta)\)</span>.</p>
<p>From these assumptions we can derive <em>complete conditionals</em> (conditional distribution given all other hidden variables and observables) for <span class="math inline">\(\beta\)</span> and <span class="math inline">\(z_{nj}\)</span>:</p>
<p><span class="math display">\[
\begin{align}
p(\beta \mid x, z, \alpha)
&\propto \prod_{n=1}^N p(x_n, z_n \mid \beta) p(\beta \mid \alpha) \\
&= h(\beta) \prod_{n=1}^N h(x_n, z_n) \exp \left( \beta^T \sum_{n=1}^N t(x_n, z_n)  N a_l(\beta) + \alpha^T t(\beta)  a_g(\alpha) \right) \\
&\propto h(\beta) \exp \left( \eta_g(x, z, \alpha)^T t(\beta) \right)
\end{align}
\]</span></p>
<p>Where <span class="math inline">\(t(\beta) = (\beta, a_l(\beta))\)</span>, <span class="math inline">\(\eta_g(x, z, \alpha) = (\alpha_1 + \sum_{n=1}^N t(x_n, z_n), \alpha_2 + N)\)</span>. We see that the (unnormalized) posterior distribution for <span class="math inline">\(\beta\)</span> has the same functional form as the (unnormalized) prior <span class="math inline">\(p(\beta)\)</span>, therefore after normalization it’d be</p>
<p><span class="math display">\[
p(\beta \mid x, z, \alpha)
= h(\beta) \exp \left( \eta_g(x, z, \alpha)^T t(\beta)  a_g(\eta_g(x, z, \alpha)) \right)
\]</span></p>
<p>The same applies to local variables <span class="math inline">\(z_{nj}\)</span>:</p>
<p><span class="math display">\[
p(z_{nj} \mid x_n, z_{n,j}, \beta)
\propto h(z_{nj}) \exp \left( \eta_l(x_n, z_{n,j}, \beta)^T t(z_{nj}) \right)
\]</span> Hence <span class="math display">\[
p(z_{nj} \mid x_n, z_{n,j}, \beta)
= h(z_{nj}) \exp \left( \eta_l(x_n, z_{n,j}, \beta)^T t(z_{nj})  a_m(\eta_l(x_n, z_{n,j}, \beta)) \right)
\]</span></p>
<p>Even though we’ve managed to find the complete conditional for <span class="math inline">\(\beta\)</span>, it might be intractable to find the posterior for all latent variables <span class="math inline">\(p(\beta, z \mid x, \alpha)\)</span>. We therefore turn to the mean field approximation:</p>
<p><span class="math display">\[
q(z, \beta \mid \Lambda) = q(\beta \mid \lambda) \prod_{n=1}^N \prod_{j=1}^J q(z_{nj} \mid \phi_{nj})
\]</span></p>
<p>We assume these marginal distributions come from the exponential family:</p>
<p><span class="math display">\[
q(\beta \mid \lambda) = h(\beta) \exp(\lambda^T t(\beta)  a_g(\lambda)) \\
q(z_{nj} \mid \phi_{nj}) = h(z_{nj}) \exp(\phi_{nj}^T t(z_{nj})  a_m(\phi_{nj}))
\]</span></p>
<p>Let’s find the optimal variational parameters now by optimizing the ELBO <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span> (<span class="math inline">\(\Theta\)</span> is model parameters, <span class="math inline">\(\alpha\)</span>, and <span class="math inline">\(\Lambda\)</span> contains variational parameters <span class="math inline">\(\phi\)</span> and <span class="math inline">\(\lambda\)</span>) by <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\phi_{nj}\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\lambda)
&= \mathbb{E}_{q} \left( \log p(x, z, \beta)  \log q(\beta)  \log q(z) \right)
= \mathbb{E}_{q} \left( \log p(\beta \mid x, z)  \log q(\beta) \right) + \text{const} \\
&= \mathbb{E}_{q} \left( \eta_g(x, z, \alpha)^T t(\beta)  \lambda^T t(\beta) + a_g(\lambda) \right) + \text{const} \\
&= \left(\mathbb{E}_{q(z)} \eta_g(x, z, \alpha)  \lambda \right)^T \mathbb{E}_{q(\beta)} t(\beta) + a_g(\lambda) + \text{const} \\
&= \left(\mathbb{E}_{q(z)} \eta_g(x, z, \alpha)  \lambda \right)^T \nabla_\lambda a_g(\lambda) t(\beta) + a_g(\lambda) + \text{const}
\end{align}
\]</span></p>
<p>Where we used aforementioned property of exponential family distributions: <span class="math inline">\(\nabla_\lambda a_g(\lambda) = \mathbb{E}_{q(\beta)} t(\beta)\)</span>. The gradient then is <span class="math display">\[
\nabla_\lambda \mathcal{L}(\lambda)
= \nabla_\lambda^2 a_g(\lambda) \left(\mathbb{E}_{q(z)} \eta_g(x, z, \alpha)  \lambda \right)
\]</span></p>
<p>After setting it to zero we get an update for global latent variables: <span class="math inline">\(\lambda = \mathbb{E}_{q(z)} \eta_g(x, z, \alpha)\)</span>. Following the same reasoning we derive the optimal update for <span class="math inline">\(\phi_{nj}\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\phi_{nj})
&= \mathbb{E}_{q} \left( \log p(z_{nj} \mid x_n, z_{n,j}, \beta)  \log q(z_{nj}) \right) + \text{const} \\
&= \mathbb{E}_{q} \left( \eta_l(x_n, z_{n,j}, \beta)^T t(z_{nj})  \phi_{nj}^T t(z_{nj}) + a_m(\phi_{nj})\right) + \text{const} \\
&= \left(\mathbb{E}_{q(\beta) q(z_{n,j})} \eta_l(x_n, z_{n,j}, \beta)  \phi_{nj} \right)^T \mathbb{E}_{q(z_{nj})} t(z_{nj}) + a_m(\phi_{nj}) + \text{const} \\
\end{align}
\]</span></p>
<p>The gradient then is <span class="math inline">\(\nabla_{\phi_{nj}} \mathcal{L}(\phi) = \nabla_{\phi_{nj}}^2 a_m(\phi_{nj}) \left(\mathbb{E}_{q(\beta) q(z_{n,j})} \eta_l(x_n, z_{n,j}, \beta)  \phi_{nj} \right)\)</span>, and the update is <span class="math inline">\(\phi_{nj} = \mathbb{E}_{q(\beta) q(z_{n,j})} \eta_l(x_n, z_{n,j}, \beta)\)</span>.</p>
<p>So far we found meanfield updates, as well as corresponding gradients of the ELBO for variational parameters <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\phi_{nj}\)</span>. Next step is to transform these gradients into <strong>natural gradients</strong>. Intuitively, classical gradient defines local linear approximation, where the notion of locality comes from the Euclidean space. However, parameters influence the ELBO only through distributions <span class="math inline">\(q\)</span>, so we might like to alter our idea of locality based on how much the distributions change. This is what natural gradient does: it defines local linear approximation where locality means small distance (symmetrized KLdivergence) between distributions. There’s great formal explanation in the paper, and if you want to read more on that matter, I refer you to a great post by Roger Grosse, <a href="http://www.metacademy.org/roadmaps/rgrosse/dgml">Differential geometry for machine learning</a>.</p>
<p>The natural gradient can be obtained from the usual gradient using a simple linear transformation:</p>
<p><span class="math display">\[
\nabla_\lambda^\text{N} f(\lambda) = \mathcal{I}(\lambda)^{1} \nabla_{\lambda} f(\lambda)
\]</span></p>
<p>Where <span class="math inline">\(\mathcal{I}(\lambda) := \mathbb{E}_{q(\beta \mid \lambda)} \left[ \nabla_\lambda \log q(\beta \mid \lambda) (\nabla_\lambda \log q(\beta \mid \lambda))^T \right]\)</span> is Fisher Information Matrix. Here I considered parameter <span class="math inline">\(\lambda\)</span> of the distribution <span class="math inline">\(q(\beta \mid \lambda)\)</span>, you got the idea. For the exponential family distribution this Information Matrix takes an especially simple form:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{I}(\lambda)
&= \mathbb{E}_q (t(\beta)  \nabla_\lambda a_g(\lambda)) (t(\beta)  \nabla_\lambda a_g(\lambda))^T
= \mathbb{E}_q (t(\beta)  \mathbb{E}_q t(\beta)) (t(\beta)  \mathbb{E}_q t(\beta))^T \\
&= \text{Cov}_q (t(\beta))
= \nabla_\lambda^2 a_g(\lambda)
\end{align}
\]</span></p>
<p>Where we’ve used another <a href="https://en.wikipedia.org/wiki/Exponential_family#Differential_identities_for_cumulants">differential identity for exponential family</a>. All these calculations lead us to the natural gradients of ELBO for variational parameters:</p>
<p><span class="math display">\[
\nabla_\lambda^\text{N} \mathcal{L}(\lambda) = \mathbb{E}_{q(z)} \eta_g(x, z, \alpha)  \lambda \\
\nabla_{\phi_{nj}}^\text{N} \mathcal{L}(\lambda) = \mathbb{E}_{q(\beta) q(z_{n,j})} \eta_l(x_n, z_{n,j}, \beta)  \phi_{nj}
\]</span></p>
<p>Surprisingly, computationwise calculating natural gradients is even simpler that calculating classical gradients! There’s an interesting connection between the meanfield update and a natural gradient step. In particular, if we make a step along the natural gradient with step size equal 1, we’d get <span class="math inline">\(\lambda^{\text{new}} = \lambda^{\text{old}} + (\mathbb{E}_{q(z)} \eta_g(x, z, \alpha)  \lambda^{\text{old}}) = \mathbb{E}_{q(z)} \eta_g(x, z, \alpha)\)</span>. The same applies to parameters <span class="math inline">\(\phi\)</span>. This means that the mean field updates are exactly natural gradient steps, and vice versa.</p>
<p>Recall, we have derived mean field updates by finding a minima of KLdivergence with the true posterior, that is in just one step (one update) we arrive at minimum. Obviously, we have the same in the natural gradient formulation, when just one step brings us to the optimum.</p>
<p>Now, the last component is stochasticity itself. So far we have only played a little with meanfield update scheme, and discovered its connection to the natural gradient optimization. We note that we have 2 parameters: local <span class="math inline">\(\phi_{nj}\)</span> and global parameter <span class="math inline">\(\lambda\)</span>. The first one is easy to optimize over as it depends only on one, <span class="math inline">\(n\)</span>th sample <span class="math inline">\(x_n\)</span>. The second one, though, needs to incorporate information from all the samples, which is computationally prohibitive in large scale regime. Luckily, now once we know the equivalence between meanfield update and natural gradient step, we can borrow ideas from stochastic optimization to make this process more scalable.</p>
<p>Let’s first reformulate the ELBO to include the sum over samples <span class="math inline">\(x_n\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha)  \log q(\beta \mid \lambda) + \sum_{n=1}^N \left(\log p(x_n, z_n \mid \beta)  \log q(z_n \mid \phi_n) \right) \right] \\
& = \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha)  \log q(\beta \mid \lambda) + N \mathbb{E}_{I} \left(\log p(x_I, z_I \mid \beta)  \log q(z_I \mid \phi_I) \right) \right]
\end{align}
\]</span></p>
<p>Where <span class="math inline">\(I \sim \text{Unif}\{1, \dots, N\}\)</span> — uniformly distribution index of a sample. Now let’s estimate <span class="math inline">\(\mathcal{L}\)</span> using a sample <span class="math inline">\(S\)</span> (assume <span class="math inline">\(N\)</span> divides by sample size <span class="math inline">\(S\)</span>) of uniformly chosen indices, this’d result in an unbiased estimator (it’s gradient would also be unbiased, so we can maximize the true ELBO by maximizing the estimate). Author of the paper start with singlesample derivation and then extend it to minibatches, but I decided I’d go straight to the minibatch case:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}_S(\Theta, \Lambda)
& := \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha)  \log q(\beta \mid \lambda) + \frac{N}{S} \sum_{i \in S} \left(\log p(x_i, z_i \mid \beta)  \log q(z_i \mid \phi_i) \right) \right] \\
& = \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha)  \log q(\beta \mid \lambda) + \sum_{n=1}^{N / S} \sum_{i \in S} \left(\log p(x_i, z_i \mid \beta)  \log q(z_i \mid \phi_i) \right) \right]
\end{align}
\]</span></p>
<p>This estimate is exactly <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span> calculated on sample consisting of <span class="math inline">\(\{x_i, z_i\}_{i \in S}\)</span> repeated <span class="math inline">\(N / S\)</span> times. Hence its natural gradient w.r.t. <span class="math inline">\(\lambda\)</span> is</p>
<p><span class="math display">\[
\nabla_\lambda^\text{N} \mathcal{L}_S(\lambda) = \mathbb{E}_{q(z)} \eta_g(\{x_S\}_{n=1}^{N/S}, \{z_S\}_{n=1}^{N/S}, \alpha)  \lambda \\
\]</span></p>
<p>One important note: for stochastic optimization we can’t use constant step size. As RobbinsMonro conditions suggest, we need to use schedule <span class="math inline">\(\rho_t\)</span> such that <span class="math inline">\(\sum \rho_t = \infty\)</span> and <span class="math inline">\(\sum \rho_t^2 < \infty\)</span>. Then the update <span class="math inline">\(\lambda^{\text{new}} = \lambda^{\text{old}} + \rho_t \nabla_\lambda^\text{N} \mathcal{L}_S(\lambda) = (1  \rho_t) \lambda^{\text{old}} + \rho_t \mathbb{E}_{q(z)} \eta_g(\{x_S\}_{n=1}^{N/S}, \{z_S\}_{n=1}^{N/S}, \alpha)\)</span></p>
Finally we have the following optimization scheme:
<ul>
<li>
Start with random initialization for <span class="math inline">\(\lambda^{(0)}\)</span>
</li>
<li>
For <span class="math inline">\(t\)</span> from 0 to MAX_ITER
<ol>
<li>
Sample <span class="math inline">\(S \sim \text{Unif}\{1, \dots, N\}^{S}\)</span>
</li>
<li>
For each sample <span class="math inline">\(i \in S\)</span> update the local variational parameter <span class="math inline">\(\phi_{i,j} = \mathbb{E}_{q(\beta) q(z_{i,j})} \eta_l(x_i, z_{i,j}, \beta)\)</span>
</li>
<li>
Replicate the sample <span class="math inline">\(N / S\)</span> times and compute the global update <span class="math inline">\(\hat \lambda = \mathbb{E}_{q(z)} \eta_g(\{x_S\}_{n=1}^{N/S}, \{z_S\}_{n=1}^{N/S}, \alpha)\)</span>
</li>
<li>
Update the global update <span class="math inline">\(\lambda^{(t+1)} = (1\rho_t) \lambda^{(t)} + \rho_t \hat \lambda\)</span>
</li>
</ol>
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/bYNrSVQbgPU" height="1" width="1" alt=""/>Mon, 04 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/20160704neuralvariationalinferencestochasticvariationalinference.htmlArtemhttp://artem.sobolev.name/posts/20160704neuralvariationalinferencestochasticvariationalinference.htmlNeural Variational Inference: Classical Theory
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/tMWZqa8pvO0/20160701neuralvariationalinferenceclassicaltheory.html
<p>As a member of <a href="http://bayesgroup.ru/">Bayesian methods research group</a> I’m heavily interested in Bayesian approach to machine learning. One of the strengths of this approach is ability to work with hidden (unobserved) variables which are interpretable. This power however comes at a cost of generally intractable exact inference, which limits the scope of solvable problems.</p>
<p>Another topic which gained lots of momentum in Machine Learning recently is Deep Learning, of course. With Deep Learning we can now build big and complex models that outperform most handengineered approaches given lots of data and computational power. The fact that Deep Learning needs a considerable amount of data also requires these methods to be scalable — a really nice property for any algorithm to have, especially in a Big Data epoch.</p>
<p>Given how appealing both topics are it’s not a surprise there’s been some work to marry these two recently. In this <a href="/tags/modern%20variational%20inference%20series.html">series</a> of blogsposts I’d like to summarize recent advances, particularly in variational inference. This is not meant to be an introductory discussion as prior familiarity with classical topics (Latent variable models, <a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">Variational Inference, Meanfield approximation</a>) is required, though I’ll introduce these ideas anyway just to remind it and setup the notation.</p>
<!more>
<h3>
Latent Variables Models
</h3>
<p>Suppose you have a probabilistic model that’s easy to describe using some auxiliary variables <span class="math inline">\(Z\)</span> that you don’t observe directly (or even would like to infer it given the data). One classical example for this setup is Gaussian Mixture Modeling: we have <span class="math inline">\(K\)</span> components in a mixture, and <span class="math inline">\(z_n\)</span> is a <a href="https://en.wikipedia.org/wiki/Onehot">onehot</a> vector of dimensionality <span class="math inline">\(K\)</span> indicating which component an observation <span class="math inline">\(x_n\)</span> belongs to. Then, conditioned on <span class="math inline">\(z_n\)</span> the distribution of <span class="math inline">\(x_n\)</span> is a usual Gaussian distribution: <span class="math inline">\(p(x_{n} \mid z_{nk} = 1) = \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\)</span> (here whenever I refer to a distribution, you should read it as its density. At least <a href="https://en.wikipedia.org/wiki/Generalized_function">generalized one</a>). Therefore the joint distribution of the model is</p>
<p><span class="math display">\[
p(x, z \mid \Theta) = \prod_{n=1}^N \prod_{k=1}^K \mathcal{N}(x_n \mid \mu_k, \Sigma_k)^{z_{nk}} \pi_k^{z_{nk}}
\]</span></p>
<p>Where <span class="math inline">\(\pi\)</span> is a probability distribution over <span class="math inline">\(K\)</span> outcomes, and <span class="math inline">\(\Theta\)</span> is a set of all model’s parameters (<span class="math inline">\(\pi\)</span>, <span class="math inline">\(\mu\)</span>s and <span class="math inline">\(\Sigma\)</span>s).</p>
<p>We’d like to do 2 things with the model: first, we obviously need to learn parameters <span class="math inline">\(\Theta\)</span>, and second, we’d like infer these latent variables <span class="math inline">\(z_n\)</span> to know which cluster the observation <span class="math inline">\(x_n\)</span> belongs to, that is, we need to calculate the distribution of <span class="math inline">\(z_n\)</span> conditioned on <span class="math inline">\(x_n\)</span>: <span class="math inline">\(p(z_n \mid x_n)\)</span>.</p>
<p>We want to learn the parameters <span class="math inline">\(\Theta\)</span> as usual by maximizing the loglikelihood. Unfortunately, we don’t know true assignments <span class="math inline">\(z_n\)</span>, and marginalizing over it as in <span class="math inline">\(p(x_n) = \sum_{k=1}^K \pi_k p(x_n, z_{nk} = 1)\)</span> is not a good idea as the resulting optimization problem would lose its convexity. Instead we decompose the loglikelihood as follows:</p>
<p><span class="math display">\[
\begin{align}
\log p(x)
&= \mathbb{E}_{q(z\mid x)} \overbrace{\log p(x)}^{\text{const in $z$}}
= \mathbb{E}_{q(z\mid x)} \log \frac{p(x, z) q(z\mid x)}{p(z \mid x) q(z\mid x)} \\
&= \mathbb{E}_{q(z\mid x)} \log \frac{p(x, z)}{q(z\mid x)} + D_{KL}(q(z\mid x) \mid\mid p(z \mid x))
\end{align}
\]</span></p>
<p>The second term is a KullbackLeibler divergence, which is always nonnegative, and equals zero iff distributions are equal almost everywhere <span class="math inline">\(q(z\mid x) = p(z \mid x)\)</span>. Therefore putting <span class="math inline">\(q(z \mid x) = p(z \mid x)\)</span> eliminates the second term, leaving us with <span class="math inline">\(\log p(x) = \mathbb{E}_{p(z \mid x)} \log \frac{p(x, z)}{p(z \mid x)}\)</span>. Therefore all we need to be able to do is to calculate the posterior <span class="math inline">\(p(z \mid x)\)</span>, and maximize the expectation. This is how EM algorithm is derived: at Estep we calculate the posterior <span class="math inline">\(p(z \mid x, \Theta^{\text{old}})\)</span>, and at Mstep we maximize the expectation <span class="math inline">\(\mathbb{E}_{p(z \mid x, \Theta^{\text{old}})} \log \frac{p(x, z \mid \Theta)}{p(z \mid x, \Theta)}\)</span> with respect to <span class="math inline">\(\Theta\)</span> keeping <span class="math inline">\(\Theta^{\text{old}}\)</span> fixed.</p>
<p>Now, all we are left to do is to find the posterior <span class="math inline">\(p(z \mid x)\)</span> which is given by the following deceivingly simple formula knows as a Bayes’ rule.</p>
<p><span class="math display">\[
p(z \mid x) = \frac{p(x \mid z) p(z)}{\int p(x \mid z) p(z)dz}
\]</span></p>
<p>Of course, there’s no free lunch and computing the denominator is intractable in general case. One <strong>can</strong> compute the posterior exactly when the prior <span class="math inline">\(p(z)\)</span> and the likelihood <span class="math inline">\(p(x \mid z)\)</span> are <a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate</a> (that is, after multiplying the prior by the likelihood you get the same functional form for <span class="math inline">\(z\)</span> as in the prior, thus the posterior comes from the same family as the prior but with different parameters), but many models of practical interest don’t have this property. This is where variational inference comes in.</p>
<h3>
Variational Inference and Meanfield
</h3>
<p>In a variational inference (VI) framework we approximate the true posterior <span class="math inline">\(p(z \mid x)\)</span> with some other simpler distribution <span class="math inline">\(q(z \mid x, \Lambda)\)</span> where <span class="math inline">\(\Lambda\)</span> is a set of (variational) parameters of the (variational) approximation (I may omit <span class="math inline">\(\Lambda\)</span> and <span class="math inline">\(\Theta\)</span> in a “given” clause when it’s convenient, remember, it always could be placed there). One possibility is to divide latent variables <span class="math inline">\(z\)</span> in groups and force the groups to be independent. In extreme case each variable gets its own group, assuming independence among all variables <span class="math inline">\(q(z \mid x) = \prod_{d=1}^D q(z_d \mid x)\)</span>. If we then set about to find the best approximation to the true posterior in this fully factorized class, we will no longer have the optimal <span class="math inline">\(q\)</span> being the true posterior itself, as the true posterior is presumably too complicated to be dealt with in analytic form (which we want from the approximation <span class="math inline">\(q\)</span> when we say “simpler distribution”). Therefore we find the optimal <span class="math inline">\(q(z_i)\)</span> by minimizing the KLdivergence with the true posterior (<span class="math inline">\(\text{const}\)</span> denotes terms that are constant w.r.t. <span class="math inline">\(q(z_i)\)</span>):</p>
<p><span class="math display">\[
\begin{align}
D_{KL}(q(z \mid x) \mid\mid p(z \mid x))
&= \mathbb{E}_{q(z_i \mid x)} \left[ \mathbb{E}_{q(z_{i} \mid x)} \log \frac{q(z_1 \mid x) \dots q(z_D \mid x)}{p(z \mid x)} \right] \\
&= \mathbb{E}_{q(z_i \mid x)} \left[ \log q(z_i \mid x)  \underbrace{\mathbb{E}_{q(z_{i} \mid x)} \log p(z \mid x)}_{\log f(z_i \mid x)} \right] + \text{const} \\
&= \mathbb{E}_{q(z_i \mid x)} \left[ \log \frac{q(z_i \mid x)}{\tfrac{1}{Z} f(z_i \mid x)} \right]  \log Z + \text{const} \\
&= D_{KL}\left(q(z_i \mid x) \mid\mid \tfrac{1}{Z} f(z_i \mid x)\right) + \text{const}
\end{align}
\]</span></p>
<p>For many models it’s possible to look into <span class="math inline">\(\mathbb{E}_{q(z_{i} \mid x)} \log p(z \mid x)\)</span> and immediately recognize logarithm of unnormalized density of some known distribution.</p>
<p>Another cornerstone of this framework is a notion of <strong>Evidence Lower Bound</strong> (ELBO): recall the decomposition of loglikelihood we derived above. In our current setting we can not compute the right hand side as we can not evaluate the true posterior <span class="math inline">\(p(z \mid x)\)</span>. However, note that the left hand side (that is, the loglikelihood) does not depend on the variational distribution <span class="math inline">\(q(z \mid x, \Lambda)\)</span>. Therefore, maximizing the first term of the right hand side w.r.t. variational parameters <span class="math inline">\(\Lambda\)</span> results in minimizing the second term, the KLdivergence with the true posterior. This implies we can ditch the second term, and maximize the first one w.r.t. both model parameters <span class="math inline">\(\Theta\)</span> and variational parameters <span class="math inline">\(\Lambda\)</span>:</p>
<p><span class="math display">\[
\text{ELBO:} \quad \mathcal{L}(\Theta, \Lambda) := \mathbb{E}_{q(z \mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
\]</span></p>
<p>Okay, so this covers the basics, but before we go to the neural networksbased methods we need to discuss some general approaches to VI and how to make it scalable. This is what the next blog post is all about.</p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/tMWZqa8pvO0" height="1" width="1" alt=""/>Fri, 01 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/20160701neuralvariationalinferenceclassicaltheory.htmlArtemhttp://artem.sobolev.name/posts/20160701neuralvariationalinferenceclassicaltheory.htmlExploiting Multiple Machines for Embarrassingly Parallel Applications
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/FkchBewZeFA/20140801gnuparallel.html
<p>During work on my machine learning project I was needed to perform some quite computationheavy calculations several times — each time with a bit different inputs. These calculations were CPU and memory bound, so just spawning them all at once would just slow down overall running time because of increased amount of context switches. Yet running 4 (=number of cores in my CPU) of them at a time (actually, 3 since other applications need CPU, too) should speed it up.</p>
<p>Fortunately, I have an old laptop with 2 cores as well as an access to somewhat more modern machine with 4 cores. That results in 10 cores spread across 3 machines (all of`em have some version of GNU Linux installed). The question was how to exploit such a treasury.</p>
<!more>
<p>And the answer is GNU Parallel with some additional bells and whistles. GNU Parallel allows one to execute some commands in parallel and even in a distributed way.</p>
<p>The command was as following</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="ex">parallel</span> u wd ... S :,host1,host2 trc <span class="dt">{}</span>.emb <span class="st">"sh {}"</span></code></pre></div>
Here we have:
<ul>
<li>
<strong>wd</strong> stands for working directory. Threedots means <code>parallel</code>’s temporary folder
</li>
<li>
<strong>S</strong> contains list of hosts with : being a localhost
</li>
<li>
<strong>trc</strong> stands for “Transfer, Return, Cleanup” and means that we’d like to transfer an executable file to target host, return specified file and do a cleanup
</li>
</ul>
<p><code>parallel</code> accepts list command arguments (file names) in standard input and executes a command (<code>sh</code> in my case) for each of them.</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="fu">ls</span> 1 jobs/* <span class="kw"></span> <span class="ex">parallel</span> u wd ... S :,host1,host2 trc <span class="dt">{}</span>.emb <span class="st">"sh {}"</span></code></pre></div>
<p>There’s a problem: we usually need more than one file to do usefull stuff. There are several solutions to that problem</p>
<ul>
<li>
<strong>Bring all files manually</strong><br/> It’s a solution, but somewhat tedious one: setting computing environment on a several machines is dull
</li>
<li>
<strong>tar it and do all the stuff in a command</strong><br/> Looks better, but some shell kungfu is required
</li>
<li>
<strong>Use <a href="http://en.wikipedia.org/wiki/Shar">shar</a></strong><br/> Basically it’s a tar archive with some shell commands for (self)extracting. I chose this way and glued in some my code.
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/FkchBewZeFA" height="1" width="1" alt=""/>Fri, 01 Aug 2014 00:00:00 UThttp://artem.sobolev.name/posts/20140801gnuparallel.htmlArtemhttp://artem.sobolev.name/posts/20140801gnuparallel.htmlOn Sorting Complexity
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/OXwmToFy0jc/20140501onsortingcomplexity.html
<p>It’s well known that lower bound for sorting problem (in general case) is <span class="math inline">\(\Omega(n \log n)\)</span>. The proof I was taught is somewhat involved and is based on paths in “decision” trees. Recently I’ve discovered an informationtheoretic approach (or reformulation) to that proof.</p>
<!more>
<p>First, let’s state the problem: given a set of some objects with an ordering produce elements of that set in that order. For now it’s completely irrelevant what are these objects, so we can assume them to be just numbers from 1 to n, or some permutation. Thus we’ll be interested in sorting permutations.</p>
<p>We’re given an ordering via a comparison function. It tells us if one object preceds (or is smaller) another outputing True or False. Thus each invocation of the comparator gives us 1 bit of information.</p>
<p>Next question is how many bits we need to represent any permutation. It’s just a binary logarithm of number of all possible permutations of <span class="math inline">\(n\)</span> elements: <span class="math inline">\(\log_2 n!\)</span>. Then we notice that</p>
<p><span class="math display">\[
\log_2 n! = \sum_{k=1}^n \log_2 k \ge \sum_{k=n/2}^{n} \log_2 k
\ge \frac{n}{2} \log_2 \frac{n}{2}
\]</span></p>
<p><span class="math display">\[
\log_2 n! = \sum_{k=1}^n \log_2 k \le n \log_2 n
\]</span></p>
<p>(Or just use the <a href="http://en.wikipedia.org/wiki/Stirling%27s_approximation">Stirling’s approximation</a> formula). Hence <span class="math inline">\(\log_2 n! = \Theta(n \log n)\)</span></p>
<p>So what, you may ask. The key point of proof is that sorting is essentially a search for a correct permutation of the input one. Since one needs <span class="math inline">\(\log_2 n!\)</span> bits to represent any permutation, we <strong>need to get that many bits</strong> of information somehow. Now let’s get back to our comparison function. As we’ve figured out already it’s able to give us only one bit of information per invocation. That implies that we need to call it <span class="math inline">\(\log n! = \Theta(n \log n)\)</span> times. And that’s exactly the lowerbound for sorting complexity. Q.E.D.</p>
<p>Non<span class="math inline">\(n \log n\)</span> sorting algorithms like <a href="http://en.wikipedia.org/wiki/Radix_sort">RadixSort</a> are possible because they use much more bits of information, taking advantage of numbers’ structure.</p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/OXwmToFy0jc" height="1" width="1" alt=""/>Thu, 01 May 2014 00:00:00 UThttp://artem.sobolev.name/posts/20140501onsortingcomplexity.htmlArtemhttp://artem.sobolev.name/posts/20140501onsortingcomplexity.htmlNamespaced Methods in JavaScript
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/KIo0BuBPx8M/20130523jsnamespacedmethods.html
<p>Once upon a time I was asked (well, actually <a href="http://habrahabr.ru/qa/7130/" title="Javascript: String.prototype.namespace.method и this / Q&A / Хабрахабр">a question</a> wasn’t for me only, but for whole habrahabr’s community) is it possible to implement namespaced methods in JavaScript for builtin types like:</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="dv">5</span>..<span class="va">rubish</span>.<span class="at">times</span>(<span class="kw">function</span>() <span class="op">{</span> <span class="co">// this function will be called 5 times</span>
<span class="va">console</span>.<span class="at">log</span>(<span class="st">"Hi there!"</span>)<span class="op">;</span>
<span class="op">}</span>)<span class="op">;</span>
<span class="st">"some string"</span>.<span class="va">hask</span>.<span class="at">map</span>(<span class="kw">function</span>(c) <span class="op">{</span> <span class="cf">return</span> <span class="va">c</span>.<span class="va">hask</span>.<span class="at">code</span>()<span class="op">;</span> <span class="op">}</span>)<span class="op">;</span>
<span class="co">// equivalent to</span>
<span class="st">"some string"</span>.<span class="at">split</span>(<span class="st">''</span>).<span class="at">map</span>(<span class="kw">function</span>(c) <span class="op">{</span> <span class="cf">return</span> <span class="va">c</span>.<span class="at">charCodeAt</span>()<span class="op">;</span> <span class="op">}</span>)<span class="op">;</span>
<span class="st">"another string"</span>.<span class="va">algo</span>.<span class="at">lcp</span>(<span class="st">"annotation"</span>)<span class="op">;</span>
<span class="co">// returns longest common prefix of two strings</span></code></pre></div>
<p>As you can see at the link, it’s possible using ECMAScript 5 features. And that’s how: <!more></p>
<p>First, let’s point out the main problem with the straightforward approach: <del>it doesn’t work</del> when you write</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="va">Class</span>.<span class="va">prototype</span>.<span class="va">ns</span>.<span class="at">method</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span>
<span class="cf">return</span> <span class="kw">this</span>.<span class="at">methodA</span>() <span class="op">+</span> <span class="kw">this</span>.<span class="at">methodB</span>()<span class="op">;</span>
<span class="op">}</span></code></pre></div>
<p><code>this</code> points to the <code>Class.prototype.ns</code> instead of an instance of <code>Class</code>. The only way to change it is rebind <code>this</code> to the our instance like this:</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="kw">var</span> instance <span class="op">=</span> <span class="kw">new</span> Class<span class="op">;</span>
<span class="va">instance</span>.<span class="va">ns</span>.<span class="va">method</span>.<span class="at">call</span>(instance)<span class="op">;</span></code></pre></div>
<p>Obviously, it’s not a solution since in that case it is a lot easier to write something like</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="kw">var</span> instance <span class="op">=</span> <span class="kw">new</span> Class<span class="op">;</span>
<span class="va">MegaLibrary</span>.<span class="at">method</span>(instance)<span class="op">;</span></code></pre></div>
<p>Thus we need to somehow return a correct function (with <code>this</code> binded to the <code>instance</code>) when user calls namespaced methods. This can be done using <a href="http://stackoverflow.com/q/812961/1190430" title="Javascript getters and setters for dummies?  Stack Overflow" target="_blank">getters</a>.</p>
<p>When user accesses our namespace we give him a proxyobject with a custom getter for every method in the namespace. This getter returns a method with rebinded <code>this</code>. The question is: how do we get a reference to the <code>instance</code>? The answer is pretty simple: using getters again! Instead of declaring an ordinary property for the namespace we can create a property with a custom getter memoizing a reference to <code>this</code>. Voilà!</p>
Finally, the code is:
<script src="https://gist.github.com/artsobolev/5599917.js"></script>
<h2 id="butwaithowcrossbrowserisit">But wait… How cross browser is it?</h2>
<p>Well, I’m pretty lazy to test it on all platforms (IE, Opera, FF, Chrome, Node.JS), so I’ll do like a mathematician in a famous anecdote:</p>
<blockquote>
<p>Three employees (an engineer, a physicist and a mathematician) are staying in a hotel while attending a technical seminar. The engineer wakes up and smells smoke. He goes out into the hallway and sees a fire, so he fills a trashcan from his room with water and douses the fire. He goes back to bed.</p>
<p>Later, the physicist wakes up and smells smoke. He opens his door and sees a fire in the hallway. He walks down the hall to a fire hose and after calculating the flame velocity, distance, water pressure, trajectory, etc. extinguishes the fire with the minimum amount of water and energy needed.</p>
Later, the mathematician wakes up and smells smoke. She goes to the hall, sees the fire and then the fire hose. She thinks for a moment and then exclaims, ‘Ah, a solution exists!’ and then goes back to bed.
</blockquote>
As you can see, the key part of code is ECMAScript 5’s <code>Object.defineProperty</code>. According to the kangax’s <a href="http://kangax.github.io/es5compattable/#Object.defineProperty" title="ECMAScript 5 compatibility table" target="_blank">ECMAScript 5 compatibility table</a> it has pretty good support:
<ul>
<li>
IE 9+
</li>
<li>
Opera 12+
</li>
<li>
FF 4+
</li>
<li>
Chrome 7+ (and thus Node.JS too)
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/KIo0BuBPx8M" height="1" width="1" alt=""/>Thu, 23 May 2013 00:00:00 UThttp://artem.sobolev.name/posts/20130523jsnamespacedmethods.htmlArtemhttp://artem.sobolev.name/posts/20130523jsnamespacedmethods.htmlCrazy Expression Parsing
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/VI7SZVJIoPA/20130330crazyexpressionparsing.html
<p>Suppose we have an expression like <code>(5+5 * (x^x5  y && 3))</code> and we’d like to get some computerunderstandable representation of that expression, like:</p>
<p><code>ADD Token[5] (MUL Token[5] (AND (BIT_OR (XOR Token[x] (SUB Token[x] Token[5])) Token[y]) Token[3])</code></p>
<p>In case if you don’t know how to do that or are looking for the solutin right now, you should know that I’m not going to present a correct solution. This post is just a joke. You should use either a <a href="http://en.wikipedia.org/wiki/Shuntingyard_algorithm" title="Shuntingyard algorithm — Wikipedia">Shuntingyard algorithm</a> or a <a href="http://en.wikipedia.org/wiki/Recursive_descent_parser">recursive descent parser</a>.</p>
<p>So if you’re ready for madness… Let’s go! <!more></p>
<p>Let’s take <a href="http://en.wikipedia.org/wiki/Don%27t_repeat_yourself">Don’t repeat yourself</a> principle as a justification. Moreover, let’s take it to extreme “Don’t repeat”. Indeed, why do we need to repeat what compiler’s developers already did?</p>
Here we go
<script src="https://gist.github.com/artsobolev/5273716.js"></script>
<p>In case you’re wondering what the heck is going on: all constants are converted to instances of <code>Token</code> class, for which I overloaded all the operators. Overloading is done in a way to preserve structure of an expression. The only thing we have to do then is to extract that information. In case you’re not familiar with C++, I recommend you to read something about operator overloading.</p>
<p>As you can see, g++ and python are required for that “parser”. Unfortunatelly, priority of a bitwise xor is too low to serve as a power operator.</p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/VI7SZVJIoPA" height="1" width="1" alt=""/>Sat, 30 Mar 2013 00:00:00 UThttp://artem.sobolev.name/posts/20130330crazyexpressionparsing.htmlArtemhttp://artem.sobolev.name/posts/20130330crazyexpressionparsing.htmlMemoization Using C++11
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/u8nQ5JzLO54/20130329cpp11memoization.html
<p>Recently I’ve read an article <a href="http://johnahlgren.blogspot.ru/2013/03/efficientmemoizationusingpartial.html" title="John Ahlgren: Efficient Memoization using Partial Function Application">Efficient Memoization using Partial Function Application</a>. Author explains function memoization using partial application. When I was reading the article, I thought “Hmmm, can I come up with a more general solution?” And as suggested in comments, one can use variadic templates to achieve it. So here is my version.</p>
<!more>
<p>First let’s do it in a more objectoriented way: we define a template class <code>Memoizator</code> with 2 parameters: a return value type and a list of argument’s types. Also we incapsulate a lookup map and will use C++11’s <a href="http://en.cppreference.com/w/cpp/utility/tuple" title="std::tuple  cppreference.com">std::tuple</a> to represent an arguments set.</p>
The code is as follows:
<script src="https://gist.github.com/artsobolev/5270779.js"></script>
<p>Good, but what about computing nth Fibonacci number using memoization? It’s not possible with a current version of <code>Memoizator</code> since it uses a separate map for each instance even if function is the same. It looks inefficient to store a separate lookup map for each instance of the same function. We’ll fix it by creating a static storage for maps accessed by a function address:</p>
<script src="https://gist.github.com/artsobolev/5271223.js"></script>
<p>Now let’s compare the memoized version against the regular one. If we compute the 42th fibonacci number using simple recursive version (with exponential time complexity), we’d get</p>
<pre><strong>$ time ./a.out</strong>
267914296
real 0m5.314s
user 0m5.220s
sys 0m0.020s</pre>
Now the memoized one (from the source above):
<pre><strong>$ time ./a.out</strong>
267914296
real 0m0.005s
user 0m0.004s
sys 0m0.004s</pre>
<p>Moreover, our memoization reduced time complexity from exponential to linear.</p>
<p><strong>UPD</strong>: you can take a look at another implementation here: <a href="http://cpptruths.blogspot.ru/2012/01/generalpurposeautomaticmemoization.html" title="c++ truths: Generalpurpose Automatic Memoization for Recursive Functions in C++11">Generalpurpose Automatic Memoization for Recursive Functions in C++11</a></p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/u8nQ5JzLO54" height="1" width="1" alt=""/>Fri, 29 Mar 2013 00:00:00 UThttp://artem.sobolev.name/posts/20130329cpp11memoization.htmlArtemhttp://artem.sobolev.name/posts/20130329cpp11memoization.htmlResizing Policy of std::vector
http://feedproxy.google.com/~r/barmaleyexeblogfeed/~3/qh9J_Kq_jEw/20130210stdvectorgrowth.html
Sometime ago when Facebook opensourced their <a title="Folly is an opensource C++ library developed and used at Facebook" href="https://github.com/facebook/folly">Folly library</a> I was reading their docs and found <a title="folly/FBvector.h documentation" href="https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md">something interesting</a>. In section “Memory Handling” they state
<blockquote>
In fact it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any of its previouslyallocated memory
</blockquote>
<p>I haven’t got it first time. Recently I recalled that article and decided to deal with it. So after reading and googling for a while I finally understood the idea, so I’d like to say a few words about it.</p>
<!more>
<p>The problem is as follows: when a vector (or a similar structure with autoresize) gets filled, it should resize. It’s well known that it should grow exponentially in order to preserve constant amortized complexity of insertions, but what growth factor to choose? At first glance, 2 seems to be ok — it’s not so big and 2 is common for computer science :). But it turns out that 2 is not so good. Let’s take a closer look by example:</p>
<p><a href="/files/vectorresizescheme.png"><img src="/files/vectorresizescheme.png" alt="Vector resize scheme" width="495" height="350" class="sizefull" /></a></p>
<p>Suppose we’ve a vector of initial size <span class="math inline">\(C\)</span>. When it gets filled, we increase its size twice. We allocate memory for a vector of size <span class="math inline">\(2C\)</span> right after our original vector. So now we have vector of size <span class="math inline">\(2C\)</span> and <span class="math inline">\(C\)</span> bytes before it, where it was when it was small. Then expand it again and again and agian and so on. After <span class="math inline">\(n\)</span> expansions we’ll get a vector of size <span class="math inline">\(2^n C\)</span> preceded by <span class="math inline">\(C + 2C + 2^2 C + \dots 2^{n1} C\)</span> bytes that were occupied by this vector before.</p>
<p>So what’s the problem? The problem is that after every increasing your vector is too big to fit previously allocated memory. How much is it bigger? Well, as we know <span class="math inline">\(2^n  1 = 1 + 2 + 4 + \dots + 2^{n1}\)</span>, thus <span class="math inline">\(2^n C  C  2C  2^2 C  \dots  2^{n1}C = C\)</span>. Therefore you have permanent lack of <span class="math inline">\(C\)</span> bytes to fit your vector in previously allocated space.</p>
<p>Okay, let’s now solve this problem. First, let’s formalize it.</p>
<p>Every time we increase vector of size <span class="math inline">\(C\)</span> with growing factor <span class="math inline">\(k\)</span> we do these steps:</p>
<ol>
<li>
Allocate <span class="math inline">\(k C\)</span> bytes
</li>
<li>
Create a new vector here and copy current vector’s content to the new one
</li>
<li>
Remove the current vector, set the new one as the current
</li>
</ol>
<p>So as you can see, formula for 2 is sort of upperbound: you can not use all of previously allocated <span class="math inline">\(n1\)</span> chunks when allocating nth, since you need to copy values from (n1)th (though you can copy them in some temporary buffer, but it requires extra memory) chunk. So when we allocate nth chunk, we need it to be less than total free space from <span class="math inline">\(n2\)</span> allocations: <span class="math display">\[ k^n C \le k^{n2}C + k^{n3}C + \dots + kC + C \]</span></p>
<p>As you can see, we can get rid of C since it’s definetly positive. <span class="math display">\[ k^n \le k^{n2} + k^{n3} + \dots + k + 1 \]</span></p>
<p>Okay, time to solve some equations! We see something like a sum of a geometric progression, and we can use a formula for it. But I don’t retain it in my head, so I’ll use a little trick here. Let’s multiply both sides by <span class="math inline">\(k1\)</span>. We assume that <span class="math inline">\(k > 1\)</span> (it’s very strange to use values greater than 1 as <em>growth</em> factor) <span class="math display">\[ (k1) k^n \le (k1) (k^{n2} + k^{n3} + \dots + k + 1) \]</span></p>
<p>Now we can notice that in the right side we have an expansion of <span class="math inline">\(k^{n1}1\)</span> (well, maybe to remember this observation is harder than remembering a formula for sum of a geometric progression…)</p>
<p><span class="math display">\[ (k1) k^n \le k^{n1}1 \]</span> <span class="math display">\[ k^{n+1}  k^n \le k^{n1}1 \]</span> <span class="math display">\[ k^{n+1} \le k^n + k^{n1}  1 \]</span></p>
<p>Oh, this obstructing 1… It would be so nice if we could throw it away! Wait, but we can! If we add 1 to the right side, we will merely increase its value, so it will still suit for an upper bound (or an approximation since 1 is a constant and is very small compared to <span class="math inline">\(k^n\)</span>).</p>
<p><span class="math display">\[ k^{n+1} \le k^n + k^{n1}  1 < k^n + k^{n1} \]</span> <span class="math display">\[ k^2 < k + 1 \]</span> <span class="math display">\[ k^2  k  1 < 0 \]</span></p>
<p>Solving this simple quadratic equation, we get <span class="math display">\[ k < \frac{1+\sqrt{5}}{2} \approx 1.61 \]</span></p>
<p>So that’s why growth factor in many dynamic arrays is 1.5: it is pretty big to not cause reallocations too frequently and is small enough to not use memory too extensively.</p><img src="http://feeds.feedburner.com/~r/barmaleyexeblogfeed/~4/qh9J_Kq_jEw" height="1" width="1" alt=""/>Sun, 10 Feb 2013 00:00:00 UThttp://artem.sobolev.name/posts/20130210stdvectorgrowth.htmlArtemhttp://artem.sobolev.name/posts/20130210stdvectorgrowth.html