B.log - RSS feed
http://artem.sobolev.name
Sat, 28 Oct 2017 00:00:00 UTStochastic Computation Graphs: Discrete Relaxations
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/rm0asUsdwas/2017-10-28-stochastic-computation-graphs-discrete-relaxations.html
<p>This is the second post of the <a href="/tags/stochastic%20computation%20graphs%20series.html">stochastic computation graphs series</a>. Last time we discussed models with <a href="/posts/2017-09-10-stochastic-computation-graphs-continuous-case.html">continuous stochastic nodes</a>, for which there are powerful reparametrization technics.</p>
<p>Unfortunately, these methods don’t work for discrete random variables. Moreover, it looks like there’s no way to backpropagate through discrete stochastic nodes, as there’s no infinitesimal change of random values when you infinitesimally perturb their parameters.</p>
<p>In this post I’ll talk about continuous relaxations of discrete random variables.</p>
<!--more-->
<h2 id="asymptotic-reparametrization">Asymptotic reparametrization</h2>
<p>One way to train models with discrete random variables is to consider an equivalent model with continuous random variables. Let me show you an example. Suppose you have a feed-forward neural network for classification that receives <span class="math inline">\(x\)</span> and outputs distribution over targets <span class="math inline">\(p(y \mid x)\)</span>, where typical layer looks like <span class="math inline">\(h_k = \sigma(W_k h_{k-1} + b_k)\)</span>. You’d like to apply dropout to each weight of this layer and <em>tune its probabilities</em>. To do so we introduce binary latent variables <span class="math inline">\(z^{(k)}_{ij}\)</span> denoting if weight is on or off. There’s one such variable for each weight, so we can stack them into a matrix <span class="math inline">\(Z_k\)</span> of the same shape as weight matrix <span class="math inline">\(W_k\)</span>. Then element-wise multiplication <span class="math inline">\(W_k \circ Z_k\)</span> would zero out dropped weights, so the formula becomes <span class="math inline">\(h_k = \sigma((W_k \circ Z_k) h_{k-1} + b_k)\)</span>. We assume each dropout mask independently follows Bernoulli distribution: <span class="math inline">\(z_{ij}^{(k)} \sim \text{Bernoulli}(p_{ij}^{(k)})\)</span> (<span class="math inline">\(Z^{(k)} \sim \text{Bernoulli}(P^{(k)})\)</span> for short)</p>
<p>In order to learn these masks (or, rather parameters of the distribution <span class="math inline">\(q(Z \mid \Lambda)\)</span> over masks parametrized by <span class="math inline">\(\Lambda\)</span>) we employ variational inference approach:</p>
<p><span class="math display">\[
\begin{align*}
\log p(y|x) \ge
\mathcal{L}(\Lambda)
&= \mathbb{E}_{q(Z \mid \Lambda)} \log \frac{p(y, Z \mid x)}{q(Z \mid \Lambda)} \\
&= \underbrace{\mathbb{E}_{q(Z \mid \Lambda)} \log p(y \mid Z, x)}_{\text{expected likelihood}} - \underbrace{D_{KL}(q(Z \mid \Lambda) \mid\mid p(Z))}_{\text{KL-divergence}}
\to \max_\Lambda
\end{align*}
\]</span></p>
<p>We can’t backpropagate gradients through discrete sampling procedure, so we need to overcome this problem somehow. Notice, however, that each unit in a layer <span class="math inline">\(h_{k+1}\)</span> is an affine transformation of <span class="math inline">\(k\)</span>th layer’s nodes followed by a nonlinear activation function. If <span class="math inline">\(k\)</span>th layer has sufficiently many neurons, then one might expect the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">Central Limit Theorem</a> to hold at least approximately for the preactivations. Namely, consider a single neuron <span class="math inline">\(s\)</span> that takes an affine combination of previous layer’s neurons <span class="math inline">\(h_{k-1}\)</span> and applies a nonlinearity: <span class="math inline">\(s = \sigma(w^T h + b)\)</span>. In our case, however, we have a vector of masks <span class="math inline">\(z \sim \text{Bernoulli}(P)\)</span>, so the formula becomes <span class="math inline">\(s = \sigma((w \odot \tfrac{z}{P})^T h + b)\)</span> (<span class="math inline">\(\odot\)</span> stands for element-wise multiplication), and if <span class="math inline">\(K=\text{dim}(z)\)</span> is large enough, then we might expect the preactivations <span class="math inline">\(\sum_{k=1}^K \tfrac{z_k}{p_k} w_k h_k + b\)</span> (we divide each weight by its keeping probability <span class="math inline">\(p_k\)</span> to make keep the expectation unaffected by noise) to be approximately distributed as <span class="math inline">\(\mathcal{N}\left(w^T h + b, \sum_{k=1}^K \tfrac{1 - p_k}{p_k} w_k^2 h_k^2 \right)\)</span>.</p>
<p>Now suppose that instead of Bernoulli multiplicative noise <span class="math inline">\(z\)</span> we actually used multiplicative Gaussian noise <span class="math inline">\(\zeta \sim \mathcal{N}(1, (1-P) / P)\)</span> (element-wise division). It’s easy to check then that the preactivations <span class="math inline">\((w \odot \zeta)^T h + b\)</span> would have the same Gaussian distribution with exactly the same parameters. Therefore, we can replace expected likelihood term in objective <span class="math inline">\(\mathcal{L}(\Lambda)\)</span> with a continuous distribution <span class="math inline">\(q(\zeta|\Lambda) = \prod_{i,j,k} \mathcal{N}\left(\zeta^{(k)}_{ij} \mid 1, (1-\lambda^{(k)}_{ij})/\lambda^{(k)}_{ij}\right)\)</span>. However, we can’t simply do the same in the KL divergence term. Instead, we need to use simple priors (like factorized Bernoulli) so that closed form can be computed – then we can take deterministic gradients w.r.t. <span class="math inline">\(\Lambda\)</span>.</p>
<p>This example shows us that for some simple models collective behavior of discrete random variables can be accurately approximated by continuous equivalents. I’d call this approach the <strong>asymptotic reparametrization</strong> <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<h2 id="naive-relaxation">Naive Relaxation</h2>
<p>The previous trick is nice and appealing, but has very limited scope of applicability. If you have just a few discrete random variables or have other issues preventing you from relying on the CLT, you’re out of luck.</p>
<p>However, consider a binary discrete random variable: <span class="math inline">\(z \sim \text{Bernoulli}(p)\)</span>. How would you sample it? Easy! Just sample a uniform r.v. <span class="math inline">\(u \sim U[0,1]\)</span> and see if it’s less than <span class="math inline">\(p\)</span>: <span class="math inline">\(z = [u > q]\)</span> where <span class="math inline">\(q = 1 - p\)</span> and brackets denote an indicator function that is equal to one when the argument is True, and zero otherwise. Equivalently we can rewrite it as <span class="math inline">\(z = H(u - q)\)</span> where <span class="math inline">\(H(x)\)</span> is a step function: it’s zero for negative <span class="math inline">\(x\)</span> and 1 for positive ones <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>. Now, this is a nice-looking reparametrization, but <span class="math inline">\(H\)</span> is not differentiable <a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a>, so you can’t backpropagate through it. What if we replace <span class="math inline">\(H\)</span> with some differentiable analogue that has a similar shape? One candidate is a sigmoid with temperature: <span class="math inline">\(\sigma_\tau(x) = \sigma\left(\tfrac{x}{\tau}\right)\)</span>: by varying temperature you can control steepness of the function. In the limit of <span class="math inline">\(\tau \to 0\)</span> we actually recover the step function <span class="math inline">\(\lim_{\tau \to 0} \sigma_\tau(x) = H(x)\)</span>.</p>
<p>So the relaxation we’ll consider is <span class="math inline">\(\zeta = \sigma_\tau(u - q)\)</span>. How can we see if it’s a good one? What do we even want from the relaxation? Well, in the end we’ll be using the discrete version of the model, the one with zeros and ones, so we’d definitely like our relaxation to sample zeros and ones often. Actually, we’d even want them to be the modes of the underlying distribution. Let’s see if that’s the case for the proposed relaxation.</p>
<p>CDF of the relaxed r.v. <span class="math inline">\(\zeta\)</span> <span class="math display">\[
\mathbb{P}(\zeta < x) = \mathbb{P}(u < q + \tau \sigma^{-1}(x)) = \min(1, \max(0, q + \tau \sigma^{-1}(x)))
\]</span> And the corresponding PDF <span class="math display">\[
\frac{\partial}{\partial x}\mathbb{P}(\zeta < x)
=
\begin{cases}
\frac{\tau}{x (1-x)}, & \sigma\left(\frac{-q}{\tau}\right) < x < \sigma\left(\frac{1-q}{\tau}\right) \\
0, & \text{otherwise}
\end{cases}
\]</span></p>
<p>Even the formula suggests that the support of the distribution of <span class="math inline">\(\zeta\)</span> is never the whole <span class="math inline">\((0, 1)\)</span>, but only approaches it as temperature <span class="math inline">\(\tau\)</span> goes to zero. For all non-zero temperatures though the support will exclude some neighborhood of the endpoints which might bias the model towards the intermediate values. This is why you want to have a random variable with infinite support <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>. If the distribution is skewed, then the resulting relaxation will also be skewed, but it’s not a problem since the probabilities are adjusted according to the CDF.</p>
<p>Let’s plot some densities for different <span class="math inline">\(\tau\)</span> (let <span class="math inline">\(q\)</span> be 0.1).</p>
<div class="post-image">
<p><img src="/files/naive-relaxation-densities.png" style="max-width: 90%" /></p>
</div>
<p>But having infinite support is not enough. It’s hard to see from plots, but if the distribution has very light tails (like Gaussian), then it’s effective support is still finite. Authors of the Concrete Distribution notice this in their paper, saying that sigmoid squashing rate is not enough to compensate (even if you twist the temperature!) for quickly decreasing Gaussian density as you approach either of infinities.</p>
<p>Let’s also think about the impact of the temperature on the relaxation. Intuitively one would expect that as we decrease the temperature, the relaxation becomes more accurate and the problem becomes “more discrete”, hence it should be harder to optimize. Indeed, <span class="math inline">\(\tfrac{d}{dx}\sigma_\tau(x) = \tfrac{1}{\tau} \sigma_\tau(x) \sigma_\tau(-x)\)</span> – as you decrease the temperature, both sigmoids become more steep, and the derivative approaches an infinitely tall spike at 0 and zero everywhere else <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a>.</p>
<div class="post-image">
<p><img src="/files/naive-relaxation-variance-by-tau.png" style="max-width: 90%" /></p>
</div>
<p>As expected, higher approximation accuracy (obtained by lowering the temperature) comes at a cost of increased variance.</p>
<h2 id="gumbel-softmax-relaxation-aka-concrete-distribution">Gumbel-Softmax Relaxation (aka Concrete Distribution)</h2>
<p>We could consider some other distributions (with larger support, like the gaussian one) instead of uniform in our relaxation, but let’s try a different approach. Let’s see how we can sample arbitrary <span class="math inline">\(K\)</span>-valued discrete random variables. It’s well-known fact (the so called <a href="https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/">Gumbel Max Trick</a>) that if <span class="math inline">\(\gamma_k\)</span> are i.i.d. <span class="math inline">\(\text{Gumbel}(0, 1)\)</span> random variables, then <span class="math inline">\(\text{argmax}_k \{\gamma_k + \log p_k\} \sim \text{Categorical}(p_1, \dots, p_K)\)</span>, that is, probability that <span class="math inline">\(k\)</span>th perturbed r.v. attains maximal value is exactly <span class="math inline">\(p_k\)</span> <a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a>. This gives you a sampling procedure: just sample <span class="math inline">\(K\)</span> independent Gumbels, add corresponding log probabilities, and then take the argmax. However, though mathematically elegant, this formula won’t help us much since argmax is not differentiable. Let’s relax it then! We have already seen that the step function <span class="math inline">\(H(x)\)</span> can be seen as a limit of a sigmoid with temperature: <span class="math inline">\(H(X) = \lim_{\tau \to 0} \sigma_\tau(x)\)</span>, so we might expect (and indeed it is) that if you assume that <span class="math inline">\(\text{argmax}(x)\)</span> returns you a one-hot vector indicating the maximum index, it can be viewed as a zero-temperature version of a softmax with temperature: <span class="math inline">\(\text{argmax}(x)_j = \lim_{\tau \to 0} \text{softmax}_\tau(x)_j\)</span> where</p>
<p><span class="math display">\[
\text{softmax}_\tau(x)_j
= \frac{\exp(x_j / \tau)}{\sum_{k=1}^K \exp(x_k / \tau)}
\]</span></p>
<p>This formula gives us continuous relaxation of discrete random variables. Let’s see what it corresponds to in binary case:</p>
<p><span class="math display">\[
\begin{align*}
\zeta
&= \frac{\exp((\gamma_1 + \log p) / \tau)}{\exp((\gamma_1 + \log p) / \tau) + \exp((\gamma_0 + \log (1-p)) / \tau)} \\
&= \frac{1}{1 + \exp((\gamma_0 + \log (1-p) - \gamma_1 - \log p) / \tau)}\\
&= \sigma_\tau\left(\gamma_1 - \gamma_0 + \log \tfrac{p}{1-p}\right)
\end{align*}
\]</span></p>
<p>Then <span class="math inline">\(\gamma_1 - \gamma_0\)</span> has <a href="https://en.wikipedia.org/wiki/Logistic_distribution">Logistic</a>(0, 1) distribution <a href="#fn7" class="footnoteRef" id="fnref7"><sup>7</sup></a>. This estimator is a bit more efficient since you can generate Logistic random variables faster than generating two independent Gumbel r.v.s. <a href="#fn8" class="footnoteRef" id="fnref8"><sup>8</sup></a></p>
<p>Even though this choice of Logistic distribution in binary case seems arbitrary, let’s not forget, that it’s a special case of a more general relaxation of any categorical r.v. If we chose some other<a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a> distribution in a binary case, we’d have to construct some cumbersome stick-breaking procedure to generalize it to the multivariate case.</p>
<h2 id="marginalization-via-continuous-noise">Marginalization via Continuous Noise</h2>
<p>An interesting approach was proposed in the <a href="https://arxiv.org/abs/1609.02200">Discrete Variational Autoencoders paper</a>. The core idea is that you can smooth binary r.v. <span class="math inline">\(z\)</span> with p.m.f. <span class="math inline">\(p(z)\)</span> by adding extra noise <span class="math inline">\(\tau_1\)</span> and <span class="math inline">\(\tau_2\)</span> and treating <span class="math inline">\(z\)</span> as a swticher between these. Indeed, consider such smoothed r.v. <span class="math inline">\(\zeta = z \cdot \tau_1 + (1 - z) \tau_0\)</span>. Now if we choose <span class="math inline">\(\tau_0\)</span> and <span class="math inline">\(\tau_1\)</span> such that the CDF of marginal <span class="math inline">\(\zeta\)</span> can be computed and inverted, we would be able to devise a reparametrization for this scheme.</p>
<p>Consider a particular example of <span class="math inline">\(\tau_0 = 0\)</span> – a constant zero, and <span class="math inline">\(\tau_1\)</span> having some continuous distribution. The marginal CDF of <span class="math inline">\(\zeta\)</span> would then be</p>
<p><span class="math display">\[
\mathbb{P}(\zeta < x) = \mathbb{P}(z = 0) [\zeta > 0] + \mathbb{P}(z=1) \mathbb{P}(\tau_1 < x)
\]</span></p>
<p>Now we can invert this CDF:</p>
<p><span class="math display">\[
\mathbb{Q}_\zeta(\rho) = \begin{cases}
\mathbb{Q}_\tau \left( \frac{\rho}{p(z=1)} \right), & \rho \le p(z=1) \mathbb{P}(\tau < 0) \\
\mathbb{Q}_\tau \left( \frac{\rho - p(z = 0)}{1 - p(z = 0)} \right), & \rho \ge p(z = 0) + p(z=1) \mathbb{P}(\tau < 0) \\
0, & \text{otherwise}
\end{cases}
\]</span></p>
<p>Where <span class="math inline">\(\mathbb{Q}_\tau(\rho)\)</span> is an inverse of CDF of <span class="math inline">\(\tau\)</span>, that is <span class="math inline">\(\mathbb{P}(\tau < \mathbb{Q}_\tau(\rho)) = \rho\)</span>. This formula clearly suggests that if you can compute and invert CDF of the smoothing noise <span class="math inline">\(\tau\)</span>, you can do the same with the smoothed variable <span class="math inline">\(\zeta\)</span>, essentially giving us reparametrization for the smoothed r.v. <span class="math inline">\(\zeta\)</span>, so we can backpropagate as usual. <a href="#fn10" class="footnoteRef" id="fnref10"><sup>10</sup></a></p>
<p>However, an attentive reader could spot a problem here. In the multivariate case we typically have some dependency structure, hence probabilities <span class="math inline">\(p(z_k = 0)\)</span> and <span class="math inline">\(p(z_k = 1)\)</span> depend on previous samples <span class="math inline">\(z_{<k}\)</span>, which we can’t backpropagate through, and need to relax in the same way.</p>
<p>Consider, for example, a general stochastic computation graph with 4-dimensional binary random variable <span class="math inline">\(z\)</span>:</p>
<div class="post-image">
<p><img src="/files/dvae.png" style="width: 400px" /></p>
</div>
<p>When applying this trick, we introduce relaxed continuous random variables <span class="math inline">\(\zeta\)</span> as simple transformations of corresponding binary random variables <span class="math inline">\(z\)</span> (red lines), and make <span class="math inline">\(z\)</span> depend on each other only through relaxed variables (purple lines).</p>
<div class="post-image">
<p><img src="/files/dvae-smoothed.png" style="width: 400px" /></p>
</div>
<p>This trick is somewhat similar to the asymptotic reparametrization – you end up with a model that only has continuous random variables, but is equivalent to the original one that has discreteness. However, it requires you to significantly alter the model by re-expressing dependence in <span class="math inline">\(z\)</span> using continuous relaxations <span class="math inline">\(\zeta\)</span>. It worked fine for the Discrete VAE application, where you want to learn this dependence structure in <span class="math inline">\(z\)</span>, but if you have a specific one in mind, you might be in trouble.</p>
<p>Also, we don’t want to introduce this noise at the test stage. So we’d like to fix the discrepancy between train and test somehow. One way to do so is to choose <span class="math inline">\(\tau\)</span> that depends on some parameter and adjust it so that <span class="math inline">\(\tau\)</span>’s distribution becomes closer to <span class="math inline">\(\delta(\tau-1)\)</span>. In the paper authors use “truncated exponential” distribution <span class="math inline">\(p(\tau) \propto \exp(\beta \tau) [0 \le \tau \le 1]\)</span> where <span class="math inline">\(\beta\)</span> is a (bounded) learnable parameter. The upper bound grows linearly as training progresses – essentially shrinking the noise towards 1 (in the limit of infinite <span class="math inline">\(\beta\)</span> we have <span class="math inline">\(p(\tau) = \delta(\tau-1)\)</span>).</p>
<h2 id="gradient-relaxations">Gradient Relaxations</h2>
<p>There’s been also some research around the following idea: we don’t have any problems with discrete random variable during the forward pass, it’s differentiation during the backward one that brings difficulties. Can we approximate the gradient only? Essentially the idea is to compute the forward pass as usual, but replace the gradient through random samples with some approximation. To a mathematician (like I pretend to be) this sounds very suspicious – the gradient will no longer correspond to the objective, it’s not even clear which objective it’d correspond to. However, these methods have some attractive properties, and are well-known in the area, so I feel I have to cover them as well.</p>
<p>One of such methods is the <a href="https://arxiv.org/abs/1308.3432"><strong>Straight Through</strong> estimator</a>, which backpropagates the gradient <span class="math inline">\(\nabla_\theta\)</span> through binary random variable <span class="math inline">\(z \sim \text{Bernoulli}(p(\theta))\)</span> as if there was no stochasticity (and nonlinearity!) in the first place. So in the forward pass you take your outputs (logits) of a layer preceding to the discrete stochastic node, squash it by a sigmoid function, then sample a Bernoulli random variable with such probability, and move on to the next layer, possibly sampling some more stochastic nodes along the way. In the backward phase, though, when it comes to differentiate through the discrete sampling procedure, you just go straight to the gradients of logits, like if there was no sigmoid and sampling in the first place: <span class="math display">\[\nabla_\theta \text{Bernoulli}(\sigma(g(\theta))) := \nabla_\theta g(\theta)\]</span></p>
<p>This estimator is clearly computing anything but an estimate of the gradient of your model, but authors claim that at least it gives you right direction for the gradient. You can also keep the sigmoid function – that’s what <a href="https://arxiv.org/abs/1406.2989">Raiko et. al</a> do: <span class="math display">\[\nabla_\theta \text{Bernoulli}(\sigma(g(\theta))) := \nabla_\theta \sigma(g(\theta))\]</span></p>
<p>Finally, authors of one of the <a href="https://arxiv.org/abs/1611.01144">Gumbel-Softmax relaxation</a> papers proposed Straight Through Gumbel where you, again, compute forward pass as usual, but in the backward pass assume there’s Gumbel-Softmax relaxation and backpropagate through it. Hypothetically as you decrease the temperature, this relaxation becomes more exact. I guess one can try to convince themselves that for small enough <span class="math inline">\(\tau\)</span> this is a reasonable approximation. <span class="math display">\[\nabla_\theta \text{Bernoulli}(\sigma(g(\theta))) := \nabla_\theta \text{RelaxedBernoulli}(\sigma(g(\theta)))\]</span></p>
<p>I personally consider these methods mathematically unsound, and advise you to refrain form using them (unless you know what you’re doing – then tell me what was your rationale for this particular choice).</p>
<h2 id="experiments">Experiments</h2>
<p>To please your eyes with some experimental plots, I used all these relaxations to train a Discrete Variational Autoencoder on MNIST. I used a single stochastic layer (shallow encoder) with 3 layers in between, and evaluated the result using <a href="/posts/2016-07-14-neural-variational-importance-weighted-autoencoders.html">10,000-sample lower bound</a>, which I assume approximates marginal log-likelihood relatively well.</p>
<div class="post-image">
<p><img src="/files/dvae-experiments.png" /></p>
</div>
<p>First, the Relaxed ELBO pane tells us all methods have no problems optimizing their target objective. However, one should refrain from comparing them according to this number, since these are different relaxations, they are not even lower bounds for a marginal log-likelihood for some relaxed model <a href="#fn11" class="footnoteRef" id="fnref11"><sup>11</sup></a>.</p>
<p>Instead, let’s look at the second and third panes. The second shows the Evidence Lower Bound for the original discrete model, and the third shows the gap between the discrete ELBO and the relaxed one. First, the marginal likelihood estimation agrees with the discrete ELBO – that’s a good thing and means nothing bad is happening to the KL-divergence between the true posterior and the approximate one.</p>
<p>You can see that the green line – the logistic distribution-based relaxation with unit temperature – actually diverges. This is a direct consequence of the chosen temperature: for <span class="math inline">\(\tau = 1\)</span> the density of relaxed random variables is unimodal and has its mode somewhere in the interior of the [0, 1] interval. This leads the network to adjust to the values around this mode, which poorly represent test time samples.</p>
<p>As you can see the normal distribution with temperature 0.4 works very well at first, but then starts diverging. This might be because of the problems of Gaussian distribution we discussed earlier: namely, it has zero mass at exact 0 and 1: the network might adapt to having some small non-zero elements, and be very surprised to see them zeroed out completely at the testing time.</p>
<p>The asymptotic reparametrization seems to be suffering from the inaccuracy of the CLT approximation. Latent code of 200 units is big, but not infinitely big for the approximation to be exact. Unfortunately, there’s no hyperparameter to adjust the approximation quality. Moreover, the relaxation gap keeps increasing.</p>
<p>The Noise-Relaxed model performs poorly compared to other methods. This might be a result of a poor hyperparameter management: recall that we introduce continuous noise that’s missing at the test time. To make the net adjust to the test time regime we need to make the noise approach <span class="math inline">\(\delta(\tau - 1)\)</span>. However, if you approach it too fast, you’ll see the relaxation gap decreasing, but the learning wouldn’t progress much as the variance of your gradients would be too high.</p>
<p>Straight through estimators perform surprisingly good: not as good as the Gumbel-Softmax relaxation, but better than what you’d expect from a mathematically unsound method.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we talked about continuous relaxations of discrete models. These relaxations allow you to invoke the reparametrization trick to backpropagate the gradients, but this comes at a cost of bias. The gradients are no more unbiased as you essentially optimize different objective. In the next blogpost we will return back to where we started – to the score-function estimator, and try to reduce its variance keeping zero bias.</p>
<p>By the way, if you’re interested, the code for DVAE implementations is <a href="https://github.com/artsobolev/dvaes">available on my GitHub</a>. However, I should warn you: it’s still work-in-progress, and lacks any documentation. I’ll add it one I’m finished with the series.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>There’s no established name for such technique. Gaussian dropout has been proposed in the original <a href="http://jmlr.org/papers/v15/srivastava14a.html">Dropout paper</a>, but their equivalence under CLT was not stated formally until the <a href="http://proceedings.mlr.press/v28/wang13a.html">Fast dropout training</a> paper. Nor has anyone applied this trick to, say, Discrete Variational Autoencoders. <strong>UPD</strong>: I just discovered an <a href="https://openreview.net/forum?id=BySRH6CpW&noteId=BySRH6CpW">ICLR2018 submission</a> using this technique to learn discrete weights.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p><span class="math inline">\(H\)</span> is called a Heavyside function, and you can define it’s behavior at 0 as you like, it doesn’t matter in most cases as it’s a zero-measure point.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Unless you use generalized functions from the distributions theory (do not confuse with probability distributions!). However, that’s a whole different world, and one should be careful doing derivations there.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>Consider a random variable <span class="math inline">\(U\)</span> with its support being <span class="math inline">\(\mathbb{R}\)</span>. Then <span class="math inline">\(\mathbb{P}(H(U + c) = 1) = \mathbb{P}(U > -c) = 1 - \Phi(-c)\)</span> where <span class="math inline">\(\Phi\)</span> is CDF of <span class="math inline">\(U\)</span>. Then if you want this probability to be equal to some value <span class="math inline">\(p\)</span>, you should shift <span class="math inline">\(U\)</span> by <span class="math inline">\(c = -\Phi^{-1}(1-p)\)</span><a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>This sounds a lot like Dirac’s delta function, which is a well-known distributional derivative of the Heavyside function.<a href="#fnref5">↩</a></p></li>
<li id="fn6"><p>An alternative derivation of this fact can be seen through a property of exponentially distributed random variables.<a href="#fnref6">↩</a></p></li>
<li id="fn7"><p>This is clearly a special case of the general one with <span class="math inline">\(U\)</span> being <span class="math inline">\(\text{Logistic}(0, 1)\)</span> and <span class="math inline">\(\Phi\)</span> being its CDF.<a href="#fnref7">↩</a></p></li>
<li id="fn8"><p>Let’s say a word or two on how to sample Gumbels and Logistics. For both of them one can analytically derive and invert the CDF, and hence come up with formulas to transform samples from uniform distribution. For <span class="math inline">\(\text{Gumbel}(\mu, \beta)\)</span> distribution this transformation is <span class="math inline">\(u \mapsto \mu - \beta \log \log \tfrac{1}{u}\)</span>, for <span class="math inline">\(\text{Logistic}(\mu, \beta)\)</span> it’s <span class="math inline">\(u \mapsto \mu + \beta \sigma^{-1}(u)\)</span>. Hence if you use the general case, you’d need to generate 2 random variables, whereas in binary case you can use just one. I guess in general <span class="math inline">\(K\)</span>-variate case you <em>could theoretically</em> use just <span class="math inline">\(K-1\)</span> random variables, but that’d induce some possibly complicated dependence structure on them, and thus unnecessary complicate the sampling process.<a href="#fnref8">↩</a></p></li>
<li id="fn9"><p>Consider a random variable <span class="math inline">\(U\)</span> with its support being <span class="math inline">\(\mathbb{R}\)</span>. Then <span class="math inline">\(\mathbb{P}(H(U + c) = 1) = \mathbb{P}(U > -c) = 1 - \Phi(-c)\)</span> where <span class="math inline">\(\Phi\)</span> is CDF of <span class="math inline">\(U\)</span>. Then if you want this probability to be equal to some value <span class="math inline">\(p\)</span>, you should shift <span class="math inline">\(U\)</span> by <span class="math inline">\(c = -\Phi^{-1}(1-p)\)</span><a href="#fnref9">↩</a></p></li>
<li id="fn10"><p>There’s an alternative way to write the sampling formula <span class="math display">\[
\mathbb{Q}_\zeta(\rho) = \begin{cases}
0, & \rho \le p(z=0) \\
\mathbb{Q}_\tau \left( \frac{\rho - p(z = 0)}{1 - p(z = 0)} \right), & \text{otherwise}
\end{cases}
\]</span> This formula has less branching, and thus is more efficient to compute. <br/> Moreover, in general one can avoid inverting the CDF by noticing that <span class="math inline">\(y = [\rho < \mathbb{P}(z=0)] \mathbb{Q}_{\tau_0}\left(\tfrac{\rho}{\mathbb{P}(z=0)}\right) + [\rho > \mathbb{P}(z=0)] \mathbb{Q}_{\tau_1}\left(\tfrac{1-\rho}{1-\mathbb{P}(z=0)}\right)\)</span> for <span class="math inline">\(\rho \sim U[0,1]\)</span> has exactly the same distribution as the marginal <span class="math inline">\(p(\zeta)\)</span>.<a href="#fnref10">↩</a></p></li>
<li id="fn11"><p>Authors of the Concrete Distribution paper did also relax the KL-divergence term, which means they optimized lower bound for marginal likelihood in a different model, however, it’s reported to lead to better results.<a href="#fnref11">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/rm0asUsdwas" height="1" width="1" alt=""/>Sat, 28 Oct 2017 00:00:00 UThttp://artem.sobolev.name/posts/2017-10-28-stochastic-computation-graphs-discrete-relaxations.htmlArtemhttp://artem.sobolev.name/posts/2017-10-28-stochastic-computation-graphs-discrete-relaxations.htmlStochastic Computation Graphs: Continuous Case
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/eHYnuBerCMs/2017-09-10-stochastic-computation-graphs-continuous-case.html
<p>Last year I covered <a href="/tags/modern%20variational%20inference%20series.html">some modern Variational Inference theory</a>. These methods are often used in conjunction with Deep Neural Networks to form deep generative models (VAE, for example) or to enrich deterministic models with stochastic control, which leads to better exploration. Or you might be interested in amortized inference.</p>
<p>All these cases turn your computation graph into a stochastic one – previously deterministic nodes now become random. And it’s not obvious how to do backpropagation through these nodes. In <a href="/tags/stochastic%20computation%20graphs%20series.html">this series</a> I’d like to outline possible approaches. This time we’re going to see why general approach works poorly, and see what we can do in a continuous case.</p>
<!--more-->
<p>First, let’s state the problem more formally. Consider the approximate inference objective:</p>
<p><span class="math display">\[
\mathbb{E}_{q(z|x)} \log \frac{p(x, z)}{q(z|x)} \to \max_{q(z|x)}
\]</span></p>
<p>or a reinforcement learning objective:</p>
<p><span class="math display">\[
\mathbb{E}_{\pi(a|s)} R(a, s) \to \max_{\pi}
\]</span></p>
<p>In the following I’ll use the following notation for the objective:</p>
<p><span class="math display">\[
\mathcal{F}(\theta) = \mathbb{E}_{p(x \mid \theta)} f(x) \to \max_{\theta}
\]</span></p>
<p>In that case the (stochastic) computation graph (SCG) can be represented in the following form <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>:</p>
<div class="post-image">
<p><img src="/files/scg-through-randomness.png" style="width: 400px" /></p>
</div>
<p>Here <span class="math inline">\(\theta\)</span>, in double circle is a set of tunable parameters, blue rhombus is a stochastic node that takes on random values, but their distribution depends on <span class="math inline">\(\theta\)</span> (maybe through some complicated but known function, like a neural network), and orange circle is the value we’re maximizing. In order to estimate the <span class="math inline">\(\mathcal{F}(\theta)\)</span> using such graph, you just take your <span class="math inline">\(\theta\)</span>s, compute <span class="math inline">\(x\)</span>’s distribution, take as many samples from it as you can get, compute <span class="math inline">\(f(x)\)</span> for each one, and then just average them.</p>
<p>How do we maximize it though? The workhorse of optimization in modern deep learning is the Stochastic Gradient Descent (or, in our case, Ascent), and if we want to apply it in our case, all we need to compute is an (preferably unbiased and low-variance) estimate of the gradient of the objective <span class="math inline">\(\nabla_\theta \mathcal{F}(\theta)\)</span> w.r.t. <span class="math inline">\(\theta\)</span>. This is seemingly easy for anyone familiar with basic calculus:</p>
<p><span class="math display">\[
\begin{align*}
\nabla_{\theta} \mathcal{F}(\theta)
& = \nabla_{\theta} \mathbb{E}_{p(x \mid \theta)} f(x)
= \nabla_{\theta} \int p(x \mid \theta) f(x) dx \\
& = \int \nabla_{\theta} p(x \mid \theta) f(x) dx
= \int \nabla_{\theta} \log p(x \mid \theta) f(x) p(x \mid \theta) dx \\
& = \mathbb{E}_{p(x \mid \theta)} \nabla_{\theta} \log p(x \mid \theta) f(x) dx
\end{align*}
\]</span></p>
<p>There you have it! Just sample some <span class="math inline">\(x \sim p(x \mid \theta)\)</span>, calculate <span class="math inline">\(f(x)\)</span> using this sample, and then multiply the result by the gradient of log density – here’s your unbiased estimate of the true gradient. However, in practice people have observed that this estimator (called the <strong>score-function estimator</strong>, and also <strong>REINFORCE</strong> in reinforcement learning literature <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>) has large variance, making it impractical for high-dimensional <span class="math inline">\(x\)</span>.</p>
<p>And it kinda makes sense. Look at the estimator. It does not use gradient information of <span class="math inline">\(f\)</span>, so it does not have any guidance where to move <span class="math inline">\(p(x|\theta)\)</span> to make the expectation <span class="math inline">\(\mathcal{F}(\theta)\)</span> higher. Instead, it tries many random <span class="math inline">\(x\)</span>s, for each sample it takes the direction one should go to make this sample more probable, and weights these directions according to the magnitude of <span class="math inline">\(f(x)\)</span>. When averaged, this gives you true direction to maximize the objective, but it’s hard to randomly stumble upon good <span class="math inline">\(x\)</span> using just a few samples (especially early in training, or in high-dimensional spaces), hence high variance.</p>
<p>This manifests a necessity of either ways to improve the variance of such estimator, or different, more efficient approaches. In the following we will consider both.</p>
<h2 id="reparametrization-trick">Reparametrization trick</h2>
<p>Being perfectly aware of the aforementioned limitation, <a href="https://arxiv.org/abs/1312.6114">Kingma et. al</a> used a smart trick in their Variational Autoencoder paper. Basically, the idea is the following: if some random variables can be decomposed into combinations of other random variables, can we transform our stochastic computation graph such that we don’t need to backpropagate through randomness, and have stochasticity injected into the model as independent noise?</p>
<p>Turns out, we can. Namely, for any gaussian random variable <span class="math inline">\(x \sim \mathcal{N}(\mu, \sigma^2)\)</span> we can decompose it into affine transformation of some independent standard gaussian noise: <span class="math inline">\(x = \mu + \sigma \varepsilon\)</span> <a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> where <span class="math inline">\(\varepsilon \sim \mathcal{N}(0, 1)\)</span> (we reparametrize the distribution, hence the name of the trick).</p>
The SCG then becomes
<div class="post-image">
<p><img src="/files/scg-gaussian-reparametrization.png" style="width: 400px" /></p>
</div>
<p>Here pink arrows denote the “flow” of backpropagation: notice that we do not encounter any sampling nodes along the way – hence we don’t need to use the high-variance score-function estimator. We can even have many layers of stochastic nodes – after the reparametrization we don’t need to differentiate through random samples, we only mix them in. Let us look at the formulas.</p>
<p><span class="math display">\[
\nabla_\theta \mathbb{E}_{p(x|\theta)} f(x)
= \nabla_\theta \mathbb{E}_{p(\varepsilon)} f(\mu(\theta) + \sigma(\theta) \varepsilon)
= \mathbb{E}_{p(\varepsilon)} \nabla_\theta f(\mu(\theta) + \sigma(\theta) \varepsilon)
\]</span></p>
<p>Notice that this time we do use the gradient of <span class="math inline">\(f\)</span>! This is the crucial difference between this estimator, and the score-function one: in the later we were averaging random directions using their “scores”, whereas here we learn an affine transformation of independent noise such that transformed samples lie in an area that has large <span class="math inline">\(f(x)\)</span>. Gradient information of <span class="math inline">\(f\)</span> tells us where to move samples <span class="math inline">\(x\)</span>, and we do so by adjusting <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma\)</span>.</p>
<p>Okay, so it looks like a great method, why not use it everywhere? The problem is that even though you can always transform a uniformly distributed random variable into any other, it’s not always computationally easy <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>. For some distributions (Dirichlet, for example <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a>) we simply don’t know any effective transformations from parameter-free random variables.</p>
<h2 id="generalized-reparametrization-trick">Generalized reparametrization trick</h2>
<p>The reparametrization trick can be seen as a transformation <span class="math inline">\(\mathcal{T}(\varepsilon | \theta)\)</span> of some independent noise into a desired random variable. Conversely, if <span class="math inline">\(\mathcal{T}\)</span> is invertible, <span class="math inline">\(\mathcal{T}^{-1}(x | \theta)\)</span> is a “whitening” / “standardizing” transformation: it takes some random variable that depends on parameters <span class="math inline">\(\theta\)</span> and makes it parameter-independent.</p>
<p>What if we find a transformation that maybe does not whiten <span class="math inline">\(x\)</span> completely, but still significantly reduce its dependence on <span class="math inline">\(\theta\)</span>? This is the core idea of the <a href="https://arxiv.org/abs/1610.02287">The Generalized Reparameterization Gradient</a> paper. In that case <span class="math inline">\(\varepsilon\)</span> would still depend on <span class="math inline">\(\theta\)</span>, but hopefully only “weakly”.</p>
<p><span class="math display">\[
\begin{align*}
\nabla_\theta \mathbb{E}_{p(x|\theta)} f(x)
&= \nabla_\theta \mathbb{E}_{p(\varepsilon|\theta)} f(\mathcal{T}(\varepsilon \mid \theta)) \\
&= \underbrace{\mathbb{E}_{p(\varepsilon|\theta)} \nabla_\theta f(\mathcal{T}(\varepsilon \mid \theta))}_{g^\text{rep}}
+ \underbrace{\mathbb{E}_{p(\varepsilon|\theta)} \nabla_\theta \log p(\varepsilon|\theta) f(\mathcal{T}(\varepsilon \mid \theta))}_{g^\text{corr}}
\end{align*}
\]</span></p>
<p>Here <span class="math inline">\(g^\text{rep}\)</span> is our usual reparametrized gradient, and <span class="math inline">\(g^\text{corr}\)</span> is the score-function part of it. It’s easy to see that varying the transformation <span class="math inline">\(\mathcal{T}\)</span> allows you to interpolate between the fully reparametrized gradients and the fully score-function-based gradients. Indeed, if <span class="math inline">\(\mathcal{T}\)</span> whitens <span class="math inline">\(x\)</span> completely, then <span class="math inline">\(p(\varepsilon|\theta)\)</span> is independent of <span class="math inline">\(\theta\)</span> and <span class="math inline">\(\nabla_\theta \log p(\varepsilon|\theta) = 0\)</span>, leaving us with <span class="math inline">\(g^\text{rep}\)</span> only. If, however, <span class="math inline">\(\mathcal{T}\)</span> is an identity map, which does not do anything, then <span class="math inline">\(\nabla_\theta f(\mathcal{T}(\varepsilon \mid \theta)) = \nabla_\theta f(\varepsilon) = 0\)</span>, and we recover the score-function estimator.</p>
<p>This formula looks great, but it requires us to know the distribution of <span class="math inline">\(\mathcal{T}^{-1}(x \mid \theta)\)</span> to sample <span class="math inline">\(\varepsilon\)</span> from. It’s more convenient to reformulate the gradient in terms of samples from <span class="math inline">\(p(x|\theta)\)</span>, which we can do after some algebraic manipulations:</p>
<p><span class="math display">\[
\begin{align*}
g^\text{rep}
=& \mathbb{E}_{p(x|\theta)} \nabla_x f(x) \nabla_\theta \mathcal{T}(\varepsilon \mid \theta)
\\
g^\text{corr}
=& \mathbb{E}_{p(x|\theta)} \Bigl[\nabla_\theta \log p(x|\theta) + \nabla_x \log p(x|\theta) \nabla_\theta \mathcal{T}(\varepsilon \mid \theta) \\& \quad\quad\quad\quad+ \nabla_\theta \log |\text{det} \nabla_\varepsilon \mathcal{T}(\varepsilon \mid \theta)|\Bigr] f(x)
\\
& \text{where } \varepsilon = \mathcal{T}^{-1}(x \mid \theta)
\end{align*}
\]</span></p>
<p>In this formulation we sample <span class="math inline">\(x\)</span> as usual, pass it through the “whitening” transformation <span class="math inline">\(\mathcal{T}^{-1}(x | \theta)\)</span> to obtain sample <span class="math inline">\(\varepsilon\)</span>, and substitute these variables into gradient constituents. One can also see everything but <span class="math inline">\(f(x) \nabla_\theta \log p(x \mid \theta)\)</span> as a <em>control variate</em> (we’ll talk about these later in the series) that uses <span class="math inline">\(f\)</span>’s gradient information and hence can be expected to be quite powerful.</p>
<p>The last question is which transformation to choose? The formulas authors propose to use usual standardizing transformation, i.e. to subtract the mean and divide by standard deviation. This choice is motivated by the following: a) it’s computationally convenient, recall that we need both <span class="math inline">\(\mathcal{T}\)</span> and <span class="math inline">\(\mathcal{T}^{-1}\)</span> <a href="#fn6" class="footnoteRef" id="fnref6"><sup>6</sup></a>; b) it makes first two moments independent of <span class="math inline">\(\theta\)</span>, which is some sense makes resulting variable “weakly” dependent on it.</p>
<h3 id="rejection-sampling-perspective-casmls-citation">Rejection sampling perspective <a href="#fn7" class="footnoteRef" id="fnref7"><sup>7</sup></a></h3>
<p>Another interesting perspective on generalized reparametrization comes from the following thought: there are efficient samplers for many distributions, can we somehow backpropagate through the sampling process? This is what authors of the <a href="http://proceedings.mlr.press/v54/naesseth17a.html">Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms</a> paper decided to find out.</p>
<p>You want to sample some distribution <span class="math inline">\(p(x|\theta)\)</span>, but can’t compute and invert its CDF, what to do then? You can use <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampling</a> procedure. Basically, you take some proposal distribution <span class="math inline">\(r(x \mid \theta)\)</span> that is easy to sample from, find a scaling factor <span class="math inline">\(M_\theta\)</span> such that scaled proposal is uniformly higher than the target density for all <span class="math inline">\(x\)</span>: <span class="math inline">\(M_\theta r(x|\theta) \ge p(x|\theta) \forall x\)</span>. Then you generate points randomly under the scaled <span class="math inline">\(M_\theta r(x|\theta)\)</span> curve, and keep only those that are also below the <span class="math inline">\(p(x|\theta)\)</span> curve:</p>
<ol style="list-style-type: decimal">
<li>Generate <span class="math inline">\(x \sim r(x|\theta)\)</span>.</li>
<li>Generate <span class="math inline">\(u \sim U[0, M_\theta r(x|\theta)]\)</span>.</li>
<li>If <span class="math inline">\(u > p(x|\theta)\)</span>, repeat from step 1, else return <span class="math inline">\(x\)</span>.</li>
</ol>
<p>Moreover, at step 1 we can use some transformation <span class="math inline">\(\mathcal{T}(\varepsilon | \theta)\)</span> of the sample <span class="math inline">\(\varepsilon \sim r(\varepsilon)\)</span> (provided the scaled density of transformed variable is uniformly higher). This is how <code>numpy</code> generates Gamma variables: if samples <span class="math inline">\(\varepsilon\)</span> from standard Gaussian, transforms the sample through some function <span class="math inline">\(x = \mathcal{T}(\varepsilon|\theta)\)</span>, and then accepts it with probability <span class="math inline">\(a(x|\theta)\)</span> <a href="#fn8" class="footnoteRef" id="fnref8"><sup>8</sup></a>.</p>
<p>Let’s find the density of <span class="math inline">\(\varepsilon\)</span>s that lead to acceptance of corresponding <span class="math inline">\(x\)</span>s. Some calculations (provided in supplementary) show that</p>
<p><span class="math display">\[
p(\varepsilon|\theta) = M_\theta r(\varepsilon) a(\mathcal{T}(\varepsilon|\theta)|\theta)
\]</span></p>
<p>Note that this density is easy to calculate, and if we reparametrize generated samples <span class="math inline">\(\varepsilon\)</span>, we’d get samples <span class="math inline">\(x\)</span> we’re looking for <span class="math inline">\(x = \mathcal{T}(\varepsilon|\theta)\)</span>. Hence the objective becomes</p>
<p><span class="math display">\[
\mathcal{F}(\theta) = \mathbb{E}_{p(\varepsilon|\theta)} f(\mathcal{T}(\varepsilon|\theta))
\]</span></p>
<p>Differentiating it w.r.t. <span class="math inline">\(\theta\)</span> gives <span class="math display">\[
\nabla_\theta \mathcal{F}(\theta)
= \mathbb{E}_{p(\varepsilon|\theta)} \nabla_\theta f(\mathcal{T}(\varepsilon|\theta))
+ \mathbb{E}_{p(\varepsilon|\theta)} f(\mathcal{T}(\varepsilon|\theta)) \nabla_\theta \log p(\varepsilon|\theta)
\]</span></p>
<p>Now compare these addends to the <span class="math inline">\(g^\text{rep}\)</span> and <span class="math inline">\(g^\text{corr}\)</span> from the previous section. You can see that they’re <em>exactly</em> the same!</p>
<p>In the previous section we choose the transformation <span class="math inline">\(\mathcal{T}^{-1}\)</span> such that it tries to remove at least some dependency on <span class="math inline">\(\theta\)</span> from samples <span class="math inline">\(x\)</span>. This section allows us to view the same method from the other end: if you have some independent noise <span class="math inline">\(\varepsilon\)</span> and a transformation <span class="math inline">\(\mathcal{T}\)</span> that makes the samples look like samples from the target density <span class="math inline">\(p(x|\theta)\)</span>, then you can add some rejection sampling on top to compensate for the mismatch, and still enjoy the lower variance of gradient estimate.</p>
<h2 id="a-very-simple-example">A (very) simple example</h2>
<p>Let’s see how much variance reduction the reparametrization trick actually gets us in a very simple problem. Namely, let’s try to minimize expected square of a Gaussian random variable <a href="#fn9" class="footnoteRef" id="fnref9"><sup>9</sup></a> (shifted by some positive constant <span class="math inline">\(c\)</span>, we will see later how it comes into play):</p>
<p><span class="math display">\[
\mathcal{F}(\mu, \sigma) = \mathbb{E}_{x \sim \mathcal{N}(\mu, \sigma^2)} [x^2 + c] \to \min_{\mu, \sigma}
\]</span></p>
<p>First, reparametrized objective is</p>
<p><span class="math display">\[
\mathcal{F}^\text{rep}(\mu, \sigma) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} (\mu + \sigma \varepsilon)^2
\]</span></p>
<p>And its stochastic gradients are <span class="math display">\[
\hat \nabla_\mu \mathcal{F}^\text{rep}(\mu, \sigma) = 2 (\mu + \sigma \varepsilon) \\
\hat \nabla_\sigma \mathcal{F}^\text{rep}(\mu, \sigma) = 2 \varepsilon (\mu + \sigma \varepsilon)
\]</span></p>
<p>The score-function-based gradients are the following:</p>
<p><span class="math display">\[
\hat \nabla_\mu \mathcal{F}^\text{SF}(\mu, \sigma) = \frac{\varepsilon}{\sigma} \left((\mu + \sigma \varepsilon)^2 + c\right) \\
\hat \nabla_\sigma \mathcal{F}^\text{SF}(\mu, \sigma) = \frac{\varepsilon^2 - 1}{\sigma} \left((\mu + \sigma \varepsilon)^2 + c\right)
\]</span></p>
<p>Both estimators are unbiased, but what are the variances of these estimators? WolframAlpha suggests</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{D}\left[\hat \nabla_\mu \mathcal{F}^\text{SF}(\mu, \sigma)\right] &= \frac{(\mu^2 + c)^2}{\sigma^2} + 15 \sigma^2 + 14 \mu^2 + 6 c,
\\
\mathbb{D}\left[\hat \nabla_\mu \mathcal{F}^\text{rep}(\mu, \sigma)\right] &= 4 \sigma^2
\\
\mathbb{D}\left[\hat \nabla_\sigma \mathcal{F}^\text{SF}(\mu, \sigma)\right] &= \frac{2 (c + \mu^2)^2}{\sigma^{2}} + 60 \mu^{2} + 74 \sigma^{2} + 20 c,
\\
\mathbb{D}\left[\hat \nabla_\sigma \mathcal{F}^\text{rep}(\mu, \sigma)\right] &= 4 \mu^2 + 8 \sigma^2
\end{align*}
\]</span></p>
<p>You can see that not only the score-function-based gradient always has a higher variance, its variance actually explodes as we approach <span class="math inline">\(\mu = 0, \sigma = 0\)</span> (unless <span class="math inline">\(c = 0\)</span> and <span class="math inline">\(\mu\)</span> is small enough to counter <span class="math inline">\(\sigma\)</span>)! This is due to the fact that as your variance shrinks, points somewhat far away from the mean get very tiny probabilities, hence score-function-based gradients thinks it should try very hard to make them more probable.</p>
<p>You might be wodering, how would generalized reparametrization work? If we consider <span class="math inline">\(\mathcal{T}^{-1}(x|\mu,\sigma) = x - \mu\)</span> transformation (it “whitens” first moment only), then we obtain the following gradient estimates:</p>
<p><span class="math display">\[
\hat \nabla_\mu \mathcal{F}^\text{g-rep}(\mu, \sigma) = 2 (\mu + \varepsilon) \\
\hat \nabla_\sigma \mathcal{F}^\text{g-rep}(\mu, \sigma) = \frac{\varepsilon^2 - \sigma^2}{\sigma^3} (\mu + \varepsilon)^2
\]</span></p>
<p>This is the reparametrized gradient w.r.t. <span class="math inline">\(\mu\)</span> and score-function gradient w.r.t. <span class="math inline">\(\sigma\)</span> (notice that <span class="math inline">\(\varepsilon \sim \mathcal{N}(0, \sigma^2)\)</span> in this case). I don’t think this is an interesting scenario, so instead we’ll consider a weird-looking second-moment-whitening transformation <span class="math inline">\(\mathcal{T}^{-1}(x|\mu,\sigma) = \frac{x - \mu}{\sigma} + \mu\)</span> with <span class="math inline">\(\mathcal{T}(\varepsilon|\mu,\sigma) = \sigma (\epsilon - \mu) + \mu\)</span>. The gradients for this transformation are:</p>
<p><span class="math display">\[
\begin{align*}
\hat \nabla_\mu \mathcal{F}^\text{g-rep}(\mu, \sigma) &=
\left(c + \left(\mu + \sigma \left(\epsilon - \mu\right)\right)^{2}\right) \left(\epsilon - \mu\right) - 2 \left(\mu + \sigma \left(\epsilon - \mu\right)\right) \left(\sigma - 1\right)
\\
\hat \nabla_\sigma \mathcal{F}^\text{g-rep}(\mu, \sigma) &=
2 \left(\epsilon - \mu\right) \left(\mu + \sigma \left(\epsilon - \mu\right)\right)
\end{align*}
\]</span></p>
<p>You can already see that the magnitude of the gradients does not explode when the variance <span class="math inline">\(\sigma\)</span> goes to zero. Let’s check the variances:</p>
<p><span class="math display">\[
\begin{align*}
\mathbb{D}\left[\hat \nabla_\mu \mathcal{F}^\text{g-rep}(\mu, \sigma)\right] &=
(\mu^2 + c)^{2} + 2 c \sigma^{2} + 4 c \sigma + 10 \mu^{2} \sigma^{2} + 4 \mu^{2} \sigma + 7 \sigma^{4} + 4 \sigma^{3} + 4 \sigma^{2}
\\
\mathbb{D}\left[\hat \nabla_\sigma \mathcal{F}^\text{g-rep}(\mu, \sigma)\right] &=
4 \mu^{2} + 8 \sigma^{2}
\end{align*}
\]</span></p>
<p>First, we see that the variance of gradient w.r.t. <span class="math inline">\(\sigma\)</span> has become identical to the variance of the reparametrized case. Second, we can confirm that the variance does not explode as we approach the optimum.</p>
<div class="post-image">
<p><img src="/files/scg-example.png" /> Gen Rep 1 is a generalized reparametrization with only 1st moment whitened out,<br/> Gen Rep 2 – with only the second one</p>
</div>
<p>The simulation plots clearly show that score-function-based gradients and the first generalized reparametrization fail to converge, which is in line with our variance analysis. The second generalized reparametrization, however, performs just as good, as the full reparametrization, even though it does have higher variance.</p>
<p>All the code I wrote while working on this post can be found <a href="https://gist.github.com/artsobolev/fec7c052d712889ef69656825634c4d4">here</a>. Though it’s quite messy, I warned you.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have discussed tricks that make Stochastic Variational Inference with continuous latent variables computationally feasible. However, quite often we’re interested in models with discrete latent variables – for example, we might be interested in a model that dynamically chooses one computation path or another, essentially controlling how much computation time to spend on a given sample. Or, train a GAN for textual data – we need a way to backpropagate through discriminator’s inputs.</p>
<p>We’ll talk about such methods in the next post.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>In this post I’ll only consider models with only one stochastic “layer”, but roughly the same math applies in more general cases.<a href="#fnref1">↩</a></p></li>
<li id="fn2"><p>Sometimes people also call this <strong>log-derivative trick</strong>, however, in my opinion, log-derivate trick is about a derivation technique, namely the fact that <span class="math inline">\(\nabla_\theta p(x\mid\theta) = p(x\mid\theta) \nabla_\theta \log p(x\mid\theta)\)</span>, and it’s a bit incorrect to call the estimator this way.<a href="#fnref2">↩</a></p></li>
<li id="fn3"><p>Equality here means both sides have the same distribution.<a href="#fnref3">↩</a></p></li>
<li id="fn4"><p>We know that for <span class="math inline">\(X \sim p(x)\)</span> with c.d.f. <span class="math inline">\(F(x)\)</span> we have <span class="math inline">\(F(X) \sim U[0, 1]\)</span>, hence <span class="math inline">\(X = F^{-1}(u)\)</span> for standard uniform <span class="math inline">\(u \sim U[0, 1]\)</span>, so there always exist a (smooth, if <span class="math inline">\(x\)</span> is continuous) transformation from standard uniform noise to any other distribution. However, computing CDF function often requires expensive integration, which is quite often infeasible.<a href="#fnref4">↩</a></p></li>
<li id="fn5"><p>Original VAE paper lists Dirichlet distribution among ones that have effective reparametrizations, however that’s actually not the case, as you still need to generate parametrized Gamma variables.<a href="#fnref5">↩</a></p></li>
<li id="fn6"><p>Technically, you could derive the density <span class="math inline">\(p(\varepsilon|\theta)\)</span> and to sample from it – this way you’d not need the inverse <span class="math inline">\(\mathcal{T}^{-1}\)</span>. However, it’s not easy in general.<a href="#fnref6">↩</a></p></li>
<li id="fn7"><p>This section is largely based on the <a href="https://casmls.github.io/general/2017/04/25/rsvi.html">Reparameterization Gradients through Rejection Sampling Algorithms</a> blogpost.<a href="#fnref7">↩</a></p></li>
<li id="fn8"><p>Normally that’d be just <span class="math inline">\(a(x|\theta) = \tfrac{p(x|\theta)}{M_\theta r(x|\theta)}\)</span>, however, if we don’t have <span class="math inline">\(r(x|\theta)\)</span> readily available, we can express the acceptance probability in terms of <span class="math inline">\(\varepsilon\)</span>: <span class="math display">\[a(\varepsilon|\theta) = \tfrac{p(\mathcal{T}(\varepsilon|\theta)|\theta) |\text{det} \nabla_\varepsilon \mathcal{T}(\varepsilon|\theta)|}{M_\theta r(\varepsilon)}\]</span><a href="#fnref8">↩</a></p></li>
<li id="fn9"><p>One might argue that our approach is flawed, as the optimal distribution is <span class="math inline">\(\mathcal{N}(0, 0)\)</span> which is not a valid distribution. However, here we’re just interested in the gradient dynamics as we approach this optimum.<a href="#fnref9">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/eHYnuBerCMs" height="1" width="1" alt=""/>Sun, 10 Sep 2017 00:00:00 UThttp://artem.sobolev.name/posts/2017-09-10-stochastic-computation-graphs-continuous-case.htmlArtemhttp://artem.sobolev.name/posts/2017-09-10-stochastic-computation-graphs-continuous-case.htmlICML 2017 Summaries
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/Da31J1HffXE/2017-08-14-icml-2017.html
<p>Just like with <a href="/posts/2016-12-31-nips-2016-summaries.html">NIPS last year</a>, here’s a list of ICML’17 summaries (updated as I stumble upon new ones)</p>
<!--more-->
<ul>
<li><a href="https://olgalitech.wordpress.com/tag/icml2017/">Random ML&Datascience musing</a> by <a href="https://twitter.com/OlgaLiakhovich">Olga Liakhovich</a>
<ul>
<li><a href="https://olgalitech.wordpress.com/2017/08/07/icml-and-my-notes-on-day-1/">ICML and my notes on day 1</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/07/brain-endurance-or-day-2-at-icml-2017/">Brain endurance or Day 2 at ICML 2017</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day-3-at-icml-2017-musical-rnns/">Day 3 at ICML 2017 — musical RNNs</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day-4-at-icml-2017-more-adversarial-nns/">Day 4 at ICML 2017 — more Adversarial NNs</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day-5-6-at-icml-all-done/">Day 5 & 6 at ICML. All done.</a></li>
</ul></li>
<li><a href="https://keunwoochoi.wordpress.com/2017/08/14/machine-learning-for-music-discovery-workshop-icml2017-sydney/">Machine learning for music discovery (workshop)</a> by <a href="https://twitter.com/keunwoochoi">Keunwoo Choi</a></li>
<li><a href="https://gmarti.gitlab.io/ml/2017/08/11/ICML-2017-field-reports.html">Field reports from ICML 2017 in Sydney</a> by <a href="https://twitter.com/GautierMarti1">Gautier Marti’s Wander</a></li>
<li><a href="http://www.machinedlearnings.com/2017/08/icml-2017-thoughts.html">ICML 2017 Thoughts</a> by <a href="https://twitter.com/PaulMineiro">Paul Mineiro</a></li>
<li><a href="http://mattdickenson.com/2017/08/17/icml-2017-recap/">ICML 2017 Recap</a> by <a href="https://twitter.com/mcdickenson">Matt Dickenson</a></li>
<li><a href="https://www.bulletproof.net.au/international-conference-machine-learning-2017-part-one/">International Conference for Machine Learning 2017 – Part One</a> + <a href="https://www.bulletproof.net.au/international-conference-machine-learning-2017-part-two/">International Conference for Machine Learning 2017 – Part Two</a></li>
</ul><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/Da31J1HffXE" height="1" width="1" alt=""/>Mon, 14 Aug 2017 00:00:00 UThttp://artem.sobolev.name/posts/2017-08-14-icml-2017.htmlArtemhttp://artem.sobolev.name/posts/2017-08-14-icml-2017.htmlOn No Free Lunch Theorem and some other impossibility results
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/32qnGEkYoTM/2017-07-23-no-free-lunch-theorem.html
<p>The more I talk to people online, the more I hear about the famous No Free Lunch Theorem (NFL theorem). Unfortunately, quite often people don’t really understand what the theorem is about, and what its implications are. In this post I’d like to share my view on the NFL theorem, and some other impossibility results.</p>
<!--more-->
<h3 id="no-free-lunch-theorem-revisited">No Free Lunch Theorem Revisited</h3>
<p>First, let’s formally state the NFL theorem. I’ll take theorem statement from the (freely available!) book <a href="http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html"><em>Understanding Machine Learning: From Theory to Algorithms</em></a> by Shai Shalev-Shwartz and Shai Ben-David.</p>
<p>In the nutshell, the theorem says that whatever learning algorithm you pick, there will always be a problem (=dataset + some metrics), that your particular algorithm is incapable of solving, even though in principle the problem could be solved (by some other algorithm, which would have its own kryptonite problem). More formally (I modified the statement to distill the notation):</p>
<blockquote>
Let <span class="math inline">\(A\)</span> be any learning algorithm for the task of binary classification with respect to the 0−1 loss over a domain <span class="math inline">\(\mathcal{X}\)</span> . Let <span class="math inline">\(m\)</span> be any number smaller than <span class="math inline">\(|\mathcal{X}|/2\)</span>, representing a training set size. Then, there exists a distribution <span class="math inline">\(D\)</span> over <span class="math inline">\(\mathcal{X} × \{0, 1\}\)</span> such that:
<ol>
<li>
There exists a function <span class="math inline">\(f : \mathcal{X} \mapsto \{0, 1\}\)</span> with <span class="math inline">\(\mathbb{P}(f(x) \not= y \mid (x, y) \sim D) = 0\)</span>.
</li>
<li>
With probability of at least 1/7 over the choice of <span class="math inline">\(S \sim D^m\)</span> we have that <span class="math inline">\(\mathbb{P}(A_S(x) \not= y \mid (x, y) \sim D) \ge 1/8\)</span>
</li>
</ol>
</blockquote>
<p>The idea of the proof is that if you have fixed training set and some nontrivial number of unseen examples, one can vary labels of these unseen examples arbitrarily. So, if your algorithm classifies some example correctly, there exists similar problem with the only difference being different ground truth label for this example. Essentially, for the same training set you can construct completely different test sets.</p>
<h3 id="sounds-pretty-frustrating-isnt-it">Sounds pretty frustrating, isn’t it?</h3>
<p>The result suggests impossibility of the universal learning machine that’d be able to take any training set, and make the best predictions possible for the unseen data. And this <em>is</em> impossible! Another reformulation of the same theorem says that every classification algorithm has accuracy 1/2 when averaged over all possible problems. However, practical implications of the theorem are not so far-reaching.</p>
<p>The theorem essentially says that every problem has an evil doppelgänger, that’d break your precious model you trained so long. However, how likely are you to run into this doppelgänger? How likely is it to run into the problem where the test set differs from the train set so much? Or, how can our human brains work so well <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>? Let me expand the later thought.</p>
<p>I believe our brains are not magical, they are just another kind of a (biological) learning machine, obeying same mathematical principles, powerful enough to solve various problems we face every day. Yes, we can’t solve all the problems in the world, but why would we care? In Machine Learning as a subfield of Artificial Intelligence we seek to solve problems of <em>practical importance</em>, and in the first place automate what people already can do. Thus, we have a proof that there exists an algorithm that works reasonably well. It’s right here, in your brain.</p>
<p>So how come we’re able to navigate in such a complex world, communicate in such complicated languages, and discover laws of nature through science by thinking hard enough, if for every problem <em>we</em> successfully solve the mathematics has an evil copy of? The answer seems to be that these evil copies are very rare. And I believe there’s a reason for that.</p>
<p>Let’s get back to the theorem. Recall, that is essentially based in the fact that for a fixed training set you can vary test set as you wish. How complicated (for some intuitive notion of complexity) does that make the distribution <span class="math inline">\(p(y|x)\)</span> that makes perfect predictions for a given <span class="math inline">\(x\)</span>? Well, if it had one regularity pattern in the training set, and then suddenly changed this pattern in the test set to something completely different, that’d make the target distribution <span class="math inline">\(p(y|x)\)</span> more complicated. So, even if every good problem (i.e. one we, humans, can solve) has an evil twin, twin’s complexity should be higher due to way more complex regularity pattern.</p>
<p>Thus, I’m sure more complicated problems and objects are less likely in the Universe. Otherwise, we’d not be able to have such a complicated life with our particular instance of a learning machine, implemented in our brain. The NFL theorem states there are hard problems out there, but doesn’t say anything how common they are, implicitly assuming uniform distribution, which seems to disagree with our observations.</p>
<h3 id="other-impossibility-results">Other impossibility results</h3>
<p>Another similar result is the <a href="https://en.wikipedia.org/wiki/Halting_problem">halting problem</a>, which states that given any program, you’d not be able to determine if it stops with 100% accuracy. However, this does not mean that every program’s halting is undecidable. For example, for <a href="https://en.wikipedia.org/wiki/Linear_bounded_automaton">linear bounded automata</a> one actually can decide if a program for this automation halts (thought that might require some astronomical amount of memory). This result only states there’s no universal decider, and every particular class should be inspected separately.</p>
<p>To recap, the idea of this post is that even though the theory seemingly limits our capabilities, we should not get discouraged by these results, as they are way more general than we need in practice. Quite often we can still solve real problems due to the fact that the general case includes some really weird functions, but reality does not.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Well, we don’t have other baselines, so it’s hard to tell if they indeed work well. But still, human brains are the best learning machines known to humanity.<a href="#fnref1">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/32qnGEkYoTM" height="1" width="1" alt=""/>Sun, 23 Jul 2017 00:00:00 UThttp://artem.sobolev.name/posts/2017-07-23-no-free-lunch-theorem.htmlArtemhttp://artem.sobolev.name/posts/2017-07-23-no-free-lunch-theorem.htmlMatrix and Vector Calculus via Differentials
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/5m_RiBN2DWU/2017-01-29-matrix-and-vector-calculus-via-differentials.html
<p>Many tasks of machine learning can be posed as optimization problems. One comes up with a parametric model, defines a loss function, and then minimizes it in order to learn optimal parameters. One very powerful tool of optimization theory is the use of smooth (differentiable) functions: those that can be locally approximated with a linear functions. We all surely know how to differentiate a function, but often it’s more convenient to perform all the derivations in matrix form, since many computational packages like numpy or matlab are optimized for vectorized expressions.</p>
<p>In this post I want to outline the general idea of how one can calculate derivatives in vector and matrix spaces (but the idea is general enough to be applied to other algebraic structures).</p>
<!--more-->
<h3>
The Gradient
</h3>
<p>What is the gradient? Recall that smooth function (for now we’ll be considering scalar functions only) <span class="math inline">\(f : \mathcal{X} \to \mathbb{R}\)</span> is one which is approximately linear within some neighborhood of a given point. That means <span class="math inline">\(f(x + dx) - f(x) = \langle g(x), dx \rangle\)</span> (think of <span class="math inline">\(dx\)</span> as of a very small perturbation of <span class="math inline">\(x\)</span>) where <span class="math inline">\(\langle \cdot, \cdot \rangle\)</span> denotes the dot product in the space <span class="math inline">\(\mathcal{X}\)</span>, and <span class="math inline">\(g(x)\)</span> is called the <strong>gradient</strong> of <span class="math inline">\(f(x)\)</span> at the point <span class="math inline">\(x\)</span> (we’ll be using the nabla notation from now on: <span class="math inline">\(g(x) = \nabla f(x)\)</span>).</p>
<p>For example, for functions of one variable (<span class="math inline">\(\mathcal{X} = \mathbb{R}\)</span>) we have <span class="math inline">\(\langle a, b \rangle = a b\)</span>, for functions of several variables (<span class="math inline">\(\mathcal{X} = \mathbb{R}^n\)</span>) it’s the usual dot product <span class="math inline">\(\langle a, b \rangle = a^T b\)</span>, and for functions of matrices (<span class="math inline">\(\mathcal{X} = \mathbb{R}^{n \times m}\)</span>) it generalizes vector dot product: <span class="math inline">\(\langle A, B \rangle = \text{Tr}(A^T B)\)</span>.</p>
<p>Now let’s introduce the notion of <strong>differential</strong> <span class="math inline">\(df(x)\)</span> to be a perturbation of the function <span class="math inline">\(f\)</span> if we perturb <span class="math inline">\(x\)</span> by <span class="math inline">\(dx\)</span>, which we assume to be infinitesimally small. The gradient only affects first-order behavior of <span class="math inline">\(df(x)\)</span>, that is as <span class="math inline">\(dx\)</span> goes to zero, thus if one expands <span class="math inline">\(df(x)\)</span> in terms of <span class="math inline">\(dx\)</span>, it’s enough to write down only the linear term to find the differential. For example, for smooth scalar-valued functions we have <span class="math inline">\(df(x) = \langle \nabla f(x), dx \rangle\)</span>, so the gradient <span class="math inline">\(\nabla f(x)\)</span> defines a linear coefficient for the differential <span class="math inline">\(df(x)\)</span> <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>.</p>
<h3>
Calculus
</h3>
<p>Okay, how can we derive the gradient of a function? One way is to take a derivative with respect to each scalar input variable, and then compose a vector out of it. This approach is quite inefficient and messy: you might have to recompute the same vector expressions for each input variable, or compute lots of sums (which does not leverage benefits of vector operations), or try to compose vector operations out of them. Instead we’ll develop a formal method that will allow us to derive gradients for many functions (just like differentiation rules you learned in the introduction to calculus) without leaving the realm of vector algebra.</p>
<p>Recall that the differential <span class="math inline">\(df(x)\)</span> of a scalar-valued function <span class="math inline">\(f\)</span> is a linear function of <span class="math inline">\(dx\)</span> and the gradient. That means, that if we could write a differential <span class="math inline">\(df(x)\)</span> and then simplify it to <span class="math inline">\(g(x)^T dx + O(\|dx\|^2)\)</span> (we ignore higher-order terms, as they are not linear in <span class="math inline">\(dx\)</span>, and go to zero faster than <span class="math inline">\(dx\)</span> does), we’ll recover the gradient <span class="math inline">\(\nabla f(x) = g(x)\)</span>. This is exactly what we’re going to do: develop a set of formal rules that will allow us to compute differentials of various operations and their combinations.</p>
<p>The general idea is to consider <span class="math inline">\(f(x + \Delta)\)</span>, and manipulate it into something of the form <span class="math inline">\(f(x) + L_x(\Delta) + O(\|\Delta\|^2)\)</span> where <span class="math inline">\(L_x(\Delta)\)</span> is a function of <span class="math inline">\(x\)</span> (maybe constant, though) and <span class="math inline">\(\Delta\)</span> that is linear in <span class="math inline">\(\Delta\)</span> (but not necessarily in <span class="math inline">\(x\)</span>). Then <span class="math inline">\(L_x(dx)\)</span> is exactly the differential <span class="math inline">\(df(x)\)</span>.</p>
<p>Let’s consider an example. Let <span class="math inline">\(f(X) = A X^{-1} + B\)</span> (all variables are square matrices of the same size):</p>
<p><span class="math display">\[
\begin{align*}
f(X + \Delta)
&= A (X + \Delta)^{-1} + B
= A X^{-1} (I + \Delta X^{-1})^{-1} + B \\
&= A X^{-1} \left(\sum_{k=0}^\infty (-\Delta X^{-1})^k \right) + B
= A X^{-1} \left(I - \Delta X^{-1} + O(\|\Delta\|^2) \right) + B \\
&= \underbrace{A X^{-1} + B}_{f(X)} \underbrace{-A X^{-1} \Delta X^{-1}}_{\text{linear in }\Delta} + O(\|\Delta\|^2)
\end{align*}
\]</span></p>
<p>Hence <span class="math inline">\(df(X) = -A X^{-1} dX X^{-1}\)</span> (it’s not of the form <span class="math inline">\(\text{Tr}(g(X)^T dX)\)</span> because <span class="math inline">\(f\)</span> is not scalar-valued, so you can’t extract the gradient from it, because the “gradient” would be something like a 4-dimensional tensor). This way we can derive differentials for many common functions and operations:</p>
<p><span class="math display">\[
d(\alpha X) = \alpha dX \\
d(X + Y) = dX + dY \\
d(XY) = dX Y + X dY \\
d(X^{-1}) = -X^{-1} dX X^{-1} \\
d(c^T x) = c^T dx \\
d(x^T A x) = x^T (A + A^T) dx \\
d(\text{Tr}(X)) = \text{Tr}(dX) \\
d(\text{det}(X)) = \text{det}(X) \text{Tr}(X^{-1} dX)
\]</span></p>
<p>It’s also very helpful to derive a rule to deal with function composition: suppose we have <span class="math inline">\(f(x)\)</span> and <span class="math inline">\(g(x)\)</span> with corresponding differentials <span class="math inline">\(df(x)\)</span> and <span class="math inline">\(dg(x)\)</span>. Then the differential of <span class="math inline">\(h(x) = f(g(x))\)</span> is</p>
<p><span class="math display">\[
dh(x) = f(g(x+dx)) - f(g(x)) = f(g(x) + dg(x)) - f(g(x)) = df(y)|_{dy = dg(x), y = g(x)}
\]</span></p>
<p>That is, we take <span class="math inline">\(df(y)\)</span>, and replace each <span class="math inline">\(dy\)</span> with <span class="math inline">\(dg(x)\)</span>, and each <span class="math inline">\(y\)</span> with <span class="math inline">\(g(x)\)</span>.</p>
<p>These rules allow us to differentiate fairly complicated expressions like <span class="math inline">\(f(x) = \text{det}(X + B) \log(a^T X^{-1} a) - \text{Tr}(X)\)</span></p>
<p><span class="math display">\[
\begin{align*}
df(X)
&= d(\text{det}(X + B) \log (a^T X^{-1} a)) - d(\text{Tr}(X)) \\
&= d(\text{det}(X + B)) \log (a^T X^{-1} a) + \text{det}(X + B) d(\log (a^T X^{-1} a)) - \text{Tr}(dX) \\
&= \text{det}(X + B) \text{Tr}((X+B)^{-1} dX) \log (a^T X^{-1} a) + \frac{\text{det}(X + B)}{a^T X^{-1} a} d(a^T X^{-1} a) - \text{Tr}(dX) \\
&= \text{Tr}\left[\text{det}(X + B)\log (a^T X^{-1} a) (X+B)^{-1} dX \right] - \frac{\text{det}(X + B)}{a^T X^{-1} a} \left(a^T X^{-1} dX X^{-1} a\right) - \text{Tr}(dX) \\
&= \text{Tr}\left[\left(\text{det}(X + B)\log (a^T X^{-1} a) (X+B)^{-1} - \frac{\text{det}(X + B)}{a^T X^{-1} a} X^{-1} a a^T X^{-1} - I\right) dX \right] \\
\end{align*}
\]</span></p>
<p>One way to sanity-check (not complete, though!) our derivations is to consider <span class="math inline">\(1 \times 1\)</span> matrices, that is, scalar case. In scalar case it all boils down to <span class="math inline">\(f(x) = (x+b) \log(a^2 / x) - x\)</span> with derivative <span class="math inline">\(f'(x) = \log(a^2 / x) - \frac{x+b}{x}-1\)</span>, which coincides with the formula above for <span class="math inline">\(1 \times 1\)</span> matrices.</p>
<h3 id="the-hessian">The Hessian</h3>
<p>The same idea can be used to calculate the hessian of a function, that is, a coefficient describing function’s local quadratic behavior. We restrict ourselves with scalar-valued functions of finite-dimensional vectors, but it generalizes to other functions if you consider appropriate bilinear maps.</p>
<p>We define the second order differential recursively <span class="math inline">\(d^2 f(x) = d(df(x))\)</span> as a linearization of a linearization (and we need to go deeper!). One might note that linearization of a linear function does not make any difference, but we’re actually linearizing not the linear approximation, but the map <span class="math inline">\(x \mapsto df(x)\)</span> itself. In a way, you can think of <span class="math inline">\(df(x)\)</span> as a function of 2 arguments: the point <span class="math inline">\(x\)</span> and an infinitesimal perturbation <span class="math inline">\(dx\)</span>. And we’re linearizing with respect to the first one. Since we have 2 independent linearizations, it’s incorrect to use the same perturbation to both of them, so we’ll introduce <span class="math inline">\(dx_1\)</span> and <span class="math inline">\(dx_2\)</span> as first and second order perturbations.</p>
<p>If <span class="math inline">\(df^2(x)\)</span> at a given point <span class="math inline">\(x\)</span> is a linearization of a linearization, it’s a function of 2 perturbations: <span class="math inline">\(dx_1\)</span> and <span class="math inline">\(dx_2\)</span>. Moreover, it’s linear in both of them, so <span class="math inline">\(df^2(x)\)</span> is actually a bilinear map. In case of a finite-dimensional vector space a bilinear map can be represented using a matrix <span class="math inline">\(H(x)\)</span>, that is <span class="math inline">\(d^2f(x) = dx_1^T H(x) dx_2\)</span>. The matrix <span class="math inline">\(H(x)\)</span> is called the <strong>hessian</strong> and denoted <span class="math inline">\(\nabla^2 f(x)\)</span>.</p>
<p>Then one uses the same formal rules, expanding <span class="math inline">\(d^2 f(x) = d(df(x))\)</span> by first computing <span class="math inline">\(df(x)\)</span> w.r.t. <span class="math inline">\(dx_1\)</span>, and then differentiating the resultant expression w.r.t. <span class="math inline">\(dx_2\)</span>. Again, let’s consider an example <span class="math inline">\(f(x) = \text{det}(I + x x^T)\)</span></p>
<p><span class="math display">\[
df(x) = 2\text{det}(I + x x^T) x^T (I + x x^T)^{-1} dx_1
= 2 f(x) x^T (I + x x^T)^{-1} dx_1
\]</span></p>
<p>Now, keeping in mind that we can move scalars around (as well as transpose them), we get</p>
<p><span class="math display">\[
\begin{align*}
d^2f(x) &= d(df(x)) \\
&= 2 \overbrace{df(x)}^{=dx_2^T \nabla f(x)} x^T (I + x x^T)^{-1} dx_1
+ 2 f(x) d(x^T) (I + x x^T)^{-1} dx_1
+ 2 f(x) x^T d((I + x x^T)^{-1}) dx_1 \\
&= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \right) dx_1 \\
&\quad- 2 f(x) x^T (I + x x^T)^{-1} (dx_2 x^T + x dx_2^T) (I + x x^T)^{-1} dx_1 \\
&= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \right) dx_1 \\
&\quad- 2 f(x) \overbrace{x^T (I + x x^T)^{-1} dx_2}^{\text{scalar, transpose}} x^T (I + x x^T)^{-1} dx_1
- 2 \overbrace{f(x) x^T (I + x x^T)^{-1} x}^{\text{scalar, trace}} dx_2^T (I + x x^T)^{-1} dx_1 \\
&= 2 dx_2^T \left( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \right) dx_1 \\
&\quad- 2 dx_2^T f(x) (I + x x^T)^{-1} x x^T (I + x x^T)^{-1} dx_1
- 2 dx_2^T f(x) \text{Tr}\left[ (I + x x^T)^{-1} x x^T \right] (I + x x^T)^{-1} dx_1 \\
&= 2 dx_2^T \Bigl( \nabla f(x) x^T (I + x x^T)^{-1} + f(x) (I + x x^T)^{-1} \\
&\quad -f(x) (I + x x^T)^{-1} x x^T (I + x x^T)^{-1}
-f(x) \text{Tr}\left[ (I + x x^T)^{-1} x x^T \right] (I + x x^T)^{-1}
\Bigr) dx_1 \\
\end{align*}
\]</span></p>
<p>Thus the Hessian is</p>
<p><span class="math display">\[
\begin{align*}
\nabla^2 f(x)
&= 2 (I + x x^T)^{-1} x \left(\nabla f(x)^T - f(x) x^T (I + x x^T)^{-1} \right)
+ 2 f(x) \left(1 - \text{Tr}\left[ (I + x x^T)^{-1} x x^T \right] \right) (I + x x^T)^{-1} \\
&= (I + x x^T)^{-1} x \nabla f(x)^T
+ \left(2 f(x) - \nabla f(x)^T x \right) (I + x x^T)^{-1} \\
&= (I + x x^T)^{-1} x \nabla f(x)^T - \nabla f(x)^T x (I + x x^T)^{-1} + 2 f(x) (I + x x^T)^{-1} \\
&= 2f(x) \left((2 - x^T (I + x x^T)^{-1} x) I - (I + x x^T)^{-1}\right) (I + x x^T)^{-1} \\
\end{align*}
\]</span></p>
<p>The funny thing is, <span class="math inline">\(f(x) = \text{det}(I + x x^T)\)</span> can be simplified using the <a href="https://en.wikipedia.org/wiki/Matrix_determinant_lemma">determinant lemma</a> as <span class="math inline">\(f(x) = 1 + x^T x\)</span>. Now this is a very simple function, whose gradient is just <span class="math inline">\(2x\)</span> and the Hessian is constant <span class="math inline">\(2I\)</span>. And actually, the expression above can be simplified into <span class="math inline">\(2I\)</span>. Sometimes it’s beneficial to simplify the function first (-:</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this post I showed how one can derive gradients and hessians using formal algebraic manipulations with differentials. The same technique is, of course, applicable to infinite-dimensional spaces (calculus of variations) and vector-valued functions (where linear map is described using the Jacobi matrix).</p>
<h3 id="addendum-on-the-dual-numbers">Addendum on the Dual Numbers</h3>
<p>The set of formal rules described above is not only helpful when calculating gradients by hand, but can also be used to automatically differentiate a function as you evaluate it. Indeed, suppose you need to differentiate some big and complex function <span class="math inline">\(f(x)\)</span>. In the above I shoved how one can use formal rules to compute <span class="math inline">\(d f(x)\)</span>, rearrange the result into the form of <span class="math inline">\(\langle g(x), dx\rangle\)</span>, and use <span class="math inline">\(g(x)\)</span> as the gradient. Note that if we use Taylor expansion of <span class="math inline">\(f(x+dx)\)</span> at <span class="math inline">\(x\)</span> we get <span class="math inline">\(f(x+dx) = f(x) + \langle \nabla f(x), dx \rangle + O(\|dx\|^2)\)</span> and neither <span class="math inline">\(f(x)\)</span> nor <span class="math inline">\(\nabla f(x)\)</span> contain (or depend on) <span class="math inline">\(dx\)</span>. This means that if we extend our set of numbers by a symbol <span class="math inline">\(dx\)</span> with a property <span class="math inline">\(\|dx\|^2 = 0\)</span> (much like we have obtained complex numbers by adding a symbol <span class="math inline">\(i\)</span> to the real numbers with a special property <span class="math inline">\(i^2 = -1\)</span>), and evaluate <span class="math inline">\(f(x+dx)\)</span> in this expanded algebra, we will obtain an expression of the form <span class="math inline">\(a + \langle b, dx \rangle\)</span> with <span class="math inline">\(a = f(x)\)</span> and <span class="math inline">\(b = \nabla f(x)\)</span>. And this is a well-known extension of the real numbers called <a href="https://en.wikipedia.org/wiki/Dual_number">dual numbers</a>.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>If you’re wondering about the particular way of combining the gradient <span class="math inline">\(\nabla f(x)\)</span> and <span class="math inline">\(dx\)</span>, here’s the explanation. Recall, that the first-order term is the linearization of a function, that is, it’s linear transformation <span class="math inline">\(L_x\)</span> of <span class="math inline">\(dx\)</span>. Because of the <a href="https://en.wikipedia.org/wiki/Riesz_representation_theorem">Riesz representation theorem</a> this linear transformation <span class="math inline">\(L_x\)</span> can be represented as a scalar product with some element of <span class="math inline">\(\mathcal{X}\)</span> (the gradient, in our case): <span class="math inline">\(L_x(dx) = \langle \nabla f(x), dx \rangle\)</span>. Of course, this logic generalizes to non-scalar-valued functions (like <span class="math inline">\(f : \mathbb{R}^n \to \mathbb{R}^m\)</span>): the gradient is used to define a linear map.<a href="#fnref1">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/5m_RiBN2DWU" height="1" width="1" alt=""/>Sun, 29 Jan 2017 00:00:00 UThttp://artem.sobolev.name/posts/2017-01-29-matrix-and-vector-calculus-via-differentials.htmlArtemhttp://artem.sobolev.name/posts/2017-01-29-matrix-and-vector-calculus-via-differentials.htmlNIPS 2016 Summaries
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/uXb15XnvxEo/2016-12-31-nips-2016-summaries.html
<p>I did not attend this year’s NIPS, but I’ve gathered many summaries published online by those who did attend the conference.</p>
<!--more-->
<ul>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hdofr/d_nips_2016_symposium_on_people_and_machines/">NIPS 2016 Symposium on People and machines: Public views on machine learning, and what this means for machine learning researchers. (Notes and panel discussion)</a> by /u/gcr</li>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hzvfi/d_nips_2016_summary_wrap_up_and_links_to_slides/">NIPS 2016 summary, wrap up, and links to slides</a> by /u/beamsearch</li>
<li><a href="http://inverseprobability.com/2016/12/13/nips-highlights.html">Post NIPS Reflections</a> by <a href="https://twitter.com/lawrennd">Neil Lawrence</a></li>
<li><a href="https://medium.com/@IgorCarron/some-general-take-aways-from-nips2016-c3c5ec23bf1a#.rykyvqbvm">Some general take aways from #NIPS2016</a> by <a href="http://nuit-blanche.blogspot.com">Igor Carron</a></li>
<li><a href="https://medium.com/@libfun/nips-2016-experience-and-highlights-104e19e4ac95#.umy1vunwa">NIPS 2016 experience and highlights</a> by <a href="https://twitter.com/libfun_sk">Sergey Korolev</a></li>
<li><a href="http://www.machinedlearnings.com/2016/12/nips-2016-reflections.html">NIPS 2016 Reflections</a> by Paul Mineiro</li>
<li><a href="http://abunchofdata.com/some-general-takeaways-from-nips2016/">Some general takeaways from #NIPS2016</a> by Arturo Slim</li>
<li><a href="https://twitter.com/rossfadely">Ross Fadely</a> and <a href="https://twitter.com/mwakanosya">Jeremy Karnowski</a>:
<ul>
<li><a href="https://blog.insightdatascience.com/nips-2016-day-1-6ae1207cab82">NIPS 2016 — Day 1 Highlights</a></li>
<li><a href="https://blog.insightdatascience.com/nips-2016-day-2-highlights-platform-wars-rl-and-rnns-9dca43bc1448#.r2aync4cu">NIPS 2016 — Day 2 Highlights: Platform wars, RL and RNNs</a></li>
<li><a href="https://blog.insightdatascience.com/nips-2016-day-3-highlights-robots-that-know-cars-that-see-and-more-1ec958896791#.geqs66a4b">NIPS 2016 — Day 3 Highlights: Robots that know, Cars that see, and more!</a></li>
<li><a href="https://blog.insightdatascience.com/nips-2016-final-highlights-days-4-6-likelihood-free-inference-dessert-analogies-and-much-more-ed7352d321ff#.uil9xf2mt">NIPS 2016 — Final Highlights Days 4–6: Likelihood-free inference, Dessert analogies, and much more.</a></li>
</ul></li>
<li><a href="https://aichamp.wordpress.com/2016/12/09/nips2016-top-10/">Key deep learning takeaways from NIPS2016 for applied data scientist</a> by Avkash Chauhan</li>
<li><a href="https://paper.dropbox.com/doc/Brad-Neubergs-NIPS-2016-Notes-XUFRdpNYyBhau0gWcybRo">Brad Neuberg’s NIPS 2016 Notes</a> by Brad Neuberg</li>
<li><a href="https://blog.ought.com/nips-2016-875bb8fadb8c#.yrr46pb2t">50 things I learned at NIPS 2016</a></li>
<li>Lab 41 by <a href="https://twitter.com/karllab41">Karl Ni</a>:
<ul>
<li><a href="https://gab41.lab41.org/nips-2016-review-day-1-6e504bcf1451#.g3uvis858">NIPS 2016 Review, Days 0 & 1</a></li>
<li><a href="https://gab41.lab41.org/nips-2016-review-day-2-daff1088135e#.prd61skhx">NIPS 2016 Review, Day 2</a></li>
<li><a href="https://gab41.lab41.org/nips-2016-review-day-3-21c78586a0ec#.a3mmr9wmi">NIPS 2016 Review, Day 3</a></li>
</ul></li>
<li><a href="http://wimlworkshop.org/2016/">WiML 2016</a> (Women in Machine Learning) videos:
<ul>
<li><a href="https://www.periscope.tv/WiMLworkshop/1ypKdAZXVOyGW?">Designing Algorithms for Practical Machine Learning</a> by Maya Gupta</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1DXxyoqrMXgGM?">On the Expressive Power of Deep Neural Networks</a> by Maithra Raghu</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1DXxyoqryqWGM?">Ancestral Causal Inference</a> by Sara Magliacane</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1vOxweXvPwgGB?">Towards a Reasoning Engine for Individualizing Healthcare</a> by <a href="http://www.suchisaria.com/">Suchi Saria</a></li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1vOxweXvqjEGB?">Learning Representations from Time Series Data through Contextualized LSTMs</a> by Madalina Fiterau</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1vAGRXDbvbkxl?">Towards Conversational Recommender Systems</a> by Konstantina Christakopoulou</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1gqGvRjOeWOGB?">Large-Scale Machine Learning through Spectral Methods: Theory & Practice</a> by Anima Anandkumar</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1jMJgAkEVajKL?">WiML Updates</a> by Tamara Broderick</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1MnGnXwZmVMxO?">Using Convolutional Neural Networks to Estimate Population Density from High Resolution Satellite Images</a> by Amy Zhang</li>
<li><a href="https://www.periscope.tv/WiMLworkshop/1dRKZRYgQwvKB">Graphons and Machine Learning</a> by <span class="citation">@JenniferChayes</span></li>
</ul></li>
<li><a href="https://medium.com/@elluba/nips-2016-cake-rocket-ai-gans-and-the-style-transfer-debate-708c46438053#.he59yf9ah">NIPS 2016: cake, Rocket AI, GANs and the style transfer debate</a> by Luba Elliott</li>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5ib4jf/discussion_summary_of_nips_2016_adversarial/">Summary of NIPS 2016 Adversarial Training Workshop: More Theory, Exciting Progress</a> by /u/fhuszar</li>
<li><a href="https://www.taraslehinevych.me/blog/2016/12/14/nips-barcelona/">NIPS 2016 Notes</a> by /u/lehinevych</li>
<li><a href="http://apeiroto.pe/ml/nips-2016.html">NIPS 2016</a> by <a href="https://twitter.com/__hylandSL">Stephanie Hyland</a></li>
<li><a href="http://www.computervisionblog.com/2016/12/nuts-and-bolts-of-building-deep.html">Nuts and Bolts of Building Deep Learning Applications: Ng @ NIPS2016</a> by <a href="https://twitter.com/quantombone">Tomasz Malisiewicz</a></li>
<li><a href="http://computerblindness.blogspot.ru/2016/12/nips-2016.html">NIPS 2016</a> by <a href="https://twitter.com/ovrdr">Roman Shapovalov</a></li>
<li><a href="http://abunchofdata.com/magenta-wins-best-demo-at-nips-2016/">Magenta wins “Best Demo” at NIPS 2016!</a>, checkout the demo <a href="https://magenta.tensorflow.org/2016/12/16/nips-demo/">here</a></li>
<li><a href="http://www.machinedlearnings.com/2016/12/dialogue-workshop-recap.html?spref=tw">Dialogue Workshop Recap</a> by <a href="https://twitter.com/PaulMineiro">Paul Mineiro</a></li>
<li><a href="https://www.linkedin.com/pulse/nips-2016-towards-end-dynamic-dialogue-system-vishal-bhalla">NIPS 2016: Towards an end to end Dynamic Dialogue System</a> by <a href="https://twitter.com/vishy_punditry">Vishal Bhalla</a></li>
<li><a href="http://www.sandtable.com/nips-2016-deep-reinforcement-learning/">NIPS 2016: Deep Reinforcement Learning</a> by Leighton Turner</li>
<li><a href="http://www.slideshare.net/SebastianRuder/nips-2016-highlights-sebastian-ruder">NIPS 2016 Highlights</a> by <a href="http://sebastianruder.com/">Sebastian Ruder</a></li>
<li><a href="http://www.nxn.se/nips-2016-barcelona/">NIPS 2016 Notes</a> by Valentine Svensson</li>
<li><a href="http://yenhuanli.github.io/blog/2016/12/12/interesting-talks-in-nips-2016/">Some Interesting Talks at NIPS 2016</a> by <a href="https://twitter.com/yenhuan_li">Yen-Huan Li</a></li>
<li><a href="https://deezer.io/deezer-r-d-goes-to-nips-2016-e7a895c2c7ff#.j0tgjb3l6">Deezer R&D goes to NIPS 2016</a></li>
<li><a href="https://blog.cometlabs.io/robots-learning-about-human-values-emotion-and-intent-a39ca12c1908#.r4afpw77e">Robots Learning About Human Values, Emotion, and Intent</a> by Malika Cantor</li>
<li><a href="http://ikuz.eu/2016/12/16/notes-on-nips-2016/">Notes on NIPS 2016</a> by Ilya Kuzovkin</li>
<li><a href="http://www.mikelanzetta.com/nips-2016-trip-report.html">NIPS 2016 Trip Report</a> by <a href="https://twitter.com/noodlefrenzy">Mike Lanzetta</a></li>
<li><a href="http://sebastianruder.com/highlights-nips-2016/">Highlights of NIPS 2016: Adversarial learning, Meta-learning, and more</a> by <a href="https://twitter.com/seb_ruder">Sebastian Ruder</a></li>
<li><a href="https://livingthing.danmackinlay.name/nips_2016.html">Garbled highlights from NIPS 2016</a> by <a href="https://danmackinlay.name/">Dan Mackinlay</a></li>
<li><a href="http://approximatelycorrect.com/2016/12/28/ai-safety-highlights-from-nips-2016/">AI Safety Highlights from NIPS 2016</a> by <a href="https://twitter.com/vkrakovna">Victoria Krakovna</a></li>
<li><a href="http://blog.evjang.com/2017/01/nips2016.html">Summary of NIPS 2016</a> by Eric Jang</li>
<li><a href="http://www.nowozin.net/sebastian/blog/nips-2016-generative-adversarial-training-workshop-talk.html">NIPS 2016 Generative Adversarial Training workshop talk</a></li>
<li><a href="http://hunch.net/?p=5937325">EWRL and NIPS 2016</a> by <a href="http://hunch.net/~jl/">John Langford</a></li>
</ul>
<p>You might also be interested in:</p>
<ul>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hwqeb/project_all_code_implementations_for_nips_2016/">All Code Implementations for NIPS 2016 papers</a></li>
<li>On RocketAI: [<a href="https://twitter.com/deanpomerleau/status/808011377059254273">1</a>] + [<a href="https://www.reddit.com/r/MachineLearning/comments/5hmdty/discussion_rocketai/db2mit3/">2</a>] + [<a href="https://medium.com/the-mission/rocket-ai-2016s-most-notorious-ai-launch-and-the-problem-with-ai-hype-d7908013f8c9">3</a>]</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/uXb15XnvxEo" height="1" width="1" alt=""/>Sat, 31 Dec 2016 00:00:00 UThttp://artem.sobolev.name/posts/2016-12-31-nips-2016-summaries.htmlArtemhttp://artem.sobolev.name/posts/2016-12-31-nips-2016-summaries.htmlNeural Variational Inference: Importance Weighted Autoencoders
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/CFI-lfE5saw/2016-07-14-neural-variational-importance-weighted-autoencoders.html
<p>Previously we covered <a href="/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-Helmholtz-machines.html">Variational Autoencoders</a> (VAE) — popular inference tool based on neural networks. In this post we’ll consider, a followup work from Torronto by Y. Burda, R. Grosse and R. Salakhutdinov, <a href="https://arxiv.org/abs/1509.00519">Importance Weighted Autoencoders</a> (IWAE). The crucial contribution of this work is introduction of a new lower-bound on the marginal log-likelihood <span class="math inline">\(\log p(x)\)</span> which generalizes ELBO, but also allows one to use less accurate approximate posteriors <span class="math inline">\(q(z \mid x, \Lambda)\)</span>.</p>
<p>On a dessert we’ll discuss another paper, <a href="https://arxiv.org/abs/1602.06725">Variational inference for Monte Carlo objectives</a> by A. Mnih and D. Rezende which aims to broaden the applicability of this approach to models where reparametrization trick can not be used (e.g. for discrete variables).</p>
<!--more-->
<h3>
Importance Weighted Autoencoders
</h3>
<p>Let’s first answer the question of how one can come up with a lower bound for the marginal log-likelihood? In the very beginning of the series, <a href="/posts/2016-07-01-neural-variational-inference-classical-theory.html">Classical Theory</a> post, we used some trickery to come up with the ELBO. That massaging of the marginal log-likelihood wasn’t particulary enlightening on how one could invent that lower bound. Now we’re going to consider a principled approach to invention of new lower bounds based on <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen’s inequality</a>.</p>
<p>Suppose we have some unbiased estimator <span class="math inline">\(f(x, z)\)</span> of <span class="math inline">\(p(x)\)</span>, that is, <span class="math inline">\(\mathbb{E}_z f(x, z) = p(x)\)</span>. Then</p>
<p><span class="math display">\[
\log p(x) = \log \mathbb{E}_z f(x, z) \stackrel{\text{Jensen}}{\ge} \mathbb{E}_z \log f(x, z)
\]</span></p>
<p>In particular, if <span class="math inline">\(z \sim q(z \mid x)\)</span> and <span class="math inline">\(f(x, z) = \tfrac{p(x, z)}{q(z \mid x)}\)</span>, we obtain the standard ELBO. The IWAE paper proposes another estimate (actually, a family of estimators parametrized by an integer <span class="math inline">\(K\)</span>) of marginal <span class="math inline">\(p(x)\)</span>:</p>
<p><span class="math display">\[
f(x, z_1, \dots, z_K) = \frac{1}{K} \sum_{k=1}^K \frac{p(x, z_k)}{q(z_k \mid x)}
\]</span></p>
<p>Where each <span class="math inline">\(z_k\)</span> comes from the same distribution <span class="math inline">\(q(z_k \mid x) = q(z \mid x)\)</span>. Obviously, <span class="math inline">\(f(x, z_1, \dots, z_K)\)</span> is still an unbiased estimator of the <span class="math inline">\(p(x)\)</span>, and therefore <span class="math inline">\(\mathbb{E}_z \log f(x, z_1, \dots, z_K)\)</span> is a valid lower-bound on the marginal log-likelihood.</p>
<p>Let’s analyze this new lower-bound now. First, let’s dissect the ELBO:</p>
<p><span class="math display">\[
\mathcal{L}(\Theta, \Lambda)
= \mathbb{E}_q \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
= \mathbb{E}_q \left[\log \frac{p(z \mid x, \Theta)}{q(z \mid x, \Lambda)} \right] + \log p(x \mid \Theta)
\]</span></p>
<p>If <span class="math inline">\(q\)</span> approximates the true posterior accurately, the first term (which is a KL-divergence, BTW) is close to zero. However, when estmating it using Monte Carlo samples, the ELBO heavily penalizes inaccurate approximations: if <span class="math inline">\(q(z \mid x, \Lambda)\)</span> gives us samples from high probability regions of the true posterior <span class="math inline">\(p(z \mid x, \Theta)\)</span> only occasionally (like 20% of times), the gap between the ELBO and the marginal log-likelihood would be huge (<span class="math inline">\(p(z\mid x, \Theta)\)</span> is small, <span class="math inline">\(q(z \mid x, \Lambda)\)</span> is big), which does not help learning. As you might have guessed, IWAE allows us to use several samples. Let’s see it in detail:</p>
<p><span class="math display">\[
\mathcal{L}_K(\Theta, \Lambda)
:= \mathbb{E}_q \left[\log \frac{1}{K} \sum_{k=1}^K \frac{p(x, z_k \mid \Theta)}{q(z_k \mid x, \Lambda)} \right]
:= \mathbb{E}_q \left[\log \frac{1}{K} \sum_{k=1}^K \frac{p(z_k \mid x, \Theta)}{q(z_k \mid x, \Lambda)} \right] + \log p(x \mid \Theta)
\]</span></p>
<p>This averaging of posterior ratios saves us from bad samples screwing the lower bound, as it’ll be pushed up by good samples (provided the approximation has a reasonable probability of generating a good sample in <span class="math inline">\(K\)</span> attempts). This allows one to perform model inference even with poor approximations <span class="math inline">\(q(z \mid x, \Lambda)\)</span>. The more samples <span class="math inline">\(K\)</span> we use — the less accurate approximation we can tolerate. In fact, authors prove the following theorem:</p>
<blockquote>
<p><strong>Theorem 1</strong>. For all <span class="math inline">\(K\)</span>, the lower bounds satisfy <span class="math display">\[
\log p(x \mid \Theta) \ge \mathcal{L}_{K+1}(\Theta, \Lambda) \ge \mathcal{L}_{K}(\Theta, \Lambda)
\]</span></p>
Moreover, if <span class="math inline">\(p(z, x \mid \Theta) / q(z \mid x, \Lambda)\)</span> is bounded, then <span class="math inline">\(\mathcal{L}_{K}(\Theta, \Lambda)\)</span> approaches <span class="math inline">\(\log p(x \mid \Theta)\)</span> as <span class="math inline">\(K\)</span> goes to infinity.
</blockquote>
<p>The convergence result follows from the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law">strong law of large numbers</a>.</p>
<p>As with VAE, we use the reparametrization trick to avoid backpropagation through stochastic units:</p>
<p><span class="math display">\[
\mathcal{L}_K(\Theta, \Lambda) = \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_K} \log \frac{1}{K} \sum_{k=1}^K \overbrace{\frac{p(x, g(\varepsilon_k; \Lambda) \mid \Theta)}{q(g(\varepsilon_k; \Lambda) \mid x, \Lambda)}}^{w(x, \varepsilon_k, \Theta, \Lambda)}
\]</span></p>
<p>The gradients then are</p>
<p><span class="math display">\[
\nabla_\Theta \mathcal{L}_K(\Theta, \Lambda) = \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_K} \sum_{k=1}^K \hat w_k(x, \varepsilon_{1 \dots K}, \Theta, \Lambda) \nabla_\Theta \log w(x, \varepsilon_k, \Theta, \Lambda) \\
\nabla_\Lambda \mathcal{L}_K(\Theta, \Lambda) = \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_K} \sum_{k=1}^K \hat w_k(x, \varepsilon_{1 \dots K}, \Theta, \Lambda) \nabla_\Lambda \log w(x, \varepsilon_k, \Theta, \Lambda) \\
\text{where } \hat w_k(x, \varepsilon_{1 \dots K}, \Theta, \Lambda) := \frac{w(x, \varepsilon_k, \Theta, \Lambda)}{\sum_{k=1}^K w(x, \varepsilon_k, \Theta, \Lambda)}
\]</span></p>
<p>(We used the identity <span class="math inline">\(\nabla_x f(x) = f(x) \nabla_x \log f(x)\)</span> here).</p>
<p>Just as one would expect, setting <span class="math inline">\(K=1\)</span> reduces these gradients to ones we’ve seen in VAEs as the only importance weight <span class="math inline">\(\hat w_1\)</span> is equal to 1. Unfortunatelly, this approach does not allow one to decompose the lower-bound into the reconstruction error and KL-divergence to analytically compute the later. However, authors report indistinguishable performance of 2 approaches (with KL computed analytically or estimated using Monte Carlo) in case of <span class="math inline">\(K=1\)</span>.</p>
<p>BTW, <a href="http://info.usherbrooke.ca/hlarochelle/index_en.html">Hugo Larochelle</a> writes <a href="https://twitter.com/hugo_larochelle/timelines/639067398511968256">notes</a> on different papers, and he has written and made publicly available <a href="https://www.evernote.com/shard/s189/sh/e2c8a133-1814-474c-b267-366600a1921b/06a756f1618cd47ababc7aae0e514dbf">Notes on Importance Weighted Autoencoders</a>.</p>
<h3>
Variational inference for Monte Carlo objectives
</h3>
<p>As I said in the introduction, IWAE has been “generalized” to discrete variables — a case when one can not employ the reparametrization trick, and instead has to somehow reduce high variance of a score function-based estimator. Previously, during our discussion of the <a href="/posts/2016-07-05-neural-variational-inference-blackbox.html">Blackbox VI and variance reduction techniques</a> we covered NVIL (Neural Variational Inference and Learning) estimator, which uses another neural network to estimate marginal likelihood and reduce the variance. This work is built upon a similar idea.</p>
<p>First, let’s derive score-function-based gradients for variational parameters <span class="math inline">\(\Lambda\)</span> (where <span class="math inline">\(w\)</span> now is defined as <span class="math inline">\(w(x, z, \Theta, \Lambda) = \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}\)</span>, and <span class="math inline">\(\hat w\)</span> is a normalized across all samples <span class="math inline">\(z_{1\dots K}\)</span> version):</p>
<p><span class="math display">\[
\begin{align}
\nabla_\Lambda \mathcal{L}_K(\Theta, \Lambda)
&= \nabla_\Lambda \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \\
&= \nabla_\Lambda \int q(z_1, \dots, z_K \mid x, \Lambda) \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \; dz_1 \dots dz_K \\
&= \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \left[ \sum_{k=1}^K \nabla_\Lambda \log q(z_k \mid x, \Lambda) \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \right] \\
& \quad+ \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \nabla_\Lambda \log \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \\
&= \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \left[ \sum_{k=1}^K \nabla_\Lambda \log q(z_k \mid x, \Lambda) \log \frac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda) \right] \\
& \quad+ \mathbb{E}_{q(z_1, \dots, z_K \mid x, \Lambda)} \left[ \sum_{k=1}^K \hat w_k(x, z_{1 \dots K}, \Theta, \Lambda) \nabla_\Lambda \log w(x, z_k, \Theta, \Lambda) \right]
\end{align}
\]</span></p>
<p>The second term is exactly the gradient of the reparametrized case, and it does not cause us any troubles. The first term, however has some issues.</p>
<p>First, it does not distinguish individual samples’ contributions: indeed, gradients for all samples have the same weight of <span class="math inline">\(\log \tfrac{1}{K} \sum_{k=1}^K w(x, z_k, \Theta, \Lambda)\)</span> (called the <em>learning signal</em>) regardless of how probable they’re in terms of the true posterior (that is, how well they describe an observation <span class="math inline">\(x\)</span>). Compare it with the second term, where gradient for each sample <span class="math inline">\(z_k\)</span> is weighted in proportion to its importance weight <span class="math inline">\(\hat w_k\)</span>.</p>
<p>Second problem is that the learning signal is unbounded, and can be quite high. Again, the second term does not suffer this as importance weights <span class="math inline">\(\hat w_k\)</span> are normalized to sum to 1.</p>
<p>One can use the NVIL estimator we’ve discussed previously to reduce the variance due to large magnitude of a learning signal. However, it does not address the problem of all gradients having the same weight. For this the authors propose to introduce per-sample baselines that minimize dependencies between samples.</p>
<p>This paper has also caught Dr. Larochelle’s attention: <a href="https://www.evernote.com/shard/s189/sh/54a9fb88-1a71-4e8a-b0e3-f13480a68b8d/0663de49b93d397f519c7d7f73b6a441">Notes on Variational inference for Monte Carlo objectives</a>.</p><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/CFI-lfE5saw" height="1" width="1" alt=""/>Thu, 14 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/2016-07-14-neural-variational-importance-weighted-autoencoders.htmlArtemhttp://artem.sobolev.name/posts/2016-07-14-neural-variational-importance-weighted-autoencoders.htmlNeural Variational Inference: Variational Autoencoders and Helmholtz machines
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/kx9whnUkp94/2016-07-11-neural-variational-inference-variational-autoencoders-and-Helmholtz-machines.html
<p>So far we had a little of “neural” in our VI methods. Now it’s time to fix it, as we’re going to consider <a href="https://arxiv.org/abs/1312.6114">Variational Autoencoders</a> (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new approach (AEVB) to large-scale inference in non-conjugate models with continuous latent variables, and a probabilistic model of autoencoders as an example of this approach. We then discuss connections to <a href="https://en.wikipedia.org/wiki/Helmholtz_machine">Helmholtz machines</a> — a predecessor of VAEs.</p>
<!--more-->
<h3 id="auto-encoding-variational-bayes">Auto-Encoding Variational Bayes</h3>
<p>As noted in the introduction of the post, this approach, called Auto-Encoding Variational Bayes (AEVB) works only for some models with continuous latent variables. Recall from our discussion of <a href="/posts/2016-07-05-neural-variational-inference-blackbox.html">Blackbox VI</a> and <a href="/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.html">Stochastic VI</a>, we’re interested in maximizing the ELBO <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span>:</p>
<p><span class="math display">\[
\mathcal{L}(\Theta, \Lambda) = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
\]</span></p>
<p>It’s not a problem to compute an estimate of the gradient of the ELBO w.r.t. model parameters <span class="math inline">\(\Theta\)</span>, but estimating the gradient w.r.t. approximation parameters <span class="math inline">\(\Lambda\)</span> is tricky as these parameters influence the distribution the expectation is taken over, and as we know from the post on <a href="/posts/2016-07-05-neural-variational-inference-blackbox.html">Blackbox VI</a>, naive gradient estimator based on score function exhibits high variance.</p>
<p>Turns out, for some distributions we can make change of variables, that is, for some distributions a sample <span class="math inline">\(z \sim q(z \mid x, \Lambda)\)</span> can be represented as a (differentiable) transformation <span class="math inline">\(g(\varepsilon; \Lambda, x)\)</span> of some auxiliary random variable <span class="math inline">\(\varepsilon\)</span> whose distribution does not depend on <span class="math inline">\(\Lambda\)</span>. A well-known example of such reparametrization is Gaussian distribution: if <span class="math inline">\(z \sim \mathcal{N}(\mu, \Sigma)\)</span> then <span class="math inline">\(z\)</span> can be represented as <span class="math inline">\(z = g(\varepsilon; \mu, \Sigma) := \mu + \Sigma^{1/2} \varepsilon\)</span> for <span class="math inline">\(\varepsilon \sim \mathcal{N}(0, I)\)</span>. This transformation is called the <strong>reparametrization trick</strong>. After the reparametrization the ELBO becomes</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)} \log \frac{p(x, g(\varepsilon; \Lambda, x)\mid \Theta)}{q(g(\varepsilon; \Lambda, x) \mid \Lambda, x)} \\
&\approx \frac{1}{L} \sum_{l=1}^L \log \frac{p(x, g(\varepsilon^{(l)}; \Lambda, x)\mid \Theta)}{q(g(\varepsilon^{(l)}; \Lambda, x) \mid \Lambda, x)} \quad \quad \text{where $\varepsilon^{(l)} \sim \mathcal{N}(0, I)$}
\end{align}
\]</span></p>
<p>This objective is a much better one as we don’t need to differentiate w.r.t. expectation’s distribution, essentially putting the variational parameters <span class="math inline">\(\Lambda\)</span> to the same regime as the model parameters <span class="math inline">\(\Theta\)</span>. It’s sufficient now to just take gradients of the ELBO’s estimate, and run any optimization algorithm like <a href="https://arxiv.org/abs/1412.6980">Adam</a>.</p>
<p>Oh, and if you wonder what Auto-Encoding in Auto-Encoding Variational Bayes means, there’s an interesting interpretation of the ELBO in terms of autoencoding:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\Theta, \Lambda)
& = \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
= \mathbb{E}_{q(z\mid x, \Lambda)} \log \frac{p(x \mid z, \Theta) p(z \mid \Theta)}{q(z \mid x, \Lambda)} \\
& = \mathbb{E}_{q(z\mid x, \Lambda)} \log p(x \mid z, \Theta) - D_{KL}(q(z \mid \Lambda, x) \mid\mid p(z \mid \Theta)) \\
\end{align}
\]</span></p>
<p>Here the first term can be treated as expected reconstruction (<span class="math inline">\(x\)</span> from the code <span class="math inline">\(z\)</span>) error, while the second one is just a regularizer.</p>
<h3 id="variational-autoencoder">Variational Autoencoder</h3>
<p>One particular application of AEVB framework comes from using neural networks as the model <span class="math inline">\(p(x \mid z, \Theta)\)</span> (called <strong>generative network</strong>) and the approximation <span class="math inline">\(q(z \mid x, \Lambda)\)</span> (called <strong>inference network</strong> or <strong>recognition network</strong>). The model has no requirements, and <span class="math inline">\(x\)</span> can be discrete or continuous (or mixed). <span class="math inline">\(z\)</span>, however, has to be continuous. Moreover, we need to be able to apply the reparametrization trick. Therefore in many practical applications <span class="math inline">\(q(z \mid x, \Lambda)\)</span> is set to be Gaussian distribution <span class="math inline">\(q(z \mid \Lambda, x) = \mathcal{N}(z \mid \mu(x; \Lambda), \Sigma(x; \Lambda))\)</span> where <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\Sigma\)</span> are outputs of a neural network taking <span class="math inline">\(x\)</span> as input, and <span class="math inline">\(\Lambda\)</span> denotes a set of neural network’s weights — the parameters we optimize the ELBO with respect to (the same applies to <span class="math inline">\(\Theta\)</span>). In order to make the reparametrization trick practical, one would like to be able to compute <span class="math inline">\(\Sigma^{1/2}\)</span> quick. We don’t want to actually compute this quantity directly as it’d be too computationally expensive. Instead you might want to predict <span class="math inline">\(\Sigma^{1/2}\)</span> by the neural network in the first place, or consider only diagonal covariance matrices (as it’s done in the paper).</p>
<p>In case of the Gaussian inference network <span class="math inline">\(q(z \mid x, \Lambda)\)</span> and a Gaussian prior <span class="math inline">\(p(z \mid \Theta)\)</span> we can compute KL-divergence <span class="math inline">\(D_{KL}(q(z \mid \Lambda, x) \mid\mid p(z \mid \Theta))\)</span> analytically, see the formula at <a href="http://stats.stackexchange.com/a/60699/62549">stats.stackexchange</a>. This slightly reduces the variance of the gradient estimator, though one can still train a VAE estimating KL-divergence using Monte Carlo, just like the reconstruction error.</p>
<p>We optimize both generative and inference networks by gradient ascent. This joint optimization pushes both the approximation towards the model, and the model towards the approximation. As a result, the generative network is encouraged to learn latent representations <span class="math inline">\(z\)</span> that exhibit the same independence pattern as the inference network. For example, if the inference network is Gaussian and has diagonal covariance matrices, then the generative model will try to learn representations with independent components.</p>
<p>VAEs have become popular because one can use it as a generative model. Essentially VAE is an easy to train autoencoder with a natural sampling procedure <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>: suppose you’ve trained the model, and now want to sample new samples similar to those you used in the training set. To do so you first sample <span class="math inline">\(z\)</span> from the prior <span class="math inline">\(p(z)\)</span>, and then generate <span class="math inline">\(x\)</span> using the model <span class="math inline">\(p(x \mid z, \Theta)\)</span>. Both operations are easy: the first one is sampling from some standard distribution (like Gaussian, for example), and the second one is just one feed-forward pass followed by sampling from another standard distribution (Bernoulli, for example, in case <span class="math inline">\(x\)</span> is a binary image).</p>
<p>If you want to read more on Variational Auto-Encoders, I refer you to a great <a href="https://arxiv.org/abs/1606.05908">tutorial by Carl Doersch</a>. Also take a look at Dustin Tran’s post <a href="http://dustintran.com/blog/variational-auto-encoders-do-not-train-complex-generative-models/">Variational auto-encoders do not train complex generative models</a> (and see the <a href="https://www.reddit.com/r/MachineLearning/comments/4ph8cq/variational_autoencoders_do_not_train_complex/">reddit discussion</a> also!).</p>
<h3 id="helmholtz-machines">Helmholtz Machines</h3>
<p>In the end I’d like to add a historical perspective. The idea of two networks, one “encoding” an observation <span class="math inline">\(x\)</span> to some latent representation (code) <span class="math inline">\(z\)</span>, and another “decoding” it back is definitely not new. In fact, the whole idea is a special case of the <a href="https://en.wikipedia.org/wiki/Helmholtz_machine">Helmholtz Machines</a> introduced by Geoffrey Hinton 20 years ago.</p>
<p>Helmholtz machine can be thought of as a neural network of stochastic hidden layers. Namely, we now have <span class="math inline">\(M\)</span> stochastic hidden layers (latent variables) <span class="math inline">\(h_1, \dots, h_M\)</span> (with deterministic <span class="math inline">\(h_0 = x\)</span>) where the layer <span class="math inline">\(h_{m-1}\)</span> is stochastically produced by the layer <span class="math inline">\(h_{m}\)</span>, that is, it is samples from some distribution <span class="math inline">\(p(h_{m-1} \mid h_m)\)</span>, which as you might have guessed already is parametrized in the same way as in usual VAEs. Actually, VAEs is a special case of a Helmholtz machine with just one stochastic layer (but each stochastic layer contains a neural network of arbitrarily many deterministic layers inside of it).</p>
<div style="text-align: center">
<p><img src="/files/Helmholtz-machine.png" style="width: 400px" /></p>
</div>
<p>This image shows an instance of a Helmholtz machine with 2 stochastic layers (blue cloudy nodes), and each stochastic layer having 2 deterministic hidden layers (white rectangles).</p>
<p>The joint model distribution is</p>
<p><span class="math display">\[
p(x, h_1, \dots, h_M \mid \Theta) = p(h_M \mid \Theta) \prod_{m=0}^{M-1} p(h_m \mid h_{m+1}, \Theta)
\]</span></p>
<p>And the approximate posterior is the same, but in inverse order:</p>
<p><span class="math display">\[
q(h_1, \dots, h_M \mid x, \Lambda) = \prod_{m=1}^{M} p(h_{m} \mid h_{m-1}, \Theta)
\]</span></p>
<p>The <span class="math inline">\(p(x, h_1, \dots, h_{M-1} \mid h_M)\)</span> distribution is usually called a <strong>generative network</strong> (or model) as it allows one to generate samples from latent representation(s). The approximate posterior <span class="math inline">\(q(h_1, \dots, h_M \mid x, \Lambda)\)</span> in this framework is called a <strong>recognition network</strong> (or model). Presumably, the name reflects the purpose of the network to recognize the hidden structure of observations.</p>
<p>So, if the VAE is a special case of Helmholtz machines, what’s new then? The standard algorithm for learning Helmholtz machines, the <a href="https://en.wikipedia.org/wiki/Wake-sleep_algorithm">Wake-Sleep algorithm</a>, turns out to be optimizing a different objective. Thus, one of significant contributions of Kingma and Welling is application of the reparametrization trick to make optimization of the ELBO w.r.t. <span class="math inline">\(\Lambda\)</span> tractable.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>This is not true for other popular autoencoding architectures. Boltzman Machines are too hard to train properly, while tranditional autoencoders (contractive, denoising) are hard to sample from (special procedures involving MCMC are required).<a href="#fnref1">↩</a></p></li>
</ol>
</div><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/kx9whnUkp94" height="1" width="1" alt=""/>Mon, 11 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-Helmholtz-machines.htmlArtemhttp://artem.sobolev.name/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-Helmholtz-machines.htmlNeural Variational Inference: Blackbox Mode
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/Rx_ge6y2j6g/2016-07-05-neural-variational-inference-blackbox.html
<p>In the <a href="/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.html">previous post</a> we covered Stochastic VI: an efficient and scalable variational inference method for exponential family models. However, there’re many more distributions than those belonging to the exponential family. Inference in these cases requires significant amount of model analysis. In this post we consider <a href="https://arxiv.org/abs/1401.0118">Black Box Variational Inference</a> by Ranganath et al. This work just as the previous one comes from <a href="http://www.cs.columbia.edu/~blei/">David Blei</a> lab — one of the leading researchers in VI. And, just for the dessert, we’ll touch upon another paper, which will finally introduce some neural networks in VI.</p>
<!--more-->
<h3>
Blackbox Variational Inference
</h3>
<p>As we have learned so far, the goal of VI is to maximize the ELBO <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span>. When we maximize it by <span class="math inline">\(\Lambda\)</span>, we decrease the gap between the marginal likelihood of the model considered <span class="math inline">\(\log p(x \mid \Theta)\)</span>, and when we maximize it by <span class="math inline">\(\Theta\)</span> we acltually fit the model. So let’s concentrate on optimizing this objective:</p>
<p><span class="math display">\[
\mathcal{L}(\Theta, \Lambda) = \mathbb{E}_{q(z \mid x, \Lambda)} \left[\log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda) \right]
\]</span></p>
<p>Let’s find gradients of this objective:</p>
<p><span class="math display">\[
\begin{align}
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
&= \nabla_{\Lambda} \int q(z \mid x, \Lambda) \left[\log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda) \right] dz \\
&= \int \nabla_{\Lambda} q(z \mid x, \Lambda) \left[\log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda) \right] dz - \int q(z \mid x, \Lambda) \nabla_{\Lambda} \log q(z \mid x, \Lambda) dz \\
&= \mathbb{E}_{q} \left[\frac{\nabla_{\Lambda} q(z \mid x, \Lambda)}{q(z \mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right] - \int q(z \mid x, \Lambda) \frac{\nabla_{\Lambda} q(z \mid x, \Lambda)}{q(z \mid x, \Lambda)} dz \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right] - \int \nabla_{\Lambda} q(z \mid x, \Lambda) dz \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right] - \nabla_{\Lambda} \overbrace{\int q(z \mid x, \Lambda) dz}^{=1} \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)} \right]
\end{align}
\]</span></p>
<p>In statistics <span class="math inline">\(\nabla_\Lambda \log q(z \mid x, \Lambda)\)</span> is known as <a href="https://en.wikipedia.org/wiki/Score_(statistics)">score function</a>. For more on this “trick” see <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">a blogpost by Shakir Mohamed</a>. In many cases of practical interest <span class="math inline">\(\log p(x, z, \mid \Theta)\)</span> is too complicated to compute this expectation in closed form. Recall that we already used stochastic optimization successfully, so we can settle with just an estimate of true gradient. We get one by approximating the expectation using Monte-Carlo estimates using <span class="math inline">\(L\)</span> samples <span class="math inline">\(z^{(l)} \sim q(z \mid x, \Lambda)\)</span> (in practice we sometimes use just <span class="math inline">\(L=1\)</span> sample. We expect correct averaging to happen automagically due to use of minibatches):</p>
<p><span class="math display">\[
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Lambda} \log q(z^{(l)} \mid x, \Lambda) \log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)}
\]</span></p>
<p>For model parameters <span class="math inline">\(\Theta\)</span> gradients look even simpler, as we don’t need to differentiate w.r.t. expectation distribution’s parameters:</p>
<p><span class="math display">\[
\begin{align}
\nabla_{\Theta} \mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{q} \nabla_{\Theta} \log p(x, z \mid \Theta)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Theta} \log p(x, z^{(l)} \mid \Theta)
\end{align}
\]</span></p>
<p>We can even “naturalize” these gradients by premultiplying by the inverse Fisher Information Matrix <span class="math inline">\(\mathcal{I}(\Lambda)^{-1}\)</span>. And that’s it! Much simpler than before, right? Of course, there’s no free lunch, so there must be a catch… And there is: performance of stochastic optimization methods crucially depends on the variance of gradient estimators. It makes perfect sense: the higher the variance — the less information about the step direction we get. And unfortunately, in practice the aforementioned estimator based on the score function has impractically high variance. Luckily, in Monte Carlo community there are many variance reductions techniques known, we now describe some of them.</p>
<p>The first technique we’ll describe is <strong>Rao-Blackwellization</strong>. The idea is simple: if it’s possible to compute the expectation w.r.t. some of random variables, you should do it. If you think of it, it’s an obvious advice as you essentially reduce amount of randomness in your Monte Carlo estimates. But let’s put it more formally: we use chain rule to rewrite joint expectation as marginal expectation of conditional one:</p>
<p><span class="math display">\[
\mathbb{E}_{X, Y} f(X, Y) = \mathbb{E}_X \left[ \mathbb{E}_{Y \mid X} f(X, Y) \right]
\]</span></p>
<p>Let’s see what happens with variance (in scalar case) when we estimate expectation of <span class="math inline">\(\mathbb{E}_{Y \mid X} f(X, Y)\)</span> instead of expectation of <span class="math inline">\(f(X, Y)\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\text{Var}_X(\mathbb{E}_{Y \mid X} f(X, Y))
&= \mathbb{E} (\mathbb{E}_{Y \mid X} f(X, Y))^2 - (\mathbb{E}_{X, Y} f(X, Y))^2 \\
&= \text{Var}_{X,Y}(f(X, Y)) - \mathbb{E}_X \left(\mathbb{E}_{Y \mid X} f(X, Y)^2 - (\mathbb{E}_{Y \mid X} f(X, Y))^2 \right) \\
&= \text{Var}_{X,Y}(f(X, Y)) - \mathbb{E}_X \text{Var}_{Y\mid X} (f(X, Y))
\end{align}
\]</span></p>
<p>This formula says that Rao-Blackwellizing an estimator reduces its variance by <span class="math inline">\(\mathbb{E}_X \text{Var}_{Y\mid X} (f(X, Y))\)</span>. Indeed, you can think of this term as of a measure of how much information <span class="math inline">\(Y\)</span> contains about <span class="math inline">\(X\)</span> that’s relevant to computing <span class="math inline">\(f(X, Y)\)</span>. Suppose <span class="math inline">\(Y = X\)</span>: then you have <span class="math inline">\(\mathbb{E}_X f(X, X)\)</span>, and taking expectation w.r.t. <span class="math inline">\(Y\)</span> does not reduce amount of randomness in the estimator. And this is what the formula tells us as <span class="math inline">\(\text{Var}_{Y \mid X} f(X, Y)\)</span> would be 0 in this case. Here’s another example: suppose <span class="math inline">\(f\)</span> does not use <span class="math inline">\(X\)</span> at all: then only randomness in <span class="math inline">\(Y\)</span> affects the estimate, and after Rao-Blackwellization we expect the variance to drop to 0. And the formula agrees with out expectations as <span class="math inline">\(\mathbb{E}_X \text{Var}_{Y \mid X} f(X, Y) = \text{Var}_Y f(X, Y)\)</span> for any <span class="math inline">\(X\)</span> since <span class="math inline">\(f(X, Y)\)</span> does not depend on <span class="math inline">\(X\)</span>.</p>
<p>Next technique is <strong>Control Variates</strong>, which is slightly less intuitive. The idea is that we can add zero-mean function <span class="math inline">\(h(X)\)</span> that’ll preserve the expectation, but reduce the variance. Again, for a scalar case</p>
<p><span class="math display">\[
\text{Var}(f(X) - \alpha h(X)) = \text{Var}(f(X)) - 2 \alpha \text{Cov}(f(X), h(X)) + \alpha^2 \text{Var}(f(X))
\]</span></p>
<p>Optimal <span class="math inline">\(\alpha^* = \frac{\text{Cov}(f(X), h(X))}{\text{Var}(f(X))}\)</span>. This formula reflects an obvious fact: if we want to reduce the variance, <span class="math inline">\(h(X)\)</span> must be correlated with <span class="math inline">\(f(X)\)</span>. Sign of correlation does not matter, as <span class="math inline">\(\alpha^*\)</span> will adjust. BTW, in reinforcement learning <span class="math inline">\(\alpha\)</span> is called <strong>baseline</strong>.</p>
<p>As we already have learned, <span class="math inline">\(\mathbb{E}_{q(z \mid x, \Lambda)} \nabla_\Lambda \log q(z \mid x, \Lambda) = 0\)</span>, so the score function is a good candidate for <span class="math inline">\(h(x)\)</span>. Therefore our estimates become</p>
<p><span class="math display">\[
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Lambda} \log q(z^{(l)} \mid x, \Lambda) \circ \left(\log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)} - \alpha^* \right)
\]</span></p>
<p>Where <span class="math inline">\(\circ\)</span> is pointwise multiplication and <span class="math inline">\(\alpha\)</span> is a vector of <span class="math inline">\(|\Lambda|\)</span> components with <span class="math inline">\(\alpha_i\)</span> being a baseline for variational parameter <span class="math inline">\(\Lambda_i\)</span>:</p>
<p><span class="math display">\[
\alpha^*_i = \frac{\text{Cov}(\nabla_{\Lambda_i} \log q(z \mid x, \Lambda)\left( \log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda) \right), \nabla_{\Lambda_i} \log q(z \mid x, \Lambda))}{\text{Var}(\nabla_{\Lambda_i} \log q(z \mid x, \Lambda)\left( \log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda) \right))}
\]</span></p>
<h3>
Neural Variational Inference and Learning
</h3>
<p>Hoooray, neural networks! In this section I’ll briefly describe a variance reduction technique coined by A. Mnih and K. Gregor in <a href="https://arxiv.org/abs/1402.0030">Neural Variational Inference and Learning in Belief Networks</a>. The idea is surprisingly simple: why not learn a baseline <span class="math inline">\(\alpha\)</span> using a neural network?</p>
<p><span class="math display">\[
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
\approx \frac{1}{L} \sum_{l=1}^L \nabla_{\Lambda} \log q(z^{(l)} \mid x, \Lambda) \circ \left(\log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)} - \alpha^* - \alpha(x) \right)
\]</span></p>
<p>Where <span class="math inline">\(\alpha(x)\)</span> is a neural network trained to minimize</p>
<p><span class="math display">\[
\mathbb{E}_{q(z \mid x, \Lambda)} \left( \log \frac{p(x, z^{(l)} \mid \Theta)}{q(z^{(l)} \mid x, \Lambda)} - \alpha^* - \alpha(x) \right)^2
\]</span></p>
<p>What’s the motivation of this objective? The gradient step of <span class="math inline">\(\nabla_\Lambda \mathcal{L}(\Theta, \Lambda)\)</span> can be seen as pushing <span class="math inline">\(q(z\mid x, \Lambda)\)</span> towards <span class="math inline">\(p(x, z \mid \Theta)\)</span>. Since <span class="math inline">\(q\)</span> has to be normalized like any other proper distribution, it’s actually pushed towards the true posterior <span class="math inline">\(p(z \mid x, \Theta)\)</span>. We can rewrite the gradient <span class="math inline">\(\nabla_\Lambda \mathcal{L}(\Theta, \Lambda)\)</span> as</p>
<p><span class="math display">\[
\begin{align}
\nabla_{\Lambda} \mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \left(\log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda) \right) \right] \\
&= \mathbb{E}_{q} \left[\nabla_{\Lambda} \log q(z \mid x, \Lambda) \left(\log p(z \mid x, \Theta) - \log q(z \mid x, \Lambda) + \log p(x \mid \Theta) \right) \right]
\end{align}
\]</span></p>
<p>While this additional <span class="math inline">\(\log p(x \mid \Theta)\)</span> term does not contribute to the expectation, it affects the variance on the estimator. Therefore, <span class="math inline">\(\alpha(x)\)</span> is supposed to estimate the marginal log-likelihood <span class="math inline">\(\log p(x \mid \Theta)\)</span>.</p>
<p>The paper also lists several other variance reduction techniques that can be used in combination with the neural network-based baseline:</p>
<ul>
<li>
<strong>Constant baseline</strong> — analogue of <em>Control Variates</em>, uses running average of <span class="math inline">\(\log p(x, z \mid \Theta) - \log q(z \mid x, \Lambda)\)</span> as a baseline
</li>
<li>
<strong>Variance normalization</strong> — normalizes the learning signal to unit variance, equivalent to adaptive learning rate
</li>
<li>
<strong>Local learning signals</strong> — falls out of the scope of this post as requires it model-specific analysis and alternations, and can’t be used in Blackbox regime
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/Rx_ge6y2j6g" height="1" width="1" alt=""/>Tue, 05 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/2016-07-05-neural-variational-inference-blackbox.htmlArtemhttp://artem.sobolev.name/posts/2016-07-05-neural-variational-inference-blackbox.htmlNeural Variational Inference: Scaling Up
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/bYNrSVQbgPU/2016-07-04-neural-variational-inference-stochastic-variational-inference.html
<p>In the <a href="/posts/2016-07-01-neural-variational-inference-classical-theory.html">previous post</a> I covered well-established classical theory developed in early 2000-s. Since then technology has made huge progress: now we have much more data, and a great need to process it and process it fast. In big data era we have huge datasets, and can not afford too many full passes over it, which might render classical VI methods impractical. Recently M. Hoffman et al. dissected classical Mean-Field VI to introduce stochasticity right into its heart, which resulted in <a href="https://arxiv.org/abs/1206.7051">Stochastic Variational Inference</a>.</p>
<!--more-->
<h3>
Stochastic Variational Inference
</h3>
<p>We start with model assumptions: we have 2 types of latent variables, the global latent variable <span class="math inline">\(\beta\)</span> and a bunch of local variables <span class="math inline">\(z_n\)</span> for each observation <span class="math inline">\(x_n\)</span>. Recalling our GMM example, <span class="math inline">\(\beta\)</span> can be thought of as a mixture weights <span class="math inline">\(\pi\)</span>, and <span class="math inline">\(z_n\)</span> are membership indicators, as previously. These variables are assumed to come from some exponential family distribution:</p>
<p><span class="math display">\[
p(x_n, z_n \mid \beta) = h(x_n, z_n) \exp \left( \beta^T t(x_n, z_n) - a_l(\beta) \right) \\
\\
p(\beta) = h(\beta) \exp(\alpha^T t(\beta) - a_g(\alpha))
\]</span></p>
<p>Where <span class="math inline">\(t(\cdot)\)</span> and <span class="math inline">\(h(\cdot)\)</span> are overloaded by their argument, so <span class="math inline">\(t(\beta)\)</span> and <span class="math inline">\(t(z_{nj})\)</span> correspond to two different functions. <span class="math inline">\(t(\cdot)\)</span> gives a <strong>natural parameter</strong> and also <strong>sufficient statistics</strong>. <span class="math inline">\(a_g\)</span> and <span class="math inline">\(a_l\)</span> are log-normalizing constants which for exponential family distributions have an interesting property, namely, the gradient of the log-normalizing constant is the expectation of sufficient statistics: <span class="math inline">\(\nabla_\alpha a_g(\alpha) = \mathbb{E} t(\beta)\)</span>.</p>
<p>From these assumptions we can derive <em>complete conditionals</em> (conditional distribution given all other hidden variables and observables) for <span class="math inline">\(\beta\)</span> and <span class="math inline">\(z_{nj}\)</span>:</p>
<p><span class="math display">\[
\begin{align}
p(\beta \mid x, z, \alpha)
&\propto \prod_{n=1}^N p(x_n, z_n \mid \beta) p(\beta \mid \alpha) \\
&= h(\beta) \prod_{n=1}^N h(x_n, z_n) \exp \left( \beta^T \sum_{n=1}^N t(x_n, z_n) - N a_l(\beta) + \alpha^T t(\beta) - a_g(\alpha) \right) \\
&\propto h(\beta) \exp \left( \eta_g(x, z, \alpha)^T t(\beta) \right)
\end{align}
\]</span></p>
<p>Where <span class="math inline">\(t(\beta) = (\beta, -a_l(\beta))\)</span>, <span class="math inline">\(\eta_g(x, z, \alpha) = (\alpha_1 + \sum_{n=1}^N t(x_n, z_n), \alpha_2 + N)\)</span>. We see that the (unnormalized) posterior distribution for <span class="math inline">\(\beta\)</span> has the same functional form as the (unnormalized) prior <span class="math inline">\(p(\beta)\)</span>, therefore after normalization it’d be</p>
<p><span class="math display">\[
p(\beta \mid x, z, \alpha)
= h(\beta) \exp \left( \eta_g(x, z, \alpha)^T t(\beta) - a_g(\eta_g(x, z, \alpha)) \right)
\]</span></p>
<p>The same applies to local variables <span class="math inline">\(z_{nj}\)</span>:</p>
<p><span class="math display">\[
p(z_{nj} \mid x_n, z_{n,-j}, \beta)
\propto h(z_{nj}) \exp \left( \eta_l(x_n, z_{n,-j}, \beta)^T t(z_{nj}) \right)
\]</span> Hence <span class="math display">\[
p(z_{nj} \mid x_n, z_{n,-j}, \beta)
= h(z_{nj}) \exp \left( \eta_l(x_n, z_{n,-j}, \beta)^T t(z_{nj}) - a_m(\eta_l(x_n, z_{n,-j}, \beta)) \right)
\]</span></p>
<p>Even though we’ve managed to find the complete conditional for <span class="math inline">\(\beta\)</span>, it might be intractable to find the posterior for all latent variables <span class="math inline">\(p(\beta, z \mid x, \alpha)\)</span>. We therefore turn to the mean field approximation:</p>
<p><span class="math display">\[
q(z, \beta \mid \Lambda) = q(\beta \mid \lambda) \prod_{n=1}^N \prod_{j=1}^J q(z_{nj} \mid \phi_{nj})
\]</span></p>
<p>We assume these marginal distributions come from the exponential family:</p>
<p><span class="math display">\[
q(\beta \mid \lambda) = h(\beta) \exp(\lambda^T t(\beta) - a_g(\lambda)) \\
q(z_{nj} \mid \phi_{nj}) = h(z_{nj}) \exp(\phi_{nj}^T t(z_{nj}) - a_m(\phi_{nj}))
\]</span></p>
<p>Let’s find the optimal variational parameters now by optimizing the ELBO <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span> (<span class="math inline">\(\Theta\)</span> is model parameters, <span class="math inline">\(\alpha\)</span>, and <span class="math inline">\(\Lambda\)</span> contains variational parameters <span class="math inline">\(\phi\)</span> and <span class="math inline">\(\lambda\)</span>) by <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\phi_{nj}\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\lambda)
&= \mathbb{E}_{q} \left( \log p(x, z, \beta) - \log q(\beta) - \log q(z) \right)
= \mathbb{E}_{q} \left( \log p(\beta \mid x, z) - \log q(\beta) \right) + \text{const} \\
&= \mathbb{E}_{q} \left( \eta_g(x, z, \alpha)^T t(\beta) - \lambda^T t(\beta) + a_g(\lambda) \right) + \text{const} \\
&= \left(\mathbb{E}_{q(z)} \eta_g(x, z, \alpha) - \lambda \right)^T \mathbb{E}_{q(\beta)} t(\beta) + a_g(\lambda) + \text{const} \\
&= \left(\mathbb{E}_{q(z)} \eta_g(x, z, \alpha) - \lambda \right)^T \nabla_\lambda a_g(\lambda) t(\beta) + a_g(\lambda) + \text{const}
\end{align}
\]</span></p>
<p>Where we used aforementioned property of exponential family distributions: <span class="math inline">\(\nabla_\lambda a_g(\lambda) = \mathbb{E}_{q(\beta)} t(\beta)\)</span>. The gradient then is <span class="math display">\[
\nabla_\lambda \mathcal{L}(\lambda)
= \nabla_\lambda^2 a_g(\lambda) \left(\mathbb{E}_{q(z)} \eta_g(x, z, \alpha) - \lambda \right)
\]</span></p>
<p>After setting it to zero we get an update for global latent variables: <span class="math inline">\(\lambda = \mathbb{E}_{q(z)} \eta_g(x, z, \alpha)\)</span>. Following the same reasoning we derive the optimal update for <span class="math inline">\(\phi_{nj}\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\phi_{nj})
&= \mathbb{E}_{q} \left( \log p(z_{nj} \mid x_n, z_{n,-j}, \beta) - \log q(z_{nj}) \right) + \text{const} \\
&= \mathbb{E}_{q} \left( \eta_l(x_n, z_{n,-j}, \beta)^T t(z_{nj}) - \phi_{nj}^T t(z_{nj}) + a_m(\phi_{nj})\right) + \text{const} \\
&= \left(\mathbb{E}_{q(\beta) q(z_{n,-j})} \eta_l(x_n, z_{n,-j}, \beta) - \phi_{nj} \right)^T \mathbb{E}_{q(z_{nj})} t(z_{nj}) + a_m(\phi_{nj}) + \text{const} \\
\end{align}
\]</span></p>
<p>The gradient then is <span class="math inline">\(\nabla_{\phi_{nj}} \mathcal{L}(\phi) = \nabla_{\phi_{nj}}^2 a_m(\phi_{nj}) \left(\mathbb{E}_{q(\beta) q(z_{n,-j})} \eta_l(x_n, z_{n,-j}, \beta) - \phi_{nj} \right)\)</span>, and the update is <span class="math inline">\(\phi_{nj} = \mathbb{E}_{q(\beta) q(z_{n,-j})} \eta_l(x_n, z_{n,-j}, \beta)\)</span>.</p>
<p>So far we found mean-field updates, as well as corresponding gradients of the ELBO for variational parameters <span class="math inline">\(\lambda\)</span> and <span class="math inline">\(\phi_{nj}\)</span>. Next step is to transform these gradients into <strong>natural gradients</strong>. Intuitively, classical gradient defines local linear approximation, where the notion of locality comes from the Euclidean space. However, parameters influence the ELBO only through distributions <span class="math inline">\(q\)</span>, so we might like to alter our idea of locality based on how much the distributions change. This is what natural gradient does: it defines local linear approximation where locality means small distance (symmetrized KL-divergence) between distributions. There’s great formal explanation in the paper, and if you want to read more on that matter, I refer you to a great post by Roger Grosse, <a href="http://www.metacademy.org/roadmaps/rgrosse/dgml">Differential geometry for machine learning</a>.</p>
<p>The natural gradient can be obtained from the usual gradient using a simple linear transformation:</p>
<p><span class="math display">\[
\nabla_\lambda^\text{N} f(\lambda) = \mathcal{I}(\lambda)^{-1} \nabla_{\lambda} f(\lambda)
\]</span></p>
<p>Where <span class="math inline">\(\mathcal{I}(\lambda) := \mathbb{E}_{q(\beta \mid \lambda)} \left[ \nabla_\lambda \log q(\beta \mid \lambda) (\nabla_\lambda \log q(\beta \mid \lambda))^T \right]\)</span> is Fisher Information Matrix. Here I considered parameter <span class="math inline">\(\lambda\)</span> of the distribution <span class="math inline">\(q(\beta \mid \lambda)\)</span>, you got the idea. For the exponential family distribution this Information Matrix takes an especially simple form:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{I}(\lambda)
&= \mathbb{E}_q (t(\beta) - \nabla_\lambda a_g(\lambda)) (t(\beta) - \nabla_\lambda a_g(\lambda))^T
= \mathbb{E}_q (t(\beta) - \mathbb{E}_q t(\beta)) (t(\beta) - \mathbb{E}_q t(\beta))^T \\
&= \text{Cov}_q (t(\beta))
= \nabla_\lambda^2 a_g(\lambda)
\end{align}
\]</span></p>
<p>Where we’ve used another <a href="https://en.wikipedia.org/wiki/Exponential_family#Differential_identities_for_cumulants">differential identity for exponential family</a>. All these calculations lead us to the natural gradients of ELBO for variational parameters:</p>
<p><span class="math display">\[
\nabla_\lambda^\text{N} \mathcal{L}(\lambda) = \mathbb{E}_{q(z)} \eta_g(x, z, \alpha) - \lambda \\
\nabla_{\phi_{nj}}^\text{N} \mathcal{L}(\lambda) = \mathbb{E}_{q(\beta) q(z_{n,-j})} \eta_l(x_n, z_{n,-j}, \beta) - \phi_{nj}
\]</span></p>
<p>Surprisingly, computation-wise calculating natural gradients is even simpler that calculating classical gradients! There’s an interesting connection between the mean-field update and a natural gradient step. In particular, if we make a step along the natural gradient with step size equal 1, we’d get <span class="math inline">\(\lambda^{\text{new}} = \lambda^{\text{old}} + (\mathbb{E}_{q(z)} \eta_g(x, z, \alpha) - \lambda^{\text{old}}) = \mathbb{E}_{q(z)} \eta_g(x, z, \alpha)\)</span>. The same applies to parameters <span class="math inline">\(\phi\)</span>. This means that the mean field updates are exactly natural gradient steps, and vice versa.</p>
<p>Recall, we have derived mean field updates by finding a minima of KL-divergence with the true posterior, that is in just one step (one update) we arrive at minimum. Obviously, we have the same in the natural gradient formulation, when just one step brings us to the optimum.</p>
<p>Now, the last component is stochasticity itself. So far we have only played a little with mean-field update scheme, and discovered its connection to the natural gradient optimization. We note that we have 2 parameters: local <span class="math inline">\(\phi_{nj}\)</span> and global parameter <span class="math inline">\(\lambda\)</span>. The first one is easy to optimize over as it depends only on one, <span class="math inline">\(n\)</span>th sample <span class="math inline">\(x_n\)</span>. The second one, though, needs to incorporate information from all the samples, which is computationally prohibitive in large scale regime. Luckily, now once we know the equivalence between mean-field update and natural gradient step, we can borrow ideas from stochastic optimization to make this process more scalable.</p>
<p>Let’s first reformulate the ELBO to include the sum over samples <span class="math inline">\(x_n\)</span>:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}(\Theta, \Lambda)
&= \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha) - \log q(\beta \mid \lambda) + \sum_{n=1}^N \left(\log p(x_n, z_n \mid \beta) - \log q(z_n \mid \phi_n) \right) \right] \\
& = \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha) - \log q(\beta \mid \lambda) + N \mathbb{E}_{I} \left(\log p(x_I, z_I \mid \beta) - \log q(z_I \mid \phi_I) \right) \right]
\end{align}
\]</span></p>
<p>Where <span class="math inline">\(I \sim \text{Unif}\{1, \dots, N\}\)</span> — uniformly distribution index of a sample. Now let’s estimate <span class="math inline">\(\mathcal{L}\)</span> using a sample <span class="math inline">\(S\)</span> (assume <span class="math inline">\(N\)</span> divides by sample size <span class="math inline">\(|S|\)</span>) of uniformly chosen indices, this’d result in an unbiased estimator (it’s gradient would also be unbiased, so we can maximize the true ELBO by maximizing the estimate). Author of the paper start with single-sample derivation and then extend it to minibatches, but I decided I’d go straight to the minibatch case:</p>
<p><span class="math display">\[
\begin{align}
\mathcal{L}_S(\Theta, \Lambda)
& := \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha) - \log q(\beta \mid \lambda) + \frac{N}{|S|} \sum_{i \in S} \left(\log p(x_i, z_i \mid \beta) - \log q(z_i \mid \phi_i) \right) \right] \\
& = \mathbb{E}_{q} \left[ \log p(\beta \mid \alpha) - \log q(\beta \mid \lambda) + \sum_{n=1}^{N / |S|} \sum_{i \in S} \left(\log p(x_i, z_i \mid \beta) - \log q(z_i \mid \phi_i) \right) \right]
\end{align}
\]</span></p>
<p>This estimate is exactly <span class="math inline">\(\mathcal{L}(\Theta, \Lambda)\)</span> calculated on sample consisting of <span class="math inline">\(\{x_i, z_i\}_{i \in S}\)</span> repeated <span class="math inline">\(N / |S|\)</span> times. Hence its natural gradient w.r.t. <span class="math inline">\(\lambda\)</span> is</p>
<p><span class="math display">\[
\nabla_\lambda^\text{N} \mathcal{L}_S(\lambda) = \mathbb{E}_{q(z)} \eta_g(\{x_S\}_{n=1}^{N/|S|}, \{z_S\}_{n=1}^{N/|S|}, \alpha) - \lambda \\
\]</span></p>
<p>One important note: for stochastic optimization we can’t use constant step size. As Robbins-Monro conditions suggest, we need to use schedule <span class="math inline">\(\rho_t\)</span> such that <span class="math inline">\(\sum \rho_t = \infty\)</span> and <span class="math inline">\(\sum \rho_t^2 < \infty\)</span>. Then the update <span class="math inline">\(\lambda^{\text{new}} = \lambda^{\text{old}} + \rho_t \nabla_\lambda^\text{N} \mathcal{L}_S(\lambda) = (1 - \rho_t) \lambda^{\text{old}} + \rho_t \mathbb{E}_{q(z)} \eta_g(\{x_S\}_{n=1}^{N/|S|}, \{z_S\}_{n=1}^{N/|S|}, \alpha)\)</span></p>
Finally we have the following optimization scheme:
<ul>
<li>
Start with random initialization for <span class="math inline">\(\lambda^{(0)}\)</span>
</li>
<li>
For <span class="math inline">\(t\)</span> from 0 to MAX_ITER
<ol>
<li>
Sample <span class="math inline">\(S \sim \text{Unif}\{1, \dots, N\}^{|S|}\)</span>
</li>
<li>
For each sample <span class="math inline">\(i \in S\)</span> update the local variational parameter <span class="math inline">\(\phi_{i,j} = \mathbb{E}_{q(\beta) q(z_{i,-j})} \eta_l(x_i, z_{i,-j}, \beta)\)</span>
</li>
<li>
Replicate the sample <span class="math inline">\(N / |S|\)</span> times and compute the global update <span class="math inline">\(\hat \lambda = \mathbb{E}_{q(z)} \eta_g(\{x_S\}_{n=1}^{N/|S|}, \{z_S\}_{n=1}^{N/|S|}, \alpha)\)</span>
</li>
<li>
Update the global update <span class="math inline">\(\lambda^{(t+1)} = (1-\rho_t) \lambda^{(t)} + \rho_t \hat \lambda\)</span>
</li>
</ol>
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/bYNrSVQbgPU" height="1" width="1" alt=""/>Mon, 04 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.htmlArtemhttp://artem.sobolev.name/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.htmlNeural Variational Inference: Classical Theory
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/tMWZqa8pvO0/2016-07-01-neural-variational-inference-classical-theory.html
<p>As a member of <a href="http://bayesgroup.ru/">Bayesian methods research group</a> I’m heavily interested in Bayesian approach to machine learning. One of the strengths of this approach is ability to work with hidden (unobserved) variables which are interpretable. This power however comes at a cost of generally intractable exact inference, which limits the scope of solvable problems.</p>
<p>Another topic which gained lots of momentum in Machine Learning recently is Deep Learning, of course. With Deep Learning we can now build big and complex models that outperform most hand-engineered approaches given lots of data and computational power. The fact that Deep Learning needs a considerable amount of data also requires these methods to be scalable — a really nice property for any algorithm to have, especially in a Big Data epoch.</p>
<p>Given how appealing both topics are it’s not a surprise there’s been some work to marry these two recently. In this <a href="/tags/modern%20variational%20inference%20series.html">series</a> of blogsposts I’d like to summarize recent advances, particularly in variational inference. This is not meant to be an introductory discussion as prior familiarity with classical topics (Latent variable models, <a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">Variational Inference, Mean-field approximation</a>) is required, though I’ll introduce these ideas anyway just to remind it and setup the notation.</p>
<!--more-->
<h3>
Latent Variables Models
</h3>
<p>Suppose you have a probabilistic model that’s easy to describe using some auxiliary variables <span class="math inline">\(Z\)</span> that you don’t observe directly (or even would like to infer it given the data). One classical example for this setup is Gaussian Mixture Modeling: we have <span class="math inline">\(K\)</span> components in a mixture, and <span class="math inline">\(z_n\)</span> is a <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> vector of dimensionality <span class="math inline">\(K\)</span> indicating which component an observation <span class="math inline">\(x_n\)</span> belongs to. Then, conditioned on <span class="math inline">\(z_n\)</span> the distribution of <span class="math inline">\(x_n\)</span> is a usual Gaussian distribution: <span class="math inline">\(p(x_{n} \mid z_{nk} = 1) = \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\)</span> (here whenever I refer to a distribution, you should read it as its density. At least <a href="https://en.wikipedia.org/wiki/Generalized_function">generalized one</a>). Therefore the joint distribution of the model is</p>
<p><span class="math display">\[
p(x, z \mid \Theta) = \prod_{n=1}^N \prod_{k=1}^K \mathcal{N}(x_n \mid \mu_k, \Sigma_k)^{z_{nk}} \pi_k^{z_{nk}}
\]</span></p>
<p>Where <span class="math inline">\(\pi\)</span> is a probability distribution over <span class="math inline">\(K\)</span> outcomes, and <span class="math inline">\(\Theta\)</span> is a set of all model’s parameters (<span class="math inline">\(\pi\)</span>, <span class="math inline">\(\mu\)</span>s and <span class="math inline">\(\Sigma\)</span>s).</p>
<p>We’d like to do 2 things with the model: first, we obviously need to learn parameters <span class="math inline">\(\Theta\)</span>, and second, we’d like infer these latent variables <span class="math inline">\(z_n\)</span> to know which cluster the observation <span class="math inline">\(x_n\)</span> belongs to, that is, we need to calculate the distribution of <span class="math inline">\(z_n\)</span> conditioned on <span class="math inline">\(x_n\)</span>: <span class="math inline">\(p(z_n \mid x_n)\)</span>.</p>
<p>We want to learn the parameters <span class="math inline">\(\Theta\)</span> as usual by maximizing the log-likelihood. Unfortunately, we don’t know true assignments <span class="math inline">\(z_n\)</span>, and marginalizing over it as in <span class="math inline">\(p(x_n) = \sum_{k=1}^K \pi_k p(x_n, z_{nk} = 1)\)</span> is not a good idea as the resulting optimization problem would lose its convexity. Instead we decompose the log-likelihood as follows:</p>
<p><span class="math display">\[
\begin{align}
\log p(x)
&= \mathbb{E}_{q(z\mid x)} \overbrace{\log p(x)}^{\text{const in $z$}}
= \mathbb{E}_{q(z\mid x)} \log \frac{p(x, z) q(z\mid x)}{p(z \mid x) q(z\mid x)} \\
&= \mathbb{E}_{q(z\mid x)} \log \frac{p(x, z)}{q(z\mid x)} + D_{KL}(q(z\mid x) \mid\mid p(z \mid x))
\end{align}
\]</span></p>
<p>The second term is a Kullback-Leibler divergence, which is always non-negative, and equals zero iff distributions are equal almost everywhere <span class="math inline">\(q(z\mid x) = p(z \mid x)\)</span>. Therefore putting <span class="math inline">\(q(z \mid x) = p(z \mid x)\)</span> eliminates the second term, leaving us with <span class="math inline">\(\log p(x) = \mathbb{E}_{p(z \mid x)} \log \frac{p(x, z)}{p(z \mid x)}\)</span>. Therefore all we need to be able to do is to calculate the posterior <span class="math inline">\(p(z \mid x)\)</span>, and maximize the expectation. This is how EM algorithm is derived: at E-step we calculate the posterior <span class="math inline">\(p(z \mid x, \Theta^{\text{old}})\)</span>, and at M-step we maximize the expectation <span class="math inline">\(\mathbb{E}_{p(z \mid x, \Theta^{\text{old}})} \log \frac{p(x, z \mid \Theta)}{p(z \mid x, \Theta)}\)</span> with respect to <span class="math inline">\(\Theta\)</span> keeping <span class="math inline">\(\Theta^{\text{old}}\)</span> fixed.</p>
<p>Now, all we are left to do is to find the posterior <span class="math inline">\(p(z \mid x)\)</span> which is given by the following deceivingly simple formula knows as a Bayes’ rule.</p>
<p><span class="math display">\[
p(z \mid x) = \frac{p(x \mid z) p(z)}{\int p(x \mid z) p(z)dz}
\]</span></p>
<p>Of course, there’s no free lunch and computing the denominator is intractable in general case. One <strong>can</strong> compute the posterior exactly when the prior <span class="math inline">\(p(z)\)</span> and the likelihood <span class="math inline">\(p(x \mid z)\)</span> are <a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate</a> (that is, after multiplying the prior by the likelihood you get the same functional form for <span class="math inline">\(z\)</span> as in the prior, thus the posterior comes from the same family as the prior but with different parameters), but many models of practical interest don’t have this property. This is where variational inference comes in.</p>
<h3>
Variational Inference and Mean-field
</h3>
<p>In a variational inference (VI) framework we approximate the true posterior <span class="math inline">\(p(z \mid x)\)</span> with some other simpler distribution <span class="math inline">\(q(z \mid x, \Lambda)\)</span> where <span class="math inline">\(\Lambda\)</span> is a set of (variational) parameters of the (variational) approximation (I may omit <span class="math inline">\(\Lambda\)</span> and <span class="math inline">\(\Theta\)</span> in a “given” clause when it’s convenient, remember, it always could be placed there). One possibility is to divide latent variables <span class="math inline">\(z\)</span> in groups and force the groups to be independent. In extreme case each variable gets its own group, assuming independence among all variables <span class="math inline">\(q(z \mid x) = \prod_{d=1}^D q(z_d \mid x)\)</span>. If we then set about to find the best approximation to the true posterior in this fully factorized class, we will no longer have the optimal <span class="math inline">\(q\)</span> being the true posterior itself, as the true posterior is presumably too complicated to be dealt with in analytic form (which we want from the approximation <span class="math inline">\(q\)</span> when we say “simpler distribution”). Therefore we find the optimal <span class="math inline">\(q(z_i)\)</span> by minimizing the KL-divergence with the true posterior (<span class="math inline">\(\text{const}\)</span> denotes terms that are constant w.r.t. <span class="math inline">\(q(z_i)\)</span>):</p>
<p><span class="math display">\[
\begin{align}
D_{KL}(q(z \mid x) \mid\mid p(z \mid x))
&= \mathbb{E}_{q(z_i \mid x)} \left[ \mathbb{E}_{q(z_{-i} \mid x)} \log \frac{q(z_1 \mid x) \dots q(z_D \mid x)}{p(z \mid x)} \right] \\
&= \mathbb{E}_{q(z_i \mid x)} \left[ \log q(z_i \mid x) - \underbrace{\mathbb{E}_{q(z_{-i} \mid x)} \log p(z \mid x)}_{\log f(z_i \mid x)} \right] + \text{const} \\
&= \mathbb{E}_{q(z_i \mid x)} \left[ \log \frac{q(z_i \mid x)}{\tfrac{1}{Z} f(z_i \mid x)} \right] - \log Z + \text{const} \\
&= D_{KL}\left(q(z_i \mid x) \mid\mid \tfrac{1}{Z} f(z_i \mid x)\right) + \text{const}
\end{align}
\]</span></p>
<p>For many models it’s possible to look into <span class="math inline">\(\mathbb{E}_{q(z_{-i} \mid x)} \log p(z \mid x)\)</span> and immediately recognize logarithm of unnormalized density of some known distribution.</p>
<p>Another cornerstone of this framework is a notion of <strong>Evidence Lower Bound</strong> (ELBO): recall the decomposition of log-likelihood we derived above. In our current setting we can not compute the right hand side as we can not evaluate the true posterior <span class="math inline">\(p(z \mid x)\)</span>. However, note that the left hand side (that is, the log-likelihood) does not depend on the variational distribution <span class="math inline">\(q(z \mid x, \Lambda)\)</span>. Therefore, maximizing the first term of the right hand side w.r.t. variational parameters <span class="math inline">\(\Lambda\)</span> results in minimizing the second term, the KL-divergence with the true posterior. This implies we can ditch the second term, and maximize the first one w.r.t. both model parameters <span class="math inline">\(\Theta\)</span> and variational parameters <span class="math inline">\(\Lambda\)</span>:</p>
<p><span class="math display">\[
\text{ELBO:} \quad \mathcal{L}(\Theta, \Lambda) := \mathbb{E}_{q(z \mid x, \Lambda)} \log \frac{p(x, z \mid \Theta)}{q(z \mid x, \Lambda)}
\]</span></p>
<p>Okay, so this covers the basics, but before we go to the neural networks-based methods we need to discuss some general approaches to VI and how to make it scalable. This is what the next blog post is all about.</p><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/tMWZqa8pvO0" height="1" width="1" alt=""/>Fri, 01 Jul 2016 00:00:00 UThttp://artem.sobolev.name/posts/2016-07-01-neural-variational-inference-classical-theory.htmlArtemhttp://artem.sobolev.name/posts/2016-07-01-neural-variational-inference-classical-theory.htmlExploiting Multiple Machines for Embarrassingly Parallel Applications
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/FkchBewZeFA/2014-08-01-gnu-parallel.html
<p>During work on my machine learning project I was needed to perform some quite computation-heavy calculations several times — each time with a bit different inputs. These calculations were CPU and memory bound, so just spawning them all at once would just slow down overall running time because of increased amount of context switches. Yet running 4 (=number of cores in my CPU) of them at a time (actually, 3 since other applications need CPU, too) should speed it up.</p>
<p>Fortunately, I have an old laptop with 2 cores as well as an access to somewhat more modern machine with 4 cores. That results in 10 cores spread across 3 machines (all of`em have some version of GNU Linux installed). The question was how to exploit such a treasury.</p>
<!--more-->
<p>And the answer is GNU Parallel with some additional bells and whistles. GNU Parallel allows one to execute some commands in parallel and even in a distributed way.</p>
<p>The command was as following</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="ex">parallel</span> -u --wd ... -S :,host1,host2 --trc <span class="dt">{}</span>.emb <span class="st">"sh {}"</span></code></pre></div>
Here we have:
<ul>
<li>
<strong>wd</strong> stands for working directory. Three-dots means <code>parallel</code>’s temporary folder
</li>
<li>
<strong>S</strong> contains list of hosts with : being a localhost
</li>
<li>
<strong>trc</strong> stands for “Transfer, Return, Cleanup” and means that we’d like to transfer an executable file to target host, return specified file and do a cleanup
</li>
</ul>
<p><code>parallel</code> accepts list command arguments (file names) in standard input and executes a command (<code>sh</code> in my case) for each of them.</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="fu">ls</span> -1 jobs/* <span class="kw">|</span> <span class="ex">parallel</span> -u --wd ... -S :,host1,host2 --trc <span class="dt">{}</span>.emb <span class="st">"sh {}"</span></code></pre></div>
<p>There’s a problem: we usually need more than one file to do usefull stuff. There are several solutions to that problem</p>
<ul>
<li>
<strong>Bring all files manually</strong><br/> It’s a solution, but somewhat tedious one: setting computing environment on a several machines is dull
</li>
<li>
<strong>tar it and do all the stuff in a command</strong><br/> Looks better, but some shell kung-fu is required
</li>
<li>
<strong>Use <a href="http://en.wikipedia.org/wiki/Shar">shar</a></strong><br/> Basically it’s a tar archive with some shell commands for (self-)extracting. I chose this way and glued in some my code.
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/FkchBewZeFA" height="1" width="1" alt=""/>Fri, 01 Aug 2014 00:00:00 UThttp://artem.sobolev.name/posts/2014-08-01-gnu-parallel.htmlArtemhttp://artem.sobolev.name/posts/2014-08-01-gnu-parallel.htmlOn Sorting Complexity
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/OXwmToFy0jc/2014-05-01-on-sorting-complexity.html
<p>It’s well known that lower bound for sorting problem (in general case) is <span class="math inline">\(\Omega(n \log n)\)</span>. The proof I was taught is somewhat involved and is based on paths in “decision” trees. Recently I’ve discovered an information-theoretic approach (or reformulation) to that proof.</p>
<!--more-->
<p>First, let’s state the problem: given a set of some objects with an ordering produce elements of that set in that order. For now it’s completely irrelevant what are these objects, so we can assume them to be just numbers from 1 to n, or some permutation. Thus we’ll be interested in sorting permutations.</p>
<p>We’re given an ordering via a comparison function. It tells us if one object preceds (or is smaller) another outputing True or False. Thus each invocation of the comparator gives us 1 bit of information.</p>
<p>Next question is how many bits we need to represent any permutation. It’s just a binary logarithm of number of all possible permutations of <span class="math inline">\(n\)</span> elements: <span class="math inline">\(\log_2 n!\)</span>. Then we notice that</p>
<p><span class="math display">\[
\log_2 n! = \sum_{k=1}^n \log_2 k \ge \sum_{k=n/2}^{n} \log_2 k
\ge \frac{n}{2} \log_2 \frac{n}{2}
\]</span></p>
<p><span class="math display">\[
\log_2 n! = \sum_{k=1}^n \log_2 k \le n \log_2 n
\]</span></p>
<p>(Or just use the <a href="http://en.wikipedia.org/wiki/Stirling%27s_approximation">Stirling’s approximation</a> formula). Hence <span class="math inline">\(\log_2 n! = \Theta(n \log n)\)</span></p>
<p>So what, you may ask. The key point of proof is that sorting is essentially a search for a correct permutation of the input one. Since one needs <span class="math inline">\(\log_2 n!\)</span> bits to represent any permutation, we <strong>need to get that many bits</strong> of information somehow. Now let’s get back to our comparison function. As we’ve figured out already it’s able to give us only one bit of information per invocation. That implies that we need to call it <span class="math inline">\(\log n! = \Theta(n \log n)\)</span> times. And that’s exactly the lower-bound for sorting complexity. Q.E.D.</p>
<p>Non-<span class="math inline">\(n \log n\)</span> sorting algorithms like <a href="http://en.wikipedia.org/wiki/Radix_sort">RadixSort</a> are possible because they use much more bits of information, taking advantage of numbers’ structure.</p><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/OXwmToFy0jc" height="1" width="1" alt=""/>Thu, 01 May 2014 00:00:00 UThttp://artem.sobolev.name/posts/2014-05-01-on-sorting-complexity.htmlArtemhttp://artem.sobolev.name/posts/2014-05-01-on-sorting-complexity.htmlNamespaced Methods in JavaScript
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/KIo0BuBPx8M/2013-05-23-js-namespaced-methods.html
<p>Once upon a time I was asked (well, actually <a href="http://habrahabr.ru/qa/7130/" title="Javascript: String.prototype.namespace.method и this / Q&A / Хабрахабр">a question</a> wasn’t for me only, but for whole habrahabr’s community) is it possible to implement namespaced methods in JavaScript for built-in types like:</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="dv">5</span>..<span class="va">rubish</span>.<span class="at">times</span>(<span class="kw">function</span>() <span class="op">{</span> <span class="co">// this function will be called 5 times</span>
<span class="va">console</span>.<span class="at">log</span>(<span class="st">"Hi there!"</span>)<span class="op">;</span>
<span class="op">}</span>)<span class="op">;</span>
<span class="st">"some string"</span>.<span class="va">hask</span>.<span class="at">map</span>(<span class="kw">function</span>(c) <span class="op">{</span> <span class="cf">return</span> <span class="va">c</span>.<span class="va">hask</span>.<span class="at">code</span>()<span class="op">;</span> <span class="op">}</span>)<span class="op">;</span>
<span class="co">// equivalent to</span>
<span class="st">"some string"</span>.<span class="at">split</span>(<span class="st">''</span>).<span class="at">map</span>(<span class="kw">function</span>(c) <span class="op">{</span> <span class="cf">return</span> <span class="va">c</span>.<span class="at">charCodeAt</span>()<span class="op">;</span> <span class="op">}</span>)<span class="op">;</span>
<span class="st">"another string"</span>.<span class="va">algo</span>.<span class="at">lcp</span>(<span class="st">"annotation"</span>)<span class="op">;</span>
<span class="co">// returns longest common prefix of two strings</span></code></pre></div>
<p>As you can see at the link, it’s possible using ECMAScript 5 features. And that’s how: <!--more--></p>
<p>First, let’s point out the main problem with the straightforward approach: <del>it doesn’t work</del> when you write</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="va">Class</span>.<span class="va">prototype</span>.<span class="va">ns</span>.<span class="at">method</span> <span class="op">=</span> <span class="kw">function</span>() <span class="op">{</span>
<span class="cf">return</span> <span class="kw">this</span>.<span class="at">methodA</span>() <span class="op">+</span> <span class="kw">this</span>.<span class="at">methodB</span>()<span class="op">;</span>
<span class="op">}</span></code></pre></div>
<p><code>this</code> points to the <code>Class.prototype.ns</code> instead of an instance of <code>Class</code>. The only way to change it is rebind <code>this</code> to the our instance like this:</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="kw">var</span> instance <span class="op">=</span> <span class="kw">new</span> Class<span class="op">;</span>
<span class="va">instance</span>.<span class="va">ns</span>.<span class="va">method</span>.<span class="at">call</span>(instance)<span class="op">;</span></code></pre></div>
<p>Obviously, it’s not a solution since in that case it is a lot easier to write something like</p>
<div class="sourceCode"><pre class="sourceCode javascript"><code class="sourceCode javascript"><span class="kw">var</span> instance <span class="op">=</span> <span class="kw">new</span> Class<span class="op">;</span>
<span class="va">MegaLibrary</span>.<span class="at">method</span>(instance)<span class="op">;</span></code></pre></div>
<p>Thus we need to somehow return a correct function (with <code>this</code> binded to the <code>instance</code>) when user calls namespaced methods. This can be done using <a href="http://stackoverflow.com/q/812961/1190430" title="Javascript getters and setters for dummies? - Stack Overflow" target="_blank">getters</a>.</p>
<p>When user accesses our namespace we give him a proxy-object with a custom getter for every method in the namespace. This getter returns a method with rebinded <code>this</code>. The question is: how do we get a reference to the <code>instance</code>? The answer is pretty simple: using getters again! Instead of declaring an ordinary property for the namespace we can create a property with a custom getter memoizing a reference to <code>this</code>. Voilà!</p>
Finally, the code is:
<script src="https://gist.github.com/artsobolev/5599917.js"></script>
<h2 id="but-wait-how-cross-browser-is-it">But wait… How cross browser is it?</h2>
<p>Well, I’m pretty lazy to test it on all platforms (IE, Opera, FF, Chrome, Node.JS), so I’ll do like a mathematician in a famous anecdote:</p>
<blockquote>
<p>Three employees (an engineer, a physicist and a mathematician) are staying in a hotel while attending a technical seminar. The engineer wakes up and smells smoke. He goes out into the hallway and sees a fire, so he fills a trashcan from his room with water and douses the fire. He goes back to bed.</p>
<p>Later, the physicist wakes up and smells smoke. He opens his door and sees a fire in the hallway. He walks down the hall to a fire hose and after calculating the flame velocity, distance, water pressure, trajectory, etc. extinguishes the fire with the minimum amount of water and energy needed.</p>
Later, the mathematician wakes up and smells smoke. She goes to the hall, sees the fire and then the fire hose. She thinks for a moment and then exclaims, ‘Ah, a solution exists!’ and then goes back to bed.
</blockquote>
As you can see, the key part of code is ECMAScript 5’s <code>Object.defineProperty</code>. According to the kangax’s <a href="http://kangax.github.io/es5-compat-table/#Object.defineProperty" title="ECMAScript 5 compatibility table" target="_blank">ECMAScript 5 compatibility table</a> it has pretty good support:
<ul>
<li>
IE 9+
</li>
<li>
Opera 12+
</li>
<li>
FF 4+
</li>
<li>
Chrome 7+ (and thus Node.JS too)
</li>
</ul><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/KIo0BuBPx8M" height="1" width="1" alt=""/>Thu, 23 May 2013 00:00:00 UThttp://artem.sobolev.name/posts/2013-05-23-js-namespaced-methods.htmlArtemhttp://artem.sobolev.name/posts/2013-05-23-js-namespaced-methods.htmlCrazy Expression Parsing
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/VI7SZVJIoPA/2013-03-30-crazy-expression-parsing.html
<p>Suppose we have an expression like <code>(5+5 * (x^x-5 | y && 3))</code> and we’d like to get some computer-understandable representation of that expression, like:</p>
<p><code>ADD Token[5] (MUL Token[5] (AND (BIT_OR (XOR Token[x] (SUB Token[x] Token[5])) Token[y]) Token[3])</code></p>
<p>In case if you don’t know how to do that or are looking for the solutin right now, you should know that I’m not going to present a correct solution. This post is just a joke. You should use either a <a href="http://en.wikipedia.org/wiki/Shunting-yard_algorithm" title="Shunting-yard algorithm — Wikipedia">Shunting-yard algorithm</a> or a <a href="http://en.wikipedia.org/wiki/Recursive_descent_parser">recursive descent parser</a>.</p>
<p>So if you’re ready for madness… Let’s go! <!--more--></p>
<p>Let’s take <a href="http://en.wikipedia.org/wiki/Don%27t_repeat_yourself">Don’t repeat yourself</a> principle as a justification. Moreover, let’s take it to extreme “Don’t repeat”. Indeed, why do we need to repeat what compiler’s developers already did?</p>
Here we go
<script src="https://gist.github.com/artsobolev/5273716.js"></script>
<p>In case you’re wondering what the heck is going on: all constants are converted to instances of <code>Token</code> class, for which I overloaded all the operators. Overloading is done in a way to preserve structure of an expression. The only thing we have to do then is to extract that information. In case you’re not familiar with C++, I recommend you to read something about operator overloading.</p>
<p>As you can see, g++ and python are required for that “parser”. Unfortunatelly, priority of a bitwise xor is too low to serve as a power operator.</p><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/VI7SZVJIoPA" height="1" width="1" alt=""/>Sat, 30 Mar 2013 00:00:00 UThttp://artem.sobolev.name/posts/2013-03-30-crazy-expression-parsing.htmlArtemhttp://artem.sobolev.name/posts/2013-03-30-crazy-expression-parsing.htmlMemoization Using C++11
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/u8nQ5JzLO54/2013-03-29-cpp-11-memoization.html
<p>Recently I’ve read an article <a href="http://john-ahlgren.blogspot.ru/2013/03/efficient-memoization-using-partial.html" title="John Ahlgren: Efficient Memoization using Partial Function Application">Efficient Memoization using Partial Function Application</a>. Author explains function memoization using partial application. When I was reading the article, I thought “Hmmm, can I come up with a more general solution?” And as suggested in comments, one can use variadic templates to achieve it. So here is my version.</p>
<!--more-->
<p>First let’s do it in a more object-oriented way: we define a template class <code>Memoizator</code> with 2 parameters: a return value type and a list of argument’s types. Also we incapsulate a lookup map and will use C++11’s <a href="http://en.cppreference.com/w/cpp/utility/tuple" title="std::tuple - cppreference.com">std::tuple</a> to represent an arguments set.</p>
The code is as follows:
<script src="https://gist.github.com/artsobolev/5270779.js"></script>
<p>Good, but what about computing n-th Fibonacci number using memoization? It’s not possible with a current version of <code>Memoizator</code> since it uses a separate map for each instance even if function is the same. It looks inefficient to store a separate lookup map for each instance of the same function. We’ll fix it by creating a static storage for maps accessed by a function address:</p>
<script src="https://gist.github.com/artsobolev/5271223.js"></script>
<p>Now let’s compare the memoized version against the regular one. If we compute the 42th fibonacci number using simple recursive version (with exponential time complexity), we’d get</p>
<pre><strong>$ time ./a.out</strong>
267914296
real 0m5.314s
user 0m5.220s
sys 0m0.020s</pre>
Now the memoized one (from the source above):
<pre><strong>$ time ./a.out</strong>
267914296
real 0m0.005s
user 0m0.004s
sys 0m0.004s</pre>
<p>Moreover, our memoization reduced time complexity from exponential to linear.</p>
<p><strong>UPD</strong>: you can take a look at another implementation here: <a href="http://cpptruths.blogspot.ru/2012/01/general-purpose-automatic-memoization.html" title="c++ truths: General-purpose Automatic Memoization for Recursive Functions in C++11">General-purpose Automatic Memoization for Recursive Functions in C++11</a></p><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/u8nQ5JzLO54" height="1" width="1" alt=""/>Fri, 29 Mar 2013 00:00:00 UThttp://artem.sobolev.name/posts/2013-03-29-cpp-11-memoization.htmlArtemhttp://artem.sobolev.name/posts/2013-03-29-cpp-11-memoization.htmlResizing Policy of std::vector
http://feedproxy.google.com/~r/barmaley-exe-blog-feed/~3/qh9J_Kq_jEw/2013-02-10-std-vector-growth.html
Sometime ago when Facebook opensourced their <a title="Folly is an open-source C++ library developed and used at Facebook" href="https://github.com/facebook/folly">Folly library</a> I was reading their docs and found <a title="folly/FBvector.h documentation" href="https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md">something interesting</a>. In section “Memory Handling” they state
<blockquote>
In fact it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any of its previously-allocated memory
</blockquote>
<p>I haven’t got it first time. Recently I recalled that article and decided to deal with it. So after reading and googling for a while I finally understood the idea, so I’d like to say a few words about it.</p>
<!--more-->
<p>The problem is as follows: when a vector (or a similar structure with autoresize) gets filled, it should resize. It’s well known that it should grow exponentially in order to preserve constant amortized complexity of insertions, but what growth factor to choose? At first glance, 2 seems to be ok — it’s not so big and 2 is common for computer science :-). But it turns out that 2 is not so good. Let’s take a closer look by example:</p>
<p><a href="/files/vector-resize-scheme.png"><img src="/files/vector-resize-scheme.png" alt="Vector resize scheme" width="495" height="350" class="size-full" /></a></p>
<p>Suppose we’ve a vector of initial size <span class="math inline">\(C\)</span>. When it gets filled, we increase its size twice. We allocate memory for a vector of size <span class="math inline">\(2C\)</span> right after our original vector. So now we have vector of size <span class="math inline">\(2C\)</span> and <span class="math inline">\(C\)</span> bytes before it, where it was when it was small. Then expand it again and again and agian and so on. After <span class="math inline">\(n\)</span> expansions we’ll get a vector of size <span class="math inline">\(2^n C\)</span> preceded by <span class="math inline">\(C + 2C + 2^2 C + \dots 2^{n-1} C\)</span> bytes that were occupied by this vector before.</p>
<p>So what’s the problem? The problem is that after every increasing your vector is too big to fit previously allocated memory. How much is it bigger? Well, as we know <span class="math inline">\(2^n - 1 = 1 + 2 + 4 + \dots + 2^{n-1}\)</span>, thus <span class="math inline">\(2^n C - C - 2C - 2^2 C - \dots - 2^{n-1}C = C\)</span>. Therefore you have permanent lack of <span class="math inline">\(C\)</span> bytes to fit your vector in previously allocated space.</p>
<p>Okay, let’s now solve this problem. First, let’s formalize it.</p>
<p>Every time we increase vector of size <span class="math inline">\(C\)</span> with growing factor <span class="math inline">\(k\)</span> we do these steps:</p>
<ol>
<li>
Allocate <span class="math inline">\(k C\)</span> bytes
</li>
<li>
Create a new vector here and copy current vector’s content to the new one
</li>
<li>
Remove the current vector, set the new one as the current
</li>
</ol>
<p>So as you can see, formula for 2 is sort of upperbound: you can not use all of previously allocated <span class="math inline">\(n-1\)</span> chunks when allocating nth, since you need to copy values from (n-1)th (though you can copy them in some temporary buffer, but it requires extra memory) chunk. So when we allocate nth chunk, we need it to be less than total free space from <span class="math inline">\(n-2\)</span> allocations: <span class="math display">\[ k^n C \le k^{n-2}C + k^{n-3}C + \dots + kC + C \]</span></p>
<p>As you can see, we can get rid of C since it’s definetly positive. <span class="math display">\[ k^n \le k^{n-2} + k^{n-3} + \dots + k + 1 \]</span></p>
<p>Okay, time to solve some equations! We see something like a sum of a geometric progression, and we can use a formula for it. But I don’t retain it in my head, so I’ll use a little trick here. Let’s multiply both sides by <span class="math inline">\(k-1\)</span>. We assume that <span class="math inline">\(k > 1\)</span> (it’s very strange to use values greater than 1 as <em>growth</em> factor) <span class="math display">\[ (k-1) k^n \le (k-1) (k^{n-2} + k^{n-3} + \dots + k + 1) \]</span></p>
<p>Now we can notice that in the right side we have an expansion of <span class="math inline">\(k^{n-1}-1\)</span> (well, maybe to remember this observation is harder than remembering a formula for sum of a geometric progression…)</p>
<p><span class="math display">\[ (k-1) k^n \le k^{n-1}-1 \]</span> <span class="math display">\[ k^{n+1} - k^n \le k^{n-1}-1 \]</span> <span class="math display">\[ k^{n+1} \le k^n + k^{n-1} - 1 \]</span></p>
<p>Oh, this obstructing 1… It would be so nice if we could throw it away! Wait, but we can! If we add 1 to the right side, we will merely increase its value, so it will still suit for an upper bound (or an approximation since 1 is a constant and is very small compared to <span class="math inline">\(k^n\)</span>).</p>
<p><span class="math display">\[ k^{n+1} \le k^n + k^{n-1} - 1 < k^n + k^{n-1} \]</span> <span class="math display">\[ k^2 < k + 1 \]</span> <span class="math display">\[ k^2 - k - 1 < 0 \]</span></p>
<p>Solving this simple quadratic equation, we get <span class="math display">\[ k < \frac{1+\sqrt{5}}{2} \approx 1.61 \]</span></p>
<p>So that’s why growth factor in many dynamic arrays is 1.5: it is pretty big to not cause reallocations too frequently and is small enough to not use memory too extensively.</p><img src="http://feeds.feedburner.com/~r/barmaley-exe-blog-feed/~4/qh9J_Kq_jEw" height="1" width="1" alt=""/>Sun, 10 Feb 2013 00:00:00 UThttp://artem.sobolev.name/posts/2013-02-10-std-vector-growth.htmlArtemhttp://artem.sobolev.name/posts/2013-02-10-std-vector-growth.html