B.loghttp://artem.sobolev.name/Personal blog of Artem Sobolev, a Machine Learning professional with particular interest in Probabilistic Modeling, Bayesian Inference, Deep Learning, and beyondSun, 02 May 2021 00:00:00 +0300Reciprocal Convexity to reverse the Jensen Inequality/posts/2021-05-02-reciprocal-convexity-to-reverse-the-jensen-inequality.html<p><a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen's inequality</a> is a powerful tool often used in mathematical derivations and analyses. It states that for a convex function $f(x)$ and an arbitrary random variable $X$ we have the following <em>upper</em> bound:
$$
f\left(\E X\right)
\le
\E f\left(X\right)
$$</p>
<p>However, oftentimes we want the …</p>Artem SobolevSun, 02 May 2021 00:00:00 +0300tag:None,2021-05-02:/posts/2021-05-02-reciprocal-convexity-to-reverse-the-jensen-inequality.htmlpostsmathNot every REINFORCE should be called Reinforcement Learning/posts/2020-11-29-reinforce-is-not-rl.html<p>Deep RL is hot these days. It's one of the most popular topics in the submissions at NeurIPS / ICLR / ICML and other ML conferences. And while the definition of RL is pretty general, in this note I'd argue that the famous REINFORCE algorithm <em>alone</em> is not enough to label your …</p>Artem SobolevSun, 29 Nov 2020 00:00:00 +0300tag:None,2020-11-29:/posts/2020-11-29-reinforce-is-not-rl.htmlpostsmachine learningRLREINFORCEA simpler derivation of f-GANs/posts/2019-12-01-a-simpler-derivation-of-f-gans.html<p>I have been looking at $f$-GANs derivation doing some of my research, and found an easier way to derive its lower bound, without invoking convex conjugate functions.</p>
<!--more-->
<p><a href="https://arxiv.org/abs/1606.00709">$f$-GANs</a> are a generalization of standard GANs to arbitrary $f$-divergence. Given a convex function $f$, <a href="https://en.wikipedia.org/wiki/F-divergence">$f$-divergence</a>, in turn, can …</p>Artem SobolevSun, 01 Dec 2019 00:00:00 +0300tag:None,2019-12-01:/posts/2019-12-01-a-simpler-derivation-of-f-gans.htmlpostsmachine learningganThoughts on Mutual Information: Alternative Dependency Measures/posts/2019-09-15-thoughts-on-mutual-information-alternative-dependency-measures.html<p>This posts finishes the discussion started in the <a href="/posts/2019-08-10-thoughts-on-mutual-information-more-estimators.html">Thoughts on Mutual Information: More Estimators</a> with a consideration of alternatives to the Mutual Information.</p>
<!--more-->
<h2>Mutual Information</h2>
<p>Let's step out a bit and take a critical look at the MI. One of its equivalent definitions says that it's a KL-divergence between the …</p>Artem SobolevSun, 15 Sep 2019 00:00:00 +0300tag:None,2019-09-15:/posts/2019-09-15-thoughts-on-mutual-information-alternative-dependency-measures.htmlpostsmachine learningmutual informationThoughts on Mutual Information: Formal Limitations/posts/2019-08-14-thoughts-on-mutual-information-formal-limitations.html<p>This posts continues the discussion started in the <a href="/posts/2019-08-10-thoughts-on-mutual-information-more-estimators.html">Thoughts on Mutual Information: More Estimators</a>. This time we'll focus on drawbacks and limitations of these bounds.</p>
<!--more-->
<p>Let's start with a elephant in the room: a year ago an interesting preprint has been uploaded to arxiv: <a href="https://arxiv.org/abs/1811.04251">Formal Limitations on the Measurement of …</a></p>Artem SobolevWed, 14 Aug 2019 00:00:00 +0300tag:None,2019-08-14:/posts/2019-08-14-thoughts-on-mutual-information-formal-limitations.htmlpostsmachine learningmutual informationThoughts on Mutual Information: More Estimators/posts/2019-08-10-thoughts-on-mutual-information-more-estimators.html<p>In this post I'd like to show how Self-Normalized Importance Sampling (<a href="/posts/2019-05-10-importance-weighted-hierarchical-variational-inference.html">IWHVI</a> and IWAE) and Annealed Importance Sampling can be used to give (sometimes sandwich) bounds on the MI in many different cases.</p>
<!--more-->
<p><a href="https://en.wikipedia.org/wiki/Mutual_information">Mutual Information</a> (MI) is an important concept from the Information Theory that captures the idea of information …</p>Artem SobolevSat, 10 Aug 2019 00:00:00 +0300tag:None,2019-08-10:/posts/2019-08-10-thoughts-on-mutual-information-more-estimators.htmlpostsmachine learningmutual informationImportance Weighted Hierarchical Variational Inference/posts/2019-05-10-importance-weighted-hierarchical-variational-inference.html<p>This post finishes the discussion on <a href="/posts/2019-04-26-neural-samplers-and-hierarchical-variational-inference.html">Neural Samplers for Variational Inference</a> by introducing some recent results (including mine).</p>
<p>Also, there's <a href="https://youtu.be/pdSu7XfGhHw">a talk recording</a> of me presenting this post's content, so if you like videos more than texts, check it out.</p>
<!--more-->
<h2>Quick Recap</h2>
<p>It all started with an aspiration for a …</p>Artem SobolevFri, 10 May 2019 00:00:00 +0300tag:None,2019-05-10:/posts/2019-05-10-importance-weighted-hierarchical-variational-inference.htmlpostsmachine learningvariational inferenceneural samplersNeural Samplers and Hierarchical Variational Inference/posts/2019-04-26-neural-samplers-and-hierarchical-variational-inference.html<p>This post sets background for the upcoming post on my work on more efficient use of neural samplers for Variational Inference.</p>
<!--more-->
<h2>Variational Inference</h2>
<p>At the core of <em>Bayesian Inference</em> lies the well-known Bayes' theorem, relating our prior beliefs $p(z)$ with those obtained after observing some data $x$:</p>
<p>$$
p(z …</p>Artem SobolevFri, 26 Apr 2019 00:00:00 +0300tag:None,2019-04-26:/posts/2019-04-26-neural-samplers-and-hierarchical-variational-inference.htmlpostsmachine learningvariational inferenceneural samplersStochastic Computation Graphs: Fixing REINFORCE/posts/2017-11-12-stochastic-computation-graphs-fixing-reinforce.html<p>This is the final post of the <a href="/tags/stochastic-computation-graphs-series.html">stochastic computation graphs series</a>. Last time we discussed models with <a href="/posts/2017-10-28-stochastic-computation-graphs-discrete-relaxations.html">discrete relaxations of stochastic nodes</a>, which allowed us to employ the power of reparametrization.</p>
<p>These methods, however, posses one flaw: they consider different models, thus introducing inherent bias – your test time discrete model …</p>Artem SobolevSun, 12 Nov 2017 00:00:00 +0300tag:None,2017-11-12:/posts/2017-11-12-stochastic-computation-graphs-fixing-reinforce.htmlpostsmachine learningdeep learningstochastic computation graphs seriesREINFORCEStochastic Computation Graphs: Discrete Relaxations/posts/2017-10-28-stochastic-computation-graphs-discrete-relaxations.html<p>This is the second post of the <a href="/tags/stochastic-computation-graphs-series.html">stochastic computation graphs series</a>. Last time we discussed models with <a href="/posts/2017-09-10-stochastic-computation-graphs-continuous-case.html">continuous stochastic nodes</a>, for which there are powerful reparametrization technics.</p>
<p>Unfortunately, these methods don't work for discrete random variables. Moreover, it looks like there's no way to backpropagate through discrete stochastic nodes, as …</p>Artem SobolevSat, 28 Oct 2017 00:00:00 +0300tag:None,2017-10-28:/posts/2017-10-28-stochastic-computation-graphs-discrete-relaxations.htmlpostsmachine learningdeep learningvariational inferencestochastic computation graphs seriesStochastic Computation Graphs: Continuous Case/posts/2017-09-10-stochastic-computation-graphs-continuous-case.html<p>Last year I covered <a href="/tags/modern-variational-inference-series.html">some modern Variational Inference theory</a>. These methods are often used in conjunction with Deep Neural Networks to form deep generative models (VAE, for example) or to enrich deterministic models with stochastic control, which leads to better exploration. Or you might be interested in amortized inference.</p>
<p>All …</p>Artem SobolevSun, 10 Sep 2017 00:00:00 +0300tag:None,2017-09-10:/posts/2017-09-10-stochastic-computation-graphs-continuous-case.htmlpostsmachine learningdeep learningstochastic computation graphs seriesREINFORCEICML 2017 Summaries/posts/2017-08-14-icml-2017.html<p>Just like with <a href="/posts/2016-12-31-nips-2016-summaries.html">NIPS last year</a>, here's a list of ICML'17 summaries (updated as I stumble upon new ones)</p>
<!--more-->
<ul>
<li><a href="https://olgalitech.wordpress.com/tag/icml2017/">Random ML&Datascience musing</a> by <a href="https://twitter.com/OlgaLiakhovich">Olga Liakhovich</a><ul>
<li><a href="https://olgalitech.wordpress.com/2017/08/07/icml-and-my-notes-on-day-1/">ICML and my notes on day 1</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/07/brain-endurance-or-day-2-at-icml-2017/">Brain endurance or Day 2 at ICML 2017</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day-3-at-icml-2017-musical-rnns/">Day 3 at ICML 2017 — musical RNNs</a></li>
<li><a href="https://olgalitech.wordpress.com/2017/08/11/day-4-at-icml-2017-more-adversarial-nns/">Day 4 …</a></li></ul></li></ul>Artem SobolevMon, 14 Aug 2017 00:00:00 +0300tag:None,2017-08-14:/posts/2017-08-14-icml-2017.htmlpostsmachine learningICMLconferenceOn No Free Lunch Theorem and some other impossibility results/posts/2017-07-23-no-free-lunch-theorem.html<p>The more I talk to people online, the more I hear about the famous No Free Lunch Theorem (NFL theorem). Unfortunately, quite often people don't really understand what the theorem is about, and what its implications are. In this post I'd like to share my view on the NFL theorem …</p>Artem SobolevSun, 23 Jul 2017 00:00:00 +0300tag:None,2017-07-23:/posts/2017-07-23-no-free-lunch-theorem.htmlpostssemi-mathematicalmachine learningartificial intelligenceMatrix and Vector Calculus via Differentials/posts/2017-01-29-matrix-and-vector-calculus-via-differentials.html<p>Many tasks of machine learning can be posed as optimization problems. One comes up with a parametric model, defines a loss function, and then minimizes it in order to learn optimal parameters. One very powerful tool of optimization theory is the use of smooth (differentiable) functions: those that can be …</p>Artem SobolevSun, 29 Jan 2017 00:00:00 +0300tag:None,2017-01-29:/posts/2017-01-29-matrix-and-vector-calculus-via-differentials.htmlpostsmathNIPS 2016 Summaries/posts/2016-12-31-nips-2016-summaries.html<p>I did not attend this year's NIPS, but I've gathered many summaries published online by those who did attend the conference.</p>
<!--more-->
<ul>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hdofr/d_nips_2016_symposium_on_people_and_machines/">NIPS 2016 Symposium on People and machines: Public views on machine learning, and what this means for machine learning researchers. (Notes and panel discussion)</a> by /u/gcr</li>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5hzvfi/d_nips_2016_summary_wrap_up_and_links_to_slides/">NIPS 2016 …</a></li></ul>Artem SobolevSat, 31 Dec 2016 00:00:00 +0300tag:None,2016-12-31:/posts/2016-12-31-nips-2016-summaries.htmlpostsmachine learningNIPSconferenceNeural Variational Inference: Importance Weighted Autoencoders/posts/2016-07-14-neural-variational-importance-weighted-autoencoders.html<p>Previously we covered <a href="/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-Helmholtz-machines.html">Variational Autoencoders</a> (VAE) — popular inference tool based on neural networks. In this post we'll consider, a followup work from Torronto by Y. Burda, R. Grosse and R. Salakhutdinov, <a href="https://arxiv.org/abs/1509.00519">Importance Weighted Autoencoders</a> (IWAE). The crucial contribution of this work is introduction of a new lower-bound on the marginal …</p>Artem SobolevThu, 14 Jul 2016 00:00:00 +0300tag:None,2016-07-14:/posts/2016-07-14-neural-variational-importance-weighted-autoencoders.htmlpostsmachine learningdeep learningvariational inferencemodern variational inference seriesNeural Variational Inference: Variational Autoencoders and Helmholtz machines/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-helmholtz-machines.html<p>So far we had a little of "neural" in our VI methods. Now it's time to fix it, as we're going to consider <a href="https://arxiv.org/abs/1312.6114">Variational Autoencoders</a> (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new …</p>Artem SobolevMon, 11 Jul 2016 00:00:00 +0300tag:None,2016-07-11:/posts/2016-07-11-neural-variational-inference-variational-autoencoders-and-helmholtz-machines.htmlpostsmachine learningdeep learningvariational inferencemodern variational inference seriesNeural Variational Inference: Blackbox Mode/posts/2016-07-05-neural-variational-inference-blackbox.html<p>In the <a href="/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.html">previous post</a> we covered Stochastic VI: an efficient and scalable variational inference method for exponential family models. However, there're many more distributions than those belonging to the exponential family. Inference in these cases requires significant amount of model analysis. In this post we consider <a href="https://arxiv.org/abs/1401.0118">Black Box Variational Inference …</a></p>Artem SobolevTue, 05 Jul 2016 00:00:00 +0300tag:None,2016-07-05:/posts/2016-07-05-neural-variational-inference-blackbox.htmlpostsmachine learningdeep learningvariational inferencemodern variational inference seriesNeural Variational Inference: Scaling Up/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.html<p>In the <a href="/posts/2016-07-01-neural-variational-inference-classical-theory.html">previous post</a> I covered well-established classical theory developed in early 2000-s. Since then technology has made huge progress: now we have much more data, and a great need to process it and process it fast. In big data era we have huge datasets, and can not afford too …</p>Artem SobolevMon, 04 Jul 2016 00:00:00 +0300tag:None,2016-07-04:/posts/2016-07-04-neural-variational-inference-stochastic-variational-inference.htmlpostsmachine learningdeep learningvariational inferencemodern variational inference seriesNeural Variational Inference: Classical Theory/posts/2016-07-01-neural-variational-inference-classical-theory.html<p>As a member of <a href="http://bayesgroup.ru/">Bayesian methods research group</a> I'm heavily interested in Bayesian approach to machine learning. One of the strengths of this approach is ability to work with hidden (unobserved) variables which are interpretable. This power however comes at a cost of generally intractable exact inference, which limits the …</p>Artem SobolevFri, 01 Jul 2016 00:00:00 +0300tag:None,2016-07-01:/posts/2016-07-01-neural-variational-inference-classical-theory.htmlpostsmachine learningdeep learningvariational inferencemodern variational inference seriesExploiting Multiple Machines for Embarrassingly Parallel Applications/posts/2014-08-01-gnu-parallel.html<p>During work on my machine learning project I was needed to perform some quite computation-heavy calculations several times — each time with a bit different inputs. These calculations were CPU and memory bound, so just spawning them all at once would just slow down overall running time because of increased amount …</p>Artem SobolevFri, 01 Aug 2014 00:00:00 +0400tag:None,2014-08-01:/posts/2014-08-01-gnu-parallel.htmlpostsgnu parallellinuxOn Sorting Complexity/posts/2014-05-01-on-sorting-complexity.html<p>It's well known that lower bound for sorting problem (in general case) is
$\Omega(n \log n)$. The proof I was taught is somewhat involved and is
based on paths in "decision" trees. Recently I've discovered an
information-theoretic approach (or reformulation) to that proof. </p>
<!--more-->
<p>First, let's state the problem: given …</p>Artem SobolevThu, 01 May 2014 00:00:00 +0400tag:None,2014-05-01:/posts/2014-05-01-on-sorting-complexity.htmlpostsalgortihmscomputer scienceNamespaced Methods in JavaScript/posts/2013-05-23-js-namespaced-methods.html<p>Once upon a time I was asked (well, actually <a href="http://habrahabr.ru/qa/7130/" title="Javascript: String.prototype.namespace.method и this / Q&A / Хабрахабр">a question</a> wasn't for me only, but for whole habrahabr's community) is it possible to implement namespaced methods in JavaScript for built-in types like:</p>
<div class="highlight"><pre><span></span><code><span class="mf">5.</span><span class="p">.</span><span class="nx">rubish</span><span class="p">.</span><span class="nx">times</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1">// this function will be called 5 times</span>
<span class="w"> </span><span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">"Hi there!"</span><span class="p">);</span>
<span class="p">});</span>
<span class="s2">"some string …</span></code></pre></div>Artem SobolevThu, 23 May 2013 00:00:00 +0400tag:None,2013-05-23:/posts/2013-05-23-js-namespaced-methods.htmlpostsjavascriptecmascript 5Crazy Expression Parsing/posts/2013-03-30-crazy-expression-parsing.html<p>Suppose we have an expression like <code>(5+5 * (x^x-5 | y && 3))</code> and we'd like to get some computer-understandable representation of that expression, like:</p>
<p><code>ADD Token[5] (MUL Token[5] (AND (BIT_OR (XOR Token[x] (SUB Token[x] Token[5])) Token[y]) Token[3])</code></p>
<p>In case if you don't know …</p>Artem SobolevSat, 30 Mar 2013 00:00:00 +0400tag:None,2013-03-30:/posts/2013-03-30-crazy-expression-parsing.htmlpostspythonmadnessMemoization Using C++11/posts/2013-03-29-cpp-11-memoization.html<p>Recently I've read an article <a href="http://john-ahlgren.blogspot.ru/2013/03/efficient-memoization-using-partial.html" title="John Ahlgren: Efficient Memoization using Partial Function Application">Efficient Memoization using Partial Function Application</a>. Author explains function memoization using partial application. When I was reading the article, I thought "Hmmm, can I come up with a more general solution?" And as suggested in comments, one can use variadic templates to achieve it. So …</p>Artem SobolevFri, 29 Mar 2013 00:00:00 +0400tag:None,2013-03-29:/posts/2013-03-29-cpp-11-memoization.htmlpostsC++C++11memoizationoptimizationResizing Policy of std::vector/posts/2013-02-10-std-vector-growth.html<p>Sometime ago when Facebook opensourced their <a title="Folly is an open-source C++ library developed and used at Facebook" href="https://github.com/facebook/folly">Folly library</a> I was reading their docs and found <a title="folly/FBvector.h documentation" href="https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md">something interesting</a>. In section "Memory Handling" they state <blockquote>In fact it can be mathematically proven that a growth factor of 2 is rigorously the worst possible because it never allows the vector to reuse any …</blockquote></p>Artem SobolevSun, 10 Feb 2013 00:00:00 +0400tag:None,2013-02-10:/posts/2013-02-10-std-vector-growth.htmlpostsC++math