<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://sirupsen.com</id>
    <title>Simon Eskildsen</title>
    <updated>2026-03-12T00:00:00.000Z</updated>
    <generator>Feed</generator>
    <author>
        <name>Simon Eskildsen</name>
        <email>simon@sirupsen.com</email>
        <uri>https://twitter.com/sirupsen</uri>
    </author>
    <link rel="alternate" href="https://sirupsen.com"/>
    <link rel="self" href="https://sirupsen.com/atom.xml"/>
    <subtitle>Recent content from Simon Eskildsen</subtitle>
    <logo>https://sirupsen.com/favicon.png</logo>
    <icon>https://sirupsen.com/favicon.png</icon>
    <rights>Copyright Simon Eskildsen</rights>
    <entry>
        <title type="html"><![CDATA[Podcast with Geek Narrator on Object Storage Databases]]></title>
        <id>https://sirupsen.com/geeknarrator-object-storage-podcast</id>
        <link href="https://sirupsen.com/geeknarrator-object-storage-podcast"/>
        <updated>2024-11-16T00:00:00.000Z</updated>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[turbopuffer: fast search on object storage]]></title>
        <id>https://sirupsen.com/turbopuffer</id>
        <link href="https://sirupsen.com/turbopuffer"/>
        <updated>2024-07-08T00:00:00.000Z</updated>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 21: Index Merges vs Composite Indexes in Postgres and MySQL]]></title>
        <id>https://sirupsen.com/index-merges</id>
        <link href="https://sirupsen.com/index-merges"/>
        <updated>2022-11-26T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[While working with Readwise on optimizing their database for the impending
launch of their Reader product, I found myself
asking the question: How much faster is a composite index compared to letting
the database do an index merge of multiple indexes? Consider this query:
SELECT count(<span]]></summary>
        <content type="html"><![CDATA[<p>While working with Readwise on optimizing their database for the impending
launch of their <a href="https://readwise.io/read">Reader product</a>, I found myself
asking the question: How much faster is a composite index compared to letting
the database do an index merge of multiple indexes? Consider this query:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token function">count</span><span class="token punctuation">(</span><span class="token operator">*</span><span class="token punctuation">)</span> <span class="token comment">/* matches ~100 rows out of 10M */</span>
<span class="token keyword">FROM</span> <span class="token keyword">table</span>
<span class="token keyword">WHERE</span> int1000 <span class="token operator">=</span> <span class="token number">1</span> <span class="token operator">AND</span> int100 <span class="token operator">=</span> <span class="token number">1</span>
<span class="token comment">/* int100 rows are 0..99 and int1000 0...9999 */</span>
</code></pre>
<details><summary><a>View Table Definition</a></summary><pre class="language-sql"><code class="language-sql"><span class="token keyword">create</span> <span class="token keyword">table</span> test_table <span class="token punctuation">(</span>
  id <span class="token keyword">bigint</span> <span class="token keyword">primary</span> <span class="token keyword">key</span> <span class="token operator">not</span> <span class="token boolean">null</span><span class="token punctuation">,</span>

  text1 <span class="token keyword">text</span> <span class="token operator">not</span> <span class="token boolean">null</span><span class="token punctuation">,</span> <span class="token comment">/* 1 KiB of random data */</span>
  text2 <span class="token keyword">text</span> <span class="token operator">not</span> <span class="token boolean">null</span><span class="token punctuation">,</span> <span class="token comment">/* 255 bytes of random data */</span>

  <span class="token comment">/* cardinality columns */</span>
  int1000 <span class="token keyword">bigint</span> <span class="token operator">not</span> <span class="token boolean">null</span><span class="token punctuation">,</span> <span class="token comment">/* ranges 0..999, cardinality: 1000 */</span>
  int100 <span class="token keyword">bigint</span> <span class="token operator">not</span> <span class="token boolean">null</span><span class="token punctuation">,</span> <span class="token comment">/* 0..99, card: 100 */</span>
  int10 <span class="token keyword">bigint</span> <span class="token operator">not</span> <span class="token boolean">null</span><span class="token punctuation">,</span> <span class="token comment">/* 0..10, card: 10 */</span>
<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">/* no indexes yet, we create those in the sections below */</span>
</code></pre></details>
<p>We can create a composite index on <code>(int1000, int100)</code>, or we could have two
individual indexes on <code>(int1000)</code> and <code>(int100)</code>, relying on the database to
leverage both indexes.</p>
<p>Having a composite index is faster, but <em>how much</em> faster than the two
individual indexes? Let’s do the napkin math, and then test it in PostgreSQL and
MySQL.</p>
<h2 id="napkin-math">Napkin Math</h2>
<p>We’ll start with the napkin math, and then verify it against Postgres and MySQL.</p>
<h3 id="composite-index-1ms">Composite Index: ~1ms</h3>
<p>The ideal index for this <code>count(*)</code> is:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">CREATE</span> <span class="token keyword">INDEX</span> <span class="token keyword">ON</span> <span class="token keyword">table</span> <span class="token punctuation">(</span>int1000<span class="token punctuation">,</span> int100<span class="token punctuation">)</span>
</code></pre>
<p>It allows the entire count to be performed on this one index.</p>
<p><code>WHERE int1000 = 1 AND int100 = 1</code> matches ~100 records of the 10M total for the
table. <sup><a href="#user-content-fn-1" id="user-content-fnref-1" data-footnote-ref="true" aria-describedby="footnote-label">1</a></sup> The database would do a quick search in the index tree to the leaf in the
index where both columns are <code>1</code>, and then scan forward until the condition no
longer holds.</p>
<figure><img src="/images/composite-index.png" alt="Illustration of a composite index tree with the leaf node storing the (int1000, int100) tuple" title="Illustration of a composite index tree with the leaf node storing the (int1000, int100) tuple" width="2000" height="1317" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Illustration of a composite index tree with the leaf node storing the (int1000, int100) tuple</figcaption></figure>
<p>For these 64-bit index entries we’d expect to have to scan only the ~100 entries
that match, which is a negligible ~2 KiB. According to the <a href="https://github.com/sirupsen/napkin-math">napkin reference</a>, we can read
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∼</mo><mn>1</mn><mtext> MiB</mtext><mi mathvariant="normal">/</mi><mn>100</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">\sim 1\text{ MiB}/100\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em"></span><span class="mrel">∼</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord">1</span><span class="mord text"><span class="mord"> MiB</span></span><span class="mord">/100</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span> from memory, so this will take absolutely no time.
With the query overhead, navigating the index tree, and everything else, it
theoretically shouldn’t take a database more than a couple <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>100</mn><mtext>-</mtext><mn>500</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">100\text{-}500\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">100</span><span class="mord text"><span class="mord">-</span></span><span class="mord">500</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span>
on the composite index to satisfy this query. <sup><a href="#user-content-fn-2" id="user-content-fnref-2" data-footnote-ref="true" aria-describedby="footnote-label">2</a></sup></p>
<h3 id="index-merge-10-30ms">Index Merge: ~10-30ms</h3>
<p>But a database can also do an index merge of two separate indexes:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">CREATE</span> <span class="token keyword">INDEX</span> <span class="token keyword">ON</span> <span class="token keyword">table</span> <span class="token punctuation">(</span>int1000<span class="token punctuation">)</span>
<span class="token keyword">CREATE</span> <span class="token keyword">INDEX</span> <span class="token keyword">ON</span> <span class="token keyword">table</span> <span class="token punctuation">(</span>int100<span class="token punctuation">)</span>
</code></pre>
<p>But how does a database utilize two indexes? And how expensive might this merge be?</p>
<p>How indexes are intersected depends on the database! There are many ways of
finding the intersection of two unordered lists: hashing, sorting, sets,
KD-trees, bitmap, …</p>
<p>MySQL does what it calls an <a href="https://dev.mysql.com/doc/refman/8.0/en/index-merge-optimization.html">index merge intersection</a>, I haven’t consulted
the source, but most likely it’s sorting. Postgres does index intersection by
<a href="https://www.postgresql.org/docs/current/indexes-bitmap-scans.html">generating a bitmap after scanning each index</a>, and then <code>AND</code>ing them
together.</p>
<p><code>int100 = 1</code> returns about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>10</mn><mi>M</mi><mo>⋅</mo><mn>1</mn><mi mathvariant="normal">/</mi><mn>1000</mn><mo>≈</mo><mn>100</mn><mo separator="true">,</mo><mn>000</mn></mrow><annotation encoding="application/x-tex">10M \cdot 1/1000 \approx 100,000</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">10</span><span class="mord mathnormal" style="margin-right:0.10903em">M</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord">1/1000</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">100</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">000</span></span></span></span> rows, which is
about ~1.5 MiB to scan. <code>int1000 = 1</code> matches only ~10,000 rows, so in total
we’re reading about <a href="https://github.com/sirupsen/napkin-math"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>200</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">200\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">200</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span></a> worth of memory from both indexes.</p>
<p>After we have the matches from the index, we need to intersect them. In this
case, for simplicity of the napkin math, let’s assume we sort the matches
from both indexes and then intersect from there.</p>
<p>We can sort <a href="https://github.com/sirupsen/napkin-math"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>1</mn><mtext> MiB</mtext></mrow><annotation encoding="application/x-tex">1\text{ MiB}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">1</span><span class="mord text"><span class="mord"> MiB</span></span></span></span></span> in <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>5</mn><mtext> ms</mtext></mrow><annotation encoding="application/x-tex">5\text{ ms}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">5</span><span class="mord text"><span class="mord"> ms</span></span></span></span></span></a>. So it would take us ~10ms
total to sort it, iterate through both sorted lists for a negligible <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∼</mo><mn>200</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">\sim 200\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em"></span><span class="mrel">∼</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">200</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span> of memory reading, write the intersection to memory for another <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo>∼</mo><mn>200</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">\sim 200\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3669em"></span><span class="mrel">∼</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">200</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span>, and then we’ve got the interesection, i.e. rows that match both
conditions.</p>
<figure><img src="/images/intersection.png" alt="Illustration of intersecting the two indexes with
whatever internal identifier the database uses." title="Illustration of intersecting the two indexes with
whatever internal identifier the database uses." width="1848" height="475" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Illustration of intersecting the two indexes with
whatever internal identifier the database uses.</figcaption></figure>
<p>Thus our napkin math indicates that for our two separate indexes we’d expect the
query to take <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mtext> </mtext><mn>10</mn><mtext> ms</mtext></mrow><annotation encoding="application/x-tex">~10\text{ ms}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mspace nobreak"> </span><span class="mord">10</span><span class="mord text"><span class="mord"> ms</span></span></span></span></span>. The sorting is sensitive to the index size which
is fairly approximate, so give it a low multiplier to land at, <code>~10-30ms</code>.</p>
<p>As we’ve seen, intersection bears a meaningful cost and on paper we expect it to
be roughly an order of magnitude slower than composite indexes. However, 10ms is
still sensible for most situations, and depending on the situation it might be
nice to not have a more specialized composite index for the query! For example,
if you are often joining between a subset of 10s of columns.</p>
<h2 id="reality">Reality</h2>
<p>Now that we’ve set our expectations from first principles about composite indexes
versus merging multiple indexes, let’s see how Postgres and MySQL fare in
real-life.</p>
<h3 id="composite-index-5ms-">Composite Index: 5ms ✅</h3>
<p>Both MySQL and Postgres perform index-only scans after we create the index:</p>
<pre class="language-sql"><code class="language-sql"><span class="token comment">/* 10M rows total, int1000 = 1 matches ~10K, int100 matches ~100K */</span>
<span class="token keyword">CREATE</span> <span class="token keyword">INDEX</span> <span class="token keyword">ON</span> <span class="token keyword">table</span> <span class="token punctuation">(</span>int1000<span class="token punctuation">,</span> int100<span class="token punctuation">)</span>
<span class="token keyword">EXPLAIN</span> <span class="token keyword">ANALYZE</span> <span class="token keyword">SELECT</span> <span class="token function">count</span><span class="token punctuation">(</span><span class="token operator">*</span><span class="token punctuation">)</span> <span class="token keyword">FROM</span> <span class="token keyword">table</span> <span class="token keyword">WHERE</span> int1000 <span class="token operator">=</span> <span class="token number">1</span> <span class="token operator">AND</span> int100 <span class="token operator">=</span> <span class="token number">1</span>
</code></pre>
<pre class="language-text"><code class="language-text">/* postgres, index is ~70 MiB */
Aggregate  (cost=6.53..6.54 rows=1 width=8) (actual time=0.919..0.919 rows=1 loops=1)
  -&gt;  Index Only Scan using compound_idx on test_table  (cost=0.43..6.29 rows=93 width=0) (actual time=0.130..0.909 rows=109 loops=1)
        Index Cond: ((int1000 = 1) AND (int100 = 1))
        Heap Fetches: 0
</code></pre>
<pre class="language-text"><code class="language-text">/* mysql, index is ~350 MiB */
-&gt; Aggregate: count(0)  (cost=18.45 rows=1) (actual time=0.181..0.181 rows=1 loops=1)
    -&gt; Covering index lookup on test_table using compound_idx (int1000=1, int100=1)  (cost=9.85 rows=86) (actual time=0.129..0.151 rows=86 loops=1)
</code></pre>
<p>They each take about ~3-5ms when the index is cached. It is a bit slower than
the ~1ms we expected from the napkin math, but in our experience working with
napkin math on database, tracks within an order of magnitude to seem acceptable.
We attribute this to overhead of walking through the index. <sup><a href="#user-content-fn-3" id="user-content-fnref-3" data-footnote-ref="true" aria-describedby="footnote-label">3</a></sup></p>
<h3 id="index-merge">Index Merge</h3>
<h4 id="mysql-30-40ms-">MySQL: 30-40ms ✅</h4>
<p>When we execute the query in MySQL it takes ~30-40ms, which tracks well the
upper end of our napkin math. That means our first principle understanding
likely lines up with reality!</p>
<p>Let’s confirm it’s doing what we expect by looking at the query plan:</p>
<pre class="language-sql"><code class="language-sql"><span class="token comment">/* 10M rows total, int1000 = 1 matches ~10K, int100 matches ~100K */</span>
<span class="token keyword">EXPLAIN</span> <span class="token keyword">ANALYZE</span> <span class="token keyword">SELECT</span> <span class="token function">count</span><span class="token punctuation">(</span><span class="token operator">*</span><span class="token punctuation">)</span> <span class="token keyword">FROM</span> <span class="token keyword">table</span> <span class="token keyword">WHERE</span> int1000 <span class="token operator">=</span> <span class="token number">1</span> <span class="token operator">AND</span> int100 <span class="token operator">=</span> <span class="token number">1</span>
</code></pre>
<pre class="language-text"><code class="language-text">/* mysql, each index is ~240 MiB */
-&gt; Aggregate: count(0)  (cost=510.64 rows=1) (actual time=31.908..31.909 rows=1 loops=1)
    -&gt; Filter: ((test_table.int100 = 1) and (test_table.int1000 = 1))  (cost=469.74 rows=409) (actual time=5.471..31.858 rows=86 loops=1)
        -&gt; Intersect rows sorted by row ID  (cost=469.74 rows=410) (actual time=5.464..31.825 rows=86 loops=1)
            -&gt; Index range scan on test_table using int1000 over (int1000 = 1)  (cost=37.05 rows=18508) (actual time=0.271..2.544 rows=9978 loops=1)
            -&gt; Index range scan on test_table using int100 over (int100 = 1)  (cost=391.79 rows=202002) (actual time=0.324..24.405 rows=99814 loops=1)
/* ~30 ms */
</code></pre>
<p>MySQL’s query plan tells us it’s doing <em>exactly</em> as we expected: get the
matching entries from each index, intersecting them and performing the count on
the intersection. Running <code>EXPLAIN</code> without analyze I could confirm that it’s
serving <em>everything</em> from the index and never going to seek the full row.</p>
<h4 id="postgres-25-90ms-">Postgres: 25-90ms 🤔</h4>
<p>Postgres is also within an order of magnitude of our napkin math, but it’s on
the higher range with more variance, in general performing worse than MySQL. Is
its bitmap-based intersection just slower on this query? Or is it doing
something completely different than MySQL?</p>
<p>Let’s look at the query plan using the same query as we used from MySQL:</p>
<pre class="language-sql"><code class="language-sql"><span class="token comment">/* 10M rows total, int1000 = 1 matches ~10K, int100 matches ~100K */</span>
<span class="token keyword">EXPLAIN</span> <span class="token keyword">ANALYZE</span> <span class="token keyword">SELECT</span> <span class="token function">count</span><span class="token punctuation">(</span><span class="token operator">*</span><span class="token punctuation">)</span> <span class="token keyword">FROM</span> <span class="token keyword">table</span> <span class="token keyword">WHERE</span> int1000 <span class="token operator">=</span> <span class="token number">1</span> <span class="token operator">AND</span> int100 <span class="token operator">=</span> <span class="token number">1</span>
</code></pre>
<pre class="language-text"><code class="language-text">/* postgres, each index is ~70 MiB */
Aggregate  (cost=1536.79..1536.80 rows=1 width=8) (actual time=29.675..29.677 rows=1 loops=1)
  -&gt;  Bitmap Heap Scan on test_table  (cost=1157.28..1536.55 rows=95 width=0) (actual time=27.567..29.663 rows=109 loops=1)
        Recheck Cond: ((int1000 = 1) AND (int100 = 1))
        Heap Blocks: exact=109
        -&gt;  BitmapAnd  (cost=1157.28..1157.28 rows=95 width=0) (actual time=27.209..27.210 rows=0 loops=1)
              -&gt;  Bitmap Index Scan on int1000_idx  (cost=0.00..111.05 rows=9948 width=0) (actual time=2.994..2.995 rows=10063 loops=1)
                    Index Cond: (int1000 = 1)
              -&gt;  Bitmap Index Scan on int100_idx  (cost=0.00..1045.94 rows=95667 width=0) (actual time=23.757..23.757 rows=100038 loops=1)
                    Index Cond: (int100 = 1)
Planning Time: 0.138 ms

/* ~30-90ms */
</code></pre>
<p>The query plan confirms that it’s using the <a href="https://www.postgresql.org/docs/current/indexes-bitmap-scans.html">bitmap intersection strategy</a>
for intersecting the two indexes. But that’s not what’s causing the performance
difference.</p>
<p>While MySQL services the entire aggregate (<code>count(*)</code>) from the index, Postgres
actually goes to the heap to get <em>every row</em>. The heap contains the <em>entire</em>
row, which is upwards of 1 KiB. This is expensive, and when the heap cache isn’t
warm, the query takes almost 100ms! <sup><a href="#user-content-fn-5" id="user-content-fnref-5" data-footnote-ref="true" aria-describedby="footnote-label">4</a></sup></p>
<p>As we can tell from the query plan, it seems that Postgres is unable to do
<a href="https://www.postgresql.org/docs/current/indexes-index-only-scans.html">index-only scans</a> in conjunction with index intersection. Maybe in a future
Postgres version they will support this; I don’t see any fundamental reason why
they couldn’t!</p>
<p>Going to the heap doesn’t have a huge impact when we’re only going to the heap
for 100 records, especially when it’s cached. However, if we change the
condition to <code>WHERE int10 = 1 and int100 = 1</code>, for a total of 10,000 matches,
then this query takes 7s on Postgres, versus 200ms in MySQL where the index-only
scan is alive and kicking!</p>
<p>So MySQL is superior on index merges where there is an opportunity to service
the entire query from the index. It is worth pointing out though that Postgres’
lower-bound when everything is cached is lower for this particular intersection
size, likely its bitmap-based intersection is faster.</p>
<p>Postgres and MySQL do have roughly equivalent performance on index-only scans
though. For example, if we do <code>int10 = 1</code> Postgres will do its own index-only
scan because only one index is involved.</p>
<p>The first time I ran Postgres for this index-only scan it was taking over a
second, I had to run <code>VACUUM</code> for the performance to match! Index-only scans
require frequent <code>VACUUM</code> on the table to avoid going to the heap to fetch the
entire row in Postgres.</p>
<p><code>VACUUM</code> helps because Postgres has to visit the heap for any records that have
been touched since the last <code>VACUUM</code>, <a href="https://www.postgresql.org/docs/current/indexes-index-only-scans.html">due to its MVCC implementation</a>. In my
experience, this can have serious consequences in production for index-only
scans if you have an update-heavy table where <code>VACUUM</code>
is expensive.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Index merges are ~10x slower than composite indexes because the ad-hoc
intersection isn’t a very fast operation. It requires e.g. sorting of the output
of each index scan to resolve. Indexes could be optimized further for
intersection, but this would likely have other ramifications for steady-state
load.</p>
<p>If you’re wondering if you need to add a composite index, or can get away with
creating to single indexes and rely on the database to use both indexes — then
<strong>the rule of thumb we establish is that an index merge will be ~10x slower than
the composite index</strong>. However, we’re still talking less than 100ms in most
cases, as long as you’re operating on 100s of rows (which in a relational,
operational database, hopefully you mostly are).</p>
<p>The gap in performance will widen when intersecting more than two columns, and
with a larger intersection size—I had to limit the scope of this article somewhere.
Roughly an order of magnitude seems like a reasonable assumption, with ~100 rows
matching many real-life query averages.</p>
<p>If you are using Postgres, be careful relying on index merging! Postgres doesn’t
do index-only scans after an index merge, requiring going to the heap for
potentially 100,000s of records for a <code>count(*)</code>. If you’re only returning 10s
to 100s of rows, that’s usually fine.</p>
<p>Another second-order take-away: If you’re in a situation where you have 10s of
columns filtering in all kinds of combinations, with queries like this:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> id
<span class="token keyword">FROM</span> products
<span class="token keyword">WHERE</span> color<span class="token operator">=</span>blue <span class="token operator">AND</span> <span class="token keyword">type</span><span class="token operator">=</span>sneaker <span class="token operator">AND</span> activity<span class="token operator">=</span>training 
  <span class="token operator">AND</span> season<span class="token operator">=</span>summer <span class="token operator">AND</span> inventory <span class="token operator">&gt;</span> <span class="token number">0</span> <span class="token operator">AND</span> price <span class="token operator">&lt;=</span> <span class="token number">200</span> <span class="token operator">AND</span> price <span class="token operator">&gt;=</span> <span class="token number">100</span>
  <span class="token comment">/* and potentially many, many more rules */</span>
</code></pre>
<p>Then you’re in a bit more of a pickle with Postgres/MySQL. It would require a combinatoric
explosion of composite indexes to support this use-case well, which would be
critical for sub 10ms performance required for fast websites. This is simply
unpractical.</p>
<p>Unfortunately, for sub 10ms response times, we also can’t rely on index merges
being <em>that</em> fast, because of the ad-hoc interesection. I wrote an article about
solving the problem of queries that have <a href="/napkin/problem-13-filtering-with-inverted-indexes">lots of conditions with Lucene</a>,
which is <em>very</em> good at doing lots of intersections. It would be interesting to
try this with <a href="https://www.postgresql.org/docs/current/gin-intro.html#:~:text=GIN%20stands%20for%20Generalized%20Inverted,appear%20within%20the%20composite%20items.">GIN-indexes</a> (inverted index, similar to what Lucene does) in
Postgres as a comparison. <a href="https://www.postgresql.org/docs/current/bloom.html">Bloom-indexes</a> may also be suited for this.
Columnar database might also be better at this, but I haven’t looked at that
in-depth yet.</p>
<section data-footnotes="true" class="footnotes"><h2 class="sr-only" id="footnote-label">Footnotes</h2>
<ol>
<li id="user-content-fn-1">
<p>The testing is done on a table generated by <a href="https://github.com/sirupsen/napkin-math/blob/master/newsletter/20-compound-vs-combining-indexes/test.rb">this simple script</a>. <a href="#user-content-fnref-1" data-footnote-backref="" aria-label="Back to reference 1" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-2">
<p>There’s a extra overhead searching the index B-tree to the relevant
range, and the reads aren’t <em>entirely</em> sequential in the B-tree.
Additionally, we’re assuming the index is in memory. That’s a reasonable
assumption given the tiny size of the index. From SSD should only be <a href="https://github.com/sirupsen/napkin-math">2x
slower</a> since it’s mostly sequential-ish access once the relevant first leaf has
been found. Each index entry struct is also bigger than two 64 bit integers,
e.g. the heap location in Postgres or the primary key in MySQL. Either way,
napkin math of a few hundred microseconds still seems fair! <a href="#user-content-fnref-2" data-footnote-backref="" aria-label="Back to reference 2" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-3">
<p>Looking at the real index sizes, the compound index is ~70 MiB in
Postgres, and 350 MiB in MySQL. We’d expect the entire index of ~3 64 bit
integers  (the third being the location on the heap) to be ~230 MiB for 10M
rows. <a href="https://news.ycombinator.com/item?id=33766531">fabien2k on HN</a> pointed out that Postgres does
<a href="https://www.postgresql.org/docs/current/btree-implementation.html">de-duplication</a>, which is likely how it achieves its lower index size.
MySQL has some overhead, which is reasonable for a structure of this size.
They both perform about equally though on this, but a smaller index size at
the same performance is superior as it takes less cache space. <a href="#user-content-fnref-3" data-footnote-backref="" aria-label="Back to reference 3" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-5">
<p>In the first edition of this article, Postgres was going to the heap 100s
of times, instead of just for 109 times for the 109 matching rows. It turns
out it’s because the bitmaps and the intersection was exceeding the
<code>work_mem=4MB</code> default setting. This causes Postgres to use a <em>lossy bitmap
intersection</em> with the just the heap page rather than exact row location. Read
more <a href="https://dba.stackexchange.com/a/106267">here.</a> Thanks to /u/therealgaxbo and /u/trulus on Reddit for
<a href="https://www.reddit.com/r/PostgreSQL/comments/z6pviz/comment/iy2ucc9/?utm_source=reddit&amp;utm_medium=web2x&amp;context=3">pointing this out.</a> Either way, Postgres is still not performing an
index-only scan, requiring 109 random disk seeks on a cold cache taking ~90ms. <a href="#user-content-fnref-5" data-footnote-backref="" aria-label="Back to reference 4" class="data-footnote-backref">↩</a></p>
</li>
</ol>
</section>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Scaling Causal's Spreadsheet Engine from Thousands to Billions of Cells: From Maps to Arrays]]></title>
        <id>https://sirupsen.com/causal</id>
        <link href="https://sirupsen.com/causal"/>
        <updated>2022-07-05T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Causal's UI
Causal is a spreadsheet built for the 21st century to help people work better
with numbers. Behind Causal’s innocent web UI is a complex calculation engine —
an interpreter that executes formulas on an in-memory, multidimensional
database. The engine sends the result from ]]></summary>
        <content type="html"><![CDATA[<figure><img src="/images/causal/image4.gif" alt="alt_text" title="Causal&#x27;s UI" width="730" height="440" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Causal's UI</figcaption></figure>
<p>Causal is a spreadsheet built for the 21st century to help people work better
with numbers. Behind Causal’s innocent web UI is a complex calculation engine —
an interpreter that executes formulas on an in-memory, multidimensional
database. The engine sends the result from evaluating expressions like <code>Price * Units</code> to the browser. The engine calculates the result for each dimension such
as time, product name, country e.g. what the revenue was for a single product,
during February ‘22, in Australia.</p>
<p>In the early days of Causal, the calculation engine ran in Javascript in <em>the
browser</em>, but that only scaled to 10,000s of cells. So we moved the calculation
engine <em>out</em> of the browser to a Node.js service, getting us to acceptable
performance for low 100,000s of cells. In its latest and current iteration, we
moved the calculation engine to Go, getting us to 1,000,000s of cells.</p>
<p>But every time we scale up by an order of magnitude, our customers find new
use-cases that require yet another order of magnitude more cells!</p>
<p>With no more “cheap tricks” of switching the run-time again, how can we scale
the calculation engine 100x, from millions to <em>billions</em> of cells?</p>
<p>In summary: by moving from maps to arrays. 😅 That may seem like an awfully
pedestrian observation, but it certainly wasn’t obvious to us at the outset that
this was the crux of the problem!</p>
<p>We want to take you along our little journey of what to do once you’ve reached a
dead-end with the profiler. Instead, we’ll be approaching the problem from first
principles with back-of-the envelope calculations and writing simple programs to
get a feel for the performance of various data structures. Causal isn’t quite at
billions of cells yet, but we’re rapidly making our way there!</p>
<h3 id="optimizing-beyond-the-profiler-dead-end">Optimizing beyond the profiler dead-end</h3>
<figure><img src="/images/causal/image3.png" alt="alt_text" title="Profile from the calculation engine that
it feels difficult to action" width="1612" height="934" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Profile from the calculation engine that
it feels difficult to action</figcaption></figure>
<p>What does it look like to reach a <em>dead-end</em> with a profiler? When you run a profiler for the first time, you’ll often get something useful: your program’s spending 20% of time in an auxiliary function <code>log_and_send_metrics()</code>that you know <em>reasonably</em> shouldn’t take 20% of time.</p>
<p>You peek at the function, see that it’s doing a ridiculous amount of string allocations, UDP-jiggling, and blocking the computing thread… You play this fun and rewarding profile whack-a-mole for a while, getting big and small increments here and there.</p>
<p>But at some point, your profile starts to look a bit like the above: There’s no longer anything that stands out to you as <em>grossly</em> against what’s <em>reasonable.</em> No longer any pesky <code>log_and_send_metrics() </code>eating double-digit percentages of your precious runtime.</p>
<p>The constraints move to your own calibration of what % is reasonable in the profile: It’s spending time in the GC, time allocating objects, a bit of time accessing hash maps, … Isn’t that all <em>reasonable</em>? How can we possibly know whether 5.3% of time scanning objects for the GC is <em>reasonable</em>? Even if we did optimize our memory allocations to get that number to 3%, that’s a puny incremental gain… It’s not going to get us to billions of cells! Should we switch to a non-GC’ed language? Rust?! At a certain point, you’ll go mad trying to turn a profile into a performance roadmap.</p>
<p>When analyzing a system top-down with a profiler, it’s easy to miss the forest for the trees. It helps to take a step back, and analyze the problem from first principles.</p>
<p>We sat down and thought about fundamentally, what is a calculation engine? With some back-of-the-envelope calculations, what’s the upper bookend of how many cells we could reasonably expect the Calculation engine to support?</p>
<p>In my experience, first-principle thinking is <em>required</em> to break out of iterative improvement and make order of magnitude improvements. A profiler can’t be your only performance tool.</p>
<h3 id="approaching-the-calculation-engine-from-first-principles">Approaching the calculation engine from first principles</h3>
<p>To understand, we have to explain two concepts from Causal that help keep your spreadsheet organized: dimensions and variables.</p>
<p>We might have a variable “Sales’” that is broken down by the dimensions “Product” and “Country”. To appreciate how easy it is to build a giant model, if we have 100s of months, 10,000s of products, 10s of countries, and 100 variables we’ve already created a model with 1B+ cells. In Causal “Sales” looks like this:</p>
<figure><img src="/images/causal/image5.png" alt="alt_text" title="Sales modeled in Causal&#x27;s UI" width="1380" height="510" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Sales modeled in Causal's UI</figcaption></figure>
<p>In a first iteration we might represent <code>Sales</code> and its cells with a map. This seems innocent enough. Especially when you’re coming from an original implementation in Javascript, hastily ported to Go. As we’ll learn in this blog post, there are several performance problems with this data structure, but we’ll take it step by step:</p>
<pre class="language-go"><code class="language-go">sales <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token keyword">map</span><span class="token punctuation">[</span><span class="token builtin">int</span><span class="token punctuation">]</span><span class="token operator">*</span>Cell<span class="token punctuation">)</span>
</code></pre>
<p>The integer index would be the _dimension index _to reference a specific cell. It is the index representing the specific dimension combination we’re interested in. For example, for <code>Sales[Toy-A][Canada]</code> the index would be 0 because Toy-A is the 0th <code>Product Name</code> and Canada is the 0th <code>Country</code>. For <code>Sales[Toy-A][United Kingdom]</code> it would be 1 (0th Toy, 1st Country), for <code>Sales[Toy-C][India]</code> it would be <code>3 * 3 = 9. </code></p>
<p>An ostensible benefit of the map structure is that if a lot of cells are 0, then we don’t have to store those cells at all. In other words, this data structure seems useful for <em>sparse</em> models.</p>
<p>But to make the spreadsheet come alive, we to calculate formulas such as <code>Net Profit = Sales * Profit</code>. This simple equation shows the power of Causal’s dimensional calculations, as this will calculate each cell’s unique net profit!</p>
<p>Now that we have a simple mental model of how Causal’s calculation engine works, we can start reasoning about its performance from first principles.</p>
<p>If we multiply two variables of 1B cells of 64 bit floating points each (~<a href="https://github.com/sirupsen/napkin-math#numbers">8 GiB memory</a>) into a third variable, then we have to traverse at least ~24 GiB of memory. If we naively assume this is sequential access (which hashmap access <em>isn’t</em>) and we have SIMD and multi-threading, we can process that memory at a rate of 30ms / 1 GiB, or ~700ms total (and <em>half</em> that time if we were willing to drop to 32-bit floating points and forgo some precision!).</p>
<p>So from first-principles, it seems <em>possible</em> to do calculations of billions of
cells in less than a second. Of course, there’s far more complexity below the
surface as we execute the many types of formulas, and computations on
dimensions. But there’s reason for optimism! We will carry through this example
of multiplying variables for <code>Net Profit</code> as it serves as a good proxy for the
performance we can expect on large models, where typically you’ll have fewer,
smaller variables.</p>
<p>In the remainder of this post, we will try to close the gap between smaller Go prototypes and the napkin math. That should serve as evidence of what performance work to focus on in the 30,000+ line of code engine.</p>
<h3 id="iteration-1-mapintcell-30m-cells-in-6s"><strong>Iteration 1</strong>: <code>map[int]*Cell</code>, 30m cells in ~6s</h3>
<p>In Causal’s calculation engine each <code>Cell</code> in the map was initially ~88 bytes to store various information about the cell such as the formula, dependencies, and other references. We start our investigation by implementing this basic data-structure in Go.</p>
<p>With 10M cell variables, for a total of 30M cells, it takes almost 6s to compute the <code>Net Profit = Sales * Profit </code>calculation. These numbers from our prototype doesn’t include all the other overhead that naturally accompanies running in a larger code-base, that’s far more feature-complete. In the real engine, this takes a few times longer.</p>
<p>We want to be able to do <em>billions</em> in seconds with plenty of wiggle-room for
necessary overhead, so 10s of millions in seconds won’t fly. We have to do
better. We know from our napkin math, that we <em>should</em> be able to.</p>
<pre class="language-sh-session"><code class="language-sh-session">$ go build main.go &amp;&amp; hyperfine ./main
Benchmark 1: ./napkin
  Time (mean ± σ):      5.828 s ±  0.032 s    [User: 10.543 s, System: 0.984 s]
  Range (min ... max):    5.791 s ...  5.881 s    10 runs
</code></pre>
<pre class="language-go"><code class="language-go"><span class="token keyword">package</span> main

<span class="token keyword">import</span> <span class="token punctuation">(</span>
        <span class="token string">&quot;math/rand&quot;</span>
<span class="token punctuation">)</span>

<span class="token keyword">type</span> Cell88 <span class="token keyword">struct</span> <span class="token punctuation">{</span>
        padding <span class="token punctuation">[</span><span class="token number">80</span><span class="token punctuation">]</span><span class="token builtin">byte</span> <span class="token comment">// just to simulate what would be real stuff</span>
        value   <span class="token builtin">float64</span>
<span class="token punctuation">}</span>

<span class="token keyword">func</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
        <span class="token function">pointerMapIntegerIndex</span><span class="token punctuation">(</span><span class="token number">10_000_000</span><span class="token punctuation">)</span> <span class="token comment">// 3 variables = 30M total</span>
<span class="token punctuation">}</span>

<span class="token keyword">func</span> <span class="token function">pointerMapIntegerIndex</span><span class="token punctuation">(</span>nCells <span class="token builtin">int</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
        one <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token keyword">map</span><span class="token punctuation">[</span><span class="token builtin">int</span><span class="token punctuation">]</span><span class="token operator">*</span>Cell88<span class="token punctuation">,</span> nCells<span class="token punctuation">)</span>
        two <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token keyword">map</span><span class="token punctuation">[</span><span class="token builtin">int</span><span class="token punctuation">]</span><span class="token operator">*</span>Cell88<span class="token punctuation">,</span> nCells<span class="token punctuation">)</span>
        res <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token keyword">map</span><span class="token punctuation">[</span><span class="token builtin">int</span><span class="token punctuation">]</span><span class="token operator">*</span>Cell88<span class="token punctuation">,</span> nCells<span class="token punctuation">)</span>

        rand <span class="token operator">:=</span> rand<span class="token punctuation">.</span><span class="token function">New</span><span class="token punctuation">(</span>rand<span class="token punctuation">.</span><span class="token function">NewSource</span><span class="token punctuation">(</span><span class="token number">0xCA0541</span><span class="token punctuation">)</span><span class="token punctuation">)</span>

        <span class="token keyword">for</span> i <span class="token operator">:=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> nCells<span class="token punctuation">;</span> i<span class="token operator">++</span> <span class="token punctuation">{</span>
                one<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token operator">&amp;</span>Cell88<span class="token punctuation">{</span>value<span class="token punctuation">:</span> rand<span class="token punctuation">.</span><span class="token function">Float64</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">}</span>
                two<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token operator">&amp;</span>Cell88<span class="token punctuation">{</span>value<span class="token punctuation">:</span> rand<span class="token punctuation">.</span><span class="token function">Float64</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">}</span>
        <span class="token punctuation">}</span>

        <span class="token keyword">for</span> i <span class="token operator">:=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> nCells<span class="token punctuation">;</span> i<span class="token operator">++</span> <span class="token punctuation">{</span>
                res<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token operator">&amp;</span>Cell88<span class="token punctuation">{</span>value<span class="token punctuation">:</span> one<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value <span class="token operator">*</span> two<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value<span class="token punctuation">}</span>
        <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
</code></pre>
<h3 id="iteration-2-cell-30m-cells-in-400ms">Iteration 2: <code>[]Cell</code>, 30m cells in ~400ms</h3>
<p>In our napkin math, we assumed <em>sequential</em> memory access. But hashmaps don’t do
sequential memory access. Perhaps this is a far larger offender than our profile
above might seemingly suggest?</p>
<p>Well, how do hashmaps work? You hash a key to find the <em>bucket</em> that this key/value pair is stored in. In that bucket, you insert the key and value. When the average size of the buckets grows to around ~6.5 entries, the number of buckets will double and all the entries will get re-shuffled (fairly expensive, and a good size to pre-size your maps). The re-sizing occurs to about equality on a lot of keys in ever-increasing buckets.</p>
<figure><img src="/images/causal/image2.png" alt="alt_text" title="Array of Structs to Struct of Arrays" width="1219" height="774" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Array of Structs to Struct of Arrays</figcaption></figure>
<p>Let’s think about the performance implications of this from the ground up. Every time we look up a cell from its integer index, the operations we have to perform (and their performance, according to the <a href="https://github.com/sirupsen/napkin-math">napkin math reference</a>):</p>
<ol>
<li><strong>Hash</strong> the integer index to a hashed value: 25ns</li>
<li><strong>Mask</strong> the hashed value to map it to a bucket: 1-5ns</li>
<li><strong>Random memory read</strong> to map the bucket to a pointer to the bucket’s address: 1ns (because it’ll be in the cache)</li>
<li><strong>Random memory read</strong> to read the bucket: 50ns</li>
<li><strong>Equality</strong> operations on up to 6-7 entries in the bucket to locate the right key: 1-10ns</li>
<li><strong>Random memory read</strong> to follow and read the *Cell pointer: 50ns</li>
</ol>
<p>Most of this goes out the wash, by far the most expensive are these random memory reads that the map entails. Let’s say ~100ns per look-up, and we have ~30M of them, that’s ~3 seconds in hash lookups alone. That lines up with the performance we’re seeing. Fundamentally, it really seems like trouble to get to billions of cells with a map.</p>
<p>There’s another problem with our data structure in addition to all the pointer-chasing leading to slow random memory reads: the size of the cell. Each cell is 88 bytes. When a CPU reads memory, it fetches one <em>cache line</em> of 64 bytes at a time. In this case, the entire 88 byte cell doesn’t fit in a single cache line. 88 bytes spans two cache lines, with 128 - 88 = 40 bytes of wasteful fetching of our precious memory bottleneck!</p>
<p>If those 40 bytes belonged to the next cell, that’s not a big deal, since we’re about to use them anyway. However, in this random-memory-read heavy world of using a hashmap that stores pointers, we can’t trust that cells will be adjacent. This is enormously wasteful for our precious memory bandwidth.</p>
<p>In the <a href="https://github.com/sirupsen/napkin-math">napkin math reference</a>, random memory reads are <em>~50x slower than sequential access</em>. A huge reason for this is that the CPU’s memory prefetcher cannot predict memory access. Accessing memory is one of the slowest things a CPU does, and if it can’t preload cache lines, we’re spending _a lot _of time stalled on memory.</p>
<p>Could we give up the map? We mentioned earlier that a nice property of the map is that it allows us to build sparse models with lots of empty cells. For example, cohort models tend to have half of their cells empty. But perhaps half of the cells being empty is not quite enough to qualify as ‘sparse’?</p>
<p>We could consider mapping the index for the cells into a large, pre-allocated array. Then cell access would be just a <em>single</em> random-read of 50ns! In fact, it’s even better than that: In this particular <code>Net Profit</code>, all the memory access is sequential. This means that the CPU can be smart and prefetch memory because it can reasonably predict what we’ll access next. For a single thread, we know we can do about <a href="https://github.com/sirupsen/napkin-math#numbers">1 GiB/100ms</a>. This is about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>30</mn><mi>M</mi><mo>⋅</mo><mn>88</mn><mtext> bytes</mtext><mo>≈</mo><mn>2.5</mn><mtext> GiB</mtext></mrow><annotation encoding="application/x-tex">30M \cdot 88 \text{ bytes} \approx 2.5 \text{ GiB}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">30</span><span class="mord mathnormal" style="margin-right:0.10903em">M</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord">88</span><span class="mord text"><span class="mord"> bytes</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">2.5</span><span class="mord text"><span class="mord"> GiB</span></span></span></span></span>, so it should take somewhere in the ballpark of 250-300ms. Consider also that the allocations themselves on the first few lines take a bit of time.</p>
<pre class="language-go"><code class="language-go"><span class="token keyword">func</span> <span class="token function">arrayCellValues</span><span class="token punctuation">(</span>nCells <span class="token builtin">int</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
        one <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token punctuation">]</span>Cell88<span class="token punctuation">,</span> nCells<span class="token punctuation">)</span>
        two <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token punctuation">]</span>Cell88<span class="token punctuation">,</span> nCells<span class="token punctuation">)</span>
        res <span class="token operator">:=</span> <span class="token function">make</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token punctuation">]</span>Cell88<span class="token punctuation">,</span> nCells<span class="token punctuation">)</span>

        rand <span class="token operator">:=</span> rand<span class="token punctuation">.</span><span class="token function">New</span><span class="token punctuation">(</span>rand<span class="token punctuation">.</span><span class="token function">NewSource</span><span class="token punctuation">(</span><span class="token number">0xCA0541</span><span class="token punctuation">)</span><span class="token punctuation">)</span>

        <span class="token keyword">for</span> i <span class="token operator">:=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> nCells<span class="token punctuation">;</span> i<span class="token operator">++</span> <span class="token punctuation">{</span>
                one<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value <span class="token operator">=</span> rand<span class="token punctuation">.</span><span class="token function">Float64</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
                two<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value <span class="token operator">=</span> rand<span class="token punctuation">.</span><span class="token function">Float64</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
        <span class="token punctuation">}</span>

        <span class="token keyword">for</span> i <span class="token operator">:=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator">&lt;</span> nCells<span class="token punctuation">;</span> i<span class="token operator">++</span> <span class="token punctuation">{</span>
                res<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value <span class="token operator">=</span> one<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value <span class="token operator">*</span> two<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">.</span>value
        <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
</code></pre>
<pre class="language-sh-session"><code class="language-sh-session">napkin:go2 $ go build main.go &amp;&amp;  hyperfine ./main
Benchmark 1: ./main
  Time (mean ± σ):     346.4 ms ±  21.1 ms    [User: 177.7 ms, System: 171.1 ms]
  Range (min ... max):   332.5 ms ... 404.4 ms    10 runs
</code></pre>
<p>That’s great! And it tracks our expectations from our napkin math well (the extra overhead is partially from the random number generator).</p>
<h3 id="iteration-3-threading-250ms">Iteration 3: Threading, 250ms</h3>
<p>Generally, we expect threading to speed things up substantially as we’re able to utilize more cores. However, in this case, we’re memory bound, not computationally bound. We’re just doing simple calculations between the cells, which is generally the case in real Causal models. Multiplying numbers takes single-digit cycles, fetching memory takes double to triple-digit number of cycles. Compute bound workloads scale well with cores. Memory bound workloads act differently when scaled up.</p>
<p>If we look at raw memory bandwidth numbers in the <a href="https://github.com/sirupsen/napkin-math#numbers">napkin math reference</a>, a 3x speed-up in a memory-bound workload seems to be our ceiling. In other words, if you’re memory bound, you only need about ~3-4 cores to exhaust memory bandwidth. More won’t help much. But they do help, because a single thread <a href="https://news.ycombinator.com/item?id=16174813">cannot exhaust memory bandwidth on most CPUs</a>.</p>
<p>When <a href="https://gist.github.com/sirupsen/d413b130d0f45d0d35d0bc85b9071abb">implemented</a> however, we only get a 0.6x speedup (400ms → 250ms), and not a 3x speed-up (130ms)? I am frankly not sure how to explain this ~120ms gap. If anyone has a theory, <a href="mailto:lukas@causal.app">we’d love to hear it</a>!</p>
<p>Either way, we definitely seem to be memory bound now. Then there’s only two ways forward: (1) Get more memory bandwidth on a different machine, or (2) Reduce the amount of memory we’re using. Let’s try to find some more brrr with (2).</p>
<h3 id="iteration-4-smaller-cells-88-bytes--32-bytes-70ms">Iteration 4: Smaller Cells, 88 bytes → 32 bytes, 70ms</h3>
<p>If we were able to cut the cell size 3x from 88 bytes to 32 bytes, we’d expect the performance to roughly 3x as well! In our simulation tool, we’ll reduce the size of the cell:</p>
<pre class="language-go"><code class="language-go"><span class="token keyword">type</span> Cell32 <span class="token keyword">struct</span> <span class="token punctuation">{</span>
    padding <span class="token punctuation">[</span><span class="token number">24</span><span class="token punctuation">]</span><span class="token builtin">byte</span>
    value   <span class="token builtin">float64</span>
<span class="token punctuation">}</span>
</code></pre>
<p>Indeed, with the threading on top, <a href="https://gist.github.com/sirupsen/d413b130d0f45d0d35d0bc85b9071abb">this gets us to ~70ms</a> which is just around a 3x improvement!</p>
<p>In fact, what is even in that cell struct? The cell stores things like formulas, but for many cells, we don’t <em>actually</em> need the formula stored with the cell. For most cells in Causal, the formula is the same as the <em>previous</em> cell. I won’t show the original struct, because it’s confusing, but there are other pointers, e.g. to the parent variable. By more carefully writing the calculation engine’s interpreter to keep track of the context, we should be able to remove various pointers to e.g. the parent variable. Often, structs get expanded with cruft as a quick way to break through some logic barrier, rather than carefully executing the surrounding context to provide this information on the stack.</p>
<p>As a general pattern, we can reduce the size of the cell by switching from an <em>array of structs</em> design to a <em>struct of arrays</em> design, in other words, if we’re in a cell with index 328, and need the formula for the cell, we could look up index 328 in a formula array. These are called <em>parallel arrays</em>. Even if we access a different formula for every single cell the CPU is smart enough to detect that it’s another sequential access. This is generally much faster than using pointers.</p>
<figure><img src="/images/causal/image1.png" alt="alt_text" title="image_tooltip" width="1283" height="477" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>image_tooltip</figcaption></figure>
<p>None of this is particularly hard to do, but it wasn’t until now that we realized how paramount this was to the engine’s performance! Unfortunately, the profiler isn’t yet helpful enough to tell you that reducing the size of a struct below that 64-byte threshold can lead to non-linear performance increases. You need to know to use tools like <a href="https://linux.die.net/man/1/pahole"><code>pahole(1)</code></a> for that.</p>
<h3 id="iteration-5-float64-w-parallel-arrays-20ms">Iteration 5: <code>[]float64</code> w/ Parallel Arrays, 20ms</h3>
<p>If we want to find the absolute speed-limit for Causal’s performance then, we’d want to imagine that the Cell is just:</p>
<pre class="language-go"><code class="language-go"><span class="token keyword">type</span> Cell8 <span class="token keyword">struct</span> <span class="token punctuation">{</span>
    value   <span class="token builtin">float64</span>
<span class="token punctuation">}</span>
</code></pre>
<p>That’s a total memory usage of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>30</mn><mo>⋅</mo><mn>8</mn><mtext> byte</mtext><mo>≈</mo><mn>228</mn><mtext> MiB</mtext></mrow><annotation encoding="application/x-tex">30 \cdot 8 \text{ byte} \approx 228 \text{ MiB}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">30</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">⋅</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord">8</span><span class="mord text"><span class="mord"> byte</span></span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6833em"></span><span class="mord">228</span><span class="mord text"><span class="mord"> MiB</span></span></span></span></span> which we can read at <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>35</mn><mtext> </mtext><mi>μ</mi><mtext>s/MiB</mtext></mrow><annotation encoding="application/x-tex">35\,\mu\text{s/MiB}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord">35</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s/MiB</span></span></span></span></span> <a href="https://github.com/sirupsen/napkin-math#numbers">in a threaded</a> program, so ~8ms. We won’t get much faster than this, since we also inevitably have to spend time allocating the memory.</p>
<p>When <a href="https://gist.github.com/sirupsen/d413b130d0f45d0d35d0bc85b9071abb">implemented</a>, the raw floats take ~20ms (consider that we have to allocate the memory too) for our 30M cells.</p>
<p>Let’s scale it up. For <a href="https://gist.github.com/sirupsen/d413b130d0f45d0d35d0bc85b9071abb#file-simulator-go-L38">1B cells</a>, this takes ~3.5s. That’s pretty good! Especially considering that the Calculation engine already has a lot of caching already to ensure we don’t have to re-evaluate every cell in the sheet. But, we want to make sure that the worst-case of evaluating the entire sheet performs well, and we have some space for inevitable overhead.</p>
<p>Our initial napkin math suggested we could get to ~700ms for 3B cells, so there’s a bit of a gap. We get to ~2.4s for 1B cells by <a href="https://gist.github.com/sirupsen/d413b130d0f45d0d35d0bc85b9071abb#file-simulator-go-L133">moving allocations into the threads that actually need them</a>, closing the gap further would take some more investigation. However, localizing allocations start to get into a territory of what would be quite hard to implement generically in reality—so we’ll stop around here until we have the luxury of this problem being the bottleneck. Plenty of work to make all these transitions in a big, production code-base!</p>
<h3 id="iteration-n-simd-compression-gpu-">Iteration N: SIMD, compression, GPU …</h3>
<p>That said, there are <em>lots</em> of optimizations we can do. Go’s compiler currently doesn’t do SIMD, which allows us to get <em>even more</em> memory bandwidth. Another path for optimization that’s common for number-heavy programs is to encode the numbers, e.g. delta-encoding. Because we’re constrained by memory bandwidth more than compute, counter-intuitively, compression can make the program <em>faster</em>. Since the CPU is stalled for tons of cycles while waiting for memory access, we can use these extra cycles to do simple arithmetic to decompress.</p>
<p>Another trend from the AI-community when it comes to number-crunching too is to leverage GPUs. These have <em>enormous</em> memory bandwidth. However, we can create serious bottlenecks when it comes to moving memory back and forth between the CPU and GPU. We’d have to learn what kinds of models would take advantage of this, we have little experience with GPUs as a team—but we may be able to utilize lots of existing ND-array implementations used for training neural nets. This would come with significant complexity—but also serious performance improvements for large models.</p>
<p>Either way there’s <em>plenty</em> of work to get to the faster, simpler design described above in the code-base. This would be further out, but makes us excited about the engineering ahead of us!</p>
<h3 id="conclusion">Conclusion</h3>
<p>Profiling had become a dead-end to make the calculation engine faster, so we needed a different approach. Rethinking the core data structure from first principles, and understanding exactly why each part of the current data structure and access patterns was slow got us out of disappointing, iterative single-digit percentage performance improvements, and unlocked order of magnitude improvements. This way of thinking about designing software is often referred to as data-oriented engineering, and <a href="https://media.handmade-seattle.com/practical-data-oriented-design/">this talk by Andrew Kelly</a>, the author of the Zig compiler, is an excellent primer that was inspirational to the team.</p>
<p>With these results, we were able to build a technical roadmap for incrementally moving the engine towards a more data-oriented design. The reality is _far _more complicated, as the calculation engine is north of 40K lines of code. But this investigation gave us confidence in the effort required to change the core of how the engine works, and the performance improvements that will come over time!</p>
<p>The biggest performance take-aways for us were:</p>
<ol>
<li>When you’re stuck with performance on profilers, start thinking about the problem from first principles</li>
<li>Use indices, not pointers when possible</li>
<li>Use array of structs when you access almost everything all the time, use struct of arrays when you don’t</li>
<li>Use arrays instead of maps when possible; the data needs to be <em>very</em> sparse for the memory savings to be worth it</li>
<li>Memory bandwidth is precious, and you can’t just parallelize your way out of it!</li>
</ol>
<p>Causal doesn’t smoothly support 1 billion cells yet, but we feel confident in
our ability to iterate our way there. Since starting this work, our small team
has already improved performance more than 3x on real models. If you’re
interested in working on this with Causal, and them get to 10s of billions of
cells, you should consider joining the Causal team — email
<a href="mailto:lukas@causal.app">lukas@causal.app</a>!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Metrics For Your Web Application's Dashboards]]></title>
        <id>https://sirupsen.com/metrics</id>
        <link href="https://sirupsen.com/metrics"/>
        <updated>2022-03-19T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Whenever I create a dashboard for an application, it’s generally the same
handful of metrics I look to. They’re the ones I always use to orient myself
quickly when Pagerduty fires. They give me the grand overview, and then I’ll
know what logging queries to start writing, code to look at, box to SSH into, or
mitigation to activate. The same metrics are able to tell me during the day
whether the system is ok, and I use them to do napkin math on e.g. capacity
planning and imminent bottlenecks:</]]></summary>
        <content type="html"><![CDATA[<p>Whenever I create a dashboard for an application, it’s generally the same
handful of metrics I look to. They’re the ones I always use to orient myself
quickly when Pagerduty fires. They give me the grand overview, and then I’ll
know what logging queries to start writing, code to look at, box to SSH into, or
mitigation to activate. The same metrics are able to tell me during the day
whether the system is ok, and I use them to do napkin math on e.g. capacity
planning and imminent bottlenecks:</p>
<ul>
<li><strong>Web Backend (e.g. Django, Node, Rails, Go, ..)</strong>
<ul>
<li>Response Time <code>p50</code>, <code>p90</code>, <code>p99</code>, <code>sum</code>, <code>avg</code> †</li>
<li>Throughput by HTTP status †</li>
<li><strong>Worker Utilization</strong> <sup><a href="#user-content-fn-web-utilization" id="user-content-fnref-web-utilization" data-footnote-ref="true" aria-describedby="footnote-label">1</a></sup></li>
<li>Request Queuing Time <sup><a href="#user-content-fn-web-queue" id="user-content-fnref-web-queue" data-footnote-ref="true" aria-describedby="footnote-label">2</a></sup></li>
<li>Service calls †
<ul>
<li>Database(s), caches, internal services, third-party APIs, ..</li>
<li>Enqueued jobs are important!</li>
<li><a href="https://sirupsen.com/napkin/problem-11-circuit-breakers">Circuit Breaker tripping</a> † <code>/min</code></li>
<li>Errors, throughput, latency <code>p50</code>, <code>p90</code>, <code>p99</code></li>
</ul>
</li>
<li>Throttling †</li>
<li>Cache hits and misses <code>%</code> †</li>
<li>CPU and Memory Utilization</li>
<li>Exception counts † <code>/min</code></li>
</ul>
</li>
<li><strong>Job Backend (e.g. Sidekiq, Celery, Bull, ..)</strong>
<ul>
<li>Job Execution Time <code>p50</code>, <code>p90</code>, <code>p99</code>, <code>sum</code>, <code>avg</code> †</li>
<li>Throughput by Job Status <code>{error, success, retry}</code> †</li>
<li>Worker Utilization <sup><a href="#user-content-fn-job-utilization" id="user-content-fnref-job-utilization" data-footnote-ref="true" aria-describedby="footnote-label">3</a></sup></li>
<li><strong>Time in Queue</strong> † <sup><a href="#user-content-fn-job-time-in-queue" id="user-content-fnref-job-time-in-queue" data-footnote-ref="true" aria-describedby="footnote-label">4</a></sup></li>
<li><strong>Queue Sizes</strong> † <sup><a href="#user-content-fn-job-queue-size" id="user-content-fnref-job-queue-size" data-footnote-ref="true" aria-describedby="footnote-label">5</a></sup>
<ul>
<li>Don’t forget scheduled jobs and retries!</li>
</ul>
</li>
<li>Service calls <code>p50</code>, <code>p90</code>, <code>p99</code>, <code>count</code>, <code>by type</code> †</li>
<li>Throttling †</li>
<li>CPU and Memory Utilization</li>
<li>Exception counts † <code>/min</code></li>
</ul>
</li>
</ul>
<p><em>† Metrics where you <strong>need</strong> the ability to slice by <code>endpoint</code> or <code>job</code>,
<code>tenant_id</code>, <code>app_id</code>, <code>worker_id</code>, <code>zone</code>, <code>hostname</code>, and <code>queue</code> (for jobs).
This is paramount to be able to figure out if it’s a single endpoint, tenant, or
app that’s causing problems.</em></p>
<p>You can likely cobble a workable chunk of this together from your existing
service provider and APM. The value is for you to know what metrics to pay
attention to, and which key ones you’re missing. The holy grail is <em>one</em>
dashboard for web, and one for job. The more incidents you have, the more
problematic it becomes that you need to visit a dozen URLs to get the metrics
you need.</p>
<p>If you have little of this and need somewhere to start, start with logs. They’re
the lowest common denominator, and if you’re productive in a good logging system
that will you <em>very</em> far. You can build all these dashboards with logs alone.
Jumping into the detailed logs is usually the next step you take during an
incident if it’s not immediately clear what to do from the metrics.</p>
<p>Use the <a href="https://stripe.com/blog/canonical-log-lines">canonical log line pattern</a> (see figure below), resist
emitting random logs throughout the request as this makes analysis difficult. A
canonical log line is a log emitted at the end of the request with everything
that happened during the request. This makes querying the logs bliss.</p>
<figure><img src="/images/canonical-log-line.png" alt="An example of a canonical log line with a subset of the metrics above, generously provided by &lt;a href=&#x27;https://readwise.io&#x27;&gt;Readwise.io&lt;/a&gt;, who I helped set up canonical log lines for." title="An example of a canonical log line with a subset of the metrics above, generously provided by &lt;a href=&#x27;https://readwise.io&#x27;&gt;Readwise.io&lt;/a&gt;, who I helped set up canonical log lines for." width="864" height="1008" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>An example of a canonical log line with a subset of the metrics above, generously provided by <a href='https://readwise.io'>Readwise.io</a>, who I helped set up canonical log lines for.</figcaption></figure>
<p>Surprisingly, there aren’t good libraries available for the canonical log line
pattern, so I recommend rolling your own. Create a middleware in your job and
web stack to emit the log at the end of the request. If you need to accumulate
metrics throughout the request for the canonical log line, create a thread-local
dictionary for them that you flush in the middleware.</p>
<p>For response time from services, you will need to emit inline logs or metrics.
Consider using an <a href="https://opentelemetry.io/">OpenTelemetry library</a> so you only need to
instrument once and can later add sinks for canonical logs (the sum), metrics,
profiling, and traces.</p>
<p>Notably absent here is monitoring a database, which would take its own post.</p>
<p>Hope this helps you step up your monitoring game. If there’s a metric you feel
strongly that’s missing, please let me know!</p>
<section data-footnotes="true" class="footnotes"><h2 class="sr-only" id="footnote-label">Footnotes</h2>
<ol>
<li id="user-content-fn-web-utilization">
<p>This is one of my favorites. What percentage of threads are
currently busy? If this is <code>&gt;80%</code>, you will start to see counter-intuitive
queuing theory take hold, yielding strange response time patterns.<br/><br/>
It is given as <code>busy_threads / total_threads</code>. <a href="#user-content-fnref-web-utilization" data-footnote-backref="" aria-label="Back to reference 1" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-web-queue">
<p>How long are requests spending in TCP/proxy queues before being
picked up by a thread? Typically you get this by your load-balancer stamping
the request with a <code>X-Request-Start</code> header, then subtracting that from the
current time in the worker thread. <a href="#user-content-fnref-web-queue" data-footnote-backref="" aria-label="Back to reference 2" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-job-utilization">
<p>Same idea as web utilization, but in this case it’s OK for
it to be &gt; 80% for periods of time as jobs are by design allowed to be in the
queue for a while. The central metric for jobs becomes time in queue. <a href="#user-content-fnref-job-utilization" data-footnote-backref="" aria-label="Back to reference 3" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-job-time-in-queue">
<p>The central metric for monitoring a job stack is to know
how long jobs spend in the queue. That will be what you can use to answer
questions such as: Do I need more workers? When will I recover? What’s the
experience for my users right now? <a href="#user-content-fnref-job-time-in-queue" data-footnote-backref="" aria-label="Back to reference 4" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-job-queue-size">
<p>How large is your queue right now? It’s especially amazing to
be able to slice this by job and queue, but your canonical logs with how much
has been enqueued is typically sufficient. <a href="#user-content-fnref-job-queue-size" data-footnote-backref="" aria-label="Back to reference 5" class="data-footnote-backref">↩</a></p>
</li>
</ol>
</section>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 18: Neural Network From Scratch]]></title>
        <id>https://sirupsen.com/napkin/neural-net</id>
        <link href="https://sirupsen.com/napkin/neural-net"/>
        <updated>2022-01-03T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[In this edition of Napkin Math, we’ll invoke the spirit of the Napkin Math
series to establish a mental model for how a neural network works by building
one from scratch. In a future issue we will do napkin math on performance, as
establishing the first-principle understanding is plenty of ground to cover for
today!
Neural nets are increasingly dominating the field of machine learning / artificial
intelligence: the most sophisticated models for computer vision (e.g. CLIP),
natural lang]]></summary>
        <content type="html"><![CDATA[<p>In this edition of Napkin Math, we’ll invoke the spirit of the Napkin Math
series to establish a mental model for how a neural network works by building
one from scratch. In a future issue we will do napkin math on performance, as
establishing the first-principle understanding is plenty of ground to cover for
today!</p>
<p>Neural nets are increasingly dominating the field of machine learning / artificial
intelligence: the most sophisticated models for computer vision (e.g. CLIP),
natural language processing (e.g. GPT-3), translation (e.g. Google Translate),
and more are based on neural nets. When these artificial neural nets reach some
arbitrary threshold of neurons, we call it <em>deep learning</em>.</p>
<p>A visceral example of Deep Learning’s unreasonable effectiveness comes from
<a href="https://www.listennotes.com/podcasts/the-twiml-ai/systems-and-software-for-xolUkM23Gb0/">this interview</a> with Jeff Dean who leads AI at Google. He explains how
500 lines of Tensorflow outperformed the previous ~500,000 lines of code for
Google Translate’s <em>extremely complicated</em> model. Blew my mind. <sup><a href="#user-content-fn-google" id="user-content-fnref-google" data-footnote-ref="true" aria-describedby="footnote-label">1</a></sup></p>
<p>As a software developer with a predominantly web-related skillset of Ruby,
databases, enough distributed systems knowledge to know to not get fancy, a bit
of hard-earned systems knowledge from debugging incidents, but only high school
level math: <em>neural networks mystify me</em>. How do they work? Why are they so
good? Why are they so slow? Why are GPUs/TPUs used to speed them up? Why do the
biggest models have more neurons than humans, yet still perform worse than the
human brain? <sup><a href="#user-content-fn-gpt3" id="user-content-fnref-gpt3" data-footnote-ref="true" aria-describedby="footnote-label">2</a></sup></p>
<p>In true napkin math fashion, the best course of action to answer those questions
is by implementing a simple neural net from scratch.</p>
<h2 id="mental-model-for-a-neural-net-building-one-from-scratch">Mental Model for a Neural Net: Building one from scratch</h2>
<p>The hardest part of napkin math isn’t the calculation itself: it’s acquiring the
conceptual understanding of a system to come up with an equation for its
performance. Presenting and testing mental models of common systems is the crux
of value from the napkin math series!</p>
<p>The simplest neural net we can draw might look something like this:</p>
<figure><img src="/images/napkin/problem-17-neural-nets/mental-model.jpg" alt="" width="687" height="598" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<ul>
<li><strong>Input layer</strong>. This is a representation of the data that we want to feed to
the neural net. For example, the input layer for a 4x4 pixel grayscale image
that looks like this <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[1, 1, 1, 0.2]</title><rect fill="#000000" x="0" y="0" width="1" height="1"></rect><rect fill="#000000" x="1" y="0" width="1" height="1"></rect><rect fill="#000000" x="0" y="1" width="1" height="1"></rect><rect fill="#E3E3E3" x="1" y="1" width="1" height="1"></rect></svg> could be <code>[1, 1, 1, 0.2]</code>. Meaning the first 3 pixels are darkest (1.0) and the last pixel is
lighter (0.2).</li>
<li><strong>Hidden Layer</strong>. This is the layer that does a bunch of math on the input
layer to convert it to our prediction. <em>Training</em> a model refers to changing the
math of the hidden layer(s) to more often create an output like the training
data. We will go into more detail with this layer in a moment. The values in the
hidden layer are called <em>weights</em>.</li>
<li><strong>Output Layer</strong>. This layer will contain our final prediction. For example,
if we feed it the rectangle from before <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[1, 1, 1, 0.2]</title><rect fill="#000000" x="0" y="0" width="1" height="1"></rect><rect fill="#000000" x="1" y="0" width="1" height="1"></rect><rect fill="#000000" x="0" y="1" width="1" height="1"></rect><rect fill="#E3E3E3" x="1" y="1" width="1" height="1"></rect></svg> we
might want the output layer to be a single number to represent how “dark” a
rectangle is, e.g.: <code>0.8</code>.</li>
</ul>
<p>For example for the image <code><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0.8, 0.7, 1, 1]</title><rect fill="#4C4C4C" x="0" y="0" width="1" height="1"></rect><rect fill="#656565" x="1" y="0" width="1" height="1"></rect><rect fill="#000000" x="0" y="1" width="1" height="1"></rect><rect fill="#000000" x="1" y="1" width="1" height="1"></rect></svg> = [0.8, 0.7, 1, 1]</code> we’d expect a value close to 1 (dark!).</p>
<p>In contrast, for <code><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0.2, 0.5, 0.4, 0.7]</title><rect fill="#E3E3E3" x="0" y="0" width="1" height="1"></rect><rect fill="#989898" x="1" y="0" width="1" height="1"></rect><rect fill="#B1B1B1" x="0" y="1" width="1" height="1"></rect><rect fill="#656565" x="1" y="1" width="1" height="1"></rect></svg> = [0.2, 0.5, 0.4, 0.7]</code> we
expect something closer to 0 than to 1.</p>
<p>Let’s implement a neural network from our simple mental model. The goal of this
neural network is to take a grayscale 2x2 image and tell us how “dark” it is
where 0 is completely white <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0, 0, 0, 0]</title><rect fill="#FFFFFF" x="0" y="0" width="1" height="1"></rect><rect fill="#FFFFFF" x="1" y="0" width="1" height="1"></rect><rect fill="#FFFFFF" x="0" y="1" width="1" height="1"></rect><rect fill="#FFFFFF" x="1" y="1" width="1" height="1"></rect></svg>, and 1 is
completely black <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[1, 1, 1, 1]</title><rect fill="#000000" x="0" y="0" width="1" height="1"></rect><rect fill="#000000" x="1" y="0" width="1" height="1"></rect><rect fill="#000000" x="0" y="1" width="1" height="1"></rect><rect fill="#000000" x="1" y="1" width="1" height="1"></rect></svg>. We will initialize the
hidden layer with some random values at first, in Python:</p>
<pre class="language-python"><code class="language-python">input_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span> <span class="token number">0.5</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.7</span><span class="token punctuation">]</span>
<span class="token comment"># We randomly initialize the weights (values) for the hidden layer... We will</span>
<span class="token comment"># need to &quot;train&quot; to make these weights give us the output layers we desire. We</span>
<span class="token comment"># will cover that shortly!</span>
hidden_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.98</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.86</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">0.08</span><span class="token punctuation">]</span>

output_neuron <span class="token operator">=</span> <span class="token number">0</span>
<span class="token comment"># This is really matrix multiplication. We explicitly _do not_ use a</span>
<span class="token comment"># matrix/tensor, because they add overhead to understanding what happens here</span>
<span class="token comment"># unless you work with them every day--which you probably don&#x27;t. More on using</span>
<span class="token comment"># matrices later.</span>
<span class="token keyword">for</span> index<span class="token punctuation">,</span> input_neuron <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>input_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    output_neuron <span class="token operator">+=</span> input_neuron <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>output_neuron<span class="token punctuation">)</span>
<span class="token comment"># =&gt; 0.68</span>
</code></pre>
<p>Our neural network is giving us <code>model(<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0.2, 0.5, 0.4, 0.7]</title><rect fill="#E3E3E3" x="0" y="0" width="1" height="1"></rect><rect fill="#989898" x="1" y="0" width="1" height="1"></rect><rect fill="#B1B1B1" x="0" y="1" width="1" height="1"></rect><rect fill="#656565" x="1" y="1" width="1" height="1"></rect></svg>) = 0.7</code> which is closer to ‘dark’ (1.0) than ‘light’ (0.0). When looking
at this rectangle as a human, we judge it to be more bright than dark, so we
were expecting something below 0.5!</p>
<div class="_articleNotice_uux7j_123"><p>There’s a <a href="https://colab.research.google.com/drive/1YRp9k_ORH4wZMqXLNkc3Ir5w4B5f-8Pa?usp=sharing">notebook</a> with the final code available. You can make a copy and execute it there. For early versions of the code, such as the above, you can create a new cell at the beginning of the notebook and build up from there!</p></div>
<p>The only real thing we can change in our neural network in its current form is
the hidden layer’s values. How do we change the hidden layer values so that the
output neuron is close to 1 when the rectangle is dark, and close to 0 when it’s
light?</p>
<p>We could abandon this approach and just take the average of all the pixels. That
would work well! However, that’s not really the point of a neural net… We’ll
hit an impasse if we one day expand our model to try to implement
<code>recognize_letters_from_picture(img)</code> or <code>is_cat(img)</code>.</p>
<p>Fundamentally, a neural network is just a way to approximate any function. It’s
really hard to sit down and write <code>is_cat</code>, but the same technique we’re using
to implement <code>average</code> through a neural network can be used to implement
<code>is_cat</code>. This is called the <a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">universal approximation theorem</a>: an
artificial neural network can approximate <em>any</em> function!</p>
<p>So, let’s try to teach our simple neural network to take the <code>average()</code> of the
pixels instead of explicitly telling it that that’s what we want! The idea of
this walkthrough example is to understand a neural net with very few values and
low complexity, otherwise it’s difficult to develop an intuition when we move to
1,000s of values and 10s of layers, as real neural networks have.</p>
<p>We can observe that if we <em>manually modify</em> all the hidden layer attributes to
<code>0.25</code>, our neural network is actually an average function!</p>
<pre class="language-python"><code class="language-python">input_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span> <span class="token number">0.5</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.7</span><span class="token punctuation">]</span>
hidden_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.25</span><span class="token punctuation">,</span> <span class="token number">0.25</span><span class="token punctuation">,</span> <span class="token number">0.25</span><span class="token punctuation">,</span> <span class="token number">0.25</span><span class="token punctuation">]</span>

output_neuron <span class="token operator">=</span> <span class="token number">0</span>
<span class="token keyword">for</span> index<span class="token punctuation">,</span> input_neuron <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>input_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    output_neuron <span class="token operator">+=</span> input_neuron <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span>

<span class="token comment"># Two simple ways of calculating the same thing!</span>
<span class="token comment">#</span>
<span class="token comment"># 0.2 * 0.25 + 0.5 * 0.25 + 0.4 * 0.25 + 0.7 * 25 = 0.45</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>output_neuron<span class="token punctuation">)</span>
<span class="token comment"># Here, we divide by 4 to get the average instead of</span>
<span class="token comment"># multiplying each element.</span>
<span class="token comment">#</span>
<span class="token comment"># (0.2 + 0.5 + 0.4 + 0.7) / 4 = 0.45</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token builtin">sum</span><span class="token punctuation">(</span>input_layer<span class="token punctuation">)</span> <span class="token operator">/</span> <span class="token number">4</span><span class="token punctuation">)</span>
</code></pre>
<p><code>model(<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0.2, 0.5, 0.4, 0.7]</title><rect fill="#E3E3E3" x="0" y="0" width="1" height="1"></rect><rect fill="#989898" x="1" y="0" width="1" height="1"></rect><rect fill="#B1B1B1" x="0" y="1" width="1" height="1"></rect><rect fill="#656565" x="1" y="1" width="1" height="1"></rect></svg>) = 0.45</code> sounds about right. The
rectangle is a little lighter than it’s dark.</p>
<p>But that was cheating! We only showed that we <em>can</em> implement <code>average()</code> by
simply changing the hidden layer’s values. But that won’t work if we try to implement
something more complicated. Let’s go back to our original hidden layer
initialized with random values:</p>
<pre class="language-python"><code class="language-python">hidden_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.98</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.86</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">0.08</span><span class="token punctuation">]</span>
</code></pre>
<p>How can we <em>teach</em> our neural network to implement <code>average</code>?</p>
<h2 id="training-our-neural-network">Training our Neural Network</h2>
<p>To teach our model, we need to create some training data. We’ll create some
rectangles and calculate their average:</p>
<pre class="language-python"><code class="language-python">rectangles <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
rectangle_average <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>

<span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># Generate a 2x2 rectangle [0.1, 0.8, 0.6, 1.0]</span>
    rectangle <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token builtin">round</span><span class="token punctuation">(</span>random<span class="token punctuation">.</span>random<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
                 <span class="token builtin">round</span><span class="token punctuation">(</span>random<span class="token punctuation">.</span>random<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
                 <span class="token builtin">round</span><span class="token punctuation">(</span>random<span class="token punctuation">.</span>random<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
                 <span class="token builtin">round</span><span class="token punctuation">(</span>random<span class="token punctuation">.</span>random<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">]</span>
    rectangles<span class="token punctuation">.</span>append<span class="token punctuation">(</span>rectangle<span class="token punctuation">)</span>
    <span class="token comment"># Take the _actual_ average for our training dataset!</span>
    rectangle_average<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token builtin">sum</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">)</span> <span class="token operator">/</span> <span class="token number">4</span><span class="token punctuation">)</span>
</code></pre>
<p>Brilliant, so we can now feed these to our little neural network and get a
result! Next step is for our neural network to adjust the values in the hidden
layer based on how its output compares with the actual average in the training
data. This is called our <code>loss</code> function: large loss, very wrong model; small
loss, less wrong model. We can use a standard measure called <a href="https://en.wikipedia.org/wiki/Mean_squared_error"><em>mean squared
error</em></a>:</p>
<pre class="language-python"><code class="language-python"><span class="token comment"># Take the average of all the differences squared!</span>
<span class="token comment"># This calculates how &quot;wrong&quot; our predictions are.</span>
<span class="token comment"># This is called our &quot;loss&quot;.</span>
<span class="token keyword">def</span> <span class="token function">mean_squared_error</span><span class="token punctuation">(</span>actual<span class="token punctuation">,</span> expected<span class="token punctuation">)</span><span class="token punctuation">:</span>
    error_sum <span class="token operator">=</span> <span class="token number">0</span>
    <span class="token keyword">for</span> a<span class="token punctuation">,</span> b <span class="token keyword">in</span> <span class="token builtin">zip</span><span class="token punctuation">(</span>actual<span class="token punctuation">,</span> expected<span class="token punctuation">)</span><span class="token punctuation">:</span>
        error_sum <span class="token operator">+=</span> <span class="token punctuation">(</span>a <span class="token operator">-</span> b<span class="token punctuation">)</span> <span class="token operator">**</span> <span class="token number">2</span>
    <span class="token keyword">return</span> error_sum <span class="token operator">/</span> <span class="token builtin">len</span><span class="token punctuation">(</span>actual<span class="token punctuation">)</span>

<span class="token keyword">print</span><span class="token punctuation">(</span>mean_squared_error<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1.</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token number">2.</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># =&gt; 1.0</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>mean_squared_error<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1.</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token number">3.</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># =&gt; 4.0</span>
</code></pre>
<p>Now we can implement <code>train()</code>:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">model</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    output_neuron <span class="token operator">=</span> <span class="token number">0.</span>
    <span class="token keyword">for</span> index<span class="token punctuation">,</span> input_neuron <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">)</span><span class="token punctuation">:</span>
        output_neuron <span class="token operator">+=</span> input_neuron <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
    <span class="token keyword">return</span> output_neuron

<span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
  outputs <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
  <span class="token keyword">for</span> rectangle <span class="token keyword">in</span> rectangles<span class="token punctuation">:</span>
      output <span class="token operator">=</span> model<span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>
      outputs<span class="token punctuation">.</span>append<span class="token punctuation">(</span>output<span class="token punctuation">)</span>
  <span class="token keyword">return</span> outputs

hidden_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.98</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.86</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">0.08</span><span class="token punctuation">]</span>
outputs <span class="token operator">=</span> train<span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>

<span class="token keyword">print</span><span class="token punctuation">(</span>outputs<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">:</span><span class="token number">10</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token comment"># [1.472, 0.7, 1.369, 0.8879, 1.392, 1.244, 0.644, 1.1179, 0.474, 1.54]</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>rectangle_average<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">:</span><span class="token number">10</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token comment"># [0.575, 0.45, 0.549, 0.35, 0.525, 0.475, 0.425, 0.65, 0.4, 0.575]</span>
mean_squared_error<span class="token punctuation">(</span>outputs<span class="token punctuation">,</span> rectangle_average<span class="token punctuation">)</span>
<span class="token comment"># 0.4218</span>
</code></pre>
<p>A good mean squared error is close to 0. Our model isn’t very good. But! We’ve
got the skeleton of a feedback loop in place for updating the hidden layer.</p>
<h3 id="updating-the-hidden-layer-with-gradient-descent">Updating the Hidden Layer with Gradient Descent</h3>
<p>Now what we need is a way to update the hidden layer in response to the mean
squared error / loss. We need to <em>minimize</em> the value of this function:</p>
<pre class="language-python"><code class="language-python">mean_squared_error<span class="token punctuation">(</span>
  train<span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">,</span>
  rectangle_average
<span class="token punctuation">)</span>
</code></pre>
<p>As noted earlier, the only thing we can really change here are the weights in
the hidden layer. How can we possibly know which weights will minimize this
function?</p>
<p>We could randomize the weights, calculate the loss (how wrong the model is,
in our case, with mean squared error), and then save the best ones we see after
some period of time.</p>
<p>We could possibly speed this up. If we have good weights, we could try to add
some random numbers to those. See if loss improves. This could work, but it
sounds slow… and likely to get stuck in some local maxima and not give a very
good result. And it’s trouble scaling this to 1,000s of weights…</p>
<p>Instead of embarking on this ad-hoc randomization mess, it turns out that
there’s a method called <em>gradient descent</em> to minimize the value of a function!
Gradient descent builds on a bit of calculus that you may not have touched on
since high school. We won’t go into depth here, but try to introduce <em>just</em>
enough that you understand the concept. <sup><a href="#user-content-fn-3blue1brown" id="user-content-fnref-3blue1brown" data-footnote-ref="true" aria-describedby="footnote-label">3</a></sup></p>
<p>Let’s try to understand gradient descent. Consider some random function whose
graph might look like this:</p>
<figure><img src="/images/napkin/problem-17-neural-nets/function.png" alt="Graph of a random function with some irregular shapes" title="Graph of a function with an irregular curve with a local and global minimum." width="775" height="485" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Graph of a function with an irregular curve with a local and global minimum.</figcaption></figure>
<p>How do we write code to find the minimum, the deepest (second) valley, of this function?</p>
<p>Let’s say that we’re at <code>x=1</code> and we know the <em>slope</em> of the function at this
point. The slope is “how fast the function grows at this very point.” You may
remember this as <em>the derivative</em>. The slope at <code>x=1</code> might be <code>-1.5</code>. This
means that every time we increase <code>x += 1</code>, it results in <code>y -= 1.5</code>. We’ll go
into how you figure out the slope in a bit, let’s focus on the concept first.</p>
<figure><img src="/images/napkin/problem-17-neural-nets/function-with-slope.png" alt="Graph function with some slope or derivative" width="1934" height="1256" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>The idea of gradient descent is that since we know the value of our function,
<code>y</code>, is decreasing as we increase <code>x</code>, we can increase <code>x</code> proportionally to the
slope. In other words, if we increase <code>x</code> by the slope, we step towards the
valley by <code>1.5</code>.</p>
<p>Let’s take that step of <code>x += 1.5</code>:</p>
<figure><img src="/images/napkin/problem-17-neural-nets/gradient-descent-overshoot.png" alt="Overshooting in gradient descent" width="1934" height="1256" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Ugh, turned out that we stepped <em>too</em> far, past this valley! If we repeat the
step, we’ll land somewhere on the left side of the valley, to then bounce back
on the right side. We might <em>never</em> land in the bottom of the valley. Bummer.
Either way, this isn’t the <em>global minimum</em> of the function. We return to that
in a moment!</p>
<p>We can fix the overstepping easily by taking smaller steps. Perhaps we should’ve
stepped by just <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.1</mn><mo>∗</mo><mn>1.5</mn><mo>=</mo><mn>0.15</mn></mrow><annotation encoding="application/x-tex">0.1 * 1.5 = 0.15</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.1</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">∗</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">1.5</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.15</span></span></span></span> instead. That would’ve smoothly landed us at
the bottom of the valley. That multiplier, <code>0.1</code>, is called the <em>learning rate</em>
in gradient descent.</p>
<figure><img src="/images/napkin/problem-17-neural-nets/minimum.png" alt="Minimum of function with gradient descent" width="1934" height="1256" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>But hang on, that’s not actually the minimum of the function. See that valley to
the right? That’s the <em>actual</em> global minimum. If our initial <code>x</code> value had been
e.g. 3, we might have found the global minimum instead of our local minimum.</p>
<p>Finding the global minimum of a function is <em>hard</em>. Gradient descent will give
us <em>a minimum</em>, but not <em>the minimum</em>. Unfortunately, it turns out it’s the best
weapon we have at our disposal. Especially when we have big, complicated
functions (like a neural net with millions of neurons). Gradient descent will
not always find the global minimum, but something <em>pretty</em> good.</p>
<p>This method of using the slope/derivative generalizes. For example, consider
optimizing a function in three-dimensions. We can visualize the gradient descent
method here as <em>rolling a ball to the lowest point.</em> A big neural network is
1000s of dimensions, but gradient descent still works to minimize the loss!</p>
<figure><img src="/images/napkin/problem-17-neural-nets/descent-3d.png" alt="Depicts a 3-dimensional graph, if we do gradient descent on this we might imagine it as rolling a ball down the hill." title="Depicts a 3-dimensional graph, if we do gradient descent on this we might imagine it as rolling a ball down the hill." width="760" height="624" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>Depicts a 3-dimensional graph, if we do gradient descent on this we might imagine it as rolling a ball down the hill.</figcaption></figure>
<h2 id="finalizing-our-neural-network-from-scratch">Finalizing our Neural Network from scratch</h2>
<p>Let’s summarize where we are:</p>
<ul>
<li>We can implement a simple neural net: <code>model()</code>.</li>
<li>Our neural net can figure out how <em>wrong</em> it is for a training set: <code>loss(train())</code>.</li>
<li>We have a method, <em>gradient descent</em>, for tuning our hidden layer’s weights
for the minimum loss. I.e. we have a method to adjust those four random values
in our hidden layer to take a <em>better</em> average as we iterate through the
training data.</li>
</ul>
<p>Now, let’s implement gradient descent and see if we can make our neural net
learn to take the average grayscale of our small rectangles:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">model</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    output_neuron <span class="token operator">=</span> <span class="token number">0.</span>
    <span class="token keyword">for</span> index<span class="token punctuation">,</span> input_neuron <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">)</span><span class="token punctuation">:</span>
        output_neuron <span class="token operator">+=</span> input_neuron <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
    <span class="token keyword">return</span> output_neuron

<span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
  outputs <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
  <span class="token keyword">for</span> rectangle <span class="token keyword">in</span> rectangles<span class="token punctuation">:</span>
      output <span class="token operator">=</span> model<span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>
      outputs<span class="token punctuation">.</span>append<span class="token punctuation">(</span>output<span class="token punctuation">)</span>

  mean_squared_error<span class="token punctuation">(</span>outputs<span class="token punctuation">,</span> rectangle_average<span class="token punctuation">)</span>

  <span class="token comment"># We go through all the weights in the hidden layer. These correspond to all</span>
  <span class="token comment"># the weights of the function we&#x27;re trying to minimize the value of: our</span>
  <span class="token comment"># model, respective of its loss (how wrong it is).</span>
  <span class="token comment"># </span>
  <span class="token comment"># For each of the weights, we want to increase/decrease it based on the slope.</span>
  <span class="token comment"># Exactly like we showed in the one-weight example above with just x. Now</span>
  <span class="token comment"># we just have 4 values instead of 1! Big models have billions.</span>
  <span class="token keyword">for</span> index<span class="token punctuation">,</span> _ <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    learning_rate <span class="token operator">=</span> <span class="token number">0.1</span>
    <span class="token comment"># But... how do we get the slope/derivative?!</span>
    hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator">-=</span> learning_rate <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span><span class="token punctuation">.</span>slope

  <span class="token keyword">return</span> outputs

hidden_layer <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token number">0.98</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.86</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">0.08</span><span class="token punctuation">]</span>
train<span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>
</code></pre>
<h3 id="automagically-computing-the-slope-of-a-function-with-autograd">Automagically computing the slope of a function with <code>autograd</code></h3>
<p>The missing piece here is to figure out the <code>slope()</code> after we’ve gone through
our training set. Figuring out the slope/derivative at a certain point is
tricky. It involves a fair bit of math. I am not going to go into the math of
calculating derivatives. Instead, we’ll do what all the machine learning
libraries do: automatically calculate it. <sup><a href="#user-content-fn-nielsen" id="user-content-fnref-nielsen" data-footnote-ref="true" aria-describedby="footnote-label">4</a></sup></p>
<p>Minimizing the loss of a function is absolutely fundamental to machine learning.
The functions (neural networks) are <em>so</em> complicated that manually sitting down
to figure out the derivative like you might’ve done in high school is not
feasible. It’s the mathematical equivalent of writing assembly to implement a
website.</p>
<p>Let’s show one simple example of finding the derivative of a function, before we
let the computers do it all for us. If we have <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msup><mi>x</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">f(x) = x^2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.8141em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>, then you might
remember from calculus classes that the derivative is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>f</mi><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mn>2</mn><mi>x</mi></mrow><annotation encoding="application/x-tex">f&#x27;(x) = 2x</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0019em;vertical-align:-0.25em"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.7519em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2</span><span class="mord mathnormal">x</span></span></span></span>. In other
words, <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">f(x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span>‘s slope at any point is <code>2x</code>, telling us it’s increasing
non-linearly. Well that’s exactly how we understand <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msup><mi>x</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">x^2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8141em"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em"><span style="top:-3.063em;margin-right:0.05em"><span class="pstrut" style="height:2.7em"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span>, perfect! This means
that for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>x</mi><mo>=</mo><mn>2</mn></mrow><annotation encoding="application/x-tex">x = 2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em"></span><span class="mord mathnormal">x</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">2</span></span></span></span> the slope is <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>4</mn></mrow><annotation encoding="application/x-tex">4</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">4</span></span></span></span>.</p>
<p>With the basics in order, we can use an <code>autograd</code> package to avoid the messy
business of computing our own derivatives. <code>autograd</code> is an <em>automatic
differentiation engine</em>. <em>grad</em> stands for <em>gradient</em>, which we can think of as the
derivative/slope of a function with more than one parameter.</p>
<p>It’s best to show how it works by using our example from before:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">import</span> torch

<span class="token comment"># A tensor is a matrix in PyTorch. It is the fundamental data-structure of neural</span>
<span class="token comment"># networks. Here we say PyTorch, please keep track of the gradient/derivative</span>
<span class="token comment"># as I do all kinds of things to the parameter(s) of this tensor.</span>
x <span class="token operator">=</span> torch<span class="token punctuation">.</span>tensor<span class="token punctuation">(</span><span class="token number">2.</span><span class="token punctuation">,</span> requires_grad<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>

<span class="token comment"># At this point we&#x27;re applying our function f(x) = x^2.</span>
y <span class="token operator">=</span> x <span class="token operator">**</span> <span class="token number">2</span>

<span class="token comment"># This tells `autograd` to compute the derivative values for all the parameters</span>
<span class="token comment"># involved. Backward is neural network jargon for this operation, which we&#x27;ll</span>
<span class="token comment"># explain momentarily.</span>
y<span class="token punctuation">.</span>backward<span class="token punctuation">(</span><span class="token punctuation">)</span>

<span class="token comment"># And show us the lovely gradient/derivative, which is 4! Sick.</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>x<span class="token punctuation">.</span>grad<span class="token punctuation">)</span>
<span class="token comment"># =&gt; 4</span>
</code></pre>
<p><code>autograd</code> is the closest to magic we get. I could do the most ridiculous stuff
with this tensor, and it’ll keep track of all the math operations applied and
have the ability to compute the derivative. We won’t go into how. Partly because
I don’t know how, and this post is long enough.</p>
<p>Just to convince you of this, we can be a little cheeky and do a bunch of random
stuff. I’m trying to really hammer this home, because this is what confused me
the most when learning about neural networks. It wasn’t obvious to me that a
neural network, including executing the loss function on the whole training set,
is <em>just</em> a function, and however complicated, we can still take the derivative
of it and use gradient descent. Even if it’s so many dimensions that it can’t be
neatly visualized as a ball rolling down a hill.</p>
<p><code>autograd</code> doesn’t complain as we add complexity and will still calculate the
gradients. In this example we’ll even use a matrix/tensor with a few more elements and
calculate an average (like our loss function <code>mean_squared_error</code>), which is the
kind of thing we’ll calculate the gradients for in our neural network:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">import</span> random
<span class="token keyword">import</span> torch

x <span class="token operator">=</span> torch<span class="token punctuation">.</span>tensor<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span> <span class="token number">0.3</span><span class="token punctuation">,</span> <span class="token number">0.8</span><span class="token punctuation">,</span> <span class="token number">0.1</span><span class="token punctuation">]</span><span class="token punctuation">,</span> requires_grad<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
y <span class="token operator">=</span> x

<span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    choice <span class="token operator">=</span> random<span class="token punctuation">.</span>randint<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">)</span>
    <span class="token keyword">if</span> choice <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span>
        y <span class="token operator">=</span> y <span class="token operator">**</span> random<span class="token punctuation">.</span>randint<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span>
    <span class="token keyword">elif</span> choice <span class="token operator">==</span> <span class="token number">1</span><span class="token punctuation">:</span>
        y <span class="token operator">=</span> y<span class="token punctuation">.</span>sqrt<span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token keyword">elif</span> choice <span class="token operator">==</span> <span class="token number">2</span><span class="token punctuation">:</span>
        y <span class="token operator">=</span> y<span class="token punctuation">.</span>atanh<span class="token punctuation">(</span><span class="token punctuation">)</span>

y <span class="token operator">=</span> y<span class="token punctuation">.</span>mean<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># This walks &quot;backwards&quot; y all the way to the parameters to</span>
<span class="token comment"># calculate the derivates / gradient! Pytorch keeps track of a graph of all the</span>
<span class="token comment"># operations.</span>
y<span class="token punctuation">.</span>backward<span class="token punctuation">(</span><span class="token punctuation">)</span>

<span class="token comment"># And here are how quickly the function is changing with respect to these</span>
<span class="token comment"># parameters for our randomized function.</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>x<span class="token punctuation">.</span>grad<span class="token punctuation">)</span>
<span class="token comment"># =&gt; tensor([0.0157, 0.0431, 0.6338, 0.0028])</span>
</code></pre>
<p>Let’s use <code>autograd</code> for our neural net and then run it against our square from
earlier <code>model(<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0.2, 0.5, 0.4, 0.7]</title><rect fill="#E3E3E3" x="0" y="0" width="1" height="1"></rect><rect fill="#989898" x="1" y="0" width="1" height="1"></rect><rect fill="#B1B1B1" x="0" y="1" width="1" height="1"></rect><rect fill="#656565" x="1" y="1" width="1" height="1"></rect></svg>) = 0.45</code>:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">import</span> torch <span class="token keyword">as</span> torch

<span class="token keyword">def</span> <span class="token function">model</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    output_neuron <span class="token operator">=</span> <span class="token number">0.</span>
    <span class="token keyword">for</span> index<span class="token punctuation">,</span> input_neuron <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">)</span><span class="token punctuation">:</span>
        output_neuron <span class="token operator">+=</span> input_neuron <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
    <span class="token keyword">return</span> output_neuron

<span class="token keyword">def</span> <span class="token function">train</span><span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
  outputs <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
  <span class="token keyword">for</span> rectangle <span class="token keyword">in</span> rectangles<span class="token punctuation">:</span>
      output <span class="token operator">=</span> model<span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>
      outputs<span class="token punctuation">.</span>append<span class="token punctuation">(</span>output<span class="token punctuation">)</span>

  <span class="token comment"># How wrong were we? Our &#x27;loss.&#x27;</span>
  error <span class="token operator">=</span> mean_squared_error<span class="token punctuation">(</span>outputs<span class="token punctuation">,</span> rectangle_average<span class="token punctuation">)</span>

  <span class="token comment"># Calculate the gradient (the derivate for all our weights!)</span>
  <span class="token comment"># This walks &quot;backwards&quot; from the error all the way to the weights to</span>
  <span class="token comment"># calculate them</span>
  error<span class="token punctuation">.</span>backward<span class="token punctuation">(</span><span class="token punctuation">)</span>

  <span class="token comment"># Now let&#x27;s go update the weights in our hidden layer per our gradient.</span>
  <span class="token comment"># This is what we discussed before: we want to find the valley of this</span>
  <span class="token comment"># four-dimensional space/four-weight function. This is gradient descent!</span>
  <span class="token keyword">for</span> index<span class="token punctuation">,</span> _ <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    learning_rate <span class="token operator">=</span> <span class="token number">0.1</span>
    <span class="token comment"># hidden_layer.grad is something like [0.7070, 0.6009, 0.6840, 0.5302]</span>
    hidden_layer<span class="token punctuation">.</span>data<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator">-=</span> learning_rate <span class="token operator">*</span> hidden_layer<span class="token punctuation">.</span>grad<span class="token punctuation">.</span>data<span class="token punctuation">[</span>index<span class="token punctuation">]</span>

  <span class="token comment"># We have to tell `autograd` that we&#x27;ve just finished an epoch to reset.</span>
  <span class="token comment"># Otherwise it&#x27;d calculate the derivative from multiple epochs.</span>
  hidden_layer<span class="token punctuation">.</span>grad<span class="token punctuation">.</span>zero_<span class="token punctuation">(</span><span class="token punctuation">)</span>
  <span class="token keyword">return</span> error

<span class="token comment"># We use tensors now, but we just use them as if they were normal lists.</span>
<span class="token comment"># We only use them so we can get the gradients.</span>
hidden_layer <span class="token operator">=</span> torch<span class="token punctuation">.</span>tensor<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">0.98</span><span class="token punctuation">,</span> <span class="token number">0.4</span><span class="token punctuation">,</span> <span class="token number">0.86</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token number">0.08</span><span class="token punctuation">]</span><span class="token punctuation">,</span> requires_grad<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>

<span class="token keyword">print</span><span class="token punctuation">(</span>model<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span><span class="token number">0.5</span><span class="token punctuation">,</span><span class="token number">0.4</span><span class="token punctuation">,</span><span class="token number">0.7</span><span class="token punctuation">]</span><span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># =&gt; 0.6840000152587891</span>

train<span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>

<span class="token comment"># The hidden layer&#x27;s weights are nudging closer to [0.25, 0.25, 0.25, 0.25]!</span>
<span class="token comment"># They are now [ 0.9093,  0.3399,  0.7916, -0.1330]</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f&quot;After: </span><span class="token interpolation"><span class="token punctuation">{</span>model<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span><span class="token number">0.5</span><span class="token punctuation">,</span><span class="token number">0.4</span><span class="token punctuation">,</span><span class="token number">0.7</span><span class="token punctuation">]</span><span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">}</span></span><span class="token string">&quot;</span></span><span class="token punctuation">)</span>
<span class="token comment"># =&gt; 0.5753424167633057</span>
<span class="token comment"># The average of this rectangle is 0.45, closer... but not there yet</span>
</code></pre>
<p>This blew my mind the first time I did this. Look at that. It’s optimizing the
hidden layer for all weights in the right direction! We’re expecting them all
to nudge towards <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.25</mn></mrow><annotation encoding="application/x-tex">0.25</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.25</span></span></span></span> to implement <code>average()</code>. We haven’t told it <em>anything</em>
about average, we’ve just told it how wrong it is through the loss.</p>
<div class="_articleNotice_uux7j_123"><p>It’s important to understand how <code>hidden_layer.grad</code> is set here. The
hidden layer is instantiated as a tensor with an argument telling Pytorch to
keep track of all operations made to it. This allows us to later call <code>backward()</code> on a future tensor that derives from the hidden layer,
in this case, the <code>error</code> tensor, which is further derived from the
<code> outputs</code> tensor. You can read more in <a href="https://pytorch.org/docs/1.9.1/generated/torch.Tensor.backward.html">the documentation</a></p></div>
<p><em>But</em>, the hidden layer isn’t all <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.25</mn></mrow><annotation encoding="application/x-tex">0.25</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.25</span></span></span></span> quite yet, as we expect for it to
implement <code>average</code>. So how do we get them to that? Well, let’s try to repeat
the gradient descent process 100 times and see if we’re getting even better!</p>
<pre class="language-python"><code class="language-python"><span class="token comment"># An epoch is a training pass over the full data set!</span>
<span class="token keyword">for</span> epoch <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
   error <span class="token operator">=</span> train<span class="token punctuation">(</span>rectangles<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span>
   <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f&quot;Epoch: </span><span class="token interpolation"><span class="token punctuation">{</span>epoch<span class="token punctuation">}</span></span><span class="token string">, Error: </span><span class="token interpolation"><span class="token punctuation">{</span>error<span class="token punctuation">}</span></span><span class="token string">, Layer: </span><span class="token interpolation"><span class="token punctuation">{</span>hidden_layer<span class="token punctuation">.</span>data<span class="token punctuation">}</span></span><span class="token string">\n\n&quot;</span></span><span class="token punctuation">)</span>
   <span class="token comment"># </span>
   <span class="token comment">#  Epoch: 99, Error: 0.0019292341312393546, Layer: tensor([0.3251, 0.2291, 0.3075, 0.1395])</span>


<span class="token keyword">print</span><span class="token punctuation">(</span>model<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span><span class="token number">0.5</span><span class="token punctuation">,</span><span class="token number">0.4</span><span class="token punctuation">,</span><span class="token number">0.7</span><span class="token punctuation">]</span><span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># =&gt; 0.4002</span>
</code></pre>
<p>Pretty close, but not quite there. I ran it for <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>300</mn></mrow><annotation encoding="application/x-tex">300</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">300</span></span></span></span> times (an iteration over
the full training set is referred to as an epoch, so 300 epochs) instead, and
then I got:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">print</span><span class="token punctuation">(</span>model<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">0.2</span><span class="token punctuation">,</span><span class="token number">0.5</span><span class="token punctuation">,</span><span class="token number">0.4</span><span class="token punctuation">,</span><span class="token number">0.7</span><span class="token punctuation">]</span><span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># Epoch: 299, Error: 1.8315197394258576e-06, Layer: tensor([0.2522, 0.2496, 0.2518, 0.2465])</span>
<span class="token comment"># tensor(0.4485, grad_fn=&lt;AddBackward0&gt;)</span>
</code></pre>
<p>Boom! Our neural net has <em>almost</em> learned to take the average, off by just a
scanty <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>0.002</mn></mrow><annotation encoding="application/x-tex">0.002</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em"></span><span class="mord">0.002</span></span></span></span>. If we fine-tuned the learning rate and number of epochs we could
probably get it there, but I’m happy with this. <code>model(<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2 2" width="18" height="18" style="position:relative;top:3px"><title>[0.2, 0.5, 0.4, 0.7]</title><rect fill="#E3E3E3" x="0" y="0" width="1" height="1"></rect><rect fill="#989898" x="1" y="0" width="1" height="1"></rect><rect fill="#B1B1B1" x="0" y="1" width="1" height="1"></rect><rect fill="#656565" x="1" y="1" width="1" height="1"></rect></svg>) = 0.448</code>:</p>
<p>That’s it. That’s your first neural net:</p>
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>m</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>l</mi><mo stretchy="false">(</mo><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>a</mi><mi>n</mi><mi>g</mi><mi>l</mi><mi>e</mi><mo stretchy="false">)</mo><mo>≈</mo><mi>a</mi><mi>v</mi><mi>g</mi><mo stretchy="false">(</mo><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>a</mi><mi>n</mi><mi>g</mi><mi>l</mi><mi>e</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">model(rectangle) \approx avg(rectangle)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">m</span><span class="mord mathnormal">o</span><span class="mord mathnormal">d</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.01968em">l</span><span class="mopen">(</span><span class="mord mathnormal">rec</span><span class="mord mathnormal">t</span><span class="mord mathnormal">an</span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mord mathnormal" style="margin-right:0.01968em">l</span><span class="mord mathnormal">e</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">≈</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.03588em">vg</span><span class="mopen">(</span><span class="mord mathnormal">rec</span><span class="mord mathnormal">t</span><span class="mord mathnormal">an</span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mord mathnormal" style="margin-right:0.01968em">l</span><span class="mord mathnormal">e</span><span class="mclose">)</span></span></span></span>
<h2 id="ok-so-you-just-implemented-the-most-complicated-average-function-ive-ever-seen">OK, so you just implemented the most complicated <code>average</code> function I’ve ever seen…</h2>
<p>Sure did. The thing is, that if we adjusted it for looking for cats, it’s the
least complicated <code>is_cat</code> you’ll ever see. Because our neural network could
implement that too by changing the training data. Remember, a neural network
with enough neurons can approximate <em>any function</em>. You’ve just learned all the
building blocks to do it. We just started with the simplest possible example.</p>
<p>If you give the hidden layer some more neurons, this neural net will be able to
recognize <a href="http://yann.lecun.com/exdb/mnist/">handwritten numbers</a> with decent accuracy (possible fun
exercise for you, see bottom of article), like this one:</p>
<figure><img src="/images/napkin/problem-17-neural-nets/mnist-sample.png" alt="An upscaled version of a handdrawn 3 from the 28x28 MNIST dataset." title="An upscaled version of a handdrawn 3 from the 28x28 MNIST dataset." width="200" height="200" loading="lazy" decoding="async" style="max-width:100%;height:auto"/><figcaption>An upscaled version of a handdrawn 3 from the 28x28 MNIST dataset.</figcaption></figure>
<h3 id="activation-functions">Activation Functions</h3>
<p>To be truly powerful, there is one paramount modification we have to make to our
neural net. Above, we were implementing the <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>a</mi><mi>v</mi><mi>e</mi><mi>r</mi><mi>a</mi><mi>g</mi><mi>e</mi></mrow><annotation encoding="application/x-tex">average</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em"></span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.03588em">v</span><span class="mord mathnormal" style="margin-right:0.02778em">er</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mord mathnormal">e</span></span></span></span> function. However, were
our neural net to implement <code>which_digit(png)</code> or <code>is_cat(jpg)</code> then it wouldn’t work.</p>
<p>Recognizing handwritten digits isn’t a <em>linear</em> function, like <code>average()</code>. It’s
non-linear. It’s a crazy function, with a crazy shape (unlike a linear
function). To create crazy functions with crazy shapes, we have to introduce a
non-linear component to our neural network. This is called an <em>activation</em>
function. It can be e.g. <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>R</mi><mi>e</mi><mi>L</mi><mi>u</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>m</mi><mi>a</mi><mi>x</mi><mo stretchy="false">(</mo><mn>0</mn><mo separator="true">,</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">ReLu(x) = max(0, x)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal" style="margin-right:0.00773em">R</span><span class="mord mathnormal">e</span><span class="mord mathnormal">Lu</span><span class="mopen">(</span><span class="mord mathnormal">x</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord mathnormal">ma</span><span class="mord mathnormal">x</span><span class="mopen">(</span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">x</span><span class="mclose">)</span></span></span></span>. There are many kinds of
<a href="https://en.wikipedia.org/wiki/Activation_function">activation functions</a> that are good for different things.
<sup><a href="#user-content-fn-activation" id="user-content-fnref-activation" data-footnote-ref="true" aria-describedby="footnote-label">5</a></sup></p>
<figure><img src="/images/napkin/problem-17-neural-nets/relu.png" alt="" width="400" height="333" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>We can apply this simple operation to our neural net:</p>
<pre class="language-python"><code class="language-python"><span class="token keyword">def</span> <span class="token function">model</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">,</span> hidden_layer<span class="token punctuation">)</span><span class="token punctuation">:</span>
    output_neuron <span class="token operator">=</span> <span class="token number">0.</span>
    <span class="token keyword">for</span> index<span class="token punctuation">,</span> input_neuron <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>rectangle<span class="token punctuation">)</span><span class="token punctuation">:</span>
        output_neuron <span class="token operator">+=</span> input_neuron <span class="token operator">*</span> hidden_layer<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
    <span class="token keyword">return</span> <span class="token builtin">max</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> output_neuron<span class="token punctuation">)</span>
</code></pre>
<p>Now, we only have a single neuron/weight… that isn’t much. Good models have
100s, and the biggest models like GPT-3 have billions. So this won’t recognize
many digits or cats, but you can easily add more weights!</p>
<h3 id="matrices">Matrices</h3>
<p>The core operation in our model, the for loop, is matrix multiplication. We could
rewrite it to use them instead, e.g. <code>rectangle @ hidden_layer</code>. PyTorch will
then do the exact same thing. Except, it’ll now execute in C-land. And if you
have a GPU and pass some extra weights, it’ll execute on a GPU, which is even
faster. When doing any kind of deep learning, you want to avoid writing any
Python loops. They’re just too slow. If you ran the code above for the 300
epochs, you’ll see that it takes minutes to complete. I left matrices out of it
to simplify the explanation as much as possible. There’s plenty going on without
them.</p>
<h2 id="next-steps-to-implement-your-own-neural-net-from-scratch">Next steps to implement your own neural net from scratch</h2>
<p>Even if you’ve carefully read through this article, you won’t fully grasp it
yet until you’ve had your own hands on it. Here are some suggestions on where to
go from here, if you’d like to move beyond the basic understanding you have now:</p>
<ol>
<li>Get the <a href="https://colab.research.google.com/drive/1YRp9k_ORH4wZMqXLNkc3Ir5w4B5f-8Pa?usp=sharing">notebook</a> running and study the code</li>
<li>Change it to far larger rectangles, e.g. 100x100</li>
<li>Add biases in addition to the weights. A model doesn’t just have
weights that are multiplied onto the inputs, but also biases that are added
(<code>+</code>) onto the inputs in each layer.</li>
<li>Rewrite the model to use <a href="https://pytorch.org/docs/stable/tensors.html">PyTorch tensors</a> for matrix operations, as
described in the previous section.</li>
<li>Add 1-2 more layers to the model. Try to have them have different sizes.</li>
<li>Change the tensors to run on GPU (see the <a href="https://pytorch.org/docs/stable/notes/cuda.html">PyTorch
documentation</a>) and see the
performance speed up! Increase the size of the training set and rectangles to
<em>really</em> be able to tell the difference. Make sure you change <code>Runtime &gt; Change Runtime Type</code> in Collab to run on a GPU.</li>
<li>This is a difficult step that will likely take a while, but it’ll be well
worth it: Adapt the code to recognize handwritten letters from the <a href="https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz">MNIST
dataset</a> dataset. You’ll need to use <a href="https://pillow.readthedocs.io/en/stable/"><code>pillow</code></a> to turn
the pixels into a large 1-dimensional tensor as the input layer, as well as a
non-linear activation function like <code>Sigmoid</code> or <code>ReLU</code>. Use <a href="http://neuralnetworksanddeeplearning.com/">Nielsen’s
book</a> as a reference if you get stuck, which does exactly this.</li>
</ol>
<p>I thoroughly hope you enjoyed this walkthrough of a neural net from scratch! In
a future issue we’ll use the mental model we’ve built up here to do some napkin
math on expected performance on training and using neural nets.</p>
<p><em>Thanks to <a href="https://www.vegardstikbakke.com/">Vegard Stikbakke</a>, <a href="https://www.flyingcroissant.ca/">Andrew Bugera</a> and <a href="https://thundergolfer.com/">Jonathan
Belotti</a> for providing valuable feedback on drafts of this article.</em></p>
<section data-footnotes="true" class="footnotes"><h2 class="sr-only" id="footnote-label">Footnotes</h2>
<ol>
<li id="user-content-fn-google">
<p>This is a good example of <a href="/peak-complexity">Peak Complexity</a>.
The existing phrase-based translation model was iteratively improved with
increasing complexity, distributed systems to look up five-word phrases
frequencies, etc. The complexity required to improve the model 1% was becoming
astronomical. A good hint you need a paradigm shift to reset the complexity.
Deep Learning provided that complexity reset for the translation model. <a href="#user-content-fnref-google" data-footnote-backref="" aria-label="Back to reference 1" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-gpt3">
<p>GPT-3 has ~175 billion weights. The human brain has ~86 billion. Of
course, you cannot technically compare an artificial neuron to a real one.
Why? I don’t know. I reserve that it remains an interesting question. <a href="https://lastweekin.ai/p/gpt-3-is-no-longer-the-only-game">It’s
estimated</a> that it cost in the double-digit millions to train it. <a href="#user-content-fnref-gpt3" data-footnote-backref="" aria-label="Back to reference 2" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-3blue1brown">
<p>There’s a brilliant <a href="https://www.youtube.com/watch?v=aircAruvnKk">Youtube series</a> that’ll go
into more depth on the math than I do in this article. This article
accompanies the video nicely, as the video doesn’t go into the implementation. <a href="#user-content-fnref-3blue1brown" data-footnote-backref="" aria-label="Back to reference 3" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-nielsen">
<p>There’s a great, <a href="http://neuralnetworksanddeeplearning.com/">short e-book</a> on implementing a neural
network from scratch available that goes into far more detail on computing the
derivative from scratch. Despite this existing, I still decided to do this
write-up because calculating the slope manually takes up a lot of time and
complexity. I wanted to teach it from scratch without going into those
details. <a href="#user-content-fnref-nielsen" data-footnote-backref="" aria-label="Back to reference 4" class="data-footnote-backref">↩</a></p>
</li>
<li id="user-content-fn-activation">
<p>I found this pretty strange when I learned about neural networks.
We can use a bunch of random non-linear function and our neural network
works… better? The answer is yes! The complicated answer I am not
knowledgeable enough to offer… If you write your own handwritten MNIST
neural net (as suggested at the end of the article), you can see for yourself
by adding/removing a non-linear function and looking at the loss. <a href="#user-content-fnref-activation" data-footnote-backref="" aria-label="Back to reference 5" class="data-footnote-backref">↩</a></p>
</li>
</ol>
</section>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Careful Trading Complexity for 'Improvements']]></title>
        <id>https://sirupsen.com/trading-complexity</id>
        <link href="https://sirupsen.com/trading-complexity"/>
        <updated>2021-11-30T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Often I’ve come across technical proposals along the lines of:

In 6 months we will outgrow our MySQL/Postgres instance. We will need to move our
biggest table to a different horizontally scalable datastore.
If we have a database outage in a region, we will have a complete outage. We
should consider moving to a data-store that’s natively multi-region.
This would be much faster if it was stored in a specialized database. Should
we consider moving to it?
I]]></summary>
        <content type="html"><![CDATA[<p>Often I’ve come across technical proposals along the lines of:</p>
<ul>
<li>In 6 months we will outgrow our MySQL/Postgres instance. We will need to move our
biggest table to a different horizontally scalable datastore.</li>
<li>If we have a database outage in a region, we will have a complete outage. We
should consider moving to a data-store that’s natively multi-region.</li>
<li>This would be much faster if it was stored in a specialized database. Should
we consider moving to it?</li>
<li>If we move to an event-based architecture, our system will be much more
reliable.</li>
</ul>
<p>What these proposals have in common is that they attempt to improve the system
by increasing complexity. Whenever you find yourself arguing for improving
infrastructure by yanking up complexity, you need to be <em>very</em> careful.</p>
<blockquote>
<p>“Simplicity is prerequisite for reliability.”
— Edsger W. Dijkstra:</p>
</blockquote>
<p>Theoretically yes: if you move your massive, quickly-growing <code>products</code> table to
a key-value store to alleviate a default-configured relational database
instance, it will probably be faster, cost less, and easier to scale.</p>
<p>However, in reality most likely the complexity will lead to more downtime (even
if in theory you get less), slower performance because it’s hard to debug (even
if in theory, it’s much faster), and worse scalability (because you don’t know
the system well).</p>
<p>More theoretical 9s + increase in complexity =&gt; less 9s + more work.</p>
<p>This all because you’re about to trade known risks for theoretical improvements,
accompanied by a slew of unknown risks. Adopting the new tech would increase
complexity by introducing a whole new system: operational burden of learning a
new data-store, developers’ overhead of using another system for a subset of the
data, development environment increases in complexity, skills don’t transfer
between the two, and a myriad of other unknown-unknowns. That’s a <em>massive</em>
cost.</p>
<p>I’m a proponent of mastering and abusing existing tools, rather than chasing
greener pastures. The more facility you gain with first-principle reasoning and
<a href="/napkin">napkin math</a>, the closer I’d wager you’ll inch towards this conclusion as
well. A new system theoretically having better guarantees is <em>not</em> enough of an
argument. Adding a new system to your stack is a huge deal and difficult to
undo.</p>
<p>So what do we do with that pesky <code>products</code> table?</p>
<p>Stop thinking about technologies, and start thinking in first-principle
requirements:</p>
<ul>
<li>You need faster inserts/updates</li>
<li>You need terabytes of storage to have runway for the next ~5 years</li>
<li>You need more read capacity</li>
</ul>
<p>The way that the shiny key-value store you’re eyeing achieves this is by not
syncing every write to disk immediately. Well, you can <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit">do that in MySQL too</a>
(and <a href="https://www.postgresql.org/docs/9.4/wal-async-commit.html">Postgres</a>). You could put your table on a new database server with that
setting on. I <a href="/napkin/problem-10-mysql-transactions-per-second">wrote about this in detail</a>.</p>
<p>There’s no reason your relational database can’t handle terabytes. Do the napkin
math, <code>log(n)</code> lookups for that many keys isn’t much worse. Most likely you can
keep it all to one server.</p>
<p>Why do you think reads would be faster in the other database than your
relational database? It probably caches in memory. Well, relational databases do
that too. You need to spread reads among more databases? Relational databases
can do that too with read-replicas…</p>
<p>Yes, MySQL/Postgres might be <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>25</mn><mo>−</mo><mn>50</mn><mi mathvariant="normal">%</mi></mrow><annotation encoding="application/x-tex">25-50\%</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em"></span><span class="mord">25</span><span class="mspace" style="margin-right:0.2222em"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em"></span></span><span class="base"><span class="strut" style="height:0.8056em;vertical-align:-0.0556em"></span><span class="mord">50%</span></span></span></span> worse at all those things than a new system. But
it still comes out <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>10</mn><mo separator="true">,</mo><mn>000</mn><mi mathvariant="normal">%</mi></mrow><annotation encoding="application/x-tex">10,000\%</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9444em;vertical-align:-0.1944em"></span><span class="mord">10</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">000%</span></span></span></span> ahead, by not being a new system with all its
associated costs and unknown-unknowns. There’s an underlying rule from evolution
that the more specialized a system is, the less adaptable to change it is.
Whether it’s a bird over-fit to its ecosystem, or a database you’re only using
for one thing.</p>
<p>We could go through a similar line of reasoning for the other examples. Adopting
a new multi-regional database for a subset of your database will likely yield to
<em>more</em> downtime due to the introduction of complexity, than sticking with what
you’ve got.</p>
<p>Don’t adopt a new system unless you can make the first-principle argument for
why your current stack fundamentally can’t handle it. For example, you will
likely reach elemental limitations doing full-text search in a relational
datastore or analytics queries on your production database, as a nature of the
data structures used. If you’re unsure, reach out, and I might be able to help
you!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 16: When To Write a Simulator]]></title>
        <id>https://sirupsen.com/napkin/problem-16-simulation</id>
        <link href="https://sirupsen.com/napkin/problem-16-simulation"/>
        <updated>2021-09-13T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[My rule for when to write a simulator:

Simulate anything that involves more than one probability, probabilities
over time, or queues.

Anything involving probability and/or queues you will need to approach with
humility and care, as they are often deceivingly difficult: How many people with
their random, erratic behaviour can you let into the checkout at once to make
sure it doesn’t topple over? How many connections should you allow op]]></summary>
        <content type="html"><![CDATA[<p>My rule for when to write a simulator:</p>
<blockquote>
<p>Simulate <em>anything</em> that involves more than one probability, probabilities
over time, or queues.</p>
</blockquote>
<p><em>Anything</em> involving probability and/or queues you will need to approach with
humility and care, as they are often deceivingly difficult: How many people with
their random, erratic behaviour can you let into the checkout at once to make
sure it doesn’t topple over? How many connections should you allow open to a
database when it’s overloaded? What is the best algorithm to prioritize
asynchronous jobs to uphold our SLOs as much as possible?</p>
<p>If you’re in a meeting discussing whether to do algorithm X or Y with this
nature of problem without a simulator (or amazing data), you’re wasting your
time. Unless maybe one of you has a PhD in queuing theory or probability theory.
Probably even then. Don’t trust your intuition for anything the rule above
applies to.</p>
<p>My favourite illustration of how bad your intuition is for these types of
problems is the Monty Hall problem:</p>
<blockquote>
<p>Suppose you’re on a game show, and you’re given the choice of three doors:
Behind one door is a car; behind the others, goats. You pick a door, say No. 1,
and the host, who knows what’s behind the doors, opens another door, say No. 3,
which has a goat. He then says to you, “Do you want to pick door No. 2?”</p>
<p>Is it to your advantage to switch your choice?</p>
<p>— <a href="https://en.wikipedia.org/wiki/Monty_Hall_problem">Wikipedia Entry for the Monty Hall problem</a></p>
</blockquote>
<figure><img src="/images/monty.png" alt="" width="2560" height="1422" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Against your intuition, it is to your advantage to switch your choice. You will
win the car twice as much if you do! This completely stumped me. Take a moment
to think about it.</p>
<p>I frantically read the explanation on <a href="https://en.wikipedia.org/wiki/Monty_Hall_problem">Wikipedia</a> several times: still
didn’t get it. Watched <a href="https://www.youtube.com/watch?v=4Lb-6rxZxx0">videos</a>, now I think that..  maybe… I get
it?  According to <a href="https://en.wikipedia.org/wiki/Monty_Hall_problem">Wikipedia</a>, Erdős, one of the most renowned
mathematicians in history also wasn’t convinced until he was shown a simulation!</p>
<p>After writing <a href="https://gist.github.com/sirupsen/87ae5e79064354b0e4f81c8e1315f89b">my simulation</a>, however, I finally feel like I get it.
Writing a simulation not only gives you a result you can trust more than your
intuition but also develops your understanding of the problem dramatically. I
won’t try to offer an in-depth explanation here, click the <a href="https://www.youtube.com/watch?v=4Lb-6rxZxx0">video link
above</a>, or try to implement a simulation — and you’ll see!</p>
<pre class="language-shellsession"><code class="language-shellsession"># https://gist.github.com/sirupsen/87ae5e79064354b0e4f81c8e1315f89b
$ ruby monty_hall.rb
Switch strategy wins: 666226 (66.62%)
No Switch strategy wins: 333774 (33.38%)
</code></pre>
<p>The short of it is that the host <em>always</em> opens the non-winning door, and not
your door, which reveals information about the doors! Your first choice retains
the 1/3 odds, but switching at this point, incorporating ‘the new information’
of the host opening a non-winning door, you improve your odds to 2/3 if you
always switch.</p>
<p>This is a good example of a deceptively difficult problem. We should simulate
it because it involves <em>probabilities over time</em>. If someone framed the Monty
Hall problem to you you’d intuitively just say ‘no’ or ‘1/3’. Any problem
involving probabilities over time should <em>humble</em> you. Walk away and quietly go
write a simulation.</p>
<p>Now imagine when you add scale, queues, … as most of the systems you work on
likely have. Thinking you can reason about this off the top of your head might
constitute a case of good ol’ <a href="https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect">Dunning-Kruger</a>. If Bob’s offering a perfect
algorithm off the top of his head, call bullshit (unless he carefully frames it
as a hypothesis to test in a simulator, thank you, Bob).</p>
<p>When I used to do <a href="https://sirupsen.com/my-journey-to-the-international-olympiad-in-informatics/">informatics competitions</a> in high school, I was never
confident in my correctness of the more math-heavy tasks — so I would often
write simulations for various things to make sure some condition held in a bunch
of scenarios (often using binary search). Same principle at work: I’m much more
confident most day-to-day developers would be able to write a good simulation
than a closed-form mathematical solution. I once read something about a
mathematician that spent a long time figuring out the optimal strategy in
Monopoly. A computer scientist came along and wrote a simulator in a <em>fraction</em>
of the time.</p>
<h2 id="using-randomness-instead-of-coordination">Using Randomness Instead of Coordination?</h2>
<p>A few years ago, we were revisiting old systems as part of moving to Kubernetes.
One system we had to adapt was a process spun up for every shard to do some
book-keeping. We were discussing how we’d make sure we’d have at least ~2-3
replicas per shard in the K8s setup (for high availability). Previously, we
had a messy static configuration in Chef to ensure we had a service for each
shard and that the replicas spread out among different servers, not something
that easily translated itself to K8s.</p>
<p>Below, the green dots denote the active replica for each shard. The red dots are
the inactive replicas for each shard:</p>
<figure><img src="/images/randomness-1.png" alt="" width="2000" height="1398" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>We discussed a couple of options: each process consulting some shared service to
coordinate having enough replicas per shard, or creating a K8s deployment per
shard with the 2-3 replicas. Both sounded a bit awkward and error-prone, and we
didn’t love either of them.</p>
<p>As a quick, curious semi-jokingly thought-experiment I asked:</p>
<blockquote>
<p>“What if each process chooses a shard at random when booting, and we boot
enough that we are near certain every shard has at least 2 replicas?”</p>
</blockquote>
<p>To rephrase the problem in a ‘mathy way’, with <code>n</code> being the number of shards:</p>
<blockquote>
<p>“How many times do you have to roll an <code>n-</code>sided die to ensure you’ve seen each
side at least <code>m</code> times?”</p>
</blockquote>
<figure><img src="/images/randomness-2.png" alt="" width="2000" height="1365" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>This successfully nerd-sniped everyone in the office pod. It didn’t take long
before some were pulling out complicated Wikipedia entries on probability
theory, trawling their email for old student MatLab licenses, and formulas
soon appeared on the whiteboard I had no idea how to parse.</p>
<p>Insecure that I’ve only ever done high school math, I surreptitiously started
writing a simple <a href="https://gist.github.com/sirupsen/8cc99a0d4290c9aa3e6c009fdce1ffec">simulator</a>. After 10 minutes I was done, and they were still
arguing about this and that probability formula. Once I showed them the
simulation the response was: <em>“oh yeah, you could do that too… in fact that’s
probably simpler…”</em> We all had a laugh and referenced that hour endearingly
for years after. (If you know a closed-form mathematical solution, I’d be very
curious! Email me.)</p>
<pre class="language-shellsession"><code class="language-shellsession"># https://gist.github.com/sirupsen/8cc99a0d4290c9aa3e6c009fdce1ffec
$ ruby die.rb
Max: 2513
Min: 509
P50: 940
P99: 1533
P999: 1842
P9999: 2147
</code></pre>
<p>It followed from running the simulation that we’d need to boot 2000+ processes
to ensure we’d have <em>at least</em> 2 replicas per shard with a 99.99% probability
with this strategy. Compare this with the ~400 we’d need if we did some light
coordination. As you can imagine, we then did the napkin cost of <del>1600 excess
dedicated CPUs to run these book-keepers at [</del><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>10</mn><mi mathvariant="normal">/</mi><mi>m</mi><mi>o</mi><mi>n</mi><mi>t</mi><mi>h</mi><mo stretchy="false">]</mo><mo stretchy="false">[</mo><mi>c</mi><mi>o</mi><mi>s</mi><mi>t</mi><mi>s</mi><mo stretchy="false">]</mo><mi mathvariant="normal">.</mi><mi>W</mi><mi>a</mi><mi>s</mi><mi>t</mi><mi>h</mi><mi>i</mi><mi>s</mi><mi>s</mi><mi>t</mi><mi>r</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>g</mi><mi>y</mi><mi>w</mi><mi>o</mi><mi>r</mi><mi>t</mi><mi>h</mi><mtext> </mtext></mrow><annotation encoding="application/x-tex">10/month][costs]. Was this
strategy worth ~</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em"></span><span class="mord">10/</span><span class="mord mathnormal">m</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mclose">]</span><span class="mopen">[</span><span class="mord mathnormal">cos</span><span class="mord mathnormal">t</span><span class="mord mathnormal">s</span><span class="mclose">]</span><span class="mord">.</span><span class="mord mathnormal">Wa</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">hi</span><span class="mord mathnormal">ss</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.02778em">r</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mord mathnormal" style="margin-right:0.03588em">y</span><span class="mord mathnormal" style="margin-right:0.02691em">w</span><span class="mord mathnormal" style="margin-right:0.02778em">or</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mspace nobreak"> </span></span></span></span>16,000 a month? Probably not.</p>
<p>Throughout my career I remember countless times complicated Wikipedia entries
have been pulled out as a possible solution. I can’t remember a single time that
was actually implemented over something simpler. Intimidating Wikipedia entries
might be another sign it’s time to write a simulator, if nothing else, to prove that
something simpler might work. For example, you don’t need to know that traffic
probably arrives in a <a href="https://en.wikipedia.org/wiki/Poisson_distribution">Poisson distribution</a> and how to do further analysis
on that. That will just happen in a simulation, even if you don’t know the name.
Not important!</p>
<h2 id="another-real-example-load-shedding">Another Real Example: Load Shedding</h2>
<p>At Shopify, a good chunk of my time there I worked on teams that worked on
reliability of the platform. Years ago, we started working on a ‘load shedder.’
The idea was that when the platform was overloaded we’d prioritize traffic. For
example, if a shop got inundated with traffic (typically bots), how could we
make sure we’d prioritize ‘shedding’ (red arrow below) the lowest value traffic?
Failing that, only degrade that single store?  Failing that, only impact that
shard?</p>
<figure><img src="/images/load-shed.png" alt="" width="2000" height="1010" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Hormoz Kheradmand led most of this effort, and has written <a href="https://hormozk.com/capacity/">this post</a>
about it in more detail. When Hormoz started working on the first load shedder,
we were uncertain about what algorithms might work for shedding traffic fairly.
It was a big topic of discussion in the lively office pod, just like the
dice-problem. Hormoz started <a href="https://github.com/hkdsun/simiload">writing simulations</a> to
develop a much better grasp on how various controls might behave. This worked
out wonderfully, and also served to convince the team that a very simple
algorithm for prioritizing traffic could work which Hormoz describes in <a href="https://hormozk.com/capacity/">his
post</a>.</p>
<p>Of course, before the simulations, we all started talking about Wikipedia
entries of the complicated, cool stuff we could do. The simple simulations
showed that none of that was necessary — perfect! There’s tremendous value in
exploratory simulation for nebulous tasks that ooze of complexity. It gives a
feedback loop, and typically a justification to keep V1 simple.</p>
<p>Do you need to bin-pack tenants on <code>n</code> shards that are being filled up randomly?
Sounds like <em>probabilities over time</em>, a lot of randomness, and smells of
NP-completion. It won’t be long before someone points out deep learning is
perfect, or some resemblance to protein folding or whatever… Write a simple
simulation with a few different sizes and see if you can beat random by even a
little bit. Probably random is fine.</p>
<p>You need to plan for retirement and want to stress-test your portfolio? The
state of the art for this is using <a href="https://engaging-data.com/will-money-last-retire-early/">Monte Carlo analysis</a> which, for the
sake of this post, we can say is a fancy way to say “simulate lots of
random scenarios.”</p>
<p>I hope you see the value in simulations for getting a handle on these types of
problems. I think you’ll also find that writing simulators is some of the most
fun programming there is. Enjoy!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 15: Increase HTTP Performance by Fitting In the Initial TCP Slow Start Window]]></title>
        <id>https://sirupsen.com/napkin/problem-15</id>
        <link href="https://sirupsen.com/napkin/problem-15"/>
        <updated>2021-07-13T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Did you know that if your site’s under ~12kb the first page will load
significantly faster?  Servers only send a few packets (typically 10)
in the initial round-trip while TCP is warming up (referred to as TCP slow
start). After sending the first set of packets, it needs to wait for
the client to acknowledge it received all those packets.
Quick illustration of transferring ~15kb with an initial TCP slow start window
(also referred to as initial congestion window or initcwnd</code]]></summary>
        <content type="html"><![CDATA[<p>Did you know that if your site’s under ~12kb the first page will load
significantly faster?  Servers only send a few packets (typically 10)
in the initial round-trip while TCP is warming up (referred to as TCP slow
start). After sending the first set of packets, it needs to wait for
the client to acknowledge it received all those packets.</p>
<p>Quick illustration of transferring ~15kb with an initial TCP slow start window
(also referred to as initial congestion window or <code>initcwnd</code>) of 10 versus 30:</p>
<figure><img src="/images/initcwnds.png" alt="" width="2000" height="1904" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>The larger the initial window, the more we can transfer in the first roundtrip,
the faster your site is on the initial page load. For a large roundtrip time
(e.g. across an ocean), this will start to matter a lot. Here is the approximate
size of the initial window for a number of common hosting providers:</p>
<table><thead><tr><th>Site</th><th>First Roundtrip Bytes (<code>initcwnd</code>)</th></tr></thead><tbody><tr><td><a href="https://readwise.io/">Heroku</a></td><td>~12kb (10 packets)</td></tr><tr><td><a href="https://www.onepeloton.ca/">Netlify</a></td><td>~12kb (10 packets)</td></tr><tr><td><a href="https://fashionnova.com/">Shopify</a></td><td>~12kb (10 packets)</td></tr><tr><td><a href="https://yellowco.co/">Squarespace</a></td><td>~12kb (10 packets)</td></tr><tr><td><a href="https://sirupsen.com/static/html/network-napkin/100kb">Cloudflare</a></td><td>~40kb (30 packets)</td></tr><tr><td><a href="https://www.fastly.com/">Fastly</a></td><td>~40kb (30 packets)</td></tr><tr><td><a href="https://demos.creative-tim.com">Github Pages</a></td><td>~40kb (30 packets)</td></tr><tr><td><a href="https://tailwindcss.com/">Vercel</a></td><td>~40kb (30 packets)</td></tr></tbody></table>
<p>To generate this, I wrote a script that you can use <a href="https://github.com/sirupsen/initcwnd"><code>sirupsen/initcwnd</code></a> to
analyze your own site. Based on the report, you can attempt to tune your page
size, or tune your server’s initial slow start window size (<code>initcwnd</code>) (see
bottom of article). It’s important to note that more isn’t necessarily better
here. Hosting providers have a hard job choosing a value. 10 might be the best
setting for your site, or it might be 64. As a rule of thumb, if most of your
clients are high-bandwidth connections, more is better. If not, you’ll need to
strike a balance. Read on, and you’ll be an expert in this!</p>
<figure><img src="/images/initcwnd-script.png" alt="" width="1692" height="904" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<hr/>
<p>Dear Napkin Mathers, it’s been too long. Since last, I’ve left Shopify after 8
amazing years. Ride of a lifetime. For the time being, I’m passing the time with
standup paddleboarding (did a 125K 3-day trip the week after I left),
recreational programming (of which napkin math surely is a part), and learning
some non-computer things.</p>
<p>In this issue, we’ll dig into the details of exactly what happens on the wire
when we do the initial page load of a website over HTTP. As I’ve already hinted
at, we’ll show that there’s a magical byte threshold to be aware of when
optimizing for short-lived, bursty TCP transfers. If you’re under this threshold,
or increase it, it’ll potentially save the client from several roundtrips.
Especially for sites with a single location that are often requested from far
away (i.e. high roundtrip times), e.g. US -&gt; Australia, this can make a <em>huge</em>
difference. That’s likely the situation you’re in if you’re operating a
SaaS-style service. While we’ll focus on HTTP over the public internet, TCP slow
start can also matter to RPC inside of your data-centre, and especially across
them.</p>
<p>As always, we’ll start by laying out our naive mental model about how we <em>think</em>
loading a site works at layer 4. Then we’ll do the napkin math on expected
performance, and confront our fragile, naive model with reality to see if it
lines up.</p>
<p>So what do we think happens at the TCP-level when we request a site? For
simplicity, we will exclude compression, DOM rendering, Javascript, etc., and
limit ourselves exclusively to downloading the HTML. In other words: <code>curl --http1.1 https://sirupsen.com &gt; /dev/null</code> (note that <a href="https://github.com/sirupsen/initcwnd"><code>sirupsen/initcwnd</code></a>
uses <code>--compressed</code> with <code>curl</code> to reflect reality).</p>
<p>We’d expect something alone the lines of:</p>
<ul>
<li>1 DNS roundtrip (we’ll ignore this one, typically cached close by)</li>
<li>1 TCP roundtrip to establish the connection (<code>SYN</code> and <code>SYN+ACK</code>)</li>
<li>2 TLS roundtrips to negotiate a <em>secure</em> connection</li>
<li>1 HTTP roundtrip to request the page and the server sending it</li>
</ul>
<figure><img src="/images/roundtrips-1.png" alt="" width="350" height="469" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>To make things a little more interesting, we’ll choose a site that is
geographically further from me that isn’t overly optimized: <code>information.dk</code>, a
Danish newspaper. Through some DNS lookups from servers in different geographies
and by using <a href="https://bgp.he.net/ip/109.238.50.144">a looking glass</a>, I can determine that all their HTML traffic
is always routed to a datacenter in Copenhagen. These days, many sites are
routed through e.g. Cloudflare POPs which will have a nearby data-centre, to
simplify our analysis, we want to make sure that’s not the case.</p>
<p>I’m currently sitting in South-Western Quebec on an LTE connection. I can
determine <a href="https://cln.sh/5Br6AV">through <code>traceroute(1)</code></a> that my traffic is travelling to
Copenhagen through the path Montreal -&gt; New York -&gt; Amsterdam -&gt; Copenhagen.
<a href="https://cln.sh/CFgnEZ">Round-trip time is ~140ms</a>.</p>
<figure><img src="/images/network.jpeg" alt="" width="1758" height="1098" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>If we add up the number of round-trips from our napkin model above (excluding
DNS), we’d expect loading the Danish site would take us <code>4 * 140ms = 560ms</code>.
Since I’m on an LTE connection where I’m not getting much above 15 mbit/s, we
have to factor in that it takes another <a href="https://www.wolframalpha.com/input/?i=160kb+at+15+mbit%2Fs">~100ms to transfer the data</a>,
in addition to the 4 round-trips. So with our napkin math, we’re expecting that
we should be able to download the 160kb of HTML from a server in Copenhagen
within a ballpark of <code>~660ms</code>.</p>
<p>Reality, however, has other plans. When I run time <code>curl --http1.1 https://www.information.dk</code> it takes 1.3s! Normally we say that if the napkin
math is within ~10x, the napkin math is likely in line with reality, but
that’s typically when we deal with nano and microseconds. Not off by
~<code>640ms</code>!</p>
<p>So what’s going on here? When there’s a discrepancy between the napkin math and
reality, it’s because either (1) the napkin model of the world is incorrect, or
(2) there’s room for optimization in the system. In this case, it’s a bit of
both. Let’s hunt down those 640ms. 👀</p>
<p>To do that, we have to analyze the raw network traffic with Wireshark. Wireshark
brings back many memories.. some fond, but mostly… frustration trying to
figure out causes of intermittent network problems. In this case, for once it’s
for fun and games! We’ll type <code>host www.information.dk</code> into Wireshark to make
it capture traffic to the site. In our terminal we run the <code>curl</code> command above
for Wireshark to have something to capture.</p>
<p>Wireshark will then give us a nice GUI to help us hunt down the roughly half a
second we haven’t accounted for. One thing to note is that in order to get
Wireshark to understand the TLS/SSL contents of the session it needs to know the
secret negotiated with the server. There’s a complete guide <a href="https://everything.curl.dev/usingcurl/tls/sslkeylogfile">here</a>, but
in short you pass <code>SSLKEYLOGFILE=log.log</code> to your <code>curl</code> command and then point
to that file in Wireshark in the TLS configuration.</p>
<h2 id="problem-1-3-tls-roundtrips-rather-than-2">Problem 1: 3 TLS roundtrips rather than 2</h2>
<figure><img src="/images/wireshark-overview.png" alt="" width="3104" height="2024" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>We see the TCP roundtrip as expected, <code>SYN</code> from the client, then <code>SYN+ACK</code> from
the server. Bueno. But after that it looks fishy. We’re seeing <em>3</em> round-trips
for TLS/SSL instead of the expected 2 from our drawing above!</p>
<figure><img src="/images/wireshark-tls-bad.png" alt="" width="1944" height="648" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>To make sure I wasn’t misunderstanding something, I double-checked with
<code>sirupsen.com</code>, and sure enough, it’s showing the two roundtrips in Wireshark as
anticipated:</p>
<figure><img src="/images/wireshark-tls-good.png" alt="" width="2008" height="292" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>If we carefully study the annotated Wireshark dump above for the Danish
newspaper, we can see that the problem is that for whatever reason the server is
waiting for a TCP ack in the middle of transmitting the certificate (packet 9).</p>
<p>To make it a easier to parse, the exchange looks like this:</p>
<figure><img src="/images/roundtrips-2.png" alt="" width="250" height="581" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Why is the server waiting for a TCP ACK from the client after transmitting ~4398
bytes of the certificate? Why doesn’t the server just send the whole certificate
at once?</p>
<h2 id="bytes-in-flight-or-the-initial-congestion-window">Bytes in flight or the “initial congestion window”</h2>
<p>In TCP, the server carefully monitors how many packets/bytes it has in flight.
Typically, each packet is ~1460 bytes of application data. The server doesn’t
necessarily send <em>all</em> the data it has at once, because the server doesn’t know
how “fat” the pipes are to the client. If the client can only receive 64 kbit/s
currently, then sending e.g. 100 packets could completely clog the network. The
network most likely will drop some random packets which would be even slower to
compensate from than sending the packets at a more sustainable pace for the
client.</p>
<p>A <em>major</em> part of the TCP protocol is the balancing act of trying to send as
much data as possible at any given time, while ensuring the server doesn’t
over-saturate the path to the client and lose packets. Losing packets is very
bad for bandwidth in TCP.</p>
<p>The server only keeps a certain amount of packets in flight at any given time.
“In flight” in TCP terms means “unacknowledged” packets, i.e. packets of data
the server has sent to the client that the client hasn’t yet sent an
acknowledgement to the server that it has received. Typically for every
successfully acknowledged packet the server’s TCP implementation will decide to
increase the number of allowed in-flight packets by 1. You may have heard this
simple algorithm referred to as “TCP slow start.” On the flip-side, if a packet
has been dropped then the server will decide to have slightly less bytes in
flight.  Throughout the existence of the TCP connection’s lifetime this dance
will be tirelessly performed.  In TCP terms what we’ve called “in-flight” is
referred to as the “congestion window” (or <code>cwnd</code> in short-form).</p>
<figure><img src="/images/slow-start.png" alt="" width="2551" height="2093" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Typically after the first packet has been lost the TCP implementation switches
from the simple TCP slow start algorithm to a more complicated <a href="https://upload.wikimedia.org/wikipedia/commons/2/24/TCP_Slow-Start_and_Congestion_Avoidance.svg">“Congestion
Control Algorithm”</a> of which there are dozens. Their job is: Based on what
we’ve observed about the network, how much should we have in flight to maximize
bandwidth?</p>
<p>Now we can go back and understand why the TLS handshake is taking 3 roundtrips
instead of 2. After the client’s starts the TLS handshake with <code>TLS HELLO</code>, the
Danish server really, really wants to transfer this ~6908 byte certificate.
Unfortunately though the server’s congestion window (packets in flight allowed) at
the time just isn’t large enough to accommodate the whole certificate!</p>
<p>Put another way, the server’s TCP implementation has decided it’s <em>not</em>
confident the poor client can receive that many tasty bytes all at once yet —
so it sends a petty 4398 bytes of the certificate. Of course, 63% of a
certificate isn’t enough to move on with the TLS handshake… so the client
sighs, sends a TCP ACK back to the server, which then sends the meager 2510 left
of the certificate so the client can move on to perform its part of the TLS
handshake.</p>
<p>Of course, this all seems a little silly… first of all, why is the certificate
6908 bytes?! For comparison, it’s 2635 for my site. Although that’s not too
interesting to me. What’s more interesting is why is the server only sending
6908 bytes?  That seems scanty for a modern web server!</p>
<p>In TCP how many packets we can send on a brand new connection before we know
<em>anything</em> about the client is called the “initial congestion window.” In a
configuration context, this is called <code>initcwnd</code>. If you reference the yellow
graph above with the packets in flight, that’s the value at the first roundtrip.</p>
<p>These days, the default for a Linux server is 10 packets, or <code>10 * 1460 = 14600 bytes</code>, where 1460 is roughly the data payload of each packet. That would’ve fit
that monster certificate of the Danish newspaper. Clearly that’s not their
<code>initcwd</code> since then the server wouldn’t have patiently waited for my ACK.
Through some digging it appears that prior to <a href="https://blog.cloudflare.com/optimizing-the-linux-stack-for-mobile-web-per/">Linux 3.0.0 <code>initcwnd</code> was
3</a>, or ~<code>3 * 1460 = 4380</code> bytes! That approximately lines up, so it seems
that the Danish newspaper’s <code>initcwnd</code> is 3. We don’t know for sure it’s Linux,
but we know the <code>initcwnd</code> is 3.</p>
<p>Because of the exponential growth of the packets in flight, <code>initcwnd</code> matters
quite a bit for how much data we can send in those first few precious
roundtrips:</p>
<figure><img src="/images/initcwnd-graph.png" alt="" width="2000" height="1650" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>As we saw in the intro, it’s common among CDNs to raise the values from the
default to e.g. 32 (~46kb). This makes sense, as you might be transmitting
images of many megabytes. Waiting for TCP slow start to get to this point can
take a few roundtrips.</p>
<p>Another other reasons, this is also why HTTP2/HTTP3 moved in the direction of
moving more data through the same connection as it has an already “warm” TCP
session. “Warm” meaning that the congestion window / bytes in flight has already
been increased generously from the initial by the server.</p>
<p>The TCP slow start window is also part of why points of presence (POPs) are
useful. If you connect to a POP in front of your website that’s 10ms
away, negotiate TLS with the POP, and the POP already has a warm connection
with the backend server 100ms away — this improves performance dramatically,
with no other changes. From <code>4 * 100ms = 400ms</code> to <code>3 * 10ms + 100ms = 130ms</code>.</p>
<h2 id="how-many-roundtrips-for-the-http-payload">How many roundtrips for the HTTP payload?</h2>
<p>Now we’ve gotten to the bottom of why we have 3 TLS roundtrips rather than the
expected 2: the initial congestion window is small. The congestion window
(allowed bytes in flight by the server) applies equally to the HTTP payload
that the server sends back to us. If it doesn’t fit inside the congestion
window, then we need multiple round-trips to receive all the HTML.</p>
<p>In Wireshark, we can pull up a TCP view that’ll give us an idea of how many
roundtrips was required to complete the request (<a href="https://github.com/sirupsen/initcwnd"><code>sirupsen/initcwnd</code></a> tries to
guess this for you with an embarrassingly simple algorithm):</p>
<figure><img src="/images/roundtrips-3.png" alt="" width="581" height="335" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>We see the TCP roundtrip, 3 TLS roundtrips, and then 5-6 HTTP roundtrips to get
the ~160kb page! Each little dot in the picture shows a packet, so you’ll notice
that the congestion window (allowed bytes in flight) is roughly doubling every
roundtrip. The server is increasing the size of the window for every successful
roundtrip. A ‘successful roundtrip’ means a roundtrip that didn’t drop packets, and
in some <a href="https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster">newer algorithms</a>, a roundtrip that didn’t take too much time.</p>
<p>Typically, the server will continue to double the number of packets (~1460 bytes each) for each successful roundtrip until either an unsuccessful roundtrip happens (slow or dropped packets), <em>or</em> the bytes in flight would exceed the <em>client’s</em> receive window.</p>
<p>When a TCP session starts, the client will advertise how many bytes <em>it</em> allows in flight. This typically is much larger than the server is wiling to send off the bat. We can pull this up in the initial <code>SYN</code> package from the client and see that it’s ~65kb:</p>
<figure><img src="/images/syn-window.png" alt="" width="1002" height="492" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>If the session had been much longer and we pushed up against that window, the client would’ve sent a TCP package updating the size of the receive window. So there’s two windows at play: the server manages the number of packets in flight: the <em>congestion window</em>. The congestion window is controlled by the server’s <em>congestion algorithm</em> which is adjusted based on the number of successful roundtrips, but always capped by the client’s <em>receive window</em>.</p>
<p>Let’s look at the amount of packets transmitted by the server in each roundtrip:</p>
<ul>
<li>TLS roundtrip: 3 packets (~4kb)</li>
<li>HTTP roundtrip 1: 6 (~8kb)</li>
<li>HTTP roundtrip 2: 10 (~14kb)</li>
<li>HTTP roundtrip 3: 17 (~24kb)</li>
<li>HTTP roundtrip 4: 29 (~41kb)</li>
<li>HTTP roundtrip 5: 48 (~69kb, this in theory would have exceeded the 64kb current
receive window since the client didn’t enlarge it for some reason. The server
only transmitted ~64kb)</li>
<li>HTTP roundtrip 6: 9 (12kb, just the remainder of the data)</li>
</ul>
<p>The growth of the congestion window is a <em>textbook</em> cubic function, it has a
<a href="https://www.wolframalpha.com/input/?i=cubic+fit+3%2C+6%2C+10%2C+17%2C+29%2C+48">perfect fit</a>:</p>
<figure><img src="/images/regression.png" alt="" width="660" height="438" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>I’m not entirely sure why it follows a cubic function. I expected TCP slow start
to just double every roundtrip. :shrug: As far as I can gather, on modern TCP
implementation the congestion window is doubled every roundtrip until a packet
is lost (as is the case for most other sites I’ve analyzed, e.g. the session in
the screenshot below). After <em>that</em> we might move to a cubic growth. This
might’ve changed later on? It’s completely up to the TCP implementation.</p>
<p>This is part of why I wrote <code>sirupsen/initcwnd</code> to spit out the size of the
windows, so you don’t have to do any math or guesswork, here for a Github repo
(uncompressed):</p>
<figure><img src="/images/initcwnd-script.png" alt="" width="1692" height="904" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<h2 id="consolidating-our-new-model-with-the-napkin-math">Consolidating our new model with the napkin math</h2>
<p>So now we can explain the discrepancy between our simplistic napkin math model
and reality. We assumed 2 TLS roundtrips, but in fact there was 3, because of
the low initial congestion window by the server. We also assumed 1 HTTP
roundtrip, but in fact there was 6, because the server’s congestion window and
client’s receive window didn’t allow sending everything at once. This brings our
total roundtrips to <code>1 + 3 + 6 = 10</code> roundtrips. With our roundtrip time at
130ms, this lines up perfectly with the 1.3s total time we observed at the top
of the post! This suggests our new, updated mental model of the system reflects
reality well.</p>
<h2 id="ok-cool-but-how-do-i-make-my-own-website-faster">Ok cool but how do I make my own website faster?</h2>
<p>Now that we’ve analyzed this website together, you can use this to analyze your
own website and optimize it. You can do this by running
<a href="https://github.com/sirupsen/initcwnd"><code>sirupsen/initcwnd</code></a> against your website. It uses some very simple
heuristics to guess the windows and their size. They don’t work always,
especially not if you’re on a slow connection or the website streams the
response back to the client, rather than sending it all at once.</p>
<p>Another thing to be aware of is that the Linux kernel (and likely other kernels)
caches the congestion window size (among other things) with clients via the
route cache. This is great, because it means that we don’t have to renegotiate
it from scratch when a client reconnects. But it might mean that subsequent runs
against the same website will give you a far larger <code>initcwnd</code>. The lowest you
encounter will be the right one. Note also that a site might have a fleet with
servers that have different <code>initcwnd</code> values!</p>
<p>The output of <code>sirupsen/initcwnd</code> will be something like:</p>
<figure><img src="/images/initcwnd-script.png" alt="" width="1692" height="904" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Here we can see the size of the TCP windows. The initial window was 10 packets
for Github.com, and then doubles every roundtrip. The last window isn’t a full
80 packets, because there wasn’t enough bytes left from the server.</p>
<p>With this result, we could decide to change the <code>initcwnd</code> to a higher value to
try to send it back in fewer roundtrips. This might, however, have drawbacks
for clients on slower connections and should be done with care. It does show
some promise that CDNs have values in the 30s. Unfortunately I don’t have access
to enough traffic to see for myself to study this, as <a href="https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-initcwnd">Google did</a> when
they championed the change from a default of 3 to 10. That document also
explains potential drawbacks in more detail.</p>
<p>The most practical day-to-day takeaway might be that e.g. base64 inlining images
and CSS may come with serious drawbacks if it throws your site over a congestion
window threshold.</p>
<p>You can change <code>initcwnd</code> with the <code>ip(1)</code> command on Linux, from here to the
default 10 to 32:</p>
<pre class="language-text"><code class="language-text">simon@netherlands:~$ ip route show
default via 10.164.0.1 dev ens4 proto dhcp src 10.164.0.2 metric 100
10.164.0.1 dev ens4 proto dhcp scope link src 10.164.0.2 metric 100

simon@netherlands:~$ sudo ip route change default via 10.164.0.1 dev ens4 proto dhcp src 10.164.0.2 metric 100 initcwnd 32 initrwnd 32

simon@netherlands:~$ ip route show
default via 10.164.0.1 dev ens4 proto dhcp src 10.164.0.2 metric 100 initcwnd 32 initrwnd 32
10.164.0.1 dev ens4 proto dhcp scope link src 10.164.0.2 metric 100
</code></pre>
<p>Another key TCP setting it’s worth tuning for TCP is
<code>tcp_slow_start_after_idle</code>. It’s a good name: by default when set to 1, it’ll
renegotiate the congestion window after a few seconds of no activity (while you
read on the site). You probably want to set this to 0 in
<code>/proc/sys/net/ipv4/tcp_slow_start_after_idle</code> so it remembers the congestion
window for the next page load.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 14: Using checksums to verify syncing 100M database records]]></title>
        <id>https://sirupsen.com/napkin/problem-14-using-checksums-to-verify</id>
        <link href="https://sirupsen.com/napkin/problem-14-using-checksums-to-verify"/>
        <updated>2021-01-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A common problem you’ve almost certainly faced is to sync two datastores. This problem comes up in numerous shapes and forms: Receiving webhooks and writing them into your datastore, maintaining a materialized view, making sure a cache reflects reality, ensure documents make it from your source of truth to a search index, or your data from your transactional store to your data lake or column store.
<img src="/images/8b99afab-9ae3-47cf-8703-f465aaec1473.png" alt="" width="1380" hei]]></summary>
        <content type="html"><![CDATA[<p>A common problem you’ve almost certainly faced is to sync two datastores. This problem comes up in numerous shapes and forms: Receiving webhooks and writing them into your datastore, maintaining a materialized view, making sure a cache reflects reality, ensure documents make it from your source of truth to a search index, or your data from your transactional store to your data lake or column store.</p>
<figure><img src="/images/8b99afab-9ae3-47cf-8703-f465aaec1473.png" alt="" width="1380" height="610" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>If you’ve built such a system, you’ve almost certainly seen B drift out of sync. Building a completely reliable syncing mechanism is difficult, but perhaps we can build a checksumming mechanism to check if the two datastores are equal in a few seconds?</p>
<p>In this issue of napkin math, we look at implementing a solution to <strong>check whether A and B are in sync for 100M records in a few seconds</strong>. The key idea is to checksum an indexed <code>updated_at</code> column and use a binary search to drill down to the mismatching records. All of this will be explained in great detail, read on!</p>
<h2 id="why-are-syncing-mechanisms-unreliable">Why are syncing mechanisms unreliable?</h2>
<p>If you are firing the events for your syncing mechanism after a transaction occurs, such as enqueuing a job, sending a webhook, or emit a Kafka event, you can’t guarantee that it <em>actually</em> gets sent after the transaction is committed. Almost certainly part of pipeline into database B is leaky due to bugs: perhaps there’s an exception you don’t handle, you drop events on the floor above a certain size, some early return, or deploys lose an event in a rare edge case.</p>
<p>But <em>even</em> if you’re doing something that’s theoretically bullet-proof, like using the database replication logs through <a href="https://debezium.io/">Debezium</a>, there’s still a good chance a bug somewhere in your syncing pipeline is causing you to lose occasional events. If theoretical guarantees were adequate, <a href="https://jepsen.io/">Jepsen</a> wouldn’t uncover much, would it? A team I worked with even wrote a TLA+ proof, but still found bugs with a solution like the one I describe here! In my experience, a checksumming system should be part of <em>any</em> syncing system.</p>
<p>It would seem to me that building reliable syncing mechanisms would be easier if databases had a standard, fast mechanism to answer the question: <em>“Does database A and B have all the same data? If not, what’s different?”</em> Over time, as you fix your bugs, it will of course happen more rarely, but being able to guarantee that they are in sync is a huge step forward.</p>
<p>Unfortunately, this doesn’t exist as a user API in modern databases, but perhaps we can design such a mechanism <em>without</em> modifying the database?</p>
<p>This exploration will be fairly long. If you just want to see the final solution, scroll down to the end. This issue shows how to use napkin math to incrementally justify increasing complexity. While I’ve been thinking about this problem for a while, this is a fairly accurate representation of how I thought about the problem a few months ago when I started working on it. It’s also worth noting that when doing napkin math usually, I don’t write prototypes like this if I’m fairly confident in my understanding of the system underneath. I’m doing it here to make it more entertaining to read!</p>
<h2 id="assumptions">Assumptions</h2>
<p>Let’s start with some assumptions to plan out our ‘syncing checksum process’:</p>
<ul>
<li>100M records</li>
<li>1KiB per record (~100 GiB total)</li>
</ul>
<p>We’ll assume both ends are SQL-flavoured relational databases, but will address other datastores later, e.g. ElasticSearch.</p>
<h2 id="iteration-1-check-in-batches">Iteration 1: Check in Batches</h2>
<p>As usual, we will start by considering the simplest possible solution for checking whether two databases are in sync: a script that iterates through all records in batches to check if they’re the same. It’ll execute the SQL query below in a loop, iterating through the whole collection on both sides and report mismatches:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
<span class="token keyword">ORDER</span> <span class="token keyword">BY</span> id <span class="token keyword">ASC</span>
<span class="token keyword">LIMIT</span> <span class="token variable">@limit</span> <span class="token keyword">OFFSET</span> <span class="token variable">@offset</span>
</code></pre>
<p>Let’s try to figure out how long this would take: Let’s assume each loop is querying the two databases in parallel and our batches are 10,000 records (10 MiB total) large:</p>
<ul>
<li>In MySQL, reading 10 MiB off SSD at <a href="https://github.com/sirupsen/napkin-math#numbers">200 us/MiB</a> will take ~2ms. We assume   this to be sequential-ish, <a href="http://yoshinorimatsunobu.blogspot.com/2013/10/making-full-table-scan-10x-faster-in.html">but this is not entirely true</a>.</li>
<li>Serializing and deserializing the MySQL protocol at <a href="https://github.com/sirupsen/napkin-math#numbers">5 ms/MiB</a>, for a total   of ~2* 50ms = 100ms.</li>
<li>Network transfer at <a href="https://github.com/sirupsen/napkin-math#numbers">10 ms/MiB</a>, for a total of ~100ms.</li>
</ul>
<p>We’d then expect each batch to take roughly ~200ms.  This would bring our theoretical grand total for this approach to <code>200 ms/batch * (100M / 10_000) batches ~= 30min</code>.</p>
<p>To test our hypothesis against reality, I implemented this to <a href="https://github.com/sirupsen/napkin-math/blob/master/newsletter/14-syncing/check.rb">run locally for the first 100 of the 10,000 batches</a>. In this local implementation, we won’t incur the network transfer overhead (we could’ve done this with <a href="https://github.com/shopify/toxiproxy">Toxiproxy</a>). Without the network overhead, we expect a query time in the 100ms ballpark. Running <a href="https://github.com/sirupsen/napkin-math/blob/master/newsletter/14-syncing/check.rb">the script</a>, I get the following plot:</p>
<figure><img src="/images/dfef5830-f658-4268-b655-ec23e64ce90c.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Ugh. The real performance is pretty far from our napkin math lower bound estimate. What’s going on here?</p>
<p>There’s a fundamental problem with our napkin math. Only the <em>very</em> first batch will read only <code>~10 MB</code> off of the SSD in MySQL. <code>OFFSET</code> queries will read through the data <em>before</em> the offset, even if it only returns the data after the offset! Each batch takes 3-5ms more than the last, which lines up well with reading another 10 MiB per batch from the increasing offset.</p>
<p>This is the reason why OFFSET-based pagination causes so much trouble in production systems. If we take the area under the graph here and extend to the 10,000 batches we’d need for our 100M records, we get a <strong>~3 day runtime</strong>.</p>
<h2 id="iteration-2-outsmarting-the-optimizer">Iteration 2: Outsmarting the optimizer</h2>
<p>As <code>OFFSET</code> will scan through all these 1 KiB records, what if we scanned an index instead? It’ll be much smaller to skip 100,000s of records on an index where each record only occupies perhaps 64 bit. It’ll still grow linearly with the offset, but passing the previous batch’s 10,000 records is only 10 KiB which would only take a few hundred microseconds to read.</p>
<p>You’d think the optimizer would make this optimization itself, but it doesn’t. So we have to do it ourselves:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
<span class="token keyword">WHERE</span> id <span class="token operator">&gt;</span> <span class="token punctuation">(</span><span class="token keyword">SELECT</span> id <span class="token keyword">FROM</span> <span class="token keyword">table</span> <span class="token keyword">LIMIT</span> <span class="token number">1</span> <span class="token keyword">OFFSET</span> <span class="token variable">@offset</span><span class="token punctuation">)</span>
<span class="token keyword">ORDER</span> <span class="token keyword">BY</span> id <span class="token keyword">ASC</span> 
<span class="token keyword">LIMIT</span> <span class="token number">10000</span><span class="token punctuation">;</span>
</code></pre>
<figure><img src="/images/47a71e04-2c3d-48e6-a7de-c2240d1ac26f.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>It’s better, but just not by enough. It just delays the inevitable scanning of lots of data to find these limits. If we interpolate how long this’d take for 10,000 batches to process our 100M records, we’re still talking on the <strong>order of 14 hours</strong>. The 128x speedup doesn’t carry through, because it only applies to the MySQL part. Network transfer is still a large portion of the total time!</p>
<p>Either way, if you have some OFFSET queries lying around in your codebase, you might want to consider this optimization.</p>
<h2 id="iteration-3-parallelization">Iteration 3: Parallelization</h2>
<p>This seems like an embarrassingly parallel problem: Can’t we just run 100 batches of 10,000 records in parallel? Can the database support that? Since we can pre-compute <em>all</em> the LIMITs and OFFSETs up front, let’s abuse that?</p>
<p>This seems kind of difficult to do the napkin math on. Typically when that’s the case, I try to solve the problem backwards: Fundamentally, the machine can <a href="https://github.com/sirupsen/napkin-math#numbers">read sequential SSD at 4 GiB/s</a>, which would be an absolute lower bound for how fast the database can work. The dataset is 100 GiB, as we established in the beginning.</p>
<p>If we’re using our optimization from iteration 2, then our queries are on average processing <code>50M * 64 bit</code> for the sub-query, and the <code>10 MiB</code> of returned data on top. That’s a total of ~400 MiB. So for our 10,000 batches, that’s 4.2 TB of data we will need to munch through with this query. We can read 1 GiB from SSD in 200ms, so that’s 14 minutes in total. That would be the <em>absolute</em> lowest bound, assuming essentially zero overhead from MySQL and not taking into consideration serialization, network, etc.</p>
<p>This also assumes the MySQL instance is doing <em>nothing</em> but serving our query, which is unrealistic. In reality, we’d dedicate <em>maybe</em> 10% of capacity to these queries, which puts us at 2 hours. Still faster, but a far cry from our hope of seconds or minutes. Buuh.</p>
<h2 id="iteration-4-dropping-offset">Iteration 4: Dropping OFFSET</h2>
<p>It’s starting to seem like trouble to use these OFFSET queries, even as sub-queries. We held on to it for a while, because it’s nice and easy to reason about, and means the queries can be fired off in parallel. We also held on to it for a while to truly show how awful these types of queries are, so hopefully you think twice about using it in a production query again!</p>
<p>If we change our approach to maintain <code>max(id)</code> from the last batch, we can simply change our loop’s query to:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
<span class="token keyword">WHERE</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
<span class="token keyword">ORDER</span> <span class="token keyword">BY</span> id <span class="token keyword">ASC</span>
<span class="token keyword">LIMIT</span> <span class="token number">10000</span><span class="token punctuation">;</span>
</code></pre>
<p>This curbed the linear growth!</p>
<figure><img src="/images/6b0263d5-c59f-4127-a573-6b06d615c195.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Now MySQL can use its efficient primary key index to do <a href="https://www.wolframalpha.com/input/?i=log%28100*10%5E6%29%2Flog%281024%2F3*2%2F%288%2B4%29%29+%2B+1++*https%3A%2F%2Fdev.mysql.com%2Fdoc%2Frefman%2F8.0%2Fen%2Festimating-performance.html*">~6 SSD seeks</a> on <code>id</code> and then scan forward. This means we only process and serialize 10 MiB, putting our napkin math consistently around 100ms per batch as in the original estimate in iteration 1. That means this solution should <strong>finish in about half an hour!</strong> However, we learned in the previous iteration that we are constrained by only taking 10% of the database’s capacity, so as calculated from iteration 3, we’re back at 2 hours..</p>
<p>We fundamentally need an approach that handles less data, as the serialization and network time is the primary reason why the integrity checking is now slow.</p>
<h2 id="iteration-5-checksumming">Iteration 5: Checksumming</h2>
<p>If we want to handle less data, we need to have some way to fingerprint or checksum each record. We could change our query to something along the lines of:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> MD5<span class="token punctuation">(</span><span class="token operator">*</span><span class="token punctuation">)</span> <span class="token keyword">FROM</span> <span class="token keyword">table</span>
<span class="token keyword">WHERE</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
<span class="token keyword">ORDER</span> <span class="token keyword">BY</span> id <span class="token keyword">ASC</span>
<span class="token keyword">LIMIT</span> <span class="token number">10000</span><span class="token punctuation">;</span>
</code></pre>
<p>If there’s a mismatch, we simply revert to iteration 4 and find the rows that mismatch, but we have to scan far less data as we can assume the majority of it lines up.</p>
<p>Before moving on, let’s see whether the napkin math works out:</p>
<ul>
<li>Reading 10 MiB off SSD at <a href="https://github.com/sirupsen/napkin-math#numbers">200 us/MiB</a> will take ~2ms.</li>
<li>Hashing 10 MiB at <a href="https://github.com/sirupsen/napkin-math#numbers">5 ms/MiB</a> will take ~50ms.</li>
<li>6 SSD seeks to find the ID at <a href="https://github.com/sirupsen/napkin-math#numbers">100 us/seek</a> will take ~600 us.</li>
<li>1 network round-trip at 250 us of the 16 byte hash.</li>
</ul>
<p>This is promising! In reality, it requires a little more SQL wrestling, for MySQL:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token function">max</span><span class="token punctuation">(</span>id<span class="token punctuation">)</span> <span class="token keyword">as</span> max_id<span class="token punctuation">,</span> MD5<span class="token punctuation">(</span>CONCAT<span class="token punctuation">(</span>
  MD5<span class="token punctuation">(</span>GROUP_CONCAT<span class="token punctuation">(</span>UNHEX<span class="token punctuation">(</span>MD5<span class="token punctuation">(</span><span class="token keyword">COALESCE</span><span class="token punctuation">(</span>t<span class="token punctuation">.</span>col_a<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
	 MD5<span class="token punctuation">(</span>GROUP_CONCAT<span class="token punctuation">(</span>UNHEX<span class="token punctuation">(</span>MD5<span class="token punctuation">(</span><span class="token keyword">COALESCE</span><span class="token punctuation">(</span>t<span class="token punctuation">.</span>col_b<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
	 MD5<span class="token punctuation">(</span>GROUP_CONCAT<span class="token punctuation">(</span>UNHEX<span class="token punctuation">(</span>MD5<span class="token punctuation">(</span><span class="token keyword">COALESCE</span><span class="token punctuation">(</span>t<span class="token punctuation">.</span>col_c<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">as</span> checksum <span class="token keyword">FROM</span> <span class="token punctuation">(</span>
  <span class="token keyword">SELECT</span> col_a<span class="token punctuation">,</span> col_b<span class="token punctuation">,</span> col_c <span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
	<span class="token keyword">WHERE</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
	<span class="token keyword">LIMIT</span> <span class="token number">10000</span> 
<span class="token punctuation">)</span> t
</code></pre>
<p>We seem to match our napkin math well:</p>
<figure><img src="/images/4c051bd7-ce00-4b60-ab50-3374366e4a71.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>This is the place to stop if you want to err on the side of safety. This is how we <a href="https://www.usenix.org/conference/srecon19emea/presentation/li">verify the integrity when we move shops between shards at Shopify</a>, which is what this approach is inspired by. However, to push performance further we need to get rid of some of this inline aggregation and hashing which eats up all our performance budget. At 50ms/batch, we’re still at <strong>~10 minutes to complete the checksumming of 100M records</strong>.</p>
<h2 id="iteration-6-checksumming-with-updated_at">Iteration 6: Checksumming with <code>updated_at</code></h2>
<p>Many database schemas have an <code>updated_at</code> column which contains the timestamp where the record was last updated. We can use this as the checksum for the row, assuming that the granularity of the timestamp is sufficient (in many cases, granularity is only seconds, but e.g. <a href="https://dev.mysql.com/doc/refman/8.0/en/fractional-seconds.html">MySQL supports fractional second granularity</a>).</p>
<p>A huge performance advantage of this is that we can use an index on <code>updated_at</code>, and no longer read and hash the full 1 KiB row! We now only need to read and hash the 64 bit timestamps. This cuts down on the data we need to read per batch from 10 MiB to 80Kb!</p>
<p>Additionally, instead of using the checksum, we can simply use a <code>sum</code> of the <code>updated_at</code>. This has the nice property of being much faster, and that we don’t necessarily need the same sort order in the other database. This will become very important if you’re doing checksumming against a database that might not store in the same order easily, e.g. ElasticSearch/Lucene.</p>
<p>Won’t summing so many records overflow? Nah, UNIX timestamp right now are approaching 32 bits, which means we can sum around 2^32 ~= 4 billion without overflowing. Isn’t a sum a poor checksum? Sure, a hash is safer, but this is not crypto, just simple checksumming. It seems sufficient to me. Might not be in your case, in which case you can use MD5, SHA1, or CRC32 or use the solution from iteration 5.</p>
<p>We still need an offset, as we can’t rely on ids increasing by exactly 1 as ids may have been deleted:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token function">max</span><span class="token punctuation">(</span>id<span class="token punctuation">)</span> <span class="token keyword">as</span> max_id<span class="token punctuation">,</span>
  <span class="token function">SUM</span><span class="token punctuation">(</span>UNIX_TIMESTAMP<span class="token punctuation">(</span>updated_at<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">as</span> checksum
<span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span> <span class="token keyword">WHERE</span> id <span class="token operator">&lt;</span> <span class="token punctuation">(</span>
  <span class="token keyword">SELECT</span> id <span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
	<span class="token keyword">WHERE</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
	<span class="token keyword">LIMIT</span> <span class="token number">1</span> <span class="token keyword">OFFSET</span> <span class="token number">10000</span>
<span class="token punctuation">)</span> <span class="token operator">AND</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
</code></pre>
<p>Let’s take inventory:</p>
<ul>
<li>Reading 80 KiB of the <code>updated_at</code> index off SSD at <a href="https://github.com/sirupsen/napkin-math#numbers">1 us/8 KiB</a> will take   ~50 us.</li>
<li>Summing 80 KiB at <a href="https://github.com/sirupsen/napkin-math#numbers">5 ns/64 bytes</a> will take ~50 us.</li>
<li>6 SSD seeks to find the ID at <a href="https://github.com/sirupsen/napkin-math#numbers">100 us/seek</a> will take ~600 us.</li>
<li>1 network round-trip at 250 us of the 16 byte hash.</li>
</ul>
<p>In theory, this query should take milliseconds! In reality, there’s overhead involved, and we can’t assume in MySQL that reads are completely sequential as <a href="http://yoshinorimatsunobu.blogspot.com/2013/10/making-full-table-scan-10x-faster-in.html">fragmentation occurs</a> on indexes and the primary key.</p>
<figure><img src="/images/d9518021-e556-466e-b9aa-5e2f50351ae2.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Without the first iteration:</p>
<figure><img src="/images/917cb97e-2dd5-47ab-96bc-f612abece5f5.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>What’s going on? We were expecting single-digit milliseconds, but we’re seeing 20ms per batch! Something is wrong.  <strong>20ms per batch still means our total checksumming time is 3 min.</strong> We’ve got more work to do.</p>
<h2 id="iteration-7-using-the-right-indexes">Iteration 7: Using the right indexes</h2>
<p>An <code>EXPLAIN</code> reveals we’re using the <code>PRIMARY</code> key for both queries, which means we’re loading these entire 1 KiB records, not just the 64 bit off the <code>updated_at</code> index.</p>
<p>Using the indexes on <code>(id)</code> and <code>(id, updated_at)</code> we need to scan <em>much</em> less data. It’s counter-intuitive to create an index on <code>id</code>, since the primary key already has an “index.” The problem with that index is that it also holds <em>all</em> the data. It’s not just the 64-bit id. You’re scanning over <em>a lot</em> of records. Indexes structured in this way are great in a lot of cases to minimize seeks (it’s called a clustered index), but problematic in others. Since these indexed already existed, this is another example of the MySQL optimizer not making the right decision for us. Forcing these indexes our query becomes:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token function">max</span><span class="token punctuation">(</span>id<span class="token punctuation">)</span> <span class="token keyword">as</span> max_id<span class="token punctuation">,</span> 
  <span class="token function">SUM</span><span class="token punctuation">(</span>UNIX_TIMESTAMP<span class="token punctuation">(</span>updated_at<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">as</span> checksum
<span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
<span class="token keyword">FORCE</span> <span class="token keyword">INDEX</span> <span class="token punctuation">(</span><span class="token identifier"><span class="token punctuation">`</span>index_table_id_updated_at<span class="token punctuation">`</span></span><span class="token punctuation">)</span> 
<span class="token keyword">WHERE</span> id <span class="token operator">&lt;</span> <span class="token punctuation">(</span>
  <span class="token keyword">SELECT</span> id
	<span class="token keyword">FROM</span> <span class="token identifier"><span class="token punctuation">`</span>table<span class="token punctuation">`</span></span>
	<span class="token keyword">FORCE</span> <span class="token keyword">INDEX</span> <span class="token punctuation">(</span><span class="token identifier"><span class="token punctuation">`</span>index_table_id<span class="token punctuation">`</span></span><span class="token punctuation">)</span>
	<span class="token keyword">WHERE</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
  <span class="token keyword">LIMIT</span> <span class="token number">1</span> <span class="token keyword">OFFSET</span> <span class="token number">10000</span>
<span class="token punctuation">)</span>  <span class="token operator">AND</span> id <span class="token operator">&gt;</span> <span class="token variable">@max_id_from_last_batch</span>
</code></pre>
<figure><img src="/images/4852b7f2-f211-4ac2-b5d7-3633b594562a.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Nice, that’s quite a bit faster, let’s remove the previous iterations to make it a little easier to see the graphs we care about now:</p>
<figure><img src="/images/fe4783ed-9ba2-4580-967a-e9958bc89856.png" alt="" width="640" height="480" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>5ms per batch is close to the theoretical floor we established in iteration 6! To checksum our full 100M records, this would take 50 seconds. We aren’t going to get much better than this as far as I can tell without modifying MySQL or pre-computing the checksums with e.g. triggers.</p>
<p>What about database constraints? Will this take up our whole database as we had trouble with in early iterations? Fortunately, this solution is much less I/O heavy than our early iterations. We need to read 2-3 GiB of indexes in total to serve these queries. Spread over 50 seconds we’re talking 10s of MiB/s, so we should be good.</p>
<p>The last trick to consider is to not checksum check <em>all</em> records in a loop. We could add another condition to only checksum records created in the past few minutes <code>updated_at &gt;= TIMESTAMPADD(MINUTE, -5, NOW())</code>, while doing full checks only periodically. You would likely want to also ignore records created in the past few seconds, to allow replication to occur: <code>updated_at &lt;= TIMESTAMPADD(SECOND, 30, NOW())</code>. We <em>do</em> still want our fast way to scan all records, as this is by far the safest, and for a database with 10,000s of changes per second, that also needs to be <em>fast</em>. The full check is also paramount when we bring up new databases and during development.</p>
<h2 id="what-do-we-do-on-a-mismatch">What do we do on a mismatch?</h2>
<p>Great, so we can now check whether batches are the same across two SQL databases quickly. We could build APIs for this to avoid users querying each other’s database. But what do we do when we have a mismatch?</p>
<p>We could send every record in the batch, but those queries are still fairly taxing. Especially if we are checksumming batches of 100,000s of records to optimize the checksumming performance.</p>
<p>We can perform a binary search: If we are checksumming 100,000 records and encounter a mismatch, we cut the records into two queries checksumming 50,000 records each. Whichever one has the mismatch, we slice them in two <em>again</em> until you find the record(s) that don’t match!</p>
<p>This approach is very similar to the Merkle tree synchronization I described in <a href="https://sirupsen.com/napkin/problem-9/">problem 9</a>. You can think of the approach we’ve landed on here as Merkle tree synchronization between two databases, but it’s simpler just to think of it as checksumming in batches. This approach is also quite similar to how <a href="http://tutorials.jenkov.com/rsync/overview.html">rsync works</a>.</p>
<h2 id="what-about-other-types-of-databases">What about other types of databases?</h2>
<p>While we covered SQL-to-SQL checksumming here, I’ve implemented a prototype of the method described here to check whether all records from a MySQL database make it to an ElasticSearch cluster. ElasticSearch, just like MySQL, is able to sum <code>updated_at</code> fast. Most databases that support any type of aggregation should work for this. Datastores like Memcached or Redis would require more thought, as they don’t implement aggregations. This would be an interesting use-case for checking the integrity of a cache. It would be possible to do something, of course, but it would require core changes to them.</p>
<p>Hope you enjoyed this. I think this is a neat pattern that I hope to see more adoption for, and perhaps even some databases and APIs adopt. Wouldn’t it be great if you could check if all your data was up-to-date just about everywhere with just a couple of API calls exchanging hashes?</p>
<p>P.S. A few weeks ago this newsletter hit 1,000 subscribers. I’m really grateful to all of you for listening in! It’s been quite fun to write these posts. It’s my favourite kind of recreational programming.</p>
<p>The <a href="https://github.com/sirupsen/napkin-math#numbers">napkin math reference</a> has also recently been extensively updated, in part to support this issue.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 13: Filtering with Inverted Indexes]]></title>
        <id>https://sirupsen.com/napkin/problem-13-filtering-with-inverted-indexes</id>
        <link href="https://sirupsen.com/napkin/problem-13-filtering-with-inverted-indexes"/>
        <updated>2020-11-08T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Database queries are all about filtering. Whether you’re finding rows with a particular name, within a price-range, or those created within a time-window. Trouble, however, ensues for most databases when you have many filters and none of them narrow down the results much.
This problem of filtering on many attributes efficiently has haunted me since Problem 3, and again in Problem 9. Queries that mass-filter are conceptually common in commerce merchandising/collections/discover]]></summary>
        <content type="html"><![CDATA[<p>Database queries are all about filtering. Whether you’re finding rows with a particular name, within a price-range, or those created within a time-window. Trouble, however, ensues for most databases when you have <em>many</em> filters and none of them narrow down the results much.</p>
<p>This problem of filtering on many attributes efficiently has haunted me since Problem 3, and again in Problem 9. Queries that mass-filter are conceptually common in commerce merchandising/collections/discovery/discounts where you expect to narrow down products by many attributes. Devilish queries of the type below might be used to create a “Blue Training Sneaker Summer Mega-Sale” collection. The merchant might have tens of millions of products, and each attribute might be on millions of products. In SQL, it might look something like the following:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> id
<span class="token keyword">FROM</span> products
<span class="token keyword">WHERE</span> color<span class="token operator">=</span>blue <span class="token operator">AND</span> <span class="token keyword">type</span><span class="token operator">=</span>sneaker <span class="token operator">AND</span> activity<span class="token operator">=</span>training 
  <span class="token operator">AND</span> season<span class="token operator">=</span>summer <span class="token operator">AND</span> inventory <span class="token operator">&gt;</span> <span class="token number">0</span> <span class="token operator">AND</span> price <span class="token operator">&lt;=</span> <span class="token number">200</span> <span class="token operator">AND</span> price <span class="token operator">&gt;=</span> <span class="token number">100</span> 
</code></pre>
<p>These are <em>especially</em> challenging when you expect the database to return a
result in a time-frame that’s suitable for a web request (sub 10 ms).
Unfortunately, classic relational databases are typically not suited for serving
these types of queries efficiently on their B-Tree based indexes for a few
reasons. The two arguments that top the list for me:</p>
<ol>
<li><strong>The data doesn’t conform to a strict schema.</strong> A product might have 100s to
1000s of attributes we need to efficiently filter against. This might mean
having extremely wide rows, with 100s of indexes, which leads to a number of
other issues.</li>
<li><strong>Databases struggle to merge multiple indexes.</strong>
<ol>
<li>Index merges not going to get you &lt; 10 ms response, and creating
composite indexes are impractical if you are filtering by 10s to 100s of
rules. I wrote a <a href="/index-merges">separate post</a> about that.</li>
<li>While MySQL/Postgres can filter by <code>price</code> and <em>then</em> <code>type</code> to serve a
query, it can’t filter efficiently by scanning and cross-referencing
multiple indexes  simultaneously (this requires Zig-Zag joins, see
<a href="https://github.com/cockroachdb/cockroach/issues/23520">here</a> for more context).</li>
</ol>
</li>
</ol>
<p>Using B-Trees for mass-filtering deserves deeper thought and napkin math (these two problems don’t seem impossible to solve), and given how much this problem troubles me, I might follow up with more detail on this in another issue. It’s also worth noting that Posgres and MySQL both implement inverted indexes, so those could be used instead of the implementation below.</p>
<p>But in this issue we will investigate the inverted index as a possible data-structure for serving many-filter queries efficiently. The inverted index (explained below) is the data-structure that powers search. We will be using Lucene, which is the most popular open-source implementation of the inverted index. It’s what powers ElasticSearch and Solr, the two most popular open-source search engines. You can think of Lucene as the RocksDB/InnoDB of search. Lucene is written in Java.</p>
<p>Why would we want to use a search engine to filter data? Because search as a problem is a superset of our filtering problem. Search is fundamentally about turning a language query <code>blue summer sneakers</code> into a series of filtering operations: intersect products that match <code>blue</code>, <code>summer</code>, and <code>sneaker</code>. Search has a language component, e.g. turning <code>sneakers</code> into <code>sneaker</code>, but the filtering problem is the same. If search is fundamentally language + filtering, perhaps we can use <em>just</em> the filtering bit? Search is typically <em>not</em> implemented on top of B-Tree indexes (what classic databases use), but use an inverted index. Perhaps that can resolve problem (1) and (2) above?</p>
<p>The inverted index is best illustrated through a simple drawing:</p>
<figure><img src="/images/14930fea-d1c1-4b03-b975-0b58431ce592.png" alt="" width="1641" height="1084" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>In our inverted index, each attribute (color, type, activity, ..) maps to a list of product ids that have that attribute. We can create a filter for <code>blue</code>, <code>summer</code>, and <code>sneakers</code> by finding the intersection of product ids that match <code>blue</code>, <code>summer</code>, and <code>sneakers</code> (ids that are present for all terms).</p>
<p>Let’s say we have 10 million products, and we are filtering by 3 attributes which each have 1.2 million products in each. What can we expect the query time to be?</p>
<p>Let’s assume the product ids are stored each as an uncompressed 64 bit integer in memory. We’d expect each attribute to be <code>1.2 million * 64 bit ~= 10mb</code>, or <code>10 * 3 = 30mb</code> total. In this case, we assume the intersection algorithm to be efficient and roughly read all the data once (in reality, there’s a lot of smart skipping involved, but this is napkin math. We won’t go into details on how to efficiently merge two sets). We can <a href="https://www.wolframalpha.com/input/?i=261mb+%2F+%28257+*+39098+%2B+65+*+153759+%2B+257+*+1209758%29">read memory at a rate of <code>1 Mb/100 us</code></a> (from SSD is only twice as slow for sequential reads), so serving the query would take ~<code>0.1 ms * 30 = 3ms</code>. I <a href="https://gist.github.com/sirupsen/0c1d388d94d9de611c54df866e6d1708">implemented this in Lucene</a>, and this napkin math lines up well with reality. In my implementation, this takes <code>~5-3ms</code>! That’s great news for solving the filtering problem with an inverted index. That’s fairly fast.</p>
<p>Now, does this scale linearly? Including more attributes will mean scanning more memory. E.g. 8 attributes we’d expect to scan ~<code>10mb * 8 = 80mb</code> of memory, which should take ~<code>0.1ms * 80 = 8ms</code>. However, in reality this takes <code>30-60ms</code>. This approaches our napkin math being an order of magnitude off. Most likely this is when we have exhausted the CPU L3 cache, and have to cycle more into main memory. We hit a similar boundary from 3 to 4 attributes. It might also suggest there’s room for optimization in Lucene.</p>
<figure><img src="/images/1b9cb6e5-ca15-4a51-9acb-ea83d1facbba.png" alt="" width="1691" height="1030" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Another interesting to note is that if we look at the inverted index file for our problem, it’s roughly ~261mb.  Won’t bore you with the calculation here, but given <a href="https://gist.github.com/sirupsen/0c1d388d94d9de611c54df866e6d1708">the implementation</a> this means that we can estimate that each product id takes <a href="https://www.wolframalpha.com/input/?i=261mb+%2F+%28257+*+39098+%2B+65+*+153759+%2B+257+*+1209758%29">up ~6.3 bits</a>. This is <em>much</em> smaller than the 64 bits per product id we estimated. The JVM overhead, however, likely makes up for it. Additionally, in Lucene doesn’t just store the product ids, but also various other meta-data along with the product ids.</p>
<p>Based on this, it’s looking feasible to use Lucene for mass filtering! While we don’t have an estimate from SQL to measure against yet (and won’t have in this issue), I can assure you this is faster than we’d get with something naive.</p>
<p>But why is it feasible even if 4 attributes takes ~20ms (as we can see on the diagram)? Because that’s acceptable-ish performance in a worst-worst case scenario. In most cases when you’re filtering, you will have multiple attributes that will be able to significantly narrow the search space. Since we aren’t that close to the lower-bound of performance (what our napkin math tells us), it suggests we might not be constrained by memory bandwidth, but by computation. This suggests that threaded execution could speed it up. And sure enough, it does. With 8 threads in the read thread pool for Lucene, we can serve the query for 4 attributes in ~6ms! That’s <em>faster</em> than our 8ms lower-bound. The reason for this is that Lucene has optimizations built in to skip over potentially large blocks of product ids when intersecting, meaning we don’t have to read all the product ids in the inverted index.</p>
<p>In reality, to go further, we’d want to do more napkin math, but this is showing a lot of promise! Besides more calculations, we’ve left out two big pieces here: sorting and indexing numbers. If there’s interest, I might follow up with that another time. But this is plenty for one issue!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 12: Recommendations]]></title>
        <id>https://sirupsen.com/napkin/problem-12-recommendations</id>
        <link href="https://sirupsen.com/napkin/problem-12-recommendations"/>
        <updated>2020-09-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Since last, I sat down with Adam and Jerod from The Changelog podcast to discuss Napkin
Math! This ended up yielding quite a few new subscribers,
welcome everyone!
For today’s edition: Have you ever wondered how recommendations work on a site
like Amazon or Netflix?
]]></summary>
        <content type="html"><![CDATA[<p>Since last, I sat down with Adam and Jerod from <a href="https://changelog.com/podcast/412">The Changelog podcast to discuss Napkin
Math</a>! This ended up yielding quite a few new subscribers,
welcome everyone!</p>
<p>For today’s edition: Have you ever wondered how recommendations work on a site
like Amazon or Netflix?</p>
<figure><img src="/images/a1f1f9c3-be46-4f82-b1f8-32f24e736446.jpeg" alt="" width="2470" height="1204" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>First we need to define similarity/relatedness. There’s many ways to do this. We
could figure out similarity by having a human label the data for what’s relevant
when the customer is looking at something else: If you’re buying black dress
shoes, you might be interested in black shoe polish. But if you’ve got millions
of products, that’s a lot of work!</p>
<p>Instead, what most simple recommendation algorithms is based on is what’s called
“collaborative filtering.” We find other users that seem to be similar to you.
If we know you’ve got a big overlap in watched TV shows to another user, perhaps
you might like something else that user liked that you haven’t watched yet?
This recommendation method is <em>much</em> less laborious than a human manually
labeling content (in reality, big companies do <a href="https://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/">human labeling <em>and</em>
collaborative filtering</a> <em>and</em> <a href="https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system/">other dark magic</a>).</p>
<p>In the example below, User 3 looks similar to User 1, so we can infer that they
<em>might</em> like Item D too. In reality, the more columns (items) we can use to
compare, the better results.</p>
<figure><img src="/images/64eda434-833b-4e6b-b7e0-9084ebd0a52e.png" alt="" width="1949" height="951" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Based on this, we can design a simple algorithm for powering our
recommendations! With <code>N</code> items and <code>M</code> users, we can create this matrix of <code>M x N</code> cells shown in the drawing as a two-dimensional array and represent
check-marks by <code>1</code> and empty cells by <code>0</code>. We can loop through each user and
compare with each other user, preferring recommendations from users we have more
check-marks in common with. This is a simplification of <a href="https://www.machinelearningplus.com/nlp/cosine-similarity">cosine similarity</a>
which is typically the simple vector math used to compare similarity between two
vectors. The ‘vector’ here being the 0s and 1s for each product for the user.
For the purpose of this article, it’s not terribly important to understand this
in detail.</p>
<p><strong>How long it take to run this algorithm to find similar users for a million users
and a million products?</strong></p>
<p>Each user would have a million bits to represent the columns. That’s <code>10^6 bits = 125 kB</code> per user. For each user, we’d need to look at every other user: <code>125 kB/user * 1 million users = 125 Gb</code>. 125 Gb is not completely unreasonably to
hold in memory, and since it’s sequential access, even if this was SSD-backed
and not all in memory, it’d still be fast. We can read memory at <a href="https://github.com/sirupsen/napkin-math">~10 Gb/s</a>,
so that’s 12.5 seconds to find the most similar user for each user. That’s way
too slow to run as part of a web request!</p>
<p>Let’s say we precomputed this in the background on a single machine, it’d take
<code>12.5 s/user * 1 million users = 12.5 million seconds ~= 144 days ~= 20 weeks</code>.
That sounds frightening, but this is an ‘embarrassingly parallel problem.’ It
means we can process User A’s recommendations on one machine, User B’s on
another, and so on.  This is what a batch compute jobs on e.g. Spark would do.
This is really <code>12.5 million CPU seconds</code>. If we had 3000 cores it’d take us
about an hour and cost us <code>3000 core * $0.02 core/hour = $60</code>. Most likely these
recommendations would earn us way more than $60, so even this is not too bad!
When people talk about Big Data computations, these are the types of large jobs
they’re referring to.</p>
<p>Even on this simple algorithm, there is <em>plenty</em> of room for optimizations.
There will be a lot of zeros in such a wide matrix (‘sparse’), so we could store
vectors of item ids instead. We could quickly skip users if they have fewer 1s
than the most similar user we’ve already matched with.  Additionally, matrix
operations like this one can be run efficiently on GPU. If I knew more about
GPU-programming, I’d do the napkin math on that! On the list for future
editions. The good thing is that libraries used to do computations like this
usually do these types of optimizations for you.</p>
<p>Cool, so this naive recommendation algorithm is feasible for a first iteration
of our recommendation algorithm. We compute the recommendations periodically on
a large cluster and shove them into MySQL/Redis/whatever for quick access on our
site.</p>
<p>But there’s a problem… If I just added a spatula to the cart, don’t you want
to immediately recommend me other kitchen utensils? Our current algorithm is
great for general recommendations, but it fails to be real-time enough to assist
a shopping session. We can’t wait for the batch job to run again. By that time,
we’ll already have bought a shower curtain and forgotten to buy a curtain rod
since the recommendation didn’t surface. Bummer.</p>
<p>What if instead of a big offline computation to figure out user-to-user
similarity, we do a big offline computation to compute item-to-item similarity?
This is what <a href="https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf">Amazon did back in 2003</a> to solve this problem. Today, they
likely do something much more advanced.</p>
<p>We could devise a simple item-to-item similarity algorithm that counts for
each item the most popular items that customers who bought that item <em>also</em>
bought.</p>
<p>The output of this algorithm would be something like the matrix below. Each cell
is the count of customers that bought both items. For example, 17
people bought both item 4 and item 1, which in comparison to others means that
it might be a great idea to show people buying item 4 to consider item 1, or
vice-versa!</p>
<figure><img src="/images/49676787-b801-4066-aa59-f6a28ee80d8d.png" alt="" width="1648" height="881" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>This algorithm has complexity even <em>worse</em> than the previous one, because worst
case we have to look at each item for each item for each customer <code>O(N^2 * M)</code>.
In reality, however, most customers haven’t bought that many items, which makes
the complexity generally <code>O(NM)</code> like our previous algorithm. This means that,
ballpark, the running time is roughly the same (an hour for $60).</p>
<p>Now we’ve got a much more versatile computation for recommendations. If
we store all these recommendations in a database, we can immediately as part of
serving the page tell the user which other products they might like based on the
item they’re currently viewing, their cart, past orders, and more. The two
recommendation algorithms might complement each other. The first is good for
home-page, broad recommendations, whereas the item-to-item similarity is good
for real-time discovery on e.g. product pages.</p>
<p>My experience with recommendations is quite limited, if you work with these
systems and have any corrections, please let me know! A big part of my incentive
for writing these posts is to explore and learn for myself. Most articles that
talk about recommendations focus on the math involved, you’ll easily be able to
find those. I wanted here to focus more on the computational aspect and not get
lost in the weeds of linear algebra.</p>
<p>P.S. Do you have experience running Apache Beam/Dataflow at scale? Very
interested to talk to you.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 11: Circuit Breakers]]></title>
        <id>https://sirupsen.com/napkin/problem-11-circuit-breakers</id>
        <link href="https://sirupsen.com/napkin/problem-11-circuit-breakers"/>
        <updated>2020-08-22T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[You may have heard of a “circuit breaker” in the context of building resilient
systems: the art of building reliable systems from unreliable components. But what
is a circuit breaker?
Let’s set the scene for today’s napkin math post by setting up a scenario.
Scenario’s pretty close to reality of what our code looked like conceptually
when we started working on resiliency at Shopify back in 2014.
Imagine a function like this (pseudo-Javascript-C-ish is a good common
denominator) ]]></summary>
        <content type="html"><![CDATA[<p>You may have heard of a “circuit breaker” in the context of building resilient
systems: the art of building reliable systems from unreliable components. But what
is a circuit breaker?</p>
<p>Let’s set the scene for today’s napkin math post by setting up a scenario.
Scenario’s pretty close to reality of what our code looked like conceptually
when we started working on resiliency at Shopify back in 2014.</p>
<p>Imagine a function like this (pseudo-Javascript-C-ish is a good common
denominator) that’s part of rendering your commerce storefront:</p>
<pre class="language-javascript"><code class="language-javascript"><span class="token keyword">function</span> <span class="token function">cart_and_session</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
  session <span class="token operator">=</span> <span class="token function">query_session_store_for_session</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>session<span class="token punctuation">)</span> <span class="token punctuation">{</span>
    user <span class="token operator">=</span> <span class="token function">query_db_for_user</span><span class="token punctuation">(</span>session<span class="token punctuation">[</span><span class="token string">&#x27;id&#x27;</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>

  cart <span class="token operator">=</span> <span class="token function">query_carts_store_for_cart</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>cart<span class="token punctuation">)</span> <span class="token punctuation">{</span>
    products <span class="token operator">=</span> <span class="token function">query_db_for_products</span><span class="token punctuation">(</span>cart<span class="token punctuation">.</span>line_items<span class="token punctuation">)</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
</code></pre>
<p>This calls three different external data-stores: (1) Session store, (2) Cart
store, (3) Database.</p>
<p>Let’s now imagine that the session store is unresponsive. Not down,
<em>unresponsive</em>: meaning every single query to it times out. Default timeouts are
usually hilariously high, so let’s assume a 5 second timeout.</p>
<p>Let’s say we’ve got 4 workers all serving requests with the above code. Under current circumstances with the session store timing out, this means
each worker would be spending 5 seconds in <code>query_session_store_for_session</code> on
<em>every</em> request! This seems bad, because our response time is at least 5
seconds. But it’s way worse than that. We’re almost certainly <em>down</em>.</p>
<p>Why are we down when a single, auxiliary data-store is timing out? Consider that
before, requests might have taken 100 ms to serve, but now they take at least 5
seconds. Your workers can only serve 1/50th the amount of requests they could
prior to our session store outage! Unless you’re 50x over-provisioned (not a
great idea), your workers are all busy waiting for the 5s timeout. The queue
behind the workers slowly filling up…</p>
<figure><img src="/images/5c6d3d44-9b57-4b75-9f00-44dea022b535.png" alt="" width="1778" height="1307" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>What can we do about this? We could reduce the timeout, which would be a good idea, but it only changes the shape of the problem, it doesn’t eliminate it. But we can implement a circuit breaker! The idea of the
circuit breaker is that if we’ve seen a timeout (or error of any other kind we
specify) a few times, then we can simply raise immediately for 15 seconds! When
the circuit is raising, this means the circuit breaker is “open” (this
vocabulary tripped me up for the first bit, it’s not “closed”). After the 15
seconds, we’ll try to see if the resource is healthy again by letting another
request through. If not, we’ll open the circuit again.</p>
<p>Won’t raising from the circuit just render a 500? The assumption is that you’ve
made your code resilient, so that if the circuit is open for the session
store, then you simply fall back to assume that people aren’t logged in instead of letting an exception trickle up the stack.</p>
<p>We can imagine a simple circuit being implemented like below. It has <em>numerous</em>
problems, but it should paint the basic picture of a circuit.</p>
<pre class="language-javascript"><code class="language-javascript">circuits <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token keyword">function</span> <span class="token function">circuit_breaker</span><span class="token punctuation">(</span><span class="token parameter"><span class="token keyword">function</span> f</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
  <span class="token comment">// Circuit&#x27;s closed, everything&#x27;s likely normal!</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>circuits<span class="token punctuation">[</span>f<span class="token punctuation">.</span>id<span class="token punctuation">]</span><span class="token punctuation">.</span>closed<span class="token punctuation">)</span> <span class="token punctuation">{</span>
    <span class="token keyword">try</span> <span class="token punctuation">{</span>
      <span class="token function">f</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span> <span class="token keyword">catch</span><span class="token punctuation">(</span>err<span class="token punctuation">)</span> <span class="token punctuation">{</span>
      <span class="token comment">// Uh-oh, an error occured. Let&#x27;s check if it&#x27;s one we should possibly</span>
      <span class="token comment">// open the circuit on (like a timeout)</span>
      <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token function">circuit_breaker_error</span><span class="token punctuation">(</span>err<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
        errors <span class="token operator">=</span> circuits<span class="token punctuation">[</span>f<span class="token punctuation">.</span>id<span class="token punctuation">]</span><span class="token punctuation">.</span>errors <span class="token operator">+=</span> <span class="token number">1</span><span class="token punctuation">;</span>
        <span class="token comment">// 3 errors have happened, let&#x27;s open the circuit!</span>
        <span class="token keyword">if</span> <span class="token punctuation">(</span>errors <span class="token operator">&gt;</span> <span class="token number">3</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
          circuits<span class="token punctuation">[</span>f<span class="token punctuation">.</span>id<span class="token punctuation">]</span><span class="token punctuation">.</span>state <span class="token operator">=</span> <span class="token string">&quot;open&quot;</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
      <span class="token punctuation">}</span>
    <span class="token punctuation">}</span>
  <span class="token punctuation">}</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>circuits<span class="token punctuation">[</span>f<span class="token punctuation">.</span>id<span class="token punctuation">]</span><span class="token punctuation">.</span>open<span class="token punctuation">)</span> <span class="token punctuation">{</span>
    <span class="token comment">// If 15 seconds have passed, let&#x27;s try to close the circuit to let requests</span>
    <span class="token comment">// through again!</span>
    <span class="token keyword">if</span> <span class="token punctuation">(</span>Time<span class="token punctuation">.</span>now <span class="token operator">-</span> circuits<span class="token punctuation">[</span>f<span class="token punctuation">.</span>id<span class="token punctuation">]</span><span class="token punctuation">.</span>opened_at <span class="token operator">&gt;</span> <span class="token number">15</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
      circuits<span class="token punctuation">[</span>f<span class="token punctuation">.</span>id<span class="token punctuation">]</span><span class="token punctuation">.</span>state <span class="token operator">=</span> <span class="token string">&quot;closed&quot;</span><span class="token punctuation">;</span>
      <span class="token keyword">return</span> <span class="token function">circuit_breaker</span><span class="token punctuation">(</span>f<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
  <span class="token punctuation">}</span>
<span class="token punctuation">}</span>
</code></pre>
<p>What position does that put us in for our session scenario? Once again, it’s best
illustrated with a drawing. Note, I’ve compressed the timeout requests a bit
here (this is not for scale) to fit some ‘normal’ (blue) requests after the
circuits open:</p>
<figure><img src="/images/4f78974a-657c-48be-8e1c-235b21fb23f5.png" alt="" width="1748" height="939" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>After the circuits have all opened, we’re golden! Back to normal despite the
slow resource! The trouble comes when our 15 seconds of open circuit have
passed, then we’re back to needing 3 failures to open the circuits again and
bring us back to capacity.  That’s <code>3 * 5s = 15s</code> where we can only serve 3
requests, rather than the normal <code>15s/100ms = 150</code>!</p>
<p>To do some napkin math, since there’s 15 seconds we’re waiting for timeouts to
open the circuits, and 15 seconds with open circuits, we can estimate that we’re
at ~50% capacity with this circuit breaker. The drawing also makes this clear. That’s <em>a lot</em> better than before,
and likely means we’ll remain up if you’re over-provisioned by 50%.</p>
<p>Now we could start introducing some complexity to the circuit to increase our
capacity. What if we only allowed failing <em>once</em> to re-open the circuit? What if
we decreased the timeout from 5s to 1s? What if we increased the time the
circuit is open from 15 seconds to 45 seconds? What if we open the circuit after 2 failures rather than 3?</p>
<p>Answering those questions is overwhelming. How on earth will we figure out how to configure the circuit so we’re not down when resources are slow? It might have been somewhat simple to
realize it was ~50% capacity with the numbers I’d chosen, but add more
configuration options and we’re in deep trouble.</p>
<p>This brings me to what I think is the most important part of this post: Your
circuit breaker is almost certainly configured wrong. When we started
introducing circuit breakers (and bulkheads, another resiliency concept) to
production at Shopify in 2014 we severely underestimated how difficult they
are to configure. It’s puzzling to me how little there’s written about
this. Most assume that you drop the circuit in, choose some decent defaults, and off you
go. But in my experience in your very next outage you’ll find out it wasn’t good enough… that’s a
less than ideal feedback loop.</p>
<p>The circuit breaker implementation I’m most familiar with is the one implemented
in the <a href="https://github.com/shopify/semian">Ruby resiliency library Semian</a>. To my knowledge, it’s one of the
more complete implementations out there, but all the options makes it a <em>devil</em>
to configure. Semian is the implementation we use in all applications at Shopify.</p>
<p>There are at least five configuration parameters relevant for circuit breakers:</p>
<ul>
<li><code>error_threshold</code>. The amount of errors to encounter for the worker before
opening the circuit, that is, to start rejecting requests instantly. In our
example, it’s been hard-coded to 3.</li>
<li><code>error_timeout</code>. The amount of time in seconds until trying to query the
resource again. That’s the time the circuit is open. 15 seconds in our example.</li>
<li><code>success_threshold</code>. The amount of successes on the circuit until closing it
again, that is to start accepting all requests to the circuit. In our example
above, this is just hard-coded to 1. This requires a bit more logic to have a
number &gt; 1, which better implementations like Semian will take care of.</li>
<li><code>resource_timeout</code>. The timeout to the resource/data-store protected by the circuit breaker. 5 seconds in our example.</li>
<li><code>half_open_resource_timeout</code>. Timeout for the resource in seconds when the
circuit is checking whether the resource might be back to normal, after the <code>error_timeout</code>. This state is called <code>half_open</code>. Most circuit breaker implementations (including our simple one
above) assume that this is the same as the ‘normal’ timeout for the resource.
The bet Semian makes is that during steady-state we can tolerate a higher
resource timeout, but during failure, we want it to be lower.</li>
</ul>
<p>In collaboration with my co-worker Damian Polan, we’ve come up with some napkin math for what we
think is a good way to think about tuning it. You can read more in <a href="https://shopify.engineering/circuit-breaker-misconfigured">this
post</a> on the Shopify blog. This blog post includes the ‘circuit breaker
equation’, which will help you figure out the right configuration for your
circuit. If you’ve never thought about something along these lines and aren’t
heavily over-provisioned, I can almost guarantee you that your circuit breaker
is configured wrong. Instead of re-hashing the post, I’d rather send you to <a href="https://shopify.engineering/circuit-breaker-misconfigured">read
it</a> and leave you with this equation as a teaser. If you’ve ever put a circuit breaker in production, you need to read that post immediately, otherwise you haven’t actually put a <em>working</em> circuit breaker in production.</p>
<figure><img src="/images/81f5ee49-9539-4235-8091-54f3ae34170b.png" alt="" width="1028" height="292" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Hope you enjoyed this post on resiliency napkin math. Until next time!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 10: MySQL transactions per second vs fsyncs per second]]></title>
        <id>https://sirupsen.com/napkin/problem-10-mysql-transactions-per-second</id>
        <link href="https://sirupsen.com/napkin/problem-10-mysql-transactions-per-second"/>
        <updated>2020-07-17T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for another napkin problem!
Since the beginning of this newsletter I’ve posed problems for you to try to
answer. Then in the next month’s edition, you hear my answer. Talking with a few
of you, it seems many of you read these as posts regardless of their
problem-answer format.
That’s why I’ve decided to experiment with a simpler format: posts where I both
present a problem and solution in one go. This one will be long, since it’ll
inc]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for another napkin problem!</p>
<p>Since the beginning of this newsletter I’ve posed problems for you to try to
answer. Then in the next month’s edition, you hear my answer. Talking with a few
of you, it seems many of you read these as posts regardless of their
problem-answer format.</p>
<p>That’s why I’ve decided to experiment with a simpler format: posts where I both
present a problem and solution in one go. This one will be long, since it’ll
include an answer to last month’s.</p>
<p>Hope you enjoy this format! As always, you are encouraged to reach out with
feedback.</p>
<h2 id="problem-10-is-mysqls-maximum-transactions-per-second-equivalent-to-fsyncs-per-second">Problem 10: Is MySQL’s maximum transactions per second equivalent to fsyncs per second?</h2>
<p>How many transactions (‘writes’) per second is MySQL capable of?</p>
<p>A naive model of how a write (a SQL insert/update/delete) to an ACID-compliant
database like MySQL works might be the following (this applies equally to
Postgres, or any other relational/ACID-compliant databases, but we’ll
proceed to work with MySQL as it’s the one I know best):</p>
<ol>
<li>Client sends query to MySQL over an existing connection: <code>INSERT INTO products (name, price) VALUES (&#x27;Sneaker&#x27;, 100)</code></li>
<li>MySQL inserts the new record to the write-ahead-log (WAL) and calls
<code>fsync(2)</code> to tell the operating system to tell the filesystem to tell the
disk to make <em>sure</em> that this data is <em>for sure</em>, pinky-swear committed to
the disk. This step, being the most complex, is depicted below.</li>
<li>MySQL inserts the record into an in-memory page in the backing storage engine
(InnoDB) so the record will be visible to subsequent queries. Why commit to
the storage engine <em>and</em> the WAL? The storage engine is optimized for serving
query results the data, and the WAL for writing it in a safe manner — we
can’t serve a <code>SELECT</code> efficiently from the WAL!</li>
<li>MySQL returns <code>OK</code> to the client.</li>
<li>MySQL eventually calls <code>fsync(2)</code> to ensure InnoDB commits the page to disk.</li>
</ol>
<figure><img src="/images/87759326-21adeb00-c7dc-11ea-89c7-559ca11530e8.png" alt="Napkin_10" width="1759" height="1198" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>In the event of power-loss at any of these points, the behaviour can be defined
without nasty surprises, upholding our dear ACID-compliance.</p>
<p>Splendid! Now that we’ve constructed a naive model of how a relational database
might handle writes safely, we can consider the latency of inserting a new
record into the database. When we consult <a href="https://github.com/sirupsen/napkin-math">the reference napkin numbers</a>, we
see that the <code>fsync(2)</code> in step (2) is by <em>far</em> the slowest operation in the
blocking chain at 1 ms.</p>
<p>For example, the network handling at step (1) takes roughly ~<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>10</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">10\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">10</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span> (TCP Echo
Server is what we can classify as ‘the TCP overhead’). The <code>write(2)</code> itself
prior to the <code>fsync(2)</code> is also negligible at ~<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>10</mn><mtext> </mtext><mi>μ</mi><mtext>s</mtext></mrow><annotation encoding="application/x-tex">10\,\mu\text{s}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em"></span><span class="mord">10</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal">μ</span><span class="mord text"><span class="mord">s</span></span></span></span></span>, since this system call
essentially just writes to an in-memory buffer (the ‘page cache’) in the kernel.
This doesn’t guarantee the actual bits are committed on disk, which means an
unexpected loss of power would erase the data, dropping our ACID-compliance on
the floor. Calling <code>fsync(2)</code> guarantees us the bits are persisted on the disk,
which will survive an unexpected system shutdown.  Downside is that it’s 100x
slower.</p>
<p>With that, we should be able to form a simple hypothesis on the maximum
throughput of MySQL:</p>
<blockquote>
<p>The maximum theoretical throughput of MySQL is equivalent to the maximum
number of <code>fsync(2)</code> per second.</p>
</blockquote>
<p>We know that <code>fsync(2)</code> takes 1 ms from earlier, which means we would naively
expect that MySQL would be able to perform in the neighbourhood of: <code>1s / 1ms/fsync = 1000 fsyncs/s = 1000 transactions/s</code> .</p>
<p>Excellent. We followed the first three of the napkin math steps: (1) Model the
system, (2) Identify the relevant latencies, (3) Do the napkin math, (4) Verify
the napkin calculations against reality.</p>
<p>On to (4: Verifying)! We’ll write a simple benchmark in Rust that writes to
MySQL with 16 threads, doing 1,000 insertions each:</p>
<pre class="language-rust"><code class="language-rust"><span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token number">0</span><span class="token punctuation">..</span><span class="token number">16</span> <span class="token punctuation">{</span>
    handles<span class="token punctuation">.</span><span class="token function">push</span><span class="token punctuation">(</span><span class="token namespace">thread<span class="token punctuation">::</span></span><span class="token function">spawn</span><span class="token punctuation">(</span><span class="token punctuation">{</span>
        <span class="token keyword">let</span> pool <span class="token operator">=</span> pool<span class="token punctuation">.</span><span class="token function">clone</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">move</span> <span class="token closure-params"><span class="token closure-punctuation punctuation">|</span><span class="token closure-punctuation punctuation">|</span></span> <span class="token punctuation">{</span>
            <span class="token keyword">let</span> <span class="token keyword">mut</span> conn <span class="token operator">=</span> pool<span class="token punctuation">.</span><span class="token function">get_conn</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">unwrap</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">// TODO: we should ideally be popping these off a queue in case of a stall</span>
            <span class="token comment">// in a thread, but this is likely good enough.</span>
            <span class="token keyword">for</span> _ <span class="token keyword">in</span> <span class="token number">0</span><span class="token punctuation">..</span><span class="token number">1000</span> <span class="token punctuation">{</span>
                conn<span class="token punctuation">.</span><span class="token function">exec_drop</span><span class="token punctuation">(</span>
                    <span class="token string">r&quot;INSERT INTO products (shop_id, title) VALUES (:shop_id, :title)&quot;</span><span class="token punctuation">,</span>
                    <span class="token macro property">params!</span> <span class="token punctuation">{</span> <span class="token string">&quot;shop_id&quot;</span> <span class="token operator">=&gt;</span> <span class="token number">123</span><span class="token punctuation">,</span> <span class="token string">&quot;title&quot;</span> <span class="token operator">=&gt;</span> <span class="token string">&quot;aerodynamic chair&quot;</span> <span class="token punctuation">}</span><span class="token punctuation">,</span>
                <span class="token punctuation">)</span>
                <span class="token punctuation">.</span><span class="token function">unwrap</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
        <span class="token punctuation">}</span>
    <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token keyword">for</span> handle <span class="token keyword">in</span> handles <span class="token punctuation">{</span>
      handle<span class="token punctuation">.</span><span class="token function">join</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">unwrap</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token comment">// 3 seconds, 16,000 insertions</span>
<span class="token punctuation">}</span>
</code></pre>
<p>This takes ~3 seconds to perform 16,000 insertions, or ~5,300 insertions per
second. This is <strong>5x</strong> more than the 1,000 <code>fsync</code> per second our napkin math
told us would be the theoretical maximum transactional throughput!</p>
<p>Typically with napkin math we aim for being within an order of magnitude, which
we are. But, when I do napkin math it usually establishes a lower-bound for the
system, i.e. from first-principles, how fast <em>could</em> this system perform in
ideal circumstances?</p>
<p>Rarely is the system 5x faster than napkin math. When we identify a
significant-ish gap between the real-life performance and the expected
performance, I call it the “first-principle gap.” This is where curiosity sets
in. It typically means there’s (1) an opportunity to improve the system, or (2)
a flaw in our model of the system. In this case, only (2) makes sense, because
the system is faster than we predicted.</p>
<p>What’s wrong with our model of how the system works? Why aren’t fsyncs per
second equal to transactions per second?</p>
<p>First I examined the benchmark… is something wrong? Nope <code>SELECT COUNT(*) FROM products</code> says 16,000. Is the MySQL I’m using configured to not <code>fsync</code> on every
write? Nope, it’s at the <a href="https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit">safe default</a>.</p>
<p>Then I sat down and thought about it. Perhaps MySQL is <em>not</em> doing an <code>fsync</code>
for every <em>single</em> write? If it’s processing 5,300 insertions per second,
perhaps it’s batching multiple writes together as part of writing to the WAL,
step (2) above? Since each transaction is so short, MySQL would benefit from
waiting a few microseconds to see if other transactions want to ride along
before calling the expensive <code>fsync(2)</code>.</p>
<p>We can test this hypothesis by writing a simple <code>bpftrace</code> script to observe the
number of <code>fsync(1)</code> for the ~16,000 insertions:</p>
<pre class="language-d"><code class="language-d">tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == &quot;mysqld&quot;/
{
        @fsyncs = count();
}
</code></pre>
<p>Running this during the ~3 seconds it takes to insert the 16,000 records we get
~8,000 <code>fsync</code> calls:</p>
<pre class="language-bash"><code class="language-bash">$ <span class="token function">sudo</span> bpftrace fsync_count.d
Attaching <span class="token number">2</span> probes<span class="token punctuation">..</span>.
^C

@fsyncs: <span class="token number">8037</span>
</code></pre>
<p>This is a peculiar number. If MySQL was batching fsyncs, we’d expect something
far lower. This number means that we’re on average doing ~2,500 <code>fsync</code> per
second, at a latency of ~0.4ms. This is twice as fast as the <code>fsync</code> latency we
expect, the 1ms mentioned earlier. For sanity, I ran the script to benchmark
<code>fsync</code> outside MySQL again, no, <a href="https://github.com/sirupsen/napkin-math/blob/fe780331c6f0c6f225a70c8a37c21e0740f7c73c/src/main.rs#L491">still 1ms</a>. <a href="https://gist.github.com/sirupsen/9fd5fe9466e82df073ed8a13ed1f661f#file-napkin-bash">Looked at the
distribution</a>, and it was consistently ~1ms.</p>
<p>So there’s two things we can draw from this: (1) We’re able to <code>fsync</code> more than
twice as fast as we expect, (2) Our hypothesis was correct that MySQL is more
clever than doing one <code>fsync</code> per transaction, however, since <code>fsync</code> also was
faster than expected, this didn’t explain everything.</p>
<p>If you remember from above, while committing the transaction could theoretically
be a single <code>fsync</code>, other features of MySQL might also call <code>fsync</code>. Perhaps
they’re adding noise?</p>
<p>We need to group <code>fsync</code> by file descriptor to get a better idea of how MySQL
uses <code>fsync</code>. However, the raw file descriptor number doesn’t tell us much. We
can use <code>readlink</code> and the <code>proc</code> file-system to obtain the file name the file
descriptor points to. Let’s write a <a href="https://github.com/iovisor/bpftrace"><code>bpftrace</code> script</a> to see what’s being
<code>fsync</code>’ed:</p>
<pre class="language-d"><code class="language-d">tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == str($1)/
{
  @fsyncs[args-&gt;fd] = count();
  if (@fd_to_filename[args-&gt;fd]) {
  } else {
    @fd_to_filename[args-&gt;fd] = 1;
    system(&quot;echo -n &#x27;fd %d -&gt; &#x27; &amp;1&gt;&amp;2 | readlink /proc/%d/fd/%d&quot;,
           args-&gt;fd, pid, args-&gt;fd);
  }
}

END {
  clear(@fd_to_filename);
}
</code></pre>
<p>Running this while inserting the 16,000 transactions into MySQL gives us:</p>
<pre class="language-bash"><code class="language-bash">personal@napkin:~$ <span class="token function">sudo</span> bpftrace <span class="token parameter variable">--unsafe</span> fsync_count_by_fd.d mysqld
Attaching <span class="token number">5</span> probes<span class="token punctuation">..</span>.
fd <span class="token number">5</span> -<span class="token operator">&gt;</span> /var/lib/mysql/ib_logfile0 <span class="token comment"># redo log, or write-ahead-log</span>
fd <span class="token number">9</span> -<span class="token operator">&gt;</span> /var/lib/mysql/ibdata1 <span class="token comment"># shared mysql tablespace</span>
fd <span class="token number">11</span> -<span class="token operator">&gt;</span> /var/lib/mysql/<span class="token comment">#ib_16384_0.dblwr # innodb doublewrite-buffer</span>
fd <span class="token number">13</span> -<span class="token operator">&gt;</span> /var/lib/mysql/undo_001 <span class="token comment"># undo log, to rollback transactions</span>
fd <span class="token number">15</span> -<span class="token operator">&gt;</span> /var/lib/mysql/undo_002 <span class="token comment"># undo log, to rollback transactions</span>
fd <span class="token number">27</span> -<span class="token operator">&gt;</span> /var/lib/mysql/mysql.ibd <span class="token comment"># tablespace </span>
fd <span class="token number">34</span> -<span class="token operator">&gt;</span> /var/lib/mysql/napkin/products.ibd <span class="token comment"># innodb storage for our products table</span>
fd <span class="token number">99</span> -<span class="token operator">&gt;</span> /var/lib/mysql/binlog.000019 <span class="token comment"># binlog for replication</span>
^C

@fsyncs<span class="token punctuation">[</span><span class="token number">9</span><span class="token punctuation">]</span>: <span class="token number">2</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">12</span><span class="token punctuation">]</span>: <span class="token number">2</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">27</span><span class="token punctuation">]</span>: <span class="token number">12</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">34</span><span class="token punctuation">]</span>: <span class="token number">47</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">13</span><span class="token punctuation">]</span>: <span class="token number">86</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">15</span><span class="token punctuation">]</span>: <span class="token number">93</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">11</span><span class="token punctuation">]</span>: <span class="token number">103</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">99</span><span class="token punctuation">]</span>: <span class="token number">2962</span>
@fsyncs<span class="token punctuation">[</span><span class="token number">5</span><span class="token punctuation">]</span>: <span class="token number">4887</span>
</code></pre>
<p>What we can observe here is that the majority of the writes are to the “redo
log”, what we call the “write-ahead-log” (WAL). There’s a few <code>fsync</code> calls to
commit the InnoDB table-space, not nearly as often, as we can always recover
this from the WAL in case we crash between them. Reads work just fine prior to
the <code>fsync</code>, as the queries can simply be served out of memory from InnoDB.</p>
<p>The only surprising thing here is the substantial volume of writes to the
binlog, which we haven’t mentioned before. You can think of the binlog as the
“replication stream.” It’s a stream of events such as <code>row a changed from x to y</code>, <code>row b was deleted</code>, and <code>table u added column c</code>. The primary replica
streams this to the read-replicas, which use it to update their own data.</p>
<p>When you think about it, the <code>binlog</code> and the WAL need to be kept exactly in
sync. We can’t have something committed on the primary replica, but not
committed to the replicas. If they’re not in sync, this could cause loss of data
due to drift in the read-replicas. The primary could commit a change to the WAL,
lose power, recover, and never write it to the binlog.</p>
<p>Since <code>fsync(1)</code> can only sync a single file-descriptor at a time, how can you
possibly ensure that the <code>binlog</code> and the WAL contain the transaction?</p>
<p>One solution would be to merge the <code>binlog</code> and the <code>WAL</code> into one log. I’m not
entirely sure why that’s not the case, but likely the reasons are historic. If
you know, let me know!</p>
<p>The solution employed by MySQL is to use a 2-factor commit. This requires three
<code>fsync</code>s to commit the transaction. <a href="https://www.burnison.ca/notes/fun-mysql-fact-of-the-day-everything-is-two-phase">This</a> and <a href="https://kristiannielsen.livejournal.com/12254.html">this reference</a> explain
this process in more detail. Because the WAL is touched twice as part of the
2-factor commit, it explains why we see roughly ~2x the number of <code>fsync</code> to
that over the bin-log from the bpftrace output above. The process of grouping
multiple transactions into one 2-factor commit in MySQL is called ‘group commit.’</p>
<p>What we can gather from these numbers is that it seems the ~16,000 transactions
were, thanks to group commit, reduced into ~2885 commits, or ~5.5 transactions
per commit on average.</p>
<p>But there’s still one other thing remaining… why was the average latency per
<code>fsync</code> twice as fast as in our benchmark? Once again, we write a simple
<code>bpftrace</code> script:</p>
<pre class="language-text"><code class="language-text">tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == &quot;mysqld&quot;/
{
        @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_fsync,tracepoint:syscalls:sys_exit_fdatasync
/comm == &quot;mysqld&quot;/
{
        @bytes = lhist((nsecs - @start[tid]) / 1000, 0, 1500, 100);
        delete(@start[tid]);
}
</code></pre>
<p>Which throws us this histogram, confirming that we’re seeing some <em>very</em> fast
<code>fsync</code>s:</p>
<pre class="language-shell-session"><code class="language-shell-session">personal@napkin:~$ sudo bpftrace fsync_latency.d
Attaching 4 probes...
^C

@bytes:
[0, 100)             439 |@@@@@@@@@@@@@@@                                     |
[100, 200)             8 |                                                    |
[200, 300)             2 |                                                    |
[300, 400)           242 |@@@@@@@@                                            |
[400, 500)          1495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[500, 600)           768 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[600, 700)           376 |@@@@@@@@@@@@@                                       |
[700, 800)           375 |@@@@@@@@@@@@@                                       |
[800, 900)           379 |@@@@@@@@@@@@@                                       |
[900, 1000)          322 |@@@@@@@@@@@                                         |
[1000, 1100)         256 |@@@@@@@@                                            |
[1100, 1200)         406 |@@@@@@@@@@@@@@                                      |
[1200, 1300)         690 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[1300, 1400)         803 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[1400, 1500)         582 |@@@@@@@@@@@@@@@@@@@@                                |
[1500, ...)         1402 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
</code></pre>
<p>To understand exactly what’s going on here, we’d have to dig into the
file-system we’re using. This is going to be out of scope (otherwise I’m never
going to be sending anything out). But, to not leave you completely hanging,
presumably, <code>ext4</code> is using techniques similar to MySQL’s group commit to batch
writes together in the journal (equivalent to the write-ahead-log of MySQL). In
ext4’s vocabulary, this seems to be called <a href="https://www.kernel.org/doc/Documentation/filesystems/ext4.txt"><code>max_batch_time</code></a>, but the
documentation on this is scanty at best. The disk could also be doing this in
addition/instead of the file-system. If you know more about this, please
enlighten me!</p>
<p>The bottom-line is that <code>fsync</code> can perform faster during real-life workloads than the
1 ms I obtain on this machine from repeatedly writing and <code>fsync</code>ing a file. Most
likely from the ext4 equivalent of group commit, which we won’t see on a
benchmark that never does multiple <code>fsync</code>s in parallel.</p>
<p>This brings us back around to explaining the discrepancy between real-life and
the napkin-math of MySQL’s theoretical, maximum throughput. We are able to
achieve an at least 5x increase in throughput from raw <code>fsync</code> calls due to:</p>
<ol>
<li>MySQL merging multiple transactions into fewer <code>fsync</code>s through ‘group commits.’</li>
<li>The file-system and/or disk merging multiple <code>fsync</code>s performed in parallel
through its own ‘group commits’, yielding faster performance.</li>
</ol>
<p>In essence, the same technique of batching is used at every layer to improve
performance.</p>
<p>While we didn’t manage to explain <em>everything</em> that’s going on here, I certainly
learned a lot from this investigation. It’d be interesting light of this to play
with changing the <a href="https://mariadb.com/kb/en/group-commit-for-the-binary-log/#changing-group-commit-frequency">group commit settings</a> to optimize MySQL for throughput over
latency. This could also be tuned at the file-system level.</p>
<h2 id="problem-9-inverted-index">Problem 9: Inverted Index</h2>
<p><a href="https://sirupsen.com/napkin/problem-9/">Last month, we looked at the inverted
index.</a> This data-structure is what’s
behind full-text search, and the way the documents are packed works well for set
intersections.</p>
<figure><img src="/images/66641ef5-efe4-440a-a616-0d30310e7540.png" alt="" width="1007" height="648" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>(A) How long do you estimate it’d take to get the ids for <code>title AND see</code> with 2
million ids for title, and 1 million for see?</strong></p>
<p>Let’s assume that each document id is stored as a 64-bit integer. Then we’re
dealing with <code>1 * 10^6 * 64bit = 8 Mb</code> and <code>2 * 10^6 * 64 bit = 16 Mb</code>. If we
use an exceptionally simple set intersection algorithm of essentially two nested
for-loops, we need to scan ~<code>24Mb</code> of sequential memory. According to the
<a href="https://github.com/sirupsen/napkin-math">reference</a>, we can do this in <code>1Mb/100us * 24Mb = 2.4ms</code>.</p>
<p>Strangely, the Lucene <a href="https://home.apache.org/~mikemccand/lucenebench/AndHighHigh.html">nightly benchmarks</a> are performing these queries at
roughly 22 QPS, or <code>1000ms/22 = 45ms</code> per query. That’s substantially worse than
our prediction. I was ready to explain why Lucene might be <em>faster</em> (e.g. by
compressing postings to less than 64-bit), but not why it might be 20x slower!
We’ve got ourselves another first-principle gap.</p>
<p>Some slowness can be due to reading from disk, but since the access pattern is
sequential, it <a href="https://github.com/sirupsen/napkin-math">should only be 2-3x slower</a>. The hardware could be different
than the reference, but hardly anything that’d explain 20x. Sending the data to
the client might incur a large penalty, but again, 20x seems enormous. This type
of gap points towards missing something fundamental (as we saw with MySQL).
Unfortunately, this month I didn’t have time to dig much deeper than this, as I
prioritized the MySQL post.</p>
<p><strong>(B) What about title OR see?</strong></p>
<p>In this case we’d have to scan roughly as much memory, but handle more documents
and potentially transfer more back to the client. We’d expect to roughly be in
the same ballpark for performance ~<code>2.4ms</code>.</p>
<p>Lucene in this case is doing <a href="https://home.apache.org/~mikemccand/lucenebench/OrHighHigh.html">roughly half the throughput</a>, which aligns with
our relative expectations. But again, in absolute terms, Lucene’s handling these
queries in ~100ms, which is much, much higher than we expect.</p>
<p><strong>(C) How do the Lucene nightly benchmarks compare for (A) and (B)? This file
shows some of the actual terms used. If they don’t line up, how might you
explain the discrepency?</strong></p>
<p>Answered inline with (A) and (B).</p>
<p><strong>(D) Let’s imagine that we want title AND see and order the results by the last
modification date of each document. How long would you expect that to take?</strong></p>
<p>If the postings are not stored in that order, we’d naively expect in the worst
case we’d need to sort roughly ~24Mb of memory, <a href="https://github.com/sirupsen/napkin-math#numbers">at
5ms/Mb</a>. This would land us in the
<code>5mb/mb * 24mb ~= 120ms</code> query time ballpark.</p>
<p>In reality, this seems like an unintentional trick question. If ordered by last
modification date, they’d already be sorted in roughly that order, since new
documents are inserted to the end of the list. Which means they’re already
stored in <em>roughly</em> the right order, meaning our sort has to move far less bits
around. Even if that wasn’t the case, we could store sorted list for just this
column, which e.g. Lucene allows with doc values.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 9: Inverted Index Performance and Merkle Tree Syncronization]]></title>
        <id>https://sirupsen.com/napkin/problem-9</id>
        <link href="https://sirupsen.com/napkin/problem-9"/>
        <updated>2020-06-07T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for another napkin problem!
As always, consult sirupsen/napkin-math to solve today’s problem, which
has all the resources you need. Keep in mind that with napkin problems you
always have to make your own assumptions about the shape of the problem.
We hit an exciting milestone since last with a total of 500 subscribers! Share the newsletter (ht]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for another napkin problem!</p>
<p>As always, consult <a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> to solve today’s problem, which
has all the resources you need. Keep in mind that with napkin problems you
always have to make your own assumptions about the shape of the problem.</p>
<p>We hit an exciting milestone since last with a total of 500 subscribers! Share the newsletter (<a href="https://sirupsen.com/napkin/">https://sirupsen.com/napkin/</a>) with your friends and co-workers if you find it useful.</p>
<p>Solving problem 8 is probably the most comprehensive yet… it took me 5 hours
today to prepare this newsletter with an answer I felt was satisfactory enough,
I hope you enjoy!</p>
<p>I’m noticing that the napkin math newsletter has evolved from fairly simple
problems, to serving simple models of how various data structures and algorithms work,
then doing napkin math with these assumptions. The complexity has gone way up,
but I hope, in turn, so has your interest.</p>
<p>Let me know how you feel about this evolution by replying. I’m also curious
about how many of you simply read through it, but don’t necessarily attempt to solve the problems. That’s completely OK, but if 90% of readers read it that way,
I would consider reframing the newsletter to include the problem <em>and</em> answer in
each edition, rather than the current format.</p>
<p><strong>Problem 9</strong></p>
<p>You may already be familiar with the inverted index. A ‘normal’ index maps e.g.
a primary key to a record, to answer queries efficiently like:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> products <span class="token keyword">WHERE</span> id <span class="token operator">=</span> <span class="token number">611</span>
</code></pre>
<p>An inverted index maps “terms” to ids. To illustrate
in SQL, it may efficiently help answer queries such as:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> id <span class="token keyword">FROM</span> products <span class="token keyword">WHERE</span> title <span class="token operator">LIKE</span> <span class="token string">&quot;%sock%&quot;</span>
</code></pre>
<p>In the SQL-databases I’m familiar with this wouldn’t be the actual syntax, it
varies greatly. A database like ElasticSearch, which is using the inverted index
as its primary data-structure, uses JSON and not SQL.</p>
<p>The inverted index might look something like this:</p>
<figure><img src="/images/66641ef5-efe4-440a-a616-0d30310e7540.png" alt="" width="1007" height="648" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>If we wanted to answer a query to find all documents that include both the words
<code>title</code> and <code>see</code>, <code>query=&#x27;title AND see&#x27;</code>, we’d need to do an intersection of
the two sets of ids (as illustrated in the drawing).</p>
<p><strong>(A)</strong> How long do you estimate it’d take to get the ids for <code>title AND see</code>
with 2 million ids for title, and 1 million for see?</p>
<p><strong>(B)</strong> What about <code>title OR see</code>?</p>
<p><strong>(C)</strong> How do the Lucene nightly benchmarks compare for <a href="https://home.apache.org/~mikemccand/lucenebench/AndHighHigh.html"><strong>(A)</strong></a> and
<a href="https://home.apache.org/~mikemccand/lucenebench/OrHighHigh.html"><strong>(B)</strong></a>? <a href="https://github.com/mikemccand/luceneutil/blob/83e6f737e9316ba829f9cd7e6cb178ed10470fb3/tasks/wikinightly.tasks">This file</a> shows some of the actual terms used. If they don’t
line up, how might you explain the discrepency?</p>
<p><strong>(D)</strong> Let’s imagine that we want <code>title AND see</code> and order the results by the
last modification date of each document. How long would you expect that to take?</p>
<p><a href="/napkin/problem-10-mysql-transactions-per-second/">Answer is available in the next edition.</a></p>
<p><strong>Answer to Problem 8</strong></p>
<p>Last month <a href="https://sirupsen.com/napkin/problem-8/">we looked at a syncing
problem.</a>. What follows is the most
deliberate answer in this newsletter’s short history. It’s a fascinating
problem, I hope you find it as interesting as I did.</p>
<p>The problem comes down to this: How does a client and server know if they have
the same data? We framed this as a hashing problem. The client and server would
each have a hash, if they match, they have the same data. If not, they need to
sync the documents!</p>
<p>The query for the client and server might look something like this:</p>
<p><code>SELECT SHA1(*) FROM table WHERE user_id = 1</code></p>
<p>For 100,000 records, that’ll in reality return us 100,000 hashes. But, let’s
assume that the hashing function is an aggregate function without confusing with
very specific syntax (you can see who to <em>actually</em> do it <a href="https://www.usenix.org/sites/default/files/conference/protected-files/srecon19emea_slides_weingarten.pdf#page=62">here</a>.</p>
<figure><img src="/images/faa046d0-cb70-4852-ae36-4a728236ae6a.png" alt="" width="1313" height="654" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>(a) How much time would you expect the server-side query to take for 100,000
records that the client might have synced? Will it have different performance
than the client-side query?</strong></p>
<p>We’ll assume each row is about 256 bytes on average (<code>2^8</code>), which means we’ll
be reading ~25Mb of data, and subsequently hash it.</p>
<p>Now, will we be reading this from disk or memory? Most databases maintain a
cache of the most frequently read data in memory, but we’ll assume the worst
case here of reading everything from disk.</p>
<p>We know from <a href="https://github.com/sirupsen/napkin-math">the reference</a> that we can hash a mb in roughly 500 us. The
astute reader might notice that only non-crypto safe hashing are that fast (e.g.
<code>CRC32</code> or <code>SIPHASH</code>), but SHA1 is in a crypto-family (although it’s <a href="https://en.wikipedia.org/wiki/SHA-1">not
considered safe anymore for that purpose</a>, it’s used for integrity in e.g.
Git and many other systems). We’re going to assume we can find a non-crypto hash
that’s fast enough with rare collissions.  Worst case, you’d sync on your next
change (or force it in the UI).</p>
<p>We can also see that we can read 1 mb sequentially at roughly <code>200 us/mb</code>, and
randomly at roughly <code>10 ms/mb</code>. In <a href="https://sirupsen.com/napkin/problem-5/">Napkin Problem 5</a> we learned that reads
on a multi-tenant database without a composite primary key that includes the
<code>user_id</code> start to look more random than not. We’ll average it out a little,
assume some pre-fetching, some sequential reads, and call it <code>1 ms/mb</code>.</p>
<p>With the caching and disk reads, we’ve got ourselves an approximation of the
query time of the full-table scan: <code>25 Mb * (500 us/Mb + 1 ms/Mb) ~= 40ms</code>.
That’s not terrible, for something that likely wouldn’t happen too often. If
this all came from memory, we can assume hashing speed only to get a lower bound
and get <code>~12.5ms</code>. Not amazing, not terrible. For perspective, that might yield
us <code>1s / 10ms = 100 syncs per second</code> (in reality, we could likely get more by
assuming multiple cores).</p>
<p>Is 100 syncs per second good? If you’ve got 1000 users and they each sync once
an hour, you’re more than covered here (<code>1000/3600 ~= 0.3 syncs per second</code>).
You’d need in the 100,000s of users before this operation would become
problematic.</p>
<p>The second part of the questions asks whether the client would have different
performance. The client might be a mobile client, which could easily be <em>much</em>
slower than the server. This is where this solution starts to break down for so
many documents to sync. We don’t have napkin numbers for mobile devices (if
you’ve got access to a mobile CPU you can run the napkin math script on, I’d
love to see it), but it wouldn’t be crazy to assume it to be an order of
magnitude slower (and terrible on the battery).</p>
<p><strong>(b) Can you think of a way to speed up this query?</strong></p>
<p>There’s iterative improvements that can be done on the current design. We could
hash the <code>updated_at</code> and store it as a column in the database. We could go a
step further and create an index on <code>(user_id, hash)</code> or <code>(user_id, updated_at)</code>. This would allow us much more efficient access to that column!
This would easily mean we’d only have to read 8-12 bytes of data per record,
rather than the previous 256 bytes.</p>
<p>Something else entirely we could do is add a <code>WHERE updated_at ..</code> with a
generous window on either side, only considering those records for sync. This is
do-able, but not very robust. Clocks are out of sync, someone could be offline
for weeks/months, … we have a lot of edge-cases to consider.</p>
<p><strong>Merkle Tree Synchronization</strong></p>
<p>The flaw with our current design is that we still have to iterate through the
100,000 records each time we want to know if a client can sync. Another flaw is
that our current query only gives us a binary answer: the 100,000 records are
synced, or the 100,000 records are not synced.</p>
<p>This query’s answer then leaves us in an uncomfortable situation… should the
client now receive 100,000 records and figure out which ones are out-of-date? Or
let the server do it? This would mean sending those 25 Mb of data back and forth
on each sync! We’re starting to get into question <code>(C)</code>, but let’s explore
this… we might be able to get two birds with one stone here.</p>
<p>What if we could design a data-structure that we maintain at write-time that
would allow us to elegantly answer the question of whether we’re in sync with
the server? Even better, what if this data-structure would tell us which rows
need to be re-synced, so we don’t have to send 100,000 records back and forth?</p>
<p>Let’s consider a Merkle tree (or ‘hash tree’). It’s a simple tree data structure
where the leaf nodes store the hash of individual records. The parent stores the
hash of <em>all</em> its children, until finally the root’s hash is an identity of the
entire state the Merkle tree represents. In other words, the root’s hash is the
answer to the query we discussed above.</p>
<p>The best way to understand a Merkle tree is to study the drawing below a little:</p>
<figure><img src="/images/2f5ff1a5-d6c5-4b38-aa20-c1d82883328d.png" alt="" width="1669" height="844" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>In the drawing I show a MySQL query to generate an equivalent node. It’s likely
not how we’d generate the data-structure in production, but it illustrates its
naive MySQL equivlalent.  The data-structure would be able to answer such a
query rapidly, wheras MySQL would need to look at each record.</p>
<p>If we scale this up to 100,000 records, we can interpolate how the root would store
<code>(hash, (1..100,000))</code>, its left child would store <code>(hash, (1..50,000))</code>, and
right child would store <code>(hash, (50,001..100,000))</code>, and so on. In that case, to
generate the root’s right node the query in the drawing would look at 50,000
records, too slow!</p>
<p>Let’s assume that the client and the server both have been able to generate this
data-structure somehow. How would they efficiently sync? Let’s draw up a merkle
tree and data table where one row is different on the server (we’ll make it
slightly less verbose than the last):</p>
<figure><img src="/images/4a216af8-61be-496b-9332-b5f9170b6714.png" alt="" width="1217" height="1175" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Notice how the parents all change when a single record changes. If the server
and client only exchange their merkle trees, they’d be able to do a simple walk
of the trees and find out that it’s indeed <code>id=4</code> that’s different, and only
sync that row. Of course, in this example with only four records, simply syncing
all the rows would work.</p>
<p>But once again, let’s scale it up. If we scale this simple model up to <code>100,000</code>
rows, we’d need to still exchange 100,000 nodes from the Merkle tree! It’s
slightly less data, since it’s just hashes. Naively, the tree would be <code>~2^18</code>
elements of perhaps 64 bits each, so ~2mb total. An order of magnitude better,
but still a lot of data to sync, especially from a mobile client. Notice here
how we keep justifying each level of complexity by doing quick calculations at
each step to know if we need to optimize further.</p>
<p>Let’s try to work backwards instead… Let’s say our Merkle tree has a maximum
depth of 8.. that’s <code>2^8 = 256</code> leaf nodes (this is <a href="http://distributeddatastore.blogspot.com/2013/07/cassandra-using-merkle-trees-to-detect.html">what Cassandra does</a> to
verify integrity between replicas). This means that each leaf would hold
<code>100,000 / 256 = 390</code> records. To store a tree of depth 8, we’d need <code>2^(8+1) = 2^9 = 512</code> nodes in a vector/array. Carrying our 64-bit per element assumption
from before to store the hash, that’s a mere 4kb for the entire Merkle tree. Now
to syncronize, we only need to send or receive 4kb!</p>
<p>Now we’ve arrived at a fast Merkle-tree based syncing algorithm:</p>
<ol>
<li>Client decides to sync</li>
<li>Server sends client its 4kb Merkle tree (fast even on 3G, 10-100ms including
round-trip and server-side processing overhead)</li>
<li>Client walks its own and the server’s Merkle tree to detect differences
(operating on <code>2 * 4kb</code> trees, both fit in L1 CPU caches,
nanoseconds to microseconds).</li>
<li>Client identifies the leaf nodes which don’t match (<code>log(n)</code>, super fast
since were traversing trees in L1).</li>
<li>Client requests the ids of all those leaf nodes from the server (<code>390 * 256 bytes = 100Kb</code> per mismatch)</li>
</ol>
<p>To actually implement this, we’d need to solve a few production problems. How do
we maintain the Merkle tree on both the client and server-side? It’s paramount
its completely in sync with the table that stores the actual data!  If our table
is the <code>orders</code> table, we could imagine maintaining an <code>orders_merkle_tree</code>
table along-side it. We could do this within the transaction in the application,
we could do it with triggers in the writer (or in the read-replicas), build it
based on the replication stream, patch MySQL to maintain this (or base it on the
existing InnoDB checksumming on each leaf), or something else entirely…</p>
<p>Our design has other challenges that’d need to be ironed out, for example, our
current design assumes an <code>auto_increment</code> per user, which is not something most
databases are designed to do. We could solve this by hashing the primary key
into <code>2^8</code> buckets and store these in the leaf nodes.</p>
<p>This answer to <code>(B)</code> also addresses <strong>(C): This is a stretch question, but it’s
fun to think about the full syncing scenario. How would you figure out which
rows haven’t synced?</strong></p>
<p>As mentioned in the previous letter, I would encourage you to watch <a href="https://www.dotconferences.com/2019/12/james-long-crdts-for-mortals">this
video</a> if this topic is interesting to you. The <a href="https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-trees-probabilistic-b-trees">Prolly Tree</a> is an
interesting data-structure for this type of work (combining B-trees and Merkle
Trees). Git is based on Merkle trees, I recommend <a href="https://shop.jcoglan.com/building-git/">this book</a> which explains
how Git works by re-implementing Git in Ruby.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Adjacent Possible: Model for Peeking into the Future]]></title>
        <id>https://sirupsen.com/adjacent-possible</id>
        <link href="https://sirupsen.com/adjacent-possible"/>
        <updated>2020-05-10T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[There are 100s of cases of important discoveries being made independently
by different people at almost exactly the same time: calculus (1600s), the
telegraph (1837), the light bulb (1879), the jet engine (1840), and the
telephone (1876). A recent example was Spectre/Meltdown (2018), possibly the
most impactful publicly disclosed security vulnerability of the past decade.
Despite its fiendish complexity it was <a href="h]]></summary>
        <content type="html"><![CDATA[<p>There are <a href="https://en.wikipedia.org/wiki/List_of_multiple_discoveries">100s of cases</a> of important discoveries being made independently
by different people at almost exactly the same time: calculus (1600s), the
telegraph (1837), the light bulb (1879), the jet engine (1840), and the
telephone (1876). A recent example was Spectre/Meltdown (2018), possibly the
most impactful publicly disclosed security vulnerability of the past decade.
Despite its fiendish complexity it was <a href="https://www.wired.com/story/meltdown-spectre-bug-collision-intel-chip-flaw-discovery/">discovered independently by two
teams</a> that year.</p>
<p>Why does this happen?</p>
<p>In <a href="/books/where-good-ideas-come-from/">“Where Good Ideas Come From,”</a>, Johnson explains the idea of the
‘adjacent possible’, pioneered by Stuart Kauffman about how biological systems
morph into complex systems. The adjacent possible idea explains simultaneous
innovation.  It’s one of those ideas that to me was so powerful it’s hard to
remember how I thought about innovation prior to learning about it.</p>
<p>To borrow Johnson’s analogy for the adjacent possible: when you build or improve
something, imagine yourself as opening a new door. You’ve unlocked a new room.
This room, in turn, has even <em>more</em> doors to be unlocked. Each innovation or
improvement unlocks even more improvements and innovations. What the doors lead
you to is what we call the ‘adjacent possible.’ The adjacent possible is what’s
about a door away from being invented. I like to visualize the adjacent possible
as coloured (“built”) and uncoloured (“not built”) nodes in a simple graph:</p>
<figure><img src="/images/adjacent-possible/adjacent_possible_simple.png" alt="" width="1235" height="701" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<blockquote>
<p>In human culture, we like to think of breakthrough ideas as sudden
accelerations on the timeline, where a genius jumps ahead fifty years and
invents something that normal minds, trapped in the present moment, couldn’t
possibly have come up with. But the truth is that technological (and
scientific) advances rarely break out of the adjacent possible; the history of
cultural progress is, almost without exception, a story of one door leading to
another door, exploring the palace one room at a time.
— <a href="/books/where-good-ideas-come-from/">Steven Johnson, Where Good Ideas Come From</a></p>
</blockquote>
<p>When Gutenberg invented the printing press, it was in the adjacent possible from
the invention of movable type, ink, paper, and the wine press. He had to
customize the ink, press, and invent molds for the type — but the printing
press was very much ripe for plucking in the adjacent possible.</p>
<figure><img src="/images/adjacent-possible/adjacent_possible_printing_press.png" alt="" width="1271" height="903" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>When you internalize it, you start seeing it everywhere.</p>
<p>Here’s Safi Bahcall painting a picture of navigating the adjacent possible,
focusing in particular on the importance of fundamental research, a door opener
that might not always get the credit and funding it deserves:</p>
<blockquote>
<p>“The vast majority of the most important breakthroughs in drug discovery have
hopped from one lily pad to another until they cleared their last challenge.
Only after the last jump, from the final lily pad, would those ideas win wide
acclaim.”
— <a href="/books/loonshots/">Safi Bahcall, Loonshots</a></p>
</blockquote>
<p>Of course, it took ingenuity for Gutenberg to combine these components to make
the printing press. It’s certainly a pattern that the inventor has a profound
familiarity with each component. Gutenberg grew up close to the wine districts
of South-Western Germany, so he was familiar with the wine press. He had to
customize the press, in the same way that much experimentation lead him to come
up with an oil-based ink that worked with his movable type (for which he needed
to invent molds).</p>
<p>But reality is that if Gutenberg hadn’t invented the printing press, someone
else would have. The inventors of the transistor admitted this outright. The
Bell Labs semiconductor team understood that when you are picking off the
adjacent possible, someone else will get there eventually. In this case, the
transistor had come into the adjacent possible from the increased understanding
of e.g. the basic research in atomic structure and understanding of electrons
conducted by scientists such as Bohr and J. J. Thompson.</p>
<blockquote>
<p>“There was little doubt, even by the transistor’s inventors, that if
Shockley’s team at Bell Labs had not gotten to the transistor first, someone
else in the United States or in Europe would have soon after.”
— <a href="/books/the-idea-factory/">Jon Gertner, The Idea Factory: Bell Labs and the Great Age of American Innovation</a></p>
</blockquote>
<p>Edison came to this conclusion too:</p>
<blockquote>
<p>I never had an idea in my life. My so-called inventions already existed in the
environment – I took them out. I’ve created nothing. Nobody does. There’s no
such thing as an idea being brain-born; everything comes from the outside.
— Edison</p>
</blockquote>
<p>Numerous quotes can be found about how innovations are plucked out of the
adjacent possible like ripe fruits:</p>
<blockquote>
<p>[Y]ou do not [make a discovery] until a background knowledge is built up to a
place where it’s almost impossible not to see the new thing, and it often
happens that the new step is done contemporaneously in two different places in
the world, independently.
— a physicist Nobel laureate interviewed by Harriet Zuckerman, in Scientific
Elite: Nobel Laureates in the United States, 1977</p>
</blockquote>
<p>The adjacent possible is a possible explanation for why simultaneous innovation
is so common.</p>
<p>You may recognize the adjacent possible as another angle on Newton’s phrase that
we ‘stand on the shoulders of giants’ (coloured nodes in the adjacent possible).
‘Great artists steal’, because otherwise how would we launch into the adjacent
possible? The greatest artists might just be the ones that create the nodes with
the most connections, such as Picasso’s influence in cubism, or Emerson’s
in transcendentalism.</p>
<p>You might initially think this is a depressing thought. Are all innovations
inevitable? Some teams in history have mowed through the adjacent possible
at unprecedented speeds. Think of the Manhatten Project. The Apollo Project.
Neither of those were in the adjacent possible. They were in the far remote
possible. Many, many doors out. But these teams pushed through.  To a company,
the momentum provided by breaking through the adjacent possible first can be
difficult to catch up with, such as Google and their page-rank search algorithm.
Some areas might be simply neglected, e.g. pandemic prevention.</p>
<p>The adjacent possible can teach us an important lesson about being too early. To
someone working in the adjacent possible, being too early and wrong is one and
the same. I’ve heard <a href="https://twitter.com/tobi">Tobi Lutke</a> say a few times that “predicting the future
is easy, but timing it is hard.” Sure, we know that autonomous vehicles are
coming (predicting the future), but are you wiling to put any money on when
(predicting timing)?</p>
<p>For example, residential internet was not geared yet for responsive online games
in the early 90s.  It was too early, even if game developers <em>knew</em> it was
eventually going to be a thing. It was in the remote possible, but not the
adjacent possible. Not enough pre-requisite doors had been opened: home internet
speed weren’t good enough, research on how to deal with network latency was
poor, and setting up servers all around the world to minimize latency was a lot
of work. Being too early means confusing the adjacent and remote possible.</p>
<p>Despite online gaming being too early to become ubiquitous, the stage was set
for the web. Half-coloured nodes signal immaturity:</p>
<figure><img src="/images/adjacent-possible/adj_int.png" alt="" width="1182" height="597" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>While Wilbur Wright knew we’d one day fly (<em>remote possible</em>), he had no idea if
it was in the adjacent possible. He especially didn’t know the timing. But he
went to the Kitty Hawk sand dunes with their flimsy plane anyway:</p>
<blockquote>
<p>“I confess that, in 1901, I said to my brother Orville that men would not fly
for fifty years. Two years later, we ourselves were making flights. This
demonstration of my inability as a prophet gave me such a shock that I have
ever since distrusted myself and have refrained from all prediction—as my
friends of the press, especially, well know. But it is not really necessary to
look too far into the future; we see enough already to be certain that it will
be magnificent. Only let us hurry and open the roads.”
— <a href="/books/the-wright-brothers/">David McCullough, The Wright Brothers</a></p>
</blockquote>
<p>Bell Labs developed the “picture phone” in the 1960s and 1970s, but they found
themselves branching off nodes in the adjacent possible that made it <em>possible</em>,
but without product/market fit. It’s possible to navigate into the adjacent
possible using the wrong doors: <code>camera + cables + packet_switching + tv</code> does
not necessarily equal a successful commercial ‘video phone’. Video telephony
wouldn’t be in the adjacent possible in a shape consumers would embrace for
another 40-50 years when convenience, price, and form factor would change with
every laptop having a webcam and every phone a front-facing camera. Babbage also
got his timing wrong.  He was ~100 years too early with the first computer
design, too.</p>
<figure><img src="/images/adjacent-possible/picturephone.jpg" alt="" width="1536" height="2048" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>These are individual failures, but part of a healthy system. We <em>need</em> people to
try. While I believe this model is useful to reason about what can be built,
it’s just as likely to make you reason incorrectly about why not to build
something. You may very likely use this model to be wrong, as an excuse not
to be venture into the fog of war. You won’t always know all your dependencies.</p>
<p>In the late 90s, LEGO was aggressively diversifying from the brick into video
games, movies, theme parks, and more.  Like the plastic mold had enabled the
brick’s transition from wood to plastic, they thought that a digital environment
with all possible bricks might start the next wave of innovation for LEGO. They
bought the biggest Silicon Graphics machine in all of Scandinavia and put it in
a tiny town in Denmark to computer-render the bricks to perfection. LEGO was
eager to use the newest graphics technology, the most recently opened door, and
marry it with LEGO.  Unsurprisingly, the graphics team never shipped anything.
When a door’s just been opened, you’re almost certainly going to run into
problems with immaturity (a contemporary example would be cryptocurrency).  You
only have to look at Minecraft’s success a decade later to know what could’ve
succeeded: much simpler graphics. LEGO must’ve grinned their teeth when they saw
Minecraft take off.</p>
<figure><img src="/images/adjacent-possible/darwin_minecraft.png" alt="" width="1190" height="715" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Just because big graphics computers exist doesn’t mean you have to use them.
It’s very easy to confuse the <em>eventually/remote possible</em> with the <em>adjacent
possible</em>. If you find yourself pushing, pushing, and pushing, but every
dependency seems to fail you — your dependencies <a href="https://blog.gardeviance.org/2015/02/an-introduction-to-wardley-value-chain.html">are too immature</a>. Every
project has dependencies, but only the immature ones stand out. You don’t think
about electricity as a risky dependency for a project (but you might have in the
1880s), but consumer adoption of VR certainly would be. Smartphones might have
been a risky dependency a decade ago, but wouldn’t be considered risky by anyone
today. QR-codes might have appeared risky in the West 5-years ago, but is
somewhere between “people get it” and “not completely mature” now. In China,
however, it’s common that food menus come with QR-codes.</p>
<p>When the transistor was invented at Bell Labs, Bell didn’t immediately replace
every vacuum tube amplifier with it in their telephony cabling (amplifiers are
used to counteract the natural fading of the signal over long distances). It
would take at least a decade to get the price, manufacturing, and reliability of
the transistor to the point where it could replace the vacuum tube with half a
century of R&amp;D behind it.  In fact, they were still laying down massive,
cross-country and oceanic cables with vacuum tubes for years after the
transistor was invented, patiently waiting it to mature. I’m sure you’ve seen a
project fail because, by analogy, you ‘started cabling with transistors
immediately after its discovery.’ Sometimes you just need to bite your lip and
go with the vacuum tube.</p>
<p>Despite this, it didn’t make Bell any less excited about the transistor. They
knew that the vacuum tube’s potential had been maxed out, while the transistor’s
was just starting. Even today, as we reach <code>5nm</code> (orders and orders of magnitude
smaller and faster) transitors, the transitor’s potential still hasn’t been
depleted. Although we’re inching closer and closer…</p>
<blockquote>
<p>“Gordon Moore suggested what would have happened if the automobile industry had
matched the semiconductor business for productivity. “We would cruise
comfortably in our cars at 100,000 mph, getting 50,000 miles per gallon of
gasoline,” Moore said. “We would find it cheaper to throw away our Rolls-Royce
and replace it than to park it downtown for the evening… . We could pass it
down through several generations without any requirement for repair.””
— <a href="/books/the-chip/">T.R. Reid, The Chip</a></p>
</blockquote>
<p>Wilbur Wright made a similar remark about the limits of airship, after trying
one for the first time on a trip to Europe:</p>
<blockquote>
<p>[Wilbur] judged it a “very successful trial.” But as he was shortly to write, the cost
of such an airship was ten times that of a Flyer, and a Flyer moved at twice the
speed. The flying machine was in its infancy while the airship had “reached its
limit and must soon become a thing of the past.” Still, the spectacle of the
airship over Paris was a grand way to begin a day.” — David McCullough, The
Wright Brothers</p>
</blockquote>
<p>It’s important to note that improving something existing can open doors just as
much as inventing something entirely new. When gas gets 20% cheaper, people
don’t just drive 20% more, they <a href="https://en.wikipedia.org/wiki/Jevons_paradox">might drive 40% more</a>. Behaviour changes.
Suddenly it looks economical to move a little further out, visit that relative
who lives in the country, or drive 10 hours on vacation.</p>
<p>As another example, the current wave of AI is fuelled by the massive
improvements in compute speed over the past few decades, partly from graphics
cards originally developed for video games. AI had been hanging out in the
remote possible for decades, just waiting for compute to hit a certain
speed/cost threshold to make them economically feasible. You might not use AI to
sort your search results if it costs $10 in compute per search, but when the
cost has generously compounded down to a micro-dollar, it very well might be.</p>
<p>The same iterative improvements are what made the transistor so successful.
Fundamentally, it can do the same as a vacuum tube: amplify and switch signals.
Initially, it was much more expensive, but smaller and more reliable (no light
to attract bugs) — which allowed it to flourish only in niche use-cases far
upmarket, e.g. in the US millitary.  But over time, the transistor beat the
vacuum tube in every way (although, some audiophiles still prefer the ‘sound’ of
vacuum tubes?!).</p>
<p>To use our new vocabulary, the transistor only initially expanded the adjacent
possible for a few cases.  Over time as iterative, consistent improvements were
made to price, size, and reliability, the transistor became the root of the
largest expanse of the ‘possible’ in human history. It didn’t open doors, it
opened up new continents. A more contemporary example might be home and mobile
Internet speeds, for which consistent, iterative improvements has expanded the
adjacent possible with streaming, video games, video chat, and photo-video heavy
social media.</p>
<p>It’s not possible to predict exactly what <a href="/unk-unk/">doors an improvement unlocks</a>.
This is a space of unknown-unknowns, but, hopefully positive ones. If we look at
history, making things cheaper, smaller, faster, and more reliable tends to
expand the adjacent possible. It wasn’t some magical new invention that made AI
take off in the past 7-10 years, it was iterative changes: cheaper, faster
compute, available on demand in the Cloud. Every time these improve by 10%,
something new is feasible.</p>
<p>As an example of perfect timing into the adjacent possible, consider Netflix’
pivot into streaming. The technology they used initially was a little whacky
(Silverlight), but it was good enough to give them an initial momentum that’s
still carrying them today. They timed the technology and the market perfectly:
home Internet speeds, browser technology, etc.</p>
<p>When you find yourself in a spot where you have your eyes on something that’s a
few doors out from where you’re standing, that means it’s time to reconsider
your approach. When Apple released the iPod in 2001, they surely were eyeing a
phone in the <em>remote possible</em>. They knew that going straight for it, they’d be
blasting through doors at a pace that’d yield an immature, poor product. They
found a way to sustainably open the doors for a phone through the iPod.
When you find a seemingly intractable problems, there’s almost always a
tractable problem worth solving hiding inside of it as a stepping stone.</p>
<p>Framing problems as the ‘adjacent possible’ has been a liberating idea to me. In
the work I do, I try to find the doors that lead to the biggest possible
expansion of the possible. That’s what makes platform work so exciting to me.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 8: Data Synchronization]]></title>
        <id>https://sirupsen.com/napkin/problem-8</id>
        <link href="https://sirupsen.com/napkin/problem-8"/>
        <updated>2020-05-03T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for another napkin problem!
As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.
Since last time, I’ve added compression and hashing numbers to the napkin math tab]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for another napkin problem!</p>
<p>As always, consult <a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.</p>
<p>Since last time, I’ve added <a href="https://github.com/sirupsen/napkin-math">compression and hashing numbers</a> to the napkin math table. Plenty more I’d like to see, happy to receive help by someone eager to write some Rust!</p>
<p>About a month ago I did a little pop-up lesson for some kids about <a href="https://www.youtube.com/watch?v=R0aMzNKUAwc">competitive programming</a>. That’s the context where I did my first napkin math. One of the most critical skills in that environment is to know ahead of time whether your solution will be fast enough to solve the problem. Was fun to prepare for the lesson, as I haven’t done anything in that space for over 6 years. I realized it’s influenced me a lot.</p>
<p>We’re on the 8th newsletter now, and I’d love to receive feedback from all of you (just reply directly to me here). Do you solve the problems? Do you just enjoy reading the problems, but don’t jot much down (that’s cool)?  Would you prefer a change in format (such as the ability to see answers before the next letter)? Do you find the problems are not applicable enough for you, or do you like them?</p>
<p><strong>Problem 8</strong></p>
<p>There might be situations where you want to checksum data in a relational database. For example, you might be <a href="https://www.youtube.com/watch?v=-GqOVx9F5QM&amp;t38m40s=">moving a tenant from one shard to another</a>, and before finalizing the move you want to ensure the data is the same on both ends (to protect against bugs in your move implementation).</p>
<p>Checksumming against databases isn’t terribly common, but can be quite useful for sanity-checking in syncing scenarios (imagine if webhook APIs had a cheap way to check whether the data you have locally is up-to-date, instead of fetching all the data).</p>
<p>We’ll imagine a slightly different scenario. We have a client (web browser with local storage, or mobile) with state stored locally from <code>table</code>. They’ve been lucky enough to be offline for a few hours, and is now coming back online. They’re issuing a sync to get the newest data. This client has offline-capabilities, so our user was able to use the client while on their offline journey. For simplicity, we imagine they haven’t made any changes locally.</p>
<figure><img src="/images/faa046d0-cb70-4852-ae36-4a728236ae6a.png" alt="" width="1313" height="654" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>The query behind an API might look like this (in reality, the query would look more like <a href="https://www.usenix.org/sites/default/files/conference/protected-files/srecon19emea_slides_weingarten.pdf#page=62">this</a>):</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> SHA1<span class="token punctuation">(</span><span class="token keyword">table</span><span class="token punctuation">.</span>updated_at<span class="token punctuation">)</span> <span class="token keyword">FROM</span> <span class="token keyword">table</span> <span class="token keyword">WHERE</span> user_id <span class="token operator">=</span> <span class="token number">1</span>
</code></pre>
<p>The user does the same query locally. If the hashes match, user is already synced!</p>
<p>If the local and server-side hash don’t match, we’d have to figure out what’s happened since the user was last online and send the changes (possibly in both directions). This can be useful on its own, but can become very powerful for syncing when extended further.</p>
<p><strong>(A)</strong> How much time would you expect the server-side query to take for 100,000 records that the client might have synced? Will it have different performance than the client-side query?</p>
<p><strong>(B)</strong> Can you think of a way to speed up this query?</p>
<p><strong>(C)</strong> This is a stretch question, but it’s fun to think about the full syncing scenario. How would you figure out which rows haven’t synced?</p>
<p>If you find this problem interesting, I’d encourage you to watch <a href="https://www.dotconferences.com/2019/12/james-long-crdts-for-mortals">this video</a> (it would help you answer question (C) if you deicde to give it a go).</p>
<p><a href="/napkin/problem-9/">Answer is available in the next edition.</a></p>
<p><strong>Answer to Problem 7</strong></p>
<p>In the <a href="https://sirupsen.com/napkin/problem-7/">last problem</a> we looked at revision history (click it for more detail). More specifically, we looked at building revision history on top of an existing relational database with a simple composite primary key design: <code>(id, version)</code> with a full duplication of the row each time it changes. The only thing you knew was that the table was updating roughly 10 times per second.</p>
<figure><img src="/images/e93e3c58-0b13-4d2b-bd8d-b08beae30caf.png" alt="" width="1295" height="921" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>(a) How much extra storage space do you anticipate this simple scheme would require after a month? A year? What would this cost on a standard cloud provider?</strong></p>
<p>The table we’re operating on was called <code>products</code>. Let’s assume somewhere around 256 bytes per product (some larger, some smaller, biggest variant being the product description). Each update thus generates <code>2^8 = 256</code> bytes. We can extrapolate out to a month: <code>2^8 bytes/update * 10 update * 3600 seconds/hour * 24 hour/day * 30 day/month ~= 6.5 Gb/month</code>. Or ~<code>80Gb</code> per year. Stored on SSD on a standard Cloud provider at <code>$0.01/Gb</code>, that’ll run us ~$8/month.</p>
<p><strong>(b) Based on (a), would you keep storing it in a relational database, or would you store it somewhere else? Where? Could you store it differently more efficiently without changing the storage engine?</strong></p>
<p>For this table, it doesn’t seem crazy—especially if we look at it as a cost-only problem. Main concern that comes to mind here to me is that this will decrease query performance, at least in MySQL. Every time you load a record, you’re also <a href="https://sirupsen.com/napkin/problem-5/">loading adjacent records as you draw in the 16KiB page</a> (as determined by the primary key).</p>
<p>Accidental abuse would also become a problem.  You might have a well-meaning merchant with a bug in a script that causes them to update their products 100/times second for a while. Do you need to clear these out? Does it permanently decrease their performance?  Limitations in the number of revisions per product would likely be a sufficient upper-case for a while.</p>
<p>If we moved to compression, we’d likely get a <a href="https://github.com/sirupsen/napkin-math#compression-ratios">3x storage-size decrease</a>. That’s not too significant, and incurs a fair amount of complexity.</p>
<p>If you, for e.g. one of the reasons above, needed to move to another engine, I’d likely base the decision on how often it needs to be queried, and what types of queries are required on the revisions (hopefully you don’t need to join on them).</p>
<p>The absolute simplest (and cheapest) would be to store it on GCS/S3, wholesale, no diffs — and then do whatever transformations necessary inside the application. I would hesitate strongly to move to something more complicated than that unless absolutely necessary (if you were doing a lot of version syncing, that might change the queries you’re doing substantially, for example).</p>
<p>Do you have other ideas on how to solve this? Experience? I’d love to hear from you!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 7: Revision History]]></title>
        <id>https://sirupsen.com/napkin/problem-7</id>
        <link href="https://sirupsen.com/napkin/problem-7"/>
        <updated>2020-04-11T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for another napkin problem!
As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.
I debated putting out a special edition of the newsletter with COVID-related napkin math problems. However, I ultimately decided to resi]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for another napkin problem!</p>
<p>As always, consult <a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.</p>
<p>I debated putting out a special edition of the newsletter with COVID-related napkin math problems. However, I ultimately decided to resist, as it’s exceedingly likely to encourage misinformation. Instead, I am attaching a brief reflection on napkin math in this context.</p>
<p>In the case of COVID, napkin math can be useful to develop intuition. It became painfully clear that there are two types of people: those that appreciate exponentials, and those that don’t. Napkin math and <a href="https://www.washingtonpost.com/graphics/2020/world/corona-simulator/">simple simulations</a> have proved apt at educating about exponential growth and the properties of spread. If you don’t stare at exponential growth routinely, it’s counter-intuitive why you’d want to shut down at a few hundred cases (or less).</p>
<p>However, napkin math is insufficient for informing policy. Napkin math is for informing direction. It’s for rapidly uncovering the fog of war to light up promising paths. Raising alarm bells to dig deeper. It’s the experimenter’s tool.</p>
<p>It’s an inadequate tool when even getting an order of magnitude assumption right is difficult. Napkin math for epidemiology is filled with exponentials, which make it mindbogglingly sensitive to minuscule changes in input. The ones we’ve dealt with here haven’t included exponential growth. I’ve been tracking napkin articles on COVID out there from hobbyist, and some of it is outright dangerous. As they say, more lies have been written in Excel than Word.</p>
<p>On that note, on to today’s problem!</p>
<p><strong>Problem 7</strong></p>
<p>Revision history is wonderful. We use it every day in tools like Git and Google Docs. While we might not use it directly all the time, the fact that it’s there makes us feel confident in making large changes. It’s also the backbone for features like real-time collaboration, synchronization, and offline-support.</p>
<p>Many of us develop with databases like MySQL that don’t easily support revision history. They lack the capability to easily answer queries such as: “give me this record the way it looked before this record”, “give me this record at this time and date”, or “tell me what has changed since these revisions.”</p>
<p>It doesn’t strike me as terribly unlikely that years from now, as computing costs continue to fall, that revision history will be a default feature. Not a feature reserved from specialized databases like <a href="https://github.com/attic-labs/noms">Noms</a> (if you’re curious about the subject, and an efficient data-structure to answer queries like the above, read about <a href="https://github.com/attic-labs/noms/blob/master/doc/intro.md#prolly-trees-probabilistic-b-trees">Prolly Trees</a>). But today, those features are not particularly common. Most companies do it differently.</p>
<p>Let’s try to analyze what it would look like to get revision history on top of a standard SQL database. As we always do, we’ll start by analyzing the simplest solution. Instead of mutating our records in place, our changes will always <em>copy</em> the entire row, increment a <code>version_number</code> on the record (which is part of the primary key), as well as an <code>updated_at</code> column. Let’s call the table we’re operating on <code>products</code>. I’ll put down one assumption: we’re seeing about 10 updates per second. Then I’ll leave you to form the rest of the assumptions (most of napkin math is about forming assumptions).</p>
<figure><img src="/images/e93e3c58-0b13-4d2b-bd8d-b08beae30caf.png" alt="" width="1295" height="921" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>(a) How much extra storage space do you anticipate this simple scheme would require after a month? A year? What would this cost on a standard cloud provider?</p>
<p>(b) Based on (a), would you keep storing it in a relational database, or would you store it somewhere else? Where? Could you store it differently more efficiently without changing the storage engine?</p>
<p><a href="/napkin/problem-8/">Answer is available in the next edition.</a></p>
<p><strong>Answer to Problem 6</strong></p>
<p>The <a href="https://sirupsen.com/napkin/problem-6/">last problem</a> can be summarized as: Is it feasible to build a client-side search feature for a personal website, storing all articles in memory? Could the New York Times do the same thing?</p>
<p>On my website, I have perhaps 100 pieces of public content (these newsletters, blog posts, book reviews). Let’s say that they’re on average 1000 words of searchable content, with each word being an average of 5 characters/bytes (fairly standard for English, e.g. this email is ~5.1). We get a total of: <code>5 * 10^0 * 10^3 * 10^2 = 5 * 10^5 bytes = 100 kb = 0.1 mb</code>. It’s not crazy to have clients download <code>0.1mb</code> of cached content, especially considering that gzip a blog post seems to compress about 1:3.</p>
<p>The second consideration would be: can we search it fast enough? If we do a simple search match, this is essentially about scanning memory. We should be able to read <a href="https://github.com/sirupsen/napkin-math">100kb in less than a millisecond</a>.</p>
<p>For the New York Times, we might ballpark that they publish 30 pieces of ~1,000 word content a day. While it’d be sweet to index since their beginnings in 1851, we’ll just consider 10 years at this publishing speed as a ballpark. <code>5 * 10^0 * 10^3 * 30 * 365 * 10 ~= 500mb</code>. That’s too much to do in the browser, so in that case we’d suggest a server-side search. Especially if we want to go back more than 10 years (by the way, past news coverage is fascinating — highly recommend currently reading articles about SARS-COV-1 from 2002). Searching that much content would take about 50ms naively, which might be ok, but since this is only 10 years of even more data, we’d likely want to also investigate more sophisticated data-structures for search.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 6: In-memory Search]]></title>
        <id>https://sirupsen.com/napkin/problem-6</id>
        <link href="https://sirupsen.com/napkin/problem-6"/>
        <updated>2020-03-07T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for napkin problem number 6!
As always, consult sirupsen/napkin-math to solve today’s problem, which
has all the resources you need. Keep in mind that with napkin problems you
always have to make your own assumptions about the shape of the problem.
Problem 6
Quick napkin calculations are helpful to iterate through simple, naive solutions
and see whether they]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for napkin problem number 6!</p>
<p>As always, consult <a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> to solve today’s problem, which
has all the resources you need. Keep in mind that with napkin problems you
always have to make your own assumptions about the shape of the problem.</p>
<p><strong>Problem 6</strong></p>
<p>Quick napkin calculations are helpful to iterate through simple, naive solutions
and see whether they might be feasible. If they are, it can often speed up
development drastically.</p>
<p>Consider building a search function for your personal website which currently
doesn’t depend on <em>any</em> external services. Do you need one, or can you do
something ultra-simple, like loading <em>all</em> articles into memory and searching
them with Javascript? Can NYT do it?</p>
<p>Feel free to reply with your answer, would love to hear them! Mine will be given in the next edition.</p>
<p><a href="/napkin/problem-7/">Answer is available in the next edition.</a></p>
<p><strong>Answer to Problem 5</strong></p>
<p>The question is explained <a href="https://sirupsen.com/napkin/problem-5/">in depth in the past edition</a>. Please refresh
your memory on that first! This is one of my favourite problems in the newsletter
so far, so I highly recommend working through it — even if you’re just doing it
with my answer below.</p>
<figure><img src="/images/ba039ecb-9a11-4e32-b495-fa90f6caef4c.png" alt="" width="1168" height="491" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>(1) When each 16 KiB database page has only 1 relevant row per page, what is the
query performance (with a <code>LIMIT 100</code>)?</strong></p>
<p>This would require 100 random SSD access, which we know from <a href="https://github.com/sirupsen/napkin-math">the
resource</a> to be <code>100 us</code> each, so a total of 10ms for this simple query
where we have to fetch a full page for each of the 100 rows.</p>
<p><strong>(2) What is the performance of (1) when all the pages are in memory?</strong></p>
<p>We can essentially assume sequential memory read performance for the 16Kb page,
which gets us to <code>(16 KiB / 64 bytes) * 5 ns =~ 1250 ns</code>.  This is certainly an
upper-bound, since we likely won’t have the traverse the whole page in memory.
Let’s round it to <code>1 us</code>, giving us a total query time of <code>100 us</code> or <code>0.1ms</code>,
or about <code>100x</code> faster than (1).</p>
<p>In reality, I’ve observed this many times where a query will show up in the slow
query log, but subsequent runs will be up to 100x faster, for exactly this
reason. The solution to avoid this is to change the primary key, which we can
now get into…</p>
<p><strong>(3) What is the performance of this query if we change the primary key to
<code>(shop_id, id)</code> to avoid the worst case of a product per page?</strong></p>
<p>Let’s assume each product is ~128 bytes, so we can fit <code>16 Kib / 128 bytes = 2^14 bytes / 2^7 bytes = 2^7 = 128</code> products per page, which means we only need
a single read.</p>
<p>If it’s on disk, <code>100 us</code>, and in memory (per our answer to (2)) around <code>1 us</code>.
In both cases, we improve the worst case by 100x by choosing a good primary key.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 5: Composite Primary Keys]]></title>
        <id>https://sirupsen.com/napkin/problem-5</id>
        <link href="https://sirupsen.com/napkin/problem-5"/>
        <updated>2020-02-03T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for napkin problem number 5! If
you are wondering why you’re receiving this email, you likely watched my talk on
napkin math and decided to sign
up for some monthly practise.
Since last, in the napkin-math repository I’ve added system call
overhead. I’ve been also been working on <a href="https://github.com/sirupsen/napkin-math/blo]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for napkin problem number 5! If
you are wondering why you’re receiving this email, you likely watched my talk on
<a href="https://www.youtube.com/watch?v=IxkSlnrRFqc">napkin math</a> and decided to sign
up for some monthly practise.</p>
<p>Since last, in <a href="https://github.com/sirupsen/napkin-math">the napkin-math repository</a> I’ve added system call
overhead. I’ve been also been working on <a href="https://github.com/sirupsen/napkin-math/blob/master/src/main.rs#L594-L675"><code>io_uring(2)</code> disk
benchmarks</a>, which leverage <a href="https://lwn.net/Articles/776703/">a new Linux API from 5.1</a>
to queue I/O sys-calls (in more recent kernels, network is also supported, it’s under active development). This
avoids system-call overhead and allows the kernel to order them as efficiently
as it likes.</p>
<p>As always, consult <a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> for resources and help to
solve this edition’s problem! This will also have a link to the archive of past
problems.</p>
<p><strong>Napkin Problem 5</strong></p>
<p>In databases, typically data is ordered on disk by some <em>key</em>. In relational
databases (and definitely MySQL), as an example, the data is ordered by the
primary key of the table. For many schemas, this might be the <code>AUTO_INCREMENT id</code> column. A good primary key is one that <em>stores together records that are
accessed together</em>.</p>
<p>We have a <code>products</code> table with the <code>id</code> as the primary key, we might do a query
like this to fetch 100 products for the <code>api</code>:</p>
<pre class="language-sql"><code class="language-sql"><span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> products <span class="token keyword">WHERE</span> shop_id <span class="token operator">=</span> <span class="token number">13</span> <span class="token keyword">LIMIT</span> <span class="token number">100</span>
</code></pre>
<p>This is going to zig-zag through the product table pages on disk to load the 100
products. In each page, unfortunately, there are other records from other shops (see illustration below).
They would never be relevant to <code>shop_id = 13</code>. If we are <em>really</em> unlucky, there may be
only 1 product per page / disk read! Each page, we’ll assume, is 16 KiB (the
default in e.g. MySQL). In the worst case, we could load 100 * 16 KiB!</p>
<figure><img src="/images/ba039ecb-9a11-4e32-b495-fa90f6caef4c.png" alt="" width="1168" height="491" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>(1) What is the performance of the query in the worst-case, where we load only one
product per page?</p>
<p>(2) What is the worst-case performance of the query when the pages are all in
memory cache (typically that would happen after (1))?</p>
<p>(3) If we changed the primary key to be <code>(shop_id, id)</code>, what would the
performance be when (3a) going to disk, and (3b) hitting cache?</p>
<p>I love seeing your answers, so don’t hesitate to email me those back!</p>
<p><a href="/napkin/problem-6/">Answer is available in the next edition.</a></p>
<p><strong>Answer to Problem 4</strong></p>
<p>The question can be summarized as: How many commands-per-second can a simple,
in-memory, single-threaded data-store do? See <a href="https://buttondown.email/computer-napkins/archive/napkin-problem-4/">the full question in the
archives</a>.</p>
<p>The network overhead of the query is <code>~10us</code> (you can find this number in
<a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a>). We expect each memory read to be random, so the
latency here is <code>50ns</code>. This goes out the wash with the networking overhead, so
with a single CPU, we estimate that we can roughly do <code>1s/10us = 1 s / 10^-5 s = 10^5 = 100,000</code> commands per second, or about 10x what the team was seeing.
Something must be wrong!</p>
<p>Knowing that, you might be interested to know that <a href="https://raw.githubusercontent.com/antirez/redis/6.0/00-RELEASENOTES">Redis 6 rc1 was just released with threaded I/O support</a>.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[How does progress(1) work?]]></title>
        <id>https://sirupsen.com/progress</id>
        <link href="https://sirupsen.com/progress"/>
        <updated>2020-01-26T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We’ll cover a neat little utility called
progress(1). Many common utilities like
cp or gzip don’t spit out a progress bar by default. progress finds those
processes and estimates how far along they are with their operation. For
example, if you’re copying a 10Gb with cp, running progress will indicate
that it’s progressed 1Gb, and has another]]></summary>
        <content type="html"><![CDATA[<p>We’ll cover a neat little utility called
<a href="https://github.com/Xfennec/progress"><code>progress(1)</code></a>. Many common utilities like
<code>cp</code> or <code>gzip</code> don’t spit out a progress bar by default. <code>progress</code> finds those
processes and estimates how far along they are with their operation. For
example, if you’re copying a <code>10Gb</code> with <code>cp</code>, running <code>progress</code> will indicate
that it’s progressed <code>1Gb</code>, and has another <code>9Gb</code> to go.</p>
<p>Here’s an example, kindly borrowed from the project’s README:</p>
<figure><img src="/images/progress.png" alt="Picture showing progress(1) in a terminal." width="720" height="278" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>What I was interested in is, how does it work? The <a href="https://github.com/Xfennec/progress#how-does-it-work">README</a> briefly goes
over it, but I wanted to go a little deeper. Fortunately, it’s a fairly simple C
program.  While this utility works on MacOS, I’ll cover how it works on Linux.
For MacOS, the methods for obtaining the information about the file-descriptors
and processes is slightly different, utilizing a library called <code>libproc</code>, due
to the absence of the <code>/proc</code> file-system. That’s the depth we’ll cover with
MacOS.</p>
<p>At the heart of <code>progress</code>, we find the function <a href="https://github.com/Xfennec/progress/blob/7a0767dc0b2b6763a4c947ecfe9c140c93655ab9/progress.c#L686"><code>monitor_processes</code></a>.
On Linux, every process exposes itself as a directory on the file-system in
<code>/proc</code> as <code>/proc/&lt;pid&gt;</code>. In the directory, there’s e.g. the <code>exe</code> file is a
link pointing to the binary that the process is executing, this could be for
example <code>/bin/tar</code>.  There’s many other interesting links and files in here. I
open <code>environ</code> regularly in production to check which environment variables a
process has open. Other files will you about its memory usage, various process
configuration, or its priority if the OOM-killer is looking for its next target.</p>
<p><code>progress</code> will look through the <code>exe</code> links for all processes on the system to
find interesting binaries, like <code>cp</code>, <code>cat</code>, <code>tar</code>, <code>grep</code>, <code>cut</code>, <code>gunzip</code>,
<code>sort</code>, <code>md5sum</code>, and many <a href="https://github.com/Xfennec/progress/blob/7a0767dc0b2b6763a4c947ecfe9c140c93655ab9/progress.c#L61-L69">more</a>.</p>
<p>For each of these processes, it’ll scan every file descriptor the process has
opened through the <code>/proc/&lt;pid&gt;/fd</code> and <code>/proc/&lt;pid&gt;/fdinfo</code> directories. These
contain ample information about the file, such as the name of the file, the
size, what position we’re reading at, and so on. <code>progress</code> will skip file
descriptors that are invalid or are not for files, e.g. a socket.</p>
<p><code>progress</code> will find the biggest file-descriptor opened by the process, e.g.
whatever <code>cp</code> is copying and see what offset in the file the process is at.
Based on that, the total file size, and waiting a second before doing a second
read it can estimate the process of the process and its throughput.</p>
<p>Once <code>progress</code> has done this for all processes, it’ll either quit or do it all
over again (this only takes a few milliseconds). To the user, this appears as
continues monitoring of the processes’ progress!</p>
<p>Of course, this simple method has its limitations. If you’re copying a lot of
small files, then it won’t help you very much. It could be extended to detect
such programs and monitor them, but it’s certainly not trivial. The way this
works also limits its usability in networks, depending on how the network
program is written. If it streams a file locally as it transfers it, it’ll work
well, but if it loads the whole thing into memory and then transfers it,
<code>progress</code> won’t know what to do. From the documentation, it appears that this
works well for downloads by many browsers. Presumably because they pre-allocate
a large file based on the header of the content-length. <code>progress</code> can then
monitor how far along the offset we are.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 4: Redis throughput]]></title>
        <id>https://sirupsen.com/napkin/problem-4</id>
        <link href="https://sirupsen.com/napkin/problem-4"/>
        <updated>2020-01-07T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for napkin problem number four! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math and decided to sign up for some monthly training.
Since last, there has been some smaller updates to the napkin-math repository and the accompanying program. I’ve been brushing up on x86 to ensure that the]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for napkin problem number four! If you are wondering why you’re receiving this email, you likely watched my talk on <a href="https://www.youtube.com/watch?v=IxkSlnrRFqc">napkin math</a> and decided to sign up for some monthly training.</p>
<p>Since last, there has been some smaller updates to <a href="https://github.com/sirupsen/napkin-math">the napkin-math repository</a> and the accompanying program. I’ve been brushing up on x86 to ensure that the base-rates truly represent the upper-bound, which will require some smaller changes. The numbers are unlikely to change by an order of magnitude, but I am dedicated to make sure they are optimum. If you’d like to help with providing some napkin calculations, I’d love contributions around serialization (JSON, YAML, …) and compression (Gzip, Snappy, …). I am also working on turning all my notes from the above talk into a long, long blog post.</p>
<p>With that out of the way, we’ll do a slightly easier problem than last week this week! As always, consult <a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> for resources and help to solve today’s problem.</p>
<p><strong>Napkin Problem 4</strong></p>
<p>Today, as you were preparing you organic, high-mountain Taiwanese oolong in the kitchennette, one of your lovely co-workers mentioned that they were looking at adding more Redises because it was maxing out at 10,000 commands per second which they were trending aggressively towards. You asked them how they were using it (were they running some obscure O(n) command?). They’d BPF-probes to determine that it was all <code>GET &lt;key&gt;</code> and <code>SET &lt;key&gt; &lt;value&gt;</code>. They also confirmed all the values were about or less than 64 bytes. For those unfamiliar with Redis, it’s a single-threaded in-memory key-value store written in C.</p>
<p>Unphased after this encounter, you walk to the window. You look out and sip your high-mountain Taiwanese oolong. As you stare at yet another condominium building being built—it hits you. 10,000 commands per second. 10,000. Isn’t that abysmally low? Shouldn’t something that’s fundamentally ‘just’ doing random memory reads and writes over an established TCP session be able to do more?</p>
<p>What kind of throughput might we be able to expect for a single-thread, as an absolute upper-bound if we disregard I/O? What if we include I/O (and assume it’s blocking each command), so it’s akin to a simple TCP server? Based on that result, would you say that they have more investigation to do before adding more servers?</p>
<p><em>Solution to this problem is <a href="/napkin/problem-5/">available in the next edition</a></em></p>
<p><strong>Answer to Problem 3</strong></p>
<p>You can read the problem in the archive, <a href="https://buttondown.email/computer-napkins/archive/16a42790-e498-4804-8e17-769ff3a30d34">here</a>.</p>
<figure><img src="/images/2042e909-962a-48d6-b1e5-a7e03c6f7092.png" alt="" width="1261" height="644" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>We have 4 bitmaps (one per condition) of <code>10^6</code> product ids, each of 64 bits.
That’s <code>4 * 10^6 * 64 bits = 32 Mb</code>. Would this be in memory or on SSDs? Well,
let’s assume the largest merchants have 10^6 products and 10^3 attributes, that
means a total of <code>10^6 * 10^3 * 64 bits = 8Gb</code>. That’d cost us about <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>8</mn><mi>i</mi><mi>n</mi><mi>m</mi><mi>e</mi><mi>m</mi><mi>o</mi><mi>r</mi><mi>y</mi><mo separator="true">,</mo><mi>o</mi><mi>r</mi><mi>a</mi><mi>b</mi><mi>o</mi><mi>u</mi><mi>t</mi></mrow><annotation encoding="application/x-tex">8 in
memory, or about </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord">8</span><span class="mord mathnormal">inm</span><span class="mord mathnormal">e</span><span class="mord mathnormal">m</span><span class="mord mathnormal" style="margin-right:0.03588em">ory</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord mathnormal" style="margin-right:0.02778em">or</span><span class="mord mathnormal">ab</span><span class="mord mathnormal">o</span><span class="mord mathnormal">u</span><span class="mord mathnormal">t</span></span></span></span>1 to store on disk. In terms of performance, this is nicely
sequential access. For memory, <code>32 mb * 100us/mb = 3.2 ms</code>. For SSD (about 10x
cheaper, and 10x slower than memory), 30 ms. 30 ms is a bit high, but 3 ms is
acceptable. $8 is not crazy, given that this would be the absolute largest
merchant we have. If cost becomes an issue, we could likely employ good caching.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 3: Membership Intersection Service]]></title>
        <id>https://sirupsen.com/napkin/problem-3</id>
        <link href="https://sirupsen.com/napkin/problem-3"/>
        <updated>2019-12-15T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends, from near and far, it’s time for napkin problem number three! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math.
This weeks problem is higher level, which is different from the past few. This makes it more difficult, but I hope you enjoy it!
Napkin Problem 3
You are considering how you might implement a set-membership service. Your use-c]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends, from near and far, it’s time for napkin problem number three! If you are wondering why you’re receiving this email, you likely <a href="https://www.youtube.com/watch?v=IxkSlnrRFqc">watched my talk on napkin math.</a></p>
<p>This weeks problem is higher level, which is different from the past few. This makes it more difficult, but I hope you enjoy it!</p>
<p><strong>Napkin Problem 3</strong></p>
<p>You are considering how you might implement a set-membership service. Your use-case is to build a service to filter products by particular attributes, e.g. efficiently among all products for a merchant get shoes that are: black, size 10, and brand X.</p>
<p>Before getting fancy, you’d like to examine whether the simplest possible algorithm would be sufficiently fast: store, for each attribute, a list of all product ids for that attribute (see drawing below). Each query to your service will take the form: <code>shoe AND black AND size-10 AND brand-x</code>. To serve the query, you find the intersection (i.e. product ids that match in all terms) between all the attributes. This should return the product ids for all products that match that condition. In the case of the drawing below, only P3 (of those visible) matches those conditions.</p>
<figure><img src="/images/7dfa1786-d88e-41bd-b336-30a9092db882.png" alt="Picture illustrating the attributes and product ids." width="400" height="406" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>The largest merchants have 1,000,000 different products. Each product will be represented in this naive data-structure as a 64-bit integer. While simply shown as a list here, you can assume that we can perform the intersections between rows efficiently in O(n) operations. In other words, in the worst case you have to read all the integers for each attribute only once per term in the query. We could implement this in a variety of ways, but the point of the back-of-the-envelope calculation is to not get lost in the weeds of implementation too early.</p>
<p>What would you estimate the worst-case performance of an average query with 4 AND conditions to be? Based on this result and your own intuition, would you say this algorithm is sufficient or would you investigate something more sophisticated?</p>
<p>As always, you can find resources at <a href="https://github.com/sirupsen/napkin-math">github.com/sirupsen/napkin-math</a>. The talk linked is the best introduction tot he topic.</p>
<p>Please reply with your answer!</p>
<p><em>Solution to this problem is <a href="/napkin/problem-4/">available in the next edition</a></em></p>
<p><strong>Answer to Problem 2</strong></p>
<p><em>Your SSD-backed database has a usage-pattern that rewards you with a 80%
page-cache hit-rate (i.e. 80% of  disk reads are served directly out of memory
instead of going to the SSD). The median is 50 distinct disk pages for a query
to gather its query results (e.g. InnoDB pages in MySQL). What is the expected
average query time from your database?</em></p>
<p><code>50 * 0.8 = 40</code> disk reads come out of the memory cache. The remaining 10 SSD
reads require a random SSD seek, each of which will take about <code>100 us</code> as per
<a href="https://github.com/sirupsen/napkin-math">the reference</a>. The reference says 64
bytes, but the OS will read a full page at a time from SSD, so this will be
roughly right. So call it a lower bound of <code>1ms</code> of SSD time. The page-cache
reads will all be less than a microsecond, so we won’t even factor them in. It’s
typically the case that we can ignore any memory latency as soon as I/O is
involved. Somewhere between 1-10ms seems reasonable, when you add in
database-overhead and that 1ms for disk-access is a lower-bound.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 2: Expected Database Query Latency]]></title>
        <id>https://sirupsen.com/napkin/problem-2</id>
        <link href="https://sirupsen.com/napkin/problem-2"/>
        <updated>2019-11-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Fellow computer-napkin-mathers, it’s time for napkin problem #2. The last
problem’s solution you’ll find at the end! I’ve updated
sirupsen/napkin-math with last week’s
tips and tricks—consult that repo if you need a refresher. My goal for that
repo is to become a great resource for napkin calculations in the domain of
computers. My talk from SRECON’s video was published this week, you can see it
<a href="https://www.youtube.com/watch?v=Ixk]]></summary>
        <content type="html"><![CDATA[<p>Fellow computer-napkin-mathers, it’s time for napkin problem #2. The last
problem’s solution you’ll find at the end! I’ve updated
<a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> with last week’s
tips and tricks—consult that repo if you need a refresher. My goal for that
repo is to become a great resource for napkin calculations in the domain of
computers. My talk from SRECON’s video was published this week, you can see it
<a href="https://www.youtube.com/watch?v=IxkSlnrRFqc">here.</a></p>
<p><strong>Problem #2: Your SSD-backed database has a usage-pattern that rewards you with
a 80% page-cache hit-rate (i.e. 80% of disk reads are served directly out of
memory instead of going to the SSD). The median is 50 distinct disk pages for a
query to gather its query results (e.g. InnoDB pages in MySQL). What is the
expected average query time from your database?</strong></p>
<p>Reply to this email with your answer, happy to provide you mine ahead of time if
you’re curious.</p>
<p><em>Solution to this problem is <a href="/napkin/problem-3/">available in the next edition</a></em></p>
<p><strong>Last Problem’s Solution</strong></p>
<p><strong>Question:</strong> <strong>How much will the storage of logs cost for a standard,
monolithic 100,000 RPS web application?</strong></p>
<p><strong>Answer:</strong> First I jotted down the basics and convert them to scientific
notation for easy calculation <code>~1 *10^3 bytes/request (1 KB)</code>, <code>9 * 10^4 seconds/day</code>, and <code>10^5 requests/second</code>. Then multiplied these numbers into
storage per day: <code>10^3 bytes/request * 9 * 10^4 seconds/day * 10^5 requests/second = 9 * 10^12 bytes/day = 9 Tb/day</code>. Then we need to use the
monthly cost for disk storage from
<a href="https://github.com/sirupsen/napkin-math">sirupsen/napkin-math</a> (or your cloud’s
pricing calculator) — <code>$0.01 GB/month</code>. So we have <code>9 Tb/day * $0.01 GB/month</code>. We
do some unit conversions (you could do this by hand to practise, or on
Wolframalpha) and get to <code>$3 * 10^3 per month</code>, or <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>3</mn><mo separator="true">,</mo><mn>000</mn><mi>p</mi><mi>e</mi><mi>r</mi><mi>m</mi><mi>o</mi><mi>n</mi><mi>t</mi><mi>h</mi><mi mathvariant="normal">.</mi><mi>M</mi><mi>o</mi><mi>s</mi><mi>t</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>h</mi><mi>o</mi><mi>s</mi><mi>e</mi><mi>w</mi><mi>h</mi><mi>o</mi><mi>r</mi><mi>e</mi><mi>p</mi><mi>l</mi><mi>i</mi><mi>e</mi><mi>d</mi><mi>g</mi><mi>o</mi><mi>t</mi><mi>s</mi><mi>o</mi><mi>m</mi><mi>e</mi><mi>w</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>e</mi><mi>b</mi><mi>e</mi><mi>t</mi><mi>w</mi><mi>e</mi><mi>e</mi><mi>n</mi></mrow><annotation encoding="application/x-tex">3,000 per month. Most of
those who replied got somewhere between </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em"></span><span class="mord">3</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em"></span><span class="mord">000</span><span class="mord mathnormal">p</span><span class="mord mathnormal" style="margin-right:0.02778em">er</span><span class="mord mathnormal">m</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.10903em">M</span><span class="mord mathnormal">os</span><span class="mord mathnormal">t</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.10764em">f</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mord mathnormal">ose</span><span class="mord mathnormal" style="margin-right:0.02691em">w</span><span class="mord mathnormal">h</span><span class="mord mathnormal">ore</span><span class="mord mathnormal" style="margin-right:0.01968em">pl</span><span class="mord mathnormal">i</span><span class="mord mathnormal">e</span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right:0.03588em">g</span><span class="mord mathnormal">o</span><span class="mord mathnormal">t</span><span class="mord mathnormal">so</span><span class="mord mathnormal">m</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.02691em">w</span><span class="mord mathnormal">h</span><span class="mord mathnormal">ere</span><span class="mord mathnormal">b</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.02691em">tw</span><span class="mord mathnormal">ee</span><span class="mord mathnormal">n</span></span></span></span>1,000 and $10,000 — well within an
order of magnitude!</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Napkin Problem 1: Logging Cost]]></title>
        <id>https://sirupsen.com/napkin/problem-1</id>
        <link href="https://sirupsen.com/napkin/problem-1"/>
        <updated>2019-10-19T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Napkin friends around the world: it’s time for your very first system’ estimation problem! Confused why you’re receiving this email? Likely you attended my talk at SRECON 19, where I said that I’d start a newsletter with occasional problems to practise your back-of-the-envelope computer calculation skills—if enough of you subscribed! Enough of you did, so here we are!
Problem #1: How much will the storage of logs cost]]></summary>
        <content type="html"><![CDATA[<p>Napkin friends around the world: it’s time for your very first system’ estimation problem! Confused why you’re receiving this email? Likely you <a href="https://www.usenix.org/conference/srecon19emea">attended my talk at SRECON 19</a>, where I said that I’d start a newsletter with occasional problems to practise your back-of-the-envelope computer calculation skills—if enough of you subscribed! Enough of you did, so here we are!</p>
<p><strong>Problem #1: How much will the storage of logs cost for a standard, monolithic 100,000 RPS web application?</strong></p>
<p>Reply to this email with your answer and how you arrived there. Then I’ll send you mine.</p>
<p><em>Solution to this problem is <a href="/napkin/problem-2/">available in the next edition</a></em></p>
<p><strong>Hints</strong></p>
<p>You can find many numbers you might need on <a href="https://github.com/sirupsen/base-rates">sirupsen/base-rates.</a> If you don’t, consider submitting a PR! I hope for that repo to grow to be the canonical source for system’s napkin math. </p>
<p>Don’t overcomplicate the solution by including e.g. CDN logs, slow query logs, etc. Keep it simple.</p>
<p>You might want to refresh your memory on <a href="https://en.wikipedia.org/wiki/Fermi_problem">Fermi Problems</a>. Remember that you need less precision than you think. Remember that your goal is just to get the exponent right, x in n * 10^x.</p>
<p><a href="https://www.wolframalpha.com">Wolframalpha</a> is good at calculating with units, you may use that the first few times—but over time the goal is for you to be able to do these calculations with no aids!</p>
<p>Consider using spaced repetition to remember the numbers you need for today’s problem, e.g. <a href="http://communis.io/">http://communis.io/</a> is a messenger bot.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[2018]]></title>
        <id>https://sirupsen.com/2018</id>
        <link href="https://sirupsen.com/2018"/>
        <updated>2019-01-25T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Every year, I spend some time reflecting on the year that passed. After reading
last year’s post, I noticed a fair bit of self-indulgent tangent
chasing. Most of which should likely have been separate posts. I’m attempting
less of that this year. I’m continuing to evolve the format, but it’ll probably
be a few years until I settle on one.
Berlin
Jenn took a medium-term assignment in Berlin, so a decent chunk of 2018 I spent
stretched between Be]]></summary>
        <content type="html"><![CDATA[<p>Every year, I spend some time reflecting on the year that passed. After reading
<a href="/2017/">last year’s post</a>, I noticed a fair bit of self-indulgent tangent
chasing. Most of which should likely have been separate posts. I’m attempting
less of that this year. I’m continuing to evolve the format, but it’ll probably
be a few years until I settle on one.</p>
<h2 id="berlin">Berlin</h2>
<p>Jenn took a medium-term assignment in Berlin, so a decent chunk of 2018 I spent
stretched between Berlin and Ottawa. After five years in Ottawa, I was starting
to feel a tad restless. Five years easily turn into 10, and while five years is
a long time, 10 is a really long time. Spending time in Berlin provided an
opportunity to test what life would be like in an “objectively cooler” city,
without committing to a major change. We enjoyed some fantastic weekends in
Berlin: knödel shops where the hairdo-memo said ‘Grease’ (unfortunately, we
missed it, so no mullet this time around), biking across the city with friends
visiting from Denmark to a bus-turned-café, and the weekly kinda-festival at
Mauerpark, where amphitheatres turn into makeshift crowd-karaoke. Despite all of
this, the best thing about the stint in Berlin was, as cliché as it may sound,
the re-appreciation of how good my life is in Ottawa. Berlin is a city that
screams ‘temporary.’ I don’t recall meeting a single person ‘from there’ or a
single person who wanted to stay there permanently.  The city has a faint smell
of millennial quarter-life crisis, I know, because given another year, that’d
likely have been what drew me there! Close to family, but also close to the
global pulse. In contrast, Ottawa has the diametrically opposite effect on
people. After this, I’m pretty okay with that.</p>
<h2 id="reading">Reading</h2>
<p>More so than the satisfaction of chasing a high number of books read, it was a
significant focus-point for 2018 to evolve the system <strong>around</strong> reading. I
increasingly feel that the more time I allocate to processing what I’ve read
(primarily through writing, creating flashcards, and cataloging ideas), the more
long-term reward. I wrote a much longer post about <a href="https://sirupsen.com/read/">the
system</a> I went through most of 2018 with. It’ll
continue to evolve, and I expect to update the post within the next year or two
with the experiments I’m carrying out. The feedback loops on increasing reading
retention are wonderfully and painfully long. Last year, I ended up <a href="https://www.goodreads.com/user_challenges/10779425">reading
around 55 books</a>. Some that
stood out were The Wright Brothers; wonderful story of innovation and fortitude,
The North Water; the fiction that’s kept me most glued since Harry Potter, The
Course of Love; raw and genuine account of long-term relationships, Doing Good
Better; a way to think about charity that appealed to me, and The Goal; part of
the underrated genre of  fiction with a refreshingly tangible takeaway.</p>
<h2 id="health">Health</h2>
<p>The frequent flights between the New and Old World were dreadful. The whole
thing clinched for me that the romantic idea of a “Nomad Lifestyle” would be a
nightmare for me. If that phase of life hits me, it’s clear that my shape will
be in 3-month chunks, not backpack-increments. Always coming out of jet lag, or
being about to go in it, was exhausting. That, and the poor seating that invited
poor posture. Under those conditions, it proved challenging to improve physical
health, despite the Gym in Berlin being the best I’ve frequented yet. It had
that dungeon-gym vibe I didn’t know I’d craved that badly. The health hit of jet
lag and transit-nutrition was uplifted by the intimidation factor of the guy
next to you casually deadlifting 500lbs, with his dog taking a nap on the
platform. This year, 2019, I hope to make some strides to improve my physical
fitness. More specifically, I’d like a ball to chase (event, in this context)
and improve my cardio, not just strength.</p>
<p>Inspired by a co-workers pulse watch, I decided that’d be an excellent motivator
to incorporate more cardio. Having a heart-rate monitor with a number closely
tuned to how miserable I’m feeling turned out to be a winning bet for tying my
running shoes more often. An unexpected additional benefit was that friends
started popping up in the Apple Watch fitness app. I have no problems with
abusing my competitive gene without shame when it comes to my health. Beating
Jeff turns out to be a great motivator.</p>
<h2 id="work">Work</h2>
<p>2018 became a year of building teams. In 2017, we were about 1.5 teams, but by
the end of 2018, there are 3. The realization that I needed to build these teams
led to an intense hiring cycle. Time well spent. With these teams, we’re able to
do the things that we’ve dreamt about for many years now—rather than some time.
It was a year with two themes: moving everything to the Cloud, and, improving
reliability. For the former, the team built a tool that allows us to move a shop
from one database to another with virtually no impact to the merchant. With this
tool, we moved every single shop individually from our data-centers to the
cloud. It’s mind-boggling to me that we’ve run every Shopify merchant through
this tool without mangling any.</p>
<p>Long-term, the concern for any company is that development slows down. You
combat that with world-class tooling. One tool we started investing in as a
team, is that all the applications inside the company have a standard way of
communicating. We started seeing more and more applications built independently,
but the tooling for them to leverage each other wasn’t improving (for the nerds
in the crowd: RPC). We’ve laid the brickwork in 2018, but this year I’m
confident we’ll start to see the first massive benefits within the company from
this foundational investment. Third, we process about 1 billion jobs in the
background at Shopify per day. This infrastructure hasn’t gotten a lot of love
over the past five years, so the third team is built around improving this
machinery. They not only did that but also started experimenting with
automatically scaling workloads based on how busy the platform is. What I’m most
proud of is the increasing autonomy of these teams. Their independence frees up
time in 2019 to focus on the next project and the next squad. If you’re
interested in any of this, you should shoot me an email.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[How I Read]]></title>
        <id>https://sirupsen.com/read</id>
        <link href="https://sirupsen.com/read"/>
        <updated>2018-07-15T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Until a few years ago, I didn’t spend much time reading. Today, I spend a few
hours every week reading, amounting to somewhere between 30 and 50 books a year.
My reading habit has evolved significantly over the past couple of years and
surely will continue to. In this post, I will describe how I approach my
reading. You may think it’s elaborate (other people’s reading systems rub me the
same way), however, keep in mind it’s evolved slowly over the years.

A complex system t]]></summary>
        <content type="html"><![CDATA[<p>Until a few years ago, I didn’t spend much time reading. Today, I spend a few
hours every week reading, amounting to somewhere between 30 and 50 books a year.
My reading habit has evolved significantly over the past couple of years and
surely will continue to. In this post, I will describe how I approach my
reading. You may think it’s elaborate (other people’s reading systems rub me the
same way), however, keep in mind it’s evolved slowly over the years.</p>
<blockquote>
<p>A complex system that works is invariably found to have evolved from a simple
system that worked. A complex system designed from scratch never works and
cannot be patched up to make it work. You have to start over with a working
simple system. – John Gall</p>
</blockquote>
<p>It’s also worth noting that this is not an aspirational post. This is what I
<em>actually</em> do, and have done for a while—otherwise, I wouldn’t think it would be
worth sharing.  I often think of the classic Charlie Munger quote on reading,
he’s not wrong:</p>
<blockquote>
<p>In my whole life, I have known no wise people (over a broad subject matter
area) who didn’t read all the time — none, zero. You’d be amazed at how much
Warren reads—and at how much I read. My children laugh at me. They think I’m
a book with a couple of legs sticking out. – Charlie Munger</p>
</blockquote>
<p>The post is divided into a section for each part of the reading process: (1:
Sourcing), (2: Choosing), (3: Reading), and (4: Processing).</p>
<h2 id="sourcing">Sourcing</h2>
<p>Whenever I stumble upon a recommendation for a book, I will follow the link to
Amazon and send the page to Instapaper. I have a script that automatically
converts any Instapaper book links into rows in an Airtable. Endorsements from
trusted sources will be added, too. <a href="https://gist.github.com/sirupsen/39bd17cbcd713936ccee91d6f5e1b761">This
script</a> will
automatically add metadata about the book from Goodreads such as genre, year
published, author, and so on.</p>
<figure><img src="/images/uuO77yFj0Pfsb6YO0fUdN6ZO.png" alt="Airtable of images" width="2000" height="362" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<h3 id="what-would-i-like-to-improve-about-choosing">What would I like to improve about choosing?</h3>
<p>Whenever I send the book to Instapaper, I’d like to attach a name to it and
automatically add them as an endorser. There are also certain people whose book
recommendations I seek. Automatically adding their endorsed books to my feed
would be valuable. If I start going deep on a topic, I may want to read a
follow-up book on the topic and will go through my sourcing list first.
Attaching a summary or similar would help make the searches more fruitful. In
general, I would like someone else to solve this problem for me. Improving it
further to aid in choosing more would be a non-trivial amount of engineering—see
next section. See the next section, (2: Choosing), for a much more elaborate
answer to how I’d like to improve the sourcing and choosing process altogether.</p>
<h3 id="what-are-changes-ive-made-in-sourcing">What are changes I’ve made in sourcing?</h3>
<p>I used to have a habit of buying the books I wanted to read instead of simply
sourcing them. That’s an expensive sourcing method. Inevitably, it grew into a
large number of unread books on my Kindle, which made me often dread opening it.
It felt like an ever-growing to-do list (where each item takes many hours to
complete). This is popularized as an
<a href="https://fs.blog/2013/06/the-antilibrary/">anti-library</a>—I don’t think this
translates to the Kindle world well, but may work for the physical realm for
books you <em>know</em> you want to read. Most importantly, it means that finishing a
book always becomes a new adventure in choosing a new book without considering
the sunk cost of already having bought another book which may mean I read less
relevant books. Generally, I subscribe to not counting money spent on books.
It’s $10, and it could very easily change your life. That’s a bargain. I will
acknowledge this is a privileged argument, but libraries make good allies if
buying is too expensive. Old books (which have stood the test of time, see next
section) often cost pennies on Amazon.</p>
<h2 id="choosing">Choosing</h2>
<p>For choosing books, I have a couple of heuristics I apply as I scurry through my
sourcing list, Google, Goodreads, and other trusted sources:</p>
<p><strong>1. What book is most applicable right now?</strong> If I can find a book that I can
start applying <em>right now</em> in whatever I’m dealing with, it’ll take precedence
over any other heuristic. If I’m about to recruit, reading books about building
teams and recruiting would be highly applicable. With an immediate opportunity
to put it into practice, it is much easier to have things stick and make an
impact. This is the most important heuristic, however, it is often challenging
to find such a book. Especially with the relatively poor sourcing tools I feel
that I have available.</p>
<p><strong>2. Syntopical reading.</strong> If I’ve been diving into a single topic, I may try to
pick up a few more works in the same category to make sure I see the problem
from different angles. I find that this helps strengthen the concepts too, as I
get to run an internal mock dialog between the authors of the books where they
agree or disagree. If it’s on a topic that strongly satisfies (1), I am more
likely than not to do syntopical reading. On the other hand, if I am mostly
looking for an overview of a topic—I may save syntopical reading for the future.</p>
<p><strong>3. Books that have aged well.</strong> If the book has been out 10 years, it’s
<a href="https://en.wikipedia.org/wiki/Lindy_effect">likely it might still be relevant another 10 years from now</a>.
If it’s been out for 100 years, it’s likely it’ll be around for another 100
years. If I am diving into recruiting due to heuristic (1), I’ll look for the
book that’s 10 years old, not the one that was published this spring. In
fast-moving fields, newer can be better, in which case I may start with new, and
then read the old. This applies to e.g. software, where I’ll likely default to
what was published recently but often go back and understand how we ended up
here by reading older material. In most sciences, old is good. I found Darwin’s
original work <strong>surprisingly</strong> readable.</p>
<p><strong>4. What discipline or topic am I weak in?</strong> I believe that at some point,
optimizing for breadth in your reading to complement your depth becomes more
impactful than going even deeper. As Munger puts it, accumulate the big ideas
from the big disciplines. There are so many disciplines where people learn to
think in different ways to solve different problems. Over time, I’d like to get
a rudimentary understanding of most of the major disciplines: law, biology,
economics, history, physics, and the list goes on. This will take a lifetime,
but I think the process will be both enjoyable and useful. I attempt to balance
disciplines, but this easily gets thrown off by other heuristics. There’s a fine
balance with (1). Breadth, (4), is most useful with depth, (1).</p>
<p><strong>5. Modern translations or interpretations are not inherently bad, especially
as introductions to a topic.</strong> Old is good, but can be taken too far. I enjoyed
reading <a href="https://www.amazon.ca/Guide-Good-Life-Ancient-Stoic/dp/0195374614">A Guide to the Good
Life</a> from
2008 as an introduction to stoicism much more than <a href="https://www.goodreads.com/book/show/97411.Letters_from_a_Stoic">Letters from a
Stoic</a> from BC
something something. Here, the concepts applied (Stoicism) have stood the test
of time—but it may be easier to apply if written by someone in the 21st century.
If you’re really into it, by all means, go to the primary source (I did).
Similarly, wanting to take advantage of knowing Danish I started reading
Kirkegaard a few years ago. I preferred the English translation because you
won’t get chastised for modernizing a translation the same way you would for
modernizing the origin language. If you’re really into a topic, it’s silly to
not go to the primary source a book or two into the topic, though. If you’re
into stoicism (as pointed out here), go to the original works.  They’re very
readable, otherwise, the ideas would not have aged as well as they did.</p>
<p><strong>6. What are my friends reading?</strong> If my friends have read a book, that’s a
free opportunity to talk with them about it or ask them whether it fits my
criteria. It’s a free book club opportunity, helping to nudge the concepts into
long-term memory and get perspective. I don’t want to have 100% overlap with my
friends, but once in a while, if the stars align—I like this opportunity. In
general, I abuse friend’s reading more to assist with (1: Applicability), as
these can be difficult books to find.</p>
<p><strong>7. Audiobooks for narrative, Kindle for anything else.</strong> While less of a
heuristic for choosing the next book, this is still something that I find
useful. If a book has a narrative, such as history, biographies, or novels—then
it falls in the Audiobook bucket for me. I may experiment with re-reads as audio
at some point. For anything else, I’ll read it on my Kindle. Some narratives are
too technical for Audiobooks to me, for example, I started listening to a book
about the fall of Enron—too difficult to follow through audio due to a large
amount of industry and finance jargon.</p>
<p><strong>8. Skim the free sample of your top <code>x</code> books.</strong> I learned from <a href="https://danieldoyon.co/">Dan Doyon</a>
that Amazon will send you free samples of books.  His Kindle is ladled with
Kindle samples, and he’ll choose his next book by skimming through 10s of these
to hit one he finds most interesting at that moment. I’ve started adopting
skimming the top samples that come out of the other heuristics. I find this a
useful supporting heuristic for e.g. (1: Applicability) and (4: Breadth). It’s
easy to choose a book especially on a new topic where the <strong>idea</strong> of knowing
about it (e.g. basic accounting) sounds intriguing, but you may just not be in
the right place and time that it’s interesting enough for you to follow through.</p>
<figure><img src="/images/2RrzgDqDXqbXmmsszgDof3UB.png" alt="Pasted image" width="200" height="342" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<h3 id="what-would-i-like-to-improve-about-choosing-1">What would I like to improve about choosing?</h3>
<p>What bothers me most about my choosing and sourcing is that it’s at the wrong
abstraction level. I should be choosing <strong>topics</strong> and <strong>skills</strong> and sorting
those by the applicability heuristic, rather than books. While books are useful,
the ultimate goal here is not to read books—but to learn. There are other ways
to learn than books: courses, classes, conversations, exercises, travel, coding
ideas, crafts, and so on. “Reading” as a way to acquire knowledge is useful, and
I see the majority of my time being spent here for personal development—however,
I would like to not choose the next <strong>book</strong> but the next <strong>topic</strong>. Not: “This
book about photography” but rather “The topic of photography” with the
supporting sourcing and choosing tooling that’ll allow me to then dig into
books.</p>
<p>The tooling I have now does not support my (1: Applicability) and (4: Breadth)
heuristics well. Self-assessing which skills I’m weak assumes I have no blind
spots, which would be incredibly naïve to believe. (6: Friends) and what they
read help shed some light on those blind spots, but are largely disconnected
from what might be useful for me. I am not sure exactly what I want, but I feel
that I should move towards a list of topics I would like to get into and sort
them by attributes such as current knowledge about the topic, upper-bound return on
investment, lower-bound return on investment, applicability, enjoyment, and
perhaps a couple others. This would allow me to go much wider, from playing
chess (which I likely don’t have a single book in my sourcing list about) to a
rudimentary understanding of a new language (no Spanish grammar books in my
sourcing list, I am afraid), because it would gain me the ability to visualize
my opportunity cost more clearly and put myself another level away from the
currently fairly subjective choice of next book. I would certainly not challenge
that there can be a serendipitous highly positive benefit to at times choosing
semi-random, recommended books in a broad topic such as management. I feel
that’s what I end up doing most of the time, and I crave more.</p>
<p>I crave too much structure, but I feel that significant investment into this
aspect would pay serious dividends. It’s likely that I will experiment with an
Airtable for this over the coming years and make changes to this article. Most
of all I hope someone else will build this, but most likely it’s far too
systematic. It is also possible that chaos here is possible, but I refuse to
believe I cannot get a system that outperforms chaos by at least 10-20%—which
would be a <strong>major</strong> win over a lifetime.</p>
<h3 id="what-are-changes-ive-made-in-choosing">What are changes I’ve made in choosing?</h3>
<p>This used to be “go down the list on Audible” or “go down the list on the
Kindle” of books already purchased. However, “just in time” choosing has been
much more effective to satisfy the most important heuristic (1): What book can
have the biggest impact for me right now? In general, I would advise looking at
your choosing akin to an efficient factory. You shouldn’t have massive piles of
inventory in front of every machine, but rather optimize the overall throughput
through the factory.</p>
<h2 id="reading">Reading</h2>
<p>Typically, I have about 3 books on the go: An Audiobook, a fiction book on the
Kindle, and a non-fiction work on the Kindle. When reading, I attempt to focus
on a couple of things. The majority of them to improve retention.</p>
<p><strong>1. Highlights.</strong> I will highlight the interesting parts of a book. Often, I
take notes too as I have too many times been in the situation where returning to
the highlight I have a hard time figuring out why I found it important at the
time of reading. Typing on the Kindle is painful to begin with, but you get the
hang of it eventually.  I use <a href="https://readwise.io">Readwise</a> for working with my highlights (more on
this in the processing stage), and use
<a href="https://blog.readwise.io/tag-your-highlights-while-you-read/">tags</a>, special
tags <a href="https://blog.readwise.io/combine-highlights-on-the-fly/">to combine highlights on the fly</a>, and <a href="https://blog.readwise.io/add-chapters-to-highlights/">their header tags</a> to add sections for
a table of contents. I also highlight words I don’t know (or don’t use), to
later <a href="http://sirupsen.com/airtable/">process them into my vocabulary</a>.</p>
<figure><img src="/images/patty-quote.png" alt="Pasted image" width="1646" height="820" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>2. Skimming and skipping.</strong> I make fairly liberal use of skimming and
skipping, especially in non-fiction where not every chapter will have an
equivalent impact for me. Skimming the first and last few pages of a chapter
often gives you a great idea about whether the chapter is worth reading for you.
For example, years ago I went to Brazil, and before going I wanted to read a
short book about the history and culture of the country. There were 3 chapters
about sports in Brazil, something I wasn’t interested in. I got the gist of it
from the first and last few pages and simply skipped. When I read Principles, I
skipped the biography and went straight to the principles, deciding I’d read the
biography chapter later if the principles were interesting enough. It felt oddly
liberating when I realized there’s no book police that’ll come knocking on your
door when you skip a chapter.</p>
<p><strong>3. Visualizing.</strong>  Ever since reading <a href="https://www.goodreads.com/review/show/2221032060">Moonwalking with Einstein</a> I’ve incorporated
memory palaces into more aspects of my life. I’ve experimented with summarizing
a book as I go in a memory palace, and this worked out quite well. It generally
meant that it was easier for me to remember the book in general. Memory palaces
aren’t just about being able to memorize a list, but also a concrete way to
connect key points into your wetware. What I found surprising was that when
something would remind me of the points from a book I’ve built a palace for for,
I’m thrown right into the memory edifice to connect it. While in the palace, I
find that I will often spend time going backwards and forwards and re-iterate
the other concepts—a form of spaced repetition. There’s still more to explore
here, but there’s certainly something to it. Think of it like when you read a
novel, you’re always visualizing what’s going on. The more effort you put into
this, the easier the novel is to remember. The longer you put an effort in, the
easier it gets to create more and more elaborate images over time. I haven’t
been as diligent with this practise for the past few books, but I plan to
continue to experiment with it.</p>
<p><strong>4. Metaphors and relations.</strong> This relates back to visualization; anything you
can do to make a book more vivid helps. If you can relate concepts from the book
to something else, it does wonders. A while ago, it felt overdue to gain a
technical understanding of how simple Blockchains work. A friend asked me to
explain it to him, and we constantly related each concept back to concepts and
metaphors we already understood. In about an hour he gained a deep enough
understanding that he could go explain it to someone else, in quite elaborate
technical detail. I attribute that to relating everything to a real-life
metaphor, e.g. ‘hashing’ in cryptography was conceptualized as akin to a fire
turning into ash; impossible to reverse, and the slightest adjustment in initial
conditions would make the configuration of ashes different. One of the most
important relations I find is to attempt to see if the concept would’ve made a
scenario in your life play out differently, had you known it. I like to think of
each past event having <code>n</code> lessons you can extract out of it. It’s important to
not leave <strong>any</strong> lessons on the table, and to suck these experiences dry—you
need to revisit them for decades to come. It’s a bit like a machine learning
algorithm (it’s actually exactly like a machine learning algorithm, which of
course, is inspired by humans). You’re constantly adding to the algorithm with
new mental models and an enriched understanding of the world. When you’ve
changed the algorithm, you need to re-train it on your data-set consisting of
your collected experience.</p>
<figure><img src="/images/fire.jpg" alt="" width="900" height="814" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>5. Summarize every chapter in your head.</strong> I don’t remember where I read or
heard this, but someone said that one of the best pieces of advice they’d ever
gotten was that every time they’d leave a room, they should stop at the door and
summarize to themselves what just happened. What did you just learn? What just
happened in that meeting? What was on that person’s mind? When I finish a
chapter in a book, I try to quickly summarize it in my head. If I’m building a
palace for the book, I’ll attempt to make up an image and plant it. This is
often surprisingly hard, but I’ve noticed improvements as a result. It’s like
the end of a (good) meeting, where someone will summarize all the actions and
outcomes. Ever been to one where that doesn’t happen? It can feel like a waste
of time.</p>
<p><strong>6. Re-read.</strong> The best books I will try to read again. I’ve done it so far for
perhaps half a dozen books and it’s been rewarding every time. In general, I
think we can treat the best books and articles more like music playlists.
Reading them again and again, with sufficient repetition in between to make them
relevant and fresh anew. For articles, I have a script that’ll feed them back to
me on a spaced repetition schedule automatically in Instapaper. I wrote more
about this <a href="https://sirupsen.com/playlists/">here</a>.</p>
<h3 id="what-would-i-like-to-improve-about-reading">What would I like to improve about reading?</h3>
<p>My retention here is still not quite as good as I would like, although I think a
fair bit of that comes from the processing (next section). I would like to more
diligently build palaces. I haven’t done it for the past 5-10 books I’ve read,
but the ones I did I’ve found myself going back to more often than not. I don’t
take as many notes to my highlights as I’d like to, I think more focus on these
two will make the biggest difference currently because they’ll both benefit the
processing stage.</p>
<p>I dream of the day where I can see the highlights of friends. This would be a
fantastic opportunity to start interesting conversations with people and build a
deeper understanding of the book while feeling much less forced than a book
club.</p>
<h3 id="what-are-changes-ive-made-to-reading">What are changes I’ve made to reading?</h3>
<p>My reading process has been fairly additive. I’ve mostly added more and more
structure to the way I read, any more effort I can put in here to twist and turn
the points made end up being better than <em>not</em> doing it. The fear here is doing
<em>too</em> much. As mentioned in the processing stage, to simplify, I will need to
figure out what works and what doesn’t.</p>
<h2 id="processing">Processing</h2>
<p>Reading, to me, is worth the most if I can remember the ideas. I don’t think you
will always be able to map an idea back to its source, meaning, just because you
can’t summarize Thinking Fast and Slow eloquently, doesn’t mean it didn’t
influence you.</p>
<blockquote>
<p>Reading and experience train your model of the world. And even if you forget the experience or what you read, its effect on your model of the world persists. Your mind is like a compiled program you’ve lost the source of. It works, but you don’t know why. – <a href="http://www.paulgraham.com/know.html">Paul Graham</a></p>
</blockquote>
<p>It’s a cliché to complain about the length of books: “This idea could be
explained in five pages! Why would they write an entire book?” This statement
bothers me to no end. If you possess the discipline it takes to incorporate an
idea into your wet-ware from article-length with no fail—then you’ve got some
discipline that you would <strong>not</strong> self-discount with a blanket statement like
that. No-one I’ve talked to who reads 10s of books a year, and have done so for
years, would dream of saying this. They understand that reading is not just
about passing words through your head.</p>
<p>Then why are books long? I’ll gently navigate around the “publisher’s require it
to be 200+ pages” conspiracy, and instead focus on two points. First, it’s a
form of spaced repetition. This wonderful, proven technique that can be applied
to almost every corner of your life. It turns out, if a book is 200 pages, it’s
going to take you a few spaced repetition cycles to read it, which raises the
probability it’ll stick for you. Unless you are diligent about repetition, my
pet theory is that most things that stick are somewhat random. You hear
something today, and then in the next spaced repetition window a few days from
now; you hear about it again. Then a week or so after that. If you consider how
many new things we hear every day, I don’t think this is so crazy. Especially
given how hyper-aware our brain is for these things, it <em>wants</em> to recognize
them. I’ve noticed this is how most new English words transition from a
spreadsheet to my real, active vocabulary. There’s a hint of random in there.</p>
<p>The second reason books are long, is that different ways of explaining an idea
resonate with different people. For you, it may be that antifragility is best
explained through a fitness analogy; you break down muscle, build them back up,
ta-daa you are now stronger. For the foodie who makes an annual pilgrimage to
New York, antifragility may draw the most connections (and thus stick best) when
applied to why the ramen seems better <em>every</em> time you go back. Remembering an
idea is some combination of the number of connections you can draw and spaced
repetition. Anecdotally, I’ve observed that I remember new information in the
space of software well. I can usually connect it to half a dozen things fairly
quickly, which makes it hard to forget. If you tell me something I don’t know
about the state of Crude Oil, I have little to connect it with and most likely I
will not remember it tomorrow unless I put in more effort; spaced repetition, or
ask enough questions that that half a dozen connections start appearing. But
that’s work.</p>
<p>Turns out forming new memories <em>needs</em> to be hard. Otherwise, how is your
brain to know what to remember and what not to? Imagine if every time you looked
at a dining table, every single memory <em>ever</em> that had to do with a table was
readily available. That’d be pretty uncomfortable. (The eyes with the cupcake on
top below is my poor imitation of the exploding head emoji: 🤯)</p>
<figure><img src="/images/o_KP7oCyzXw2ASahKC6kUUZI.png" alt="Scan Jul 11, 2018, 07.41.jpg" width="1410" height="1000" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>Here are some of the steps I do after having read a book that I’ve done for a while.</p>
<p><strong>1. Writing a review/summary.</strong> A few weeks after reading a book, typically
I’ll write a short summary and review and publish it on Goodreads
(<a href="https://www.goodreads.com/review/show/2417256899?book_show_action=false">example</a>).
This forces me to extract the key lessons from the book. Typically, I’ll use my
highlights from <a href="http://readwise.io/">Readwise.io</a> to assist in extracting the
key lessons from the book and throw them into the summary. You can see all my
reviews on my <a href="https://www.goodreads.com/user/show/38623347-simon-eskildsen">Goodreads profile</a>.</p>
<figure><img src="/images/-8pAR7XHj8Cu_QSeiARS_IWC.png" alt="Pasted image" width="1408" height="1000" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>2. Converting highlights to index cards.</strong> Either at the same time as doing
the review/summary or later, I will go through my highlights and find the ones
I like most. Often, I end up spending hours (typically on a Saturday or Sunday
morning) going into rabbit holes as part of polishing my highlights. This is
fine, if they’re interesting, it helps me to build connections and stick them in
long-term memory. For the best points in the books, often a combination of
highlights and themes, I’ll create a <strong>physical</strong> index card. I try as much as
possible to draw on the card and think of references to other books.</p>
<figure><img src="/images/index-card.png" alt="" width="1648" height="1000" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p><strong>3. Reviewing index cards.</strong> I have two containers for my index cards. One with
index cards that have been processed at least once (left) and one for cards that
have yet to be processed (right). As you can see, the top card in the left box
is the one that was most recently reviewed (2nd of July, 2018) and the card on
top of the right box hasn’t been reviewed yet (only one date). As you see on the
card above, and the card below, there are little symbols under the date. These
symbols have special meanings for what I did with the card at the same time. I
have a dozen or so symbols to experiment with what works best for retention over
time. <code>W</code> below means that I wrote at least 200 words about the content of the
card, attempting to draw new connections and elaborate on the idea. <code>R</code> is
followed by a number and rates how much I’ve applied this idea since last time.
<code>U</code> followed by a number is how useful this idea is, on a scale from 1 to 7.
These numbers are long-term to inform a better sorting algorithm if there are two
cards I can review now, I’d prefer the one with a low <code>R</code> value (not applied
yet), high <code>U</code> value (very useful), and where a long time has passed since last
reviewed. I may digitize this at some point (I’m terrified of losing these
cards), but this has worked well so far. Again, as with (2: Choosing), I think I
can beat randomness and sorting by date by at least 10%, which is a significant
improvement over the long-term. However, I’ll need some data first. Below, you
can see a full list of my symbols. Some are not deprecated, but many I continue
to use.</p>
<figure><img src="/images/d-bxD_1EcycjhEoP_zAV8cf-.png" alt="Pasted image" width="1241" height="1000" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>When I travel, I usually bring the box of unprocessed cards with me and spend
some time reflecting on those cards. Some call this a “Commonplace Book”; i.e. a
book with all the best snippets from everywhere. Why index cards and not a
notebook? Well, notebooks can only grow so much in size, and are hard to change
without becoming messy. Often, I’ll tear cards apart on a second review and
re-write them for more clarity and backfill the dates. I can sort them however I
want, which is difficult. Airtable would be a fantastic candidate for the
Commonplace book but the physical aspect currently intrigues me.</p>
<figure><img src="/images/v5NYGhseS7cUet1VPnFwybMU.png" alt="IMG_0960.JPG" width="1333" height="1000" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>If you’re after something similar, <a href="https://readwise.io">Readwise</a> has a great
feature to send you some of your highlights every day. Takes minutes to set up
if you’re already using a Kindle.</p>
<p><strong>4. Listening to Podcasts with the author</strong> After a book, I often find myself
with a slew of questions I wish I could ask the author. That’s exactly why they
get invited to various Podcasts (if they’re alive). With <a href="https://www.listennotes.com/">Podcast search
engines</a> it’s easy to find a Podcast with the
author. The show notes will often reveal what types of questions the interviewer
is going after.</p>
<h3 id="what-would-i-like-to-improve-about-processing">What would I like to improve about processing?</h3>
<p>As mentioned, I may need a new home for these nuggets instead of index cards.
It’s tough to sort them properly, so currently it’s a simple queue based on last
review date. I am about a year behind (i.e. I review cards now I wrote about a
year ago), so typically I produce cards faster than I can process them. For the
time being, I’m OK with it. I destroy a lot of cards when I review that are not
relevant to me, or I think are covered by something else. I’ve scurried through
them quite a few times to try to find something I was sure I had on a card—this
is a frustrating experience. I just don’t have the perfect software for it yet,
and I worry a lot about putting this somewhere and having to convert it around.
To some extent, this has become my most prized possession in that it’s
impossible for me to replace.</p>
<p>Going forward, I’ll likely digitize them to make them searchable. A year or two
from now, I’m going to go through them and review the <code>R</code> and <code>U</code> scores and
correlations with other symbols to find out what works, and what doesn’t. Based
on this, I will create a sorting algorithm for the digitized index cards. Again,
the software in this space is lacking, so it may be a fancy use of Airtable if
nothing better exists by that time.</p>
<h3 id="what-are-changes-ive-made-to-processing">What are changes I’ve made to processing?</h3>
<p>This is the step I’ve invested the most in in the past few years because I feel
this is where the most impact is had. In general, I think that people should
spend 50-60% of their time in this stage over all others. Most spend the
majority of their time in reading. I’ve come to many great realizations writing
about cards and applying them to my life and current situations. My past self
can recognize an idea as useful, and recognize that there’s no immediate
application of it, transcribe it to a card and hope it pops up at a better time.
This setup poises me to increase the probability I get the right idea, at the
right time. The right time being when it’s most likely to be applied.</p>
<p>Overall, I have not made many changes here other than gradually adding to this
system. I hope in a few years to go through the data on the cards and the
ratings, to figure out which methods work best for retention. Writing? Flash
cards? Memory palaces? Talking to a friend?</p>
<h2 id="future">Future</h2>
<p>I will continue to iterate on this, likely, for the rest of my life. I think
everyone deserves a good reading system. It takes years to build one, you can’t
start out with this, or any other system—you need to gradually build it over
time. The reading habit is most important, then you start paying more attention
to what you read, you start highlighting, you start taking notes, you start
writing summaries, and slowly a complex system that works for you will evolve
and evolve. I hope this can inspire you to invest more in your reading process.</p>
<p>For book recommendations, see <a href="https://www.goodreads.com/user/show/38623347-simon-eskildsen">my Goodreads profile</a>
especially my <a href="https://www.goodreads.com/review/list/38623347?shelf=reread"><code>reread</code> shelf</a>.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Media Playlists]]></title>
        <id>https://sirupsen.com/playlists</id>
        <link href="https://sirupsen.com/playlists"/>
        <updated>2018-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[We have playlists for our favorite music, but don’t re-consume great information
nearly enough. Almost certainly you’ve once watched a documentary (or read a
book) about the environment, after which you
ponder how to reduce your footprint: an electric car, eating less
meat,
or <a href="http://thec]]></summary>
        <content type="html"><![CDATA[<p>We have playlists for our favorite music, but don’t re-consume great information
nearly enough. Almost certainly you’ve once watched a documentary (or read a
book) about <a href="https://www.beforetheflood.com/">the environment</a>, after which you
ponder how to reduce your footprint: an electric car, <a href="https://en.wikipedia.org/wiki/Environmental_impact_of_meat_productionhttps://en.wikipedia.org/wiki/Environmental_impact_of_meat_production">eating less
meat</a>,
or <a href="http://theconversation.com/airline-emissions-and-the-case-for-a-carbon-tax-on-flight-tickets-56598">voluntarily paid carbon
tax</a>
on your air-travel emissions. Then, after a few weeks, the effects mostly fade,
and you gradually return to baseline…</p>
<p>This cycle of a bee entering your bonnet for a short period, only for another
bee to take its place, is ineffective. We pick up gems from conversations,
articles, books, and videos, only to use them for a few days or weeks. Most
things we learn, we forget, unless our environment strongly nudges us to
consider those ideas repeatedly. However, most ideas don’t leap from medium-term
memory into long-term principles. How can we increase our odds of compounding
ideas on top of each other, instead of leap-frogging between new ones?</p>
<p><a href="https://en.wikipedia.org/wiki/Spaced_repetition">Spaced repetition</a> is the
simple idea that the probability of remembering an idea for the long-term
increases dramatically if we’re reminded at an intentional, exponential
schedule. We might discover that the word for the effect where we learn a new
word and start noticing it everywhere is called the ‘frequency illusion.’ To not
forget this, we make sure we’re exposed to this piece of information a few days
from now, then a week after that, two weeks after that, then a month, three
months, and then every six months from there. Spaced repetition is a
well-studied effect, and many (including myself) have had <a href="http://sirupsen.com/airtable">success with this
through flash-cards</a>. We expose ourselves to the
piece of information <em>just</em> before we would forget it, refreshing the memory.</p>
<p>However, the effect doesn’t need to be constrained to fun facts on flashcards.
It can be deep, complex ideas as well. Ideas or ways of thinking that we
incorporate deeper, and deeper into our wetware with each successive
re-consumption of an article, book, or video on some schedule. In the past year,
I’ve been interested in exposing myself to an increasing amount of spaced
repetition outside of flashcards.</p>
<p><a href="http://readwise.io/">Readwise</a> helps me by re-surfacing highlights from my
Kindle and Instapaper.  Quite a few times reading through the daily digest from
Readwise, a highlight came at just the right time to implement it that day or
sparked new connections to form more connected memories. My pet theory is that
the truly useful ideas that make it from books to our life principles are the
ones that strike us at <em>just</em> the right time where we needed that idea. Through
spaced repetition, we increase that probability dramatically.</p>
<p>In general, the more well-connected an idea in your head, the higher the
likelihood that it surfaces at the right time. To me, the definition of a useful
idea is one that’s readily available when you need it. It is hard work, and
takes time, to mold the neural connections to elevate an idea to this status. A
100, time-tested ideas stored in this fashion are worth a thousand times more
than 10,000 that enter and leave rapidly.</p>
<p>For example, a few months ago, a highlight about <a href="https://en.wikipedia.org/wiki/Survivorship_bias">survivorship
bias</a> came up. This cognitive
bias points out that we don’t adequately value the information <em>not</em> present. We
may be inclined to say that ‘old buildings are more beautiful’ when in fact,
when you think about it, only the beautiful old buildings survive. The ugly ones
are torn down, and new ones will take their place. This idea came up in my
Readwise digest as I was walking to work, at just the right time. It was highly
applicable to a problem we were working through on the team. As a result, I now
see survivorship bias everywhere I look. It feels like that one, deep
application made an order of magnitude more neurons connect than anything I’d
done previously.</p>
<p>While flash cards and Readwise have been helpful, it doesn’t solve the problem
for me of content that requires more deliberation. A video, article, or entire
book. For the first two, a few months ago I built a script that will re-surface
article or videos saved in Instapaper on a spaced repetition schedule. For
example, I liked <a href="http://www.collaborativefund.com/blog/expectations-vs-forecasts/">this article about Expectations vs
Forecasts</a> in
my Instapaper and archived it. A week later, it came up on top of my to-read
list again. Then a month after that.  I’ll see it again in another few months,
for it to finally only be read every 6 months. This creates a ‘playlist’ of
great articles, with new articles coming up once in a while too. Spending more
time on a few great articles is providing me more value than trying to read
everything. I now mostly skim articles on the first read. If it’s interesting,
I’ll ‘like’ it and go in more depth the second time. I’m finding myself taking
more notes and highlights each time it pops up again. I add videos to Instapaper
too, to recycle the same system.</p>
<p>While this is good, I hope that the next-generation of read-it-later services
will build spaced repetition straight into their core product. I hope they’ll
help with heuristics on when to read old, and when to learn new. Perhaps treat
the inbox, not as a queue, where what I just added comes up on top, but what I
added months or years ago is next. This helps avoid the cycle of spending the
majority of your time <a href="http://www.collaborativefund.com/blog/expiring-vs-lt-knowledge/">consuming media that expires
rapidly</a>.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Positive Unknown-Unknowns]]></title>
        <id>https://sirupsen.com/unk-unk</id>
        <link href="https://sirupsen.com/unk-unk"/>
        <updated>2018-03-10T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[When we make decisions, it’s useful to be cognizant of unknown-unknowns. Almost
in every case, we think about unknown-unknowns in a negative sense. If we’re
venturing into unknown territory, we accept that it’s likely we’ll stumble upon
Black
Swans:
improbable events that throw a wrench into our plans. Typically, we’ll draw on
our experience to take the path we figure has the fewest negative
unknown-u]]></summary>
        <content type="html"><![CDATA[<p>When we make decisions, it’s useful to be cognizant of unknown-unknowns. Almost
in every case, we think about unknown-unknowns in a negative sense. If we’re
venturing into unknown territory, we accept that it’s likely we’ll stumble upon
<a href="https://www.amazon.ca/Black-Swan-Improbable-Robustness-Fragility/dp/081297381X">Black
Swans</a>:
improbable events that throw a wrench into our plans. Typically, we’ll draw on
our experience to take the path we figure has the fewest negative
unknown-unknowns. We may choose to stretch something we already know instead of
adopting something new. Brooding on negative unknown-unknowns is extremely
useful, and fairly commonplace.</p>
<p>I think it’s equally useful to invert the traditional thinking about
unknown-unknowns and ask ourselves: How many <em>positive</em> unknown-unknowns might
we face with this option? Might we face more positive black swans, than
negative? In effect, what would give us the most positive optionality?</p>
<p>When making decisions, we weigh most strongly the first-order effects. We’re not
taught to <a href="https://www.fs.blog/2016/04/second-level-thinking/">systematically think through the second- and third-order
effects</a>. As we get further
away from first-order effects, our ability to predict effects decreases
exponentially. There’s a higher chance that we’ve missed second-order effects,
than first-order effects. These missed effects are what we call
unknown-unknowns. There are too many variables to keep track of and the
interactions between them, while governed by simple rules, become unmanageable
to the human brain. You can attempt to combat this with expertise, but you must
face that you won’t catch them all.</p>
<figure><img src="/images/unk-unk.png" alt="" width="2100" height="1275" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>An example might help. Consider the Internet, which had a fairly niche purpose
at first. Yet, it seemed to many that connecting the planet would be a good
idea. There’s no way that those connecting the globe could’ve anticipated the
amount of positive unknown-unknowns ramifications of the Internet. What they did
project, however, was that the space of unknown-unknowns positives for the
Internet was enormous.</p>
<p>Similarly, if we look at cryptocurrencies today, people are smitten with the
potential for the positive unknown-unknowns (and others by greed). What the Internet,
cryptocurrencies, and the printing press have in common is that they’re
foundational platforms with an enormous surface area for positive
unknown-unknowns.</p>
<p>I’ve seen positive unknown-unknowns numerous times when people build platforms.
Someone builds something great and simultaneously takes the time to solve the
problem one layer deeper than they otherwise might have. They sense the
potential in increasing the probability of positive unknown-unknowns, by
supplying the vision of a platform. Internally, two years ago we had an
<a href="http://sirupsen.com/podcast">employees-only single podcast</a>. Today, we have
around ten ranging from training, interviews to learn more about how to build an
internal product or history lessons about the company from our executives. When
it was clear that there was an internal podcast <em>platform</em>, it exploded. The
first podcast went one level deeper to provide a platform, increasing the
surface area for positive unknown-unknowns.</p>
<p>We will have to remain humble to the fact that often we can’t predict all
effects, positive and negative. We can attempt to reason about their size, but
we won’t know for sure. There’s an old Taoist fable that we can interpret as a
story unknown-unknown second and third-order effects:</p>
<blockquote>
<p>“When an old farmer’s stallion wins a prize at a country show, his neighbour
calls round to congratulate him, but the old farmer says, “Who knows what is
good and what is bad?”</p>
<p>The next day some thieves come and steal his valuable animal. His neighbour
comes to commiserate with him, but the old man replies, “Who knows what is
good and what is bad?”</p>
<p>A few days later the spirited stallion escapes from the thieves and joins a
herd of wild mares, leading them back to the farm. The neighbour calls to
share the farmer’s joy, but the farmer says, “Who knows what is good and what
is bad?”</p>
<p>The following day, while trying to break in one of the mares, the farmer’s son
is thrown and fractures his leg. The neighbour calls to share the farmer’s
sorrow, but the old man’s attitude remains the same as before.</p>
<p>The following week the army passes by, forcibly conscripting soldiers for the
war, but they do not take the farmer’s son because he cannot walk. The
neighbour thinks to himself, “Who knows what is good and what is bad?” and
realises that the old farmer must be a Taoist sage. ”</p>
</blockquote>
<p>It is tempting to believe at any of the critical points in this story that you
know what will happen next with certainty. With the most prized stallion in the
land, riches await! Or, when stolen, that you’ll never see it again.  While the
series of events in this story seem <em>highly</em> unlikely, it teaches us that
effects will happen that we could never have imagined. The sum of the
probabilities of unknown-unknowns may outweigh the knowns.</p>
<p>You may be looking at two options for a decision that seem equally good. Have
you considered which one has larger optionality long-term? Third-order effects
that you could by no means predict? With a small modification, could you
increase the surface area for unknown-unknown positives? Can you expose even a
fraction of a platform?</p>
<p>Considering positive unknown-unknowns has changed my mind quite a few times in
the past year. Contemplating optionality is <em>not</em> about making decisions based
on hope. It is one of many mental models in your arsenal to improve your
decisions. Each model gives you a new vantage point to see the problem from to
help you come to a better decision.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
    <entry>
        <title type="html"><![CDATA[Peak Complexity]]></title>
        <id>https://sirupsen.com/peak-complexity</id>
        <link href="https://sirupsen.com/peak-complexity"/>
        <updated>2018-02-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[With the teams I work with, we operate with the idea of peak complexity: the
time at which a project reaches its highest complexity. Peak complexity has
proved a useful mental model to us for reasoning about complexity. It helps
inform decisions about when to step back and refactor, how many people should be
working on the project at a given point in time, and how we should structure the
project.
What we find is that to make something simpler, we typically have to raise the
co]]></summary>
        <content type="html"><![CDATA[<p>With the teams I work with, we operate with the idea of <em>peak complexity</em>: the
time at which a project reaches its highest complexity. Peak complexity has
proved a useful mental model to us for reasoning about complexity. It helps
inform decisions about when to step back and refactor, how many people should be
working on the project at a given point in time, and how we should structure the
project.</p>
<p>What we find is that to make something simpler, we typically have to raise the
complexity momentarily. If you want to organize a messy closet, you take out
everything and arrange it on the floor. When all your winter coats, toques, and
spare umbrellas are laid out beneath you, you’re at peak complexity. The state
of your house is <em>worse</em> than it was before you started. We accept this step as
necessary to organize. Only when it’s all laid out can you decide what goes back
in, and what doesn’t to ultimately lower the complexity from the initial point.</p>
<figure><img src="/images/peak-complexity.png" alt="" width="530" height="364" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>When you’re cleaning your house, you do this one messy place at a time: the
bedroom closet, then the attic, and lastly, the dreaded basement. Doing it all
at once would be utter mayhem; costumes, stamp collections, coats, and lego sets
everywhere. We’re managing our series of peak complexity points to one messy
floor-patch at a time.</p>
<p>This model works for software, too. As we embark on a complex project, we need
to consider the pending complexity peaks(s). It’s completely okay to add
complexity along the journey, sometimes you need to momentarily trade technical
debt for speed. But it’s also part of the job to manage your complexity budget.
Be honest with your team about where you reside on the curve. The more
complexity you add, the harder it is to onboard new members to the team.
Typically, your bus factor increases, because few people can hold this
complexity in their head at a time. With high complexity, the probability of
error increases non-linearly. It’s prudent to review your project’s inflection
points and structure it to have many small peaks. This avoids creating a
Complexity Everest. A big mountain is tough to climb. It gets exponentially
harder the closer you get to the top as oxygen levels decrease, wind increases,
temperature drops, and willpower depletes. That’s why you want to structure your
project into hills that deliver value every step of the way: day-time hikes with
picnic baskets. Sometimes, the inevitable mountain appears—and that’s okay, but
be realistic about what it means to the project.</p>
<figure><img src="/images/peak-complexity-smaller.png" alt="" width="530" height="364" loading="lazy" decoding="async" style="max-width:100%;height:auto"/></figure>
<p>The worst thing you can do is build a complexity mountain and not harvest the
simplicity gains on the other side. The descent may require a smaller team and
take less time than it took to climb, but is incredibly important work. As I’ve
written about before, the more you can <a href="/drafts">simplify the mental model of the
software</a>, the more leverage you build. If you fail to recognize peak
complexity and descend you may strand there. This is how you end up supporting
your project forever. It’s also worth noting that for a project, there’s not
just peak complexity, there are other resources you can trade for speed in the
short-term:</p>
<ul>
<li><strong>Peak Toil.</strong> You trade manual operations/lack of automation for getting
the first iteration of the project shipped sooner. Just as with peak
complexity, it’ll catch up to you.</li>
<li><strong>Peak Money/Cost.</strong> Money is another resource you can often trade for speed, e.g.
by leaving optimization to after the initial version has shipped.</li>
<li><strong>Peak People.</strong> This is the point in time where your project has the most
staff assigned to it, as the project moves into later phases of its
life-cycle it’ll most likely have less people assigned to it. Other projects
require more people as the initial version is out. On some projects, again,
you can trade people for speed. An opportunity cost comes with that, of
course.</li>
<li><strong>Peak Stress/Work.</strong> People can sprint to reach some short-term target, but
if you don’t allow them to rest, your people will lose trust in you, get
tired, and will shorten their timescale for decisions.</li>
<li><strong>Peak Sluggishness.</strong> For many projects, you can solve performance later to
get the first iteration out quicker, too. It may be that it’s not worth
solving some algorithmic or data storage problem until you’ve proved that
it’s something customers want.</li>
</ul>
<p>As a lead or project manager, I think it’s your responsibility to be aware of
these peaks when trading the amplitude of a peak for speed on the project. If
you push the peak too high on too many, your project will go through a tough
problem and fail for reasons unrelated to the project.</p>]]></content>
        <author>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </author>
        <contributor>
            <name>Simon Eskildsen</name>
            <email>simon@sirupsen.com</email>
            <uri>https://twitter.com/sirupsen</uri>
        </contributor>
    </entry>
</feed>