Simon Eskildsen

Podcast with Geek Narrator on Object Storage Databases

2024-11-16T00:00:00.000Z

turbopuffer: fast search on object storage

2024-07-08T00:00:00.000Z

Napkin Problem 21: Index Merges vs Composite Indexes in Postgres and MySQL

2022-11-26T00:00:00.000Z

While working with Readwise on optimizing their database for the impending launch of their Reader product, I found myself asking the question: How much faster is a composite index compared to letting the database do an index merge of multiple indexes? Consider this query:

SELECT count(*) /* matches ~100 rows out of 10M */
FROM table
WHERE int1000 = 1 AND int100 = 1
/* int100 rows are 0..99 and int1000 0...9999 */

View Table Definition

Scaling Causal's Spreadsheet Engine from Thousands to Billions of Cells: From Maps to Arrays

2022-07-05T00:00:00.000Z

Causal's UI

Causal is a spreadsheet built for the 21st century to help people work better with numbers. Behind Causal’s innocent web UI is a complex calculation engine — an interpreter that executes formulas on an in-memory, multidimensional database. The engine sends the result from evaluating expressions like Price * Units to the browser. The engine calculates the result for each dimension such as time, product name, country e.g. what the revenue was for a single product, during February ‘22, in Australia.

In the early days of Causal, the calculation engine ran in Javascript in the browser, but that only scaled to 10,000s of cells. So we moved the calculation engine out of the browser to a Node.js service, getting us to acceptable performance for low 100,000s of cells. In its latest and current iteration, we moved the calculation engine to Go, getting us to 1,000,000s of cells.

But every time we scale up by an order of magnitude, our customers find new use-cases that require yet another order of magnitude more cells!

With no more “cheap tricks” of switching the run-time again, how can we scale the calculation engine 100x, from millions to billions of cells?

In summary: by moving from maps to arrays. 😅 That may seem like an awfully pedestrian observation, but it certainly wasn’t obvious to us at the outset that this was the crux of the problem!

We want to take you along our little journey of what to do once you’ve reached a dead-end with the profiler. Instead, we’ll be approaching the problem from first principles with back-of-the envelope calculations and writing simple programs to get a feel for the performance of various data structures. Causal isn’t quite at billions of cells yet, but we’re rapidly making our way there!

Optimizing beyond the profiler dead-end

Profile from the calculation engine that it feels difficult to action

What does it look like to reach a dead-end with a profiler? When you run a profiler for the first time, you’ll often get something useful: your program’s spending 20% of time in an auxiliary function log_and_send_metrics()that you know reasonably shouldn’t take 20% of time.

You peek at the function, see that it’s doing a ridiculous amount of string allocations, UDP-jiggling, and blocking the computing thread… You play this fun and rewarding profile whack-a-mole for a while, getting big and small increments here and there.

But at some point, your profile starts to look a bit like the above: There’s no longer anything that stands out to you as grossly against what’s reasonable. No longer any pesky log_and_send_metrics() eating double-digit percentages of your precious runtime.

The constraints move to your own calibration of what % is reasonable in the profile: It’s spending time in the GC, time allocating objects, a bit of time accessing hash maps, … Isn’t that all reasonable? How can we possibly know whether 5.3% of time scanning objects for the GC is reasonable? Even if we did optimize our memory allocations to get that number to 3%, that’s a puny incremental gain… It’s not going to get us to billions of cells! Should we switch to a non-GC’ed language? Rust?! At a certain point, you’ll go mad trying to turn a profile into a performance roadmap.

When analyzing a system top-down with a profiler, it’s easy to miss the forest for the trees. It helps to take a step back, and analyze the problem from first principles.

We sat down and thought about fundamentally, what is a calculation engine? With some back-of-the-envelope calculations, what’s the upper bookend of how many cells we could reasonably expect the Calculation engine to support?

In my experience, first-principle thinking is required to break out of iterative improvement and make order of magnitude improvements. A profiler can’t be your only performance tool.

Approaching the calculation engine from first principles

To understand, we have to explain two concepts from Causal that help keep your spreadsheet organized: dimensions and variables.

We might have a variable “Sales’” that is broken down by the dimensions “Product” and “Country”. To appreciate how easy it is to build a giant model, if we have 100s of months, 10,000s of products, 10s of countries, and 100 variables we’ve already created a model with 1B+ cells. In Causal “Sales” looks like this:

Sales modeled in Causal's UI

In a first iteration we might represent Sales and its cells with a map. This seems innocent enough. Especially when you’re coming from an original implementation in Javascript, hastily ported to Go. As we’ll learn in this blog post, there are several performance problems with this data structure, but we’ll take it step by step:

sales := make(map[int]*Cell)

The integer index would be the _dimension index _to reference a specific cell. It is the index representing the specific dimension combination we’re interested in. For example, for Sales[Toy-A][Canada] the index would be 0 because Toy-A is the 0th Product Name and Canada is the 0th Country. For Sales[Toy-A][United Kingdom] it would be 1 (0th Toy, 1st Country), for Sales[Toy-C][India] it would be 3 * 3 = 9.

An ostensible benefit of the map structure is that if a lot of cells are 0, then we don’t have to store those cells at all. In other words, this data structure seems useful for sparse models.

But to make the spreadsheet come alive, we to calculate formulas such as Net Profit = Sales * Profit. This simple equation shows the power of Causal’s dimensional calculations, as this will calculate each cell’s unique net profit!

Now that we have a simple mental model of how Causal’s calculation engine works, we can start reasoning about its performance from first principles.

If we multiply two variables of 1B cells of 64 bit floating points each (~8 GiB memory) into a third variable, then we have to traverse at least ~24 GiB of memory. If we naively assume this is sequential access (which hashmap access isn’t) and we have SIMD and multi-threading, we can process that memory at a rate of 30ms / 1 GiB, or ~700ms total (and half that time if we were willing to drop to 32-bit floating points and forgo some precision!).

So from first-principles, it seems possible to do calculations of billions of cells in less than a second. Of course, there’s far more complexity below the surface as we execute the many types of formulas, and computations on dimensions. But there’s reason for optimism! We will carry through this example of multiplying variables for Net Profit as it serves as a good proxy for the performance we can expect on large models, where typically you’ll have fewer, smaller variables.

In the remainder of this post, we will try to close the gap between smaller Go prototypes and the napkin math. That should serve as evidence of what performance work to focus on in the 30,000+ line of code engine.

Iteration 1: `map[int]*Cell`, 30m cells in ~6s

In Causal’s calculation engine each Cell in the map was initially ~88 bytes to store various information about the cell such as the formula, dependencies, and other references. We start our investigation by implementing this basic data-structure in Go.

With 10M cell variables, for a total of 30M cells, it takes almost 6s to compute the Net Profit = Sales * Profit calculation. These numbers from our prototype doesn’t include all the other overhead that naturally accompanies running in a larger code-base, that’s far more feature-complete. In the real engine, this takes a few times longer.

We want to be able to do billions in seconds with plenty of wiggle-room for necessary overhead, so 10s of millions in seconds won’t fly. We have to do better. We know from our napkin math, that we should be able to.

$ go build main.go && hyperfine ./main
Benchmark 1: ./napkin
  Time (mean ± σ):      5.828 s ±  0.032 s    [User: 10.543 s, System: 0.984 s]
  Range (min ... max):    5.791 s ...  5.881 s    10 runs

package main

import (
        "math/rand"
)

type Cell88 struct {
        padding [80]byte // just to simulate what would be real stuff
        value   float64
}

func main() {
        pointerMapIntegerIndex(10_000_000) // 3 variables = 30M total
}

func pointerMapIntegerIndex(nCells int) {
        one := make(map[int]*Cell88, nCells)
        two := make(map[int]*Cell88, nCells)
        res := make(map[int]*Cell88, nCells)

        rand := rand.New(rand.NewSource(0xCA0541))

        for i := 0; i < nCells; i++ {
                one[i] = &Cell88{value: rand.Float64()}
                two[i] = &Cell88{value: rand.Float64()}
        }

        for i := 0; i < nCells; i++ {
                res[i] = &Cell88{value: one[i].value * two[i].value}
        }
}

Iteration 2: `[]Cell`, 30m cells in ~400ms

In our napkin math, we assumed sequential memory access. But hashmaps don’t do sequential memory access. Perhaps this is a far larger offender than our profile above might seemingly suggest?

Well, how do hashmaps work? You hash a key to find the bucket that this key/value pair is stored in. In that bucket, you insert the key and value. When the average size of the buckets grows to around ~6.5 entries, the number of buckets will double and all the entries will get re-shuffled (fairly expensive, and a good size to pre-size your maps). The re-sizing occurs to about equality on a lot of keys in ever-increasing buckets.

Array of Structs to Struct of Arrays

Let’s think about the performance implications of this from the ground up. Every time we look up a cell from its integer index, the operations we have to perform (and their performance, according to the napkin math reference):

Hash the integer index to a hashed value: 25ns
Mask the hashed value to map it to a bucket: 1-5ns
Random memory read to map the bucket to a pointer to the bucket’s address: 1ns (because it’ll be in the cache)
Random memory read to read the bucket: 50ns
Equality operations on up to 6-7 entries in the bucket to locate the right key: 1-10ns
Random memory read to follow and read the *Cell pointer: 50ns

Most of this goes out the wash, by far the most expensive are these random memory reads that the map entails. Let’s say ~100ns per look-up, and we have ~30M of them, that’s ~3 seconds in hash lookups alone. That lines up with the performance we’re seeing. Fundamentally, it really seems like trouble to get to billions of cells with a map.

There’s another problem with our data structure in addition to all the pointer-chasing leading to slow random memory reads: the size of the cell. Each cell is 88 bytes. When a CPU reads memory, it fetches one cache line of 64 bytes at a time. In this case, the entire 88 byte cell doesn’t fit in a single cache line. 88 bytes spans two cache lines, with 128 - 88 = 40 bytes of wasteful fetching of our precious memory bottleneck!

If those 40 bytes belonged to the next cell, that’s not a big deal, since we’re about to use them anyway. However, in this random-memory-read heavy world of using a hashmap that stores pointers, we can’t trust that cells will be adjacent. This is enormously wasteful for our precious memory bandwidth.

In the napkin math reference, random memory reads are ~50x slower than sequential access. A huge reason for this is that the CPU’s memory prefetcher cannot predict memory access. Accessing memory is one of the slowest things a CPU does, and if it can’t preload cache lines, we’re spending _a lot _of time stalled on memory.

Could we give up the map? We mentioned earlier that a nice property of the map is that it allows us to build sparse models with lots of empty cells. For example, cohort models tend to have half of their cells empty. But perhaps half of the cells being empty is not quite enough to qualify as ‘sparse’?

We could consider mapping the index for the cells into a large, pre-allocated array. Then cell access would be just a single random-read of 50ns! In fact, it’s even better than that: In this particular Net Profit, all the memory access is sequential. This means that the CPU can be smart and prefetch memory because it can reasonably predict what we’ll access next. For a single thread, we know we can do about 1 GiB/100ms. This is about $30M \cdot 88 \text{ bytes} \approx 2.5 \text{ GiB}$ , so it should take somewhere in the ballpark of 250-300ms. Consider also that the allocations themselves on the first few lines take a bit of time.

func arrayCellValues(nCells int) {
        one := make([]Cell88, nCells)
        two := make([]Cell88, nCells)
        res := make([]Cell88, nCells)

        rand := rand.New(rand.NewSource(0xCA0541))

        for i := 0; i < nCells; i++ {
                one[i].value = rand.Float64()
                two[i].value = rand.Float64()
        }

        for i := 0; i < nCells; i++ {
                res[i].value = one[i].value * two[i].value
        }
}

napkin:go2 $ go build main.go &&  hyperfine ./main
Benchmark 1: ./main
  Time (mean ± σ):     346.4 ms ±  21.1 ms    [User: 177.7 ms, System: 171.1 ms]
  Range (min ... max):   332.5 ms ... 404.4 ms    10 runs

That’s great! And it tracks our expectations from our napkin math well (the extra overhead is partially from the random number generator).

Iteration 3: Threading, 250ms

Generally, we expect threading to speed things up substantially as we’re able to utilize more cores. However, in this case, we’re memory bound, not computationally bound. We’re just doing simple calculations between the cells, which is generally the case in real Causal models. Multiplying numbers takes single-digit cycles, fetching memory takes double to triple-digit number of cycles. Compute bound workloads scale well with cores. Memory bound workloads act differently when scaled up.

If we look at raw memory bandwidth numbers in the napkin math reference, a 3x speed-up in a memory-bound workload seems to be our ceiling. In other words, if you’re memory bound, you only need about ~3-4 cores to exhaust memory bandwidth. More won’t help much. But they do help, because a single thread cannot exhaust memory bandwidth on most CPUs.

When implemented however, we only get a 0.6x speedup (400ms → 250ms), and not a 3x speed-up (130ms)? I am frankly not sure how to explain this ~120ms gap. If anyone has a theory, we’d love to hear it!

Either way, we definitely seem to be memory bound now. Then there’s only two ways forward: (1) Get more memory bandwidth on a different machine, or (2) Reduce the amount of memory we’re using. Let’s try to find some more brrr with (2).

Iteration 4: Smaller Cells, 88 bytes → 32 bytes, 70ms

If we were able to cut the cell size 3x from 88 bytes to 32 bytes, we’d expect the performance to roughly 3x as well! In our simulation tool, we’ll reduce the size of the cell:

type Cell32 struct {
    padding [24]byte
    value   float64
}

Indeed, with the threading on top, this gets us to ~70ms which is just around a 3x improvement!

In fact, what is even in that cell struct? The cell stores things like formulas, but for many cells, we don’t actually need the formula stored with the cell. For most cells in Causal, the formula is the same as the previous cell. I won’t show the original struct, because it’s confusing, but there are other pointers, e.g. to the parent variable. By more carefully writing the calculation engine’s interpreter to keep track of the context, we should be able to remove various pointers to e.g. the parent variable. Often, structs get expanded with cruft as a quick way to break through some logic barrier, rather than carefully executing the surrounding context to provide this information on the stack.

As a general pattern, we can reduce the size of the cell by switching from an array of structs design to a struct of arrays design, in other words, if we’re in a cell with index 328, and need the formula for the cell, we could look up index 328 in a formula array. These are called parallel arrays. Even if we access a different formula for every single cell the CPU is smart enough to detect that it’s another sequential access. This is generally much faster than using pointers.

image_tooltip

None of this is particularly hard to do, but it wasn’t until now that we realized how paramount this was to the engine’s performance! Unfortunately, the profiler isn’t yet helpful enough to tell you that reducing the size of a struct below that 64-byte threshold can lead to non-linear performance increases. You need to know to use tools like pahole(1) for that.

Iteration 5: `[]float64` w/ Parallel Arrays, 20ms

If we want to find the absolute speed-limit for Causal’s performance then, we’d want to imagine that the Cell is just:

type Cell8 struct {
    value   float64
}

That’s a total memory usage of $30 \cdot 8 \text{ byte} \approx 228 \text{ MiB}$ which we can read at $35\,\mu\text{s/MiB}$ in a threaded program, so ~8ms. We won’t get much faster than this, since we also inevitably have to spend time allocating the memory.

When implemented, the raw floats take ~20ms (consider that we have to allocate the memory too) for our 30M cells.

Let’s scale it up. For 1B cells, this takes ~3.5s. That’s pretty good! Especially considering that the Calculation engine already has a lot of caching already to ensure we don’t have to re-evaluate every cell in the sheet. But, we want to make sure that the worst-case of evaluating the entire sheet performs well, and we have some space for inevitable overhead.

Our initial napkin math suggested we could get to ~700ms for 3B cells, so there’s a bit of a gap. We get to ~2.4s for 1B cells by moving allocations into the threads that actually need them, closing the gap further would take some more investigation. However, localizing allocations start to get into a territory of what would be quite hard to implement generically in reality—so we’ll stop around here until we have the luxury of this problem being the bottleneck. Plenty of work to make all these transitions in a big, production code-base!

Iteration N: SIMD, compression, GPU …

That said, there are lots of optimizations we can do. Go’s compiler currently doesn’t do SIMD, which allows us to get even more memory bandwidth. Another path for optimization that’s common for number-heavy programs is to encode the numbers, e.g. delta-encoding. Because we’re constrained by memory bandwidth more than compute, counter-intuitively, compression can make the program faster. Since the CPU is stalled for tons of cycles while waiting for memory access, we can use these extra cycles to do simple arithmetic to decompress.

Another trend from the AI-community when it comes to number-crunching too is to leverage GPUs. These have enormous memory bandwidth. However, we can create serious bottlenecks when it comes to moving memory back and forth between the CPU and GPU. We’d have to learn what kinds of models would take advantage of this, we have little experience with GPUs as a team—but we may be able to utilize lots of existing ND-array implementations used for training neural nets. This would come with significant complexity—but also serious performance improvements for large models.

Either way there’s plenty of work to get to the faster, simpler design described above in the code-base. This would be further out, but makes us excited about the engineering ahead of us!

Conclusion

Profiling had become a dead-end to make the calculation engine faster, so we needed a different approach. Rethinking the core data structure from first principles, and understanding exactly why each part of the current data structure and access patterns was slow got us out of disappointing, iterative single-digit percentage performance improvements, and unlocked order of magnitude improvements. This way of thinking about designing software is often referred to as data-oriented engineering, and this talk by Andrew Kelly, the author of the Zig compiler, is an excellent primer that was inspirational to the team.

With these results, we were able to build a technical roadmap for incrementally moving the engine towards a more data-oriented design. The reality is _far _more complicated, as the calculation engine is north of 40K lines of code. But this investigation gave us confidence in the effort required to change the core of how the engine works, and the performance improvements that will come over time!

The biggest performance take-aways for us were:

When you’re stuck with performance on profilers, start thinking about the problem from first principles
Use indices, not pointers when possible
Use array of structs when you access almost everything all the time, use struct of arrays when you don’t
Use arrays instead of maps when possible; the data needs to be very sparse for the memory savings to be worth it
Memory bandwidth is precious, and you can’t just parallelize your way out of it!

Causal doesn’t smoothly support 1 billion cells yet, but we feel confident in our ability to iterate our way there. Since starting this work, our small team has already improved performance more than 3x on real models. If you’re interested in working on this with Causal, and them get to 10s of billions of cells, you should consider joining the Causal team — email lukas@causal.app!

Metrics For Your Web Application's Dashboards

2022-03-19T00:00:00.000Z

Whenever I create a dashboard for an application, it’s generally the same handful of metrics I look to. They’re the ones I always use to orient myself quickly when Pagerduty fires. They give me the grand overview, and then I’ll know what logging queries to start writing, code to look at, box to SSH into, or mitigation to activate. The same metrics are able to tell me during the day whether the system is ok, and I use them to do napkin math on e.g. capacity planning and imminent bottlenecks:

Web Backend (e.g. Django, Node, Rails, Go, ..)
- Response Time p50, p90, p99, sum, avg †
- Throughput by HTTP status †
- Worker Utilization ¹
- Request Queuing Time ²
- Service calls †
  - Database(s), caches, internal services, third-party APIs, ..
  - Enqueued jobs are important!
  - Circuit Breaker tripping † /min
  - Errors, throughput, latency p50, p90, p99
- Throttling †
- Cache hits and misses % †
- CPU and Memory Utilization
- Exception counts † /min
Job Backend (e.g. Sidekiq, Celery, Bull, ..)
- Job Execution Time p50, p90, p99, sum, avg †
- Throughput by Job Status {error, success, retry} †
- Worker Utilization ³
- Time in Queue † ⁴
- Queue Sizes † ⁵
  - Don’t forget scheduled jobs and retries!
- Service calls p50, p90, p99, count, by type †
- Throttling †
- CPU and Memory Utilization
- Exception counts † /min

† Metrics where you need the ability to slice by endpoint or job, tenant_id, app_id, worker_id, zone, hostname, and queue (for jobs). This is paramount to be able to figure out if it’s a single endpoint, tenant, or app that’s causing problems.

You can likely cobble a workable chunk of this together from your existing service provider and APM. The value is for you to know what metrics to pay attention to, and which key ones you’re missing. The holy grail is one dashboard for web, and one for job. The more incidents you have, the more problematic it becomes that you need to visit a dozen URLs to get the metrics you need.

If you have little of this and need somewhere to start, start with logs. They’re the lowest common denominator, and if you’re productive in a good logging system that will you very far. You can build all these dashboards with logs alone. Jumping into the detailed logs is usually the next step you take during an incident if it’s not immediately clear what to do from the metrics.

Use the canonical log line pattern (see figure below), resist emitting random logs throughout the request as this makes analysis difficult. A canonical log line is a log emitted at the end of the request with everything that happened during the request. This makes querying the logs bliss.

Readwise.io, who I helped set up canonical log lines for." title="An example of a canonical log line with a subset of the metrics above, generously provided by Readwise.io, who I helped set up canonical log lines for." width="864" height="1008" loading="lazy" decoding="async" style="max-width:100%;height:auto"/>

An example of a canonical log line with a subset of the metrics above, generously provided by Readwise.io, who I helped set up canonical log lines for.

Surprisingly, there aren’t good libraries available for the canonical log line pattern, so I recommend rolling your own. Create a middleware in your job and web stack to emit the log at the end of the request. If you need to accumulate metrics throughout the request for the canonical log line, create a thread-local dictionary for them that you flush in the middleware.

For response time from services, you will need to emit inline logs or metrics. Consider using an OpenTelemetry library so you only need to instrument once and can later add sinks for canonical logs (the sum), metrics, profiling, and traces.

Notably absent here is monitoring a database, which would take its own post.

Hope this helps you step up your monitoring game. If there’s a metric you feel strongly that’s missing, please let me know!

This is one of my favorites. What percentage of threads are currently busy? If this is >80%, you will start to see counter-intuitive queuing theory take hold, yielding strange response time patterns.

It is given as busy_threads / total_threads. ↩
How long are requests spending in TCP/proxy queues before being picked up by a thread? Typically you get this by your load-balancer stamping the request with a X-Request-Start header, then subtracting that from the current time in the worker thread. ↩
Same idea as web utilization, but in this case it’s OK for it to be > 80% for periods of time as jobs are by design allowed to be in the queue for a while. The central metric for jobs becomes time in queue. ↩
The central metric for monitoring a job stack is to know how long jobs spend in the queue. That will be what you can use to answer questions such as: Do I need more workers? When will I recover? What’s the experience for my users right now? ↩
How large is your queue right now? It’s especially amazing to be able to slice this by job and queue, but your canonical logs with how much has been enqueued is typically sufficient. ↩

Napkin Problem 18: Neural Network From Scratch

2022-01-03T00:00:00.000Z

In this edition of Napkin Math, we’ll invoke the spirit of the Napkin Math series to establish a mental model for how a neural network works by building one from scratch. In a future issue we will do napkin math on performance, as establishing the first-principle understanding is plenty of ground to cover for today!

Neural nets are increasingly dominating the field of machine learning / artificial intelligence: the most sophisticated models for computer vision (e.g. CLIP), natural language processing (e.g. GPT-3), translation (e.g. Google Translate), and more are based on neural nets. When these artificial neural nets reach some arbitrary threshold of neurons, we call it deep learning.

A visceral example of Deep Learning’s unreasonable effectiveness comes from this interview with Jeff Dean who leads AI at Google. He explains how 500 lines of Tensorflow outperformed the previous ~500,000 lines of code for Google Translate’s extremely complicated model. Blew my mind. ¹

As a software developer with a predominantly web-related skillset of Ruby, databases, enough distributed systems knowledge to know to not get fancy, a bit of hard-earned systems knowledge from debugging incidents, but only high school level math: neural networks mystify me. How do they work? Why are they so good? Why are they so slow? Why are GPUs/TPUs used to speed them up? Why do the biggest models have more neurons than humans, yet still perform worse than the human brain? ²

In true napkin math fashion, the best course of action to answer those questions is by implementing a simple neural net from scratch.

Mental Model for a Neural Net: Building one from scratch

The hardest part of napkin math isn’t the calculation itself: it’s acquiring the conceptual understanding of a system to come up with an equation for its performance. Presenting and testing mental models of common systems is the crux of value from the napkin math series!

The simplest neural net we can draw might look something like this:

Input layer. This is a representation of the data that we want to feed to the neural net. For example, the input layer for a 4x4 pixel grayscale image that looks like this could be [1, 1, 1, 0.2]. Meaning the first 3 pixels are darkest (1.0) and the last pixel is lighter (0.2).
Hidden Layer. This is the layer that does a bunch of math on the input layer to convert it to our prediction. Training a model refers to changing the math of the hidden layer(s) to more often create an output like the training data. We will go into more detail with this layer in a moment. The values in the hidden layer are called weights.
Output Layer. This layer will contain our final prediction. For example, if we feed it the rectangle from before we might want the output layer to be a single number to represent how “dark” a rectangle is, e.g.: 0.8.

For example for the image = [0.8, 0.7, 1, 1] we’d expect a value close to 1 (dark!).

In contrast, for = [0.2, 0.5, 0.4, 0.7] we expect something closer to 0 than to 1.

Let’s implement a neural network from our simple mental model. The goal of this neural network is to take a grayscale 2x2 image and tell us how “dark” it is where 0 is completely white , and 1 is completely black . We will initialize the hidden layer with some random values at first, in Python:

input_layer = [0.2, 0.5, 0.4, 0.7]
# We randomly initialize the weights (values) for the hidden layer... We will
# need to "train" to make these weights give us the output layers we desire. We
# will cover that shortly!
hidden_layer = [0.98, 0.4, 0.86, -0.08]

output_neuron = 0
# This is really matrix multiplication. We explicitly _do not_ use a
# matrix/tensor, because they add overhead to understanding what happens here
# unless you work with them every day--which you probably don't. More on using
# matrices later.
for index, input_neuron in enumerate(input_layer):
    output_neuron += input_neuron * hidden_layer[index]
print(output_neuron)
# => 0.68

Our neural network is giving us model() = 0.7 which is closer to ‘dark’ (1.0) than ‘light’ (0.0). When looking at this rectangle as a human, we judge it to be more bright than dark, so we were expecting something below 0.5!

There’s a notebook with the final code available. You can make a copy and execute it there. For early versions of the code, such as the above, you can create a new cell at the beginning of the notebook and build up from there!

The only real thing we can change in our neural network in its current form is the hidden layer’s values. How do we change the hidden layer values so that the output neuron is close to 1 when the rectangle is dark, and close to 0 when it’s light?

We could abandon this approach and just take the average of all the pixels. That would work well! However, that’s not really the point of a neural net… We’ll hit an impasse if we one day expand our model to try to implement recognize_letters_from_picture(img) or is_cat(img).

Fundamentally, a neural network is just a way to approximate any function. It’s really hard to sit down and write is_cat, but the same technique we’re using to implement average through a neural network can be used to implement is_cat. This is called the universal approximation theorem: an artificial neural network can approximate any function!

So, let’s try to teach our simple neural network to take the average() of the pixels instead of explicitly telling it that that’s what we want! The idea of this walkthrough example is to understand a neural net with very few values and low complexity, otherwise it’s difficult to develop an intuition when we move to 1,000s of values and 10s of layers, as real neural networks have.

We can observe that if we manually modify all the hidden layer attributes to 0.25, our neural network is actually an average function!

input_layer = [0.2, 0.5, 0.4, 0.7]
hidden_layer = [0.25, 0.25, 0.25, 0.25]

output_neuron = 0
for index, input_neuron in enumerate(input_layer):
    output_neuron += input_neuron * hidden_layer[index]

# Two simple ways of calculating the same thing!
#
# 0.2 * 0.25 + 0.5 * 0.25 + 0.4 * 0.25 + 0.7 * 25 = 0.45
print(output_neuron)
# Here, we divide by 4 to get the average instead of
# multiplying each element.
#
# (0.2 + 0.5 + 0.4 + 0.7) / 4 = 0.45
print(sum(input_layer) / 4)

model() = 0.45 sounds about right. The rectangle is a little lighter than it’s dark.

But that was cheating! We only showed that we can implement average() by simply changing the hidden layer’s values. But that won’t work if we try to implement something more complicated. Let’s go back to our original hidden layer initialized with random values:

hidden_layer = [0.98, 0.4, 0.86, -0.08]

How can we teach our neural network to implement average?

Training our Neural Network

To teach our model, we need to create some training data. We’ll create some rectangles and calculate their average:

rectangles = []
rectangle_average = []

for i in range(0, 1000):
    # Generate a 2x2 rectangle [0.1, 0.8, 0.6, 1.0]
    rectangle = [round(random.random(), 1),
                 round(random.random(), 1),
                 round(random.random(), 1),
                 round(random.random(), 1)]
    rectangles.append(rectangle)
    # Take the _actual_ average for our training dataset!
    rectangle_average.append(sum(rectangle) / 4)

Brilliant, so we can now feed these to our little neural network and get a result! Next step is for our neural network to adjust the values in the hidden layer based on how its output compares with the actual average in the training data. This is called our loss function: large loss, very wrong model; small loss, less wrong model. We can use a standard measure called mean squared error:

# Take the average of all the differences squared!
# This calculates how "wrong" our predictions are.
# This is called our "loss".
def mean_squared_error(actual, expected):
    error_sum = 0
    for a, b in zip(actual, expected):
        error_sum += (a - b) ** 2
    return error_sum / len(actual)

print(mean_squared_error([1.], [2.]))
# => 1.0
print(mean_squared_error([1.], [3.]))
# => 4.0

Now we can implement train():

def model(rectangle, hidden_layer):
    output_neuron = 0.
    for index, input_neuron in enumerate(rectangle):
        output_neuron += input_neuron * hidden_layer[index]
    return output_neuron

def train(rectangles, hidden_layer):
  outputs = []
  for rectangle in rectangles:
      output = model(rectangle, hidden_layer)
      outputs.append(output)
  return outputs

hidden_layer = [0.98, 0.4, 0.86, -0.08]
outputs = train(rectangles, hidden_layer)

print(outputs[0:10])
# [1.472, 0.7, 1.369, 0.8879, 1.392, 1.244, 0.644, 1.1179, 0.474, 1.54]
print(rectangle_average[0:10])
# [0.575, 0.45, 0.549, 0.35, 0.525, 0.475, 0.425, 0.65, 0.4, 0.575]
mean_squared_error(outputs, rectangle_average)
# 0.4218

A good mean squared error is close to 0. Our model isn’t very good. But! We’ve got the skeleton of a feedback loop in place for updating the hidden layer.

Updating the Hidden Layer with Gradient Descent

Now what we need is a way to update the hidden layer in response to the mean squared error / loss. We need to minimize the value of this function:

mean_squared_error(
  train(rectangles, hidden_layer),
  rectangle_average
)

As noted earlier, the only thing we can really change here are the weights in the hidden layer. How can we possibly know which weights will minimize this function?

We could randomize the weights, calculate the loss (how wrong the model is, in our case, with mean squared error), and then save the best ones we see after some period of time.

We could possibly speed this up. If we have good weights, we could try to add some random numbers to those. See if loss improves. This could work, but it sounds slow… and likely to get stuck in some local maxima and not give a very good result. And it’s trouble scaling this to 1,000s of weights…

Instead of embarking on this ad-hoc randomization mess, it turns out that there’s a method called gradient descent to minimize the value of a function! Gradient descent builds on a bit of calculus that you may not have touched on since high school. We won’t go into depth here, but try to introduce just enough that you understand the concept. ³

Let’s try to understand gradient descent. Consider some random function whose graph might look like this:

Graph of a function with an irregular curve with a local and global minimum.

How do we write code to find the minimum, the deepest (second) valley, of this function?

Let’s say that we’re at x=1 and we know the slope of the function at this point. The slope is “how fast the function grows at this very point.” You may remember this as the derivative. The slope at x=1 might be -1.5. This means that every time we increase x += 1, it results in y -= 1.5. We’ll go into how you figure out the slope in a bit, let’s focus on the concept first.

The idea of gradient descent is that since we know the value of our function, y, is decreasing as we increase x, we can increase x proportionally to the slope. In other words, if we increase x by the slope, we step towards the valley by 1.5.

Let’s take that step of x += 1.5:

Ugh, turned out that we stepped too far, past this valley! If we repeat the step, we’ll land somewhere on the left side of the valley, to then bounce back on the right side. We might never land in the bottom of the valley. Bummer. Either way, this isn’t the global minimum of the function. We return to that in a moment!

We can fix the overstepping easily by taking smaller steps. Perhaps we should’ve stepped by just $0.1 * 1.5 = 0.15$ instead. That would’ve smoothly landed us at the bottom of the valley. That multiplier, 0.1, is called the learning rate in gradient descent.

But hang on, that’s not actually the minimum of the function. See that valley to the right? That’s the actual global minimum. If our initial x value had been e.g. 3, we might have found the global minimum instead of our local minimum.

Finding the global minimum of a function is hard. Gradient descent will give us a minimum, but not the minimum. Unfortunately, it turns out it’s the best weapon we have at our disposal. Especially when we have big, complicated functions (like a neural net with millions of neurons). Gradient descent will not always find the global minimum, but something pretty good.

This method of using the slope/derivative generalizes. For example, consider optimizing a function in three-dimensions. We can visualize the gradient descent method here as rolling a ball to the lowest point. A big neural network is 1000s of dimensions, but gradient descent still works to minimize the loss!

Depicts a 3-dimensional graph, if we do gradient descent on this we might imagine it as rolling a ball down the hill.

Finalizing our Neural Network from scratch

Let’s summarize where we are:

We can implement a simple neural net: model().
Our neural net can figure out how wrong it is for a training set: loss(train()).
We have a method, gradient descent, for tuning our hidden layer’s weights for the minimum loss. I.e. we have a method to adjust those four random values in our hidden layer to take a better average as we iterate through the training data.

Now, let’s implement gradient descent and see if we can make our neural net learn to take the average grayscale of our small rectangles:

def model(rectangle, hidden_layer):
    output_neuron = 0.
    for index, input_neuron in enumerate(rectangle):
        output_neuron += input_neuron * hidden_layer[index]
    return output_neuron

def train(rectangles, hidden_layer):
  outputs = []
  for rectangle in rectangles:
      output = model(rectangle, hidden_layer)
      outputs.append(output)

  mean_squared_error(outputs, rectangle_average)

  # We go through all the weights in the hidden layer. These correspond to all
  # the weights of the function we're trying to minimize the value of: our
  # model, respective of its loss (how wrong it is).
  # 
  # For each of the weights, we want to increase/decrease it based on the slope.
  # Exactly like we showed in the one-weight example above with just x. Now
  # we just have 4 values instead of 1! Big models have billions.
  for index, _ in enumerate(hidden_layer):
    learning_rate = 0.1
    # But... how do we get the slope/derivative?!
    hidden_layer[index] -= learning_rate * hidden_layer[index].slope

  return outputs

hidden_layer = [0.98, 0.4, 0.86, -0.08]
train(rectangles, hidden_layer)

Automagically computing the slope of a function with `autograd`

The missing piece here is to figure out the slope() after we’ve gone through our training set. Figuring out the slope/derivative at a certain point is tricky. It involves a fair bit of math. I am not going to go into the math of calculating derivatives. Instead, we’ll do what all the machine learning libraries do: automatically calculate it. ⁴

Minimizing the loss of a function is absolutely fundamental to machine learning. The functions (neural networks) are so complicated that manually sitting down to figure out the derivative like you might’ve done in high school is not feasible. It’s the mathematical equivalent of writing assembly to implement a website.

Let’s show one simple example of finding the derivative of a function, before we let the computers do it all for us. If we have $f(x) = x^2$ , then you might remember from calculus classes that the derivative is $f'(x) = 2x$ . In other words, $f(x)$ ‘s slope at any point is 2x, telling us it’s increasing non-linearly. Well that’s exactly how we understand $x^2$ , perfect! This means that for $x = 2$ the slope is $4$ .

With the basics in order, we can use an autograd package to avoid the messy business of computing our own derivatives. autograd is an automatic differentiation engine. grad stands for gradient, which we can think of as the derivative/slope of a function with more than one parameter.

It’s best to show how it works by using our example from before:

import torch

# A tensor is a matrix in PyTorch. It is the fundamental data-structure of neural
# networks. Here we say PyTorch, please keep track of the gradient/derivative
# as I do all kinds of things to the parameter(s) of this tensor.
x = torch.tensor(2., requires_grad=True)

# At this point we're applying our function f(x) = x^2.
y = x ** 2

# This tells `autograd` to compute the derivative values for all the parameters
# involved. Backward is neural network jargon for this operation, which we'll
# explain momentarily.
y.backward()

# And show us the lovely gradient/derivative, which is 4! Sick.
print(x.grad)
# => 4

autograd is the closest to magic we get. I could do the most ridiculous stuff with this tensor, and it’ll keep track of all the math operations applied and have the ability to compute the derivative. We won’t go into how. Partly because I don’t know how, and this post is long enough.

Just to convince you of this, we can be a little cheeky and do a bunch of random stuff. I’m trying to really hammer this home, because this is what confused me the most when learning about neural networks. It wasn’t obvious to me that a neural network, including executing the loss function on the whole training set, is just a function, and however complicated, we can still take the derivative of it and use gradient descent. Even if it’s so many dimensions that it can’t be neatly visualized as a ball rolling down a hill.

autograd doesn’t complain as we add complexity and will still calculate the gradients. In this example we’ll even use a matrix/tensor with a few more elements and calculate an average (like our loss function mean_squared_error), which is the kind of thing we’ll calculate the gradients for in our neural network:

import random
import torch

x = torch.tensor([0.2, 0.3, 0.8, 0.1], requires_grad=True)
y = x

for _ in range(3):
    choice = random.randint(0, 2)
    if choice == 0:
        y = y ** random.randint(1, 10)
    elif choice == 1:
        y = y.sqrt()
    elif choice == 2:
        y = y.atanh()

y = y.mean()
# This walks "backwards" y all the way to the parameters to
# calculate the derivates / gradient! Pytorch keeps track of a graph of all the
# operations.
y.backward()

# And here are how quickly the function is changing with respect to these
# parameters for our randomized function.
print(x.grad)
# => tensor([0.0157, 0.0431, 0.6338, 0.0028])

Let’s use autograd for our neural net and then run it against our square from earlier model() = 0.45:

import torch as torch

def model(rectangle, hidden_layer):
    output_neuron = 0.
    for index, input_neuron in enumerate(rectangle):
        output_neuron += input_neuron * hidden_layer[index]
    return output_neuron

def train(rectangles, hidden_layer):
  outputs = []
  for rectangle in rectangles:
      output = model(rectangle, hidden_layer)
      outputs.append(output)

  # How wrong were we? Our 'loss.'
  error = mean_squared_error(outputs, rectangle_average)

  # Calculate the gradient (the derivate for all our weights!)
  # This walks "backwards" from the error all the way to the weights to
  # calculate them
  error.backward()

  # Now let's go update the weights in our hidden layer per our gradient.
  # This is what we discussed before: we want to find the valley of this
  # four-dimensional space/four-weight function. This is gradient descent!
  for index, _ in enumerate(hidden_layer):
    learning_rate = 0.1
    # hidden_layer.grad is something like [0.7070, 0.6009, 0.6840, 0.5302]
    hidden_layer.data[index] -= learning_rate * hidden_layer.grad.data[index]

  # We have to tell `autograd` that we've just finished an epoch to reset.
  # Otherwise it'd calculate the derivative from multiple epochs.
  hidden_layer.grad.zero_()
  return error

# We use tensors now, but we just use them as if they were normal lists.
# We only use them so we can get the gradients.
hidden_layer = torch.tensor([0.98, 0.4, 0.86, -0.08], requires_grad=True)

print(model([0.2,0.5,0.4,0.7], hidden_layer))
# => 0.6840000152587891

train(rectangles, hidden_layer)

# The hidden layer's weights are nudging closer to [0.25, 0.25, 0.25, 0.25]!
# They are now [ 0.9093,  0.3399,  0.7916, -0.1330]
print(f"After: {model([0.2,0.5,0.4,0.7], hidden_layer)}")
# => 0.5753424167633057
# The average of this rectangle is 0.45, closer... but not there yet

This blew my mind the first time I did this. Look at that. It’s optimizing the hidden layer for all weights in the right direction! We’re expecting them all to nudge towards $0.25$ to implement average(). We haven’t told it anything about average, we’ve just told it how wrong it is through the loss.

It’s important to understand how hidden_layer.grad is set here. The hidden layer is instantiated as a tensor with an argument telling Pytorch to keep track of all operations made to it. This allows us to later call backward() on a future tensor that derives from the hidden layer, in this case, the error tensor, which is further derived from the outputs tensor. You can read more in the documentation

But, the hidden layer isn’t all $0.25$ quite yet, as we expect for it to implement average. So how do we get them to that? Well, let’s try to repeat the gradient descent process 100 times and see if we’re getting even better!

# An epoch is a training pass over the full data set!
for epoch in range(100):
   error = train(rectangles, hidden_layer)
   print(f"Epoch: {epoch}, Error: {error}, Layer: {hidden_layer.data}\n\n")
   # 
   #  Epoch: 99, Error: 0.0019292341312393546, Layer: tensor([0.3251, 0.2291, 0.3075, 0.1395])


print(model([0.2,0.5,0.4,0.7], hidden_layer).item())
# => 0.4002

Pretty close, but not quite there. I ran it for $300$ times (an iteration over the full training set is referred to as an epoch, so 300 epochs) instead, and then I got:

print(model([0.2,0.5,0.4,0.7], hidden_layer).item())
# Epoch: 299, Error: 1.8315197394258576e-06, Layer: tensor([0.2522, 0.2496, 0.2518, 0.2465])
# tensor(0.4485, grad_fn=)

Boom! Our neural net has almost learned to take the average, off by just a scanty $0.002$ . If we fine-tuned the learning rate and number of epochs we could probably get it there, but I’m happy with this. model() = 0.448:

That’s it. That’s your first neural net:

model(rectangle) \approx avg(rectangle)

OK, so you just implemented the most complicated `average` function I’ve ever seen…

Sure did. The thing is, that if we adjusted it for looking for cats, it’s the least complicated is_cat you’ll ever see. Because our neural network could implement that too by changing the training data. Remember, a neural network with enough neurons can approximate any function. You’ve just learned all the building blocks to do it. We just started with the simplest possible example.

If you give the hidden layer some more neurons, this neural net will be able to recognize handwritten numbers with decent accuracy (possible fun exercise for you, see bottom of article), like this one:

An upscaled version of a handdrawn 3 from the 28x28 MNIST dataset.

Activation Functions

To be truly powerful, there is one paramount modification we have to make to our neural net. Above, we were implementing the $average$ function. However, were our neural net to implement which_digit(png) or is_cat(jpg) then it wouldn’t work.

Recognizing handwritten digits isn’t a linear function, like average(). It’s non-linear. It’s a crazy function, with a crazy shape (unlike a linear function). To create crazy functions with crazy shapes, we have to introduce a non-linear component to our neural network. This is called an activation function. It can be e.g. $ReLu(x) = max(0, x)$ . There are many kinds of activation functions that are good for different things. ⁵

We can apply this simple operation to our neural net:

def model(rectangle, hidden_layer):
    output_neuron = 0.
    for index, input_neuron in enumerate(rectangle):
        output_neuron += input_neuron * hidden_layer[index]
    return max(0, output_neuron)

Now, we only have a single neuron/weight… that isn’t much. Good models have 100s, and the biggest models like GPT-3 have billions. So this won’t recognize many digits or cats, but you can easily add more weights!

Matrices

The core operation in our model, the for loop, is matrix multiplication. We could rewrite it to use them instead, e.g. rectangle @ hidden_layer. PyTorch will then do the exact same thing. Except, it’ll now execute in C-land. And if you have a GPU and pass some extra weights, it’ll execute on a GPU, which is even faster. When doing any kind of deep learning, you want to avoid writing any Python loops. They’re just too slow. If you ran the code above for the 300 epochs, you’ll see that it takes minutes to complete. I left matrices out of it to simplify the explanation as much as possible. There’s plenty going on without them.

Next steps to implement your own neural net from scratch

Even if you’ve carefully read through this article, you won’t fully grasp it yet until you’ve had your own hands on it. Here are some suggestions on where to go from here, if you’d like to move beyond the basic understanding you have now:

Get the notebook running and study the code
Change it to far larger rectangles, e.g. 100x100
Add biases in addition to the weights. A model doesn’t just have weights that are multiplied onto the inputs, but also biases that are added (+) onto the inputs in each layer.
Rewrite the model to use PyTorch tensors for matrix operations, as described in the previous section.
Add 1-2 more layers to the model. Try to have them have different sizes.
Change the tensors to run on GPU (see the PyTorch documentation) and see the performance speed up! Increase the size of the training set and rectangles to really be able to tell the difference. Make sure you change Runtime > Change Runtime Type in Collab to run on a GPU.
This is a difficult step that will likely take a while, but it’ll be well worth it: Adapt the code to recognize handwritten letters from the MNIST dataset dataset. You’ll need to use pillow to turn the pixels into a large 1-dimensional tensor as the input layer, as well as a non-linear activation function like Sigmoid or ReLU. Use Nielsen’s book as a reference if you get stuck, which does exactly this.

I thoroughly hope you enjoyed this walkthrough of a neural net from scratch! In a future issue we’ll use the mental model we’ve built up here to do some napkin math on expected performance on training and using neural nets.

Thanks to Vegard Stikbakke, Andrew Bugera and Jonathan Belotti for providing valuable feedback on drafts of this article.

This is a good example of Peak Complexity. The existing phrase-based translation model was iteratively improved with increasing complexity, distributed systems to look up five-word phrases frequencies, etc. The complexity required to improve the model 1% was becoming astronomical. A good hint you need a paradigm shift to reset the complexity. Deep Learning provided that complexity reset for the translation model. ↩
GPT-3 has ~175 billion weights. The human brain has ~86 billion. Of course, you cannot technically compare an artificial neuron to a real one. Why? I don’t know. I reserve that it remains an interesting question. It’s estimated that it cost in the double-digit millions to train it. ↩
There’s a brilliant Youtube series that’ll go into more depth on the math than I do in this article. This article accompanies the video nicely, as the video doesn’t go into the implementation. ↩
There’s a great, short e-book on implementing a neural network from scratch available that goes into far more detail on computing the derivative from scratch. Despite this existing, I still decided to do this write-up because calculating the slope manually takes up a lot of time and complexity. I wanted to teach it from scratch without going into those details. ↩
I found this pretty strange when I learned about neural networks. We can use a bunch of random non-linear function and our neural network works… better? The answer is yes! The complicated answer I am not knowledgeable enough to offer… If you write your own handwritten MNIST neural net (as suggested at the end of the article), you can see for yourself by adding/removing a non-linear function and looking at the loss. ↩

Careful Trading Complexity for 'Improvements'

2021-11-30T00:00:00.000Z

Often I’ve come across technical proposals along the lines of:

In 6 months we will outgrow our MySQL/Postgres instance. We will need to move our biggest table to a different horizontally scalable datastore.
If we have a database outage in a region, we will have a complete outage. We should consider moving to a data-store that’s natively multi-region.
This would be much faster if it was stored in a specialized database. Should we consider moving to it?
If we move to an event-based architecture, our system will be much more reliable.

What these proposals have in common is that they attempt to improve the system by increasing complexity. Whenever you find yourself arguing for improving infrastructure by yanking up complexity, you need to be very careful.

“Simplicity is prerequisite for reliability.” — Edsger W. Dijkstra:

Theoretically yes: if you move your massive, quickly-growing products table to a key-value store to alleviate a default-configured relational database instance, it will probably be faster, cost less, and easier to scale.

However, in reality most likely the complexity will lead to more downtime (even if in theory you get less), slower performance because it’s hard to debug (even if in theory, it’s much faster), and worse scalability (because you don’t know the system well).

More theoretical 9s + increase in complexity => less 9s + more work.

This all because you’re about to trade known risks for theoretical improvements, accompanied by a slew of unknown risks. Adopting the new tech would increase complexity by introducing a whole new system: operational burden of learning a new data-store, developers’ overhead of using another system for a subset of the data, development environment increases in complexity, skills don’t transfer between the two, and a myriad of other unknown-unknowns. That’s a massive cost.

I’m a proponent of mastering and abusing existing tools, rather than chasing greener pastures. The more facility you gain with first-principle reasoning and napkin math, the closer I’d wager you’ll inch towards this conclusion as well. A new system theoretically having better guarantees is not enough of an argument. Adding a new system to your stack is a huge deal and difficult to undo.

So what do we do with that pesky products table?

Stop thinking about technologies, and start thinking in first-principle requirements:

You need faster inserts/updates
You need terabytes of storage to have runway for the next ~5 years
You need more read capacity

The way that the shiny key-value store you’re eyeing achieves this is by not syncing every write to disk immediately. Well, you can do that in MySQL too (and Postgres). You could put your table on a new database server with that setting on. I wrote about this in detail.

There’s no reason your relational database can’t handle terabytes. Do the napkin math, log(n) lookups for that many keys isn’t much worse. Most likely you can keep it all to one server.

Why do you think reads would be faster in the other database than your relational database? It probably caches in memory. Well, relational databases do that too. You need to spread reads among more databases? Relational databases can do that too with read-replicas…

Yes, MySQL/Postgres might be $25-50\%$ worse at all those things than a new system. But it still comes out $10,000\%$ ahead, by not being a new system with all its associated costs and unknown-unknowns. There’s an underlying rule from evolution that the more specialized a system is, the less adaptable to change it is. Whether it’s a bird over-fit to its ecosystem, or a database you’re only using for one thing.

We could go through a similar line of reasoning for the other examples. Adopting a new multi-regional database for a subset of your database will likely yield to more downtime due to the introduction of complexity, than sticking with what you’ve got.

Don’t adopt a new system unless you can make the first-principle argument for why your current stack fundamentally can’t handle it. For example, you will likely reach elemental limitations doing full-text search in a relational datastore or analytics queries on your production database, as a nature of the data structures used. If you’re unsure, reach out, and I might be able to help you!

Napkin Problem 16: When To Write a Simulator

2021-09-13T00:00:00.000Z

My rule for when to write a simulator:

Simulate anything that involves more than one probability, probabilities over time, or queues.

Anything involving probability and/or queues you will need to approach with humility and care, as they are often deceivingly difficult: How many people with their random, erratic behaviour can you let into the checkout at once to make sure it doesn’t topple over? How many connections should you allow open to a database when it’s overloaded? What is the best algorithm to prioritize asynchronous jobs to uphold our SLOs as much as possible?

If you’re in a meeting discussing whether to do algorithm X or Y with this nature of problem without a simulator (or amazing data), you’re wasting your time. Unless maybe one of you has a PhD in queuing theory or probability theory. Probably even then. Don’t trust your intuition for anything the rule above applies to.

My favourite illustration of how bad your intuition is for these types of problems is the Monty Hall problem:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?”

Is it to your advantage to switch your choice?

— Wikipedia Entry for the Monty Hall problem

Against your intuition, it is to your advantage to switch your choice. You will win the car twice as much if you do! This completely stumped me. Take a moment to think about it.

I frantically read the explanation on Wikipedia several times: still didn’t get it. Watched videos, now I think that.. maybe… I get it? According to Wikipedia, Erdős, one of the most renowned mathematicians in history also wasn’t convinced until he was shown a simulation!

After writing my simulation, however, I finally feel like I get it. Writing a simulation not only gives you a result you can trust more than your intuition but also develops your understanding of the problem dramatically. I won’t try to offer an in-depth explanation here, click the video link above, or try to implement a simulation — and you’ll see!

# https://gist.github.com/sirupsen/87ae5e79064354b0e4f81c8e1315f89b
$ ruby monty_hall.rb
Switch strategy wins: 666226 (66.62%)
No Switch strategy wins: 333774 (33.38%)

The short of it is that the host always opens the non-winning door, and not your door, which reveals information about the doors! Your first choice retains the 1/3 odds, but switching at this point, incorporating ‘the new information’ of the host opening a non-winning door, you improve your odds to 2/3 if you always switch.

This is a good example of a deceptively difficult problem. We should simulate it because it involves probabilities over time. If someone framed the Monty Hall problem to you you’d intuitively just say ‘no’ or ‘1/3’. Any problem involving probabilities over time should humble you. Walk away and quietly go write a simulation.

Now imagine when you add scale, queues, … as most of the systems you work on likely have. Thinking you can reason about this off the top of your head might constitute a case of good ol’ Dunning-Kruger. If Bob’s offering a perfect algorithm off the top of his head, call bullshit (unless he carefully frames it as a hypothesis to test in a simulator, thank you, Bob).

When I used to do informatics competitions in high school, I was never confident in my correctness of the more math-heavy tasks — so I would often write simulations for various things to make sure some condition held in a bunch of scenarios (often using binary search). Same principle at work: I’m much more confident most day-to-day developers would be able to write a good simulation than a closed-form mathematical solution. I once read something about a mathematician that spent a long time figuring out the optimal strategy in Monopoly. A computer scientist came along and wrote a simulator in a fraction of the time.

Using Randomness Instead of Coordination?

A few years ago, we were revisiting old systems as part of moving to Kubernetes. One system we had to adapt was a process spun up for every shard to do some book-keeping. We were discussing how we’d make sure we’d have at least ~2-3 replicas per shard in the K8s setup (for high availability). Previously, we had a messy static configuration in Chef to ensure we had a service for each shard and that the replicas spread out among different servers, not something that easily translated itself to K8s.

Below, the green dots denote the active replica for each shard. The red dots are the inactive replicas for each shard:

We discussed a couple of options: each process consulting some shared service to coordinate having enough replicas per shard, or creating a K8s deployment per shard with the 2-3 replicas. Both sounded a bit awkward and error-prone, and we didn’t love either of them.

As a quick, curious semi-jokingly thought-experiment I asked:

“What if each process chooses a shard at random when booting, and we boot enough that we are near certain every shard has at least 2 replicas?”

To rephrase the problem in a ‘mathy way’, with n being the number of shards:

“How many times do you have to roll an n-sided die to ensure you’ve seen each side at least m times?”

This successfully nerd-sniped everyone in the office pod. It didn’t take long before some were pulling out complicated Wikipedia entries on probability theory, trawling their email for old student MatLab licenses, and formulas soon appeared on the whiteboard I had no idea how to parse.

Insecure that I’ve only ever done high school math, I surreptitiously started writing a simple simulator. After 10 minutes I was done, and they were still arguing about this and that probability formula. Once I showed them the simulation the response was: “oh yeah, you could do that too… in fact that’s probably simpler…” We all had a laugh and referenced that hour endearingly for years after. (If you know a closed-form mathematical solution, I’d be very curious! Email me.)

# https://gist.github.com/sirupsen/8cc99a0d4290c9aa3e6c009fdce1ffec
$ ruby die.rb
Max: 2513
Min: 509
P50: 940
P99: 1533
P999: 1842
P9999: 2147

It followed from running the simulation that we’d need to boot 2000+ processes to ensure we’d have at least 2 replicas per shard with a 99.99% probability with this strategy. Compare this with the ~400 we’d need if we did some light coordination. As you can imagine, we then did the napkin cost of ~~1600 excess dedicated CPUs to run these book-keepers at [~~ $10/month][costs]. Was this strategy worth ~$ 16,000 a month? Probably not.

Throughout my career I remember countless times complicated Wikipedia entries have been pulled out as a possible solution. I can’t remember a single time that was actually implemented over something simpler. Intimidating Wikipedia entries might be another sign it’s time to write a simulator, if nothing else, to prove that something simpler might work. For example, you don’t need to know that traffic probably arrives in a Poisson distribution and how to do further analysis on that. That will just happen in a simulation, even if you don’t know the name. Not important!

Another Real Example: Load Shedding

At Shopify, a good chunk of my time there I worked on teams that worked on reliability of the platform. Years ago, we started working on a ‘load shedder.’ The idea was that when the platform was overloaded we’d prioritize traffic. For example, if a shop got inundated with traffic (typically bots), how could we make sure we’d prioritize ‘shedding’ (red arrow below) the lowest value traffic? Failing that, only degrade that single store? Failing that, only impact that shard?

Hormoz Kheradmand led most of this effort, and has written this post about it in more detail. When Hormoz started working on the first load shedder, we were uncertain about what algorithms might work for shedding traffic fairly. It was a big topic of discussion in the lively office pod, just like the dice-problem. Hormoz started writing simulations to develop a much better grasp on how various controls might behave. This worked out wonderfully, and also served to convince the team that a very simple algorithm for prioritizing traffic could work which Hormoz describes in his post.

Of course, before the simulations, we all started talking about Wikipedia entries of the complicated, cool stuff we could do. The simple simulations showed that none of that was necessary — perfect! There’s tremendous value in exploratory simulation for nebulous tasks that ooze of complexity. It gives a feedback loop, and typically a justification to keep V1 simple.

Do you need to bin-pack tenants on n shards that are being filled up randomly? Sounds like probabilities over time, a lot of randomness, and smells of NP-completion. It won’t be long before someone points out deep learning is perfect, or some resemblance to protein folding or whatever… Write a simple simulation with a few different sizes and see if you can beat random by even a little bit. Probably random is fine.

You need to plan for retirement and want to stress-test your portfolio? The state of the art for this is using Monte Carlo analysis which, for the sake of this post, we can say is a fancy way to say “simulate lots of random scenarios.”

I hope you see the value in simulations for getting a handle on these types of problems. I think you’ll also find that writing simulators is some of the most fun programming there is. Enjoy!

Napkin Problem 15: Increase HTTP Performance by Fitting In the Initial TCP Slow Start Window

2021-07-13T00:00:00.000Z

Did you know that if your site’s under ~12kb the first page will load significantly faster? Servers only send a few packets (typically 10) in the initial round-trip while TCP is warming up (referred to as TCP slow start). After sending the first set of packets, it needs to wait for the client to acknowledge it received all those packets.

Quick illustration of transferring ~15kb with an initial TCP slow start window (also referred to as initial congestion window or initcwnd) of 10 versus 30:

The larger the initial window, the more we can transfer in the first roundtrip, the faster your site is on the initial page load. For a large roundtrip time (e.g. across an ocean), this will start to matter a lot. Here is the approximate size of the initial window for a number of common hosting providers:

Site	First Roundtrip Bytes (`initcwnd`)
Heroku	~12kb (10 packets)
Netlify	~12kb (10 packets)
Shopify	~12kb (10 packets)
Squarespace	~12kb (10 packets)
Cloudflare	~40kb (30 packets)
Fastly	~40kb (30 packets)
Github Pages	~40kb (30 packets)
Vercel	~40kb (30 packets)

To generate this, I wrote a script that you can use sirupsen/initcwnd to analyze your own site. Based on the report, you can attempt to tune your page size, or tune your server’s initial slow start window size (initcwnd) (see bottom of article). It’s important to note that more isn’t necessarily better here. Hosting providers have a hard job choosing a value. 10 might be the best setting for your site, or it might be 64. As a rule of thumb, if most of your clients are high-bandwidth connections, more is better. If not, you’ll need to strike a balance. Read on, and you’ll be an expert in this!

Dear Napkin Mathers, it’s been too long. Since last, I’ve left Shopify after 8 amazing years. Ride of a lifetime. For the time being, I’m passing the time with standup paddleboarding (did a 125K 3-day trip the week after I left), recreational programming (of which napkin math surely is a part), and learning some non-computer things.

In this issue, we’ll dig into the details of exactly what happens on the wire when we do the initial page load of a website over HTTP. As I’ve already hinted at, we’ll show that there’s a magical byte threshold to be aware of when optimizing for short-lived, bursty TCP transfers. If you’re under this threshold, or increase it, it’ll potentially save the client from several roundtrips. Especially for sites with a single location that are often requested from far away (i.e. high roundtrip times), e.g. US -> Australia, this can make a huge difference. That’s likely the situation you’re in if you’re operating a SaaS-style service. While we’ll focus on HTTP over the public internet, TCP slow start can also matter to RPC inside of your data-centre, and especially across them.

As always, we’ll start by laying out our naive mental model about how we think loading a site works at layer 4. Then we’ll do the napkin math on expected performance, and confront our fragile, naive model with reality to see if it lines up.

So what do we think happens at the TCP-level when we request a site? For simplicity, we will exclude compression, DOM rendering, Javascript, etc., and limit ourselves exclusively to downloading the HTML. In other words: curl --http1.1 https://sirupsen.com > /dev/null (note that sirupsen/initcwnd uses --compressed with curl to reflect reality).

We’d expect something alone the lines of:

1 DNS roundtrip (we’ll ignore this one, typically cached close by)
1 TCP roundtrip to establish the connection (SYN and SYN+ACK)
2 TLS roundtrips to negotiate a secure connection
1 HTTP roundtrip to request the page and the server sending it

To make things a little more interesting, we’ll choose a site that is geographically further from me that isn’t overly optimized: information.dk, a Danish newspaper. Through some DNS lookups from servers in different geographies and by using a looking glass, I can determine that all their HTML traffic is always routed to a datacenter in Copenhagen. These days, many sites are routed through e.g. Cloudflare POPs which will have a nearby data-centre, to simplify our analysis, we want to make sure that’s not the case.

I’m currently sitting in South-Western Quebec on an LTE connection. I can determine through traceroute(1) that my traffic is travelling to Copenhagen through the path Montreal -> New York -> Amsterdam -> Copenhagen. Round-trip time is ~140ms.

If we add up the number of round-trips from our napkin model above (excluding DNS), we’d expect loading the Danish site would take us 4 * 140ms = 560ms. Since I’m on an LTE connection where I’m not getting much above 15 mbit/s, we have to factor in that it takes another ~100ms to transfer the data, in addition to the 4 round-trips. So with our napkin math, we’re expecting that we should be able to download the 160kb of HTML from a server in Copenhagen within a ballpark of ~660ms.

Reality, however, has other plans. When I run time curl --http1.1 https://www.information.dk it takes 1.3s! Normally we say that if the napkin math is within ~10x, the napkin math is likely in line with reality, but that’s typically when we deal with nano and microseconds. Not off by ~640ms!

So what’s going on here? When there’s a discrepancy between the napkin math and reality, it’s because either (1) the napkin model of the world is incorrect, or (2) there’s room for optimization in the system. In this case, it’s a bit of both. Let’s hunt down those 640ms. 👀

To do that, we have to analyze the raw network traffic with Wireshark. Wireshark brings back many memories.. some fond, but mostly… frustration trying to figure out causes of intermittent network problems. In this case, for once it’s for fun and games! We’ll type host www.information.dk into Wireshark to make it capture traffic to the site. In our terminal we run the curl command above for Wireshark to have something to capture.

Wireshark will then give us a nice GUI to help us hunt down the roughly half a second we haven’t accounted for. One thing to note is that in order to get Wireshark to understand the TLS/SSL contents of the session it needs to know the secret negotiated with the server. There’s a complete guide here, but in short you pass SSLKEYLOGFILE=log.log to your curl command and then point to that file in Wireshark in the TLS configuration.

Problem 1: 3 TLS roundtrips rather than 2

We see the TCP roundtrip as expected, SYN from the client, then SYN+ACK from the server. Bueno. But after that it looks fishy. We’re seeing 3 round-trips for TLS/SSL instead of the expected 2 from our drawing above!

To make sure I wasn’t misunderstanding something, I double-checked with sirupsen.com, and sure enough, it’s showing the two roundtrips in Wireshark as anticipated:

If we carefully study the annotated Wireshark dump above for the Danish newspaper, we can see that the problem is that for whatever reason the server is waiting for a TCP ack in the middle of transmitting the certificate (packet 9).

To make it a easier to parse, the exchange looks like this:

Why is the server waiting for a TCP ACK from the client after transmitting ~4398 bytes of the certificate? Why doesn’t the server just send the whole certificate at once?

Bytes in flight or the “initial congestion window”

In TCP, the server carefully monitors how many packets/bytes it has in flight. Typically, each packet is ~1460 bytes of application data. The server doesn’t necessarily send all the data it has at once, because the server doesn’t know how “fat” the pipes are to the client. If the client can only receive 64 kbit/s currently, then sending e.g. 100 packets could completely clog the network. The network most likely will drop some random packets which would be even slower to compensate from than sending the packets at a more sustainable pace for the client.

A major part of the TCP protocol is the balancing act of trying to send as much data as possible at any given time, while ensuring the server doesn’t over-saturate the path to the client and lose packets. Losing packets is very bad for bandwidth in TCP.

The server only keeps a certain amount of packets in flight at any given time. “In flight” in TCP terms means “unacknowledged” packets, i.e. packets of data the server has sent to the client that the client hasn’t yet sent an acknowledgement to the server that it has received. Typically for every successfully acknowledged packet the server’s TCP implementation will decide to increase the number of allowed in-flight packets by 1. You may have heard this simple algorithm referred to as “TCP slow start.” On the flip-side, if a packet has been dropped then the server will decide to have slightly less bytes in flight. Throughout the existence of the TCP connection’s lifetime this dance will be tirelessly performed. In TCP terms what we’ve called “in-flight” is referred to as the “congestion window” (or cwnd in short-form).

Typically after the first packet has been lost the TCP implementation switches from the simple TCP slow start algorithm to a more complicated “Congestion Control Algorithm” of which there are dozens. Their job is: Based on what we’ve observed about the network, how much should we have in flight to maximize bandwidth?

Now we can go back and understand why the TLS handshake is taking 3 roundtrips instead of 2. After the client’s starts the TLS handshake with TLS HELLO, the Danish server really, really wants to transfer this ~6908 byte certificate. Unfortunately though the server’s congestion window (packets in flight allowed) at the time just isn’t large enough to accommodate the whole certificate!

Put another way, the server’s TCP implementation has decided it’s not confident the poor client can receive that many tasty bytes all at once yet — so it sends a petty 4398 bytes of the certificate. Of course, 63% of a certificate isn’t enough to move on with the TLS handshake… so the client sighs, sends a TCP ACK back to the server, which then sends the meager 2510 left of the certificate so the client can move on to perform its part of the TLS handshake.

Of course, this all seems a little silly… first of all, why is the certificate 6908 bytes?! For comparison, it’s 2635 for my site. Although that’s not too interesting to me. What’s more interesting is why is the server only sending 6908 bytes? That seems scanty for a modern web server!

In TCP how many packets we can send on a brand new connection before we know anything about the client is called the “initial congestion window.” In a configuration context, this is called initcwnd. If you reference the yellow graph above with the packets in flight, that’s the value at the first roundtrip.

These days, the default for a Linux server is 10 packets, or 10 * 1460 = 14600 bytes, where 1460 is roughly the data payload of each packet. That would’ve fit that monster certificate of the Danish newspaper. Clearly that’s not their initcwd since then the server wouldn’t have patiently waited for my ACK. Through some digging it appears that prior to Linux 3.0.0 initcwnd was 3, or ~3 * 1460 = 4380 bytes! That approximately lines up, so it seems that the Danish newspaper’s initcwnd is 3. We don’t know for sure it’s Linux, but we know the initcwnd is 3.

Because of the exponential growth of the packets in flight, initcwnd matters quite a bit for how much data we can send in those first few precious roundtrips:

As we saw in the intro, it’s common among CDNs to raise the values from the default to e.g. 32 (~46kb). This makes sense, as you might be transmitting images of many megabytes. Waiting for TCP slow start to get to this point can take a few roundtrips.

Another other reasons, this is also why HTTP2/HTTP3 moved in the direction of moving more data through the same connection as it has an already “warm” TCP session. “Warm” meaning that the congestion window / bytes in flight has already been increased generously from the initial by the server.

The TCP slow start window is also part of why points of presence (POPs) are useful. If you connect to a POP in front of your website that’s 10ms away, negotiate TLS with the POP, and the POP already has a warm connection with the backend server 100ms away — this improves performance dramatically, with no other changes. From 4 * 100ms = 400ms to 3 * 10ms + 100ms = 130ms.

How many roundtrips for the HTTP payload?

Now we’ve gotten to the bottom of why we have 3 TLS roundtrips rather than the expected 2: the initial congestion window is small. The congestion window (allowed bytes in flight by the server) applies equally to the HTTP payload that the server sends back to us. If it doesn’t fit inside the congestion window, then we need multiple round-trips to receive all the HTML.

In Wireshark, we can pull up a TCP view that’ll give us an idea of how many roundtrips was required to complete the request (sirupsen/initcwnd tries to guess this for you with an embarrassingly simple algorithm):

We see the TCP roundtrip, 3 TLS roundtrips, and then 5-6 HTTP roundtrips to get the ~160kb page! Each little dot in the picture shows a packet, so you’ll notice that the congestion window (allowed bytes in flight) is roughly doubling every roundtrip. The server is increasing the size of the window for every successful roundtrip. A ‘successful roundtrip’ means a roundtrip that didn’t drop packets, and in some newer algorithms, a roundtrip that didn’t take too much time.

Typically, the server will continue to double the number of packets (~1460 bytes each) for each successful roundtrip until either an unsuccessful roundtrip happens (slow or dropped packets), or the bytes in flight would exceed the client’s receive window.

When a TCP session starts, the client will advertise how many bytes it allows in flight. This typically is much larger than the server is wiling to send off the bat. We can pull this up in the initial SYN package from the client and see that it’s ~65kb:

If the session had been much longer and we pushed up against that window, the client would’ve sent a TCP package updating the size of the receive window. So there’s two windows at play: the server manages the number of packets in flight: the congestion window. The congestion window is controlled by the server’s congestion algorithm which is adjusted based on the number of successful roundtrips, but always capped by the client’s receive window.

Let’s look at the amount of packets transmitted by the server in each roundtrip:

TLS roundtrip: 3 packets (~4kb)
HTTP roundtrip 1: 6 (~8kb)
HTTP roundtrip 2: 10 (~14kb)
HTTP roundtrip 3: 17 (~24kb)
HTTP roundtrip 4: 29 (~41kb)
HTTP roundtrip 5: 48 (~69kb, this in theory would have exceeded the 64kb current receive window since the client didn’t enlarge it for some reason. The server only transmitted ~64kb)
HTTP roundtrip 6: 9 (12kb, just the remainder of the data)

The growth of the congestion window is a textbook cubic function, it has a perfect fit:

I’m not entirely sure why it follows a cubic function. I expected TCP slow start to just double every roundtrip. :shrug: As far as I can gather, on modern TCP implementation the congestion window is doubled every roundtrip until a packet is lost (as is the case for most other sites I’ve analyzed, e.g. the session in the screenshot below). After that we might move to a cubic growth. This might’ve changed later on? It’s completely up to the TCP implementation.

This is part of why I wrote sirupsen/initcwnd to spit out the size of the windows, so you don’t have to do any math or guesswork, here for a Github repo (uncompressed):

Consolidating our new model with the napkin math

So now we can explain the discrepancy between our simplistic napkin math model and reality. We assumed 2 TLS roundtrips, but in fact there was 3, because of the low initial congestion window by the server. We also assumed 1 HTTP roundtrip, but in fact there was 6, because the server’s congestion window and client’s receive window didn’t allow sending everything at once. This brings our total roundtrips to 1 + 3 + 6 = 10 roundtrips. With our roundtrip time at 130ms, this lines up perfectly with the 1.3s total time we observed at the top of the post! This suggests our new, updated mental model of the system reflects reality well.

Ok cool but how do I make my own website faster?

Now that we’ve analyzed this website together, you can use this to analyze your own website and optimize it. You can do this by running sirupsen/initcwnd against your website. It uses some very simple heuristics to guess the windows and their size. They don’t work always, especially not if you’re on a slow connection or the website streams the response back to the client, rather than sending it all at once.

Another thing to be aware of is that the Linux kernel (and likely other kernels) caches the congestion window size (among other things) with clients via the route cache. This is great, because it means that we don’t have to renegotiate it from scratch when a client reconnects. But it might mean that subsequent runs against the same website will give you a far larger initcwnd. The lowest you encounter will be the right one. Note also that a site might have a fleet with servers that have different initcwnd values!

The output of sirupsen/initcwnd will be something like:

Here we can see the size of the TCP windows. The initial window was 10 packets for Github.com, and then doubles every roundtrip. The last window isn’t a full 80 packets, because there wasn’t enough bytes left from the server.

With this result, we could decide to change the initcwnd to a higher value to try to send it back in fewer roundtrips. This might, however, have drawbacks for clients on slower connections and should be done with care. It does show some promise that CDNs have values in the 30s. Unfortunately I don’t have access to enough traffic to see for myself to study this, as Google did when they championed the change from a default of 3 to 10. That document also explains potential drawbacks in more detail.

The most practical day-to-day takeaway might be that e.g. base64 inlining images and CSS may come with serious drawbacks if it throws your site over a congestion window threshold.

You can change initcwnd with the ip(1) command on Linux, from here to the default 10 to 32:

simon@netherlands:~$ ip route show
default via 10.164.0.1 dev ens4 proto dhcp src 10.164.0.2 metric 100
10.164.0.1 dev ens4 proto dhcp scope link src 10.164.0.2 metric 100

simon@netherlands:~$ sudo ip route change default via 10.164.0.1 dev ens4 proto dhcp src 10.164.0.2 metric 100 initcwnd 32 initrwnd 32

simon@netherlands:~$ ip route show
default via 10.164.0.1 dev ens4 proto dhcp src 10.164.0.2 metric 100 initcwnd 32 initrwnd 32
10.164.0.1 dev ens4 proto dhcp scope link src 10.164.0.2 metric 100

Another key TCP setting it’s worth tuning for TCP is tcp_slow_start_after_idle. It’s a good name: by default when set to 1, it’ll renegotiate the congestion window after a few seconds of no activity (while you read on the site). You probably want to set this to 0 in /proc/sys/net/ipv4/tcp_slow_start_after_idle so it remembers the congestion window for the next page load.

Napkin Problem 14: Using checksums to verify syncing 100M database records

2021-01-02T00:00:00.000Z

A common problem you’ve almost certainly faced is to sync two datastores. This problem comes up in numerous shapes and forms: Receiving webhooks and writing them into your datastore, maintaining a materialized view, making sure a cache reflects reality, ensure documents make it from your source of truth to a search index, or your data from your transactional store to your data lake or column store.

If you’ve built such a system, you’ve almost certainly seen B drift out of sync. Building a completely reliable syncing mechanism is difficult, but perhaps we can build a checksumming mechanism to check if the two datastores are equal in a few seconds?

In this issue of napkin math, we look at implementing a solution to check whether A and B are in sync for 100M records in a few seconds. The key idea is to checksum an indexed updated_at column and use a binary search to drill down to the mismatching records. All of this will be explained in great detail, read on!

Why are syncing mechanisms unreliable?

If you are firing the events for your syncing mechanism after a transaction occurs, such as enqueuing a job, sending a webhook, or emit a Kafka event, you can’t guarantee that it actually gets sent after the transaction is committed. Almost certainly part of pipeline into database B is leaky due to bugs: perhaps there’s an exception you don’t handle, you drop events on the floor above a certain size, some early return, or deploys lose an event in a rare edge case.

But even if you’re doing something that’s theoretically bullet-proof, like using the database replication logs through Debezium, there’s still a good chance a bug somewhere in your syncing pipeline is causing you to lose occasional events. If theoretical guarantees were adequate, Jepsen wouldn’t uncover much, would it? A team I worked with even wrote a TLA+ proof, but still found bugs with a solution like the one I describe here! In my experience, a checksumming system should be part of any syncing system.

It would seem to me that building reliable syncing mechanisms would be easier if databases had a standard, fast mechanism to answer the question: “Does database A and B have all the same data? If not, what’s different?” Over time, as you fix your bugs, it will of course happen more rarely, but being able to guarantee that they are in sync is a huge step forward.

Unfortunately, this doesn’t exist as a user API in modern databases, but perhaps we can design such a mechanism without modifying the database?

This exploration will be fairly long. If you just want to see the final solution, scroll down to the end. This issue shows how to use napkin math to incrementally justify increasing complexity. While I’ve been thinking about this problem for a while, this is a fairly accurate representation of how I thought about the problem a few months ago when I started working on it. It’s also worth noting that when doing napkin math usually, I don’t write prototypes like this if I’m fairly confident in my understanding of the system underneath. I’m doing it here to make it more entertaining to read!

Assumptions

Let’s start with some assumptions to plan out our ‘syncing checksum process’:

100M records
1KiB per record (~100 GiB total)

We’ll assume both ends are SQL-flavoured relational databases, but will address other datastores later, e.g. ElasticSearch.

Iteration 1: Check in Batches

As usual, we will start by considering the simplest possible solution for checking whether two databases are in sync: a script that iterates through all records in batches to check if they’re the same. It’ll execute the SQL query below in a loop, iterating through the whole collection on both sides and report mismatches:

SELECT * FROM `table`
ORDER BY id ASC
LIMIT @limit OFFSET @offset

Let’s try to figure out how long this would take: Let’s assume each loop is querying the two databases in parallel and our batches are 10,000 records (10 MiB total) large:

In MySQL, reading 10 MiB off SSD at 200 us/MiB will take ~2ms. We assume this to be sequential-ish, but this is not entirely true.
Serializing and deserializing the MySQL protocol at 5 ms/MiB, for a total of ~2* 50ms = 100ms.
Network transfer at 10 ms/MiB, for a total of ~100ms.

We’d then expect each batch to take roughly ~200ms. This would bring our theoretical grand total for this approach to 200 ms/batch * (100M / 10_000) batches ~= 30min.

To test our hypothesis against reality, I implemented this to run locally for the first 100 of the 10,000 batches. In this local implementation, we won’t incur the network transfer overhead (we could’ve done this with Toxiproxy). Without the network overhead, we expect a query time in the 100ms ballpark. Running the script, I get the following plot:

Ugh. The real performance is pretty far from our napkin math lower bound estimate. What’s going on here?

There’s a fundamental problem with our napkin math. Only the very first batch will read only ~10 MB off of the SSD in MySQL. OFFSET queries will read through the data before the offset, even if it only returns the data after the offset! Each batch takes 3-5ms more than the last, which lines up well with reading another 10 MiB per batch from the increasing offset.

This is the reason why OFFSET-based pagination causes so much trouble in production systems. If we take the area under the graph here and extend to the 10,000 batches we’d need for our 100M records, we get a ~3 day runtime.

Iteration 2: Outsmarting the optimizer

As OFFSET will scan through all these 1 KiB records, what if we scanned an index instead? It’ll be much smaller to skip 100,000s of records on an index where each record only occupies perhaps 64 bit. It’ll still grow linearly with the offset, but passing the previous batch’s 10,000 records is only 10 KiB which would only take a few hundred microseconds to read.

You’d think the optimizer would make this optimization itself, but it doesn’t. So we have to do it ourselves:

SELECT * FROM `table`
WHERE id > (SELECT id FROM table LIMIT 1 OFFSET @offset)
ORDER BY id ASC 
LIMIT 10000;

It’s better, but just not by enough. It just delays the inevitable scanning of lots of data to find these limits. If we interpolate how long this’d take for 10,000 batches to process our 100M records, we’re still talking on the order of 14 hours. The 128x speedup doesn’t carry through, because it only applies to the MySQL part. Network transfer is still a large portion of the total time!

Either way, if you have some OFFSET queries lying around in your codebase, you might want to consider this optimization.

Iteration 3: Parallelization

This seems like an embarrassingly parallel problem: Can’t we just run 100 batches of 10,000 records in parallel? Can the database support that? Since we can pre-compute all the LIMITs and OFFSETs up front, let’s abuse that?

This seems kind of difficult to do the napkin math on. Typically when that’s the case, I try to solve the problem backwards: Fundamentally, the machine can read sequential SSD at 4 GiB/s, which would be an absolute lower bound for how fast the database can work. The dataset is 100 GiB, as we established in the beginning.

If we’re using our optimization from iteration 2, then our queries are on average processing 50M * 64 bit for the sub-query, and the 10 MiB of returned data on top. That’s a total of ~400 MiB. So for our 10,000 batches, that’s 4.2 TB of data we will need to munch through with this query. We can read 1 GiB from SSD in 200ms, so that’s 14 minutes in total. That would be the absolute lowest bound, assuming essentially zero overhead from MySQL and not taking into consideration serialization, network, etc.

This also assumes the MySQL instance is doing nothing but serving our query, which is unrealistic. In reality, we’d dedicate maybe 10% of capacity to these queries, which puts us at 2 hours. Still faster, but a far cry from our hope of seconds or minutes. Buuh.

Iteration 4: Dropping OFFSET

It’s starting to seem like trouble to use these OFFSET queries, even as sub-queries. We held on to it for a while, because it’s nice and easy to reason about, and means the queries can be fired off in parallel. We also held on to it for a while to truly show how awful these types of queries are, so hopefully you think twice about using it in a production query again!

If we change our approach to maintain max(id) from the last batch, we can simply change our loop’s query to:

SELECT * FROM `table`
WHERE id > @max_id_from_last_batch
ORDER BY id ASC
LIMIT 10000;

This curbed the linear growth!

Now MySQL can use its efficient primary key index to do ~6 SSD seeks on id and then scan forward. This means we only process and serialize 10 MiB, putting our napkin math consistently around 100ms per batch as in the original estimate in iteration 1. That means this solution should finish in about half an hour! However, we learned in the previous iteration that we are constrained by only taking 10% of the database’s capacity, so as calculated from iteration 3, we’re back at 2 hours..

We fundamentally need an approach that handles less data, as the serialization and network time is the primary reason why the integrity checking is now slow.

Iteration 5: Checksumming

If we want to handle less data, we need to have some way to fingerprint or checksum each record. We could change our query to something along the lines of:

SELECT MD5(*) FROM table
WHERE id > @max_id_from_last_batch
ORDER BY id ASC
LIMIT 10000;

If there’s a mismatch, we simply revert to iteration 4 and find the rows that mismatch, but we have to scan far less data as we can assume the majority of it lines up.

Before moving on, let’s see whether the napkin math works out:

Reading 10 MiB off SSD at 200 us/MiB will take ~2ms.
Hashing 10 MiB at 5 ms/MiB will take ~50ms.
6 SSD seeks to find the ID at 100 us/seek will take ~600 us.
1 network round-trip at 250 us of the 16 byte hash.

This is promising! In reality, it requires a little more SQL wrestling, for MySQL:

SELECT max(id) as max_id, MD5(CONCAT(
  MD5(GROUP_CONCAT(UNHEX(MD5(COALESCE(t.col_a))))),
	 MD5(GROUP_CONCAT(UNHEX(MD5(COALESCE(t.col_b))))),
	 MD5(GROUP_CONCAT(UNHEX(MD5(COALESCE(t.col_c)))))
)) as checksum FROM (
  SELECT col_a, col_b, col_c FROM `table`
	WHERE id > @max_id_from_last_batch
	LIMIT 10000 
) t

We seem to match our napkin math well:

This is the place to stop if you want to err on the side of safety. This is how we verify the integrity when we move shops between shards at Shopify, which is what this approach is inspired by. However, to push performance further we need to get rid of some of this inline aggregation and hashing which eats up all our performance budget. At 50ms/batch, we’re still at ~10 minutes to complete the checksumming of 100M records.

Iteration 6: Checksumming with `updated_at`

Many database schemas have an updated_at column which contains the timestamp where the record was last updated. We can use this as the checksum for the row, assuming that the granularity of the timestamp is sufficient (in many cases, granularity is only seconds, but e.g. MySQL supports fractional second granularity).

A huge performance advantage of this is that we can use an index on updated_at, and no longer read and hash the full 1 KiB row! We now only need to read and hash the 64 bit timestamps. This cuts down on the data we need to read per batch from 10 MiB to 80Kb!

Additionally, instead of using the checksum, we can simply use a sum of the updated_at. This has the nice property of being much faster, and that we don’t necessarily need the same sort order in the other database. This will become very important if you’re doing checksumming against a database that might not store in the same order easily, e.g. ElasticSearch/Lucene.

Won’t summing so many records overflow? Nah, UNIX timestamp right now are approaching 32 bits, which means we can sum around 2^32 ~= 4 billion without overflowing. Isn’t a sum a poor checksum? Sure, a hash is safer, but this is not crypto, just simple checksumming. It seems sufficient to me. Might not be in your case, in which case you can use MD5, SHA1, or CRC32 or use the solution from iteration 5.

We still need an offset, as we can’t rely on ids increasing by exactly 1 as ids may have been deleted:

SELECT max(id) as max_id,
  SUM(UNIX_TIMESTAMP(updated_at)) as checksum
FROM `table` WHERE id < (
  SELECT id FROM `table`
	WHERE id > @max_id_from_last_batch
	LIMIT 1 OFFSET 10000
) AND id > @max_id_from_last_batch

Let’s take inventory:

Reading 80 KiB of the updated_at index off SSD at 1 us/8 KiB will take ~50 us.
Summing 80 KiB at 5 ns/64 bytes will take ~50 us.
6 SSD seeks to find the ID at 100 us/seek will take ~600 us.
1 network round-trip at 250 us of the 16 byte hash.

In theory, this query should take milliseconds! In reality, there’s overhead involved, and we can’t assume in MySQL that reads are completely sequential as fragmentation occurs on indexes and the primary key.

Without the first iteration:

What’s going on? We were expecting single-digit milliseconds, but we’re seeing 20ms per batch! Something is wrong. 20ms per batch still means our total checksumming time is 3 min. We’ve got more work to do.

Iteration 7: Using the right indexes

An EXPLAIN reveals we’re using the PRIMARY key for both queries, which means we’re loading these entire 1 KiB records, not just the 64 bit off the updated_at index.

Using the indexes on (id) and (id, updated_at) we need to scan much less data. It’s counter-intuitive to create an index on id, since the primary key already has an “index.” The problem with that index is that it also holds all the data. It’s not just the 64-bit id. You’re scanning over a lot of records. Indexes structured in this way are great in a lot of cases to minimize seeks (it’s called a clustered index), but problematic in others. Since these indexed already existed, this is another example of the MySQL optimizer not making the right decision for us. Forcing these indexes our query becomes:

SELECT max(id) as max_id, 
  SUM(UNIX_TIMESTAMP(updated_at)) as checksum
FROM `table`
FORCE INDEX (`index_table_id_updated_at`) 
WHERE id < (
  SELECT id
	FROM `table`
	FORCE INDEX (`index_table_id`)
	WHERE id > @max_id_from_last_batch
  LIMIT 1 OFFSET 10000
)  AND id > @max_id_from_last_batch

Nice, that’s quite a bit faster, let’s remove the previous iterations to make it a little easier to see the graphs we care about now:

5ms per batch is close to the theoretical floor we established in iteration 6! To checksum our full 100M records, this would take 50 seconds. We aren’t going to get much better than this as far as I can tell without modifying MySQL or pre-computing the checksums with e.g. triggers.

What about database constraints? Will this take up our whole database as we had trouble with in early iterations? Fortunately, this solution is much less I/O heavy than our early iterations. We need to read 2-3 GiB of indexes in total to serve these queries. Spread over 50 seconds we’re talking 10s of MiB/s, so we should be good.

The last trick to consider is to not checksum check all records in a loop. We could add another condition to only checksum records created in the past few minutes updated_at >= TIMESTAMPADD(MINUTE, -5, NOW()), while doing full checks only periodically. You would likely want to also ignore records created in the past few seconds, to allow replication to occur: updated_at <= TIMESTAMPADD(SECOND, 30, NOW()). We do still want our fast way to scan all records, as this is by far the safest, and for a database with 10,000s of changes per second, that also needs to be fast. The full check is also paramount when we bring up new databases and during development.

What do we do on a mismatch?

Great, so we can now check whether batches are the same across two SQL databases quickly. We could build APIs for this to avoid users querying each other’s database. But what do we do when we have a mismatch?

We could send every record in the batch, but those queries are still fairly taxing. Especially if we are checksumming batches of 100,000s of records to optimize the checksumming performance.

We can perform a binary search: If we are checksumming 100,000 records and encounter a mismatch, we cut the records into two queries checksumming 50,000 records each. Whichever one has the mismatch, we slice them in two again until you find the record(s) that don’t match!

This approach is very similar to the Merkle tree synchronization I described in problem 9. You can think of the approach we’ve landed on here as Merkle tree synchronization between two databases, but it’s simpler just to think of it as checksumming in batches. This approach is also quite similar to how rsync works.

What about other types of databases?

While we covered SQL-to-SQL checksumming here, I’ve implemented a prototype of the method described here to check whether all records from a MySQL database make it to an ElasticSearch cluster. ElasticSearch, just like MySQL, is able to sum updated_at fast. Most databases that support any type of aggregation should work for this. Datastores like Memcached or Redis would require more thought, as they don’t implement aggregations. This would be an interesting use-case for checking the integrity of a cache. It would be possible to do something, of course, but it would require core changes to them.

Hope you enjoyed this. I think this is a neat pattern that I hope to see more adoption for, and perhaps even some databases and APIs adopt. Wouldn’t it be great if you could check if all your data was up-to-date just about everywhere with just a couple of API calls exchanging hashes?

P.S. A few weeks ago this newsletter hit 1,000 subscribers. I’m really grateful to all of you for listening in! It’s been quite fun to write these posts. It’s my favourite kind of recreational programming.

The napkin math reference has also recently been extensively updated, in part to support this issue.

Napkin Problem 13: Filtering with Inverted Indexes

2020-11-08T00:00:00.000Z

Database queries are all about filtering. Whether you’re finding rows with a particular name, within a price-range, or those created within a time-window. Trouble, however, ensues for most databases when you have many filters and none of them narrow down the results much.

This problem of filtering on many attributes efficiently has haunted me since Problem 3, and again in Problem 9. Queries that mass-filter are conceptually common in commerce merchandising/collections/discovery/discounts where you expect to narrow down products by many attributes. Devilish queries of the type below might be used to create a “Blue Training Sneaker Summer Mega-Sale” collection. The merchant might have tens of millions of products, and each attribute might be on millions of products. In SQL, it might look something like the following:

SELECT id
FROM products
WHERE color=blue AND type=sneaker AND activity=training 
  AND season=summer AND inventory > 0 AND price <= 200 AND price >= 100

These are especially challenging when you expect the database to return a result in a time-frame that’s suitable for a web request (sub 10 ms). Unfortunately, classic relational databases are typically not suited for serving these types of queries efficiently on their B-Tree based indexes for a few reasons. The two arguments that top the list for me:

The data doesn’t conform to a strict schema. A product might have 100s to 1000s of attributes we need to efficiently filter against. This might mean having extremely wide rows, with 100s of indexes, which leads to a number of other issues.
Databases struggle to merge multiple indexes.
1. Index merges not going to get you < 10 ms response, and creating composite indexes are impractical if you are filtering by 10s to 100s of rules. I wrote a separate post about that.
2. While MySQL/Postgres can filter by price and then type to serve a query, it can’t filter efficiently by scanning and cross-referencing multiple indexes simultaneously (this requires Zig-Zag joins, see here for more context).

Using B-Trees for mass-filtering deserves deeper thought and napkin math (these two problems don’t seem impossible to solve), and given how much this problem troubles me, I might follow up with more detail on this in another issue. It’s also worth noting that Posgres and MySQL both implement inverted indexes, so those could be used instead of the implementation below.

But in this issue we will investigate the inverted index as a possible data-structure for serving many-filter queries efficiently. The inverted index (explained below) is the data-structure that powers search. We will be using Lucene, which is the most popular open-source implementation of the inverted index. It’s what powers ElasticSearch and Solr, the two most popular open-source search engines. You can think of Lucene as the RocksDB/InnoDB of search. Lucene is written in Java.

Why would we want to use a search engine to filter data? Because search as a problem is a superset of our filtering problem. Search is fundamentally about turning a language query blue summer sneakers into a series of filtering operations: intersect products that match blue, summer, and sneaker. Search has a language component, e.g. turning sneakers into sneaker, but the filtering problem is the same. If search is fundamentally language + filtering, perhaps we can use just the filtering bit? Search is typically not implemented on top of B-Tree indexes (what classic databases use), but use an inverted index. Perhaps that can resolve problem (1) and (2) above?

The inverted index is best illustrated through a simple drawing:

In our inverted index, each attribute (color, type, activity, ..) maps to a list of product ids that have that attribute. We can create a filter for blue, summer, and sneakers by finding the intersection of product ids that match blue, summer, and sneakers (ids that are present for all terms).

Let’s say we have 10 million products, and we are filtering by 3 attributes which each have 1.2 million products in each. What can we expect the query time to be?

Let’s assume the product ids are stored each as an uncompressed 64 bit integer in memory. We’d expect each attribute to be 1.2 million * 64 bit ~= 10mb, or 10 * 3 = 30mb total. In this case, we assume the intersection algorithm to be efficient and roughly read all the data once (in reality, there’s a lot of smart skipping involved, but this is napkin math. We won’t go into details on how to efficiently merge two sets). We can read memory at a rate of 1 Mb/100 us (from SSD is only twice as slow for sequential reads), so serving the query would take ~0.1 ms * 30 = 3ms. I implemented this in Lucene, and this napkin math lines up well with reality. In my implementation, this takes ~5-3ms! That’s great news for solving the filtering problem with an inverted index. That’s fairly fast.

Now, does this scale linearly? Including more attributes will mean scanning more memory. E.g. 8 attributes we’d expect to scan ~10mb * 8 = 80mb of memory, which should take ~0.1ms * 80 = 8ms. However, in reality this takes 30-60ms. This approaches our napkin math being an order of magnitude off. Most likely this is when we have exhausted the CPU L3 cache, and have to cycle more into main memory. We hit a similar boundary from 3 to 4 attributes. It might also suggest there’s room for optimization in Lucene.

Another interesting to note is that if we look at the inverted index file for our problem, it’s roughly ~261mb. Won’t bore you with the calculation here, but given the implementation this means that we can estimate that each product id takes up ~6.3 bits. This is much smaller than the 64 bits per product id we estimated. The JVM overhead, however, likely makes up for it. Additionally, in Lucene doesn’t just store the product ids, but also various other meta-data along with the product ids.

Based on this, it’s looking feasible to use Lucene for mass filtering! While we don’t have an estimate from SQL to measure against yet (and won’t have in this issue), I can assure you this is faster than we’d get with something naive.

But why is it feasible even if 4 attributes takes ~20ms (as we can see on the diagram)? Because that’s acceptable-ish performance in a worst-worst case scenario. In most cases when you’re filtering, you will have multiple attributes that will be able to significantly narrow the search space. Since we aren’t that close to the lower-bound of performance (what our napkin math tells us), it suggests we might not be constrained by memory bandwidth, but by computation. This suggests that threaded execution could speed it up. And sure enough, it does. With 8 threads in the read thread pool for Lucene, we can serve the query for 4 attributes in ~6ms! That’s faster than our 8ms lower-bound. The reason for this is that Lucene has optimizations built in to skip over potentially large blocks of product ids when intersecting, meaning we don’t have to read all the product ids in the inverted index.

In reality, to go further, we’d want to do more napkin math, but this is showing a lot of promise! Besides more calculations, we’ve left out two big pieces here: sorting and indexing numbers. If there’s interest, I might follow up with that another time. But this is plenty for one issue!

Napkin Problem 12: Recommendations

2020-09-27T00:00:00.000Z

Since last, I sat down with Adam and Jerod from The Changelog podcast to discuss Napkin Math! This ended up yielding quite a few new subscribers, welcome everyone!

For today’s edition: Have you ever wondered how recommendations work on a site like Amazon or Netflix?

First we need to define similarity/relatedness. There’s many ways to do this. We could figure out similarity by having a human label the data for what’s relevant when the customer is looking at something else: If you’re buying black dress shoes, you might be interested in black shoe polish. But if you’ve got millions of products, that’s a lot of work!

Instead, what most simple recommendation algorithms is based on is what’s called “collaborative filtering.” We find other users that seem to be similar to you. If we know you’ve got a big overlap in watched TV shows to another user, perhaps you might like something else that user liked that you haven’t watched yet? This recommendation method is much less laborious than a human manually labeling content (in reality, big companies do human labeling and collaborative filtering and other dark magic).

In the example below, User 3 looks similar to User 1, so we can infer that they might like Item D too. In reality, the more columns (items) we can use to compare, the better results.

Based on this, we can design a simple algorithm for powering our recommendations! With N items and M users, we can create this matrix of M x N cells shown in the drawing as a two-dimensional array and represent check-marks by 1 and empty cells by 0. We can loop through each user and compare with each other user, preferring recommendations from users we have more check-marks in common with. This is a simplification of cosine similarity which is typically the simple vector math used to compare similarity between two vectors. The ‘vector’ here being the 0s and 1s for each product for the user. For the purpose of this article, it’s not terribly important to understand this in detail.

How long it take to run this algorithm to find similar users for a million users and a million products?

Each user would have a million bits to represent the columns. That’s 10^6 bits = 125 kB per user. For each user, we’d need to look at every other user: 125 kB/user * 1 million users = 125 Gb. 125 Gb is not completely unreasonably to hold in memory, and since it’s sequential access, even if this was SSD-backed and not all in memory, it’d still be fast. We can read memory at ~10 Gb/s, so that’s 12.5 seconds to find the most similar user for each user. That’s way too slow to run as part of a web request!

Let’s say we precomputed this in the background on a single machine, it’d take 12.5 s/user * 1 million users = 12.5 million seconds ~= 144 days ~= 20 weeks. That sounds frightening, but this is an ‘embarrassingly parallel problem.’ It means we can process User A’s recommendations on one machine, User B’s on another, and so on. This is what a batch compute jobs on e.g. Spark would do. This is really 12.5 million CPU seconds. If we had 3000 cores it’d take us about an hour and cost us 3000 core * $0.02 core/hour = $60. Most likely these recommendations would earn us way more than $60, so even this is not too bad! When people talk about Big Data computations, these are the types of large jobs they’re referring to.

Even on this simple algorithm, there is plenty of room for optimizations. There will be a lot of zeros in such a wide matrix (‘sparse’), so we could store vectors of item ids instead. We could quickly skip users if they have fewer 1s than the most similar user we’ve already matched with. Additionally, matrix operations like this one can be run efficiently on GPU. If I knew more about GPU-programming, I’d do the napkin math on that! On the list for future editions. The good thing is that libraries used to do computations like this usually do these types of optimizations for you.

Cool, so this naive recommendation algorithm is feasible for a first iteration of our recommendation algorithm. We compute the recommendations periodically on a large cluster and shove them into MySQL/Redis/whatever for quick access on our site.

But there’s a problem… If I just added a spatula to the cart, don’t you want to immediately recommend me other kitchen utensils? Our current algorithm is great for general recommendations, but it fails to be real-time enough to assist a shopping session. We can’t wait for the batch job to run again. By that time, we’ll already have bought a shower curtain and forgotten to buy a curtain rod since the recommendation didn’t surface. Bummer.

What if instead of a big offline computation to figure out user-to-user similarity, we do a big offline computation to compute item-to-item similarity? This is what Amazon did back in 2003 to solve this problem. Today, they likely do something much more advanced.

We could devise a simple item-to-item similarity algorithm that counts for each item the most popular items that customers who bought that item also bought.

The output of this algorithm would be something like the matrix below. Each cell is the count of customers that bought both items. For example, 17 people bought both item 4 and item 1, which in comparison to others means that it might be a great idea to show people buying item 4 to consider item 1, or vice-versa!

This algorithm has complexity even worse than the previous one, because worst case we have to look at each item for each item for each customer O(N^2 * M). In reality, however, most customers haven’t bought that many items, which makes the complexity generally O(NM) like our previous algorithm. This means that, ballpark, the running time is roughly the same (an hour for $60).

Now we’ve got a much more versatile computation for recommendations. If we store all these recommendations in a database, we can immediately as part of serving the page tell the user which other products they might like based on the item they’re currently viewing, their cart, past orders, and more. The two recommendation algorithms might complement each other. The first is good for home-page, broad recommendations, whereas the item-to-item similarity is good for real-time discovery on e.g. product pages.

My experience with recommendations is quite limited, if you work with these systems and have any corrections, please let me know! A big part of my incentive for writing these posts is to explore and learn for myself. Most articles that talk about recommendations focus on the math involved, you’ll easily be able to find those. I wanted here to focus more on the computational aspect and not get lost in the weeds of linear algebra.

P.S. Do you have experience running Apache Beam/Dataflow at scale? Very interested to talk to you.

Napkin Problem 11: Circuit Breakers

2020-08-22T00:00:00.000Z

You may have heard of a “circuit breaker” in the context of building resilient systems: the art of building reliable systems from unreliable components. But what is a circuit breaker?

Let’s set the scene for today’s napkin math post by setting up a scenario. Scenario’s pretty close to reality of what our code looked like conceptually when we started working on resiliency at Shopify back in 2014.

Imagine a function like this (pseudo-Javascript-C-ish is a good common denominator) that’s part of rendering your commerce storefront:

function cart_and_session() {
  session = query_session_store_for_session();
  if (session) {
    user = query_db_for_user(session['id']);
  }

  cart = query_carts_store_for_cart();
  if (cart) {
    products = query_db_for_products(cart.line_items);
  }
}

This calls three different external data-stores: (1) Session store, (2) Cart store, (3) Database.

Let’s now imagine that the session store is unresponsive. Not down, unresponsive: meaning every single query to it times out. Default timeouts are usually hilariously high, so let’s assume a 5 second timeout.

Let’s say we’ve got 4 workers all serving requests with the above code. Under current circumstances with the session store timing out, this means each worker would be spending 5 seconds in query_session_store_for_session on every request! This seems bad, because our response time is at least 5 seconds. But it’s way worse than that. We’re almost certainly down.

Why are we down when a single, auxiliary data-store is timing out? Consider that before, requests might have taken 100 ms to serve, but now they take at least 5 seconds. Your workers can only serve 1/50th the amount of requests they could prior to our session store outage! Unless you’re 50x over-provisioned (not a great idea), your workers are all busy waiting for the 5s timeout. The queue behind the workers slowly filling up…

What can we do about this? We could reduce the timeout, which would be a good idea, but it only changes the shape of the problem, it doesn’t eliminate it. But we can implement a circuit breaker! The idea of the circuit breaker is that if we’ve seen a timeout (or error of any other kind we specify) a few times, then we can simply raise immediately for 15 seconds! When the circuit is raising, this means the circuit breaker is “open” (this vocabulary tripped me up for the first bit, it’s not “closed”). After the 15 seconds, we’ll try to see if the resource is healthy again by letting another request through. If not, we’ll open the circuit again.

Won’t raising from the circuit just render a 500? The assumption is that you’ve made your code resilient, so that if the circuit is open for the session store, then you simply fall back to assume that people aren’t logged in instead of letting an exception trickle up the stack.

We can imagine a simple circuit being implemented like below. It has numerous problems, but it should paint the basic picture of a circuit.

circuits = {}
function circuit_breaker(function f) {
  // Circuit's closed, everything's likely normal!
  if (circuits[f.id].closed) {
    try {
      f();
    } catch(err) {
      // Uh-oh, an error occured. Let's check if it's one we should possibly
      // open the circuit on (like a timeout)
      if (circuit_breaker_error(err)) {
        errors = circuits[f.id].errors += 1;
        // 3 errors have happened, let's open the circuit!
        if (errors > 3) {
          circuits[f.id].state = "open";
        }
      }
    }
  }

  if (circuits[f.id].open) {
    // If 15 seconds have passed, let's try to close the circuit to let requests
    // through again!
    if (Time.now - circuits[f.id].opened_at > 15) {
      circuits[f.id].state = "closed";
      return circuit_breaker(f);
    }
    return false;
  }
}

What position does that put us in for our session scenario? Once again, it’s best illustrated with a drawing. Note, I’ve compressed the timeout requests a bit here (this is not for scale) to fit some ‘normal’ (blue) requests after the circuits open:

After the circuits have all opened, we’re golden! Back to normal despite the slow resource! The trouble comes when our 15 seconds of open circuit have passed, then we’re back to needing 3 failures to open the circuits again and bring us back to capacity. That’s 3 * 5s = 15s where we can only serve 3 requests, rather than the normal 15s/100ms = 150!

To do some napkin math, since there’s 15 seconds we’re waiting for timeouts to open the circuits, and 15 seconds with open circuits, we can estimate that we’re at ~50% capacity with this circuit breaker. The drawing also makes this clear. That’s a lot better than before, and likely means we’ll remain up if you’re over-provisioned by 50%.

Now we could start introducing some complexity to the circuit to increase our capacity. What if we only allowed failing once to re-open the circuit? What if we decreased the timeout from 5s to 1s? What if we increased the time the circuit is open from 15 seconds to 45 seconds? What if we open the circuit after 2 failures rather than 3?

Answering those questions is overwhelming. How on earth will we figure out how to configure the circuit so we’re not down when resources are slow? It might have been somewhat simple to realize it was ~50% capacity with the numbers I’d chosen, but add more configuration options and we’re in deep trouble.

This brings me to what I think is the most important part of this post: Your circuit breaker is almost certainly configured wrong. When we started introducing circuit breakers (and bulkheads, another resiliency concept) to production at Shopify in 2014 we severely underestimated how difficult they are to configure. It’s puzzling to me how little there’s written about this. Most assume that you drop the circuit in, choose some decent defaults, and off you go. But in my experience in your very next outage you’ll find out it wasn’t good enough… that’s a less than ideal feedback loop.

The circuit breaker implementation I’m most familiar with is the one implemented in the Ruby resiliency library Semian. To my knowledge, it’s one of the more complete implementations out there, but all the options makes it a devil to configure. Semian is the implementation we use in all applications at Shopify.

There are at least five configuration parameters relevant for circuit breakers:

error_threshold. The amount of errors to encounter for the worker before opening the circuit, that is, to start rejecting requests instantly. In our example, it’s been hard-coded to 3.
error_timeout. The amount of time in seconds until trying to query the resource again. That’s the time the circuit is open. 15 seconds in our example.
success_threshold. The amount of successes on the circuit until closing it again, that is to start accepting all requests to the circuit. In our example above, this is just hard-coded to 1. This requires a bit more logic to have a number > 1, which better implementations like Semian will take care of.
resource_timeout. The timeout to the resource/data-store protected by the circuit breaker. 5 seconds in our example.
half_open_resource_timeout. Timeout for the resource in seconds when the circuit is checking whether the resource might be back to normal, after the error_timeout. This state is called half_open. Most circuit breaker implementations (including our simple one above) assume that this is the same as the ‘normal’ timeout for the resource. The bet Semian makes is that during steady-state we can tolerate a higher resource timeout, but during failure, we want it to be lower.

In collaboration with my co-worker Damian Polan, we’ve come up with some napkin math for what we think is a good way to think about tuning it. You can read more in this post on the Shopify blog. This blog post includes the ‘circuit breaker equation’, which will help you figure out the right configuration for your circuit. If you’ve never thought about something along these lines and aren’t heavily over-provisioned, I can almost guarantee you that your circuit breaker is configured wrong. Instead of re-hashing the post, I’d rather send you to read it and leave you with this equation as a teaser. If you’ve ever put a circuit breaker in production, you need to read that post immediately, otherwise you haven’t actually put a working circuit breaker in production.

Hope you enjoyed this post on resiliency napkin math. Until next time!

Napkin Problem 10: MySQL transactions per second vs fsyncs per second

2020-07-17T00:00:00.000Z

Napkin friends, from near and far, it’s time for another napkin problem!

Since the beginning of this newsletter I’ve posed problems for you to try to answer. Then in the next month’s edition, you hear my answer. Talking with a few of you, it seems many of you read these as posts regardless of their problem-answer format.

That’s why I’ve decided to experiment with a simpler format: posts where I both present a problem and solution in one go. This one will be long, since it’ll include an answer to last month’s.

Hope you enjoy this format! As always, you are encouraged to reach out with feedback.

Problem 10: Is MySQL’s maximum transactions per second equivalent to fsyncs per second?

How many transactions (‘writes’) per second is MySQL capable of?

A naive model of how a write (a SQL insert/update/delete) to an ACID-compliant database like MySQL works might be the following (this applies equally to Postgres, or any other relational/ACID-compliant databases, but we’ll proceed to work with MySQL as it’s the one I know best):

Client sends query to MySQL over an existing connection: INSERT INTO products (name, price) VALUES ('Sneaker', 100)
MySQL inserts the new record to the write-ahead-log (WAL) and calls fsync(2) to tell the operating system to tell the filesystem to tell the disk to make sure that this data is for sure, pinky-swear committed to the disk. This step, being the most complex, is depicted below.
MySQL inserts the record into an in-memory page in the backing storage engine (InnoDB) so the record will be visible to subsequent queries. Why commit to the storage engine and the WAL? The storage engine is optimized for serving query results the data, and the WAL for writing it in a safe manner — we can’t serve a SELECT efficiently from the WAL!
MySQL returns OK to the client.
MySQL eventually calls fsync(2) to ensure InnoDB commits the page to disk.

In the event of power-loss at any of these points, the behaviour can be defined without nasty surprises, upholding our dear ACID-compliance.

Splendid! Now that we’ve constructed a naive model of how a relational database might handle writes safely, we can consider the latency of inserting a new record into the database. When we consult the reference napkin numbers, we see that the fsync(2) in step (2) is by far the slowest operation in the blocking chain at 1 ms.

For example, the network handling at step (1) takes roughly ~ $10\,\mu\text{s}$ (TCP Echo Server is what we can classify as ‘the TCP overhead’). The write(2) itself prior to the fsync(2) is also negligible at ~ $10\,\mu\text{s}$ , since this system call essentially just writes to an in-memory buffer (the ‘page cache’) in the kernel. This doesn’t guarantee the actual bits are committed on disk, which means an unexpected loss of power would erase the data, dropping our ACID-compliance on the floor. Calling fsync(2) guarantees us the bits are persisted on the disk, which will survive an unexpected system shutdown. Downside is that it’s 100x slower.

With that, we should be able to form a simple hypothesis on the maximum throughput of MySQL:

The maximum theoretical throughput of MySQL is equivalent to the maximum number of fsync(2) per second.

We know that fsync(2) takes 1 ms from earlier, which means we would naively expect that MySQL would be able to perform in the neighbourhood of: 1s / 1ms/fsync = 1000 fsyncs/s = 1000 transactions/s .

Excellent. We followed the first three of the napkin math steps: (1) Model the system, (2) Identify the relevant latencies, (3) Do the napkin math, (4) Verify the napkin calculations against reality.

On to (4: Verifying)! We’ll write a simple benchmark in Rust that writes to MySQL with 16 threads, doing 1,000 insertions each:

for i in 0..16 {
    handles.push(thread::spawn({
        let pool = pool.clone();
        move || {
            let mut conn = pool.get_conn().unwrap();
            // TODO: we should ideally be popping these off a queue in case of a stall
            // in a thread, but this is likely good enough.
            for _ in 0..1000 {
                conn.exec_drop(
                    r"INSERT INTO products (shop_id, title) VALUES (:shop_id, :title)",
                    params! { "shop_id" => 123, "title" => "aerodynamic chair" },
                )
                .unwrap();
            }
        }
    }));

    for handle in handles {
      handle.join().unwrap();
    }
    // 3 seconds, 16,000 insertions
}

This takes ~3 seconds to perform 16,000 insertions, or ~5,300 insertions per second. This is 5x more than the 1,000 fsync per second our napkin math told us would be the theoretical maximum transactional throughput!

Typically with napkin math we aim for being within an order of magnitude, which we are. But, when I do napkin math it usually establishes a lower-bound for the system, i.e. from first-principles, how fast could this system perform in ideal circumstances?

Rarely is the system 5x faster than napkin math. When we identify a significant-ish gap between the real-life performance and the expected performance, I call it the “first-principle gap.” This is where curiosity sets in. It typically means there’s (1) an opportunity to improve the system, or (2) a flaw in our model of the system. In this case, only (2) makes sense, because the system is faster than we predicted.

What’s wrong with our model of how the system works? Why aren’t fsyncs per second equal to transactions per second?

First I examined the benchmark… is something wrong? Nope SELECT COUNT(*) FROM products says 16,000. Is the MySQL I’m using configured to not fsync on every write? Nope, it’s at the safe default.

Then I sat down and thought about it. Perhaps MySQL is not doing an fsync for every single write? If it’s processing 5,300 insertions per second, perhaps it’s batching multiple writes together as part of writing to the WAL, step (2) above? Since each transaction is so short, MySQL would benefit from waiting a few microseconds to see if other transactions want to ride along before calling the expensive fsync(2).

We can test this hypothesis by writing a simple bpftrace script to observe the number of fsync(1) for the ~16,000 insertions:

tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == "mysqld"/
{
        @fsyncs = count();
}

Running this during the ~3 seconds it takes to insert the 16,000 records we get ~8,000 fsync calls:

$ sudo bpftrace fsync_count.d
Attaching 2 probes...
^C

@fsyncs: 8037

This is a peculiar number. If MySQL was batching fsyncs, we’d expect something far lower. This number means that we’re on average doing ~2,500 fsync per second, at a latency of ~0.4ms. This is twice as fast as the fsync latency we expect, the 1ms mentioned earlier. For sanity, I ran the script to benchmark fsync outside MySQL again, no, still 1ms. Looked at the distribution, and it was consistently ~1ms.

So there’s two things we can draw from this: (1) We’re able to fsync more than twice as fast as we expect, (2) Our hypothesis was correct that MySQL is more clever than doing one fsync per transaction, however, since fsync also was faster than expected, this didn’t explain everything.

If you remember from above, while committing the transaction could theoretically be a single fsync, other features of MySQL might also call fsync. Perhaps they’re adding noise?

We need to group fsync by file descriptor to get a better idea of how MySQL uses fsync. However, the raw file descriptor number doesn’t tell us much. We can use readlink and the proc file-system to obtain the file name the file descriptor points to. Let’s write a bpftrace script to see what’s being fsync’ed:

tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == str($1)/
{
  @fsyncs[args->fd] = count();
  if (@fd_to_filename[args->fd]) {
  } else {
    @fd_to_filename[args->fd] = 1;
    system("echo -n 'fd %d -> ' &1>&2 | readlink /proc/%d/fd/%d",
           args->fd, pid, args->fd);
  }
}

END {
  clear(@fd_to_filename);
}

Running this while inserting the 16,000 transactions into MySQL gives us:

personal@napkin:~$ sudo bpftrace --unsafe fsync_count_by_fd.d mysqld
Attaching 5 probes...
fd 5 -> /var/lib/mysql/ib_logfile0 # redo log, or write-ahead-log
fd 9 -> /var/lib/mysql/ibdata1 # shared mysql tablespace
fd 11 -> /var/lib/mysql/#ib_16384_0.dblwr # innodb doublewrite-buffer
fd 13 -> /var/lib/mysql/undo_001 # undo log, to rollback transactions
fd 15 -> /var/lib/mysql/undo_002 # undo log, to rollback transactions
fd 27 -> /var/lib/mysql/mysql.ibd # tablespace 
fd 34 -> /var/lib/mysql/napkin/products.ibd # innodb storage for our products table
fd 99 -> /var/lib/mysql/binlog.000019 # binlog for replication
^C

@fsyncs[9]: 2
@fsyncs[12]: 2
@fsyncs[27]: 12
@fsyncs[34]: 47
@fsyncs[13]: 86
@fsyncs[15]: 93
@fsyncs[11]: 103
@fsyncs[99]: 2962
@fsyncs[5]: 4887

What we can observe here is that the majority of the writes are to the “redo log”, what we call the “write-ahead-log” (WAL). There’s a few fsync calls to commit the InnoDB table-space, not nearly as often, as we can always recover this from the WAL in case we crash between them. Reads work just fine prior to the fsync, as the queries can simply be served out of memory from InnoDB.

The only surprising thing here is the substantial volume of writes to the binlog, which we haven’t mentioned before. You can think of the binlog as the “replication stream.” It’s a stream of events such as row a changed from x to y, row b was deleted, and table u added column c. The primary replica streams this to the read-replicas, which use it to update their own data.

When you think about it, the binlog and the WAL need to be kept exactly in sync. We can’t have something committed on the primary replica, but not committed to the replicas. If they’re not in sync, this could cause loss of data due to drift in the read-replicas. The primary could commit a change to the WAL, lose power, recover, and never write it to the binlog.

Since fsync(1) can only sync a single file-descriptor at a time, how can you possibly ensure that the binlog and the WAL contain the transaction?

One solution would be to merge the binlog and the WAL into one log. I’m not entirely sure why that’s not the case, but likely the reasons are historic. If you know, let me know!

The solution employed by MySQL is to use a 2-factor commit. This requires three fsyncs to commit the transaction. This and this reference explain this process in more detail. Because the WAL is touched twice as part of the 2-factor commit, it explains why we see roughly ~2x the number of fsync to that over the bin-log from the bpftrace output above. The process of grouping multiple transactions into one 2-factor commit in MySQL is called ‘group commit.’

What we can gather from these numbers is that it seems the ~16,000 transactions were, thanks to group commit, reduced into ~2885 commits, or ~5.5 transactions per commit on average.

But there’s still one other thing remaining… why was the average latency per fsync twice as fast as in our benchmark? Once again, we write a simple bpftrace script:

tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == "mysqld"/
{
        @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_fsync,tracepoint:syscalls:sys_exit_fdatasync
/comm == "mysqld"/
{
        @bytes = lhist((nsecs - @start[tid]) / 1000, 0, 1500, 100);
        delete(@start[tid]);
}

Which throws us this histogram, confirming that we’re seeing some very fast fsyncs:

personal@napkin:~$ sudo bpftrace fsync_latency.d
Attaching 4 probes...
^C

@bytes:
[0, 100)             439 |@@@@@@@@@@@@@@@                                     |
[100, 200)             8 |                                                    |
[200, 300)             2 |                                                    |
[300, 400)           242 |@@@@@@@@                                            |
[400, 500)          1495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[500, 600)           768 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[600, 700)           376 |@@@@@@@@@@@@@                                       |
[700, 800)           375 |@@@@@@@@@@@@@                                       |
[800, 900)           379 |@@@@@@@@@@@@@                                       |
[900, 1000)          322 |@@@@@@@@@@@                                         |
[1000, 1100)         256 |@@@@@@@@                                            |
[1100, 1200)         406 |@@@@@@@@@@@@@@                                      |
[1200, 1300)         690 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[1300, 1400)         803 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[1400, 1500)         582 |@@@@@@@@@@@@@@@@@@@@                                |
[1500, ...)         1402 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |

To understand exactly what’s going on here, we’d have to dig into the file-system we’re using. This is going to be out of scope (otherwise I’m never going to be sending anything out). But, to not leave you completely hanging, presumably, ext4 is using techniques similar to MySQL’s group commit to batch writes together in the journal (equivalent to the write-ahead-log of MySQL). In ext4’s vocabulary, this seems to be called max_batch_time, but the documentation on this is scanty at best. The disk could also be doing this in addition/instead of the file-system. If you know more about this, please enlighten me!

The bottom-line is that fsync can perform faster during real-life workloads than the 1 ms I obtain on this machine from repeatedly writing and fsyncing a file. Most likely from the ext4 equivalent of group commit, which we won’t see on a benchmark that never does multiple fsyncs in parallel.

This brings us back around to explaining the discrepancy between real-life and the napkin-math of MySQL’s theoretical, maximum throughput. We are able to achieve an at least 5x increase in throughput from raw fsync calls due to:

MySQL merging multiple transactions into fewer fsyncs through ‘group commits.’
The file-system and/or disk merging multiple fsyncs performed in parallel through its own ‘group commits’, yielding faster performance.

In essence, the same technique of batching is used at every layer to improve performance.

While we didn’t manage to explain everything that’s going on here, I certainly learned a lot from this investigation. It’d be interesting light of this to play with changing the group commit settings to optimize MySQL for throughput over latency. This could also be tuned at the file-system level.

Problem 9: Inverted Index

Last month, we looked at the inverted index. This data-structure is what’s behind full-text search, and the way the documents are packed works well for set intersections.

(A) How long do you estimate it’d take to get the ids for title AND see with 2 million ids for title, and 1 million for see?

Let’s assume that each document id is stored as a 64-bit integer. Then we’re dealing with 1 * 10^6 * 64bit = 8 Mb and 2 * 10^6 * 64 bit = 16 Mb. If we use an exceptionally simple set intersection algorithm of essentially two nested for-loops, we need to scan ~24Mb of sequential memory. According to the reference, we can do this in 1Mb/100us * 24Mb = 2.4ms.

Strangely, the Lucene nightly benchmarks are performing these queries at roughly 22 QPS, or 1000ms/22 = 45ms per query. That’s substantially worse than our prediction. I was ready to explain why Lucene might be faster (e.g. by compressing postings to less than 64-bit), but not why it might be 20x slower! We’ve got ourselves another first-principle gap.

Some slowness can be due to reading from disk, but since the access pattern is sequential, it should only be 2-3x slower. The hardware could be different than the reference, but hardly anything that’d explain 20x. Sending the data to the client might incur a large penalty, but again, 20x seems enormous. This type of gap points towards missing something fundamental (as we saw with MySQL). Unfortunately, this month I didn’t have time to dig much deeper than this, as I prioritized the MySQL post.

(B) What about title OR see?

In this case we’d have to scan roughly as much memory, but handle more documents and potentially transfer more back to the client. We’d expect to roughly be in the same ballpark for performance ~2.4ms.

Lucene in this case is doing roughly half the throughput, which aligns with our relative expectations. But again, in absolute terms, Lucene’s handling these queries in ~100ms, which is much, much higher than we expect.

(C) How do the Lucene nightly benchmarks compare for (A) and (B)? This file shows some of the actual terms used. If they don’t line up, how might you explain the discrepency?

Answered inline with (A) and (B).

(D) Let’s imagine that we want title AND see and order the results by the last modification date of each document. How long would you expect that to take?

If the postings are not stored in that order, we’d naively expect in the worst case we’d need to sort roughly ~24Mb of memory, at 5ms/Mb. This would land us in the 5mb/mb * 24mb ~= 120ms query time ballpark.

In reality, this seems like an unintentional trick question. If ordered by last modification date, they’d already be sorted in roughly that order, since new documents are inserted to the end of the list. Which means they’re already stored in roughly the right order, meaning our sort has to move far less bits around. Even if that wasn’t the case, we could store sorted list for just this column, which e.g. Lucene allows with doc values.

Napkin Problem 9: Inverted Index Performance and Merkle Tree Syncronization

2020-06-07T00:00:00.000Z

Napkin friends, from near and far, it’s time for another napkin problem!

As always, consult sirupsen/napkin-math to solve today’s problem, which has all the resources you need. Keep in mind that with napkin problems you always have to make your own assumptions about the shape of the problem.

We hit an exciting milestone since last with a total of 500 subscribers! Share the newsletter (https://sirupsen.com/napkin/) with your friends and co-workers if you find it useful.

Solving problem 8 is probably the most comprehensive yet… it took me 5 hours today to prepare this newsletter with an answer I felt was satisfactory enough, I hope you enjoy!

I’m noticing that the napkin math newsletter has evolved from fairly simple problems, to serving simple models of how various data structures and algorithms work, then doing napkin math with these assumptions. The complexity has gone way up, but I hope, in turn, so has your interest.

Let me know how you feel about this evolution by replying. I’m also curious about how many of you simply read through it, but don’t necessarily attempt to solve the problems. That’s completely OK, but if 90% of readers read it that way, I would consider reframing the newsletter to include the problem and answer in each edition, rather than the current format.

Problem 9

You may already be familiar with the inverted index. A ‘normal’ index maps e.g. a primary key to a record, to answer queries efficiently like:

SELECT * FROM products WHERE id = 611

An inverted index maps “terms” to ids. To illustrate in SQL, it may efficiently help answer queries such as:

SELECT id FROM products WHERE title LIKE "%sock%"

In the SQL-databases I’m familiar with this wouldn’t be the actual syntax, it varies greatly. A database like ElasticSearch, which is using the inverted index as its primary data-structure, uses JSON and not SQL.

The inverted index might look something like this:

If we wanted to answer a query to find all documents that include both the words title and see, query='title AND see', we’d need to do an intersection of the two sets of ids (as illustrated in the drawing).

(A) How long do you estimate it’d take to get the ids for title AND see with 2 million ids for title, and 1 million for see?

(B) What about title OR see?

(C) How do the Lucene nightly benchmarks compare for (A) and (B)? This file shows some of the actual terms used. If they don’t line up, how might you explain the discrepency?

(D) Let’s imagine that we want title AND see and order the results by the last modification date of each document. How long would you expect that to take?

Answer is available in the next edition.

Answer to Problem 8

Last month we looked at a syncing problem.. What follows is the most deliberate answer in this newsletter’s short history. It’s a fascinating problem, I hope you find it as interesting as I did.

The problem comes down to this: How does a client and server know if they have the same data? We framed this as a hashing problem. The client and server would each have a hash, if they match, they have the same data. If not, they need to sync the documents!

The query for the client and server might look something like this:

SELECT SHA1(*) FROM table WHERE user_id = 1

For 100,000 records, that’ll in reality return us 100,000 hashes. But, let’s assume that the hashing function is an aggregate function without confusing with very specific syntax (you can see who to actually do it here.

(a) How much time would you expect the server-side query to take for 100,000 records that the client might have synced? Will it have different performance than the client-side query?

We’ll assume each row is about 256 bytes on average (2^8), which means we’ll be reading ~25Mb of data, and subsequently hash it.

Now, will we be reading this from disk or memory? Most databases maintain a cache of the most frequently read data in memory, but we’ll assume the worst case here of reading everything from disk.

We know from the reference that we can hash a mb in roughly 500 us. The astute reader might notice that only non-crypto safe hashing are that fast (e.g. CRC32 or SIPHASH), but SHA1 is in a crypto-family (although it’s not considered safe anymore for that purpose, it’s used for integrity in e.g. Git and many other systems). We’re going to assume we can find a non-crypto hash that’s fast enough with rare collissions. Worst case, you’d sync on your next change (or force it in the UI).

We can also see that we can read 1 mb sequentially at roughly 200 us/mb, and randomly at roughly 10 ms/mb. In Napkin Problem 5 we learned that reads on a multi-tenant database without a composite primary key that includes the user_id start to look more random than not. We’ll average it out a little, assume some pre-fetching, some sequential reads, and call it 1 ms/mb.

With the caching and disk reads, we’ve got ourselves an approximation of the query time of the full-table scan: 25 Mb * (500 us/Mb + 1 ms/Mb) ~= 40ms. That’s not terrible, for something that likely wouldn’t happen too often. If this all came from memory, we can assume hashing speed only to get a lower bound and get ~12.5ms. Not amazing, not terrible. For perspective, that might yield us 1s / 10ms = 100 syncs per second (in reality, we could likely get more by assuming multiple cores).

Is 100 syncs per second good? If you’ve got 1000 users and they each sync once an hour, you’re more than covered here (1000/3600 ~= 0.3 syncs per second). You’d need in the 100,000s of users before this operation would become problematic.

The second part of the questions asks whether the client would have different performance. The client might be a mobile client, which could easily be much slower than the server. This is where this solution starts to break down for so many documents to sync. We don’t have napkin numbers for mobile devices (if you’ve got access to a mobile CPU you can run the napkin math script on, I’d love to see it), but it wouldn’t be crazy to assume it to be an order of magnitude slower (and terrible on the battery).

(b) Can you think of a way to speed up this query?

There’s iterative improvements that can be done on the current design. We could hash the updated_at and store it as a column in the database. We could go a step further and create an index on (user_id, hash) or (user_id, updated_at). This would allow us much more efficient access to that column! This would easily mean we’d only have to read 8-12 bytes of data per record, rather than the previous 256 bytes.

Something else entirely we could do is add a WHERE updated_at .. with a generous window on either side, only considering those records for sync. This is do-able, but not very robust. Clocks are out of sync, someone could be offline for weeks/months, … we have a lot of edge-cases to consider.

Merkle Tree Synchronization

The flaw with our current design is that we still have to iterate through the 100,000 records each time we want to know if a client can sync. Another flaw is that our current query only gives us a binary answer: the 100,000 records are synced, or the 100,000 records are not synced.

This query’s answer then leaves us in an uncomfortable situation… should the client now receive 100,000 records and figure out which ones are out-of-date? Or let the server do it? This would mean sending those 25 Mb of data back and forth on each sync! We’re starting to get into question (C), but let’s explore this… we might be able to get two birds with one stone here.

What if we could design a data-structure that we maintain at write-time that would allow us to elegantly answer the question of whether we’re in sync with the server? Even better, what if this data-structure would tell us which rows need to be re-synced, so we don’t have to send 100,000 records back and forth?

Let’s consider a Merkle tree (or ‘hash tree’). It’s a simple tree data structure where the leaf nodes store the hash of individual records. The parent stores the hash of all its children, until finally the root’s hash is an identity of the entire state the Merkle tree represents. In other words, the root’s hash is the answer to the query we discussed above.

The best way to understand a Merkle tree is to study the drawing below a little:

In the drawing I show a MySQL query to generate an equivalent node. It’s likely not how we’d generate the data-structure in production, but it illustrates its naive MySQL equivlalent. The data-structure would be able to answer such a query rapidly, wheras MySQL would need to look at each record.

If we scale this up to 100,000 records, we can interpolate how the root would store (hash, (1..100,000)), its left child would store (hash, (1..50,000)), and right child would store (hash, (50,001..100,000)), and so on. In that case, to generate the root’s right node the query in the drawing would look at 50,000 records, too slow!

Let’s assume that the client and the server both have been able to generate this data-structure somehow. How would they efficiently sync? Let’s draw up a merkle tree and data table where one row is different on the server (we’ll make it slightly less verbose than the last):

Notice how the parents all change when a single record changes. If the server and client only exchange their merkle trees, they’d be able to do a simple walk of the trees and find out that it’s indeed id=4 that’s different, and only sync that row. Of course, in this example with only four records, simply syncing all the rows would work.

But once again, let’s scale it up. If we scale this simple model up to 100,000 rows, we’d need to still exchange 100,000 nodes from the Merkle tree! It’s slightly less data, since it’s just hashes. Naively, the tree would be ~2^18 elements of perhaps 64 bits each, so ~2mb total. An order of magnitude better, but still a lot of data to sync, especially from a mobile client. Notice here how we keep justifying each level of complexity by doing quick calculations at each step to know if we need to optimize further.

Let’s try to work backwards instead… Let’s say our Merkle tree has a maximum depth of 8.. that’s 2^8 = 256 leaf nodes (this is what Cassandra does to verify integrity between replicas). This means that each leaf would hold 100,000 / 256 = 390 records. To store a tree of depth 8, we’d need 2^(8+1) = 2^9 = 512 nodes in a vector/array. Carrying our 64-bit per element assumption from before to store the hash, that’s a mere 4kb for the entire Merkle tree. Now to syncronize, we only need to send or receive 4kb!

Now we’ve arrived at a fast Merkle-tree based syncing algorithm:

Client decides to sync
Server sends client its 4kb Merkle tree (fast even on 3G, 10-100ms including round-trip and server-side processing overhead)
Client walks its own and the server’s Merkle tree to detect differences (operating on 2 * 4kb trees, both fit in L1 CPU caches, nanoseconds to microseconds).
Client identifies the leaf nodes which don’t match (log(n), super fast since were traversing trees in L1).
Client requests the ids of all those leaf nodes from the server (390 * 256 bytes = 100Kb per mismatch)

To actually implement this, we’d need to solve a few production problems. How do we maintain the Merkle tree on both the client and server-side? It’s paramount its completely in sync with the table that stores the actual data! If our table is the orders table, we could imagine maintaining an orders_merkle_tree table along-side it. We could do this within the transaction in the application, we could do it with triggers in the writer (or in the read-replicas), build it based on the replication stream, patch MySQL to maintain this (or base it on the existing InnoDB checksumming on each leaf), or something else entirely…

Our design has other challenges that’d need to be ironed out, for example, our current design assumes an auto_increment per user, which is not something most databases are designed to do. We could solve this by hashing the primary key into 2^8 buckets and store these in the leaf nodes.

This answer to (B) also addresses (C): This is a stretch question, but it’s fun to think about the full syncing scenario. How would you figure out which rows haven’t synced?

As mentioned in the previous letter, I would encourage you to watch this video if this topic is interesting to you. The Prolly Tree is an interesting data-structure for this type of work (combining B-trees and Merkle Trees). Git is based on Merkle trees, I recommend this book which explains how Git works by re-implementing Git in Ruby.

Adjacent Possible: Model for Peeking into the Future

2020-05-10T00:00:00.000Z

There are 100s of cases of important discoveries being made independently by different people at almost exactly the same time: calculus (1600s), the telegraph (1837), the light bulb (1879), the jet engine (1840), and the telephone (1876). A recent example was Spectre/Meltdown (2018), possibly the most impactful publicly disclosed security vulnerability of the past decade. Despite its fiendish complexity it was discovered independently by two teams that year.

Why does this happen?

In “Where Good Ideas Come From,”, Johnson explains the idea of the ‘adjacent possible’, pioneered by Stuart Kauffman about how biological systems morph into complex systems. The adjacent possible idea explains simultaneous innovation. It’s one of those ideas that to me was so powerful it’s hard to remember how I thought about innovation prior to learning about it.

To borrow Johnson’s analogy for the adjacent possible: when you build or improve something, imagine yourself as opening a new door. You’ve unlocked a new room. This room, in turn, has even more doors to be unlocked. Each innovation or improvement unlocks even more improvements and innovations. What the doors lead you to is what we call the ‘adjacent possible.’ The adjacent possible is what’s about a door away from being invented. I like to visualize the adjacent possible as coloured (“built”) and uncoloured (“not built”) nodes in a simple graph:

In human culture, we like to think of breakthrough ideas as sudden accelerations on the timeline, where a genius jumps ahead fifty years and invents something that normal minds, trapped in the present moment, couldn’t possibly have come up with. But the truth is that technological (and scientific) advances rarely break out of the adjacent possible; the history of cultural progress is, almost without exception, a story of one door leading to another door, exploring the palace one room at a time. — Steven Johnson, Where Good Ideas Come From

When Gutenberg invented the printing press, it was in the adjacent possible from the invention of movable type, ink, paper, and the wine press. He had to customize the ink, press, and invent molds for the type — but the printing press was very much ripe for plucking in the adjacent possible.

When you internalize it, you start seeing it everywhere.

Here’s Safi Bahcall painting a picture of navigating the adjacent possible, focusing in particular on the importance of fundamental research, a door opener that might not always get the credit and funding it deserves:

“The vast majority of the most important breakthroughs in drug discovery have hopped from one lily pad to another until they cleared their last challenge. Only after the last jump, from the final lily pad, would those ideas win wide acclaim.” — Safi Bahcall, Loonshots

Of course, it took ingenuity for Gutenberg to combine these components to make the printing press. It’s certainly a pattern that the inventor has a profound familiarity with each component. Gutenberg grew up close to the wine districts of South-Western Germany, so he was familiar with the wine press. He had to customize the press, in the same way that much experimentation lead him to come up with an oil-based ink that worked with his movable type (for which he needed to invent molds).

But reality is that if Gutenberg hadn’t invented the printing press, someone else would have. The inventors of the transistor admitted this outright. The Bell Labs semiconductor team understood that when you are picking off the adjacent possible, someone else will get there eventually. In this case, the transistor had come into the adjacent possible from the increased understanding of e.g. the basic research in atomic structure and understanding of electrons conducted by scientists such as Bohr and J. J. Thompson.

“There was little doubt, even by the transistor’s inventors, that if Shockley’s team at Bell Labs had not gotten to the transistor first, someone else in the United States or in Europe would have soon after.” — Jon Gertner, The Idea Factory: Bell Labs and the Great Age of American Innovation

Edison came to this conclusion too:

I never had an idea in my life. My so-called inventions already existed in the environment – I took them out. I’ve created nothing. Nobody does. There’s no such thing as an idea being brain-born; everything comes from the outside. — Edison

Numerous quotes can be found about how innovations are plucked out of the adjacent possible like ripe fruits:

[Y]ou do not [make a discovery] until a background knowledge is built up to a place where it’s almost impossible not to see the new thing, and it often happens that the new step is done contemporaneously in two different places in the world, independently. — a physicist Nobel laureate interviewed by Harriet Zuckerman, in Scientific Elite: Nobel Laureates in the United States, 1977

The adjacent possible is a possible explanation for why simultaneous innovation is so common.

You may recognize the adjacent possible as another angle on Newton’s phrase that we ‘stand on the shoulders of giants’ (coloured nodes in the adjacent possible). ‘Great artists steal’, because otherwise how would we launch into the adjacent possible? The greatest artists might just be the ones that create the nodes with the most connections, such as Picasso’s influence in cubism, or Emerson’s in transcendentalism.

You might initially think this is a depressing thought. Are all innovations inevitable? Some teams in history have mowed through the adjacent possible at unprecedented speeds. Think of the Manhatten Project. The Apollo Project. Neither of those were in the adjacent possible. They were in the far remote possible. Many, many doors out. But these teams pushed through. To a company, the momentum provided by breaking through the adjacent possible first can be difficult to catch up with, such as Google and their page-rank search algorithm. Some areas might be simply neglected, e.g. pandemic prevention.

The adjacent possible can teach us an important lesson about being too early. To someone working in the adjacent possible, being too early and wrong is one and the same. I’ve heard Tobi Lutke say a few times that “predicting the future is easy, but timing it is hard.” Sure, we know that autonomous vehicles are coming (predicting the future), but are you wiling to put any money on when (predicting timing)?

For example, residential internet was not geared yet for responsive online games in the early 90s. It was too early, even if game developers knew it was eventually going to be a thing. It was in the remote possible, but not the adjacent possible. Not enough pre-requisite doors had been opened: home internet speed weren’t good enough, research on how to deal with network latency was poor, and setting up servers all around the world to minimize latency was a lot of work. Being too early means confusing the adjacent and remote possible.

Despite online gaming being too early to become ubiquitous, the stage was set for the web. Half-coloured nodes signal immaturity:

While Wilbur Wright knew we’d one day fly (remote possible), he had no idea if it was in the adjacent possible. He especially didn’t know the timing. But he went to the Kitty Hawk sand dunes with their flimsy plane anyway:

“I confess that, in 1901, I said to my brother Orville that men would not fly for fifty years. Two years later, we ourselves were making flights. This demonstration of my inability as a prophet gave me such a shock that I have ever since distrusted myself and have refrained from all prediction—as my friends of the press, especially, well know. But it is not really necessary to look too far into the future; we see enough already to be certain that it will be magnificent. Only let us hurry and open the roads.” — David McCullough, The Wright Brothers

Bell Labs developed the “picture phone” in the 1960s and 1970s, but they found themselves branching off nodes in the adjacent possible that made it possible, but without product/market fit. It’s possible to navigate into the adjacent possible using the wrong doors: camera + cables + packet_switching + tv does not necessarily equal a successful commercial ‘video phone’. Video telephony wouldn’t be in the adjacent possible in a shape consumers would embrace for another 40-50 years when convenience, price, and form factor would change with every laptop having a webcam and every phone a front-facing camera. Babbage also got his timing wrong. He was ~100 years too early with the first computer design, too.

These are individual failures, but part of a healthy system. We need people to try. While I believe this model is useful to reason about what can be built, it’s just as likely to make you reason incorrectly about why not to build something. You may very likely use this model to be wrong, as an excuse not to be venture into the fog of war. You won’t always know all your dependencies.

In the late 90s, LEGO was aggressively diversifying from the brick into video games, movies, theme parks, and more. Like the plastic mold had enabled the brick’s transition from wood to plastic, they thought that a digital environment with all possible bricks might start the next wave of innovation for LEGO. They bought the biggest Silicon Graphics machine in all of Scandinavia and put it in a tiny town in Denmark to computer-render the bricks to perfection. LEGO was eager to use the newest graphics technology, the most recently opened door, and marry it with LEGO. Unsurprisingly, the graphics team never shipped anything. When a door’s just been opened, you’re almost certainly going to run into problems with immaturity (a contemporary example would be cryptocurrency). You only have to look at Minecraft’s success a decade later to know what could’ve succeeded: much simpler graphics. LEGO must’ve grinned their teeth when they saw Minecraft take off.

Just because big graphics computers exist doesn’t mean you have to use them. It’s very easy to confuse the eventually/remote possible with the adjacent possible. If you find yourself pushing, pushing, and pushing, but every dependency seems to fail you — your dependencies are too immature. Every project has dependencies, but only the immature ones stand out. You don’t think about electricity as a risky dependency for a project (but you might have in the 1880s), but consumer adoption of VR certainly would be. Smartphones might have been a risky dependency a decade ago, but wouldn’t be considered risky by anyone today. QR-codes might have appeared risky in the West 5-years ago, but is somewhere between “people get it” and “not completely mature” now. In China, however, it’s common that food menus come with QR-codes.

When the transistor was invented at Bell Labs, Bell didn’t immediately replace every vacuum tube amplifier with it in their telephony cabling (amplifiers are used to counteract the natural fading of the signal over long distances). It would take at least a decade to get the price, manufacturing, and reliability of the transistor to the point where it could replace the vacuum tube with half a century of R&D behind it. In fact, they were still laying down massive, cross-country and oceanic cables with vacuum tubes for years after the transistor was invented, patiently waiting it to mature. I’m sure you’ve seen a project fail because, by analogy, you ‘started cabling with transistors immediately after its discovery.’ Sometimes you just need to bite your lip and go with the vacuum tube.

Despite this, it didn’t make Bell any less excited about the transistor. They knew that the vacuum tube’s potential had been maxed out, while the transistor’s was just starting. Even today, as we reach 5nm (orders and orders of magnitude smaller and faster) transitors, the transitor’s potential still hasn’t been depleted. Although we’re inching closer and closer…

“Gordon Moore suggested what would have happened if the automobile industry had matched the semiconductor business for productivity. “We would cruise comfortably in our cars at 100,000 mph, getting 50,000 miles per gallon of gasoline,” Moore said. “We would find it cheaper to throw away our Rolls-Royce and replace it than to park it downtown for the evening… . We could pass it down through several generations without any requirement for repair.”” — T.R. Reid, The Chip

Wilbur Wright made a similar remark about the limits of airship, after trying one for the first time on a trip to Europe:

[Wilbur] judged it a “very successful trial.” But as he was shortly to write, the cost of such an airship was ten times that of a Flyer, and a Flyer moved at twice the speed. The flying machine was in its infancy while the airship had “reached its limit and must soon become a thing of the past.” Still, the spectacle of the airship over Paris was a grand way to begin a day.” — David McCullough, The Wright Brothers

It’s important to note that improving something existing can open doors just as much as inventing something entirely new. When gas gets 20% cheaper, people don’t just drive 20% more, they might drive 40% more. Behaviour changes. Suddenly it looks economical to move a little further out, visit that relative who lives in the country, or drive 10 hours on vacation.

As another example, the current wave of AI is fuelled by the massive improvements in compute speed over the past few decades, partly from graphics cards originally developed for video games. AI had been hanging out in the remote possible for decades, just waiting for compute to hit a certain speed/cost threshold to make them economically feasible. You might not use AI to sort your search results if it costs $10 in compute per search, but when the cost has generously compounded down to a micro-dollar, it very well might be.

The same iterative improvements are what made the transistor so successful. Fundamentally, it can do the same as a vacuum tube: amplify and switch signals. Initially, it was much more expensive, but smaller and more reliable (no light to attract bugs) — which allowed it to flourish only in niche use-cases far upmarket, e.g. in the US millitary. But over time, the transistor beat the vacuum tube in every way (although, some audiophiles still prefer the ‘sound’ of vacuum tubes?!).

To use our new vocabulary, the transistor only initially expanded the adjacent possible for a few cases. Over time as iterative, consistent improvements were made to price, size, and reliability, the transistor became the root of the largest expanse of the ‘possible’ in human history. It didn’t open doors, it opened up new continents. A more contemporary example might be home and mobile Internet speeds, for which consistent, iterative improvements has expanded the adjacent possible with streaming, video games, video chat, and photo-video heavy social media.

It’s not possible to predict exactly what doors an improvement unlocks. This is a space of unknown-unknowns, but, hopefully positive ones. If we look at history, making things cheaper, smaller, faster, and more reliable tends to expand the adjacent possible. It wasn’t some magical new invention that made AI take off in the past 7-10 years, it was iterative changes: cheaper, faster compute, available on demand in the Cloud. Every time these improve by 10%, something new is feasible.

As an example of perfect timing into the adjacent possible, consider Netflix’ pivot into streaming. The technology they used initially was a little whacky (Silverlight), but it was good enough to give them an initial momentum that’s still carrying them today. They timed the technology and the market perfectly: home Internet speeds, browser technology, etc.

When you find yourself in a spot where you have your eyes on something that’s a few doors out from where you’re standing, that means it’s time to reconsider your approach. When Apple released the iPod in 2001, they surely were eyeing a phone in the remote possible. They knew that going straight for it, they’d be blasting through doors at a pace that’d yield an immature, poor product. They found a way to sustainably open the doors for a phone through the iPod. When you find a seemingly intractable problems, there’s almost always a tractable problem worth solving hiding inside of it as a stepping stone.

Framing problems as the ‘adjacent possible’ has been a liberating idea to me. In the work I do, I try to find the doors that lead to the biggest possible expansion of the possible. That’s what makes platform work so exciting to me.

Napkin Problem 8: Data Synchronization

2020-05-03T00:00:00.000Z

Napkin friends, from near and far, it’s time for another napkin problem!

Since last time, I’ve added compression and hashing numbers to the napkin math table. Plenty more I’d like to see, happy to receive help by someone eager to write some Rust!

About a month ago I did a little pop-up lesson for some kids about competitive programming. That’s the context where I did my first napkin math. One of the most critical skills in that environment is to know ahead of time whether your solution will be fast enough to solve the problem. Was fun to prepare for the lesson, as I haven’t done anything in that space for over 6 years. I realized it’s influenced me a lot.

We’re on the 8th newsletter now, and I’d love to receive feedback from all of you (just reply directly to me here). Do you solve the problems? Do you just enjoy reading the problems, but don’t jot much down (that’s cool)? Would you prefer a change in format (such as the ability to see answers before the next letter)? Do you find the problems are not applicable enough for you, or do you like them?

Problem 8

There might be situations where you want to checksum data in a relational database. For example, you might be moving a tenant from one shard to another, and before finalizing the move you want to ensure the data is the same on both ends (to protect against bugs in your move implementation).

Checksumming against databases isn’t terribly common, but can be quite useful for sanity-checking in syncing scenarios (imagine if webhook APIs had a cheap way to check whether the data you have locally is up-to-date, instead of fetching all the data).

We’ll imagine a slightly different scenario. We have a client (web browser with local storage, or mobile) with state stored locally from table. They’ve been lucky enough to be offline for a few hours, and is now coming back online. They’re issuing a sync to get the newest data. This client has offline-capabilities, so our user was able to use the client while on their offline journey. For simplicity, we imagine they haven’t made any changes locally.

The query behind an API might look like this (in reality, the query would look more like this):

SELECT SHA1(table.updated_at) FROM table WHERE user_id = 1

The user does the same query locally. If the hashes match, user is already synced!

If the local and server-side hash don’t match, we’d have to figure out what’s happened since the user was last online and send the changes (possibly in both directions). This can be useful on its own, but can become very powerful for syncing when extended further.

(A) How much time would you expect the server-side query to take for 100,000 records that the client might have synced? Will it have different performance than the client-side query?

(B) Can you think of a way to speed up this query?

(C) This is a stretch question, but it’s fun to think about the full syncing scenario. How would you figure out which rows haven’t synced?

If you find this problem interesting, I’d encourage you to watch this video (it would help you answer question (C) if you deicde to give it a go).

Answer is available in the next edition.

Answer to Problem 7

In the last problem we looked at revision history (click it for more detail). More specifically, we looked at building revision history on top of an existing relational database with a simple composite primary key design: (id, version) with a full duplication of the row each time it changes. The only thing you knew was that the table was updating roughly 10 times per second.

(a) How much extra storage space do you anticipate this simple scheme would require after a month? A year? What would this cost on a standard cloud provider?

The table we’re operating on was called products. Let’s assume somewhere around 256 bytes per product (some larger, some smaller, biggest variant being the product description). Each update thus generates 2^8 = 256 bytes. We can extrapolate out to a month: 2^8 bytes/update * 10 update * 3600 seconds/hour * 24 hour/day * 30 day/month ~= 6.5 Gb/month. Or ~80Gb per year. Stored on SSD on a standard Cloud provider at $0.01/Gb, that’ll run us ~$8/month.

(b) Based on (a), would you keep storing it in a relational database, or would you store it somewhere else? Where? Could you store it differently more efficiently without changing the storage engine?

For this table, it doesn’t seem crazy—especially if we look at it as a cost-only problem. Main concern that comes to mind here to me is that this will decrease query performance, at least in MySQL. Every time you load a record, you’re also loading adjacent records as you draw in the 16KiB page (as determined by the primary key).

Accidental abuse would also become a problem. You might have a well-meaning merchant with a bug in a script that causes them to update their products 100/times second for a while. Do you need to clear these out? Does it permanently decrease their performance? Limitations in the number of revisions per product would likely be a sufficient upper-case for a while.

If we moved to compression, we’d likely get a 3x storage-size decrease. That’s not too significant, and incurs a fair amount of complexity.

If you, for e.g. one of the reasons above, needed to move to another engine, I’d likely base the decision on how often it needs to be queried, and what types of queries are required on the revisions (hopefully you don’t need to join on them).

The absolute simplest (and cheapest) would be to store it on GCS/S3, wholesale, no diffs — and then do whatever transformations necessary inside the application. I would hesitate strongly to move to something more complicated than that unless absolutely necessary (if you were doing a lot of version syncing, that might change the queries you’re doing substantially, for example).

Do you have other ideas on how to solve this? Experience? I’d love to hear from you!

Napkin Problem 7: Revision History

2020-04-11T00:00:00.000Z

Napkin friends, from near and far, it’s time for another napkin problem!

I debated putting out a special edition of the newsletter with COVID-related napkin math problems. However, I ultimately decided to resist, as it’s exceedingly likely to encourage misinformation. Instead, I am attaching a brief reflection on napkin math in this context.

In the case of COVID, napkin math can be useful to develop intuition. It became painfully clear that there are two types of people: those that appreciate exponentials, and those that don’t. Napkin math and simple simulations have proved apt at educating about exponential growth and the properties of spread. If you don’t stare at exponential growth routinely, it’s counter-intuitive why you’d want to shut down at a few hundred cases (or less).

However, napkin math is insufficient for informing policy. Napkin math is for informing direction. It’s for rapidly uncovering the fog of war to light up promising paths. Raising alarm bells to dig deeper. It’s the experimenter’s tool.

It’s an inadequate tool when even getting an order of magnitude assumption right is difficult. Napkin math for epidemiology is filled with exponentials, which make it mindbogglingly sensitive to minuscule changes in input. The ones we’ve dealt with here haven’t included exponential growth. I’ve been tracking napkin articles on COVID out there from hobbyist, and some of it is outright dangerous. As they say, more lies have been written in Excel than Word.

On that note, on to today’s problem!

Problem 7

Revision history is wonderful. We use it every day in tools like Git and Google Docs. While we might not use it directly all the time, the fact that it’s there makes us feel confident in making large changes. It’s also the backbone for features like real-time collaboration, synchronization, and offline-support.

Many of us develop with databases like MySQL that don’t easily support revision history. They lack the capability to easily answer queries such as: “give me this record the way it looked before this record”, “give me this record at this time and date”, or “tell me what has changed since these revisions.”

It doesn’t strike me as terribly unlikely that years from now, as computing costs continue to fall, that revision history will be a default feature. Not a feature reserved from specialized databases like Noms (if you’re curious about the subject, and an efficient data-structure to answer queries like the above, read about Prolly Trees). But today, those features are not particularly common. Most companies do it differently.

Let’s try to analyze what it would look like to get revision history on top of a standard SQL database. As we always do, we’ll start by analyzing the simplest solution. Instead of mutating our records in place, our changes will always copy the entire row, increment a version_number on the record (which is part of the primary key), as well as an updated_at column. Let’s call the table we’re operating on products. I’ll put down one assumption: we’re seeing about 10 updates per second. Then I’ll leave you to form the rest of the assumptions (most of napkin math is about forming assumptions).

(a) How much extra storage space do you anticipate this simple scheme would require after a month? A year? What would this cost on a standard cloud provider?

(b) Based on (a), would you keep storing it in a relational database, or would you store it somewhere else? Where? Could you store it differently more efficiently without changing the storage engine?

Answer is available in the next edition.

Answer to Problem 6

The last problem can be summarized as: Is it feasible to build a client-side search feature for a personal website, storing all articles in memory? Could the New York Times do the same thing?

On my website, I have perhaps 100 pieces of public content (these newsletters, blog posts, book reviews). Let’s say that they’re on average 1000 words of searchable content, with each word being an average of 5 characters/bytes (fairly standard for English, e.g. this email is ~5.1). We get a total of: 5 * 10^0 * 10^3 * 10^2 = 5 * 10^5 bytes = 100 kb = 0.1 mb. It’s not crazy to have clients download 0.1mb of cached content, especially considering that gzip a blog post seems to compress about 1:3.

The second consideration would be: can we search it fast enough? If we do a simple search match, this is essentially about scanning memory. We should be able to read 100kb in less than a millisecond.

For the New York Times, we might ballpark that they publish 30 pieces of ~1,000 word content a day. While it’d be sweet to index since their beginnings in 1851, we’ll just consider 10 years at this publishing speed as a ballpark. 5 * 10^0 * 10^3 * 30 * 365 * 10 ~= 500mb. That’s too much to do in the browser, so in that case we’d suggest a server-side search. Especially if we want to go back more than 10 years (by the way, past news coverage is fascinating — highly recommend currently reading articles about SARS-COV-1 from 2002). Searching that much content would take about 50ms naively, which might be ok, but since this is only 10 years of even more data, we’d likely want to also investigate more sophisticated data-structures for search.

Napkin Problem 6: In-memory Search

2020-03-07T00:00:00.000Z

Napkin friends, from near and far, it’s time for napkin problem number 6!

Problem 6

Quick napkin calculations are helpful to iterate through simple, naive solutions and see whether they might be feasible. If they are, it can often speed up development drastically.

Consider building a search function for your personal website which currently doesn’t depend on any external services. Do you need one, or can you do something ultra-simple, like loading all articles into memory and searching them with Javascript? Can NYT do it?

Feel free to reply with your answer, would love to hear them! Mine will be given in the next edition.

Answer is available in the next edition.

Answer to Problem 5

The question is explained in depth in the past edition. Please refresh your memory on that first! This is one of my favourite problems in the newsletter so far, so I highly recommend working through it — even if you’re just doing it with my answer below.

(1) When each 16 KiB database page has only 1 relevant row per page, what is the query performance (with a LIMIT 100)?

This would require 100 random SSD access, which we know from the resource to be 100 us each, so a total of 10ms for this simple query where we have to fetch a full page for each of the 100 rows.

(2) What is the performance of (1) when all the pages are in memory?

We can essentially assume sequential memory read performance for the 16Kb page, which gets us to (16 KiB / 64 bytes) * 5 ns =~ 1250 ns. This is certainly an upper-bound, since we likely won’t have the traverse the whole page in memory. Let’s round it to 1 us, giving us a total query time of 100 us or 0.1ms, or about 100x faster than (1).

In reality, I’ve observed this many times where a query will show up in the slow query log, but subsequent runs will be up to 100x faster, for exactly this reason. The solution to avoid this is to change the primary key, which we can now get into…

(3) What is the performance of this query if we change the primary key to (shop_id, id) to avoid the worst case of a product per page?

Let’s assume each product is ~128 bytes, so we can fit 16 Kib / 128 bytes = 2^14 bytes / 2^7 bytes = 2^7 = 128 products per page, which means we only need a single read.

If it’s on disk, 100 us, and in memory (per our answer to (2)) around 1 us. In both cases, we improve the worst case by 100x by choosing a good primary key.

Napkin Problem 5: Composite Primary Keys

2020-02-03T00:00:00.000Z

Napkin friends, from near and far, it’s time for napkin problem number 5! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math and decided to sign up for some monthly practise.

Since last, in the napkin-math repository I’ve added system call overhead. I’ve been also been working on io_uring(2) disk benchmarks, which leverage a new Linux API from 5.1 to queue I/O sys-calls (in more recent kernels, network is also supported, it’s under active development). This avoids system-call overhead and allows the kernel to order them as efficiently as it likes.

As always, consult sirupsen/napkin-math for resources and help to solve this edition’s problem! This will also have a link to the archive of past problems.

Napkin Problem 5

In databases, typically data is ordered on disk by some key. In relational databases (and definitely MySQL), as an example, the data is ordered by the primary key of the table. For many schemas, this might be the AUTO_INCREMENT id column. A good primary key is one that stores together records that are accessed together.

We have a products table with the id as the primary key, we might do a query like this to fetch 100 products for the api:

SELECT * FROM products WHERE shop_id = 13 LIMIT 100

This is going to zig-zag through the product table pages on disk to load the 100 products. In each page, unfortunately, there are other records from other shops (see illustration below). They would never be relevant to shop_id = 13. If we are really unlucky, there may be only 1 product per page / disk read! Each page, we’ll assume, is 16 KiB (the default in e.g. MySQL). In the worst case, we could load 100 * 16 KiB!

(1) What is the performance of the query in the worst-case, where we load only one product per page?

(2) What is the worst-case performance of the query when the pages are all in memory cache (typically that would happen after (1))?

(3) If we changed the primary key to be (shop_id, id), what would the performance be when (3a) going to disk, and (3b) hitting cache?

I love seeing your answers, so don’t hesitate to email me those back!

Answer is available in the next edition.

Answer to Problem 4

The question can be summarized as: How many commands-per-second can a simple, in-memory, single-threaded data-store do? See the full question in the archives.

The network overhead of the query is ~10us (you can find this number in sirupsen/napkin-math). We expect each memory read to be random, so the latency here is 50ns. This goes out the wash with the networking overhead, so with a single CPU, we estimate that we can roughly do 1s/10us = 1 s / 10^-5 s = 10^5 = 100,000 commands per second, or about 10x what the team was seeing. Something must be wrong!

Knowing that, you might be interested to know that Redis 6 rc1 was just released with threaded I/O support.

How does progress(1) work?

2020-01-26T00:00:00.000Z

We’ll cover a neat little utility called progress(1). Many common utilities like cp or gzip don’t spit out a progress bar by default. progress finds those processes and estimates how far along they are with their operation. For example, if you’re copying a 10Gb with cp, running progress will indicate that it’s progressed 1Gb, and has another 9Gb to go.

Here’s an example, kindly borrowed from the project’s README:

What I was interested in is, how does it work? The README briefly goes over it, but I wanted to go a little deeper. Fortunately, it’s a fairly simple C program. While this utility works on MacOS, I’ll cover how it works on Linux. For MacOS, the methods for obtaining the information about the file-descriptors and processes is slightly different, utilizing a library called libproc, due to the absence of the /proc file-system. That’s the depth we’ll cover with MacOS.

At the heart of progress, we find the function monitor_processes. On Linux, every process exposes itself as a directory on the file-system in /proc as /proc/. In the directory, there’s e.g. the exe file is a link pointing to the binary that the process is executing, this could be for example /bin/tar. There’s many other interesting links and files in here. I open environ regularly in production to check which environment variables a process has open. Other files will you about its memory usage, various process configuration, or its priority if the OOM-killer is looking for its next target.

progress will look through the exe links for all processes on the system to find interesting binaries, like cp, cat, tar, grep, cut, gunzip, sort, md5sum, and many more.

For each of these processes, it’ll scan every file descriptor the process has opened through the /proc//fd and /proc//fdinfo directories. These contain ample information about the file, such as the name of the file, the size, what position we’re reading at, and so on. progress will skip file descriptors that are invalid or are not for files, e.g. a socket.

progress will find the biggest file-descriptor opened by the process, e.g. whatever cp is copying and see what offset in the file the process is at. Based on that, the total file size, and waiting a second before doing a second read it can estimate the process of the process and its throughput.

Once progress has done this for all processes, it’ll either quit or do it all over again (this only takes a few milliseconds). To the user, this appears as continues monitoring of the processes’ progress!

Of course, this simple method has its limitations. If you’re copying a lot of small files, then it won’t help you very much. It could be extended to detect such programs and monitor them, but it’s certainly not trivial. The way this works also limits its usability in networks, depending on how the network program is written. If it streams a file locally as it transfers it, it’ll work well, but if it loads the whole thing into memory and then transfers it, progress won’t know what to do. From the documentation, it appears that this works well for downloads by many browsers. Presumably because they pre-allocate a large file based on the header of the content-length. progress can then monitor how far along the offset we are.

Napkin Problem 4: Redis throughput

2020-01-07T00:00:00.000Z

Napkin friends, from near and far, it’s time for napkin problem number four! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math and decided to sign up for some monthly training.

Since last, there has been some smaller updates to the napkin-math repository and the accompanying program. I’ve been brushing up on x86 to ensure that the base-rates truly represent the upper-bound, which will require some smaller changes. The numbers are unlikely to change by an order of magnitude, but I am dedicated to make sure they are optimum. If you’d like to help with providing some napkin calculations, I’d love contributions around serialization (JSON, YAML, …) and compression (Gzip, Snappy, …). I am also working on turning all my notes from the above talk into a long, long blog post.

With that out of the way, we’ll do a slightly easier problem than last week this week! As always, consult sirupsen/napkin-math for resources and help to solve today’s problem.

Napkin Problem 4

Today, as you were preparing you organic, high-mountain Taiwanese oolong in the kitchennette, one of your lovely co-workers mentioned that they were looking at adding more Redises because it was maxing out at 10,000 commands per second which they were trending aggressively towards. You asked them how they were using it (were they running some obscure O(n) command?). They’d BPF-probes to determine that it was all GET and SET . They also confirmed all the values were about or less than 64 bytes. For those unfamiliar with Redis, it’s a single-threaded in-memory key-value store written in C.

Unphased after this encounter, you walk to the window. You look out and sip your high-mountain Taiwanese oolong. As you stare at yet another condominium building being built—it hits you. 10,000 commands per second. 10,000. Isn’t that abysmally low? Shouldn’t something that’s fundamentally ‘just’ doing random memory reads and writes over an established TCP session be able to do more?

What kind of throughput might we be able to expect for a single-thread, as an absolute upper-bound if we disregard I/O? What if we include I/O (and assume it’s blocking each command), so it’s akin to a simple TCP server? Based on that result, would you say that they have more investigation to do before adding more servers?

Solution to this problem is available in the next edition

Answer to Problem 3

You can read the problem in the archive, here.

We have 4 bitmaps (one per condition) of 10^6 product ids, each of 64 bits. That’s 4 * 10^6 * 64 bits = 32 Mb. Would this be in memory or on SSDs? Well, let’s assume the largest merchants have 10^6 products and 10^3 attributes, that means a total of 10^6 * 10^3 * 64 bits = 8Gb. That’d cost us about $8 in memory, or about$ 1 to store on disk. In terms of performance, this is nicely sequential access. For memory, 32 mb * 100us/mb = 3.2 ms. For SSD (about 10x cheaper, and 10x slower than memory), 30 ms. 30 ms is a bit high, but 3 ms is acceptable. $8 is not crazy, given that this would be the absolute largest merchant we have. If cost becomes an issue, we could likely employ good caching.

Napkin Problem 3: Membership Intersection Service

2019-12-15T00:00:00.000Z

Napkin friends, from near and far, it’s time for napkin problem number three! If you are wondering why you’re receiving this email, you likely watched my talk on napkin math.

This weeks problem is higher level, which is different from the past few. This makes it more difficult, but I hope you enjoy it!

Napkin Problem 3

You are considering how you might implement a set-membership service. Your use-case is to build a service to filter products by particular attributes, e.g. efficiently among all products for a merchant get shoes that are: black, size 10, and brand X.

Before getting fancy, you’d like to examine whether the simplest possible algorithm would be sufficiently fast: store, for each attribute, a list of all product ids for that attribute (see drawing below). Each query to your service will take the form: shoe AND black AND size-10 AND brand-x. To serve the query, you find the intersection (i.e. product ids that match in all terms) between all the attributes. This should return the product ids for all products that match that condition. In the case of the drawing below, only P3 (of those visible) matches those conditions.

The largest merchants have 1,000,000 different products. Each product will be represented in this naive data-structure as a 64-bit integer. While simply shown as a list here, you can assume that we can perform the intersections between rows efficiently in O(n) operations. In other words, in the worst case you have to read all the integers for each attribute only once per term in the query. We could implement this in a variety of ways, but the point of the back-of-the-envelope calculation is to not get lost in the weeds of implementation too early.

What would you estimate the worst-case performance of an average query with 4 AND conditions to be? Based on this result and your own intuition, would you say this algorithm is sufficient or would you investigate something more sophisticated?

As always, you can find resources at github.com/sirupsen/napkin-math. The talk linked is the best introduction tot he topic.

Please reply with your answer!

Solution to this problem is available in the next edition

Answer to Problem 2

Your SSD-backed database has a usage-pattern that rewards you with a 80% page-cache hit-rate (i.e. 80% of disk reads are served directly out of memory instead of going to the SSD). The median is 50 distinct disk pages for a query to gather its query results (e.g. InnoDB pages in MySQL). What is the expected average query time from your database?

50 * 0.8 = 40 disk reads come out of the memory cache. The remaining 10 SSD reads require a random SSD seek, each of which will take about 100 us as per the reference. The reference says 64 bytes, but the OS will read a full page at a time from SSD, so this will be roughly right. So call it a lower bound of 1ms of SSD time. The page-cache reads will all be less than a microsecond, so we won’t even factor them in. It’s typically the case that we can ignore any memory latency as soon as I/O is involved. Somewhere between 1-10ms seems reasonable, when you add in database-overhead and that 1ms for disk-access is a lower-bound.

Napkin Problem 2: Expected Database Query Latency

2019-11-02T00:00:00.000Z

Fellow computer-napkin-mathers, it’s time for napkin problem #2. The last problem’s solution you’ll find at the end! I’ve updated sirupsen/napkin-math with last week’s tips and tricks—consult that repo if you need a refresher. My goal for that repo is to become a great resource for napkin calculations in the domain of computers. My talk from SRECON’s video was published this week, you can see it here.

Problem #2: Your SSD-backed database has a usage-pattern that rewards you with a 80% page-cache hit-rate (i.e. 80% of disk reads are served directly out of memory instead of going to the SSD). The median is 50 distinct disk pages for a query to gather its query results (e.g. InnoDB pages in MySQL). What is the expected average query time from your database?

Reply to this email with your answer, happy to provide you mine ahead of time if you’re curious.

Solution to this problem is available in the next edition

Last Problem’s Solution

Question: How much will the storage of logs cost for a standard, monolithic 100,000 RPS web application?

Answer: First I jotted down the basics and convert them to scientific notation for easy calculation ~1 *10^3 bytes/request (1 KB), 9 * 10^4 seconds/day, and 10^5 requests/second. Then multiplied these numbers into storage per day: 10^3 bytes/request * 9 * 10^4 seconds/day * 10^5 requests/second = 9 * 10^12 bytes/day = 9 Tb/day. Then we need to use the monthly cost for disk storage from sirupsen/napkin-math (or your cloud’s pricing calculator) — $0.01 GB/month. So we have 9 Tb/day * $0.01 GB/month. We do some unit conversions (you could do this by hand to practise, or on Wolframalpha) and get to $3 * 10^3 per month, or $3,000 per month. Most of those who replied got somewhere between$ 1,000 and $10,000 — well within an order of magnitude!

Napkin Problem 1: Logging Cost

2019-10-19T00:00:00.000Z

Napkin friends around the world: it’s time for your very first system’ estimation problem! Confused why you’re receiving this email? Likely you attended my talk at SRECON 19, where I said that I’d start a newsletter with occasional problems to practise your back-of-the-envelope computer calculation skills—if enough of you subscribed! Enough of you did, so here we are!

Problem #1: How much will the storage of logs cost for a standard, monolithic 100,000 RPS web application?

Reply to this email with your answer and how you arrived there. Then I’ll send you mine.

Solution to this problem is available in the next edition

Hints

You can find many numbers you might need on sirupsen/base-rates. If you don’t, consider submitting a PR! I hope for that repo to grow to be the canonical source for system’s napkin math.

Don’t overcomplicate the solution by including e.g. CDN logs, slow query logs, etc. Keep it simple.

You might want to refresh your memory on Fermi Problems. Remember that you need less precision than you think. Remember that your goal is just to get the exponent right, x in n * 10^x.

Wolframalpha is good at calculating with units, you may use that the first few times—but over time the goal is for you to be able to do these calculations with no aids!

Consider using spaced repetition to remember the numbers you need for today’s problem, e.g. http://communis.io/ is a messenger bot.

2018

2019-01-25T00:00:00.000Z

Every year, I spend some time reflecting on the year that passed. After reading last year’s post, I noticed a fair bit of self-indulgent tangent chasing. Most of which should likely have been separate posts. I’m attempting less of that this year. I’m continuing to evolve the format, but it’ll probably be a few years until I settle on one.

Berlin

Jenn took a medium-term assignment in Berlin, so a decent chunk of 2018 I spent stretched between Berlin and Ottawa. After five years in Ottawa, I was starting to feel a tad restless. Five years easily turn into 10, and while five years is a long time, 10 is a really long time. Spending time in Berlin provided an opportunity to test what life would be like in an “objectively cooler” city, without committing to a major change. We enjoyed some fantastic weekends in Berlin: knödel shops where the hairdo-memo said ‘Grease’ (unfortunately, we missed it, so no mullet this time around), biking across the city with friends visiting from Denmark to a bus-turned-café, and the weekly kinda-festival at Mauerpark, where amphitheatres turn into makeshift crowd-karaoke. Despite all of this, the best thing about the stint in Berlin was, as cliché as it may sound, the re-appreciation of how good my life is in Ottawa. Berlin is a city that screams ‘temporary.’ I don’t recall meeting a single person ‘from there’ or a single person who wanted to stay there permanently. The city has a faint smell of millennial quarter-life crisis, I know, because given another year, that’d likely have been what drew me there! Close to family, but also close to the global pulse. In contrast, Ottawa has the diametrically opposite effect on people. After this, I’m pretty okay with that.

Reading

More so than the satisfaction of chasing a high number of books read, it was a significant focus-point for 2018 to evolve the system around reading. I increasingly feel that the more time I allocate to processing what I’ve read (primarily through writing, creating flashcards, and cataloging ideas), the more long-term reward. I wrote a much longer post about the system I went through most of 2018 with. It’ll continue to evolve, and I expect to update the post within the next year or two with the experiments I’m carrying out. The feedback loops on increasing reading retention are wonderfully and painfully long. Last year, I ended up reading around 55 books. Some that stood out were The Wright Brothers; wonderful story of innovation and fortitude, The North Water; the fiction that’s kept me most glued since Harry Potter, The Course of Love; raw and genuine account of long-term relationships, Doing Good Better; a way to think about charity that appealed to me, and The Goal; part of the underrated genre of fiction with a refreshingly tangible takeaway.

Health

The frequent flights between the New and Old World were dreadful. The whole thing clinched for me that the romantic idea of a “Nomad Lifestyle” would be a nightmare for me. If that phase of life hits me, it’s clear that my shape will be in 3-month chunks, not backpack-increments. Always coming out of jet lag, or being about to go in it, was exhausting. That, and the poor seating that invited poor posture. Under those conditions, it proved challenging to improve physical health, despite the Gym in Berlin being the best I’ve frequented yet. It had that dungeon-gym vibe I didn’t know I’d craved that badly. The health hit of jet lag and transit-nutrition was uplifted by the intimidation factor of the guy next to you casually deadlifting 500lbs, with his dog taking a nap on the platform. This year, 2019, I hope to make some strides to improve my physical fitness. More specifically, I’d like a ball to chase (event, in this context) and improve my cardio, not just strength.

Inspired by a co-workers pulse watch, I decided that’d be an excellent motivator to incorporate more cardio. Having a heart-rate monitor with a number closely tuned to how miserable I’m feeling turned out to be a winning bet for tying my running shoes more often. An unexpected additional benefit was that friends started popping up in the Apple Watch fitness app. I have no problems with abusing my competitive gene without shame when it comes to my health. Beating Jeff turns out to be a great motivator.

Work

2018 became a year of building teams. In 2017, we were about 1.5 teams, but by the end of 2018, there are 3. The realization that I needed to build these teams led to an intense hiring cycle. Time well spent. With these teams, we’re able to do the things that we’ve dreamt about for many years now—rather than some time. It was a year with two themes: moving everything to the Cloud, and, improving reliability. For the former, the team built a tool that allows us to move a shop from one database to another with virtually no impact to the merchant. With this tool, we moved every single shop individually from our data-centers to the cloud. It’s mind-boggling to me that we’ve run every Shopify merchant through this tool without mangling any.

Long-term, the concern for any company is that development slows down. You combat that with world-class tooling. One tool we started investing in as a team, is that all the applications inside the company have a standard way of communicating. We started seeing more and more applications built independently, but the tooling for them to leverage each other wasn’t improving (for the nerds in the crowd: RPC). We’ve laid the brickwork in 2018, but this year I’m confident we’ll start to see the first massive benefits within the company from this foundational investment. Third, we process about 1 billion jobs in the background at Shopify per day. This infrastructure hasn’t gotten a lot of love over the past five years, so the third team is built around improving this machinery. They not only did that but also started experimenting with automatically scaling workloads based on how busy the platform is. What I’m most proud of is the increasing autonomy of these teams. Their independence frees up time in 2019 to focus on the next project and the next squad. If you’re interested in any of this, you should shoot me an email.

How I Read

2018-07-15T00:00:00.000Z

Until a few years ago, I didn’t spend much time reading. Today, I spend a few hours every week reading, amounting to somewhere between 30 and 50 books a year. My reading habit has evolved significantly over the past couple of years and surely will continue to. In this post, I will describe how I approach my reading. You may think it’s elaborate (other people’s reading systems rub me the same way), however, keep in mind it’s evolved slowly over the years.

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system. – John Gall

It’s also worth noting that this is not an aspirational post. This is what I actually do, and have done for a while—otherwise, I wouldn’t think it would be worth sharing. I often think of the classic Charlie Munger quote on reading, he’s not wrong:

In my whole life, I have known no wise people (over a broad subject matter area) who didn’t read all the time — none, zero. You’d be amazed at how much Warren reads—and at how much I read. My children laugh at me. They think I’m a book with a couple of legs sticking out. – Charlie Munger

The post is divided into a section for each part of the reading process: (1: Sourcing), (2: Choosing), (3: Reading), and (4: Processing).

Sourcing

Whenever I stumble upon a recommendation for a book, I will follow the link to Amazon and send the page to Instapaper. I have a script that automatically converts any Instapaper book links into rows in an Airtable. Endorsements from trusted sources will be added, too. This script will automatically add metadata about the book from Goodreads such as genre, year published, author, and so on.

What would I like to improve about choosing?

Whenever I send the book to Instapaper, I’d like to attach a name to it and automatically add them as an endorser. There are also certain people whose book recommendations I seek. Automatically adding their endorsed books to my feed would be valuable. If I start going deep on a topic, I may want to read a follow-up book on the topic and will go through my sourcing list first. Attaching a summary or similar would help make the searches more fruitful. In general, I would like someone else to solve this problem for me. Improving it further to aid in choosing more would be a non-trivial amount of engineering—see next section. See the next section, (2: Choosing), for a much more elaborate answer to how I’d like to improve the sourcing and choosing process altogether.

What are changes I’ve made in sourcing?

I used to have a habit of buying the books I wanted to read instead of simply sourcing them. That’s an expensive sourcing method. Inevitably, it grew into a large number of unread books on my Kindle, which made me often dread opening it. It felt like an ever-growing to-do list (where each item takes many hours to complete). This is popularized as an anti-library—I don’t think this translates to the Kindle world well, but may work for the physical realm for books you know you want to read. Most importantly, it means that finishing a book always becomes a new adventure in choosing a new book without considering the sunk cost of already having bought another book which may mean I read less relevant books. Generally, I subscribe to not counting money spent on books. It’s $10, and it could very easily change your life. That’s a bargain. I will acknowledge this is a privileged argument, but libraries make good allies if buying is too expensive. Old books (which have stood the test of time, see next section) often cost pennies on Amazon.

Choosing

For choosing books, I have a couple of heuristics I apply as I scurry through my sourcing list, Google, Goodreads, and other trusted sources:

1. What book is most applicable right now? If I can find a book that I can start applying right now in whatever I’m dealing with, it’ll take precedence over any other heuristic. If I’m about to recruit, reading books about building teams and recruiting would be highly applicable. With an immediate opportunity to put it into practice, it is much easier to have things stick and make an impact. This is the most important heuristic, however, it is often challenging to find such a book. Especially with the relatively poor sourcing tools I feel that I have available.

2. Syntopical reading. If I’ve been diving into a single topic, I may try to pick up a few more works in the same category to make sure I see the problem from different angles. I find that this helps strengthen the concepts too, as I get to run an internal mock dialog between the authors of the books where they agree or disagree. If it’s on a topic that strongly satisfies (1), I am more likely than not to do syntopical reading. On the other hand, if I am mostly looking for an overview of a topic—I may save syntopical reading for the future.

3. Books that have aged well. If the book has been out 10 years, it’s likely it might still be relevant another 10 years from now. If it’s been out for 100 years, it’s likely it’ll be around for another 100 years. If I am diving into recruiting due to heuristic (1), I’ll look for the book that’s 10 years old, not the one that was published this spring. In fast-moving fields, newer can be better, in which case I may start with new, and then read the old. This applies to e.g. software, where I’ll likely default to what was published recently but often go back and understand how we ended up here by reading older material. In most sciences, old is good. I found Darwin’s original work surprisingly readable.

4. What discipline or topic am I weak in? I believe that at some point, optimizing for breadth in your reading to complement your depth becomes more impactful than going even deeper. As Munger puts it, accumulate the big ideas from the big disciplines. There are so many disciplines where people learn to think in different ways to solve different problems. Over time, I’d like to get a rudimentary understanding of most of the major disciplines: law, biology, economics, history, physics, and the list goes on. This will take a lifetime, but I think the process will be both enjoyable and useful. I attempt to balance disciplines, but this easily gets thrown off by other heuristics. There’s a fine balance with (1). Breadth, (4), is most useful with depth, (1).

5. Modern translations or interpretations are not inherently bad, especially as introductions to a topic. Old is good, but can be taken too far. I enjoyed reading A Guide to the Good Life from 2008 as an introduction to stoicism much more than Letters from a Stoic from BC something something. Here, the concepts applied (Stoicism) have stood the test of time—but it may be easier to apply if written by someone in the 21st century. If you’re really into it, by all means, go to the primary source (I did). Similarly, wanting to take advantage of knowing Danish I started reading Kirkegaard a few years ago. I preferred the English translation because you won’t get chastised for modernizing a translation the same way you would for modernizing the origin language. If you’re really into a topic, it’s silly to not go to the primary source a book or two into the topic, though. If you’re into stoicism (as pointed out here), go to the original works. They’re very readable, otherwise, the ideas would not have aged as well as they did.

6. What are my friends reading? If my friends have read a book, that’s a free opportunity to talk with them about it or ask them whether it fits my criteria. It’s a free book club opportunity, helping to nudge the concepts into long-term memory and get perspective. I don’t want to have 100% overlap with my friends, but once in a while, if the stars align—I like this opportunity. In general, I abuse friend’s reading more to assist with (1: Applicability), as these can be difficult books to find.

7. Audiobooks for narrative, Kindle for anything else. While less of a heuristic for choosing the next book, this is still something that I find useful. If a book has a narrative, such as history, biographies, or novels—then it falls in the Audiobook bucket for me. I may experiment with re-reads as audio at some point. For anything else, I’ll read it on my Kindle. Some narratives are too technical for Audiobooks to me, for example, I started listening to a book about the fall of Enron—too difficult to follow through audio due to a large amount of industry and finance jargon.

8. Skim the free sample of your top x books. I learned from Dan Doyon that Amazon will send you free samples of books. His Kindle is ladled with Kindle samples, and he’ll choose his next book by skimming through 10s of these to hit one he finds most interesting at that moment. I’ve started adopting skimming the top samples that come out of the other heuristics. I find this a useful supporting heuristic for e.g. (1: Applicability) and (4: Breadth). It’s easy to choose a book especially on a new topic where the idea of knowing about it (e.g. basic accounting) sounds intriguing, but you may just not be in the right place and time that it’s interesting enough for you to follow through.

What would I like to improve about choosing?

What bothers me most about my choosing and sourcing is that it’s at the wrong abstraction level. I should be choosing topics and skills and sorting those by the applicability heuristic, rather than books. While books are useful, the ultimate goal here is not to read books—but to learn. There are other ways to learn than books: courses, classes, conversations, exercises, travel, coding ideas, crafts, and so on. “Reading” as a way to acquire knowledge is useful, and I see the majority of my time being spent here for personal development—however, I would like to not choose the next book but the next topic. Not: “This book about photography” but rather “The topic of photography” with the supporting sourcing and choosing tooling that’ll allow me to then dig into books.

The tooling I have now does not support my (1: Applicability) and (4: Breadth) heuristics well. Self-assessing which skills I’m weak assumes I have no blind spots, which would be incredibly naïve to believe. (6: Friends) and what they read help shed some light on those blind spots, but are largely disconnected from what might be useful for me. I am not sure exactly what I want, but I feel that I should move towards a list of topics I would like to get into and sort them by attributes such as current knowledge about the topic, upper-bound return on investment, lower-bound return on investment, applicability, enjoyment, and perhaps a couple others. This would allow me to go much wider, from playing chess (which I likely don’t have a single book in my sourcing list about) to a rudimentary understanding of a new language (no Spanish grammar books in my sourcing list, I am afraid), because it would gain me the ability to visualize my opportunity cost more clearly and put myself another level away from the currently fairly subjective choice of next book. I would certainly not challenge that there can be a serendipitous highly positive benefit to at times choosing semi-random, recommended books in a broad topic such as management. I feel that’s what I end up doing most of the time, and I crave more.

I crave too much structure, but I feel that significant investment into this aspect would pay serious dividends. It’s likely that I will experiment with an Airtable for this over the coming years and make changes to this article. Most of all I hope someone else will build this, but most likely it’s far too systematic. It is also possible that chaos here is possible, but I refuse to believe I cannot get a system that outperforms chaos by at least 10-20%—which would be a major win over a lifetime.

What are changes I’ve made in choosing?

This used to be “go down the list on Audible” or “go down the list on the Kindle” of books already purchased. However, “just in time” choosing has been much more effective to satisfy the most important heuristic (1): What book can have the biggest impact for me right now? In general, I would advise looking at your choosing akin to an efficient factory. You shouldn’t have massive piles of inventory in front of every machine, but rather optimize the overall throughput through the factory.

Reading

Typically, I have about 3 books on the go: An Audiobook, a fiction book on the Kindle, and a non-fiction work on the Kindle. When reading, I attempt to focus on a couple of things. The majority of them to improve retention.

1. Highlights. I will highlight the interesting parts of a book. Often, I take notes too as I have too many times been in the situation where returning to the highlight I have a hard time figuring out why I found it important at the time of reading. Typing on the Kindle is painful to begin with, but you get the hang of it eventually. I use Readwise for working with my highlights (more on this in the processing stage), and use tags, special tags to combine highlights on the fly, and their header tags to add sections for a table of contents. I also highlight words I don’t know (or don’t use), to later process them into my vocabulary.

2. Skimming and skipping. I make fairly liberal use of skimming and skipping, especially in non-fiction where not every chapter will have an equivalent impact for me. Skimming the first and last few pages of a chapter often gives you a great idea about whether the chapter is worth reading for you. For example, years ago I went to Brazil, and before going I wanted to read a short book about the history and culture of the country. There were 3 chapters about sports in Brazil, something I wasn’t interested in. I got the gist of it from the first and last few pages and simply skipped. When I read Principles, I skipped the biography and went straight to the principles, deciding I’d read the biography chapter later if the principles were interesting enough. It felt oddly liberating when I realized there’s no book police that’ll come knocking on your door when you skip a chapter.

3. Visualizing. Ever since reading Moonwalking with Einstein I’ve incorporated memory palaces into more aspects of my life. I’ve experimented with summarizing a book as I go in a memory palace, and this worked out quite well. It generally meant that it was easier for me to remember the book in general. Memory palaces aren’t just about being able to memorize a list, but also a concrete way to connect key points into your wetware. What I found surprising was that when something would remind me of the points from a book I’ve built a palace for for, I’m thrown right into the memory edifice to connect it. While in the palace, I find that I will often spend time going backwards and forwards and re-iterate the other concepts—a form of spaced repetition. There’s still more to explore here, but there’s certainly something to it. Think of it like when you read a novel, you’re always visualizing what’s going on. The more effort you put into this, the easier the novel is to remember. The longer you put an effort in, the easier it gets to create more and more elaborate images over time. I haven’t been as diligent with this practise for the past few books, but I plan to continue to experiment with it.

4. Metaphors and relations. This relates back to visualization; anything you can do to make a book more vivid helps. If you can relate concepts from the book to something else, it does wonders. A while ago, it felt overdue to gain a technical understanding of how simple Blockchains work. A friend asked me to explain it to him, and we constantly related each concept back to concepts and metaphors we already understood. In about an hour he gained a deep enough understanding that he could go explain it to someone else, in quite elaborate technical detail. I attribute that to relating everything to a real-life metaphor, e.g. ‘hashing’ in cryptography was conceptualized as akin to a fire turning into ash; impossible to reverse, and the slightest adjustment in initial conditions would make the configuration of ashes different. One of the most important relations I find is to attempt to see if the concept would’ve made a scenario in your life play out differently, had you known it. I like to think of each past event having n lessons you can extract out of it. It’s important to not leave any lessons on the table, and to suck these experiences dry—you need to revisit them for decades to come. It’s a bit like a machine learning algorithm (it’s actually exactly like a machine learning algorithm, which of course, is inspired by humans). You’re constantly adding to the algorithm with new mental models and an enriched understanding of the world. When you’ve changed the algorithm, you need to re-train it on your data-set consisting of your collected experience.

5. Summarize every chapter in your head. I don’t remember where I read or heard this, but someone said that one of the best pieces of advice they’d ever gotten was that every time they’d leave a room, they should stop at the door and summarize to themselves what just happened. What did you just learn? What just happened in that meeting? What was on that person’s mind? When I finish a chapter in a book, I try to quickly summarize it in my head. If I’m building a palace for the book, I’ll attempt to make up an image and plant it. This is often surprisingly hard, but I’ve noticed improvements as a result. It’s like the end of a (good) meeting, where someone will summarize all the actions and outcomes. Ever been to one where that doesn’t happen? It can feel like a waste of time.

6. Re-read. The best books I will try to read again. I’ve done it so far for perhaps half a dozen books and it’s been rewarding every time. In general, I think we can treat the best books and articles more like music playlists. Reading them again and again, with sufficient repetition in between to make them relevant and fresh anew. For articles, I have a script that’ll feed them back to me on a spaced repetition schedule automatically in Instapaper. I wrote more about this here.

What would I like to improve about reading?

My retention here is still not quite as good as I would like, although I think a fair bit of that comes from the processing (next section). I would like to more diligently build palaces. I haven’t done it for the past 5-10 books I’ve read, but the ones I did I’ve found myself going back to more often than not. I don’t take as many notes to my highlights as I’d like to, I think more focus on these two will make the biggest difference currently because they’ll both benefit the processing stage.

I dream of the day where I can see the highlights of friends. This would be a fantastic opportunity to start interesting conversations with people and build a deeper understanding of the book while feeling much less forced than a book club.

What are changes I’ve made to reading?

My reading process has been fairly additive. I’ve mostly added more and more structure to the way I read, any more effort I can put in here to twist and turn the points made end up being better than not doing it. The fear here is doing too much. As mentioned in the processing stage, to simplify, I will need to figure out what works and what doesn’t.

Processing

Reading, to me, is worth the most if I can remember the ideas. I don’t think you will always be able to map an idea back to its source, meaning, just because you can’t summarize Thinking Fast and Slow eloquently, doesn’t mean it didn’t influence you.

Reading and experience train your model of the world. And even if you forget the experience or what you read, its effect on your model of the world persists. Your mind is like a compiled program you’ve lost the source of. It works, but you don’t know why. – Paul Graham

It’s a cliché to complain about the length of books: “This idea could be explained in five pages! Why would they write an entire book?” This statement bothers me to no end. If you possess the discipline it takes to incorporate an idea into your wet-ware from article-length with no fail—then you’ve got some discipline that you would not self-discount with a blanket statement like that. No-one I’ve talked to who reads 10s of books a year, and have done so for years, would dream of saying this. They understand that reading is not just about passing words through your head.

Then why are books long? I’ll gently navigate around the “publisher’s require it to be 200+ pages” conspiracy, and instead focus on two points. First, it’s a form of spaced repetition. This wonderful, proven technique that can be applied to almost every corner of your life. It turns out, if a book is 200 pages, it’s going to take you a few spaced repetition cycles to read it, which raises the probability it’ll stick for you. Unless you are diligent about repetition, my pet theory is that most things that stick are somewhat random. You hear something today, and then in the next spaced repetition window a few days from now; you hear about it again. Then a week or so after that. If you consider how many new things we hear every day, I don’t think this is so crazy. Especially given how hyper-aware our brain is for these things, it wants to recognize them. I’ve noticed this is how most new English words transition from a spreadsheet to my real, active vocabulary. There’s a hint of random in there.

The second reason books are long, is that different ways of explaining an idea resonate with different people. For you, it may be that antifragility is best explained through a fitness analogy; you break down muscle, build them back up, ta-daa you are now stronger. For the foodie who makes an annual pilgrimage to New York, antifragility may draw the most connections (and thus stick best) when applied to why the ramen seems better every time you go back. Remembering an idea is some combination of the number of connections you can draw and spaced repetition. Anecdotally, I’ve observed that I remember new information in the space of software well. I can usually connect it to half a dozen things fairly quickly, which makes it hard to forget. If you tell me something I don’t know about the state of Crude Oil, I have little to connect it with and most likely I will not remember it tomorrow unless I put in more effort; spaced repetition, or ask enough questions that that half a dozen connections start appearing. But that’s work.

Turns out forming new memories needs to be hard. Otherwise, how is your brain to know what to remember and what not to? Imagine if every time you looked at a dining table, every single memory ever that had to do with a table was readily available. That’d be pretty uncomfortable. (The eyes with the cupcake on top below is my poor imitation of the exploding head emoji: 🤯)

Here are some of the steps I do after having read a book that I’ve done for a while.

1. Writing a review/summary. A few weeks after reading a book, typically I’ll write a short summary and review and publish it on Goodreads (example). This forces me to extract the key lessons from the book. Typically, I’ll use my highlights from Readwise.io to assist in extracting the key lessons from the book and throw them into the summary. You can see all my reviews on my Goodreads profile.

2. Converting highlights to index cards. Either at the same time as doing the review/summary or later, I will go through my highlights and find the ones I like most. Often, I end up spending hours (typically on a Saturday or Sunday morning) going into rabbit holes as part of polishing my highlights. This is fine, if they’re interesting, it helps me to build connections and stick them in long-term memory. For the best points in the books, often a combination of highlights and themes, I’ll create a physical index card. I try as much as possible to draw on the card and think of references to other books.

3. Reviewing index cards. I have two containers for my index cards. One with index cards that have been processed at least once (left) and one for cards that have yet to be processed (right). As you can see, the top card in the left box is the one that was most recently reviewed (2nd of July, 2018) and the card on top of the right box hasn’t been reviewed yet (only one date). As you see on the card above, and the card below, there are little symbols under the date. These symbols have special meanings for what I did with the card at the same time. I have a dozen or so symbols to experiment with what works best for retention over time. W below means that I wrote at least 200 words about the content of the card, attempting to draw new connections and elaborate on the idea. R is followed by a number and rates how much I’ve applied this idea since last time. U followed by a number is how useful this idea is, on a scale from 1 to 7. These numbers are long-term to inform a better sorting algorithm if there are two cards I can review now, I’d prefer the one with a low R value (not applied yet), high U value (very useful), and where a long time has passed since last reviewed. I may digitize this at some point (I’m terrified of losing these cards), but this has worked well so far. Again, as with (2: Choosing), I think I can beat randomness and sorting by date by at least 10%, which is a significant improvement over the long-term. However, I’ll need some data first. Below, you can see a full list of my symbols. Some are not deprecated, but many I continue to use.

When I travel, I usually bring the box of unprocessed cards with me and spend some time reflecting on those cards. Some call this a “Commonplace Book”; i.e. a book with all the best snippets from everywhere. Why index cards and not a notebook? Well, notebooks can only grow so much in size, and are hard to change without becoming messy. Often, I’ll tear cards apart on a second review and re-write them for more clarity and backfill the dates. I can sort them however I want, which is difficult. Airtable would be a fantastic candidate for the Commonplace book but the physical aspect currently intrigues me.

If you’re after something similar, Readwise has a great feature to send you some of your highlights every day. Takes minutes to set up if you’re already using a Kindle.

4. Listening to Podcasts with the author After a book, I often find myself with a slew of questions I wish I could ask the author. That’s exactly why they get invited to various Podcasts (if they’re alive). With Podcast search engines it’s easy to find a Podcast with the author. The show notes will often reveal what types of questions the interviewer is going after.

What would I like to improve about processing?

As mentioned, I may need a new home for these nuggets instead of index cards. It’s tough to sort them properly, so currently it’s a simple queue based on last review date. I am about a year behind (i.e. I review cards now I wrote about a year ago), so typically I produce cards faster than I can process them. For the time being, I’m OK with it. I destroy a lot of cards when I review that are not relevant to me, or I think are covered by something else. I’ve scurried through them quite a few times to try to find something I was sure I had on a card—this is a frustrating experience. I just don’t have the perfect software for it yet, and I worry a lot about putting this somewhere and having to convert it around. To some extent, this has become my most prized possession in that it’s impossible for me to replace.

Going forward, I’ll likely digitize them to make them searchable. A year or two from now, I’m going to go through them and review the R and U scores and correlations with other symbols to find out what works, and what doesn’t. Based on this, I will create a sorting algorithm for the digitized index cards. Again, the software in this space is lacking, so it may be a fancy use of Airtable if nothing better exists by that time.

What are changes I’ve made to processing?

This is the step I’ve invested the most in in the past few years because I feel this is where the most impact is had. In general, I think that people should spend 50-60% of their time in this stage over all others. Most spend the majority of their time in reading. I’ve come to many great realizations writing about cards and applying them to my life and current situations. My past self can recognize an idea as useful, and recognize that there’s no immediate application of it, transcribe it to a card and hope it pops up at a better time. This setup poises me to increase the probability I get the right idea, at the right time. The right time being when it’s most likely to be applied.

Overall, I have not made many changes here other than gradually adding to this system. I hope in a few years to go through the data on the cards and the ratings, to figure out which methods work best for retention. Writing? Flash cards? Memory palaces? Talking to a friend?

Future

I will continue to iterate on this, likely, for the rest of my life. I think everyone deserves a good reading system. It takes years to build one, you can’t start out with this, or any other system—you need to gradually build it over time. The reading habit is most important, then you start paying more attention to what you read, you start highlighting, you start taking notes, you start writing summaries, and slowly a complex system that works for you will evolve and evolve. I hope this can inspire you to invest more in your reading process.

For book recommendations, see my Goodreads profile especially my reread shelf.

Media Playlists

2018-06-02T00:00:00.000Z

We have playlists for our favorite music, but don’t re-consume great information nearly enough. Almost certainly you’ve once watched a documentary (or read a book) about the environment, after which you ponder how to reduce your footprint: an electric car, eating less meat, or voluntarily paid carbon tax on your air-travel emissions. Then, after a few weeks, the effects mostly fade, and you gradually return to baseline…

This cycle of a bee entering your bonnet for a short period, only for another bee to take its place, is ineffective. We pick up gems from conversations, articles, books, and videos, only to use them for a few days or weeks. Most things we learn, we forget, unless our environment strongly nudges us to consider those ideas repeatedly. However, most ideas don’t leap from medium-term memory into long-term principles. How can we increase our odds of compounding ideas on top of each other, instead of leap-frogging between new ones?

Spaced repetition is the simple idea that the probability of remembering an idea for the long-term increases dramatically if we’re reminded at an intentional, exponential schedule. We might discover that the word for the effect where we learn a new word and start noticing it everywhere is called the ‘frequency illusion.’ To not forget this, we make sure we’re exposed to this piece of information a few days from now, then a week after that, two weeks after that, then a month, three months, and then every six months from there. Spaced repetition is a well-studied effect, and many (including myself) have had success with this through flash-cards. We expose ourselves to the piece of information just before we would forget it, refreshing the memory.

However, the effect doesn’t need to be constrained to fun facts on flashcards. It can be deep, complex ideas as well. Ideas or ways of thinking that we incorporate deeper, and deeper into our wetware with each successive re-consumption of an article, book, or video on some schedule. In the past year, I’ve been interested in exposing myself to an increasing amount of spaced repetition outside of flashcards.

Readwise helps me by re-surfacing highlights from my Kindle and Instapaper. Quite a few times reading through the daily digest from Readwise, a highlight came at just the right time to implement it that day or sparked new connections to form more connected memories. My pet theory is that the truly useful ideas that make it from books to our life principles are the ones that strike us at just the right time where we needed that idea. Through spaced repetition, we increase that probability dramatically.

In general, the more well-connected an idea in your head, the higher the likelihood that it surfaces at the right time. To me, the definition of a useful idea is one that’s readily available when you need it. It is hard work, and takes time, to mold the neural connections to elevate an idea to this status. A 100, time-tested ideas stored in this fashion are worth a thousand times more than 10,000 that enter and leave rapidly.

For example, a few months ago, a highlight about survivorship bias came up. This cognitive bias points out that we don’t adequately value the information not present. We may be inclined to say that ‘old buildings are more beautiful’ when in fact, when you think about it, only the beautiful old buildings survive. The ugly ones are torn down, and new ones will take their place. This idea came up in my Readwise digest as I was walking to work, at just the right time. It was highly applicable to a problem we were working through on the team. As a result, I now see survivorship bias everywhere I look. It feels like that one, deep application made an order of magnitude more neurons connect than anything I’d done previously.

While flash cards and Readwise have been helpful, it doesn’t solve the problem for me of content that requires more deliberation. A video, article, or entire book. For the first two, a few months ago I built a script that will re-surface article or videos saved in Instapaper on a spaced repetition schedule. For example, I liked this article about Expectations vs Forecasts in my Instapaper and archived it. A week later, it came up on top of my to-read list again. Then a month after that. I’ll see it again in another few months, for it to finally only be read every 6 months. This creates a ‘playlist’ of great articles, with new articles coming up once in a while too. Spending more time on a few great articles is providing me more value than trying to read everything. I now mostly skim articles on the first read. If it’s interesting, I’ll ‘like’ it and go in more depth the second time. I’m finding myself taking more notes and highlights each time it pops up again. I add videos to Instapaper too, to recycle the same system.

While this is good, I hope that the next-generation of read-it-later services will build spaced repetition straight into their core product. I hope they’ll help with heuristics on when to read old, and when to learn new. Perhaps treat the inbox, not as a queue, where what I just added comes up on top, but what I added months or years ago is next. This helps avoid the cycle of spending the majority of your time consuming media that expires rapidly.

Positive Unknown-Unknowns

2018-03-10T00:00:00.000Z

When we make decisions, it’s useful to be cognizant of unknown-unknowns. Almost in every case, we think about unknown-unknowns in a negative sense. If we’re venturing into unknown territory, we accept that it’s likely we’ll stumble upon Black Swans: improbable events that throw a wrench into our plans. Typically, we’ll draw on our experience to take the path we figure has the fewest negative unknown-unknowns. We may choose to stretch something we already know instead of adopting something new. Brooding on negative unknown-unknowns is extremely useful, and fairly commonplace.

I think it’s equally useful to invert the traditional thinking about unknown-unknowns and ask ourselves: How many positive unknown-unknowns might we face with this option? Might we face more positive black swans, than negative? In effect, what would give us the most positive optionality?

When making decisions, we weigh most strongly the first-order effects. We’re not taught to systematically think through the second- and third-order effects. As we get further away from first-order effects, our ability to predict effects decreases exponentially. There’s a higher chance that we’ve missed second-order effects, than first-order effects. These missed effects are what we call unknown-unknowns. There are too many variables to keep track of and the interactions between them, while governed by simple rules, become unmanageable to the human brain. You can attempt to combat this with expertise, but you must face that you won’t catch them all.

An example might help. Consider the Internet, which had a fairly niche purpose at first. Yet, it seemed to many that connecting the planet would be a good idea. There’s no way that those connecting the globe could’ve anticipated the amount of positive unknown-unknowns ramifications of the Internet. What they did project, however, was that the space of unknown-unknowns positives for the Internet was enormous.

Similarly, if we look at cryptocurrencies today, people are smitten with the potential for the positive unknown-unknowns (and others by greed). What the Internet, cryptocurrencies, and the printing press have in common is that they’re foundational platforms with an enormous surface area for positive unknown-unknowns.

I’ve seen positive unknown-unknowns numerous times when people build platforms. Someone builds something great and simultaneously takes the time to solve the problem one layer deeper than they otherwise might have. They sense the potential in increasing the probability of positive unknown-unknowns, by supplying the vision of a platform. Internally, two years ago we had an employees-only single podcast. Today, we have around ten ranging from training, interviews to learn more about how to build an internal product or history lessons about the company from our executives. When it was clear that there was an internal podcast platform, it exploded. The first podcast went one level deeper to provide a platform, increasing the surface area for positive unknown-unknowns.

We will have to remain humble to the fact that often we can’t predict all effects, positive and negative. We can attempt to reason about their size, but we won’t know for sure. There’s an old Taoist fable that we can interpret as a story unknown-unknown second and third-order effects:

“When an old farmer’s stallion wins a prize at a country show, his neighbour calls round to congratulate him, but the old farmer says, “Who knows what is good and what is bad?”

The next day some thieves come and steal his valuable animal. His neighbour comes to commiserate with him, but the old man replies, “Who knows what is good and what is bad?”

A few days later the spirited stallion escapes from the thieves and joins a herd of wild mares, leading them back to the farm. The neighbour calls to share the farmer’s joy, but the farmer says, “Who knows what is good and what is bad?”

The following day, while trying to break in one of the mares, the farmer’s son is thrown and fractures his leg. The neighbour calls to share the farmer’s sorrow, but the old man’s attitude remains the same as before.

The following week the army passes by, forcibly conscripting soldiers for the war, but they do not take the farmer’s son because he cannot walk. The neighbour thinks to himself, “Who knows what is good and what is bad?” and realises that the old farmer must be a Taoist sage. ”

It is tempting to believe at any of the critical points in this story that you know what will happen next with certainty. With the most prized stallion in the land, riches await! Or, when stolen, that you’ll never see it again. While the series of events in this story seem highly unlikely, it teaches us that effects will happen that we could never have imagined. The sum of the probabilities of unknown-unknowns may outweigh the knowns.

You may be looking at two options for a decision that seem equally good. Have you considered which one has larger optionality long-term? Third-order effects that you could by no means predict? With a small modification, could you increase the surface area for unknown-unknown positives? Can you expose even a fraction of a platform?

Considering positive unknown-unknowns has changed my mind quite a few times in the past year. Contemplating optionality is not about making decisions based on hope. It is one of many mental models in your arsenal to improve your decisions. Each model gives you a new vantage point to see the problem from to help you come to a better decision.

Peak Complexity

2018-02-02T00:00:00.000Z

With the teams I work with, we operate with the idea of peak complexity: the time at which a project reaches its highest complexity. Peak complexity has proved a useful mental model to us for reasoning about complexity. It helps inform decisions about when to step back and refactor, how many people should be working on the project at a given point in time, and how we should structure the project.

What we find is that to make something simpler, we typically have to raise the complexity momentarily. If you want to organize a messy closet, you take out everything and arrange it on the floor. When all your winter coats, toques, and spare umbrellas are laid out beneath you, you’re at peak complexity. The state of your house is worse than it was before you started. We accept this step as necessary to organize. Only when it’s all laid out can you decide what goes back in, and what doesn’t to ultimately lower the complexity from the initial point.

When you’re cleaning your house, you do this one messy place at a time: the bedroom closet, then the attic, and lastly, the dreaded basement. Doing it all at once would be utter mayhem; costumes, stamp collections, coats, and lego sets everywhere. We’re managing our series of peak complexity points to one messy floor-patch at a time.

This model works for software, too. As we embark on a complex project, we need to consider the pending complexity peaks(s). It’s completely okay to add complexity along the journey, sometimes you need to momentarily trade technical debt for speed. But it’s also part of the job to manage your complexity budget. Be honest with your team about where you reside on the curve. The more complexity you add, the harder it is to onboard new members to the team. Typically, your bus factor increases, because few people can hold this complexity in their head at a time. With high complexity, the probability of error increases non-linearly. It’s prudent to review your project’s inflection points and structure it to have many small peaks. This avoids creating a Complexity Everest. A big mountain is tough to climb. It gets exponentially harder the closer you get to the top as oxygen levels decrease, wind increases, temperature drops, and willpower depletes. That’s why you want to structure your project into hills that deliver value every step of the way: day-time hikes with picnic baskets. Sometimes, the inevitable mountain appears—and that’s okay, but be realistic about what it means to the project.

The worst thing you can do is build a complexity mountain and not harvest the simplicity gains on the other side. The descent may require a smaller team and take less time than it took to climb, but is incredibly important work. As I’ve written about before, the more you can simplify the mental model of the software, the more leverage you build. If you fail to recognize peak complexity and descend you may strand there. This is how you end up supporting your project forever. It’s also worth noting that for a project, there’s not just peak complexity, there are other resources you can trade for speed in the short-term:

Peak Toil. You trade manual operations/lack of automation for getting the first iteration of the project shipped sooner. Just as with peak complexity, it’ll catch up to you.
Peak Money/Cost. Money is another resource you can often trade for speed, e.g. by leaving optimization to after the initial version has shipped.
Peak People. This is the point in time where your project has the most staff assigned to it, as the project moves into later phases of its life-cycle it’ll most likely have less people assigned to it. Other projects require more people as the initial version is out. On some projects, again, you can trade people for speed. An opportunity cost comes with that, of course.
Peak Stress/Work. People can sprint to reach some short-term target, but if you don’t allow them to rest, your people will lose trust in you, get tired, and will shorten their timescale for decisions.
Peak Sluggishness. For many projects, you can solve performance later to get the first iteration out quicker, too. It may be that it’s not worth solving some algorithmic or data storage problem until you’ve proved that it’s something customers want.

As a lead or project manager, I think it’s your responsibility to be aware of these peaks when trading the amplitude of a peak for speed on the project. If you push the peak too high on too many, your project will go through a tough problem and fail for reasons unrelated to the project.

Simon Eskildsen

Podcast with Geek Narrator on Object Storage Databases

turbopuffer: fast search on object storage

Napkin Problem 21: Index Merges vs Composite Indexes in Postgres and MySQL

Scaling Causal's Spreadsheet Engine from Thousands to Billions of Cells: From Maps to Arrays

Optimizing beyond the profiler dead-end

Approaching the calculation engine from first principles

Iteration 1: map[int]*Cell, 30m cells in ~6s

Iteration 2: []Cell, 30m cells in ~400ms

Iteration 3: Threading, 250ms

Iteration 4: Smaller Cells, 88 bytes → 32 bytes, 70ms

Iteration 5: []float64 w/ Parallel Arrays, 20ms

Iteration N: SIMD, compression, GPU …

Conclusion

Metrics For Your Web Application's Dashboards

Footnotes

Napkin Problem 18: Neural Network From Scratch

Mental Model for a Neural Net: Building one from scratch

Training our Neural Network

Updating the Hidden Layer with Gradient Descent

Finalizing our Neural Network from scratch

Automagically computing the slope of a function with autograd

OK, so you just implemented the most complicated average function I’ve ever seen…

Activation Functions

Matrices

Next steps to implement your own neural net from scratch

Footnotes

Careful Trading Complexity for 'Improvements'

Napkin Problem 16: When To Write a Simulator

Using Randomness Instead of Coordination?

Another Real Example: Load Shedding

Napkin Problem 15: Increase HTTP Performance by Fitting In the Initial TCP Slow Start Window

Problem 1: 3 TLS roundtrips rather than 2

Bytes in flight or the “initial congestion window”

How many roundtrips for the HTTP payload?

Consolidating our new model with the napkin math

Ok cool but how do I make my own website faster?

Napkin Problem 14: Using checksums to verify syncing 100M database records

Why are syncing mechanisms unreliable?

Assumptions

Iteration 1: Check in Batches

Iteration 2: Outsmarting the optimizer

Iteration 3: Parallelization

Iteration 4: Dropping OFFSET

Iteration 5: Checksumming

Iteration 6: Checksumming with updated_at

Iteration 7: Using the right indexes

What do we do on a mismatch?

What about other types of databases?

Napkin Problem 13: Filtering with Inverted Indexes

Napkin Problem 12: Recommendations

Napkin Problem 11: Circuit Breakers

Napkin Problem 10: MySQL transactions per second vs fsyncs per second

Problem 10: Is MySQL’s maximum transactions per second equivalent to fsyncs per second?

Problem 9: Inverted Index

Napkin Problem 9: Inverted Index Performance and Merkle Tree Syncronization

Adjacent Possible: Model for Peeking into the Future

Napkin Problem 8: Data Synchronization

Napkin Problem 7: Revision History

Napkin Problem 6: In-memory Search

Napkin Problem 5: Composite Primary Keys

How does progress(1) work?

Napkin Problem 4: Redis throughput

Napkin Problem 3: Membership Intersection Service

Napkin Problem 2: Expected Database Query Latency

Napkin Problem 1: Logging Cost

2018

Berlin

Reading

Health

Work

How I Read

Sourcing

What would I like to improve about choosing?

What are changes I’ve made in sourcing?

Choosing

What would I like to improve about choosing?

What are changes I’ve made in choosing?

Reading

What would I like to improve about reading?

Iteration 1: `map[int]*Cell`, 30m cells in ~6s

Iteration 2: `[]Cell`, 30m cells in ~400ms

Iteration 5: `[]float64` w/ Parallel Arrays, 20ms

Automagically computing the slope of a function with `autograd`

OK, so you just implemented the most complicated `average` function I’ve ever seen…

Iteration 6: Checksumming with `updated_at`