Tom Moertel's Blog

When Google Takeout Fails

2025-08-27T00:00:00Z

Posted on August 27, 2025

When Google Takeout fails, you’re basically on your own. Here are some quick tips for getting your data when things go wrong.

Oh, you tried to download your precious data from a bunch of Google products and days later you got an email with the subject “We couldn’t create a copy of your Google data”? Surely, for something as important as this, Google is going to Respect the User and provide meaningful help and options for moving forward.

Joke’s on you. Because what Google actually provides is nothing. Seriously, the email says only that “Something went wrong and we couldn’t create a copy of your Google data. You can try to create a new export.”

Well, that’s helpful.

What else can we try? It turns out that you can glean some information by going to takeout.google.com/manage and clicking on the failing export to reveal this screen:

Which of the 22 products failed? Who the f— knows. Google isn’t saying.

What? Did you think that clicking on the “22 products” link would tell you something useful about the failure? Nope. It just lists all 22 products that were in the failed attempt. What about the “Learn more” link? It just takes you to the generic help for Takeout failures, which again boils down to “try again.”

If you want to figure out which products you can’t get your data from, you’ll have to do it the old fashioned way: Binary Search.

I know, I know. It’s absolutely, mind bogglingly absurd that you should have to waste days making repeated Takeout attempts to identify what products are failing you. All because Google’s leadership doesn’t prioritize Takeout—or users—enough to make sure that Takeout actually works reliably.

(Note that I am criticizing Google’s leadership, not the heroic folks within Google that do yeoman’s work on Takeout, despite the lack of prioritization.)

Fortunately, there is a trick to make the binary search slightly easier. If you click the red “Start new request” button on this particular error screen, the UI for starting a new request will be limited to just the products that you tried to include in the failed attempt. (As far as I can tell, this semi-helpful feature is completely undocumented.) That means you can run the binary search as follows:

When a Takeout request fails, go to takeout.google.com/manage, find the failed attempt, and click on it. You will see a screen like the one above.
Now, right click the “Start new request” button and select “Open Link in New Tab” in your web browser. In that new tab, configure a new Takeout request but, crucially, uncheck all but the first half of the listed products. Submit the request and close the tab.
That will take you back to the previous tab. Click the “Start new request” button again. This time, uncheck all but the second half of the listed products. Submit the request.
Now wait. Hours. Maybe days.
If all the Takeout requests succeed, great, you’re done.
Otherwise, for each failing attempt, go back to Step 1 and split that failed attempt into two half-sized attempts by repeating the procedure above.
Eventually, you will narrow the failures down to the individual products that you can’t get your data from. For these products, you can try changing the Takeout settings (if offered) or try to escalate to someone you know within Google who might help. If you’re famous, you can try the Twitter route.

Good luck!

Sampling with SQL

2024-08-23T00:00:00Z

By Tom Moertel

Posted on August 23, 2024

Tags: sampling, sql, probability, poisson process, exponential distribution, statistics

Sampling is one of the most powerful tools you can wield to extract meaning from large datasets. It lets you reduce a massive pile of data into a small yet representative dataset that’s fast and easy to use.

If you know how to take samples using SQL, the ubiquitous query language, you’ll be able to take samples anywhere. No dataset will be beyond your reach!

In this post, we’ll look at some clever algorithms for taking samples. These algorithms are fast and easily translated into SQL.

First, however, I’ll note that many database systems have some built-in support for taking samples. For example, some SQL dialects support a TABLESAMPLE clause. If your system has built-in support—and it does what you need—using it will usually be your best option.

Often, though, the built-in support is limited to simple cases. Let’s consider some realistic scenarios that are more challenging:

We want to be able to take samples with and without replacement.
We want to take weighted samples in which each item in the input dataset is selected with probability in proportion to its corresponding weight.
We want to support the full range of weights we might expect to see in a FAANG-sized dataset, say between \(0\) to \(10^{20}\) for frequency distributions (e.g., clicks or impressions or RPC events) and between \(0\) to \(1\) with values as small as \(10^{-20}\) for normalized probability distributions. In other words, weights are non-negative numbers, possibly very large or very small.
We want to take deterministic samples. This property lets us take repeatable samples and, in some cases, helps query planners produce faster queries.

Sampling without replacement: the A-ES algorithm in SQL

In 2006, Pavlos S. Efraimidis and Paul G. Spirakis published a one-pass algorithm for drawing a weighted random sample, without replacement, from a population of weighted items. It’s quite simple:

Given a population \(V\) indexed by \(i = 1\ldots{}n\) and having weights \(w_i\):

For each \(v_i\) in \(V,\) let \(u_i = \mathrm{random}(0, 1)\) and \(k_i = u_i^{1/w_i}\).
Select the \(m\) items with the largest keys \(k_i\).

That algorithm has a straightforward implementation in SQL:

SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 100  -- Sample size.

You’ll note that we changed the ordering logic a bit. A straight translation would have been

ORDER BY POW(RANDOM(), 1.0 / weight) DESC

Our translation

ORDER BY -LN(1.0 - RANDOM()) / weight

is more numerically stable and also helps to show the connection between the algorithm and the fascinating theory of Poisson processes. This connection makes it easier to understand how the algorithm works. More on that in a moment.

Numerical stability and other tweaks

First, the numerical stability claim. Assume that our SQL system uses IEEE double-precision floating-point values under the hood. When the weights are large, say on the order of \(w_i = 10^{17}\), it doesn’t matter what the random value \(u_i\) is. The corresponding sort key \(k_i = u_i^{1/w_i}\) will almost always be \(1.0\). Consider the interval \(0.01 \le u_i \le 1\), representing 99% of the possible random values \(u_i\). This entire interval gets mapped to \(1.0\) when \(w_i = 10^{17}\):

# Python.
>>> w_i = 1e17
>>> math.pow(0.01, 1.0/w_i) == math.pow(1.0, 1.0/w_i) == 1.0
True

Likewise, when weights are small, say \(w_i = 10^{-17}\), the corresponding sort key will almost always be zero. Consider the interval \(0 \le u_i \le 0.99\), representing 99% of the possible random values \(u_i\):

>>> w_i = 1e-17
>>> math.pow(0.0, 1.0/w_i) == math.pow(0.99, 1.0/w_i) == 0.0
True

For very large (or small) weights, then, the straightforward implementation doesn’t work. The wanted sort ordering is destroyed when very large (or small) powers cause what should be distinct sort keys to collapse into indistinguishable fixed points.

Fortunately, logarithms are order-preserving transformations, so sorting by \(\ln(u_i^{1/w_i}) = \ln(u_i) / w_i\) produces the same ordering as sorting by \(u_i^{1/w_i}\) when we’re using mathematically pure real numbers. But the log-transformed version is much more stable when using floating-point numbers. Distinct random inputs \(u_i\) now produce reliably distinct sort keys \(k_i\), even when the input weights \(w_i\) are very large or very small:

>>> [math.log(u_i) / 1e17 for u_i in (0.01, 0.010001, 0.99, 0.990001)]
[-4.605170185988091e-17, -4.605070190987759e-17,
 -1.005033585350145e-19, -1.0049325753001471e-19]

>>> [math.log(u_i) / 1e-17 for u_i in (0.01, 0.010001, 0.99, 0.990001)]
[-4.605170185988091e+17, -4.605070190987758e+17,
 -1005033585350145.0, -1004932575300147.1]

As a final tweak, we negate the sort keys so that instead of sorting by \(u_i^{1/w_i}\) descending, as in the original algorithm, we do an equivalent sort by \(-\ln(u_i) / w_i\) ascending. Note the leading minus sign. The rationale for flipping the sign will become apparent when we discuss Poisson processes in the next section.

One last numerical subtlety. Why do we generate random numbers with the expression 1.0 - RANDOM() instead of just RANDOM()? Since most implementations of RANDOM(), such as the PCG implementation used by DuckDB, return a floating-point value in the semi-closed range \([0, 1)\), they can theoretically return zero. And we don’t want to take the logarithm of zero. So we instead use 1.0 - RANDOM() to generate a random number in the semi-closed range \((0, 1]\), which excludes zero.

Does this algorithm actually work?

First, what do we mean by work? In this case, we’ll say that we want the algorithm to produce samples that are equivalent to a succession of random draws, each draw removing an item from the population, and each draw fair with respect to the population that remains at the time of the draw. Because the population grows slightly smaller with each draw, the draws in the sample are not independent and identically distributed (iid). In practice, however, when your samples are very small compared to your population of interest, and your weights are such that it’s unlikely that iid draws would draw the same item more than once, you can generally pretend that the draws are iid and get away with it. (When you can’t get away with it, you can use a reweighting scheme to extract unbiased estimates anyway. This subject is worth its own post, so I won’t say more here.)

Anyway, it’s not obvious that assigning a random number \(u_i\) to every row \(i\), then sorting rows on \(-\ln(u_i)/w_i\), and finally taking the top \(m\) rows is a recipe that would result in a sample that has the wanted properties. But it does.

The clearest way to understand what’s going on is to first take a detour through the fascinating theory of Poisson processes. In short, a Poisson process with rate \(\lambda\) is a sequence of arrivals such that the times between successive arrivals are all independent, exponentially distributed random variables with rate \(\lambda\).

Poisson processes have some important (and useful!) properties:

They are memoryless. No matter what has previously happened, the time until the next arrival from a Poisson process with rate \(\lambda\) is exponentially distributed with rate \(\lambda\).
They can be merged. If you have two Poisson processes with rates \(\lambda_1\) and \(\lambda_2\), the arrivals from both processes form a combined Poisson process with rate \(\lambda_1 + \lambda_2\). This holds for any number of processes.
They win races in proportion to their rates. In a race between the very next arrival from a Poisson process with rate \(\lambda_1\) and the very next arrival from a Poisson process with rate \(\lambda_2\), the probability that the first process will win the race is \(\lambda_1 / (\lambda_1 + \lambda_2)\).

Now that we know the basics of Poisson processes, there’s just one more tidbit we need:

The uniform–exponential bridge. If \(X\) is a random variable having a uniform distribution between zero and one, then \(-\ln(X)/\lambda\) has an exponential distribution with rate \(\lambda\).

With the uniform–exponential bridge in mind, we can begin to see what the algorithm is doing when it assigns every row a key \(k_i = -\ln(u_i)/w_i\) and sorts the population by that key. It’s running a race over all the rows in the population! In this race, each row arrives at the finish line at a time that’s an exponentially distributed random variable with a rate corresponding to the row’s weight \(w_i\). The first \(m\) arrivals form the sample.

To prove that this race does the sampling that we want, we will show that it is equivalent to a succession of one-row draws, each draw being fair with respect to the population that remains at the time of the draw. Let the population’s total weight be \(w\), and consider an arbitrary row \(i\) with weight \(w_i\). The algorithm will assign it an exponentially distributed random variable with rate \(w_i\), which corresponds to the very next arrival from a Poisson process with the same rate.

Now consider all rows except \(i\). They too correspond to Poisson processes with rates equal to their weights. And we can merge them into a combined process with rate \(\sum_{j \ne i} w_j = w - w_i\).

Now, using the rule about Poisson races, we know that row \(i\), represented by a process with rate \(\lambda_1 = w_i\), will win the race against those other rows, represented by a combined process with rate \(\lambda_2 = w - w_i\), with probability

\[\frac{\lambda_1}{\lambda_1 + \lambda_2} = \frac{w_i}{w_i + (w - w_i)} = \frac{w_i}{w}.\]

And since we chose row \(i\) arbitrarily, the same argument applies to all rows. Thus every row’s probability of being drawn is equal to its weight in proportion to the population’s total weight. This proves that running a “race of exponentials” lets us perform one fair draw from a population.

But, after we’ve drawn one row, what’s left but a new, slightly smaller population? And can’t we run a new race on this slightly smaller population to correctly draw another row?

We can. And, since Poisson processes are memoryless, we do not have to generate new arrival times to run this new race. We can reuse the existing arrival times because the arrivals that have already happened have no effect on later arrivals. Thus the next row we draw using the leftover arrival times will be another fair draw.

We can repeat this argument to show that successive rows are chosen fairly in relation to the population that remains at the time of each draw. Thus algorithm A-ES selects a sample of size \(m\) by making \(m\) successive draws, each fair with respect to its remaining population. And that’s the proof.

Tricks for faster samples

Most large analytical datasets will be stored in a column-oriented storage format, such as Parquet. When reading from such datasets, you typically only have to pay for the columns you read. (By “pay”, I mean wait for the query engine to do its work, but if you’re running your query on some tech company’s cloud, you may actually pay in currency too.)

For example, if your dataset contains a table having 100 columns but you need only four of them, the query engine will usually only read those four columns. In row-oriented data stores, by contrast, you’ll generally have to decode entire rows, even if you only want four out of the 100 values in each row. Additionally, most column-oriented stores support some kind of filter pushdown, allowing the storage engine to skip rows when a filtering expression evaluates to false. These two properties—pay for what you read and filter pushdown—are ones we can exploit when taking samples.

Say we have a Population table with billions of rows and around 100 columns. How can we efficiently take a weighted sample of 1000 rows?

We could use the basic sampling formulation, as discussed earlier:

SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 1000  -- Sample size.

But think about what the query engine must do to execute this query. It must read and decode all 100 columns for all of those billions of rows so that it may pass those rows into the sort/limit logic (typically implemented as a TOP_N operation) to determine which rows to keep for the sample. Even though the sample will keep only 0.00001% of those rows, you’ll have to pay to read the entire Population table!

A much faster approach is to only read the columns we need to determine which rows are in the sample. Say our table has a primary key pk that uniquely identifies each row. The following variant on our sampling formulation returns only the primary keys needed to identify the rows in the sample:

SELECT pk
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 1000  -- Sample size.

This variant only forces the query engine to read two columns: pk and weight: Yes, it still must read those two columns for the billions of rows in the table, but those columns contain small values and can be scanned quickly. After all, that’s what column-oriented stores are designed to do well. The point is that we’re not paying to read about 100 additional columns whose values we’re just going to throw away 99.99999% of the time.

One we have identified the rows in our sample, we can run a second query to pull in the full set of wanted columns for just those rows.

Adding determinism

Our sampling algorithm depends on randomization. If we run our algorithm twice with the same inputs, we’ll get different results each time. Often, that nondeterminism is exactly what we want.

But sometimes it isn’t. Sometimes, it’s useful to be able to control the dice rolls that the algorithm depends on. For example, sometimes it’s useful to be able to repeat a sample. Or almost repeat a sample.

To allow us to control the nature of the randomization used when we take samples, we must replace calls to RANDOM with a deterministic pseudorandom function. One common approach is to hash a primary key and then map the hashed value to a number in the range \([0, 1)\). The following DuckDB macro pseudorandom_uniform will do exactly that:

-- Returns a pseudorandom fp64 number in the range [0, 1). The number
-- is determined by the given `key`, `seed` string, and integer `index`.
CREATE MACRO pseudorandom_uniform(key, seed, index)
AS (
  (HASH(key || seed || index) >> 11) * POW(2.0, -53)
);

We can vary the seed and index parameters to generate independent random values for the same key. For example, if I fix the seed to “demo-seed-20240601” and generate random numbers for the key “key123” over the index values \(1, 2, \ldots, 10\), I get 10 fresh random numbers:

SELECT
  pseudorandom_uniform('key123', 'demo-seed-20240601', i) AS u_key123
FROM RANGE(1, 11) AS t(i);

Ten random numbers for the key “key123” and seed “demo-seed-20240601”.
i	u_key123
1	0.9592606495318252
2	0.6309411348395693
3	0.5673207749533353
4	0.11182926321927167
5	0.3375806483238627
6	0.12881607107157678
7	0.6993372364353198
8	0.94031652266991
9	0.17893798791559323
10	0.6903126337753016

To take deterministic samples, we just replace calls to RANDOM() with calls to our function pseudorandom_uniform().

Now that we can take deterministic samples, we can do even more useful things! For example, we can take samples with replacement.

Sampling with replacement

Earlier, we proved that the A-ES algorithm allows us to take a sample without replacement as a series of successive draws, each draw removing an item from the population, and each draw fair with respect to the population that remains at the time of the draw. But what if we wanted to take a sample with replacement? A sample with replacement requires us to return each item to the population as it is selected so that every selection is fair with respect to the original population, and individual items may be selected more than once.

Can we efficiently implement sampling with replacement in SQL? Yes! But it’s a little trickier. (I haven’t found this algorithm published anywhere; please let me know if you have. It took me some effort to create, but I wouldn’t be surprised if it’s already known.)

Think back to our correctness proof for the A-ES algorithm. For each row \(i\) having a weight \(w_i\), the algorithm imagined a corresponding Poisson process with rate \(\lambda_i = w_i\) and represented the row by the very next arrival from that process. That arrival would occur at time \(k_i = -\ln(u_i) / w_i\), where \(u_i\) is a uniformly distributed random number in the range \((0, 1]\). Then the algorithm sorted all rows by their \(k_i\) values and took the first \(m\) arrivals as the sample.

With one minor tweak to this algorithm, we can take a sample with replacement. That tweak is to consider not just the very next arrival from each row’s Poisson process but all arrivals. Let \(k_{i,j}\) denote the \(j\)th arrival from row \(i\)’s process. Since we know that in a Poisson process the times between successive arrivals are exponentially distributed random variables, we can take the running sum over interarrival times to give us the needed arrival times. That is, \(k_{i,j} = \sum_{r=1}^{j} -\ln(u_{i,r}) / w_i\), where \(u_{i,r}\) represents the \(r\)th uniformly distributed random variable for row \(i\).

One minor problem with this tweaked algorithm is that a Poisson process generates a theoretically infinite series of arrivals. Creating an infinite series for each row and then sorting the arrivals from all of these series is intractable.

Fortunately, we can avoid this problem! Think about how the A-ES algorithm for taking a sample without replacement relates to our proposed intractable algorithm for taking a sample with replacement. We could describe the without algorithm in terms of the with algorithm like so: Prepare to take a sample with replacement, but then ignore all arrivals \(k_{i,j}\) for \(j > 1\); the remaining arrivals must be of the form \(k_{i,1}\), where \(i\) indicates the corresponding row. Then take the first \(m\) of these remaining arrivals as your sample, as before.

Now think about going the other way, from having a without-replacement sample and needing to construct a corresponding with-replacement sample. Let \(S\) be the set of rows sampled without replacement. We know these rows were represented by a corresponding set of arrival times \(k_{i,1}\) for \(i\) in \(S\). We also know that, had the sample been taken with replacement, the race would have included some additional arrival times \(k_{i,j}\) for \(j > 1\) that could have displaced some of the winning rows in \(S\). But, crucially, we also know that if \(S_{*}\) represents the set of rows in the corresponding sample with replacement, then \(S_{*}\) must be contained within \(S\). This claim follows from the fact that if any arrival \(k_{i,j}\) for \(j > 1\) does displace a row among the first \(m\) arrivals, the \(j>1\) requirement implies that the displacing row is a duplicate of some row \(i\) that arrived earlier in the sample at time \(k_{i,1}\); thus, displacement cannot introduce a new row from outside of \(S\).

Therefore, if we have a sample without replacement, we can construct a sample with replacement from its rows. We can ignore all other rows in the population. This makes the problem much more approachable.

So now we can see an algorithm taking shape for taking a sample with replacement of size \(m\):

First, take an \(m\)-sized sample \(S\) without replacement using the efficient A-ES algorithm.
For each sampled row \(i\) in \(S\), generate \(m\) arrivals \(k_{i,j}\) for \(j = 1, 2, \ldots, m\) using the same pseudorandom universe that was used to generate \(S\).
Take the first \(m\) arrivals as the sample.

You may have noticed that step 2 of this algorithm requires us to create \(m\) arrivals for each of the \(m\) rows in \(S\). This step thus requires \(O(m^2)\) time. When \(m\) is large, this time can be prohibitive.

Fortunately, we can use probability theory to reduce this cost to \(O(m)\). The idea is that if row \(i\) is expected to occur in the sample \(n_i\) times, it is very unlikely to occur more than \(c \cdot n_i\) times when \(c \ge 2\). So we don’t need to generate a full \(m\) arrivals for each row \(i\) in \(S\); we can get away with generating \(m_i = \lceil{c \cdot n_i}\rceil\) arrivals instead, for a suitably large \(c\) to assuage our personal level of paranoia.

Here’s a sample implementation as a DuckDB macro:

-- Takes a weighted sample with replacement from a table.
--
-- Args:
--  population_table: The table to sample from. It must have a `pk` column
--    of unique primary keys and a `weight` column of non-negative weights.
--  seed: A string that determines the pseudorandom universe in which the
--    sample is taken. Samples taken with distinct seeds are independent.
--    If you wish to repeat a sample, reuse the sample's seed.
--  sample_size: The number of rows to include in the sample. This value
--    may be larger than the number of rows in the `population_table`.
--
-- Returns a sample of rows from the `population_table`.
CREATE MACRO sample_with_replacement(population_table, seed, sample_size)
AS TABLE (
  WITH
    -- First, take a sample *without* replacement of the wanted size.
    SampleWithoutReplacement AS (
      SELECT *
      FROM query_table(population_table::varchar)
      WHERE weight > 0
      ORDER BY -LN(pseudorandom_uniform(pk, seed, 1)) / weight
      LIMIT sample_size
    ),
    -- Compute the total weight over the sample.
    SampleWithoutReplacementTotals AS (
      SELECT SUM(weight) AS weight
      FROM SampleWithoutReplacement
    ),
    -- Generate a series of arrivals for each row in the sample.
    SampleWithReplacementArrivals AS (
      SELECT
        S.*,
        SUM(-LN(pseudorandom_uniform(pk, seed, trial_index)) / S.weight)
          OVER (PARTITION BY pk ORDER BY trial_index)
          AS rws_sort_key
      FROM SampleWithoutReplacement AS S
      CROSS JOIN SampleWithoutReplacementTotals AS T
      CROSS JOIN
        UNNEST(
          RANGE(1, CAST(2.0 * sample_size * S.weight / T.weight + 2 AS INT)))
        AS I(trial_index)
    )
  -- Form the sample *with* replacement from the first `sample_size` arrivals.
  SELECT * EXCLUDE (rws_sort_key)
  FROM SampleWithReplacementArrivals
  ORDER BY rws_sort_key
  LIMIT sample_size
);

Example of sampling with replacement

As an example of when we might want to sample with replacement instead of without, consider the following population table ThreeToOne that represents the possible outcomes of tossing a biased coin:

`ThreeToOne` population table.
pk	weight
heads	3
tails	1

For this biased coin, “heads” is 3 times as likely as “tails.” We can simulate flipping this coin 10 times by sampling 10 rows from the ThreeToOne population table with replacement:

SELECT pk
FROM sample_with_replacement(ThreeToOne, 'test-seed-20240601', 10);

Results of drawing a 10-row sample with replacement from `ThreeToOne`.
pk
heads
heads
heads
tails
tails
heads
heads
heads
tails
heads

In this sample, we got 7 heads and 3 tails. On average, we would expect about 7.5 heads in each sample of size 10, so our observed sample is close to our expectations.

But maybe we just got lucky. As a stronger test of our SQL sampling logic, let’s take 10,000 samples of size 40 and look at the count of heads across all of the samples. We would expect this count to have a Binomial(size = 40, p = 3/4) distribution. To compare our observed results to the expected distribution, I’ll compute the empirical distribution of the results and plot that over the expected distribution. As you can see in the figure below, the observed distribution closely matches the expected distribution.

Conclusion

Sampling is a powerful tool. And with the SQL logic we’ve just discussed, you can take fast, easy samples from virtually any dataset, no matter how large. And you can take those samples with or without replacement.

What makes it all work is a clever connection to the theory of Poisson processes. Those processes are memoryless and mergeable, and their arrivals win races in proportion to their rates. These properties are exactly what we need to run races that let us take samples.

Beyond what we’ve discussed in this article, there are further ways we can exploit these properties. For example, as a performance optimization, we can predict the arrival time \(t\) of the final row in a sample. Then we can augment our SQL sampling logic with a pushdown filter that eliminates population rows with arrival times greater than \(c \cdot t\) for some constant \(c\). This filtering happens before ORDER/LIMIT processing and can greatly speed queries by eliminating more than 99.99% of rows early on, before they are even fully read on systems that support “late materialization.”

But this article is already too long, so I’ll stop here for now.

References

Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006.

Dice Race (part 1)

2023-06-24T00:00:00Z

By Tom Moertel

Posted on June 24, 2023

Tags: interview-problems, probability, dice-race

In a dice race of degree \(m\) and length \(n\), each player chooses a sequence of \(m\) die rolls. Then we roll a die \(n\) times to generate the race sequence. The player whose sequence occurs first in the race sequence wins. If no player’s sequence occurs in the race sequence, the race is a tie. (If we want to guarantee a winner, we can always set \(n = \infty\).)

For example, say you and I run a race of degree \(m=2\) and length \(n=20\). I might choose 55 as my sequence. You might choose 23. Then we roll a die \(n=20\) times to generate the race sequence:

6 1 6 6 1 5 5 3 1 4 5 2 3 3 2 2 6 4 3 4

In this race, I would win since 55 occurred before 23 in the race sequence.

Now consider the following questions:

Given a choice between 55 and 23, which should you choose? Why?
What is the expected number of die rolls before 55 occurs? What about for 23?
Write code to run a dice race, given degree \(m\), length \(n\), and the players’ chosen sequences.
Use the code to simulate many races between 55 and 23.
How do the results compare with your answer to question 2?

I’ll answer the first two questions in this post and save the rest for a later post.

(Now would be a good time to pause and try to answer the questions yourself.)

Question 1: Given a choice between 55 and 23, which should you choose? Why?

In the example above, 55 won the race. Is it the better choice, in general? To answer the question, let’s think about the task of matching the sequence 55 in a series of die rolls. This task can be modeled as a finite automaton:

We start in the initial state S0. If the next die roll is a 5, we advance to state S1; otherwise, we remain at S0. If we’re at S1 and the next roll is another 5, we advance to S2 and have successfully matched the sequence 55; otherwise, we go all the way back to S0 and start over from the beginning.

Now let’s consider the automaton for the sequence 23:

It’s similar to the automaton for 55 except that when in state S1, we have a 1-in-6 chance to stay at S1. Failure to advance does not always force us to start over at S0. That reduced risk gives the sequence 23 an advantage over 55.

So, given a choice between 55 and 23, you should choose to race with 23.

Question 2: What is the expected number of die rolls before 55 occurs? What about for 23?

To answer this question, we’ll have to use a little algebra and probability theory. Let’s start by finding the expected number of rolls needed to win with the sequence 55.

Let \(x_i\) be the expected number of die rolls needed to reach state S2 from S\(i\). We want to find \(x_0\), the expected number of rolls needed to reach the “match” state S2 from the initial state S0. Referring to the diagram for the 55 automaton, we know that \(x_0\) must satisfy the equation

\[ x_0 = 1 + \tfrac{5}{6} x_0 + \tfrac{1}{6} x_1. \]

That’s because from S0 we must roll one die for a chance to advance, and, once rolled, we have a 5/6 probability of remaining at S0 and a 1/6 probability of advancing to S1. If we remain at S0, the expected cost to reach S2 is \(x_0\); and if we advance to S1, the expected cost to reach S2 is \(x_1\). (While this reasoning is intuitive, it is also justified by the law of total expectation.)

Using similar logic, we can find the equation for \(x_1\):

\[ x_1 = 1 + \tfrac{5}{6} x_0 + \tfrac{1}{6} x_2. \]

And the equation for \(x_2\) is trivial, since it takes zero rolls to reach S2 from S2:

\[ x_2 = 0. \]

Now it’s a matter of solving the system of equations for \(x_0\), \(x_1\), and \(x_2\). Substituting 0 for \(x_2\) in the equation for \(x_1\) gives

\[ x_1 = 1 + \tfrac{5}{6} x_0. \]

And substituting this expression for \(x_1\) in the equation for \(x_0\) gives

\[ x_0 = 1 + \tfrac{5}{6} x_0 + \tfrac{1}{6} \left( 1 + \tfrac{5}{6} x_0 \right) = \tfrac{7}{6} + \tfrac{35}{36} x_0. \]

Subtracting \(\tfrac{35}{36} x_0\) from both sides gives us

\[ \tfrac{1}{36} x_0 = \tfrac{7}{6}. \]

And multiplying both sides by 36 solves the equation:

\[ x_0 = \tfrac{7}{6}(36) = (7)(6) = 42. \]

So, on average, you need 42 rolls to find your sequence when you choose 55.

Now let’s consider the case when you choose 23 as your sequence. Referring to the diagram for the 23 automaton, the corresponding equations are:

\[\begin{align} x_0 & = 1 + \tfrac{5}{6} x_0 + \tfrac{1}{6} x_1 \\ x_1 & = 1 + \tfrac{4}{6} x_0 + \tfrac{1}{6} x_1 + \tfrac{1}{6} x_2 \\ x_2 & = 0 \end{align}\]

Solving this system of equations (I’ll skip the algebra), we find that \(x_0 = 36\).

So, on average, you need 36 die rolls to find your sequence when you choose 23. That’s notably faster than 42 rolls when you choose 55.

A Great Old-Timey Game-Programming Hack

2013-12-14T00:00:00Z

By Tom Moertel

Posted on December 14, 2013

Tags: assembly, 6809, games, hacks

A long time ago, when I was a college undergrad, I spent some time working on computer video games. This was in the 8-bit PC era, when the gaming hardware was almost impossibly slow by today’s standards.

It might not surprise you, then, to learn that game programmers of yore did all sorts of crazy things to make their games run at playable speeds. Crazy, crazy things.

This is a story about one of those things.

While I’ve done my best to recall the important details, I may have gotten some things wrong. If I have, please forgive me. It was a long time ago.

(Note: This post contains some small images as inline-data URLs that some feed readers apparently don’t handle well. If you can’t see the images in your feed, click through to the original article to see them.)

Update 2013-12-16: There have been lots of great comments about this story on Hacker News and Proggit, as well as here on the blog. Thanks for sharing, everyone!

The set-up

A friend of mine, a gifted programmer, was nearly finished with a new game. Somehow, he had squeezed into a 1980’s-era PC, pretty faithfully, what was at the time a graphically impressive coin-op game popular at the arcades.

The only problem was that his version of the game was unplayable. It was too slow, and the choppy motion broke the player’s immersion. It was after all a side-scroller.

My friend, who had been working on the game while also taking a full load of college courses, was starting to seem a little frazzled. Fearing that he had missed some easy optimization, he asked me to take a look at the code.

I did. But there was no easy optimization to be had.

The code was tight. Everywhere I looked, he had already done all the things I could think of. Loops had been unrolled. Unneeded draws had been eliminated. All evidence of waste had vanished.

And, with them, our hopes of an easy fix.

But what about a hard fix? A crazy fix?

Well, that was always a possibility.

Before I get to the crazy stuff, however, I should back up and explain how graphics hardware worked back then.

How graphics hardware worked back then

Here’s the quick version: The graphics hardware didn’t do jack squat.

On the PC in question, if you wanted to draw something onto the screen, you had to do that yourself, byte by byte. No texture mappers, no blitters. Just bytes. Bytes that you had to move yourself.

Fun.

My friend’s game spent most of its time redrawing the background. (Again: side-scroller.) For every frame, it had to draw nearly a screen’s worth of background tiles, shifted for the player’s position.

If my memory serves me, each tile was 28 by 28 pixels. Each pixel was one of 16 colors, taking half a byte to represent. Therefore, tiles were represented in memory as 28 contiguous rows of 14 bytes each. The first 14 bytes represented the first row, the second 14 bytes the second row, and so on.

The screen, however, was 320 pixels wide by 240 pixels high. In memory, then, a screen buffer was laid out as 240 contiguous rows of 160 bytes each.

So when copying a tile at address X onto a screen buffer location starting at address Y, you had to copy 28 rows. To copy each row, you had to copy 14 bytes. Fortunately, the processor for this game was a 6809, which had a few 16-bit index registers and delightful “auto-increment” addressing modes (sort of like applying the postfix ++ operator to pointers in C). That meant you could copy 4 pixels at a time while advancing both the X and Y registers in passing:

    LDU ,X++     ; read a 2-byte word (= 4 pixels) from source tile
    STU ,Y++     ; write it to screen buffer

To copy a full row, you had to do that seven times, so you might sandwich those lines within a 7-count loop:

    LDB #7       ; remaining words <- tile width in words
@LOOP
    LDU ,X++     ; read a 2-byte word (= 4 pixels) from source tile
    STU ,Y++     ; write it to screen buffer
    DECB         ; reduce remaining-word count
    BNE @LOOP    ; loop while words remain

When you were done copying the row, you needed to advance the destination pointer Y so that it pointed to the starting address for the next row you would draw into the screen buffer. Since the screen buffer was 160 bytes wide and a tile was only 14 bytes wide, that meant you had to add their difference to Y:

    LEAY 160-14,Y

And, just like that, you have copied a row onto the screen.

That’s one. To copy a full tile, you need to do the same thing 28 times. So, in turn, you sandwich that code within a 28-count loop.

Putting it all together, and giving names to the important numbers, you might end up with a subroutine like this:

;;; important constants

SCRW = 160          ; screen-buffer width in bytes (= 320 4-bit pixels)
TILW =  14          ; background-tile width in bytes (= 28 4-bit pixels)
TILH =  28          ; background-tile height in rows
WOFF = SCRW - TILW  ; s-b offset from end of one tile row to start of next


COPYTILE
;;;
;;; Copy a 28x28 background tile into a screen buffer.
;;; Arguments:
;;;   X = starting address of background tile
;;;   Y = starting address of destination in screen buffer
;;;
    LDA #TILH               ; remaining rows <- tile height
@COPYROW
    LDB #TILW/2             ; remaining words <- tile width in words
@LOOP
    LDU ,X++                ; read a word (= 4 pixels) from source tile
    STU ,Y++                ; write it to screen buffer
    DECB                    ; reduce remaining-word count
    BNE @LOOP               ; loop while words remain
    ;;
    LEAY WOFF,Y             ; advance dst ptr to start of next dst row
    DECA                    ; reduce remaining-row count
    BNE @COPYROW            ; loop while rows remain
    ;;
    RTS                     ; done! return to caller

And that code would be fine.

If you didn’t care about speed.

Caring about speed

Knowing that the game is probably going to spend most of its time running that code, you do what any good programmer would: start counting cycles. Here’s the inner loop again, with setup and finishing, annotated with cycle counts:

    LDB #TILW/2             ; 2 cycles (set-up)
@LOOP
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    DECB                    ; 2
    BNE @LOOP               ; 3
    ;;
    LEAY WOFF,Y             ; 8 (finishing)

Looking at those counts within the loop, you’re not likely to miss that you’re burning 21 cycles to copy just 4 pixels. To copy a full row, then, works out to 2 cycles + (7 iterations) * (21 cycles/iteration) + 8 cycles = 157 cycles. Ouch.

But this isn’t your first time at the keyboard. You know what to do. Unroll that loop!

    LDU ,X++                ; 8 cycles
    STU ,Y++                ; 8
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    LDU ,X++                ; 8
    STU ,Y++                ; 8
    LEAY WOFF,Y             ; 8 (finishing)

Now, with the loop overhead reduced to zero – you’ve even eliminated the set-up – each row takes only 7 * (8 + 8) + 8 = 120 cycles. That’s a 30-percent speed-up. Pretty good.

And that’s where most programmers probably would have left it.

But not my friend.

He knew that those ++ operations were costly, 3 cycles apiece. And, with the loop unrolled, he also knew exactly where each word to be read or written was located with respect to X or Y. So he cleverly replaced those 3-cycle post-increments with exact offsets. They cost only 1 cycle apiece, and the 0 offset is actually free:

    LDU  ,X                 ; 5 cycles
    STU  ,Y                 ; 5
    LDU 2,X                 ; 6
    STU 2,Y                 ; 6
    LDU 4,X                 ; 6
    STU 4,Y                 ; 6
    LDU 6,X                 ; 6
    STU 6,Y                 ; 6
    LDU 8,X                 ; 6
    STU 8,Y                 ; 6
    LDU 10,X                ; 6
    STU 10,Y                ; 6
    LDU 12,X                ; 6
    STU 12,Y                ; 6
    LEAX TILW,X             ; 8 (finishing)
    LEAY SCRW,Y             ; 8 (finishing)

With his optimizations, the cycle count per row had been reduced to (5 + 5) + 6 * (6 + 6) + (8 + 8) = 98 cycles. Compared to the original code, that’s 60 percent faster:

original_speed  = (1*row) / (157*cycle)
optimized_speed = (1*row) /  (98*cycle)

speed_up  =  optimized_speed / original_speed - 1  =  157 / 98 - 1  =  0.60.

Putting it all together – and, again, I’m going by memory so the code might have been slightly different – the copy-tile subroutine looked something like this, and it copied a full tile, all 28 rows, in a lean 2893 cycles:

COPYTILE2
;;;
;;; Copy a 28x28 screen tile into a screen buffer.
;;; Arguments:
;;;   X = starting address of background tile
;;;   Y = starting address of destination in screen buffer
;;; Execution time:
;;;   4 + 28 * (82 + 8 + 8 + 2 + 3) + 5 = 2893 cycles
;;;
    LDA #TILH      ; initialize row count (4 cycles)
    ;;
@COPY1
    ;; unroll inner loop (copies one row of 28 pixels in 82 cycles)
    LDU ,X         ; (1) read 4 pixels (5 cycles)
    STU ,Y         ;     write 4 pixels (5 cycles)
    LDU 2,X        ; (2) (6 cycles)
    STU 2,Y        ;     (6 cycles)
    LDU 4,X        ; (3) ...
    STU 4,Y        ;     ...
    LDU 6,X        ; (4)
    STU 6,Y        ;
    LDU 8,X        ; (5)
    STU 8,Y        ;
    LDU 10,X       ; (6)
    STU 10,Y       ;
    LDU 12,X       ; (7)
    STU 12,Y       ;
    ;;
    LEAX TILW,X    ; advance src to start of next row (8 cycles)
    LEAY SCRW,Y    ; advance dst to start of next row (8 cycles)
    DECA           ; reduce remaining count by one (2 cycles)
    BNE @COPY1     ; loop while rows remain (3 cycles)
    ;;
    RTS            ; done!  return to caller (5 cycles)

This code, all in all, was 60% faster than the naive COPYTILE code we started out with.

But that wasn’t fast enough. Not nearly.

So when my friend showed me his code and asked if I could make it faster, I really wanted to help him. I really wanted to be able to answer yes.

But I had to answer no. I hated giving that answer. But, studying the code, I couldn’t see any way to speed it up.

And yet, the hook was set.

Later, I couldn’t get the problem out of my head. Maybe I had missed something. I had grown up on Apple II computers and their 6502 processors. My friend’s code, however, was for the 6809. Maybe it afforded optimizations that I didn’t know about.

With renewing optimism, I dialed into the university’s library to access the card catalog. (These were the days before the World Wide Web.) Through a VT220 terminal emulator, I searched for books about the 6809.

There was one. A 6809 microprocessor manual. It was in the engineering library. Luckily, I was an engineering student and had sign-out privileges.

Just like that, I was off to the library.

A crazy idea

When I got to the engineering library, I found the book right where it was supposed to be, looking a little beat up:

The MC6809-MC6809E Microprocessor Programming Manual.

Not bothering to find a chair, I flipped through the pages standing, looking for some oddball, 6809-specific instruction that might throw a lot of bytes and throw them fast. Page after page, though, turned up nothing.

Then I came upon PSHS. “Push registers on the hardware stack.”

On my beloved 6502, if you wanted to save registers on the stack, you had to do it one register at a time and, even then, by transferring them through the accumulator. It was slow and expensive. So I had learned to avoid the stack when speed mattered.

But on the 6809 you could save all of the registers – or any subset of them – with a single instruction. And, amazingly, it cost a mere 5 cycles, plus 1 cycle more per byte pushed.

Since the processor had three 16-bit general-purpose registers – D, X, and Y – I could load them up and then use a single PSHS instruction to write 6 bytes in just 11 cycles. The corresponding pull instruction, PULS, had the same low cost.

Further, the 6809 had two stack registers, S and U. I could use one as a source pointer and the other as a destination pointer. In theory, with a single PULS/PSHU pair, I could copy 6 bytes in 22 cycles.

That’s crazy, crazy fast.

Excitement mounting, I headed for the front desk and reached for my student ID. I was going to sign out that wonderful little book for further study.

A crazy plan

Walking back to the dorms, I formed my plan.

I would save the S and U registers somewhere, and then point S at the background tile and U at the screen buffer. Then I would pull from S and push to U, copying 6 bytes at a time, using D, X, and Y as go-betweens. To copy the 14 bytes that make up a row would require three such iterations, which unrolled would be about 60 cycles.

When I reached my room, I found a piece of paper and sketched it out:

    PULS D,X,Y    ; first 6 bytes    11 cycles
    PSHU D,X,Y    ;                  11
    PULS D,X,Y    ; second 6 bytes   11
    PSHU D,X,Y    ;                  11
    PULS D        ; final 2 bytes     7
    PSHU D        ;                   7
    LEAU -WOFF,U  ; advance dst ptr   8

Just 66 cycles, including the after-row adjustment to U that prepares it for the next row. (Note that the adjustment is now negative.) For comparison, the naive row-copy loop that we looked at earlier took 157 cycles to do the same thing. And my friend’s optimized code took 98. Already, this crazy idea was looking like a big win.

Still, that last PULS/PSHU pair! What a clunker. It was needed to handle the final two bytes per row, since the rows were 28 pixels = 14 bytes wide, and 6 doesn’t divide 14 evenly.

That darn remainder of 2 bytes!

If only the game had used 24-by-24 tiles instead… But it hadn’t, so I pored through the manual, looking for a way to lower the cost of that inevitable last pair.

And then, by surprise, I struck gold! It was another 6809 oddity, the DP register.

On the 6502 and most 8-bit processors of the era, the lowest 256 bytes of memory was called the zero page. The zero page was special because its memory locations had single-byte addresses and could be accessed with shorter and usually faster instructions.

The 6809’s designers took this idea one step further. They let you use the DP register to designate any page as the zero page, which they called the “direct page.”

But none of my byte-slinging instructions needed to use the direct page. That meant I could use the DP register as an additional one-byte go-between. Now I could copy 7 bytes per pull-push pair!

And 14 is evenly divisible by 7.

With this change, I could copy a whole 28-pixel row and advance to the next in just 5 instructions:

    PULS D,X,Y,DP  ; first 7 bytes    12 cycles
    PSHU D,X,Y,DP  ;                  12
    PULS D,X,Y,DP  ; second 7 bytes   12
    PSHU D,X,Y,DP  ;                  12
    LEAU -WOFF,U   ; advance dst ptr   8

56 cycles! Fifty. Six. Cycles.

That bit of code made me feel just awesome. I had managed to harness every single available register on the machine for byte-slinging! D, X, Y, U, S, and even the oddball DP – they were all fully engaged.

I was feeling great about this solution.

Except for just one thing…

I’ll have a little more crazy, please

If you’re acquainted with stacks, you may have noticed a subtle flaw in my brilliant plan:

Pushes and pulls work in opposite directions.

See, I lied about that little block of code I just showed you. It didn’t actually copy one row from a background tile onto the screen. What it actually did was break the row into two 7-byte blocks – I’ll call them “septets” – and then draw them onto the screen swapped.

Recall that tile rows were laid out sensibly in memory as 14 contiguous bytes, like this:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0   1   2   3   4   5   6   7   8   9   A   B   C   D |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

So when you pulled-and-pushed a row onto the screen in 7-byte septets, it would get horizontally split like this:

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 7   8   9   A   B   C   D | 0   1   2   3   4   5   6 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The first septet was now second, and the second now first. Note, however, that the bytes within each septet were unchanged; they retained their original ordering.

Further, if you ran the row-copy code multiple times, the vertical stack of rows you copied would end up being upside down. That’s because the code uses pushes to write the rows onto the screen. Pushes move the stack pointer toward lower addresses, and lower addresses correspond to screen lines higher up on the display.

To visualize the havoc that this row-reversal and septet-swapping would wreak, let’s say that you had the following 28-by-28 tile, representing a key:

If you had used 28 applications of my row-copy code to draw it onto the screen, you would have ended up with this not-a-key:

So . . . that was a problem.

But this problem, too, had an elegant solution!

To review:

the rows were upside down
the septets within each row were swapped
but the bytes within each septet were unchanged

The breakthrough was to see “upside down” and “swapped” as two instances of the same thing. Reversal.

Reversal has this handy property: it’s an involution. Reverse something twice, and you get back the original. With this in mind, I reasoned that if the tiles were pre-processed to reverse their rows and then the septets within the rows, the tiles would end up looking fine when they hit the screen. The mangling that occurred during copying and pre-processing would, in effect, undo each other.

Working it out on paper, I also discovered that the pre-processing step could ignore the rows and just treat the tiles as one long, linear sequence of septets.¹ To see why this is so, let’s consider a small tile of 2 rows of 2 septets each:

+---+---+
| a   b |
+---+---+
| c   d |
+---+---+

In memory, it would be laid out in row-major order, that is, as 4 sequential septets:

+---+---+---+---+
| a   b | c   d |
+---+---+---+---+

So, reversing the rows and then swapping the septets within each row is the same as just reversing the septet order in memory:

+---+---+                  +---+---+                  +---+---+
| a   b |                  | c   d |                  | d   c |
+---+---+        =======>  +---+---+        =======>  +---+---+
| c   d |        reverse   | a   b |        swap      | b   a |
+---+---+        rows      +---+---+        septets   +---+---+

+---+---+---+---+          +---+---+---+---+          +---+---+---+---+
| a   b | c   d |          | c   d | a   b |          | d   c | b   a |
+---+---+---+---+          +---+---+---+---+          +---+---+---+---+

So the solution to the mangling problem turned out to be a simple, one-time pre-processing step: Just break each tile into septets and reverse their order.

Knowing that the mangling problem could be solved, I was feeling good about my overall idea. I couldn’t see any other problems, so I sketched out a new copy-tile subroutine to present to my friend. Since the row-copy logic was now just 5 instructions, I used it 4 times, unrolling the sole remaining loop a bit. Now, instead of copying 28 single rows, I copied 7 quad-rows.

The code looked something like this:

COPYTILE3
;;;
;;; Copy a 28x28 screen tile into a screen buffer.
;;; Arguments:
;;;   X = starting address of *septet-reversed* background tile
;;;   Y = starting address of destination in screen buffer
;;; Execution time:
;;;   34 + 7 * (224 + 7 + 3) + 7 + 10 = 1689 cycles
;;;
    ;; setup:  34 cycles
    PSHS U,DP      ; save U and DP (8 cycles)
    STS >SSAVE     ; save S (7 cycles)
    ;;
    LDA #TILH/4    ; initial quad-row count (2 cycles)
    STA >ROWCT     ; (5 cycles)
    ;;
    LEAS ,X                    ; initialize src ptr (4 cycles)
    LEAU (TILH-1)*SCRW+TILW,Y  ; initialize dst ptr (8 cycles)
    ;;
@COPY1
    ;; copy four rows of 28 pixels in 4 * (48 + 8) = 224 cycles
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    LEAU -WOFF,U
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    LEAU -WOFF,U
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    LEAU -WOFF,U
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    PULS X,Y,D,DP
    PSHU X,Y,D,DP
    LEAU -WOFF,U
    ;;
    DEC >ROWCT     ; reduce remaining quad-row count by one (7 cycles)
    BNE @COPY1     ; loop while quad-rows remain (3 cycles)
    ;;
    LDS  >SSAVE    ; restore S (7 cycles)
    PULS U,DP,PC   ; restore regs and return to caller (10 cycles)

SSAVE ZMD  1       ; stash for saving S while drawing
ROWCT ZMB  1       ; var: remaining rows to copy

Adding up the cycles, the total was just 1689. I had shaved almost 1200 cycles off of my friend’s code. That’s a 70 percent speed-up!

Beaming, I went to show my friend.

The acid test

When I caught up with my friend and said I’d figured out how to make the copy-tile routine 70% faster, his face lit up. When I explained about the whole septets-and-mangling thing, though, his face unlit, doubts rising. But when I showed him the code, he got it. It clicked.

“Let’s try it out!” he said.

In about a half hour’s work, he had the code integrated into the game. After a rebuild and reboot, the game was loading.

The title screen came up.

He was into the game.

And . . .

Holy crap!

The game seemed amazingly fast. In truth, it was probably only a third to a half faster, but that was enough. Some important perceptual threshold had been crossed. The game-play was now smooth, natural. You could feel the difference.

We just sat there, playing the game and grinning. It was a good day.

But it would not last.

In for a penny . . .

A few days after the speed problem was behind us – or so we thought – the game’s finishing touches were applied. One of those finishing touches was sampled sound effects. This required feeding byte-sized morsels to the audio-output DAC a few thousand times each second. To schedule the feedings, my friend had turned on a hardware-clocked interrupt.

Worked like a charm. The sounds sounded great, and everything was perfect.

Except for one thing.

After playing into the game one night, we noticed that some of the tiles were getting corrupted. The more we played, the more corruption we saw.

Oh, no.

Something very bad was happening.

And then it hit us.

What happens if an interrupt goes off during the copy-tile routine?

The processor, to prepare for calling the interrupt handler, would push its current state onto the system stack. But, during the copy-tile routine, there was no system stack. The system-stack register S had been commandeered. And where was it pointing? Right into the memory buffer that held the reference tiles!

Oops.

We slumped. After all that effort, to think that our all-important speed-up was causing memory corruption…

Needing a break, we walked to an all-night diner near campus to eat and think. Over pancakes and bacon, scrapple and grilled stickies, we talked through the problem.

There were only two stack registers, and without commandeering both of them, the copy-tile routine wouldn’t be nearly as fast. There was no way, then, to return the S register to the system without losing our hard-won speed. But there was also no way that we could get reliable sound without using interrupts.

One way or another, then, interrupts were going to be triggered. And, when they were, if copy-tile was running, whatever S was pointing to was going to get hammered.

“How can we prevent the corruption?” I asked.

We sat there, ignoring our food, the question hanging in the air.

Suddenly, my friend slapped the table. He had it.

“Don’t prevent it!” he said.

“What?” I asked.

“Don’t prevent the corruption,” he explained. “Let it happen. Just not to the reference tiles.”

It was simple, really. He continued:

“Just swap S and U in the copy-tile routine. U will point at the reference tiles, and S at the screen buffer. If an interrupt goes off, the corruption will happen where S is pointing – on the screen. The corruption will last only until we redraw the next frame.”

“That’s brilliant!” I said.

Eager to try his solution, we quickly finished our meal.

. . . In for a pound

Walking back to the dorms, though, it bothered us that players might see screen glitches, even if for just a frame. We both felt that there had to be some way to make the hack perfect, if only we could find it.

Later that night, we found it.

Again, it was simple once we saw it. All we had to do was change the tile ordering. Instead of placing the tiles onto the screen from top to bottom, left to right, we would go right to left, bottom to top. In other words, from high addresses to low addresses.

That way, when an interrupt went off and corrupted the memory locations immediately before the tile we were currently drawing, the corruption would be fixed when the very next tile was drawn. And since tile-drawing took place in a buffer that wasn’t displayed until it was fully drawn (and the vertical refresh had occurred), nobody would ever see the corruption.²

It was the perfect hack!

To the old days!

And that’s the way things were. The challenge wasn’t overwhelming complexity, as it is today. The challenge was cramming your ideas into machines so slow, so limited that most ideas didn’t fit.

You had to bang your ideas around, twist them, turn them, searching for something, anything that would help you squeeze them into the machine. Sometimes you found it, and you got one step closer to realizing your ideas. Sometimes you didn’t.

But the search was always instructive.

In this case, the search yielded several small victories that, together, solved the problem. But when you consider all the things we had to do to make that darn game fast enough, it does seem crazy.

We started with a tile-copying subroutine whose core was a hand-tuned, unrolled loop of machine instructions. Then, to buy a 70-percent speed-up:

We replaced this subroutine with a very special manifestation of insanity that commandeered both stack pointers and used pulls and pushes, and every single available register, to draw tiles upside down and horizontally broken.
Then we pre-processed the tiles so that drawing them would actually fix them.
But then – dammit! – interrupts that occurred during drawing could corrupt the reference tiles.
So, to protect the reference tiles, we corrupted the screen buffer instead.
But that corruption would be visible.
So we changed the tile-placement order to repair – on the fly – any corruption that might have occurred, before it could ever be displayed.
And it all worked!

We did that. For 70 percent.

And it was so worth it.

Tell ’em if you got ’em

Anyway, I wanted to share that story with you for two reasons.

The first is that it’s a fun story. It’s one my earliest memories of struggling with a messy computing problem and, through dogged persistence, zeroing in on a solution that was both effective and elegant. The result was powerfully satisfying.

The struggle, I learned, is worth it.

The second reason is to encourage you to tell your stories. I know that, “back in the day,” most video games probably gave rise to many stories like this one. I’d love to hear them. But too many of them are lost, having faded out of memory before anyone thought to preserve them.

If you have a story, please don’t wait. Tell it. Every day you wait, it gets harder.

Tell it.

Exercise: Prove that for all finite sequences of finite sequences, applying (reverse ⋅ concat) is the same as applying (concat ⋅ reverse ⋅ map reverse).↩︎
We also had to make sure that nothing valuable was stored in the memory locations just before a screen buffer. They could, in theory, also be corrupted if an interrupt occurred while placing the top-left septet of the top-left corner tile. This septet’s address corresponded to the start of the buffer.↩︎

Tricks of the trade: Recursion to Iteration, Part 4: The Trampoline

2013-06-12T00:00:00Z

By Tom Moertel

Posted on June 12, 2013

Tags: programming, recursion, iteration, python, recursion-to-iteration series, tail calls, data structures, trampolines

This is the fourth article in a series on converting recursive algorithms into iterative algorithms. If you haven’t read the earlier articles first, you may want to do so before continuing.

In the first article of our series, we showed that if you can convert an algorithm’s recursive calls into tail calls, you can eliminate those tail calls to create an iterative version of the algorithm using The Simple Method. In this article, we’ll look at another way to eliminate tail calls: the trampoline.

The idea behind the trampoline is this: before making a tail call, manually remove the current execution frame from the stack, eliminating stack build-up.

Execution frames and the stack

To understand why we might want to manually remove an execution frame, let’s think about what happens when we call a function. The language runtime needs some place to store housekeeping information and any local variables the function may use, so it allocates a new execution frame on the stack. Then it turns control over to the function. When the function is done, it executes a return statement. This statement tells the runtime to remove the execution frame from the stack and to give control (and any result) back to the caller.

But what if the function doesn’t return right away? What if it makes another function call instead? In that case, the runtime must create a new execution frame for that call and push it onto the stack, on top of the current frame. If the function ends up calling itself many times recursively, each call will add another frame to the stack, and pretty soon we will have eaten up a lot of stack space.

Eliminating stack build-up

To avoid this problem, some programming languages guarantee that they will recycle the current execution frame whenever a function makes a tail call. That is, if the function calls some other function (or itself recursively) and just returns that function’s result verbatim, that’s a tail call. In that case, the runtime will recycle the current function’s execution frame before transferring control to the other function, making it so that the other function will return its result directly to the original function’s caller. This process is called tail-call elimination.

But in languages like Python that don’t offer tail-call elimination, every call, even if it’s a tail call, pushes a new frame onto the stack. So if we want to prevent stack build-up, we must somehow eliminate the current frame from the stack ourselves, before making a tail call.

But how? The only obvious way to eliminate the current frame is to return to our caller. If we’re to make this work, then, the caller must be willing to help us out. That’s where the trampoline comes in. It’s our co-conspirator in the plot to eliminate stack build-up.

The trampoline

Here’s what the trampoline does:

It calls our function f, making itself the current caller.
When f wants to make a recursive tail call to itself, it returns the instruction call(f)(*args, **kwds). The language runtime dutifully removes the current execution frame from the stack and returns control to the trampoline, passing it the instruction.
The trampoline interprets the instruction and calls f back, giving it the supplied arguments, and again making itself the caller.
This process repeats until f wants to return a final result z; then it returns the new instruction result(z) instead. As before, the runtime removes the current execution frame from the stack and returns control to the trampoline.
But now when the trampoline interprets the new instruction it will return z to its caller, ending the trampoline dance.

Now you can see how the trampoline got its name. When our function uses a return statement to remove its own execution frame from the stack, the trampoline bounces control back to it with new arguments.

Here’s a simple implementation. First, we will encode our instructions to the trampoline as triples. We’ll let call(f)(*args, **kwds) be the triple (f, args, kwds), and result(z) be the triple (None, z, None):

def call(f):
    """Instruct trampoline to call f with the args that follow."""
    def g(*args, **kwds):
        return f, args, kwds
    return g

def result(value):
    """Instruct trampoline to stop iterating and return a value."""
    return None, value, None

Now we’ll create a decorator to wrap a function with a trampoline that will interpret the instructions that the function returns:

import functools

def with_trampoline(f):
    """Wrap a trampoline around a function that expects a trampoline."""
    @functools.wraps(f)
    def g(*args, **kwds):
        h = f
        # the trampoline
        while h is not None:
            h, args, kwds = h(*args, **kwds)
        return args
    return g

Note that the trampoline boils down to three lines:

while h is not None:
    h, args, kwds = h(*args, **kwds)
return args

Basically, the trampoline keeps calling whatever function is in h until that function returns a result(z) instruction, at which time the loop exits and z is returned. The original recursive tail calls have been boiled down to a while loop. Recursion has become iteration.

Example: factorial

To see how we might use this implementation, let’s return to the factorial example from the first article in our series:

def factorial(n):
    if n < 2:
        return 1
    return n * factorial(n - 1)

Step one, as before, is to tail-convert the lone recursive call:

 def factorial(n, acc=1):
     if n < 2:
         return acc
     return factorial(n - 1, acc * n)

Now we can create an equivalent function that uses trampoline idioms:

def trampoline_factorial(n, acc=1):
    if n < 2:
        return result(acc)
    return call(trampoline_factorial)(n - 1, n * acc)

Note how the return statements have been transformed.

Finally, we can wrap this function with a trampoline to get a callable version that we can use just like the original:

factorial = with_trampoline(trampoline_factorial)

Let’s take it for a spin:

>>> factorial(5)
120

To really see what’s going on, be sure to use the Online Python Tutor’s visualizer to step through the original, tail-recursive, and trampoline versions of the function. Just open this link: Visualize the execution. (ProTip: use a new tab.)

Why use the trampoline?

As I mentioned at the beginning of this article, if you can convert a function’s recursive calls into tail calls – which you must do to use a trampoline – you can also use the Simple Method to convert the function’s recursion into iteration, eliminating the calls altogether. For example, here’s what the Simple Method does to our original factorial function:

def factorial(n, acc=1):
    while n > 1:
        (n, acc) = (n - 1, acc * n)
    return acc

This version is simpler and more efficient than the trampoline version. So why not use the Simple Method always?

The answer is that the Simple Method is tricky to apply to functions that make tail calls from within loops. Recall that it introduces a loop around a function’s body and replaces recursive tail calls with continue statements. But if the function already has its own loops, replacing a tail call within one of them with a continue statement will restart that inner loop instead of the whole-body loop, as desired. In that case, you must add condition flags to make sure the right loop gets restarted, and that gets old fast. Then, using a trampoline may be a win.

That said, I almost never use trampolines. Getting a function into tail-call form is nine tenths of the battle. If I’ve gone that far already, I’ll usually go the rest of the way to get a tight, iterative version.

Why, then, did we make this effort to understand the trampoline? Two reasons. First, it’s semi-common in programming lore, so it’s best to know about it. Second, it’s a stepping stone to a more-general, more-powerful technique: continuation-passing-style expressions. That’s our subject for next time.

In the meantime, if you want another take on trampolines in Python, Kyle Miller wrote a nice article on the subject: Tail call recursion in Python.

Thanks for reading! As always, if you have questions or comments, please leave a comment on the blog or hit me at @tmoertel.

Tricks of the trade: Recursion to Iteration, Part 3: Recursive Data Structures

2013-06-03T00:00:00Z

By Tom Moertel

Posted on June 3, 2013

Tags: programming, recursion, iteration, python, recursion-to-iteration series, tail calls, data structures

This is the third article in a series on converting recursive algorithms into iterative algorithms. If any of what follows seems confusing, you may want to read the earlier articles first.

This is an extra article that I hadn’t planned. I’m writing it because in a comment on the previous article a reader asked me to show a less mathematical example and suggested tree traversal. So that’s the subject of this article: We’ll take a binary tree and flatten it into a list, first recursively, then iteratively.

The challenge

First, let’s define a binary tree to be either empty or given by a node having three parts: (1) a value, (2) a left subtree, and (3) a right subtree, where both of the subtrees are themselves binary trees. In Haskell, we might define it like so:

data BinaryTree a = Empty | Node a (BinaryTree a) (BinaryTree a)

In Python, which we’ll use for the rest of this article, we’ll say that None represents an empty tree and that the following class represents a node:

import collections
Node = collections.namedtuple('Node', 'val left right')

# some sample trees having various node counts
tree0 = None  # empty tree
tree1 = Node(5, None, None)
tree2 = Node(7, tree1, None)
tree3 = Node(7, tree1, Node(9, None, None))
tree4 = Node(2, None, tree3)
tree5 = Node(2, Node(1, None, None), tree3)

Let us now define a function to flatten a tree using an in-order traversal. The recursive definition is absurdly simple, the data type having only two cases to consider:

def flatten(bst):
    # empty case
    if bst is None:
        return []
    # node case
    return flatten(bst.left) + [bst.val] + flatten(bst.right)

A few tests to check that it does what we expect:

def check_flattener(f):
    assert f(tree0) == []
    assert f(tree1) == [5]
    assert f(tree2) == [5, 7]
    assert f(tree3) == [5, 7, 9]
    assert f(tree4) == [2, 5, 7, 9]
    assert f(tree5) == [1, 2, 5, 7, 9]
    print 'ok'

check_flattener(flatten)  # ok

Our challenge for today is to convert flatten into an iterative version. Other than a new trick – partial evaluation – the transformation is straightforward, so I’ll move quickly.

Let’s do this!

Eliminating the first recursive call

First, let’s separate the base case from the incremental work:

def step(bst):
    return flatten(bst.left) + [bst.val] + flatten(bst.right)

def flatten(bst):
    if bst is None:
        return []
    return step(bst)

And let’s break the incremental work into smaller pieces to see what’s going on.

def step(bst):
    left = flatten(bst.left)
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return left

def flatten(bst):
    if bst is None:
        return []
    return step(bst)

Let’s try to get rid of the first recursive call by assuming that somebody has passed us its result via a secret argument left:

def step(bst, left=None):
    if left is None:
        left = flatten(bst.left)
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return left

def flatten(bst):
    if bst is None:
        return []
    return step(bst)

And now we’ll make step return values that parallel its input arguments:

def step(bst, left=None):
    if left is None:
        left = flatten(bst.left)
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return bst, left  # <-- add bst

def flatten(bst):
    if bst is None:
        return []
    return step(bst)[-1]  # <-- note [-1]

In the first recursive call, the transformation applied to bst is .left, so we want to apply the opposite transformation to bst in the returned values. And what’s the opposite of descending to a node’s left subtree? It’s ascending to the node’s parent. So we want something like this:

# this code does not work!

def step(bst, left=None):
    if left is None:
        left = flatten(bst.left)
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return get_parent(bst), left  # <-- need get_parent

But we’re stuck. We can’t define get_parent because our tree data structure doesn’t keep track of parents, only children.

New plan: Maybe we can assume that someone has passed us the node’s parent and go from there?

But this plan hits the same brick wall: If we add a new argument to accept the parent, we must for parallelism add a new return value to emit the transformed parent, which is the parent of the parent. But we can’t compute the parent of the parent because, as before, we have no way of implementing get_parent.

So we do what mathematicians do when their assumptions hit a brick wall: we strengthen our assumption! Now we assume that someone has passed us all of the parents, right up to the tree’s root. And that assumption gives us what we need:

def step(bst, parents, left=None):
    if left is None:
        left = flatten(bst.left)
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return parents[-1], parents[:-1], left

Note that we’re using the Python stack convention for parents; thus the immediate parent of bst is given by the final element parents[-1].

As a simplification, we can eliminate the bst argument by considering it the final parent pushed onto the stack:

def step(parents, left=None):
    bst = parents.pop()  # <-- bst = top of parents stack
    if left is None:
        left = flatten(bst.left)
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return parents, left

Now that step requires the parents stack as an argument, the base function must provide it:

def flatten(bst):
    if bst is None:
        return []
    parents = [bst]
    return step(parents)[-1]

But we still haven’t eliminated the first recursive call. To do that, we’ll need to pass the step function a value for its left argument, which will cause the recursive call to be skipped.

But we only know what that value should be for one case, the base case, when bst is None; then left must be []. To get to that case from the tree’s root, where bst is definitely not None, we must iteratively replicate the normal recursive calls on bst.left until we hit the leftmost leaf node. And then, to compute the desired result, we must reverse the trip, iterating the step function until we have returned to the tree’s root, where the parents stack must be empty:

def flatten(bst):
    # find initial conditions for secret-feature "left"
    left = []
    parents = []
    while bst is not None:
        parents.append(bst)
        bst = bst.left
    # iterate to compute the result
    while parents:
        parents, left = step(parents, left)
    return left

And just like that, one of the recursive calls has been transformed into iteration. We’re halfway to the finish line!

Eliminating the second recursive call

But we still have to eliminate that final recursive call to flatten, now sequestered in step. Let’s take a closer look at that function after we make its left argument required since it always gets called with a value now:

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    right = flatten(bst.right)
    left.extend(right)
    return parents, left

To get rid of the recursive call to flatten, we’re going to use a new trick: partial evaluation. Basically, we’re going to replace the call to flatten with the function body of flatten, after we rename all its variables to prevent conflicts. So let’s make a copy of flatten and suffix all its variables with 1:

def flatten1(bst1):
    left1 = []
    parents1 = []
    while bst1 is not None:
        parents1.append(bst1)
        bst1 = bst1.left
    while parents1:
        parents1, left1 = step(parents1, left1)
    return left1

And then let’s make its arguments and return values explicit:

    (bst1, ) = ARGUMENTS
    left1 = []
    parents1 = []
    while bst1 is not None:
        parents1.append(bst1)
        bst1 = bst1.left
    while parents1:
        parents1, left1 = step(parents1, left1)
    RETURNS = (left1, )

And then we’ll drop this expansion into step:

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    # -- begin partial evaluation --
    (bst1, ) = (bst.right, )
    left1 = []
    parents1 = []
    while bst1 is not None:
        parents1.append(bst1)
        bst1 = bst1.left
    while parents1:
        parents1, left1 = step(parents1, left1)
    (right, ) = (left1, )
    # -- end partial evaluation --
    left.extend(right)
    return parents, left

Now we can eliminate code by fusion across the partial-evaluation boundary.

First up: left1. We can now see that this variable accumulates values that, in the end, get appended to left (via the return variable right). But we can just as well append those values to left directly, eliminating left1 within the boundary and the call to left.extend(right) without:

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    # -- begin partial evaluation --
    (bst1, ) = (bst.right, )
    # left1 = []  # <-- eliminate and use left instead
    parents1 = []
    while bst1 is not None:
        parents1.append(bst1)
        bst1 = bst1.left
    while parents1:
        parents1, left = step(parents1, left)
    # (right, ) = (left, )  # <-- eliminated
    # -- end partial evaluation --
    # left.extend(right)  # <-- eliminated
    return parents, left

For this next fusion, we’re going to need to recall our base function to get the necessary outside scope:

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    # -- begin partial evaluation --
    (bst1, ) = (bst.right, )
    parents1 = []
    while bst1 is not None:
        parents1.append(bst1)
        bst1 = bst1.left
    while parents1:
        parents1, left = step(parents1, left)
    # -- end partial evaluation --
    return parents, left

def flatten(bst):
    left = []
    parents = []
    while bst is not None:
        parents.append(bst)
        bst = bst.left
    while parents:
        parents, left = step(parents, left)
    return left

When flatten calls step and the code within the partially evaluated region executes, it builds up a stack of nodes parents1 and then calls step iteratively to pop values off of that stack and process them. When it’s finished, control returns to step proper, which then returns to its caller, flatten, with the values (parents, left). But look at what flatten then does with parents: it calls step iteratively to pop values off of that stack and process them in exactly the same way.

So we can eliminate the while loop in step – and the recursive call! – by returning not parents but parents + parents1, which will make the while loop in flatten do the exact same work.

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    # -- begin partial evaluation --
    (bst1, ) = (bst.right, )
    parents1 = []
    while bst1 is not None:
        parents1.append(bst1)
        bst1 = bst1.left
    # while parents1:                            # <-- eliminated
    #     parents1, left = step(parents1, left)  #
    # -- end partial evaluation --
    return parents + parents1, left  # parents -> parents + parents1

And then we can eliminate parents1 completely by taking the values we would have appended to it and appending them directly to parents:

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    # -- begin partial evaluation --
    (bst1, ) = (bst.right, )
    # parents1 = []  # <-- eliminated
    while bst1 is not None:
        parents.append(bst1)  # parents1 -> parents
        bst1 = bst1.left
    # -- end partial evaluation --
    return parents, left  # parents + parents1 -> parents

And now, once we remove our partial-evaluation scaffolding, our step function is looking simple again:

def step(parents, left):
    bst = parents.pop()
    left.append(bst.val)
    bst1 = bst.right
    while bst1 is not None:
        parents.append(bst1)
        bst1 = bst1.left
    return parents, left

For the final leg of our journey – simplification – let’s inline the step logic back into the base function:

def flatten(bst):
    left = []
    parents = []
    while bst is not None:
        parents.append(bst)
        bst = bst.left
    while parents:
        parents, left = parents, left
        bst = parents.pop()
        left.append(bst.val)
        bst1 = bst.right
        while bst1 is not None:
            parents.append(bst1)
            bst1 = bst1.left
        parents, left = parents, left
    return left

Let’s eliminate the trivial argument-binding and return-value assignments:

def flatten(bst):
    left = []
    parents = []
    while bst is not None:
        parents.append(bst)
        bst = bst.left
    while parents:
        # parents, left = parents, left  # = no-op
        bst = parents.pop()
        left.append(bst.val)
        bst1 = bst.right
        while bst1 is not None:
            parents.append(bst1)
            bst1 = bst1.left
        # parents, left = parents, left  # = no-op
    return left

And, finally, factor out the duplicated while loop into a local function:

def flatten(bst):
    left = []
    parents = []
    def descend_left(bst):
        while bst is not None:
            parents.append(bst)
            bst = bst.left
    descend_left(bst)
    while parents:
        bst = parents.pop()
        left.append(bst.val)
        descend_left(bst.right)
    return left

And that’s it! We now have a tight, efficient, and iterative version of our original function. Further, the code is close to idiomatic.

That’s it for this time. If you have any questions or comments, just hit me at @tmoertel or use the comment form below.

Thanks for reading!

Lazy merging in Python using streams

2013-05-26T00:00:00Z

By Tom Moertel

Posted on May 26, 2013

Tags: programming, python, iterators, streams, SICP, functional programming

Recently while solving a programming puzzle in Python, I needed to merge a series of N iterators, each yielding values in sorted order, into a single iterator over the sorted values. The trick is that, when asked for a value from the merged series, you must extract all N iterators’ next values to determine which is the smallest. And then, of course, you can emit only that one value. So what do you do with the remaining N – 1 values you’ve extracted?

Instead of trying to find some place to store them, perhaps it would be better to avoid the problem altogether by not extracting more values than we are prepared to emit. This we can do by converting the iterators into an equivalent form in which the next value is always exposed and hence available for making decisions before extraction. This form is basically the stream of SICP fame.

The idea is to convert each Python iterator into either None (representing an empty stream) or a pair containing the iterator’s next value and the iterator itself:

def iterator_to_stream(iterator):
    """Convert an iterator into a stream (None if the iterator is empty)."""
    try:
        return iterator.next(), iterator
    except StopIteration:
        return None

Then to extract values from the stream, you just apply stream_next to it, and it will hand you back the next value and the updated state of the stream:

def stream_next(stream):
    """Get (next_value, next_stream) from a stream."""
    val, iterator = stream
    return val, iterator_to_stream(iterator)

Since streams expose their next value, they can be ordered by that value. And for my task that was the property that made all the difference:

import heapq

def merge(iterators):
    """Make a lazy sorted iterator that merges lazy sorted iterators."""
    streams = map(iterator_to_stream, map(iter, iterators))
    heapq.heapify(streams)
    while streams:
        stream = heapq.heappop(streams)
        if stream is not None:
            val, stream = stream_next(stream)
            heapq.heappush(streams, stream)
            yield val

An example use:

>>> xs = merge([xrange(3), xrange(2, 9), xrange(5)])
>>> xs
<generator object merge at 0x7fea07c9d320>

>>> list(xs)
[0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8]

Tricks of the trade: Recursion to Iteration, Part 2: Eliminating Recursion with the Time-Traveling Secret Feature Trick

2013-05-14T00:00:00Z

By Tom Moertel

Posted on May 14, 2013

Tags: programming, recursion, iteration, python, google code jam, puzzles, recursion-to-iteration series, tail calls

This is the second post in a series on converting recursive algorithms into iterative algorithms. If you haven’t read the previous post, you probably should. It introduces some terms and background that will be helpful.

Last time, if you’ll recall, we discussed The Simple Method of converting recursive functions into iterative functions. The method, as its name implies, is straightforward and mostly mechanical. The only potential trouble is that, for the method to work, you must convert all of the recursive calls in your function into tail calls.

This task can be tricky. So we also discussed the Secret Feature trick for putting recursive calls into tail-call form. That trick works well for simple recursive calls, but when your calls aren’t so simple, you need a beefier version of the trick.

That’s the subject of this post: the Time-Traveling Secret Feature trick. It’s like sending a T-800 back in time to terminate a function’s recursiveness before it can ever offer resistance in the present.

Yeah.

But we’ll have to work up to it. So stick with the early examples to prepare for the cybernetic brain augmentations to come.

Enough talk! Let’s start with a practical example.

Update 2013-06-17: If you want another example, here’s also a step-by-step conversion of the Fibonacci function.

Computing binomial coefficients

The binomial coefficient \(\binom{n}{k}\) for integers \(n\) and \(k\) gives the number of ways to choose \(k\) items from a set of \(n\) items. For this reason, it’s often pronounced “\(n\) choose \(k\).” It’s very common in combinatorics and statistics and often pops up in the analysis of algorithms. A whole chapter, in fact, is dedicated to them in Graham, Knuth, and Patashnik’s Concrete Mathematics.

Math textbooks define the coefficient like this: \[\binom{n}{k} = \frac{n!}{k!(n-k)!},\] but that form causes all sorts of problems for computers. Fortunately, Concrete Mathematics helpfully points out a lifesaving absorption identity: \[\binom{n}{k} = \frac{n}{k}\binom{n - 1}{k - 1},\] and we know what that is, don’t we? That’s a recursive function just waiting to happen!

And that identity, along with the base case \(\binom{n}{0} = 1\), gives us today’s running example:

def binomial(n, k):
    if k == 0:
        return 1
    return n * binomial(n - 1, k - 1) // k

Before going further, a cautionary point. Math math and computer math are not the same. In math math, \(\frac{nx}{k} = \frac{n}{k}x = \frac{x}{k}n\), but in computer math the equation does not necessarily hold because of finite precision or, in our case, because we’re dealing with integer division that throws away remainders. (By the way, in Python, // is the division operator that throws away remainders.) For this reason, I have been careful to use the form

n * binomial(n - 1, k - 1) // k

instead of the more literal translation

(n // k) * binomial(n - 1, k - 1)

which, if you try it, you’ll see often produces the wrong answer.

Okay, our challenge is set before us. Ready to de-recursivify binomial?

Once again, the Secret Feature trick

First, however, we must put the function’s recursive call into tail-call form. And you remember how to do that, right? With the Secret Feature trick, of course! As a refresher from last time, here’s the trick in a nutshell:

Find a recursive call that’s not a tail call.
Identify what work is being done between that call and its return statement.
Extend the function with a secret feature to do that work, as controlled by a new accumulator argument with a default value that causes it to do nothing.
Use the secret feature to eliminate the old work.
You’ve now got a tail call!
Repeat until all recursive calls are tail calls.

Let’s go through it, by the numbers.

Find a recursive call that’s not a tail call.

Well, there’s only three lines of logic. We’re not talking rocket science here.

def binomial(n, k):
    if k == 0:
        return 1
    return n * binomial(n - 1, k - 1) // k
    #          ^^^^^^^^^^^^^^^^^^^^^^
    #               right here!

Identify what work is being done between that call and its return statement.

In our mind’s eye, let’s lift the recursive call out of the return statement and replace it with the variable x.
```
    x = binomial(n - 1, k - 1)
    return n * x // k
```
Now we can visualize the additional work as a separate function operating on x: multiplying it on the left by one number and dividing it on the right by another:
```
def work(x, lmul, rdiv):
    return lmul * x // rdiv
```
For functions this simple, you can just hold them in your head and inline them into your code as needed. But for more-complicated functions, it really does help to break them out like this. For this example, I’m going to pretend that our work function is more complicated, just to show how to do it.
Extend the function with a secret feature to do that work, as controlled by a new accumulator argument – in this case a pair (lmul, rdiv) – with a default value that causes it to do nothing.
```
def work(x, lmul, rdiv):
    return lmul * x // rdiv

def binomial(n, k, lmul=1, rdiv=1):
    if k == 0:
        return work(1, lmul, rdiv)
    return work(n * binomial(n - 1, k - 1) // k, lmul, rdiv)
```
Note that I just mechanically converted all return {whatever} statements into return work({whatever}, lmul, rdiv).

Use the secret feature to eliminate the old work.

Watch what happens to that final line.

def work(x, lmul, rdiv):
    return lmul * x // rdiv

def binomial(n, k, lmul=1, rdiv=1):
    if k == 0:
        return work(1, lmul, rdiv)
    return binomial(n - 1, k - 1, lmul * n, k * rdiv)

You’ve now got a tail call!

Indeed, we do! All that’s left is to inline the sole remaining work call:

def binomial(n, k, lmul=1, rdiv=1):
    if k == 0:
        return lmul * 1 // rdiv
    return binomial(n - 1, k - 1, lmul * n, k * rdiv)

And to simplify away the needless multiplication by 1 in the return statement:

def binomial(n, k, lmul=1, rdiv=1):
    if k == 0:
        return lmul // rdiv
    return binomial(n - 1, k - 1, lmul * n, k * rdiv)

Repeat until all recursive calls are tail calls.

There was only one, so we’re done! Yay!

With all the recursive calls (just the one) converted into tail calls, we can easily convert the function into iterative form by applying the Simple Method. Here’s what we get after we load the function body into a one-shot loop and convert the tail call into an assignment-and-continue pair.

def binomial_iter(n, k, lmul=1, rdiv=1):
    while True:
        if k == 0:
            return lmul // rdiv
        (n, k, lmul, rdiv) = (n - 1, k - 1, lmul * n, k * rdiv)
        continue
        break

As a final touch, we can tidy up and in the process tuck the accumulators inside the function body to keep them completely secret:

def binomial_iter(n, k):
    lmul = rdiv = 1
    while k > 0:
        (n, k, lmul, rdiv) = (n - 1, k - 1, lmul * n, k * rdiv)
    return lmul // rdiv

And now we have an iterative version of our original binomial function!

A short, cautionary lesson

This next part is subtle but important. To understand what’s going on, you first need to see it. So, once again, let’s use the Online Python Tutor’s excellent Python-runtime visualizer. Open the link below in a new tab if you can, and then read on.

Visualize the execution of binomial(39, 9).

Click the Forward button to advance through each step of the computation. When you get to step 22, where the recursive version has fully loaded the stack with its nested frames, click slowly. Watch the return value (in red) carefully as you advance. See how it gently climbs to the final answer of 211915132, never exceeding that value?

Now continue stepping through to the iterative version. Watch the value of lmul as you advance. See how it grows rapidly, finally reaching a whopping 76899763100160?

This difference matters. While both versions computed the correct answer, the original recursive version would do so without exceeding the capacity of 32-bit signed integers.¹ The iterative version, however, needs a comparatively hoggish 47 bits to faithfully arrive at the correct answer.

Python’s integers, lucky for us, grow as needed to hold their values, so we need not fear overflow in this case, but not all languages for which we might want to use our recursion-to-iteration techniques offer such protections. It’s something to keep in mind the next time you’re taking the recursion out of an algorithm in C:

typedef int32_t Int;

Int binomial(Int n, Int k) {
    if (k == 0)
        return 1;
    return n * binomial(n - 1, k - 1) / k;
}

Int binomial_iter(Int n, Int k) {
    Int lmul = 1, rdiv = 1;
    while (k > 0) {
        lmul *= n; rdiv *= k; n -= 1; k -= 1;
    }
    return lmul / rdiv;
}

int main(int argc, char* argv[]) {
    printf("binomial(39, 9)      = %ld\n", (long) binomial(39, 9));
    printf("binomial_iter(39, 9) = %ld\n", (long) binomial_iter(39, 9));
}

/* Output with Int = int32_t:

binomial(39, 9)      = 211915132
binomial_iter(39, 9) = -4481      <-- oops!

*/

In any case, bigger integers are slower and consume more memory. In one important way, the original, recursive algorithm was better. We have lost something important.

Now let’s get it back.

The importance of respecting associativity and commutativity

It turns out that our iterative version of binomial is the original’s evil twin. Its stack usage looks charming and all, but something else about it is off. Something hard to put our finger on. What is the difference?

It all comes down to grouping and ordering decisions. Implicit grouping and ordering decisions that we overlooked during our code transformations.

Consider how both versions compute \(\binom{5}{2}\).

First, the recursive version. We start out with \(n=5, k=2\) and then recursively expand the expression \[n\binom{n-1}{k-1}/k\] until we hit the base case of \(k=0\). The series of expansions proceeds like so:

\[\binom{5}{2} = 5{\binom{4}{1}} / 2 = 5\cdot\left(4\binom{3}{0} / 1\right) / 2 = 5\cdot(4\cdot(1)/1)/2 = 10.\]

At every step (except for the base case) we perform a multiplication by \(n\) and then a division by \(k\). It’s that division by \(k\) that keeps intermediate results from getting out of hand.

Now lets look at the iterative version. It’s not so obvious what’s going on. (Score another point for recursion over iteration!) But we can puzzle it out. We start out with both accumulators at \(1\) and then loop over the decreasing values of \(k\), building up the accumulators until \(k=0\), at which point we return the quotient of the accumulators.

So the calculation is given by

\[\binom{5}{2} = \frac{\mathit{lmul}}{\mathit{rdiv}} = \frac{((1)\cdot 5)\cdot 4}{((1)\cdot 2)\cdot 1} = 10.\]

Both the numerator and denominator grow and grow and grow until the final division. Not a good trend.

So why did the Secret Feature trick work great for factorial in our previous article but fail us, albeit subtly, now? The answer is that in factorial the extra work being done between each recursive call and its return statement was multiplication and nothing more. And multiplication (of integers that don’t overflow) is commutative and associative. Meaning, ordering and grouping don’t matter.

The lesson, then, is this: Think carefully about whether ordering and grouping matter before using the Secret Feature trick. If it matters, you have two options: One, you can modify your algorithm so that ordering and grouping don’t matter and then use the Secret Feature trick. (But this option is often intractable.) Two, you can use the Time-Traveling Secret Feature trick, which preserves ordering and grouping. And that trick is what we’ve been waiting for.

It’s time.

The Time-Traveling Secret Feature trick

What’s about to happen next is so mind-bendingly awesome that you should probably get a drink to prepare yourself. It’s like combining a T-800 from The Terminator, with the dreams-within-dreams layering from Inception, with the momentary inversion of reality that occurs when the quantum field surrounding an inductive proof collapses and everything snaps into place.

It’s. Really. Cool.

Got your drink? Okay, let’s do this.

Let’s go back to our original, recursive binomial function:

def binomial(n, k):
    if k == 0:
        return 1
    return n * binomial(n - 1, k - 1) // k

Our goal is to create an equivalent iterative version that exactly preserves the properties of the original. Well, how the hell are we going to do that?

Hold that thought for a sec.

Let’s just stop for a moment and think about what that recursive function is doing. We call it to compute some answer \(x_t\). But that answer depends on some earlier answer \(x_{t-1}\) that the function doesn’t know, so it calls itself to get that answer. And so on. Until, finally, it finds one concrete answer \(x_0\) that it actually knows and, from which, it can build every subsequent answer.

So, in truth, our answer \(x_t\) is just the final element in a whole timeline of needed earlier answers:

\[x_0, x_1, \ldots x_t.\]

Well, so what? Why should we care?

Because we’ve watched The Terminator about six hundred times! We know how to get rid of a problem in the present when we’ve seen its timeline: We send a Terminator into the past to rewrite that timeline so the problem never gets created in the first place.

And our little recursive problem with binomial here? We’ve seen its timeline.

That’s right. It’s about to get real.

One more thing. Every single one of these steps preserves the original function’s behavior. You can run your unit tests after every step, and they should all pass.

Let’s begin.

Send a T-800 terminator unit into the function’s timeline, back to the time of \(x_0\).

Oh, we don’t have a T-800 terminator unit? No problem. Assume we’ve sent a T-800 terminator unit into the function’s timeline, back to the time of \(x_0\).

That was easy. What’s next?
Move the recursive call and surrounding work into a helper function.

This might seem like overkill but prevents mistakes. Do you make mistakes? Then just make the helper function already. I’m calling it step for reasons that will shortly become obvious.
```
def step(n, k):
    return n * binomial(n - 1, k - 1) // k

def binomial(n, k):
    if k == 0:
        return 1
    return step(n, k)
```

Partition the helper function into its 3 fundamental parts.

They are:

the recursive call to get the previous answer \(x_{i-1}\),
the work to compute the current answer \(x_i\) from \(x_{i-1}\), and
a return statement applied to \(x_i\) alone:

def step(n, k):
    previous_x = binomial(n - 1, k - 1)  # part (1)
    x = n * previous_x // k              # part (2)
    return x                             # part (3)

def binomial(n, k):
    if k == 0:
        return 1
    return step(n, k)

Secretly prepare to receive communications from the T-800 terminator unit.

You see, once the T-800 gets its cybernetic hands on earlier values in the timeline, we can use its knowledge to eliminate the recursion. We just need some way to get that knowledge from the T-800, now stuck in the past, to us, stuck in the present.

Fortunately, we’ve seen time-travel movies, so we know exactly how this problem is solved. We’ll just use a dead drop! That’s a prearranged secret location where the T-800 will drop values in the past and where we will check for them in the future.

So, per prior arrangement with the T-800, we’ll extend our helper function with a secret feature that checks the dead drop for a previous value \(x_{i-1}\) and, if one’s there, uses it to – muahahahaha! – break the recursive call chain:
```
def step(n, k, previous_x=None):             # <- new stuff
    if previous_x is None:                   # <
        previous_x = binomial(n - 1, k - 1)  # <
    x = n * previous_x // k
    return x

def binomial(n, k):
    if k == 0:
        return 1
    return step(n, k)
```
Ordinarily, we would think of the helper as a function of \(n_i\) and \(k_i\) that computes \(x_i\). That’s because the \(x_{i-1}\) argument can safely be ignored. It has no effect unless the T-800 drops a value into it.

But if the T-800 does drop a value into it, we can see the function as representing the transformation

\[(n_i, k_i, x_{i-1}) \to x_i\]

Note that \(x_{i-1}\) goes in and \(x_i\) comes out. And from that little kernel of power we can do some seriously crazy stuff. For example: reverse the backward flow of time. That is, compute forward through the recursive timeline instead of backward.

Yeah.

Read on.

Modify the helper’s return statement to also pass through its non-secret arguments.

def step(n, k, previous_x=None):
    if previous_x is None:
        previous_x = binomial(n - 1, k - 1)
    x = n * previous_x // k
    return (n, k, x)  # <-- here

def binomial(n, k):
    if k == 0:
        return 1
    return step(n, k)[2]  # <-- note also

Now our helper represents the transformation

\[(n_i, k_i, x_{i-1}) \to (n_i, k_i, x_i)\]

Interesting.

Apply, to the non-secret arguments in the helper’s return statement, the opposite of the transformations applied to them in the recursive call.

Since we passed \(n-1,k-1\) into the recursive call, we’ll pass \(n+1,k+1\) out through the return statement:
```
def step(n, k, previous_x=None):
    if previous_x is None:
        previous_x = binomial(n - 1, k - 1)  # <-- look here
    x = n * previous_x // k
    return (n + 1, k + 1, x)  # <-- apply opposite here

def binomial(n, k):
    if k == 0:
        return 1
    return step(n, k)[2]
```
Now our helper represents the transformation

\[(n_i, k_i, x_{i-1}) \to (n_{i+1}, k_{i+1}, x_i)\]

And now we can reverse the flow of time.

When we started out with our original recursive function, if we asked it for \(x_t\), the function had to go back in the timeline to get \(x_{t-1}\). And to get \(x_{t-1}\), it had to go back even further to get \(x_{t-2}\). And so on, chewing up stack every backward step of the way to \(x_0\). It was heartbreaking, watching it work like that.

But now, we can step the other way. If we have any (\(n_i, k_i, x_{i-1})\) we can get \(x_i\) straight away, no recursion required. In goes \(x_{i-1}\), out comes \(x_i\). We can go forward.

Why, if we knew \(n_1\), \(k_1\), and \(x_0\), we could compute the entire series \(x_0, x_1, \ldots x_t\) with zero recursion by starting at \(i=0\) and working forward.

So let’s get those values!
Determine the initial conditions at the start of the timeline.

For this simple example, most of you can probably determine the initial conditions by inspection. But I’m going to go through the process anyway. You never know when you might need it. So:

What’s the start of the timeline? It’s when the recursive binomial function calls itself so many times that it finally hits one of its base cases, which defines the first entry in the timeline, anchoring the timeline at time \(i = 0\). The base cases in this example are easy to find since we’ve already split out the step logic; all that’s left in binomial is the base-case logic. It’s easy to see that there is only one base case, and it’s when \(k=0\):
```
def step(n, k, previous_x=None):
    if previous_x is None:
        previous_x = binomial(n - 1, k - 1)
    x = n * previous_x // k
    return (n + 1, k + 1, x)

def binomial(n, k):
    if k == 0:    # <-- base case
        return 1
    return step(n, k)[2]
```
From this base case we know that at the very start of the timeline we must have \(i=0\), \(k=0\), and \(x=1\). Thus we know that \(k_0=0\) and \(x_0=1\). And because the timeline ends at time \(i = t\), and we know that \((n_t, k_t) = (n, k)\), the original arguments binomial was called with.

Let’s visualize what we know about the timeline so far:

time \(i\): \(0\) \(1\) \(2\) \(\cdots\) \(t\)

\(n_i\): – – – \(\cdots\) \(n\)

\(k_i\): \(0\) – – \(\cdots\) \(k\)

\(x_i\): \(1\) – – \(\cdots\) ?

The answer we ultimately seek is \(x_t\), marked by a question mark in the timeline.

Now to determine \(n_1\) and \(k_1\). To do that, we must answer this question: How do we go from \((n_i, k_i)\) to \((n_{i+1}, k_{i+1})\)?

We already know the answer! Our step helper computes \((n_i, k_i, x_{i-1}) \to (n_{i+1}, k_{i+1}, x_i)\). So look at its logic: How does it transform its \(n\) and \(k\) arguments? It adds one to them. Thus,

\[\begin{eqnarray*} n_{i+1} & = & n_i + 1,\\ k_{i+1} & = & k_i + 1. \end{eqnarray*}\]

Therefore, \[k_1 = k_0 + 1 = 0 + 1 = 1.\] Since \(k_i\) is \(i\) steps from \(k_0 = 0\) and each step adds \(+1\) we have \[k_i = k_0 + i (+1) = i.\] And when \(i=t\) we know from the equation above that \(t = k_{t} = k\).

Finally, we can compute \(n_1\), which is \(t-1\) reverse steps from \(n_t=n\), the only \(n_i\) that we know so far: \[n_1 = n_t - (t - 1) (+1) = n - k + 1.\]

And now we have our initial conditions:

\[(n_1, k_1, x_0) = (n - k + 1, 1, 0).\]

So now our knowledge of the timeline looks like this:

time \(i\): \(0\) \(1\) \(2\) \(\cdots\) \(t=k\)

\(n_i\): \(n-k\) \(n-k+1\) – \(\cdots\) \(n\)

\(k_i\): \(0\) \(1\) – \(\cdots\) \(k\)

\(x_i\): \(1\) – – \(\cdots\) ?

And with this knowledge, we can step forward through the timeline, from time \(i=1\) to time \(i=2\), and so on, until finally, at the last step when \(i=t\) we will have determined \(x_t\), our desired answer!

Let’s do it!

time \(i\):	\(0\)	\(1\)	\(2\)	\(\cdots\)	\(t\)
\(n_i\):	–	–	–	\(\cdots\)	\(n\)
\(k_i\):	\(0\)	–	–	\(\cdots\)	\(k\)
\(x_i\):	\(1\)	–	–	\(\cdots\)	?

time \(i\):	\(0\)	\(1\)	\(2\)	\(\cdots\)	\(t=k\)
\(n_i\):	\(n-k\)	\(n-k+1\)	–	\(\cdots\)	\(n\)
\(k_i\):	\(0\)	\(1\)	–	\(\cdots\)	\(k\)
\(x_i\):	\(1\)	–	–	\(\cdots\)	?

In the main function, iterate the step helper \(t\) times, starting from the initial conditions. Then return \(x_t\).

def step(n, k, previous_x=None):
    if previous_x is None:
        previous_x = binomial(n - 1, k - 1)
    x = n * previous_x // k
    return (n + 1, k + 1, x)

def binomial(n, k):
    if k == 0:
        return 1
    t = k                                            # <- new
    (n, k, previous_x) = (n - k + 1, 1, 1)           #
    for _i in xrange(1, t + 1):                      #
        (n, k, previous_x) = step(n, k, previous_x)  #
    return previous_x  # = x_t                       #

Boom! That’s 100% iterative. But there’s more!

Remove the base cases from the original function.

Since our iterations start from a base case, the base cases are already incorporated into our new, iterative logic.

def step(n, k, previous_x=None):
    if previous_x is None:
        previous_x = binomial(n - 1, k - 1)
    x = n * previous_x // k
    return (n + 1, k + 1, x)

def binomial(n, k):
    # if k == 0:   # <- remove these lines
    #    return 1  #
    t = k
    (n, k, previous_x) = (n - k + 1, 1, 1)
    for _i in xrange(1, t + 1):
        (n, k, previous_x) = step(n, k, previous_x)
    return previous_x  # = x_t

Boom again!

Remove the secret feature from the step function.

Since previous_x always gets a value now, we can make it a required part of the function.

def step(n, k, previous_x):  # <- remove default value for previous_x
    # if previous_x is None:                  # <- remove these 2 lines
    #    previous_x = binomial(n - 1, k - 1)  #
    x = n * previous_x // k
    return (n + 1, k + 1, x)

def binomial(n, k):
    t = k
    (n, k, previous_x) = (n - k + 1, 1, 1)
    for _i in xrange(1, t + 1):
        (n, k, previous_x) = step(n, k, previous_x)
    return previous_x  # = x_t

Inline the step function.

We’re on it!

def binomial(n, k):
    t = k
    (n, k, previous_x) = (n - k + 1, 1, 1)
    for _i in xrange(1, t + 1):
        x = n * previous_x // k
        (n, k, previous_x) = (n + 1, k + 1, x)
    return previous_x  # = x_t

Apply finishing touches.

This example is pretty tight already, but we can substitute away the x variable.

And, because \(k_i = i\), we can replace the for loop’s induction variable _i with \(k_i\) and correspondingly eliminate \(k_i\) from the step logic.

And, while we’re making stuff better, there’s an obvious optimization. One property of binomial coefficients is that \(\binom{n}{k} = \binom{n}{n-k}\). One property of our code is that it runs for \(t=k\) steps. So when \(k > n-k\), we can reduce the number of steps by solving for \(\binom{n}{n-k}\) instead. Let’s add a couple of lines at the start of the logic to claim this optimization.

And, of course, let’s add a docstring – we’re nice like that.

The final result:
```
def binomial(n, k):
    """Compute binomial coefficient C(n, k) = n! / (k! * (n-k)!)."""
    if k > n - k:
        k = n - k   # 'pute C(n, n-k) instead if it's easier
    t = k
    (n, previous_x) = (n - k + 1, 1)
    for k in xrange(1, t + 1):
        (n, previous_x) = (n + 1, n * previous_x // k)
    return previous_x  # = x_t
```

Let’s review what we just did. It’s mind blowing:

We sent an imaginary T-800 terminator unit back into the function’s timeline.
Then we added a secret feature to our function that allowed the T-800 to send us values from the timeline’s past.
Then we used those values to break the recursive chain, reverse the backward flow of time, and compute the timeline forward and iteratively instead of backward and recursively.
The function then being fully iterative, we removed the secret feature, and the imaginary T-800 winked out of existence. (In an ironic twist of fate, the T-800, via the secret-feature dead drop, had delivered into our hands precisely the knowledge we needed to write both of them out of history!)
And now we have a fast, efficient, and property-preserving iterative version of our original recursive function, and it’s about as good as anything we could hope to conjure up from scratch. (See it in action – just one stack frame and easy on the ints, too.)
And – most importantly – we did it all using a mechanical, repeatable process!

Thanks!

Well, that’s it for this installment. I hope you enjoyed reading it as much as I did writing it. If you liked it (or didn’t), or if you found a mistake, or especially if you can think of any way to help make my explanations better, let me know. Just post a comment or fire me a tweet at @tmoertel.

Until next time, keep your brain recursive and your Python code iterative.

In the visualization, you can’t actually see the largest integer produced by the recursive version’s computation. It’s produced between steps 32 and 33, when the return value from step 32 is multiplied by step 33’s \(n=39\) to arrive at an integer of \(\log_2(39 \cdot 48903492) \approx 30.8\) bits, which is then immediately divided by \(k=9\) to produce the final answer.↩︎

Tricks of the trade: Recursion to Iteration, Part 1: The Simple Method, secret features, and accumulators

2013-05-11T00:00:00Z

By Tom Moertel

Posted on May 11, 2013

Tags: programming, recursion, iteration, python, google code jam, puzzles, recursion-to-iteration series

Alternative title: I wish Python had tail-call elimination.

Recursive programming is powerful because it maps so easily to proof by induction, making it easy to design algorithms and prove them correct.

But recursion is poorly supported by many popular programming languages. Drop a large input into a recursive algorithm in Python, and you’ll probably hit the runtime’s recursion limit. Raise the limit, and you may run out of stack space and segfault.

These are not happy outcomes.

Therefore, an important trick of the trade is knowing how to translate recursive algorithms into iterative algorithms. That way, you can design, prove, and initially code your algorithms in the almighty realm of recursion. Then, after you’ve got things just the way you want them, you can translate your algorithms into equivalent iterative forms through a series of mechanical steps. You can prove your cake and run it in Python, too.

This topic – turning recursion into iteration – is fascinating enough that I’m going to do a series of posts on it. Tail calls, trampolines, continuation-passing style – and more. Lots of good stuff.

For today, though, let’s just look at one simple method and one supporting trick.

The Simple Method

This translation method works on many simple recursive functions. When it works, it works well, and the results are lean and fast. I generally try it first and consider more complicated methods only when it fails.

In a nutshell:

Study the function.
Convert all recursive calls into tail calls. (If you can’t, stop. Try another method.)
Introduce a one-shot loop around the function body.
Convert tail calls into continue statements.
Tidy up.

An important property of this method is that it’s incrementally correct – after every step you have a function that’s equivalent to the original. So if you have unit tests, you can run them after each and every step to make sure you didn’t make a mistake.

Let’s see the method in action.

Example: factorial

With a function this simple, we could probably go straight to the iterative version without using any techniques, just a little noggin power. But the point here is to develop a mechanical process that we can trust when our functions aren’t so simple or our noggins aren’t so powered. So we’re going to work on a really simple function so that we can focus on the process.

Ready? Then let’s show these guys how cyber-commandos get it done! Mark IV style!

Study the original function.

def factorial(n):
    if n < 2:
        return 1
    return n * factorial(n - 1)

Nothing scary here. Just one recursive call. We can do this!

Convert recursive calls into tail calls.
```
 def factorial1a(n, acc=1):
     if n < 2:
         return 1 * acc
     return factorial1a(n - 1, acc * n)
```
(If this step seemed confusing, see the Bonus Explainer at the end of the article for the “secret feature” trick behind the step.)
Introduce a one-shot loop around the function body. You want while True: body ; break.
```
def factorial1b(n, acc=1):
    while True:
        if n < 2:
            return 1 * acc
        return factorial1b(n - 1, acc * n)
        break
```
Yes, I know putting a break after a return is crazy, but do it anyway. Clean-up comes later. For now, we’re going by the numbers.
Replace all recursive tail calls f(x=x1, y=y1, ...) with (x, y, ...) = (x1, y1, ...); continue. Be sure to update all arguments in the assignment.
```
def factorial1c(n, acc=1):
    while True:
        if n < 2:
            return 1 * acc
        (n, acc) = (n - 1, acc * n)
        continue
        break
```
For this step, I copy the original function’s argument list, parentheses and all, and paste it over the return keyword. Less chance to screw something up that way. It’s all about being mechanical.
Tidy the code and make it more idiomatic.
```
def factorial1d(n, acc=1):
    while n > 1:
        (n, acc) = (n - 1, acc * n)
    return acc
```
Okay. This step is less about mechanics and more about style. Eliminate the cruft. Simplify. Make it sparkle.
That’s it. You’re finished!

What have we gained?

We just did five steps of work to convert our original, recursive factorial function into the equivalent, iterative factorial1d function. If our programming language had supported tail-call elimination, we could have stopped at step two with factorial1a. But nooooooo… We had to press on, all the way through step five, because we’re using Python. It’s almost enough to make a cyber-commando punch a kitten.

No, the work wasn’t hard, but it was still work. So what did it buy us?

To see what it bought us, let’s look inside the Python run-time environment. We’ll use the Online Python Tutor’s visualizer to observe the build-up of stack frames as factorial, factorial1a, and factorial1d each compute the factorial of 5.

This is very cool, so don’t miss the link: Visualize It! (ProTip: Open it in a new tab.)

Click the “Forward” button to step through the execution of the functions. Notice what happens in the Frames column. When factorial is computing the factorial of 5, five frames build up on the stack. Not a coincidence.

Same thing for our tail-recursive factorial1a. (You’re right. It is tragic.)

But not for our iterative wonder factorial1d. It uses just one stack frame, over and over, until it’s done. That’s economy!

That’s why we did the work. Economy. We converted O(n) stack use into O(1) stack use. When n could be large, that savings matters. It could be the difference between getting an answer and getting a segfault.

Not-so-simple cases

Okay, so we tackled factorial. But that was an easy one. What if your function isn’t so easy? Then it’s time for more advanced methods.

That’s our topic for next time.

Until then, keep your brain recursive and your Python code iterative.

Bonus Explainer: Using secret features for tail-call conversion

In step 2 of The Simple Method, we converted the recursive call in this code:

def factorial(n):
    if n < 2:
        return 1
    return n * factorial(n - 1)  # <-- right here!

into the recursive tail call in this code:

 def factorial(n, acc=1):
     if n < 2:
         return 1 * acc
     return factorial(n - 1, acc * n)  # <-- oh, yeah!

This conversion is easy once you get the hang of it, but the first few times you see it, it seems like magic. So let’s take this one step by step.

First, the problem. We want to get rid of the n * in the following code:

    return n * factorial(n - 1)

That n * stands between our recursive call to factorial and the return keyword. In other words, the code is actually equivalent to the following:

    x = factorial(n - 1)
    result = n * x
    return result

That is, our code has to call the factorial function, await its result (x), and then do something with that result (multiply it by n) before it can return its result. It’s that pesky intermediate doing something we must get rid of. We want nothing but the recursive call to factorial in the return statement.

So how do we get rid of that multiplication?

Here’s the trick. We extend our function with a multiplication feature and use it to do the multiplication for us.

Shh! It’s a secret feature, though, just for us. Nobody else.

In essence, every call to our extended function will not only compute a factorial but also (secretly) multiply that factorial by whatever extra value we give it. The variables that hold these extra values are often called “accumulators,” so I use the name acc here as a nod to tradition.

So here’s our function, freshly extended:

def factorial(n, acc=1):
    if n < 2:
        return acc * 1
    return acc * n * factorial(n - 1)

See what I did to add the secret multiplication feature? Two things.

First, I took the original function and added an additional argument acc, the multiplier. Note that it defaults to 1 so that it has no effect until we give it some other value (since \(1 \cdot x = x\)). Gotta keep the secret secret, after all.

Second, I changed every single return statement from return {whatever} to return acc * {whatever}. Whenever our function would have returned x, it now returns acc * x. And that’s it. Our secret feature is done! And it’s trivial to prove correct. (In fact, we just did prove it correct! Re-read the second sentence.)

Both changes were entirely mechanical, hard to screw up, and, together, default to doing nothing. These are the properties you want when adding secret features to your functions.

Okay. Now we have a function that computes the factorial of n and, secretly, multiplies it by acc.

Now let’s return to that troublesome line, but in our newly extended function:

    return acc * n * factorial(n - 1)

It computes the factorial of n - 1 and then multiplies it by acc * n. But wait! We don’t need to do that multiplication ourselves. Not anymore. Now we can ask our extended factorial function to do it for us, using the secret feature.

So we can rewrite the troublesome line as

    return factorial(n - 1, acc * n)

And that’s a tail call!

So our tail-call version of the factorial function is this:

def factorial(n, acc=1):
    if n < 2:
        return acc * 1
    return factorial(n - 1, acc * n)

And now that all our recursive calls are tail calls – there was only the one – this function is easy to convert into iterative form using The Simple Method described in the main article.

Let’s review the Secret Feature trick for making recursive calls into tail calls. By the numbers:

Find a recursive call that’s not a tail call.
Identify what work is being done between that call and its return statement.
Extend the function with a secret feature to do that work, as controlled by a new accumulator argument with a default value that causes it to do nothing.
Use the secret feature to eliminate the old work.
You’ve now got a tail call!
Repeat until all recursive calls are tail calls.

With a little practice, it becomes second nature. So…

Exercise: Get your practice on!

You mission is to get rid of the recursion in the following function. Feel like you can handle it? Then just fork the exercise repo and do your stuff to exercise1.py.

def find_val_or_next_smallest(bst, x):
    """Get the greatest value <= x in a binary search tree.

    Returns None if no such value can be found.

    """
    if bst is None:
        return None
    elif bst.val == x:
        return x
    elif bst.val > x:
        return find_val_or_next_smallest(bst.left, x)
    else:
        right_best = find_val_or_next_smallest(bst.right, x)
        if right_best is None:
            return bst.val
        return right_best

Have fun!

Odds and the evidence of a single coin toss

2013-03-04T00:00:00Z

By Tom Moertel

Posted on March 4, 2013

Tags: probability, odds, bayesian, coin-toss-problem, reasoning, evidence

A while ago I wrote about how to update your beliefs in light of the evidence contained within a single coin toss. My article ended with the tantalizing prospect of an easier way to do such updates using odds. That prospect is what we’ll explore in this article.

First, we’ll be using the same probability notation as before. If you’re unfamiliar with it, go back and review the earlier article.

Second, if you’re reading this article from a feed, there’s a good chance that the math is going to look screwy because MathML doesn’t travel well. If so, just click through to the original article for a proper rendering.

Now, to the good stuff.

Introducing odds

The odds on a proposition \(X\), written \(O(X)\), is just the probability that \(X\) is true relative to the probability that it is false: \[O(X) = \frac{P(X)}{P(\neg X)} = \frac{P(X)}{1 - P(X)}\]

Thus given a probability \(p\) we can compute the equivalent odds \(o = p/(1-p)\). This formula maps (monotonically) the intuitive probability scale of \(p \in [0, 1]\) to the seemingly bizarre odds scale of \(o \in [0, \infty]\). Why would we want to do this? The short answer: odds are often easier to work with.

But before we see how odds can make calculations easier, let’s try to gain some intuition about what they represent. Here are some common degrees of belief encoded as both probabilities and odds:

degree of belief	probability	odds
I’m certain it’s false	\(p=0\)	\(o=0\)
I’m certain it’s true	\(p=1\)	\(o=\infty\)
I have no idea	\(p=1/2\)	\(o=1\)

For historical reasons, you’ll often see odds written as a ratio \(n:m\). These ratios can be normalized to the form \(o:1\) where \(o = n/m\). When \(n=m\), the odds are said to be “even” or “one-to-one” or “fifty-fifty”; all such ratios normalize to \(o=1\).

Updating beliefs using odds

If you’ll recall, Bayes’ rule tells us that the plausibility of a proposition \(X\) in light of some new evidence \(E\) is just the proposition’s prior plausibility multiplied by an adjustment factor for the new evidence:

\[(\mbox{new plausibility}) = (\mbox{old plausibility}) \times (\mbox{evidence adjustment})\]

In terms of probabilities, it looks like this:

\[P(X \mid E) = P(X) \times {P(E \mid X)}/{P(E)}\]

What about in terms of odds? Since odds relate the probabilities of \(X\) and \(\neg X\), let’s consider what happens to the probability of \(\neg X\) when we learn about new evidence \(E\). As for \(X\), it’s just a straightforward application of Bayes’ rule:

\[P(\neg X \mid E) = P(\neg X) \times {P(E \mid \neg X)}/{P(E)}\]

Now let’s divide the first equation by the second:

\[\frac{P(X \mid E)}{P(\neg X \mid E)} = \frac{P(X) \times {P(E \mid X)}/{P(E)}}{P(\neg X) \times {P(E \mid \neg X)}/{P(E)}}\]

Breaking up the terms gives us pieces we can begin to recognize:

\[\frac{P(X \mid E)}{P(\neg X \mid E)} = \frac{P(X)}{P(\neg X)} \times \frac{P(E \mid X)}{P(E \mid \neg X)} \times \frac{1/P(E)}{1/P(E)}\]

The term on the left side and the initial term on the right side are now easily recognized as odds. The final term on the right side, involving our prior beliefs about \(E\), helpfully drops out altogether. (This is a big win, as we’ll see later.) What we’re left with is simple and easy to interpret:

\[O(X \mid E) = O(X) \times \frac{P(E \mid X)}{P(E \mid \neg X)}\]

In other words, it’s our familiar belief-updating rule, just in terms of odds:

\[(\mbox{new odds}) = (\mbox{old odds}) \times (\mbox{evidence adjustment})\]

But now the evidence adjustment is a ratio of two likelihoods (and is often called a “likelihood ratio”):

adjustment for probabilities	adjustment for odds
\[\frac{P(E \mid X)}{P(E)}\]	\[\frac{P(E \mid X)}{P(E \mid \neg X)}\]

Note the difference in denominators. It’s often easier to come by \(P(E \mid \neg X)\) than \(P(E)\). As a case in point, let’s return to our earlier coin-toss problem.

Using odds to update our beliefs in the coin-toss problem

If you’ll recall, I had claimed \(S\), that I had a special coin that always comes up heads. You then observed evidence \(H\), that when you flipped the coin it did indeed come up heads. My question to you was this: How much more should you believe \(S\) in light of \(H\)?

This “how much more” is exactly what our evidence adjustment represents. In terms of odds, it’s just the likelihood ratio

\[\frac{P(H \mid S)}{P(H \mid \neg S)}.\]

If the coin is special, we know it will come up heads; thus the numerator is \(1\). But if it’s not special, we have no reason to believe it’s more likely to come up heads or tails; thus the denominator is \(1/2\). And that’s all we need:

\[\frac{P(H \mid S)}{P(H \mid \neg S)} = \frac{1}{1/2} = 2.\]

Problem solved: In light of the coin coming up heads, we should double our odds on the coin being special.

Chaining updates

Notice that in this case the evidence adjustment is a constant multiplier of \(2\). In other words, regardless of our prior odds that the coin is special, when we see the coin come up heads, we should double our odds.

Exercise: What if you flip the coin three times and get three heads? How should this outcome affect your beliefs about \(S\)?

Solution: This three-flip experiment is just the one-flip experiment repeated three times. And each time we know what to do: double our odds on \(S\). Thus observing the three-heads outcome should cause us to scale our prior odds by \(2\times 2\times 2 = 2^3 = 8\).

Using odds, our belief updates become trivial. If the coin continues to come up heads in \(100\) follow-up experiments, we just scale the current odds by \(2^{100}\).

In contrast, recall that when we had solved the problem using probabilities, the evidence adjustment was not constant; it depended upon our prior probability of \(S\). If our prior probability was \(p\), our evidence adjustment was \(2/(1 + p)\). This dependence makes it harder to answer questions about repeated experiments because each experiment affects our current probability \(p\) and thus alters our evidence adjustment. No more easy chaining.

So using odds in this case is a big win (and will often be when testing binary hypotheses).

The point

Odds and probabilities are two ways of encoding degrees of belief. They are equivalent but have different effects on calculation. Using one or the other will often make your life easier, so know how to use both.

P.S. If you find this stuff interesting, the best formal treatment I’ve found is Probability Theory: The Logic of Science, a delightful book by E. T. Jaynes. You can read the first three chapters online. The first chapter is especially rewarding; in it Jaynes builds the logic of “plausible reasoning” from the ground up.