In a previous post I go into the mathematical detail of the Lindy effect: power law distributions etc. The key fact we need for this blog post is that if something has the kind of survival distribution described by the Lindy effect, then the expected future lifetime equals the current age. For example, the 100 year old book in the opening paragraph is expected to be around for another 100 years.

Note that this is all framed in terms of **probability distributions**. It’s not saying, for example, that everything new will go away soon. Everything was once new. Someone watching Hamlet on opening night would be right to speculate that nobody would care about Hamlet in a few years. But now that we know Hamlet has been around for four centuries and is going strong, the Lindy effect would predict that people will be performing Hamlet in the 25th century.

Note that **Lindy takes nothing into account except survival to date**. Someone might have been more bullish on Hamlet after opening night based on other information such as how well the play was received, but that’s beyond the scope of the Lindy effect.

If we apply the Lindy effect to **programming languages**, we only consider how long they’ve been around and whether they’re thriving today. You might think that Go, for example, will be along for a long time based on Google’s enormous influence, but the Lindy effect does not take such information into account.

So, if we assume the Lindy effect holds, here are the expected years when programming languages would die. (How exactly would you define the time of death for a programming language? Doesn’t really matter. This isn’t that precise or that serious.)

Language | Born | Expected death |
---|---|---|

Go | 2009 | 2025 |

C# | 2000 | 2034 |

Java | 1995 | 2039 |

Python | 1991 | 2043 |

Haskell | 1990 | 2044 |

C | 1972 | 2062 |

Lisp | 1959 | 2075 |

Fortran | 1957 | 2077 |

You can debate what it even means for a language to survive. For example, I’d consider Lisp to be alive and well if in the future people are programming Clojure but not Common Lisp, for example, but others might disagree.

“We don’t know what language engineers will be coding in in the year 2100. However, we do know that it will be called FORTRAN.” — C.A.R. Hoare

]]>A couple times I’ve used the following LCG (linear congruential random number generator) in examples. An LCG starts with an initial value of *z* and updates *z* at each step by multiplying by a constant *a* and taking the remainder by *m*. The particular LCG I’ve used has *a* = 742938285 and *m* = 2^{31} – 1 = 2147483647.

(I used these parameters because they have maximum range, i.e. every positive integer less than *m* appears eventually, and because it was one of the LCGs recommended in an article years ago. That is, it’s very good as far as 32-bit LCGs go, which isn’t very far. An earlier post shows how it quickly fails the PractRand test suite.)

Let’s pick the seed so that the 100th output of the generator is today’s date in ISO form: 20170816.

We need to solve

*a*^{100}*z* = 20170816 mod 2147483647.

By reducing *a*^{100} mod 2147483647 we find we need to solve

160159497 *z* = 20170816 mod 2147483647

and the solution is *z* = 1898888478. (See How to solve linear congruences.)

The following Python code verifies that the solution works.

a = 742938285 z = 1898888478 m = 2**31 - 1 for _ in range(100): z = a*z % m print(z)]]>

In my recent correspondence with Melissa O’Neill, she gave me an example that seeds a random number generator so that the 9th and 10th outputs produce the ASCII code for my name.

Here’s the code. The function `next`

is the xoroshiro128+ (XOR/rotate/shift/rotate) random number generator. The global array `s`

contains the state of the random number generator. Its initial values are the seeds of the generator.

#include <cstdio> #include <cstdint> // xoroshiro generator taken from // http://vigna.di.unimi.it/xorshift/xoroshiro128plus.c uint64_t s[2]; static inline uint64_t rotl(const uint64_t x, int k) { return (x << k) | (x >> (64 - k)); } uint64_t next(void) { const uint64_t s0 = s[0]; uint64_t s1 = s[1]; const uint64_t result = s0 + s1; s1 ^= s0; s[0] = rotl(s0, 55) ^ s1 ^ (s1 << 14); // a, b s[1] = rotl(s1, 36); // c return result; } int main() { freopen(NULL, "wb", stdout); s[0] = 0x084f31240ed2ec3f; s[1] = 0x0aa0d69470975eb8; while (1) { uint64_t value = next(); fwrite((void*) &value, sizeof(value), 1, stdout); } }

Compile this code then look at a hex dump of the first few outputs. Here’s what you get:

cook@mac> gcc xoroshiro.cpp cook@mac> ./a.out | hexdump -C | head f7 4a 6a 7f b8 07 f0 12 f8 96 e1 af 29 08 e3 c8 |.Jj.........)...| 15 0e b0 38 01 ef b2 a7 bb e9 6f 4d 77 55 d7 a0 |...8......oMwU..| 49 08 f2 fc 0c b2 ea e8 48 c2 89 1b 31 3f d7 3d |I.......H...1?.=| 11 eb bc 5e ee 98 b6 3b d9 d1 cc 15 ae 00 fc 2f |...^...;......./| 3e 20 4a 6f 68 6e 20 44 2e 20 43 6f 6f 6b 20 3c |> John D. Cook <| d1 80 49 27 3e 25 c2 4b 2a e3 78 71 9c 9e f7 18 |..I'>%.K*.xq....| 0b bb 1f c6 1c 71 79 29 d6 45 81 47 3b 88 4f 42 |.....qy).E.G;.OB| 7c 1c 19 c4 22 58 51 2d d7 74 69 ac 36 6f e0 3f ||..."XQ-.ti.6o.?| 78 7f a4 14 1c bc a8 de 17 e3 f7 d8 0c de 2c ea |x.............,.| a2 37 83 f9 38 e4 14 77 e3 6a c8 52 d1 0c 29 01 |.7..8..w.j.R..).|

(I cut out the line numbers on the left side to make the output fit better horizontally on the page.)

Not only did one pair of seed values put my name in the output, another pair would work too. If you change the seed values to

s[0] = 0x820e8a6c1baf5b13; s[1] = 0x5c51f1c4e2e64253;

you’ll also see my name in the output:

66 9d 95 fe 30 7c 60 de 7c 89 0c 6f cd d6 05 1e |f...0|`.|..o....| 2b e9 9c cc cd 3d c5 ec 3f e3 88 6c a6 cd 78 84 |+....=..?..l..x.| 20 12 ac f2 2b 3c 89 73 60 09 8d c3 85 68 9e eb | ...+<.s`....h..| 15 3e c1 0b 07 68 63 e5 73 a7 a8 f2 4b 8b dd d0 |.>...hc.s...K...| 3e 20 4a 6f 68 6e 20 44 2e 20 43 6f 6f 6b 20 3c |> John D. Cook <| 3f 71 cf d7 5f a6 ab cf 9c 81 93 d1 3d 4c 9e 41 |?q.._.......=L.A| 0d b5 48 9c aa fc 84 d8 c6 64 1d c4 91 79 b4 f8 |..H......d...y..| 0c ac 35 48 fd 38 01 dd cc a4 e4 43 c6 5b e8 c7 |..5H.8.....C.[..| e1 2e 76 30 0f 9d 41 ff b0 99 84 8d c1 72 5a 91 |..v0..A......rZ.| 06 ea f2 8e d7 87 e5 2c 53 a9 0c a0 f4 cd d1 9b |.......,S.......|

Note that there are limits to how much you can manipulate the output of an RNG. In this case the seed was selected to produce two particular output values, but the rest of the sequence behaves as if it were random. (See Random is as random does.)

]]>Rather than running to completion, it runs until it a test fails with an infinitesimally small *p*-value. It runs all tests at a given sample size, then doubles the sample and runs the tests again.

A while back I wrote about looking for an RNG that would fail the NIST test suite and being surprised that a simple LCG (linear congruential generator) did fairly well. PractRand, however, dismisses this generator with extreme prejudice:

RNG_test using PractRand version 0.93 RNG = RNG_stdin32, seed = 0x4a992b2c test set = normal, folding = standard (32 bit) rng=RNG_stdin32, seed=0x4a992b2c length= 64 megabytes (2^26 bytes), time= 2.9 seconds Test Name Raw Processed Evaluation BCFN(2+0,13-3,T) R=+115128 p = 0 FAIL !!!!!!!! BCFN(2+1,13-3,T) R=+105892 p = 0 FAIL !!!!!!!! ... [Low1/32]FPF-14+6/16:(8,14-9) R= +25.8 p = 1.5e-16 FAIL [Low1/32]FPF-14+6/16:(9,14-10) R= +15.5 p = 8.2e-9 very suspicious [Low1/32]FPF-14+6/16:(10,14-11) R= +12.9 p = 1.8e-6 unusual [Low1/32]FPF-14+6/16:all R=+380.4 p = 8.2e-337 FAIL !!!!!!! [Low1/32]FPF-14+6/16:all2 R=+33954 p = 0 FAIL !!!!!!!! [Low1/32]FPF-14+6/16:cross R= +33.6 p = 6.4e-25 FAIL !! ...and 9 test result(s) without anomalies

I don’t recall the last time I saw a *p*-value of exactly zero. Presumably the *p*-value was so small that it underflowed to zero.

Another 32 bit generator, George Marsaglia’s MWC generator, also fails, but not as spectacularly as LCG.

RNG_test using PractRand version 0.93 RNG = RNG_stdin32, seed = 0x187edf12 test set = normal, folding = standard (32 bit) rng=RNG_stdin32, seed=0x187edf12 length= 64 megabytes (2^26 bytes), time= 2.9 seconds Test Name Raw Processed Evaluation DC6-9x1Bytes-1 R= +6.3 p = 2.2e-3 unusual Gap-16:A R= +33.3 p = 1.6e-28 FAIL !!! Gap-16:B R= +13.6 p = 7.9e-12 FAIL ...and 105 test result(s) without anomalies

Next let’s see how some well-regarded 64-bit random number generators do. We’ll look at xorshift128+ and xoroshir0128+ by Sebastiano Vigna and David Blackman, the famous Mersenne Twister, and PCG by Melissa O’Neill.

The numbers generated by xhoroshir0128+ and xorshift128+ are not random in the least significant bit and so the PractRand tests end quickly. The authors claim that xoroshiro128+ “passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests.” I’ve only run PractRand with its default settings; I have not tried getting it to keep running the rest of the tests.

A lack of randomness in the least significant bits is inconsequential if you’re converting the outputs to floating point numbers, say as part of the process of generating Gaussian random values. For other uses, such as reducing the outputs modulo a small number, a lack of randomness in the least significant bits would be disastrous. (Here numerical analysis and number theory have opposite ideas regarding which parts of a number are most “significant.”)

At the time of writing, both Mersenne Twister and PCG have gone through 256 GB of generated values and are still running. I intend to add more results tomorrow.

**Update**: Mersenne Twister failed a couple of tests with 512 GB of input. I stopped the tests after PCG passed 16 TB.

RNG_test using PractRand version 0.93 RNG = RNG_stdin64, seed = 0xe15bf63c test set = normal, folding = standard (64 bit) rng=RNG_stdin64, seed=0xe15bf63c length= 128 megabytes (2^27 bytes), time= 2.8 seconds Test Name Raw Processed Evaluation [Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!! [Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!! ...and 146 test result(s) without anomalies

RNG_test using PractRand version 0.93 RNG = RNG_stdin64, seed = 0x8c7c5173 test set = normal, folding = standard (64 bit) rng=RNG_stdin64, seed=0x8c7c5173 length= 128 megabytes (2^27 bytes), time= 2.8 seconds Test Name Raw Processed Evaluation [Low1/64]BRank(12):256(2) R= +3748 p~= 3e-1129 FAIL !!!!!!!! [Low1/64]BRank(12):384(1) R= +5405 p~= 3e-1628 FAIL !!!!!!!! ...and 146 test result(s) without anomalies

RNG_test using PractRand version 0.93 RNG = RNG_stdin64, seed = 0x300dab94 test set = normal, folding = standard (64 bit) rng=RNG_stdin64, seed=0x300dab94 length= 256 megabytes (2^28 bytes), time= 3.7 seconds no anomalies in 159 test result(s) rng=RNG_stdin64, seed=0x300dab94 length= 512 megabytes (2^29 bytes), time= 7.9 seconds Test Name Raw Processed Evaluation BCFN(2+2,13-2,T) R= -7.0 p =1-1.2e-3 unusual [Low1/64]BCFN(2+2,13-6,T) R= -5.7 p =1-1.0e-3 unusual ...and 167 test result(s) without anomalies ... rng=RNG_stdin64, seed=0x300dab94 length= 256 gigabytes (2^38 bytes), time= 3857 seconds no anomalies in 265 test result(s) rng=RNG_stdin64, seed=0x300dab94 length= 512 gigabytes (2^39 bytes), time= 8142 seconds Test Name Raw Processed Evaluation BRank(12):24K(1) R=+99759 p~= 0 FAIL !!!!!!!! [Low16/64]BRank(12):16K(1) R= +1165 p~= 1.3e-351 FAIL !!!!!!! ...and 274 test result(s) without anomalies

RNG_test using PractRand version 0.93 RNG = RNG_stdin64, seed = 0x82f88431 test set = normal, folding = standard (64 bit) rng=RNG_stdin64, seed=0x82f88431 length= 128 megabytes (2^27 bytes), time= 2.0 seconds Test Name Raw Processed Evaluation [Low1/64]DC6-9x1Bytes-1 R= +6.6 p = 1.6e-3 unusual ...and 147 test result(s) without anomalies rng=RNG_stdin64, seed=0x82f88431 length= 256 megabytes (2^28 bytes), time= 4.7 seconds no anomalies in 159 test result(s) rng=RNG_stdin64, seed=0x82f88431 length= 512 megabytes (2^29 bytes), time= 9.5 seconds no anomalies in 169 test result(s) ... rng=RNG_stdin64, seed=0x82f88431 length= 16 terabytes (2^44 bytes), time= 254943 seconds no anomalies in 329 test result(s)]]>

Here’s the problem.

- Create a complete graph with
*n*nodes, i.e. connect every node to every other node. - Assign each edge a uniform random weight between 0 and 1.
- Find the minimum spanning tree.
- Add up the edges of the weights in the minimum spanning tree.

The surprise is that as *n* goes to infinity, the expected value of the process above converges to the Riemann zeta function at 3, i.e.

ζ(3) = 1/1³ + 1/2³ + 1/3³ + …

Incidentally, there are closed-form expressions for the Riemann zeta function at positive even integers. For example, ζ(2) = π² / 6. But no closed-form expressions have been found for odd integers.

Here’s a little Python code to play with this.

import networkx as nx from random import random N = 1000 G = nx.Graph() for i in range(N): for j in range(i+1, N): G.add_edge(i, j, weight=random()) T = nx.minimum_spanning_tree(G) edges = T.edges(data=True) print( sum([e[2]["weight"] for e in edges]) )

When I ran this, I got 1.2307, close to ζ(3) = 1.20205….

I ran this again, putting the code above inside a loop, averaging the results of 100 simulations, and got 1.19701. That is, the distance between my simulation result and ζ(3) went from 0.03 to 0.003.

There are two reasons I wouldn’t get exactly ζ(3). First, I’m only running a finite number of simulations (100) and so I’m not computing the expected value exactly, but only approximating it. (Probably. As in PAC: probably approximately correct.) Second, I’m using a finite graph, of size 1000, not taking a limit as graph size goes to infinity.

My limited results above suggest that the first reason accounts for most of the difference between simulation and theory. Running 100 replications cut the error down by a factor of 10. This is exactly what you’d expect from the central limit theorem. This suggests that for graphs as small as 1000 nodes, the expected value is close to the asymptotic value.

You could experiment with this, increasing the graph size and increasing the number of replications. But be patient. It takes a while for each replication to run.

The paper by Frieze considers more than the uniform distribution. You can use any non-negative distribution with finite variance and whose cumulative distribution function F is differentiable at zero. The more general result replaces ζ(3) with ζ(3) / *F* ‘(0). We could, for example, replace the uniform distribution on weights with an exponential distribution. In this case the distribution function is 1 – exp(-*x*), at its derivative at the origin is 1, so our simulation should still produce approximately ζ(3). And indeed it does. When I took the average of 100 runs with exponential weights I got a value of 1.2065.

There’s a little subtlety around using the derivative of the distribution at 0 rather than the density at 0. The derivative of the distribution (CDF) is the density (PDF), so why not just say density? One reason would be to allow the most general probability distributions, but a more immediate reason is that we’re up against a discontinuity at the origin. We’re looking at non-negative distributions, so the density has to be zero to the left of the origin.

When we say the derivative of the distribution at 0, we really mean the derivative at zero of a *smooth extension* of the distribution. For example, the exponential distribution has density 0 for negative *x* and density exp(-*x*) for non-negative *x*. Strictly speaking, the CDF of this distribution is 1 – exp(-*x*) for non-negative *x* and 0 for negative *x*. The left and right derivatives are different, so the derivative doesn’t exist. By saying the distribution function is simply exp(-*x*), we’ve used a smooth extension from the non-negative reals to all reals.

When you’re editing English prose, the semantic units you are concerned with might be words, sentences, or paragraphs. When you’re editing programming language source code, you care about functions or various “balanced expressions” such as the content between two parentheses or two curly brackets.

The following table gives some of the selection commands built into Emacs.

Unit | Command | Key binding |
---|---|---|

word | `mark-word` |
M-@ |

paragraph | `mark-paragraph` |
M-h |

page | `mark-page` |
C-x C-p |

buffer | `mark-whole-buffer ` |
C-x h |

function | `mark-defun` |
C-M-h |

balanced expression | `mark-sexp` |
C-M-@ |

The `expand-region`

package offers an alternative to several of these commands. More on that later.

The command for selecting a word does just what you expect. Likewise, the commands for selecting a page or a buffer require little explanation. But the meaning of a “paragraph” depends on context (i.e. editing mode), as do the meanings of “function” and “balanced expression.”

When editing source code, a “paragraph” is typically a block of code without blank lines. However, each language implements its own editing mode and could interpret editing units differently. Function definition syntax varies across languages, so `mark-defun`

has to be implemented differently in each language mode.

Balanced expressions could have a different meanings in different contexts, but they’re fairly consistent. Content between matching delimiters—quotation marks, parentheses, square brackets, curly braces, etc.—is generally considered a balanced expression.

Here’s where `expand-region`

comes in. It’s typically bound to `C-=`

. It can be used as a substitute for `mark-word`

and `mark-sexp`

. And if you use it repeatedly, it can replace `mark-defun`

.

Each time you call `expand-region`

it takes in more context. For example, suppose you’re in text mode with your cursor is in the middle of a word. The first call to `expand-region`

selects to the end of the word. The second call selects the whole word, i.e. expanding backward to the beginning. The next call selects the enclosing sentence and the next call the enclosing paragraph.

The `expand-region`

function works analogously when editing source code. Suppose you’re editing the bit of Emacs Lisp below and have your cursor on the slash between `eshell`

and `pwd`

.

(setq eshell-prompt-function (lambda nil (concat (eshell/pwd) " $ ")))

Here’s what sequential invocations of `expand-region`

will select.

`/pwd`

`/pwd/)`

`(eshell/pwd)`

`(eshell/pwd) " $ ")`

`(concat (eshell/pwd) " $ ")`

`(concat (eshell/pwd) " $ "))`

`(lambda nil (concat (eshell/pwd) " $ "))`

`(lambda nil (concat (eshell/pwd) " $ ")))`

`(setq eshell-prompt-function (lambda nil (concat (eshell/pwd) " $ ")))`

This is kinda tedious in this particular context because there are a lot of delimiters in a small region. In less dense code you’ll select larger blocks of code with each invocation of `expand-region`

. Since each invocation requires only a single key (i.e. hold down Control and repeatedly type `=`

) it’s easy to call `expand-region`

over and over until you select the region you’d like.

**Related posts**:

If you randomly select a piece of English prose, extract the *i*‘s, *j*‘s, and *k*‘s, and multiply them together as quaternions, what are you likely to get?

The probability that a letter in this sequence is an *i* is about 91.5%. There’s a 6.5% chance it’s a *k*, and a 2% chance it’s a *j*. (Derived from here.) We’ll assume the probabilities of each letter appearing next are independent.

You could think of the process multiplying all the *i*‘s, *j*‘s, and *k*‘s together as a random walk on the unit quaternions, an example of a Markov chain. Start at 1. At each step, multiply your current state by *i* with probability 0.915, by *j* with probability 0.02, and by *k* with probability 0.065.

After the first step, you’re most likely at *i*. You could be at *j* or *k*, but nowhere else. After the second step, you’re most likely at -1, though you could be anywhere except at 1. For the first several steps you’re much more likely to be at some places than others. But after 50 steps, you’re about equally likely to be at any of the eight possible values.

I stumbled on a Twitter account yesterday called Wolfram|Alpha Can’t. It posts bizarre queries that Wolfram Alpha can’t answer. Here’s one that caught my eye.

result of extracting the i’s, j’s, and k’s in order from Finnegans Wake and interpreting as a quaternion product

— Wolfram|Alpha Can’t (@wacnt) May 17, 2017

Suppose you did extract all the *i*‘s, *j*‘s, and *k*‘s from James Joyce’s novel Finnegans Wake. How would you answer the question above?

You could initialize an accumulator to 1 and then march through the list, updating the accumulator by multiplying it by the next element. But is is there a more efficient way?

Quaternion multiplication is not commutative, i.e. the order in which you multiply things matters. So it would not be enough to have a count of how many times each letter appears. Is there any sort of useful summary of the data short of carrying out the whole multiplication? In other words, could you scan the list while doing something other than quaternion multiplication, something faster to compute? Something analogous to sufficient statistics.

We’re carrying out multiplications in the group Q of unit quaternions, a group with eight elements: ±1, ±*i*, ±*j*, ±*k*. But the input to the question about Finnegans Wake only involves three of these elements. Could that be exploited for some slight efficiency?

How would you best implement quaternion multiplication? Of course the answer depends on your environment and what you mean by “best.”

Note that we don’t actually need to implement quaternion multiplication in general, though that would be sufficient. All we really need is multiplication in the group Q.

You could implement multiplication by a table lookup. You could use an 8 × 3 table; the left side of our multiplication could be anything in Q, but the right side can only be *i*, *j*, or *k*. You could represent quaternions as a list of four numbers—coefficients of 1, *i*, *j*, and *k*—and write rules for multiplying these. You could also represent quaternions as real 4 × 4 matrices or as complex 2 × 2 matrices.

If you have an interesting solution, please share it in a comment below. It could be interesting by any one of several criteria: fast, short, cryptic, amusing, etc.

**Related posts**:

There are five regular solids in three dimensions:

- tetrahedron
- octahedron (pictured above)
- hexahedron (cube)
- dodecahedron
- icosahedron.

I give a proof here that these are the only five.

The first three of these regular solids generalize to all dimensions, and these generalizations are the only regular solids in dimensions 5 and higher. (There are six regular solids in dimension 4.)

I’ve mentioned generalizations of the cube, the hypercube, lately. I suppose you could call the generalization of a octahedron a “hyperoctahedron” by analogy with the hypercube, though I’ve never heard anybody use that term. Instead, the most common name is **cross polytope**.

This post will focus on the cross polytope. In particular, we’re going to look at the relative volume of a ball inside a cross polytope.

The cross polytope in *n* dimensions is the convex hull of all *n*-dimensional vectors that are ±1 in one coordinate and 0 in all the rest. It is the “plus or minus” part that gives the *cross* polyhedron its name, i.e. the vertices are in pairs *across* the origin.

In analysis, the cross polytope is the unit ball in ℓ_{1} (“little ell one”), the set of points (*x*_{1}, *x*_{1}, …, *x*_{n}) such that

|*x*_{1}| + |*x*_{2}| + … + |*x*_{n}| = 1.

The ℓ_{1} norm, and hence the ℓ_{1} ball, comes up frequently in **compressed sensing** and in **sparse regression**.

In recent blog posts we’ve looked at how the relative volume in a ball inscribed in a hypercube drops quickly as dimension increases. What about the cross polytope? The relative volume of a ball inscribed in a cross polytope decreases rapidly with dimension as well. But does it decreases faster or slower than the relative volume of a ball inscribed in a hypercube? To answer this, we need to compute

Let’s gather what we need to evaluate this. We need the volume of a ball of radius *r* in *n* dimensions, and as mentioned before this is

A ball sitting inside an *n*-dimensional unit cross polytope will have radius 1/√*n*. This is because if *n* positive numbers sum to 1, the sum of their squares is minimized by making them all equal, and the point (1/*n*, 1/*n*, …, 1/*n*) has norm 1/√ *n*. A ball inside a unit hypercube will have radius 1/2.

The cross polytope has volume 2^{n} / *n*! and the hypercube has volume 1.

Putting this all together, the relative volume of a ball in a cross polytope divided by the relative volume of a ball inside a hypercube is

which fortunately reduces to just

But how do we compare *n*! and *n*^{n/2}? That’s a job for Stirling’s approximation. It tells us that for large *n*, the ratio is approximately

and so the ratio diverges for large *n*, i.e. the ball in the cross polytope takes up increasingly more relative volume.

Looking back at just the relative volume of the ball inside the cross polytope, and applying Stirling’s approximation again, we see that the relative volume of the ball inside the cross polytope is approximately

and so the relative volume decreases geometrically as *n* increases, decreasing much slower than the relative volume of a ball in a hypercube.

]]>GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an

extrememinority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statisticsworks. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

We looked at the proportion of volume contained near the corners of a hypercube. If you take the set of points within a distance 1/2 of a corner of a hypercube, you could rearrange these points to form a full ball centered one corner of the hypercube. Saying that not much volume is located near the corners is equivalent to saying that the sphere packing that centers spheres at points with integer coordinates is not very dense.

We also looked at centering balls inside hypercubes. This is the same sphere packing as above, just shifting every coordinate by 1/2. So saying that a ball in a box doesn’t take up much volume in high dimensions is another way of saying that the integer lattice sphere packing is not very dense.

How much better can we pack spheres? In 24 dimensions, balls centered inside hypercubes would have density equal to the volume of a ball of radius 1/2, or (π/2)^{12} / 12!. The most dense packing in 24 dimensions, the Leech lattice sphere packing, has a density of π^{12} / 12!, i.e. it is 2^{12} = 4096 times more efficient.

The densest sphere packings have only been proven in dimensions 1, 2, 3, 8, and 24. (The densest regular (lattice) packings are known for dimensions up to 8, but it is conceivable that there exist irregular packings that are more efficient than the most efficient lattice packing.) Dimension 24 is special in numerous ways, and it appears that 24 is a local maximum as far as optimal sphere packing density. How does sphere packing based on a integer lattice compare to the best packing in other high dimensions?

Although optimal packings are not known in high dimensions, upper and lower bounds on packing density are known. If Δ is the optimal sphere packing density in dimension *n*, then we have the following upper and lower bounds for large *n*:

The following plot shows how the integer lattice packing density (solid line) compares to the upper and lower bounds (dashed lines).

The upper and lower bounds come from Sphere Packings, Lattices, and Groups, published in 1998. Perhaps tighter bounds have been found since then.

]]>I recently wrote that the corners of a cube stick out more in high dimensions. You can quantify this by centering a ball at a corner and looking at how much of the ball comes from the cube and how much from surrounding space. That post showed that the proportion of volume near a corner goes down rapidly as dimension increases.

About a year ago I wrote a blog post about how formal methods let you explore corner cases. Along the way I said that most cases are corner cases, i.e. most of the volume is in the corners.

Both posts are correct, but they use the term “corner” differently. That is because there are two ideas of “corner” that are the same in low dimensions but diverge in higher dimensions.

Draw a circle and then draw a square just big enough to contain it. You could say that the area in the middle is the area inside the circle and the corners are everything else. Or you could say that the corners are the regions near a vertex of the square, and the middle is everything else. These two criteria aren’t that different. But in high dimensions they’re vastly different.

The post about pointy corners looked at the proportion of volume near the vertices of the cube. The post about formal methods looked at the proportion of volume not contained in a ball in the middle of the cube. As dimension increases, the former goes to zero and the latter goes to one.

In other words, in high dimensions most of the volume is neither near a vertex nor in a ball in the middle. This gives a hint at why sphere packing is interesting in high dimensions. The next post looks at how the sphere packings implicit in this post compare to the best possible packings.

]]>Here’s another surprise: **corners stick out more in high dimensions**. Hypercubes, for example, become pointier as dimension increases.

How might we quantify this? Think of a pyramid and a flag pole. If you imagine a ball centered at the top of a pyramid, a fair proportion of the volume of the ball contains part of the pyramid. But if you do the same for a flag pole, only a small proportion of the ball contains pole; nearly all the volume of the ball is air.

So one way to quantify how pointy a corner is would be to look at a neighborhood of the corner and measure how much of the neighborhood intersects the solid that the corner is part of. The less volume, the pointier the corner.

Consider a unit square. Put a disk of radius *r* at a corner, with *r* < 1. One quarter of that disk will be inside the square. So the proportion of the square near a particular corner is π*r*²/4, and the proportion of the square near any corner is π*r*².

Now do the analogous exercise for a unit cube. Look at a ball of radius *r* < 1 centered at a corner. One eighth of the volume of that ball contains part of the cube. The proportion of cube’s volume located within a distance *r* of a particular corner is π*r*³/6, and the proportion located within a distance *r* of any corner is 4π*r*³/3.

The corner of a cube sticks out a little more than the corner of a square. 79% of a square is within a distance 0.5 of a corner, while the proportion is 52% for a cube. In that sense, the corners of a cube stick out a little more than the corners of a square.

Now let’s look at a hypercube of dimension *n*. Let *V* be the volume of an *n*-dimensional ball of radius *r* < 1. The proportion of the hypercube’s volume located within a distance *r* of a particular corner is *V* / 2^{n} and the proportion located with a distance *r* of any corner is simply *V*.

The equation for the volume *V* is

If we fix *r* and let *n* vary, this function decreases rapidly as *n* increases.

Saying that corners stick out more in high dimensions is a corollary of the more widely known fact that a ball in a box takes up less and less volume as the dimension of the ball and the box increase.

Let’s set *r* = 1/2 and plot how the volume of a ball varies with dimension *n*.

You could think of this as the volume of a ball sitting inside a unit hypercube, or more relevant to the topic of this post, the proportion of the volume of the hypercube located with a distance 1/2 of a corner.

]]>Suppose you have a sequence *X*_{1}, *X*_{2}, …, *X*_{n} of *n* random variables that each take on the values {-1, 1} with equal probability. You could think of this as a random walk: you start at 0 and take a sequence of steps either to the left or the right.

Let *S*_{n} = *X*_{1} + *X*_{2} + … + *X*_{n} be the sum of the sequence. The expected value of *S*_{n} is 0 by symmetry, but it could be as large as n or as small as –*n*. We want to look at how large |*S*_{n}| is likely to be relative to the sequence length *n*.

Here’s the analytical bound:

(If you’re reading this via email, you probably can’t see the equation. Here’s why and how to fix it.)

Here is a simulation in Python to illustrate the bound.

from random import randint import numpy as np import matplotlib.pyplot as plt n = 400 # random walk length N = 10000 # number of repeated walks reps = np.empty(N) for i in range(N): random_walk = [2*randint(0, 1) - 1 for _ in range(n)] reps[i] = abs(sum(random_walk)) / n plt.hist(reps, bins=41) plt.xlabel("$|S_n|/n$") plt.show()

And here are the results.

]]>What exactly do we mean by “nearly all the area” and “near the equator”? You get to decide. Pick your standard of “nearly all the area,” say 99%, and your definition of “near the equator,” say within 5 degrees. Then it’s always possible to take the dimension high enough that your standards are met. The more demanding your standard, the higher the dimension will need to be, but it’s always possible to pick the dimension high enough.

This result is hard to imagine. Maybe a simulation will help make it more believable.

In the simulation below, we take as our “north pole” the point (1, 0, 0, 0, …, 0). We could pick any unit vector, but this choice is convenient. Our equator is the set of points orthogonal to the pole, i.e. that have first coordinate equal to zero. We draw points randomly from the sphere, compute their latitude (i.e. angle from the equator), and make a histogram of the results.

The area of our planet isn’t particularly concentrated near the equator.

But as we increase the dimension, we see more and more of the simulation points are near the equator.

Here’s the code that produced the graphs.

from scipy.stats import norm from math import sqrt, pi, acos, degrees import matplotlib.pyplot as plt def pt_on_sphere(n): # Return random point on unit sphere in R^n. # Generate n standard normals and normalize length. x = norm.rvs(0, 1, n) length = sqrt(sum(x**2)) return x/length def latitude(x): # Latitude relative to plane with first coordinate zero. angle_to_pole = acos(x[0]) # in radians latitude_from_equator = 0.5*pi - angle_to_pole return degrees( latitude_from_equator ) N = 1000 # number of samples for n in [3, 30, 300, 3000]: # dimension of R^n latitudes = [latitude(pt_on_sphere(n)) for _ in range(N)] plt.hist(latitudes, bins=int(sqrt(N))) plt.xlabel("Latitude in degrees from equator") plt.title("Sphere in dimension {}".format(n)) plt.xlim((-90, 90)) plt.show()

Not only is most of the area near the equator, the amount of area outside a band around the equator decreases very rapidly as you move away from the band. You can see that from the histograms above. They look like a normal (Gaussian) distribution, and in fact we can make that more precise.

If *A* is a band around the equator containing at least half the area, then the proportion of the area a distance *r* or greater from *A* is bound by exp( -(*n*-1)*r*² ). And in fact, this holds for *any* set *A* containing at least half the area; it doesn’t have to be a band around the equator, just any set of large measure.

**Related post**: Willie Sutton and the multivariate normal distribution

This is not to say that MWC is the best generator for every purpose, but any shortcomings of MWC are not apparent from DIEHARDER. The PCG family of generators, for example, is apparently superior to MWC, but you couldn’t necessarily conclude that from these tests.

Unless your application demands more of a random number generator than these tests do, MWC is probably adequate for your application. I wouldn’t recommend it for cryptography or for high-dimensional integration by darts, but it would be fine for many common applications.

George Marsaglia developed the DIEHARD battery of tests in 1995. Physics professor Robert G. Brown later refined and extended Marsaglia’s original test suite to create the DIEHARDER suite. (The name of Marsaglia’s battery of tests was a pun on the Diehard car battery. Brown continued the pun tradition by naming his suite after Die Harder, the sequel to the movie Die Hard.) The DIEHARDER suite is open source. It is designed to be at least as rigorous as the original DIEHARD suite and in some cases more rigorous.

There are 31 distinct kinds of tests in the DIEHARDER suite, but some of these are run multiple times. In total, 114 tests are run.

The main strength of Marsaglia’s MWC algorithm is that it is very short. The heart of the code is only three lines:

m_z = 36969 * (m_z & 65535) + (m_z >> 16); m_w = 18000 * (m_w & 65535) + (m_w >> 16); return (m_z << 16) + m_w;

You can find a full implementation of a random number generator class based in MWC here.

The heart of PCG is also very short, but a bit more mysterious.

rng->state = oldstate * 6364136223846793005ULL + (rng->inc | 1); uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u; uint32_t rot = oldstate >> 59u; return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));

(These are the 64-bit state versions of MWC and PCG. Both have versions based on larger state.)

Because these generators require little code, they’d be relatively easy to step into with a debugger, compared to other RNGs such as Mersenne Twister that require more code and more state.

Out of the 114 DIEHARDER tests run on MWC, all but three returned a pass, and the rest returned a weak pass.

A few weak passes are to be expected. The difference between pass, weak pass, and failure is whether a *p*-value falls below a certain threshold. Theory says that ideally *p*-values would uniformly distributed, and so one would fall outside the region for a strong pass now and then.

Rather than counting strong and weak passes, let’s look at the *p*-values themselves. We’d expect these to be uniformly distributed. Let’s see if they are.

Here are the *p*-values reported by the DIEHARDER tests for MWC:

Here are the corresponding values for PCG:

Neither test has too many small *p*-values, no more than we’d expect. This is normally what we’re concerned about. Too many small *p*-values would indicate that the generated samples are showing behavior that would be rare for truly random input.

But both sets of test results have a surprising number of large *p*-values. Not sure what to make of that. I suspect it says more about the DIEHARDER test suite than the random number generators being tested.

**Update**: I went back to look at some results from Mersenne Twister to see if this pattern of large *p*-values persists there. It does, and in fact the *p*-values are even more biased toward the high end for Mersenne Twister.

One thing I noticed is that the large *p*-values are disproportionately coming from some of the same tests each time. In particular, the repetitions of the`sts_serial`

test have an unexpectedly high number of large *p*-values for each generator.

The result looks like a fractal called the Sierpinski triangle or Sierpinski gasket.

Here’s an example:

If the random number generation is biased, the resulting triangle will show it. In the image below, the lower left corner was chosen with probability 1/2, the top with probability 1/3, and the right corner with probability 1/6.

**Update**: Here’s an animated version that lets you watch the process in action.

Here’s Python code to play the chaos game yourself.

from scipy import sqrt, zeros import matplotlib.pyplot as plt from random import random, randint def midpoint(p, q): return (0.5*(p[0] + q[0]), 0.5*(p[1] + q[1])) # Three corners of an equilateral triangle corner = [(0, 0), (0.5, sqrt(3)/2), (1, 0)] N = 1000 x = zeros(N) y = zeros(N) x[0] = random() y[0] = random() for i in range(1, N): k = randint(0, 2) # random triangle vertex x[i], y[i] = midpoint( corner[k], (x[i-1], y[i-1]) ) plt.scatter(x, y) plt.show()

**Update 2**: Peter Norvig posted some Python code with variations on the game presented here, generalizing a triangle to other shapes. If you try the analogous procedure with a square, you simply get a square filled with random dots.

However, you can get what you might expect, the square analog of the Sierpinski triangle, the product of a Cantor set with itself, if you make a couple modifications. First, pick a *side* at random, not a corner. Second, move 1/3 of the way toward the chosen side, not 1/2 way.

Here’s what I got with these changes:

Source: Chaos and Fractals

]]>The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.

**Update**: See the very informative note by the author of PCG in the comments below.

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a **lot** of random numbers as input.)

For both test suites I used the seed (`state`

) 20170707105851 and sequence constant (`inc`

) 42.

The PCG generator did well on all the NIST tests. For every test, at least least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

-------------------------------------------------------- C1 C2 C3 ... C10 P-VALUE PROPORTION STATISTICAL TEST -------------------------------------------------------- 2 0 2 0 0.213309 10/10 Frequency 0 0 1 3 0.534146 10/10 BlockFrequency 3 0 0 0 0.350485 10/10 CumulativeSums 1 1 0 2 0.350485 10/10 CumulativeSums 0 2 2 1 0.911413 10/10 Runs 0 0 1 1 0.534146 10/10 LongestRun 0 1 2 0 0.739918 10/10 Rank 0 4 0 0 0.122325 10/10 FFT 1 0 0 1 0.000439 10/10 NonOverlappingTemplate ... 2 1 0 0 0.350485 9/10 NonOverlappingTemplate 0 2 1 0 0.739918 10/10 OverlappingTemplate 1 1 0 2 0.911413 10/10 Universal 1 1 0 0 0.017912 10/10 ApproximateEntropy 1 0 1 1 ---- 3/4 RandomExcursions ... 0 0 0 1 ---- 4/4 RandomExcursions 2 0 0 0 ---- 4/4 RandomExcursionsVariant ... 0 0 3 0 ---- 4/4 RandomExcursionsVariant 1 2 3 0 0.350485 9/10 Serial 1 1 1 0 0.739918 10/10 Serial 1 2 0 0 0.911413 10/10 LinearComplexity ...

The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

#=============================================================================# test_name |ntup| tsamples |psamples| p-value |Assessment #=============================================================================# diehard_birthdays| 0| 100| 100|0.46682782| PASSED diehard_operm5| 0| 1000000| 100|0.83602120| PASSED diehard_rank_32x32| 0| 40000| 100|0.11092547| PASSED diehard_rank_6x8| 0| 100000| 100|0.78938803| PASSED diehard_bitstream| 0| 2097152| 100|0.81624396| PASSED diehard_opso| 0| 2097152| 100|0.95589325| PASSED diehard_oqso| 0| 2097152| 100|0.86171368| PASSED diehard_dna| 0| 2097152| 100|0.24812341| PASSED diehard_count_1s_str| 0| 256000| 100|0.75417270| PASSED diehard_count_1s_byt| 0| 256000| 100|0.25725000| PASSED diehard_parking_lot| 0| 12000| 100|0.59288414| PASSED diehard_2dsphere| 2| 8000| 100|0.79652706| PASSED diehard_3dsphere| 3| 4000| 100|0.14978100| PASSED diehard_squeeze| 0| 100000| 100|0.35356584| PASSED diehard_sums| 0| 100| 100|0.04522121| PASSED diehard_runs| 0| 100000| 100|0.39739835| PASSED diehard_runs| 0| 100000| 100|0.99128296| PASSED diehard_craps| 0| 200000| 100|0.64934221| PASSED diehard_craps| 0| 200000| 100|0.27352733| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.10570816| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.00267789| WEAK sts_monobit| 1| 100000| 100|0.98166534| PASSED sts_runs| 2| 100000| 100|0.05017630| PASSED sts_serial| 1| 100000| 100|0.95153782| PASSED ... sts_serial| 16| 100000| 100|0.59342390| PASSED rgb_bitdist| 1| 100000| 100|0.50763759| PASSED ... rgb_bitdist| 12| 100000| 100|0.98576422| PASSED rgb_minimum_distance| 2| 10000| 1000|0.23378443| PASSED ... rgb_minimum_distance| 5| 10000| 1000|0.13215367| PASSED rgb_permutations| 2| 100000| 100|0.54142546| PASSED ... rgb_permutations| 5| 100000| 100|0.96040216| PASSED rgb_lagged_sum| 0| 1000000| 100|0.66587166| PASSED ... rgb_lagged_sum| 31| 1000000| 100|0.00183752| WEAK rgb_lagged_sum| 32| 1000000| 100|0.13582393| PASSED rgb_kstest_test| 0| 10000| 1000|0.74708548| PASSED dab_bytedistrib| 0| 51200000| 1|0.30789191| PASSED dab_dct| 256| 50000| 1|0.89665788| PASSED dab_filltree| 32| 15000000| 1|0.67278231| PASSED dab_filltree| 32| 15000000| 1|0.35348003| PASSED dab_filltree2| 0| 5000000| 1|0.18749029| PASSED dab_filltree2| 1| 5000000| 1|0.92600020| PASSED]]>

This post will implement a couple of the simplest tests in Python and show that the generator does surprisingly well.

The linear congruential generator used here starts with an arbitrary seed, then at each step produces a new number by multiplying the previous number by a constant and taking the remainder by 2^{31} – 1. The multiplier constant was chosen to be one of the multipliers recommended in [1].

We’ll need a couple math functions:

from math import sqrt, log

and we need to define the constants for our generator.

# Linear congruence generator (LCG) constants z = 20170705 # seed a = 742938285 # multiplier e = 31 # will need this later m = 2**e -1 # modulus

Next we form a long string of 0’s and 1’s using our generator

# Number of random numbers to generate N = 100000 # Format to print bits, padding with 0's on the left if needed formatstr = "0" + str(e) + "b" bit_string = "" for _ in range(N): z = a*z % m # LCG bit_string += format(z, formatstr)

Next we run a couple tests. First, we count the number of 1’s in our string of bits. We expect about half the bits to be 1’s. We can quantify “about” as within two standard deviations.

def count_ones(string): ones = 0 for i in range(len(string)): if string[i] == '1': ones += 1 return ones ones = count_ones(bit_string) expected = e*N/2 sd = sqrt(0.25*N) print( "Number of 1's: {}".format(ones) ) print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

The results are nothing unusual:

Number of 1's: 1550199 Expected: 1549683.8 to 1550316.2

Next we look at the length of the longest runs on 1’s. I’ve written before about the probability of long runs and the code below uses a couple results from that post.

def runs(string): max_run = 0 current_run = 0 for i in range(len(string)): if string[i] == '1': current_run += 1 else: max_run = max(max_run, current_run) current_run = 0 return max_run runlength = runs(bit_string) expected = -log(0.5*e*N)/log(0.5) sd = 1/log(2) print( "Run length: {}".format(runlength) ) print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

Again the results are nothing unusual:

Run length: 19 Expected: 17.7 to 23.4

Simple random number generators are adequate for many uses. Some applications, such as high dimensional integration and cryptography, require more sophisticated generators, but sometimes its convenient and sufficient to use something simple. For example, code using the LCG generator above would be easier to debug than code using the Mersenne Twister. The entire state of the LCG is a single number, whereas the Mersenne Twister maintains an internal state of 312 numbers.

One obvious limitation of the LCG used here is that it couldn’t possibly produce more than 2^{31} – 1 values before repeating itself. Since the state only depends on the last value, every time it comes to a given output, the next output will be whatever the next output was the previous time. In fact, [1] shows that it does produce 2^{31} – 1 values before cycling. If the multiplier were not chosen carefully it could have a shorter period.

So our LCG has a period of about two billion values. That’s a lot if you’re writing a little game, for example. But it’s not enough for many scientific applications.

* * *

[1] George S. Fishman and Louis R. Moore III, An exhaustive analysis of multiplicative congruential random number generators with modulus 2^{31} – 1, SIAM Journal of Scientific and Statistical Computing, Vol. 7, no. 1, January 1986.

More precisely, let φ(*n*) equal the log of the least common multiple of the numbers 1, 2, …, *n*. There are theorems that give upper and lower bounds on how far φ(*n*) can be from *n*. We won’t prove or even state these bounds here. See [1] for that. Instead, we’ll show empirically that φ(*n*) is approximately *n*.

Here’s some Python code to plot φ(*n*) over *n*. The ratio jumps up sharply after the first few values of *n*. In the plot below, we chop off the first 20 values of *n*.

from scipy import arange, empty from sympy.core.numbers import ilcm from sympy import log import matplotlib.pyplot as plt N = 5000 x = arange(N) phi = empty(N) M = 1 for n in range(1, N): M = ilcm(n, M) phi[n] = log(M) a = 20 plt.plot(x[a:], phi[a:]/x[a:]) plt.xlabel("$n$") plt.ylabel("$\phi(n) / n$") plt.show()

Here’s the graph this produces.

[1] J. Barkley Rosser and Lowell Schoenfeld. Approximate formulas for some functions of prime numbers. Illinois Journal of Mathematics, Volume 6, Issue 1 (1962), 64-94. (On Project Euclid)

]]>If you subscribe by email, you’ll get an email each morning containing the post(s) from the previous day.

I just noticed a problem with email subscription: it doesn’t show SVG images, at least when reading via Gmail; maybe other email clients display SVG correctly. Here’s what a portion of yesterday’s email looks like in Gmail:

I’ve started using SVG for graphs, equations, and a few other images. The main advantage to SVG is that the images look sharper. Also, you can display the same image file at any resolution; no need to have different versions of the image for display at different sizes. And sometimes SVG files are smaller than their raster counterparts.

There may be a way to have web site visitors see SVG and email subscribers see PNG. If not, email subscribers can click on the link at the top of each post to open it in a browser and see all the images.

By the way, RSS readers handle SVG just fine. At least Digger Reader, the RSS reader I use, works well with SVG. The only problem I see is that centered content is always moved to the left.

* * *

The email newsletter is different from the email blog subscription. I only send out a newsletter once a month. It highlights the most popular posts and says a little about what I’ve been up to. I just sent out a newsletter this morning, so it’ll be another month before the next one comes out.

]]>MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is **effective sample size **(ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

where *n* is the number of samples and ρ(*k*) is the correlation at lag *k*.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag *k* decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

]]>Here’s some data to illustrate this.

|------+-----------------+---------| | n | avg. operations | 10*p(n) | |------+-----------------+---------| | 100 | 5200.2 | 5410 | | 200 | 12018.3 | 12230 | | 300 | 19446.9 | 19870 | | 400 | 27272.2 | 27410 | | 500 | 35392.2 | 35710 | | 600 | 43747.3 | 44090 | | 700 | 52297.8 | 52790 | | 800 | 61015.5 | 61330 | | 900 | 69879.6 | 69970 | | 1000 | 78873.5 | 79190 | | 1100 | 87984.4 | 88310 | | 1200 | 97201.4 | 97330 | | 1300 | 106515.9 | 106570 | | 1400 | 115920.2 | 116570 | | 1500 | 125407.9 | 125530 | | 1600 | 134973.5 | 134990 | | 1700 | 144612.1 | 145190 | | 1800 | 154319.4 | 154010 | | 1900 | 164091.5 | 163810 | | 2000 | 173925.1 | 173890 | |------+-----------------+---------|

The maximum difference between the quicksort and prime columns is about 4%. In the latter half of the table, the maximum error is about 0.4%.

What’s going on here? Why should quicksort be related to prime numbers?!

The real mystery is the prime number theorem, not quicksort. The prime number theorem tells us that the *n*th prime number is approximately *n* log *n*. And the number of operations in an efficient sort is proportional to *n* log *n*. The latter is easier to see than the former.

A lot of algorithms have run time proportional to *n* log *n*: mergesort, heapsort, FFT (Fast Fourier Transform), etc. All these have run time approximately proportional to the *n*th prime.

Now for the fine print. What exactly is the average run time for quicksort? It’s easy to say it’s O(*n* log *n*), but getting more specific requires making assumptions. I used as the average number of operations 11.67 *n* log *n* – 1.74 *n* based on Knuth’s TAOCP, Volume 3. And why 10 times the *n*th prime and not 11.67? I chose 10 to make the example work better. For very large values on *n*, a larger coefficient would work better.

]]>

The confidence region for the object flares out over time, something like the bell of a trumpet.

**Why does the region get larger**? Because there’s uncertainty in the velocity, and the velocity gets multiplied by elapsed time.

**Why isn’t the confidence region a cone**? Because that would ignore the uncertainty in the initial position. The result would be too small.

**Why isn’t the confidence region a truncated cone**? That’s not a bad approximation, though it’s a bit too large. If we ignore probability for a moment and treat confidence intervals as deterministic limits, then we get a truncated cone. For example, suppose assume position and velocity are each within two standard deviations of their estimates. Then we’d estimate position to be between *x*_{0} – 2σ_{x} + (*v_{0}* – 2σ

**So what is the confidence region**? It’s some where between the cone and the truncated cone.

The position *x* + *t* *v* is the sum of two random variables. The first has variance σ_{x}² and the second has variance *t*² σ_{v}². Variances of independent random variables add, so the standard deviation for the sum is

√(σ_{x}² + *t*² σ_{v}²) = *t* √(σ_{x}² / *t*² + σ_{v}²)

Note that as *t* increases, the latter approaches *t* σ_{v} from above. Ignoring the uncertainty in initial position underestimates standard deviation, but the relative error decreases as *t* increases.

For large *t*, a confidence interval for position at time *t* is approximately proportional to *t*, so the width of the confidence intervals over time look like a cone. But from small *t*, the dependence on *t* is less linear and more curved.

Clearly it’s *necessary* that one of the coefficients be irrational. What may be surprising is that it is *sufficient*.

If the coefficients are all rational with common denominator *N*, then the sequence would only contain multiples of 1/*N*. The interval [1/3*N*, 2/3*N*], for example, would never get a sample. If *a*_{0} were irrational but the rest of the coefficients were rational, we’d have the same situation, simply shifted by *a*_{0}.

This is a theorem about what happens in the limit, but we can look at what happens for some large but finite set of terms. And we can use a χ^{2} test to see how evenly our sequence is compared to what one would expect from a random sequence.

In the code below, we use the polynomial *p*(*x*) = √2 *x*² + π*x* + 1 and evaluate *p* at 0, 1, 2, …, 99,999. We then count how many fall in the bins [0, 0.01), [0.01, 0.02), … [0.99, 1] and compute a chi-square statistic on the counts.

from math import pi, sqrt, floor def p(x): return 1 + pi*x + sqrt(2)*x*x def chisq_stat(O, E): return sum( [(o - e)**2/e for (o, e) in zip(O, E)] ) def frac(x): return x - floor(x) N = 100000 data = [frac(p(n)) for n in range(N)] count = [0]*100 for d in data: count[ int(floor(100*d)) ] += 1 expected = [N/100]*100 print(chisq_stat(count, expected))

We get a chi-square statistic of 95.4. Since we have 100 bins, there are 99 degrees of freedom, and we should compare our statistic to a chi-square distribution with 99 degrees of freedom. Such a distribution has mean 99 and standard deviation √(99*2) = 14.07, so a value of 95.4 is completely unremarkable.

If we had gotten a very large chi-square statistic, say 200, we’d have reason to suspect something was wrong. Maybe a misunderstanding on our part or a bug in our software. Or maybe the sequence was not as uniformly distributed as we expected.

If we had gotten a very small chi-square statistic, say 10, then again maybe we misunderstood something, or maybe our sequence is remarkably evenly distributed, more evenly than one would expect from a random sequence.

**Related posts**:

When is the first digit of 2^{n} equal to *k*? When 2^{n} is between *k* × 10^{p} and (*k*+1) × 10^{p} for some positive integer *p*. By taking logarithms base 10 we find that this is equivalent to the fractional part of *n* log_{10}2 being between log_{10} *k* and log_{10} (*k* + 1).

The map

*x* ↦ ( *x* + log_{10 }2 ) mod 1

is ergodic. I wrote about irrational rotations a few weeks ago, and this is essentially the same thing. You could scale *x* by 2π and think of it as rotations on a circle instead of arithmetic mod 1 on an interval. The important thing is that log_{10 }2 is irrational.

Repeatedly multiplying by 2 is corresponds to adding log_{10 }2 on the log scale. So powers of two correspond to iterates of the map above, starting with *x* = 0. Birkhoff’s Ergodic Theorem tells us that the number of times iterates of this map fall in the interval [*a*, *b*] equals *b* – *a*. So for *k* = 1, 2, 3, … 9, the proportion of powers of 2 start with *k* is equal to log_{10} (*k* + 1) – log_{10} (*k*) = log_{10} ((*k* + 1) / *k*).

This is Benford’s law. In particular, the proportion of powers of 2 that begin with 1 is equal to log_{10} (2) = 0.301.

Note that the only thing special about 2 is that log_{10 }2 is irrational. Powers of 3 follow Benford’s law as well because log_{10 }3 is also irrational. For what values of *b* do powers of *b* not follow Benford’s law? Those with log_{10 }*b* rational, i.e. powers of 10. Obviously powers of 10 don’t follow Benford’s law because their first digit is always 1!

[Interpret the “!” above as factorial or exclamation as you wish.]

Let’s look at powers of 2 empirically to see Benford’s law in practice. Here’s a simple program to look at first digits of powers of 2.

count = [0]*10 N = 10000 def first_digit(n): return int(str(n)[0]) for i in range(N): n = first_digit( 2**i ) count[n] += 1 print(count)

Unfortunately this only works for moderate values of `N`

. It ran in under a second with `N`

set to 10,000 but for larger values of `N`

it rapidly becomes impractical.

Here’s a much more efficient version that ran in about 2 seconds with `N`

= 1,000,000.

from math import log10 N = 1000000 count = [0]*10 def first_digit_2_exp_e(e): r = (log10(2.0)*e) % 1 for i in range(2, 11): if r < log10(i): return i-1 for i in range(N): n = first_digit_2_exp_e( i ) count[n] += 1 print(count)

You could make it more efficient by caching the values of `log10`

rather than recomputing them. This brought the run time down to about 1.4 seconds. That’s a nice improvement, but nothing like the orders of magnitude improvement from changing algorithms.

Here are the results comparing the actual counts to the predictions of Benford’s law (rounded to the nearest integer).

|---------------+--------+-----------| | Leading digit | Actual | Predicted | |---------------+--------+-----------| | 1 | 301030 | 301030 | | 2 | 176093 | 176091 | | 3 | 124937 | 124939 | | 4 | 96911 | 96910 | | 5 | 79182 | 79181 | | 6 | 66947 | 66948 | | 7 | 57990 | 57992 | | 8 | 51154 | 51153 | | 9 | 45756 | 45757 | |---------------+--------+-----------|

The agreement is almost too good to believe, never off by more than 2.

Are the results correct? The inefficient version relied on integer arithmetic and so would be exact. The efficient version relies on floating point and so it’s conceivable that limits of precision caused a leading digit to be calculated incorrectly, but I doubt that happened. Floating point is precise to about 15 significant figures. We start with `log10(2)`

, multiply it by numbers up to 1,000,000 and take the fractional part. The result is good to around 9 significant figures, enough to correctly calculate which log digits the result falls between.

**Update**: See Andrew Dalke’s Python script in the comments. He shows a way to efficiently use integer arithmetic.

If *a* and *b* are close, they don’t have to be very large for the beta PDF to be approximately normal. (In all the plots below, the solid blue line is the beta distribution and the dashed orange line is the normal distribution with the same mean and variance.)

On the other hand, when *a* and *b* are very different, the beta distribution can be skewed and far from normal. Note that *a* + *b* is the same in the example above and below.

Why the sharp corner above? The beta distribution is only defined on the interval [0, 1] and so the PDF is zero for negative values.

An application came up today that raised an interesting question: What if *a* + *b* is very large, but *a* and *b* are very different? The former works in favor of the normal approximation but the latter works against it.

The application had a low probability of success but a very large number of trials. Specifically, *a* + *b* would be on the order of a million, but *a* would be less than 500. Does the normal approximation hold for these kinds of numbers? Here are some plots to see.

When *a* = 500 the normal approximation is very good. It’s still pretty good when *a* = 50, but not good at all when *a* = 5.

**Update**: Mike Anderson suggested using the skewness to quantify how far a beta is from normal. Good idea.

The skewness of a beta(*a*, *b*) distribution is

2(*b* – *a*)√(*a* +* b* + 1) / (*a* + *b* + 2) √(*ab*)

Let *N* = *a* + *b* and assume *N* is large and *a* is small, so that *N*, *N* + 2, *b – **a*, and *N* – *a* are all approximately equal in ratios. Then the skewness is approximately 2 /√*a*. In other words, once the number of trials is sufficiently large, sknewness hardly depends on the number of trials and only depends on the number of successes.

You can read the right side of the equation above as “the measure of the set of points that *f* maps into *E*.” You can apply this condition repeatedly to see that the measure of the set of points mapped into *E* after *n* iterations is still just the measure of *E*.

If *X* is a probability space, i.e. μ( *X *) = 1, then you could interpret the definition of measure-preserving to mean that the probability that a point ends up in *E* after *n* iterations is independent of *n*. We’ll illustrate this below with a simulation.

Let *X* be the half-open unit interval (0, 1] and let *f* be the Gauss map, i.e.

where [*z*] is the integer part of *z*. The function *f* is measure-preserving, though not for the usual Lebesgue measure. Instead it preserves the following measure:

Let’s take as our set *E* an interval [*a*, *b*] and test via simulation whether the probability of landing in *E* after *n* iterations really is just the measure of *E*, independent of *n*.

We can’t just generate points uniformly in the interval (0, 1]. We have to generate the points so that the probability of a point coming from a set *E* is μ(*E*). That means the PDF of the distribution must be *p*(*x*) = 1 / (log(2) (1 + *x*)). We use the inverse-CDF method to generate points with this density.

from math import log, floor from random import random def gauss_map(x): y = 1.0/x return y - floor(y) # iterate gauss map n times def iterated_gauss_map(x, n): while n > 0: x = gauss_map(x) n = n - 1 return x # measure mu( [a,b] ) def mu(a, b): return (log(1.0+b) - log(1.0+a))/log(2.0) # Generate samples with Prob(x in E) = mu( E ) def sample(): u = random() return 2.0**u - 1.0 def simulate(num_points, num_iterations, a, b): count = 0 for _ in range(num_points): x = sample() y = iterated_gauss_map(x, num_iterations) if a < y < b: count += 1 return count / num_points # Change these parameters however you like a, b = 0.1, 0.25 N = 1000000 exact = mu(a, b) print("Exact probability:", exact) print("Simulated probability after n iterations") for n in range(1, 10): simulated = simulate(N, n, a, b) print("n =", n, ":", simulated)

Here’s the output I got:

Exact probability: 0.18442457113742736 Simulated probability after n iterations n = 1 : 0.184329 n = 2 : 0.183969 n = 3 : 0.184233 n = 4 : 0.184322 n = 5 : 0.184439 n = 6 : 0.184059 n = 7 : 0.184602 n = 8 : 0.183877 n = 9 : 0.184834

With 1,000,000 samples, we expect the results to be the same to about 3 decimal places, which is what we see above.

**Related post**: Irrational rotations are ergodic.

(A transformation *f* is ergodic if it is measure preserving and the only sets *E* with *f*^{ –1}(*E*) = *E* are those with measure 0 or full measure. Rational rotations are measure-preserving but not ergodic. The Gauss map above is ergodic.)