]]>GCHQ in the ’70s, we thought of ourselves as completely Bayesian statisticians. All our data analysis was completely Bayesian, and that was a direct inheritance from Alan Turing. I’m not sure this has ever really been published, but Turing, almost as a sideline during his cryptoanalytic work, reinvented Bayesian statistics for himself. The work against Enigma and other German ciphers was fully Bayesian. …

Bayesian statistics was an

extrememinority discipline in the ’70s. In academia, I only really know of two people who were working majorly in the field, Jimmy Savage … in the States and Dennis Lindley in Britain. And they were regarded as fringe figures in the statistics community. It’s extremely different now. The reason is that Bayesian statisticsworks. So eventually truth will out. There are many, many problems where Bayesian methods are obviously the right thing to do. But in the ’70s we understood that already in Britain in the classified environment.

We looked at the proportion of volume contained near the corners of a hypercube. If you take the set of points within a distance 1/2 of a corner of a hypercube, you could rearrange these points to form a full ball centered one corner of the hypercube. Saying that not much volume is located near the corners is equivalent to saying that the sphere packing that centers spheres at points with integer coordinates is not very dense.

We also looked at centering balls inside hypercubes. This is the same sphere packing as above, just shifting every coordinate by 1/2. So saying that a ball in a box doesn’t take up much volume in high dimensions is another way of saying that the integer lattice sphere packing is not very dense.

How much better can we pack spheres? In 24 dimensions, balls centered inside hypercubes would have density equal to the volume of a ball of radius 1/2, or (π/2)^{12} / 12!. The most dense packing in 24 dimensions, the Leech lattice sphere packing, has a density of π^{12} / 12!, i.e. it is 2^{12} = 4096 times more efficient.

The densest sphere packings have only been proven in dimensions 1, 2, 3, 8, and 24. (The densest regular (lattice) packings are known for dimensions up to 8, but it is conceivable that there exist irregular packings that are more efficient than the most efficient lattice packing.) Dimension 24 is special in numerous ways, and it appears that 24 is a local maximum as far as optimal sphere packing density. How does sphere packing based on a integer lattice compare to the best packing in other high dimensions?

Although optimal packings are not known in high dimensions, upper and lower bounds on packing density are known. If Δ is the optimal sphere packing density in dimension *n*, then we have the following upper and lower bounds for large *n*:

The following plot shows how the integer lattice packing density (solid line) compares to the upper and lower bounds (dashed lines).

The upper and lower bounds come from Sphere Packings, Lattices, and Groups, published in 1998. Perhaps tighter bounds have been found since then.

]]>I recently wrote that the corners of a cube stick out more in high dimensions. You can quantify this by centering a ball at a corner and looking at how much of the ball comes from the cube and how much from surrounding space. That post showed that the proportion of volume near a corner goes down rapidly as dimension increases.

About a year ago I wrote a blog post about how formal methods let you explore corner cases. Along the way I said that most cases are corner cases, i.e. most of the volume is in the corners.

Both posts are correct, but they use the term “corner” differently. That is because there are two ideas of “corner” that are the same in low dimensions but diverge in higher dimensions.

Draw a circle and then draw a square just big enough to contain it. You could say that the area in the middle is the area inside the circle and the corners are everything else. Or you could say that the corners are the regions near a vertex of the square, and the middle is everything else. These two criteria aren’t that different. But in high dimensions they’re vastly different.

The post about pointy corners looked at the proportion of volume near the vertices of the cube. The post about formal methods looked at the proportion of volume not contained in a ball in the middle of the cube. As dimension increases, the former goes to zero and the latter goes to one.

In other words, in high dimensions most of the volume is neither near a vertex nor in a ball in the middle. This gives a hint at why sphere packing is interesting in high dimensions. The next post looks at how the sphere packings implicit in this post compare to the best possible packings.

]]>Here’s another surprise: **corners stick out more in high dimensions**. Hypercubes, for example, become pointier as dimension increases.

How might we quantify this? Think of a pyramid and a flag pole. If you imagine a ball centered at the top of a pyramid, a fair proportion of the volume of the ball contains part of the pyramid. But if you do the same for a flag pole, only a small proportion of the ball contains pole; nearly all the volume of the ball is air.

So one way to quantify how pointy a corner is would be to look at a neighborhood of the corner and measure how much of the neighborhood intersects the solid that the corner is part of. The less volume, the pointier the corner.

Consider a unit square. Put a disk of radius *r* at a corner, with *r* < 1. One quarter of that disk will be inside the square. So the proportion of the square near a particular corner is π*r*²/4, and the proportion of the square near any corner is π*r*².

Now do the analogous exercise for a unit cube. Look at a ball of radius *r* < 1 centered at a corner. One eighth of the volume of that ball contains part of the cube. The proportion of cube’s volume located within a distance *r* of a particular corner is π*r*³/6, and the proportion located within a distance *r* of any corner is 4π*r*³/3.

The corner of a cube sticks out a little more than the corner of a square. 79% of a square is within a distance 0.5 of a corner, while the proportion is 52% for a cube. In that sense, the corners of a cube stick out a little more than the corners of a square.

Now let’s look at a hypercube of dimension *n*. Let *V* be the volume of an *n*-dimensional ball of radius *r* < 1. The proportion of the hypercube’s volume located within a distance *r* of a particular corner is *V* / 2^{n} and the proportion located with a distance *r* of any corner is simply *V*.

The equation for the volume *V* is

If we fix *r* and let *n* vary, this function decreases rapidly as *n* increases.

Saying that corners stick out more in high dimensions is a corollary of the more widely known fact that a ball in a box takes up less and less volume as the dimension of the ball and the box increase.

Let’s set *r* = 1/2 and plot how the volume of a ball varies with dimension *n*.

You could think of this as the volume of a ball sitting inside a unit hypercube, or more relevant to the topic of this post, the proportion of the volume of the hypercube located with a distance 1/2 of a corner.

]]>Suppose you have a sequence *X*_{1}, *X*_{2}, …, *X*_{n} of *n* random variables that each take on the values {-1, 1} with equal probability. You could think of this as a random walk: you start at 0 and take a sequence of steps either to the left or the right.

Let *S*_{n} = *X*_{1} + *X*_{2} + … + *X*_{n} be the sum of the sequence. The expected value of *S*_{n} is 0 by symmetry, but it could be as large as n or as small as –*n*. We want to look at how large |*S*_{n}| is likely to be relative to the sequence length *n*.

Here’s the analytical bound:

(If you’re reading this via email, you probably can’t see the equation. Here’s why and how to fix it.)

Here is a simulation in Python to illustrate the bound.

from random import randint import numpy as np import matplotlib.pyplot as plt n = 400 # random walk length N = 10000 # number of repeated walks reps = np.empty(N) for i in range(N): random_walk = [2*randint(0, 1) - 1 for _ in range(n)] reps[i] = abs(sum(random_walk)) / n plt.hist(reps, bins=41) plt.xlabel("$|S_n|/n$") plt.show()

And here are the results.

]]>What exactly do we mean by “nearly all the area” and “near the equator”? You get to decide. Pick your standard of “nearly all the area,” say 99%, and your definition of “near the equator,” say within 5 degrees. Then it’s always possible to take the dimension high enough that your standards are met. The more demanding your standard, the higher the dimension will need to be, but it’s always possible to pick the dimension high enough.

This result is hard to imagine. Maybe a simulation will help make it more believable.

In the simulation below, we take as our “north pole” the point (1, 0, 0, 0, …, 0). We could pick any unit vector, but this choice is convenient. Our equator is the set of points orthogonal to the pole, i.e. that have first coordinate equal to zero. We draw points randomly from the sphere, compute their latitude (i.e. angle from the equator), and make a histogram of the results.

The area of our planet isn’t particularly concentrated near the equator.

But as we increase the dimension, we see more and more of the simulation points are near the equator.

Here’s the code that produced the graphs.

from scipy.stats import norm from math import sqrt, pi, acos, degrees import matplotlib.pyplot as plt def pt_on_sphere(n): # Return random point on unit sphere in R^n. # Generate n standard normals and normalize length. x = norm.rvs(0, 1, n) length = sqrt(sum(x**2)) return x/length def latitude(x): # Latitude relative to plane with first coordinate zero. angle_to_pole = acos(x[0]) # in radians latitude_from_equator = 0.5*pi - angle_to_pole return degrees( latitude_from_equator ) N = 1000 # number of samples for n in [3, 30, 300, 3000]: # dimension of R^n latitudes = [latitude(pt_on_sphere(n)) for _ in range(N)] plt.hist(latitudes, bins=int(sqrt(N))) plt.xlabel("Latitude in degrees from equator") plt.title("Sphere in dimension {}".format(n)) plt.xlim((-90, 90)) plt.show()

Not only is most of the area near the equator, the amount of area outside a band around the equator decreases very rapidly as you move away from the band. You can see that from the histograms above. They look like a normal (Gaussian) distribution, and in fact we can make that more precise.

If *A* is a band around the equator containing at least half the area, then the proportion of the area a distance *r* or greater from *A* is bound by exp( -(*n*-1)*r*² ). And in fact, this holds for *any* set *A* containing at least half the area; it doesn’t have to be a band around the equator, just any set of large measure.

**Related post**: Willie Sutton and the multivariate normal distribution

This is not to say that MWC is the best generator for every purpose, but any shortcomings of MWC are not apparent from DIEHARDER. The PCG family of generators, for example, is apparently superior to MWC, but you couldn’t necessarily conclude that from these tests.

Unless your application demands more of a random number generator than these tests do, MWC is probably adequate for your application. I wouldn’t recommend it for cryptography or for high-dimensional integration by darts, but it would be fine for many common applications.

George Marsaglia developed the DIEHARD battery of tests in 1995. Physics professor Robert G. Brown later refined and extended Marsaglia’s original test suite to create the DIEHARDER suite. (The name of Marsaglia’s battery of tests was a pun on the Diehard car battery. Brown continued the pun tradition by naming his suite after Die Harder, the sequel to the movie Die Hard.) The DIEHARDER suite is open source. It is designed to be at least as rigorous as the original DIEHARD suite and in some cases more rigorous.

There are 31 distinct kinds of tests in the DIEHARDER suite, but some of these are run multiple times. In total, 114 tests are run.

The main strength of Marsaglia’s MWC algorithm is that it is very short. The heart of the code is only three lines:

m_z = 36969 * (m_z & 65535) + (m_z >> 16); m_w = 18000 * (m_w & 65535) + (m_w >> 16); return (m_z << 16) + m_w;

You can find a full implementation of a random number generator class based in MWC here.

The heart of PCG is also very short, but a bit more mysterious.

rng->state = oldstate * 6364136223846793005ULL + (rng->inc | 1); uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u; uint32_t rot = oldstate >> 59u; return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));

(These are the 64-bit state versions of MWC and PCG. Both have versions based on larger state.)

Because these generators require little code, they’d be relatively easy to step into with a debugger, compared to other RNGs such as Mersenne Twister that require more code and more state.

Out of the 114 DIEHARDER tests run on MWC, all but three returned a pass, and the rest returned a weak pass.

A few weak passes are to be expected. The difference between pass, weak pass, and failure is whether a *p*-value falls below a certain threshold. Theory says that ideally *p*-values would uniformly distributed, and so one would fall outside the region for a strong pass now and then.

Rather than counting strong and weak passes, let’s look at the *p*-values themselves. We’d expect these to be uniformly distributed. Let’s see if they are.

Here are the *p*-values reported by the DIEHARDER tests for MWC:

Here are the corresponding values for PCG:

Neither test has too many small *p*-values, no more than we’d expect. This is normally what we’re concerned about. Too many small *p*-values would indicate that the generated samples are showing behavior that would be rare for truly random input.

But both sets of test results have a surprising number of large *p*-values. Not sure what to make of that. I suspect it says more about the DIEHARDER test suite than the random number generators being tested.

**Update**: I went back to look at some results from Mersenne Twister to see if this pattern of large *p*-values persists there. It does, and in fact the *p*-values are even more biased toward the high end for Mersenne Twister.

One thing I noticed is that the large *p*-values are disproportionately coming from some of the same tests each time. In particular, the repetitions of the`sts_serial`

test have an unexpectedly high number of large *p*-values for each generator.

The result looks like a fractal called the Sierpinski triangle or Sierpinski gasket.

Here’s an example:

If the random number generation is biased, the resulting triangle will show it. In the image below, the lower left corner was chosen with probability 1/2, the top with probability 1/3, and the right corner with probability 1/6.

**Update**: Here’s an animated version that lets you watch the process in action.

Here’s Python code to play the chaos game yourself.

from scipy import sqrt, zeros import matplotlib.pyplot as plt from random import random, randint def midpoint(p, q): return (0.5*(p[0] + q[0]), 0.5*(p[1] + q[1])) # Three corners of an equilateral triangle corner = [(0, 0), (0.5, sqrt(3)/2), (1, 0)] N = 1000 x = zeros(N) y = zeros(N) x[0] = random() y[0] = random() for i in range(1, N): k = randint(0, 2) # random triangle vertex x[i], y[i] = midpoint( corner[k], (x[i-1], y[i-1]) ) plt.scatter(x, y) plt.show()

**Update 2**: Peter Norvig posted some Python code with variations on the game presented here, generalizing a triangle to other shapes. If you try the analogous procedure with a square, you simply get a square filled with random dots.

However, you can get what you might expect, the square analog of the Sierpinski triangle, the product of a Cantor set with itself, if you make a couple modifications. First, pick a *side* at random, not a corner. Second, move 1/3 of the way toward the chosen side, not 1/2 way.

Here’s what I got with these changes:

Source: Chaos and Fractals

]]>The journal article announcing PCG gives the results of testing it with the TestU01 test suite. I wanted to try it out by testing it with the DIEHARDER test suite (Robert G. Brown’s extension of George Marsaglia’s DIEHARD test suite) and the NIST Statistical Test Suite. I used what the generator’s website calls the “minimal C implementation.”

(The preprint of the journal article is dated 2015 but apparently hasn’t been published yet.)

For the NIST test suite, I generated 10,000,000 bits and divided them into 10 streams.

For the DIEHARDER test suite, I generated 800,000,000 unsigned 32-bit integers. (DIEHARDER requires a **lot** of random numbers as input.)

For both test suites I used the seed (`state`

) 20170707105851 and sequence constant (`inc`

) 42.

The PCG generator did well on all the NIST tests. For every test, at least least 9 out of 10 streams passed. The test authors say you should expect at least 8 out of 10 streams to pass.

Here’s an excerpt from the results. You can find the full results here.

-------------------------------------------------------- C1 C2 C3 ... C10 P-VALUE PROPORTION STATISTICAL TEST -------------------------------------------------------- 2 0 2 0 0.213309 10/10 Frequency 0 0 1 3 0.534146 10/10 BlockFrequency 3 0 0 0 0.350485 10/10 CumulativeSums 1 1 0 2 0.350485 10/10 CumulativeSums 0 2 2 1 0.911413 10/10 Runs 0 0 1 1 0.534146 10/10 LongestRun 0 1 2 0 0.739918 10/10 Rank 0 4 0 0 0.122325 10/10 FFT 1 0 0 1 0.000439 10/10 NonOverlappingTemplate ... 2 1 0 0 0.350485 9/10 NonOverlappingTemplate 0 2 1 0 0.739918 10/10 OverlappingTemplate 1 1 0 2 0.911413 10/10 Universal 1 1 0 0 0.017912 10/10 ApproximateEntropy 1 0 1 1 ---- 3/4 RandomExcursions ... 0 0 0 1 ---- 4/4 RandomExcursions 2 0 0 0 ---- 4/4 RandomExcursionsVariant ... 0 0 3 0 ---- 4/4 RandomExcursionsVariant 1 2 3 0 0.350485 9/10 Serial 1 1 1 0 0.739918 10/10 Serial 1 2 0 0 0.911413 10/10 LinearComplexity ...

The DIEHARDER suite has 31 kinds tests, some of which are run many times, making a total of 114 tests. Out of the 114 tests, two returned a weak pass for the PCG input and all the rest passed. A few weak passes are to be expected from running so many tests and so this isn’t a strike against the generator. In fact, it might be suspicious if no tests returned a weak pass.

Here’s an edited version of the results. The full results are here.

#=============================================================================# test_name |ntup| tsamples |psamples| p-value |Assessment #=============================================================================# diehard_birthdays| 0| 100| 100|0.46682782| PASSED diehard_operm5| 0| 1000000| 100|0.83602120| PASSED diehard_rank_32x32| 0| 40000| 100|0.11092547| PASSED diehard_rank_6x8| 0| 100000| 100|0.78938803| PASSED diehard_bitstream| 0| 2097152| 100|0.81624396| PASSED diehard_opso| 0| 2097152| 100|0.95589325| PASSED diehard_oqso| 0| 2097152| 100|0.86171368| PASSED diehard_dna| 0| 2097152| 100|0.24812341| PASSED diehard_count_1s_str| 0| 256000| 100|0.75417270| PASSED diehard_count_1s_byt| 0| 256000| 100|0.25725000| PASSED diehard_parking_lot| 0| 12000| 100|0.59288414| PASSED diehard_2dsphere| 2| 8000| 100|0.79652706| PASSED diehard_3dsphere| 3| 4000| 100|0.14978100| PASSED diehard_squeeze| 0| 100000| 100|0.35356584| PASSED diehard_sums| 0| 100| 100|0.04522121| PASSED diehard_runs| 0| 100000| 100|0.39739835| PASSED diehard_runs| 0| 100000| 100|0.99128296| PASSED diehard_craps| 0| 200000| 100|0.64934221| PASSED diehard_craps| 0| 200000| 100|0.27352733| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.10570816| PASSED marsaglia_tsang_gcd| 0| 10000000| 100|0.00267789| WEAK sts_monobit| 1| 100000| 100|0.98166534| PASSED sts_runs| 2| 100000| 100|0.05017630| PASSED sts_serial| 1| 100000| 100|0.95153782| PASSED ... sts_serial| 16| 100000| 100|0.59342390| PASSED rgb_bitdist| 1| 100000| 100|0.50763759| PASSED ... rgb_bitdist| 12| 100000| 100|0.98576422| PASSED rgb_minimum_distance| 2| 10000| 1000|0.23378443| PASSED ... rgb_minimum_distance| 5| 10000| 1000|0.13215367| PASSED rgb_permutations| 2| 100000| 100|0.54142546| PASSED ... rgb_permutations| 5| 100000| 100|0.96040216| PASSED rgb_lagged_sum| 0| 1000000| 100|0.66587166| PASSED ... rgb_lagged_sum| 31| 1000000| 100|0.00183752| WEAK rgb_lagged_sum| 32| 1000000| 100|0.13582393| PASSED rgb_kstest_test| 0| 10000| 1000|0.74708548| PASSED dab_bytedistrib| 0| 51200000| 1|0.30789191| PASSED dab_dct| 256| 50000| 1|0.89665788| PASSED dab_filltree| 32| 15000000| 1|0.67278231| PASSED dab_filltree| 32| 15000000| 1|0.35348003| PASSED dab_filltree2| 0| 5000000| 1|0.18749029| PASSED dab_filltree2| 1| 5000000| 1|0.92600020| PASSED]]>

This post will implement a couple of the simplest tests in Python and show that the generator does surprisingly well.

The linear congruential generator used here starts with an arbitrary seed, then at each step produces a new number by multiplying the previous number by a constant and taking the remainder by 2^{31} – 1. The multiplier constant was chosen to be one of the multipliers recommended in [1].

We’ll need a couple math functions:

from math import sqrt, log

and we need to define the constants for our generator.

# Linear congruence generator (LCG) constants z = 20170705 # seed a = 742938285 # multiplier e = 31 # will need this later m = 2**e -1 # modulus

Next we form a long string of 0’s and 1’s using our generator

# Number of random numbers to generate N = 100000 # Format to print bits, padding with 0's on the left if needed formatstr = "0" + str(e) + "b" bit_string = "" for _ in range(N): z = a*z % m # LCG bit_string += format(z, formatstr)

Next we run a couple tests. First, we count the number of 1’s in our string of bits. We expect about half the bits to be 1’s. We can quantify “about” as within two standard deviations.

def count_ones(string): ones = 0 for i in range(len(string)): if string[i] == '1': ones += 1 return ones ones = count_ones(bit_string) expected = e*N/2 sd = sqrt(0.25*N) print( "Number of 1's: {}".format(ones) ) print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

The results are nothing unusual:

Number of 1's: 1550199 Expected: 1549683.8 to 1550316.2

Next we look at the length of the longest runs on 1’s. I’ve written before about the probability of long runs and the code below uses a couple results from that post.

def runs(string): max_run = 0 current_run = 0 for i in range(len(string)): if string[i] == '1': current_run += 1 else: max_run = max(max_run, current_run) current_run = 0 return max_run runlength = runs(bit_string) expected = -log(0.5*e*N)/log(0.5) sd = 1/log(2) print( "Run length: {}".format(runlength) ) print( "Expected: {} to {}".format(expected - 2*sd, expected + 2*sd) )

Again the results are nothing unusual:

Run length: 19 Expected: 17.7 to 23.4

Simple random number generators are adequate for many uses. Some applications, such as high dimensional integration and cryptography, require more sophisticated generators, but sometimes its convenient and sufficient to use something simple. For example, code using the LCG generator above would be easier to debug than code using the Mersenne Twister. The entire state of the LCG is a single number, whereas the Mersenne Twister maintains an internal state of 312 numbers.

One obvious limitation of the LCG used here is that it couldn’t possibly produce more than 2^{31} – 1 values before repeating itself. Since the state only depends on the last value, every time it comes to a given output, the next output will be whatever the next output was the previous time. In fact, [1] shows that it does produce 2^{31} – 1 values before cycling. If the multiplier were not chosen carefully it could have a shorter period.

So our LCG has a period of about two billion values. That’s a lot if you’re writing a little game, for example. But it’s not enough for many scientific applications.

* * *

[1] George S. Fishman and Louis R. Moore III, An exhaustive analysis of multiplicative congruential random number generators with modulus 2^{31} – 1, SIAM Journal of Scientific and Statistical Computing, Vol. 7, no. 1, January 1986.

More precisely, let φ(*n*) equal the log of the least common multiple of the numbers 1, 2, …, *n*. There are theorems that give upper and lower bounds on how far φ(*n*) can be from *n*. We won’t prove or even state these bounds here. See [1] for that. Instead, we’ll show empirically that φ(*n*) is approximately *n*.

Here’s some Python code to plot φ(*n*) over *n*. The ratio jumps up sharply after the first few values of *n*. In the plot below, we chop off the first 20 values of *n*.

from scipy import arange, empty from sympy.core.numbers import ilcm from sympy import log import matplotlib.pyplot as plt N = 5000 x = arange(N) phi = empty(N) M = 1 for n in range(1, N): M = ilcm(n, M) phi[n] = log(M) a = 20 plt.plot(x[a:], phi[a:]/x[a:]) plt.xlabel("$n$") plt.ylabel("$\phi(n) / n$") plt.show()

Here’s the graph this produces.

[1] J. Barkley Rosser and Lowell Schoenfeld. Approximate formulas for some functions of prime numbers. Illinois Journal of Mathematics, Volume 6, Issue 1 (1962), 64-94. (On Project Euclid)

]]>If you subscribe by email, you’ll get an email each morning containing the post(s) from the previous day.

I just noticed a problem with email subscription: it doesn’t show SVG images, at least when reading via Gmail; maybe other email clients display SVG correctly. Here’s what a portion of yesterday’s email looks like in Gmail:

I’ve started using SVG for graphs, equations, and a few other images. The main advantage to SVG is that the images look sharper. Also, you can display the same image file at any resolution; no need to have different versions of the image for display at different sizes. And sometimes SVG files are smaller than their raster counterparts.

There may be a way to have web site visitors see SVG and email subscribers see PNG. If not, email subscribers can click on the link at the top of each post to open it in a browser and see all the images.

By the way, RSS readers handle SVG just fine. At least Digger Reader, the RSS reader I use, works well with SVG. The only problem I see is that centered content is always moved to the left.

* * *

The email newsletter is different from the email blog subscription. I only send out a newsletter once a month. It highlights the most popular posts and says a little about what I’ve been up to. I just sent out a newsletter this morning, so it’ll be another month before the next one comes out.

]]>MCMC (Markov Chain Monte Carlo) gives us a way around this impasse. It lets us draw samples from practically any probability distribution. But there’s a catch: the samples are not independent. This lack of independence means that all the familiar theory on convergence of sums of random variables goes out the window.

There’s not much theory to guide assessing the convergence of sums of MCMC samples, but there are heuristics. One of these is **effective sample size **(ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples. You might want to say, for example, that 1,000 samples from a certain Markov chain are worth about as much as 80 independent samples because the MCMC samples are highly correlated. Or you might want to say that 1,000 samples from a different Markov chain are worth about as much as 300 independent samples because although the MCMC samples are dependent, they’re weakly correlated.

Here’s the definition of ESS:

where *n* is the number of samples and ρ(*k*) is the correlation at lag *k*.

This behaves well in the extremes. If your samples are independent, your effective samples size equals the actual sample size. If the correlation at lag *k* decreases extremely slowly, so slowly that the sum in the denominator diverges, your effective sample size is zero.

Any reasonable Markov chain is between the extremes. Zero lag correlation is too much to hope for, but ideally the correlations die off fast enough that the sum in the denominator not only converges but also isn’t a terribly large value.

I’m not sure who first proposed this definition of ESS. There’s a reference to it in Handbook of Markov Chain Monte Carlo where the authors cite a paper [1] in which Radford Neal mentions it. Neal cites B. D. Ripley [2].

[1] Markov Chain Monte Carlo in Practice: A Roundtable Discussion. Robert E. Kass, Bradley P. Carlin, Andrew Gelman and Radford M. Neal. The American Statistician. Vol. 52, No. 2 (May, 1998), pp. 93-100

[2] Stochlastic Simulation, B. D. Ripley, 1987.

]]>Here’s some data to illustrate this.

|------+-----------------+---------| | n | avg. operations | 10*p(n) | |------+-----------------+---------| | 100 | 5200.2 | 5410 | | 200 | 12018.3 | 12230 | | 300 | 19446.9 | 19870 | | 400 | 27272.2 | 27410 | | 500 | 35392.2 | 35710 | | 600 | 43747.3 | 44090 | | 700 | 52297.8 | 52790 | | 800 | 61015.5 | 61330 | | 900 | 69879.6 | 69970 | | 1000 | 78873.5 | 79190 | | 1100 | 87984.4 | 88310 | | 1200 | 97201.4 | 97330 | | 1300 | 106515.9 | 106570 | | 1400 | 115920.2 | 116570 | | 1500 | 125407.9 | 125530 | | 1600 | 134973.5 | 134990 | | 1700 | 144612.1 | 145190 | | 1800 | 154319.4 | 154010 | | 1900 | 164091.5 | 163810 | | 2000 | 173925.1 | 173890 | |------+-----------------+---------|

The maximum difference between the quicksort and prime columns is about 4%. In the latter half of the table, the maximum error is about 0.4%.

What’s going on here? Why should quicksort be related to prime numbers?!

The real mystery is the prime number theorem, not quicksort. The prime number theorem tells us that the *n*th prime number is approximately *n* log *n*. And the number of operations in an efficient sort is proportional to *n* log *n*. The latter is easier to see than the former.

A lot of algorithms have run time proportional to *n* log *n*: mergesort, heapsort, FFT (Fast Fourier Transform), etc. All these have run time approximately proportional to the *n*th prime.

Now for the fine print. What exactly is the average run time for quicksort? It’s easy to say it’s O(*n* log *n*), but getting more specific requires making assumptions. I used as the average number of operations 11.67 *n* log *n* – 1.74 *n* based on Knuth’s TAOCP, Volume 3. And why 10 times the *n*th prime and not 11.67? I chose 10 to make the example work better. For very large values on *n*, a larger coefficient would work better.

]]>

The confidence region for the object flares out over time, something like the bell of a trumpet.

**Why does the region get larger**? Because there’s uncertainty in the velocity, and the velocity gets multiplied by elapsed time.

**Why isn’t the confidence region a cone**? Because that would ignore the uncertainty in the initial position. The result would be too small.

**Why isn’t the confidence region a truncated cone**? That’s not a bad approximation, though it’s a bit too large. If we ignore probability for a moment and treat confidence intervals as deterministic limits, then we get a truncated cone. For example, suppose assume position and velocity are each within two standard deviations of their estimates. Then we’d estimate position to be between *x*_{0} – 2σ_{x} + (*v_{0}* – 2σ

**So what is the confidence region**? It’s some where between the cone and the truncated cone.

The position *x* + *t* *v* is the sum of two random variables. The first has variance σ_{x}² and the second has variance *t*² σ_{v}². Variances of independent random variables add, so the standard deviation for the sum is

√(σ_{x}² + *t*² σ_{v}²) = *t* √(σ_{x}² / *t*² + σ_{v}²)

Note that as *t* increases, the latter approaches *t* σ_{v} from above. Ignoring the uncertainty in initial position underestimates standard deviation, but the relative error decreases as *t* increases.

For large *t*, a confidence interval for position at time *t* is approximately proportional to *t*, so the width of the confidence intervals over time look like a cone. But from small *t*, the dependence on *t* is less linear and more curved.

Clearly it’s *necessary* that one of the coefficients be irrational. What may be surprising is that it is *sufficient*.

If the coefficients are all rational with common denominator *N*, then the sequence would only contain multiples of 1/*N*. The interval [1/3*N*, 2/3*N*], for example, would never get a sample. If *a*_{0} were irrational but the rest of the coefficients were rational, we’d have the same situation, simply shifted by *a*_{0}.

This is a theorem about what happens in the limit, but we can look at what happens for some large but finite set of terms. And we can use a χ^{2} test to see how evenly our sequence is compared to what one would expect from a random sequence.

In the code below, we use the polynomial *p*(*x*) = √2 *x*² + π*x* + 1 and evaluate *p* at 0, 1, 2, …, 99,999. We then count how many fall in the bins [0, 0.01), [0.01, 0.02), … [0.99, 1] and compute a chi-square statistic on the counts.

from math import pi, sqrt, floor def p(x): return 1 + pi*x + sqrt(2)*x*x def chisq_stat(O, E): return sum( [(o - e)**2/e for (o, e) in zip(O, E)] ) def frac(x): return x - floor(x) N = 100000 data = [frac(p(n)) for n in range(N)] count = [0]*100 for d in data: count[ int(floor(100*d)) ] += 1 expected = [N/100]*100 print(chisq_stat(count, expected))

We get a chi-square statistic of 95.4. Since we have 100 bins, there are 99 degrees of freedom, and we should compare our statistic to a chi-square distribution with 99 degrees of freedom. Such a distribution has mean 99 and standard deviation √(99*2) = 14.07, so a value of 95.4 is completely unremarkable.

If we had gotten a very large chi-square statistic, say 200, we’d have reason to suspect something was wrong. Maybe a misunderstanding on our part or a bug in our software. Or maybe the sequence was not as uniformly distributed as we expected.

If we had gotten a very small chi-square statistic, say 10, then again maybe we misunderstood something, or maybe our sequence is remarkably evenly distributed, more evenly than one would expect from a random sequence.

**Related posts**:

When is the first digit of 2^{n} equal to *k*? When 2^{n} is between *k* × 10^{p} and (*k*+1) × 10^{p} for some positive integer *p*. By taking logarithms base 10 we find that this is equivalent to the fractional part of *n* log_{10}2 being between log_{10} *k* and log_{10} (*k* + 1).

The map

*x* ↦ ( *x* + log_{10 }2 ) mod 1

is ergodic. I wrote about irrational rotations a few weeks ago, and this is essentially the same thing. You could scale *x* by 2π and think of it as rotations on a circle instead of arithmetic mod 1 on an interval. The important thing is that log_{10 }2 is irrational.

Repeatedly multiplying by 2 is corresponds to adding log_{10 }2 on the log scale. So powers of two correspond to iterates of the map above, starting with *x* = 0. Birkhoff’s Ergodic Theorem tells us that the number of times iterates of this map fall in the interval [*a*, *b*] equals *b* – *a*. So for *k* = 1, 2, 3, … 9, the proportion of powers of 2 start with *k* is equal to log_{10} (*k* + 1) – log_{10} (*k*) = log_{10} ((*k* + 1) / *k*).

This is Benford’s law. In particular, the proportion of powers of 2 that begin with 1 is equal to log_{10} (2) = 0.301.

Note that the only thing special about 2 is that log_{10 }2 is irrational. Powers of 3 follow Benford’s law as well because log_{10 }3 is also irrational. For what values of *b* do powers of *b* not follow Benford’s law? Those with log_{10 }*b* rational, i.e. powers of 10. Obviously powers of 10 don’t follow Benford’s law because their first digit is always 1!

[Interpret the “!” above as factorial or exclamation as you wish.]

Let’s look at powers of 2 empirically to see Benford’s law in practice. Here’s a simple program to look at first digits of powers of 2.

count = [0]*10 N = 10000 def first_digit(n): return int(str(n)[0]) for i in range(N): n = first_digit( 2**i ) count[n] += 1 print(count)

Unfortunately this only works for moderate values of `N`

. It ran in under a second with `N`

set to 10,000 but for larger values of `N`

it rapidly becomes impractical.

Here’s a much more efficient version that ran in about 2 seconds with `N`

= 1,000,000.

from math import log10 N = 1000000 count = [0]*10 def first_digit_2_exp_e(e): r = (log10(2.0)*e) % 1 for i in range(2, 11): if r < log10(i): return i-1 for i in range(N): n = first_digit_2_exp_e( i ) count[n] += 1 print(count)

You could make it more efficient by caching the values of `log10`

rather than recomputing them. This brought the run time down to about 1.4 seconds. That’s a nice improvement, but nothing like the orders of magnitude improvement from changing algorithms.

Here are the results comparing the actual counts to the predictions of Benford’s law (rounded to the nearest integer).

|---------------+--------+-----------| | Leading digit | Actual | Predicted | |---------------+--------+-----------| | 1 | 301030 | 301030 | | 2 | 176093 | 176091 | | 3 | 124937 | 124939 | | 4 | 96911 | 96910 | | 5 | 79182 | 79181 | | 6 | 66947 | 66948 | | 7 | 57990 | 57992 | | 8 | 51154 | 51153 | | 9 | 45756 | 45757 | |---------------+--------+-----------|

The agreement is almost too good to believe, never off by more than 2.

Are the results correct? The inefficient version relied on integer arithmetic and so would be exact. The efficient version relies on floating point and so it’s conceivable that limits of precision caused a leading digit to be calculated incorrectly, but I doubt that happened. Floating point is precise to about 15 significant figures. We start with `log10(2)`

, multiply it by numbers up to 1,000,000 and take the fractional part. The result is good to around 9 significant figures, enough to correctly calculate which log digits the result falls between.

**Update**: See Andrew Dalke’s Python script in the comments. He shows a way to efficiently use integer arithmetic.

If *a* and *b* are close, they don’t have to be very large for the beta PDF to be approximately normal. (In all the plots below, the solid blue line is the beta distribution and the dashed orange line is the normal distribution with the same mean and variance.)

On the other hand, when *a* and *b* are very different, the beta distribution can be skewed and far from normal. Note that *a* + *b* is the same in the example above and below.

Why the sharp corner above? The beta distribution is only defined on the interval [0, 1] and so the PDF is zero for negative values.

An application came up today that raised an interesting question: What if *a* + *b* is very large, but *a* and *b* are very different? The former works in favor of the normal approximation but the latter works against it.

The application had a low probability of success but a very large number of trials. Specifically, *a* + *b* would be on the order of a million, but *a* would be less than 500. Does the normal approximation hold for these kinds of numbers? Here are some plots to see.

When *a* = 500 the normal approximation is very good. It’s still pretty good when *a* = 50, but not good at all when *a* = 5.

**Update**: Mike Anderson suggested using the skewness to quantify how far a beta is from normal. Good idea.

The skewness of a beta(*a*, *b*) distribution is

2(*b* – *a*)√(*a* +* b* + 1) / (*a* + *b* + 2) √(*ab*)

Let *N* = *a* + *b* and assume *N* is large and *a* is small, so that *N*, *N* + 2, *b – **a*, and *N* – *a* are all approximately equal in ratios. Then the skewness is approximately 2 /√*a*. In other words, once the number of trials is sufficiently large, sknewness hardly depends on the number of trials and only depends on the number of successes.

You can read the right side of the equation above as “the measure of the set of points that *f* maps into *E*.” You can apply this condition repeatedly to see that the measure of the set of points mapped into *E* after *n* iterations is still just the measure of *E*.

If *X* is a probability space, i.e. μ( *X *) = 1, then you could interpret the definition of measure-preserving to mean that the probability that a point ends up in *E* after *n* iterations is independent of *n*. We’ll illustrate this below with a simulation.

Let *X* be the half-open unit interval (0, 1] and let *f* be the Gauss map, i.e.

where [*z*] is the integer part of *z*. The function *f* is measure-preserving, though not for the usual Lebesgue measure. Instead it preserves the following measure:

Let’s take as our set *E* an interval [*a*, *b*] and test via simulation whether the probability of landing in *E* after *n* iterations really is just the measure of *E*, independent of *n*.

We can’t just generate points uniformly in the interval (0, 1]. We have to generate the points so that the probability of a point coming from a set *E* is μ(*E*). That means the PDF of the distribution must be *p*(*x*) = 1 / (log(2) (1 + *x*)). We use the inverse-CDF method to generate points with this density.

from math import log, floor from random import random def gauss_map(x): y = 1.0/x return y - floor(y) # iterate gauss map n times def iterated_gauss_map(x, n): while n > 0: x = gauss_map(x) n = n - 1 return x # measure mu( [a,b] ) def mu(a, b): return (log(1.0+b) - log(1.0+a))/log(2.0) # Generate samples with Prob(x in E) = mu( E ) def sample(): u = random() return 2.0**u - 1.0 def simulate(num_points, num_iterations, a, b): count = 0 for _ in range(num_points): x = sample() y = iterated_gauss_map(x, num_iterations) if a < y < b: count += 1 return count / num_points # Change these parameters however you like a, b = 0.1, 0.25 N = 1000000 exact = mu(a, b) print("Exact probability:", exact) print("Simulated probability after n iterations") for n in range(1, 10): simulated = simulate(N, n, a, b) print("n =", n, ":", simulated)

Here’s the output I got:

Exact probability: 0.18442457113742736 Simulated probability after n iterations n = 1 : 0.184329 n = 2 : 0.183969 n = 3 : 0.184233 n = 4 : 0.184322 n = 5 : 0.184439 n = 6 : 0.184059 n = 7 : 0.184602 n = 8 : 0.183877 n = 9 : 0.184834

With 1,000,000 samples, we expect the results to be the same to about 3 decimal places, which is what we see above.

**Related post**: Irrational rotations are ergodic.

(A transformation *f* is ergodic if it is measure preserving and the only sets *E* with *f*^{ –1}(*E*) = *E* are those with measure 0 or full measure. Rational rotations are measure-preserving but not ergodic. The Gauss map above is ergodic.)

Here’s an argument from Brian Beckman for using a functional style of programming in a particular circumstance that I find persuasive. The immediate context is Kalman filtering, but it applies to a broad range of mathematical computation.

By writing a Kalman ﬁlter as a functional fold, we can

test code in friendly environmentsand thendeploy identical code with confidence in unfriendly environments. In friendly environments, data are deterministic, static, and present in memory. In unfriendly, real-world environments, data are unpredictable, dynamic, and arrive asynchronously.

If you write the guts of your mathematical processing as a function to be folded over your data, you can isolate the implementation of your algorithm from the things that make code hardest to test, i.e. the data source. You can test your code in a well-controlled “friendly” test environment and deploy exactly the same code into production, i.e. an “unfriendly” environment.

Brian continues:

The flexibility to deploy exactly the code that was tested is especially important for numerical code like filters. Detecting, diagnosing and correcting numerical issues without repeatable data sequences is impractical. Once code is hardened, it can be critical to deploy exactly the same code, to the binary level, in production, because of numerical brittleness. Functional form makes it easy to test and deploy exactly the same code because it minimizes the coupling between code and environment.

I ran into this early on when developing clinical trial methods first for simulation, then for production. Someone would ask whether we were using the same code in production as in simulation.

“Yes we are.”

“*Exactly* the same code?”

“Well, essentially.”

“Essentially” was not good enough. We got to where we would use the exact same binary code for simulation and production, but something analogous to Kalman folding would have gotten us there sooner, and would have made it easier to enforce this separation between numerical code and its environment across applications.

Why is it important to use the exact same binary code in test and production, not just a recompile of the same source code? Brian explains:

Numerical issues can substantially complicate code, and being able to move exactly the same code, without even recompiling, between testing and deployment can make the difference to a successful application. We have seen many cases where differences in compiler flags, let alone differences in architectures, even between different versions of the same CPU family, introduce enough differences in the generated code to cause qualitative differences in the output.

A filter that behaved well in the lab can fail in practice.

Emphasis added, here and in the first quote above.

Note that this post gives an argument for a functional *style of programming*, not necessarily for the use of *functional programming languages*. Whether the numerical core or the application that uses it would best be written in a functional language is a separate discussion.

**Related posts**:

I was looking over my records and tried to see how my work divides into these three areas. In short, it doesn’t.

The boundaries between these areas are fuzzy or arbitrary to begin with, but a few projects fell cleanly into one of the three categories. However, 85% of my income has come from projects that involve a combination of two areas or all three areas.

If you calculate a confidence interval using R, you could say you’re doing math, statistics, and computing. But for the accounting above I’d simply call that statistics. When I say a project uses math and computation, for example, I mean it requires math outside what is typical in programming, and programming outside what is typical in math.

]]>Let *X* be a multivariate Gaussian random variable with mean zero and let *E* and *F* be two symmetric convex sets, both centered at the origin. The Gaussian correlation inequality says that

Prob(*X *in *E* and *F*) ≥ Prob(*X* in *E*) Prob(*X* in *F*).

Here’s a bit of Python code for illustrating the inequality. For symmetric convex sets we take balls of *p*-norm *r* where *p* ≥ 1 and *r* > 0. We could, for example, set one of the values of *p* to 1 to get a cube and set the other *p* to 2 to get a Euclidean ball.

from scipy.stats import norm as gaussian def pnorm(v, p): return sum( [abs(x)**p for x in v] )**(1./p) def simulate(dim, r1, p1, r2, p2, numReps): count_1, count_2, count_both = (0, 0, 0) for _ in range(numReps): x = gaussian.rvs(0, 1, dim) in_1 = (pnorm(x, p1) < r1) in_2 = (pnorm(x, p2) < r2) if in_1: count_1 += 1 if in_2: count_2 += 1 if in_1 and in_2: count_both += 1 print("Prob in both:", count_both / numReps) print("Lower bound: ", count_1*count_2 * numReps**-2) simulate(3, 1, 2, 1, 1, 1000)

When `numReps`

is large, we expect the simulated probability of the intersection to be greater than the simulated lower bound. In the example above, the former was 0.075 and the latter 0.015075, ordered as we’d expect.

If we didn’t know that the theorem has been proven, we could use code like this to try to find counterexamples. Of course a simulation cannot prove or disprove a theorem, but if we found what appeared to be a counterexample, we could see whether it persists with different random number generation seeds and with a large value of `numReps`

. If so, then we could try to establish the inequality analytically. Now that the theorem has been proven we know that we’re not going to find real counterexamples, but the code is only useful as an illustration.

**Related posts**:

Ergodic functions have the property that “the time average equals the space average.” We’ll unpack what that means and illustrate it by simulation.

Suppose we pick a starting point *x* on the circle then repeatedly rotate it by a golden angle. Take an integrable function *f* on the circle and form the average of its values at the sequence of rotations. This is the time average. The space average is the integral of *f* over the circle, divided by the circumference of the circle. The ergodic theorem says that the time average equals the space average, except possibly for a setting of starting values of measure zero.

More generally, let *X* be a measure space (like the unit circle) with measure μ let *T* be an ergodic transformation (like rotating by a golden angle), Then for almost all starting values *x* we have the following:

Let’s do a simulation to see this in practice by running the following Python script.

from scipy import pi, cos from scipy.constants import golden from scipy.integrate import quad golden_angle = 2*pi*golden**-2 def T(x): return (x + golden_angle) % (2*pi) def time_average(x, f, T, n): s = 0 for k in range(n): s += f(x) x = T(x) return s/n def space_average(f): integral = quad(f, 0, 2*pi)[0] return integral / (2*pi) f = lambda x: cos(x)**2 N = 1000000 print( time_average(0, f, T, N) ) print( space_average(f) )

In this case we get 0.49999996 for the time average, and 0.5 for the space average. They’re not the same, but we only used a finite value of *n*; we didn’t take a limit. We should expect the two values to be close because *n* is large, but we shouldn’t expect them to be equal.

**Update**: The code and results were updated to fix a bug pointed out in the comments below. I had written `... % 2*pi`

when I should have written `... % (2*pi)`

. I assumed the modulo operator was lower precedence than multiplication, but it’s not. It was a coincidence that the buggy code was fairly accurate.

A friend of mine, a programmer with decades of experience, recently made a similar error. He’s a Clojure fan but was writing in C or some similar language. He rightfully pointed out that this kind of error simply cannot happen in Clojure. Lisps, including Clojure, don’t have operator precedence because they don’t have operators. They only have functions, and the order in which functions are called is made explicit with parentheses. The Python code `x % 2*pi`

corresponds to `(* (mod x 2) pi)`

in Clojure, and the Python code `x % (2*pi)`

corresponds to `(mod x (* 2 pi))`

.

**Related**: Origin of the word “ergodic”

I had not seen one of these old saxophones until Carlo Burkhardt sent me photos today of a Pierret Modele 5 Tenor Sax from around 1912.

Here’s a closeup of the octave keys.

And here’s a closeup of the bell where you can see the branding.

]]>Musical notes go around in a circle. After 12 half steps we’re back where we started. What would it sound like if we played intervals that went around this circle at golden angles? I’ll include audio files and the code that produced them below.

A golden interval, moving around the music circle by a golden angle, is a little more than a major third. And so a chord made of golden intervals is like an augmented major chord but stretched a bit.

An augmented major triad divides the musical circle exactly in thirds. For example, C E G#. Each note is four half steps, a major third, from the previous note. In terms of a circle, each interval is 120°. Here’s what these notes sound like in succession and as a chord.

(Download)

If we go around the musical circle in golden angles, we get something like an augmented triad but with slightly bigger intervals. In terms of a circle, each note moves 137.5° from the previous note rather than 120°. Whereas an augmented triad goes around the musical circle at 0°, 120°, and 240° degrees, a golden triad goes around 0°, 137.5°, and 275°.

A half step corresponds to 30°, so a golden angle corresponds to a little more than 4.5 half steps. If we start on C, the next note is between E and F, and the next is just a little higher than A.

If we keep going up in golden intervals, we do not return to the note we stared on, unlike a progression of major thirds. In fact, we never get the same note twice because a golden interval is not a rational part of a circle. Four golden angle rotations amount to 412.5°, i.e. 52.5° more than a circle. In terms of music, going up four golden intervals puts us an octave and almost a whole step higher than we stared.

Here’s what a longer progression of golden intervals sounds like. Each note keeps going but decays so you can hear both the sequence of notes and how they sound in harmony. The intention was to create something like playing the notes on a piano with the sustain pedal down.

(Download)

It sounds a little unusual but pleasant, better than I thought it would.

Here’s the Python code that produced the sound files in case you’d like to create your own. You might, for example, experiment by increasing or decreasing the decay rate. Or you might try using richer tones than just pure sine waves.

from scipy.constants import golden import numpy as np from scipy.io.wavfile import write N = 44100 # samples per second # Inputs: # frequency in cycles per second # duration in seconds # decay = half-life in seconds def tone(freq, duration, decay=0): t = np.arange(0, duration, 1.0/N) y = np.sin(freq*2*np.pi*t) if decay > 0: k = np.log(2) / decay y *= np.exp(-k*t) return y # Scale signal to 16-bit integers, # values between -2^15 and 2^15 - 1. def to_integer(signal): signal /= max(abs(signal)) return np.int16(signal*(2**15 - 1)) C = 262 # middle C frequency in Hz # Play notes sequentially then as a chord def arpeggio(n, interval): y = np.zeros((n+1)*N) for i in range(n): x = tone(C * interval**i, 1) y[i*N : (i+1)*N] = x y[n*N : (n+1)*N] += x return y # Play notes sequentially with each sustained def bell(n, interval, decay): y = np.zeros(n*N) for i in range(n): x = tone(C * interval**i, n-i, decay) y[i*N:] += x return y major_third = 2**(1/3) golden_interval = 2**(golden**-2) write("augmented.wav", N, to_integer(arpeggio(3, major_third))) write("golden_triad.wav", N, to_integer(arpeggio(3, golden_interval))) write("bell.wav", N, to_integer(bell(9, golden_interval, 6)))]]>

You can’t rotate RGB values per se, but you can rotate hues. So my initial idea was to convert RGB to HSV or HSL, rotate the H component, then convert back to RGB. There are some subtleties with converting between RGB and either HSV or HSL, but I’m willing to ignore those for now.

The problem I ran into was that my idea of a color wheel doesn’t match the HSV or HSL color wheels. For example, I’m thinking of green as the complementary color to red, the color 180° away. On the HSV and HSL color wheels, the complementary color to red is cyan. The color wheel I have in mind is the “artist’s color wheel” based on the RYB color space, not RGB. Subtractive color, not additive.

This brings up several questions.

- How do you convert back and forth between RYB and RGB?
- How do you describe the artist’s color wheel mathematically, in RYB or any other system?
- What is a good reference on color theory? I’d like to understand in detail how the various color systems relate, something that spans the gamut (pun intended) from an artist’s perspective down to physics and physiology.

I see adjoint functors.

How often do you see them?

All the time. They’re everywhere. pic.twitter.com/6PkGJ9wP4A

— Paul Phillips (@contrarivariant) May 27, 2017

Mashup of Saunders Mac Lane’s quip “Adjoint functors arise everywhere” and Haley Joel Osment’s famous line from Sixth Sense.

**Related**: Applied category theory

I ran across the above quote from Hamming this morning. It made me wonder whether I tried to prepare students for my past when I used to teach college students.

How do you prepare a student for the future? Mostly by focusing on skills that will always be useful, even as times change: logic, clear communication, diligence, etc.

Negative forecasting is more reliable here than positive forecasting. It’s hard to predict what’s going to be in demand in the future (besides timeless skills), but it’s easier to predict what’s probably not going to be in demand. The latter aligns with Hamming’s exhortation not to prepare students for your past.

]]>