Clearly it’s *necessary* that one of the coefficients be irrational. What may be surprising is that it is *sufficient*.

If the coefficients are all rational with common denominator *N*, then the sequence would only contain multiples of 1/*N*. The interval [1/3*N*, 2/3*N*], for example, would never get a sample. If *a*_{0} were irrational but the rest of the coefficients were rational, we’d have the same situation, simply shifted by *a*_{0}.

This is a theorem about what happens in the limit, but we can look at what happens for some large but finite set of terms. And we can use a χ^{2} test to see how evenly our sequence is compared to what one would expect from a random sequence.

In the code below, we use the polynomial *p*(*x*) = √2 *x*² + π*x* + 1 and evaluate *p* at 0, 1, 2, …, 99,999. We then count how many fall in the bins [0, 0.01), [0.01, 0.02), … [0.99, 1] and compute a chi-square statistic on the counts.

from math import pi, sqrt, floor def p(x): return 1 + pi*x + sqrt(2)*x*x def chisq_stat(O, E): return sum( [(o - e)**2/e for (o, e) in zip(O, E)] ) def frac(x): return x - floor(x) N = 100000 data = [frac(p(n)) for n in range(N)] count = [0]*100 for d in data: count[ int(floor(100*d)) ] += 1 expected = [N/100]*100 print(chisq_stat(count, expected))

We get a chi-square statistic of 95.4. Since we have 100 bins, there are 99 degrees of freedom, and we should compare our statistic to a chi-square distribution with 99 degrees of freedom. Such a distribution has mean 99 and standard deviation √(99*2) = 14.07, so a value of 95.4 is completely unremarkable.

If we had gotten a very large chi-square statistic, say 200, we’d have reason to suspect something was wrong. Maybe a misunderstanding on our part or a bug in our software. Or maybe the sequence was not as uniformly distributed as we expected.

If we had gotten a very small chi-square statistic, say 10, then again maybe we misunderstood something, or maybe our sequence is remarkably evenly distributed, more evenly than one would expect from a random sequence.

**Related posts**:

When is the first digit of 2^{n} equal to *k*? When 2^{n} is between *k* × 10^{p} and (*k*+1) × 10^{p} for some positive integer *p*. By taking logarithms base 10 we find that this is equivalent to the fractional part of *n* log_{10}2 being between log_{10} *k* and log_{10} (*k* + 1).

The map

*x* ↦ ( *x* + log_{10 }2 ) mod 1

is ergodic. I wrote about irrational rotations a few weeks ago, and this is essentially the same thing. You could scale *x* by 2π and think of it as rotations on a circle instead of arithmetic mod 1 on an interval. The important thing is that log_{10 }2 is irrational.

Repeatedly multiplying by 2 is corresponds to adding log_{10 }2 on the log scale. So powers of two correspond to iterates of the map above, starting with *x* = 0. Birkhoff’s Ergodic Theorem tells us that the number of times iterates of this map fall in the interval [*a*, *b*] equals *b* – *a*. So for *k* = 1, 2, 3, … 9, the proportion of powers of 2 start with *k* is equal to log_{10} (*k* + 1) – log_{10} (*k*) = log_{10} ((*k* + 1) / *k*).

This is Benford’s law. In particular, the proportion of powers of 2 that begin with 1 is equal to log_{10} (2) = 0.301.

Note that the only thing special about 2 is that log_{10 }2 is irrational. Powers of 3 follow Benford’s law as well because log_{10 }3 is also irrational. For what values of *b* do powers of *b* not follow Benford’s law? Those with log_{10 }*b* rational, i.e. powers of 10. Obviously powers of 10 don’t follow Benford’s law because their first digit is always 1!

[Interpret the “!” above as factorial or exclamation as you wish.]

Let’s look at powers of 2 empirically to see Benford’s law in practice. Here’s a simple program to look at first digits of powers of 2.

count = [0]*10 N = 10000 def first_digit(n): return int(str(n)[0]) for i in range(N): n = first_digit( 2**i ) count[n] += 1 print(count)

Unfortunately this only works for moderate values of `N`

. It ran in under a second with `N`

set to 10,000 but for larger values of `N`

it rapidly becomes impractical.

Here’s a much more efficient version that ran in about 2 seconds with `N`

= 1,000,000.

from math import log10 N = 1000000 count = [0]*10 def first_digit_2_exp_e(e): r = (log10(2.0)*e) % 1 for i in range(2, 11): if r < log10(i): return i-1 for i in range(N): n = first_digit_2_exp_e( i ) count[n] += 1 print(count)

You could make it more efficient by caching the values of `log10`

rather than recomputing them. This brought the run time down to about 1.4 seconds. That’s a nice improvement, but nothing like the orders of magnitude improvement from changing algorithms.

Here are the results comparing the actual counts to the predictions of Benford’s law (rounded to the nearest integer).

|---------------+--------+-----------| | Leading digit | Actual | Predicted | |---------------+--------+-----------| | 1 | 301030 | 301030 | | 2 | 176093 | 176091 | | 3 | 124937 | 124939 | | 4 | 96911 | 96910 | | 5 | 79182 | 79181 | | 6 | 66947 | 66948 | | 7 | 57990 | 57992 | | 8 | 51154 | 51153 | | 9 | 45756 | 45757 | |---------------+--------+-----------|

The agreement is almost too good to believe, never off by more than 2.

Are the results correct? The inefficient version relied on integer arithmetic and so would be exact. The efficient version relies on floating point and so it’s conceivable that limits of precision caused a leading digit to be calculated incorrectly, but I doubt that happened. Floating point is precise to about 15 significant figures. We start with `log10(2)`

, multiply it by numbers up to 1,000,000 and take the fractional part. The result is good to around 9 significant figures, enough to correctly calculate which log digits the result falls between.

**Update**: See Andrew Dalke’s Python script in the comments. He shows a way to efficiently use integer arithmetic.

If *a* and *b* are close, they don’t have to be very large for the beta PDF to be approximately normal. (In all the plots below, the solid blue line is the beta distribution and the dashed orange line is the normal distribution with the same mean and variance.)

On the other hand, when *a* and *b* are very different, the beta distribution can be skewed and far from normal. Note that *a* + *b* is the same in the example above and below.

Why the sharp corner above? The beta distribution is only defined on the interval [0, 1] and so the PDF is zero for negative values.

An application came up today that raised an interesting question: What if *a* + *b* is very large, but *a* and *b* are very different? The former works in favor of the normal approximation but the latter works against it.

The application had a low probability of success but a very large number of trials. Specifically, *a* + *b* would be on the order of a million, but *a* would be less than 500. Does the normal approximation hold for these kinds of numbers? Here are some plots to see.

When *a* = 500 the normal approximation is very good. It’s still pretty good when *a* = 50, but not good at all when *a* = 5.

**Update**: Mike Anderson suggested using the skewness to quantify how far a beta is from normal. Good idea.

The skewness of a beta(*a*, *b*) distribution is

2(*b* – *a*)√(*a* +* b* + 1) / (*a* + *b* + 2) √(*ab*)

Let *N* = *a* + *b* and assume *N* is large and *a* is small, so that *N*, *N* + 2, *b – **a*, and *N* – *a* are all approximately equal in ratios. Then the skewness is approximately 2 /√*a*. In other words, once the number of trials is sufficiently large, sknewness hardly depends on the number of trials and only depends on the number of successes.

μ( *E *) = μ( *f*^{ –1}(*E*) ).

You can read the right side of the equation above as “the measure of the set of points that *f* maps into *E*.” You can apply this condition repeatedly to see that the measure of the set of points mapped into *E* after *n* iterations is still just the measure of *E*.

If *X* is a probability space, i.e. μ( *X *) = 1, then you could interpret the definition of measure-preserving to mean that the probability that a point ends up in *E* after *n* iterations is independent of *n*. We’ll illustrate this below with a simulation.

Let *X* be the half-open unit interval (0, 1] and let *f* be the Gauss map, i.e.

*f*(*x*) = 1/*x* – [1/*x*]

where [*z*] is the integer part of *z*. The function *f* is measure-preserving, though not for the usual Lebesgue measure. Instead it preserves the following measure:

Let’s take as our set *E* an interval [*a*, *b*] and test via simulation whether the probability of landing in *E* after *n* iterations really is just the measure of *E*, independent of *n*.

We can’t just generate points uniformly in the interval (0, 1]. We have to generate the points so that the probability of a point coming from a set *E* is μ(*E*). That means the PDF of the distribution must be *p*(*x*) = 1 / (log(2) (1 + *x*)). We use the inverse-CDF method to generate points with this density.

from math import log, floor from random import random def gauss_map(x): y = 1.0/x return y - floor(y) # iterate gauss map n times def iterated_gauss_map(x, n): while n > 0: x = gauss_map(x) n = n - 1 return x # measure mu( [a,b] ) def mu(a, b): return (log(1.0+b) - log(1.0+a))/log(2.0) # Generate samples with Prob(x in E) = mu( E ) def sample(): u = random() return 2.0**u - 1.0 def simulate(num_points, num_iterations, a, b): count = 0 for _ in range(num_points): x = sample() y = iterated_gauss_map(x, num_iterations) if a < y < b: count += 1 return count / num_points # Change these parameters however you like a, b = 0.1, 0.25 N = 1000000 exact = mu(a, b) print("Exact probability:", exact) print("Simulated probability after n iterations") for n in range(1, 10): simulated = simulate(N, n, a, b) print("n =", n, ":", simulated)

Here’s the output I got:

Exact probability: 0.18442457113742736 Simulated probability after n iterations n = 1 : 0.184329 n = 2 : 0.183969 n = 3 : 0.184233 n = 4 : 0.184322 n = 5 : 0.184439 n = 6 : 0.184059 n = 7 : 0.184602 n = 8 : 0.183877 n = 9 : 0.184834

With 1,000,000 samples, we expect the results to be the same to about 3 decimal places, which is what we see above.

**Related post**: Irrational rotations are ergodic.

(A transformation *f* is ergodic if it is measure preserving and the only sets *E* with *f*^{ –1}(*E*) = *E* are those with measure 0 or full measure. Rational rotations are measure-preserving but not ergodic. The Gauss map above is ergodic.)

Here’s an argument from Brian Beckman for using a functional style of programming in a particular circumstance that I find persuasive. The immediate context is Kalman filtering, but it applies to a broad range of mathematical computation.

By writing a Kalman ﬁlter as a functional fold, we can

test code in friendly environmentsand thendeploy identical code with confidence in unfriendly environments. In friendly environments, data are deterministic, static, and present in memory. In unfriendly, real-world environments, data are unpredictable, dynamic, and arrive asynchronously.

If you write the guts of your mathematical processing as a function to be folded over your data, you can isolate the implementation of your algorithm from the things that make code hardest to test, i.e. the data source. You can test your code in a well-controlled “friendly” test environment and deploy exactly the same code into production, i.e. an “unfriendly” environment.

Brian continues:

The flexibility to deploy exactly the code that was tested is especially important for numerical code like filters. Detecting, diagnosing and correcting numerical issues without repeatable data sequences is impractical. Once code is hardened, it can be critical to deploy exactly the same code, to the binary level, in production, because of numerical brittleness. Functional form makes it easy to test and deploy exactly the same code because it minimizes the coupling between code and environment.

I ran into this early on when developing clinical trial methods first for simulation, then for production. Someone would ask whether we were using the same code in production as in simulation.

“Yes we are.”

“*Exactly* the same code?”

“Well, essentially.”

“Essentially” was not good enough. We got to where we would use the exact same binary code for simulation and production, but something analogous to Kalman folding would have gotten us there sooner, and would have made it easier to enforce this separation between numerical code and its environment across applications.

Why is it important to use the exact same binary code in test and production, not just a recompile of the same source code? Brian explains:

Numerical issues can substantially complicate code, and being able to move exactly the same code, without even recompiling, between testing and deployment can make the difference to a successful application. We have seen many cases where differences in compiler flags, let alone differences in architectures, even between different versions of the same CPU family, introduce enough differences in the generated code to cause qualitative differences in the output.

A filter that behaved well in the lab can fail in practice.

Emphasis added, here and in the first quote above.

Note that this post gives an argument for a functional *style of programming*, not necessarily for the use of *functional programming languages*. Whether the numerical core or the application that uses it would best be written in a functional language is a separate discussion.

**Related posts**:

I was looking over my records and tried to see how my work divides into these three areas. In short, it doesn’t.

The boundaries between these areas are fuzzy or arbitrary to begin with, but a few projects fell cleanly into one of the three categories. However, 85% of my income has come from projects that involve a combination of two areas or all three areas.

If you calculate a confidence interval using R, you could say you’re doing math, statistics, and computing. But for the accounting above I’d simply call that statistics. When I say a project uses math and computation, for example, I mean it requires math outside what is typical in programming, and programming outside what is typical in math.

]]>Let *X* be a multivariate Gaussian random variable with mean zero and let *E* and *F* be two symmetric convex sets, both centered at the origin. The Gaussian correlation inequality says that

Prob(*X *in *E* and *F*) ≥ Prob(*X* in *E*) Prob(*X* in *F*).

Here’s a bit of Python code for illustrating the inequality. For symmetric convex sets we take balls of *p*-norm *r* where *p* ≥ 1 and *r* > 0. We could, for example, set one of the values of *p* to 1 to get a cube and set the other *p* to 2 to get a Euclidean ball.

from scipy.stats import norm as gaussian def pnorm(v, p): return sum( [abs(x)**p for x in v] )**(1./p) def simulate(dim, r1, p1, r2, p2, numReps): count_1, count_2, count_both = (0, 0, 0) for _ in range(numReps): x = gaussian.rvs(0, 1, dim) in_1 = (pnorm(x, p1) < r1) in_2 = (pnorm(x, p2) < r2) if in_1: count_1 += 1 if in_2: count_2 += 1 if in_1 and in_2: count_both += 1 print("Prob in both:", count_both / numReps) print("Lower bound: ", count_1*count_2 * numReps**-2) simulate(3, 1, 2, 1, 1, 1000)

When `numReps`

is large, we expect the simulated probability of the intersection to be greater than the simulated lower bound. In the example above, the former was 0.075 and the latter 0.015075, ordered as we’d expect.

If we didn’t know that the theorem has been proven, we could use code like this to try to find counterexamples. Of course a simulation cannot prove or disprove a theorem, but if we found what appeared to be a counterexample, we could see whether it persists with different random number generation seeds and with a large value of `numReps`

. If so, then we could try to establish the inequality analytically. Now that the theorem has been proven we know that we’re not going to find real counterexamples, but the code is only useful as an illustration.

**Related posts**:

Ergodic functions have the property that “the time average equals the space average.” We’ll unpack what that means and illustrate it by simulation.

Suppose we pick a starting point *x* on the circle then repeatedly rotate it by a golden angle. Take an integrable function *f* on the circle and form the average of its values at the sequence of rotations. This is the time average. The space average is the integral of *f* over the circle, divided by the circumference of the circle. The ergodic theorem says that the time average equals the space average, except possibly for a setting of starting values of measure zero.

More generally, let *X* be a measure space (like the unit circle) with measure μ let *T* be an ergodic transformation (like rotating by a golden angle), Then for almost all starting values *x* we have the following:

Let’s do a simulation to see this in practice by running the following Python script.

from scipy import pi, cos from scipy.constants import golden from scipy.integrate import quad golden_angle = 2*pi*golden**-2 def T(x): return (x + golden_angle) % (2*pi) def time_average(x, f, T, n): s = 0 for k in range(n): s += f(x) x = T(x) return s/n def space_average(f): integral = quad(f, 0, 2*pi)[0] return integral / (2*pi) f = lambda x: cos(x)**2 N = 1000000 print( time_average(0, f, T, N) ) print( space_average(f) )

In this case we get 0.49999996 for the time average, and 0.5 for the space average. They’re not the same, but we only used a finite value of *n*; we didn’t take a limit. We should expect the two values to be close because *n* is large, but we shouldn’t expect them to be equal.

**Update**: The code and results were updated to fix a bug pointed out in the comments below. I had written `... % 2*pi`

when I should have written `... % (2*pi)`

. I assumed the modulo operator was lower precedence than multiplication, but it’s not. It was a coincidence that the buggy code was fairly accurate.

A friend of mine, a programmer with decades of experience, recently made a similar error. He’s a Clojure fan but was writing in C or some similar language. He rightfully pointed out that this kind of error simply cannot happen in Clojure. Lisps, including Clojure, don’t have operator precedence because they don’t have operators. They only have functions, and the order in which functions are called is made explicit with parentheses. The Python code `x % 2*pi`

corresponds to `(* (mod x 2) pi)`

in Clojure, and the Python code `x % (2*pi)`

corresponds to `(mod x (* 2 pi))`

.

**Related**: Origin of the word “ergodic”

I had not seen one of these old saxophones until Carlo Burkhardt sent me photos today of a Pierret Modele 5 Tenor Sax from around 1912.

Here’s a closeup of the octave keys.

And here’s a closeup of the bell where you can see the branding.

]]>Musical notes go around in a circle. After 12 half steps we’re back where we started. What would it sound like if we played intervals that went around this circle at golden angles? I’ll include audio files and the code that produced them below.

A golden interval, moving around the music circle by a golden angle, is a little more than a major third. And so a chord made of golden intervals is like an augmented major chord but stretched a bit.

An augmented major triad divides the musical circle exactly in thirds. For example, C E G#. Each note is four half steps, a major third, from the previous note. In terms of a circle, each interval is 120°. Here’s what these notes sound like in succession and as a chord.

(Download)

If we go around the musical circle in golden angles, we get something like an augmented triad but with slightly bigger intervals. In terms of a circle, each note moves 137.5° from the previous note rather than 120°. Whereas an augmented triad goes around the musical circle at 0°, 120°, and 240° degrees, a golden triad goes around 0°, 137.5°, and 275°.

A half step corresponds to 30°, so a golden angle corresponds to a little more than 4.5 half steps. If we start on C, the next note is between E and F, and the next is just a little higher than A.

If we keep going up in golden intervals, we do not return to the note we stared on, unlike a progression of major thirds. In fact, we never get the same note twice because a golden interval is not a rational part of a circle. Four golden angle rotations amount to 412.5°, i.e. 52.5° more than a circle. In terms of music, going up four golden intervals puts us an octave and almost a whole step higher than we stared.

Here’s what a longer progression of golden intervals sounds like. Each note keeps going but decays so you can hear both the sequence of notes and how they sound in harmony. The intention was to create something like playing the notes on a piano with the sustain pedal down.

(Download)

It sounds a little unusual but pleasant, better than I thought it would.

Here’s the Python code that produced the sound files in case you’d like to create your own. You might, for example, experiment by increasing or decreasing the decay rate. Or you might try using richer tones than just pure sine waves.

from scipy.constants import golden import numpy as np from scipy.io.wavfile import write N = 44100 # samples per second # Inputs: # frequency in cycles per second # duration in seconds # decay = half-life in seconds def tone(freq, duration, decay=0): t = np.arange(0, duration, 1.0/N) y = np.sin(freq*2*np.pi*t) if decay > 0: k = np.log(2) / decay y *= np.exp(-k*t) return y # Scale signal to 16-bit integers, # values between -2^15 and 2^15 - 1. def to_integer(signal): signal /= max(abs(signal)) return np.int16(signal*(2**15 - 1)) C = 262 # middle C frequency in Hz # Play notes sequentially then as a chord def arpeggio(n, interval): y = np.zeros((n+1)*N) for i in range(n): x = tone(C * interval**i, 1) y[i*N : (i+1)*N] = x y[n*N : (n+1)*N] += x return y # Play notes sequentially with each sustained def bell(n, interval, decay): y = np.zeros(n*N) for i in range(n): x = tone(C * interval**i, n-i, decay) y[i*N:] += x return y major_third = 2**(1/3) golden_interval = 2**(golden**-2) write("augmented.wav", N, to_integer(arpeggio(3, major_third))) write("golden_triad.wav", N, to_integer(arpeggio(3, golden_interval))) write("bell.wav", N, to_integer(bell(9, golden_interval, 6)))]]>

You can’t rotate RGB values per se, but you can rotate hues. So my initial idea was to convert RGB to HSV or HSL, rotate the H component, then convert back to RGB. There are some subtleties with converting between RGB and either HSV or HSL, but I’m willing to ignore those for now.

The problem I ran into was that my idea of a color wheel doesn’t match the HSV or HSL color wheels. For example, I’m thinking of green as the complementary color to red, the color 180° away. On the HSV and HSL color wheels, the complementary color to red is cyan. The color wheel I have in mind is the “artist’s color wheel” based on the RYB color space, not RGB. Subtractive color, not additive.

This brings up several questions.

- How do you convert back and forth between RYB and RGB?
- How do you describe the artist’s color wheel mathematically, in RYB or any other system?
- What is a good reference on color theory? I’d like to understand in detail how the various color systems relate, something that spans the gamut (pun intended) from an artist’s perspective down to physics and physiology.

I see adjoint functors.

How often do you see them?

All the time. They’re everywhere. pic.twitter.com/6PkGJ9wP4A

— Paul Phillips (@contrarivariant) May 27, 2017

Mashup of Saunders Mac Lane’s quip “Adjoint functors arise everywhere” and Haley Joel Osment’s famous line from Sixth Sense.

**Related**: Applied category theory

I ran across the above quote from Hamming this morning. It made me wonder whether I tried to prepare students for my past when I used to teach college students.

How do you prepare a student for the future? Mostly by focusing on skills that will always be useful, even as times change: logic, clear communication, diligence, etc.

Negative forecasting is more reliable here than positive forecasting. It’s hard to predict what’s going to be in demand in the future (besides timeless skills), but it’s easier to predict what’s probably not going to be in demand. The latter aligns with Hamming’s exhortation not to prepare students for your past.

]]>]]>… in an ideal world, people would learn this material over many years, after having background courses in commutative algebra, algebraic topology, differential geometry, complex analysis, homological algebra, number theory, and French literature.

]]>It is always strange and painful to have to change a habit of mind; though, when we have made the effort, we may find a great relief, even a sense of adventure and delight, in getting rid of the false and returning to the true.

Now spin that rose around a vertical line a distance *R* from the center of the rose. This makes a torus (doughnut) shape whose cross sections look like the rose above. You could think of having a cutout shaped like the rose above and extruding Play-Doh through it, then joining the ends in a loop.

In case you’re curious, the image above was created with the following Mathematica command:

RevolutionPlot3D[{4 + Cos[5 t] Cos[t], Cos[5 t] Sin[t]}, {t, 0, 2 Pi}]

What would the volume of resulting solid be?

This sounds like a horrendous calculus homework problem, but it’s actually quite easy. A theorem by Pappus of Alexandria (c. 300 AD) says that the volume is equal to the area times the circumference of the circle traversed by the centroid.

The area of a rose of the form *r* = cos(*k*θ) is simply π/2 if *k* is even and π/4 if *k* is odd. (Recall from the previous post that the number of petals is 2*k* if *k* is even and *k* if *k* is odd.)

The location of the centroid is easy: it’s at the origin by symmetry. If we rotate the rose around a vertical line *x* = *R* then the centroid travels a distance 2π*R*.

So putting all the pieces together, the volume is π²*R* if *k* is even and half that if *k* is odd. (We assume *R* > 1 so that the figure doesn’t intersect itself and some volume get counted twice.)

We can also find the surface area easily by another theorem of Pappus. The surface area is just the arc length of the rose times the circumference of the circle traced out by the centroid. The previous post showed how to find the arc length, so just multiply that by 2π*R*.

(I rotated the graph 36° so it would be symmetric about the vertical axis rather than the horizontal axis.)

The arc length of a curve in polar coordinates is given by

and so we can use this find the length. The integral doesn’t have a closed form in terms of elementary functions. Instead, the result turns out to use a special function *E*(*x*), the “complete elliptic integral of the second kind,” defined by

Here’s the calculation for the length of a rose:

So the arc length of the rose *r* = cos(*k*θ) with θ running from 0 to 2π is 4 *E*(-*k*² + 1). You can calculate *E* in SciPy with `scipy.special.ellipe`

.

If we compute the length of the rose at the top of the post, we get 4 *E*(-24) = 21.01. Does that pass the sniff test? Each petal goes from *r* = 0 out to *r* = 1 and back. If the petal were a straight line, this would have length 2. Since the petals are curved, the length of each is a little more than 2. There are five petals, so the result should be a little more than 10. But we got a little more than 20. How can that be? Since 5 is odd, the rose with *k* = 5 traces each petal twice, so we should expect a value of a little more than 20, which is what we got.

As *k* gets larger, the petals come closer to being straight lines. So we should expect that 4*E*(-*k*² + 1) approaches 4*k* as *k* gets large. The following plot of *E*(-*k*² + 1) – *k* provides empirical support for this conjecture by showing that the difference approaches 0, and gives an idea of the rate of convergence. It should be possible to prove that, say, that *E*(-*k*²) asymptotically approaches *k*, but I haven’t done this.

**Related posts**:

The proof is surprisingly simple. Start with the following:

Now integrate the first and last expressions between two points *a* and *b*. Note that the former integral gives the arc length of cosh between *a* and *b*, and the later integral gives the area under the graph of cosh between *a* and *b*.

By the way, the most famous catenary may be the Gateway Arch in St. Louis, Missouri.

]]>Here’s an example from [1]. Suppose you want to find the extreme values of *x*³ + 2*xyz* – *x*² on the unit sphere using Lagrange multipliers. This leads to the following system of polynomial equations where λ is the Lagrange multiplier.

There’s no obvious way to go about solving this system of equations. However, there is a systematic way to approach this problem using a “lexicographic Gröbner basis.” This transforms the problem from into something that **looks much worse** but that is actually easier to work with. And most importantly, the transformation is algorithmic. It requires some computation—there are numerous software packages for doing this—but doesn’t require a flash of insight.

The transformed system looks intimidating compared to the original:

We’ve gone from four equations to eight, from small integer coefficients to large fraction coefficients, from squares to seventh powers. And yet we’ve made progress because the four variables are **less entangled** in the new system.

The last equation involves only *z* and factors nicely:

This cracks the problem wide open. We can easily find all the possible values of *z*, and once we substitute values for *z*, the rest of the equations simplify greatly and can be solved easily.

The key is that Gröbner bases transform our problem into a form that, although it appears more difficult, is easier to work with because the variables are somewhat separated. Solving one variable, *z*, is like pulling out a thread that then makes the rest of the threads easier to separate.

**Related**: The great reformulation of algebraic geometry

* * *

[1] David Cox et al. Applications of Computational Algebraic Geometry: American Mathematical Society Short Course January 6-7, 1997 San Diego, California (Proceedings of Symposia in Applied Mathematics)

]]>White noise has equal power at all frequencies, just as white light is a combination of all the frequencies of the visible spectrum. The components of red noise are weighted toward low frequencies, just as red light is at the low end of the visible spectrum. Pink noise is weighted toward low frequencies too, but not as strongly as read. Specifically, the power in red noise drops off like 1/*f*² where *f* is frequency. The power in pink noise drops off like 1/*f*.

Generating pink noise is more complicated than you might think. The book Creating Noise, by Stefan Hollos and J. Richard Hollos, has a good explanation and C source code for generating pink noise and variations such as 1/*f *^{α} noise for 0 < α < 1. If you want even more background, check out Recursive Digital Filters by the same authors.

If you’d like to hear what pink noise sounds like, here’s a sample that was created using the software in the book with a 6th order filter.

(Download)

**Related posts**:

- Ooh, that sounds like fun!
- Run away!

I’ve been on several projects where the sponsors have identified some aspect of the status quo that clearly needs improving. Working on a project that realizes these problems and is willing to address them sounds like fun. But then the project runs to the opposite extreme and creates something worse.

For example, most software development is sloppy. So a project reacts against this and becomes so formal that nobody can get any work done. In those settings I like to say “Hold off on buying a tux. Just start by tucking your shirt tail in. Maybe that’s as formal as you need to be.”

I’d be more optimistic about a project with more modest goals, say one that wants to move 50% of the way toward an ideal, rather than wanting to jump from one end of a spectrum to the other. Or even better, a project that has identified a direction they want to move in, and thinks in terms of experimenting to find their optimal position along that direction.

]]>Start with a positive integer *n*. Compute 3*n* + 1 and divide by 2 repeatedly until you get an odd number. Then repeat the process. For example, suppose we start with 13. We get 3*13+1 = 40, and 40/8 = 5, so 5 is the next term in the sequence. 5*3 + 1 is 16, which is a power of 2, so we get down to 1.

Does this process always reach 1? So far nobody has found a proof or a counterexample.

If you pick a large starting number *n* at random, it appears that not only will the sequence terminate, the values produced by the sequence approximately follow Benford’s law (source). If you’re unfamiliar with Benford’s law, please see the first post in this series.

Here’s some Python code to play with this.

from math import log10, floor def leading_digit(x): y = log10(x) % 1 return int(floor(10**y)) # 3n+1 iteration def iterates(seed): s = set() n = seed while n > 1: n = 3*n + 1 while n % 2 == 0: n = n / 2 s.add(n) return s

Let’s save the iterates starting with a large starting value:

it = iterates(378357768968665902923668054558637)

Here’s what we get and what we would expect from Benford’s law:

|---------------+----------+-----------| | Leading digit | Observed | Predicted | |---------------+----------+-----------| | 1 | 46 | 53 | | 2 | 26 | 31 | | 3 | 29 | 22 | | 4 | 16 | 17 | | 5 | 24 | 14 | | 6 | 8 | 12 | | 7 | 12 | 10 | | 8 | 9 | 9 | | 9 | 7 | 8 | |---------------+----------+-----------|

We get a chi-square of 12.88 (*p* = 0.116) and so we get a reasonable fit.

Here’s another run with a different starting point.

it = iterates(243963882982396137355964322146256)

which produces

|---------------+----------+-----------| | Leading digit | Observed | Predicted | |---------------+----------+-----------| | 1 | 44 | 41 | | 2 | 22 | 24 | | 3 | 15 | 17 | | 4 | 12 | 13 | | 5 | 11 | 11 | | 6 | 9 | 9 | | 7 | 11 | 8 | | 8 | 6 | 7 | | 9 | 7 | 6 | |---------------+----------+-----------|

This has a chi-square value of 2.166 (*p* = 0.975) which is an even better fit.

Samples from a Cauchy distribution nearly follow Benford’s law. I’ll demonstrate this below. The more data you see, the more confident you should be of this. But with a typical statistical approach, crudely applied NHST (null hypothesis significance testing), the more data you see, the less convinced you are.

This post assumes you’ve read the previous post that explains what Benford’s law is and looks at how well samples from a Weibull distribution follow that law.

This post has two purposes. First, we show that samples from a Cauchy distribution approximately follow Benford’s law. Second, we look at problems with testing goodness of fit with NHST.

We can reuse the code from the previous post to test Cauchy samples, with one modification. Cauchy samples can be negative, so we have to modify our `leading_digit`

function to take an absolute value.

def leading_digit(x): y = log10(abs(x)) % 1 return int(floor(10**y))

We’ll also need to import `cauchy`

from `scipy.stats`

and change where we draw samples to use this distribution.

samples = cauchy.rvs(0, 1, N)

Here’s how a sample of 1000 Cauchy values compared to the prediction of Benford’s law:

|---------------+----------+-----------| | Leading digit | Observed | Predicted | |---------------+----------+-----------| | 1 | 313 | 301 | | 2 | 163 | 176 | | 3 | 119 | 125 | | 4 | 90 | 97 | | 5 | 69 | 79 | | 6 | 74 | 67 | | 7 | 63 | 58 | | 8 | 52 | 51 | | 9 | 57 | 46 | |---------------+----------+-----------|

Here’s a bar graph of the same data.

A common way to measure goodness of fit is to use a chi-square test. The null hypothesis would be that the data follow a Benford distribution. We look at the chi-square statistic for the observed data, based on a chi-square distribution with 8 degrees of freedom (one less than the number of categories, which is 9 because of the nine digits). We compute the *p*-value, the probability of seeing a chi-square statistic this larger or larger, and reject our null hypothesis if this *p*-value is too small.

Here’s how our chi-square values and *p*-values vary with sample size.

|-------------+------------+---------| | Sample size | chi-square | p-value | |-------------+------------+---------| | 64 | 13.542 | 0.0945 | | 128 | 10.438 | 0.2356 | | 256 | 13.002 | 0.1118 | | 512 | 8.213 | 0.4129 | | 1024 | 10.434 | 0.2358 | | 2048 | 6.652 | 0.5745 | | 4096 | 15.966 | 0.0429 | | 8192 | 20.181 | 0.0097 | | 16384 | 31.855 | 9.9e-05 | | 32768 | 45.336 | 3.2e-07 | |-------------+------------+---------|

The *p*-values eventually get very small, but they don’t decrease monotonically with sample size. This is to be expected. If the data came from a Benford distribution, i.e. if the null hypothesis were true, we’d expect the *p*-values to be uniformly distributed, i.e. they’d be equally likely to take on any value between 0 and 1. And not until the two largest samples do we see values that don’t look consistent with uniform samples from [0, 1].

In one sense NHST has done its job. Cauchy samples *do not* exactly follow Benford’s law, and with enough data we can show this. But we’re rejecting a null hypothesis that isn’t that interesting. We’re showing that the data don’t exactly follow Benford’s law rather than showing that they *do* approximately follow Benford’s law.

In 1881, Simon Newcomb noticed that the edges of the first pages in a book of logarithms were dirty while the edges of the later pages were clean. From this he concluded that people were far more likely to look up the logarithms of numbers with leading digit 1 than of those with leading digit 9. Frank Benford studied the same phenomena later and now the phenomena is known as Benford’s law, or sometime the Newcomb-Benford law.

A data set follows Benford’s law if the proportion of elements with leading digit *d* is approximately

log_{10}((*d* + 1*)/ d*).

You could replace “10” with *b* if you look at the leading digits in base *b*.

Sets of physical constants often satisfy Benford’s law, as I showed here for the constants defined in SciPy.

Incidentally, factorials satisfy Benford’s law exactly in the limit.

The Weibull distribution is a generalization of the exponential distribution. It’s a convenient distribution for survival analysis because it can have decreasing, constant, or increasing hazard, depending on whether the value of a shape parameter γ is less than, equal to, or greater than 1 respectively. The special case of constant hazard, shape γ = 1, corresponds to the exponential distribution.

If the shape parameter of a Weibull distributions is “not too large” then samples from that distribution approximately follow Benford’s law (source). We’ll explore this statement with a little Python code.

SciPy doesn’t contain a Weibull distribution per se, but it does have support for a generalization of the Weibull known as the exponential Weibull. The latter has two shape parameters. We set the first of these to 1 to get the ordinary Weibull distribution.

from math import log10, floor from scipy.stats import exponweib def leading_digit(x): y = log10(x) % 1 return int(floor(10**y)) def weibull_stats(gamma): distribution = exponweib(1, gamma) N = 10000 samples = distribution.rvs(N) counts = [0]*10 for s in samples: counts[ leading_digit(s) ] += 1 print (counts)

Here’s how the leading digit distribution of a simulation of 10,000 samples from an exponential (Weibull with γ = 1) compares to the distribution predicted by Benford’s law.

|---------------+----------+-----------| | Leading digit | Observed | Predicted | |---------------+----------+-----------| | 1 | 3286 | 3010 | | 2 | 1792 | 1761 | | 3 | 1158 | 1249 | | 4 | 851 | 969 | | 5 | 754 | 792 | | 6 | 624 | 669 | | 7 | 534 | 580 | | 8 | 508 | 511 | | 9 | 493 | 458 | |---------------+----------+-----------|

Looks like a fairly good fit. How could we quantify the fit so we can compare how the fit varies with the shape parameter? The most common approach is to use the chi-square goodness of fit test.

def chisq_stat(O, E): return sum( [(o - e)**2/e for (o, e) in zip(O, E)] )

Here “O” stands for “observed” and “E” stands for “expected.” The observed counts are the counts we actually saw. The expected values are the values Benford’s law would predict:

expected = [N*log10((i+1)/i) for i in range(1, 10)]

Note that we don’t want to pass `counts`

to `chisq_stat`

but `counts[1:]`

instead. This is because `counts`

starts with 0 index, but leading digits can’t be 0 for positive samples.

Here are the chi square goodness of fit statistics for a few values of γ. (Smaller is better.)

|-------+------------| | Shape | Chi-square | |-------+------------| | 0.1 | 1.415 | | 0.5 | 9.078 | | 1.0 | 69.776 | | 1.5 | 769.216 | | 2.0 | 1873.242 | |-------+------------|

This suggests that samples from a Weibull follow Benford’s law fairly well for shape γ < 1, i.e. for the case of decreasing hazard.

The golden ratio φ is (1 + √5)/2. A golden rectangle is one in which the ratio of the longer side to the shorter side is φ. Credit cards, for example, are typically golden rectangles.

You might guess that a golden angle is 1/φ of a circle, but it’s actually 1/φ^{2} of a circle. Let *a* be the length of an arc cut out of a circle by a golden angle and *b* be the length of its complement. Then by definition the ratio of *b* to *a* is φ. In other words, the golden angle is defined in terms of the ratio of its complementary arc, not of the entire circle. [1]

The video below has many references to the golden angle. It says that the golden angle is 137.5 degrees, which is fine given the context of a popular video. But this doesn’t explain where the angle comes from or give its exact value of 360/φ^{2} degrees.

[1] Why does this work out to 1/φ^{2}? The ratio *b*/*a* equals φ, by definition. So the ratio of *a* to the whole circle is

*a*/(*a* + *b*) = *a*/(*a* + φ*a*) = 1/(1 + φ) = 1/φ^{2}

since φ satisfies the quadratic equation 1 + φ = φ^{2}.

There’s one thing advocates of all the aforementioned systems agree on: the number of basic personality types is a perfect square.

]]>Here’s the airport:

And here’s the book cover:

I’ve written about the image on book cover before. Someone asked me what function it graphed and I decided it was probably the Weierstrass ℘ function.

For more on Weierstrass’ elliptic function and why I think that’s what’s on the cover of A&S, see this post.

Photo of Denver airport via Wikipedia.

]]>Sometimes we’re disappointed with a simple solution because, although we don’t realize it yet, we didn’t properly frame the problem it solves.

I’ve been in numerous conversations where someone says effectively, “I understand that 2+3 = 5, but what if we made it 5.1?” They really want an answer of 5.1, or maybe larger, for reasons they can’t articulate. They formulated a problem whose solution is to add 2 and 3, but that formulation left out something they care about. In this situation, the easy response to say is “No, 2+3 = 5. There’s nothing we can do about that.” The more difficult response is to find out why “5” is an unsatisfactory result.

Sometimes we’re uncomfortable with a simple solution even though it does solve the right problem.

If you work hard and come up with a simple solution, it may look like you didn’t put in much effort. And if someone else comes up with the simple solution, you may look foolish.

Sometimes simplicity is disturbing. Maybe it has implications we have to get used to.

**Update**: A couple people have replied via Twitter saying that we resist simplicity because it’s boring. I think beneath that is that we’re not ready to move on to a new problem.

When you’re invested in a problem, it can be hard to see it solved. If the solution is complicated, you can keep working for a simpler solution. But once someone finds a really simple solution, it’s hard to justify continuing work in that direction.

A simple solution is not something to dwell on but to build on. We want some things to be boringly simple so we can do exciting things with them. But it’s hard to shift from producer to consumer: Now that I’ve produced this simple solution, and still a little sad that it’s wrapped up, how can I use it to solve something else?

**Related posts**:

The holes are all rectangular, so it’s surprising that the geometry is so varied when you slice open a Menger sponge. For example, when you cut it on the diagonal, you can see stars! (I wrote about this here.)

I mentioned this blog post to a friend at Go 3D Now, a company that does 3D scanning and animation, and he created the video below. The video starts out by taking you through the sponge, then at about the half way part the sponge splits apart.

]]>