The sine function is periodic along the real axis, but it grows exponentially along the imaginary axis. I mentioned parenthetically last time that the Jacobi functions are not just periodic but doubly periodic. That means that not only are they periodic as you move along the real axis, they’re periodic when you move along **any **line in the complex plane.We’ll illustrate this with some plots.

It will make our code more compact if we first define a function to split a complex number into its real and imaginary parts.

pair[z_] := {Re[z], Im[z]}

Here’s what it looks like when you plot the real and imaginary parts of the sine function along the line (10 + 0.5*i*)*t*.

ParametricPlot[ pair[ Sin[(0.5 I + 10)t] ], {t, 0, 10}, PlotRange -> All ]

By contrast, here’s a plot of the sn function along the line 1.25 + *it*.

ParametricPlot[ pair[ JacobiSN[1.25 + I t, 0.5] ], {t, 0, 10} ]

It’s a closed loop because sn is periodic in every direction. (By the way, the curve looks like an egg. More posts about egg-shaped curves starting here.)

When you plot sn along various lines you’ll always get closed curves, but you won’t always get such round curves. Here’s an example that doesn’t look like a closed loop because the curve turns around sharply at each end.

ParametricPlot[ pair[ JacobiCN[ (I + 0.5) t, 0.5] ], {t, 0, 10} ]

To show that it really is periodic, we’ll add a vertical time dimension to visualize how the curve is traced out over time.

triple[z_, t_] = {Re[z], Im[z], t} ParametricPlot3D[ triple[JacobiSN[(I + 0.5) t, 0.5], 0.1 t], {t, 0, 20} ]

Here’s another curve plotting sn along a different line.

ParametricPlot[ pair[ JacobiSN[(0.9 + I) t, 0.5] ], {t, 0, 55}, PlotRange -> All, PlotPoints -> 100 ]

As before, we can add a time dimension to imagine how the curve is traced out.

ParametricPlot3D[ triple[JacobiSN[(0.9 + I) t, 0.5], 0.1 t], {t, 0, 150}, PlotRange -> All, PlotPoints -> 200 ]

Here’s a variation on the plot above that stretches the vertical axis so you can better see what’s going on.

ParametricPlot3D[ triple[JacobiSN[(0.9 + I) t, 0.5], 0.2 t], {t, 0, 150}, PlotRange -> All, PlotPoints -> 200 ]]]>

large to assume θ is a good enough approximation to sin θ) and in conformal mapping.

I ran across an article [1] yesterday that shows how Jacobi’s three elliptic functions—sn, cn, and dn—could be defined by one dynamical system

with initial conditions *x*(0) = 0, *y*(0) = 1, and *z*(0) = 1.

The parameter *k* is the modulus. (In Mathematica’s notation below, *k*² is the modulus.) As *k* decreases to 0, sn converges to sine, cn to cosine, and dn to 1. As *k* increases to 1, sn converges tanh, and cn and dn converge to sech. So you could think of *k* as a knob you turn to go from being more like circular functions (ordinary trig functions) to more like hyperbolic functions.

Since we have a dynamical system, let’s plot the solution, varying the modulus each time.The Jacobi functions are periodic (in fact they’re doubly periodic) and so the plots will be closed loops.

f[t_, m_] = {JacobiSN[t, m], JacobiCN[t, m], JacobiDN[t, m]} ParametricPlot3D[f[t, 0.707], {t, 0, 10}, AspectRatio -> 1]

ParametricPlot3D[f[t, 0.99], {t, 0, 20}, AspectRatio -> 1]

ParametricPlot3D[f[t, 0.01], {t, 0, 20}, AspectRatio -> 1]

Note that this last plot is nearly flat because the modulus is small and so *z* is nearly constant. The small modulus also makes the phase portrait nearly circular because *x* is approximately sine and *y* is approximately cosine.

[1] Kenneth R. Meyer. Jacobi Elliptic Functions from a Dynamical Systems Point of View. The American Mathematical Monthly, Vol. 108, No. 8 (Oct., 2001), pp. 729-737

]]>**Frobenius’s theorem** says only real vector spaces that can be made into division algebras are the real numbers, complex numbers, and quaternions.

So can you multiply triples of real numbers? Sure. **Cross product** is one way, but there’s no “division” analog to cross product. For example, the cross product of any vector with itself is zero, so cross product has lots of **zero divisors**, non-zero elements that multiply to zero.

Frobenius tells us we can’t form triples of real numbers into a division algebra, but can we come closer to a division algebra than we get with cross products? It turns out we can! We can define a multiplication that is **commutative**, has an identity element, and has no zero divisors. But what’s missing? The multiplication doesn’t have the distributive property you’d have in a division algebra.

The multiplication operation is complicated to describe. For details see “A Commutative Multiplication of Number Triplets” by Frank R. Pfaff, The American Mathematical Monthly, Vol. 107, No. 2 (Feb., 2000), pp. 156-162.

**Related posts**:

First, how many pieces are there in a set of dominoes? A domino corresponds to an unordered pair of numbers from 0 to *n*. The most popular form has *n* = 6, but there are variations with other values of *n*. You can show that the number of dominoes is

This is because there are *n*+1 possible numbers (since blanks are a possibility) and each one is either a double or not. The number of ways to choose two distinct numbers is the binomial coefficient and the number of doubles is *n*+1.

Another way to look at this is that we are selecting two things from a set of *n*+1 things with replacement and so the number of possibilities is

where the symbol on the left is Stanley’s symbol for selection with replacement.

In any case, there are 28 dominoes when *n* = 6, 55 when *n* = 9, and 91 when *n* = 12.

There are a couple ways to make a magic square of sorts from a set of dominoes. To read more about this, see this post.

How many ways can you cover an *m* by *n* chess board with dominoes? The answer turns out to be

See this post for details.

]]>This post will summarize the results of a question I asked from @AnalysisFact.

If a genie offered to give you a thorough understanding of one theorem, what theorem would you choose?

— Analysis Fact (@AnalysisFact) June 15, 2018

Here are the results by category. Some are not strictly theorems but rather topics or conjectures.

- Curry-Howard correspondence
- P=NP or not
- Collatz Conjecture

- Cohen’s forcing theorem
- Godel’s Incompleteness theorem
- Continuum hypothesis
- Zorn’s lemma

- The ABC conjecture (theorem?)
- Prime number theorem.
- Riemann hypothesis
- Fundamental theorem of finite abelian groups
- Classification of finite simple groups
- Fermat’s last theorem, the unpublished Fermat version

- Thurston’s geometrization conjecture
- Gauss Bonnet theorem
- de Rham’s theorem
- Grothendieck Riemann Roch theorem

- Banach-Tarski theorem.
- Stokes theorem
- Carleson-Hunt theorem
- The epsilon/delta definition of continuity
- Universal approximation theorem

- Navier–Stokes existence and smoothness postulate
- The relativistic version of the Shrodinger equation
- Atiyah-Singer index theorem

*E*=*mc*²- Noether’s theorem
- Liouville’s theorem

- Existence of general equilibrium prices
- Farkas’ lemma
- The graph minor theorem
- Central limit theorem

A couple people picked up the fact that in folk stories, being granted a wish doesn’t usually turn out well.

M. Shah: uh oh. Is this one of those malicious genies that twists language used to make the wish so that you are granted some horrific wish?

Jumpy the Hat: You now understand every single thing about irrational numbers but it’s all wasted because you’re cursed to become NJ Wildberger and you don’t think they exist

M Shah: or you want to thoroughly understand some theorem about Weierstrass’s monster. But little did anyone know that Weierstrass actually did have a pet monster. And it ends up biting your head off because it doesn’t like other things that are continuous.

]]>What’s a book you would like to have read but don’t want to read?

— John D. Cook (@JohnDCook) June 15, 2018

Here are the responses I got, organized by category.

Literature:

- The Brothers Karamazov
- War and Peace
- Crime and Punishment
- Gravity’s Rainbow
- Infinite Jest
- Finnegans Wake
- Ulysses
- Atlas Shrugged
- The Iliad

Math, Science, and Software:

- Art of Computer Programming
- Design Patterns: Elements of Reusable Object-Oriented Software
- Programming: Principles and Practice Using C++
- An Introduction to Default Logic
- A Brief History of Time
- A New Kind of Science
- Partial Differential Equations
- Princeton Companion to Mathematics
- The Algebraic Eigenvalue Problem

History and economics

Religion and philosophy

Misc:

]]>The first is Sylvester’s determinant theorem. If *A* is a *n* by *k* matrix and *B* is a *k* by *n* matrix, then

where *I* is an identity matrix whose dimension is given by its subscript. The theorem is true for any *k*, but it’s especially useful when *k* is small because the matrices on the right side are of size *k*. If *k* = 1, the right side is the determinant of a 1 by 1 matrix, i.e. just a number!

The second is known as the matrix inversion lemma or Woodbury’s matrix identity. It says

which is a lot to take in at once. We’ll unpack it a little at a time.

First of all, the matrices have whatever properties are necessary for the equation to make sense: *A* and *C* are square, invertible matrices, though possibly of different sizes.

Suppose *A* is *n* by *n*. Then *U* necessarily has *n* rows. Let *k* be the number of columns in *U*. Then *C* is *k* by *k* and *V* is *k* by *n*.

To simplify things a little, let *C* be the identity.

Think of *k* as small, maybe even 1. Then *UV* is a low-rank perturbation of *A*. Suppose you have computed the inverse of *A*, or know something about the inverse of *A*, and want to know how that inverse changes when you change *A* by adding a rank *k* matrix to it.

To simplify things further, assume *A* is the identity matrix. Now the matrix inversion lemma reduces to something similar to Sylvester’s theorem above.

To make things clearer, we’ll add subscripts on the identity matrices as above.

If *k* is small, then the matrix between *U* and *V* on the right side is small. If *k* = 1, it’s just a number, and its inverse is just it’s reciprocal.

produces a long sequence of primes. Namely, the values are prime for *n* = 1, 2, 3, …, 40.

The **second** is that the number

is extraordinarily close to an integer. This number is known as Ramanujan’s constant. It differs from its nearest integer by 3 parts in 10^{30}. Ramanujan’s constant equals

262537412640768743.99999999999925…

**There is a connection** between these two facts: The polynomial

returns primes for *n* = 1, 2, 3, …, *k*-1 primes if 4*k* – 1 is a **Heegner number**, and

is almost an integer if *d* is a (large) Heegner number.

Source: The Book of Numbers by Conway and Guy.

So what’s a Heegner number and how many are there? An integer *d* is a Heegner number if the ring generated by appending √-*d* to the integers has unique factorization. There are nine such numbers:

1, 2, 3, 7, 11, 19, 43, 67, 163.

There’s deeper stuff going on here than I understand—modular forms, the *j*-function, etc.—so this post won’t explain everything. There’s something unsatisfying about saying something is “almost” an integer without quantifying. There’s a way to be more precise, but we won’t go there. Instead, we’ll just play with the results.

First we look at the claim that *n*² – *n* + *k* produces primes for *n* = 1 through *k* – 1 if 4*k* – 1 is a Heegner number. The Heegner numbers of the form 4*k* + 1 are 2, 3, 5, 11, and 17. The following code shows that the claim is true for these values of *k*.

k = {2, 3, 5, 11, 17} claim[x_] := AllTrue[ Table[n^2 - n + x, {n, x - 1}], PrimeQ ] AllTrue[k, claim]

This returns `True`

, so the claim is true.

As for exp(π √*d*) being close to an integer, this apparently only true for the last three Heegner numbers.

h = {1, 2, 3, 7, 11, 19, 43, 67, 163} For[i = 1, i < 10, i++, Print[ AccountingForm[ N[ Exp[ Pi Sqrt[ h[[i]] ] ], 31 ] ] ] ]

(The function `AccountingForm`

suppresses scientific notation, making it easier to see where the decimal portion of the number starts.)

Here are the results:

23.1406926327792 85.0196952232072 230.7645883191458 4071.9320952252610 33506.1430655924387 885479.7776801543194 884736743.9997774660349 147197952743.9999986624548 262537412640768743.9999999999993

I manually edited the output to align the decimal points and truncate the decimal places beyond that needed to show that the last number is not an integer.

]]>The US flag has had 13 stripes from the beginning, representing the first 13 states. The number of stars has increased over time as the number of states has increased. Currently there are 50 stars, arranged in alternating rows of 6 and 5 stars.

If California breaks into three states, how would we arrange 52 stars? One way would be alternating rows of 7 and 6. Here’s a quick mock up of a possible 52-star flag I just made:

The difference isn’t that noticeable.

What if California [1] were to secede from the US? We’ve had 49 states before, in the brief period between the beginning of Alaskan statehood and the beginning of Hawaiian statehood. During that time [1], the flag had 7 rows of 7 stars, but the rows were staggered as in the current flag, not lined up in columns as in the 48-star flag before it.

***

[1] I don’t know how many people seriously propose any state leaving the US. The last time a state tried the result was the bloodiest war in US history. There’s a group in Vermont that wants their state to withdraw from the US.

Texans talk about seceding from the union, and you’ll occasionally see SECEDE bumper stickers, but it’s a joke. Maybe a few people take it seriously, but certainly not many.

[2] There is a tradition dating back to 1818 that the flag only changes on July 4. So the period of the 49-star flag didn’t exactly coincide with the period of 49 states.

]]>There’s no simple expression for *p*(*n*), but Ramanujan discovered a fairly simple asymptotic approximation:

How accurate is this approximation? Here’s a little Matheamtica code to see.

p[n_] := PartitionsP[n] approx[n_] := Exp[ Pi Sqrt[2 n/3]] / (4 n Sqrt[3]) relativeError[n_] := Abs[p[n] - approx[n]]/p[n] ListLinePlot[Table[relativeError[n], {n, 100}]]

So for values of *n* around 100, the approximation is off by about 5%.

Since it’s an asymptotic approximation, the relative error decreases (albeit slowly, apparently) as *n* increases. The relative error for *n* = 1,000 is about 1.4% and the relative error for *n* = 1,000,000 is about 0.044%.

**Update**: After John Baez commented on the oscillation in the relative error I decided to go back and look at it more carefully. Do the oscillations end or do they just become too small to see?

To answer this, let’s plot the difference in consecutive terms.

relerr[a_, b_] := Abs[a - b]/b t = Table[relerr[p[n], approx[n]], {n, 300}]; ListLinePlot[Table[ t[[n + 1]] - t[[n]], {n, 60}]]

The plot crosses back and forth across the zero line, indicating that the relative error alternately increases and decreases, but only up to a point. Past *n* = 25 the plot stays below the zero line; the sign changes in the first differences stop.

But now we see that the first differences themselves alternate! We can investigate the alternation in first differences by plotting second differences [1].

ListLinePlot[ Table[ t[[n + 2]] - 2 t[[n + 1]] + t[[n]], {n, 25, 120}] ]

So it appears that the *second* differences keep crossing the zero line for a lot longer, so far out that it’s hard to see. In fact the second differences become positive and stay positive after *n* = 120. But the second differences keep alternating, so you could look at *third* differences …

**Related posts**: Special numbers

[1] The code does a small algebraic simplification that might make some people scratch their heads. All it does is simplify

(*t*_{n+2} – *t*_{n+1}) – (*t*_{n+1} – *t*_{n}).

The book Minimal Perl is useful in this regard. It has chapters on Perl as a better `grep`

, a better `awk`

, a better `sed`

, and a better `find`

. While Perl is not easy to learn, it might be easier to learn a minimal subset of Perl than to learn each of the separate utilities it could potentially replace. I wrote about this a few years ago and have been thinking about it again recently.

Here I want to zoom in on Perl as a better `grep`

. What’s the minimum Perl you need to know in order to use Perl to search files the way `grep`

would?

By using Perl as your `grep`

, you get to use Perl’s more extensive pattern matching features. Also, you get to use one regex syntax rather than wondering about the specifics of numerous regex dialects supported across various programs.

Let *RE* stand for a generic regular expression. To search a file `foo.txt`

for lines containing the pattern *RE*, you could type

perl -wln -e "/RE/ and print;" foo.txt

The Perl one-liner above requires more typing than using grep would, but you could wrap this code in a shell script if you’d like.

If you’d like to print lines that *don’t* match a regex, change the `and`

to `or`

:

perl -wln -e "/RE/ or print;" foo.txt

By learning just a little Perl you can customize your search results. For example, if you’d like to just print the part of the line that matched the regex, not the entire line, you could modify the code above to

perl -wln -e "/RE/ and print $&;" foo.txt

because `$&`

is a special variable that holds the result of the latest match.

***

For daily tips on regular expressions, follow @RegexTip on Twitter.

]]>I just finished listening to the latest episode of Twenty Thousand Hertz, the story behind “Deep Note,” the THX logo sound.

There are a couple mathematical details of the sound that I’d like to explore here: random number generation, and especially Pythagorean tuning.

First is that part of the construction of the sound depended on a random number generator. The voices start in a random configuration and slowly reach the target D major chord at the end.

Apparently the random number generator was not seeded in a reproducible way. This was only mentioned toward the end of the show, and a teaser implies that they’ll go more into this in the next episode.

The other thing to mention is that the final chord is based on Pythagorean tuning, not the more familiar equal temperament.

The lowest note in the final chord is D1. (Here’s an explanation of musical pitch notation.) The other notes are D2, A2, D3, A3, D4, A4, D5, A5, D6, and F#6.

Octave frequencies are a ratio of 2:1, so if D1 is tuned to 36 Hz, then D2 is 72 Hz, D3 is 144 Hz, D4 is 288 Hz, D5 is 576 Hz, and D6 is 1152 Hz.

In Pythagorean tuning, fifths are in a ratio of 3:2. In equal temperament, a fifth is a ratio of 2^{7/12} or 1.4983 [1], a little less than 3/2. So Pythagorean fifths are slightly bigger than equal temperament fifths. (I explain all this here.)

If D2 is 72 Hz, then A2 is 108 Hz. It follows that A3 would be 216 Hz, A4 would be 432 Hz (flatter than the famous A 440), and A5 would be 864 Hz.

The F#6 on top is the most interesting note. Pythagorean tuning is based on fifths being a ratio of 3:2, so how do you get the major third interval for the highest note? By going up by fifths 4 times from D4, i.e. D4 -> A4 -> E5 -> B5 -> F#6.

The frequency of F#6 would be 81/16 of the frequency of D4, or 1458 Hz.

The F#6 on top has a frequency 81/64 that of the D# below it. A Pythagorean major third is a ratio of 81/64 = 1.2656, whereas an equal temperament major third is f 2^{4/12} or 1.2599 [2]. Pythagorean tuning makes more of a difference to thirds than it does to fifths.

A Pythagorean major third is sharper than a major third in equal temperament. Some describe Pythagorean major chords as brighter or sweeter than equal temperament chords. That the effect the composer was going for and why he chose Pythagorean tuning.

Then after specifying the exact pitches for each note, the composer actually changed the pitches of the highest voices a little to make the chord sound fuller. This makes the three voices on each of the highest notes sound like three voices, not just one voice. Also, the chord shimmers a little bit because the random effects from the beginning of Deep Note never completely stop, they are just diminished over time.

[1] The exponent is 7/12 because a half step is 1/12 of an octave, and a fifth is 7 half steps.

[2] The exponent is 4/12 because a major third is 4 half steps.

Musical score above via THX Ltd on Twitter.

]]>The Stirling number of the second kind

counts the number of ways to partition a set of *n* elements into *k* non-empty subsets. The notation on the left is easier to use inline, and the subscript reminds us that we’re talking about Stirling numbers of the second kind. The notation on the right suggests that we’re dealing with something analogous to binomial coefficients, and the curly braces suggest this thing might have something to do with counting sets.

Since the *n*th Bell number counts the total number of ways to partition a set of *n* elements into any number of non-empty subsets, we have

Another connection to Bell is that *S*_{2}(*n*, *k*) is the sum of the coefficients in the partial exponential Bell polynomial *B*_{n, k}.

The Stirling numbers of the first kind

count how many ways to partition a set into *cycles* rather than subsets. A cycle is a sort of ordered subset. The order of elements matters, but only a circular way. A cycle of size *k* is a way to place *k* items evenly around a circle, where two cycles are considered the same if you can rotate one into the other. So, for example, [1, 2, 3] and [2, 3, 1] represent the same cycle, but [1, 2, 3] and [1, 3, 2] represent different cycles.

Since a set with at least three elements can be arranged into multiple cycles, Stirling numbers of the first kind are greater than or equal to Stirling numbers of the second kind, given the same arguments.

We started out by saying Stirling numbers were like binomial coefficients, and here we show that they satisfy addition identities similar to binomial coefficients.

For binomial coefficients we have

To see this, imagine we start with the set of the numbers 1 through *n*. How many ways can we select a subset of *k* items? We have selections that exclude the number 1 and selections that include the number 1. These correspond to the two terms on the right side of the identity.

The analogous identities for Stirling numbers are

The combinatorial proofs of these identities are similar to the argument above for binomial coefficients. If you want to partition the numbers 1 through *n* into *k* subsets (or cycles), either 1 is in a subset (cycle) by itself or not.

Everything above has implicitly assumed *n* and *k* were positive, or at least non-negative, numbers. Let’s look first at how we might handle zero arguments, then negative arguments.

It works out best if we define *S*_{1}(0, *k*) and *S*_{2}(0, *k*) to be 1 if *k* is 0 and zero otherwise. Also we define *S*_{1}(*n*, 0) and *S*_{2}(*n*, 0) to be 1 if *n* is 0 and zero otherwise. (See Concrete Mathematics.)

When either *n* or *k* is zero the combinatoric interpretation is strained. If we let either be negative, there is no combinatorial interpretation, though it can still be useful to consider negative arguments, much like one does with binomial coefficients.

We can take the addition theorems above, which are *theorems* for non-negative arguments, and treat them as *definitions* for negative arguments. When we do, we get the following beautiful dual relationship between Stirling numbers of the first and second kinds:

1, 2, 3, 4, …

Now take partial sums, the *n*th term of the new series being the sum of the first n terms of the previous series. This gives us the triangular numbers, so called because they count the number of coins at each level of a triangular arrangement:

1, 3, 6, 10, …

If we repeat this again we get the tetrahedral numbers, the number of balls on each level of a tetrahedral stack of balls.

1, 4, 10, 20, …

We can repeat this process and general define *T*_{n, d}, the *n*th tetrahedral number of dimension *d*, recursively. We define *T*_{n, 1} = *n* and for *d* > 1,

This is just a formalization of the discussion above.

It turns out there’s a simple expression for tetrahedral number of all dimensions:

Here’s a quick proof by induction. The theorem clearly holds when *n* = 1 or *d* = 1. Now assume it hold for all *n* < *m* and *d* < *m*.

The last line follows from the binomial addition identity

It also turns out that *T*_{n, d} is the number of ways to select *d* things from a set of *n* with replacement.

**Related posts**:

You may have to look at that sum twice to see it correctly. It looks a lot like the sum for *e*^{n} except the roles of *k* and *n* are reversed in the numerator.

The sum reminded me of the equation

that I blogged about a few weeks ago. It’s correct, but you may have to stare at it a little while to see why.

Incidentally, the *n*th Bell number is the *n*th** moment of a Poisson random variable** with mean 1.

There’s also a connection with Bell polynomials. The *n*th Bell number is the sum of the coefficients of the *n*th **complete Bell polynomial**. The *n*th Bell polynomial is sum over *k* of the partial exponential Bell polynomials *B*_{n,k}. The partial (exponential) Bell polynomials are defined here.

You can compute Bell numbers in Python with `sympy.bell`

and in Mathematical with `BellB`

. You can compute Bell numbers by hand, or write your own function in a language that doesn’t provide one, with the recursion relation *B*_{0} = 1 and

for *n* > 0. Bell numbers grow very quickly, faster than exponential, so you’ll need extended precision integers if you want to compute very many Bell numbers.

For theoretical purposes, it is sometimes helpful to work with the exponential generating function of the Bell numbers. The *n*th Bell number is *n*! times the coefficient of *x*^{n} in exp(exp(*x*) – 1).

An impractical but amusing way to compute Bell numbers would be via simulation, finding the expected value of *n*th powers of random draws from a Poisson distribution with mean 1.

But the central limit theorem says nothing about *relative* error. Relative error can diverge to infinity while absolute error converges to zero. We’ll illustrate this with an example.

The average of *N* independent exponential(1) random variables has a gamma distribution with shape *N* and scale 1/*N*.

As *N* increases, the average becomes more like a normal in distribution. That is, the absolute error in approximating the distribution function of gamma random variable with that of a normal random variable decreases. (Note that we’re talking about distribution functions (CDFs) and not densities (PDFs). The previous post discussed a surprise with density functions in this example.)

The following plot shows that the difference between the distributions functions get smaller as *N* increases.

But when we look at the ratio of the tail probabilities, that is Pr(*X* > *t*) / Pr(*Y* > *t*) where *X* is the average of *N* exponential r.v.s and *Y* is the corresponding normal approximation from the central limit theorem, we see that the ratios diverge, and they diverge faster as *N* increases.

To make it clear what’s being plotted, here is the Python code used to draw the graphs above.

import matplotlib.pyplot as plt from scipy.stats import gamma, norm from scipy import linspace, sqrt def tail_ratio(ns): x = linspace(0, 4, 400) for n in ns: gcdf = gamma.sf(x, n, scale = 1/n) ncdf = norm.sf(x, loc=1, scale=sqrt(1/n)) plt.plot(x, gcdf/ncdf plt.yscale("log") plt.legend(["n = {}".format(n) for n in ns]) plt.savefig("gamma_normal_tail_ratios.svg") def cdf_error(ns): x = linspace(0, 6, 400) for n in ns: gtail = gamma.cdf(x, n, scale = 1/n) ntail = norm.cdf(x, loc=1, scale=sqrt(1/n)) plt.plot(x, gtail-ntail) plt.legend(["n = {}".format(n) for n in ns]) plt.savefig("gamma_normal_cdf_diff.svg") ns = [1, 4, 16] tail_ratio([ns) cdf_error(ns)]]>

The first place most people encounter Gibbs phenomena is in Fourier series for a step function. The Fourier series develops “bat ears” near the discontinuity. Here’s an example I blogged about before not with Fourier series but with analogous Chebyshev series.

The series converges rapidly in the middle of the flat parts, but under-shoots and over-shoots near the jumps in the step function.

Runge phenomena is similar, where interpolating functions under- and over-shoot the function they’re approximating.

Both plots above come from this post.

Here’s the example I ran across with the central limit theorem. The distribution of the average of a set of exponential random variables converges to the distribution of a normal random variable. The nice thing about the exponential distribution is that the averages have a familiar distribution: a gamma distribution. If each exponential has mean 1, the average has a gamma distribution with shape *N* and scale 1/*N*. The central limit theorem says this is converging in distribution to a normal distribution with the same mean and variance.

The plot below shows the difference between the density function of the average of *N* exponential random variables and the density function for its normal approximation, for *N* = 10 and for *N* = 400.

Notice that the orange line, corresponding to *N* = 400, is very flat, most of the time. That is, the normal approximation fits very well. But there’s this spike in the middle, something reminiscent of Gibbs phenomena or Runge phenomena. Going from 10 to 400 samples the average error decreases quite a bit, but the maximum error doesn’t go down much at all.

If you go back and look at the Central Limit Theorem, or its more quantitative counterpart the Berry-Esseen theorem, you’ll notice that it applies to the distribution function, not the density, or in other words, the CDF, not the PDF. I think the density functions do converge in this case because the exponential function has a smooth density, but the rate of convergence depends on the norm. It looks like the convergence is fast in square (*L*²) norm, but slow in sup norm. A little experiment shows that this is indeed the case.

Maybe the max norm doesn’t converge at all, i.e. the densities don’t converge pointwise. It looks like the max norm may be headed toward a horizontal asymptote, just like Gibbs phenomena.

**Update**: It seems we do not have uniform convergence. If we let *N* = 1,000,000, the sup norm of the error 0.1836. It appears the sup norm of the error is approaching a lower bound of approximately this value.

For this post I’ll be looking at two-tailed events, the probability that a normal random variable is either less than –*k*σ or greater than *k*σ. If you’re only interested in one of these two probabilities, divide by 2. Also, since the results are independent of σ, let’s assume σ = 1 for convenience.

The following Python code will print a table of *k*-sigma event probabilities.

from scipy.stats import norm for k in range(1, 40): print(k, 2*norm.cdf(-k))

This shows, for example, that a “25 sigma event,” something I’ve heard people talk about with a straight face, has a probability of 6 × 10^{-138}.

The code above reports that a 38 or 39 sigma event has probability 0, exactly 0. That’s because the actual value, while not zero, is so close to zero that floating point precision can’t tell the difference. This happens around 10^{-308}.

What if, despite all the signs warning *hic sunt dracones* you want to compute even smaller probabilities? Then for one thing you’ll need to switch over to log scale in order for the results to be representable as floating point numbers.

Exactly computing these extreme probabilities is challenging, but there are convenient upper and lower bounds on the probabilities that can be derived from Abramowitz and Stegun, equation 7.1.13 with a change of variables. See notes here.

We can use these bounds to compute upper and lower bounds on the base 10 logs of the tail probabilities.

from scipy import log, sqrt, pi def core(t, c): x = 2*sqrt(2/pi)/(t + sqrt(t**2 + c)) ln_p = -0.5*t**2 + log(x) return ln_p/log(10) def log10_upper(t): return core(t, 2*sqrt(2)) def log10_lower(t): return core(t, 4*sqrt(2)/pi)

This tells us that the log base 10 of the probability of a normal random variable being more than 38 standard deviations away from its mean is between -315.23867 and -315.23859. The upper and lower bounds agree to seven significant figures, and the accuracy only improves as *k* gets larger. So for large arguments, we can use either the upper or lower bound as an accurate approximation.

The code above was used to compute this table of tail probabilities for *k* = 1 to 100 standard deviations.

What does that mean? Six sigma means six standard deviations away from the mean of a probability distribution, sigma (σ) being the common notation for a standard deviation. Moreover, the underlying distribution is implicitly a normal (Gaussian) distribution; people don’t commonly talk about “six sigma” in the context of other distributions [1]. Here’s a table to indicate the odds against a *k*-sigma event for various *k*.

|-------+-----------------| | Sigma | Odds | |-------+-----------------| | 1 | 2 : 1 | | 2 | 21 : 1 | | 3 | 370 : 1 | | 4 | 16,000 : 1 | | 5 | 1,700,000 : 1 | | 6 | 500,000,000 : 1 | |-------+-----------------|

If you see something that according to your assumptions should happen twice in a billion tries, maybe you’ve seen something extraordinarily rare, or maybe your assumptions were wrong. Taleb’s comment suggests the latter is more likely.

You could formalize this with Bayes rule. For example, suppose you’re 99% sure the thing you’re looking at has a normal distribution with variance 1, but you’re willing to concede there’s a 1% chance that what you’re looking at has a heavier-tailed distribution, say a Student *t* distribution with 10 degrees of freedom, rescaled to also have variance 1.

It’s hard to tell the two distributions apart, especially in the tails. But although both are small in the tails, the normal is *relatively* much smaller.

Now suppose you’ve seen an observation greater than 6. The Bayes factor in favor of the *t* distribution hypothesis is 272. This means that even though before seeing any data you thought the odds were 99 to 1 in favor of the data coming from a normal distribution, after seeing such a large observation you would put the odds at 272 to 1 in favor of the *t* distribution.

If you allow a small possibility that your assumption of a normal distribution is wrong (see Cromwell’s rule) then seeing an extreme event will radically change your mind. You don’t have to think the heavier-tailed distribution is equally likely, just a possibility. If you *did* think *a priori* that both possibilities were equally likely, the posterior odds for the *t* distribution would be 27,000 to 1.

In this example we’re comparing the normal distribution to a very specific and somewhat arbitrary alternative. Our alternative was just an example. You could have picked a wide variety of alternatives that would have given a qualitatively similar result, reversing your *a priori* confidence in a normal model.

By the way, a *t* distribution with 10 degrees of freedom is not a very heavy-tailed distribution. It has heavier tails than a normal for sure, but not nearly as heavy as a Cauchy, which corresponds to a *t* with only one degree of freedom. If we had used a distribution with a heavier tail, the posterior odds in favor of that distribution would have been higher.

***

[1] A six-sigma event isn’t that rare unless your probability distribution is normal. By Markov’s inequality, the probability is less than 1/36 for any distribution. The rarity of six-sigma events comes from the assumption of a normal distribution more than from the number of sigmas per se.

]]>As I wrote about the other day in the context of rational approximations for π, the most economical rational approximations of an irrational number come from convergents of continued fractions. The book Calendrical Calculations applies continued fractions to various kinds of calendars.

The continued fraction for the number of days in a year is as follows.

This means that the best approximations for the length of a year are 365 days plus a fraction of 1/4, or 7/29, or 8/33, or 23/95, etc.

You could have one leap day every four years. More accurate would be 7 leap days every 29 years, etc. The Gregorian calendar has 97 leap days every 400 years.

If we express the ratio of the length of the year to the length of the month, we get

By taking various convergents we get 37/3, 99/8, 136/11, etc. So if you want to design successively more accurate lunisolar calendars, you’d have 37 lunar months every 3 years, 99 lunar months every 8 years, etc.

I used your post on practical techniques for computing smooth max, which will probably be the difference between being able to actually use my result and not. I wrote up what I’m planning to do here.

His post extends the idea in my post to computing the gradient and Hessian of the soft max.

]]>Now obviously **we know there’s life on Earth**. But if we’re going look for life on other planets, it’s reasonable to ask that our methods return positive results when examining the one planet we know for sure does host life. So scientists looked at the data from Galileo as if it were coming from another planet to see what patterns in the data might indicate life.

I’ve started using looking for life on Earth as a **metaphor**. I’m working on a project right now where I’m looking for a needle in a haystack, or rather *another* needle in a haystack: I knew that one needle existed before I got started. So I want to make sure that my search procedure at least finds the one positive result I already know exists. I explained to a colleague that we need to make sure we can at least find life on Earth.

This reminds me of **simulation work**. You make up a scenario and treat it as the state of nature. Then pretend you don’t know that state, and see how successful your method is at discovering that state. It’s sort of a schizophrenic way of thinking, pretending that half of your brain doesn’t know what the other half is doing.

It also reminds me of **software testing**. The most trivial tests can be surprisingly effective at finding bugs. So you write a few tests to confirm that there’s life on Earth.

[1] I found out about Galileo’s Earth reconnaissance listening to the latest episode of the .NET Rocks! podcast.

]]>Here are three problems that reduce to counting selections with replacement.

For example, suppose you’re running an experiment in which you randomize to *n* different treatments and you want to know how many ways the next *k* subjects can be assigned to treatments. So if you had treatments *A*, *B*, and *C*, and five subjects, you could assign all five to *A*, or four to *A* and one to *B*, etc. for a total of 21 possibilities. Your choosing 5 things from a set of 3, with replacement because you can (and will) assign the same treatment more than once.

For another example, if you’re taking the *k*th order partial derivatives of a function of *n* variables, you’re choosing *k* things (variables to differentiate with respect to) from a set of *n* (the variables). Equality of mixed partials for smooth functions says that all that matters is how many times you differentiate with respect to each variable. Order doesn’t matter: differentiating with respect to *x* and then with respect to *y* gives the same result as taking the derivatives in the opposite order, as long as the function you’re differentiating has enough derivatives. I wrote about this example here.

I recently had an expression come up that required summing over *n* distinct variables, each with a non-negative even value, and summing to 2*k*. How many ways can that be done? As many as dividing all the variable values in half and asking that they sum to *k*. Here the thing being chosen the variable, and since the indexes sum to *k*, I have a total of *k* choices to make, with replacement since I can chose a variable more than once. So again I’m choosing *k* things with replacement from a set of size *n*.

I wrote up a set of notes on sampling with replacement that I won’t duplicate here, but in a nutshell:

The symbol on the left is Richard Stanley’s suggested notation for the number of ways to select *k* things with replacement from a set of size *n*. It turns out that this equals the expression on the right side. The derivation isn’t too long, but it’s not completely obvious either. See the aforementioned notes for details.

I started by saying basic combinatorial problems can be reduced to multiplication, permutations, and combinations. Sampling with replacement can be reduced to a combination, the right side of the equation above, but with non-obvious arguments. Hence the need to introduce a new symbol, the one on the right, that maps directly to problem statements.

Sometimes you just need to count how many ways one can select *k* things with replacement from a set of size *n*. But often you need to actually enumerate the possibilities, say to loop over them in software.

In conducting a clinical trial, you may want to enumerate all the ways pending data could turn out. If you’re going to act the same way, no matter how the pending data work out, there’s no need to wait for the missing data before proceeding. This is something I did many times when I worked at MD Anderson Cancer Center.

When you evaluate a multivariate Taylor series, you need to carry out a sum that has a term for corresponding to each partial derivative. You could naively sum over all possible derivatives, not taking into account equality of mixed partials, but that could be very inefficient, as described here.

The example above summing over even partitions of an even number also comes out of a problem with multivariate Taylor series, one in which odd terms drop out by symmetry.

To find algorithms for enumerating selections with replacement, see Knuth’s TAOCP, volume 4, Fascicle 3.

Think of the denominator of your fraction as something you have to buy. If you have enough budget to buy a three-digit denominator, then you’re better off buying 355/113 rather than 314/100 because the former is more accurate. The approximation 355/113 is accurate to 6 decimal places, whereas 314/100 is only accurate to 2 decimal places.

There’s a way to find the most economical rational approximations to π, or any other irrational number, using continued fractions. If you’re interested in details, see the links below.

Here are the 10 best rational approximations to π, found via continued fractions.

|----------------+----------| | Fraction | Decimals | |----------------+----------| | 3 | 0.8 | | 22/7 | 2.9 | | 333/106 | 4.1 | | 355/113 | 6.6 | | 103993/33102 | 9.2 | | 104348/33215 | 9.5 | | 208341/66317 | 9.9 | | 312689/99532 | 10.5 | | 833719/265381 | 11.1 | | 1146408/364913 | 11.8 | |----------------+----------|

If you only want to know the number of correct decimal places, ignore the fractional parts of the numbers in the Decimals column above. For example, 22/7 gives two correct decimal places. But it almost gives three. (Technically, the Decimals column gives the -log_{10} of the absolute error.)

In case you’re curious, here’s a plot of the absolute errors on a log scale.

*f*(*x*) = 1/(1 + exp(*a* + *bx*))

has a **fixed point**, i.e. *f*(*x*) = *x*.

So given logistic regression parameters *a* and *b*, when does the logistic curve given by *y* = *f*(*x*) cross the line *y* = *x*? Do they always cross? Can they cross twice?

There’s always at least one solution. Because *f*(*x*) is strictly between 0 and 1, the function

*g*(*x*) = *f*(*x*) – *x*

is positive at 0 and negative at 1, and by the intermediate value theorem g(*x*) must be zero for some *x* between 0 and 1.

Sometimes *f* has only one fixed point. It may have two or three fixed points, as demonstrated in the graph below. The case of two fixed points is unstable: the logistic curve is tangent to the line *y* = *x* at one point, and a tiny change would turn this tangent point into either no crossing or two crossings.

If |*b*| < 1, then you can show that the function *f* is a **contraction map** on [0, 1]. In that case there is a unique solution to *f*(*x*) = *x*, and you can find it by starting with an arbitrary value for *x* and repeatedly applying *f* to it. For example, if *a*= 1 and *b* = 0.8 and we start with *x*= 0, after applying *f* ten times we get *x* = *f*(*x*) = 0.233790157.

There are a couple questions left to finish this up. How can you tell from *a* and *b* how many fixed points there will be? The condition |*b*| < 1 is sufficient for *f* to be a contraction map on [0, 1]. Can you find necessary and sufficient conditions?

**Related post**: Sensitivity of logistic regression prediction

Iterations of *f* converge on the golden ratio, no matter where you start (with one exception). The video creates a graph where they connect values of *x* on one line to values of *f*(*x*) on another line. Curiously, there’s an oval that emerges where no lines cross.

Here’s a little Python I wrote to play with this:

import matplotlib.pyplot as plt from numpy import linspace N = 70 x = linspace(-3, 3, N) y = 1 + 1/x for i in range(N): plt.plot([x[i], y[i]], [0, 1]) plt.xlim(-5, 5)

And here’s the output:

In the plot above I just used matplotlib’s default color sequence for each line. In the plot below, I used fewer lines (N = 50) and specified the color for each line. Also, I made a couple changes to the plot command, specifying the color for each line and putting the *x* line above the *y* line.

plt.plot([x[i], y[i]], [0, 1], c="#243f6a")

If you play around with the Python code, you probably want to keep N even. This prevents the *x* array from containing zero.

**Update**: Here’s a variation that extends the lines connecting (*x*, 0) and (*y*, 1). And I make a few other changes while I’m at it.

N = 200 x = linspace(-10, 10, N) y = 1 + 1/x z = 2*y-x for i in range(N): plt.plot([x[i], z[i]], [0, 2], c="#243f6a") plt.xlim(-10, 10)]]>

It’s a joke on Judea Pearl, expert in causal inference, and the Perl programming language, known for its unusual, terse syntax.

**Related:**

Last year I started digging into automatic differentiation and questioned whether I really had mastered calculus. At a high level, automatic differentiation is “just” an application of the chain rule in many variables. But you can make career out of exploring automatic differentiation (AD), and many people do. The basics of AD are not that complicated, but you can go down a deep rabbit hole exploring optimal ways to implement AD in various contexts.

You can make a career out of things that seem even simpler than AD. Thousands of people have made a career out of solving the equation *Ax* = *b* where *A* is a large matrix and the vector *b* is given. In high school you solve two equations in two unknowns, then three equations in three unknowns, and in principle you could keep going. But you don’t solve a sparse system of millions of equations the same way. When you consider efficiency, accuracy, limitations of computer arithmetic, parallel computing, etc. the humble equation *Ax* = *b* is not so simple to solve.

As Richard Feynman said, nearly everything is really interesting if you go into it deeply enough.

**Related posts**:

This is an analogous post for testing whether two data sets come from distributions with the same variance. Statistics texts books often present the *F*-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both data sets come from normal distributions.

Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it’s important to know how procedures perform when assumptions don’t hold exactly. Since the *F*-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the *F*-test to perform terribly under simulation, relative to its recommended alternatives **Bartlett’s test** and **Levene’s test**. That’s not exactly what I found.

For my simulations I selected 35 samples from each of two distributions. I selected significance levels for the *F*-test, Bartlett’s test, and Levene’s test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.

I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the *F*-test. Then I used the same designs for data coming from a heavy-tailed distribution to see how well each of the tests performed.

For the normal null scenario, both data sets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario I used normal distributions with standard deviations 15 and 25.

Here are the results from the normal distribution simulations.

|----------+-------+--------+---------| | Test | Alpha | Type I | Type II | |----------+-------+--------+---------| | F | 0.13 | 0.0390 | 0.1863 | | Bartlett | 0.04 | 0.0396 | 0.1906 | | Levene | 0.06 | 0.0439 | 0.2607 | |----------+-------+--------+---------|

Here the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.

The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had larger Type II error than the other tests when calibrated to have the same Type I error.

To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.

Next I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both data sets were drawn from a Student *t* distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.

|----------+-------+--------+---------| | Test | Alpha | Type I | Type II | |----------+-------+--------+---------| | F | 0.13 | 0.2417 | 0.2852 | | Bartlett | 0.04 | 0.2165 | 0.2859 | | Levene | 0.06 | 0.0448 | 0.4537 | |----------+-------+--------+---------|

The operating characteristics degraded when drawing samples from a heavy-tailed distribution, *t* with 4 degrees of freedom, but they didn’t degrade uniformly.

Compared to the *F*-test, the Bartlett test had slightly better Type I error and the same Type II error.

The Levene test, had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.

**Conclusion**: The *F*-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett’s test doesn’t seem much better in these particular scenarios. Levene’s test, however, does perform better than the *F*-test, depending on the relative importance you place on Type I and Type II error.

The general equation for the surface of an ellipsoid is

An ellipsoid has three semi-axes: *a*, *b*, and *c*. For a sphere, all three are equal to the radius. For a spheroid, two are equal. In an oblate spheroid, like Earth and Jupiter, *a* = *b* > *c*. For a prolate spheroid like Saturn’s moon Enceladus, *a* = *b* < *c*.

The dwarf planet Haumea is far from spherical. It’s not even a spheroid. It’s a general (tri-axial) ellipsoid, with *a*, *b*, and *c* being quite distinct. It’s three semi-axes are approximately 1000, 750, and 500 kilometers. The dimensions are not known very accurately. The image above is an artists conception of Haumea and its ring, via Astronomy Picture of the Day.

This post explains how to compute the volume and surface area of an ellipsoid. See that post for details. Here we’ll just copy the Python code from that post.

from math import sin, cos, acos, sqrt, pi from scipy.special import ellipkinc, ellipeinc def volume(a, b, c): return 4*pi*a*b*c/3 def area(a, b, c): phi = acos(c/a) k = a*sqrt(b**2 - c**2)/(b*sqrt(a**2 - c**2)) E = ellipeinc(phi, k) F = ellipkinc(phi, k) elliptic = E*sin(phi)**2 + F*cos(phi)**2 return 2.0*pi*c**2 + 2*pi*a*b*elliptic/sin(phi) a, b, c = 1000, 750, 500 print(volume(a, b, c)) print(area(a, b, c))

**Related post**: Planets evenly spaced on a log scale