**I don’t get paid for reviews**. I review things that I find interesting and think that readers would find interesting.

**I don’t do reviews with strings attached**. Most publishers don’t try to attach strings. They simply ask me if I’d like a copy of their book, and that’s that. A couple publishers have tried to exert more control, and I don’t review their books.

**I don’t write negative reviews** because they’re not interesting. There are millions of products you won’t buy this year. Who cares about another thing *not* to buy? A negative review could be interesting if it were for a well-known product that many people were thinking about buying, but I haven’t been asked to review anything like that. If I find something disappointing, I don’t write a review.

**Books need to be on paper**. Electronic files are fine for reference and for short-form reading, but I prefer paper for long-form reading.

**I’m open to reviewing hardware** if it’s something I would use and something that I think my readers would be interested in. I haven’t reviewed hardware to date, but someone offered me a device that expect to review when it gets here.

Or in terms of probability, what is the expected distance between a fraction *n*/*r*, where *n* is large and fixed and *r* is chosen randomly between 1 and *n*, and the nearest larger integer?

In symbols, the question above is asking for the approximate value of

for large *n*, i.e. in the limit as *n* goes to ∞. Here ⌈*x*⌉ denotes the ceiling of *x*, the smallest integer greater than or equal to *x*.

Let’s plot this as a function of *n* and see what it looks like. Here’s the Python code.

import matplotlib.pyplot as plt from numpy import ceil, arange def f(n): return sum( [ceil(n/r) - n/r for r in range(1, n)] )/n x = arange(1, 100) y = [f(n) for n in x] plt.plot(x, y) plt.show()

And here’s the result.

It appears the graph may be converging to some value, and in fact it is. Charles de la Vallée Poussin proved in 1898 that the limiting value is the Euler–Mascheroni constant γ = 0.5772…. This constant is the limiting difference between the *n*th harmonic number and log *n*, i.e.

We can add a horizontal line to our plot to see how well the graph seems to match γ. To do this we need to import the constant `euler_gamma`

from `numpy`

and add the

plt.axhline(y=euler_gamma, linestyle=":")

after the plot command. When we do, this is what we see.

It looks like the plot is converging to a value slightly less than γ. Apparently the convergence is very slow. When we go out to 10,000 the plot is closer to being centered around γ but still maybe below γ more than above.

If we evaluate our function at *n* = 1,000,000, we get 0.577258… while γ = 0.577215….

At *n* = 10,000,000 we get 0.577218…. So taking 100 times as many terms in our sum gives us one extra correct decimal place, as we’d expect of a random processes since convergence usually goes like 1/√*n*.

Thank you all for reading, commenting, and sharing.

**Update**: For highlights of my posts over the years, see Tim Hopper’s post John Cook’s Ten Year Blogging Endeavour.

This number has 23,249,425 digits when written in base 10.

In base 2, 2^{p} – 1 is a sequence of *p* ones. For example, 31 = 2^{5} -1 which is 11111 in binary. So in binary, the new record prime is a string of 77,232,917 ones.

You can convert a binary number to hexadecimal (base 16) by starting at the right end and converting blocks of 4 bits into hexadecimal. For example, to convert 101101111 to hex, we break it into three blocks: 1, 0110, and 1111. These blocks convert to 1, 6, and F, and so 101101111 in binary corresponds to 16F in hex.

Now 77,232,917 = 19,308,229 * 4 + 1, so if we break our string of 77,232,917 ones into groups of four, we have a single bit followed by 19,308,229 groups of 4. This means that in hexadecimal, the new prime record is 1FFF…FFF, a 1 followed by 19,308,229 F’s.

The new record is the 50th Mersenne prime. A Mersenne prime is a prime number that is one less than a power of 2, i.e. of the form 2^{p} – 1. It turns out *p* must be prime before 2* ^{p}* – 1 can be prime. In the case of the new record, 77,232,917 is prime.

It is unknown whether there are infinitely many Mersenne primes. But now we know there are at least 50 of them.

All the recent prime number records have been Mersenne primes because there is an algorithm for testing whether a number of the special form 2* ^{p}* – 1 is prime, the Lucas-Lehmer test.

Mersenne primes are closely related to perfect numbers. A number *n* is perfect if the sum of the numbers less than *n* that divide *n* equals *n*. For example, 28 is divisible by 1, 2, 4, 7, and 14, and 1 + 2 + 4 + 7 + 14 = 28.

Finding a new Mersenne prime means finding a new perfect number. If *m* is a Mersenne prime, then *n* = *m*(*m* + 1)/2 is a perfect number. Conversely, if *n* is an *even* perfect number, *n* has the form *m*(*m* + 1)/2, Nobody knows whether odd perfect numbers exist. Since we don’t know whether there are infinitely many Mersenne primes, we don’t know whether there are infinitely many perfect numbers. But there are at least 50 of them.

**Related posts**:

If the signal is given by a function *h*(*t*), then the Nyquist-Shannon sampling theorem says we can recover *h*(*t*) by

where sinc(*x*) = sin(π*x*) / π*x*.

In practice, signals may not entirely band-limited, but beyond some frequency *f*_{c} higher frequencies can be ignored. This means that the cutoff frequency *f*_{c} is somewhat fuzzy. As we demonstrate below, it’s much better to err on the side of making the cutoff frequency higher than necessary. Sampling at a little less than the necessary frequency can cause the reconstructed signal to be a poor approximation of the original. That is, the sampling theorem is robust to over-sampling but not to under-sampling. There’s no harm from sampling more frequently than necessary. (No harm as far as the accuracy of the equation above. There may be economic costs, for example, that come from using an unnecessarily high sampling rate.)

Let’s look at the function *h*(*t*) = cos(18πt) + cos(20πt). The bandwidth of this function is 10 Hz, and so the sampling theorem requires that we sample our function at 20 Hz. If we sample at 20.4 Hz, 2% higher than necessary, the reconstruction lines up with the original function so well that the plots of the two functions agree to the thickness of the plotting line.

But if we sample at 19.6 Hz, 2% less than necessary, the reconstruction is not at all accurate due to problems with aliasing.

One rule of thumb is to use the **Engineer’s Nyquist frequency** of 2.5 *f*_{c} which is 25% more than the exact Nyquist frequency. An engineer’s Nyquist frequency is sorta like a baker’s dozen, a conventional safety margin added to a well-known quantity.

**Update**: Here’s a plot of the error, the RMS difference between the signal and its reconstruction, as a function of sampling frequency.

By the way, the function in the example demonstrates beats. The sum of a 9 Hz signal and a 10 Hz signal is a 9.5 Hz signal modulated at 0.5 Hz. More details on beats in this post on AM radio and musical instruments.

]]>Someone wrote to me the other day asking if I could explain a probability example from the Wall Street Journal. (“Proving Investment Success Takes Time,” Spencer Jakab, November 25, 2017.)

Victor Haghani … and two colleagues told several hundred acquaintances who worked in finance that they would flip two coins, one that was normal and the other that was weighted so it came up heads 60% of the time. They asked the people how many flips it would take them to figure out, with a 95% confidence level, which one was the 60% coin. Told to give a “quick guess,” nearly a third said fewer than 10 flips, while the median response was 40. The correct answer is 143.

The anecdote is correct in spirit: it takes longer to discover the better of two options than most people suppose. But it’s jarring to read the answer is precisely 143 when the question hasn’t been stated clearly.

How many flips would it take to figure out which coin is better with a 95% confidence level? For starters, the answer would have to be a distribution, not a single number. You might quickly come to the right conclusion. You *might* quickly come to the *wrong* conclusion. You might flip coins for a long time and never come to a conclusion. Maybe there is a way a formulating the problem so that so that the *expected value* of the distribution is 143.

How are you to go about flipping the coins? Do you flip both of them, or just flip one coin? For example, you might flip both coins until you are confident that one is better, and conclude that the better one is the one that was designed to come up heads 60% of the time. Or you could just flip one of them and test the hypothesis Prob(heads) = 0.5 versus the alternative Prob(heads) = 0.6. Or maybe you flip one coin two times for every one time you flip the other. Etc.

What do you mean by “95% confidence level”? Is this a frequentist confidence interval? And do you compute the (Bayesian) predictive probability of arriving at such a confidence level? Are you computing the (Bayesian) posterior model probabilities of two models, one in which the first coin has probability of heads 0.5 and the second has probability 0.6 versus the opposite model?

Do you assume that you know a priori that one coin has probability of heads 0.5 and the other 0.6, or do you not assume this and just want to find the coin with higher probability of heads, and evaluate such a model when in fact the probabilities of heads are as stated?

Are you conducting an experiment with a predetermined sample size of 143? Or are you continuous monitoring the data, stopping when you reach your conclusion?

I leave it as an exercise to the reader to implement the various alternatives suggested above and see whether one of them produces 143 as a result. (I did a a back-of-the-envelope calculation that suggests there is one.) So the first question is to reverse engineer which problem statement the article was based on. The second question is to decide which problem formulation you believe would be most appropriate in the context of the article.

]]>This was yesterday’s image.

Today’s image is surprisingly plain if we use *y* = 18.

This is in part because the least common multiple of 1, 1, and 18 is 18. The image could have no more than 18 vertices. In fact, it has only 6 vertices because the summand above has period 6.

But if we use *y* = 2018 we get something much more complex.

The Exponential Sum of the Day page will use *y* = 18 this year. There will be a few simple images this year but there will also be lots of surprises.

Here are the books:

Click on the image to see a larger version.

Two titles are not possible to read in the photo. These are

- Conduction of heat in solids by Carslaw and Jaeger
- Molecular Thermodynamics of Fluid Phase Equilibria by Prausnitz

where *y*_{*} and *x*_{0} are chosen to match the tower’s dimensions.

Here’s a plot of the curve:

And here’s the code that produced the plot:

from numpy import log, exp, linspace, vectorize import matplotlib.pyplot as plt # Taken from "Towing Icebergs, Falling Dominoes, # and Other Adventures in Applied Mathematics" # by Robert B. Banks # Constants given in Banks in feet. Convert to meters. feet_to_meter = 0.0254*12 ystar = 201*feet_to_meter x0 = 207*feet_to_meter height = 984*feet_to_meter # Solve for where to cut off curve to match height of the tower. # - ystar log xmin/x0 = height xmin = x0 * exp(-height/ystar) def f(x): if -xmin < x < xmin: return height else: return -ystar*log(abs(x/x0)) curve = vectorize(f) x = linspace(-x0, x0, 400) plt.plot(x, curve(x)) plt.xlim(-2*x0, 2*x0) plt.xlabel("Meters") plt.ylabel("Meters") plt.title("Eiffel Tower") plt.axes().set_aspect(1) plt.savefig("eiffel_tower.svg")

**Related post**: When length equals area

The St. Louis arch is approximately a catenary, i.e. a hyperbolic cosine.

These have been the most popular math-related posts here this year.

- Golden powers are nearly integers
- How efficient is Morse code?
- Finding numbers in pi
- Common words used as technical terms
- Sierpinski triangle strikes again

See also a list of the top five computing-related posts.

]]>Hermite polynomials are orthogonal polynomials over the real line with respect to the weight given by the standard normal distribution. (There are two conventions for defining Hermite polynomials, what Wikipedia calls the physicist convention and the probabilist convention. We’re using the latter here.)

The first few Hermite polynomials are 1, *x*, *x*^{2} – 1, *x*^{3} – 3*x*, … . You can find the rest of the Hermite polynomials via the recurrence relation

*H*_{n+1}(*x*) = *x* *H*_{n} – *n* *H*_{n-1}(*x*).

The odd order Hermite polynomials are odd functions, and the standard normal density is an even function, so the integral of their product is zero. For an even number *m*, the integral of the *m*th order Hermite polynomial times the standard normal density is (*m*-1)!!, i.e. *m* double factorial. In probability terms, if *X* is a standard normal random variable, the expected value of *H*_{k}(*X*) is 0 if *k* is odd and (*k* – 1)!! otherwise.

You could think of the Hermite polynomials as **the right basis to use** when working with normal probability distributions. Writing a polynomial as a linear combination of Hermite polyn0mials is a change of basis that makes integration easy:

Here [*k* even] is the function that returns 1 if *k* is even and 0 otherwise. This notation was introduced by Kenneth Iverson in the APL programming language and has become moderately common in mathematics.

]]>

The previous post looked at integrating exp( log( *g*(*x*) ) ) by expanding log *g*(*x*) in a power series centered at its maximum. Without loss of generality we can assume the maximum occurs at 0; otherwise do a change of variables first to make this happen.

The first order term of the series is missing because the derivative of *g* is zero at its maximum. Taking the terms of the series up to second order gives us something of the form *a* – *bx*² where *b* > 0. Then

exp(*a* – *bx*²) = exp(*a*) exp(-*bx*²).

The first term on the right is a constant that cam be pulled out of the integral, and the second term is a normal distribution density, modulo vertical scaling.

Now suppose we want to use more terms from the power series for log *g*. Let’s write out the series as

*g*(*x*) = *a* – *bx*² + *R*(*x*)

where *R*(*x*) is the remainder of the series after the quadratic terms. When we exponentiate the series for log *g* we get

exp(log *g*(*x*)) = exp(*a*) exp(-*bx*²) exp(*R*(*x*)).

We think of this as the expected value of exp(*R*(*X*)) where *X* is a normal random variable with variance 1/*b*, up to some constant we pull out of the integral.

We could approximate *R*(*x*) with the polynomial formed by the first few terms, but that’s not enough by itself. We need to approximate the expected value of exp(*R*(*X*)), not of *R*(*X*). The trick is to create *another* polynomial and find the expectation of *that* polynomial applied to *X*.

We need to find a power series for exp(*R*(*x*)) and truncate it after a few terms. We don’t need to analytically find the power series for exp(*R*(*x*)) directly before truncating it. We can apply the first few terms of the power series for exp to the first few terms of the power series for *R* and keep all the terms up to the order we want, say up to order *k*, giving us a *k*th order polynomial *P*(*x*).

Once we have the polynomial *P*(*x*), we can find the expectation *P*(*X*) in terms of normal moments (see this post) or alternatively by writing the polynomial as a sum of Hermite polynomials.

]]>

and

This integral comes up in Bayesian logisitic regression with a uniform (improper) prior. We will use this integral to illustrate a simple case of Laplace approximation.

The idea of Laplace approximation is to approximate the integral of a Gaussian-like function by the integral of a (scaled) Gaussian with the same mode and same curvature at the mode. That’s the most common, second order Laplace approximation. More generally, Laplace’s approximation depends on truncating a Taylor series, and truncating after the 2nd order term gives the approximation described above. One can construct higher-order Laplace approximations by keeping more terms of the power series.

Let *x*_{0} be the place where *f* takes on its maximum and let *g* = log *f*. Note that *g* also takes on its maximum at *x*_{0} and so its first derivative is zero there. Then the (second order) Laplace approximation has the following derivation.

We can show that

and

It follows that *x*_{0} = log (*m*/*n*) and the Laplace approximation for *I*(*m*, *n*) reduces to

.

Now let’s compare the Laplace approximation to the exact value of the integral for a few values of *m* and *n*. We would expect the Laplace approximation to be more accurate when the integrand is shaped more like the Gaussian density. This happens when *m* and *n* are large and when they are more nearly equal. (The function *f* is a likelihood, and *m+n* is the number of data points in the likelihood. More data means the likelihood is more nearly normal. Also, *m* = *n* means *f* is symmetric, which improves the normal approximation.)

We’ll look at (*m*, *n*) = (2, 1), (20, 10), (200, 100), and (15, 15). Here’s the plot of *f*(*x*, 2, 1) with its normal approximation, scaled vertically so that the heights match.

Even for small arguments, the Gaussian approximation fits well on the left side, but not so much on the right. For large arguments it would be hard to see the difference between the two curves.

Here are the results of the Laplace approximation and exact integration.

|-----+-----+---------------+---------------+--------------| | m | n | approx | exact | rel error | |-----+-----+---------------+---------------+--------------| | 2 | 1 | 0.45481187 | 0.5 | -0.090376260 | | 20 | 10 | 4.9442206e-9 | 4.99250870e-9 | -0.009672111 | | 15 | 15 | 8.5243139e-10 | 8.5956334e-10 | -0.008297178 | | 200 | 100 | 3.6037801e-84 | 3.6072854e-84 | -0.000971728 | |-----+-----+---------------+---------------+--------------|

The integrals were computed exactly using Mathematica; the results are rational numbers. [1]

Note that when *m* and *n* increase by a factor of 10, the relative error goes down by about a factor of 10. This is in keeping with theory that say the error should be *O*(1/(*m*+*n*)). Also, note that the error for (15, 15) is a little lower than that for (20, 10). The errors are qualitatively just what we expected.

***

[1] I initially used Mathematica to compute the integrals, but then realized that’s not necessary. The change of variables *u* = exp(*x*) / (1 + exp(*x*)) shows that

*I*(*a*, *b*) = *B*(*a*, *b*+1) + *B*(*a*+1, *b*)

where *B* is the beta function. Incidentally, this shows that if *a* and *b* are integers then *I*(*a*, *b*) is rational.

- Programming language life expectancy
- SHA1 no longer recommended, but hardly a failure
- The most disliked programming language
- Improving on the Unix shell
- One practical application of functional programming

I plan to post a list of the top file math-related posts soon.

]]>I recently ran across an essay [1] in which C. S. Lewis expressed similar concerns sixty years ago, referring to “the incubus of Research.” In the essay he describes a young academic in the humanities who

… far from being able or anxious … to add to the sum of human knowledge, wants to acquire a good deal more of the knowledge we already have. He has lately begun to discover how many things he needs to know in order to follow up his budding interests … To head him off from these studies, to pinfold him in some small inquiry whose chief claim is that no one has made it before is cruel and frustrating. It wastes such years as he will never have again …

My favorite part of the quote is describing research as “some small inquiry whose chief claim is that no one has made it before.” As Lewis said elsewhere, striving for originality can thwart originality.

[1] “Interim Report.” First published in *The Cambridge Review* in 1956 and reprinted as chapter 17 of Present Concerns.

Find an “entry point” into China of independent intrinsic interest to you, be it basketball, artificial intelligence, Chinese opera, whatever.

In a podcast interview—sorry, I no longer remember which one—Cowen talked more generally about finding entry points or onramps for learning big topics. The blog post mentioned above applies this specifically to China, but he gave other examples of coming to a subject through a side door rather than the front entrance. If I remember correctly, he mentioned learning the politics or economics of a region by first studying its architecture or food.

I’ve stumbled upon a number of intellectual onramps through my career, but I haven’t been as deliberate as Cowen in seeking them out. I had no interest in medicine before I ended up working for the world’s largest cancer center. I learned a bit about cancer and genetics from working at MD Anderson and I’ve since learned a little about other areas of medicine working with various clients. Right now I’m working on projects in nephrology and neurology.

Applied math is my onramp to lots of things I might not pursue otherwise. As John Tukey said, you get to play in everyone else’s backyard.

There are many things I’ve tried and failed to learn via a frontal assault. For example, I’ve tried several times to learn algebraic geometry by simply reading a book on the subject. But I find all the abstract machinery mind-numbing and difficult to absorb, just as Cowen described his first exposure to Chinese history. If I’m ever to learn much algebraic geometry, it will start with an indirect entry point, such as a concrete problem I need to solve.

]]>Let *A*, *B*, and *C* be matrices that are compatible for multiplication. Then (*AB*)*C* = *A*(*BC*). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix *M*, let rows(*M*) be the number of rows in *M* and let cols(*M*) be the number of columns. When we multiply two matrices *M* and *N*, we multiply every row of *M* by every column of *N* [1]. So the number of vector inner products is rows(*M*) cols(*N*), and the number of multiplications [2] in each inner product is cols(*M*). (We must have cols(*M*) = rows(*N*) because we implicitly assumed it’s possible to multiple *M* and *N*.)

That means the total number of scalar multiplications required to conmpute *MN* equals

rows(*M*) cols(*M*) cols(*N*) = rows(*M*) rows(*N*) cols(*N*).

Now let’s apply this to computing *ABC*. Suppose *A* is an *m* by *n* matrix and *C* is a *p *by *q* matrix. Then *B* has to be a *n* by *p* matrix in order to be compatible for multiplication.

Computing *AB* requires *mnp* multiplications. Once we have *AB*, a *m* by *p* matrix, the number of multiplications to compute (*AB*)*C* is *mpq*. So the total multiplications to compute *ABC* by first computing *AB* is *mp*(*n* + *q*).

If we compute *BC* first, this requires *npq* multiplications, and then multiplying *A* by *BC* requires *mnq* operations, for a total of *nq*(*m + p*).

In summary, computing (*AB*)*C* requires *mp*(*n* + *q*) multiplications, and computing *A*(*BC*) requires (*m + p*)*nq* multiplications. I turned the second term around to emphasize that in both expressions you do something to *m* and *p*, and something else to *n* and *q*. You either multiply and add, or add and multiply.

If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if *m* and *p* are large relative to *n* and *q*, i.e. both *A* and *C* are tall matrices, having more rows than columns, then multiplying *A*(*BC*) is going to be faster than multiplying (*AB*)*C*.

For example, suppose both *A* and *C* have a million rows and a hundred columns. Necessarily *B* would have a hundred rows and a million columns. Computing (*AB*)*C* would require 2×10^{14} multiplications, but computing *A*(*BC*) would take 2×10^{10} multiplications. That is, the latter would be 10,000 times faster.

**Related**: Applied linear algebra

***

[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.

]]>

The *k*th order term in Taylor’s theorem is a rank *k* tensor. You can think o rank 0 tensors as numbers, rank 1 tensors as vectors, and rank 2 tensors as matrices. Then we run out of familiar objects. A rank 3 tensor requires you start thinking in terms of tensors rather than more elementary terms.

**There is a way to express Taylor’s theorem using only vectors and matrices**. Maybe not the most elegant approach, depending on one’s taste, but it avoids any handwaving talk of a tensor being a “higher dimensional boxes of numbers” and such.

There’s a small price to pay. You have to introduce two new but simple ideas: the vec operator and the Kronecker product.

The **vec operator** takes an *m* × *n* matrix *A* and returns an *mn* × 1 matrix *v*, i.e. a column vector, by stacking the columns of *A*. The first *m* elements of *v* are the first column of *A*, the next *m* elements of *v* are the second column of *A*, etc.

The **Kronecker product** of an *m* × *n* matrix *A* and a *p* × *q* matrix *B* is a *mp* × *nq* matrix *K* = *A *⊗* B*. You can think of *K* as a block partitioned matrix. The *ij* block of *K* is *a*_{ij }*B*. In other words, to form *K*, take each element of *A* and replace it with its product with the matrix *B*.

A couple examples will make this clear.

Now we write down Taylor’s theorem. Let *f* be a real-valued function of *n* variables. Then

where *f*^{(0)} = *f* and for *k* > 0,

The symbol ⊗ with a number on top means to take the Kronecker product of the argument with itself that many times.

Source: Matrix Differential Calculus by Magnus and Neudecker.

**Related post**: What is a tensor?

I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

]]>Thanks to Brian Borchers who suggested the subject of this post in a comment on a previous post on transforms and convolutions.

The **moment generating function** (MGF) of a random variable *X* is defined as the expected value of exp(*tX*). By the so-called rule of the unconscious statistician we have

where *f*_{X} is the probability density function of the random variable *X*. The function *M*_{X} is called the moment generating function of *X* because it’s *n*th derivative, evaluated at 0, gives the *n*th moment of *X*, i.e. the expected value of *X*^{n}.

If we flip the sign on *t* in the integral above, we have the **two-sided Laplace transform** of *f*_{X}. That is, the moment generating function of *X* at *t* is the two-sided Laplace transform of *f*_{X} at –*t*. If the density function is zero for negative values, then the two-sided Laplace transform reduces to the more common (one-sided) Laplace transform.

Since the derivatives of *M*_{X} at zero are the moments of *X*, the power series for *M*_{X} is the **exponential generating function** for the moments. We have

where *m*_{n} is the *n*th moment of *X*.

This terminology needs a little explanation since we’re using “generating function” two or three different ways. The “moment generating function” is the function defined above and only appears in probability. In combinatorics, the (ordinary) generating function of a sequence is the power series whose coefficient of *x*^{n} is the *n*th term of the sequence. The exponential generating function is similar, except that each term is divided by *n*!. This is called the exponential generating series because it looks like the power series for the exponential function. Indeed, the exponential function is the exponential generating function for the sequence of all 1’s.

The equation above shows that *M*_{X} is the exponential generating function for *m*_{n} and the ordinary generating function for *m*_{n}/*n*!.

If a random variable *Y* is defined on the integers, then the (ordinary) generating function for the sequence Prob(*Y* = *n*) is called, naturally enough, the **probability generating function** for *Y.*

The **z-transform** of a sequence, common in electrical engineering, is the (ordinary) generating function of the sequence, but with *x* replaced with 1/*z*.

The **characteristic function** of a random variable is a variation on the moment generating function. Rather than use the expected value of *tX*, it uses the expected value of *itX*. This means the characteristic function of a random variable is the **Fourier transform** of its density function.

Characteristic functions are easier to work with than moment generating functions. We haven’t talked about when moment generating functions exist, but it’s clear from the integral above that the right tail of the density has to go to zero faster than *e*^{–x}, which isn’t the case for fat-tailed distributions. That’s not a problem for the characteristic function because the Fourier transform exists for any density function. This is another example of how complex variables simplify problems.

Given the complexity of the plot, the function definition is surprisingly simple:

The Fourier transform is even simpler: it’s the indicator function of [-2π, -π] ∪ [π, 2π], i.e. the function that is 1 on the intervals [-2π, -π] and [π, 2π] but zero everywhere else.

The Shannon wavelet is orthogonal to integer translates of itself. This isn’t obvious in the time domain, but it’s easy to prove in the frequency domain using Parseval’s theorem.

Here’s a plot of the original wavelet and the wavelet shifted to the left by 3:

And here’s a plot of the product of the two wavelets. It’s plausible that the areas above and below the *x*-axis cancel out each other, and indeed they do.

**Related post**: Sinc and Jinc integrals

where *f* and *g* are functions, *T* is an integral transform, and * is a kind of convolution. In words, the transform of a convolution is the product of transforms.

When the transformation changes, the notion of convolution changes.

Here are three examples.

With the Fourier transform defined as

convolution is defined as

Note: There are many minor variations on the definition of the Fourier transform. See these notes.

With the Laplace transform defined as

convolution is defined as

With the Mellin transform defined as

convolution is defined as

This post will look at the power series for the gamma function centered at 1 and will use a different plotting style.

Here’s what the contours look like with 12 terms:

The radius of convergence is 1 because the gamma function has a singularity at 0. (The radius of convergence is the distance from the center of the power series to the nearest singularity.) Contour lines correspond to constant phase. The gamma function is real for real arguments, and so you can see the real axis running through the middle of the plot because real numbers have zero phase. The contour lines meet at zeros, which you can see are near a circle of radius 1 centered at *z* = 1.

Here’s what the contour plot looks like with 30 terms:

And here’s what it looks like with just 5 terms:

Here’s the Mathematica code that was used to create the images.

P[x_] = Normal[Series[1.0 Gamma[x], {x, 1, 12}]] ContourPlot[ Arg[P[x + I y]], {x, -1, 3}, {y, -2, 2}, ColorFunction -> "PearlColors", Contours -> 20 ]

The 1.0 in front of the call to `Gamma`

is important. It tells Mathematica to numerically evaluate the coefficients of the power series. Otherwise Mathematica will find the coefficients symbolically and the plots will take forever.

Similarly, the call to `Normal`

tells Mathematica not to carry around a symbolic error term `O(x - 1)`

.^{13}

At first the results we needed were in the literature but after a while we ran out of known results and had to learn something about special functions. This was a very unsettling experience for there were very few places to go to really learn about special functions. At least that is what we thought. Actually there were many, but the typical American graduate education which we had did not include anything about hypergeometric functions. And

hypergeometric functions are the keyto this subject, as I have found out after many years of fighting them.

Emphasis added.

Askey’s book was written in 1975, and he was describing his experience from ten years before that. Special functions, and in particular hypergeometric functions, went from being common knowledge among mathematicians at the beginning of the 20th century to being arcane by mid century.

I learned little about special functions and nothing about hypergeometric functions as a graduate student. I first ran into hypergeometric functions reading in Concrete Mathematics how they are used in combinatorics and in calculating sums in closed form. Then when I started working in statistics I found that they are everywhere.

Hypergeometric functions are very useful, but not often taught anymore. Like a lot of useful mathematics, they fall between two stools. They’re considered too advanced or arcane for the undergraduate curriculum, and not a hot enough area of research to be part of the graduate curriculum.

**Related posts**:

The 61st Fibonacci number is 2504730781961. The 62nd is 4052739537881. Since these end in 1 and 1, the 63rd Fibonacci number must end in 2, etc. and so the pattern starts over.

It’s not obvious that the cycle should have length 60, but it is fairly easy to see that there must be a cycle.

In any base, the last digits must cycle. The length of these cycles varies erratically:

In this post I want to as a different question: **how often do Fibonacci numbers take on each of the possible last digits in each base**? In other words, how are the Fibonacci numbers distributed mod *m* for various values of *m*?

I haven’t tried to prove any theorems. I’ve just run some examples using the following Python code.

def histogram(base): prev = 1 F = 2 counts = [0]*base counts[F] += 1 while F != 1 or prev != 1: temp = F F = (F + prev) % base counts[F] += 1 prev = temp return counts

In base 10, the last digits of Fibonacci numbers have period 60. Do all digits appear in the cycle? Do they all appear equally often?

Yes and no. Here are the results for base 10:

>>> histogram(10) >>> [4, 8, 4, 8, 4, 8, 4, 8, 4, 8]

Each even digits appears 4 times and each odd digit appears 8 times.

What happens if we look at the last two digits, i.e. if we look at Fibonacci numbers mod 100?

>>> histogram(100) >>> [2, 6, 2, 2, …, 2, 6, 2, 2]

Each two-digit number *n* appears six times if *n* = 1 mod 4. Otherwise it appears two times.

What about the last three digits? Now we see some zeros. For example, no Fibonacci number is congruent to 4 or 6 mod 1000. (Or mod 8.)

>>> histogram(1000) >>> [2, 3, 2, 1, 0, 3, 0, 1, …, 2, 3, 2, 1, 0, 3, 0, 1]

Here’s something curious. The Fibonacci numbers are exactly evenly distributed mod 5.

>>> histogram(5) >>> [4, 4, 4, 4, 4]

The same is apparently true for all powers of 5. Not only are all the frequencies the same, they’re all 4’s. I’ve tested this for powers of 5 up to 5^{10}. And conversely, it seems the Fibonacci numbers are not evenly distributed mod *m* unless *m* is a power of 5. I’ve tested this for *m* up to 100,000.

**Conjecture**: The Fibonacci numbers are uniformly distributed mod *m* if and only if *m* is a power of 5.

**Update**: The conjecture is correct, and was proved in 1972 by Niederreiter.

]]>

Here are a couple plots to visualize Jentzsch’s theorem using the plotting scheme described in this post.

First, we take the function *f*(*z*) = 1/(1 + *z*²). This function has singularities as *i* and –*i*, and so the radius of convergence is 1. These singularities correspond to the two white dots in the plot below.

Now let’s look at the Taylor series for *f*.

*f*(*z*) = 1 – *z*^{2} + *z*^{4} – *z*^{6} + *z*^{8} – …

The series only converges inside the unit circle, but the partial sums are just polynomials and so are defined everywhere. Here we plot the partial sum up to *z*^{20}.

The black dots around the unit circle are zeros. The color quickly lightens as you move away from the unit circle because the function values grow quickly. Inside the unit circle, the two graphs match well, though of course outside they are very different.

Here’s another example. This time we’ll look at the gamma function. The power series centered at 1 has radius 1 because the function has a pole at 0.

Here’s a plot of the gamma function. Note the white dot which is the singularity at 0.

And here’s a plot of the first 20 terms of the Taylor series centered at 1.

]]>A family of polynomials *P*_{k} is orthogonal over the interval [-1, 1] with respect to a weight *w*(*x*) if

whenever *m* ≠ *n*.

If *w*(*x*) = 1, we get the Legendre polynomials.

If *w*(*x*) = (1 – *x*²)^{-1/2} we get the Chebyshev polynomials.

These are both special cases of the Jacobi polynomials which have weight *w*(*x*) = (1- *x*)^{α} (1 + *x*)^{β}. Legendre polynomials correspond to α = β = 0, and Chebyshev polynomials correspond to α = β = -1/2.

The weight function for Jacobi polynomials is a rescaling of the density function of a beta distribution. The change of variables *x* = 1 – 2*u* shows

The right side is proportional to the expected value of *f*(1 – 2*X*) where *X* is a random variable with a beta(α + 1, β+1) distribution. So for fixed α and β, if *m* ≠ *n* and *X* has a beta(α + 1, β+1) distribution, then the expected value of *P*_{m}(1 – 2*X*) *P*_{n}(1 – 2*X*) is zero.

While we’re at it, we’ll briefly mention two other connections between orthogonal polynomials and probability: Laguerre polynomials and Hermite polynomials.

The Laguerre polynomials are orthogonal over the interval [0, ∞) with weight *w*(*x*)* = x ^{α}* exp(-

There are two minor variations on the Hermite polynomials, depending on whether you take the weight to be exp(-*x*²) or exp(-*x*²/2). These are sometimes known as the physicist’s Hermite polynomials and the probabilist’s Hermite polynomials. Naturally we’re interested in the latter. The probabilist’s Hermite polynomials are orthogonal over (-∞, ∞) with the standard normal (Gaussian) density as the weight.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with *cowabunga*, not with *sponge*.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function *f*(*x*) = 1/(1 + *x*²) on the interval [-5, 5], or equivalently, and more convenient here, the function *f*(*x*) = 1/(1 + 25*x*²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255] on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2*i*, which is inside the football. Linear interpolation at evenly spaced points would converge for the function *f*(*x*) = 1/(1 + *x*²) since the singularity at *i* is outside the football.

For another example, consider the function *f*(*x*) = exp(- 1/*x*²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

With 28 interpolation points in the plot below, the lack of convergence is clear.

The problem is not polynomial interpolation *per se* but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the *rate* of convergence but not whether the interpolants converge.

**Related**: Help with interpolation

***

[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

]]>The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.

The **minmax** approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is “no,” in which case you’ve only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.

The best **average** approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is “yes,” but there’s only a 10% change that the answer is yes. There’s a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. **Entropy is maximized when all outcomes are equally likely**. That means you can’t learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.

Now suppose you want to ask about height and sex. As in the previous post, we assume men and women’s heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.

If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you’ve put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.

If you’re going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each step **conditioning on your previous answer**. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.

The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, *conditional on all previous data*. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won’t do as well.