I recently ran across an essay [1] in which C. S. Lewis expressed similar concerns sixty years ago, referring to “the incubus of Research.” In the essay he describes a young academic in the humanities who

… far from being able or anxious … to add to the sum of human knowledge, wants to acquire a good deal more of the knowledge we already have. He has lately begun to discover how many things he needs to know in order to follow up his budding interests … To head him off from these studies, to pinfold him in some small inquiry whose chief claim is that no one has made it before is cruel and frustrating. It wastes such years as he will never have again …

My favorite part of the quote is describing research as “some small inquiry whose chief claim is that no one has made it before.” As Lewis said elsewhere, striving for originality can thwart originality.

[1] “Interim Report.” First published in *The Cambridge Review* in 1956 and reprinted as chapter 17 of Present Concerns.

Find an “entry point” into China of independent intrinsic interest to you, be it basketball, artificial intelligence, Chinese opera, whatever.

In a podcast interview—sorry, I no longer remember which one—Cowen talked more generally about finding entry points or onramps for learning big topics. The blog post mentioned above applies this specifically to China, but he gave other examples of coming to a subject through a side door rather than the front entrance. If I remember correctly, he mentioned learning the politics or economics of a region by first studying its architecture or food.

I’ve stumbled upon a number of intellectual onramps through my career, but I haven’t been as deliberate as Cowen in seeking them out. I had no interest in medicine before I ended up working for the world’s largest cancer center. I learned a bit about cancer and genetics from working at MD Anderson and I’ve since learned a little about other areas of medicine working with various clients. Right now I’m working on projects in nephrology and neurology.

Applied math is my onramp to lots of things I might not pursue otherwise. As John Tukey said, you get to play in everyone else’s backyard.

There are many things I’ve tried and failed to learn via a frontal assault. For example, I’ve tried several times to learn algebraic geometry by simply reading a book on the subject. But I find all the abstract machinery mind-numbing and difficult to absorb, just as Cowen described his first exposure to Chinese history. If I’m ever to learn much algebraic geometry, it will start with an indirect entry point, such as a concrete problem I need to solve.

]]>Let *A*, *B*, and *C* be matrices that are compatible for multiplication. Then (*AB*)*C* = *A*(*BC*). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.

For any matrix *M*, let rows(*M*) be the number of rows in *M* and let cols(*M*) be the number of columns. When we multiply two matrices *M* and *N*, we multiply every row of *M* by every column of *N* [1]. So the number of vector inner products is rows(*M*) cols(*N*), and the number of multiplications [2] in each inner product is cols(*M*). (We must have cols(*M*) = rows(*N*) because we implicitly assumed it’s possible to multiple *M* and *N*.)

That means the total number of scalar multiplications required to conmpute *MN* equals

rows(*M*) cols(*M*) cols(*N*) = rows(*M*) rows(*N*) cols(*N*).

Now let’s apply this to computing *ABC*. Suppose *A* is an *m* by *n* matrix and *C* is a *p *by *q* matrix. Then *B* has to be a *n* by *p* matrix in order to be compatible for multiplication.

Computing *AB* requires *mnp* multiplications. Once we have *AB*, a *m* by *p* matrix, the number of multiplications to compute (*AB*)*C* is *mpq*. So the total multiplications to compute *ABC* by first computing *AB* is *mp*(*n* + *q*).

If we compute *BC* first, this requires *npq* multiplications, and then multiplying *A* by *BC* requires *mnq* operations, for a total of *nq*(*m + p*).

In summary, computing (*AB*)*C* requires *mp*(*n* + *q*) multiplications, and computing *A*(*BC*) requires (*m + p*)*nq* multiplications. I turned the second term around to emphasize that in both expressions you do something to *m* and *p*, and something else to *n* and *q*. You either multiply and add, or add and multiply.

If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if *m* and *p* are large relative to *n* and *q*, i.e. both *A* and *C* are tall matrices, having more rows than columns, then multiplying *A*(*BC*) is going to be faster than multiplying (*AB*)*C*.

For example, suppose both *A* and *C* have a million rows and a hundred columns. Necessarily *B* would have a hundred rows and a million columns. Computing (*AB*)*C* would require 2×10^{14} multiplications, but computing *A*(*BC*) would take 2×10^{10} multiplications. That is, the latter would be 10,000 times faster.

**Related**: Applied linear algebra

***

[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.

[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.

]]>

The *k*th order term in Taylor’s theorem is a rank *k* tensor. You can think o rank 0 tensors as numbers, rank 1 tensors as vectors, and rank 2 tensors as matrices. Then we run out of familiar objects. A rank 3 tensor requires you start thinking in terms of tensors rather than more elementary terms.

**There is a way to express Taylor’s theorem using only vectors and matrices**. Maybe not the most elegant approach, depending on one’s taste, but it avoids any handwaving talk of a tensor being a “higher dimensional boxes of numbers” and such.

There’s a small price to pay. You have to introduce two new but simple ideas: the vec operator and the Kronecker product.

The **vec operator** takes an *m* × *n* matrix *A* and returns an *mn* × 1 matrix *v*, i.e. a column vector, by stacking the columns of *A*. The first *m* elements of *v* are the first column of *A*, the next *m* elements of *v* are the second column of *A*, etc.

The **Kronecker product** of an *m* × *n* matrix *A* and a *p* × *q* matrix *B* is a *mp* × *nq* matrix *K* = *A *⊗* B*. You can think of *K* as a block partitioned matrix. The *ij* block of *K* is *a*_{ij }*B*. In other words, to form *K*, take each element of *A* and replace it with its product with the matrix *B*.

A couple examples will make this clear.

Now we write down Taylor’s theorem. Let *f* be a real-valued function of *n* variables. Then

where *f*^{(0)} = *f* and for *k* > 0,

The symbol ⊗ with a number on top means to take the Kronecker product of the argument with itself that many times.

Source: Matrix Differential Calculus by Mangus and Neudecker.

**Related post**: What is a tensor?

I will discuss, among other things, when common sense applies and when correct analysis can be counter-intuitive. There will be ample time at the end of the presentation for Q & A.

If you’re interested in attending, you can register here.

]]>Thanks to Brian Borchers who suggested the subject of this post in a comment on a previous post on transforms and convolutions.

The **moment generating function** (MGF) of a random variable *X* is defined as the expected value of exp(*tX*). By the so-called rule of the unconscious statistician we have

where *f*_{X} is the probability density function of the random variable *X*. The function *M*_{X} is called the moment generating function of *X* because it’s *n*th derivative, evaluated at 0, gives the *n*th moment of *X*, i.e. the expected value of *X*^{n}.

If we flip the sign on *t* in the integral above, we have the **two-sided Laplace transform** of *f*_{X}. That is, the moment generating function of *X* at *t* is the two-sided Laplace transform of *f*_{X} at –*t*. If the density function is zero for negative values, then the two-sided Laplace transform reduces to the more common (one-sided) Laplace transform.

Since the derivatives of *M*_{X} at zero are the moments of *X*, the power series for *M*_{X} is the **exponential generating function** for the moments. We have

where *m*_{n} is the *n*th moment of *X*.

This terminology needs a little explanation since we’re using “generating function” two or three different ways. The “moment generating function” is the function defined above and only appears in probability. In combinatorics, the (ordinary) generating function of a sequence is the power series whose coefficient of *x*^{n} is the *n*th term of the sequence. The exponential generating function is similar, except that each term is divided by *n*!. This is called the exponential generating series because it looks like the power series for the exponential function. Indeed, the exponential function is the exponential generating function for the sequence of all 1’s.

The equation above shows that *M*_{X} is the exponential generating function for *m*_{n} and the ordinary generating function for *m*_{n}/*n*!.

If a random variable *Y* is defined on the integers, then the (ordinary) generating function for the sequence Prob(*Y* = *n*) is called, naturally enough, the **probability generating function** for *Y.*

The **z-transform** of a sequence, common in electrical engineering, is the (ordinary) generating function of the sequence, but with *x* replaced with 1/*z*.

The **characteristic function** of a random variable is a variation on the moment generating function. Rather than use the expected value of *tX*, it uses the expected value of *itX*. This means the characteristic function of a random variable is the **Fourier transform** of its density function.

Characteristic functions are easier to work with than moment generating functions. We haven’t talked about when moment generating functions exist, but it’s clear from the integral above that the right tail of the density has to go to zero faster than *e*^{–x}, which isn’t the case for fat-tailed distributions. That’s not a problem for the characteristic function because the Fourier transform exists for any density function. This is another example of how complex variables simplify problems.

Given the complexity of the plot, the function definition is surprisingly simple:

The Fourier transform is even simpler: it’s the indicator function of [-2π, -π] ∪ [π, 2π], i.e. the function that is 1 on the intervals [-2π, -π] and [π, 2π] but zero everywhere else.

The Shannon wavelet is orthogonal to integer translates of itself. This isn’t obvious in the time domain, but it’s easy to prove in the frequency domain using Parseval’s theorem.

Here’s a plot of the original wavelet and the wavelet shifted to the left by 3:

And here’s a plot of the product of the two wavelets. It’s plausible that the areas above and below the *x*-axis cancel out each other, and indeed they do.

**Related post**: Sinc and Jinc integrals

where *f* and *g* are functions, *T* is an integral transform, and * is a kind of convolution. In words, the transform of a convolution is the product of transforms.

When the transformation changes, the notion of convolution changes.

Here are three examples.

With the Fourier transform defined as

convolution is defined as

Note: There are many minor variations on the definition of the Fourier transform. See these notes.

With the Laplace transform defined as

convolution is defined as

With the Mellin transform defined as

convolution is defined as

This post will look at the power series for the gamma function centered at 1 and will use a different plotting style.

Here’s what the contours look like with 12 terms:

The radius of convergence is 1 because the gamma function has a singularity at 0. (The radius of convergence is the distance from the center of the power series to the nearest singularity.) Contour lines correspond to constant phase. The gamma function is real for real arguments, and so you can see the real axis running through the middle of the plot because real numbers have zero phase. The contour lines meet at zeros, which you can see are near a circle of radius 1 centered at *z* = 1.

Here’s what the contour plot looks like with 30 terms:

And here’s what it looks like with just 5 terms:

Here’s the Mathematica code that was used to create the images.

P[x_] = Normal[Series[1.0 Gamma[x], {x, 1, 12}]] ContourPlot[ Arg[P[x + I y]], {x, -1, 3}, {y, -2, 2}, ColorFunction -> "PearlColors", Contours -> 20 ]

The 1.0 in front of the call to `Gamma`

is important. It tells Mathematica to numerically evaluate the coefficients of the power series. Otherwise Mathematica will find the coefficients symbolically and the plots will take forever.

Similarly, the call to `Normal`

tells Mathematica not to carry around a symbolic error term `O(x - 1)`

.^{13}

At first the results we needed were in the literature but after a while we ran out of known results and had to learn something about special functions. This was a very unsettling experience for there were very few places to go to really learn about special functions. At least that is what we thought. Actually there were many, but the typical American graduate education which we had did not include anything about hypergeometric functions. And

hypergeometric functions are the keyto this subject, as I have found out after many years of fighting them.

Emphasis added.

Askey’s book was written in 1975, and he was describing his experience from ten years before that. Special functions, and in particular hypergeometric functions, went from being common knowledge among mathematicians at the beginning of the 20th century to being arcane by mid century.

I learned little about special functions and nothing about hypergeometric functions as a graduate student. I first ran into hypergeometric functions reading in Concrete Mathematics how they are used in combinatorics and in calculating sums in closed form. Then when I started working in statistics I found that they are everywhere.

Hypergeometric functions are very useful, but not often taught anymore. Like a lot of useful mathematics, they fall between two stools. They’re considered too advanced or arcane for the undergraduate curriculum, and not a hot enough area of research to be part of the graduate curriculum.

**Related posts**:

The 61st Fibonacci number is 2504730781961. The 62nd is 4052739537881. Since these end in 1 and 1, the 63rd Fibonacci number must end in 2, etc. and so the pattern starts over.

It’s not obvious that the cycle should have length 60, but it is fairly easy to see that there must be a cycle.

In any base, the last digits must cycle. The length of these cycles varies erratically:

In this post I want to as a different question: **how often do Fibonacci numbers take on each of the possible last digits in each base**? In other words, how are the Fibonacci numbers distributed mod *m* for various values of *m*?

I haven’t tried to prove any theorems. I’ve just run some examples using the following Python code.

def histogram(base): prev = 1 F = 2 counts = [0]*base counts[F] += 1 while F != 1 or prev != 1: temp = F F = (F + prev) % base counts[F] += 1 prev = temp return counts

In base 10, the last digits of Fibonacci numbers have period 60. Do all digits appear in the cycle? Do they all appear equally often?

Yes and no. Here are the results for base 10:

>>> histogram(10) >>> [4, 8, 4, 8, 4, 8, 4, 8, 4, 8]

Each even digits appears 4 times and each odd digit appears 8 times.

What happens if we look at the last two digits, i.e. if we look at Fibonacci numbers mod 100?

>>> histogram(100) >>> [2, 6, 2, 2, …, 2, 6, 2, 2]

Each two-digit number *n* appears six times if *n* = 1 mod 4. Otherwise it appears two times.

What about the last three digits? Now we see some zeros. For example, no Fibonacci number is congruent to 4 or 6 mod 1000. (Or mod 8.)

>>> histogram(1000) >>> [2, 3, 2, 1, 0, 3, 0, 1, …, 2, 3, 2, 1, 0, 3, 0, 1]

Here’s something curious. The Fibonacci numbers are exactly evenly distributed mod 5.

>>> histogram(5) >>> [4, 4, 4, 4, 4]

The same is apparently true for all powers of 5. Not only are all the frequencies the same, they’re all 4’s. I’ve tested this for powers of 5 up to 5^{10}. And conversely, it seems the Fibonacci numbers are not evenly distributed mod *m* unless *m* is a power of 5. I’ve tested this for *m* up to 100,000.

**Conjecture**: The Fibonacci numbers are uniformly distributed mod *m* if and only if *m* is a power of 5.

**Update**: The conjecture is correct, and was proved in 1972 by Niederreiter.

]]>

Here are a couple plots to visualize Jentzsch’s theorem using the plotting scheme described in this post.

First, we take the function *f*(*z*) = 1/(1 + *z*²). This function has singularities as *i* and –*i*, and so the radius of convergence is 1. These singularities correspond to the two white dots in the plot below.

Now let’s look at the Taylor series for *f*.

*f*(*z*) = 1 – *z*^{2} + *z*^{4} – *z*^{6} + *z*^{8} – …

The series only converges inside the unit circle, but the partial sums are just polynomials and so are defined everywhere. Here we plot the partial sum up to *z*^{20}.

The black dots around the unit circle are zeros. The color quickly lightens as you move away from the unit circle because the function values grow quickly. Inside the unit circle, the two graphs match well, though of course outside they are very different.

Here’s another example. This time we’ll look at the gamma function. The power series centered at 1 has radius 1 because the function has a pole at 0.

Here’s a plot of the gamma function. Note the white dot which is the singularity at 0.

And here’s a plot of the first 20 terms of the Taylor series centered at 1.

]]>A family of polynomials *P*_{k} is orthogonal over the interval [-1, 1] with respect to a weight *w*(*x*) if

whenever *m* ≠ *n*.

If *w*(*x*) = 1, we get the Legendre polynomials.

If *w*(*x*) = (1 – *x*²)^{-1/2} we get the Chebyshev polynomials.

These are both special cases of the Jacobi polynomials which have weight *w*(*x*) = (1- *x*)^{α} (1 + *x*)^{β}. Legendre polynomials correspond to α = β = 0, and Chebyshev polynomials correspond to α = β = -1/2.

The weight function for Jacobi polynomials is a rescaling of the density function of a beta distribution. The change of variables *x* = 1 – 2*u* shows

The right side is proportional to the expected value of *f*(1 – 2*X*) where *X* is a random variable with a beta(α + 1, β+1) distribution. So for fixed α and β, if *m* ≠ *n* and *X* has a beta(α + 1, β+1) distribution, then the expected value of *P*_{m}(1 – 2*X*) *P*_{n}(1 – 2*X*) is zero.

While we’re at it, we’ll briefly mention two other connections between orthogonal polynomials and probability: Laguerre polynomials and Hermite polynomials.

The Laguerre polynomials are orthogonal over the interval [0, ∞) with weight *w*(*x*)* = x ^{α}* exp(-

There are two minor variations on the Hermite polynomials, depending on whether you take the weight to be exp(-*x*²) or exp(-*x*²/2). These are sometimes known as the physicist’s Hermite polynomials and the probabilist’s Hermite polynomials. Naturally we’re interested in the latter. The probabilist’s Hermite polynomials are orthogonal over (-∞, ∞) with the standard normal (Gaussian) density as the weight.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with *cowabunga*, not with *sponge*.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function *f*(*x*) = 1/(1 + *x*²) on the interval [-5, 5], or equivalently, and more convenient here, the function *f*(*x*) = 1/(1 + 25*x*²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255] on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2*i*, which is inside the football. Linear interpolation at evenly spaced points would converge for the function *f*(*x*) = 1/(1 + *x*²) since the singularity at *i* is outside the football.

For another example, consider the function *f*(*x*) = exp(- 1/*x*²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

With 28 interpolation points in the plot below, the lack of convergence is clear.

The problem is not polynomial interpolation *per se* but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the *rate* of convergence but not whether the interpolants converge.

**Related**: Help with interpolation

***

[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

]]>The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.

The **minmax** approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is “no,” in which case you’ve only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.

The best **average** approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is “yes,” but there’s only a 10% change that the answer is yes. There’s a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. **Entropy is maximized when all outcomes are equally likely**. That means you can’t learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.

Now suppose you want to ask about height and sex. As in the previous post, we assume men and women’s heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.

If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you’ve put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.

If you’re going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each step **conditioning on your previous answer**. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.

The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, *conditional on all previous data*. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won’t do as well.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

π Things Only Left Handed Introverts Over 6′ 5″ with O+ Blood Type Will Appreciate

— Tim Hopper (@tdhopper) November 16, 2014

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log_{2}(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10^{-15}. This is an example of where the normal distribution assumption breaks down in the extremes.

for *x* ≥ 1 where *a* > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. “80-20” rule) are named after the same person, the Italian economist Vilfredo Pareto.

Samples from a Pareto distribution obey Benford’s law in the limit as the parameter *a* goes to zero. That is, the smaller the parameter *a*, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford’s law.

Here’s an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape *a* = 1 and shape *a* = 0.2 with the counts expected under Benford’s law.

Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.

**More posts on Benford’s law**

- Weibull distribution and Benford’s law
- Benford’s law, chi-square, and factorials
- Benford’s law and SciPy constants

More posts on Pareto

Here are some posts on testing a uniform RNG.

Here’s a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.

A few posts on manipulating a random number generator.

- Manipulating a random number generator
- Reverse engineering the seed of an LCG
- Predicting when an RNG will output a given value

And finally, a post on a cryptographically secure random number generator.

]]>For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the *a* and *b* parameters of the prior and the posterior to compute how much information was gained.

from scipy.integrate import quad from scipy.stats import beta as beta from scipy import log2 def infogain(post_a, post_b, prior_a, prior_b): p = beta(post_a, post_b).pdf q = beta(prior_a, prior_b).pdf (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1) return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine `quad`

needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

print( infogain(4, 7, 3, 7) ) print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

]]>- Data scientists can only query aggregate statistics, such as counts and averages.
- These aggregate statistics must be based on results that return at least 1,000 database rows.

This sounds good, but it’s naive. It’s not enough to protect customer privacy.

Someone wants to know how much money customer 123456789 makes. If he asks for this person’s income, the query would be blocked by the first rule. If he asks for the *average* income of customers with ID 123456789 then he gets past the first rule but not the second.

He decides to test whether the customer in question makes a six-figure income. So first he queries for the number of customers with income over $100,000. This is a count so it gets past the firs rule. The result turns out to be 14,254, so it gets past the second rule as well. Now he asks how many customers with ID *not equal to* 123456789 have income over $100,000. This is a valid query as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.

Now he goes back and asks for the average income of customers with income over $100,000. Then he asks for the average income of customers with income over $100,000 and with ID not equal to 123456789. With a little algebra, he’s able to find customer 123456789’s exact income.

You might object that it’s cheating to have a clause such as “ID not equal 123456789” in a query. Of course it’s cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. [1]

That doesn’t work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.

You may be asking why let a data scientist have access to customer IDs at all. Obviously you wouldn’t do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.

**Related**:

***

[1] Readers familiar with measure theory might sense a σ-algebra lurking in the background.

]]>You can start by expressing the function output *f*(*z*) in polar form

*f*(*z*) = *r**e*^{iθ}

Then you could plot the magnitude *r* as height and write the phase angle θ on the graph. Here’s a famous example of that approach from the cover of Abramowitz and Stegun.

You can find more on that book and the function it plots here.

This approach makes it easy to visualize the magnitude, but the phase is textual rather than visual. A more visual approach is to use *color* to represent the phase, such as hue in HSV scale.

You can combine color with height, but sometimes you’ll just use color alone. That is the approach taken in the book Visual Complex Functions. This is more useful than it may seem at first because the magnitude and phase are tightly coupled for an analytic function. The phase alone uniquely determines the function, up to a constant multiple. The image I posted the other day, the one I thought looked like Paul Klee meets Perry the Platypus, is an example of a phase-only plot.

If I’d added more contour lines the plot would be more informative, but it would look less like a Paul Klee painting and less like a cartoon platypus, so I stuck with the defaults. Mathematica has dozens of color schemes for phase plots. I stumbled on “StarryNightColors” and liked it. I imagine I wouldn’t have seen the connection to Perry the Playtpus in a different color scheme.

You can visualize magnitude as well as phase if you add another dimension to color. That’s what Daniel Velleman did in a paper that I read recently [1]. He uses hue to represent angle and brightness to represent magnitude. I liked his approach partly on aesthetic grounds. Phase plots using hue only tend to look psychedelic and garish. The varying brightness makes the plots more attractive. I’ll give some examples then include Velleman’s code.

First, here’s a plot of Jacobi’s sn function [2], an elliptic function analogous to sine. I picked it because it has a zero and a pole. Zeros show up as black and poles as white. (The function is elliptic, so it’s periodic horizontally and vertically. Functions like sine are periodic horizontally but not vertically.)

You can see the poles on the left and right and the zero in the middle. Note that the hues swirl in opposite directions around zeros and poles: ROYGBIV counterclockwise around a zero and clockwise around a pole.

Next, here’s a plot of the 5th Chebyshev polynomial. I chose this because I’ve been thinking about Chebyshev polynomials lately, and because polynomials make plots that fade to white in every direction. (That is, |*p*(*z*)| → ∞ as |*z*| → ∞ for all polynomials.)

Finally, here’s a plot of the gamma function. I included this example because you can see the poles as little white dots on the left, and the plot has a nice dark-to-light overall pattern.

Here’s the Mathematica code from Velleman’s paper. Note that in the HSV scale, he uses brightness to change both the saturation (S) and value (V). Unfortunately the function *f* being plotted is a global rather than being passed in as an argument to the plotting function.

brightness[x_] := If[x <= 1, Sqrt[x]/2, 1 - 8/((x + 3)^2)] ComplexColor[z_] := If[z == 0, Hue[0, 1, 0], Hue[Arg[z]/(2 Pi), (1 - (b = brightness[Abs[z]])^4)^0.25, b]] ComplexPlot[xmin_, ymin_, xmax_, ymax_, xsteps_, ysteps_] := Block[ {deltax = N[(xmax - xmin)/xsteps], deltay = N[(ymax - ymin)/ysteps]}, Graphics[ Raster[ Table[ f[x + I y], {y, ymin + deltay/2, ymax, deltay}, {x, xmin + deltax/2, xmax, deltax}], {{xmin, ymin}, {xmax, ymax}}, ColorFunction -> ComplexColor ] ] ]

[1] Daniel J. Velleman. The Fundamental Theorem of Algebra: A Visual Approach. Mathematical Intelligencer, Volume 37, Number 4, 2015.

[2] I used f[z] = 10 JacobiSN[I z, 1.5]. I multiplied the argument by *i* because I wanted to rotate the picture 90 degrees. And I multiplied the output by 10 to get a less saturated image.

The Kullback-Leibler divergence between two random variables *X* and *Y* is defined as

This is pronounced/interpreted several ways:

- The divergence from
*Y*to*X* - The relative entropy of
*X*with respect to*Y* - How well
*Y*approximates*X* - The information gain going from the prior
*Y*to the posterior*X* - The average surprise in seeing
*Y*when you expected*X*

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if *X* and *Y* have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

from scipy.integrate import quad from scipy.stats import expon, gamma from scipy import inf def KL(X, Y): f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x)) return quad(f, 0, inf) e = expon g = gamma(a = 2) print( KL(e, g) ) print( KL(g, e) )

This returns

(0.5772156649008394, 1.3799968612282498e-08) (0.4227843350984687, 2.7366807708872898e-09)

The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose *X* and *Y* are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the `KL`

function to integrate over the whole real line

def KL(X, Y): f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x)) return quad(f, -inf, inf)

and trying an example.

n1 = norm(1, 1) n2 = norm(2, 1) print( KL(n1, n2) ) print( KL(n2, n1) )

This returns

(0.4999999999999981, 1.2012834963423225e-08) (0.5000000000000001, 8.106890774205374e-09)

and so both integrals are equal to within the error in the numerical integration.

]]>]]>

If the function you’re interpolating is smooth, then interpolating at more points may or may not improve the fit of the interpolation, depending on where you put the points. The famous example of Runge shows that interpolating

*f*(*x*) = 1 / (1 + *x*²)

at more points can make the fit worse. When interpolating at 16 evenly spaced points, the behavior is wild at the ends of the interval.

Here’s the Python code that produced the plot.

import matplotlib.pyplot as plt from scipy import interpolate, linspace def cauchy(x): return (1 + x**2)**-1 n = 16 x = linspace(-5, 5, n) # points to interpolate at y = cauchy(x) f = interpolate.BarycentricInterpolator(x, y) xnew = linspace(-5, 5, 200) # points to plot at ynew = f(xnew) plt.plot(x, y, 'o', xnew, ynew, '-') plt.show()

However, for smooth functions interpolating at more points *does* improve the fit if you interpolate at the roots of Chebyshev polynomials. As you interpolate at the roots of higher and higher degree Chebyshev polynomials, the interpolants converge to the function being interpolated. The plot below shows how interpolating at the roots of *T*_{16}, the 16th Chebyshev polynomial, eliminates the bad behavior at the ends.

To make this plot, we replaced `x`

above with the roots of *T*_{16}, rescaled from the interval [-1, 1] to the interval [-5, 5] to match the example above.

x = [cos(pi*(2*k-1)/(2*n)) for k in range(1, n+1)] x = 5*array(x)

What if the function we’re interpolating isn’t smooth? If the function has a step discontinuity, we can see Gibbs phenomena, similar to what we saw in the previous post. Here’s the result of interpolating the indicator function of the interval [-1, 1] at 100 Chebyshev points. We get the same “bat ears” as before.

**Related**: Help with interpolation

Bessel functions come up naturally when working in polar coordinates, just as sines and cosines come up naturally when working in rectangular coordinates. You can think of Bessel functions as a sort of variation on sine waves. Or even more accurately, a variation on sinc functions, where sinc(*z*) = sin(*z*)/*z*. [1]

A Fourier series represents a function as a sum of sines and cosines of different frequencies. To make things a little simpler here, I’ll only consider Fourier sine series so I don’t have to repeatedly say “and cosine.”

A Fourier-Bessel function does something similar. It represents a function as a sum of rescaled versions of a particular Bessel function. We’ll use the Bessel *J*_{0} here, but you could pick some other *J*_{ν}.

Fourier series scale the sine and cosine functions by π times integers, i.e. sin(π*z*), sin(2π*z*), sin(3π*z*), etc. Fourier-Bessel series scale by the zeros of the Bessel function: *J*_{0}(λ_{1}*z*), *J*_{0}(λ_{2}*z*), *J*_{0}(λ_{3}*z*), etc. where λ_{n} are the zeros of *J*_{0}. This is analogous to scaling sin(π*z*) by its roots: π, 2π, 3π, etc. So a Fourier-Bessel series for a function *f* looks like

The coefficients *c*_{n} for Fourier-Bessel series can be computed analogously to Fourier coefficients, but with a couple minor complications. First, the basis functions of a Fourier series are orthogonal over [0, 1] without any explicit weight, i.e. with weight 1. And second, the inner product of a basis function doesn’t depend on the frequency. In detail,

Here δ_{mn} equals 1 if *m* = *n* and 0 otherwise.

Fourier-Bessel basis functions are orthogonal with a weight *z*, and the inner product of a basis function with itself depends on the frequency. In detail

So whereas the coefficients for a Fourier sine series are given by

the coefficients for a Fourier-Bessel series are given by

Fourier and Fourier-Bessel series are examples of orthogonal series, and so by construction they converge in the norm given by their associated inner product. That means that if *S*_{N} is the *N*th partial sum of a Fourier series

and the analogous statement for a Fourier-Bessel series is

In short, the series converge in a (weighted) *L*² norm. But how do the series converge pointwise? A lot of harmonic analysis is devoted to answering this question, what conditions on the function *f* guarantee what kind of behavior of the partial sums of the series expansion.

If we look at the Fourier series for a step function, the partial sums converge pointwise everywhere except at the step discontinuity. But the way they converge is interesting. You get a sort of “bat ear” phenomena where the partial sums overshoot the step function at the discontinuity. This is called the Gibbs phenomenon after Josiah Willard Gibbs who observed the effect in 1899. (Henry Wilbraham observed the same thing earlier, but Gibbs didn’t know that.)

The Gibbs phenomena is well known for Fourier series. It’s not as well known that the same phenomenon occurs for other orthogonal series, such as Fourier-Bessel series. I’ll give an example of Gibbs phenomenon for Fourier-Bessel series taken from [2] and give Python code to visualize it.

We take our function *f*(*z*) to be 1 on [0, 1/2] and 0 on (1/2, 1]. It works out that

Here’s the plot with 100 terms. Notice how the partial sums overshoot the mark to the left of 1/2 and undershoot to the right of 1/2.

Here’s the same plot with 1,000 terms.

Here’s the Python code that produced the plot.

import matplotlib.pyplot as plt from scipy.special import j0, j1, jn_zeros from scipy import linspace N = 100 # number of terms in series roots = jn_zeros(0, N) coeff = [j1(r/2) / (r*j1(r)**2) for r in roots] z = linspace(0, 1, 200) def partial_sum(z): return sum([coeff[i]*j0(roots[i]*z) for i in range(N)]) plt.plot(z, partial_sum(z)) plt.xlabel("z") plt.ylabel("{}th partial sum".format(N)) plt.show()

[1] To be precise, as *z* goes to infinity

and so the Bessel functions are asymptotically proportional to sin(*z* – φ)/√*z* for some phase shift φ.

[2] The Gibbs’ phenomenon for Fourier-Bessel Series. Temple H. Fay and P. Kendrik Kloppers. International Journal of Mathematical Education in Science and Technology. 2003. vol. 323, no. 2, 199-217.

]]>Why that date? I wanted to start with something with a fairly small period, and that one looked interesting. I’ll have to do something different for the images that have a much longer period.

Image made in collaboration with Go 3D Now.

]]>

Companies often use an old version of their production database for testing. But what if the production database has **sensitive information** that software developers and testers should not have access to?

You can’t completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be

- Use the original data.
- Generate completely new artificial data.
- Use the real data as a guide to generating new data.

We’ll assume the first option is off the table and consider the pros and cons of the other two options.

For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That’s fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don’t have 11 year-old retirees or 98 year-old clients who can’t legally purchase alcohol.

There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find **more bugs**, but a realistic database is likely to find **more important bugs**.

Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you’ve never sold your product to someone in Louisiana, and there’s a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)

On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the *joint* distributions to match and not just the *unconditional* distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.

So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.

So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like **differential privacy**. But if you just want moderately realistic test cases, you can do something much simpler.

Finally, you have to address your **privacy-utility trade-off**. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.

If you’d like help with this process, let’s talk. I can help you determine what your obligations are and how best to meet them while meeting your business objectives.

]]>where *m*, *d*, and *y* are the current month, day, and two-digit year. The four most recent images show how different these plots can be.

These images are from 10/30/17, 10/31/17, 11/1/17, and 11/2/17.

Consecutive dates often produce very different images for a couple reasons. First, consecutive integers are relatively prime. From a number theoretic perspective, 30 and 31 are very different, for example. (This touches on the motivation for *p*-adic numbers: put a different metric on integers, one based on their prime factorization.)

The other reason consecutive dates produce qualitatively different images is that you might roll over a month or year, such as going from October (10) to November (11). You’ll see a big change when we roll over from 2017 to 2018.

The partial sums are periodic [1] with period lcm(*m*, *d*, *y*). The image for 10/31/17 above has the most points because 10, 31, and 17 are relatively prime. It’s also true that 11, 2, and 17 are relatively prime, but these are smaller numbers.

You could think of the month, day, and year components of the sum as three different gears. The sums repeat when all three gears return to their initial positions. In the image for yesterday, 11/1/17, the middle gear is effectively not there.

[1] The *terms* of the sums are periodic. The partial sums are periodic if the full sum is zero. So *if* the partial sums are periodic, the lcm is a period.

Trefethen gives several famous Yogi Berra quotes and concludes that

Yogiisms are statements that, if taken literally, are meaningless or contradictory or nonsensical or tautological—yet nevertheless convey something true.

An inverse yogiism is the opposite,

[a] statement that is literally true, yet conveys something false.

What a great way way to frame a chapter! Now that I’ve heard the phrase, I’m trying to think of inverse yogiisms. Nothing particular has come to mind yet, but I feel like there must be lots of things that fit that description. Trefethen comes up with three inverse yogiisms, and my favorite is the middle one: Faber’s theorem on polynomial interpolation.

Faber’s theorem is a non-convergence result for interpolants of continuous functions. Trefethen quotes several numerical analysis textbooks that comment on Faber’s theorem in a way that implies an overly pessimistic interpretation. Faber’s theorem is true for continuous functions in general, but if the function *f* being interpolated is smooth, or even just Lipschitz continuous, the theorem doesn’t hold. In particular, Chebyshev interpolation produces a sequence of polynomials converging to *f.*

A few years ago I wrote a blog post that shows a famous example due to Carle Runge that if you interpolate *f*(*x*) = 1/(1 + *x*²) over [-5, 5] with evenly spaced nodes, the sequence of interpolating polynomials diverges. In other words, adding more interpolation points makes the fit *worse*.

Here’s the result of fitting a 16th degree polynomial to *f* at evenly spaced nodes.

The error near the ends is terrible, though the fit does improve in the middle. If instead of using evenly spaced nodes you use the roots of Chebyshev polynomials, the interpolating polynomials do in fact converge, and converge quickly. If the *k*th derivative of *f* has bounded variation, then the error in interpolating *f* at *n* points is *O*(*n*^{–k}).

I have fond memories of writing Perl, though it’s been a long time since I used it. I mostly wrote scripts for file munging, the task it does best, and never had to maintain someone else’s Perl code. Under different circumstances I probably would have had less favorable memories.

Perl is a very large, expressive language. That’s a positive if you’re working alone but a negative if working with others. Individuals can carve out their favorite subsets of Perl and ignore the rest, but two people may carve out different subsets. You may personally avoid some feature, but you have to learn it anyway if your colleague uses it. Also, in a large language there’s greater chance that you’ll accidentally use a feature you didn’t intend to. For example, in Perl you might use an array in a scalar context. This works, but not as you’d expect if you didn’t intend to do it.

I suspect that people who like large languages like C++ and Common Lisp are more inclined to like Perl, while people who prefer small languages like C and Scheme have opposite inclinations.

**Related posts**: