First imagine a thin wire running through the coil of the shell. In cylindrical coordinates, this wire follows the parameterization

*r* = *e*^{kθ}

*z* = *Tt*

If *T* = 0 this is a logarithmic spiral in the (*r*, θ) plane. For positive *T*, the spiral is stretched so that its vertical position is proportional to its radius.

Next we build a shell by putting a tube around this imaginary wire. The radius *R* of the tube at each point is proportional to the *r* coordinate: *R = Dr.*

The image above was created using *k* = 0.1, *T* = 2.791, and *D* = 0.8845 using Øyvind Hammer’s seashell generating software. You can download Hammer’s software for Windows and experiment with your own shell simulations by adjusting the parameters.

See also Hammer’s book and YouTube video:

]]>Hammer’s code uses a statistical language called Past that I’d never heard of. Here’s my interpretation of his code using Python.

import matplotlib.pyplot as plt from numpy import arange, sin, cos, exp i = arange(5000) x1 = 1.0*cos(i/10.0)*exp(-i/2500.0) y1 = 1.4*sin(i/10.0)*exp(-i/2500.0) d = 450.0 vx = cos(i/d)*x1 - sin(i/d)*y1 vy = sin(i/d)*x1 + cos(i/d)*y1 plt.plot(vx, vy, "k") h = max(vy) - min(vy) w = max(vx) - min(vx) plt.axes().set_aspect(w/h) plt.show()

This code produces what’s called a harmonograph, related to the motion of a pendulum free to move in *x* and *y* directions:

It’s not exactly the same as the movie poster, but it’s definitely similar. If you find a way to modify the code to make it closer to the poster, leave a comment below.

]]>If we know *n* is a Fibonacci number, how can we tell which one it is? That is, if *n* = *F*_{m}, how can we find *m*?

For large *m*, *F*_{m} is approximately φ^{m} / √ 5 and the error decreases exponentially with *m*. By taking logs, we can solve for *m* and round the result to the nearest integer.

We can illustrate this with SymPy. First, let’s get a Fibonacci number.

>>> from sympy import * >>> F = fibonacci(100) >>> F 354224848179261915075

Now let’s forget that we know *F* is a Fibonacci number and test whether it is one.

>>> sqrt(5*F**2 - 4) sqrt(627376215338105766356982006981782561278121)

Apparently 5*F*^{2} – 4 is not a perfect square. Now let’s try 5*F*^{2} + 4.

>>> sqrt(5*F**2 + 4) 792070839848372253127

Bingo! Now that we know it’s a Fibonacci number, which one is it?

>>> N((0.5*log(5) + log(F))/log(GoldenRatio), 10) 100.0000000

So *F* must be the 100th Fibonacci number.

It looks like our approximation gave an exact result, but if we ask for more precision we see that it did not.

>>> N((0.5*log(5) + log(F))/log(GoldenRatio), 50) 99.999999999999999999999999999999999999999996687654

**Related posts**:

In this post we’ll write a little Python code to kick the tires on Cantrell’s approximation. The post also illustrates how to do some common tasks using SciPy and matplotlib.

Here are the imports we’ll need.

import matplotlib.pyplot as plt from scipy import pi, e, sqrt, log, linspace from scipy.special import lambertw, gamma, psi from scipy.optimize import root

First of all, the gamma function has a local minimum *k* somewhere between 1 and 2, and so it only makes sense to speak of its inverse to the left or right of this point. Gamma is strictly increasing for real values larger than *k*.

To find *k* we look for where the derivative of gamma is zero. It’s more common to work with the derivative of the logarithm of the gamma function than the derivative of the gamma function itself. That works just as well because gamma has a minimum where its log has a minimum. The derivative of the log of the gamma function is called ψ and is implemented in SciPy as `scipy.special.psi`

. We use the function `scipy.optimize.root`

to find where ψ is zero.

The `root`

function returns more information than just the root we’re after. The root(s) are returned in the array`x`

, and in our case there’s only one root, so we take the first element of the array:

k = root(psi, 1.46).x[0]

Now here is Cantrell’s algorithm:

c = sqrt(2*pi)/e - gamma(k) def L(x): return log((x+c)/sqrt(2*pi)) def W(x): return lambertw(x) def AIG(x): return L(x) / W( L(x) / e) + 0.5

Cantrell uses `AIG`

for Approximate Inverse Gamma.

How well goes this algorithm work? For starters, we’ll see how well it does when we do a round trip, following the exact gamma with the approximate inverse.

x = linspace(5, 30, 100) plt.plot(x, AIG(gamma(x))) plt.show()

This produces the following plot:

We get a straight line, as we should, so next we do a more demanding test. We’ll look at the absolute error in the approximate inverse. We’ll use a log scale on the *x*-axis since gamma values get large quickly.

y = gamma(x) plt.plot(y, x- AIG(y)) plt.xscale("log") plt.show()

This shows the approximation error is small, and gets smaller as its argument increases.

Cantrell’s algorithm is based on an asymptotic approximation, so it’s not surprising that it improves for large arguments.

**Related posts**:

Morse code was designed so that the most frequently used letters have the shortest codes. In general, code length increases as frequency decreases.

How efficient is Morse code? We’ll compare letter frequencies based on Google’s research with the length of each code, and make the standard assumption that a dash is three times as long as a dot.

|--------+------+--------+-----------| | Letter | Code | Length | Frequency | |--------+------+--------+-----------| | E | . | 1 | 12.49% | | T | - | 3 | 9.28% | | A | .- | 4 | 8.04% | | O | --- | 9 | 7.64% | | I | .. | 2 | 7.57% | | N | -. | 4 | 7.23% | | S | ... | 3 | 6.51% | | R | .-. | 5 | 6.28% | | H | .... | 4 | 5.05% | | L | .-.. | 6 | 4.07% | | D | -.. | 5 | 3.82% | | C | -.-. | 8 | 3.34% | | U | ..- | 5 | 2.73% | | M | -- | 6 | 2.51% | | F | ..-. | 6 | 2.40% | | P | .--. | 8 | 2.14% | | G | --. | 7 | 1.87% | | W | .-- | 7 | 1.68% | | Y | -.-- | 10 | 1.66% | | B | -... | 6 | 1.48% | | V | ...- | 6 | 1.05% | | K | -.- | 7 | 0.54% | | X | -..- | 8 | 0.23% | | J | .--- | 10 | 0.16% | | Q | --.- | 10 | 0.12% | | Z | --.. | 8 | 0.09% | |--------+------+--------+-----------|

There’s room for improvement. Assigning the letter O such a long code, for example, was clearly not optimal.

But how much difference does it make? If we were to rearrange the codes so that they corresponded to letter frequency, how much shorter would a typical text transmission be?

Multiplying the code lengths by their frequency, we find that an average letter, weighted by frequency, has code length 4.5268.

What if we rearranged the codes? Then we would get 4.1257 which would be about 9% more efficient. To put it another way, Morse code achieved 91% of the efficiency that it could have achieved with the same codes. This is relative to Google’s English corpus. A different corpus would give slightly different results.

Toward the bottom of the table above, letter frequencies correspond poorly to code lengths, though this hardly matters for efficiency. But some of the choices near the top of the table are puzzling. The relative frequency of the first few letters has remained stable over time and was well known long before Google. (See ETAOIN SHRDLU.) Maybe there were factors other than efficiency that influenced how the most frequently used characters were encoded.

**Update**: Some sources I looked at said that a dash is three times as long as a dot, including the space between dots or dashes. Others said there is a pause as long as a dot between elements. If you use the latter timing, it takes an average time equal to 6.0054 dots to transmit an English letter, and this could be improved to 5.6616. By that measure Morse code is about 93.5% efficient. (I only added time for space inside the code for a letter because the space between letters is the same no matter how they are coded.)

Universal algebra is the study of features common to familiar algebraic systems … [It] places the algebraic notions in their proper setting; it often reveals connexions between seemingly different concepts and helps to systemize one’s thoughts. … [T]his approach does not usually solve the whole problem for us, but only

.tidies up a mass of rather trivial detail, allowing us to concentrate our powers on the hard core of the problem

Emphasis added. Source: Universal Algebra by P. M. Cohn

**Related**: Applied category theory

The robustness of a graph is defined as

Then [2] explains that

Nis the total number of nodes in the initial network andS(q) is the relative size of the largest connected component after removingqnodes with the highest degrees. The normalization factor 1/Nensures 1/N≤R≤ 0.5. Two special cases exist:R= 1/Ncorresponds to star-like networks whileR= 0.5 represents the case of the fully connected network.

Apparently “relative size” means relative to the size of the original network, not to the network with *q* nodes removed.

There is one difference between [1] and [2]. The former defines the sum up to *N* and the latter to *N*-1. But this doesn’t matter because *S*(*N*) = 0: when you remove *N* nodes from a graph with *N* nodes, there’s nothing left!

I have a couple reservations. One is that it’s not clear whether *R* is well defined.

Consider the phrase “removing *q* nodes with the highest degrees.” What if you have a choice of nodes with the highest degree? Maybe all nodes have degree no more than *n* and several have degree exactly *n*. It seems your choice of node to remove matters. Removing one node of degree *n* may give you a very different network than removing another node of the same degree.

Along the same lines, it’s not clear whether it matters how one removes more than one degree. For *S*(2), for example, does it matter whether I remove the two most connected nodes from the original graph at once, or remove the most connected node first, then the most connected node from what remains?

My second reservation is that I don’t think the formula above gives exactly the extreme values it says. Start with a star graph. Once you remove the center node, there are only isolated points left, and each point is 1/*N* of the original graph. So all the *S*(*q*) are equal to 1/*N*. Then *R* equals (*N* – 1)/ *N*^{2}, which is less than 1/*N*.

With the complete graph, removing a point leaves a complete graph of lower order. So *S*(*q*) = (*N* – *q*)/*N*. Then *R* equals (*N* – 1)/2*N*, which is less than 1/2.

* * *

[1] CM Schneider et al, Mitigation of malicious attacks on networks. P. Natl. Acad. March 8, 2011, vol. 108. no. 10

[2] Big Data of Complex Networks, CRC Press. Matthias Dehmer et al editors.

]]>**JC**: *Can you start off by telling us a little bit about Give Directly, and what you do?*

**PN**: GiveDirectly is the first nonprofit that lets individual donors like you and me send money directly to the extreme poor. And that’s it—we don’t buy them things we think they need, or tell them what they should be doing, or how they should be doing it. Michael Faye and I co-founded GD, along with Jeremy Shapiro and Rohit Wanchoo, because on net we felt (and still feel) the poor have a stronger track record putting money to use than most of the intermediaries and experts who want to spend it for them.

**JC**: *What are common objections you brush up against, and how do you respond?*

**PN**: We’ve all heard and to some extent internalized a lot of negative stereotypes about the extreme poor—you can’t just give them money, they’ll blow it on alcohol, they won’t work as hard, etc. And it’s only in the last decade or so with the advent of experimental testing that we’ve build a broad evidence base showing that in fact quite the opposite is the case—in study after study the poor have used money sensibly, and if anything drank less and worked more. So to us it’s simply a question of catching folks up on the data.

**JC**: *Why do you think randomized controlled trials are emerging in development economics just in the past decade or so when it has been a standard tool gold standard in other areas for much longer?*

**PN**: I agree that experimental testing in development is long overdue. And to be blunt, I think it came late because we worry more about getting real results when we’re helping ourselves than we do when we’re helping others. When it comes to helping others, we get our serotonin from *believing* we’re making a difference, not the actual difference we make (which we may never find out, for example when we give to charities overseas). And so it’s tempting to succumb to wishful thinking rather than rigorous testing.

**JC**: *What considerations went into the design of your pending basic income trial? What would you have loved to do differently methodologically if you had 10X the budget? 100X?*

**PN**: This experiment is all about scale, in a couple of ways. First, there have been some great basic income pilots in the past, but they haven’t committed to supporting people for more than a few years. That’s important because a big argument the “pro” camp makes is that guaranteeing long-term economic security will free people up to take risks, be more creative, etc.—and a big worry the “con” camp raises is that it will cause people to stop trying. So it was important to commit to support over a long period. We’re doing over a decade—12 years—and with more funding we’d go even longer.

Second, it’s important to test this by randomizing at the community level, not just the individual level. That’s because a lot of the debate over basic income is about how community interactions will change (vs purely individual behavior). So we’re enrolling entire villages—and with more funding, we could make that entire counties, etc. That lets you start to understanding impacts on community cohesion, social capital, the macroeconomy, etc.

**JC**: *In what ways do you think math has served as a good or poor guide for development economics over the years?*

**PN**: I think the far more important question is why has math—and in particular statistics—played such a small role in development decision-making, while “success stories” and “theories of change” have played such large ones.

**JC**: *Can you say something about the efficiency of GiveDirectly?*

**PN**: What we’ve tried to do at GD is, first, be very clear about our marginal cost structure—typically around 90% in the hands of the poor, 10% on costs of enrolling them and delivering funds; and second, provide evidence on how these transfers affect a wide range of outcomes and let donors judge for themselves how valuable those outcomes are.

**JC**: *What is your vision for a methodologically sound poverty reduction research program? What are the main pitfalls and challenges you see?*

**PN**: First, we need to run experiments at larger scales. Testing new ideas in a few villages, run by an NGO, is a great start, but it’s not always an accurate to guide to how an intervention will perform when a government tries to deliver it nation-wide, or how doing something at that scale will affect the broader economy (what we call “general equilibrium effects”). I’ve written about this recently with Karthik Muralidharan based on some of our recent experiences running large-scale evaluations in India.

Second, we need to measure value created for the poor. RCTs tell us how an intervention changes “outcomes,” but not how valuable those outcomes are. That’s fine if you want to assign your own values to outcomes—I could be an education guy, say, and care only about years of formal schooling. But if we care at all about the values and priorities of the poor themselves, we need a different approach. One simple step is to ask people how much money an intervention is worth to them—what economists call their “willingness to pay.” If we’re spending $100 on a program, we’d hope it’s worth at least that much to the beneficiary. If not, begs the question why we don’t just give them the money.

**JC**: *What can people do to help?*

**PN**: Lots of things. Here are a few:

- Set up a recurring donation, preferably to the basic income project. Worst case scenario your money will make life much better for someone in extreme poverty; best case, it will also generate evidence that redefines anti-poverty policy.
- Follow ten recipients on GDLive. Share things they say that you find interesting. Give us feedback on the experience (which is very beta).
- Ask five friends whether they give money to poor people. Find out what they think and why. Share the evidence and information we’ve published and then give us feedback—what was helpful? What was missing?
- Ask other charities to publish the experimental evidence on their interventions prominently on their websites, and to explain why they are confident that they can add more value for the poor by spending money on their behalf than the poor could create for themselves if they had the money. Some do! But we need to create a world where simply publishing a few “success stories” doesn’t cut it any more.

**Related post**: Interview with Food for the Hungry CIO

Excerpt from the new book Big Data of Complex Networks:

Big Data and data protection law provide for a number of mutual conflicts: from the perspective of Big Data analytics,

. From the perspective of the law, Big Data is either a big threat … or a major challenge for international and national lawmakers to adopt today’s data protection laws to the latest technological and economic developments.a strict application of data protection law as we know it today would set an immediate end to most Big Data applications

Emphasis added.

The author of the chapter on legal matters is Swiss and writes primarily in a European context, though all countries face similar problems.

I’m not a lawyer, though I sometimes work with lawyers, and sometimes help companies with the statistical aspects of HIPAA law. But as a layman the observation above sounds reasonable to me, that strict application of the law could bring many applications to a halt, for better and for worse.

In my opinion the regulations around HIPAA and de-identification are mostly reasonable. The things it prohibits mostly should be prohibited. And it has a common sense provision in the form of expert determination. If your data uses fall outside the regulation’s specific recommendations but don’t endanger privacy, you can have an expert can certify that this is the case.

]]>When the transition from one state to another is nonlinear, the probability distribution around future states is not normal. There are many variations on the Kalman filter that amount to various approximations to get around this core difficulty: extended Kalman filters, unscented Kalman filters, particle filters, etc. Here I’ll just discuss unscented Kalman filters and particle filters. This will be a very hand-wavy discussion, but it will give the basic ideas.

It’s easy to push discrete random points through a nonlinear transformation. Calculating the effect of a nonlinear transformation on continuous random variables is more work. The idea of an unscented Kalman filter is to create a normal distribution that approximates the result of a nonlinear transformation by seeing what happens to a few carefully chosen points. The idea of particle filtering is to randomly generate a cloud of points and push these points through the nonlinear transformation.

This may make it sound like (unscented) Kalman filters and particle filters are trivial to implement. Far from it. The basic framework is easy to understand, but getting good performance on a particular problem can take a lot of work.

**Related posts**:

We limit our attention to large, symmetric matrices. We fill the entries of the matrix on the diagonal and above the diagonal with random elements, then fill in the elements below the diagonal by symmetry.

If we choose our random values from a thin-tailed distribution, then **Wigner’s semicircle law** tells us what to expect from our distribution. If our matrices are large enough, then we expect the probability density of eigenvalues to be semicircular. To illustrate this, we’ll fill a matrix with samples from a standard normal distribution and compute its eigenvalues with the following Python code.

import numpy as np import matplotlib.pyplot as plt N = 5000 A = np.random.normal(0, 1, (N, N)) B = (A + A.T)/np.sqrt(2) eigenvalues = np.linalg.eigvalsh(B) print(max(eigenvalues), min(eigenvalues)) plt.hist(eigenvalues, bins=70) plt.show()

We first create an *N* by *N* non-symmetric matrix, then make it symmetric by adding it to its transpose. (That’s easier than only creating the upper-triangular elements.) We divide by the square root of 2 to return the variance of each component to its original value, in this case 1. The eigenvalues in this particular experiment ran from -141.095 to 141.257 and their histogram shows the expected semi-circular shape.

Wigner’s semicircle law does not require the samples to come from a normal distribution. Any distribution with finite variance will do. We illustrate this by replacing the normal distribution with a Laplace distribution with the same variance and rerunning the code. That is, we change the definition of the matrix `A`

to

A = np.random.laplace(0, np.sqrt(0.5), (N, N))

and get very similar results. The eigenvalues ran from -140.886 to 141.514 and again we see a semicircular distribution.

But what happens when we draw samples from a heavy-tailed distribution, i.e. one without finite variance? We’ll use a Student-*t* distribution with ν = 1.8 degrees of freedom. When ν > 2 the *t*-distribution has variance ν/(ν – 2), but for smaller values of ν it has no finite variance. We change the definition of the matrix `A`

to the following:

A = np.random.standard_t(1.8, (N, N))

and now the distribution is quite different.

In this case the minimum eigenvalue was -9631.558 and the largest was 9633.853. When the matrix components are selected from a heavy-tailed distribution, the eigenvalues also have a heavier-tailed distribution. In this case, nearly all the eigenvalues are between -1000 and 1000, but the largest and smallest were 10 times larger. The eigenvalues are more variable, and their distribution looks more like a normal distribution and less like a semicircle.

The distributions for all thin-tailed symmetric matrices are the same. They have a universal property. But each heavy-tailed distribution gives rise to a different distribution on eigenvalues. With apologies to Tolstoy, thin-tailed matrices are all alike; every thick-tailed matrix is thick-tailed in its own way.

**Update**: As the first comment below rightfully points out, the diagonal entries should be divided by 2, not sqrt(2). Each of the off-diagonal elements of *A* + *A*^{T} are the sum of two independent random variables, but the diagonal elements are twice what they were in *A*. To put it another way, the diagonal elements are the sum of perfectly correlated random variables, not independent random variables.

I reran the simulations with the corrected code and it made no noticeable difference, either to the plots or to the range of the eigenvalues.

]]>This is an interesting problem for two reasons. First, it illustrates a clever variation on integration by parts; that’s why it was assigned. But it can also be computed using complex variables. As is often the case,** the “complex” approach is simpler**. Below I’ll show the solution the students were expected to find, then one that they wouldn’t not be expected to find.

The traditional approach to this integral is to integrate by parts. Letting *u* = sin(4*x*), the integral becomes

Next we integrate by parts one more time, this time letting *u* = cos(4*x*). This gives us

At this point it looks like we’re getting nowhere. We could keep on integrating by parts forever. Not only that, we’re going in circles: we have an integral that’s just like the one we started with. But then the clever step is to realize that this is a good thing. Let *I* be our original integral. Then

Now we can solve for *I*:

Here’s another approach. Recognize that sin(4*x*) is the imaginary part of exp(4*ix*) and so our integral is the imaginary part of

which we can integrate immediately:

There’s still algebra to do, but the calculus is over. And while the algebra will take a few steps, it’s routine.

First, let’s take care of the fraction.

Next,

and so our integral is the complex part of

which gives us the same result as before.

The complex variable requires one insight: recognizing a sine as the real part of an exponential. The traditional approach requires several insights: two integrations by parts and solving for the original integral.

**Related**:

It depends on how well you shuffle the cards. If you do what’s called a “faro shuffle” then the probability of a pair of cards remaining neighbors is 0. In a faro shuffle, if the cards were originally numbered 1 through 52, the new arrangement will be 1, 27, 2, 28, 3, 29, …, 26, 52. All pairs are split up.

But if you shuffle the cards so well that the new arrangement is a random permutation of the original, there’s about an 86% chance that at least one pair of neighbors remain neighbors. To be more precise, consider permutations of *n* items. As *n* increases, the probability of no two cards remaining consecutive converges to exp(-2), and so the probability of at least two cards remaining next to each other converges to 1 – exp(-2).

By the way, this considers pairs staying next to each other in either order. For example, if the cards were numbered consecutively, then either a permutation with 7 followed by 8 or 8 followed by 7 would count.

More generally, for large *n*, the probability of *k* pairs remaining consecutive after a shuffle is 2^{k}*e*^{-2} / *k*!.

One application of this result would be testing. If you randomly permute a list of things to break up consecutive pairs, it probably won’t work. Instead, you might want to split your list in half, randomize each half separately, then interleave the two half lists as in the faro shuffle.

Another application would be fraud detection. If you’re suspicious that a supposedly random process isn’t splitting up neighbors, you could use the result presented here to calibrate your expectations. Maybe what you’re seeing is consistent with good randomization. Or maybe not. Note that the result here looks at *any* pair remaining together. If a *particular* pair remains together consistently, that’s more suspicious.

Related posts:

Reference: David P. Robbins, The Probability that Neighbors Remain Neighbors After Random Rearrangements. The American Mathematical Monthly, Vol. 87, No. 2, pp. 122-124

]]>Once upon a time, my opinion of category theory was the same as my opinion of Facebook: if I ignore it for long enough, hopefully it will go away. It is now my educated opinion that category theory will

notgo away, and in fact the language of category theory will continue to spread until it becomes the default foundation of mathematics.

More posts on category theory:

]]>gcd(*F*_{m}, *F*_{n}) = *F*_{gcd(m, n)}

In words, the greatest common divisor of the *m*th and *n*th Fibonacci numbers is the *g*th Fibonacci number where *g* is the greatest common divisor of *m* and *n*. You can find a proof here.

M. Wunderlich used this identity to create a short, clever proof that there are infinitely many primes.

Suppose on the contrary that there are only *k* prime numbers, and consider the set of Fibonacci numbers with prime indices: *F*_{p1}, *F*_{p2}, … *F*_{pk}. The Fibonacci theorem above tells us that these Fibonacci numbers are pairwise relatively prime: each pair has greatest common divisor *F*_{1} = 1.

Each of the Fibonacci numbers in our list must have only one prime factor. Why? Because we have assumed there are only *k* prime numbers, and no two of our Fibonacci numbers share a prime factor. But *F*_{19} = 4181 = 37*113. We’ve reached a contradiction, and so there must be infinitely many primes.

**Source**: M. Wunderlich, Another proof of the infinite primes theorem. American Mathematical Monthly, Vol. 72, No. 3 (March 1965), p. 305.

Here are a couple more unusual proofs that there are infinitely many primes. The first uses a product of sines. The second is from Paul Erdős. It also gives a lower bound on π(*N*), the number of primes less than *N*.

Diffie-Hellman public key cryptography is based on the assumption that discrete logarithms are hard to compute. There are algorithms to compute discrete logarithms that are much faster than brute force. For example, baby-step giant-step is a fairly simple algorithm. There are more efficient algorithms as well, but the best known algorithms are still much slower than raising numbers to powers. Whenever you find something that is much harder to undo than to do, it might be useful in cryptography, and that is the case with discrete logs.

Diffie-Hellman encryption requires users to compute exponentials and presumably requires attackers to compute discrete logs. I say “presumably” because it’s a fatal error in cryptography to assume an attacker has to solve the problem you think he’d have to solve. For example, you can create a simple encryption scheme by permuting the alphabet and encrypting each letter to its counterpart in the permutation. Someone might naively think “No one can break this because they’d have to try 26! permutations of the alphabet, over 400 million million million million possibilities!” Except that’s not how anyone approaches a substitution cipher. If it were, you wouldn’t see cryptograms in puzzle books.

As far as we know, discrete logarithms are hard to compute when working over integers mod *p* where *p* is a large prime, except for primes that have certain properties. We’ll look at what those properties are below and how to avoid them.

For a prime *p*, the integers mod *p* form a finite field. They are a group under addition, and the non-zero elements form a group under multiplication. It’s the multiplicative group we care about here. This group has order *p*-1, i.e. it has *p*-1 elements.

A group of prime order has no proper subgroups. But a group of composite order does. And our multiplicative group has order *p*-1, which is composite. (Except for *p* = 3, and cryptography depends on primes far, far bigger than 3.)

Sylow’s theorems tell us something about what kinds of subgroups a group must have. If *s* is prime and *s*^{k} is a factor of the order of our group, then the group has a subgroup of order *s*^{k}. We don’t want our multiplicative group to have any small-order subgroups because these would make it easier to compute discrete logarithms.

A **safe prime** *p* has the form 2*q* + 1 where *q* is also prime. Diffie-Hellman chooses safe primes for moduli because this means the multiplicative group of order *p*-1 = 2*q* has no small subgroups. (It has two small subgroups, {1} and {1, -1}, but these can easily be avoided. The algorithm requires picking a generator *g*, and as long as you don’t pick *g* to be 1 or -1 mod *p*, then *g* generates a group of order *q*, and if *p* is gigantic, so is *q*.) Because *q* is prime, the subgroup of order *q* does not have any further subgroups.

**Related post**: Probability that a number is prime

]]>

The biggest uses for automated theorem proving have been highly specialized applications, not mathematically interesting theorems. Computer chip manufacturers use formal methods to verify that given certain inputs their chips produce certain outputs. Compiler writers use formal methods to verify that their software does the right thing. A theorem saying your product behaves correctly is very valuable to you and your customers, but nobody else. These aren’t the kinds of theorems that anyone would cite the way they might site the Pythagorean theorem. Nobody would ever say “And therefore, by the theorem showing that this particular pacemaker will not fall into certain error modes, I now prove this result unrelated to pacemakers.”

Automated theorem provers are important in these highly specialized applications in part because the results are of such limited interest. For every theorem of wide mathematical interest, there are a large number of mathematicians who are searching for a proof or who are willing to scrutinize a proposed proof. A theorem saying that a piece of electronics performs correctly appeals to only the tiniest audience, and yet is probably much easier (for a computer) to prove.

The term “automated theorem proving” is overloaded to mean a couple things. It’s used broadly to include any use of computing in proving theorems, and it’s used more narrowly to mean software that searches for proofs or even new theorems. Most theorem provers in the broad sense are not automated theorem provers in the more narrow sense but rather proof *assistants*. They verify proofs rather than discover them. (There’s some gray zone. They may search on a small scale, looking for a way to prove a minor narrow result, but not search for the entire proof to a big theorem.) There have been computer-verified proofs of important mathematical theorems, such as the Feit-Thompson theorem from group theory, but I’m not aware of any generally interesting discoveries that have come out of a theorem prover.

**Related post**: Formal methods let you explore the corners

Ideally, a secure hash is “indistinguishable from a random mapping.” [1] So if a hash function has a range of size *N*, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash *N*+1 items, two of them are *certain* to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.

The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values [2]. In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.

Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem comes up in attacking hash functions.

When *N* is large, it is likely that hashing √*N* values will lead to a collision. We prove this below.

The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.

Suppose we’re hashing *n* items to a range of size *N* = *n*^{2}. The exact probability that all *n* items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.

The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.

The second approximation comes from using the first two terms in the power series for log(1 + *x*). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as *n* goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.

So the probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, **a hash function with range of size N can hash on the order of √N values before running into collisions**.

This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 2^{32} or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range *could* be secure, based on the size of its range, while a 64-bit hash cannot be.

We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large *n*, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.

There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough *n*, *n*^{2} – *n* = *n*^{2} in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to *k* bits, you could lose *k* bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.

[1] Cryptography Engineering by Ferguson, Schneier, and Kohno

[2] Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.

]]>My first reply was this:

sqrt(2017) < 45.

2017 not divisible by 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, or 43.

Ergo prime.

I’m not sure that’s what he had in mind. There’s some implied calculation (which I didn’t actually do), so it’s kinda cheating. It would be interesting if there were something special about 2017 that would allow a more transparent proof.

(Implicit in the proof above is the theorem that if a number has a prime factor, it has a prime factor less than it’s square root. If you multiply together numbers bigger than the square root of *n*, you get a product bigger than *n*.)

Then for fun I gave two more proofs that are more sophisticated but would require far too much work to actually carry out by hand.

The first uses Fermat’s little theorem:

For 0 < *a* < 2017, *a*^{2016} – 1 is divisible by 2017.

2017 is not one of the three Carmichael numbers < 2465.

Ergo prime.

Fermat’s little theorem says that if *p* is prime, then for any 0 < *a* < *p*, *a*^{p – 1} – 1 is divisible by *p*. This is actually an efficient way to prove that a number is *not* prime. Find a number *a* such that the result doesn’t hold, and you’ve proved that *p* isn’t prime. For small numbers, the easiest way to show a number is not prime is to show its factors. But for very large numbers, such as those used in cryptography, it’s efficient to have a way to prove that a number *has* factors without having to actually *produce* those factors.

Unfortunately, Fermat’s little theorem gives a necessary condition for a number to be prime, but not a sufficient condition. It can appear to be prime for every witness (the bases *a* are called witnesses) and still not be a prime. The Carmichael numbers pass the Fermat primailty test without being prime. The first four are 561, 1105, 1729, and 2465.

For more on using Fermat’s little theorem to test for primality, see Probability that a number is prime.

The final proof was this:

2016! + 1 is divisible by 2017, and so by Wilson’s theorem 2017 is prime.

Unlike Fermat’s little theorem, Wilson’s theorem gives necessary and sufficient conditions for a number to be prime. A number *n* is prime if and only if (*n*-1)! + 1 is divisible by *n*. In theory you could use Wilson’s theorem to test whether a number is prime, but this would be horrendously inefficient. 2016! has 5,789 digits. (You can find out how many digits *n*! has without computing it using a trick described here.)

Despite its inefficiency, you can actually use Wilson’s theorem and SymPy to prove that 2017 is prime.

>>> from sympy import factorial >>> x = factorial(2016) + 1 >>> x % 2017 0

]]>

I send out a newsletter at the end of each month. I’ve sent out around 20 so far. They all have two parts:

- a review of the most popular posts of the month, and
- a few words about what I’ve been up to.

That’s it. Short and sweet. I might send out more mail than this someday, but I’ve been doing this for nearly two years I’ve never sent more than one email a month.

If you’d like to subscribe, just enter your email address in the box on the side of the page labeled “Subscribe to my newsletter.” If you’re not reading this directly on the site, say you’re reading it in an RSS reader, then you can follow this link.

]]>The **cover time** of a graph is the expected time it takes a simple random walk on the graph to visit every node. A simple random walk starts at some node, then at each step chooses with equal probability one of the adjacent nodes. The cover time is defined to be the maximum over all starting points of expected time to visit all nodes.

It seems plausible that adding edges to a graph would decrease its cover time. Sometimes this is the case and sometimes it isn’t, as we’ll demonstrate below.

This post will look at three kinds of graphs

- a
**chain**of*n*vertices - a
**complete graph**on*n*vertices - a “
**lollipop**” with*n*vertices

A chain simply connects vertices in a straight line. A complete graph connects every node to every other node.

A lollipop with *n* vertices takes a chain with *n*/2 vertices and complete graph on *n*/2 vertices and joins them together. The image above is a lollipop graph of order 10. The complete graph becomes a clique in the new graph.

The image looks more like a kite than a lollipop because that’s the way my software (networkx) drew it by default. If you’d like, image the complete graph more round and the chain more straight so that the graph more closely resembles a lollipop.

Chains have a long cover time. Complete graphs have a short cover time. What about lollipops?

A plausible answer would be that a lollipop graph would have a moderate cover time, somewhere between that of a complete graph and a chain. But that’s not the case. In fact, the lollipop has a longer cover time than either the complete graph or the chain.

You could think of a lollipop graph with *n* vertices as a chain of length *n* with extra edges added. This shows that adding edges does not always decrease the cover time.

The cover times for the three graphs we consider here are of different orders: O(*n* log *n*) for the complete graph, O(*n*^{2}) for the chain, and O(*n*^{3}) for the lollipop.

Now these are only asymptotic results. Big-O notation leaves out proportionality constants. We know that for sufficiently large *n* the cover times are ordered so that the complete graph is the smallest, then the chain, then the lollipop. But does *n* have to be astronomically large before this happens? What happens with a modest size *n*?

There’s a theoretical result that says the cover time is bounded by 2*m*(*n*-1) where *m* is the number of edges and *n* the number of vertices. This tells us that when we say the cover time for the chain is O(*n*^{2}) we can say more, that it’s less than 2*n*^{2}, so at least in this case the proportionality constant isn’t large.

We’ll look at the cover times with some simulation results below based on *n* = 10 and *n* = 30 based on 10,000 runs.

With 10 nodes each, the cover times for the complete graph, chain, and lollipop were 23.30, 82.25, and 131.08 respectively.

With 30 nodes the corresponding cover times were 114.37, 845.16, and 3480.95.

So the run times were ordered as the asymptotic theory would suggest.

He had four names at various times. A person’s life is heterogeneous, so this could be seen as an advantage. Life’s parts sometimes have little in common, so little that it might appear that various people lived them. When this happens, it is difficult not to feel surprised that all these people carry the same name.

This reminded me of the section of James Scott’s Seeing Like a State that explains how names used to be more variable.

Among some peoples, it is not uncommon for individuals to have different names during different stages of life (infancy, childhood, adulthood) and in some cases after death; added to these are names used for joking, rituals, and mourning and names used for interactions with same-sex friends or with in-laws. Each name is specific to a certain phase of life, social setting, or interlocutor.

If someone’s name had more than one component, the final component might come from their profession (which could change) rather than their ancestry. Scott goes on to say

The invention of permanent, inherited patronyms was … the last step in establishing the necessary preconditions of modern statecraft. In almost every case it was a state project, designed to allow officials to identify, unambiguously, the majority of its citizens.

In short, governments insisted people adopt fixed names to make them easier to tax and to conscript. Before fixed names, governments would ask towns to provide so much tax money or so many soldiers because it could not tax or conscript citizens directly. For a famous example, see Luke’s account of the birth of Jesus: all went to be registered, *each to his own town*.

It’s hard to imagine people not needing fixed names. But when people lived on a smaller scale, interacting with a number of people closer to Dunbar’s number, there was no danger of ambiguity because there was more context.

]]>

where *p* was a positive integer. Here we look at what happens when we let *p* be a *negative* integer and we let *n* go to infinity. We’ll learn more about Bernoulli numbers and we’ll see what is meant by apparently absurd statements such as 1 + 2 + 3 + … = -1/12.

If *p* < -1, then the limit as *n* goes to infinity of *S*_{p}(*n*) is ζ(*-p*). That is, for *s* > 1, the Riemann zeta function ζ(*s*) is defined by

We don’t have to limit ourselves to real numbers *s* > 1; the definition holds for complex numbers *s* with real part greater than 1. That’ll be important below.

When *s* is a positive even number, there’s a formula for ζ(*s*) in terms of the Bernoulli numbers:

The best-known special case of this formula is that

1 + 1/4 + 1/9 + 1/16 + … = π^{2} / 6.

It’s a famous open problem to find a closed-form expression for ζ(3) or any other odd argument.

The formula relating the zeta function and Bernoulli tells us a couple things about the Bernoulli numbers. First, for *n* ≥ 1 the Bernoulli numbers with index 2*n* alternate sign. Second, by looking at the sum defining ζ(2*n*) we can see that it is approximately 1 for large *n*. This tells us that for large *n*, |*B*_{2n}| is approximately (2*n*)! / 2^{2n-1} π^{2n}.

We said above that the sum defining the Riemann zeta function is valid for complex numbers *s* with real part greater than 1. There is a unique analytic extension of the zeta function to the rest of the complex plane, except at *s* = 1. The zeta function is defined, for example, at negative integers, but the sum defining zeta in the half-plane Re(*s*) > 1 is NOT valid.

You may have seen the equation

1 + 2 + 3 + … = -1/12.

This is an abuse of notation. The sum on the left clearly diverges to infinity. But if the sum defining ζ(*s*) for Re(*s*) > 1 were valid for *s* = -1 (which it is not) then the left side would equal ζ(-1). The analytic continuation of ζ *is* valid at -1, and in fact ζ(-1) = -1/12. So the equation above is true if you interpret the left side, not as an ordinary sum, but as a way of writing ζ(-1). The same approach could be used to make sense of similar equations such as

1^{2} + 2^{2} + 3^{2} + … = 0

and

1^{3} + 2^{3} + 3^{3} + … = 1/120.

1 + 2 + 3 + … + *n* = *n*(*n* + 1) / 2

There’s also a formula for the sum of the first *n* squares

1^{2} + 2^{2} + 3^{2} + … + *n*^{2} = *n*(*n* + 1)(2*n* + 1) / 6

and for the sum of the first *n* cubes:

1^{3} + 2^{3} + 3^{3} + … + *n*^{3} = *n*^{2}(*n* + 1)^{2} / 4

It’s natural to ask whether there’s a general formula for all exponents. There is, but it’s not entirely satisfying. There’s a single formula for the sum of the *p*th powers of the first *n* positive integers, but it involves mysterious coefficients known as Bernoulli numbers. So there’s a fairly simple formula for sums of powers in terms of Bernoulli numbers, but there’s no simple formula for Bernoulli numbers.

At first glance this might not seem that useful because we have a formula for a sum in terms of another sum, but note that the sums run over different ranges. The original sum runs up to *n*, and in application *n* might be very large. The sum on the right runs up to the exponent *p*, and in application *p* is usually quite small.

The first few Bernoulli numbers *B*_{j}, starting with *j* = 0, are 1, -1/2, 1/6, 0, -1/30, 0, …. This short list might give you some hope of finding a nice formula for the Bernoulli numbers, but they’re full of surprises. The 12th Bernoulli number, for example, is -691/2730.

So how do you define the Bernoulli numbers? These numbers come up in many contexts, and so there are many ways to define them. One way is to say that the exponential generating function of the Bernoulli numbers is *z* / (exp(*z*)- 1). In other words, *B*^{j} is *j*! times the coefficient of *z*^{j} in the power series for *z* / (exp(*z*)- 1) centered at *z* = 0.

There are a couple ways to look at this definition. The first reaction is probably disappointment. “You won’t just tell me what the Bernoulli numbers are. You tell me that I have to calculate another term of this ugly power series every time I want a new Bernoulli number?!” But from a more advanced perspective, you might be grateful that the unwieldy Bernoulli numbers have such a simple exponential generating function. Often the most useful ways to study special numbers is via their generating functions.

The next post will look at what happens when we let *p* be negative, and when we let *n* go to infinity. In other words, we’ll look at the Riemann zeta function.

By rapid mixing we mean that a random walk approaches its limiting (stationary) probability distribution quickly relative to random walks on other graphs.

Here’s a graph that supports rapid mixing. Pick a prime *p* and label nodes 0, 1, 2, 3, …, *p*-1. Create a circular graph by connecting each node *k* to *k*+1 (mod *p*). Then add an edge between each non-zero *k* to its multiplicative inverse (mod *p*). If an element is it’s own inverse, add a loop connecting the node to itself. For the purpose of this construction, consider 0 to be its own inverse. This construction comes from [1].

Here’s what the graph looks like with *p* = 23.

This graph doesn’t show the three loops from a node to itself, nodes 1, 0, and 22.

At each node in the random walk, we choose an edge uniformly and follow it. The stationary distribution for a random walk on this graph is uniform. That is, in the long run, you’re equally likely to be at any particular node.

If we pick an arbitrary starting node, 13, and take 30 steps of the random walk, the simulated probability distribution is nearly flat.

By contrast, we take a variation on the random walk above. From each node, we move left, right, or stay in place with equal probability. This walk is not as well mixed after 100 steps as the original walk is after only 30 steps.

You can tell from the graph that the walk started around 13. In the previous graph, evidence of the starting point had vanished.

The code below was used to produce these graphs.

To investigate how quickly a walk on this graph converges to its limiting distribution, we could run code like the following.

from random import random import numpy as np import matplotlib.pyplot as plt m = 23 # Multiplicative inverses mod 23 inverse = [0, 1, 12, 8, 6, 14, 4, 10, 3, 18, 7, 21, 2, 16, 5, 20, 13, 19, 9, 17, 15, 11, 22] # Verify that the inverses are correct assert(len(inverse) == m) for i in range(1, m): assert (i*inverse[i] % 23 == 1) # Simulation num_reps = 100000 num_steps = 30 def take_cayley_step(loc): u = random() * 3 if u < 1: # move left newloc = (loc + m - 1) % m elif u < 2: # move right newloc = (loc + 1) % m else: newloc = inverse[loc] # or newloc = loc return newloc stats = [0]*m for i in range(num_reps): loc = 13 # arbitrary fixed starting point for j in range(num_steps): loc = take_cayley_step(loc) stats[loc] += 1 normed = np.array(stats)/num_reps plt.plot(normed) plt.xlim(1, m) plt.ylim(0,2/m) plt.show()

**Related posts**:

* * *

[1] Shlomo Hoory, Nathan Linial, Avi Wigderson. “Expander graphs and their applications.” Bulletin of the American Mathematical Society, Volume 43, Number 4, October 2006, pages 439–561.

]]>If you thumb through Guy Steele’s book Common Lisp: The Language, 2nd Edition, you might be surprised how much space is devoted to defining familiar functions: square root, log, arcsine, etc. He gives some history of how these functions were first defined in Lisp and then refined by the ANSI (X3JI3) standardization committee in 1989.

There are three sources of complexity:

- Complex number arguments
- Multi-valued functions
- +0 and -0

The functions under discussion are defined for either real or complex inputs. This does not complicate things much in itself. Defining some functions for complex arguments, such as the `exp`

function, is simple. The complication comes from the interaction of complex arguments with multi-valued functions and floating point representations of zero.

The tricky functions to define are inverse functions, functions where we have to make a choice of range.

Let’s restrict our attention to real numbers for a moment. How do you define the square root of a positive number *x*? There are two solutions to the equation *y*^{2} = *x*, and √*x* is defined to be the positive solution.

What about the arcsine of *x*? This is the number whose sine is *x*. Except there is a “the” number. There are infinitely many numbers whose sine is *x*, so we have to make a choice. It seems natural to chose values in an interval symmetric about 0, so we take arcsine of *x* to be the number *between -π/2 and π/2* whose sine is *x*.

Now what about arctangent? As with arcsine, we have to make a choice because for any *x* there are infinitely many numbers *y* whose tangent is *x*. Again it’s convenient to define the range to be in an interval symmetric about zero, so we define the arctangent of *x* to be the number *y* *between -π/2 and π/2* whose tangent is *x*. But now we have a subtle complication with tangent we didn’t have with sine because tangent is unbounded. How do we want to define the tangent of a vertical angle? Should we call it ∞ or -∞? What do we want to return if someone asks for the arctangent of ±∞? Should we return π/2 or -π/2?

The discussion shows there are some minor complications in defining inverse functions on the real line. Things get more complicated when working in the complex plane. To take the square root example, it’s easy to say we’re going to define square root so that the square root of a positive number is another positive number. Fine. But which solution to *z*^{2} = *w* should we take to be the square root of a complex number *w*, such as 2 + 3*i* or -5 + 17*i*?

Or consider logarithms. For positive numbers *x* there is only one real number *y* such at exp(*y*) = *x*. But what if we take a negative value of *x* such as -1? There’s no *real* number whose exponential is -1, but there is a complex number. In fact, there are infinitely many complex numbers whose exponential is -1. Which one should we choose?

A little known feature of floating point arithmetic (specified by the IEEE 754 standard) is that there are two kinds of zero: +0 and -0. This sounds bizarre at first, but there are good reasons for this, which I explain here. But defining functions to work properly with two kinds of zero takes a lot of work. This was the main reason the ANSI Common Lisp committee had to revise their definitions of several transcendental functions. If a function has a branch cut discontinuity along the real axis, for example, you want your definition to be continuous as you approach *x* + 0*i* from above and as you approach *x* -0*i* from below.

I’ll cut to the chase and present the solution the X3J13 came up with. For a discussion of the changes this required and the detailed justifications, see Guy Steele’s book.

The first step is to carefully define the two-argument arctangent function `(atan y x)`

for all 16 combinations of *y* and *x* being positive, negative, +0, or -0. Then other functions are defined as follows.

- Define
`phase`

in terms of`atan`

. - Define complex
`abs`

in terms of real`sqrt`

. - Define complex
`log`

in terms of`phase`

, complex`abs`

, and real`log`

. - Define complex
`sqrt`

in terms of complex`log`

. - Define everything else in terms of the functions above.

The actual implementations may not follow this sequence, but they have to produce results consistent with this sequence.

The phase of *z* is defined as the arctangent with arguments the imaginary and real parts of *z*.

The complex log of *z* is defined as log |*z*| + *i* phase(*z*).

Square root of *z* is defined as exp( log(*z*) / 2 ).

The inverses of circular and hyperbolic functions are defined as follows.

Note that there are many ways to define these functions that seem to be equivalent, and are equivalent *in some region*. Getting the branch cuts right is what makes this complicated.

]]>

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re *explicitly* subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are *implicitly* subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

]]>Saying “nobody” is a bit of hyperbole. It happens, but not often. Most problems come to you in the language of the client’s business, such as “We want to make sure this machine doesn’t run into that machine” or “We’re trying to predict kidney failure.” [1]

Recently Charles McCreary from CRM Engineering sent me the most specific problem statement I’ve ever seen:

I need to calculate the von Mises yield envelope in the axial force-internal/external pressure domain for Lame’s solution …

That really took me by surprise. Sometimes a lead will mention mathematical terminology like “particle filters” or “finite elements,” though even this level of detail is uncommon. I’ve never seen something so specific.

It’s still the case that a lot of work goes into formulating a problem. I’m sure Charles’ client didn’t approach *him* with this statement. I’m consulting to a consultant who has already done the hard work of reducing the original application down to a purely mathematical problem. (I’d call it “purely mathematical” rather than “pure mathematical” because it’s definitely applied math.) I look forward to digging into it as soon as I wrap up what I’m currently working on.

* * *

[1] Nobody has come to me wanting to predict kidney failure, but I’ve worked on predicting several other maladies that I’ll not mention to maintain confidentiality.

]]>For example, suppose you want to compute the following integral that comes up frequently in probability.

There is no (elementary) function whose derivative is exp(-*x*^{2}). It’s not just hard to find or ugly. It simply doesn’t exist, not within the universe of elementary functions. There are functions whose derivative is exp(-*x*^{2}), but these functions are not finite algebraic combinations of the kinds of functions you’d see in high school.

If you think of the definite integral above as meaning “the result you get when you find an antiderivative, let its arguments go off to ∞ and -∞, and subtract the two limits” then you’ll never calculate it. And when you hear that the antiderivative doesn’t exist (in the world of functions you’re familiar with) then you might think that not only can you not calculate the integral, no one can.

In fact the integral is easy to calculate. It requires an ingenious trick [2], but once you see that trick it’s not hard.

Let *I* be the value of the integral. Changing the integration variable makes no difference, i.e.

and so

This integral can be converted to polar coordinates. Instead of describing the plane as an infinite square with *x* and *y* each going off to infinity in both directions, we can think of it as an infinite disk, with radius going off to infinity. The advantage of this approach is that the Jacobian of the change of variables gives us an extra factor of *r* that makes the exponential integral tractable.

From this we get *I*^{2} = π and so *I* = √π.

This specific trick comes up occasionally. But more generally, it is often the case that definite integrals are easier to compute than indefinite integrals. One of the most common applications of complex analysis is computing such integrals through the magic of contour integration. This leads to a lesson closely related to the one above, namely that **you may not have to do what it looks like you need to do**. In this case, you don’t always need to compute indefinite integrals (anti-derivatives) as an intermediate step to compute definite integrals. [3]

Mathematics is filled with theorems that effectively say that you don’t actually have to compute what you conceptually need to compute. Sometimes you can get by with calculating much less.

* * *

[1] One frustration I’ve had working with statisticians is that many have forgotten the distinction between *what* they want to calculate and *how* they calculate it. This makes it difficult to suggest better ways of computing things.

[2] Lord Kelvin said of this trick “A mathematician is one to whom *that* is as obvious as that twice two makes four is to you. Liouville was a mathematician.”

[3] If you look back carefully, we had to compute the integral of exp(-*r*^{2}) *r*, which you would do by first computing its anti-derivative. But we didn’t have to compute the anti-derivative of the original integrand. We traded a hard (in some sense impossible) anti-derivative problem for an easy one.