It depends on how well you shuffle the cards. If you do what’s called a “faro shuffle” then the probability of a pair of cards remaining neighbors is 0. In a faro shuffle, if the cards were originally numbered 1 through 52, the new arrangement will be 1, 27, 2, 28, 3, 29, …, 26, 52. All pairs are split up.

But if you shuffle the cards so well that the new arrangement is a random permutation of the original, there’s about an 86% chance that at least one pair of neighbors remain neighbors. To be more precise, consider permutations of *n* items. As *n* increases, the probability of no two cards remaining consecutive converges to exp(-2), and so the probability of at least two cards remaining next to each other converges to 1 – exp(-2).

By the way, this considers pairs staying next to each other in either order. For example, if the cards were numbered consecutively, then either a permutation with 7 followed by 8 or 8 followed by 7 would count.

More generally, for large *n*, the probability of *k* pairs remaining consecutive after a shuffle is 2^{k}*e*^{-2} / *k*!.

One application of this result would be testing. If you randomly permute a list of things to break up consecutive pairs, it probably won’t work. Instead, you might want to split your list in half, randomize each half separately, then interleave the two half lists as in the faro shuffle.

Another application would be fraud detection. If you’re suspicious that a supposedly random process isn’t splitting up neighbors, you could use the result presented here to calibrate your expectations. Maybe what you’re seeing is consistent with good randomization. Or maybe not. Note that the result here looks at *any* pair remaining together. If a *particular* pair remains together consistently, that’s more suspicious.

Reference: David P. Robbins, The Probability that Neighbors Remain Neighbors After Random Rearrangements. The American Mathematical Monthly, Vol. 87, No. 2, pp. 122-124

]]>Once upon a time, my opinion of category theory was the same as my opinion of Facebook: if I ignore it for long enough, hopefully it will go away. It is now my educated opinion that category theory will

notgo away, and in fact the language of category theory will continue to spread until it becomes the default foundation of mathematics.

More posts on category theory:

]]>gcd(*F*_{m}, *F*_{n}) = *F*_{gcd(m, n)}

In words, the greatest common divisor of the *m*th and *n*th Fibonacci numbers is the *g*th Fibonacci number where *g* is the greatest common divisor of *m* and *n*. You can find a proof here.

M. Wunderlich used this identity to create a short, clever proof that there are infinitely many primes.

Suppose on the contrary that there are only *k* prime numbers, and consider the set of Fibonacci numbers with prime indices: *F*_{p1}, *F*_{p2}, … *F*_{pk}. The Fibonacci theorem above tells us that these Fibonacci numbers are pairwise relatively prime: each pair has greatest common divisor *F*_{1} = 1.

Each of the Fibonacci numbers in our list must have only one prime factor. Why? Because we have assumed there are only *k* prime numbers, and no two of our Fibonacci numbers share a prime factor. But *F*_{19} = 4181 = 37*113. We’ve reached a contradiction, and so there must be infinitely many primes.

**Source**: M. Wunderlich, Another proof of the infinite primes theorem. American Mathematical Monthly, Vol. 72, No. 3 (March 1965), p. 305.

Here are a couple more unusual proofs that there are infinitely many primes. The first uses a product of sines. The second is from Paul Erdős. It also gives a lower bound on π(*N*), the number of primes less than *N*.

Diffie-Hellman public key cryptography is based on the assumption that discrete logarithms are hard to compute. There are algorithms to compute discrete logarithms that are much faster than brute force. For example, baby-step giant-step is a fairly simple algorithm. There are more efficient algorithms as well, but the best known algorithms are still much slower than raising numbers to powers. Whenever you find something that is much harder to undo than to do, it might be useful in cryptography, and that is the case with discrete logs.

Diffie-Hellman encryption requires users to compute exponentials and presumably requires attackers to compute discrete logs. I say “presumably” because it’s a fatal error in cryptography to assume an attacker has to solve the problem you think he’d have to solve. For example, you can create a simple encryption scheme by permuting the alphabet and encrypting each letter to its counterpart in the permutation. Someone might naively think “No one can break this because they’d have to try 26! permutations of the alphabet, over 400 million million million million possibilities!” Except that’s not how anyone approaches a substitution cipher. If it were, you wouldn’t see cryptograms in puzzle books.

As far as we know, discrete logarithms are hard to compute when working over integers mod *p* where *p* is a large prime, except for primes that have certain properties. We’ll look at what those properties are below and how to avoid them.

For a prime *p*, the integers mod *p* form a finite field. They are a group under addition, and the non-zero elements form a group under multiplication. It’s the multiplicative group we care about here. This group has order *p*-1, i.e. it has *p*-1 elements.

A group of prime order has no proper subgroups. But a group of composite order does. And our multiplicative group has order *p*-1, which is composite. (Except for *p* = 3, and cryptography depends on primes far, far bigger than 3.)

Sylow’s theorems tell us something about what kinds of subgroups a group must have. If *s* is prime and *s*^{k} is a factor of the order of our group, then the group has a subgroup of order *s*^{k}. We don’t want our multiplicative group to have any small-order subgroups because these would make it easier to compute discrete logarithms.

A **safe prime** *p* has the form 2*q* + 1 where *q* is also prime. Diffie-Hellman chooses safe primes for moduli because this means the multiplicative group of order *p*-1 = 2*q* has no small subgroups. (It has two small subgroups, {1} and {1, -1}, but these can easily be avoided. The algorithm requires picking a generator *g*, and as long as you don’t pick *g* to be 1 or -1 mod *p*, then *g* generates a group of order *q*, and if *p* is gigantic, so is *q*.) Because *q* is prime, the subgroup of order *q* does not have any further subgroups.

**Related post**: Probability that a number is prime

]]>

The biggest uses for automated theorem proving have been highly specialized applications, not mathematically interesting theorems. Computer chip manufacturers use formal methods to verify that given certain inputs their chips produce certain outputs. Compiler writers use formal methods to verify that their software does the right thing. A theorem saying your product behaves correctly is very valuable to you and your customers, but nobody else. These aren’t the kinds of theorems that anyone would cite the way they might site the Pythagorean theorem. Nobody would ever say “And therefore, by the theorem showing that this particular pacemaker will not fall into certain error modes, I now prove this result unrelated to pacemakers.”

Automated theorem provers are important in these highly specialized applications in part because the results are of such limited interest. For every theorem of wide mathematical interest, there are a large number of mathematicians who are searching for a proof or who are willing to scrutinize a proposed proof. A theorem saying that a piece of electronics performs correctly appeals to only the tiniest audience, and yet is probably much easier (for a computer) to prove.

The term “automated theorem proving” is overloaded to mean a couple things. It’s used broadly to include any use of computing in proving theorems, and it’s used more narrowly to mean software that searches for proofs or even new theorems. Most theorem provers in the broad sense are not automated theorem provers in the more narrow sense but rather proof *assistants*. They verify proofs rather than discover them. (There’s some gray zone. They may search on a small scale, looking for a way to prove a minor narrow result, but not search for the entire proof to a big theorem.) There have been computer-verified proofs of important mathematical theorems, such as the Feit-Thompson theorem from group theory, but I’m not aware of any generally interesting discoveries that have come out of a theorem prover.

**Related post**: Formal methods let you explore the corners

Ideally, a secure hash is “indistinguishable from a random mapping.” [1] So if a hash function has a range of size *N*, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeon hole principle, we know that if we hash *N*+1 items, two of them are *certain* to have the same hash value. But it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.

The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values [2]. In a room of less than 366 people, it’s possible that everyone has a different birthday. But in a group of 23 people, there are even odds that two people have the same birthday.

Variations on the birthday problem come up frequently. For example, in seeding random number generators. And importantly for this post, the birthday problem comes up in attacking hash functions.

When *N* is large, it is likely that hashing √*N* values will lead to a collision. We prove this below.

The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.

Suppose we’re hashing *n* items to a range of size *N* = *n*^{2}. The exact probability that all *n* items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.

The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. (The third terms from the two asymptotic series are the same so they cancel out.) The second line isn’t exactly what you’d get by applying the asymptotic expansion. It’s been simplified a little. The neglected terms are not a mistake but terms that can be left out because they go to zero.

The second approximation comes from using the first two terms in the power series for log(1 + *x*). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as *n* goes to infinity. Concretely, we’re willing to say that a billion and one divided by a billion is essentially 1.

So the probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, **a hash function with range of size N can hash on the order of √N values before running into collisions**.

This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 2^{32} or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like SHA3-512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range *could* be secure, based on the size of its range, while a 64-bit hash cannot be.

We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. But even if we were only interested in a fix but large *n*, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.

There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough *n*, *n*^{2} – *n* = *n*^{2} in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to *k* bits, you could lose *k* bits of precision in carrying out their difference. See Avoiding overflow, underflow, and loss of precision for more along these lines.

[1] Cryptography Engineering by Ferguson, Schneier, and Kohno

[2] Leap year of course complicates things since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. But these complications don’t ruin the party trick: In a room of 30 people, two people usually share a birthday.

]]>My first reply was this:

sqrt(2017) < 45.

2017 not divisible by 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, or 43.

Ergo prime.

I’m not sure that’s what he had in mind. There’s some implied calculation (which I didn’t actually do), so it’s kinda cheating. It would be interesting if there were something special about 2017 that would allow a more transparent proof.

(Implicit in the proof above is the theorem that if a number has a prime factor, it has a prime factor less than it’s square root. If you multiply together numbers bigger than the square root of *n*, you get a product bigger than *n*.)

Then for fun I gave two more proofs that are more sophisticated but would require far too much work to actually carry out by hand.

The first uses Fermat’s little theorem:

For 0 < *a* < 2017, *a*^{2016} – 1 is divisible by 2017.

2017 is not one of the three Carmichael numbers < 2465.

Ergo prime.

Fermat’s little theorem says that if *p* is prime, then for any 0 < *a* < *p*, *a*^{p – 1} – 1 is divisible by *p*. This is actually an efficient way to prove that a number is *not* prime. Find a number *a* such that the result doesn’t hold, and you’ve proved that *p* isn’t prime. For small numbers, the easiest way to show a number is not prime is to show its factors. But for very large numbers, such as those used in cryptography, it’s efficient to have a way to prove that a number *has* factors without having to actually *produce* those factors.

Unfortunately, Fermat’s little theorem gives a necessary condition for a number to be prime, but not a sufficient condition. It can appear to be prime for every witness (the bases *a* are called witnesses) and still not be a prime. The Carmichael numbers pass the Fermat primailty test without being prime. The first four are 561, 1105, 1729, and 2465.

For more on using Fermat’s little theorem to test for primality, see Probability that a number is prime.

The final proof was this:

2016! + 1 is divisible by 2017, and so by Wilson’s theorem 2017 is prime.

Unlike Fermat’s little theorem, Wilson’s theorem gives necessary and sufficient conditions for a number to be prime. A number *n* is prime if and only if (*n*-1)! + 1 is divisible by *n*. In theory you could use Wilson’s theorem to test whether a number is prime, but this would be horrendously inefficient. 2016! has 5,789 digits. (You can find out how many digits *n*! has without computing it using a trick described here.)

Despite its inefficiency, you can actually use Wilson’s theorem and SymPy to prove that 2017 is prime.

>>> from sympy import factorial >>> x = factorial(2016) + 1 >>> x % 2017 0

]]>

I send out a newsletter at the end of each month. I’ve sent out around 20 so far. They all have two parts:

- a review of the most popular posts of the month, and
- a few words about what I’ve been up to.

That’s it. Short and sweet. I might send out more mail than this someday, but I’ve been doing this for nearly two years I’ve never sent more than one email a month.

If you’d like to subscribe, just enter your email address in the box on the side of the page labeled “Subscribe to my newsletter.” If you’re not reading this directly on the site, say you’re reading it in an RSS reader, then you can follow this link.

]]>The **cover time** of a graph is the expected time it takes a simple random walk on the graph to visit every node. A simple random walk starts at some node, then at each step chooses with equal probability one of the adjacent nodes. The cover time is defined to be the maximum over all starting points of expected time to visit all nodes.

It seems plausible that adding edges to a graph would decrease its cover time. Sometimes this is the case and sometimes it isn’t, as we’ll demonstrate below.

This post will look at three kinds of graphs

- a
**chain**of*n*vertices - a
**complete graph**on*n*vertices - a “
**lollipop**” with*n*vertices

A chain simply connects vertices in a straight line. A complete graph connects every node to every other node.

A lollipop with *n* vertices takes a chain with *n*/2 vertices and complete graph on *n*/2 vertices and joins them together. The image above is a lollipop graph of order 10. The complete graph becomes a clique in the new graph.

The image looks more like a kite than a lollipop because that’s the way my software (networkx) drew it by default. If you’d like, image the complete graph more round and the chain more straight so that the graph more closely resembles a lollipop.

Chains have a long cover time. Complete graphs have a short cover time. What about lollipops?

A plausible answer would be that a lollipop graph would have a moderate cover time, somewhere between that of a complete graph and a chain. But that’s not the case. In fact, the lollipop has a longer cover time than either the complete graph or the chain.

You could think of a lollipop graph with *n* vertices as a chain of length *n* with extra edges added. This shows that adding edges does not always decrease the cover time.

The cover times for the three graphs we consider here are of different orders: O(*n* log *n*) for the complete graph, O(*n*^{2}) for the chain, and O(*n*^{3}) for the lollipop.

Now these are only asymptotic results. Big-O notation leaves out proportionality constants. We know that for sufficiently large *n* the cover times are ordered so that the complete graph is the smallest, then the chain, then the lollipop. But does *n* have to be astronomically large before this happens? What happens with a modest size *n*?

There’s a theoretical result that says the cover time is bounded by 2*m*(*n*-1) where *m* is the number of edges and *n* the number of vertices. This tells us that when we say the cover time for the chain is O(*n*^{2}) we can say more, that it’s less than 2*n*^{2}, so at least in this case the proportionality constant isn’t large.

We’ll look at the cover times with some simulation results below based on *n* = 10 and *n* = 30 based on 10,000 runs.

With 10 nodes each, the cover times for the complete graph, chain, and lollipop were 23.30, 82.25, and 131.08 respectively.

With 30 nodes the corresponding cover times were 114.37, 845.16, and 3480.95.

So the run times were ordered as the asymptotic theory would suggest.

He had four names at various times. A person’s life is heterogeneous, so this could be seen as an advantage. Life’s parts sometimes have little in common, so little that it might appear that various people lived them. When this happens, it is difficult not to feel surprised that all these people carry the same name.

This reminded me of the section of James Scott’s Seeing Like a State that explains how names used to be more variable.

Among some peoples, it is not uncommon for individuals to have different names during different stages of life (infancy, childhood, adulthood) and in some cases after death; added to these are names used for joking, rituals, and mourning and names used for interactions with same-sex friends or with in-laws. Each name is specific to a certain phase of life, social setting, or interlocutor.

If someone’s name had more than one component, the final component might come from their profession (which could change) rather than their ancestry. Scott goes on to say

The invention of permanent, inherited patronyms was … the last step in establishing the necessary preconditions of modern statecraft. In almost every case it was a state project, designed to allow officials to identify, unambiguously, the majority of its citizens.

In short, governments insisted people adopt fixed names to make them easier to tax and to conscript. Before fixed names, governments would ask towns to provide so much tax money or so many soldiers because it could not tax or conscript citizens directly. For a famous example, see Luke’s account of the birth of Jesus: all went to be registered, *each to his own town*.

It’s hard to imagine people not needing fixed names. But when people lived on a smaller scale, interacting with a number of people closer to Dunbar’s number, there was no danger of ambiguity because there was more context.

]]>

where *p* was a positive integer. Here we look at what happens when we let *p* be a *negative* integer and we let *n* go to infinity. We’ll learn more about Bernoulli numbers and we’ll see what is meant by apparently absurd statements such as 1 + 2 + 3 + … = -1/12.

If *p* < -1, then the limit as *n* goes to infinity of *S*_{p}(*n*) is ζ(*-p*). That is, for *s* > 1, the Riemann zeta function ζ(*s*) is defined by

We don’t have to limit ourselves to real numbers *s* > 1; the definition holds for complex numbers *s* with real part greater than 1. That’ll be important below.

When *s* is a positive even number, there’s a formula for ζ(*s*) in terms of the Bernoulli numbers:

The best-known special case of this formula is that

1 + 1/4 + 1/9 + 1/16 + … = π^{2} / 6.

It’s a famous open problem to find a closed-form expression for ζ(3) or any other odd argument.

The formula relating the zeta function and Bernoulli tells us a couple things about the Bernoulli numbers. First, for *n* ≥ 1 the Bernoulli numbers with index 2*n* alternate sign. Second, by looking at the sum defining ζ(2*n*) we can see that it is approximately 1 for large *n*. This tells us that for large *n*, |*B*_{2n}| is approximately (2*n*)! / 2^{2n-1} π^{2n}.

We said above that the sum defining the Riemann zeta function is valid for complex numbers *s* with real part greater than 1. There is a unique analytic extension of the zeta function to the rest of the complex plane, except at *s* = 1. The zeta function is defined, for example, at negative integers, but the sum defining zeta in the half-plane Re(*s*) > 1 is NOT valid.

You may have seen the equation

1 + 2 + 3 + … = -1/12.

This is an abuse of notation. The sum on the left clearly diverges to infinity. But if the sum defining ζ(*s*) for Re(*s*) > 1 were valid for *s* = -1 (which it is not) then the left side would equal ζ(-1). The analytic continuation of ζ *is* valid at -1, and in fact ζ(-1) = -1/12. So the equation above is true if you interpret the left side, not as an ordinary sum, but as a way of writing ζ(-1). The same approach could be used to make sense of similar equations such as

1^{2} + 2^{2} + 3^{2} + … = 0

and

1^{3} + 2^{3} + 3^{3} + … = 1/120.

1 + 2 + 3 + … + *n* = *n*(*n* + 1) / 2

There’s also a formula for the sum of the first *n* squares

1^{2} + 2^{2} + 3^{2} + … + *n*^{2} = *n*(*n* + 1)(2*n* + 1) / 6

and for the sum of the first *n* cubes:

1^{3} + 2^{3} + 3^{3} + … + *n*^{3} = *n*^{2}(*n* + 1)^{2} / 4

It’s natural to ask whether there’s a general formula for all exponents. There is, but it’s not entirely satisfying. There’s a single formula for the sum of the *p*th powers of the first *n* positive integers, but it involves mysterious coefficients known as Bernoulli numbers. So there’s a fairly simple formula for sums of powers in terms of Bernoulli numbers, but there’s no simple formula for Bernoulli numbers.

At first glance this might not seem that useful because we have a formula for a sum in terms of another sum, but note that the sums run over different ranges. The original sum runs up to *n*, and in application *n* might be very large. The sum on the right runs up to the exponent *p*, and in application *p* is usually quite small.

The first few Bernoulli numbers *B*_{j}, starting with *j* = 0, are 1, -1/2, 1/6, 0, -1/30, 0, …. This short list might give you some hope of finding a nice formula for the Bernoulli numbers, but they’re full of surprises. The 12th Bernoulli number, for example, is -691/2730.

So how do you define the Bernoulli numbers? These numbers come up in many contexts, and so there are many ways to define them. One way is to say that the exponential generating function of the Bernoulli numbers is *z* / (exp(*z*)- 1). In other words, *B*^{j} is *j*! times the coefficient of *z*^{j} in the power series for *z* / (exp(*z*)- 1) centered at *z* = 0.

There are a couple ways to look at this definition. The first reaction is probably disappointment. “You won’t just tell me what the Bernoulli numbers are. You tell me that I have to calculate another term of this ugly power series every time I want a new Bernoulli number?!” But from a more advanced perspective, you might be grateful that the unwieldy Bernoulli numbers have such a simple exponential generating function. Often the most useful ways to study special numbers is via their generating functions.

The next post will look at what happens when we let *p* be negative, and when we let *n* go to infinity. In other words, we’ll look at the Riemann zeta function.

By rapid mixing we mean that a random walk approaches its limiting (stationary) probability distribution quickly relative to random walks on other graphs.

Here’s a graph that supports rapid mixing. Pick a prime *p* and label nodes 0, 1, 2, 3, …, *p*-1. Create a circular graph by connecting each node *k* to *k*+1 (mod *p*). Then add an edge between each non-zero *k* to its multiplicative inverse (mod *p*). If an element is it’s own inverse, add a loop connecting the node to itself. For the purpose of this construction, consider 0 to be its own inverse. This construction comes from [1].

Here’s what the graph looks like with *p* = 23.

This graph doesn’t show the three loops from a node to itself, nodes 1, 0, and 22.

At each node in the random walk, we choose an edge uniformly and follow it. The stationary distribution for a random walk on this graph is uniform. That is, in the long run, you’re equally likely to be at any particular node.

If we pick an arbitrary starting node, 13, and take 30 steps of the random walk, the simulated probability distribution is nearly flat.

By contrast, we take a variation on the random walk above. From each node, we move left, right, or stay in place with equal probability. This walk is not as well mixed after 100 steps as the original walk is after only 30 steps.

You can tell from the graph that the walk started around 13. In the previous graph, evidence of the starting point had vanished.

The code below was used to produce these graphs.

To investigate how quickly a walk on this graph converges to its limiting distribution, we could run code like the following.

from random import random import numpy as np import matplotlib.pyplot as plt m = 23 # Multiplicative inverses mod 23 inverse = [0, 1, 12, 8, 6, 14, 4, 10, 3, 18, 7, 21, 2, 16, 5, 20, 13, 19, 9, 17, 15, 11, 22] # Verify that the inverses are correct assert(len(inverse) == m) for i in range(1, m): assert (i*inverse[i] % 23 == 1) # Simulation num_reps = 100000 num_steps = 30 def take_cayley_step(loc): u = random() * 3 if u < 1: # move left newloc = (loc + m - 1) % m elif u < 2: # move right newloc = (loc + 1) % m else: newloc = inverse[loc] # or newloc = loc return newloc stats = [0]*m for i in range(num_reps): loc = 13 # arbitrary fixed starting point for j in range(num_steps): loc = take_cayley_step(loc) stats[loc] += 1 normed = np.array(stats)/num_reps plt.plot(normed) plt.xlim(1, m) plt.ylim(0,2/m) plt.show()

**Related posts**:

* * *

[1] Shlomo Hoory, Nathan Linial, Avi Wigderson. “Expander graphs and their applications.” Bulletin of the American Mathematical Society, Volume 43, Number 4, October 2006, pages 439–561.

]]>If you thumb through Guy Steele’s book Common Lisp: The Language, 2nd Edition, you might be surprised how much space is devoted to defining familiar functions: square root, log, arcsine, etc. He gives some history of how these functions were first defined in Lisp and then refined by the ANSI (X3JI3) standardization committee in 1989.

There are three sources of complexity:

- Complex number arguments
- Multi-valued functions
- +0 and -0

The functions under discussion are defined for either real or complex inputs. This does not complicate things much in itself. Defining some functions for complex arguments, such as the `exp`

function, is simple. The complication comes from the interaction of complex arguments with multi-valued functions and floating point representations of zero.

The tricky functions to define are inverse functions, functions where we have to make a choice of range.

Let’s restrict our attention to real numbers for a moment. How do you define the square root of a positive number *x*? There are two solutions to the equation *y*^{2} = *x*, and √*x* is defined to be the positive solution.

What about the arcsine of *x*? This is the number whose sine is *x*. Except there is a “the” number. There are infinitely many numbers whose sine is *x*, so we have to make a choice. It seems natural to chose values in an interval symmetric about 0, so we take arcsine of *x* to be the number *between -π/2 and π/2* whose sine is *x*.

Now what about arctangent? As with arcsine, we have to make a choice because for any *x* there are infinitely many numbers *y* whose tangent is *x*. Again it’s convenient to define the range to be in an interval symmetric about zero, so we define the arctangent of *x* to be the number *y* *between -π/2 and π/2* whose tangent is *x*. But now we have a subtle complication with tangent we didn’t have with sine because tangent is unbounded. How do we want to define the tangent of a vertical angle? Should we call it ∞ or -∞? What do we want to return if someone asks for the arctangent of ±∞? Should we return π/2 or -π/2?

The discussion shows there are some minor complications in defining inverse functions on the real line. Things get more complicated when working in the complex plane. To take the square root example, it’s easy to say we’re going to define square root so that the square root of a positive number is another positive number. Fine. But which solution to *z*^{2} = *w* should we take to be the square root of a complex number *w*, such as 2 + 3*i* or -5 + 17*i*?

Or consider logarithms. For positive numbers *x* there is only one real number *y* such at exp(*y*) = *x*. But what if we take a negative value of *x* such as -1? There’s no *real* number whose exponential is -1, but there is a complex number. In fact, there are infinitely many complex numbers whose exponential is -1. Which one should we choose?

A little known feature of floating point arithmetic (specified by the IEEE 754 standard) is that there are two kinds of zero: +0 and -0. This sounds bizarre at first, but there are good reasons for this, which I explain here. But defining functions to work properly with two kinds of zero takes a lot of work. This was the main reason the ANSI Common Lisp committee had to revise their definitions of several transcendental functions. If a function has a branch cut discontinuity along the real axis, for example, you want your definition to be continuous as you approach *x* + 0*i* from above and as you approach *x* -0*i* from below.

I’ll cut to the chase and present the solution the X3J13 came up with. For a discussion of the changes this required and the detailed justifications, see Guy Steele’s book.

The first step is to carefully define the two-argument arctangent function `(atan y x)`

for all 16 combinations of *y* and *x* being positive, negative, +0, or -0. Then other functions are defined as follows.

- Define
`phase`

in terms of`atan`

. - Define complex
`abs`

in terms of real`sqrt`

. - Define complex
`log`

in terms of`phase`

, complex`abs`

, and real`log`

. - Define complex
`sqrt`

in terms of complex`log`

. - Define everything else in terms of the functions above.

The actual implementations may not follow this sequence, but they have to produce results consistent with this sequence.

The phase of *z* is defined as the arctangent with arguments the imaginary and real parts of *z*.

The complex log of *z* is defined as log |*z*| + *i* phase(*z*).

Square root of *z* is defined as exp( log(*z*) / 2 ).

The inverses of circular and hyperbolic functions are defined as follows.

Note that there are many ways to define these functions that seem to be equivalent, and are equivalent *in some region*. Getting the branch cuts right is what makes this complicated.

]]>

Bayesian methods are often characterized as “subjective” because the user must choose a prior distribution, that is, a mathematical expression of prior information. The prior distribution requires information and user input, that’s for sure, but I don’t see this as being any more “subjective” than other aspects of a statistical procedure, such as the choice of model for the data (for example, logistic regression) or the choice of which variables to include in a prediction, the choice of which coefficients should vary over time or across situations, the choice of statistical test, and so forth. Indeed, Bayesian methods can in many ways be more “objective” than conventional approaches in that Bayesian inference, with its smoothing and partial pooling, is well adapted to including diverse sources of information and thus can reduce the number of data coding or data exclusion choice points in an analysis.

People worry about prior distributions, not because they’re subjective, but because they’re *explicitly* subjective. There are many other subjective factors, common to Bayesian and Frequentist statistics, but these are *implicitly* subjective.

In practice, prior distributions often don’t make much difference. For example, you might show that an optimistic prior and a pessimistic prior lead to the same conclusion.

If you have so little data that the choice of prior does make a substantial difference, being able to specify the prior is a benefit. Suppose you only have a little data but have to make a decision anyway. A frequentist might say there’s too little data for a statistical analysis to be meaningful. So what do you do? Make a decision entirely subjectively! But with a Bayesian approach, you capture what is known outside of the data at hand in the form of a prior distribution, and update the prior with the little data you have. In this scenario, a Bayesian analysis is less subjective and more informed by the data than a frequentist approach.

]]>Saying “nobody” is a bit of hyperbole. It happens, but not often. Most problems come to you in the language of the client’s business, such as “We want to make sure this machine doesn’t run into that machine” or “We’re trying to predict kidney failure.” [1]

Recently Charles McCreary from CRM Engineering sent me the most specific problem statement I’ve ever seen:

I need to calculate the von Mises yield envelope in the axial force-internal/external pressure domain for Lame’s solution …

That really took me by surprise. Sometimes a lead will mention mathematical terminology like “particle filters” or “finite elements,” though even this level of detail is uncommon. I’ve never seen something so specific.

It’s still the case that a lot of work goes into formulating a problem. I’m sure Charles’ client didn’t approach *him* with this statement. I’m consulting to a consultant who has already done the hard work of reducing the original application down to a purely mathematical problem. (I’d call it “purely mathematical” rather than “pure mathematical” because it’s definitely applied math.) I look forward to digging into it as soon as I wrap up what I’m currently working on.

* * *

[1] Nobody has come to me wanting to predict kidney failure, but I’ve worked on predicting several other maladies that I’ll not mention to maintain confidentiality.

]]>For example, suppose you want to compute the following integral that comes up frequently in probability.

There is no (elementary) function whose derivative is exp(-*x*^{2}). It’s not just hard to find or ugly. It simply doesn’t exist, not within the universe of elementary functions. There are functions whose derivative is exp(-*x*^{2}), but these functions are not finite algebraic combinations of the kinds of functions you’d see in high school.

If you think of the definite integral above as meaning “the result you get when you find an antiderivative, let its arguments go off to ∞ and -∞, and subtract the two limits” then you’ll never calculate it. And when you hear that the antiderivative doesn’t exist (in the world of functions you’re familiar with) then you might think that not only can you not calculate the integral, no one can.

In fact the integral is easy to calculate. It requires an ingenious trick [2], but once you see that trick it’s not hard.

Let *I* be the value of the integral. Changing the integration variable makes no difference, i.e.

and so

This integral can be converted to polar coordinates. Instead of describing the plane as an infinite square with *x* and *y* each going off to infinity in both directions, we can think of it as an infinite disk, with radius going off to infinity. The advantage of this approach is that the Jacobian of the change of variables gives us an extra factor of *r* that makes the exponential integral tractable.

From this we get *I*^{2} = π and so *I* = √π.

This specific trick comes up occasionally. But more generally, it is often the case that definite integrals are easier to compute than indefinite integrals. One of the most common applications of complex analysis is computing such integrals through the magic of contour integration. This leads to a lesson closely related to the one above, namely that **you may not have to do what it looks like you need to do**. In this case, you don’t always need to compute indefinite integrals (anti-derivatives) as an intermediate step to compute definite integrals. [3]

Mathematics is filled with theorems that effectively say that you don’t actually have to compute what you conceptually need to compute. Sometimes you can get by with calculating much less.

* * *

[1] One frustration I’ve had working with statisticians is that many have forgotten the distinction between *what* they want to calculate and *how* they calculate it. This makes it difficult to suggest better ways of computing things.

[2] Lord Kelvin said of this trick “A mathematician is one to whom *that* is as obvious as that twice two makes four is to you. Liouville was a mathematician.”

[3] If you look back carefully, we had to compute the integral of exp(-*r*^{2}) *r*, which you would do by first computing its anti-derivative. But we didn’t have to compute the anti-derivative of the original integrand. We traded a hard (in some sense impossible) anti-derivative problem for an easy one.

My choir director persuaded me to try anyway, with just a few days before auditions. That wasn’t enough time for me to learn the music with all its strange intervals. But I tried out. I sang the whole thing. As awful as it was, I kept going. It was about as terrible as it could be, just good enough to not be funny. I wanted to walk out, and maybe I should have out of compassion for the judges, but I stuck it out.

I was proud of that audition, not as a musical achievement, but because I powered through something humiliating.

I did better in band than in choir. I made Area in band and tried out for State but didn’t make it. I worked hard for that one and did a fair job, but simply wasn’t good enough.

That turned out well. It was my senior year, and I was debating whether to major in math or music. I’d told myself that if I made State, I’d major in music. I didn’t make State, so I majored in math and took a few music classes for fun. We can never know how alternative paths would have worked out, but it’s hard to imagine that I would have succeeded as a musician. I didn’t have the talent or the temperament for it.

When I was in college I wondered whether I should have done something like acoustical engineering as a sort of compromise between math and music. I could imagine that working out. Years later I got a chance to do some work in acoustics and enjoyed it, but I’m glad I made a career of math. Applied math has given me the chance to work in a lot of different areas—to play in everyone else’s back yard, as John Tukey put it—and I believe it suits me better than music or acoustics would have.

]]>`shell-mode`

to work as I’d like. Maybe this will save someone else some time if they want to do the same.
I’ve used a Mac occasionally since the days of the beige toasters, but I never owned one until recently. I’ve said for years that I’d buy a Mac as soon as I have a justification, and I recently started a project that needs a Mac.

I’d heard that Emacs was hard to set up on Mac, but that has not been my experience. I’m running Emacs 25.1 on macOS 10.12.1. Maybe there were problems with earlier versions of Emacs or OS X that I skipped. Or maybe there are quirks I haven’t run into yet. So far my only difficulties have been related to running a shell inside Emacs.

The first problem I ran into is that my path is not the same inside `shell-mode`

as in a terminal window. A little searching showed a lot of discussion of this problem but no good solutions. My current solution is to run `source .bash_profile`

from my bash shell inside Emacs to manually force it to read the configuration file. There’s probably a way to avoid this, and if you know how please tell me, but this works OK for now.

Manually sourcing the `.bash_profile`

file works for bash but doesn’t work for Eshell. I doubt I’ll have much use for Eshell, however. It’s more useful on Windows when you want a Unix-like shell inside Emacs.

**Update**: Dan Schmidt pointed out in the comments that Emacs reads `.bashrc`

rather than `.bash_profile`

. It seems that Mac doesn’t read `.bashrc`

at all, at least not if it can find a `.bash_profile`

file. I created a `.bashrc`

file that sources `.bash_profile`

and that fixed my problem, though it did not fix the problem with Eshell or the path problem below.

The second problem I had was that Control-up arrow does not scroll through shell history because that key combination has special meaning to the operating system, bringing up Mission Control. Quite a surprise when you expect to scroll through previous commands but instead your entire screen changes.

I got around this by putting the following code in my Emacs config file and using Alt-up and Alt-down instead of Control-up and Control-down to scroll shell history. (I’m using my beloved Microsoft Natural keyboard, so I have an Alt key.)

(add-hook 'shell-mode-hook (lambda () (define-key shell-mode-map (kbd "<M-up>") 'comint-previous-input) (define-key shell-mode-map (kbd "<M-down>") 'comint-next-input) ) )

The last problem I had was running the Clojure REPL inside Emacs. When I ran `lein repl`

from bash inside Emacs I got an error saying command not found. Apparently running `source .bash_profile`

didn’t give me entirely the same path in Emacs as in a terminal. I was able to fix the following to my Emacs config file.

(add-to-list 'exec-path "/usr/local/bin")

This works, though there are a couple things I don’t understand. First, I don’t understand why `/usr/local/bin`

was missing from my path inside Emacs. Second, I don’t understand why adding the path customizations from my `.bash_profile`

to `exec-path`

doesn’t work. Until I need to understand this, I’m willing to let it remain a mystery.

After fixing the problems mentioned in the original post, I ran into another problem. Trying to run LaTeX on a file failed saying that `pdflatex`

couldn’t be found. Adding the path to `pdflatex`

to the `exec-path`

didn’t work. But the following code from the TeX Stack Exchange did work:

(getenv "PATH") (setenv "PATH" (concat "/Library/TeX/texbin" ":" (getenv "PATH")))

This is the path for El Capitan and Sierra. The path is different in earlier versions of the OS.

By the way, you can use one configuration file across operating systems by putting code like this in your file.

(cond ((string-equal system-type "windows-nt") (progn ; Windows-specific configurations ... ) ) ((string-equal system-type "gnu/linux") (progn ; Linux-specific configurations ... ) ) ((string-equal system-type "darwin") (progn ; Mac-specific configurations ... ) ) )

If you need machine-specific configuration for two machines running the same OS, you can test `system-name`

rather than `system-type`

.

See my answer here and take a look at some of the other answers on the same site.

Yes. See my post Advice for going solo.

Shortly after I went out on my own, I wrote this post responding to questions people had about my particular situation. My answers there remain valid, except one. I said that planned to do anything I can do well that also pays well. That was true at the time, but I’ve gotten a little more selective since then.

Only in general terms. For example, I did some work with psychoacoustics earlier this year, and lately I’ve been working with medical device startups and giving expert testimony.

Nearly all the work I do is covered under NDA (non-disclosure agreement). Occasionally a project will be public, such as the white paper I wrote for Hitachi Data Systems comparing replication and erasure coding. But usually a project is confidential, though I hope to be able to say more about some projects after they come to market.

I wrote an FAQ post of sorts a few years ago. Here are the questions from that post that people still ask fairly often.

- Can you recommend an introductory statistics book?
- Why do you prefer Python to R?
- How do you run your Twitter accounts?
- Why don’t use you XeTeX?
- Can I use your code?
- How can you test a random number generator?
- How do you type math characters in HTML?

You can use this page to send me a question and see my various contact information. The page also has a link to a vCard you could import into your contact manager.

]]>The University of Texas band’s half time show that year was a beautiful tribute to the fallen A&M students.

]]>I’d say it’s not a book about networks per se but a collection of topics associated with networks: cell phone protocols, search engines, auctions, recommendation engines, etc. It would be a good introduction for non-technical people who are curious about how these things work. More technically inclined folks probably already know much of what’s here.

]]>Productivity tip: Work hard.

— John D. Cook (@JohnDCook) October 8, 2015

I don’t know how people take it, but here’s what I meant by it. Sometimes you can find a smarter way to work, and if you can, I assume you’re doing that. Don’t drive nails with your shoe if you can find a hammer. But ultimately the way to get things done is hard work. You might see some marginal increase in productivity from using some app or another, but there’s nothing that’s going to magically make you 10x more productive without extra effort.

Many people have replied on Twitter “I think you mean ‘work smart.'” At some point “work smarter” wasn’t a cliché, but now it is. The problem of our time isn’t people brute-forcing their way with hard, thoughtless work. We’re more likely to wish for a silver bullet. We’re gnostics.

Smart work is a kind of hard work. It may take less physical work but more mental work. Or less mental work and more emotional work. It’s hard work to try to find a new perspective and take risks.

One last thought: hard work is not necessarily long work. Sometimes it is, but often not. Hard creative work requires bursts of mental or emotional effort that cannot be sustained for long.

]]>

We are very good at building complex software systems that work 95% of the time. But we do not know how to build complex software systems that are ultra-reliably safe (i.e. P_f < 10^-7/hour).

Emphasis added.

Developing medium-reliability and high-reliability software are almost entirely different professions. Using typical software development procedures on systems that must be ultra-reliable would invite disaster. But using extremely cautious development methods on systems that can afford to fail relatively often would be an economic disaster.

**Related post**: Formal validation methods let you explore the corners

]]>

Rivalries seem sillier to outsiders the more similar the two options are. And yet this makes sense. I’ve forgotten the psychological term for this, but it has a name: Similar things compete for space in your brain more than things that are dissimilar. For example, studying French can make it harder to spell English words. (Does *literature* have two t’s in French and one in English or is it the other way around?) But studying Chinese doesn’t impair English orthography.

It’s been said that academic politics are so vicious because the stakes are so small [1]. Similarly, there are fierce technological loyalties because the differences with competing technologies are so small, small enough to cause confusion. My favorite example: I can’t keep straight which languages use `else if`

, `elif`

, `elseif`

, … in branching.

If you have to learn two similar technologies, it may be easier to devote yourself exclusively to one, then to the other, then use both and learn to keep them straight.

**Related post**: Ford-Chevy arguments in technology

[1] I first heard this attributed to Henry Kissinger, but there’s no agreement on who first said it. Several people have said similar things.

]]>

How does this function compare to its limit, exp(*x*)? We might want to know because it’s often useful to have polynomial upper or lower bounds on exp(*x*).

For *x* > 0 it’s clear that exp(*x*) is larger than *T*_{n}(*x*) since the discarded terms in the power series for exp(*x*) are all positive.

The case of *x* < 0 is more interesting. There exp(*x*) > *T*_{n}(*x*) if *n* is odd and exp(*x*) < *T*_{n}(*x*) if *n* is even.

Define *f*_{n}(*x*) = exp(*x*) – *T*_{n}(*x*). If *x* > 0 then *f*_{n}(*x*) > 0.

We want to show that if *x* < 0 then *f*_{n}(*x*) > 0 for odd *n* and *f*_{n}(*x*) < 0 for even *n*.

For *n* = 1, note that *f*_{1} and its derivative are both zero at 0. Now suppose *f*_{1} is zero at some point *a* < 0. Then by Rolle’s theorem, there is some point *b* with *a <* *b* < 0 where the derivative of *f*_{1} is 0. Since the derivative of *f*_{1} is also zero at 0, there must be some point *c* with *b* < *c* < 0 where the second derivative of *f*_{1} is 0, again by Rolle’s theorem. But the second derivative of *f*_{1} is exp(*x*) which is never 0. So our assumption *f*_{1}(*a*) = 0 leads to a contradiction.

Now *f*_{1}(0) = 0 and *f*_{1}(*x*) ≠ 0 for *x* < 0. So *f*_{1}(*x*) must be always positive or always negative. Which is it? For negative *x*, exp(*x*) is bounded and so

*f*_{1}(*x*) = exp(*x*) – 1 – *x*

is eventually dominated by the –*x* term, which is positive since *x* is negative.

The proof for *n* = 2 is similar. If *f*_{2}(*x*) is zero at some point *a* < 0, then we can use Rolle’s theorem to find a point *b* < 0 where the derivative of *f*_{2} is zero, and a point *c* < 0 where the second derivative is zero, and a point *d* < 0 where the third derivative is zero. But the third derivative of *f*_{2} is exp(*x*) which is never zero.

As before the contradiction shows *f*_{2}(*x*) ≠ 0 for *x* < 0. So is *f*_{2}(*x*) always positive or always negative? This time we have

*f*_{2}(*x*) = exp(*x*) – 1 – *x* – *x*^{2}/2

which is eventually dominated by the –*x*^{2} term, which is negative.

For general *n*, we assume *f*_{n} is zero for some point *x* < 0 and apply Rolle’s theorem *n*+1 times to reach the contradiction that exp(*x*) is zero somewhere. This tells us that *f*_{n}(*x*) is never zero for negative *x*. We then look at the dominant term –*x*^{n} to argue that *f*_{n} is positive or negative depending on whether *n* is odd or even.

Another way to show the sign of *f*_{n}(*x*) for negative *x* would be to apply the alternating series theorem to *x* = -1.

]]>It’s always been “We can’t do it that way. It would be too slow.”

You know what’s slow? Spending all day trying to figure out why it doesn’t work. That’s slow. That’s the slowest thing I know.

In calculus, you’d say more. First you’d say that if a square has side *near* *x*, then it has area *near* *x*^{2}. That is, area is a continuous function of the length of a side. As the length of the side changes, there’s never an abrupt jump in area. Next you could be more specific and say that a small change Δ*x* to a side of length *x* corresponds to approximately a change of 2*x* Δ*x* in the area.

In probability, you ask what is the area of a square like if you pick the length of its side at random. If you pick the length of the side from a distribution with mean μ, does the distribution of the area have mean μ^{2}? No, but if the probability distribution on side length is tightly concentrated around μ, then the distribution on area will be concentrated near μ^{2}. And you can approximate just how near the area is to μ^{2} using the delta method, analogous to the calculus discussion above.

If the distribution on side lengths is not particularly concentrated, finding the distribution on the area is more interesting. It will depend on the specific distribution on side length, and the mean area might not be particularly close to the square of the mean side length. The function to compute area is trivial, and yet the question of what happens when you stick a random variable into that function is not trivial. **Random variables** behave as you might expect when you stick them into linear functions, but **offer surprises when you stick them into nonlinear functions**.

Suppose you pick the length of the side of a square uniformly from the interval [0, 1]. Then the average side is 1/2, and so you might expect the average area to be 1/4. But the expected area is actually 1/3. You could see this a couple ways, analytically and empirically.

First an analytical derivation. If *X* has a uniform [0, 1] distribution and *Z* = *X*^{2}, then the CDF of *Z* is

Prob(*Z* ≤ *z*) = Prob(*X* ≤ √*z*) = √ *z*.

and so the PDF for *Z*, the derivative of the CDF, is -1/2√*z*. From there you can compute the expected value by integrating *z* times the PDF.

You could check your calculations by seeing whether simulation gives you similar results. Here’s a little Python code to do that.

from random import random N = 1000000 print( sum([random()**2 for _ in range(N)] )/N )

When I run this, I get 0.33386, close to 1/3.

Now lets look at an exponential distribution on side length with mean 1. Then a calculation similar to the one above shows that the expected value of the product is 2. You can also check this with simulation. This time we’ll be a little fancier and let SciPy generate our random values for us.

print( sum(expon.rvs(size=N)**2)/N )

When I ran this, I got 1.99934, close to the expected value of 2.

You’ll notice that in both examples, the expected value of the area is more than the square of the expected value of the side. This is not a coincidence but consequence of Jensen’s inequality. Squaring is a convex function, so the expected value of the square is larger than the square of the expected value for any random variable.

]]>