This post will take non-random uses for probability in a different direction. We’ll start with discussion of probability and end up proving a classical analysis result.

Let *p* be the probability of success on some experiment and let *X*_{n} count the number of successes in *n* independent trials. The expected value of *X*_{n} is *np*, and so the expected value of *X*_{n}/*n* is *p*.

Now let *f * be a continuous function on [0, 1] and think about *f*(*X*_{n}/*n*). As *n* increases, the distribution of *X*_{n}/*n* clusters more and more tightly around *p*, and so *f*(*X*_{n}/*n*) clusters more and more around *f*(*p*).

Let’s illustrate this with a simulation. Let *p* = 0.6, *n* = 100, and let *f*(*x*) = cos(*x*).

Here’s a histogram of random samples from *X*_{n}/*n*.

And here’s a histogram of random samples from *f*(*X*_{n}/*n*). Note that it’s centered near cos(*p*) = 0.825.

As *n* gets large, the expected value of *f*(*X*_{n}/*n*) converges to *f*(*p*). (You could make all this all rigorous, but I’ll leave out the details to focus on the intuitive idea.)

Now write out the expected value of *f*(*X*_{n}/*n*).

Up to this point we’ve thought of *p* as a fixed probability, and we’ve arrived at an approximation result guided by probabilistic thinking. But the expression above is approximately *f*(*p*) regardless of how we think of it. So now let’s think of *p* varying over the interval [0, 1].

E[ *f*(*X*_{n}/*n*) ] is an *n*th degree **polynomial** in *p* that is close to *f*(*p*). In fact, as *n* goes to infinity, E[ *f*(*X*_{n}/*n*) ] converges uniformly to *f*(*p*).

So for an arbitrary continuous function *f*, we’ve constructed a sequence of polynomials that converge uniformly to *f*. In other words, we’ve proved the **Weierstrass approximation theorem**! The Weierstrass approximation theorem says that there exists a sequence of polynomials converging to any continuous function *f* on a bounded interval, and we’ve explicitly constructed such a sequence. The polynomials E[ *f*(*X*_{n}/*n*) ] are known as the **Bernstein polynomials** associated with *f*.

This is one of my favorite proofs. When I saw this in college, it felt like the professor was pulling a rabbit out of a hat. We’re talking about probability when all of a sudden, out pops a result that has nothing to do with randomness.

Now let’s see an example of the Bernstein polynomials approximating a continuous function. Again we’ll let *f*(*x*) = cos(*x*). If we plot the Bernstein polynomials and cosine on the same plot, the approximation is good enough that it’s hard to see what’s what. So instead we’ll plot the differences, cosine minus the Bernstein approximations.

As the degree of the Bernstein polynomials increase, the error decreases. Also, note the vertical scale. Cosine of zero is 1, and so the errors here are small relative to the size of the function being approximated.

]]>Volker Michel describes this clash of conventions colorfully in his book on constructive approximation.

Many mathematicians have faced weird jigsaw puzzles with misplaced continents after using a data set from a geoscientist. If you ever get such figures, too, or if you are, for example, desperately searching South America in a data set but cannot find it, remember the remark you have just read to solve your problem.

**Related posts**:

According to Google’s Ngram Viewer, the term “supercomputer” peaked in 1990.

(The term “cluster” is far more common, but it is mostly a non-technical term. It’s used in connection with grapes, for example, much more often than with computers.)

Years ago you’d hear someone say a problem would take a “big computer,” but these days you’re more likely to hear someone say a problem is going to take “a lot of computing power.” Hearing that a problem is going to require a “big computer” sounds as dated as saying something would take “a big generator” rather than saying it would take a lot of electricity.

Like electricity, computing power has been commoditized. We tend to think in terms of the total amount of computing resources needed, measured in, say, CPU-hours or number of low-level operations. We don’t think first about what configuration of machines would deliver these resources any more than we’d first think about what machines it would take to deliver a certain quantity of electrical power.

There are still supercomputers and problems that require them, though an increasing number of computationally intense projects do not require such specialized hardware.

]]>Here’s page: https://www.johndcook.com/expsum/

Here are a few sample images. Small changes in the coefficients can make a big change in the appearance of the graphs.

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to **deliberately do a mediocre job** of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

It a nutshell, the disturbing thing is that you have a negative criteria for what constitutes a good solution: one that’s not too close to optimal. But there are a lot of ways to not be too close to optimal. In practice, you experiment until you find **an optimally suboptimal solution**, i.e. the intentionally suboptimal fit that performs the best in validation.

]]>

Exponential sums also make pretty pictures. If you make a scatter plot of the sequence of partial sums you can get surprising shapes. This is related to the trickiness of estimating such sums: the partial sums don’t simply monotonically converge to a limit.

The exponential sum page at UNSW suggests playing around with polynomials with dates in the denominator. If we take that suggestion with today’s date, we get the curve below:

These are the partial sums of exp(2π*i* *f*(*n*)) where *f*(*n*) = *n*/10 + *n*²/7 + *n*³/17.

[**Update**: You can get an image each day for the current day’s date here.]

Here’s the code that produced the image.

import matplotlib.pyplot as plt from numpy import array, pi, exp, log N = 12000 def f(n): return n/10 + n**2/7 + n**3/17 z = array( [exp( 2*pi*1j*f(n) ) for n in range(3, N+3)] ) z = z.cumsum() plt.plot(z.real, z.imag, color='#333399') plt.axes().set_aspect(1) plt.show()

If we use logarithms, we get interesting spirals. Here *f*(*n*) = log(*n*)^{4.1}.

And we can mix polynomials with logarithms. Here *f*(*n*) = log(*n*) + *n*²/100.

In this last image, I reduced the number of points from 12,000 to 1200. With a large number of points the spiral nature dominates and you don’t see the swirls along the spiral as clearly.

]]>

What about a function of two variables? If it has two local maxima, does it need to have a local minimum? No, it could have a saddle point in between, a point that is a local minimum in one direction but a local maximum in another direction. But even more surprising, it need not even have a saddle point. A function of two variables could have two local maxima and no other critical points! Here’s an example:

*f*(*x*, *y*) = – (*x*² – 1)² – (*x*²*y* – *x* – 1)²

It’s clear that the function is zero at (-1, 0) and (1, 2), and that the function is negative otherwise. So it clearly has two local maxima. You can write out the partial derivatives with respect to *x* and *y* and see that the only place they’re both zero is at the two local maxima.

Here’s a plot of the function:

And here’s a contour plot:

The two maxima are in in the bright patches in the lower left and upper right.

You might be thinking that if you walk between two peaks, you’ve got to go down in between. And that’s true. If you walk in a straight line between (-1, 0) and (1, 2), you’ll run into a local minimum around (0.2316, 1.2316). But that’s only a local minimum along your path. It’s not a local minimum or saddle point of the function in a neighborhood of that point.

I found this example in the book Single Digits.

]]>But it’s also possible to write clean source code that is nevertheless obfuscated. For example, it’s not at all obvious what the following code computes.

def f(x): return 4*x*(1-x) def g(x): return x**3.5 * (1-x)**2.5 sum = 0 x = 1/7.0 N = 1000000 for _ in range(N): x = f(x) sum += g(x) print(sum/N)

The key to unraveling this mystery is a post I wrote a few days ago that shows the *x*‘s in this code cover the unit interval like random samples from a beta(0.5, 0.5) distribution. That means the code is finding the expected value of *g*(*X*) where *X* has a beta(0.5, 0.5) distribution. The following computes this expected value.

The output of the program matches 1/60π to six decimal places.

**Related post**: What does this code do?

For this post, I’ll only consider scales starting on C. That is, I’ll only consider changing the intervals between notes, not changing the starting note. Also, I’ll only consider subsets of the common chromatic scale; this post won’t get into dividing the octave into more or less than 12 intervals.

First of all we have the major scale — C D E F G A B C — and the “natural” minor scale: A B C D E F G A. The word “natural” suggests there are other minor scales. More on that later.

Then we have the classical modes: Dorian, Phrygian, Lydian, Mixolydian, Aeolian, and Locrian. These have the same intervals as taking the notes of the C major scale and starting on D, E, F, G, A, and B respectively. For example, Dorian has the same intervals as D E F G A B C D. Since we said we’d start everything on C, the Dorian mode would be C D E♭ F G A B♭ C. The Aeloian mode is the same as the natural minor scale.

The harmonic minor scale adds a new wrinkle: C D E♭ F G A♭ B C. Notice that A♭ and B are three half steps apart. In all the scales above, notes were either a half step or a whole step apart. Do we want to consider scales that have such large intervals? It seems desirable to include the harmonic minor scale. But what about this: C E♭ G♭ A C. Is that a scale? Most musicians would think of that as a *chord* or *arpeggio* rather than a *scale*. (It’s a diminished seventh chord. And it would be more common to write the A as a B♭♭.)

We might try to put some restriction on the definition of a scale so that the harmonic minor scale is included and the diminished seventh arpeggio is excluded. Here’s what I settled on. For the purposes of this post, I’ll say that a scale is an ascending sequence of eight notes with two restrictions: the first and last are an octave apart, and no two consecutive notes are more than three half steps apart. This will include modes mentioned above, and the harmonic minor scale, but will exclude the diminished seventh arpeggio. (It also excludes pentatonic scales, which we may or may not want to include.)

One way to enumerate all possible scales would be to start with the chromatic scale and decide which notes to keep. Write out the notes C, C♯, D, … , B, C and write a ‘1’ underneath a note if you want to keep it and a ‘0’ otherwise. We have to start and end on C, so we only need to specify which of the 11 notes in the middle we are keeping. That means we can describe any potential scale as an 11-bit binary number. That’s what I did to carry out an exhaustive search for scales with a little program.

There are 266 scales that meet the criteria listed here. I’ve listed all of them on another page. Some of these scales have names and some don’t. I’ve noted some names as shown below. I imagine there are others that have names that I have not labeled. I’d appreciate your help filling these in.

|--------------+-----------------------+-------------------| | Scale number | Notes | Name | |--------------+-----------------------+-------------------| | 693 | C D E F# G A B C | Lydian mode | | 725 | C D E F G A B C | Major | | 726 | C D E F G A Bb C | Mixolydian mode | | 825 | C D Eb F# G Ab B C | Hungarian minor | | 826 | C D Eb F# G Ab Bb C | Ukrainian Dorian | | 854 | C D Eb F G A Bb C | Dorian mode | | 858 | C D Eb F G Ab Bb C | Natural minor | | 1235 | C Db E F G Ab B C | Double harmonic | | 1242 | C Db E F G Ab Bb C | Phrygian dominant | | 1257 | C Db E F Gb Ab B C | Persian | | 1370 | C Db Eb F G Ab Bb C | Phrygian mode | | 1386 | C Db Eb F Gb Ab Bb C | Locrian mode | |--------------+-----------------------+-------------------|

**Related posts**:

How much information is contained in nationality? That depends on exactly how you define nations versus territories etc., but for this blog post I’ll take this Wikipedia table for my raw data. You can calculate that nationality has entropy of 5.26 bits. That is, on average, nationality is slightly more informative than asking five independent yes/no questions. (See this post for how to calculate information content.)

Entropy measures expected information content. Knowing that someone is from India (population 1.3 billion) carries only 2.50 bits of information. Knowing that someone is from Vatican City (population 800) carries 23.16 bits of information.

One way to reduce the re-identification risk of **PII** (personally identifiable information) such as nationality is to combine small categories. Suppose we lump all countries with a population under one million into “other.” Then we go from 240 categories down to 160. This hardly makes any difference to the entropy: it drops from 5.26 bits to 5.25 bits. But the information content for the smallest country on the list is now 8.80 bits rather than 23.16.

What about religion? This is also subtle to define, but again I’ll use Wikipedia for my data. Using these numbers, we get an entropy of 2.65 bits. The largest religion, Christianity, has an information content 1.67 bits. The smallest religion on the list, Rastafari, has an information content of 13.29 bits.

So if nationality carries 5.25 bits of information and religion 2.65 bits, how much information does the combination of nationality and religion carry? At least 5.25 bits, but no more than 5.25 + 2.65 = 7.9 bits on average. For two random variables *X* and *Y*, the **joint entropy** *H*(*X*, *Y*) satisfies

max( *H*(X), *H*(Y) ) ≤ *H*(*X*, *Y*) ≤ *H*(X) + *H*(Y)

where *H*(*X*) and *H*(*Y*) are the entropy of *X* and *Y* respectively.

Computing the joint entropy exactly would require getting into the joint distribution of nationality and religion. I’d rather not get into this calculation in detail, except to discuss possible **toxic pairs**. On *average*, the information content of the combination of nationality and religion is no more than the sum of the information content of each separately. But *particular combinations* can be highly informative.

For example, there are not a lot of Jews living in predominantly Muslim countries. According to one source, there are at least five Jews living in Iraq. Other sources put the estimate as “less than 10.” (There are zero Jews living in Libya.)

Knowing that someone is a Christian living in Mexico, for example, would not be highly informative. But knowing someone is a Jew living in Iraq would be extremely informative.

The values appear to bounce all over the place. Let’s look at a histogram of the sequence.

Indeed the points do bounce all over the unit interval, though they more often bounce near one of the ends.

Does that distribution look familiar? You might recognize it from Bayesian statistics. It’s a beta distribution. It’s symmetric, so the two beta distribution parameters are equal. There’s a vertical asymptote on each end, so the parameters are less than 1. In fact, it’s a beta(1/2, 1/2) distribution. It comes up, for example, as the **Jeffreys prior** for Bernoulli trials.

The graph below adds the beta(1/2, 1/2) density to the histogram to show how well it fits.

We have to be careful in describing the relation between the iterations and the beta distribution. The iterates are **ergodic**, not random. The **sequence** of points does not simulate independent draws from any distribution: points near 0.5 are always sent to points near 1, points near 1 are always sent to points near 0, etc. But **in aggregate**, the points do fill out the unit interval like samples from a beta distribution. In the limit, the proportion of points in a region equals the probability of that region under a beta(1/2, 1/2) distribution.

Here’s the Python code that produced the graphs above.

import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt def quadratic(x): return 4*x*(1-x) N = 100000 x = np.empty(N) # arbitary irrational starting point x[0] = 1/np.sqrt(3) for i in range(1, N): x[i] = quadratic( x[i-1] ) plt.plot(x[0:100]) plt.xlabel("iteration index") plt.show() t = np.linspace(0, 1, 100) plt.hist(x, bins=t, normed=True) plt.xlabel("bins") plt.ylabel("counts") plt.plot(t, beta(0.5,0.5).pdf(t), linewidth=3) plt.legend(["beta(1/2, 1/2)"]) plt.show()

**
Related posts**:

There are only 256 possible elementary cellular automata, so it’s practical to plot them all. I won’t list all the images here—you can find them all here—but I will give a few examples to show the variety of patterns they produce. As in the previous post, we imagine our grid rolled up into a cylinder, i.e. we’ll wrap around if necessary to find pixels diagonally up to the left and right.

As we discussed in the previous post, the number of a rule comes from what value it assigns to each of eight possible cellular states, turned into a binary number. So it’s plausible that binary numbers with more 1’s correspond to more black pixels. This is roughly true, though the graph below shows that the situation is more complex than that.

]]>Imagine an infinite grid of graph paper and fill in a few squares in one row. The squares in the subsequent rows are filled in or not depending on the state of the square’s upstairs neighbors. For **elementary cellular automata**, the state of a square depends on the square directly above and the two squares diagonally above on each side. In matrix notation, the state of the (*i*, *j*) element depends on elements (*i*-1, *j*-1), (*i*-1, *j*), and (*i*-1, *j*+1).

There are 256 possible elementary cellular automata, and here’s how you can number them. The states of three consecutive cells determine a three-bit binary number. An automaton is determined by what bit it assigned to each of the eight possible three-bit states, so an automaton corresponds to an 8-bit number. In this post we’re interested in **Rule 90**. In binary, 90 is written 01011010, and the table below spells out the rule in detail.

000 -> 0 001 -> 1 010 -> 0 011 -> 1 100 -> 1 101 -> 0 110 -> 1 111 -> 0

If we start with a single square filled in (i.e. set to 1) then this is the graph we get, i.e. the Sierpenski triangle:

This pattern depends critically on our initial conditions. Roughly speaking, it seems that if you start with regular initial conditions you’ll get regular results. If you start with random initial conditions, you’ll get random-looking results as shown below.

We see the same empty triangles as before, but they’re much smaller and appear scattered throughout.

In order to create a rectangular image, I wrapped the edges: the upper left neighbor of a point on the left edge is the right-most square on the row above, and similarly for the right edge. You could think of this as wrapping our graph paper into a cylinder.

]]>

This is not possible with a secure random number generator. Or more precisely, it is not practical. It may be theoretically possible, but doing so requires solving a problem currently believed to be extremely time-consuming. (Lots of weasel words here. That’s the nature of cryptography. Security often depends on the assumption that a problem is as hard to solve as experts generally believe it is.)

The Blum Blum Shub algorithm for generating random bits rests on the assumption that a certain number theory problem, the quadratic residuosity problem, is hard to solve. The algorithm is simple. Let *M* = *pq* where *p* and *q* are large primes, both congruent to 3 mod 4. Pick a seed *x*_{0} between 1 and *M* and relatively prime to *M*. Now for *n* > 0, set

*x*_{n+1} = *x*_{n}² mod *M*

and return the least significant bit of *x*_{n+1}. (Yes, that’s a lot of work for just one bit. If you don’t need cryptographic security, there are much faster random number generators.)

Here’s some Python code to illustrate using the generator. The code is intended to be clear, not efficient.

Given two large (not necessarily prime) numbers `x`

and `y`

, the code below finds primes `p`

and `q`

for the algorithm and checks that the seed is OK to use.

import sympy # super secret large numbers x = 3*10**200 y = 4*10**200 seed = 5*10**300 def next_usable_prime(x): # Find the next prime congruent to 3 mod 4 following x. p = sympy.nextprime(x) while (p % 4 != 3): p = sympy.nextprime(p) return p p = next_usable_prime(x) q = next_usable_prime(y) M = p*q assert(1 < seed < M) assert(seed % p != 0) assert(seed % q != 0)

There’s a little bit of a chicken-and-egg problem here: how do you pick `x`

, `y`

, and `seed`

? Well, you could use a cryptographically secure random number generator ….

Now let’s generate a long string of bits:

# Number of random numbers to generate N = 100000 x = seed bit_string = "" for _ in range(N): x = x*x % M b = x % 2 bit_string += str(b)

I did not test the output thoroughly; I didn’t use anything like DIEHARDER or PractRand as in previous posts, but just ran a couple simple tests described here.

First I look at the balance of 0’s and 1’s.

Number of 1's: 50171 Expected: 49683.7 to 50316.2

Then the longest run. (See this post for a discussion of expected run length.)

Run length: 16 Expected: 12.7 to 18.5

Nothing unusual here.

The first two authors of Blum Blum Shub are Lenore and Manuel Blum. The third author is Michael Shub.

I had a chance to meet the Blums at the Heidelberg Laureate Forum in 2014. Manuel Blum gave a talk that year on mental cryptography that I blogged about here and followed up here. He and his wife Lenore were very pleasant to talk with.

]]>In the previous two posts we looked at a randomization scheme for protecting the privacy of a binary response. This post will look briefly at adding noise to continuous or unbounded data. I like to keep the posts here fairly short, but this topic is fairly technical. To keep it short I’ll omit some of the details and give more of an intuitive overview.

The idea of **differential privacy** is to guarantee bounds on how much information may be revealed by someone’s participation in a database. These bounds are described by two numbers, ε (epsilon) and δ (delta). We’re primarily interested in the multiplicative bound described by ε. This number is roughly the number of bits of information an analyst might gain regarding an individual. (The multiplicative bound is exp(ε) and so ε, the natural log of the multiplicative bound, would be the information measure, though technically in *nats* rather than *bits* since we’re using natural logs rather than logs base 2.)

The δ term is added to the multiplicative bound. Ideally δ is 0, that is, we’d prefer **(ε, 0)-differential privacy**, but sometimes we have to settle for **(ε, δ)-differential privacy**. Roughly speaking, the δ term represents the possibility that a few individuals may stand to lose more privacy than the rest, that the multiplicative bound doesn’t apply to everyone. If δ is very small, this risk is very small.

The **Laplace distribution** is also known as the **double exponential distribution** because its distribution function looks like the exponential distribution function with a copy reflected about the *y*-axis; these two exponential curves join at the origin to create a sort of circus tent shape. The absolute value of a Laplace random variable is an exponential random variable.

Why are we interested this particular distribution? Because we’re interested in multiplicative bounds, and so it’s not too surprising that exponential distributions might make the calculations work out because of the way the exponential scales multiplicatively.

The **Laplace mechanism** adds Laplacian-distributed noise to a function. If Δ*f* is the **sensitivity** of a function *f*, a measure of how revealing the function might be, then adding Laplace noise with scale Δ*f*/ε preserves (ε 0)-differential privacy.

Technically, Δ*f* is the *l*_{1} sensitivity. We need this detail because the results for Gaussian noise involve *l*_{2} sensitivity. This is just a matter of whether we measure sensitivity by the *l*_{1} (sum of absolute values) norm or the *l*_{2} (root sum of squares) norm.

The **Gaussian mechanism** protects privacy by adding randomness with a more familiar normal (Gaussian) distribution. Here the results are a little messier. Let ε be strictly between 0 and 1 and pick δ > 0. Then the Gaussian mechanism is (ε, δ)-differentially private provided the scale of the Gaussian noise satisfies

It’s not surprising that the *l*_{2} norm appears in this context since the normal distribution and *l*_{2} norm are closely related. It’s also not surprising that a δ term appears: the Laplace distribution is ideally suited to multiplicative bounds but the normal distribution is not.

Previous related posts:

]]>In the previous post we looked at a simple randomization procedure to obscure individual responses to yes/no questions in a way that retains the statistical usefulness of the data. In this post we’ll generalize that procedure, quantify the privacy loss, and discuss the utility/privacy trade-off.

Suppose we have a binary response to some question as a field in our database. With probability *t* we leave the value alone. Otherwise we replace the answer with the result of a fair coin toss. In the previous post, what we now call *t* was implicitly equal to 1/2. The value recorded in the database could have come from a coin toss and so the value is not definitive. And yet it does contain some information. The posterior probability that the original answer was 1 (“yes”) is higher if a 1 is recorded. We did this calculation for *t* = 1/2 last time, and here we’ll look at the result for general *t*.

If *t* = 0, the recorded result is always random. The field contains no private information, but it is also statistically useless. At the opposite extreme, *t* = 1, the recorded result is pure private information and statistically useful. The closer *t* is to 0, the more privacy we have, and the closer *t* is to 1, the more useful the data is. We’ll quantify this privacy/utility trade-off below.

You can go through an exercise in applying Bayes theorem as in the previous post to show that the probability that the original response is 1, given that the recorded response is 1, is

where *p* is the overall probability of a true response of 1.

The **privacy loss** associated with an observation of 1 is the gain in information due to that observation. Before knowing that a particular response was 1, our estimate that the true response was 1 would be *p*; not having any individual data, we use the group mean. But after observing a recorded response of 1, the posterior probability is the expression above. The information gain is the log base 2 of the ratio of these values:

When *t* = 0, the privacy loss is 0. When *t* = 1, the loss is -log_{2}(*p*) bits, i.e. the entire information contained in the response. When *t* = 1/2, the loss is -log_{2}(3/(2*p* + 1)) bits.

We’ve looked at the privacy cost of setting *t* to various values. What are the statistical costs? Why not make *t* as small as possible? Well, 0 is a possible value of *t*, corresponding to complete loss of statistical utility. So we’d expect that small positive values of *t* make it harder to estimate *p*.

Each recorded response is a 1 with probability *tp* + (1 – *t*)/2. Suppose there are *N* database records and let *S* be the sum of the recorded values. Then our estimator for *p* is

The variance of this estimator is inversely proportional to *t*, and so the width of our confidence intervals for *p* are proportional to 1/√*t*. Note that the larger *N* is, the smaller we can afford to make *t*.

Previous related posts:

- Randomized response, privacy, and Bayes theorem
- Quantifying the information content of personal data
- HIPAA de-identification
- Information theory
- Database anonymization

**Next up**: Adding Laplace or Gaussian noise and differential privacy

Suppose you want to gather data on an incriminating question. For example, maybe a statistics professor would like to know how many students cheated on a test. Being a statistician, the professor has a clever way to find out what he wants to know while giving each student deniability.

Each student is asked to flip two coins. If the first coin comes up heads, the student answers the question truthfully, yes or no. Otherwise the student reports “yes” if the second coin came up heads and “no” it came up tails. Every student has deniability because each “yes” answer may have come from an innocent student who flipped tails on the first coin and heads on the second.

How can the professor estimate *p*, the proportion of students who cheated? Around half the students will get a head on the first coin and answer truthfully; the rest will look at the second coin and answer yes or no with equal probability. So the expected proportion of yes answers is *Y* = 0.5*p* + 0.25, and we can estimate *p* as 2*Y* – 0.5.

The calculations above assume that everyone complied with the protocol, which may not be reasonable. If everyone were honest, there’d be no reason for this exercise in the first place. But we could imagine another scenario. Someone holds a database with identifiers and answers to a yes/no question. The owner of the database could follow the procedure above to introduce randomness in the data before giving the data over to someone else.

What can we infer from someone’s randomized response to the cheating question? There’s nothing you can infer with *certainty*; that’s the point of introducing randomness. But that doesn’t mean that the answers contain no information. If we completely randomized the responses, dispensing with the first coin flip, *then* the responses would contain no information. The responses *do* contain information, but not enough to be incriminating.

Let *C* be a random variable representing whether someone cheated, and let *R* be their response, following the randomization procedure above. Given a response *R* = 1, what is the probability *p* that *C* = 1, i.e. that someone cheated? This is a classic application of Bayes’ theorem.

If we didn’t know someone’s response, we would estimate their probability of having cheated as *p*, the group average. But knowing that their response was “yes” we update our estimate to 3*p* / (2*p* + 1). At the extremes of *p* = 0 and *p* = 1 these coincide. But for any value of *p* strictly between 0 and 1, our estimate goes up. That is, the probability that someone cheated, conditional on knowing they responded “yes”, is higher than the unconditional probability. In symbols, we have

when 0 < *p *< 1. The difference between the left and right sides above is maximized when *p* = (√3 – 1)/2 = 0.366. That is, a “yes” response tells us the most when about 1/3 of the students cheated. When *p* = 0.366, *P*(*C *= 1 | *R*= 1) = 0.634, i.e. the posterior probability is almost twice the prior probability.

You could go through a similar exercise with Bayes theorem to show that *P*(*C* = 1 | *R* = 0) = *p*/(3 – 2*p*), which is less than *p* provided 0 < *p* < 1. So if someone answers “yes” to cheating, that does make it more likely that the actually cheated, but not so much more that you can justly accuse them of cheating. (Unless *p* = 1, in which case you’re in the realm of logic rather than probability: if everyone cheated, then you can conclude that any individual cheated.)

**Update**: See the next post for a more general randomization scheme and more about the trade-off between privacy and utility. The post after that gives an overview of randomization for more general kinds of data.

If you would like help with database de-identification, please let me know.

]]>This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using

`\AA`

to produce Å. Every time someone will ask “Why not use XeTeX and just enter these characters?”If you can “just enter” non-English characters, then you don’t need a tip. But a lot of people either don’t know how to do this or don’t have a convenient way to do so. Most English speakers only need to type foreign characters occasionally, and will find it easier, for example, to type

`\AA`

or`\ss`

than to learn how to produce Å or ß from a keyboard. If you frequently need to enter Unicode characters, and know how to do so, then XeTeX is great.

**Related posts**:

Fermat’s little theorem says that for any prime *p*, then for any integer *a*,

*a*^{p} = *a* (mod *p*).

That is, *a*^{p} and *a* have the same remainder when you divide by *p*. Jordan Ellenberg picked the special case of *a* = 2 as his favorite theorem for the purpose of the podcast. And it’s this special case that can be proved from Pascal’s triangle.

The *p*th row of Pascal’s triangle consists of the coefficients of (*x* + *y*)^{p} when expanded using the binomial theorem. By setting *x* = *y* = 1, you can see that the numbers in the row must add up to 2^{p}. Also, all the numbers in the row are divisible by *p* except for the 1’s on each end. So the remainder when 2^{p} is divided by *p* is 2.

It’s easy to observe that the numbers in the *p*th row, except for the ends, are divisible by *p*. For example, when *p* = 5, the numbers are 1, 5, 10, 10, 5, 1. When *p* = 7, the numbers are 1, 7, 28, 35, 35, 28, 7, 1.

To prove this you have to show that the binomial coefficient *C*(*p*, *k*) is divisible by *p* when 0 < *k* < *p*. When you expand the binomial coefficient into factorials, you see that there’s a factor of *p* on top, and nothing can cancel it out because *p* is prime and the numbers in the denominator are less than *p*.

By the way, you can form an analogous proof for the general case of Fermat’s little theorem by expanding

(1 + 1 + 1 + … + 1)^{p}

using the multinomial generalization of the binomial theorem.

]]>I was stuck, and my advisor countered by saying “Let me ask you a harder question.” I was still stuck, and so he said “Let me ask you an even harder question.” Then I got it.

By “harder” he meant “more general.” He started with a concrete problem, then made it progressively more abstract until I recognized it. His follow-up questions were *logically* harder but *psychologically* easier.

This incident came to mind when I ran across an example in Lawrence Evans’ control theory course notes. He uses the example to illustrate what he calls an example of mathematical wisdom:

It is sometimes easier to solve a problem by embedding it within a larger class of problems and then solving the larger class all at once.

The problem is to evaluate the integral of the sinc function:

He does so by introducing the more general problem of evaluating the function

which reduces to the sinc integral when α = 0.

We can find the derivative of *I*(α) by differentiating under the integral sign and integrating by parts twice.

Therefore

As α goes to infinity, *I*(α) goes to zero, and so *C* = π/2 and *I*(0) = π/2.

Incidentally, note that instead of computing an integral in order to solve a differential equation as one often does, we introduced a differential equation in order to compute an integral.

]]>If the answer to a question has probability *p*, then it contains -log_{2} *p* **bits of information**. Knowing someone’s sex gives you 1 bit of information because -log_{2}(1/2) = 1.

Knowing whether someone can roll their tongue could give you more or less information than knowing their sex. Estimates vary, but say 75% can roll their tongue. Then knowing that someone *can* roll their tongue gives you 0.415 bits of information, but knowing that they *cannot* roll their tongue gives you 2 bits of information.

On *average*, knowing someone’s tongue rolling ability gives you less information than knowing their sex. The average amount of information, or **entropy**, is

0.75(-log_{2} 0.75) + 0.25(-log_{2} 0.25) = 0.81.

Entropy is maximized when all outcomes are equally likely. But for identifiability, we’re concerned with maximum information as well as average information.

Knowing someone’s zip code gives you a variable amount of information, less for densely populated zip codes and more for sparsely populated zip codes. An average zip code contains about 7,500 people. If we assume a US population of 326,000,000, this means a typical zip code would give us about 15.4 bits of information.

The Safe Harbor provisions of US HIPAA regulations let you use the first three digits of someone’s zip code except when this would represent less than 20,000 people, as it would in several sparsely populated areas. Knowing that an American lives in a region of 20,000 people would give you 14 bits of information about that person.

Birth dates are complicated because age distribution is uneven. Knowing that someone’s birth date was over a century ago is highly informative, much more so than knowing it was a couple decades ago. That’s why the Safe Harbor provisions do not allow including age, much less birth date, for people over 90.

Birthdays are simpler than birth dates. Birthdays are not perfectly evenly distributed throughout the year, but they’re close enough for our purposes. If we ignore leap years, a birthday contains -log_{2}(1/365) or about 8.5 bits of information. If we consider leap years, knowing someone was born on a leap day gives us two extra bits of information.

Independent information is additive. I don’t expect there’s much correlation between sex, geographical region, and birthday, so you could add up the bits from each of these information sources. So if you know someone’s sex, their zip code (assuming 7,500 people), and their birthday (not a leap day), then you have 25 bits of information, which may be enough to identify them.

This post didn’t consider correlated information. For example, suppose you know someone’s zip code and primary language. Those two pieces of information together don’t provide as much information as the sum of the information they provide separately because language and location are correlated. I may discuss the information content of correlated information in a future post. (**Update**: Here is a post on correlated pairs of data.)

**Related**: HIPAA de-identification

This article gives the following example. Suppose beauty and acting ability were uncorrelated. Knowing how attractive someone is would give you no advantage in guessing their acting ability, and vice versa. Suppose further that successful actors have a combination of beauty and acting ability. Then among successful actors, the beautiful would tend to be poor actors, and the unattractive would tend to be good actors.

Here’s a little Python code to illustrate this. We take two independent attributes, distributed like IQs, i.e. normal with mean 100 and standard deviation 15. As the sum of the two attributes increases, the correlation between the two attributes becomes more negative.

from numpy import arange from scipy.stats import norm, pearsonr import matplotlib.pyplot as plt # Correlation. # The function pearsonr returns correlation and a p-value. def corr(x, y): return pearsonr(x, y)[0] x = norm.rvs(100, 15, 10000) y = norm.rvs(100, 15, 10000) z = x + y span = arange(80, 260, 10) c = [ corr( x[z > low], y[z > low] ) for low in span ] plt.plot( span, c ) plt.xlabel( "minimum sum" ) plt.ylabel( "correlation coefficient" ) plt.show()]]>

I first noticed this when taking complex analysis where the **Cauchy integral formula** comes up over and over. When I first saw the formula I thought it was surprising, but certainly didn’t think “I bet we’re going to use this all the time.” The Cauchy integral formula was discovered after many of the results that textbooks now prove using it. Mathematicians realized over time that they could organize a class in complex variables more efficiently by proving the Cauchy integral formula as early as possible, then use it to prove much of the rest of the syllabus.

In functional analysis, it’s the **Hahn-Banach theorem**. This initially unimpressive theorem turns out to be the workhorse of functional analysis. Reading through a book on functional analysis you’ll see “By the Hahn-Banach theorem …” so often that you start to think “Really, that again? What does it have to do here?”

In category theory, it’s the **Yoneda lemma**. The most common four-word phrase in category theory must be “by the Yoneda lemma.” Not only is it the *most* cited theorem in category theory, it may be the *only* highly cited theorem in category theory.

The most cited theorem in machine learning is probably **Bayes’ theorem**, but I’m not sure Bayes’ theorem looms as large in ML the previous theorems do in their fields.

Every area of math has theorems that come up more often than other, such as the **central limit theorem** in probability and the **dominated convergence theorem** in real analysis, but I can’t think of any theorems that come up as frequently as Hahn-Banach and Yoneda do in their areas.

As with people, there are theorems that attract attention and theorems that get the job done. These categories may overlap, but often they don’t.

]]>

The extremes are easy. If you pick only from one population, then the resulting distribution will be exactly as wide as the distribution of that population. But what about in the middle? If you pick from both populations with equal probability, will the width resulting distribution be approximately the average of the widths of the two populations separately?

To make things more specific, we’ll draw from two populations: Cauchy and normal. With probability η we will sample from a Cauchy distribution with scale γ and with probability (1-η) we will sample from a normal distribution with scale σ. The resulting combined distribution is a mixture, known in spectroscopy as a **pseudo-Voigt distribution**. In that field, the Cauchy distribution is usually called the Lorentz distribution.

(Why”pseudo”? A **Voigt** random variable is the *sum* of a Cauchy and a normal random variable. Its PDF is a *convolution* of a Cauchy and a normal PDF. A **pseudo-Voigt** random variable is the *mixture* of a Cauchy and a normal random variable. Its PDF is a *convex combination* of the PDFs of a Cauchy and a normal PDF. In fact, the convex combination coefficients are η and (1-η) mentioned above. Convex combinations are easier to work with than convolutions, at least in some contexts, and the pseudo-Voigt distribution is a convenient approximation to the Voigt distribution.)

We will measure the width of distributions by full width at half maximum (FWHM). In other words, we’ll measure how far apart the two points are where the distribution takes on half of its maximum value.

It’s not hard to calculate that the FWHM for a Cauchy distribution with scale 2γ and the FWHM for a normal distribution with scale σ is 2 √(2 log 2) σ.

If we have a convex combination of Cauchy and normal distributions, we’d expect the FWHM to be at least roughly the same convex combination of the FWHM of the separate distributions, i.e. we’d expect the FWHM of our mixture to be

2ηγ + 2(1 – η)√(2 log 2)σ.

How close is that guess to being correct? It has to be exactly correct when η is 0 or 1, but how well does it do in the middle? Here are a few experiments fixing γ = 1 and varying σ.

]]>“Many hands make

lightwork” — Often

But many hands makemorework — Always

I’ve seen this over and over. But I think I’ve found an exception. When work is overwhelming, a lot of time is absorbed by discouragement and indecision. In that case, new people can make a big improvement. They not only get work done, but they can make others feel more like working.

Flood cleanup is like that, and that’s what motivated this note. Someone new coming by to help energizes everyone else. And with more people, you see progress sooner and make more progress, in a sort of positive feedback loop.

This is all in the context of fairly small teams. There must be a point where adding more people decreases productivity per person or even total productivity. I’ve heard reports of a highly bureaucratic relief organization that makes things worse when they show up to “help.” The ideal team size is somewhere between a couple discouraged individuals and a bloated bureaucracy.

**Related post**: Optimal team size

Relearning _______ from a _______ perspective.

Not relearning something forgotten, but going back over something you already know well, but from a different starting point, a different approach, etc.

Have any experiences along these lines you’d like to share in the comments? Anything you have relearned, attempted to relearn, or would like to relearn from a new angle?

]]>My family and I are doing fine. Our house has not flooded, and at this point it looks like it will not flood. We’ve only lost electricity for a second or two.

Of course not everyone in Houston is doing so well. Harvey has done tremendous damage. Downtown was hit especially hard, and apparently they are in for more heavy rain. But it looks like the worst may be over for my area.

**Update **(5:30 AM, August 28): More flooding overnight, some of it near by. We’re still OK. It looks like the heaviest rain is over, but there’s still rain in the forecast and there’s no place for more rain to go.

Houston has two enormous reservoirs west of town that together hold about half a billion cubic meters of water. This morning they started releasing water from the reservoirs to prevent dams from breaking.

Space City Weather has been the best source of information. The site offers “hype-free forecasts for greater Houston.” It’s a shame that a news source should have to describe itself as “hype-free,” but they are indeed hype-free and other sources are not.

**Update** (August 29): Looks like the heavy rain is over. We’re expecting rain for a few more days, but the water is receding faster than it’s collecting, at least on the northwest side.

Given an LCA group *G*, the Fourier transform takes a function on G and returns a function on the dual group of *G*. We said this much last time, but we didn’t define the dual group; we just stated examples. We also didn’t say just how you define a Fourier transform in this general setting.

Before we can define a dual group, we have to define group homomorphisms. A **homomorphism** between two groups *G* and *H* is a function *h* between the groups that preserves the group structure. Suppose the group operation is denoted by addition on *G* and by multiplication on *H* (as it will be in our application), saying *h* preserves the group structure means

*h*(*x* + *y*) = *h*(x) *h*(*y*)

for all *x* and *y* in *G*.

Next, let *T* be the unit circle, i.e. complex numbers with absolute value 1. *T* is a group with respect to multiplication. (Why *T* for circle? This is a common notation, anticipating generalization to toruses in all dimensions. A circle is a one-dimensional torus.)

Now a character on *G* is a continuous homomorphism from *G* to *T*. The set of all characters on *G* is the dual group of *G*. Call this group Γ. If *G* is an LCA group, then so is Γ.

The classical Fourier transform is defined by an integral. To define the Fourier transform on a group we have to have a way to do integration on that group. And there’s a theorem that says we can always do that. For every LCA group, there exists a **Haar measure** μ, and this measure is nice enough to develop our theory. This measure is essentially unique: Any two Haar measures on the same LCA group must be proportional to each other. In other words, the measure is unique up to multiplying by a constant.

On a discrete group—for our purposes, think of the integers and the integers mod *m*—Haar measure is just counting; the measure of a set is the number of things in the set. And integration with respect to this measure is summation.

Let *f* be a function in *L*¹(*G*), i.e. an absolutely integrable function on *G*. Then the Fourier transform of *f* is a function on Γ defined by

What does this have to do with the classical Fourier transform? The classical Fourier transform takes a function of time and returns a function of frequency. The correspondence between the classical Fourier transform and the abstract Fourier transform is to associate the frequency ω with the character that takes *x* to the value exp(*i*ω*x*).

There are multiple slightly different conventions for the classical Fourier transform cataloged here. These correspond to different constant multiples in the choice of measure on *G* and Γ, i.e. whether to divide by or multiply by √(2π), and in the correspondence between frequencies and characters, whether ω corresponds to exp(±*i*ω*x*) or exp(±2π*i*ω*x*).

Is there a **general theory** that unifies all these related but different things? Why yes, yes there is.

Everything in the opening paragraph is simply a Fourier transform, each in a different context. And the contexts correspond to groups. Specifically, locally compact Abelian groups.

Some of these groups are easier to see than others. Clearly the real numbers with addition form a group: the sum of two real numbers is a real number, etc. But where are the groups in the other contexts?

You can think of a periodic function as a function on a circle; the function values have to agree at both ends of an interval, so you might as well think of those two points as the same point, i.e. join them to make a circle. Shifting along an interval, wrapping around if necessary, corresponds to a rotation of the circle, and rotations form a group. So analyzing a periodic function in to a set of Fourier coefficients is a Fourier transform on the circle.

You can think of a set of Fourier coefficients as a function on the integers, mapping *n* to the *n*th coefficient. Synthesizing a set of Fourier coefficients into a periodic function is a Fourier transform on the group of integers.

What about a discrete Fourier transform (DFT)? If you have a function sampled at *m* points, you could think of those points as the group of integers mod *m*. Your sampled points constitute a function on the integers mod *m*, and the DFT is a Fourier transform on that group.

Note that the DFT is a Fourier transform in its own right. It’s not an approximation per se, though it’s nearly always used as part of an approximation process. You can start with a continuous function, approximate it by a finite set of samples, compute the DFT of these samples, and the result will give you an approximation to the Fourier transform of the original continuous function.

What about functions of several variables? These are functions on groups too. A function of two real variables, for example, is a function on *R*², which is a group with (vector) addition.

A Fourier transform takes a function defined on a group and returns a function defined on the dual of that group. I go into exactly what a dual group is in my next post, but for now, just note that the Fourier transform takes a function defined on one group and returns a function defined on another group.

The dual of the circle is the integers, and vice versa. That’s why the Fourier transform of a function on the circle is an infinite set of Fourier coefficients, which we think of as a function on the integers. The Fourier transform of the function on the integers, i.e. a set of Fourier coefficients, is a function on the circle, i.e. a periodic function.

The dual group of the real numbers is the real numbers again. That’s why the Fourier transform of a function on the real line is another function on the real line.

The integers mod *m* is also its own dual group. So the DFT takes a set of *m* numbers and returns a set of *m* numbers.

What do locally compact and Abelian mean? And why do we make these assumptions?

Let’s start with Abelian. This just means that the group operation is commutative. When we’re adding real numbers, or composing rotations of a circle, these operations are commutative.

Locally compactness is a more technical requirement. The circle is compact, and so are the integers mod *m*. But if we restricted out attention to compact groups, that would leave out the integers and the real numbers. These spaces are not compact, but they’re locally compact, and that’s enough for the theory to go through.

It turns out that LCA groups are a sort of theoretical sweet spot. Assuming groups are LCA is general enough to include the examples we care about the most, but it’s not so general that the theory becomes harder and the results less powerful.

This post relates Fourier series (analysis and synthesis) to Fourier transforms (on the real line) by saying they’re both special cases of Fourier analysis on LCA groups. There are a couple other ways to connect Fourier series and Fourier transforms.

You can take the Fourier **transform **(not Fourier **series**) of a periodic function two ways: by restricting it to one period and defining it to be zero everywhere else, or by letting it repeat forever across the real line and taking the Fourier transform in the sense of generalized functions. You can read more about these two approaches in this post.