converges or diverges, stirring up a lot of discussion. Sam Walters has been part of this discussion and pointed to a paper that says this is known as the Flint Hills series.

My first thought was to replace the sine term with a random variable and see what happens, because the values of *n* mod 2π act a lot like a random sequence. To be precise, the series is ergodic.

If *X* is a uniform random variable on the interval [0, 2π], then the random variable *Y =* 1/sin(*X*)² is fat tailed, so fat that it has no finite expected value. If *Y* had a finite expected value E[*Y*], then one might expect the Flint Hills sum to be roughly E[*Y*] ζ(3), i.e. the Flint Hills sum with the sine terms replaced by E[*Y*]. But since E[*Y*] diverges, this suggests that the Flint Hills series diverges.

Of course this doesn’t settle the question because the values of *n* mod 2π are not actually random. They simply act as if they were random *in some contexts*. For example, if you wanted to use them as if they were random values in order to do Monte Carlo integration, they would work just fine.

The question is whether the values act sufficiently like random values in the context of the Flint Hills problem. This is not clear, and is a problem in number theory rather than in probability. (Though number theory and probability are surprisingly interconnected.)

Sam Walters suggested considering a variation on the Flint Hills problem where we replace sin(*n*) with sin(2πθ*n*) for some constant θ, the original problem corresponding to θ = 1/2π. I suspect the series diverges for some (almost all?) values of θ. That is, unless you pick θ very carefully, number theory won’t rescue the series from divergence.

When I started this blog I routed my RSS feed through Feedburner, and now Feedburner is no longer working for my site.

If you **subscribed by RSS**, please check the feed URL. It should be

https://www.johndcook.com/blog/feed

which was previously forwarded to a Feedburner URL. If you subscribe directly to the feed with my domain, it should work. I had some problems with it earlier, but I believe they’ve been fixed.

Also, if you **subscribed by email**, you went through Feedburner. You won’t receive any more posts by email that way. I’m considering possible replacements.

In the mean time, you can sign up for monthly highlights from the blog via email. Each month I highlight the three or four most popular posts. Once in a while I may add a paragraph about what I’m up to. Short and sweet.

You can also find me on Twitter.

**Update**: Maybe Feedburner is working after all. I’m not sure what’s going on. Still, it seems Feedburner is hanging on by a thread. The steps above are a good way to prepare for when it goes away.

That is, big O tilde notation ignores logarithmic factors. For example, the **FFT algorithm** computes the discrete Fourier transform of a sequence of length *n* in

*O*(*n* log *n*)

steps, and so you could write this as *Õ*(*n*). This notation comes up frequently in **computational number theory** where algorithms often have a mix of polynomial and logarithmic terms.

A couple days ago I blogged about algorithms for multiplying large numbers. The **Schönhage-Strasse algorithm** has a run time on the order of

*O*(*n* log(*n*) log(log(*n*))),

which you could write as simply *Õ*(*n*).

**Shor’s quantum algorithm** for factoring *n*-bit integers has a run time of

*O*(*n*² log(*n*) log(log(*n*))),

which we could write as *Õ*(*n*²). The **fast Euclidean algorithm** improves on the ancient Euclidean algorithm by reducing run time from *O*(*n*²) down to

*O*(*n* log²(*n*) log(log(*n*))),

which could be written simply as *Õ*(*n*).

The definition at the top of the post says we can ignore powers of logarithm, but the previous examples contained iterated logarithms. This is permissible because log(*x*) < *x*, and so log(log(*x*)) < log(*x*). [2]

- Four uncommon but handy notations
- How fast can you multiply large integers?
- RSA numbers and factoring

[1] Big O notation can be confusing at first. For example, the equal sign doesn’t have its standard meaning. For more details, see these notes.

[2] Sometimes in mathematics a superscript on a function is an exponent to be applied to the output, and sometimes it indicates the number of times a function should be iterated. That is, *f*²(*x*) could mean *f*(*x*)² or *f*( *f*(*x*) ). The former is the convention for logarithms, and we follow that convention here.

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

**Tabular data** is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

**Non-tabular data** could mean anything other than tabular data, but in practice it often means **text**, or it could mean data with a graph structure or some other structure.

My point here isn’t to quibble over language usage but to offer a constructive suggestion:** say what structure data has, not what structure it doesn’t have**.

Discussions about “unstructured data” are often unproductive because two people can use the term, with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or at most in small number of languages. That’s a start. There should be more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify **free text fields**. You can’t. If the text is truly *free*, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breech. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

]]>If you’re multiplying integers with tens of thousands of decimal digits, the most efficient algorithm is the Schönhage-Strassen algorithm, which takes

O(*n* log *n* log log *n*)

operations. For smaller numbers, another algorithm, such as Karatsuba’s algorithm, might be faster. And for impractically large numbers, Fürer’s algorithm is faster.

What is impractically large? Let’s say a number is impractically large if storing it requires more than a billion dollars worth of RAM. If I did my back-of-the-envelop math correctly, you can buy enough RAM to store about 2^{57} bits for about a billion dollars. The Schönhage-Strassen algorithm is more efficient than Fürer’s algorithm for numbers with less than 2^{64} bits.

**Related post**: How fast can you multiply matrices?

Of course this is just an aphorism. Sometimes discoveries are indeed named after their discoverers. But the times when this isn’t the case are more interesting.

Stigler’s law reveals something about human nature. The people who get credit for an idea are the ones who make their discovery known. They work in the open rather than in obscurity or secrecy. They may be the first to explain an idea well, or the first to popularize it, or the first to apply it to a problem that people care about. Apparently people value these things more than they value strict accounts of historical priority.

**Related post**: Stigler’s law and Avogadro’s number

I used the Python module unidecode to convert names to ASCII before searching, and that fixed the problem. Here’s a small example showing how the code works.

import unidecode for x in ["Poincaré", "Gödel"]: print(x, unidecode.unidecode(x))

This produces

Poincaré Poincare Gödel Godel

Installing the `unidecode`

module also installs a command line utility by the same name. So you could, for example, pipe text to that utility.

As someone pointed out on Hacker News, this isn’t so impressive for Western languages,

But if you need to project Arabic, Russian or Chinese, unidecode is close to black magic:

>>> from unidecode import unidecode >>> unidecode("北亰") 'Bei Jing '

(Someone has said in the comments that 北亰 is a typo and should be 北京. I can’t say whether this is right, but I can say that `unidecode`

transliterates both to “Bei Jing.”)

I titled this post “Projecting Unicode to ASCII” because this code is a **projection** in the mathematical sense. A projection is a function *P* such that for all inputs *x*,

*P*( *P*(*x*) ) = *P*(*x*).

That is, applying the function twice does the same thing as applying the function once. The name comes from projection in the colloquial sense, such as projecting a three dimensional object onto a two dimensional plane. An equivalent term is to say *P* is **idempotent**. [1]

The `unidecode`

function maps the full range of Unicode characters into the range 0x00 to 0x7F, and if you apply it to a character already in that range, the function leaves it unchanged. So the function is a projection, or you could say the function is idempotent.

Projection is such a simple condition that it hardly seems worth giving it a name. And yet it is extremely useful. A general principle in user interface to design is to make something a projection if the user expects it to be a projection. Users probably don’t have the vocabulary to say “I expected this to be a projection” but they’ll be frustrated if something is almost a projection but not quite.

For example, if software has a button to convert an image from color to grayscale, it would be surprising if (accidentally) clicking button a second time had any effect. It would be unexpected if it returned the original color image, and it would be even more unexpected if it did something else, such as keeping the image in grayscale but lowering the resolution.

- Contractions and fixed points
- Area of a triangle and its projections
- Converting between Unicode and LaTeX

[1] The term “idempotent” may be used more generally than “projection,” the latter being more common in linear algebra. Some people may think of a projection as *linear* idempotent function. We’re not exactly doing linear algebra here, but people do think of portions of Unicode geometrically, speaking of “planes.”

This evening I stumbled on the fact that John von Neumann and Fibonacci both have asteroids named after them. Then I wondered how many more famous mathematicians have asteroids named after them. As it turns out, most of them: Euclid, Gauss, Cauchy, Noether, Gödel, Ramanujan, … It’s easier to look at the names that are *not* assigned to asteroids.

I wrote a little script to search for the 100 greatest mathematicians (according to James Allen’s list) and look for names missing from a list of named asteroids. Here are the ones that were missing, in order of Allen’s list.

- Alexandre Grothendieck
- Hermann K. H. Weyl
- Brahmagupta
- Aryabhata
- Apollonius
- Carl Ludwig Siegel
- Diophantus
- Muhammed al-Khowarizmi
- Bhascara (II) Acharya
- Alhazen ibn al-Haytham
- Andrey N. Kolmogorov
- Eudoxus
- Panini
- Jakob Steiner
- Hermann G. Grassmann
- M. E. Camille Jordan
- Jean-Pierre Serre
- Michael F. Atiyah
- Atle Selberg
- Pappus

So of the top 100 list of mathematicians, 80 have asteroids named after them and the 20 above do not.

Alexandre Grothendieck, is the greatest mathematician—11th on James Allen’s list—not to have an asteroid named in his honor.

]]>The HIPAA Privacy Rule offers two ways to say that data has been de-identified: Safe Harbor and expert determination. This post is about the former. I help companies with the latter.

The Safe Harbor provision lists 18 categories of data that would cause a data set to not be considered de-identified unless an expert determines the data does not pose a significant re-identification risk.

Some of the items prohibited by Safe Harbor are obvious: telephone number, email address, social security number, etc. Others are not so obvious. In order for data to fall under the Safe Harbor provision, one must remove

All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 …

Why are these dates a problem? Birth dates are clearly useful in identifying individuals; when combined with zip code and sex, they give enough information to uniquely identify 87% of Americans. (More details here.) But why admission or discharge dates?

Latanya Sweeney demonstrated here how dates of hospital admission can be used to identify individuals. She purchased a set of anonymized health records for the state of Washington for 2011 and compared the records to newspaper stories. She simply did a LexusNexus search on the term “hospitalized” to find news stories about people who were hospitalized, then searched for the medical records for the personal details from the newspaper articles.

In the discussion section of her article Sweeney points out that although she searched newspapers, one could get similar information from other sources, such as employee leave records or from a record of someone asking to pay a bill late due to illness.

There are ways to retain the information in dates without jeopardizing privacy. For example, one could jitter the dates by adding a random offset. However, the way to do this depends on context and can be subtle. For example, Netflix jittered the dates in its Netflix Prize data set by +/- two weeks, but this was not enough to prevent a privacy breach [1]. And if you add too much randomness and the utility of the data degrades. That’s why the HIPAA Privacy Rule includes the provision to obtain expert determination that your procedures are adequate in your context.

[1] Arvind Narayanan and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets, or How to Break Anonymity of the Netflix Prize Dataset.

]]>First of all, you can read exactly what these exponential sums are here. These plots can be periodic in two senses. The first is simply repeating the same sequence of points. The second is repeatedly producing the same pattern of points, but shifted. Here are a couple examples from last year.

The sum for New Years Eve was periodic in the first sense. Plotting more points would make no difference.

The sum for Christmas Eve was periodic in the latter sense, repeating the pattern 12 times as it moves down the page. If we increased the value of *N* in the formula that generates the plots, we’d see more repetitions extending further down.

Since 19 is a prime, there will be more exponential sums this year that are periodic in this latter sense. I doubled the value of *N* so the plot will always display at least two periods.

The fact that 19 is prime also means there will be more intricate plots this year. For example, here’s the plot for March 17, 2019

compared to the plot for March 17, 2018.

There will also be a few simpler but strange plots. The plot for January 19 is unlike anything I’ve seen since starting this page.

**Goldbach’s conjecture** says that every even number greater than 2 can be written as the sum of two primes. The **odd Goldbach conjecture**, a.k.a. the **weak Goldbach conjecture**, says that every odd number greater than 5 can be written as the sum of three primes. The odd Goldbach conjecture isn’t really a conjecture anymore because Harald Helfgott proved it in 2013, though the original Goldbach conjecture remains unsettled.

The odd Goldbach conjecture says it should be possible to write 2019 as the sum of three primes. And in fact there are 2,259 ways to write 2019 as a non-decreasing sequence of primes.

3 + 5 + 2011 3 + 13 + 2003 3 + 17 + 1999 ... 659 + 659 + 701 659 + 677 + 701 673 + 673 + 673

Lagrange’s four-square theorem says that every non-negative integer can be written as the sum of four squares. 2019 is a non-negative integer, so it can be written as the sum of four squares. In fact there are 66 ways to write 2019 as a sum of four squares.

0 1 13 43 0 5 25 37 0 7 11 43 ... 16 19 21 31 17 23 24 25 19 20 23 27

Sometimes there is a unique way to write a number as the sum of four squares. The last time a year could be written uniquely as a sum of four squares was 1536, and the next time will be in 2048.

2019 is a semiprime, i.e. the product of two primes, 3 and 673. For every semiprime *s*, there are either one or two distinct groups of order *s*.

As explained here, if *s* = *pq* with *p* > *q*, all groups of order *s* are isomorphic if *q* is not a factor of *p*-1. If *q* does divide *p*-1 then there are exactly two non-isomorphic groups of order *s*, one abelian and one non-abelian. In our case, 3 does divide 672, so there are two groups of order 2019. The first is easy: the cyclic group of order 2019. The second is more complex.

You could take the **direct product** of the cyclic groups of order 3 and 673, but that turns out to be isomorphic to the cyclic group of order 2019. But if you take the **semidirect product** you get the other group of order 2019, the non-abelian one.

Starting with two groups *G* and *H*, the direct product *G* × *H* is the Cartesian product of *G* and *H* with multiplication defined componentwise. The semidirect product of *G* and *H*, written [1]

also starts with the Cartesian product of *G* and *H* but defines multiplication differently.

In our application, *G* will be the integers mod 673 with **addition** and *H* will be a three-element subgroup of the integers mod 673 with multiplication [2]. Let *H* be the set {1, 255, 417} with respect to multiplication [3]. Note that 1 is its own inverse and 255 and 417 are inverses of each other.

Now the group product of (*g*_{1}, *h*_{1}) and (*g*_{2}, *h*_{2}) is defined to be

(*g*_{1} + *h*_{1}^{-1}*g*_{2}, *h*_{1} *h*_{2})

So, for example, the product of (5, 255) and (334, 417) is (5 + 417*334, 255*417) which reduces to (645, 1) working mod 673.

(We haven’t defined the semidirect product in general, but the procedure above suffices to define the semidirect product for any two groups of prime order, and so it is sufficient to find all groups of semiprime order.)

Note that our group is non-abelian. For example, if we reverse the order of multiplication above we get (263, 1).

The identity element is just the pair consisting of the identity elements from *G* and *H*. In our case, this is (0, 1) because 0 is the additive identity and 1 is the multiplicative identity.

The inverse of an element (*g*, *h*) is given by

(-*gh*, *h*^{-1}).

So, for example, the inverse of (600, 255) is (444, 417).

The goal of this post is to be concrete rather than general.

So to make everything perfectly explicit, we’ll write a little Python code implementing the product and inverse.

def hinv(h): if h == 255: return 417 if h == 417: return 255 if h == 1: return 1 def prod(x, y): g1, h1 = x g2, h2 = y g3 = (g1 + hinv(h1)*g2) % 673 h3 = (h1*h2) % 673 return (g3, h3) def group_inv(x): g, h = x return ((-g*h)%673, hinv(h)) x = (5, 255) y = (334, 417) print(prod(x, y)) print(prod(y, x)) print(group_inv((600, 255)))

The following code verifies that our group satisfies the axioms of a group.

from itertools import product identity = (0, 1) h_list = [1, 255, 417] def elem(x): g, h = x g_ok = (0 <= g <= 672) h_ok = (h in h_list) return (g_ok and h_ok) group = product(range(673), h_list) assert(len([g for g in group]) == 2019) # closed under multiplicaton for x in group: for y in group: assert( elem(prod(x, y)) ) # multiplication is associative for x in group: for y in group: for z in group: xy_z = prod(prod(x, y),z) x_yz = prod(x, prod(y,z)) assert(xy_z == x_yz) # identity acts like it's supposed to for x in group: assert( prod(x, identity) == x ) assert( prod(identity, x) == x ) # every element has an inverse for x in group: ginv = group_inv(x) assert( elem(ginv) ) assert( prod(x, ginv) == identity ) assert( prod(ginv, x) == identity )

[1] The symbol for semidirect product is ⋊. It’s U+22CA in Unicode and `\rtimes`

in LaTeX.

[2] In general, the semidirect product depends on a choice of an action of the group *H* on the group *G*. Here the action is multiplication by an element of *H*. Different actions can result in different groups. Sometimes the particular choice of action is made explicit as a subscript on the ⋊ symbol.

[3] How did I find these numbers? There are 672 non-zero numbers mod 673, so I picked a number, it happened to be 5, and raised it to the powers 672/3 and 2*672/3.

]]>

I read somewhere that, contrary to popular belief, Kansas is not the flattest state in the US. Instead, Florida is the flattest, and Kansas was several notches further down the list.

(**Update**: Nevertheless, Kansas is literally flatter than a pancake. Thanks to Andreas Krause for the link.)

How would you measure the flatness of a geographic region? The simplest approach would be to look at the range of elevation from the highest point to the lowest point, so that’s what I did. Here are elevations and differences for each state, measured in feet.

|----------------+-------+------+-------| | State | High | Low | Diff | |----------------+-------+------+-------| | Florida | 345 | 0 | 345 | | Delaware | 450 | 0 | 450 | | Louisiana | 535 | -7 | 542 | | Mississippi | 807 | 0 | 807 | | Rhode Island | 814 | 0 | 814 | | Indiana | 1257 | 322 | 935 | | Illinois | 1237 | 279 | 958 | | Ohio | 1549 | 456 | 1093 | | Iowa | 1670 | 479 | 1191 | | Wisconsin | 1952 | 581 | 1371 | | Michigan | 1982 | 571 | 1411 | | Missouri | 1772 | 230 | 1542 | | Minnesota | 2303 | 600 | 1703 | | New Jersey | 1804 | 0 | 1804 | | Connecticut | 2382 | 0 | 2382 | | Alabama | 2405 | 0 | 2405 | | Arkansas | 2756 | 56 | 2700 | | North Dakota | 3507 | 751 | 2756 | | Pennsylvania | 3215 | 0 | 3215 | | Kansas | 4042 | 679 | 3363 | | Maryland | 3363 | 0 | 3363 | | Massachusetts | 3491 | 0 | 3491 | | South Carolina | 3563 | 0 | 3563 | | Kentucky | 4144 | 256 | 3888 | | Vermont | 4396 | 95 | 4301 | | Nebraska | 5427 | 840 | 4587 | | West Virginia | 4865 | 240 | 4625 | | Oklahoma | 4977 | 289 | 4688 | | Georgia | 4787 | 0 | 4787 | | Maine | 5269 | 0 | 5269 | | New York | 5348 | 0 | 5348 | | Virginia | 5732 | 0 | 5732 | | South Dakota | 7247 | 968 | 6279 | | New Hampshire | 6293 | 0 | 6293 | | Tennessee | 6647 | 177 | 6470 | | North Carolina | 6690 | 0 | 6690 | | Texas | 8753 | 0 | 8753 | | New Mexico | 13169 | 2844 | 10325 | | Wyoming | 13812 | 3100 | 10712 | | Montana | 12808 | 1801 | 11007 | | Colorado | 14439 | 3314 | 11125 | | Oregon | 11247 | 0 | 11247 | | Utah | 13527 | 2001 | 11526 | | Idaho | 12667 | 712 | 11955 | | Arizona | 12631 | 69 | 12562 | | Nevada | 13146 | 479 | 12667 | | Hawaii | 13806 | 0 | 13806 | | Washington | 14419 | 0 | 14419 | | California | 14505 | -282 | 14787 | | Alaska | 20335 | 0 | 20335 | |----------------+-------+------+-------|

By the measure used in the table above, Florida is about 10 times flatter than Kansas.

The centroid of the continental US is in Kansas, and if you look at the center of the map above, you’ll see that there’s an elevation change across Kansas.

If I remember correctly, the source I saw said that Kansas was something like 9th, though in the table above it’s 20th. Maybe that source measured flatness differently. If you had a grid of measurements throughout each state, you could compute something like a Sobolev norm that accounts for gradients.

]]>I haven’t seen the northern lights in person, and I didn’t know until she told me that the lights appear gray to the naked eye, like smoke. Sometimes the lights have a hint of color, but the color is nowhere near as intense as in photographs. Photos like the one above are not false color images. The green light is really there, even though it is not nearly as vivid to the naked eye as it is to a camera.

The reason the aurora appear gray is that human eyes don’t see in color well in dim light. At very low light levels, our vision is based entirely on rods, photoreceptor cells that are very sensitive to light but not to color. If aurora are bright enough, cones are activated and color becomes visible.

Why green in particular? It comes from excited oxygen atoms, emitting radiation at a wavelength of 557.7 nm. Other colors can be found in aurora, but green is most common for physical and physiological reasons. The physical reasons involve radiation and the chemical composition of the atmosphere. The physiological reason is that human vision is most sensitive around the frequency of green light.

Crockford recommends the following **check sum** procedure, a simple error detection code:

The check symbol encodes the number modulo 37, 37 being the least prime number greater than 32.

That is, we take the remainder when the base 32 number is divided by 37 and append the result to the original encoded number. The remainder could be larger than 31, so we need to expand our alphabet of symbols. Crockford recommends using the symbols *, ~, $, =, and U to represent remainders of 32, 33, 34, 35, and 36.

Crockford says his check sum will “detect wrong-symbol and transposed-symbol errors.” We will show that this is the case in the proof below.

Here’s a little Python code to demonstrate how the checksum works

from base32_crockford import encode, decode s = "H88CMK9BVJ1V" w = "H88CMK9BVJ1W" # wrong last char t = "H88CMK9BVJV1" # transposed last chars def append_checksum(s): return encode(decode(s), checksum=True) print(append_checksum(s)) print(append_checksum(w)) print(append_checksum(t))

This produces the following output.

H88CMK9BVJ1VP H88CMK9BVJ1WQ H88CMK9BVJV1E

The checksum character of the original string is P. When the last character is changed, the checksum changes from P to Q. Similarly, transposing the last two characters changes the checksum from P to E.

The following code illustrates that the check sum can be a non-alphabetic character.

s = "H88CMK9BVJ10" n = decode(s) r = n % 37 print(r) print(encode(n, checksum=True))

This produces

32 H88CMK9BVJ10*

As we said above, a remainder of 32 is represented by *.

If you change one character in a base 32 number, its remainder by 37 will change as well, and so the check sum changes.

Specifically, if you change the *n*th digit from the right, counting from 0, by an amount *k*, then you change the number by a factor of *k* 2^{n}. Since 0 < *k* < 32, *k* is not divisible by 37, nor is 2^{n}. Because 37 is prime, *k* 2^{n} is not divisible by 37 [1]. The same argument holds if we replace 37 by any larger prime.

Now what about transpositions? If you swap consecutive digits *a* and *b* in a number, you also change the remainder by 37 (or any larger prime) and hence the check sum.

Again, let’s be specific. Suppose we transpose the *n*th and *n*+1st digits from the right, again counting from 0. Denote these digits by *a* and *b* respectively. Then swapping these two digits changes the number by an amount

(*b* 2^{n+1} + *a* 2^{n}) – (*a* 2^{n+1} + *b* 2^{n}) = (*b* – *a*) 2^{n}

If *a* ≠ *b*, then *b* – *a* is a number between -31 and 31, but not 0, and so *b* – *a* is not divisible by 37. Neither is any power of 2 divisible by 37, so we’ve changed the remainder by 37, i.e. changed the check sum. And as before, the same argument works for any prime larger than 47.

[1] A prime *p* divides a product *ab* only if it divides *a* or it divides *b*. This isn’t true for composite numbers. For example, 6 divides 4*9 = 36, but 6 doesn’t divide 4 or 9.

By convention, math books typically represent numbers in bases larger than 10 by using letters as new digit symbols following 9. For example, base 16 would use 0, 1, 2, …, 9, A, B, C, D, E, and F as its “digits.” This works for bases up to 36; base 36 would use all the letters of the alphabet. There’s no firm convention for whether to use upper or lower case letters.

The common use for base 64 encoding isn’t to represent bits as numbers per se, but to have an efficient way to transmit bits in a context that requires text characters.

There are around 100 possible characters on a keyboard, and 64 is the largest power of 2 less than 100 [1], and so base 64 is the most dense encoding using common characters in a base that is a power of 2.

Base 64 encoding does not follow the math convention of using the digits first and then adding more symbols; it’s free not to because there’s no intention of treating the output as numbers. Instead, the capital letters A through Z represent the numbers 0 though 25, the lower case letters a through z represent the numbers 26 through 51, and the digits 0 through 9 represent the numbers 52 through 61. The symbol + is used for 62 and / is used for 63.

Douglas Crockford proposed an interesting form of base 32 encoding. His encoding mostly follows the math convention: 0, 1, 2, …, 9, A, B, …, except he does not use the letters I, L, O, and U. This eliminates the possibility of confusing i, I, or l with 1, or confusing O with 0. Crockford had one more letter he could eliminate, and he chose U in order to avoid an “accidental obscenity.” [2]

Crockford’s base 32 encoding is a compromise between efficiency and human legibility. It is more efficient than hexadecimal, representing 25% more bits per character. It’s less efficient than base 64, representing 17% fewer bits per character, but is more legible than base 64 encoding because it eliminates commonly confused characters.

His encoding is also case insensitive. He recommends using only capital letters for output, but permitting upper or lower case letters in input. This is in the spirit of **Postel’s law**, also known as **the robustness principle**:

Be conservative in what you send, and liberal in what you accept.

See the next post for an explanation of Crockford’s check sum proposal.

Here’s a Python script to generate passwords using Crockford’s base 32 encoding.

from secrets import randbits from base32_crockford import encode def gen_pwd(numbits): print(encode(randbits(numbits)))

For example, `gen_pwd(60)`

would create a 12-character password with 60-bits of entropy, and this password would be free of commonly confused characters.

[1] We want to use powers of 2 because it’s easy to convert between base 2 and base 2^{n}: start at the right end and convert bits in groups of *n*. For example, to convert a binary string to hexadecimal (base 2^{4} = 16), convert groups of four bits each to hexadecimal. So to convert the binary number 101111001 to hex, we break it into 1 0111 1001 and convert each piece to hex, with 1 -> 1, 0111 -> 7, and 1001 -> 9, to find 101111001 -> 179. If we a base that is not a power of 2, the conversion would be more complicated and not so localized.

[2] All the words on George Carlin’s infamous list include either an I or a U, and so none can result from Crockford’s base 32 encoding. If one were willing to risk accidental obscenities, it would be good to put U back in and remove S since the latter resembles 5, particularly in some fonts.

]]>**Astronomy**

**Programming**

**Computer arithmetic**

**Math**

**Statistics**

Written in hexadecimal the newly discovered prime is

For decades the largest known prime has been a Mersenne prime because there’s an efficient test for checking whether a Mersenne number is prime. I explain the test here.

There are now 51 known Mersenne primes. There may be infinitely many Mersenne primes, but this hasn’t been proven. Also, the newly discovered Mersenne prime is the 51st *known* Mersenne prime, but it may not be the 51st Mersenne prime, i.e. there may be undiscovered Mersenne primes hiding between the last few that have been discovered.

Three weeks ago I wrote a post graphing the trend in Mersenne primes. Here’s the updated post. Mersenne primes have the form 2^{p} -1 where *p* is a prime, and this graph plots the values of *p* on a log scale. See the earlier post for details.

Because there’s a one-to-one correspondence between Mersenne primes and even perfect numbers, the new discovery means there is also a new perfect number. *M* is a Mersenne prime if and only if *M*(*M* + 1)/2 is an even perfect number. This is also the *M*th triangular number.

We’ll briefly review our approach to adaptive randomization but not go into much detail. For a more thorough introduction, see, for example, this report.

Adaptive randomization designs **allow the randomization probabilities to change** in response to accumulated outcome data so that more subjects are assigned to (what appear to be) more effective treatments. They also **allow for continuous monitoring**, so one can stop a trial early if we’re sufficiently confident that we’ve found the best treatment.

Of course we don’t *know* what treatments are more effective or else we wouldn’t be running a clinical trial. We have an idea based on the data seen so far at any point in the trial, and that assessment may change as more data become available.

We could simply put each patient on what appears to be the best arm (the “play-the-winner” strategy), but this would forgo the benefits of randomization. Instead we compromise, continuing to randomize, but increasing the randomization probability for what appears to be the best treatment at the time a subject enters the trial.

By monitoring the performance of each treatment arm, we can drop a poorly performing arm and assign subjects to the better treatment arms. This is particularly important for multi-arm trials. We want to weed out the poor treatments quickly so we can focus on the more promising treatments.

Continuous monitoring opens the possibility of stopping trials early if there is a clear winner. If all treatments perform similarly, more patients will be used. The maximum number of patients will be enrolled only if necessary.

Randomizing an equal number of patients to each of several treatment arms would require a lot of subjects. A multi-arm adaptive trail turns into a two-arm trial once the other arms are dropped. We’ll present simulation results below that demonstrate this.

Running a big trial with several treatment arms could be **more cost effective** than running several smaller trials because there is a certain fixed cost associated with running any trial, no matter how small: protocol review, IRB approval, etc.

There has been some skepticism whether two-arm adaptively randomized trials live up to their hype. Trial design is a multi-objective optimization problem and it’s easy to claim victory by doing better by one criteria while doing worse by another. In my opinion, adaptive randomization is more promising for multi-arm trials than two-arm trials.

In my experience, multi-arm trials benefit more from early stopping than from adapting randomization probabilities. That is, one may treat more patients effectively by randomizing equally but dropping poorly performing treatments. Instead of reducing the probability of assigning patients to a poor treatment arm, continue to randomize equally so you can more quickly gather enough evidence to drop the arm.

I initially thought that gradually decreasing the randomization probability of a poorly performing arm would be better than keeping the randomization probability equal until it suddenly drops to zero. But experience suggests this intuition was wrong.

I designed a 4-arm trial with a uniform prior on the probability of response on each arm. The maximum accrual was set to 400 patients.

An arm is suspended if the posterior probability of it being best drops below 0.05. (Note: A suspended arm is not necessarily closed. It may become active again in response to more data.)

Subjects are randomized equally to all available arms. If only one arm is available, the trial stops. Each trial was simulated 1,000 times.

In the first scenario, I assume the true probabilities of successful response on the treatment arms are 0.3, 0.4, 0.5, and 0.6 respectively. The treatment arm with 30% response was dropped early in 99.5% of the simulations, and on average only 12.8 patients were assigned to this treatment.

|-----+----------+----------------+----------| | Arm | Response | Pr(early stop) | Patients | |-----+----------+----------------+----------| | 1 | 0.3 | 0.995 | 12.8 | | 2 | 0.4 | 0.968 | 25.7 | | 3 | 0.5 | 0.754 | 60.8 | | 4 | 0.6 | 0.086 | 93.9 | |-----+----------+----------------+----------|

An average of 193.2 patients were used out of the maximum accrual of 400. Note that 80% of the subjects were allocated to the two best treatments.

Here are the results for the second scenario. Note that in this scenario there are two bad treatments and two good treatments. As we’d hope, the two bad treatments are dropped early and the trial concentrates on deciding the better of the two good treatment.

|-----+----------+----------------+----------| | Arm | Response | Pr(early stop) | Patients | |-----+----------+----------------+----------| | 1 | 0.35 | 0.999 | 11.7 | | 2 | 0.45 | 0.975 | 22.5 | | 3 | 0.60 | 0.502 | 85.6 | | 4 | 0.65 | 0.142 | 111.0 | |-----+----------+----------------+----------|

An average of 230.8 patients were used out of the maximum accrual of 400. Now 85% of patients were assigned to the two best treatments. More patients were used in this scenario because the two best treatments were harder to tell apart.

The contraction mapping theorem says that if a function moves points closer together, then there must be some point the function doesn’t move. We’ll make this statement more precise and give a historically important application.

A function *f* on a metric space *X* is a **contraction** if there exists a constant *q* with 0 ≤ *q* < 1 such that for any pair of points *x* and *y* in *X*,

*d*( *f*(*x*), *f*(*y*) ) ≤ *q* *d*(*x*, *y*)

where *d* is the metric on *X*.

A point *x* is a **fixed point** of a function *f* if *f*(*x*) = *x*.

**Banach’s fixed point theorem**, also known as the **contraction mapping theorem**, says that every contraction on a complete metric space has a fixed point. The proof is constructive: start with any point in the space and repeatedly apply the contraction. The sequence of iterates will converge to the fixed point.

Kepler’s equation in for an object in an elliptical orbit says

*M* + *e* sin *E* = *E*

where *M* is the **mean anomaly**, *e* is the **eccentricity**, and *E* is the **eccentric anomaly**. These “anomalies” are parameters that describe the location of an object in orbit. Kepler solved for *E* given *M* and *e* using the contraction mapping theorem, though he didn’t call it that.

Kepler speculated that it is not possible to solve for *E* in closed form—he was right—and used a couple iterations [1] of

*f*(*E*) = *M* + *e* sin *E*

to find an approximate fixed point. Since the mean anomaly is a good approximation for the eccentric anomaly, *M* makes a good starting point for the iteration. The iteration will converge from any starting point, as we will show below, but you’ll get a useful answer sooner starting from a good approximation.

Kepler came up with his idea for finding *E* around 1620, and Banach stated his fixed point theorem three centuries later. Kepler had the idea of Banach’s theorem, but he didn’t have a rigorous formulation of the theorem or a proof.

In modern terminology, the real line is a complete metric space and so we only need to prove that the function *f* above is a contraction. By the mean value theorem, it suffices to show that the absolute value of its derivative is less than 1. That is, we can use an upper bound on |*f *‘| as the *q* in the definition of contraction.

Now

*f ‘* (*E*) = *e* cos *E*

and so

|*f* ‘(*E*)| ≤ *e*

for all *E*. If our object is in an elliptical orbit, *e* < 1 and so we have a contraction.

The following example comes from [2], though the author uses Newton’s method to solve Kepler’s equation. This is more efficient, but anachronistic.

Consider a satellite on a geocentric orbit with eccentricity

e= 0.37255. Determine the true anomaly at three hours after perigee passage, and calculate the position of the satellite.

The author determines that *M* = 3.6029 and solves Kepler’s equation

*M* + *e* sin *E* = *E*

for *E*, which she then uses to solve for the true anomaly and position of the satellite.

The following Python code shows the results of the first 10 iterations of Kepler’s equation.

from math import sin M = 3.6029 e = 0.37255 E = M for _ in range(10): E = M + e*sin(E) print(E)

This produces

3.437070 3.494414 3.474166 3.481271 3.478772 3.479650 3.479341 3.479450 3.479412 3.479425

and so it appears the iteration has converged to *E* = 3.4794 to four decimal places.

Note that this example has a fairly large eccentricity. Presumably Kepler would have been concerned with much smaller eccentricities. The eccentricity of Jupiter’s orbit, for example, is around 0.05. For such small values of *e* the iteration would converge more quickly.

[1] Bertil Gustafsson saying in his book Scientific Computing: A Historical Perspective that Kepler only used two iterations. Since *M* gives a good starting approximation to *E*, two iterations would give a good answer. I imagine Kepler would have done more iterations if necessary but found empirically that two was enough. Incidentally, it appears Gustaffson has a sign error in his statement of Kepler’s equation.

[2] Euler Celestial Analysis by Dora Musielak.

]]>Here is the corresponding LaTeX source code.

There were two problems: the trademark symbol and the non-printing symbol denoted by a red underscore in the source file. The trademark was a non-ASCII character (Unicode U+2122) and the underscore represented a non-printing (U+00A0). At first I only noticed the trademark symbol, and I fixed it by including a LaTeX package to allow Unicode characters:

\usepackage[utf8x]{inputenc}

An alternative fix, one that doesn’t require including a new package, would be to replace the trademark Unicode character with `\texttrademark\`

. Note the trailing backslash. Without the backslash there would be no space after the trademark symbol. The problem with the unprintable character would remain, but the character could just be deleted.

I found out there are two Unicode code points render the trademark glyph, U+0099 and U+2122. The former is in the Latin 1 Supplement section and is officially a control character. The correct code point for the trademark symbol is the latter. Unicode files U+2122 under Letterlike Symbols and gives it the official name TRADE MARK SIGN.

[1] Jay Schinfeld, Fady Sharara, Randy Morris, Gianpiero D. Palermo, Zev Rosenwaks, Eric Seaman, Steve Hirshberg, John Cook, Cristina Cardona, G. Charles Ostermeier, and Alexander J. Travis. Cap-Score Prospectively Predicts Probability of Pregnancy, Molecular Reproduction and Development. To appear.

]]>Suppose two people share one prime. That is, one person chooses primes *p* and *q* and the other chooses *p* and *r*, and *q* ≠ *r*. (This has happened [1].) Then you can easily find *p*. All three primes can be obtained in less than a millisecond as illustrated by the Python script below.

In a way it would be better to share *both* your primes with someone else than to share just one. If both your primes are the same as someone else’s, you could read each other’s messages, but a third party attacker could not. If you’re lucky, the person you share primes with doesn’t know that you share primes or isn’t interested in reading your mail. But if that person’s private key is ever disclosed, so is yours.

Nearly all the time required to run the script is devoted to finding the primes, so we time just the factorization.

from secrets import randbits from sympy import nextprime, gcd from timeit import default_timer as timer numbits = 2048 p = nextprime(randbits(numbits)) q = nextprime(randbits(numbits)) r = nextprime(randbits(numbits)) m = p*q n = p*r t0 = timer() g = gcd(m, n) assert(p == g) assert(q == m//p) assert(r == n//p) t1 = timer() # Print time in milliseconds print(1000*(t1 - t0))

There are a few noteworthy things about the Python code.

First, it uses a cryptographic random number generator, not the default generator, to create 2048-bit random numbers.

Second, it uses the portable `default_timer`

which chooses the appropriate timer on different operating systems. Note that it returns wall clock time, not CPU time.

Finally, it uses integer division `//`

to recover *q* and *r*. If you use the customary division operator Python will carry out integer division and attempt to cast the result to a float, resulting in the error message “OverflowError: integer division result too large for a float.”

If you wanted to find the greatest common divisor of two small numbers by hand, you’d probably start by factoring the two numbers. But as Euclid knew, you don’t have to *factor* two numbers before you can find the greatest *common factor* of the numbers. For large numbers, the Euclidean algorithm is orders of magnitude faster than factoring.

Nobody has announced being able to factor a 1024 bit number yet. The number *m* and *n* above have four times as many bits. The largest of the RSA numbers factored so far has 768 bits. This was done in 2009 and took approximately two CPU millennia, *i.e.* around 2000 CPU years.

The classic Euclidean algorithm takes *O*(*n*²) operations to find the greatest common divisor of two integers of length *n*. But there are modern sub-quadratic variations on the Euclidean algorithm that take

*O*(*n* log²(*n*) log(log(*n*)))

operations, which is much faster for very large numbers. I believe SymPy is using the classic Euclidean algorithm, but it could transparently switch to using the fast Euclidean algorithm for large inputs.

- How long does it take to find large primes?
- Fermat’s factoring trick
- Most RSA implementations use the same public exponent

[1] As noted in the comments, this has happened due to faulty software implementation, but it shouldn’t happen by chance. By the prime number theorem, the number of *n*-bit primes is approximately 2^{n} / (*n* log 2). If *M* people choose 2*M* primes each *n* bits long, the probability of two of these primes being the same is roughly

2^{2-n} *M* *n* log 2.

If *M* = 7.5 billion (the current world population) and *n* = 1024, the probability of two primes being the same is about 10^{-295}. This roughly the probability of winning the Powerball jackpot 40 times in a row.

Suppose you have a football field with area *A*. If you make two parallel sides twice as long, then the area will be 2*A*. If you double the length of the sides again, the area will be 4*A*. Following this reason to its logical conclusion, you could double the length of the sides as many times as you wish, say 15 times, and each time the area doubles.

Except that’s not true. By the time you’ve doubled the length of the sides 15 times, you have a shape so big that it is far from being a rectangle. The fact that Earth is round matters a lot for figure that big.

Euclidean geometry models our world really well for rectangles the size of a football field, or even rectangles the size of Kansas. But eventually it breaks down. If the top extends to the north pole, your rectangle becomes a spherical triangle.

The problem in this example isn’t logic; it’s geometry. If you double the length of the sides of a Euclidean rectangle 15 times, you do double the area 15 times. A football field is not exactly a Euclidean rectangle, though it’s close enough for all practical purposes. Even Kansas is a Euclidean rectangle for most practical purposes. But a figure on the surface of the earth with sides thousands of miles long is definitely not Euclidean.

Models are based on experience with data within some range. The surprising thing about Newtonian physics is not that it breaks down at a subatomic scale and at a cosmic scale. The surprising thing is that it is usually adequate for everything in between.

Most models do not scale up or down over anywhere near as many orders of magnitude as Euclidean geometry or Newtonian physics. If a dose-response curve, for example, is linear for based on observations in the range of 10 to 100 milligrams, nobody in his right mind would expect the curve to remain linear for doses up to a kilogram. It wouldn’t be surprising to find out that linearity breaks down before you get to 200 milligrams.

Kevin Kelly is one of the most optimistic people writing about technology, but there’s a nuance to his optimism that isn’t widely appreciated.

Kelly sees technological progress as steady and inevitable, but not monotone. He has often said that new technologies create almost as many problems as they solve. Maybe it’s 10 steps forward and 9 steps back, but as long as the net motion is forward, progress accumulates.

A naive form of technological optimism believes that everything is getting better every day in every way for everyone, and to suggest otherwise is heresy. Kevin Kelly has been called a “wild-eyed optimist,” but even he is not a dogmatic optimist on a micro scale.

Even if most people believe an innovation is an improvement by the criteria they care most about, that doesn’t mean it’s an improvement for every purpose. You’re not a Luddite just because you think Version 8 of some software package was better than Version 9 for some task.

***

Photo of Kevin Kelly by Christopher Michel [CC BY 2.0] via Wikipedia.

]]>Recall the setup for RSA encryption given in the previous post.

- Select two very large prime numbers
*p*and*q*. - Compute
*n*=*pq*and φ(*n*) = (*p*– 1)(*q*– 1). - Choose an encryption key
*e*relatively prime to φ(*n*). - Calculate the decryption key
*d*such that*ed*= 1 (mod φ(*n*)). - Publish
*e*and*n*, and keep*d*,*p*, and*q*secret.

φ is Euler’s totient function, defined here.

There’s a complication in the first step. Maybe you *think* the numbers *p* and *q* are prime, but they’re not. In that case the calculation of φ in step 2 is wrong.

The numbers *p* and *q* need to be “very large,” where the definition of what constitutes large changes over time due to Moore’s law and progress with factoring algorithms. Currently *p* and *q* would typically have at least 2048 bits each. It’s easy to find numbers that large that are *probably* prime, but it’s difficult to be certain.

At the moment, the fastest way to test for primes has a small chance making a false positive error, but no chance of a false negative. That is, if the test says a number is composite, it is certainly composite. But if the test says a number may be prime, there is a small chance that it is not prime. (Details here.) So in practice RSA starts by finding two large **probable primes** or **pseudoprimes**.

Discussions of RSA encryption often point out that large pseudoprimes are very rare, so it isn’t a problem that RSA starts with pseudoprimes. But what does that mean? Is there a one in a trillion chance that your private key won’t work and nobody can send you messages? Or that you can receive some messages and not others?

RSA encryption works by taking a message *m* and raising it to the exponent *e* modulo *n* where *e* and *n* are defined at the top of the post. To decrypt the message, you raise it to the exponent *d* modulo *n* where *d* is your private decryption key. Because *d* was computed to satisfy

*ed* = 1 (mod φ(*n*)),

Euler’s theorem says that we’ll get our message back. We’ll give an example below.

If *p* and *q* are prime, then φ(*n*) = (*p* – 1)(*q* – 1). But if we’re wrong in our assumption that one of these factors is prime, our calculation of φ(*n*) will be wrong. Will our encryption and decryption process work anyway? Maybe.

We’ll do three examples below, all using small numbers. We’ll use *e* = 257 as our public encryption exponent and *m* = 42 as the message to encrypt in all examples.

In the first example *p* and *q* are indeed prime, and everything works as it should. In the next two examples we will replace *p* with a pseudoprime. In the second example everything works despite our error, but in the third example decryption fails.

We’ll start with *p* = 337 and *q* = 283. Both are prime. The Python code below shows that *d* = 60833 and the encrypted message is 45431. Everything works as advertised.

Now we use *p* = 341 and *q* = 283. Here *p* is a pseudoprime for base 2, i.e.

2^{340} = 1 mod 341

and so 341 passes Fermat’s primality test [1] for *b* = 2. Now *d* = 10073 and the encrypted message is 94956. Decrypting the encrypted message works, even though *p* is not prime and our calculation for φ is wrong. In fact, the process works not just for our message *m* = 42 but for every message.

Here again we let *p* be the pseudoprime 341 but set *q* to the prime 389. Now *d* = 6673, the encrypted message is 7660, but decrypting the encrypted message returns 55669, not 42 as we started with. Decryption fails for all other messages as well.

If we use the correct value for φ(*pq*) the example works. That is, if we use

φ(341*389) = φ(11*31*389) = 10*30*388

rather than the incorrect value 340*388 the decryption process recovers our original message.

The examples above show that if we mistakenly think one of our numbers is a prime when it is only a pseudoprime, decryption might succeed or it might fail. In either case, we assume *n* = *pq* has two prime factors when in fact it has more, and so *n* is easier to factor than we thought.

Here’s the code for the examples above.

from sympy import gcd, mod_inverse message = 42 e = 257 def example(p, q): n = p*q phi = (p-1)*(q-1) assert( gcd(e, phi) == 1 ) d = mod_inverse(e, phi) assert( d*e % phi == 1 ) encrypted = pow(message, e, n) decrypted = pow(encrypted, d, n) return (message == decrypted) print(example(337, 283)) print(example(341, 283)) print(example(341, 389))

[1] Fermat’s primality test is explained here. In our example we only tried one base, *b* = 2. If we had also tried *b* = 3 we would have discovered that 341 is not prime. In practice one would try several different bases. Using the heuristic that failures are independent for each base, it’s very unlikely that a composite number would be a pseudoprime for each of, say, 5o different bases.

Imagine this conversation.

“Could you tell me your social security number?”

“Absolutely not! That’s private.”

“OK, how about just the last four digits?”

“Oh, OK. That’s fine.”

When I was in college, professors would post grades by the last four digits of student social security numbers. Now that seems incredibly naive, but no one objected at the time. Using these four digits rather than names would keep your grades private from the most lazy observer but not from anyone willing to put out a little effort.

There’s a widespread belief in the US that your social security number is a deep secret, and that telling someone your social security number gives them power over you akin to a fairy telling someone his true name. On the other hand, we also believe that telling someone just the last four digits of your SSN is harmless. Both are wrong. It’s not that hard to find someone’s full SSN, and revealing the last four digits gives someone a lot of information to use in identifying you.

In an earlier post I looked at how easily most people could be identified by the combination of birth date, sex, and zip code. We’ll use the analytical results from that post to look at how easily someone could be identified by their birthday, state, and the last four digits of their SSN [1]. Note that the previous post used birth *date*, i.e. including year, where here we only look at birth *day*, i.e. month and day but no year. Note also that there’s nothing special about social security numbers for our purposes. The last four digits of your phone number would provide just as much information.

If you know someone lives in Wyoming, and you know their birthday and the last four digits of their SSN, you can uniquely identify them 85% of the time, and in an addition 7% of cases you can narrow down the possibilities to just two people. In Texas, by contrast, the chances of a birthday and four-digit ID being unique are 0.03%. The chances of narrowing the possibilities to two people are larger but still only 0.1%.

Here are results for a few states. Note that even though Texas has between two and three times the population of Ohio, it’s over 100x harder to uniquely identify someone with the information discussed here.

|-----------+------------+--------+--------| | State | Population | Unique | Pairs | |-----------+------------+--------+--------| | Texas | 30,000,000 | 0.03% | 0.11% | | Ohio | 12,000,000 | 3.73% | 6.14% | | Tennessee | 6,700,000 | 15.95% | 14.64% | | Wyoming | 600,000 | 84.84% | 6.97% | |-----------+------------+--------+--------|

[1] In that post we made the dubious simplifying assumption that birth dates were uniformly distributed from 0 to 78 years. This assumption is not accurate, but it was good enough to prove the point that it’s easier to identify people than you might think. Here our assumptions are better founded. Birthdays are nearly uniformly distributed, though there are some slight irregularities. The **last** four digits of social security numbers are uniformly distributed, though the first digits are correlated with the state.

Here’s the setup for RSA encryption in a nutshell.

- Select two very large prime numbers
*p*and*q*. - Compute
*n*=*pq*and φ(*n*) = (*p*– 1)(*q*– 1). - Choose an encryption key
*e*relatively prime to φ(*n*). - Calculate the decryption key
*d*such that*ed*= 1 (mod φ(*n*)). - Publish
*e*and*n*, and keep*d*,*p*, and*q*secret.

The asymmetry in the encryption scheme comes from the fact that the person who knows *p* and *q* can compute *n*, φ(*n*), and *d* easily, but no one else can find *d* without factoring *n*. (Or so it seems.)

Here we focus on the third step above. In theory *e* could be very large, but very often *e* = 65537. In that case, we don’t actually choose *e* to be relatively prime to φ(*n*). Instead, we pick *p* and *q*, and hence *n*, and usually φ(*n*) will be relatively prime to 65537, but if it’s not, we choose *p* and *q* again until gcd(65537, φ(*n*)) = 1. Kraft and Washington [1] have this to say about the choice of *e*:

… surprisingly,

edoes not have to be large. Sometimese= 3 has been used, although it is usually recommended that larger values be used. In practice, the most common value ise= 65537, The advantages are that 65537 is a relatively large prime, so it’s easier to arrange that gcd(e, φ(n)) = 1, and it is one more than a power of 2, so raising a number to the 65537th power consists mostly of squarings …

In short, the customary value of *e* was chosen for efficiency. Also, there aren’t many choices for similar values of *e* [3].

Heninger and Shacham [2] go into more detail.

The choice

e= 65537 = 2^{16}+ 1 is especially widespread. Of the certificates observed in the UCSD TLS 2 Corpus (which was obtained by surveying frequently-used TLS servers), 99.5% had e = 65537, and all hadeat most 32 bits.

So the “most common value” mentioned by Kraft and Washington appears to be used 99.5% of the time.

Is it a problem that *e* is very often the same number? At first glace no. If the number *e* is going to be public anyway, there doesn’t seem to be any harm in selecting it even before you choose *p* and *q*.

There may be a problem, however, that the value nearly everyone has chosen is fairly small. In particular, [2] gives an algorithm for recovering private RSA keys when some of the bits are know and the exponent *e* is small. How would some of the bits be known? If someone has physical access to a computer, they can recover some bits from it’s memory.

[1] James S. Kraft and Lawrence C. Washington. An Introduction to Number Theory with Cryptography. 2nd edition.

[2] Nadia Heninger and Hovav Shacham. Reconstructing RSA Private Keys from Random Key Bits.

[3] The property that makes *e* an efficient exponent is that it is a Fermat prime, i.e. it has the form 2^{2m} + 1 with *m* = 4. Raising a number to a Fermat number exponent takes *n* squarings and 1 multiplication. And being prime increases the changes that it will be relatively prime to φ(*n*). Unfortunately 65537 may be the largest Fermat prime. We know for certain it’s the largest Fermat prime that can be written in less than a billion digits. So if there are more Fermat primes, they’re too large to use.

However, there are *smaller* Fermat primes than 65537, and in fact the quote from [1] mentions the smallest, 3, has been used in practice. The remaining Fermat primes are 5, 17, and 257. Perhaps these have been used in practice as well, though that may not be a good idea.

A similar issue arises in differential privacy. A practical implementation of differential privacy must rule out some queries a priori in some way that is not data-dependent. Otherwise it could be a violation of differential privacy **either to answer or refuse to answer** the same query. Short of refusing the query entirely, it could also be informative to reply “Are you sure you want to submit that query? It’s going to use up a lot of your **privacy budget**.”

Differential privacy adds random noise to query results, but in a more sophisticated way than Yandex blurring out portions of photos. Adding noise selectively reveals information about what the noise is trying to conceal. As the FAS article indicated, by blurring out only the sensitive areas, the blurred maps point out their exact location.

**Selective security measures** can have a similar effect. By placing greater protection on some information, you can mark that information as being particularly important. Once you realize this, otherwise nonsensical security measures start to make sense: applying uniform security results in excessive protection for some information, but keeps attackers from being able to identify the most valuable targets.

In historical order, or at least in the order of Gustafsson’s book:

- Fixed point iteration
- Newton’s method for root-finding
- Runge phenomenon
- Gauss quadrature
- Taylor series
- Fourier series
- Gibbs phenomenon
- Fourier transform
- Runge-Kutta
- Linear programming
- Nonlinear optimization
- Iterative linear solvers
- Natural cubic spline interpolation
- Wavelets
- Fast Fourier Transform (FFT)
- Singular value decomposition
- Monte Carlo methods
- MCMC

The links above include numerical methods I have written about. What about the methods I have *not* written about?

I like to write fairly short, self-contained blog posts, and I’ve written about algorithms compatible with that. The methods in Gustafsson’s book that I haven’t written about are mostly methods in partial differential equations. I don’t see how to write short, self-contained posts about numerical methods for PDEs. In any case, my impression is that not many readers would be interested. If you are one of the few readers who *would* like to see more about PDEs, you may enjoy the following somewhat personal rambling about PDEs.

I studied PDEs in grad school—mostly abstract theory, but also numerical methods—and expected PDEs to be a big part of my career. I’ve done some professional work with differential equations, but the demand for other areas, particularly probability and statistics, has been far greater. In college I had the impression that applied math was practically synonymous with differential equations. I think that was closer to being true a generation or two ago than it is now.

My impression of the market demand for various kinds of math is no doubt colored by my experience. When I was at MD Anderson we had one person in biomathematics working in PDEs, and then none when she left. There may have been people at MDACC doing research into PDEs for modeling cancer, but they weren’t in the biostatistics or biomathematics departments. I know that people are working with differential equations in cancer research, but not at the world’s largest cancer center, and so I expect there aren’t many doing research in that area.

Since leaving MDACC to work as a consultant I’ve seen little demand for differential equations, more in Kalman filters than in differential equations per se. A lot of companies hire people to numerically solve PDEs, but there don’t seem to be many who want to use a consultant for such work. I imagine most companies with an interest in PDEs are large and have a staff of engineers and applied mathematicians working together. There’s more demand for statistical consulting because companies are likely to have an occasional need for statistics. The companies that need PDEs, say for making finite element models of oil reservoirs or airplanes, have an ongoing need and hire accordingly.

]]>Suppose average life expectancy is around 78 years. If everyone is between 0 and 78, then there are 78*365 possible birth dates and twice that many combinations of birth date and sex.

What’s the average population of a US zip code? We’ll use 9,330 for reasons explained in [1].

We have 56,940 possible birth date and sex combinations for 9,330 people. There have to be many unused birth date and sex combinations in a typical zip code, and it’s plausible that many combinations will only be used once. We’ll run a Python simulation to see just how many we’d expect to be used one time.

The array `demo`

below will keep track of the possible demographic values, i.e. combinations of birth date and sex. We’ll loop over the population of the zip code, randomly assigning everyone to a demographic value, then see what proportion of demographic values is only used once.

from random import randrange from numpy import zeros zctasize = 9330 demosize = 365*78*2 demo = zeros(demosize) for _ in range(zctasize): d = randrange(demosize) demo[d] += 1 unique = len(demo[demo == 1]) print(unique/zctasize)

I ran this simulation 10 times and got values ranging from 84.3% to 85.7%.

As Road White points out in the comments, you can estimate the number of unique demographics by a probability calculation.

Suppose there are *z* inhabitants in our zip code and *d* demographic categories. We’re assuming here (and above) that all demographic categories are equally likely, even though that’s not true of birth dates.

We start by looking at a particular demographic category. The probability that exactly one person falls in that category is

To find the expected number of demographic slots with exactly one entry we multiply by *d*, and to get the proportion of *p* of people this represents we divide by *z*.

and so

which in our case is 84.8%, consistent with our simulation above.

Here we used the Taylor series approximation

If *z* is of the same magnitude as *d* or smaller, then the error in the approximation above is *O*(1/*d*).

You could also work this as a Poisson distribution, then condition on the probability of a slot being occupied.

By the way, a similar calculation shows that the expected number of demographic slots containing two people is *r* exp(-*r*)/2 where *r* = *z*/*d*. So while 85% can be uniquely identified, another 7% can be narrowed down to two possibilities.

[1] I don’t know, but I do know the average population of a “zip code tabulation area” or ZCTA, and that’s 9,330 according to the data linked to here. As I discuss in that post, the Census Bureau reports population by ZTCA, not by zip code per se, for several reasons. Zip codes can cross state boundaries, they are continually adjusted, and some zip codes contain only post office boxes.

]]>