Suppose that you are in love with a lady on Neptune and that she returns the sentiment. It will be some consolation for the melancholy separation if you can say to yourself at some—possibly prearranged—moment, “She is thinking of me now.” Unfortunately a difficulty has arisen because we have had to abolish Now … She will have to think of you continuously for eight hours on end in order to circumvent the ambiguity of “Now.”

This reminded me of The Book of Strange New Things. This science fiction novel has several themes, but one is the effect of separation on a relationship. Even if you have faster-than-light communication, how does it effect you to know that your husband or wife is light years away? The communication delay might be no more than was common before the telegraph, but the psychological effect is different.

**Related post**: Eddington’s constant

He says at one point “If I were a Christian …” implying that he is not, but his philosophy of software echos the Christian idea of grace, a completely free gift rather than something earned. If you want to use my software without giving anything back in return, enjoy. If you’re motivated by gratitude, not obligation, to give something back, that’s great. Works follow grace. Works don’t earn grace.

I was thinking about making a donation to a particular open source project that has been important to my business when I stumbled on DHH’s talk. While watching it, I reconsidered that donation. The software is freely given, no strings attached. I don’t take anything from anyone else by using it. Etc. Then I made the donation anyway, out of a sense of gratitude rather than a sense of obligation.

My biggest contributions to open source software have been unconscious. I had no idea that code from this blog was being used in hundreds of open source projects until Tim Hopper pointed it out.

Most contributions to open source software are in kind, i.e. contributing code. But cash helps too. Here are a couple ideas if you’d like to donate a little cash.

You could buy some swag with a project logo on it, especially some of the more overpriced swag. Maybe your company rules would allow this that wouldn’t allow making a donation. It feels odd to deliberately buy something overpriced—they want how much for this coffee mug?!—but that’s how the project can make a little money.

If you’d like to make a cash donation, but you’re hesitating because it’s not tax deductible, here’s a possibility: deduct your own tax. Give the after-tax amount that corresponds to the amount you would have given before taxes. For example, suppose you’d like to donate $100 cash if it were tax deductible, and suppose that your marginal tax rate is 30%. Then donate $70. It’s the same to you as donating $100 and saving $30 on your taxes.

Disclaimer: I am not an accountant or a tax attorney. And in case it ever needs to be said, I’m not a lepidopterist, auto mechanic, or cosmetologist either.

]]>Both are true, but they involve limits in different spaces. Weierstrass’ theorem applies to convergence over a compact interval of the real line, and Morera’s theorem applies to convergence over compact sets in the complex plane. The uniform limit of polynomials is better behaved when it has to be uniform in two dimensions rather than just one.

This post is a sort of variation on the post mentioned above, again comparing convergence over the real line and convergence in the complex plane, this time looking at what happens when you conjugate variables [1].

There’s an abstract version of Weierstrass’ theorem due to Marshall Stone known as the Stone-Weierstrass theorem. It generalizes the real interval of Weierstrass’ theorem to any compact Hausdorff space [2]. The compact Hausdorff space we care about for this post is the unit disk in the complex plane.

There are two versions of the Stone-Weierstrass theorem, one for real-valued functions and one for complex-valued functions. I’ll state the theorems below for those who are interested, but my primary interest here is a corollary to the complex Stone-Weierstrass theorem. It says that any continuous complex-valued function on the unit disk can be approximated as closely as you like with polynomials in *z* and the conjugate of *z* with complex coefficients, i.e. by polynomials of the form

When you throw in conjugates, things change a lot. The uniform limit of polynomials in *z* alone must be analytic, very well behaved. But the uniform limit of polynomials in *z* and *z* conjugate is merely continuous. It can have all kinds of sharp edges etc.

Conjugation opens up a lot of new possibilities, for better or for worse. As I wrote about here, an analytic function can only do two things to a tiny disk: stretch it or rotate it. It cannot flip it over, as conjugation does.

By adding or subtracting a variable and its conjugate, you can separate out the real and imaginary parts. The the parts are no longer inextricably linked and this allows much more general functions. The magic of complex analysis, the set of theorems that seem too good to be true, depends on the real and imaginary parts being tightly coupled.

***

Now for those who are interested, the statement of the Stone-Weierstrass theorems.

Let *X* be a compact Hausdorff space. The set of real or complex valued functions on *X* forms an algebra. That is, it is closed under taking linear combinations and products of its elements.

The real Stone-Weierstrass theorem says a subalgebra of the continuous real-valued functions on *X* is dense if it contains a non-zero constant function and if it separates points. The latter condition means that for any two points, you can find functions in the subalgebra that take on different values on the two points.

If we take the interval [0, 1] as our Hausdorff space, the Stone-Weierstrass theorem says that the subalgebra of polynomials is dense.

The complex Stone-Weierstrass theorem says a subalgebra of he continuous complex-valued functions on *X* is dense if it contains a non-zero constant function, separates points, and is closed under taking conjugates. The statement above about polynomials in *z* and *z* conjugate follows immediately.

***

[1] For a complex variable *z* of the form *a* + *bi* where *a* and *b* are real numbers and *i* is the imaginary unit, the conjugate is *a* – *bi*.

[2] A Hausdorff space is a general topological space. All it requires is that for any two distinct points in the space, you can find disjoint open sets that each contain one of the points.

In many cases, a Hausdorff space is the most general setting where you can work without running into difficulties, especially if it is compact. You can often prove something under much stronger assumptions, then reuse the proof to prove the same result in a (compact) Hausdorff space.

See this diagram of 34 kinds of topological spaces, moving from strongest assumptions at the top to weakest at the bottom. Hausdorff is almost on bottom. The only thing weaker on that diagram is *T*_{1}, also called a Fréchet space.

In a *T*_{1} space, you can separate points, but not simultaneously. That is, given two points *p*_{1} and *p*_{2}, you can find an open set *U*_{1} around *p*_{1} that does not contain *p*_{2}, and you can find an open set *U*_{2} around *p*_{2} that does not contain *p*_{1}. But you cannot necessarily find these sets in such a way that *U*_{1} and *U*_{2} are disjoint. If you could always choose these open sets to be distinct, you’d have a Hausdorff space.

The **naive** haven’t heard of power laws or don’t know much about them. They probably tend to expect things to be more uniformly distributed than they really are.

The **enthusiasts** have read books about fractals, networks, power laws, etc. and see power laws everywhere.

The **skeptics** say the prevalence of power laws is overestimated.

Within the skeptics are the **pedants**. They’ll point out that nothing exactly follows a power law over the entirety of its range, which is true but unhelpful. Nothing follows any theoretical distribution exactly and everywhere. This is true of the normal distribution as much as it is true of power laws, but that doesn’t mean that normal distributions and power laws aren’t useful models.

If you asked a power law enthusiast how zip code populations are distributed, their first guess would of course be a power law. They might suppose that 80% of the US population lives in 20% of the zip codes, following Pareto’s 80-20 rule.

The enthusiasts wouldn’t be too far off. They would do better than the uniformitarians who might expect that close to 20% of the population lives in 20% of the zip codes.

It turns out that about 70% of Americans live in the zip codes in the top 20th percentile of population. To include 80% of the population, you have to include the top 27% of zip codes.

But a power law implies more than just the 80-2o rule. It implies that something like the 80-20 rule holds on multiple scales, and that’s not true of zip codes.

The signature of a power law is a straight line on a log-log plot. Power laws never hold exactly and everywhere, but a lot of things approximately follow a power law over a useful range, typically in the middle.

What do we get if we make a log-log plot of zip code population?

This looks more like a squircle than a straight line.

If we zoom in on just part of the plot, say the largest 2,000 zip codes [1], we get something that has a flat spot, but plot bows outward, and continues to bow outward as we zoom in on different parts of it.

Why is the distribution of zip code populations not a power law? One reason is that zip codes are artificial. They were designed to make it easier to distribute mail, and so there was a deliberate effort to make them somewhat uniform in population. The population of the largest zip code is only 12 times the average.

You’re more likely to see power laws in something organic like city populations. The largest city in the US is nearly 1,400 larger than the average city [2].

Another reason zip code populations do not follow a power law is that zip codes are somewhat of a geographical designation. I say “somewhat” because the relationship between zip codes and geography is complicated. See [1].

But because there’s at least some attempt to accommodate geography, and because the US population is very thin in large regions of the country, there are many sparsely populated zip codes, even when you roll them up from 5 digits to just 3 digits, and that’s why the curve drops very sharply at the end.

- Identification using birthday, sex, and zip code
- Passwords and power laws
- Estimating power law exponents

[1] This post is based on ZCTAs (Zip Code Tabulation Areas) according to the 2010 census. ZCTAs are not exactly zip codes for reasons explained here.

[2] It depends on how you define “city,” but the average local jurisdiction population in the United States is around 6,200 and the population of New York City is around 8,600,000, which is 1,400 times larger.

]]>Before that, I wrote a post on Bernstein’s proof that used his eponymous polynomials to prove Weierstrass’ theorem. This is my favorite proof because it’s an example of using results from probability to prove a statement that has nothing to do with randomness.

This morning I’ll present one more way to prove the approximation theorem, this one due to Landau.

The Landau kernel is defined as

Denote its integral by

Let *f*(*x*) be any continuous function on [-1, 1]. Then the convolution of the normalized Landau kernel with *f* gives a sequence of polynomial approximations that converge uniformly to *f*. By “normalized” I mean dividing the kernel by its integral so that it integrates to 1.

For each *n*,

is a polynomial in *x* of degree 2*n*, and as *n* goes to infinity this converges uniformly to *f*(*x*).

There are a few connections I’d like to mention. First, the normalized Landau kernel is essentially a beta distribution density, just scaled to live on [-1, 1] rather than [0, 1].

And as with Bernstein’s proof of the Weierstrass approximation theorem, you could use probability to prove Landau’s result. Namely, you could use the fact that two independent random variables *X* and *Y*, the PDF of their sum is the convolution of their PDFs.

The normalizing constants *k*_{n} have a simple closed form in terms of double factorials:

I don’t know which Landau is responsible for the Landau kernel. I’ve written before about the Edmund Landau and his Big O notation, and I wrote about Lev Landau and his license plate game. Edmund was a mathematician, so it makes sense that he might be the one to come up with another proof of Weierstrass’ theorem. Lev was a physicist, and I could imagine he would be interested in the Landau kernel as an approximation to the delta function.

If you know which of these Landaus, or maybe another, is behind the Landau kernel, please let me know.

**Update**: Someone sent me this paper which implies Edmund Landau is the one we’re looking for.

The post mentioned above uses a proof by Bernstein. And in that post I used the absolute value function as an example. Not only is |*x*| an example, you could go the other way around and use it as a step in the proof. That is, there is a proof of the Weierstrass approximation theorem that starts by proving the special case of |*x*| then use that result to build a proof for the general case.

There have been many proofs of Weierstrass’ theorem, and recently I ran across a proof due to Lebesgue. Here I’ll show how Lebesgue constructed a sequence of polynomials approximating |*x*|. It’s like pulling a rabbit out of a hat.

The staring point is the binomial theorem. If *x* and *y* are real numbers with |*x*| > |*y*| and *r* is any real number, then

.

Now apply the theorem substituting 1 for *x* and *x*² – 1 for *y* above and you get

The partial sums of the right hand side are a sequence of polynomials converging to |*x*| on the interval [-1, 1].

***

If you’re puzzled by the binomial coefficient with a top number that isn’t a positive integer, see the general definition of binomial coefficients. The top number can even be complex, and indeed the binomial theorem holds for complex *r*.

You might also be puzzled by the binomial theorem being an infinite sum. Surely if *r* is a positive integer we should get the more familiar binomial theorem which is a *finite* sum. And indeed we do. The general definition of binomial coefficients insures that if *r* is a positive integer, all the binomial coefficients with *k* > *r* are zero.

It may be impossible to prove that a choice was not deliberate, but you can show good faith by providing evidence that the choice was deliberate *by a different criteria* than the one in question.

I’ll give four examples, three positive and one negative.

My previous three blog posts looked at different aspects of the Cliff random number generator. The generator needs a seed between 0 and 1 to start. Suppose I chose 0.8121086949937715 as my seed. On the one hand, that’s a number with no apparent special features. But you might ask “Hey, why that number?” and you’d be right. I show in the first post in the series how that number was chosen to make the generator start off producing duplicate output.

In the next two posts in the series, I chose π – 3 as my seed. That’s a recognizable number and obviously a deliberate choice. But it has no apparent connection to the random number generator, and so it’s reasonable to assume that the seed wasn’t chosen to make the generator look good or bad.

The SHA-2 cryptographic hash function seventy two 32-bit numbers for initial state that needed to be “random” in some sense. But if the values were actually chosen at random, critics would suspect that the values were chosen to provide a back door. And maybe there is a clever way to pick the initial state that provides a non-obvious exploitable weakness.

The designers of SHA-2 chose the square roots of the first consecutive primes to fill one set of constants, and the cube roots of the first consecutive primes to fill another. See code here.

The initial state is definitely not random. Someone looking at the state would eventually discover where it came from. So while the choice was obviously deliberate, but apparently not designed by any cryptographic criteria.

Daniel Bernstein’s elliptic curve Curve25519 is widely trusted in part because Bernstein made his design choices transparent. The curve is

*y*² = *x*³ + 486662*x*² + *x*

over the finite field with 2^{255}-19 elements, hence the name.

2^{255}-19 is the largest prime less than 2^{255}, and being close to 2^{255} makes the method efficient to implement. The coefficient 48666 is less obvious. But Bernstein explains in his paper that he took the three smallest possible values of this parameter that met the explicit design criteria, and then rejected two of them on objective grounds described at the bottom of the paper.

The design of elliptic curve NIST P-384 is not as transparent as that of Curve25519 which has lead to speculation that NIST may have designed the curve to have a back door.

The curve has has Weierstrass form

*y*² = *x*³ – 3*x* + *b*

over the finite field with *p* elements where

*p* = 2^{384} – 2^{128} – 2^{96} + 2^{32} – 1.

As with Curve25519, the choice of field size was motivated by efficiency; the pattern of powers of 2 enables some tricks for efficient implementation. Also, there are objective reasons why the linear coefficient is -3. But the last coefficient *b* is the 383-bit number

27580193559959705877849011840389048093056905856361568521428707301988689241309860865136260764883745107765439761230575

which has raised some eyebrows. NIST says the value was chosen at random, though not directly. More specifically, NIST says it first generated a certain 160-bit random number, then applied a common key stretching algorithm to obtain *b* above. Researchers are divided over whether they believe this. See more in this post.

Sometimes you can’t prove that a choice wasn’t deliberate. In that case the best you can do is show that the choice *was* deliberate, but by an innocent criteria, one unrelated to the matter at hand.

I tried to do this in the Cliff RNG blog posts by using π as my seed. I couldn’t actually use π because the seed had to be between 0 and 1, but there’s an obvious way to take π and produce a number between 0 and 1.

The designers of SHA-2 took a similar route. Just as π is a natural choice for a real number, primes are natural choices for integers. They couldn’t use integers directly, but they made complicated patterns from simple integers in a natural way by taking roots.

Daniel Bernstein gained the cryptography community’s trust by making his design criteria transparent. Given his criteria, the design was almost inevitable.

NIST was not as persuasive as Daniel Bernstein. They claim to have chosen a 160-bit number at random, and they may very well have. But there’s no way to know whether they generated a lot of 160-bit seeds until they found one that resulted in a constant term that has some special property. They may have chosen their seed in good faith, but they have a not been completely persuasive.

Sometimes it’s not enough to act in good faith; you have to make a persuasive case that you acted in good faith.

]]>Yesterday I posted about testing the generator with the DIEHARDER test suite, the successor to George Marsaglia’s DIEHARD battery of RNG tests.

This morning I discovered something about the Cliff RNG which led to discovering something about DIEHARDER.The latter is more important: I don’t think anyone is using the Cliff RNG for serious work, but people are definitely using DIEHARDER.

The Cliff RNG has a short period, and yet many of the DIEHARDER tests passed. However, several of the tests failed as well, and perhaps these tests failed due to the short period, in particular rgb_lagged_sum. But at least some tests can pass despite a short period.

Since the Cliff RNG maintains internal state as a floating point number and outputs integers, the period is a bit subtle.

The state of the Cliff RNG is a floating point number between 0 and 1, and so it has 2^{53} possible values. (See Anatomy of a floating point number.) That means the maximum possible period is 2^{53}. We use the internal state *x* to create 32-bit integers *n* by multiplying *x* by 2^{32} and truncating to an integer value. The maximum period could conceivably be larger than 2^{32} because an output value *n* could repeat but correspond to two different *x*‘s. The actual period, at least in my experiment, was between 2^{22} and 2^{23}.

I seeded the Cliff RNG with π – 3 (why?) and found that starting with index *i* = 3,075,302, the output values repeat with period *p* = 6,705,743. So there was a burn-in period before the state entered a cycle, but it would repeat that cycle forever. Not only are the output values repeating, the state values *x* repeat. (Otherwise it would be possible for the integer outputs to cycle for a while then break out.)

It’s possible that starting with other seeds, the generator has other cycle lengths, longer or shorter. But I stumbled across one cycle of period 6,705,743.

I wrote a chapter for O’Reilly’s book Beautiful Testing in which I discuss How to test a random number generator. More specifically, now to test a non-uniform random number generator. That is, starting with a trusted uniform random number generator, transform the values to have some other probability distribution. This is the most common scenario. Few developers write their own core RNG, but it’s fairly common to have to use a core RNG to create a custom distribution.

If you do want to test a uniform random number generator, as I do in this post, there are test suites like DIEHARDER. This is one of the oldest and best known test suites. There are newer and more rigorous suites, like TestU01 that I blog about here. There’s also the NIST statistical test suite that I write about in the same post.

These tests are a little challenging to build and use. I’ve had clients ask me to run these tests for them and help them interpret the results. If you’d like for me to do that for you, let’s talk.

]]>I produced a file of s billion 32-bit integers by multiplying the output values, which were floating point numbers between 0 and 1, by 2^{32} and truncating to integer. Then I ran the DIEHARDER random number generator test suite.

The results were interesting. Before running the tests, I thought the tests would nearly all pass or nearly all fail, more likely the latter. But what happened was that many tests passed and some failed hard [1].

Here’s a list of the tests that passed:

- diehard_birthdays
- diehard_rank_32x32
- diehard_rank_6x8
- diehard_bitstream
- diehard_oqso
- diehard_dna
- diehard_count_1s_str
- diehard_count_1s_byt
- diehard_runs
- sts_monobit
- sts_serial
- rgb_bitdist
- rgb_kstest_test
- dab_dct
- dab_filltree2
- dab_monobit2

The tests that failed were:

- diehard_parking_lot
- diehard_2sphere
- diehard_3sphere
- diehard_squeeze
- diehard_craps
- marsaglia_tsang_gcd
- rgb_lagged_sum
- dab_bytedistrib

I’ve left out a few test restults that were ambiguous as well as tests that were described as “Suspect” and “Do not use” on the DIEHARDER web site.

The site I mentioned in the previous post where I ran across this generator said that it passed a spherical generation test. I assume the implementation of that test was less demanding that the version included in DIEHARD. But the generator does well by other tests.

The lagged sum test tests for autocorrelation. Maybe the failure of this test has something to do with the fixed points discussed earlier.

**Update**: After writing this post I discovered that the generator has a short period, as I discuss here. That explains why the lagged sum test fails: the output has perfect autocorrelation at a lag equal to the period.

[1] By “failed hard” I mean the test return a *p*-value of zero. The *p*-value couldn’t actually be zero, but it was close enough that it the displayed value was exactly zero.

*x*_{n+1} = | 100 log(*x*_{n}) mod 1 |

for *n* > 0. The article linked to above says that this generator passes a test of randomness based on generating points on a sphere.

While the long term distribution of the generator may be good, it has a problem with its sequential behavior, namely that it has an infinite number of fixed points. If the generator ever reaches one of these points, it gets stuck forever.

Here’s a proof. Since *x* is between 0 and 1, log(*x*) is always negative. So we can replace the absolute value above with a negative. A fixed point of the generator is a solution to

*x* = -100 log(*x*) – *k*

for some integer *k*. Define

*f*(*x*, *k*) = -100 log(*x*) – *x* – *k.*

For each non-negative *k*, the limit of *f*(*x*, *k*) as *x *goes to 0 is ∞ and the limit as *x* goes to 1 is negative, so somewhere in between it is zero.

So in exact operations over the reals, there is a fixed point for each non-negative integer *k*. However, when implemented in finite precision arithmetic, it manages to get itself unstuck as the following Python code shows with *k* arbitrarily chosen to be 20.

from numpy import log from scipy.optimize import bisect r = bisect(lambda x: -100*log(x) - x - 20, 0.4, 0.999) print(r) for i in range(10): r = abs(-100*log(r)) % 1 print(r)

This produces

0.8121086949937715 0.8121086949252678 0.8121087033605505 0.8121076646716787 0.8122355649800639 0.7964876237426814 0.7543688054472710 0.1873898667258977 0.4563983607272064 0.4389252746489802 0.3426097596058071

In infinite precision, *r* above would be a fixed point. In floating point precision, *r* is not. But it does start out nearly fixed. The first iteration only changes *r* in the 11th decimal place, and it doesn’t move far for the next few iterations.

**Update**: See the next post for how the random number generator does on the DIEHARDER test suite.

The *k*th fixed point is the solution to *f*(*x*, *k*) = 0. The following code plots the fixed points as a function of *k*.

t = arange(300) y = [bisect( lambda x: -100*log(x)-x-k, 0.001, 0.999) for k in t] plt.plot(t, y) plt.xlabel("k") plt.ylabel("fixed point")

The fixed points cluster toward zero, or they would in infinite precision arithmetic. As we showed above, the Cliff random number generator performs better in practice than in theory. Maybe the generator does well after wandering close to zero, but I wouldn’t be surprised if it has a bias toward the low end of the interval.

Advocates of difficult-to-learn technologies say that tools should be optimized for experienced users, that ease of learning is over-rated because you’re only learn a tool once and use it for much longer. That makes sense if you use a tool continuously. If you use a tool occasionally, however, you might learn it once and relearn it many times.

The ease of relearning a technology should be emphasized more. As you’re learning a programming language, for example, it may be difficult to imagine forgetting it and needing to relearn it down the road. But you might ask yourself

If I put this down for a couple years and then have to come back to it, what language would I wish I’d written it in?

A while back I debated relearning Perl for the kind of text munging projects that Perl was designed for. But not only would I have to relearn Perl once, I’d have to relearn it every time I revisit the code. Perl does not stick in my head without constant use. Awk, on the other hand, is small and simple, and has a lot of the benefits of Perl. You can learn the basics of Awk in a day, and so if you have to, you can relearn it in a day.

**Something easy to learn is also easy to relearn**.

However, the converse isn’t necessarily true. Some things may be hard to learn but easy to pick back up. For example, I found LaTeX hard to learn but easy to relearn after not using it for several years. A lot of other tools seem almost as hard to relearn every time I pick them up. I think part of what made LaTeX easy to pick back up was its internal consistency. It’s a little quirky, but it has conceptual integrity.

I’ve used Mathematica off and on ever since it came out. Sometimes I’d go for years without using it, but it has always been easy to pick back up. Mathematica is easy to return to because its syntax is consistent and predictable. Mathematica has conceptual integrity. I find R much harder to use because the inconsistent syntax fades from my memory between uses.

Conceptual integrity comes from strong leadership, even a “benevolent dictator.” Donald Knuth shaped TeX and Stephen Wolfram shaped Mathematica. R has been more of an egalitarian effort, and it shows.

The “Tidyverse” of libraries on top of R is more consistent than the base language, due to the Hadley Wickham doing so much of the work himself. In fact, the Tidyverse was initially called the “Hadleyverse,” though Hadley didn’t like that name.

]]>One of my favorite theorems is the Weierstrass approximation theorem. It says that every continuous function can be approximated as closely as you like by polynomials. This is surprising because polynomials are very special, well behaved functions, and merely continuous function can be worse.

For example, a polynomial cannot have any kinks in it, unlike the absolute value function. But even though an individual polynomial cannot have a kink, a sequence of polynomials can approach a kink.

Morera’s theorem [1] says that the uniform limit of analytic functions is analytic. But Weierstrass’s theorem says that the uniform limit of analytic functions (namely polynomials) can have a kink in it, which an analytic function cannot have. What gives?

Weierstrass’s theorem is about uniform convergence over an interval of the real line.

Morera’s theorem is about uniform convergence on compact subsets of an open set in the complex plane.

We’ll show an example below where a sequence of polynomials converges to |*x*| on an interval of the real line but not in a disk containing the interval.

The Weierstrass approximation theorem tells us that there *exists* a sequence of polynomials converging uniformly to any continuous function on a compact interval. But we can go a step further and actually *construct* a sequence of such polynomials. The polynomials fall out of a proof of the Weierstrass theorem using probability! There’s nothing random going on here, and yet we can take a set of inequalities that fall out of calculations motivated by probability and construct our approximations.

Here is the *n*th Bernstein polynomial approximation to a function *g* in Mathematica code.

B[x_, n_, g_] := Sum[Binomial[n, k] x^k (1 - x)^(n - k) g[k/n], {k, 0, n}]

We can then use the following code to show the convergence of Bernstein polynomials.

f[x_] := Abs[x - 1/2] Plot[{B[x, 4, f], B[x, 10, f], B[x, 30, f], f[x]}, {x, 0, 1} , PlotLegends -> "Expressions"]

Next we’ll take a particular Bernstein polynomial for *f* in the sequence and plot the difference between it and *f* over the complex plane.

ComplexPlot3D[B[z, 6, f] - f[z], {z, 0, 1 + I}]

The polynomial is close to *f* along the interval [0, 1] on the real line, but the further we move away from the real axis the further it gets from *f*. Furthermore, the distance increases as we increase the degree of polynomial. The following code looks at the distance between *B*_{n}(*i*) and *f*(*i*) for *n* = 1, 2, 3, …, 10.

Table[N[Abs[B[I , n, f] - f[I]]], {n, 1, 10}]

It returns

0.6180 1.9021 1.9021 3.4085 3.4085 7.3941 7.3941 23.4511 23.4511 92.4377

So there’s no contradiction between the theorems of Weierstrass and Morera. The Bernstein polynomials do indeed converge uniformly to *f* over [0, 1] but they do not converge in the complex plane.

[1] I don’t know whether this theorem is exactly due to Morera, but it follows directly from Morera’s theorem.

]]>Let *X*, *Y*, and *Z* be three unit vectors. If *X* is nearly parallel to *Y*, and *Y* is nearly parallel to *Z*, then *X* is nearly parallel to *Z*.

Here’s a proof. Think of *X*, *Y*, and *Z* as points on a unit sphere. Then saying that *X* and *Y* are nearly parallel means that the two points are close together on the sphere. The statement above follows from the triangle inequality on the sphere:

dist(*X*, *Z*) ≤ dist(*X*, *Y*) + dist(*Y*, *Z*).

So if the two terms on the right are small, the term on the left is small, though maybe not quite as small. No more than twice the larger of the other two angles.

We can be a little more quantitative. Let *a* be the angle between *X* and *Y*, *b* the angle between *Y * and *Z*, and *c* the angle between *X* and *Z*. Then the law of cosines for spherical trigonometry says

cos *c* = cos *a* cos *b* + sin *a* sin *b* cos γ

where γ is the angle between the arcs *a* and *b*. If *a* and *b* are small, then sin *a* and sin *b* are also small (see here), and so we have the approximation

cos *c* ≈ cos *a* cos *b*.

The error in the approximation is sin *a* sin *b* cos γ, the product of two small numbers and a number with absolute value no more than 1.

The geometric exercise above was inspired by a discussion of correlation.

Correlation of random variables is not transitive. Correlation corresponds to directions not being perpendicular. If *X* is not perpendicular to *Y*, and *Y* is not perpendicular to *Z*, it might be the case that *X* is perpendicular to *Z*.

But if we replace “not perpendicular” with “nearly parallel” we see that we do have something like transitivity. That is, correlation of random variables is not transitive, but *high* correlation is.

If the angles *a*, *b*, and *c* above are correlation angles, then we have the approximation

corr(*X*, *Z*) ≈ corr(*X*, *Y*) corr(*Y*, *Z*)

if all the correlations are near 1.

Exercise for the reader: interpret the error term in the geometric problem in statistical terms.

]]>We stopped the construction where we did because the next triangle to be added would overlap the first triangle, which would clutter the image. But we could certainly have kept going.

If we do keep going, then the set of hypotenuse angles will be dense in the circle, with no repeats.

The *n*th triangle has sides of length 1 and √*n*, and so the tangent of *n*th triangle’s acute angle is 1/√*n*. The angle formed by the *n*th hypotenuse is thus

arctan(1) + arctan(1/√2) + arctan(1/√3) + … + arctan(1/√*n*).

Here’s a plot of the first 99 hypotenuse angles.

Here’s the code that produced the plot.

from numpy import * import matplotlib.pyplot as plt plt.axes().set_aspect(1) N = 100 theta = cumsum(arctan(arange(1,N)**-0.5)) plt.scatter(cos(theta), sin(theta)) plt.savefig("theodorus_angles.svg")

If we change `N`

to 500 we get a solid ring because the angles are closer together than the default thickness of dots in a scatterplot.

How would you plot this spiral? At each step, you need to draw a segment of length 1, perpendicular to the hypotenuse of the previous triangle. There are two perpendicular directions, and you want to choose the one that moves counterclockwise.

If we step outside the *xy* plane, we can compute the cross product of the unit vector in the *z* direction with the vector (*x*, *y*). The cross product will be perpendicular to both, and by the right-hand rule, it will point in the counterclockwise direction.

The cross product of (0, 0, 1) and (*x*, *y*, 0) is (-*y*, *x*, 0), so the direction we want to go in the *xy* plane is (-*y*, *x*). We divide this vector by its length to get a vector of length 1, then add it to our previous point. [1]

Here’s Python code to draw the spiral.

import matplotlib.pyplot as plt def next_vertex(x, y): h = (x**2 + y**2)**0.5 return (x - y/h, y + x/h) plt.axes().set_aspect(1) plt.axis('off') # base of the first triangle plt.plot([0, 1], [0, 0]) N = 17 x_old, y_old = 1, 0 for n in range(1, N): x_new, y_new = next_vertex(x_old, y_old) # draw short side plt.plot([x_old, x_new], [y_old, y_new]) # draw hypotenuse plt.plot([0, x_new], [0, y_new]) x_old, y_old = x_new, y_new plt.savefig("theodorus.png")

If you’re not familiar with the `plot`

function above, you might expect the two arguments to `plot`

to be points. But the first argument is a list of *x* coordinates and the second a list of *y* coordinates.

Update: See the next post for how the hypotenuse angles fill in a circle.

[1] Here’s an alternative derivation using complex numbers.

Label the current vertex in the complex plane *z*_{n}. Then *z*_{n}/|*z*_{n}| has length 1 and points in the same direction from the origin. Multiplying this point by *i* rotates it a quarter turn counterclockwise, so

*z*_{n+1} = *z*_{n} + *i **z*_{n}/|*z*_{n}|.

Two years after the publication of RSA, Michael Rabin created an alternative that is provably as hard to break as factoring integers.

Like RSA, Rabin’s method begins by selecting two large primes, *p* and *q*, and keeping them secret. Their product *n* = *pq* is the public key. Given a message *m* in the form of a number between 0 and *n*, the cipher text is simply

*c* = *m*² mod *n*.

Rabin showed that if you can recover *m* from *c*, then you can factor *n*.

However, Rabin’s method does not decrypt uniquely. For every cipher text *c*, there are four possible clear texts. This may seem like a show stopper, but there are ways to get around it. You could pad messages with some sort of hash before encryption so that is extremely unlikely that more than one of the four decryption possibilities will have the right format. It is also possible to make the method uniquely invertible by placing some restrictions on the primes used.

I don’t know that Rabin’s method has been used in practice. I suspect that the assurance that attacks are as hard as factoring integers isn’t valuable enough to inspire people to put in the effort to harden the method for practical use [1].

There’s growing suspicion that factoring may not be as hard as we’ve thought. And we know that if large scale quantum computing becomes practical, factoring integers will be easy.

My impression is that researchers are more concerned about a breakthrough in factoring than they are about a way to break RSA without factoring [2, 3].

[1] One necessary step in practical implementation of Rabin’s method would be to make sure *m* >: √*n*. Otherwise you could recover *m* by taking the *integer* square root rather than having to take a modular square root. The former is easy and the latter is hard.

[2] There are attacks against RSA that do not involve factoring, but they mostly exploit implementation flaws. They are not a frontal assault on the math problem posed by RSA.

[3] By “breaktrhough” I meant a huge improvement in efficiency. But another kind of breaktrough is conceivable. Someone could prove that factoring really is hard (on classical computers) by establishing a lower bound. That seems very unlikely, but it would be interesting. Maybe it would spark renewed interest in Rabin’s method.

]]>This post looks at **Aitken’s method** for speeding up the convergence of a series. It does not matter whether the series is an alternating series or not. Aitken’s method works best when the series is converging approximately like a geometric series, which is often true for power series of analytic functions.

Aitken acceleration estimates the sum of a series by taking the last three partial sums and extrapolating to where the partial sums appear to be going. That is, it estimates the sum as

We’ll use the exponential function as our example, estimating exp(0.3) first by using six terms of the Taylor series for exp(*x*) and then by applying Aitken’s method.

Here’s a little Python code to carry out our example.

from math import exp, factorial from numpy import cumsum def aitken(s0, s1, s2): return s2 - (s2 - s1)**2/(s2 - 2*s1 + s0) def exp_partial_sums(x, N): terms = [x**n/factorial(n) for n in range(N)] return cumsum(terms) x, N = 0.3, 6 partial = exp_partial_sums(x, N) # estimate taking the last partial sum p = partial[-1] # estimate applying Aitken acceleration a = aitken( partial[-3], partial[-2], partial[-1] ) # error analysis error_p = exp(x) - p error_a = exp(x) - a print(error_p, error_a, error_p/error_a)

This says the error simply using partial sums is 1.0576e-06. The error using Aitken acceleration is -2.3498e-07, which is 4.5 times smaller.

If we go back to the example of the Struve function in the previous post, the approximation error using Aitken acceleration is about 3 times smaller than simply using partial sums. So Aikten acceleration would have been more appropriate than Euler acceleration.

Why would you use an acceleration method rather than just adding up more terms of the series. The latter might be the thing to do, depending on context. If the terms of the series are expensive to calculate, acceleration will be more economical since it only requires a few arithmetic operations. If the terms of your series involve special functions, the additional cycles needed to apply Aitkin acceleration are negligible.

Sometimes you only have a limited number of terms, such as when you’ve derived the first few terms from a hand calculation. For example, maybe you’ve computed a series solution to a differential equation. In that case, Aitken’s method lets you squeeze a little more accuracy out of the terms you have.

For our demo we’ll evaluate the Struve function defined by the series

Note that the the terms in the series are relatively expensive to evaluate since each requires evaluating a gamma function. Euler’s acceleration method will be inexpensive relative to computing the terms it takes as input.

Here’s what we get by evaluating the first partial sums for *H*_{1.2}(3.4):

2.34748 0.67236 1.19572 1.10378 1.11414 1.11332

So if we were to stop here, we’d report 1.11332 as the value of *H*_{1.2}(3.4).

Next let’s see what we’d get using Euler’s transformation for alternating series. I’ll full precision in my calculations internally but only displaying four digits to save horizontal space that we’ll need shortly.

2.3474 1.5099 0.6723 0.9340 1.1957 1.1497 1.1037 1.1089 1.1141 1.1137 1.1133

Now we repeat the process again, taking averages of consecutive terms in each column to produce the next column.

2.3474 1.5099 1.2219 1.1319 1.1087 1.1058 0.6723 0.9340 1.0418 1.0856 1.1029 1.1957 1.1497 1.1293 1.1203 1.1037 1.1089 1.1113 1.1141 1.1137 1.1133

The terms across the top are the Euler approximations to the series. The final term 1.1058 is the Euler approximation based on all six terms. And it is **worse** than what we get from simply taking partial sums!

The exact value is 1.11337… and so the sixth partial sum was accurate to four decimal places, but our clever method was only good to one decimal place.

What went wrong?! Euler’s method is designed to speed up the convergence of **slowly converging** alternating series. But our series converges pretty quickly because it has two terms in the denominator that grow like factorials. When you apply acceleration when you don’t need to, **you can make things worse**. Since our series was already converging quickly, we would have done better to use Aitken acceleration, the topic of the next post.

But Euler’s method works well when it’s needed. For example, let’s look at the slowly converging series

π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …

Then we get the following array.

4.0000 3.3333 3.200o 3.1619 3.1492 3.1445 2.6666 3.0666 3.1238 3.1365 3.1399 3.4666 3.1809 3.1492 3.1434 2.8952 3.1174 3.1376 3.3396 3.1578 2.9760

The sequence of partial sums is along the left side, and the series of Euler terms is across the top row. Neither gives a great approximation to π, but the approximation error using Euler’s acceleration method is 57 times smaller than simply using the partial sums.

There’s a terrific visualization of data breach statistics at Information is Beautiful, and they share their data here. Note that the data only includes breaches of over 30,000 records.

By using their data, we can look at the total number of records breached each year. The figure for 2019 is projected: since the data stop in June of this year, I’ve assumed there will be as many breaches in the second half of 2019 as the first half.

Note that records breached is measured in billions.

Need to respond to a data incident/breach? We can help.

]]>The Diffie-Hellman problems are formulated for an Abelian group. The main group we have in mind is the multiplicative group of non-zero integers modulo a large prime *p*. But we start out more generally because the Diffie-Hellman problems are harder over some groups than others as we will see below.

Let *g* be the generator of a group. When we think of the group operation written as multiplication, this means that every element of the group is a power of *g*.

Let *a* and *b* be two integers. Given *g*^{a} and *g*^{b}, the CDH problem is to compute *g*^{ab}. Note that the exponent is *ab* and not *a*+*b*. The latter would be easy: just multiply *g*^{a} and *g*^{b}.

To compute *g*^{ab} we apparently need to know either *a* or *b*, which we are not given. Solving for either exponent is the discrete logarithm problem, which is impractical for some groups.

It’s conceivable that there is a way to solve CDH without solving the discrete logarithm problem, but at this time the most efficient attacks on CDH compute discrete logs.

Diffie-Hellman key exchange depends on the assumption that CDH is a hard problem.

Suppose Alice and Bob both agree on a generator *g*, and select private keys *a* and *b* respectively. Alice sends Bob *g*^{a} and Bob sends Alice *g*^{b}. This doesn’t reveal their private keys because the discrete logarithm problem is (believed to be) hard. Now both compute *g*^{ab} and use it as their shared key. Alice computes *g*^{ab} by raising *g*^{b} to the power *a*, and Bob computes *g*^{ab} by raising *g*^{a} to the power *b*.

If someone intercepted the exchange between Alice and Bob, they would possess *g*^{a} and *g*^{b} and would want to compute *g*^{ab}. This the the CDH.

When working over the integers modulo a large prime *p*, CDH is hard if *p*-1 has a large factor, such as when *p* – 1 = 2*q* for a prime *q*. If *p*-1 has only small factors, i.e. if *p*-1 is “smooth”, then the discrete logarithm problem is tractable via the Silver-Pohlig-Hellman algorithm [1]. But if *p* is large and *p*-1 has a large prime factor, CDH is hard as far as we currently know.

The DDH problem also starts with knowledge of *g*^{a} and *g*^{b}. But instead of asking to compute *g*^{ab} it asks whether one can distinguish *g*^{ab} from *g*^{c} for a randomly chosen *c* with probability better than 0.5.

Clearly DDH is weaker than CDH because if we can solve CDH we know the answer to the DDH question with certainty. Still, the DDH problem is believed to be hard over some groups, such as certain elliptic curves, but not over other groups, as illustrated below.

Suppose we’re working in the multiplicative group of non-zero integers modulo a prime *p*. Using Legendre symbols, which are practical to compute even for very large numbers, you can determine which whether *g*^{a} is a square mod *p*, which happens if and only if *a* is even. So by computing the Legendre symbol for *g*^{a} mod *p*, we know the parity of *a*. Similarly we can find the parity of *b*, and so we know the parity of *ab*. If *ab* is odd but *g*^{c} is a square mod *p*, we know that the answer to the DDH problem is no. Similarly if *ab* is even but *g*^{c} is not a square mod *p*, we also know that the answer to the DDH problem is no.

Since half the positive integers mod *p* are squares, we can answer the DDH problem with certainty half the time by the parity argument above. If our parity argument is inconclusive we have to guess. So overall we can answer he DDH problem correctly with probability 0.75.

[1] As is common when you have a lot of names attached to a theorem, there were multiple discoveries. Silver discovered the algorithm first, but Pohlig and Hellman discovered it independently and published first.

]]>It turns out there are no strict magic squares formed by knight’s tours. This was proved in 2003. See a news article here.

- Magic squares as matrices
- Consecutive pair magic squares
- Alphamagic squares in English, French, and Spanish
- Consecutive primes magic square
- Magic squares associated with Mars and Jupiter
- A king’s magic tour

I’ve written before about Milton’s use of the word in Paradise Lost. Milton is alluding to the four elements of antiquity: air, earth, fire, and water.

I recently found out the word quaternion appears in the Latin Vulgate as well. Acts 12:4 records that Herod had Peter guarded by four squads of four soldiers each, “quattuor quaternionibus militum.”

**Update**: Thanks to Phil for pointing out in the comments that *quaternion* appears in the King James as well, “four quaternions of soldiers.”

There’s little relation between technical difficulty and usefulness.

Fermat’s last theorem says there are no positive integer solutions to

*a*^{n} + *b*^{n} = *c*^{n}

for integers *n* > 2. This theorem was proven three and a half centuries after Fermat proposed it. The theorem is well known, even in pop culture. For example, Captain Picard tries to prove it in Star Trek: Next Generation. Fermat’s last theorem was famous for being an open problem that was easy to state. Now that it has been proven, perhaps it will fade from popular consciousness.

The mathematics developed over the years in attempts to prove Fermat’s last theorem has been very useful, but the theorem itself is of no practical importance that I’m aware of.

Fermat’s little theorem says that if *p* is a prime and *a* is any integer not divisible by *p*, then *a*^{p − 1} − 1 is divisible by *p*. This theorem is unknown in pop culture but familiar in math circles. It’s proved near the beginning of any introductory number theory class.

The theorem, and its generalization by Euler, comes up constantly in applications of number theory. See three examples here.

Suppose *p* is an odd prime and let *E* be the elliptic curve with equation

*y*² = *x*³ + *ax*² + *bx* + *c*

where all the variables come from the integers mod *p*. Now pick some *d* that is not a square mod *p*, i.e. there is no *x* such that *x*² – *d* is divisible by *p*. Then the curve *E*‘ with equation

*dy*² = *x*³ + *ax*² + *bx* + *c*

is a twist of *E*. More explicit it’s a **quadratic twist** of *E*. This is usually what people have in mind when they just say *twist* with no modifier, but there are other kinds of twists.

Let’s get concrete. We’ll work with integers mod 7. Then the squares are 1, 4, and 2. (If that last one looks strange, note that 3*3 = 2 mod 7.) And so the non-squares are 3, 5, and 6. Suppose our curve *E* is

*y*² = *x*³ + 3*x* + 2.

Then we can form a twist of *E* by multiplying the left side by 3, 5, or 6. Let’s use 3. So *E*‘ given by

3*y*² = *x*³ + 3*x* + 2

is a twist of *E*.

There’s something potentially misleading in the term “twisted curve”. The curve *E*‘ is a twist of *E*, so *E*‘ is twisted relative to *E*, *and vice versa*. You can’t say *E*‘ is twisted and *E* is not.

But you might object that *E*‘ has the form

*dy*² = *x*³ + *ax*² + *bx* + *c*

where *d* is a non-square, and *E* does not, so *E*‘ is the one that’s twisted. But a change of variables can leave the curve the same while changing the form of the equation. Every curve has an equation in which the coefficient of *y*² is 1 and a form in which the coefficient is a non-square. Specifically,

*dy*² = *x*³ + *ax*² + *bx* + *c*

and

*y*² = *x*³ + *dax*² + *d*²*bx* + *d*³*c*

specify the same curve.

The two curves *E* and *E*‘ are not isomorphic over the field of integers mod *p*, but they are isomorphic over the quadratic extension of the integers mod *p*. That is, if you extend the field by adding the square root of *d*, the two curves are isomorphic over that field.

For a given *x*, *f*(*x*) = *x*³ + *ax*² + *bx* + *c* is either a square or a non-square. If it is a square, then

*y*² =* f*(*x*)

has a solution and

*dy*² =* f*(*x*)

does not. Conversely, if *f*(*x*) is not a square, then

*dy*² =* f*(*x*)

has a solution and

*y*² =* f*(*x*)

does not. So a given *x* coordinate either corresponds to a point on *E* or a point on *E*‘, but not both.

In elliptic curve cryptography, it’s good if not only is the curve you’re using secure, so are its twists. Suppose you intend to work over a curve *E*, and someone sends you an *x* coordinate that’s not on *E*. If you don’t check whether *x* corresponds to a point on *E*, you are effectively working on a twist *E*‘ rather than *E*. That can be a security hole if *E*‘ is not as secure a curve as *E*, i.e. if the discrete logarithm problem on *E*‘ can be solved more efficiently than the same problem on *E*.

If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can’t do much better than brute force, hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may have found a collision, another string of bits that happens to have the same hash value.

But you know the input comes from a restricted set, a set small enough to search by brute force, then hash functions are reversible. If I know that you’ve either hashed “yes” or “no,” then I can apply the hash function to both and see which one it was.

Suppose someone has attempted to anonymize a data set by hashing personally identifying information (PII) such as name, phone number, etc. These inputs come from a small enough space that a brute force search is easy.

For instance, suppose someone has applied a cryptographic hash to first names. Then all an attacker needs to do is find a list of common names, hash them all, and see which hash values match. I searched for a list of names and found this, a list of the 1000 most popular baby girl and boy names in California in 2017.

The data set was compiled based on 473,441 births. Of those births, 366,039 had one of the 2,000 names. That is, 77% of the babies had one of the 1,000 most common names for their sex.

I wrote a little script to read in all the names and compute a SHA256 hash. The program took a fraction of a second to run. With the output of this program, I can’t re-identify every first name in a data set, but I could re-identify 77% of them, assuming my list of names is representative [1].

Now, for example, if I see the hash value

96001f228724d6a56db13d147a9080848103cf7d67bf08f53bda5d038777b2e6

in the data set, I can look this value up in my list of 2000 hash values and see that it is the hashed value of ACHILLES [2].

If you saw the hash value above and had no idea where it came from—it could be the hash of a JPEG image file for all you know—it would be hopeless to try to figure out what produced it. But if you suspect it’s the hash of a first name, it’s trivial to reverse.

Hashing numbers is simpler but more computationally intense. I had to do a little research to find a list of names, but I know that social security numbers are simply 9-digit numbers. There are only a billion possible nine-digit numbers, so it’s feasible to hash them all to make a look-up table.

I tried this using a little Python script on an old laptop. In 20 seconds I was able to create a file with the hash values of a million SSNs, so presumably I could have finished the job in about five and a half hours. Of course it would take even less time with more efficient code or a more powerful computer.

One way to improve the situation would be to use a key, a random string of bits that you combine with values before hashing. An attacker not knowing your secret key value could not do something as simple as what was described above.

However, in a large data set, such as one from a data breach, an attacker could apply frequency analysis to get some idea how hash values correspond to names. Hash values that show up most frequently in the data probably correspond to popular names etc. This isn’t definitive, but it is useful information. You might be able to tell, for example, that someone has a common first name and a rare last name. This could help narrow down the possibilities for identifying someone.

Several people have commented that the problem goes away if you use a unique salt value for each record. In some ways this is true, but in others it is not.

If you do use a unique salt per record, and save the salt with the data, then you can still do a brute force attack. The execution time is now quadratic rather than linear, but still feasible.

If you throw the salt away, then you’ve effectively replaced each identifier with a random string. Then you’ve effectively removed these columns since they’re filled with useless random noise. Why not just delete them?

You could use a unique salt per user rather than per record. Then you couldn’t identify a given record, but you could tell when two records belong to the same person. But in this case, why not just assign random IDs to each user? Use the salt itself as the ID. No need to use a hash function.

Note that the argument above for keys applies to using a custom hashing algorithm, either something you write from scratch or by combining rounds of established methods.

Kirchoff’s principle advises against relying on security-by-obscurity. It says that you should assume that your algorithm will become known, which often turns out to be the case. But no matter: any hashing scheme that always maps the same input to the same output is vulnerable to frequency analysis.

- California’s new CCPA and deidentified data
- Comparing truncation to differential privacy
- Data privacy consulting

[1] Name frequencies change over time and space, but I imagine the data from California in 2017 would be adequate to identify most Americans of any age and in any state. Certainly this would not meet the HIPAA standard that “the risk is very small that the information could be used … to identify an individual.”

[2] The California baby name data set standardized all names to capital letters and removed diacritical marks. With a little extra effort, one could hash variations such as JOSE, JOSÉ, Jose, and José.

]]>

Comparisons of technologies are multi-faceted. When someone says “*X* is better than *Y*” I want to ask “By *all* criteria? There’s *nothing* better about *Y*?”

When people say *X* is better but *Y* won, it’s often the case that they think the criteria by which *X* is better are most important, but a majority placed more weight on the criteria by which *Y* is better.

Some plausible explanations for why people stick with *Y*:

*X*is better than*Y*for some tasks, but not many people consider those tasks most important.*X*is marginally better than*Y*, but not enough better to justify the costs of switching.*X*was better than*Y*, but they didn’t do the work of letting people know about it and explaining why it was better.*X*is technically better than*Y*, but the personalities associated with*X*turn people off.

Sometimes the better alternative simply doesn’t win. The race doesn’t go to the swift, nor the battle to the strong. But often a broader perspective explains why the winner won.

**Related post**: Ford-Chevy arguments

]]>

Suppose you have an integrand that looks roughly like a normal density, something with a single peak that drops off fairly quickly on either side of the peak. The majority of integrals that arise in basic applications of probability and statistics fit this description.

You can approximate the value of the integral as the area of a box. The height of the box is the value of the integrand at its peak. The width of the box is the **FWHM**: full width at half maximum, i.e. the distance between two points on where the integrand take on half its maximum value.

This post will include several examples to illustrate how well, or how poorly, the technique works.

First let’s look at exp(-*x*²), the normal density function apart fror some scaling factors, integrated over the whole real line. The maximum of the function is 1, occurring at *x* = 0. So half the maximum is 1/2, and exp(-*x*²) = 1/2 at *x* = ± √ log 2. The FWHM is thus 2 √ log 2, and since our maximum is 1, the FWHM is the integral approximation. This gives 1.665, whereas the exact value is √π = 1.772.

Next, let’s look at integrating 1/(1 + *x*^{2n}) for positive integers *n*. In every case, the maximum is 1. The integrand takes on half its maximum value at *x* = ± 1, so the FWHM is always 2. So we would approximate the integral as 2 in every case, independent of *n*.

When *n* = 1, the correct value of the integral is π and so our approximation of 2 is pretty bad. But that’s not surprising. We said above that our integrand needs to drop off fairly quickly, but that’s not the case here. We have a fat tailed distribution, the Cauchy, and so it’s not surprising that our approximation is poor.

However, as *n* increases, our integral drops off more quickly, and the approximation improves. When *n* = 2, the exact value of the integral is 2.221. When *n* = 3, the integral is 2.094. When *n* = 10 the integral is 2.008. And in the limit as *n* goes to infinity, the integral is 2 and our approximation is exact.

The integrands in the examples above were all symmetric about the origin. That may give the misleading impression that our approximation only applies to symmetric integrals. The examples also all had a maximum of 1, which could give the impression that the integral approximation is just the FWHM, not the FWHM times the maximum. We’ll correct both of these impressions in the next example.

Our task is to approximate the integral

which comes up in Bayesian logistic regression with an improper prior.

The integrand is symmetric about 0 when *a* = *b*. But otherwise the integrand is not symmetric, and the peak occurs at log(*a*/*b*) rather than at 0 and the value at that peak is

We’ll use the following Python code to help with our calculations.

from scipy import log, exp from scipy.optimize import brentq def logistic_max(a, b): return a**a * b** b /(a+b)**(a+b) def fwhm(a, b): hm = 0.5*logistic_max(a, b) f = lambda x: a*x - (a+b)*log(1 + exp(x)) - log(hm) left = brentq(f, -2, 0) right = brentq(f, 0, 2) return right - left def approx(a, b): return logistic_max(a, b)*fwhm(a, b)

And now for a couple examples. When *a* = 3 and *b* = 2, we get an approximate value of 0.0761, compared to an exact value of 0.0833, a relative error of 8.6%.

When *a* = 10 and *b* = 12, we get an approximate value of 2.647 × 10^{-7}, compared to an exact value of 2.835 × 10^{-7}, a relative error of 6.6%.

First, you can use “five tomatoes” as a mnemonic for remembering that there are 5280 feet in a mile.

“Five tomatoes” is a mnemonic for the number of feet in a mile because it sounds like “five two eight oh.”

The number of meters in a kilometer is 1,000. Alas, there’s no mnemonic for that. ;)

— Units (@UnitFact) June 21, 2019

As I’ve argued earlier, systems of units that seem awkward now were designed according to different criteria. If you pointed out to a medieval Englishman that the number of feet in a mile is hard to remember, he might ask “Why would you want to convert feet into miles?”

The number of feet in a mile is very nearly

The exact value of the expression above is 5280.0000088… The difference between a mile and exp(π √67 / 3) feet is less than the length of an E. coli bacterium.

When people see examples like this, an expression that can’t be an integer and yet is eerily close to being an integer, some ask whether there’s a reason. Others think the question is meaningless. “The number equals whatever it equals, and it happens to be near an integer. So what?”

But there *is* some deeper math going on here. I touch on the reason in this post. 67 is a Heegner number, and so exp(π √67) is nearly an integer. That integer is 5280³ + 744 and so exp(π √67 / 3) ≈ 5280.

Lurking in the background is something called the *j*-function. It has something to do with why some expressions are very nearly integers. Incidentally, the number 744 mentioned above is the constant term in the Laurent expansion for the *j*-function.

*a**x*² + *bx* + *c* = 0

is

Δ = *b*² – 4*ac*.

If the discriminant Δ is zero, the equation has a double root, i.e. there is a unique *x* that makes the equation zero, and it counts twice as a root. If the discriminant is not zero, there are two distinct roots.

Cubic equations also have a discriminant. For a cubic equation

*a**x*³ + *bx*² + *cx* + *d* = 0

the discriminant is given by

Δ = 18*abcd* – 4*b*³*d* + *b*²c² – 4*ac³* – 27*a*²*d*².

If Δ = 0, the equation has a multiple root, but otherwise it has three distinct roots.

A change of variable can reduce the general cubic equation to a so-called “depressed” cubic equation of the form

*x*³ + *px* + *q* = 0

in which case the discriminant simplifies to

Δ = – 4*p³* – 27*q*².

Here are a couple interesting connections. The idea reducing a cubic equation to a depressed cubic goes back to Cardano (1501–1576). What’s called a depressed cubic in this context is known as the Weierstrass (1815–1897) form in the context of elliptic curves. That is, an elliptic curve of the form

*y*² = *x*³ + *ax* + *b*

is said to be in Weierstrass form. In other words, an elliptic curve is in Weierstrass form if the right hand side is a depressed cubic.

Furthermore, an elliptic curve is required to be non-singular, which means it must satisfy

4*a³* + 27*b*² ≠ 0.

In other words, the discriminant of the right hand side is non-zero. In the context of elliptic curves, the discriminant is defined to be

Δ = -16(4*a³* + 27*b*²)

which is the same as the discriminant above, except for a factor of 16 that simplifies some calculations with elliptic curves.

In the context of solving quadratic and cubic equations, we’re usually implicitly working with real or complex numbers. Suppose all the coefficients of a quadratic equation are real. I the discriminant is positive, there are two distinct real roots. If the discriminant is negative, there are two distinct complex roots, and these roots are complex conjugates of each other.

Similar remarks hold for cubic equations when the coefficients are all real. If the discriminant is positive, there are three distinct real roots. If the discriminant is negative, there is one real root and a complex conjugate pair of complex roots.

In the first section I only considered whether the discriminant was zero, and so the statements are independent of the field the coefficients come from.

For elliptic curves, one works over a variety of fields. Maybe real or complex numbers, but also finite fields. In most of the blog posts I’ve written about elliptic curves, the field is integers modulo a large prime.

*x*² = *n* mod *p*

has a solution then *n* is a square mod *p*, or in classical terminology, *n* is a quadratic residue mod *p*. Half of the numbers between 0 and *p* are quadratic residues and half are not. The residues are distributed somewhat randomly. For large *p*, quadratic residues are distributed enough like random numbers to be useful in cryptographic applications. For example, see this post. But the distribution isn’t entirely random as we’ll demonstrate here.

First let’s look at a couple illustrations, using *p* = 101 and *p* = 103. We’ll use the following Python code to plot the residues.

import matplotlib.pyplot as plt for p in [101, 103]: s = {x**2 % p for x in range(1, p)} for q in s: plt.vlines(q, 0, 1) plt.title("Quadratic residues mod {}".format(p)) plt.savefig("residues_mod_{}.svg".format(p)) plt.close()

Here are the plots.

If you look closely at the first image, you’ll notice it appears to be symmetric about the middle. And if you look closely at the second image, you’ll notice it appears to be anti-symmetric, i.e. the left side has bars where the right side has gaps and vice versa.

The following code shows that these observations are correct.

p = 101 s = {x**2 % p for x in range(1, p)} for q in s: assert(p - q in s) p = 103 s = {x**2 % p for x in range(1, p)} for q in s: assert(p - q not in s)

Both of these observations hold more generally. If *p* = 1 mod 4, the residues are symmetric: *n* is a quadratic residue if and only if *p* – *n* is a quadratic residue. And if *p* = 3 mod 4, the residues are anti-symmetric: *n* is a quadratic residue if and only if *p* – *n* is *not* a quadratic residue. Both of these statements are consequences of the quadratic reciprocity theorem.

If you look again at the plot of residues for *p* = 103, you’ll notice that not only is it anti-symmetric, it is also darker on the left side. We can verify with the following code

print(len( [q for q in s if q < 51] )) print(len( [q for q in s if q > 51] ))

that there are 28 residues on the left and and 23 on the right. This is a general pattern called quadratic excess. The quadratic excess of an odd prime *p*, denoted *E*(*p*) is the number of quadratic residues to the left of the middle minus the number to the right of the middle. If *p* = 1 mod 4, *E*(*p*) is zero by symmetry, but if *p* = 3 mod 4, *E*(*p*) is always positive.

Here is a plot of quadratic excess for primes less than 3000 and congruent to 3 mod 4.

In this post I’ve said a couple things that are in tension. At the top I said that quadratic residues act sufficiently like random bits to be useful in cryptography. Then later on I showed that they have major departures from random behavior. That would be a problem if you looked at the entire sequence of residues, but if *p* has thousands of digits, and you’re looking at a small sample of the reciprocity values, that’s a different situation.

Suppose you pick two large enough primes, *p* and *q*, and keep them secret but tell me their product *N* = *pq*. Then you pick a random value *n* and ask me whether it’s a quadratic residue mod *N*, the best I can do is say “Maybe, with probability 1/2.” There are crypto systems that depend on this being a computationally hard problem to solve, unless the factors *p* and *q* are known.

However, you need to be a little careful. If *n* is smaller than √*N*, I could test whether *n* is a square, working in the integers, not the integers mod *N*. If it is a square integer, then it’s a square mod *N*. If it’s not a square integer, it might be a square mod *N*, but I wouldn’t know.

There’s a related issue I may explore in the future. Suppose instead of just asking whether a particular number is a quadratic residue, you take a range of numbers and return 0 or 1 depending on whether each number is a residue. Can I distinguish the sequence from a random sequence using statistical tests? If I took a sequence of length greater than *N*/2 then I could tell by looking at symmetry or antisymmetry. But what if the range is small relative to *N*? Maybe *N* is a thousand-digit numnber but you only give me a sequence of a million bits. Would the distribution be distinguishable from uniform random bits? Is there any sort of autocorrelation that might betray the sequence as not random?