Elliptic curves have been studied for many years by pure mathematicians with no intention to apply the results to anything outside math itself. And yet elliptic curves have become a critical part of applied cryptography.

Elliptic curves are very concrete. There are some subtleties in the definition—more on that in a moment—but they’re essentially the set of point satisfying a simple equation. And yet a lot of extremely abstract mathematics has been developed out of necessity to study these simple objects. And while the objects are in some sense simple, the questions that people naturally ask about them are far from simple.

A preliminary definition of an elliptic curve is the set of points satisfying

*y*² = *x*³ + *ax + b.*

This is a theorem, not a definition, and it requires some qualifications. The values *x*, *y*, *a*, and *b* come from some field, and that field is an important part of the definition of an elliptic curve. If that field is the real numbers, then all elliptic curves do have the form above, known as the Weierstrass form. For fields of characteristic 2 or 3, the Weierstrass form isn’t general enough. Also, we require that

4*a*³ + 27*b*² ≠ 0.

The other day I wrote about Curve1174, a particular elliptic curve used in cryptography. The points on this curve satisfy

*x*² + *y*² = 1 – 1174 *x*² *y*²

This equation does *not* specify an elliptic curve if we’re working over real numbers. But Curve1174 is defined over the integers modulo *p* = 2^{251} – 9. There it *is* an elliptic curve. It is equivalent to a curve in Weierstrass, though that’s not true when working over the reals. So whether an equation defines an elliptic curve depends on the field the constituents come from.

An elliptic curve is not an ellipse, and it may not be a curve in the usual sense.

There is a connection between elliptic curves and ellipses, but it’s indirect. Elliptic curves are related to the integrals you would write down to find the length of a portion of an ellipse.

Working over the real numbers, an elliptic curve is a curve in the geometric sense. Working over a finite field, an elliptic curve is a finite set of points, not a continuum. Working over the complex numbers, an elliptic curve is a two-dimensional surface. The name “curve” is extended by analogy to elliptic curves over general fields.

In this section we’ll give the full definition of an algebraic curve, though we’ll be deliberately vague about some of the details.

The definition of an elliptic curve is not in terms of equations of a particular form. It says an elliptic curve is a

- smooth,
- projective,
- algebraic curve,
- of genus one,
- having a specified base point.

Working over real numbers, **smoothness** can be specified in terms of derivatives. But that does smoothness mean working over a finite field? You take the derivative equations from the real case and extend them by analogy to other fields. You can “differentiate” polynomials in settings where you can’t take limits by defining derivatives algebraically. (The condition 4*a*³ + 27*b*² ≠ 0 above is to guarantee smoothness.)

Informally, **projective** means we add “points at infinity” as necessary to make things more consistent. Formally, we’re not actually working with pairs of coordinates (*x*, *y*) but equivalence classes of triples of coordinates (*x, *y*, *z). You can usually think in terms of pairs of values, but the extra value is there when you need it to deal with points at infinity.

An **algebraic curve** is the set of points satisfying a polynomial equation.

The **genus** of an algebraic curve is roughly the number of holes it has. Over the complex numbers, the genus of an algebraic curve really is the number of holes. As with so many ideas in algebra, a theorem from a familiar context is taken as a definition in a more general context.

We haven’t talked about the idea of a **base point**. There’s a way to add points on an elliptic curve, and the choice of a base point is a choice of where to put the additive identity. In the post on Curve1174, we go into the addition in detail, and the base point is (0, 1). In elliptic curve cryptography, the choice of base point can be very important. This post gives an example, specifying the base point on a curve used in the implementation of Bitcoin.

This is a good move, but unnecessary. Here’s what I mean by that. The update was likely unnecessary for reasons I’ll explain below, but it was easy to do, and it increased consistency across Microsoft’s product line. It’s also good PR.

Let’s back up a bit. SHA-1 and SHA-2 are secure hash functions [1]. They take a file, in this case a Microsoft software update, and return a relatively small number, small relative to the original file size. In the case of SHA-1, the result is 160 bits (20 bytes). They’re designed so that if a file is changed, the function value is nearly certain to change. That is, it’s extremely unlikely that a change to the file would not result in a change to the hash value.

The concern isn’t accidental changes. The probability of accidentally producing two files with the same hash function value is tiny as I show here.

The concern is a clever attacker who could modify the software update in such a way that the hash function remains unchanged, bypassing the hash as a security measure. That would be harder to do with SHA-2 than with SHA-1, hence Microsoft’s decision years ago to move to SHA-2 for new versions of the operating system, and its recent decision to make the change retroactive.

By a collision we mean two files that hash to the same value. It’s obvious from the pigeon hole principle [2] that collisions are possible, but how hard are they to produce deliberately?

Google demonstrated two years ago that it could produce two PDF files with the same SHA-1 hash value. But doing so required over 6,500 years of CPU time running in parallel [3]. Also, Google started with a file designed to make collisions possible. According to their announcement,

We started by creating a PDF prefix specifically crafted to allow us to generate two documents with arbitrary distinct visual contents, but that would hash to the same SHA-1 digest.

It would be harder to start with a specified input, such as a software update file and generate a collision. It would be harder still to generate a collision that had some desired behavior.

According to this page, it’s known how to tamper with *two* files simultaneously so that they will have the same SHA-1 hash values. This is what Google did, at the cost of thousands of CPU years. But so far, nobody has been able to start with a given file and create another file with the same SHA-1 value.

As I said at the beginning, it made sense for Microsoft to decide to move from SHA-1 to SHA-2 because the cost of doing so was small. But the use of SHA-1 hash codes is probably not the biggest security risk in Windows 7.

- Putting Google’s SHA-1 collision in perspective
- Probability of secure hash function collisions
- The hash function menagerie

[1] SHA-1 is a hash function, but SHA-2 is actually a family of hash functions: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. All are believed to provide at least 112 bits of security, while SHA-1 provides less than 63.

The SHA-*x* functions output *x* bits. The SHA-*x*/*y* functions use *x* bits of internal state and output *y* bits. To be consistent with this naming convention, SHA-1 should be called SHA-160.

[2] The pigeon hole principle says that if you put more than *n* things into *n* boxes, one of the boxes has to have more than one thing. If you hash files of more than *n* bits to *n*-bit numbers, at least two files have to go to the same value.

[3] If you were to rent this much CPU time in the cloud at 5 cents per CPU hour, it would cost about $2,800,000. If the only obstacle were the cost of computing resources, someone might be willing to pay that to tamper with a Microsoft update.

]]>There’s some truth to that. MD5 was very popular, and remains popular years after it was proven insecure. And now variations on SHA like SHA1 and SHA256 are commonly used. But there are a **lot more** cryptographic hash functions in common use.

If Python’s `hashlib`

is a reliable guide, the most common hashing algorithms are

- MD5
- SHA1
- SHA224
- SHA256
- SHA384
- SHA512

because these are the six algorithms guaranteed to be supported on every platform, as listed in the output of the `algorithms_guaranteed`

method in `hashlib`

.

The `algorithms_available`

methods in hashlib includes additional algorithms available in a particular installation. On the computer I’m using at the moment, it lists 18 hash algorithms in addition to those on the guaranteed list.

Mathematica supports the hash functions on `hashlib`

‘s guaranteed list, and a few more:

- Adler32
- CRC32
- MD3
- MD4
- RIPEMD160
- RIPEMD160SHA256
- SHA256SHA256

The first two hashes, Adler32 and CRC32, were never intended to be secure. They were designed simply as error detection codes and weren’t designed to be tamper-resistant. As the names imply, MD3 and MD4 were predecessors to MD5.

The hash that Mathematica calls RIPEMD160SHA256 is SHA 256 followed by the RIPEMD160. The reason this combination gets a name of its own is because it is used in Bitcoin. Finally, SHA256SHA256 is simply SHA256 applied twice.

The hash functions mentioned above are the most commonly used, but there are hundreds of others in common use. The hashcat password cracking tool lists **260 kinds of hash functions** it can attack.

Some of these hash functions are fundamental algorithms, such as Whirlpool and variations of GOST. Some are combinations of primitive functions, such as salted or iterated variations. Many of them are vendor and product specific. For example, hashcat lists nine different hashing algorithms associated with various versions of Microsoft Office, six algorithms for Cisco products, five algorithms for SAP, etc.

It’s interesting to speculate on why there are so many custom hash functions: hashing technology progress, differing emphases on security and speed, not-invented-here syndrome, etc.

There’s something going on that isn’t exactly security-by-obscurity, i.e. relying on keeping your encryption algorithm a secret. The hashing algorithms for all the products mentioned above are well known, but there may be some small advantage to being a little bit off the beaten path.

People have built special hardware and software for attacking popular hashing algorithms, and doing something a little different could prevent this from working on your hash. Of course doing something a little different could also introduce a weakness you didn’t anticipate. Creating your own encryption algorithm is a bad idea unless you’re an expert, and often even if you are an expert But making a new hash function by combining secure primitives is not as dangerous as creating your own encryption algorithm.

However, there’s a special case where the details are not complicated, the so called Edwards curves. I’ll look briefly at Edwards curves in general, then focus on Curve1174, a particular Edwards curve used in cryptography.

The example here could be used in an introductory group theory course with no reference to elliptic curves. Just think of it as a funny way to add pairs of integers.

For a particular class of elliptic curve, Edwards curves, the addition formula is simpler than usual. As mentioned a few days ago, an Edwards curve has the form

*x*² + *y*² = 1 + *d* *x*² *y*²

where *d* is not 0 or 1 in the underlying finite field. Then addition on the curve is given by

When *d* is a square, there are some exceptions. When *d* is not a square, as will be the case in our application, the denominators are never zero, and so the formula above is all there is to the addition rule.

Note that the division in the formula above is division in the underlying finite field, i.e. multiplication by the multiplicative inverse.

We’re interested in Curve1174, a particular elliptic curve used in cryptography. The underlying field is GF(*p*), the integers modulo the prime *p* = 2^{251} – 9. Also, *d* = -1174, from whence the curve takes its name.

The plot above shows what Curve1174 looks like over the real numbers, though we’re interested in the curve over the integers mod *p*. (By the way, if *d* > 0 you get a curve that looks like a squircle.)

We consider the pairs of integers (*x, **y*) that lie on the curve, i.e. those that satisfy

*x*² + *y*² = 1 + *d* *x*² *y*² mod *p*.

You can show that the sum of two points on the curve is another point on the curve, if you define addition with the formula above. The identity element for addition is the pair (0, 1). The additive inverse of a point (*x*, *y*) is the point (-*x*, *y*). So we have a group. Addition is commutative, and so in fact we have an Abelian group.

We can implement addition on Curve1174 in a few lines of Python.

from sympy import mod_inverse def divide(a, b, p): "Compute a/b in GF(p)" return (a*mod_inverse(b, p))%p def group_add(x1, y1, x2, y2, p, d): x3 = divide(x1*y2 + x2*y1, 1 + d*x1*x2*y1*y2, p) y3 = divide(y1*y2 - x1*x2, 1 - d*x1*x2*y1*y2, p) return (x3, y3)

The only thing we needed SymPy for was the `mod_inverse`

function. It wouldn’t take much work to write your own `mod_inverse`

function from scratch using the method outlined here using a variation on the Euclidean algorithm.

It’s clear that (1, 0) is a point on the curve, and so we can add it to itself with the code

p = 2**251 - 9 d = -1174 print(group_add(1, 0, 1, 0, p, d))

and find that it equals

(0, 3618502788666131106986593281521497120414687020801267626233049500247285301238),

which may come as a bit of a surprise. Arithmetic here is not intuitive; it scrambles up points well, which hints at why the curve is useful in cryptography.

Let’s find another point on the curve. Let’s set *x* = 2019 and see what *y* is. When we come up with the equation *y* must satisfy, the Jacobi symbol shows there is no solution.

When *x* = 2025 there is a solution, and we can compute it using `sqrt_mod`

from `sympy.ntheory`

.

x = 2025 k = divide(1 - x*x, 1 - d*x*x, p) y = sqrt_mod(k, p)

This says the point

(2025, 588747530266665079407582947937120321357732884331117971504880828350684014295)

is on Curve1174. And since *x* and *y* only appear as squares in the equation defining the curve, once we find an (*x*, *y*) pair on the curve, the points (±*x*, ±*y*) are also on the curve.

Just for grins, let’s double the point (*x*, *y*) above, i.e. add it to itself. This works out to

(2795920935947049934301363619759082573282734750667150130900931245990107374027, 2577351770662637935098262284237063829290291047539093190165388036658162531660).

In general it can be hard to compute how many point lie on an elliptic curve, but in the case of Curve 1174 the number of points is known. Bernstein et al computed that the number of points on Curve1174 is *p* + 1 – *t* where *t* is a big number, but much smaller than *p*, on the order of the square root of *p*. Specifically,

*t* = 45330879683285730139092453152713398836.

Why not just absorb the 1 into *t*? This was done to match the notation in Hasse’s theorem. See the footnote here.

What does all this have to do with cryptography? Cryptographers like to find problems that can be computed easily but that are hard to reverse. Most public key cryptography methods depend on the difficulty of undoing one of three things:

- multiplication,
- modular exponentiation, or
- multiplication over an elliptic curve.

RSA encryption, for example, depends on the difficulty of factoring the product of two large primes.

The **elliptic curve discrete logarithm problem** (**ECDLP**) is the problem of undoing multiplication over an elliptic curve. If *n* is an integer and *P* is a point on the curve, we can compute *Q* = *nP* easily. If *n* is large, we don’t just add *P* to itself *n* times. Instead we double it log_{2}*n* times and add the necessary intermediate results, analogous to fast exponentiation.

It’s easy to compute *Q* given *n* and *P*, but it’s hard to compute *n* given *P* and *Q*. This is the elliptic curve discrete logarithm problem that EEC protocols rely on for their security.

Are these one-liners real or mythology? To some extent, they’re both. Below I’ll give a famous real example. Then I’ll argue that even though such examples do occur, they may create unrealistic expectations.

In 1986, Jon Bentley posted the following exercise:

Given a text file and an integer

k, print thekmost common words in the file (and the number of their occurrences) in decreasing frequency.

Donald Knuth wrote an elegant program in response. Knuth’s program runs for 17 pages in his book Literate Programming.

McIlroy’s solution is short enough to quote below [1].

tr -cs A-Za-z ' ' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q

McIlroy’s response to Knuth was like Abraham Lincoln’s response to Edward Everett at Gettysburg. Lincoln’s famous address was 50x shorter than that of the orator who preceded him [2].

Knuth and McIlroy had very different objectives and placed different constraints on themselves, and so their solutions are not directly comparable. But McIlroy’s solution has become famous. Knuth’s solution is remembered, if at all, as the verbose program that McIlroy responded to.

The stereotype of a Unix wizard is someone who could improvise programs like the one above. Maybe McIlroy carefully thought about his program for days, looking for the most elegant solution. That would seem plausible, but in fact he says the script was “written on the spot and worked on the first try.” He said that the script was similar to one he had written a year before, but it still counts as an improvisation.

McIlroy’s script was a real example of the kind of wizardry attributed to Unix adepts. Why can’t more people quickly improvise scripts like that?

The exercise that Bentley posed was the kind of problem that programmers like McIlroy solved routinely at the time. The tools he piped together were developed precisely for such problems. McIlroy didn’t see his solution as extraordinary but said “Old UNIX hands know instinctively how to solve this one in a jiffy.”

The traditional Unix toolbox is full of utilities for text manipulation. Not only are they useful, but they compose well. This composability depends not only on the tools themselves, but also the shell environment they were designed to operate in. (The latter is why some utilities don’t work as well when ported to other operating systems, even if the functionality is duplicated.)

Bentley’s exercise was clearly text-based: given a text file, produce a text file. What about problems that are not text manipulation? The trick to being productive from a command line is to turn problems into text manipulation problems. The output of a shell command is text. Programs are text. Once you get into the necessary mindset, everything is text. This may not be the most efficient approach to a given problem, but it’s a possible strategy.

The hard part on the path to becoming a command line wizard, or any kind of wizard, is thinking about how to apply existing tools to your particular problems. You could memorize McIlroy’s script and be prepared next time you need to report word frequencies, but applying the *spirit* of his script to your particular problems takes work. Reading one-liners that other people have developed for their work may be inspiring, or intimidating, but they’re no substitute for thinking hard about your particular work.

You get faster at anything with repetition. Maybe you don’t solve any particular kind of problem often enough to be fluent at solving it. If someone can solve a problem by quickly typing a one-liner in a shell, maybe they are clever, or maybe their job is repetitive. Or maybe both: maybe they’ve found a way to make semi-repetitive tasks repetitive enough to automate. One way to become more productive is to split semi-repetitive tasks into more creative and more repetitive parts.

[1] The odd looking line break is a quoted newline.

[2] Everett’s speech contained 13,607 words while Lincoln’s Gettysburg Address contained 272, a ratio of almost exactly 50 to 1.

]]>The named elliptic curves are over a prime field, i.e. a finite field with a prime number of elements *p*, denoted GF(*p*). The number of points on the elliptic curve is on the order of *p* [1].

The curve names usually contain a number which is the number of bits in the binary representation of *p*. Let’s see how that plays out with a few named elliptic curves.

Curve name | Bits in p |
---|---|

ANSSI FRP256v1 | 256 |

BN(2, 254) | 254 |

brainpoolP256t1 | 256 |

Curve1174 | 251 |

Curve25519 | 255 |

Curve383187 | 383 |

E-222 | 222 |

E-382 | 382 |

E-521 | 521 |

Ed448-Goldilocks | 448 |

M-211 | 221 |

M-383 | 383 |

M-511 | 511 |

NIST P-224 | 224 |

NIST P-256 | 256 |

secp256k1 | 256 |

In Curve25519, *p* = 2^{255} – 19 and in Curve 383187, *p* = 2^{383} – 187. Here the number of bits in *p* is part of the name but another number is stuck on.

The only mystery on the list is Curve1174 where *p* has 251 bits. The equation for the curve is

*x*² + *y*² = 1 – 1174 *x² y²*

and so the 1174 in the name comes from a coefficient rather than from the number of bits in *p*.

The equation for Curve1174 doesn’t look like an elliptic curve. It doesn’t have the familiar (Weierstrass) form

*y*² = *x*³ + *ax* + *b*

It is an example of an Edwards curve, named after Harold Edwards. So are all the curves above whose names start with “E”. These curves have the form

*x*² + *y*² = 1 + *d* *x*² *y*².

where *d* is not 0 or 1. So some Edwards curves are named after their *d* parameter and some are named after the number of bits in *p*.

It’s not obvious that an Edwards curve can be changed into Weierstrass form, but apparently it’s possible; this paper goes into the details.

The advantage of Edwards curves is that the elliptic curve group addition has a simple, convenient form. Also, when *d* is not a square in the underlying field, there are no exceptional points to consider for group addition.

Is *d* = -1174 a square in the field underlying Curve1174? For that curve *p* = 2^{251} – 9, and we can use the Jacobi symbol code from earlier this week to show that *d* is not a square.

p = 2**251 - 9 d = p-1174 print(jacobi(d, p))

This prints -1, indicating that *d* is not a square. Note that we set *d* to *p* – 1174 rather than -1174 because our code assumes the first argument is positive, and -1174 and *p* – 1174 are equivalent mod *p*.

**Update**: More on addition on Curve1174.

[1] It is difficult to compute the exact number of points on an elliptic curve over a prime field. However, the number is roughly *p* ± 2√*p*. More precisely, Hasse’s theorem says

We will present Python code for playing with the entropy extractor. (μRNG is extremely efficient, but the Python code here is not; it’s just for illustration.) The code will show how to use the `pyfinite`

library to do arithmetic over a finite field.

The μRNG generator starts with three bit streams—*X*, *Y*, and *Z*—each with at least 1/3 bit min-entropy per bit.

Min-entropy is Rényi entropy with α = ∞. For a Bernoulli random variable, that takes on two values, one with probability *p* and the other with probability 1-*p*, the min-entropy is

-log_{2} max(*p*, 1-*p*).

So requiring min-entropy of at least 1/3 means the two probabilities are less than 2^{-1/3} = 0.7937.

Take eight bits (one byte) at a time from *X*, *Y*, and *Z*, and interpret each byte as an element of the finite field with 2^{8} elements. Then compute

*X***Y* + *Z*

in this field. The resulting stream of bits will be independent and uniformly distributed, or very nearly so.

We will need the `bernoulli`

class for generating our input bit streams, and the `pyfinite`

for doing finite field arithmetic on the bits.

from scipy.stats import bernoulli from pyfinite import ffield

And we will need a couple bit manipulation functions.

def bits_to_num(a): "Convert an array of bits to an integer." x = 0 for i in range(len(a)): x += a[i]*2**i return x def bitCount(n): "Count how many bits are set to 1." count = 0 while(n): n &= n - 1 count += 1 return count

The following function generates random bytes using the entropy extractor. The input bit streams have *p* = 0.79, corresponding to min-entropy 0.34.

def generate_byte(): "Generate bytes using the entropy extractor." b = bernoulli(0.79) x = bits_to_num(b.rvs(8)) y = bits_to_num(b.rvs(8)) z = bits_to_num(b.rvs(8)) F = ffield.FField(8) return F.Add(F.Multiply(x, y), z)

Note that 79% of the bits produced by the Bernoulli generator will be 1’s. But we can see that the output bytes are about half 1’s and half 0’s.

s = 0 N = 1000 for _ in range(N): s += bitCount( generate_byte() ) print( s/(8*N) )

This returned 0.50375 the first time I ran it and 0.49925 the second time.

For more details see the μRNG paper.

]]>

*S* = –*p* log_{2} *p* – (1-*p*) log_{2} (1-*p*).

It’s common to start with *p* and compute entropy, but recently I had to go the other way around: given entropy, solve for *p*. It’s easy to come up with an approximate solution.

Entropy in this case is approximately quadratic

*S* ≈ 4*p*(1-*p*)

and so

*p* ≈ (1 ± √(1-*S*))/2.

This is a good approximation if *S* is near 0 or 1 but mediocre in the middle. You could use solve for *p* numerically, say with Newton’s method, to get more accuracy if needed.

For example, if you need to know the last digit of *a*×*b*, it might seem you need to know *a* and *b* so you can multiply them together. But you only have to know the last digits of *a* and *b*. In fact, if one of the last digits is 0, that’s all you need to know.

As a math consultant, I often tell clients they don’t need information that they think they need. That news may come as a relief, or it may cause anxiety. I may tell a client, for instance, that missing data cannot change a conclusion, so it’s not worth waiting for. Whether that brings relief or anxiety depends on whether they believe me.

There’s a physics demonstration where you have a heavy ball on a long cable. You pull back the ball like a pendulum and let it touch your chin. Then let the ball go and stand still. If you’re convinced of the physical laws governing the motion of the ball, you can stand there without flinching. You know that just as it left your chin with zero velocity, it will return with zero velocity. In fact, because of friction, it won’t quite return to your chin. But if you’re not fully convinced of this explanation, you’ll be afraid that a wrecking ball is about to smash your face, and understandably so.

When you tell clients that they don’t need something they think they need, it may come across as if you’re asking them to stand still as a wrecking ball swings toward their face. It’s not enough to be correct; you need to be persuasive as well. Examples help. As with the physics demo, you can put your own own face on the line before asking them to do the same.

]]>David Johnston left a comment saying that he and his colleagues used this extension to finite fields as part of the construction of μRNG, a remarkably efficient true random number generator. The finite field version of Erdős-Szemerédi leads to a way to combine three physical but non-uniform sources of randomness into a uniformly distributed stream of bits.

Bourgain, Katz, and Tao proved that the result

max{|*A*+*A*|, |*A***A*|} ≥ *c*|*A*|^{1+ε}

extends to subsets *A* from a finite field *F* with conditions on the field *F* and the size of *A* relative to *F*.

It suffices for *F* to have prime order. More generally, and importantly for applications, it also holds for fields of order 2^{p} for prime *p*.

Given a constant δ > 0, if the size of the set *A* satisfies

|*F*|^{δ} < |*A*| < |*F*|^{1-δ}

then the theorem holds where the constant *c* depends on δ.

where *a* is a positive integer and *p* is prime. It is defined to be 0 if *a* is a multiple of *p*, 1 if *a* has a square root mod *p*, and -1 otherwise.

The Jacobi symbol is a generalization of the Legendre symbol and uses the same notation. It relaxes the requirement that *p* be prime and only requires that *p* is odd.

If *m* has prime factors *p*_{i} with exponents *e*_{i}, then the Jacobi symbol is defined by

Note that the symbol on the left is a Jacobi symbol while the symbols on the right are Legendre symbols.

The Legendre and Jacobi symbols are **not** fractions, but they act in some ways like fractions, and so the notation is suggestive. They come up in applications of number theory, so it’s useful to be able to compute them.

Since the Legendre symbol is a special case of the Jacobi symbol, we only need an algorithm for computing the latter.

In the earlier post mentioned above, I outline an algorithm for computing Legendre symbols. The code below is more explicit, and more general. It’s Python code, but it doesn’t depend on any libraries or special features of Python, so it could easily be translated to another language. The algorithm is taken from Algorithmic Number Theory by Bach and Shallit. Its execution time is *O*( (log *a*)(log *n*) ).

def jacobi(a, n): assert(n > a > 0 and n%2 == 1) t = 1 while a != 0: while a % 2 == 0: a /= 2 r = n % 8 if r == 3 or r == 5: t = -t a, n = n, a if a % 4 == n % 4 == 3: t = -t a %= n if n == 1: return t else: return 0

To test the code we randomly generate positive integers *a* and odd integers *n* greater than *a*. We compare our self-contained Jacobi symbol function to the one in SymPy.

N = 1000 for _ in range(100): a = randrange(1, N) n = randrange(a+1, 2*N) if n%2 == 0: n += 1 j1 = jacobi_symbol(a, n) j2 = jacobi(a, n) if j1 != j2: print(a, n, j1, j2)

This prints nothing, suggesting that we coded the algorithm correctly.

Twitter gave me the handle @data_tip even though that’s not what I typed in, and what I typed in is not being used. Apparently they don’t let you pick your handle any more.

]]>The most common dose finding method is the **3+3** rule. There are countless variations on this theme, but the basic idea is that you give a dose of an experimental drug to three people. If all three are OK, you go up a dose next time. If two out of three are OK, you give that dose again. If only one out of three is OK, you stop [1].

The 3+3 algorithm implicitly assumes deterministic thinking, at least in part. The assumption is that if three out of three patients respond well, we **know** the dose is safe [2].

If you increase the dose level and the next three patients experience adverse events, you stop the trial. Why? Because you *know* that the new dose is dangerous, and you *know* the previous dose was safe. You can only escalate because you assume you have complete knowledge based on three samples.

But if we treat three patients at a particular dose level and none have an adverse reaction we **do not know** for certain that the dose level is safe, though we may have sufficient confidence in its safety to try the next dose level. Similarly, if we treat three patients at a dose and all have an adverse reaction, we do not know for certain that the dose is toxic.

A Bayesian dose-finding method estimates toxicity probabilities given the data available. It might decide at one point that a dose appears safe, then reverse its decision later based on more data. Similarly, it may reverse an initial assessment that a dose is unsafe.

A dose-finding method based on posterior probabilities of toxicity is not strictly a dose *escalation* method because it can explore in two directions. It may decide that the next dose level to explore is higher or lower than the current level.

In Phase I studies of chemotherapeutics, you conventionally start at the lowest dose. This makes sense. These are toxic agents, and you naturally want to start at a dose you have reason to believe isn’t too toxic. (NB: I say “too toxic” because chemotherapy is toxic. You hope that it’s toxic to a tumor without being too toxic for the patient host.)

But on closer inspection maybe you shouldn’t start at the lowest dose. Suppose you want to test 100 mg, 200 mg, and 300 mg of some agent. Then 100 mg is the lowest dose, and it’s ethical to start at 100 mg. Now what if we add a dose of 50 mg to the possibilities? Did the 100 mg dose suddenly become unethical as a starting dose?

If you have reason to believe that 100 mg is a tolerable dose, why not start with that dose, even if you add a lower dose in case you’re wrong? This makes sense if you think of dose-finding, but not if you think only in terms of dose escalation. If you can only escalate, then it’s impossible to ever give a dose below the starting dose.

[1] I have heard, but I haven’t been able to confirm, that the 3+3 method has its origin in a method proposed by John Tukey during WWII for testing bombs. When testing a mechanical system, like a bomb, there is much less uncertainty than when testing a drug in a human. In a mechanical setting, you may have a lot more confidence from three samples than you would in a medical setting.

[2] How do you explain the situation where one out of three has an adverse reaction? Is the dose safe or not? Here you naturally switch to probabilistic thinking because deterministic thinking leads to a contradiction.

]]>

RSA encryption depends on 5 numbers:

- Large primes
*p*and*q* - The modulus
*n*=*pq* - Encryption key
*e* - Decryption key
*d*

The numbers *p*, *q*, and *d* are kept secret, and the numbers *e* and *n* are made public. The encryption method relies on the assumption that in practice one cannot factor *n* into *p* and *q*.

All five numbers should be chosen anew for each certificate [1], but in practice numbers are sometimes reused.

The numbers *p* and *q* should be unique to each use of the method, but in practice there have been instances of duplicate primes, where two instances may have one of their two primes in common, which lets you factor the modulus using the Euclidean algorithm.

The encryption key *e* does not need to be unique, or so we thought. In practice the encryption exponent *e* is usually 65537, i.e. 2^{16} + 1, because this makes implementation faster. Once study found that this exponent was used 99.5% of the time. However, this allows an attack on RSA encryption if any of the bits of *p* or *q* can be recovered from computer memory. More on this here.

The author of this post found several instances of certificates with shared moduli. One way this could happen is if a negligent certificate authority recycles *p*, *q*, and *n*, only generating new *e* and *d* pairs for each user. If you and someone else share a modulus *n*, you can use your (*e*, *d*) pair to factor *n*, and from knowing their public key *e* you can recover their private key *d*.

Duplicate moduli cannot happen by chance. As described here, the probability of having one shared prime due to random selection is roughly the probability of winning the Powerball jackpot 40 times in a row. The probability of having two shared primes due to random selection is inconceivably small.

[1] Everyone agrees that *p*, *q*, and hence *n* should be unique. This also means that *d* will be unique. But there is disagreement over whether the exponent *e* should be unique, though reusing *e* does lead to a possible attack as described here.

**Supercookies**, also known as **evercookies** or **zombie cookies**, are like browser cookies in that they can be used to track you, but are much harder to remove.

The way I first heard supercookies describe was as a cookie that you can appear to delete, but as soon as you do, software rewrites the cookie. Like the Hydra from Greek mythology, cutting off a head does no good because it grows back [1].

This explanation is oversimplified. It doesn’t quite work that way.

A supercookie is not a cookie *per se*. It’s anything that can be used to uniquely identify your browser: font fingerprinting, flash cache, cached images, browser plugins and preferences, etc. Deleting your cookies has no effect because **a supercookie is not a cookie**.

However, a supercookie can work with other code to recreate deleted cookies, and so the simplified description is not entirely wrong. A supercookie could alert web sites that a cookie has been deleted, and allow those sites to replace that cookie, or update the cookie if some browser data has changed.

You can ask sites to not track you, but this works on an honor system and is ignored with impunity, even (especially?) by the best known companies.

Apple has announced that it is removing Do Not Track from its Safari browser because the feature is worse than useless. Servers don’t honor it, and it gives a false sense of privacy. Not only that, the DNT setting is one more bit that servers could use to identify you! Because only about 9% of users turn on DNT, knowing that someone has it turned on gives about 3.5 bits of information toward identifying that person.

How do you remove supercookies? You can’t. As explained above, **a supercookie isn’t a file that can be removed**. It’s a **procedure** for exploiting a combination of data.

You could remove specific ways that sites try to identify you. You could, for example, remove Flash to thwart attempts to exploit Flash’s data, cutting off one head of the Hydra. This might block the way some companies track you, but there are others.

It’s an arms race. As fingerprinting techniques become well known, browser developers and users try to block them, and those intent on identifying you come up with more creative approaches.

Given the efforts companies use to identify individuals (or at least their devices), **it seems it must be worth it**. At least companies believe it’s worth it, and for some it probably is. But there are reasons to believe that tracking isn’t as valuable as it seems. For example, this article argues that the most valuable targeting information is **freely given**. For example, you know who is interested in buying weighted blankets? People who search on weighted blankets!

There have been numerous anecdotal stories recently of companies that have changed their marketing methods in order to comply with GDPR and have increased their sales. These are only anecdotes, but they suggest that at least for some companies, there are **profitable alternatives** to identifying customers who don’t wish to be identified.

[1] In the Greek myth, cutting off one head of the Hydra caused *two* heads to grow back. Does deleting a supercookie cause it to come back stronger? Maybe. Clearing your cookies is another user behavior that can be used to fingerprint you.

Erdős and Szemerédi proved that there are constants *c* and ε > 0 such that

max{|*A*+*A*|, |*A***A*|} ≥ *c*|*A*|^{1+ε}

In other words, either *A*+*A* or *A***A* is substantially bigger than *A*. Erdős and Szemerédi only proved that some positive ε exists, but they suspected ε could be chosen close to 1, i.e. that either |*A*+*A*| or |*A***A*| is bounded below by a fixed multiple of |*A*|² or nearly so. George Shakan later showed that one can take ε to be any value less than

1/3 + 5/5277 = 0.3342899…

but the conjecture remains that the upper limit on ε is 1.

The following Python code will let you explore the sum-product conjecture empirically. It randomly selects sets of size *N* from the non-negative integers less than *R*, then computes the sum and product sets using set comprehensions.

from numpy.random import choice def trial(R, N): # R = integer range, N = sample size x = choice(R, N, replace=False) s = {a+b for a in x for b in x} p = {a*b for a in x for b in x} return (len(s), len(p))

When I first tried this code I thought it had a bug. I called `trial`

10 times and got the same values for |*A*+*A*| and |*A***A*| every time. That was because I chose *R* large relative to *N*. In that case, it is likely that every sum and every product will be unique, aside from the redundancy from commutativity. That is, if *R* >> *N*, it is likely that |*A*+*A*| and |*A***A*| will both equal *N*(*N*+1)/2. Things get more interesting when *N* is closer to *R*.

The Erdős-Szemerédi problem is a problem in combinatorics, looking for deterministic lower bounds. But it seems natural to consider a probabilistic extension. Instead of asking about **lower bounds** on |*A*+*A*| and |*A***A*| you could ask for the **distribution** on |*A*+*A*| and |*A***A*| when the sets *A* are drawn from some probability distribution.

If the set *A* is drawn from a continuous distribution, then |*A*+*A*| and |*A***A*| both equal *N*(*N*+1)/2 with probability 1. Only careful choices, ones that would happen randomly with probability zero, could prevent the sums and products from being unique, modulo commutativity, as in the case *R* >> *N* above.

If the set *A* is an arithmetic sequence then |*A*+*A*| is small and |*A***A*| is large, and the opposite holds if *A* is a geometric sequence. So it might be interesting to look at the correlation of |*A*+*A*| and |*A***A*| when *A* comes from a discrete distribution, such as choosing *N* integers uniformly from [1, *R*] when *N*/*R* is not too small.

A normal distribution has the familiar **bell curve** shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a **circus tent**.

A normal distribution has very **thin tails**, i.e. probability density drops very rapidly as you move further from the middle, like exp(-x²). The Laplace distribution has **moderate tails** [1], going to zero like exp(-|x|).

So normal and Laplace distributions are qualitatively very different, both in the center and in the tails. So why would you want to replace one by the other?

The normal distribution is convenient to use in mathematical statistics. Whether it is realistic in application depends on context, but it’s convenient and conventional. The Laplace distribution is convenient and conventional in differential privacy. There’s no need to ask whether it is realistic because Laplace noise is added deliberately; the distribution assumption is exactly correct by construction. (See this post for details.)

When mathematical statistics and differential privacy combine, it could be convenient to “approximate” a Laplace distribution by a normal distribution [2].

So if you wanted to replace a Laplace distribution with a normal distribution, which one would you choose? Both distributions are symmetric about their means, so it’s natural to pick the means to be the same. So without loss of generality, we’ll assume both distribution have mean 0. The question then becomes how to choose the scale parameters.

You could just set the two scale parameters to be the same, but that’s similar to the **Greek letter fallacy**, assuming two parameters have the same meaning just because they have the same symbol. Because the two distributions have different tail weights, their scale parameters serve different functions.

One way to replace a Laplace distribution with a normal would be to pick the scale parameter of the normal so that both two **quantiles match**. For example, you might want both distributions to have have 95% of their probability mass in the same interval.

I’ve written before about how to solve for scale parameters given two quantiles. We find two quantiles of the Laplace distribution, then use the method in that post to find the corresponding normal distribution scale (standard deviation).

The Laplace distribution with scale *s* has density

*f*(*x*)* = *exp(-|*x*|/* s*)/2

If we want to solve for the quantile *x* such that Prob(*X* > *x*) = *p*, we have

*x* = –*s* log(2 – 2*p*).

Using the formula derived in the previously mentioned post,

σ = 2*x* / Φ^{-1}(*x*)

where Φ is the cumulative distribution function of the standard normal.

- Adding Gaussian or Laplacian noise for privacy
- Generating Laplace random samples
- Data privacy consulting

[1] The normal distribution is the canonical example of a thin-tailed distribution, while exponential tails are conventionally the boundary between thick and thin. “Thick tailed” and “thin tailed” are often taken to mean thicker than exponential and thinner that exponential respectively.

[2] You could use a Gaussian mechanism rather than a Laplace mechanism for similar reasons, but this makes the differential privacy theory more complicated. Rather than working with ε-differential privacy you have to work with (ε, δ)-differential privacy. The latter is messier and harder to interpret.

]]>The law specifically mentions **probabilistic identifiers**.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of guessing personal data fields [1]. That could be **really** broad. For example, the fact that you’re reading this blog post makes it “more probable than not” that you have a college degree, and education is one of the categories mentioned in the law.

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

It’s interesting that the CCPA says “a particular consumer or **device**.” That would seem to include browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

]]>Web sites may not be able to identify *you*, but they can probably identify *your web browser*. Your browser sends a lot of information back to web servers, and the combination of settings for a particular browser are usually unique. To get an idea what information we’re talking about, you could take a look at Device Info.

One of the pieces of information that gets sent back to servers is the list of fonts installed on your device. Your font fingerprint is just one component of your browser fingerprint, but it’s an easy component to understand.

Various applications install their own fonts. If you’ve installed Microsoft Office, for example, that would be evident in your list of fonts. However, Office is ubiquitous, so that information doesn’t go very far to identifying you. Maybe the *lack* of fonts installed with Office would be more conspicuous.

Less common software goes further toward identifying you. For example, I have Mathematica on one of my computers, and along with it Mathematica fonts, something that’s not too common.

Then there are the fonts you’ve installed deliberately, many of the free. Maybe you’ve installed fonts to support various languages, such as Hebrew and Greek fonts for Bible scholars. Maybe you have dyslexia and have installed fonts that are easier for you to read. Maybe you’ve installed a font because it contains technical symbols you need for your work. These increase the chances that your combination of fonts is unique.

Maybe you have purchased a few commercial fonts. One of the reasons to buy fonts is to have something that doesn’t look so common. This also makes the font fingerprint of your browser less common.

Servers have to query whether particular fonts are installed. An obscure font would go a long way toward identifying you. But if a font is truly obscure, the server isn’t likely to ask whether it’s installed. So the greatest privacy risk comes from moderately uncommon fonts [1].

Your browser fingerprint is probably unique, unless you have a brand new device, or you’ve made a deliberate effort to keep your fingerprint generic. So while a site may not know who you are, it can recognize whether you’ve been there before, and customize the content you receive accordingly. Maybe you’ve looked at the same product three times without buying, and so you get a nudge to encourage you to go ahead and buy.

(It’ll be interesting to see what effect the California Consumer Privacy Act has on this when it goes into effect the beginning of next year.)

Since there’s more than enough information to uniquely identify your browser, fingerprints are robust to changes. Installing a new font won’t throw advertisers off your trail. If you still have the same monitor size, same geographic location, etc. then advertisers can update your fingerprint information to include he new font. You might even get an advertisement for more fonts if they infer you’re a typography aficionado.

[1] Except for a **spearphishing** attack. A server might check for the presence of fonts that, although uncommon in general, are likely to be on the target’s computer. For example, if someone wanted to detect my browser in particular, they know I have Mathematica fonts installed because I said so above. And they might guess that I have installed the Greek and Hebrew fonts I mentioned. They might also look for obscure fonts I’ve mentioned in the blog, such as Unifont, Andika, and Inconsolata.

Physicist Lev Landau used to play a mental game with Soviet license plates [1]. The plates had the form of two digits, a dash, two more digits, and some letters.

His game was to apply high school math operators to the numbers on both side of the dash so that the dash could be replaced by an equal sign. For example, given the license plate 44-74, one solution would be

4! + 4 = 7*4.

Note that we can insert operators, such as !, +, and *, but not more digits.

Is there a solution for every possible license plate? That depends on what class of operators you allow.

You could trivialize the game by applying the fractional part operation {*x*} to both sides since the fractional part of an integer is zero. You could disallow the fractional part operator on the grounds that it’s not clearly a high school math operation, or just disallow it because it makes the game uninteresting.

It turns out there’s a universal solution, starting with the observation that

√(*n* + 1) = sec arctan √*n*.

If one side is greater than the other by one, the formula above gives an immediate solution. For example, a solution for the license plate 89-88 would be

√89 = sec arctan √88.

If the difference is greater, the formula can be applied repeatedly. For example, we could apply the formula twice to obtain

√(*n* + 2) = sec arctan √(*n* + 1) = sec arctan sec arctan √*n*

and so a possible solution for 35-37 is

sec arctan sec arctan √35 = √37.

Given that a solution is always possible, we can make the game more interesting by looking for the simplest solution. We have some intuitive idea what this means. With our example of 44-74, the first solution

4! + 4 = 7*4

is simpler than the universal solution

sec arctan sec arctan … √44 = √74

which would require applying secant and arctangent each 30 times.

The Kolmogorov complexity of an object is the length of the shortest computer program to produce the object. We could compute the Kolmogorov complexity of the functions applied to the digits on each side in order to measure how complex a solution is.

To make this precise, we need to specify what our programming language is, and that’s not as easy as it sounds. If we think of mathematics notation as a programming language, do we want to count ! as one character and arctan as 6 characters? That doesn’t seem right. If we wrote “arctan” as “atn” we would use fewer characters without producing a different solution.

To make things more objective, we could look at the length of actual computer programs rather than imagining math notation to be a programming language. Say we choose Python. Then here are a couple functions that compute our two solutions for license plate 44-74.

from math import sqrt, cos, atan def f(): sec = lambda x: 1/cos(x) y = sqrt(44) for _ in range(30): y = sec(atan(y)) return y def g(): return sqrt(77)

We could measure the complexity of our functions `f`

and `g`

by counting the number of characters in each one. But there still are difficulties.

What about the import statement? It should count toward the length of `f`

because it uses everything imported but `g`

could have used a shorter statement that only imported `sqrt`

. More fundamentally, are we cheating by even importing a library?

Furthermore, the two functions above don’t produce exactly the same output due to limited precision. We could imagine that our imported functions are infinitely precise, but then we’re not really using Python but rather an idealized version of Python.

And what about the loop? That introduced new digits, 3 and 0, and so violates the rules of Landau’s game. So should we unroll the loop before calculating complexity?

Kolmogorov complexity is a very useful concept, but it’s more of a thought experiment than something you can practically compute. We can imagine the shortest program to compute something, but we can seldom know that we’ve actually found such a program. All we can know in practice is upper bounds.

In theory you can enumerate all Turing machines of a given length, or all Python programs of a given length, and find the shortest one that does a given task [2], but the list grows exponentially with length.

However, it is possible to compute the length of particular programs if we deal with some of the complications mentioned above. We could make Landau’s game a two-player game by seeing who could come up with the simpler solution in a fixed amount of time.

If we allow sine and degree in our set of operators, there’s a universal solution due to B. S. Gorobets. For *n* ≥ 6, *n*! is a multiple of 360, and so

sin (*n*!)° = 0.

And if *n* is less than 6, it’s two-digit representation begins with zero, so we can multiply the digits to get zero.

If we disallow transcendental functions, we block Gorobets’ trick and we have functions whose length we can objectively measure in a programming language.

[1] Harun Šiljak. Soviet Street Mathematics: Landau’s License Plate Game. Mathematical Intelligencer, https://doi.org/10.1007/s00283-017-9743-9

[2] In addition to having to test an exponentially growing set of programs, there’s the problem of knowing if even one program will complete.

If you run a program and see it complete, then you know it completes! If you don’t see it complete, it might just need more time, or it might never finish. The halting problem theorem says that there’s no program that can tell whether another arbitrary program will terminate. You still might be able to determine whether a particular program terminates, but you might not.

Photo credit CC BY-SA 3.0 via Wikimedia Commons

Русский: Задний автомобильный номер СССР образца 1936 года

In addition to chronological blog posts, there are about 200 “pages” on the site, mostly technical notes. These include the most popular content on the site.

***

The following is a rambling discussion of things that have changed over the history of the blog.

I started out posting a little more than once a day on average. I’ve slowed down a little, but I still wrote about 300 posts over the last year. My posts are getting a little longer. They’re still fairly short, but a little longer than they used to be.

I’ve also started using more footnotes. That has worked out well, letting me write for two audiences at the same time: those who want a high-level overview and those who want technical details.

More people are reading from mobile devices, so I moved to a responsive design. I also use more SVG images because they look great across a variety of platforms.

HTTPS has become more common, and I switched to HTTPS a while back.

Google nearly killed RSS when it killed the most popular RSS reader. I announce most of my posts on Twitter because Twitter has sorta taken the place of RSS. You can still subscribe by RSS, and many do, but not as many as before the demise of Google Reader.

I’ve been writing more about privacy and security lately because I’m doing more work with privacy and security.

I was reluctant to start blogging, but I gave it a chance after several people exhorted me to do it. I was especially hesitant to allow comments, but the signal to noise ratio has been a pleasant surprise (aside from the millions of comments blocked by my spam filter). I’ve learned a lot from your feedback. I’ve met a lot of friends and clients through the blog and am very glad I started it.

]]>This post will look at economics and power laws in the context of password cracking. Increasing the cost of verifying a password does not impact attackers proportionately because the attackers have a power law on their side.

In an earlier post I explained how key stretching increases the time required to verify a password. The idea is that if authentication systems can afford to spend more computation time verifying a password, they can force attackers to spend more time as well.

The stretching algorithm increases the time required to test a single salt and password combination by a factor of *r* where *r* is a parameter in the algorithm. But increasing the amount of work per instance by a factor of *r* does not decrease the number of users an attacker can compromise by a factor of *r*.

If an attacker is going through a list of compromised passwords, the number of user passwords recovered for a fixed amount of computing power decreases by less than *r* because passwords follow a power law distribution [1]. The *n*th most common password appears with frequency proportional to *n*^{–s} for some constant *s*. Studies suggest *s* is between 0.5 and 0.9. Let’s say *s* = 0.7.

Suppose an attacker has a sorted list of common passwords. He starts with the most common password and tries it for each user. Then he tries the next most common password, and keeps going until he runs out of resources. If you increase the difficulty of testing each password by *r*, the attacker will have to stop after trying fewer passwords.

Suppose the attacker had time to test the 1000 most common passwords, but then you set *r* to 1000, increasing the resources required for each individual test by 1000. Now the attacker can only test the most common password. The most common password accounts for about 1/24 of the users of the 1000 most common passwords. So the number of **passwords** the attacker can test went down by 1000, but the number of **users** effected only went down by 24.

It would take 1000 times longer if the attacker still tested the same number of passwords. But the attacker is hitting diminishing return after the first password tested. By testing passwords in order of popularity, his return on resources does not need to go down by a factor of 1000. In this example it only went down by a factor of 24.

How much the hacker’s ROI decreases depends on several factors. It never decreases by more than *r*, and often it decreases less.

[1] Any mention of power laws brings out two opposite reactions, one enthusiastic and one pedantic. The naive reaction is “Cool! Another example of how power laws are everywhere!” The pedantic reaction is “Few things actually follow a power law. When people claim that data follow a power law distribution the data usually follow a slightly different distribution.”

The pedantic reaction is technically correct but not very helpful. The enthusiastic reaction is correct with a little nuance: many things approximately follow a power law distribution over a significant range. Nothing follows a power law distribution exactly and everywhere, but the same is true of any probability distribution.

It doesn’t matter for this post whether data exactly follow a power law. This post is just a sort of back-of-the-envelope calculation. The power law distribution has some empirical basis in this context, and it’s a convenient form for calculations.

]]>Yesterday my wife and I watched our daughter’s junior varsity soccer game. Several statistical questions came to mind.

Larger schools tend to have better sports teams. If the talent distributions of a large school and a small school are the same, the larger school will have a better team because its players are the best from a larger population. If one school is twice as big as another, its team may consist of the top 5% while the other school’s team consists of its top 10%.

Does size benefit a school’s top (varsity) team or its second (junior varsity) team more? Would you expect more variability in varsity or junior varsity scores? Does your answer depend on whether you assume a thin-tailed (e.g. normal) or thick tailed (e.g. Cauchy) distribution on talent?

What if two schools have the same size, but one has a better athletic program, say due to better coaching. Suppose this shifts the center of the talent distribution. Does such a shift benefit varsity or junior varsity teams more?

Suppose both the varsity and junior varsity teams from two schools are playing each other, as was the case last night. If you know the outcome of the junior varsity game, how much should that influence your prediction of the outcome of the varsity game? Has anyone looked at this, either as an abstract model or by analyzing actual scores?

**Related post**: Probability of winning the World Series

]]>

I won’t put a popup on my site cajoling you to subscribe, nor will I ask you to sign up before letting you read something I’ve written. My newsletter is just for people who want to read it.

I’ve been sending out a monthly newsletter for about four years. I’ve never sent out more than one email a month, though I might some day if I have a good reason to.

Each newsletter highlights a few of the most popular posts since the previous newsletter. I used to say a little about my business each month, or maybe something personal, but I quit when that got to be repetitive. I may go back to doing that occasionally if I have something to say that I believe you’ll find interesting.

]]>Storing passwords in plain text is least secure thing a server could do. If this list is leaked, someone knows all the passwords with no effort.

A better approach would be to run passwords through a secure hash function before storing them. The server would never store a password per se, only its hashed value. When someone enters a password, the server would check whether the hash of the password matches the stored hashed value. If the list of hashed values is leaked, it will take some effort to recover the original passwords, though maybe not much effort, as I wrote about in my previous post.

The next step in sophistication is to append a random value, called a **salt**, to the password before applying the hash function. The server would store each user’s salt value and the hash of the password with the salt added.

Now if the user has a naive password like “qwerty” you couldn’t crack the password by doing a simple Google search. You can find the hash value of “qwerty” by searching, but not the hash value of qwerty + random salt, because although the former is common the latter is probably unique. You could still crack the password if the hash is insecure, but it would take a little effort.

Suppose an attacker has a list of salt values and corresponding hash values for salt + password. They could guess passwords, hashing each with a salt value, to see if any hash values match. They would start with the most common passwords and probably get some matches.

**Key stretching** is a way to make this brute force search more time consuming by requiring repeated hashing. In the following stretching algorithm, *p* is the password, *s* is the salt, *h* is the hash function, and || means string concatenation.

Now the time required to test each password has been multiplied by *r*. The idea is to pick a value of *r* that is affordable for legitimate use but expensive for attacks.

Hash functions are not reversible in general. MD5 is a 128-bit hash, and so it maps any string, no matter how long, into 128 bits. Obviously if you run all strings of length, say, 129 bits, some of them have to hash to the same value. (Another win for the pigeon hole principle.)

And yet in practice it may be possible to reverse a hash, given some context. In the context of short, weak passwords, it may be possible to determine what text was hashed to create a particular value. All it may take is a quick web search [2]. For example, let’s take an insecure password like “password” and run it through a bit of Python to compute its MD5 hash.

>>> import hashlib >>> def foo(x): ... print(hashlib.md5(x.encode('utf-8')).hexdigest()) >>> foo("password") 5f4dcc3b5aa765d61d8327deb882cf99

A Google search returns over 21,000 hits on `5f4dcc3b5aa765d61d8327deb882cf99`

, and the second one shows that it’s the hash of “password”.

If I try a slightly less naive password like “p@$$word” I still get several hits, indicating that the hash is part of a list of compromised passwords.

Not every hash of a short string can be reversed this way. If I hash my business phone number, for example, I get something that does not yet appear in Google searches. The hash could still be reversed easily, but it would take more than just a Google search.

See the next post for how salting can thwart web search attacks.

[1] MD5 was invented in 1992 and the first flaw was discovered in 1996. Experts started moving away from MD5 at that time. More flaws were discovered over time. In 2010 the CMU Software Engineering Institute declared that MD5 was “cryptographically broken and unsuitable for further use.” It’s still being used, though not as much. MD5 still makes a useful checksum, though it’s not cryptographically secure.

[2] The same would be true of a secure hash of an insecure password. For example, SHA-256 is better than MD5, but you could look up the SHA-256 hash values of terrible passwords too. But MD5 hashes are easier to search on. In my experiments, I got far fewer hits searching on SHA-256 hash values.

If you’re trying to reverse hash values on your own computer without a web search, the MD5 value would require much less computation than the SHA-256 value.

]]>There’s a branch of math that studies how people wait in line: **queueing theory**. It’s not just about people standing in line, but about any system with clients and servers.

An introduction to queueing theory, about what you’d learn in one or two lectures, is very valuable for understanding how the world around you works. But after a brief introduction, you quickly hit diminishing return on your effort, unless you want to specialize in the theory or you have a particular problem you’re interested in.

The simplest models in queueing theory have nice closed-form solutions. They’re designed to be easy to understand, not to be highly realistic in application. They make simplifying assumptions, and minor variations no longer have nice solutions. But they make very useful toy models.

One of the insights from basic queueing theory is that **variability **in arrival and service times** matters a great deal**. This is unlike revenue, for example, where you can estimate total revenue simply by multiplying the estimated number of customers by the estimated average revenue per customer.

A cashier who can handle a stream of customers **on average** is in big trouble. If customers arrived at regular intervals and all required the same amount of time to serve, then if things are OK on average, they’re OK. But if there’s not much slack, and there’s the slightest bit of variability in arrival or service times, long lines will form.

I work through a specific example here. The example also shows that when a system is near capacity, adding a new server can make a huge improvement. In that example, going from one server to two doesn’t just cut the waiting time in half, but reduces it by a couple orders of magnitude.

As I mentioned at the beginning, queueing theory doesn’t just apply to shoppers and cashiers. It also applies to diverse situations such as cars entering a highway and page requests hitting a web server. See more on the latter here.

Here’s a closeup of the octave keys.

Paul says he found his instrument in an antique shop. It has no serial number or manufacturer information. If you know anything about this model, please leave a comment below.

converges or diverges, stirring up a lot of discussion. Sam Walters has been part of this discussion and pointed to a paper that says this is known as the Flint Hills series.

My first thought was to replace the sine term with a random variable and see what happens, because the values of *n* mod 2π act a lot like a random sequence. To be precise, the series is ergodic.

If *X* is a uniform random variable on the interval [0, 2π], then the random variable *Y =* 1/sin(*X*)² is fat tailed, so fat that it has no finite expected value. If *Y* had a finite expected value E[*Y*], then one might expect the Flint Hills sum to be roughly E[*Y*] ζ(3), i.e. the Flint Hills sum with the sine terms replaced by E[*Y*]. But since E[*Y*] diverges, this suggests that the Flint Hills series diverges.

Of course this doesn’t settle the question because the values of *n* mod 2π are not actually random. They simply act as if they were random *in some contexts*. For example, if you wanted to use them as if they were random values in order to do Monte Carlo integration, they would work just fine.

The question is whether the values act sufficiently like random values in the context of the Flint Hills problem. This is not clear, and is a problem in number theory rather than in probability. (Though number theory and probability are surprisingly interconnected.)

Sam Walters suggested considering a variation on the Flint Hills problem where we replace sin(*n*) with sin(2πθ*n*) for some constant θ, the original problem corresponding to θ = 1/2π. I suspect the series diverges for some (almost all?) values of θ. That is, unless you pick θ very carefully, number theory won’t rescue the series from divergence.

When I started this blog I routed my RSS feed through Feedburner, and now Feedburner is no longer working for my site.

If you **subscribed by RSS**, please check the feed URL. It should be

https://www.johndcook.com/blog/feed

which was previously forwarded to a Feedburner URL. If you subscribe directly to the feed with my domain, it should work. I had some problems with it earlier, but I believe they’ve been fixed.

Also, if you **subscribed by email**, you went through Feedburner. You won’t receive any more posts by email that way. I’m considering possible replacements.

In the mean time, you can sign up for monthly highlights from the blog via email. Each month I highlight the three or four most popular posts. Once in a while I may add a paragraph about what I’m up to. Short and sweet.

You can also find me on Twitter.

**Update**: Maybe Feedburner is working after all. I’m not sure what’s going on. Still, it seems Feedburner is hanging on by a thread. The steps above are a good way to prepare for when it goes away.