**More curvature posts**:

Let’s look at the question in base *b*, so we can address binary and base 10 at the same time. The **prime number theorem** says that for large *n*, the number of primes less than *b*^{n} is approximately

*b*^{n} / *n* log *b*

and so the probability that an *n*-digit number base *b* is prime is roughly 1 / *n* log *b*. If you select an *odd* *b*-digit number, you double your chances.

If you generate odd *n*-digit numbers at random, how long would it be until you find a prime? You might get lucky and find one on your first try. Or you might have extraordinarily bad luck and find a composite every time until you give up.

The number of tries until you find a prime is a **geometric random variable** with

*p* = 2 / *n* log *b*.

The expected value is 1 / *p*.

So, for example, if you wanted to find a 256-bit prime, you would need to test on average 256 log(2) / 2 = 89 candidates. And if you wanted to find a 100-digit prime, you’d need to test on average 100 log(10) / 2 = 115 candidates.

You could get more accurate bounds by using a refined variation of the prime number theorem. Also, we implicitly looked at the problem of finding a number with *at most* *n* digits rather than exactly *n* digits. Neither of these would make much difference to our final result. To back up this claim, let’s compare our estimate to a **simulation** that actually looks for primes.

The histogram above was based on finding 10,000 primes, each with 100 digits, and counting how many candidates were evaluated each time. The plot matches a geometric distribution well. The mean from the simulation was 113.5, close to the 115 estimated above.

For **cryptographic applications**, you often don’t want just any prime. You might want a safe prime or a prime satisfying some other property, in which case you’ll need to test more primes. But your probability of success will likely still be geometrically distributed.

It’s plausible that for large *p*, *p* being prime and 2*p* + 1 being prime are independent events, at least approximately, and a numerical experiment suggests that’s true. I generated 10,000 primes, each 100-digits long, and 48 were safe primes, about what you’d expect from theory.

**Schnorr groups**, named after Claus Schnorr, are multiplicative subgroups of a finite field. They are designed for cryptographic application, namely as a setting for hard discrete logarithm problems.

The application of Schnorr groups to **Schnorr signatures** was patented, but the patent ran out in 2008.

There has been a proposal to include Schnorr signatures in Bitcoin, but it has not been accepted at this point. (The proposal would use Schnorr *signatures* but over an elliptic curve rather than a Schnorr *group*.)

Pick a large prime *p*. Then the integers mod *p* form a finite field. The non-zero elements form an Abelian group of order *p*-1 under multiplication. Then *p* – 1 is composite, and so it can be factored into *qr* where *q* is prime. We want *q* to be large as well because it will be the order of our Schnorr group.

Pick any *h* such that *h*^{r} is not congruent to 1 mod *p*. Then *g* = *h*^{r} is the generator of our Schnorr group. Why does *g* have order *q*? A simple calculation shows

where the last step follows from Fermat’s little theorem. This shows that the order of *g* is either *q* or a divisor of *q*, but by construction *g* is not congruent to 1 (mod *p*), and *q* has no other factors since it is prime.

Here’s a toy example using small numbers. Let *p* = 23, *q* = 11 and *r* = 2. Then we can pick *h* to be any number that is not a root of 1 (mod 23), i.e. any number except 1 or 22. Say we pick *h* = 7. Then *g* = 49 = 3 (mod 23).

Here’s a Python script to verify our claims and print out the elements of our Schnorr group.

from sympy import isprime p, q, r, h = 23, 11, 2, 7 assert(isprime(p)) assert(isprime(q)) assert(p-1 == q*r) g = h**r % p assert(g != 1) assert(g**q % p == 1) for i in range(q): print( g**i % p )

This shows that our group elements are {1, 3, 9, 4, 12, 13, 16, 2, 6, 18, 8}.

In theory we could use the same script with much larger numbers, but in that case we wouldn’t want to print out the elements. Also, the code would take forever. We naively computed *a*^{b} (mod *m*) by computing *a*^{b} first, then taking the remainder modulo *m*. It is far more efficient to reduce modulo *m* along the way. That’s what Python’s three-argument `pow`

function does.

Here’s a revised version that runs quickly with large numbers.

from sympy import isprime p = 115792089237316195423570985008687907852837564279074904382605163141518161494337 q = 341948486974166000522343609283189 r = 338624364920977752681389262317185522840540224 h = 3141592653589793238462643383279502884197 assert(isprime(p)) assert(isprime(q)) assert(p-1 == q*r) g = pow(h, r, p) assert(g != 1) assert(pow(g, q, p) == 1)

Photo of Claus Schnorr by Konrad Jacobs CC BY-SA 2.0 de

]]>*y*² = *x*³ + 7

over a finite field to be described shortly.

Addition on elliptic curves in the plane is defined geometrically in terms of where lines intercept the curve. We won’t go into the geometry here, except to say that it boils down to a set of equations involving real numbers. But we’re not working over real numbers; we’re working over a finite field.

The idea is to take the equations motivated by the geometry in the plane then use those equations to define addition when you’re not working over real numbers but over a different field. In the case of secp256k1, the field is the finite field of integers mod *p* where

*p* = 2^{256} – 2^{32} – 977

Here *p* was chosen to be relatively close to 2^{256}. It’s not the largest prime less than 2^{256}; there are a lot of primes between *p* and 2^{256}. Other factors also went into the choice *p*. Note that we’re not working in the integers mod *p* per se; we’re working in an Abelian group whose addition law is defined by an elliptic curve over the integers mod *p*.

Next, we pick a base point *g* on the elliptic curve. The standard defining secp256k1 says that *g* is

0279BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798

in “compressed form” or

040x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8

in “uncompressed form”.

The base point is a specially chosen point on the elliptic curve, and so it is a *pair* of numbers mod *p*, not a single number. How do you extract *x* and *y* from these compressed or uncompressed forms?

The compressed form only gives *x* and you’re supposed to solve for *y*. The uncompressed form gives you *x* and *y*. However, the numbers are slightly encoded. In compressed form, the string either starts with “o2” or “o3” and the rest of the string is the hexadecimal representation of *x*. There will be two values of *y* satisfying

*y*² = *x*³ + 7 mod *p*

and the “o2” or “03” tells you which one to pick. If the compressed form starts with 02, pick the root whose least significant bit is even. And if the compressed form starts with 03, pick the root whose least significant bit is odd. (The two roots will add to *p*, and *p* is odd, so one of the roots will be even and one will be odd.)

The uncompressed form will always start with 04. After this follow the hex representations of *x* and *y* concatenated together.

In either case we have

x = 79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798

and

y = 483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8

We can verify this with a little Python code:

x = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798 y = 0x483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8 p = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F assert((y*y - x*x*x - 7) % p == 0)

Starting with our base point *g*, define *k**g* to be *g* added to itself *k* times. Note again that the sense of “addition” here is addition in the elliptic curve, not addition in the field of integers mod *p*. **The key to elliptic curve cryptography** is that *kg* can be computed efficiently, but solving for *k* starting from the product *kg* cannot. You can compute *kg* using the fast exponentiation algorithm, but solving for *k* requires computing discrete logarithms. (This is the **ECDLP**: Elliptic Curve Discrete Logarithm Problem.)

Why is this called “exponentiation” and not “multiplication”? Arithmetic on the elliptic curve is commutative, and in commutative (i.e. Abelian) groups the group operation is usually denoted as addition. And repeated addition is called multiplication.

But in general group theory, the group operation is denoted as multiplication, and repeated application of the group operation is called exponentiation. It’s conventional to use the general term “exponentiation” even though over an Abelian group it makes more sense to call it multiplication.

You undo exponentiation by taking logarithms, so the process of solving for *k* is called the discrete logarithm problem. The security of elliptic curve cryptography depends on the difficulty of computing discrete logarithms.

The best algorithms for solving discrete logarithm problem in a group of size *n* currently require O(√*n*) operations. How big is *n* in our case?

The base point *g* was chosen to have a large order, and in fact its order is approximately 2^{256}. Specifically, the order of *g* written in hexadecimal is

*n* = FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141.

This means that we get approximately 256/2 = 128 bits of security because √(2^{256}) = 2^{128}.

Currying is a simple but useful idea. (It’s named after logician Haskell Curry [1] and has noting to do with spicy cuisine.) If you have a function of two variables, you can think of it as a function of one variable that returns a function of one variable. So starting with a function *f*(*x*, *y*), we can think of this as a function that takes a number *x* and returns a function *f*(*x*, -) of *y*. The dash is a placeholder as in this recent post.

If you’ve taken calculus then you saw this in the context of Fubini’s theorem:

To integrate the function of two variables *f*(*x*, *y*), you can temporarily fix *y* and integrate the remaining function of *x*. This gives you a number, the value of an integral, for each *y*, so it’s a function of *y*. Integrate that function, and you have the value of the original function of two variables.

The first time you see this you may think it’s a definition, but it’s not. You can define the integral on the left directly, and it will equal the result of the two nested integrations on the right. Or at least the two sides will often be equal. The conditions on Fubini’s theorem tell you exactly when the two sides are equal.

A more sophisticated version of the same trick occurs in partial differential equations. If you have an evolution equation, a PDE for a function on one time variable and several space variables, you can think of it as an ODE via currying. For each time value *t*, you get a function of the spatial variables. So you can think of your solution as a path in a space of functions. The spatial derivatives specify an operator on that space of functions.

(I’m glossing over details here because spelling everything out would take a lot of writing, and might obscure the big idea, which relevant for this post. If you’d like the full story, you can see, for example, my graduate advisor’s book. It was out of print when I studied it, but now it’s a cheap Dover paperback.)

In the Haskell programming language (also named after Haskell Curry) you get currying for free. In fact, there’s no other way to express a function of two variables. For example, suppose you want to implement the function *f*(*x*, *y*) = *x*² + *y*.

Prelude> f x y = x**2 + y

Then Haskell thinks of this as a function of one variable (i.e. *x*), that returns a function of one variable (i.e. *f*(*x*, -)) which itself returns a number (i.e. *f*(*x*, *y*)). You can see this by asking the REPL for the type of `f`

:

Prelude> :info f f :: Floating a => a -> a -> a

Technically, Haskell, just like lambda calculus, only has functions of one variable. You could create a product datatype consisting of a pair of variables and have your function take that as an argument, but it’s still a function on one variable, though that variable is not atomic.

The way you’d formalize currying in category theory is to say that the following is a natural isomorphism:

For more on what Hom means, see this post.

- Weakening the requirements for a group (especially section on semigroups)
- Categorical products
- Beta reduction

[1] In concordance with Stigler’s law of eponymy, currying was not introduced by Curry but Gottlob Frege. It was then developed by Moses Schönfinkel and developed further by Haskell Curry.

]]>Bojan Šavrič, Tom Patterson, and Bernhard Jenny have developed a new map projection called **Equal Earth** that nicely balances several competing criteria, including aesthetics.

The Equal Earth projection satisfies the following mathematical criteria:

**Equal area**: The area of a region on the globe is proportional to the area of its projection.**Straight parallels**: Lines of equal latitude project on to horizontal lines.**Regularly distributed meridians**: At a given latitude, the spacing between lines of longitude are equal.**Bilateral symmetry**: The projection is symmetric with respect to the*x*-axis (equator) and*y*-axis (prime meridian).

Here are the equations for the Equal Earth projection.

Here λ and φ are longitude and latitude respectively. The parametric latitude θ is introduced for convenience and is simply a rescaling of latitude φ.

The parameters *a*_{i} are given below.

a_1 = 1.340264 a_2 = -0.081106 a_3 = 0.000893 a_4 = 0.003796

You can find more in their paper: Bojan Šavrič, Tom Patterson, and Bernhard Jenny. The Equal Earth map projection. International Journal of Geographical Information Science. https://doi.org/10.1080/13658816.2018.1504949**.**

**Related posts**

The **butterfly effect** is the semi-serious claim that a butterfly flapping its wings can cause a tornado half way around the world. It’s a poetic way of saying that some systems show **sensitive dependence on initial conditions**, that the slightest change now can make an enormous difference later. Often this comes up in the context of **nonlinear**, **chaotic** systems but it isn’t limited to that. I give an example here of a linear differential equation whose solutions start out the essentially the same but eventually diverge completely.

Once you think about these things for a while, you start to see nonlinearity and potential butterfly effects everywhere. There are tipping points everywhere waiting to be tipped!

But **a butterfly flapping its wings usually has no effect**, even in sensitive or chaotic systems. You might even say *especially* in sensitive or chaotic systems.

Sensitive systems are not always and everywhere sensitive to everything. They are sensitive in particular ways under particular circumstances, and can otherwise be quite resistant to influence. Not only may a system be insensitive to butterflies, it may even be relatively insensitive to **raging bulls**. The raging bull may have little more long-term effect than a butterfly. This is what I’m calling the other butterfly effect.

In some ways chaotic systems are *less* sensitive to change than other systems. And this can be a good thing. Resistance to control also means resistance to unwanted tampering. Chaotic or stochastic systems can have a sort of self-healing property. They are more stable than more orderly systems, though in a different way.

The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome. Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all. We should evaluate what aspects can be steered and how well they can be steered. The most effective intervention may not come from tweaking the inputs but from changing the structure of the system.

The diagram below summarizes the rules:

Imagine yourself in the position of Sam Kass and Karen Bryla designing the game. You first try adding one extra move, but it turns out that’s not possible.

The main property of Rock, Paper, Scissors is that no player has a winning strategy. That implies **you can only add an even number of extra moves**, keeping the total number of moves odd. That way, for every move one player makes, the other player has an equal number of winning and losing moves. Otherwise some moves would be better and others worse. So you can’t add just one move, say Lizard. You have to add two (or four, or six, …).

How many ways could you assign rules to an extension of Rock, Paper, Scissors adding two more moves?

Number the possible moves 1 through 5. We will make a table of which moves beat which, with +1 in the (*i*, *k*) position if move *i* beats move *j* and -1 if *j* beats *i*. There will be zeros on the diagonal since it’s a tie if both players make the same move.

Let’s start with the original Rock, Paper, Scissors. In order for this game to not have a winning strategy, the table must be filled in as below, with the only option being to set *a* = 1 or *a* = -1.

|---+----+----+----| | | 1 | 2 | 3 | |---+----+----+----| | 1 | 0 | a | -a | |---+----+----+----| | 2 | -a | 0 | a | |---+----+----+----| | 3 | a | -a | 0 | |---+----+----+----|

If 1, 2, and 3 correspond to Rock, Paper, and Scissors, then *a* = 1 according to the usual rules, but we’ll allow the possibility that the usual rules are reversed. (If you don’t like that, just set *a* = 1).

Next we fill in the rest of the moves. The table must be **skew-symmetric**, i.e. the (*i*, *j*) element must be the negative of the (*j*, *i*) element, because if (*i*, *j*) is a winning move then (*j*, *i*) is a losing move and vice versa. Also, **the rows and columns must sum to zero**. Together these requirements greatly reduce the number of possibilities.

|---+----+----+----+----+----| | | 1 | 2 | 3 | 4 | 5 | |---+----+----+----+----+----| | 1 | 0 | a | -a | b | -b | |---+----+----+----+----+----| | 2 | -a | 0 | a | c | -c | |---+----+----+----+----+----| | 3 | a | -a | 0 | d | -d | |---+----+----+----+----+----| | 4 | -b | -c | -d | 0 | d | |---+----+----+----+----+----| | 5 | b | c | d | -d | 0 | |---+----+----+----+----+----|

The values of *a*, *b*, and *c* may each be chosen independently to be 1 or -1. If *b* + *c* = 0, then *d* can be chosen freely. Otherwise *b* and *c* have the same sign, and *d* must have the opposite sign. So all together there are 12 possibilities (6 if you insist *a* = 1). These are listed below.

|---+---+---+---| | a | b | c | d | |---+---+---+---| | + | + | + | - | | + | + | - | + | | + | + | - | - | | + | - | + | + | | + | - | + | - | | + | - | - | + | | - | + | + | - | | - | + | - | + | | - | + | - | - | | - | - | + | + | | - | - | + | - | | - | - | - | + | |---+---+---+---|

The version of the rules used on The Big Bang Theory corresponds to the second row of the table above: *a* = *b* = *d* = 1 and *c* = -1.

Here’s another way to count the number of possible designs. Suppose we start with tradition Rock, Paper, Scissors, corresponding to *a* = 1 in the notation above. Now let’s add the rules for Lizard. We can pick any two things from {Rock, Paper, Scissors, Spock} and say that Lizard beats them, and then the other two things must beat Lizard. There are 6 ways to choose 2 things from a set of 4.

Once we’ve decided the rules for Lizard, we have no choice regarding Spock. Spock’s rules must be the opposite of Lizard’s rules in order to balance everything out. If we decide Lizard beats Rock, for example, then Rock must beat Spock so two things beat Rock and Rock beats two things.

If we’re willing to consider reversing the usual rules of Rock, Paper, Scissors, i.e. setting *a* = -1, then there are 6 more possibilities, for a total of 12.

By the way, we can see from the approach above how to add more moves. If we wanted to add Stephen Hawking and Velociraptor to our game, then we have 20 choices: we choose 3 things out of 6 for Hawking to beat, and the rest of the rules are determined by these choices. Velociraptor has to be the anti-Hawking. If we decide that Hawking beats Paper, Lizard, and Spock, then we’d get the rules in the diagram below.

You might want to design the game so that for any subset of three moves you have a game with no winning strategy. Here’s an example why. If the subset (1 , 2, 4) is a fair game, then *a* = *c*. But if the subset (2, 3, 4) is a fair game, then *a* = *-c*. So one of the two games must have a winning strategy.

The first graph above was made with **GraphVis** using the code below.

digraph rock { node [fontname = "helvetica"] "Rock" -> "Scissors" "Rock" -> "Lizard" "Paper" -> "Rock" "Paper" -> "Spock" "Scissors" -> "Paper" "Scissors" -> "Lizard" "Lizard" -> "Spock" "Lizard" -> "Paper" "Spock" -> "Rock" "Spock" -> "Scissors" layout = circo }

Save the code to a file, say `rock.gv`

, then run the command

dot -Tpng rock.gv > rock.png

to produce a PNG file.

To make the above definition precise, we need to say what kinds of objects and what kinds of functions we’re talking about. That is, we specify a category *C* that the object belong to, and the functions are the morphisms of that category [1]. For example, in the context of groups, Hom(*A*, *B*) would be the set of group homomorphisms [2] between *A* and *B*, but in the context of continuous groups (Lie groups), we would restrict Hom(*A*, *B*) to be *continuous* group homomorphisms.

To emphasize that Hom refers to a set of morphisms in a particular category, sometimes you’ll see the name of the category as a subscript, as in Hom_{C}(*A*, *B*). Sometimes you’ll see the name of the category as a replacement for Hom as in *C*(*A*, *B*). You may also see the name of the category in square braces as in [*C*](*A*, *B*).

So far Hom has been a set, but you’ll also see Hom as a functor. This is where the notation takes a little more interpretation. You may see a capital H with objects as superscripts or subscripts:

You may also see a lower case h instead. And I’ll use the name of the category instead of “Hom” just to throw that variation in too.

I’ll explain what these mean and give a mnemonic for remembering which is which.

The dash in Hom(*A*, -) and Hom(-, *A*) means “insert your object here.” That is, Hom(*A*, *–*) takes an object *B* and maps it to the set of morphisms Hom(*A*, *B*), and Hom(-, *A*) takes an object *B* to Hom(*B*, *A*).

So Hom(*A*, -) and Hom(-, *A*) each take an object in the category *C* to a set of morphims, i.e. an element in the category Set. But that’s only half of what it takes to be a functor. A functor not only maps objects in one category to objects in another category, it also maps morphisms in one category to morphisms in the other. (And it does so in a way that the interactions between the maps of objects and morphisms interact coherently.)

Where do the morphisms come in? We’ve established what *H*^{A} and *H*_{A} do to objects, but what do they do to morphisms?

Suppose we have a morphism *f*: *X* → *Y* in the category *C *and a function *g* in Hom(*A*, *X*) to Hom(*A*, *Y*)? For each *g* in Hom(*A*, *X*), the composition *fg* is an element of Hom(*A*, *Y*).

Next suppose *f*: *Y* → *X* (note the reversed order) and a function *g* in Hom(*X*, *A*). Now the composition *gf* is an element of Hom(*Y*, *A*). Note that before we applied *f* after *g*, but here we pre-compose with *f*, i.e. apply *f* before *g*.

Aside from the notation, what’s going on is very simple. If you have a function from *A* to *X* and a function from *X* to *Y*, then the composition of the two is a function from *A* to *Y*. Similarly, if you have a function from *Y* to *X* and a function from *X* to *A*, the composition is a function from *Y* to *A*.

Note that *H*^{A} is a covariant functor, but *H*_{A} is a contravariant. More on covariant vs contravariant below.

How can you keep *H*^{A} and *H*_{A} straight? It’s common to use superscript notation *Y*^{X} to indicate the set of functions from the superscript object *X* to the base object *Y*. You may have seen this before even if you don’t think you have.

The notation *Y*² denotes the product of *Y* with it self, such as *R*² for the plane, i.e. pairs of real numbers. A pair of things in *Y* is a function from a two-element set to *Y*. You could think of (*y*_{1}, *y*_{2}) as the result of mapping the set (1, 2) into *Y*.

You may also have seen the notation 2^{X} for the power set of *X*, i.e. the set of all subsets of *X*. You could think of the power set of *X* being the set of maps from *X* to the Boolean values (true, false) where an element *x* is mapped to true if and only if *x* is in that particular subset.

The notation using *H* or *h* with a superscript *A* stands for Hom(*A*, -), i.e. morphisms *out of* *A*, which is consistent with the usage described above. And so the other is the other, i.e. a subscript *A* stands for Hom(-, *A*), i.e morphisms *into* *A*.

(Unfortunately, some authors use the opposite of the convention used here, which blows the whole mnemonic away. But the convention used here is most common.)

We’re close to being able to state one of the most important theorems in category theory, the Yoneda lemma. (Lemmas often turn out to be more useful and better known than the theorem they were first used to prove.) So while we’re in the neighborhood, we’ll take a look.

A corollary of the Yoneda lemma says

The meaning of “Hom” is different on the left and right because we’re looking at morphisms between different kinds of objects. On the right we have sets of morphisms in our category *C* as above. The left side takes more explanation.

What kind of things are *H*^{A} and *H*^{B}? They are functors from *C* to Set. The class of functors between two categories forms a category itself. The functors are the objects in this new category, and natural transformations are the morphisms. So Hom in this context is the set of natural transformations between the two functors.

What kind of things are *H*_{A} and *H*_{B}? They are contravariant functors from *C* to Set, and contravariant functors also form a category. However, contemporary category theory doesn’t like to speak of contravariant functors, preferring to only work with covariant functors, and leaving the term “covariant” implicit. So rather than saying *H*_{A} and *H*_{B} are contravariant functors on *C*, most contemporary writers would say they are (covariant) functors on a new category *C*^{op}, where “op” stands for opposite. That is, *C*^{op} is a category with all the same objects as *C*, but with all the arrows turned around. Every morphism from *A* to *B* in *C* corresponds to a morphism from *B* to *A* in *C*^{op}.

[1] Morphisms are a generalization of functions. Often morphisms are functions, but they might not be. But still, they have to be things that compose like functions.

[2] The name Hom is a shortened from of “homomorphism.” Different areas of math have different names for structure-preserving functions, and category theory wanted to have one name for them all. It used “Hom” as an allusion to what structure-preserving functions are called in group theory. Similarly, “morphism” is also a shorted form of “homomorphism.” I suppose the goal was to use names reminiscent of group theory, but different enough to remind the reader that the category theory counterparts are distinct.

Incidentally, “homomorphism” comes from the Greek roots meaning “similar” and “shape.” A homomorphism is a function between similar objects (objects in the same category) that preserves structure (shape).

]]>We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design **adaptive clinical trial designs**, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the** first patient** in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of **Bayes theorem**, you don’t have to have an *ad hoc* rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to **all information available** at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

**The cold start problem for business applications is easier** than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use **Bayesian model selection** or **Bayesian model averaging** to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

]]>I thought that the graph was tracing out one octagon, then another, then another. But that’s not what it’s doing at all. You can see for yourself by going to the exponential sum page and clicking the animate link.

The curve is being traced out in back-and-forth strokes. It’s drawing thin wedges, not octagons. You can get a better picture of this by looking at the real and imaginary parts (*x* and *y* coordinates) separately.

It would be hard to look at the first plot an imagine the second, or to look at the second plot and imagine the first.

]]>where *m*, *d*, and *y* are the month, day, and (two-digit) year. The *n*/*d* term is simply *n*, and integer, when *d* = 1 and so it has no effect because exp(2π*n*) = 1. Here’s today’s sum, the plot formed by the partial sums above.

It’s possible in principle to work out exactly each of the points are on the curve. In order to do so, you’d need to compute

Since we can use the Pythagorean theorem to compute sine from cosine and vice versa, we only need to know how to compute the sine of five degrees.

OK, so how do we do that? The sine and cosine of 45 degrees and of 30 degrees are well known, and we can use the trig identity for sin(*A* – B) to find that the sine of 15 degrees is (√3 – 1)/ 2√2. So how do we get from the sine of 15 degrees to the sine of 5 degrees? We can borrow a trick from medieval astronomers. They did something similar to our work above, then solved the trig identity

as a cubic equation in sin θ to find the sine of 1/3 of an angle with known sine.

If we ask Mathematica to solve this equation for us

Solve[3 x -4 x^3 == Sin[15 Degree], x]

we get three roots, and it’s the middle one that we’re after. (We can evaluate the roots numerically and tell that the middle one has the right magnitude.) So Mathematica says that the sine of 5 degrees is

Unfortunately, this is the end of the line for Mathematica. We know that the expression above must be real, but Mathematica is unable to find its real part. I tried running `FullSimplify`

on this expression, and aborted evaluation after it ran for 20 minutes or so. If anybody wants to pick up where Mathematica left off, let me know your solution. It may be possible to continue with Mathematica by giving the software some help, or it may be easier to continue with pencil and paper.

**Update**: Based on the comments below, it seems this is about as far as we can go. This was a surprise to me. We’re solving a cubic equation, less than 5th degree, so the roots are expressible in terms of radicals. But that doesn’t mean we can eliminate the appearance of the *i*‘s, even though we know the imaginary part is zero.

If we start the sequence with (1, 1) we get the sequence

1, 1, 2, 3, 5, 4, 3, 7, 5, 6, 11, 17, 14, 31, 15, 23, 19, 21, 20, 41, 61, 51, 56, 107, 163, 135, 149, 142, 97, 239, 168, 37, 41, 39, 40, 79, 17, 48, 13, 61, 37, 49, 43, 46, 89, 45, 67, 56, 41, 97, 69, 83, 76, 53, 43, 48, 13, …

We know it will keep repeating because 48 followed by 13 has occurred before, and the “memory” of the sequence only stretches back two steps.

If we start with (6, 3) we get

6, 3, 3, 3, …

It’s easy to see that if the sequence ever repeats a value *n* then it will produce only that value from then on, because *n* + *n* = 2*n* is obviously not prime, and its smallest prime factor is 2.

Conway believes that this sequence will always get stuck in a cycle but this has not been proven.

Here’s Python code to try out Conway’s conjecture:

from sympy import isprime, primefactors def subprime(a, b): seq = [a, b] repeated = False while not repeated: s = seq[-1] + seq[-2] if not isprime(s): s //= primefactors(s)[0] repeated = check_repeat(seq, s) seq.append(s) return seq def check_repeat(seq, s): if not s in seq: return False if seq[-1] == s: return True # all previous occurances of the last element in seq indices = [i for i, x in enumerate(seq) if x == seq[-1]] for i in indices[:-1]: if seq[i+1] == s: return True return False

I’ve verified by brute force that the sequence repeats for all starting values less than 1000.

**Related posts**:

The previous post looked at the distribution of eigenvalues for very general random matrices. In this post we will look at the eigenvalues of matrices with more structure. Fill an *n* by *n* matrix *A* with values drawn from a standard normal distribution and let *M* be the average of *A* and its transpose, i.e. *M* = ½(*A* + *A*^{T}). The eigenvalues will all be real because *M* is symmetric.

This is called a “Gaussian Orthogonal Ensemble” or GOE. The term is standard but a little misleading because such matrices may not be orthogonal.

The joint probability distribution for the eigenvalues of *M* has three terms: a constant term that we will ignore, an exponential term, and a product term. (Source)

The exponential term is the same as in a multivariate normal distribution. This says the probability density drops of quickly as you go away from the origin, i.e. it’s rare for eigenvalues to be too big. The product term multiplies the distances between each pair of eigenvalues. This says it’s also rare for eigenvalues to be very close together.

(The missing constant to turn the expression above from a proportionality to an equation is whatever it has to be for the right side to integrate to 1. When trying to qualitatively understand a probability density, it usually helps to ignore proportionality constants. They are determined by the rest of the density expression, and they’re often complicated.)

If eigenvalues are neither tightly clumped together, nor too far apart, we’d expect that the distance between them has a distribution with a hump away from zero, and a tail that decays quickly. We will demonstrate this with a simulation, then give an exact distribution.

The following Python code simulates 2 by 2 Gaussian matrices.

import matplotlib.pyplot as plt import numpy as np n = 2 reps = 1000 diffs = np.zeros(reps) for r in range(reps): A = np.random.normal(scale=n**-0.5, size=(n,n)) M = 0.5*(A + A.T) w = np.linalg.eigvalsh(M) diffs[r] = abs(w[1] - w[0]) plt.hist(diffs, bins=int(reps**0.5)) plt.show()

This produced the following histogram:

The exact probability distribution is *p*(*s*) = *s* exp(-*s*²/4)/2. This result is known as “Wigner’s surmise.”

The probability distribution need not be normal. It can be any distribution, shifted to have mean 0 and scaled to have variance 1/*n*, provided the tail of the distribution isn’t so thick that the variance doesn’t exist. If you don’t scale the variance to 1/*n* you still get a circle, just not a *unit* circle.

We’ll illustrate the circular law with a uniform distribution. The uniform distribution has mean 1/2 and variance 1/12, so we will subtract 1/2 and multiply each entry by √(12/*n*).

Here’s our Python code:

import matplotlib.pyplot as plt import numpy as np n = 100 M = np.random.random((n,n)) - 0.5 M *= (12/n)**0.5 w = np.linalg.eigvals(M) plt.scatter(np.real(w), np.imag(w)) plt.axes().set_aspect(1) plt.show()

When *n*=100 we get the following plot.

When *n*=1000 we can see the disk filling in more.

Note that the points are symmetric about the real axis. All the entries of *M* are real, so its characteristic polynomial has all real coefficients, and so its roots come in conjugate pairs. If we randomly generated complex entries for *M* we would not have such symmetry.

**Related post**: Fat-tailed random matrices

- The most important skill in software development (742)
- Four hours of concentration (329)
- Why programmers are not paid in proportion to productivity (278)
- Golden powers are nearly integers (277)
- Why isn’t everything normally distributed? (250)
- Kalman filters and functional programming (215)
- Planets evenly spaced on log scale (168)
- Nobody will steal your idea (166)
- How Thomas Jefferson prepared for meetings (159)
- Coming full circle (157)

Everyone knows what a **million** is. Any larger than that and you may run into trouble. The next large number name, **billion**, means 10^{9} to some people and 10^{12} to others.

When writing for those who take one billion to mean 10^{9}, your audience may or may not know that one **trillion** is 10^{12} according to that convention. You cannot count on people knowing the names of larger numbers: **quadrillion**, **quintillion**, **sextillion**, etc.

To support this last claim, let’s look at the frequency of large number names according to Google’s Ngram viewer. Presumably the less often a word is used, the less likely people are to recognize it.

Here’s a bar chart for the data from 2005, plotting on a logarithmic scale. The chart has a roughly linear slope, which means the frequency of the words drops exponentially. **Million** is used more than 24,000 times as often as **septillion**.

Here’s the raw data.

|-------------+--------------| | Name | Frequency | |-------------+--------------| | Million | 0.0096086564 | | Billion | 0.0030243298 | | Trillion | 0.0002595139 | | Quadrillion | 0.0000074383 | | Quintillion | 0.0000016745 | | Sextillion | 0.0000006676 | | Septillion | 0.0000003970 | |-------------+--------------|

There is a consistency to these names, though they are not well known. Using the convention that these names refer to powers of 1000, the pattern is

[Latin prefix for *n*] + llion = 1000^{n+1}

So for example, billion is 1000^{2+1} because bi- is the Latin prefix for 2. A trillion is 1000^{3+1} because tri- corresponds to 3, and so forth. A duodecillion is 1000^{13} because the Latin prefix duodec- corresponds to 12.

|-------------------+---------| | Name | Meaning | |-------------------+---------| | Billion | 10^09 | | Trillion | 10^12 | | Quadrillion | 10^15 | | Quintillion | 10^18 | | Sextillion | 10^21 | | Septillion | 10^24 | | Octillion | 10^27 | | Nonillion | 10^30 | | Decillion | 10^33 | | Undecillion | 10^36 | | Duodecillion | 10^39 | | Tredecillion | 10^42 | | Quattuordecillion | 10^45 | | Quindecillion | 10^48 | | Sexdecillion | 10^51 | | Septendecillion | 10^54 | | Octodecillion | 10^57 | | Novemdecillion | 10^60 | | Vigintillion | 10^63 | |-------------------+---------|

Here’s the Mathematica code that was used to create the plot above.

BarChart[ {0.0096086564, 0.0030243298, 0.0002595139, 0.0000074383, 0.0000016745, 0.0000006676, 0.0000003970}, ChartLabels -> {"million", "billion", "trillion", "quadrillion", "quintillion", "sextillion", "septillion"}, ScalingFunctions -> "Log", ChartStyle -> "DarkRainbow" ]]]>

Since 2^{32} = 82595524*52 + 48, there are 82595525 ways to generate the numbers 0 through 47, but only 82595524 ways to generate the numbers 48 through 51. As Melissa points out in her post, the bias here is small, but the bias increases linearly with the size of our “deck.” To clarify, it is the *relative* bias that increases, not the *absolute* bias.

Suppose you want to generate a number between 0 and *M*, where *M* is less than 2^{32} and not a power of 2. There will be 1 + ⌊2^{32}/*M*⌋ ways to generate a 0, but ⌊2^{32}/*M*⌋ ways to generate *M*-1. The *difference* in the probability of generating 0 vs generating *M*-1 is 1/2^{32}, independent of *M*. However, the *ratio* of the two probabilities is 1 + 1/⌊2^{32}/*M*⌋ or approximately 1 + *M*/2^{32}.

As *M* increases, both the favored and unfavored outcomes become increasingly rare, but ratio of their respective probabilities approaches 2.

Whether this makes any practical difference depends on your context. In general, the need for random number generator quality increases with the volume of random numbers needed.

Under conventional assumptions, the sample size necessary to detect a difference between two very small probabilities *p*_{1} and *p*_{2} is approximately

8(*p*_{1} + *p*_{2})/(*p*_{1} – *p*_{2})²

and so in the card example, we would have to deal roughly 6 × 10^{18} cards to detect the bias between one of the more likely cards and one of the less likely cards.

***

]]>Sometimes math uses a new font rather than a new alphabet, such as Fraktur. This is common in Lie groups when you want to associate related symbols to a Lie group and its Lie algebra. By convention a Lie group is denoted by an ordinary Latin letter and its associated Lie algebra is denoted by the same letter in Fraktur font.

To produce Fraktur letters in LaTeX, load the `amssymb`

package and use the command `\mathfrak{}`

.

Symbols such as `\mathfrak{A}`

are math symbols and can only be used in math mode. They are not intended to be a substitute for setting text in Fraktur font. This is consistent with the semantic distinction in Unicode described below.

The Unicode standard tries to distinguish the appearance of a symbol from its semantics, though there are compromises. For example, the Greek letter Ω has Unicode code point U+03A9 but the symbol Ω for electrical resistance in Ohms is U+2621 even though they are rendered the same [1].

The letters *a* through *z*, rendered in Fraktur font and used as mathematical symbols, have Unicode values U+1D51E through U+1D537. These values are in the “Supplementary Multilingual Plane” and do not commonly have font support [2].

The corresponding letters A through Z are encoded as U+1D504 through U+1D51C, though interestingly a few letters are missing. The code point U+1D506, which you’d expect to be Fraktur C, is reserved. The spots corresponding to H, I, and R are also reserved. Presumably these are reserved because they are not commonly used as mathematical symbols. However, the corresponding bold versions U+1D56C through U+ID585 have no such gaps [3].

[1] At least they usually are. A font designer could choose provide different glyphs for the two symbols. I used the same character for both because some I thought some readers might not see the Ohm symbol properly rendered.

[2] If you have the necessary fonts installed you should see the alphabet in Fraktur below:

𝔞 𝔟 𝔠 𝔡 𝔢 𝔣 𝔤 𝔥 𝔦 𝔧 𝔨 𝔩 𝔪 𝔫 𝔬 𝔭 𝔮 𝔯 𝔰 𝔱 𝔲 𝔳 𝔴 𝔵 𝔶 𝔷

I can see these symbols from my desktop and from my iPhone, but not from my Android tablet. Same with the symbols below.

[3] Here are the bold upper case and lower case Fraktur letters in Unicode:

𝕬 𝕭 𝕮 𝕯 𝕰 𝕱 𝕲 𝕳 𝕴 𝕵 𝕶 𝕷 𝕸 𝕹 𝕺 𝕻 𝕼 𝕽 𝕾 𝕿 𝖀 𝖁 𝖂 𝖃 𝖄 𝖅

𝖆 𝖇 𝖈 𝖉 𝖊 𝖋 𝖌 𝖍 𝖎 𝖏 𝖐 𝖑 𝖒 𝖓 𝖔 𝖕 𝖖 𝖗 𝖘 𝖙 𝖚 𝖛 𝖜 𝖝 𝖞 𝖟

For each positive integer *k*, and non-negative integer *n*, the *k*th homotopy group of the sphere *S*^{n} is a finitely generated Abelian group, something that can be described by a finite list of numbers. So we’re looking at simply writing a function that takes two integers as input and returns a list of integers. This function is implemented in an online calculator that lets you lookup homotopy groups.

Computing homotopy groups of spheres is far from easy. The first Fields medal given to a topologist was for partial work along these lines. There are still groups that haven’t been computed, and potentially more Fields medals to win. But our task is much more modest: simply to **summarize** what has been discovered.

This is not going to be too easy, as suggested by the sample of results in the table below.

This table was taken from the Homotopy Type Theory book, and was in turn based on the Wikipedia article on homotopy groups of spheres.

To give an example of what we’re after, the table says that π_{13}(*S*²), the 13th homotopy group of the 2-sphere, is *Z*_{12} × *Z*_{2}. All we need to know is the subscripts on the *Z*‘s, the orders of the cyclic factors, and so our function would take as input (13, 2) and return (12, 2).

The table tells us that π_{8}(*S*^{4}) = *Z*_{2}^{2}. This is another way of writing *Z*_{2} × *Z*_{2}, and so our function would take (8, 4) as input and return (2, 2).

When I said above that our function would return a list of integers I glossed over one thing: some of the *Z*‘s don’t have a subscript. That is some of the factors are the group of integers, not the group of integers mod some finite number. So we need to add an extra symbol to indicate a factor with no subscript. I’ll use ∞ because the integers as the infinite cyclic group. For example, our function would take (7, 4) and return (∞, 12). I will also use 1 to denote the trivial group, the group with 1 element.

Some results are unknown, and so we’ll return an empty list for these.

The order of the numbers in the output doesn’t matter, but we will list the numbers in descending order because that appears to be conventional.

Some of the values of our function can be filled in by a general theorem, and some will simply be data.

If we call our function *f*, then there is a theorem that says *f*(*k*, *n*) = (1) if *k* < *n*. This accounts for the zeros in the upper right corner of the chart above.

There’s another theorem that says *f*(*n*+*m*, *n*) is independent of *n* if *n* > *m* + 1. These are the so-called stable homotopy groups.

The rest of the groups are erratic; we can’t do much better than just listing them as data.

(By the way, the corresponding results for **homology** rather than **homotopy** are ridiculously simple by comparison. For *k* > 0, the *k*th homology group of *S*^{n} is isomorphic to the integers if *k* = *n* and is trivial otherwise.)

Which has more mobility on an empty chessboard: a rook, a king or a knight?

On an ordinary 8 by 8 chessboard, a rook can move to any one of 14 squares, no matter where it starts. A knight and a king both have 8 possible moves from the middle of the board, fewer moves from an edge.

If we make the board bigger or higher dimensional, what happens to the relative mobility of each piece? Assume we’re playing chess on a hypercube lattice, *n* points along each edge, in *d* dimensions. So standard chess corresponds to *n* = 8 and *d* = 2.

A **rook** can move to any square in a coordinate direction, so it can choose one of *n*-1 squares in each of *d* dimensions, for a total of (*n*-1)*d* possibilities.

A **king** can move a distance of 0, 1, or -1 in each coordinate. In *d* dimensions, this gives him 3* ^{d}* – 1 options. (Minus one for the position he’s initially on: moving 0 in every coordinate isn’t a move!)

What about a **knight**? There are *C*(*d*, 2) ways to choose two coordinate directions. For each pair of directions, there are 8 possibilities: two ways to choose which direction to move two steps in, and then whether to move + or – in each direction. So there are 4*d*(*d* – 1) possibilities.

In short:

- A king’s mobility increases exponentially with dimension and is independent of the size of the board.
- A rook’s mobility increases linearly with dimension and linearly with the size of the board.
- A knight’s mobility increases quadratically with dimension and independent of the size of the board.

The rules above are not the only way to generalize chess rules to higher dimensions. Here’s an interesting alternative way to define a knight’s moves: a knight can move to any lattice point a distance √5 away. In dimensions up to 4, this corresponds to the definition above. But in dimension 5 and higher, there’s a new possibility: moving one step in each of five dimensions. In that case, the number of possible knight moves increases with dimension like *d*^{5}.

**Related post**: A knight’s random walk in 3 or 4 dimensions

]]>

So what do we mean by 3D chess? For this post, we’ll have a three dimensional lattice of possible positions, of size 8 by 8 by 8. You could think of this as a set of 8 ordinary chess boards stacked vertically. To generalize a knight’s move to this new situation, we’ll say that a knight move consists of moving 2 steps in one direction and 1 step in an orthogonal direction. For example, he might move up two levels and over one position horizontally.

Suppose our knight walks randomly through the chess lattice. At each point, he evaluates all possible moves and chooses one randomly with all possible moves having equal probability. How long on average will it take our knight to return to where he started?

As described in the post about the two dimensional problem, we can find the average return time using a theorem about Markov chains.

The solution is to view the problem as a random walk on a graph. The vertices of the graph are the squares of a chess board and the edges connect legal knight moves. The general solution for the time to first return is simply 2

N/kwhereNis the number of edges in the graph, andkis the number of edges meeting at the starting point.

The problem reduces to counting *N* and *k*. This is tedious in two dimensions, and gets harder in higher dimensions. Rather than go through a combinatorial argument, I’ll show how to compute the result with a little Python code.

To count the number of edges *N*, we’ll add up the number of edges at each node in our graph, and then divide by 2 since this process will count each edge twice. We will iterate over our lattice, generate all potential moves, and discard those that would move outside the lattice.

from numpy import all from itertools import product, combinations def legal(v): return all([ 0 <= t < 8 for t in v]) count = 0 for position in product(range(8), range(8), range(8)): # Choose two directions for d in combinations(range(3), 2): # Move 1 step in one direction # and 2 steps in the other. for step in [1, 2]: for sign0 in [-1, 1]: for sign1 in [-1, 1]: move = list(position) move[d[0]] += sign0*step move[d[1]] += sign1*(3-step) if legal(move): count += 1 print(count // 2)

This tells us that there are *N* = 4,032 nodes in our graph of possible moves. The number of starting moves *k* depends on our starting point. For example, if we start at a corner, then we have 6 possibilities. In this case we should expect our knight to return to his starting point in an average of 2*4032/6 = 1,344 moves.

We can easily modify the code above. To look at different size lattices, we could change all the 8’s above. The function `legal`

would need more work if the lattice was not the same size in each dimensions.

We could also look at four dimensional chess by adding one more range to the position loop and changing the combinations to come from `range(4)`

rather than `range(3)`

. In case you’re curious, in four dimensions, the graph capturing legal knight moves in an 8 by 8 by 8 by 8 lattice would have 64,512 edges. If our knight started in a corner, he’d have 12 possible starting moves, so we’d expect him to return to his starting position on average after 5,275 moves.

Something analogous happens in mathematics. While pursuing one goal, mathematicians spin off tools that are useful for other purposes. Algebraic topology might be the NASA of pure mathematics. It’s worth taking a course in algebraic topology even if you don’t care about topology, just to see a lot of widely used ideas in their original habitat.

Number theory has developed an enormous amount of technical machinery. If a graduate student in number theory described to you what she’s working on, you probably wouldn’t recognize it as number theory. But as far as I know, not much of the machinery of number theory has been applied to much besides number theory. Maybe there’s a arbitrage opportunity, not to apply number theory per se, but to apply the *machinery* of number theory.

Suppose you have an observation that either came from *X* or *Y*, and *a priori* you believe that both are equally probable. Assume *X* and *Y* are both normally distributed with standard deviation 1, but that the mean of *X* is 0 and the mean of *Y* is 1. The probability you assign to *X* and *Y* after seeing data will change, depending on what you see. The larger the value you observe, the more sure you can be that it came from *Y*.

This plot shows the posterior probability that an observation came from *Y*, given that we know the observation is greater than the value on the horizontal axis.

Suppose I’ve seen the exact value, and it’s 4. All I tell you is that it’s bigger than 2. Then you would say it probably came from *Y*. When I go back and tell you that in fact it’s bigger than 3. You would be more sure it came from *Y*. The more information I give you, the more convinced you are. This isn’t the case with other probability distributions.

Now let’s suppose *X* and *Y* have a Cauchy distribution with unit scale, with *X* having mode 0 and *Y* having mode 1. The plot below shows how likely is it that our observation came from *Y* given a lower bound on the observation.

We are most confident that our data came from *Y* when we know that our data is greater than 1. But the larger our lower bound is, the further we look out in the tails, the *less* confident we are! If we know, for example, that our data is at least 5, then we still think that it’s more likely that it came from *Y* than from *X*, but we’re not as sure.

As above, suppose I’ve seen the data value but only tell you lower bounds on its value. Suppose I see a value of 4, but only tell you it’s positive. When I come back and tell you that the value is bigger than 1, your confidence goes up that the sample came from *Y*. But as I give you more information, telling you that the sample is bigger than 2, then bigger than 3, your confidence that it came from *Y* goes *down*, just the opposite of the normal case.

What accounts for the very different behavior?

In the normal example, seeing a value of 5 or more from *Y* is unlikely, but seeing a value so large from *X* is very unlikely. Both tails are getting smaller as you move to the right, but in *relative* terms, the tail of *X* is getting thinner much faster than the tail of *Y*.

In the Cauchy example, the value of both tails gets smaller as you move to the right, but the *relative* difference between the two tails is decreasing. Seeing a value greater than 10, say, from *Y* is unlikely, but it would only be slightly less likely from *X*.

In between thin tails and thick tails are medium tails. The tails of the Laplace (double exponential) distribution decay exponentially. Exponential tails are a often considered the boundary between thin tails and thick tails, or super-exponential and sub-exponential tails.

Suppose you have two Laplace random variables with the same scale, with *X* centered at 0 and *Y* centered at 1. What is the posterior probability that a sample came from *Y* rather than *X*, given that it’s at least some value *Z* > 1? It’s **constant**! Specifically, it’s *e*/(1 + *e*).

In the normal and Cauchy examples, it didn’t really matter that one distribution was centered at 0 and the other at 1. We’d get the same qualitative behavior no matter what the shift between the two distributions. The limit would tend to 1 for the normal distribution and 1/2 for the Cauchy. The constant value of the posterior probability with the Laplace example depends on the size of the shift between the two.

We’ve assumed that *X* and *Y* are equally likely *a priori*. The limiting value in the normal case does not depend on the prior probabilities of *X* and *Y* as long as they’re both positive. The prior probabilities will effect the limiting values for the Cauchy and Laplace case.

For anyone who wants a more precise formulation of the examples above, let *B* be a non-degenerate Bernoulli random variable and define *Z* = *BX* + (1-*B*)*Y*. We’re computing the conditional probability Pr(*B* = 0 | Z* > z*) using Bayes’ theorem. If *X* and *Y* are normally distributed, the limit of Pr(*B* = 0 | Z* > z*) as *z* goes to infinity is 1. If *X* and *Y* are Cauchy distributed, the limit is the unconditional probability Pr(B = 0).

In the normal case, as z goes to infinity, the distribution of *B* carries no useful information.

In the Cauchy case, as z goes to infinity, the observation z carries no useful information.

]]>He would proceed very slowly at the beginning of the semester, so slowly that you didn’t see how he could possibly cover the course material by the end. But his pace would gradually increase to the point that he was going very quickly at the end. And yet the pace increased so smoothly that you were hardly aware of it. By understanding the first material thoroughly, you were able to go through the latter material quickly.

If you’ve got 15 weeks to cover 15 chapters, don’t assume the optimal pace is to cover one chapter every week.

I often read technical books the way the professor mentioned above lectured. The density of completely new ideas typically decreases as a book progresses. If your reading pace is proportional to the density of new ideas, you’ll start slow and speed up.

The preface may be the most important part of the book. Some books I’ve only read the preface and felt like I got a lot out of the book.

The last couple chapters of technical books can often be ignored. It’s common for authors to squeeze in something about their research at the end of a book, even it its out of character with the rest of the book.

]]>There’s a general solution to the problem for any curve. Given a parameterization *p*(*t*), where *p* is vector-valued, the length covered from time 0 to time *t* is

If you change the time parameterization by inverting this function, solving for *t* as a function of *s*, then the total length of curve traversed by *p*(*t*(*s*)) up to time *s* is *s*. This is called either the **unit speed parameterization** or **parameterization by arc length**.

**The hard part is inverting s(t)**. If you had to find a unit speed parameterization in a calculus class, the problem would be carefully designed so the function

My response to my friend’s question was that there probably is a closed-form unit speed parameterization, but it would probably involve elliptic functions. He didn’t need much resolution, and decided to do something ad hoc.

Starting with the parameterization *p*(*t*) = (*a* cos(t), *b* sin(*t*)), the arc length *s*(*t*) is given by a special function known as an “incomplete elliptic integral of the second kind.” I remembered that the Jacobi elliptic functions were related to the inverses of elliptic integrals, so my hunch was that you could make a unit speed parameterization using Jacobi elliptic functions. Maybe you can, but it’s not as easy as I thought because the Jacobi functions are related to the inverses of elliptic integrals of the *first* kind.

**Elliptic integrals** are so named because they were first identified by computing arc length for a (portion of) an ellipse. **Elliptic functions** were discovered by inverting elliptic *integrals*, but not the same class of elliptic integrals that give the arc length of an ellipse. (There may well be a transformation that makes these more directly related, but I’m not aware of it.)

Incidentally, **elliptic curves** are related to elliptic *functions*, but they are not ellipses! There is a connection from ellipses to elliptic curves, but it’s historical and indirect.

What if we had a more general curve than an ellipse, say something parameterized by **cubic splines**? Cubic splines are piecewise cubic polynomials, patched together in such a way that the first and second derivatives are continuous across the patches. We can find length of a spline by finding the length of each polynomial patch.

If *p*(*t*) is the parameterization of a curve in 2 or 3 dimensions (or really any number of dimensions) and each component of *p* is a cubic polynomial, then each component of the derivative of *p* is a quadratic polynomial, and so the sum of the squares of the components is a fourth degree polynomial. So finding the arc length involves integrating the square root of a fourth degree polynomial. This makes it an **elliptic integral**!

Unfortunately, knowing that the arc length of a cubic spline corresponds to an elliptic integral is not so useful because it could be any type of elliptic integral, depending on its parameters. You’d have to do some work first to put it into a form where you could call on elliptic integrals to finish your problem.

The elliptic integral path is something of a dead end. It could still be useful if you needed high accuracy, or if you had some restriction on the class of curves you’re interested in. But in general, you’d need to use numerical integration to find the arc length.

You could also find unit-speed parameterizations numerically, using root-finding to invert *s*(*t*) at particular points. Since *s* is an increasing function of *t*, you could use a bisection method, which is not the most efficient approach but very robust.

It takes a fair amount of computation to carry root finding where each function evaluation requires computing a numerical integral. But this would work, and depending on your context it could be efficient enough.

If you needed more efficiency, say for a real-time embedded system, you could take a different approach. Your spline is probably an approximation to something else, and so your solution only needs to be as accurate as the spline approximation. This gives you the wiggle room to do something more efficient. You might change your parameterization slightly to make the arc length calculations easier, or come up with a way to approximately invert the arc length function, something that takes advantage of your particular problem and error tolerance.

In this post I’ll write *C*(*n*) rather than *C*_{n} because that will be easier to read later on.

Catalan numbers come up in all kinds of applications. For example, the number of ways to parenthesize an expression with *n* terms is the *n*th Catalan number *C*(*n*).

Here’s a strange theorem regarding Catalan numbers that I found in Catalan Numbers with Applications by Thomas Koshy:

The number of digits in

C(10^{n}) … converges to the number formed by the digits on the right side of the decimal point in log_{10}4 = 0.60205999132…

Let’s see what that means. *C*(10) equals 16,796 which of course has 5 digits.

Next, *C*(100) equals

896,519,947,090,131,496,687,170,070,074,100,632,420,837,521,538,745,909,320

which has 57 digits. These numbers are getting really big really fast, so I’ll give a table of just the number of digits of a few more examples rather than listing the Catalan numbers themselves.

|---+------------------| | n | # C(10^n) digits | |---+------------------| | 1 | 5 | | 2 | 57 | | 3 | 598 | | 4 | 6015 | | 5 | 60199 | | 6 | 602051 | | 7 | 6020590 | | 8 | 60205987 | |---+------------------|

I stopped at *n* = 8 because my computer locked up when I tried to compute *C*(10^{9}). I was computing these numbers with the following Mathematica code:

Table[ IntegerLength[ CatalanNumber[10^n] ], {n, 1, 8} ]

Computing `CatalanNumber[10^9]`

was too much for Mathematica. So how might we extent the table above?

We don’t need to know the Catalan numbers themselves, only how many digits they have. And we can compute the number of digits from an approximate value of their logarithm. Taking logarithms also helps us avoid overflow.

We’ll write Python below to determine the number of digits, in part to show that we don’t need any special capability of something like Mathematica.

We need three facts before we write the code:

- The number of decimal digits in a number
*x*is 1 + ⌊log_{10}*x*⌋ where ⌊*y*⌋ is the floor of*y*, the greatest integer not greater than*y*. *n*! = Γ(*n*+ 1)- log
_{10}(x) = log(*x*) / log(10)

Note that when I write “log” I *always* mean natural log. More on that here.

This code can compute the number of digits of *C*(10^{n}) quickly for large *n*.

from scipy import log, floor from scipy.special import gammaln def log_catalan(n): return gammaln(2*n+1) - 2*gammaln(n+1) - log(n+1) def catalan_digits(n): return 1 + floor( log_catalan(n)/log(10) ) for n in range(1, 14): print( n, catalan_digits(10.**n) )

The code doesn’t run into trouble until `n`

= 306, in which case `gammaln`

overflows. (Note the dot after 10 in the last line. Without the dot Python computes `10**n`

as an integer, and that has problems when `n`

= 19.)

How would you go about proving the theorem above? We want to show that the number of digits in *C*(*n*) divided by 10^{n} converges to log_{10} 4, i.e.

We can switch to natural logs by multiplying both sides by log(10). Also, in the limit we don’t need the 1 or the floor in the numerator. So it suffices to prove

Now we see this has nothing to do with base 10. We only need to prove

and that is a simple exercise using Stirling’s approximation.

Our proof showed that this theorem ultimately doesn’t have anything to do with base 10. If we did everything in another base, we’d get analogous results.

To give a taste of that, let’s work in base 7. If we look at the Catalan numbers *C*(7^{n}) and count how many “digits” their base 7 representations have, we get the same pattern as the base 7 representation of log_{7} 4.

Note that the base appears in four places:

- which Catalan numbers we’re looking at,
- which base we express the Catalan numbers in,
- which base we take the log of 4 in, and
- which base we express that result in.

If you forget one of these, as I did at first, you won’t get the right result!

Here’s a little Mathematica code to do an example with *n* = 8.

BaseForm[ 1 + Floor[ Log[7, CatalanNumber[7^8]] ], 7 ]

This returns 46623341_{7} and the code

BaseForm[Log[7, 4.], 7]

returns 0.4662336_{7}.

and so we have the following right triangle.

The hypotenuse will always be irrational because the only Fibonacci numbers that are squares are 1 and 144, and 144 is the 12th Fibonacci number.

]]>Here are a few numerical results to give an idea of the accuracy of the bounds on a log scale. The first column is the argument *n*, The second is the log of the CBC (central binomial coefficient) minus the log of the lower bound. The third column is the log of the upper bound minus the log of the CBC.

|---+--------+--------| | n | lower | upper | |---+--------+--------| | 1 | 0.0000 | 0.3465 | | 2 | 0.0588 | 0.2876 | | 3 | 0.0793 | 0.2672 | | 4 | 0.0896 | 0.2569 | | 5 | 0.0958 | 0.2507 | | 6 | 0.0999 | 0.2466 | | 7 | 0.1029 | 0.2436 | | 8 | 0.1051 | 0.2414 | | 9 | 0.1069 | 0.2396 | |---+--------+--------|

And here’s a plot of the same data taking *n* out further.

So the ratio of the upper bound to the CBC and the ratio of the CBC to the lower bound both quickly approach an asymptote, and the lower bound is better.

]]>For vectors *x* and *y* in three dimensions, the dot product satisfies

where θ is the angle between the two vectors. In higher dimensions, we can turn this theorem around and use it as the *definition* of the angle between two vectors based on their dot product.

(The plots below are jagged because they’re based on random sampling.)

Here’s the Euclidean distance between *xy* and *yx* for quaternions from the earlier post:

And here’s the corresponding angular distance:

The range has changed from [0, 2] to [o, π], and the distribution now shifts left instead of right.

Here’s a similar graph, looking at the angular distance between *xy* and *yx* for octonions, something I haven’t plotted before in Euclidean distance.

This graph is more symmetric, which we might expect: since octonions have less algebraic structure than quaternions, we might expect the relationship between *xy* and *yz* to behave more erratically, and for the plot to look more like a normal distribution.

Finally, let’s revisit the distance between (xy)z and *x*(*yz*) for octonions. Here is the distribution of the Euclidean distance from a previous post:

And here is the corresponding histogram based on angular distance.

These plots are based on uniform random samples of quaternions and octonions of length 1, i.e. points from the unit spheres in 4 and 8 dimensions respectively. Quaternions and octonions have the property that the product of unit length vectors is another unit length vector, and the angle between to unit vectors is the inverse cosine of their dot product.

I thought that sedenions also had this norm property, that the product of unit length vectors has unit length. Apparently not, as I discovered by trying to take the inverse cosine of a number larger than 1. So what is the distribution of lengths that come from multiplying two sedenions of length 1? Apparently the mean is near 1, and here’s a histogram.

]]>