Here’s one way fractional derivatives could be defined. Suppose the Fourier transform of

f(x) isg(ξ). Then for positive integern, then^{th}derivative off(x) has Fourier transform (2πiξ)^{n}g(ξ). So you could take then^{th}derivative off(x) as follows: take the Fourier transform, multiply by (2πiξ), and take the inverse Fourier transform. This suggests the same procedure could be used to^{n}definethen^{th}derivative whennis not an integer.

Here’s another way to define fractional derivatives that doesn’t use Fourier transforms, the Grünwald-Letnikov derivative. It’s a fairly direct approach.

The starting point is to note that for positive integer *n*, the *n*th derivative can be written as

where

and Δ^{n} iterates Δ. For example,

In general, for positive integer *n*, we have

We could set the upper limit of the sum to *n*, but there’s no harm in setting it to ∞ because the binomial coefficients will be zero for *k* larger than *n*. And in this form, we can replace *n* with the integer *n* with any positive real number α to define the αth derivative. That is, the Grünwald-Letnikov derivative of *f* is given by

See these notes for the definition of binomial coefficients for possibly non-integer arguments and for an explanation why for integer *n* the coefficients are eventually zero.

**Related post**: Mittag-Leffler functions

Yesterday I was on Amazon.com and noticed that nearly all the books they recommended for me were either about Lisp or mountain climbing. I thought this was odd, and mentioned it on Twitter. Carl Vogel had a witty reply: “I guess they weren’t sure whether you want to figuratively or literally look down on everyone.”

The stereotype Lisp programmer does look down on everyone. But this is based on a tiny, and perhaps unrepresentative, sample of people writing *about* Lisp compared to the much larger number of people who are writing *in* Lisp.

Lisp has been around for over 50 years and shows no signs of going away. There are a lot of people writing Lisp in obscurity. Kate Gregory said something similar about C++ developers, calling them the dark matter of programmers because there are lot of them but they don’t make much of a splash. They’re quietly doing their job, not speaking at conferences or writing much about their language.

I imagine there are a lot of humble Lisp programmers. It takes some humility to commit to an older technology, especially given the pervasive neomania of the programmer community. It also takes some humility to work on projects that have been around for years or that are deep within the infrastructure of more visible projects, which is where I expect a lot of Lisp lives.

You can do very clever things in Lisp, but you don’t have to. As Ed Post famously said, “The determined Real Programmer can write FORTRAN programs in any language.” There must be a lot of code out there that writes `(f x)`

instead of `f(x)`

but otherwise isn’t that different from FORTRAN.

All four involve the unknown function φ(*x*) in an integral with a kernel *K*(*x*, *y*) and all have an input function *f*(*x*). In all four the integration ranges from some fixed lower limit. In the **Volterra equations**, the upper limit of integration is the variable *x*, while in the **Fredholm equations**, the upper limit of integration is a fixed constant.

The so-called equations **of the first kind** only involve the unknown function φ inside the integral. The equations **of the second kind** also involve φ outside the integral.

So the four equations above are

- Volterra equation of the first kind
- Volterra equation of the second kind
- Fredholm equation of the first kind
- Fredholm equation of the second kind

Here’s a diagram to make these easier to keep straight:

In general, the theory of Volterra equations is easier than that of Fredholm equations. And while equations of the first kind look simpler at first, it’s common to reduce equations of the first kind to equations of the second kind and concentrate on the later.

There are many variations on this theme. The *x* in Volterra equations could be a vector. The integral could be, for example, a double or triple integral. In Fredholm equations, the integration may be over fixed general region. Maybe you’re integrating over a watermelon, as the late William Guy would say. You could have nonlinear versions of these equations where instead of multiplying *K*(*x*, *y*) times φ(*y*) you have a kernel *K*(*x*, *y, φ( y)*) that is some nonlinear function of φ.

You may see references to Volterra or Fredholm equations of the **third kind**. These are an extension of the second kind, where a function *A*(*x*) multiples the φ outside the integral. Equations of the second kind are the most important since the first and third kinds can often be reduced to the second kind.

Fluctuation strength reaches its maximum at a modulation frequency of around 4 Hz. For much higher modulation frequencies, one perceives roughness rather than fluctuation. The reference value for one vacil is a 1 kHz tone, fully modulated at 4 Hz, at a sound pressure level of 60 decibels. In other words

(1 + sin(8π*t*)) sin(2000π*t*)

where *t* is time in seconds.

Since the carrier frequency is 250 times greater than the modulation frequency, you can’t see both in the same graph. In this plot, the carrier is solid blue compared to the modulation.

Here’s what the reference for one vacil would sound like:

**See also**: What is an asper?

(1 + sin(140π*t*)) sin(2000π*t*)

where *t* is time in seconds.

Here’s what that sounds like (if you play this at 60 dB, about the loudness of a typical conversation at one meter):

And here’s the Python code that made the file:

from scipy.io.wavfile import write from numpy import arange, pi, sin, int16 def f(t, f_c, f_m): # t = time # f_c = carrier frequency # f_m = modulation frequency return (1 + sin(2*pi*f_m*t))*sin(2*f_c*pi*t) def to_integer(signal): # Take samples in [-1, 1] and scale to 16-bit integers, # values between -2^15 and 2^15 - 1. return int16(signal*(2**15 - 1)) N = 48000 # samples per second x = arange(3*N) # three seconds of audio # 1 asper corresponds to a 1 kHz tone, 100% modulated at 70 Hz, at 60 dB data = f(x/N, 1000, 70) write("one_asper.wav", N, to_integer(data))

**See also**: What is a vacil?

and we can generalize this to the Mittag=Leffler function

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

and

where erfc(*x*) is the complementary error function.

Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function *E*_{α} is *E*_{α, 1}.

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

The exponential function exp(-*x*) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is *f*(*x*) = exp(-*x*) and whose distribution function (CDF) is *F*(*x*) = 1 – exp(-*x*). The Mittag-Leffler distribution has CDF is 1 – *E*_{α}(-*x*^{α}) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the *k*th term as the probability mass for *k*. That is,

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation

is exp(*ax*), for 0 < μ < 1, the soltuion to the **fractional differential equation**

is *a**x*^{μ-1} *E*_{μ, μ}(*a* *x*^{μ}). Note that this reduces to exp(*ax*) when μ = 1. [3]

[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.

]]>This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

At the time Brian was working on one big paper in private. This has since been split into several papers and they’re now public.

- Kalman Folding, Part 1
- Kalman Folding 2: Tracking and System Dynamics
- Kalman Folding 3: Derivations
- Kalman Folding 4: Streams and Observables
- Kalman Folding 5: Non-Linear models and the EKF
- Kalman Folding 6: ??
- Kalman Folding 7: A Small Streams Library

]]>

This made me think of corners in the geometric sense. If you have a sphere in a box in high dimensions, **nearly all the volume is in the corners**, i.e. outside the sphere. This is more than a metaphor. You can think of software options geometrically, with each independent choice corresponding to a dimension. Paths through a piece of software that are individually rare may account for nearly all use when considered together.

With a circle inside a square, nearly 78.5% of the area is inside the circle. With a ball sitting inside a 3-D box, 52.4% of the volume is inside the ball. As the dimension increases, the proportion of volume inside the sphere rapidly decreases. For a 10-dimensional sphere sitting in a 10-dimensional box, 0.25% of the volume is in the sphere. Said another way, 99.75% of the volume is in the corners.

When you go up to 100 dimensions, the proportion of volume inside the sphere is about 2 parts in 10^{70}, a 1 followed by 70 zeros [1]. If 100 dimensions sounds like pure fantasy, think about a piece of software with more than 100 features. Those feature combinations multiply like geometric dimensions [2].

Here’s a little Python code you could use to see how much volume is in a sphere as a function of dimension.

from scipy.special import gamma from math import pi def unit_sphere_volume(n): return pi**(0.5*n)/gamma(0.5*n + 1) def unit_cube_volume(n): return 2**n def ratio(n): return unit_sphere_volume(n) / unit_cube_volume(n) print( [ratio(n) for n in range(1, 20)] )

* * *

[1] There are names for such extremely large numbers. These names are hardly ever used—scientific notation is much more practical— but they’re fun to say. 10^{70} is ten duovigintillion in American nomenclature, ten undecilliard in European.

[2] Geometric dimensions are perfectly independent, but software feature combinations are not. In terms of logic, some combinations may not be possible. Or in terms of probability, the probability of exploring some paths is conditional on the probability of exploring other paths. Even so, there are inconceivably many paths through any large software system. And in large-scale operations, events that should “never happen” happen regularly.

]]>]]>

People best understand computer programs in a different order than compilers do. This is a key idea of literate programming, and one that distinguishes literate programs from heavily commented programs.

Traditional source code, no matter how heavily commented, is presented in the order dictated by the compiler. The computer is the primary audience. Literate programming is more humanistic in the sense that the primary audience is a human. The *computer* has to go to extra effort to arrange the code for its needs. As Donald Knuth describes it in his book on literate programming,

The practitioner of literate programming … strives for a program that is comprehensible because its concepts have been introduced

in an order that is best for human understanding, using a mixture of formal and informal methods that nicely reinforce each other. [emphasis added]

There are two steps in processing literate programs: **weaving** and **tangling**. You take files containing prose and code, and *weave* them into documentation and *tangle* them into source code. Tools like Sweave and Pweave focus on the weave process, as their names imply. The weave side of literate programming has gotten the most attention.

A half-hearted approach to literate programming doesn’t require much of a tangle process. A well-commented program has no tangle step at all. A *weave document that follows the order of the source code has a trivial tangle step: save the code to its own file, manually or automatically, but don’t rearrange it. But a full-fledged literate program may make the tangle program work harder, rearranging code fragments from human-friendly to compiler-friendly order.

The most obvious feature of literate programming is that it requires careful explanation. Here’s more from the paragraph I quoted above, filling in the part I left out.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with explanation and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible …

The discipline of explaining every piece of code leads to better code. It serves a similar purpose to writing unit tests. I saw somewhere—I can’t remember where now— that Knuth hates unit testing and sees it as redundant effort. Presumably this is because unit testing and literate programming overlap. Unit tests are a kind of commentary on code, explaining how it used, exploring its limitations, etc.

Knuth understands that literate programming doesn’t replace the need for *testing*, just *unit testing*. He explained somewhere—again I forget where—that he would test TeX by spending days at a time trying fiendishly to break it.

When I read Knuth’s book, I could see the value of carefully explaining code. What I didn’t appreciate was the value of presenting code in a different order than the order of source code.

I’m working on a project now where a few lines of code may require a few paragraphs of explanation. That’s what got me thinking about literate programming. My intention was to write my documentation in the same order as the code. It took a while to realize I had stumbled on an ideal application of literate programming: a complicated algorithm that needs to be explained carefully, both in order to hand over to the client and to reduce the chances of errors. The best order to understand this algorithm is definitely not top-down going through the code.

I think I understand better now why literate programming hasn’t gained much of an audience. I used to think that it was because developers hate writing prose. That’s part of it. Most programmers I’ve worked with would much rather write a hundred lines of unit tests than write one complete sentence.

But that’s not the whole story. There are quite a few programmers who are willing and able to write prose. Why don’t more of them use literate programming?

I think part of the reason is that having a non-trivial tangle process is a barrier to adoption. A programmer can decide to start writing more extensive comments, gradually working up to essay-like explanations. But it’s one thing to say “I’m going to heavily comment my C code” and quite another to say “I’m not going to write C per se any more. I’m going to write CWEB files that compile to C.” Even if a programmer wants to write CWEB in secret, just checking in the tangled C files, other programmers will edit these C files and the changes won’t be reflected in the CWEB source. Also, the output of tangle is *less* readable than ordinary code. The programmer secretly using CWEB as a preprocessor would appear to be writing undocumented code.

Tricky code benefits from a literate presentation, but routine code does not benefit so much. You either have to have two ways of writing code—straight source for routine code and literate programs for the tricky parts—or impose the overhead of literate programming everywhere. Most code is mundane and repetitive, challenging because of its volume rather than its cleverness. Knuth doesn’t write this kind of code. He only writes code that benefits from a literate presentation.

To write a good literate program, not only do you need to be able to program, and need to be willing and able to write good prose, on top of that you need to have a good sense for story telling, arranging the code for the benefit of other readers. If this is done poorly, the result is harder to understand than traditional programs.

I may use literate programming more now that I’m starting to understand it, at least for my own benefit and hopefully for the benefit of clients. I usually deliver algorithms or libraries, not large applications, and so it wouldn’t be too much work to create two versions of my results. I could create a literate program, then weave a report, and manually edit the tangled code into a traditional form convenient for the client.

]]>Any number in Pascal’s triangle that is *not* in the outer two layers will appear at least three times, usually four.

Pascal’s triangle contains the binomial coefficients C(*n*, *r*) where *n* ranges over non-negative numbers and *r* ranges from 0 to *n*. The outer layers are the elements with *r* equal to 0, 1, *n*-1, and *n*.

We’ll write some Python code to explore how often the numbers up to 1,000,000 appear. How many rows of Pascal’s triangle should we compute? The smallest number on row *n* is C(*n*, 2). Now 1,000,000 is between C(1414, 2) and C(1415, 2) so we need row 1414. This means we need *N* = 1415 below because the row numbers start with 0.

I’d like to use a NumPy array for storing Pascal’s triangle. In the process of writing this code I realized that a NumPy array with `dtype int`

doesn’t contain Python’s arbitrary-sized integers. This makes sense because NumPy is designed for efficient computation, and using a NumPy array to contain huge integers is unnatural. But I’d like to do it anyway, and the to make it happen is to set `dtype`

to `object`

.

import numpy as np from collections import Counter N = 1415 # Number of rows of Pascal's triangle to compute Pascal = np.zeros((N, N), dtype=object) Pascal[0, 0] = 1 Pascal[1,0] = Pascal[1,1] = 1 for n in range(2, N): for r in range(0, n+1): Pascal[n, r] = Pascal[n-1, r-1] + Pascal[n-1, r] c = Counter() for n in range(4, N): for r in range(2, n-1): p = Pascal[n, r] if p <= 1000000: c[p] += 1

When we run this code, we find that our counter contains 1732 elements. That is, of the numbers up to one million, only 1732 of them appear inside Pascal’s triangle when we disallow the outer two layers. (The second layer contains consecutive integers, so every positive integer appears in Pascal’s triangle. But most integers *only* appear in the second layer.)

When *Single Digits* speaks of “Any number in Pascal’s triangle that is *not* in the outer two layers” this cannot refer to numbers that are not in the outer two layers because *every* natural number appears in the outer two layers. Also, when it says the number “will appear at least three times, usually four” it is referring to the entire triangle, i.e. including the outer two layers. So another way to state the sentence quoted above is as follows.

Define the interior of Pascal’s triangle to be the triangle excluding the outer two layers. Every number

nin the interior of Pascal’s triangle appears twice more outside of the interior, namely as C(n, 1) and C(n,n-1). Most oftennappears at least twice in the interior as well.

This means that any number you find in the interior of Pascal’s triangle, interior meaning not in the two outer layers, will appear at least three times in the full triangle, usually more.

Here are the statistics from our code above regarding just the *interior* of Pascal’s triangle.

- One number, 3003, appears six times.
- Six numbers appear four times: 120, 210, 1540, 7140, 11628, and 24310.
- Ten numbers appear only once: 6, 20, 70, 252, 924, 3432, 12870, 48620, 184756, and 705432.
- The large majority of numbers, 1715 out of 1732, appear twice.

This is very much like the idea of homotopy from topology, a continuous deformation of one thing into another. No discontinuities along the way — no ripping, no jumping suddenly from one thing to another.

* * *

]]>**Working with people with complementary skills is a blast**, but you’re unlike to experience that in an academic project. You might get some small degree specialization. Maybe one of the mechanical engineers on a project has more artistic ability than the other mechanical engineers, for example. But this is hardly like the experience of working with a team of people who are all great at different things.

]]>

How could we do better? One proposed strategy is the Thue-Morse sequence. Someone has to choose first, so let’s say it’s A. Then B chooses next. At this point A is still at an advantage, so we let B choose *again* before giving A the next pick. So the first four choices are ABBA. The next four choices reverse this: BAAB. Then the next eight choices reverse the pattern of the first eight choices. So the first 16 choices are ABBABAABBAABABBA. This process has been called “taking turns taking turns.”

In terms of 0’s and 1’s, the sequence is defined by *t*_{0} = 0, *t*_{2n} = *t*_{n}, and *t*_{2n+1} = 1 – *t*_{2n}.

How well does this procedure work? Let’s simulate it by generating random values representing skill levels. We’ll sort the values and assign them to the two teams using the Thue-Morse sequence and look at the difference between the total point values on each team. Equivalently, we can sum the values, alternating the signs of the terms according to the Thue-Morse sequence.

import numpy as np # Thue-Morse sequence TM = np.array([1, -1, -1, 1]) # Simple alternating signs Alt = np.array([1, -1, 1, -1]) N = 1000000 # number of simulations y = TM # change y to Alt to swap out strategies total = 0 for _ in range(N): x = sorted(np.random.random(4), reverse=True) total += sum(x*y) print(total/N)

When we use the alternating sequence, there’s an average skill difference between the two teams of around 0.4. This is a huge imbalance relative to expected total skill of 2. (Each skill is a uniform random value between 0 and 1, so the total skill of all four players is 2 on average.)

When we use the Thue-Morse sequence, the average difference in my simulation was 0.000144. With one million simulations, our results are good to about three decimal places [1], and so the difference is indistinguishable from 0 to the resolution of our simulation. (When I repeated the simulation I got -0.000635 as the average difference.)

There are several ways to explore this further. One would be to work out the exact expected values analytically. Another would be to look at distributions other than uniform.

* * *

[1] The error in the average of *N* simulations is on the order of 1/√*N* by the central limit theorem.

*p*(*x*) = (*x*^{2} – 3)^{2} + (*x* + 7)^{2}

is obviously never negative for real values of *x*. What about the converse: If a real polynomial is never negative, is it a sum of squares? Yes, indeed it is.

What about polynomials in two variables? There the answer is no. David Hilbert (1862–1943) knew that there must be positive polynomials that are not a sum of squares, but no one produced a specific example until Theodove Motzkin in 1967. His polynomial

*p*(*x*, *y*) = 1 – 3*x*^{2}*y*^{2} + *x*^{2} *y*^{4} + *x*^{4} *y*^{2}

is never negative but cannot be written as a sum of any number of squares. Here’s a plot:

Source: Single Digits

]]>

Some zip code are so sparsely populated that people living in these areas are relatively easy to identify if you have other data. The so-called Safe Harbor provision of HIPAA (Health Insurance Portability and Accountability Act) says that it’s usually OK to include the first three digits of someone’s zip code in de-identified data. But there are 17 areas so thinly populated that even listing the first three digits of their zip code is considered too much of an identification risk. These are areas such that the first three digits of the zip code are:

- 036
- 059
- 063
- 102
- 203
- 556
- 692
- 790
- 821
- 823
- 830
- 831
- 878
- 879
- 884
- 890
- 893

This list could change over time. These are the regions that currently contain fewer than 20,000 people, the criterion given in the HIPAA regulations.

Knowing that someone is part of an area containing 20,000 people hardly identifies them. The concern is that in combination with other information, zip code data is more informative in these areas.

**Related post**: Bayesian clinical trials in one zip code

When I was in college, I sat in on a communication workshop for Latin American preachers. This was unusual since I’m neither Latin American nor a preacher, but I’m glad I was there.

I learned several things in that workshop that I’ve used ever since. For example, when you’re gesturing about something moving forward in time, move your hand from left to right **from the audience’s perspective**. Since English speakers (and for the audience of this workshop, Spanish speakers) read from left to right, we think of time progressing from left to right. If you see someone talking about time moving forward, but you see motion from right to left, you feel a subtle cognitive dissonance. (Presumably you should reverse this when speaking to an audience whose primary language is Hebrew or Arabic.)

Another lesson from that workshop, the one I want to focus on here, is that you don’t always need to convey how you arrived at an idea. Specifically, the leader of the workshop said that if you discover something interesting from reading the New Testament in Greek, you can usually present your point persuasively using the text in your audience’s language without appealing to Greek. This isn’t always possible—you may need to explore the meaning of a Greek word or two—but you can use Greek for your personal study without necessarily sharing it publicly. **The point isn’t to hide anything, only to consider your audience**. In a room full of Greek scholars, bring out the Greek.

This story came up in a recent conversation with Brent Yorgey about category theory. You might discover something via category theory but then share it without discussing category theory. If your audience is well versed in category theory, then go ahead and bring out your categories. But otherwise your audience might be bored or intimidated, as many people would be listening to an argument based on the finer points of Koine Greek grammar. Microsoft’s LINQ software, for example, was inspired by category theory principles, but you’d be hard pressed to find any reference to this because most programmers don’t want to know or need to know where it came from. They just want to know how to use it.

Some things may sound profound when expressed in esoteric language, such as category theory or Koine Greek, that don’t seem so profound in more down-to-earth language. Expressing yourself in a different language helps filter out pedantry from useful ideas. (On the other hand, some things that looked like pure pedantry have turned out to be very useful. Some hairs are worth splitting.)

Sometimes you have to introduce a new terms because there isn’t a colloquial counterpart. Monads are a good example, a concept from category theory that has entered software development. A monad is what it is, and analogies to burritos and other foods don’t really help. Better to introduce the term and say plainly what it is.

* * *

More on applied category theory

]]>These two subjects have a lot of overlap, and some tweets will combine both, but many will be strictly about one or the other. So some content will be strictly about programming, some strictly about math, and some combining ideas from both.

]]>The length of a phone number varies by country, but US a phone number is a 10 digit number, and 10-digit numbers are often divisible by three different prime numbers, give or take a couple. Assuming phone numbers are scattered among possible 10-digit numbers in a way that doesn’t bias their number of prime factors, these numbers will often have between 1 and 5 prime factors. If a country has 9- or 11-digit phone numbers, the result is essentially the same.

Let ω(*n*) be the number of distinct prime factors of *n*. Then the Erdős–Kac theorem says roughly that ω(*n*) is distributed like a normal random variable with mean and variance log log *n*. More precisely, fix two numbers *a* and *b*. For a given value of *x*, count the proportion of positive integers less than *x* where

(ω(*n*) – log log *n*) / sqrt( log log *n*)

is between *a* and *b*. Then in the limit as *x* goes to infinity, this proportion approaches the probability that a standard normal random variable is between *a* and *b*.

So by that heuristic, the number of distinct prime factors of a 10-digit number is approximately normally distributed with mean and variance log log 10^11 = 3.232, and such a distribution is between 1 and 5 around 73% of the time.

My business phone number, for example, is 8324228646. Obviously this is divisible by 2. In fact it equals 2 × 3^{2} × 462457147, and so it has exactly three distinct prime factors: 2, 3, and 462457147.

Here’s how you could play with this using Python.

from sympy.ntheory import factorint def omega(n): return len(factorint(n))

I looked in SymPy and didn’t see an implementation of ω(*n*) directly, but it does have a function factorint that returns the prime factors of a number, along with their multiplicities, in a dictionary. So ω(*n*) is just the size of that dictionary.

I took the first 20 phone numbers in my contacts and ran them through `omega`

and got results consistent with what you’d expect from the theory above. One was prime, and none had more than five factors.

The examples in the ditaa site all have arrows between boxes, but you don’t have to have boxes.

Here’s the ditaa source:

A₀ ---> A₁ ---> A₂ ---> A₃ ---> A₄ | | | | | | f₀ | f₁ | f₂ | f₃ | f₄ | | | | | v v v v v B₀ ---> B₁ ---> B₂ ---> B₃ ---> B₄

and here’s the image it produces:

It’s not pretty. You could make a nicer image with LaTeX. But as the old saying goes, the remarkable thing about a dancing bear is not that it dances well but that it dances at all.

The trick to getting the subscripts is to use Unicode characters 0x208n for subscript n. As I noted at the bottom of this post, ditaa isn’t strictly limited to ASCII art. You can use Unicode characters as well. You may or may not be able to see the subscripts in the source code they are not part of the most widely supported set of characters.

* * *

[1] The Five Lemma is a diagram-chasing result from homological algebra. It lets you infer properties the middle function *f* from properties of the other *f*‘s.

Someone has proved that the limiting distribution of leading digits of factorials exactly satisfies Benford’s law. But if we didn’t know this, we might use a chi-square statistic to measure how well the empirical results match expectations. As I argued in the previous post, statistical tests don’t apply here, but they can be useful anyway in giving us a way to measure the size of the deviations from theory.

Benford’s law makes a better illustration of the chi-square test than the example of prime remainders because the bins are unevenly sized, which they’re allowed to be in general. In the prime remainder post, they were all the same size.

The original post on leading digits of factorial explains why we compute the leading digits the way we do. Only one detail has changed: the original post used Python 2 and this one uses Python 3. Integer division was the default in Python 2, but now in Python 3 we have to use `//`

to explicitly ask for integer division, floating point division being the new default.

Here’s a plot of the distribution of the leading digits for the first 500 factorials.

And here’s code to compute the chi-square statistic:

from math import factorial, log10 def leading_digit_int(n): while n > 9: n = n//10 return n def chisq_stat(O, E): return sum( [(o - e)**2/e for (o, e) in zip(O, E)] ) # Waste the 0th slot to make the code simpler. digits = [0]*10 N = 500 for i in range(N): digits[ leading_digit_int( factorial(i) ) ] += 1 expected = [ N*log10(1 + 1/n) for n in range(1, 10) ] print( chisq_stat(digits[1:], expected) )

This gives a chi-square statistic of 7.693, very near the mean value of 8 for a chi-square distribution with eight degrees of freedom. (There are eight degrees of freedom, not nine, because if we know how many factorials start with the digits 1 through 8, we know how many start with 9.)

So the chi-square statistic suggests that the deviation from Benford’s law is just what we’d expect from random data following Benford’s law. And as we said before, this suggestion turns out to be correct.

**Related posts**:

Using statistical tests on number theory problems is kind of odd. There’s nothing random going on, so in that since the whole enterprise is unjustified. Nevertheless, statistical tests can be suggestive. They certainly don’t prove theorems, but they can give reason to suspect a theorem is true or false. In that sense, applying statistical tests to number theory isn’t all that different from applying them to more traditional settings.

First we’ll look at the remainders of primes modulo 4. Except for 2, all primes are odd, and so they either have remainder 1 or 3 when divided by 4. Brian Hayes wrote recently that Chebyshev noticed in the 1850’s that there seems to be more primes with remainder 3. Is the imbalance larger than one would expect to see from fair coin tosses?

Here’s some Python code to find the proportion of the first million primes (after 2) that have remainder 3 when divided by 4.

from sympy import prime from scipy import sqrt n = 1000000 rem3 = 0 for i in range (2, n+2): if prime(i) % 4 == 3: rem3 += 1 p_hat = rem3/n

This shows that of the first million odd primes, 500,202 are congruent to 3 mod 4. Would it be unusual for a coin to come up heads this many times in a million flips? To find out we’d compute the z-statistic:

Here *p* is the proportion under the null hypothesis, *q* = 1 – *p*, and *n* is the sample size. In our case, the null hypothesis is *p* = 0.5 and *n* = 1,000,000. [1]

The code

p = 0.5 q = 1 - p z = (p_hat - p)/sqrt(p*q/n)

shows that z = 0.404, hardly a suspiciously large value. If we were looking at random values we’d see a z-score this large or larger 34% of the time. So in this case doesn’t suggest much.

[1] The derivation of the z statistic is fairly quick. If the proportion of successes is *p*, then the number of successes out of *n* trials is binomial(*n*, *p*). For large *n*, this is has approximately the same distribution as a normal distribution with the same mean and variance, mean *np* and variance *npq*. The *proportion* of successes then has approximately mean *p* and standard deviation √(*pq/n*). Subtracting the mean and dividing by the standard deviation normalizes the distribution to have mean 0 and variance 1. So under the null hypothesis, the z statistic has a standard normal distribution.

Four posts about musical instrument acoustics:

]]>- It could be quicker than creating a graphical image .
- You can paste them into plain text documents like source code files.
- They can be produced programmatically.
- There is software to turn ASCII art into more polished images.

Today I’ll post a few notes about how to create graphical versions of ASCII diagrams in Emacs with org-mode.

You can embed and execute source code in org-mode files. I wrote a couple posts about this, one showing how to run Python and R inside org-mode and another showing how to mix languages in org-mode. The latter shows Perl calling R and Python, all in 14 lines of code.

There are currently 39 programming languages that org-mode can call by default. In addition to conventional programming languages like the ones listed above, org-mode also supports ditaa, the language that treats ASCII art as a specification language to produce graphics.

You can edit code blocks just as you would other text in an org-mode file. But if you’d like to edit a code block in its language’s mode, type `C-c '`

from inside the code block. If you’re editing Python code, for example, this will open a new window, in Python mode, containing just that code block. If you type `C-c '`

inside a ditaa code block, Emacs opens a new window in “artist mode,” a mode specifically for editing ASCII art.

You can run code inside org-mode two ways: interactively and for export. With your cursor inside a block of code, type `C-c C-c`

to execute the code and report the results. You can also export an entire org-mode document and have the results of code execution embedded in the final document. This works much the same as “*weave” projects like Sweave, Pweave, and Hweave. But while each of these is specific to a particular programming language (R, Python, and Haskell respectively), org-mode works with dozens of languages, including ditaa.

You embed ditaa code just like you’d embed any other code. In the first post mentioned above, I gave this example of calling R:

#+begin_src R sqrt(42) #+end_src

Here’s the analogous code for ditaa:

#+BEGIN_SRC ditaa :file foo.png +-------+ | Hello | +-------+ #+END_SRC

The markers to begin and end a source code segment can be upper or lower case. I used lower case in my previous post, but it may be more readable to use upper case so that the markers stand out better from their arguments.

The R example above didn’t use any header arguments, though it could have. With ditaa, you must provide a header argument: the name of the file to save the graphics in.

If you run the ditaa code by navigating inside the code block and running `C-c C-c`

, Emacs will add a couple lines after the code block:

#+RESULTS: [[file:foo.png]]

This is the literal text, what you’d see if you opened the file in another editor. But org-mode uses the double brackets for links. You wouldn’t see the double brackets. Instead you’d see a hyperlink with text `file:foo.png`

. Clicking on that link opens the image.

You can export the org-mode file with the command `C-c C-e`

. This brings up a menu of export options: `h`

for HTML, `l`

for LaTeX, etc. Within each of these are further options. For example, you could type `l`

for LaTeX and then type `l`

again to select export to LaTeX or type `p`

to have have org-mode run LaTeX on the file and produce a PDF. If you know you want a PDF, you could do this all in one command: `C-c C-l l p`

.

You can control whether org-mode exports code or the *results* of running code (or both or neither). Org-mode exports the results of ditaa code, i.e. graphics files, by default. This makes sense: your PDF file will have a nice image version of your diagram, not the ASCII art used as its source code.

By default, the only programming language you can run inside org-mode is Emacs Lisp. This makes sense. You want to be deliberate about what code you run, but if you don’t want to run Emacs Lisp you’d better not run Emacs!

Inside your Emacs configuration file, you specify what languages org-mode is allowed to run. Here’s an example allowing Python and ditaa:

(org-babel-do-load-languages 'org-babel-load-languages '( (python . t) (ditaa . t)) )

Recent versions of Emacs are supposed to ship with ditaa, but the ditaa jar file was missing from the two computers I experimented with. Emacs complained

org-babel-execute:ditaa: Could not find ditaa.jar at ...

and so I copied `ditaa.jar`

to the place where Emacs said it was looking for it. That worked, but it’s kind of a kludge because now I had two copies of the jar file. A better solution is to tell Emacs where you already have ditaa.

I like to use the exact same `init.el`

file on every machine I use and so I added a section where I put OS-specific and machine-specific configuration. Here I put a different path to ditaa for each OS.

;;---------------------------------------------------------------------- ;; OS specific (cond ((string-equal system-type "windows-nt") ; Microsoft Windows (progn (setq-default ispell-program-name "C:/bin/Aspell/bin/aspell.exe") # etc. (setq org-ditaa-jar-path "c:/bin/ditaa/ditaa.jar") ) ) ((string-equal system-type "gnu/linux") ; Linux (progn (setq x-select-enable-clipboard t) # etc. (setq org-ditaa-jar-path "/usr/bin/ditaa") ) ) )]]>

Old technologies never die. Instead, their range of application shrinks. Or maybe it grows when conditions change.

ASCII art, drawing pictures with fixed-width plain text characters, is no longer how many people want to produce diagrams. Just fire up Adobe Illustrator and you get incomparably more resolution of expression.

And yet there are times when ASCII art comes in handy. You can, for example, paste it into source code files. Someone more familiar with Emacs than Illustrator may be able to produce a simple diagram in the former faster than the latter. And it can be relatively easy to programmatically produce a large number of ASCII art diagrams, depending on the nature of the diagrams.

It’s also possible to use ASCII art as a way to specify nicely rendered images. I’ll show how to do that with ditaa below.

Here’s an ASCII version of the conjugate prior diagram I made some time ago:

+-------------+ | | | Exponential | | | +-------------+ | lambda | v +-------------+ +-------------+ +-------------+ | | tau | | lambda | | | Lognormal |---------->| Gamma |<----------| Poisson | | | | |---+ | | +-------------+ +-------------+ | +-------------+ | ^ ^ | beta | | | | | | +------+ | tau | | | | +-------------+ | mu | | +------------------>| Normal | | |----+ +-------------+ | ^ | mu | | +-------+

And here’s the image produced by `ditaa`

processing the ASCII diagram above:

**Update**: See my next post on how to create ASCII art diagrams and their graphic version from ditaa using Emacs org mode.

**Update**: When I first made the diagram above, I tried using Greek letters, e.g. using β rather than “beta,” but this didn’t work. I thought “OK, I suppose it’s not really *ASCII* art if you use non-ASCII characters.” But someone told me Unicode characters worked for him, so I tried again when I wrote the follow up post and it worked.

My first attempt, from a Windows laptop, calling ditaa from the command line, did not work. My second attempt, running inside org-mode from a Windows desktop, did work. My third attempt, running from Emacs on Linux also worked.

]]>**JC**: Chris, I know you do a lot of work with Ruby on Rails. What do you think of Ruby without Rails? Would you be as interested in Ruby if the Rails framework had been written in some other language?

**CT**: Let me back up a little bit and give you some of my background. I started out as an engineer and I used VB because it was what I had for the task at hand. Then when I decided to buckle down and become a real developer I chose Python because it seemed like the most engineering-oriented alternative. It seemed less like an enterprise language, more small and nimble. I chose Python over Ruby because of my engineering background. Python seemed more serious, while Ruby seemed more like a hipster language. Ruby sounded frivolous, but I kept hearing good things about it, especially with Rails. So like a lot of people I came to Ruby through Rails. It was the functionality and ease of use that got me hooked, but I do love Ruby as a language, the beauty and expressiveness of it. It reads more like prose than other languages. It’s designed for people rather than machines. But it’s also a very large language and hard to parse because of that. Over time though I’ve seen people abuse the looseness, the freedom in Ruby, and that’s caused me to look at stricter options like Haskell and other functional languages.

**JC**: I only looked at Ruby briefly, and when I saw the relative number of numerical libraries for Python and Ruby I thought “Well, looks like it’s Python for me.”

It seems like Ruby bears some resemblance to Perl, for better or worse.

**CT**: Absolutely. Ruby has two spiritual ancestors. One is Perl and the other is Smalltalk. I think both of those are great, and many of the things I love about Ruby come from that lineage. Perl contributed the get-things-done attitude, the looseness and terseness, the freedom to interact at any level of abstraction.

It’s kinda odd. I keep coming back to The Zen of Python. One of the things it says is that explicit is better than implicit, and I really think that’s true. And yet I work in Ruby and Rails where implicit is the name of the game. So I have some cognitive dissonance over that. I love Ruby on Rails, but I’m starting to look at other languages and frameworks to see if something else might fit as well.

**JC**: Do you have the freedom to choose what language and framework you work in? Do clients just ask for a web site, or do they dictate the technology?

**CT**: We have a mix. A lot of clients just want a web app, but some, especially large companies, want us to use their technology stack. So while we do a lot of Rails, we also do some Python, Haskell, etc.

**JC**: Do you do everything soup-to-nuts or do you have some specialization?

**CT**: We have three roles at thoughtbot: designer, web developer, and mobile developer. The designers might do some JavaScript, but they mostly focused on user experience, testing, and design.

**JC**: How do you keep everything straight? The most intimidating thing to me about web development is all the diverse tools in play: the language for your logic, JavaScript, CSS, HTML, SQL, etc.

**CT**: There’s definitely some of that, but we outsource some parts of the stack. We host applications on Heroku, giving them responsibility for platform management. They run on top of AWS so they handle all the scaling issues so we can focus on the code. We’ll deploy to other environments if our client insists, but our preference is to go with Heroku.

Similarly, Rails has a lot of functionality for the database layer, so we don’t write a lot of SQL by hand. We’re all knowledgeable of SQL, but we’re not DBA-level experts. We scale up on that as necessary, but we want to focus on the application.

**JC**: Shifting gears a little bit, how do you program differently in a dynamic language like Ruby than you would in a stricter language like C++? And is that a good thing?

**CT**: One thing about Ruby, and dynamic languages in general, is that testing becomes all the more critical. There are a lot of potential runtime errors you have to test for. Whereas with something like Haskell you can program a lot of your logic into the type system. Ruby lets you work more freely, but Haskell leads to more robust applications. Some of our internal software at thoughtbot is written in Haskell.

**JC**: I was excited about using Haskell, but when I used it on a production project I ran into a lot of frustrations that you wouldn’t anticipate from working with Haskell in the small.

**CT**: Haskell does seem to have a more aggressive learning curve than other languages. There’s a lot of Academia in it, and in a way that’s good. The language hasn’t compromised its vision, and it’s been able to really develop some things thoroughly. But it also has a kind of academic heaviness to it.

There’s a language out there called Elm that’s inspired by Haskell and the whole ML family of languages that compiles down to JavaScript. It presents a friendlier interface to the whole type-driven, functional way of thinking. The developers of the language have put a lot of effort into making it approachable, without having to understand comonads and functors and all that.

**JC**: My difficulties with Haskell weren’t the theory but things like the lack of tooling and above all the difficulty of package management.

**CT**: Cabal Hell.

**JC**: Right.

**CT**: My understanding is that that’s improved dramatically with new technologies like Stack. We’re scaling up internally on Haskell. That’s the next area we’d like to get into. I’ll be able to say more about that down the road.

* * *

Check out Upcase for training materials on tools like Vim, Git, Rails, and Tmux.

]]>These have all been numerical algorithms. Insertion sort is an example of a non-numerical algorithm that could be implemented as a fold.

Insertion sort is not the fastest sorting algorithm. It takes O(*n*^{2}) operations to sort a list of size *n* while the fastest algorithms take O(*n* log *n*) operations. But insertion sort is simple, and as it works its way down a list, the portion it has processed is sorted. If you interrupt it, you have correct results given the input so far. If you interrupt something like a quicksort, you don’t have a sorted list. If you’re receiving a stream of data points and want to sort them as they come, you have to use insertion sort rather than something like quicksort.

The function to fold over a list looks like this:

f as a = [x | x <- as, x < a] ++ [a] ++ [x | x <- as, x >= a]

Given a sorted list `as`

and a value `a`

, return a new sorted list that has inserted the element `a`

in the proper position. Our function `f`

does this by joining together the list of the elements less than `a`

, the list containing only `a`

, and the list of elements at least as big as `a`

.

Here’s how we could use this to alphabetize the letters in the word “example.”

foldl f [] "example"

This returns `"aeelmpx"`

.

Haskell takes our function `f`

and an empty list of characters `[]`

and returns a new list of characters by folding `f`

over the list of characters making up the string `"example"`

.

You can always watch how `foldl`

works by replacing it with `scanl`

to see intermediate results.

scanl f [] "example"

returns

["", "e", "ex", "aex", "aemx", "aempx", "aelmpx", "aeelmpx"]]]>

The results are evenly distributed with some variation, just like dice rolls. In fact, given these results and results from a set of real dice rolls, most people would probably think the former are real because they’re more evenly distributed. A chi-squared goodness of fit test shows that the results are **too evenly distributed** compared to real dice rolls.

At the end of my previous post, I very briefly discuss what happens when you look a “dice” with more than six sides. Here I’ll go into a little more detail and look at a large number of examples.

In short, you either get a suspiciously good fit or a terrible fit. If you look at the remainder when dividing primes by *m*, you get values between 1 and *m*-1. You can’t get a remainder of 0 because primes aren’t divisible by *m* (or anything else!). If *m* itself is prime, then you get all the numbers between 1 and *m*-1, and as we’ll show below you get them in very even proportions. But if *m* isn’t prime, there are some remainders you can’t get.

The sequence of remainders looks random in the sense of being unpredictable. (Of course it *is* predictable by the algorithm that generates them, but it’s not predictable in the sense that you could look at the sequence out of context and guess what’s coming next.) The sequence is biased, and that’s the big news. Pairs of consecutive primes have correlated remainders. But I’m just interested in showing a different departure from a uniform distribution, namely that the results are too evenly distributed compared to random sequences.

The table below gives the chi-square statistic and *p*-value for each of several primes. For each prime *p*, we take remainders mod *p* of the next million primes after *p* and compute the chi-square goodness of fit statistic with *p*-2 degrees of freedom. (Why *p*-2? There are *p*-1 different remainders, and the chi-square test for *k* possibilities has *k*-1 degrees of freedom.)

The *p*-value column gives the probability of seeing at fit this good or better from uniform random data. (The *p* in *p*-value is unrelated to our use of *p* to denote a prime. It’s an unfortunate convention of statistics that everything is denoted *p*.) After the first few primes, the *p*-values are extremely small, indicating that such an even distribution of values would be astonishing from random data.

|-------+------------+------------| | Prime | Chi-square | p-value | |-------+------------+------------| | 3 | 0.0585 | 2.88e-02 | | 5 | 0.0660 | 5.32e-04 | | 7 | 0.0186 | 1.32e-07 | | 11 | 0.2468 | 2.15e-07 | | 13 | 0.3934 | 6.79e-08 | | 17 | 0.5633 | 7.64e-10 | | 19 | 1.3127 | 3.45e-08 | | 23 | 1.1351 | 2.93e-11 | | 29 | 1.9740 | 3.80e-12 | | 31 | 2.0052 | 3.11e-13 | | 37 | 2.5586 | 3.92e-15 | | 41 | 3.1821 | 9.78e-16 | | 43 | 4.4765 | 5.17e-14 | | 47 | 3.7142 | 9.97e-18 | | 53 | 3.7043 | 3.80e-21 | | 59 | 7.0134 | 2.43e-17 | | 61 | 5.1461 | 6.45e-22 | | 67 | 7.1037 | 5.38e-21 | | 71 | 7.6626 | 6.13e-22 | | 73 | 7.5545 | 4.11e-23 | | 79 | 8.0275 | 3.40e-25 | | 83 | 12.1233 | 9.92e-21 | | 89 | 11.4111 | 2.71e-24 | | 97 | 12.4057 | 2.06e-26 | | 101 | 11.8201 | 3.82e-29 | | 103 | 14.4733 | 3.69e-26 | | 107 | 13.8520 | 9.24e-29 | | 109 | 16.7674 | 8.56e-26 | | 113 | 15.0897 | 1.20e-29 | | 127 | 16.4376 | 6.69e-34 | | 131 | 19.2023 | 6.80e-32 | | 137 | 19.1728 | 1.81e-34 | | 139 | 22.2992 | 1.82e-31 | | 149 | 22.8107 | 6.67e-35 | | 151 | 22.8993 | 1.29e-35 | | 157 | 30.1726 | 2.60e-30 | | 163 | 26.5702 | 3.43e-36 | | 167 | 28.9628 | 3.49e-35 | | 173 | 31.5647 | 7.78e-35 | | 179 | 33.3494 | 2.46e-35 | | 181 | 36.3610 | 2.47e-33 | | 191 | 29.1131 | 1.68e-44 | | 193 | 29.9492 | 2.55e-44 | | 197 | 34.2279 | 3.49e-41 | | 199 | 36.7055 | 1.79e-39 | | 211 | 41.0392 | 8.42e-40 | | 223 | 39.6699 | 1.73e-45 | | 227 | 42.3420 | 2.26e-44 | | 229 | 37.1896 | 2.02e-50 | | 233 | 45.0111 | 4.50e-44 | | 239 | 43.8145 | 2.27e-47 | | 241 | 51.3011 | 1.69e-41 | | 251 | 47.8670 | 6.28e-48 | | 257 | 44.4022 | 1.54e-53 | | 263 | 51.5905 | 7.50e-49 | | 269 | 59.8398 | 3.92e-44 | | 271 | 59.6326 | 6.02e-45 | | 277 | 52.2383 | 2.80e-53 | | 281 | 52.4748 | 1.63e-54 | | 283 | 64.4001 | 2.86e-45 | | 293 | 59.7095 | 2.59e-52 | | 307 | 65.2644 | 1.64e-52 | | 311 | 63.1488 | 1.26e-55 | | 313 | 68.6085 | 7.07e-52 | | 317 | 63.4099 | 1.72e-57 | | 331 | 66.3142 | 7.20e-60 | | 337 | 70.2918 | 1.38e-58 | | 347 | 71.3334 | 3.83e-61 | | 349 | 75.8101 | 3.38e-58 | | 353 | 74.7747 | 2.33e-60 | | 359 | 80.8957 | 1.35e-57 | | 367 | 88.7827 | 1.63e-54 | | 373 | 92.5027 | 7.32e-54 | | 379 | 86.4056 | 5.67e-60 | | 383 | 74.2349 | 3.13e-71 | | 389 | 101.7328 | 9.20e-53 | | 397 | 86.9403 | 1.96e-65 | | 401 | 90.3736 | 3.90e-64 | | 409 | 92.3426 | 2.93e-65 | | 419 | 95.9756 | 8.42e-66 | | 421 | 91.1197 | 3.95e-70 | | 431 | 100.3389 | 1.79e-66 | | 433 | 95.7909 | 1.77e-70 | | 439 | 96.2274 | 4.09e-72 | | 443 | 103.6848 | 6.96e-68 | | 449 | 105.2126 | 1.07e-68 | | 457 | 111.9310 | 1.49e-66 | | 461 | 106.1544 | 7.96e-72 | | 463 | 116.3193 | 1.74e-65 | | 467 | 116.2824 | 1.02e-66 | | 479 | 104.2246 | 3.92e-79 | | 487 | 116.4034 | 9.12e-73 | | 491 | 127.2121 | 6.69e-67 | | 499 | 130.9234 | 5.90e-67 | | 503 | 118.4955 | 2.60e-76 | | 509 | 130.9212 | 6.91e-70 | | 521 | 118.6699 | 6.61e-82 | | 523 | 135.4400 | 3.43e-71 | | 541 | 135.9210 | 3.13e-76 | | 547 | 120.0327 | 2.41e-89 | |-------+------------+------------|]]>

This post uses a fold to compute **mean**, **variance**, **skewness**, and **kurtosis**. See this earlier post for an object-oriented approach. The code below seems cryptic out of context. The object-oriented post gives references for where these algorithms are developed. The important point for this post is that we can compute mean, variance, skewness, and kurtosis **all in one pass through the data** even though textbook definitions appear to require at least two passes. It’s also worth noting that the functional version is less than half as much code as the object-oriented version.

(Algorithms that work in one pass through a stream of data, updating for each new input, are sometimes called “online” algorithms. This is unfortunate now that “online” has come to mean something else.)

The Haskell function `moments`

below returns the number of samples and the mean, but does not directly return variance, skewness and kurtosis. Instead it returns moments from which these statistics can easily be calculated using the `mvks`

function.

moments (n, m1, m2, m3, m4) x = (n', m1', m2', m3', m4') where n' = n + 1 delta = x - m1 delta_n = delta / n' delta_n2 = delta_n**2 t = delta*delta_n*n m1' = m1 + delta_n m4' = m4 + t*delta_n2*(n'*n' - 3*n' + 3) + 6*delta_n2*m2 - 4*delta_n*m3 m3' = m3 + t*delta_n*(n' - 2) - 3*delta_n*m2 m2' = m2 + t mvsk (n, m1, m2, m3, m4) = (m1, m2/(n-1.0), (sqrt n)*m3/m2**1.5, n*m4/m2**2 - 3.0)

Here’s an example of how you would use this Haskell code to compute statistics for the list [2, 30, 51, 72]:

ghci> mvsk $ foldl moments (0,0,0,0,0) [2, 30, 51, 72] (38.75, 894.25,-0.1685, -1.2912)

The `foldl`

applies `moments`

first to its initial value, the 5-tuple of zeros. Then it iterates over the data, taking data points one at a time and visiting each point only once, returning a new state from `moments`

each time. Another way to say this is that after processing each data point, `moments`

returns the 5-tuple that it would have returned if that data only consisted of the values up to that point.

For a non-numerical example of folds, see my post on sorting.

]]>Yesterday Brian Hayes wrote a post about the distribution of primes. He showed how you could take the remainder when primes are divided by 7 and produce something that looks like rolls of six-sided dice. Here we apply the chi-square goodness of fit test to show that the rolls are **too evenly distributed** to mimic randomness. This post does not assume you’ve seen the chi-square test before, so it serves as an introduction to this goodness of fit test.

In Brian Hayes’ post, he looks at the remainder when consecutive primes are divided by 7, starting with 11. Why 11? Because it’s the smallest prime bigger than 7. Since no prime is divisible by any other prime, all the primes after 7 will have a remainder of between 1 and 6 inclusive when divided by 7. So the results are analogous to rolling six-sided dice.

The following Python code looks at prime remainders and (pseudo)random rolls of dice and computes the chi-square statistic for both.

First, we import some functions we’ll need.

from sympy import prime from random import random from math import ceil

The function `prime`

takes an argument *n* and returns the *n*th prime. The function `random`

produces a pseudorandom number between 0 and 1. The ceiling function `ceil`

rounds its argument up to an integer. We’ll use it to convert the output of `random`

into dice rolls.

In this example we’ll use six-sided dice, but you could change `num_sides`

to simulate other kinds of dice. With six-sided dice, we divide by 7, and we start our primes with the fifth prime, 11.

num_sides = 6 modulus = num_sides + 1 # Find the index of the smallest prime bigger than num_sides index = 1 while prime(index) <= modulus: index += 1

We’re going to take a million samples and count how many times we see 1, 2, …, 6. We’ll keep track of our results in an array of length 7, wasting a little bit of space since the 0th slot will always be 0. (Because the remainder when dividing a prime by a smaller number is always positive.)

# Number of samples N = 1000000 observed_primes = [0]*modulus observed_random = [0]*modulus

Next we “roll” our dice two ways, using prime remainders and using a pseudorandom number generator.

for i in range(index, N+index): m = prime(i) % modulus observed_primes[m] += 1 m = int(ceil(random()*num_sides)) observed_random[m] += 1

The chi-square goodness of fit test depends on the observed number of events in each cell and the expected number. We expect 1/6th of the rolls to land in cell 1, 2, …, 6 for both the primes and the random numbers. But in a general application of the chi-square test, you could have a different expected number of observations in each cell.

expected = [N/num_sides for i in range(1, modulus)]

The chi-square test statistic sums (O – E)^{2}/E over all cells, where O stands for “observed” and E stands for “expected.”

def chisq_stat(O, E): return sum( [(o - e)**2/e for (o, e) in zip(O, E)] )

Finally, we compute the chi-square statistic for both methods.

ch = chisq_stat(observed_primes[1:], expected[1:]) print(ch) ch = chisq_stat(observed_random[1:], expected[1:]) print(ch)

Note that we chop off the first element of the observed and expected lists to get rid of the 0th element that we didn’t use.

When I ran this I got 0.01865 for the prime method and 5.0243 for the random method. Your results for the prime method should be the same, though you might have a different result for the random method.

Now, how do we interpret these results? Since we have six possible outcomes, our test statistics has a chi-square distribution with five degrees of freedom. It’s one less than the number of possibilities because the total counts have to sum to N; if you know how many times 1, 2, 3, 4, and 5 came up, you can calculate how many times 6 came up.

A chi-square distribution with ν degrees of freedom has expected value ν. In our case, we expect a value around 5, and so the chi-square value of 5.0243 is unremarkable. But the value of 0.01864 is remarkably small. A large chi-square statistics would indicate a poor fit, the observed numbers being suspiciously far from their expected values. But a small chi-square value suggests the fit is suspiciously good, closer to the expected values than we’d expect of a random process.

We can be precise about how common or unusual a chi-square statistic is by computing the probability that a sample from the chi square distribution would be larger or smaller. The `cdf`

gives the probability of seeing a value this small or smaller, i.e. a fit this good or better. The `sf`

gives the probability of seeing a value this larger or larger, i.e. a fit this bad or worse. (The `scipy`

library uses `sf`

for “survival function,” another name for the ccdf, complementary cumulative distribution function).

from scipy.stats import chi2 print(chi2.cdf(ch, num_sides-1), chi2.sf(ch, num_sides-1))

This says that for the random rolls, there’s about a 41% chance of seeing a better fit and a 59% chance of seeing a worse fit. Unremarkable.

But it says there’s only a 2.5 in a million chance of seeing a better fit than we get with prime numbers. The fit is suspiciously good. In a sense this is not surprising: prime numbers are **not** random! And yet in another sense it is surprising since there’s a heuristic that says primes act like random numbers unless there’s a good reason why in some context they don’t. This departure from randomness is the subject of research published just this year.

If you look at dice with 4 or 12 sides, you get a suspiciously good fit, but not as suspicious as with 6 sides. But with 8 or 20-sided dice you get a very bad fit, so bad that its probability underflows to 0. This is because the corresponding moduli, 9 and 21, are composite, which means some of the cells in our chi-square test will have no observations. (Suppose *m* has a proper factor *a**.* Then if a prime *p* were congruent to *a* mod *m,* *p* would be have to be divisible by *a*.)

**Update**: See the next post for a systematic look at different moduli.

You don’t have to use “dice” that correspond to regular solids. You could consider 10-sided “dice,” for example. For such numbers it may be easier to think of spinners than dice, a spinner with 10 equal arc segments it could fall into.

**Related post**: Probability that a number is prime