A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.

To illustrate this, we start by generating 5,000 samples from a discrete power law with exponent 3.

import numpy.random alpha = 3 n = 5000 x = numpy.random.zipf(alpha, n)

The continuous MLE is very simple to implement:

alpha_hat = 1 + n / sum(log(x))

Unfortunately, it gives an estimate of 6.87 for alpha, though we know it should be around 3.

The MLE for the discrete power law distribution satisfies

Here ζ is the Riemann zeta function, and *x*_{i} are the samples. Note that the left size of the equation is the derivative of log ζ, or what is sometimes called the logarithmic derivative.

There are three minor obstacles to finding the estimator using Python. First, SciPy doesn’t implement the Riemann zeta function ζ(*x*) per se. It implements a generalization, the Hurwitz zeta function, ζ(*x*, *q*). Here we just need to set *q* to 1 to get the Riemann zeta function.

Second, SciPy doesn’t implement the derivative of zeta. We don’t need much accuracy, so it’s easy enough to implement our own. See an earlier post for an explanation of the implementation below.

Finally, we don’t have an explicit equation for our estimator. But we can easily solve for it using the bisection algorithm. (Bisect is slow but reliable. We’re not in a hurry, so we might as use something reliable.)

from scipy import log from scipy.special import zeta from scipy.optimize import bisect xmin = 1 def log_zeta(x): return log(zeta(x)) def log_deriv_zeta(x): h = 1e-5 return (log_zeta(x+h) - log_zeta(x-h))/(2*h) t = -sum( log(x/xmin) )/n def objective(x): return log_deriv_zeta(x) - t a, b = 1.01, 10 alpha_hat = bisect(objective, a, b, xtol=1e-6) print(alpha_hat)

We have assumed that our data follow a power law immediately from *n* = 1. In practice, power laws generally fit better after the first few elements. The code above works for the more general case if you set `xmin`

to be the point at which power law behavior kicks in.

The bisection method above searches for a value of the power law exponent between 1.01 and 10, which is somewhat arbitrary. However, power law exponents are very often between 2 and 3 and seldom too far outside that range.

The code gives an estimate of α equal to 2.969, very near the true value of 3, and much better than the naive estimate of 6.87.

Of course in real applications you don’t know the correct result before you begin, so you use something like a confidence interval to give you an idea how much uncertainty remains in your estimate.

The following equation [2] gives a value of σ from a normal approximation to the distribution of our estimator.

So an approximate 95% confidence interval would be the point estimate +/- 2σ.

from scipy.special import zeta from scipy import sqrt def zeta_prime(x, xmin=1): h = 1e-5 return (zeta(x+h, xmin) - zeta(x-h, xmin))/(2*h) def zeta_double_prime(x, xmin=1): h = 1e-5 return (zeta(x+h, xmin) -2*zeta(x,xmin) + zeta(x-h, xmin))/h**2 def sigma(n, alpha_hat, xmin=1): z = zeta(alpha_hat, xmin) temp = zeta_double_prime(alpha_hat, xmin)/z temp -= (zeta_prime(alpha_hat, xmin)/z)**2 return 1/sqrt(n*temp) print( sigma(n, alpha_hat) )

Here we use a finite difference approximation for the second derivative of zeta, an extension of the idea used above for the first derivative. We don’t need high accuracy approximations of the derivatives since statistical error will be larger than the approximation error.

In the example above, we have α = 2.969 and σ = 0.0334, so a 95% confidence interval would be [2.902, 3.036].

* * *

[1] Using the continuous MLE with discrete data is not so bad when the minimum output *x*_{min} is moderately large. But here, where *x*_{min} = 1 it’s terrible.

[2] Equation 3.6 from Power-law distributions in empirical data by Aaron Clauset, Cosma Rohilla Shalzi, and M. E. J. Newman.

]]>

My most popular account is CompSciFact, tweets about computer science and related topics.

AlgebraFact is for algebra, number theory, and miscellaneous pure math. (Miscellaneous applied math is more likely to end up on AnalysisFact.)

ProbFact is for probability.

DataSciFact is for data science: statistics, machine learning, visualization, etc.

You can find a full list of my various Twitter accounts here.

]]>**Explorers**are people who ask — are there objects with such and such properties and if so, how many? …**Alchemists …**are those whose greatest excitement comes from finding connections between two areas of math that no one had previously seen as having anything to do with each other.**Wrestlers**… thrive not on equalities between numbers but on inequalities, what quantity can be estimated or bounded by what other quantity, and on asymptotic estimates of size or rate of growth. This tribe consists chiefly of analysts …**Detectives**… doggedly pursue the most difficult, deep questions, seeking clues here and there, sure there is a trail somewhere, often searching for years or decades. …

I’m some combination of alchemist and wrestler. I suppose most applied mathematicians are. Applications usually require taking ideas developed in one context and using them in another. They take often complex things then estimate and bound them by things easier to understand.

One of my favorite proofs is Bernstein’s proof of the Weierstrass approximation theorem. It appeals to both alchemists and wrestlers. It takes an inequality from probability and uses it in an entirely different context, one with no randomness in sight, and uses it to explicitly construct an approximation satisfying the theorem.

I thought of David Mumford’s tribes when I got an email a couple days ago from someone who wrote to tell me he found in one of my tech reports a function that he’d studied in his own research. My tech report was motivated by a problem in biostatistics, while he was looking at material structural fatigue. The connection between remote fields was a bit of alchemy, while the content of the tech report, an upper bound on an integral, was a bit of wrestling.

]]>The most obvious way to approximate a derivative would be to simply stick a small step size into the definition of derivative:

*f’*(*x*) ≈ (*f*(*x*+*h*) – *f*(*x*)) /* h*

However, we could do much better using

*f’*(*x*) ≈ (*f*(*x*+*h*) – *f*(*x-h*)) /* 2h*

To see why, expand *f*(*x*) in a power series:

*f*(*x* + *h*) = *f*(*x*) + *h* *f*‘(*x*) + *h*^{2} *f*”(*x*)/2 + *O*(*h*^{3})

A little rearrangement shows that the error in the one-sided difference, the first approximation above, is *O*(*h*). Now if you replace *h* with –*h* and do a little algebra you can also show that the two-sided difference is *O*(*h*^{2}). When *h* is small, *h*^{2} is very small, so the two-sided version will be more accurate for sufficiently small *h*.

So how small should *h* be? The smaller the better,* in theory*. In computer arithmetic, you lose precision whenever you subtract two nearly equal numbers. The more bits two numbers share, the more bits of precision you may lose in the subtraction. In my application, *h* = 10^{-5} works well: the precision after the subtraction in the numerator is comparable to the precision of the (two-sided) finite difference approximation. The following code was adequate for my purposes.

from scipy.special import zeta def zeta_prime(x): h = 1e-5 return (zeta(x+h,1) - zeta(x-h,1))/(2*h)

The zeta function in SciPy is Hurwitz zeta function, a generalization of the Riemann zeta function. Setting the second argument to 1 gives the Riemann zeta function.

There’s a variation on the method above that works for real-valued functions that extend to a complex analytic function. In that case you can use the complex step differentiation trick to use

Im( *f*(*x*+*ih*)/*h* )

to approximate the derivative. It amounts to the two-sided finite difference above, except you don’t need to have a computer carry out the subtraction, and so you save some precision. Why’s that? When *x* is real, *x* + *ih* and *x* – *ih* are complex conjugates, and *f*(x – *ih*) is the conjugate of *f*(*x* + *ih*), i.e. conjugation and function application commute in this setting. So (*f*(*x*+*ih*) – *f*(*x-ih*)) is twice the imaginary part of *f*(*x* + *ih*).

SciPy implements complex versions many special functions, but unfortunately not the zeta function.

]]>Dana Mackenzie, “What in the Name of Euclid Is Going On Here?”, Science, 2005

]]>

… thinking very critically about what shells get used for and what they’re actually good at and what they’re not good at.

I’ve wondered about this but never reached any satisfying conclusions. I was curious to hear Anthony’s ideas, so I asked him for another interview. (I interviewed Anthony and his co-author Katy Huff regarding their book Effective Computation in Physics.

* * *

**JC**: If your shell speaks your programming language, then what else does it need to do?

**AS**: It’s an interesting question. People have tried to use Python as a shell for years and years and they came up with a bunch of different potential solutions, but none of them quite worked because the language wasn’t built around that idea. It ended up being more verbose than people want from a shell. The main purpose of the shell, in my opinion, is to run other code and to glue things together. Python does that really well for libraries and functions, but it doesn’t do that so well for executables. Bash deals with executables really well, but it’s terrible for dealing with even simple conditional logic. Like a lot of people, I wanted something that would do all these things simultaneously and do them all well. But you quickly end up where many traditional computer science people are not willing to go: context-sensitive parsing. It’s something they teach you to be afraid of in school .

**JC**: But you do it all the time. How can you get away from it?

**AS**: You can’t, but people want to avoid it in their core languages. The major programming languages keep it out. You’ll find it quarantined to domain-specific languages where the damage is small.

**JC**: So you have something in mind like Perl? There the behavior of a function can depend entirely on whether it’s being used in a scalar context or an array context.

**AS**: That’s right. Perl does some of this. The language Forth is completely built around this. It’s all context-sensitive.

You brought up something interesting [in a previous email] about the overlap between shells and editors. Those things are completely separate in my mind, but for a lot of people they get merged very quickly. For instance, Emacs has the ability to run a shell inside the editor, and people use that all the time.

**JC**: The way I work is that I start something at the command line, then it gets a little complicated, and I switch over to writing a script and regret not having done that sooner. I especially do that with something like R. This is just going to be a few quick calculations, so I’ll do it right from the REPL. Then things get more complicated …

**AS**: IPython sorta has that too, the old IPython readline shell. You just wanted to do something simple that bash couldn’t do quickly or easily, so you open up the IPython command line. Inevitably it ends up taking more lines than you wanted it to. That is part of why the Jupyter notebook is so great.

**JC**: One thing I noticed about PowerShell was that system administrators were ecstatic when it came out and would say how much they loved the command line. Then Microsoft put out this ISE, sort of an IDE for PowerShell, and everyone moved there. So they’re not really using the command line anymore. They’re excited about PowerShell as a programming language, not as an interactive shell per se.

In Bruce Payette’s PowerShell book he fields questions asking why PowerShell did something some way they find odd and his answer is always “Because it’s a *shell*.”

**AS**: Do you have any examples?

**JC**: For example, functions don’t use parentheses around their arguments or commas between their arguments because that’s not what people expect from a shell. You expect to type something like `ls`

, not `ls()`

with parentheses at the end. There were more subtle examples than this, but they’re not fresh on my mind.

**AS**: That’s where I think that tools like Python plumbum are lacking. It’s an all-Python environment, so you have to use Python syntax even when it’s cumbersome. It prevents you from having to import subprocess and worry about that all the time, but it doesn’t do much more than that.

**JC**: When you were writing xonsh, where there times you wished you could change the Python language? Or things you’d do differently in the shell if you weren’t aiming for 100% Python compatibility?

**AS**: That’s interesting. Python is deceptively simple. It has a lot of little pieces to it. It’s very natural and intuitive to use, but re-implementing the parser for Python was more work than I expected. There are a lot of little gotchas in the parser. I spent a lot of time on tuples and function argument grouping. The way they’re handled looks very similar but they’re handled completely differently for no reason that’s readily apparent.

There’s also this ambiguity between Python commands and shell commands if you’re trying to do both simultaneously, and that’s frustrating. That’s the hard part, figuring out when you’re in a subprocess and when you’re in Python mode.

**JC**: It’s hard for you as an implementer, but hopefully users can be blissfully ignorant of the issues and it just does what they expect.

I guess you’re walking a fine line, because as soon as you say you want the shell to infer what people mean, you start getting into the kinds of complications you have in Perl where things depend so heavily on context, and that sort of thing is contrary to the spirit of Python.

**AS**: Yeah, exactly! After going through this exercise, there is one thing I’d like to change about Python. Python is white space-sensitive at the beginning of a line, but not after the first non-white space character. For example, you can put as many spaces around a binary operator as you like, or none at all. That’s really, really frustrating. If you enforced PEP 8, requiring exactly one white space around every binary operator, you’d be able to resolve these currently ambiguous cases between subprocess mode and Python mode very naturally. But I can’t imagine a world in which people would agree to this.

**JC**: What shell would you use if you weren’t using xonsh?

**AS**: I probably would use bash. Fish is really nice in some ways, and things like zsh have nice features too. What I used to do is go back and forth between working in an IPython shell and a bash shell, and between those two I could pretty much get the job done.

**JC**: Do you use Emacs?

**AS**: No, I don’t use Emacs or Vim or any of those editors. I use an editor I wrote, kinda like nano. I’ve used Emacs and Vim, but they got in my way too much, so I wanted something else. This is sort of the same thing as xonsh; I want my tools to get out of my way. I want the barrier to entry to doing what I want to be basically zero. You can spend years and years becoming a master of some of these tools and then you’re really effective, but I want to just open up the editor and start typing text. The same thing with the shell. I just want to open it up and get to work and not have to keep going back to the documentation.

You do not want to be an edge case in this future we are building.

Systems run by algorithms can be more efficient on average, but make life harder on the edge cases, people who are exceptions to the system developers’ expectations.

Algorithms, whether encoded in software or in rigid bureaucratic processes, can unwittingly discriminate against minorities. The problem isn’t *recognized* minorities, such as racial minorities or the disabled, but *unrecognized* minorities, people who were overlooked.

For example, two twins were recently prevented from getting their drivers licenses because DMV software couldn’t tell their photos apart. Surely the people who wrote the software harbored no malice toward twins. They just didn’t anticipate that two drivers licence applicants could have indistinguishable photos.

I imagine most people reading this have had difficulty with software (or bureaucratic procedures) that didn’t anticipate something about them; everyone is an edge case in some context. Maybe you don’t have a middle name, but a form insists you cannot leave the middle name field blank. Maybe there are more letters in your name or more children in your family than a programmer anticipated. Maybe you choose not to use some technology that “everybody” uses. Maybe you happen to have a social security number that hashes to a value that causes a program to crash.

When software routinely fails, there obviously has to have a human override. But as software improves for most people, there’s less apparent need to make provision for the exceptional cases. So things could get harder for edge cases as they get better for more people.

**Related posts**:

A plausible guess is that project lead time would be proportional to the logarithm of the company size. If a company with *n* employees has a hierarchy with every manager having *m* subordinates, the number of management layers would be around log_{m}(*n*). If every project has to be approved by every layer of management, lead time should be logarithmic in the company size. This implies huge companies only take a little longer to start projects than medium-sized companies, and that doesn’t match my experience.

In my experience, lead time is proportional to something like the square root of the company size:

*T* = *k* √ *E*

where *T* is lead time, *k* is a proportionality constant, and *E* is the number of employees. For example, someone told me that he moved to a company 1000 times bigger and things seem to move about 30 times slower. That would be consistent with a square root rule.

If *T* is measured in days and *k* = 0.5, the square root rule would say that a solo entrepreneur could start a project in half a day, and a company of 130,000 employees would take six months. That seems about right. Of course small companies can move slowly, and large companies can move quickly. But it’s a good rule of thumb to say individuals operate on a scale of days, small-to-medium companies on the scale of weeks, and large companies on the scale of months.

The reason may be that large companies scale up well, but they don’t scale down well. They can put together large deals fairly quickly, relative to the size of the deal, but not small deals.

]]>Today I visited Bastrop State Park on the way home from Austin. Some trees, particularly oaks, survived the fires. Pines have come back on their own in parts of the park. A volunteer working in the park told me that some of these new trees are 10 feet tall, though I didn’t see these myself. In other parts volunteers have planted pines. Here’s a photo I took this morning.

Most of the new growth in the forest is underbrush, in some places thicker than in the photo above. The same volunteer mentioned above said that the park is already planning prescribed burning in some areas to clear the underbrush and protect the viable trees.

]]>

One of the questions that came up was how to compare two studies with different sample sizes. Of course there are many factors involved, but I said that as a general rule of thumb, a study with four times the sample size will give confidence intervals that are half as wide. They loved that. In the midst of what to them was a sea of statistical mumbo jumbo, here was something they could grab onto. I also pointed out a few things I thought doctors would want to hear and two or three buzzwords the sales people should learn.

The scientific literature on their product was favorable, but the company was not able to convey this because the sales reps didn’t have the words to use. I gave them the words by translating scientific jargon to simple language.

If you’d like for me to give your sales team the words they need, please contact me.

]]>Any claim coming from an observational study is most likely to be wrong.

They back up this assertion with data about observational studies later contradicted by prospective studies.

Much has been said lately about the assertion that most published results are false, particularly observational studies in medicine, and I won’t rehash that discussion here. Instead I want to cut to the process Young and Karr propose for improving the quality of observational studies. They summarize their proposal as follows.

The main technical idea is to split the data into two data sets, a modelling data set and a holdout data set. The main operational idea is to require the journal to accept or reject the paper based on an analysis of the modelling data set without knowing the results of applying the methods used for the modelling set on the holdout set and to publish an addendum to the paper giving the results of the analysis of the holdout set.

They then describe an eight-step process in detail. One step is that cleaning the data and dividing it into a modelling set and a holdout set would be done by different people than the modelling and analysis. They then explain why this would lead to more truthful publications.

The holdout set is the key. Both the author and the journal know there is a sword of Damocles over their heads. Both stand to be embarrassed if the holdout set does not support the original claims of the author.

* * *

The full title of the article is *Deming, data and observational studies: A process out of control and needing fixing. *It* a*ppeared in the September 2011 issue of Significance.

Update: The article can be found here.

]]>I’m most familiar with intellectual property in the form of software, but I imagine the same applies to many other forms of intellectual property. Some forms of data are easy to understand, such as a list of passwords. But others, such as source code, require a large amount of context beyond the data. One reason acquisitions fail so often is that the physical assets of a company are not enough. The most valuable assets a company has are often intangible.

Of course companies should protect their intellectual property, but a breach is not necessarily a disaster. On the other hand, the loss of institutional memory may be a disaster.

]]>These applications involve large amounts of data. But more importantly they involve **new data**, not simply greater quantities of data we’ve had before. Cheap sensors make it possible to measure things more directly and in higher resolution than before. We have sources of data, such as social media, that are qualitatively different from what we’ve had in the past.

Simply saying we have more data than before obscures what’s happening. For example, we don’t know more about consumer behavior than a generation ago because we do more phone surveys and have more customer satisfaction post cards to fill out. We know more because we can observe things we couldn’t observe before.

Clever analysis deserves some credit for the successes of big data, but more credit goes to new sources of data and the technologies that make these sources possible.

]]>But if you have less traffic so that the number of visitors involved in a test is appreciable, you might be concerned with possible lost revenue during the test itself. The point of A/B testing is to improve profitability *after* the test, not during the test. If you also want to consider profitability *during* the test, you might want to consider more alternatives.

My experience with testing comes from a context where the stakes are higher than improving conversion on web sites: treating cancer patients. You want to find out which treatments performed better for the sake of future patients, those who were treated after the randomized trial. But you also want to treat the participants in the clinical trial effectively. Two ways we would do that are early stopping rules and adaptive randomization. Both practices are applicable to A/B testing web pages.

A conventional clinical trial might take a few hundred patients and randomize half to one treatment and half to another. But if one treatment appears to be much more effective, at some point it becomes unconscionable to keep assigning the less effective treatment. So you stop the experiment early. You might want to do the same with web designs. If you planned to show two variations of a page to 500 visitors each, but after 100 visitors it’s obvious which version is performing better, you’d like to stop the test and show everyone the better page. On the other hand, if you have so many visitors that you’re not concerned with what happens to the 1000 visitors in the test, just let the test run to completion.

Another approach is to compromise between equal randomization and early stopping. Suppose A is performing better than B, but not so much better that you’re willing to stop and declare A the winner. You might keep randomizing, but increase the probability that the test will assign A. If A really is better, more visitors will see the better page. But if you’re wrong and B is really better, you may still discover this because some visitors are still seeing B. If B keeps performing better, the tide will turn and the test will prefer it. This is called adaptive randomization. The more evidence there is that one version is better, the higher the probability that you’ll show people that version.

One way to use adaptive randomization is variable experiment sizes. Instead of deciding a test size in advance, you test until you’re satisfied that you’ve found a winner. That may require fewer visitors than a conventional A/B test. It may also require more, but only when there’s a good reason to. The test may go into overtime, so to speak, because the two versions are performing similarly, in which case you’d like to keep testing longer to find which is better.

It’s easy to fall into thinking that the winner of a test will be used forever, whether you’re testing web pages or cancer treatments. But this isn’t the case. The winner will eventually be tested against something else, maybe very soon. This means that you might want to put a little more emphasis on the performance *during* the test and not just performance *after* the test, because there may not be much opportunity for performance after the test.

If you’d like to discuss how adaptive randomization could benefit your testing, please let me know.

]]>

I had a brief interview last night. Someone took this photo as we were setting up.

]]>The planets have elliptical orbits with the sun at one focus, but these ellipses are nearly circles centered at the sun. We’ll assume the orbits are perfectly circular and lie in the same plane. (Now that Pluto is not classified as a planet, we can say without qualification that the planets have nearly circular orbits. Pluto’s orbit is much more elliptical than any of the planets.)

We can work in astronomical units (AUs) so that the distance from the Earth to the sun is 1. We can also work in units of years so that the period is also 1. Then we could describe the position of the Earth at time *t* as exp(2π*it*).

Mars has a larger orbit and a longer period. By Kepler’s third law, the size of the orbit and the period are related: the square of the period is proportional to the cube of the radius. Because we’re working in AUs and years, the proportionality constant is 1. If we denote the radius of Mars’ orbit by *r*, then its orbit can be described by

*r* exp(2π*i* (*r*^{-3/2} *t* ))

Here we pick our initial time so that at *t* = 0 the two planets are aligned.

The distance between the planets is just the absolute value of the difference between their positions:

| exp(2π*it*) – *r* exp(2π*i* (*r*^{-3/2} *t*)) |

The following code computes and plots the distance from Earth to Mars over time.

from scipy import exp, pi, absolute, linspace import matplotlib.pyplot as plt def earth(t): return exp(2*pi*1j*t) def mars(t): r = 1.524 # semi-major axis of Mars orbit in AU return r*exp(2*pi*1j*(r**-1.5*t)) def distance(t): return absolute(earth(t) - mars(t)) x = linspace(0, 20, 1000) plt.plot(x, distance(x)) plt.xlabel("Time in years") plt.ylabel("Distance in AU") plt.ylim(0, 3) plt.show()

And the output looks like this:

Notice that the distance varies from about 0.5 to about 2.5. That’s because the radius of Mars’ orbit is about 1.5 AU. So when the planets are exactly in phase, they are 0.5 AU apart and when they’re exactly out of phase they are 2.5 AU apart. In other words the distance ranges from 1.5 – 1 to 1.5 + 1.

The distance function seems to be periodic with period about 2 years. We can do a little calculation by hand to show that is the case and find the period exactly.

The distance squared is the distance times its complex conjugate. If we let ω = *r *^{-3/2} then the distance squared is

*d*^{2}(*t*) = (exp(2π*it*) – *r* exp(2π*i*ω*t*)) (exp(-2π*it*) – *r* exp(-2π*i*ω*t*))

which simplifies to

1 + *r*^{2} – 2*r* cos(2π(1 – ω)*t*)

and so the (squared) distance is periodic with period 1/(1 – ω) = 2.13.

Notice that the plot of distance looks more angular at the minima and more rounded near the maxima. Said another way, the distance changes more rapidly when the planets leave their nearest approach than their furthest approach. You can prove this by taking square root of *d*^{2}(*t*) and computing its derivative.

Let *f*(*t*) = 1 + *r*^{2} – 2*r* cos(2π(1 – ω)*t*). By the chain rule, the derivative of the square root of *f*(*t*) is 1/2 *f*(*t*)^{-1/2} *f*‘(*t*). Near a maximum or a minimum, *f*‘(*t*) takes on the same values. But the term *f*(*t*)^{-1/2} is largest when *f*(*t*) is smallest and vice versa because of the negative exponent.

One of the surprises from differential equations is that an infinitely concentrated input usually results in a diffuse output. A *fundamental solution* to a differential equation is a solution to the equation with a Dirac delta as the forcing function. In a sense, your input is so concentrated that it’s not actually a function. And yet the output may be a nice continuous function, and not one that is not particularly concentrated.

The situation is analogous to striking a bell. The input, the hammer blow to the bell, is extremely short, but the response of the bell is long and smooth. Solving a differential equation with a delta function as input is like learning about a bell by listening to how it rings when you strike it. A better analogy would be striking the bell in many places; a fundamental solution actually solves for a delta function with a position argument, not just a single delta function.

If you’re curious how this informal talk of “infinitely concentrated” input and delta “functions” can be made rigorous, start by reading this post.

**Related post**: Life lessons from differential equations

If a student answers BACDEFGHIJ, then did they make two mistakes or just one? Two events are in the wrong position, but they made one transposition error. The simplest way to grade such a test would be to count the number of events that are in the correct position. Is this the most fair way to grade?

If you decide to count how many transpositions are needed to correct a student’s answer, do you count any transposition or only adjacent transpositions? For example, if someone answered JBCDEFGHIA, then transposing the A and the J is enough to put the results in order. But reversing the first and last event seems like a bigger mistake than reversing the first two events. Counting only adjacent transpositions would penalize this mistake more. You would have to swap the J with each of the eight letters between J and A. But it hardly seems that answering JBCDEFGHIA is eight times worse than answering BACDEFGHIJ.

Maybe counting transpositions is too much work. So we just go back to counting how many events are in the right place. But then suppose someone answers JABCDEFGHI. This is completely wrong since every event is in the wrong position. But the student obviously knows something, since the relative order of nearly all of the events is correct. From one perspective there was only one mistake: J comes last, not first.

What is the worst possible answer? Maybe getting the order exactly backward? If you have an odd number of events, then getting the order backward means one event is in the right place, and so that doesn’t receive the lowest possible score.

This is an interesting problem beyond grading exams. (As for grading exams, I’d suggest simply not using questions of this type on an exam.) In manufacturing, how serious a mistake is it to reverse two consecutive components versus two distant components? You could also ask the same question when comparing DNA sequences or other digital signals. The best way to assign a distance between the actual and desired sequence would depend entirely on context.

]]>Or maybe not. A new study of three contemporary hunter-gatherer tribes found that they stay awake long after dark and sleep an average of 6.5 hours a night. They also don’t nap much [1]. This suggests the way we sleep may not be that different from our ancient forebears.

Historian A. Roger Ekirch suggested that before electric lighting it was common to sleep in two four-hour segments with an hour or so of wakefulness in between. His theory was based primarily on medieval English texts that refer to “first sleep” and “second sleep” and has other literary support as well. A small study found that subjects settled into the sleep pattern Ekirch predicted when they were in a dark room for 14 hours each night for a month. But the hunter-gatherers don’t sleep this way.

Maybe latitude is an important factor. The hunter-gatherers mentioned above live between 2 and 20 degrees south of the equator whereas England is 52 degrees north of the equator. Maybe two-phase sleep was more common at high latitudes with long winter nights. Of course there are many differences between modern/ancient [2] hunter-gatherers and medieval Western Europeans besides latitude.

Two studies have found two patterns of how people sleep without electric lights. Maybe electric lights don’t have as much impact on how people sleep as other factors.

**Related post**: Paleolithic nonsense

* * *

[1] The study participants were given something like a Fitbit to wear. The article said that naps less than 15 minutes would be below the resolution of the monitors, so we don’t know how often the participants took cat naps. We only know that they rarely took longer naps.

[2] There is an implicit assumption that the contemporary hunter-gatherers live and, in particular, sleep like their ancient ancestors. This seems reasonable, though we can’t be certain. There is also the bigger assumption that the tribesmen represent not only *their* ancestors but all paleolithic humans. Maybe they do, and we don’t have much else to go on, but we don’t know. I suspect there was more diversity in the paleolithic era than we assume.

Unlike the formula I wrote about a few days ago relating Fibonacci numbers and pi, this one is not as simple to prove. The numerator inside the root is easy enough to estimate asymptotically, but estimating the denominator depends on the distribution of primes.

**Source**: Yuri V. Matiyasevich and Richard K. Guy, A new formula for π, American Mathematical Monthly, Vol 93, No. 8 (October 1986), pp. 631-635.

]]>

PACE, which stands for Property Assessed Clean Energy, is a nation-wide program that makes long-term financing available for energy upgrades, repaid through an annual assessment added to your property tax bill. Though it is a national initiative, each state must create its own PACE program, and so there is some variety in the forms PACE can take. Texas passed legislation in June 2013 authorizing local governments to implement PACE programs. Yesterday the Houston City Council unanimously passed a Resolution of Intent to adopt PACE.

I have been working with PACE Houston, a private company that develops PACE projects by providing strategic advice to property owners. If you’re in the Houston area and are interested in help with PACE financing, you could contact me, or go to the contact page on the PACE Houston web site.

]]>*I’m a data scientist*. Not sure what that means, but it sounds cool.

*I study machine learning*. Hmm. Maybe interesting, maybe a little ominous.

*I’m into big data*. Exciting or passé, depending on how many times you’ve heard the term.

Even though each of these descriptions makes a different impression, they’re all essentially the same thing. You could throw in a few more terms too, like artificial intelligence, inferential science, decision theory, or inverse probability.

There are distinctions. These terms don’t entirely overlap, but the overlap is huge. They all have to do with taking data and making an inference.

“Decision-making under uncertainty” emphasizes that you never have complete data, and yet you need to make decisions anyway. “Decision theory” emphasizes that the whole point of analyzing data is to do something as a result, and suggests that focusing directly on the decision itself, rather than proxies along the way, is the best way to do this.

“Data science” stresses that there is more to the process of making inferences than what falls under the traditional heading of “statistics.” Statistics has never been only about “the grotesque phenomenon generally known as mathematical statistics,” as Francis Anscombe described it. Things like data cleaning and visualization have always been part of the practice of statistics, though not the theory of statistics. Data science also emphasizes the role of computation. Some say a data scientist is a statistician who can program. Some say data science is statistics on a Mac.

Despite the hype around the term data science, it’s growing on me. It has its drawbacks, but so does every other name.

Machine learning, like decision theory, emphasizes the ultimate goal of doing something with data rather than creating an accurate model of the process that generates the data. If you can create such a model, so much the better. But it may not be necessary to have a great model in order to accomplish what you originally set out to do. “Naive Bayes,” for example, is a classification algorithm that is admittedly naive. It knowingly makes a gross simplification, assuming events are independent that we know are certainly not independent, and yet it often works well enough.

“Big data” is a big can of worms. It is often concerned with data sets that are indeed big, but it also implies other things, such as the way the data become available, as a real time stream rather than as a complete static set. See Erik Meijer’s Big data cube. And that’s just when the term “big data” is used in some fairly meaningful way. It’s also used so broadly as to be meaningless.

The term “statistics” literally means the mathematics of the interests of states, as in governments, because these were the first applications of statistics. So while “statistics” may be the most established and perhaps most respectable term discussed here, it’s not great. As I remarked here, “The term *statistics* would be equivalent to *governmentistics*, a historically accurate but otherwise useless term.” Statistics emphasizes probability models and mathematical rigor more than other variations on data analysis do. Statisticians criticize machine learning folks for being sloppy. Machine learning folks criticize statisticians for being too conservative, or for being too focused on description and not focused enough on prediction.

Bayesian statistics is much older than what is now sometimes called “classical” statistics. It was essential dormant during the first half of the 20th century before experiencing a renaissance in the second half of the century. Bayesian statistics was originally called “inverse probability” for good reason. Probability theory takes the probabilities of events as given and makes inferences about possible outcomes. Bayesian statistics does the inverse, taking data as given and inferring the probabilities that lead to the data. All statistics does something like this, but Bayesian statistics is consistent in forming all inference directly as probabilities. Frequetist (“classical”) statistics also infers probabilities, but the results, things like *p*-values and confidence intervals, are not the probabilities of what most people think they are. See Anthony O’Hagan’s description here.

Data analysis has gone by many names over time, sometimes with meaningful distinctions and sometimes not. Often people make a distinction without a difference.

]]>As mysterious as this equation may seem, it’s not hard to prove. The arctangent identity

shows that the sum telescopes, leaving only the first term, arctan(1) = π/4. To prove the arctangent identity, take the tangent of both sides, use the addition law for tangents, and use the Fibonacci identity

See this post for an even more remarkable formula relating Fibonacci numbers and π.

]]>But how would I know if someone had learned English fluently? If they were fluent, I’d assume they were native speakers. I knew people had learned English as a second language **because** they weren’t fluent. This is **selection bias**, where the selection of data you see is influenced by the very thing you’re interested in.

A famous example of selection bias is that of British bombers returning from missions over Germany in World War II. Statistician Abraham Wald advised the RAF to add armor precisely where these bombers were **not** shot, reasoning that he was only able to inspect bombers that survived their missions. More on this story here.

When I was in college, I had a roommate who had learned Spanish fluently as a second language. I thought “That’s not possible!” though of course it is possible. I immediately began to wonder what else that I thought was impossible was merely difficulty.

]]>There can only be a finite number of cases, because *n*! grows faster than 10^{n} for *n* > 10, and it’s reasonable to guess that 23 might be the largest case. Turns out it’s not, but it’s close. The only cases where *n*! has *n* digits are 1, 22, 23, and 24. Once you’ve found these by brute force, it’s not hard to show that they must be the only ones because of the growth rate of *n*!.

Is there a convenient way to find the number of digits in n! without having to compute *n*! itself? Sure. For starters, the number of digits in the base 10 representation of a number *x* is

⌊ log_{10} *x* ⌋ + 1.

where ⌊ *z* ⌋ is the floor of *z*, the largest integer less than or equal to *z*. The log of the factorial function is easier to compute than the factorial itself because it won’t overflow. You’re more likely to find a function to compute the log of the gamma function than the log of factorial, and more likely to find software that uses natural logs than logs base 10. So in Python, for example, you could compute the number of digits with this:

from scipy.special import gammaln from math import log, floor def digits_in_factorial(n): return floor( gammaln(n+1)/log(10.0) ) + 1

What about a more elementary formula, one that doesn’t use the gamma function? If you use Stirling’s approximation for factorial and take log of that you should at least get a good approximation. Here it is again in Python:

from math import log, floor, pi def stirling(n): return floor( ((n+0.5)*log(n) - n + 0.5*log(2*pi))/log(10) ) + 1

The code above is exact for every *n* > 2 as far as I’ve tested, up to *n* = 1,000,000. (Note that one million factorial is an extremely large number. It has 5,565,709 digits. And yet we can easily say something about this number, namely how many digits it has!)

The code may break down somewhere because the error in Stirling’s approximation or the limitations of floating point arithmetic. Stirling’s approximation gets more accurate as *n* increases, but it’s conceivable that a factorial value could be so close to a power of 10 that the approximation error pushes it from one side of the power of 10 to the other. Maybe that’s not possible and someone could prove that it’s not possible.

You could extend the code above to optionally take another base besides 10.

def digits_in_factorial(n, b=10): return floor( gammaln(n+1)/log(b) ) + 1 def stirling(n, b=10): return floor( ((n+0.5)*log(n) - n + 0.5*log(2*pi))/log(b) ) + 1

The code using Stirling’s approximation still works for all *n* > 2, even for *b* as small as 2. This is slightly surprising since the number of bits in a number is more detailed information than the number of decimal digits.

My title speaks of “data analysis” not “statistics”, and of “computation” not “computing science”; it does not speak of “mathematics”, but only last. Why? …

My brother-in-squared-law, Francis J. Anscombe has commented on my use of “data analysis” in the following words:

Whereas the content of Tukey’s remarks is always worth pondering, some of his terminology is hard to take. He seems to identify “statistics” with the grotesque phenomenon generally known as “mathematical statistics”, and finds it necessary to replace “statistical analysis” with “data analysis.”

(Tukey calls Anscombe his “brother-in-squared-law” because Anscombe was a fellow statistician as well as his brother-in-law. At first I thought Tukey had said “brother-in-law-squared”, which could mean his brother-in-law’s brother-in-law, but I suppose it was a pun on the role of least-square methods in statistics.)

Tukey later says

I … shall stick to this attitude today, and shall continue to use the words “data analysis”, in part to indicate that we can take probability seriously, or leave it alone, as may from time to time be appropriate or necessary.

It seems Tukey was reserving the term “statistics” for that portion of data analysis which is rigorously based on probability.

]]>As with financial arbitrage, the hard part is spotting opportunities, not necessarily acting on them. If you want to be a hero with regular expressions, as in the xkcd cartoon below, the key isn’t knowing regular expressions well. The key is knowing *about* regular expressions in a context where no one else does.

To spot a technical arbitrage opportunity, you need to know what that technology can and cannot (easily) do. You also need to recognize situations where the technology can help. You don’t need to be able to stand up to a quiz show style oral exam.

**Related post**: You can be a hero with a simple idea

The life he had chosen was a cocoon. Surrounded by a web of old manuscripts and scholarly papers, he would achieve tenure, publish frequently, teach a group of carefully selected graduate students, be treated like a celebrity by the handful of people who had the faintest idea who he was, and go to his grave deluded into thinking he had achieved greatness when in fact he stayed in school all his life. Where was the plunge into the unknown?

I don’t believe the author meant to imply that an academic career is necessarily so insular. In the context of the quote, Ivan says his father, also a scholar, “hadn’t stayed in the cocoon” but had taken great risks. But there is certainly the danger of living in a tiny world and believing that you’re living in a much larger world. Others may face the same danger, though it seems particularly acute for academics.

It’s interesting that Ivan criticizes scholars for avoiding the unknown. Scholars would say they’re all about pursuing the unknown. But scholars pursue known unknowns, questions that can be precisely formulated within the narrow bounds of their discipline. The “plunge into the unknown” here refers to unknown unknowns, messier situations where not only are the outcomes unknown, even the risks themselves are unknown.

]]>There’s only one way a function on the real line can be periodic. But if you think of functions of a complex variable, it makes sense to look at functions that are periodic in two different directions. Sine and cosine are periodic as you move horizontally across the complex plane, but not if you move in any other direction. But you could imagine a function that’s periodic vertically as well as horizontally.

A doubly periodic function satisfies *f*(*x*) = *f*(*x* + ω_{1}) and *f*(*x*) = *f*(*x* + ω_{2}) for all *x* and for two different fixed complex periods, ω_{1} and ω_{2}, with different angular components, i.e. the two periods are not real multiples of each other. For example, the two periods could be 1 and *i*.

How many doubly periodic functions are there? The answer depends on how much regularity you require. If you ask that the functions be differentiable everywhere as functions of a complex variable (i.e. entire), the only doubly periodic functions are constant functions [1]. But if you relax your requirements to allow functions to have singularities, there’s a wide variety of functions that are doubly periodic. These are the *elliptic* functions. They’re periodic in two independent directions, and meromorphic (i.e. analytic except at isolated poles). [2]

What about triply periodic functions? If you require them to be meromorphic, then the only triply periodic functions are constant functions. To put it another way, if a meromorphic function is periodic in three directions, it’s periodic in every direction for every period, i.e. constant. If a function has three independent periods, you can construct a sequence with a limit point where the function is constant, and so it’s constant everywhere.

* * *

[1] Another way to put this is to say that elliptic functions must have at least one pole inside the parallelogram determined by the lines from the origin to ω_{1} and ω_{2}. A doubly periodic function’s values everywhere are repeats of its values on this parallelogram. If the function were continuous over this parallelogram (i.e. with no poles) then it would be bounded over the parallelogram and hence bounded everywhere. But Liovuille’s theorem says a bounded entire function must be constant.

[2] We don’t consider arbitrary singularities, only isolated poles. There are doubly periodic functions with essential singularities, but these are outside the definition of elliptic functions.

]]>Bob Martin published a dialog yesterday about the origin of structured programming, the idea that programs should not be written with `goto`

statements but should use less powerful, more specialized ways to transfer control. Edsgar Dijkstra championed this idea, most famously in his letter Go-to statement considered harmful. Since then there have been countless “considered harmful” articles that humorously allude to Dijkstra’s letter.

Toward the end of the dialog, Uncle Bob’s interlocutor says “Hurray for Dijkstra” for inventing the new technology of structured programming. Uncle Bob corrects him

New Technology? No, no, you misunderstand. … He didn’t invent anything. What he did was to identify something we

shouldn’t do. That’s not a technology. That’s adiscipline.

Huh? I thought Structured Programming made things better.Oh, it did. But not by giving us some new tools or technologies. It made things better by taking away a damaging tool.

The money quote is the last line above: **It made things better by taking away a damaging tool**.

Creating new tools gets far more attention than removing old tools. How might we be better off by letting go of a tool? When our first impulse is that we need a new technology, might we need a new discipline instead?

Few people have ever been able to convince an entire profession to let go of a tool they assumed was essential. If we’re to have any impact, most of us will need to aim much, much lower. It’s enough to improve our personal productivity and possibly that of a few peers. Maybe you personally would be better off without something that is beneficial to most people.

What are some technologies you’ve found that you’re better off not using?

**Related posts**: