So if you’re familiar with math but not Linux, and you run across the `base32`

utility, you might naturally assume that the command converts numbers to base 32 using the symbols 0, 1, 2, &hellip, 9, A, B, C, …, V. That’s a reasonable guess, but it actually uses the symbols A, B, C, …, Z, 2, 3, 4, 5, 6, and 7. It’s all described in RFC 3548.

What’s going on? The purpose of base 32 encoding is to render binary data in a way that is human readable and capable of being processed by software that was originally written with human readable input in mind. The purpose is not to carry out mathematical operations.

Note that the digit 0 is not used, because it’s visually similar to the letter O. The digit 1 is also not used, perhaps because it looks like a lowercase `l`

in some fonts.

`sed`

very often, but it’s very handy when I do use it, particularly when needing to make a small change to a large file.
Lately I’ve been trying to fix a 30 MB JSON file that has been corrupted somehow. The file is one very long line.

Emacs was unable to open the file. (It might have eventually opened the file, but I killed the process when it took longer than I was willing to wait.)

Emacs can open large files, but it has trouble with long lines. Somewhere its data structures assume lines are not typically millions of characters long.

I used `sed`

to add line breaks after closing brackets

sed -i 's/]/]\n/g' myfile.json

and then I was able to open the file in Emacs.

If the problem with my JSON file were simply a missing closing brace—it’s not—then I could add a closing brace with

sed -i 's/$/}/' myfile.json

I had a friend in college who got a job because of a `sed`

script he wrote as an intern.

A finite element program would crash when it attempted to take too large a time step, but the program would not finish by the time the results were needed if it always took tiny time steps. So they’d let the program crash occasionally, then edit the configuration file with a smaller time step and restart the program.

They were asking engineers to work around the clock so someone could edit the configuration file and restart the finite element program if it crashed in the middle of the night. My friend wrote a shell script to automate this process, using `sed`

to do the file editing. He eliminated the need for a night shift and got a job offer.

Γ(10.3) ≈ 0.7 × 9! + 0.3 × 10! = 1342656.

But the exact value is closer to 716430.69, and so our estimate is 53% too high. Not a very good approximation.

Now let’s try again, applying linear interpolation to the log of the gamma function. Our approximation is

log Γ(10.3) ≈ 0.7 × log 9! + 0.3 × log 10! = 13.4926

while the actual value is 13.4820, an error of about 0.08%. If we take exponentials to get an approximation of Γ(10.3), not log Γ(10.3), the error is larger, about 1%, but still much better than 53% error.

The gamma function grows very quickly, and so the log gamma function is usually easier to work with.

As a bonus, the Bohr–Mollerup theorem says that log gamma is a convex function. This tells us that not only does linear interpolation give an approximation, it gives us an upper bound.

The Bohr–Mollerup theorem essentially says that the gamma function is the only function that extends factorial from a function on the integers to a log-convex function on the real numbers. This isn’t quite true since it’s actually &Gamma(*x* + 1) that extends factorial. Showing the gamma function is unique is the hard part. In the preceding paragraph we used the easy direction of the theorem, saying that gamma is log-convex.

We could write a program to estimate the volume of a high-dimensional sphere this way. But there’s a problem: very few random samples will fall in the sphere. The ratio of the volume of a sphere to the volume of a box it fits in goes to zero as the dimension increases. We might take a large number of samples and none of them fall inside the sphere. In this case we’d estimate the volume as zero. This estimate would have small absolute error, but 100% relative error.

So instead of actually writing a program to randomly sample a high dimensional cube, let’s **imagine** that we did. Instead of doing a big Monte Carlo study, we could **be clever and use theory**.

Let *n* be our dimension. We want to draw uniform random samples from [−1, 1]^{n} and see whether they land inside the unit sphere. So we’d draw *n* random samples from [−1, 1] and see whether the sum of their squares is less than or equal to 1.

Let *X*_{i} be a uniform random variable on [−1, 1]. We want to know the probability that

*X*_{1}² + *X*_{2}² + *X*_{3}² + … + *X*_{n}² ≤ 1.

This would be an ugly calculation, but since we’re primarily interested in the case of large *n*, we can approximate the sum using the central limit theorem (CLT). We can show, using the transformation theorem, that each *X*_{i}² has mean 1/3 and variance 4/45. The CLT says that the sum has approximately the distribution of a normal random variable with mean *n*/3 and variance 4*n*/45.

The approach above turns out to be a bad idea, though it’s not obvious why.

The CLT does provide a good approximation of the sum above, **near the mean**. But we have a sum with mean *n*/3, with *n* large, and we’re asking for the probability that the sum is less than 1. In other words, we’re asking for the probability **in the tail** where the CLT approximation error is a bad (relative) fit. More on this here.

This post turned out to not be about what I thought it would be about. I thought this post would lead to a asymptotic approximation for the volume of an *n*-dimensional sphere. I would compare the approximation to the exact value and see how well it did. Except it did terribly. So instead, this post a cautionary tale about remembering how convergence works in the CLT.

and wondered how many sums of this form can be evaluated in closed form like this. Quite a few it turns out.

Sums of the form

evaluate to a rational number when *k* is a non-negative integer and *c* is a rational number with |*c*| > 1. Furthermore, there is an algorithm for finding the value of the sum.

The sums can be evaluated using the **polylogarithm** function Li_{s}(*z*) defined as

using the identity

We then need to have a way to evaluate Li_{s}(*z*). This cannot be done in closed form in general, but it can be done when *s* is a negative integer as above. To evaluate Li_{−k}(*z*) we need to know two things. First,

and second,

Now Li_{0}(*z*) is a rational function of *z*, namely *z*/(1 − *z*). The derivative of a rational function is a rational function, and multiplying a rational function of *z* by *z* produces another rational function, so Li_{s}(*z*) is a rational function of *z* whenever *s* is a non-positive integer.

Assuming the results cited above, we can prove the identity

stated at the top of the post.The sum equals Li_{−3}(1/2), and

The result comes from plugging in *z*= 1/2 and getting out 26.

When *k* and *c* are positive integers, the sum

is not necessarily an integer, as it is when *k* = 3 and *c* = 2, but it is always rational. It looks like the sum is an integer if *c*= 2; I verified that the sum is an integer for *c* = 2 and *k* = 1 through 10 using the `PolyLog`

function in Mathematica.

**Update**: Here is a proof that the sum is an integer when *n* = 2. From a comment by Theophylline on Substack.

The sum is occasionally an integer for larger values of *c*. For example,

and

(0, 0, 0, …, 0)

and

(0, 0, 0, …, 1)

can’t have a radius larger than 1/2 because they’ll intersect at (0, 0, 0, …, 1/2).

Now center a little red sphere at center, the point where every coordinate equals 1/2, and blow it up until it touches the blue spheres. Which is larger, the blue spheres or the red one?

The answer depends on *n*. In low dimensions, the blue spheres are larger. But in higher dimensions, the red sphere is larger.

The distance of from the center of the hypercube to any of the vertices is √*n* / 2 and so the radius of the red sphere is (√*n* − 1) / 2. When *n* = 4, the radius of the red sphere equals the radius of the blue spheres.

The hypercube and blue spheres will fit inside a hypercube whose side is length 2. But if *n* > 9 the red sphere will spill out of this larger hypercube.

This post was motivate by this post by Aryeh Kontorovich on Twitter.

Obviously the unrestricted player would be expected to win more games, but how many more? At least one, because the unrestricted player knows what the restricted player will do on the last round.

If *n* is large, then in the early rounds of the game it makes little difference that one of the players is restricted. The restriction doesn’t give the unrestricted player much exploitable information. But in the later rounds of the game, the limitations on the restricted player’s moves increase the other players chances of winning more.

It turns out that the if the unrestricted player uses an optimal strategy, he can expect to win *O*(√*n*) more rounds than he loses. More precisely, the expected advantage approaches 1.4658√*n* as *n* grows. You can find a thorough analysis of the problem in this paper.

This function of *q* and *n* comes up in other contexts as well and has a name that we will get to shortly.

Leo August Pochhammer (1841–1920) defined the *k*th rising power of *x* by

Rising and falling powers come up naturally in combinatorics, finite difference analogs of calculus, and in the definition of hypergeometric functions.

The *q*-analog of the Pochhammer symbol is defined as

Like the Pochhammer symbol, the *q*-Pochhammer symbol also comes up in combinatorics.

In general, *q*-analogs of various concepts are generalizations that reduce to the original concept in some sort of limit as *q* goes to 1. The relationship between the *q*-Pochhammer symbol and the Pochhammer symbol is

For a simpler introduction to *q*-analogs, see this post on *q*-factorial and *q*-binomial coefficients.

The motivation for this post was to give a name to the function that gives probability a random *n* by *n* matrix over a finite field of order *q* is invertible. In the notation above, this function is (1/*q*; 1/*q*)_{n}.

There’s a confusing notational coincidence here. The number of elements in a finite field is usually denoted by *q*. The *q* in *q*-analogs such as the *q*-Pochhammer symbol has absolute value less than 1. It’s a coincidence that they both use the letter *q*, and the *q* in our application of *q*-Pochhammer symbols is the reciprocal of the *q* representing the order of a finite field.

I mentioned in the previous post that the probability of the matrix being invertible is a decreasing function of *n*. This probability decreases to a positive limit, varying with the value of *q*. This limit is (1/*q;* 1/*q*)_{∞}. Here the subscript ∞ denotes that we take the limit in (1/*q;* 1/*q*)_{n} as *n* goes to infinity. There’s no problem here because the infinite product converges.

The *q*-Pochhammer symbol (*a*; *q*)_{n} is implemented in Mathematica as `QPochhammer[a, q, n]`

and the special case (*q*; *q*)_{∞} is implemented as `QPochhammer[q]`

. We can use the latter to make the following plot.

Plot[QPochhammer[q], {q, -1, 1}]

Recall that the *q* in our motivating application is the reciprocal of the *q* in the *q*-Pochhhammer symbol. This says for large fields, the limiting probability that an *n* by *n* matrix is invertible as *n* increases is near 1, but that for smaller fields the limiting probability is also smaller. For *q* = 2, the probability is 0.288788.

Plot[QPochhammer[1/q], {q, 2, 100}, PlotRange -> All]

The number of possible *n* × *n* matrices with entries from a field of size *q* is *q*^{n²}. The set of invertible *n* × *n* matrices over a field with *q* elements is *GL*_{n}(*q*) and the number of elements in this set is [1]

The probability that an *n* × *n* matrix is invertible is then

which is an increasing function of *q* and a decreasing function of *n*. More on this function in the next post.

- Spaces and subspaces over finite fields
- Finite projective planes
- Finite field Diffie Hellman encryption

[1] Robert A. Wilson. The Finite Simple Groups. Springer 2009

The post Solvability of linear systems over finite fields first appeared on John D. Cook.]]>Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition *usually* have a certain chemical in their blood, and people who do not have the condition *usually* do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Imagine parallel parking is available along the shoulder of a road, but no parking spaces are marked.

The first person to park picks a spot on the shoulder at random. Then another car also chooses a spot along the shoulder at random, with the constraint that the second car can’t overlap the first. This process continues until no more cars can park. How many people can park this way?

Assume all cars have the same length, and we choose our units so that cars have length 1. The expected number of cars that can park is a function *M*(*x*) of the length of the parking strip *x*. Clearly if *x* < 1 then *M*(*x*) = 0. Alfréd Rényi [1] found that for *x* ≥ 1, *M*(*x*) satisfies the equation

This is an unusual equation, difficult to work with because it defined *M* only implicitly. It’s not even clear that the equation has a solution. But it does, and the ratio of *M*(*x*) to *x* approaches a constant as *x* increases.

The number *m* is known as **Rényi’s parking constant**.

This says for a long parking strip, parking sequentially at random will allow about 3/4 as many cars to park as if you were to pack the cars end-to-end.

[1] Steven R. Finch. Mathematical Constants. Cambridge University Press, 2003.

The post Rényi’s parking constant first appeared on John D. Cook.]]>The word *planet* means “wanderer.” This because the planets appear to wander in the night sky. Unlike stars that appear to orbit the earth in a circle as the earth orbits the sun, planets appear to occasionally reverse course. When planets appear to move backward this is called retrograde motion.

Here’s what the motion of Venus would look like over a period of 8 years as explored here.

Venus completes 13 orbits around the sun in the time it takes Earth to complete 8 orbits. The ratio isn’t exactly 13 to 8, but it’s very close. Five times over the course of eight years Venus will appear to reverse course for a few days. How many days? We will get to that shortly.

When we speak of the motion of the planets through the night sky, we’re not talking about their rising and setting each day due to the rotation of the earth on its axis. We’re talking about their motion from night to night. The image above is how an observer far above the Earth and not rotating with the Earth would see the position of Venus over the course of eight years.

The orbit of Venus as seen from earth is beautiful but complicated. From the Copernican perspective, the orbits of Earth and Venus are simply concentric circles. You may bristle at my saying planets have circular rather than elliptical orbits [1]. The orbits are not exactly circles, but are so close to circular that you cannot see the difference. For the purposes of this post, we’ll assume planets orbit the sun in circles.

There is a surprisingly simple equation [2] for finding the points where a planet will appear to change course:

Here *r* is the radius of Earth’s orbit and *R* is the radius of the other planet’s orbit [3]. The constant *k* is the difference in angular velocities of the two planets. You can solve this equation for the times when the apparent motion changes.

Note that the equation is entirely symmetric in *r* and *R*. So Venusian observing Earth and an Earthling observing Venus would agree on the times when the apparent motions of the two planets reverse.

Let’s find when Venus enters and leaves retrograde motion. Here are the constants we need.

r = 1 # AU R = 0.72332 # AU venus_year = 224.70 # days earth_year = 365.24 # days k = 2*pi/venus_year - 2*pi/earth_year c = sqrt(r*R) / (r + R - sqrt(r*R))

With these constants we can now plot cos(*kt*) and see when it equals *c*.

This shows there are five times over the course of eight years when Venus is in apparent retrograde motion.

If we set time *t* = 0 to be a time when Earth and Venus are aligned, we start in the middle of retrograde period. Venus enters prograde motion 21 days later, and the next retrograde period begins at day 563. So out of every 584 days, Venus spends 42 days in retrograde motion and 542 days in prograde motion.

[1] Planets do not exactly orbit in circles. They don’t *exactly* orbit in ellipses either. Modeling orbits as ellipses is much more accurate than modeling orbits as circles, but not still not perfectly accurate.

[2] 100 Great Problems of Elementary Mathematics: Their History and Solution. By Heinrich Dörrie. Dover, 1965.

[3] There’s nothing unique about observing planets from Earth. Here “Earth” simply means the planet you’re viewing from.

The post Calculating when a planet will appear to move backwards first appeared on John D. Cook.]]>The business principle of kaizen, based on the Japanese 改善 for improvement, is based on the assumption that incremental improvements accumulate. But quantifying how improvements accumulate takes some care.

Two successive 1% improvements amount to a 2% improvement. But two successive 50% improvements amount to a 125% improvement. So sometimes you can add, and sometimes you cannot. What’s going on?

An *x*% improvement multiplies something by 1 + *x*/100. For example, if you earn 5% interest on a principle of *P* dollars, you now have 1.05 *P* dollars.

So an *x*% improvement followed by a *y*% improvement multiplies by

(1 + *x*/100)(1 + *y*/100) = 1 + (*x* + *y*)/100 + *xy*/10000.

If *x* and *y* are small, then *xy*/10000 is negligible. But if *x* and *y* are large, the product term may not be negligible, depending on context. I go into this further in this post: Small probabilities add, big ones don’t.

Now let’s look at a variation. Suppose doing one thing by itself brings an *x*% improvement and doing another thing by itself makes a *y*% improvement. How much improvement could you expect from doing both?

For example, suppose you find through A/B testing that changing the font on a page increases conversions by 10%. And you find in a separate A/B test that changing an image on the page increases conversions by 15%. If you change the font and the image, would you expect a 25% increase in conversions?

The issue here is not so much whether it is appropriate to add percentages. Since

1.1 × 1.15 = 1.265

you don’t get a much different answer whether you multiply or add. But maybe you could change the font and the image and conversions increase 12%. Maybe either change alone creates a better impression, but together they don’t make a better impression than doing one of the changes. Or maybe the new font and the new image clash somehow and doing both changes together *lowers* conversions.

The statistical term for what’s going on is interaction effects. A sequence of small improvements creates an additive effect if the improvements are independent. But the effects could be dependent, in which case the whole is less than the sum of the parts. This is typical. Assuming that improvements are independent is often overly optimistic. But sometimes you run into a synergistic effect and the whole is greater than the sum of the parts.

In the example above, we imagine testing the effect of a font change and an image change separately. What if we first changed the font, then with the new font tested the image? That’s better. If there were a clash between the new font and the new image we’d know it.

But we’re missing something here. If we had tested the image first and then tested the new font with the new image, we might have gotten different results. In general, the order of sequential testing matters.

If you have a small number of things to test, you can discover interaction effects by doing a factorial design, either a full factorial design or a fractional factorial design.

If you have a large number of things to test, you’ll have to do some sort of sequential testing. Maybe you do some combination of sequential and factorial testing, guided by which effects you have reason to believe will be approximately independent.

In practice, a testing plan needs to balance simplicity and statistical power. Sequentially testing one option at a time is simple, and may be fine if interaction effects are small. But if interaction effects are large, sequential testing may be leaving money on the table.

If you’d like some help with testing, or with web analytics more generally, we can help.

The post Do incremental improvements add, multiply, or something else? first appeared on John D. Cook.]]>I wondered whether it also sounds like a sawtooth wave, and indeed it does. More on that shortly.

The Clausen function can be defined in terms of its Fourier series:

The function commonly known as *the* Clausen function is one of a family of functions, hence the subscript 2. The Clausen functions for all non-negative integers *n* are defined by replacing 2 with *n* on both sides of the defining equation.

The Fourier coefficients decay quadratically, as do those of a triangle wave or sawtooth wave, as discussed here. This implies the function Cl_{2}(*x*) cannot have a continuous derivative. In fact, the derivative of Cl_{2}(*x*) is infinite at 0. This follows quickly from the integral representation of the function.

The fundamental theorem of calculus shows that the derivative

blows up at 0.

Now suppose we create an audio clip of Cl_{2}(440*x*). This creates a sound with pitch A 440, but rather than a sinewave it has an unpleasant buzzing sound, much like a sawtooth wave.

The harshness of the sound is due to the slow decay of the Fourier coefficients; the Fourier coefficients of more pleasant musical sounds decay much faster than quadratically.

Next you draw a circle through the vertices of the triangle, and draw a square outside that.

Then you draw a circle through the vertices of the square, and draw a pentagon outside that.

Then you think “Will this ever stop?!”, meaning the meeting, but you could ask a similar question about your doodling: does your sequence of doodles converge to a circle of finite radius, or do they grow without bound?

An *n*-gon circumscribed on the outside of a circle of radius *r* is inscribed in a circle of radius

So if you start with a unit circle, the radius of the circle through the vertices of the *N*-gon is

and the limit as *N* → ∞ exists. The limit is known as the **polygon circumscribing constant**, and it equals 8.7000366252….

We can visualize the limit by making a plot with large *N*. The plot is less cluttered if we leave out the circles and just show the polygons. *N* = 30 in the plot below.

The reciprocal of the polygon circumscribing constant is known as the **Kepler-Bouwkamp constant**. The Kepler-Bouwkamp constant is the limiting radius if you were to reverse the process above, *inscribing* polygons at each step rather than *circumscribing* them. It would make sense to call the Kepler-Bouwkamp constant the polygon *inscribing* constant, but for historical reasons it is named after Johannes Kepler and Christoffel Bouwkamp.

The specification for the NPI number format says that the first digit must be either 1 or 2. Currently every NPI in the database starts with 1. There are about 8.4 million NPIs currently, so it’ll be a while before they’ll need to roll the first digit over to 2.

The last digit of the NPI is a check sum. The check sum uses the Luhn algorithm, the same check sum used for credit cards and other kinds of identifiers. The Luhn algorithm was developed in 1954 and was designed to be easy to implement by hand. It’s kind of a quirky algorithm, but it will catch all single-digit errors and nearly all transposition errors.

The Luhn algorithm is not applied to the NPI itself but by first prepending 80840 to the (first nine digits of) the NPI.

For example, let’s look at 1993999998. This is not (currently) anyone’s NPI, but it has a valid NPI format because the Luhn checksum of 80840199399999 is 8. We will verify this with the code below.

The following code computes the Luhn checksum.

def checksum(payload): digits = [int(c) for c in reversed(str(payload))] s = 0 for i, d in enumerate(digits): if i % 2 == 0: t = 2*d if t > 9: t -= 9 s += t else: s += d return (s*9) % 10

And the following checks whether the last digit of a number is the checksum of the previous digits.

def verify(fullnumber): payload = fullnumber // 10 return checksum(payload) == fullnumber % 10

And finally, the following validates an NPI number.

def verify_npi(npi): return verify(int("80840" + str(npi)))

Here we apply the code above to the hypothetical NPI number mentioned above.

assert(checksum(80840199399999) == 8) assert(verify(808401993999998)) assert(verify_npi(1993999998))

How can you possibly solve a mission-critical problem with millions of variables—when the worst-case computational complexity of *every known algorithm* for that problem is exponential in the number of variables?

SAT (Satisfiability) solvers have seen dramatic orders-of-magnitude performance gains for many problems through algorithmic improvements over the last couple of decades or so. The SAT problem—finding an assignment of Boolean variables that makes a given Boolean expression true—represents the archetypal NP-complete problem and in the general case is intractable.

However, for many practical problems, solutions can be found very efficiently by use of modern methods. This “killer app” of computer science, as described by Donald Knuth, has applications to many areas, including software verification, electronic design automation, artificial intelligence, bioinformatics, and planning and scheduling.

Its uses are surprising and diverse, from running billion dollar auctions to solving graph coloring problems to computing solutions to Sudoku puzzles. As an example, I’ve included a toy code below that uses SMT, a relative of SAT, to find the English language suffix rule for regular past tense verbs (“-ed”) from data.

When used as a machine learning method, SAT solvers are quite different from other methods such as neural networks. SAT solvers can for some problems have long or unpredictable runtimes (though MAXSAT can sometimes relax this restriction), whereas neural networks have essentially fixed inference cost (though looping agent-based models do not).

On the other hand, answers from SAT solvers are always guaranteed correct, and the process is interpretable; this is currently not so for neural network-based large language models.

To understand better how to think about this difference in method capabilities, we can take a lesson from the computational science community. There, it is common to have a well-stocked computational toolbox of both slow, accurate methods and fast, approximate methods.

In computational chemistry, ab initio methods can give highly accurate results by solving Schrödinger’s equation directly, but only scale to limited numbers of atoms. Molecular dynamics (MD), however, relies more on approximations, but scales efficiently to many more atoms. Both are useful in different contexts. In fact, the two methodologies can cross-pollenate, for example when ab initio calculations are used to devise force fields for MD simulations.

A lesson to take from this is, it is paramount to find the best tool for the given problem, using any and all means at one’s disposal.

The following are some of my favorite general references on SAT solvers:

- Donald Knuth, The Art Of Computer Programming: Volume 4 Fascicle 6, Satisfiability. A coherent introduction to the methods, accessible to the mathematically inclined.
- Marijn J.H. Heule, Advanced Topics in Logic: Automated Reasoning and Satisfiability. Slides from an advanced class by one of the world experts.
- A. Biere, H. van Maaren, M. Heule, eds, Handbook of Satisfiability. An in-depth tome on state of the art, arranged topically.
- Daniel Kroening and Ofer Strichman, Decision Procedures: An Algorithmic Point of View. An accessible book on an important class of extensions to SAT: satisfiability modulo theories (SMT), which make SAT-type methods applicable to a much wider range of problems.
- Documentation for Z3, an innovative open-source solver for solving SAT, SMT, MAXSAT and combinatorial optimization problems.
- Simons Institute workshops here, here, here. here and here covering a great range of topics and state of the art research.

It would seem that unless P = NP, commonly suspected to be false, the solution of these kinds of problems for any possible input is hopelessly beyond reach of even the world’s fastest computers. Thankfully, many of the problems we care about have an internal structure that makes them much more solvable (and likewise for neural networks). Continued improvement of SAT/SMT methods, in theory and implementation, will greatly benefit the effective solution of these problems.

import csv import z3 def char2int(c): return ord(c) - ord('a') def int2char(i): return chr(i + ord('a')) # Access the language data from the file. with open('eng_cols.txt', newline='') as csvfile: reader = csv.reader(csvfile, delimiter='\t') table = [row for row in reader] nrow, ncol = len(table), len(table[0]) # Identify which columns of input table have stem and targeted word form. stem_col, form_col = 0, 1 # Calculate word lengths. nstem = [len(table[i][stem_col]) for i in range(nrow)] nform = [len(table[i][form_col]) for i in range(nrow)] # Length of suffix being sought. ns = 2 # Initialize optimizer. solver = z3.Optimize() # Define variables to identify the characters of suffix; add constraints. var_suf = [z3.Int(f'var_suf_{i}') for i in range(ns)] for i in range(ns): solver.add(z3.And(var_suf[i] >= 0, var_suf[i] < 26)) # Define variables to indicate whether the given word matches the rule. var_m = [z3.Bool(f'var_m_{i}') for i in range(nrow)] # Loop over words. for i in range(nrow): # Constraint on number of characters. constraints = [nform[i] == nstem[i] + ns] # Constraint that the form contains the stem. for j in range(nstem[i]): constraints.append( table[i][stem_col][j] == table[i][form_col][j] if j < nform[i] else False) # Constraint that the end of the word form matches the suffix. for j in range(ns): constraints.append( char2int(table[i][form_col][nform[i]-1-j]) == var_suf[j] if j < nform[i] else False) # var_m[i] is the "and" of all these constraints. solver.add(var_m[i] == z3.And(constraints)) # Seek suffix that maximizes number of matches. count = z3.Sum([z3.If(var_m[i], 1, 0) for i in range(nrow)]) solver.maximize(count) # Run solver, output results. if solver.check() == z3.sat: model = solver.model() suf = [model[var_suf[i]] for i in range(ns)] print('Suffix identified: ' + ''.join(list([int2char(suf[i].as_long()) for i in range(ns)]))[::-1]) print('Number of matches: ' + str(model.evaluate(count)) + ' out of ' + str(nrow) + '.') var_m_values = [model[var_m[i]] for i in range(nrow)] print('Matches:') for i in range(nrow): if var_m_values[i]: print(table[i][stem_col], table[i][form_col])The post Getting some (algorithmic) SAT-isfaction first appeared on John D. Cook.]]>

The second step is always the same, applying Lagrange interpolation with enough points to get the accuracy you need. But the first step, range reduction, depends on the function being evaluated. And as the previous post concluded, evaluating more advanced functions generally requires more advanced range reduction.

For the gamma function, the identity

can be used to reduce the problem of computing Γ(*x*) for any real *x* to the problem of computing Γ(*x*) for *x* in the interval [1, 2]. If *x* is large, the identity will have to be applied many times and so this would be a lot of work. However, the larger *x* is, the more accurate Stirling’s approximation becomes.

Computing Γ(*x* + *iy*) is more complex, pardon the pun. We can still use the identity above to reduce the *real* part *x* of the argument to lie in the interval [1, 2], but what about the *complex* part* y*?

Abramowitz and Stegun, Table 6.7, tabulates the principal branch of log Γ(*x* + *iy*) for *x* from 1 to 2 and for *y* from 0 to 10, both in increments of 0.1. Generally the logarithm of the gamma function is more useful in computation than the gamma function itself. It is also easier to interpolate, which I imagine is the main reason A&S tabulate it rather than the gamma function *per se*. A note below Table 6.7 says that linear interpolation gives about 3 significant figures, and eight-point interpolation gives about 8 figures.

By the Schwarz reflection principle,

and with this we can extend our range on *y* to [−10, 10].

What about larger *y*? We have two options: the duplication formula and Stirling’s approximation.

The duplication formula

lets us compute Γ(2*z*) if we can compute Γ(*z*) and Γ(*z* + ½).

Stirling’s approximation for Γ(*z*) is accurate when |*z*| is large, and |*x* + *iy*| is large when |*y*| is large.

For example, the Handbook of Mathematical Functions edited by Abramowitz and Stegun tabulates sines and cosines in increments of one tenth of a degree, from 0 degrees to 45 degrees. What if your angle was outside the range 0° to 45° or if you needed to specify your angle more precisely than 1/10 of a degree? What if you wanted, for example, to calculate cos 203.147°?

The high-level answer is that you would use range reduction and interpolation. You’d first use range reduction to reduce the problem of working with any angle to the problem of working with an angle between 0° and 45°, then you’d use interpolation to get the necessary accuracy for a value within this range.

OK, but how exactly do you do the range reduction and how exactly do you to the interpolation? This isn’t deep, but it’s not trivial either.

Since sine and cosine have a period of 360°, you can add or subtract some multiple of 360° to obtain an angle between −180° and 180°.

Next, you can use parity to reduce the range further. That is, since sin(−*x*) = −sin(*x*) and cos(−*x*) = cos(*x*) you can reduce the problem to computing the sine or cosine of an angle between 0 and 180°.

The identities sin(180° − *x*) = sin(*x*) and cos(180° −*x*) = −cos(*x*) let you reduce the range further to between 0 and 90°.

Finally, the identities cos(*x*) = sin(90° − *x*) and sin(*x*) = cos(90° − *x*) can reduce the range to 0° to 45°.

You can fill in between the tabulated angles using interpolation, but how accurate will your result be? How many interpolation points will you need to use in order to get single precision, e.g. an error on the order of 10^{−7}?

The tables tell you. As explained in this post on using a table of logarithms, the tables have a notation at the bottom of the table that tells you how many Lagrange interpolation points to use and what kind of accuracy you’ll get. Five interpolation points will give you roughly single precision accuracy, and the notation gives you a little more accurate error bound. The post on using log tables also explains how Lagrange interpolation works.

I intend to write more posts on using tables. The general pattern is always range reduction and interpolation, but it takes more advanced math to reduce the range of more advanced functions.

**Update**: The next post shows how to use tables to compute the gamma function for complex arguments.

Kepler noted in 1609 that you could approximate the perimeter of an ellipse as the perimeter of a circle whose radius is the mean of the semi-axes of the ellipse, where the mean could be either the arithmetic mean or the geometric mean. The previous post showed that the arithmetic mean is more accurate, and that it under-estimates the perimeter. This post will explain both of these facts.

There are several series for calculating the perimeter of an ellipse. In 1798 James Ivory came up with a series that converges more rapidly than previously discovered series. Ivory’s series is

where

If you’re not familiar with the !! notation, see this post on multifactorials.

The 0th order approximation using Ivory’s series, dropping all the infinite series terms, corresponds to Kepler’s approximation using the arithmetic mean of the semi-axes *a* and *b*. By convention the semi-major axis is labeled *a* and the semi-minor axis *b*, but the distinction is unnecessary here since Ivory’s series is symmetric in *a* and *b*.

Note that *h* ≥ 0 and *h* = 0 only if the ellipse is a circle. So the terms in the series are positive, which explains why Kepler’s approximation under-estimates the perimeter.

Using just one term in Ivory’s series gives a very good approximation

The approximation error increases as the ratio of *a* to *b* increases, but even for a highly eccentric ellipse like the orbit of the Mars Orbital Mission, the ratio of *a* to *b* is only 2, and the relative approximation error is about 1/500, about 12 times more accurate than Kepler’s approximation.

*P* ≈ 2π(ab)^{½}

or

*P* ≈ π(*a* + *b*).

In other words, you can approximate the perimeter of an ellipse by the circumference of a circle of radius *r* where *r* is either the geometric mean or arithmetic mean of the semi-major and semi-minor axes.

How good are these approximations, particularly when *a* and *b* are roughly equal? Which one is better?

When can choose our unit of measurement so that the semi-minor axis *b* equals 1, then plot the error in the two approximations as *a* increases.

We see from this plot that both approximations give lower bounds, and that arithmetic mean is more accurate than geometric mean.

Incidentally, if we used the geometric mean of the semi-axes as the radius of a circle when approximating the *area* then the results would be exactly correct. But for perimeter, the arithmetic mean is better.

Next, if we just consider ellipses in which the semi-major axis is no more than twice as long as the semi-minor axis, the arithmetic approximation is within 2% of the exact value and the geometric approximation is within 8%. Both approximations are good when *a* ≈ *b*.

The next post goes into more mathematical detail, explaining why Kepler’s approximation behaves as it does and giving ways to improve on it.

One way to capture the structure of a graph is its adjacency matrix *A*. Each element *a*_{ij} of this matrix equals 1 if there is an edge between the *i*th and *j*th node and a 0 otherwise. If you square the matrix *A*, the (*i*, *j*) entry of the result tells you how many paths of length 2 are between the *i*th and *j*th nodes. In general, the (*i,* *j*) entry of *A*^{n} tells you how many paths of length *n* there are between the corresponding nodes.

Calculating eigenvector centrality requires finding an eigenvector associated with the largest eigenvalue of *A*. One way to find such an eigenvector is the power method. You start with a random initial vector and repeatedly multiply it by *A*. This produces a sequence of vectors that converges to the eigenvector we’re after.

Conceptually this is the same as computing *A*^{n} first and multiplying it by the random initial vector. So not only does the matrix *A*^{n} count paths of length *n*, for large *n* it helps us compute eigenvector centrality.

Now for a little fine print. The power method will converge for any starting vector that has some component in the direction of the eigenvector you’re trying to find. This is almost certainly the case if you start with a vector chosen at random. The power method also requires that the matrix has a single eigenvector of largest magnitude and that its associated eigenspace have dimension 1. The post on eigenvector centrality stated that these conditions hold, provided the network is connected.

In principle, you could use calculate eigenvector centrality by computing *A*^{n} for some large *n*. In practice, you’d never do that. For a square matrix of size *N*, a matrix-vector product takes *O*(*N*²) operations, whereas squaring *A* requires *O*(*N*³) operations. So you would repeatedly apply *A* to a vector rather than computing powers of *A*.

Also, you wouldn’t use the power method unless *A* is sparse, making it relatively efficient to multiply by *A*. If most of the entries of *A* are zeros, and there is an exploitable pattern to where the non-zero elements are located, you can multiply *A* by a vector with far less than *N*² operations.

Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The website does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthy trial-and-error processes.

[1] “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3] “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

The post The search for the perfect prompt first appeared on John D. Cook.]]>

Let’s look at a few motivating examples before we get into the details.

If you wanted to advertise something, say a book you’ve written, and you’re hoping people will promote it on Twitter. Would you rather get a shout out from someone with more followers or someone with less followers? All else being equal, more followers is better. But even better would be a shout out from someone whose followers have a lot of followers.

Suppose you’re at a graduation ceremony. Your mind starts to wander after the first few hundred people walk across the stage, and you start to think about how a cold might spread through the crowd. The dean handing out diplomas could be a superspreader because he’s shaking hands with everyone as they receive their diplomas. But an inconspicuous parent in the audience may also be a superspreader because she’s a flight attendant and will be in a different city every day for the next few days. And not only is she a traveler, she’s in contact with travelers.

Ranking web pages according to the number of inbound links they have was a good idea because this takes advantage of revealed preferences: instead of asking people what pages they recommend, you can observe what pages they implicitly recommend by linking to them. An even better idea was Page Rank, weighing inbound links by how many links the linking pages have.

The idea of eigenvector centrality is to give each node a rank proportional to the sum of the ranks of the adjacent nodes. This may seem circular, and it kinda is.

To know the rank of a node, you have to know the ranks of the nodes linking to it. But to know their ranks, you have to know the ranks of the nodes linking to them, etc. There is no **sequential** solution to the problem because you’d end up in an infinite regress, even for a finite network. But there is a **simultaneous** solution, considering all pages at once.

You want to assign to the *i*th node a rank *x*_{i} proportional to the sum of the ranks of all adjacent nodes. A more convenient way to express this to compute the weighted sum of the ranks of *all* nodes, with weight 1 for adjacent nodes and weight 0 for non-adjacent nodes. That is, you want to compute

where the *a*‘s are the components of the adjacency matrix *A*. Here *a*_{ij} equals 1 if there is an edge between nodes *i* and *j* and it equals 0 otherwise.

This is equivalent to looking for solutions to the system of equations

where *A* is the adjacency matrix and *x* is the vector of node ranks. If there are *N* nodes, then *A* is an *N* × *N* matrix and *x* is a column vector of length *N*.

In linear algebra terminology, *x* is an eigenvector of the adjacency matrix *A*, hence the name eigenvector centrality.

There are several mathematical details to be concerned about. Does the eigenvalue problem defining *x* have a solution? Is the solution unique (up to scalar multiples)? What about the eigenvalue λ? Does the adjacency matrix have multiple eigenvalues, which would mean there are multiple eigenvectors?

If *A* is the adjacency matrix for a connected graph, then there is a unique eigenvalue λ of largest magnitude, there is a unique corresponding eigenvector *x*, and all the components of *x* are non-negative. It is often said that this is a result of the Perron–Frobenius theorem, but there’s a little more to it than that because you need the hypothesis that the graph is connected.

The matrix *A* is non-negative, but not necessarily positive: some entries may be zero. But if the graph is connected, there’s a path between any node *i* and another node *j*, which means for some power of *A*, the *ij* entry of *A* is not zero. So although *A* is not necessarily positive, some power of *A* is positive. This puts us in the positive case of the Perron–Frobenius theorem, which is nicer than the non-negative case.

This section has discussed graphs, but Page Rank applies to *directed* graphs because links have a direction. But similar theorems hold for directed graphs.

If you were to do a linear regression on the data you’d get a relation

lumens = *a* × watts + *b*

where the intercept term *b* is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore *b*; it’s probably small. But what if you wanted to *require* that *b* = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope *a* without using any regression software.

Let **x** be the vector of input data, the wattage of the LED bulbs. And let **y** be the corresponding light output in lumens. The regression line uses the slope *a* that minimizes

(*a* **x** − **y**)² = *a*² **x** · **x** − 2*a* **x** · **y** + **y** · **y**.

Setting the derivative with respect to *a* to zero shows

*a* = **x** · **y** / **x** · **x**

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring *b* = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

- Best line to fit three points
- Logistic regression quick takes
- Linear regression and post quantum cryptography

[1] The orange line in the image above is the least squares fit for the model *y* = *ax*, but it’s not quite the same line you’d get if you fit the model *y* = *ax* + *b*.

The answer to the first question is that I wrote about the local maxima of the sinc function three years ago. That post shows that the derivative of the sinc function sin(*x*)/*x* is zero if and only if *x* is a fixed point of the tangent function.

As for why that should be connected to zeros a Bessel function, that one’s pretty easy. In general, Bessel functions cannot be expressed in terms of elementary functions. But the Bessel functions whose order is an integer plus ½ can.

For integer *n*,

So when *n* = 1, we’ve got the derivative of sinc right there in the definition.

Not only can you bootstrap tables to calculate logarithms of real numbers not given in the tables, you can also bootstrap a table of logarithms and a table of arctangents to calculate logarithms of complex numbers.

One of the examples in Abramowitz and Stegun (Example 7, page 90) is to compute log(2 + 3*i*). How could you do that with tables? Or with a programming language that doesn’t support complex numbers?

Now we have to be a little careful about what we mean by the logarithm of a complex number.

In the context of real numbers, the logarithm of a real number *x* is the real number *y* such that *e*^{y} = *x*. This equation has a unique solution if *x* is positive and no solution otherwise.

In the context of complex numbers, **a** logarithm of the complex number *z* is any complex number *w* such that *e*^{w} = *z*. This equation has no solution if *z* = 0, and it has infinitely many solutions otherwise: for any solution *w*, *w* + 2*n*π*i* is also a solution for all integers *n*.

If you write the complex number *z* in polar form

*z* = *r* *e*^{iθ}

then

log(*z*) = log(*r*) + *i*θ.

The proof is immediate:

*e*^{log(r) + iθ} = *e*^{log(r)} *e*^{iθ} = *r* *e*^{iθ}.

So computing the logarithm of a complex number boils down to computing its magnitude *r* and its argument θ.

The equation defining a logarithm has a unique solution if we make a branch cut along the negative real axis and restrict θ to be in the range −π < θ ≤ π. This is called the **principal branch** of log, sometimes written Log. As far as I know, every programming language that supports complex logarithms uses the principal branch implicitly. For example, in Python (NumPy), `log(x)`

computes the principal branch of the log function.

Going back to the example mentioned above,

log(2 + 3*i*) = log( √(2² + 3²) ) + arctan(3/2) = ½ log(13) + arctan(3/2) *i*.

This could easily be computed by looking up the logarithm of 13 and the arc tangent of 3/2.

The exercise in A&S actually asks the reader to calculate log(±2 ± 3*i*). The reason for the variety of signs is to require the reader to pick the value of θ that lies in the range −π < θ ≤ π. For example,

log(−2 + 3*i*) = = ½ log(13) + (π − arctan(3/2)) *i*.

Before calculators were common, function values would be looked up in a table. For example, here is a piece of a table of logarithms from Abramowitz and Stegun, affectionately known as A&S.

But you wouldn’t just “look up” logarithm values. If you needed to know the value of a logarithm at a point where it is explicitly tabulated, then yes, you’d simply look it up. If you wanted to know the log of 1.754, then there it is in the table. But what if, for example, you wanted to know the log of 1.7543?

Notice that function values are given to 15 significant figures but input values are only given to four significant figures. If you wanted 15 sig figs in your output, presumably you’d want to specify your input to 15 sig figs as well. Or maybe you only needed 10 figures of precision, in which case you could ignore the rightmost column of decimal places in the table, but you still can’t directly specify input values to 10 figures.

If you go to the bottom of the column of A&S in the image above, you see this:

What’s the meaning of the mysterious square bracket expression? It’s telling you that for the input values in the range of this column, i.e. between 1.750 and 1.800, the error using linear interpolation will be less than 4 × 10^{−8}, and that if you want full precision, i.e. 15 sig figs, then you’ll need to use Lagrange interpolation with 5 points.

So going back to the example of wanting to know the value of log(1,7543), we could calculate it using

0.7 × log(1.754) + 0.3 × log(1.755)

and expect the error to be less than 4 × 10^{−8}.

We can confirm this with a little Python code.

>>> from math import log >>> exact = log(1.7543) >>> approx = 0.7*log(1.754) + 0.3*log(1.755) >>> exact - approx 3.411265947494968e-08

Python uses double precision arithmetic, which is accurate to between 15 and 16 figures—more on that here—and so the function calls above are essentially the same as the tabulated values.

Now suppose we want the value of *x* = 1.75430123456789. The hint in square brackets says we should use Lagrange interpolation at five points, centered at the nearest tabulated value to *x*. That is, we’ll use the values of log at 1.752, 1.753, 1.754, 1.755, and 1.756 to compute the value of log(*x*).

Here’s the Lagrange interpolation formula, given in A&S as equation 25.2.15.

We illustrate this with the following Python code.

def interpolate(fs, p, h): s = (p**2 - 1)*p*(p-2)*fs[0]/24 s -= (p - 1)*p*(p**2 - 4)*fs[1]/6 s += (p**2 - 1)*(p**2 - 4)*fs[2]/4 s -= (p + 1)*p*(p**2 - 4)*fs[3]/6 s += (p**2 - 1)*p*(p + 2)*fs[4]/24 return s xs = np.linspace(1.752, 1.756, 5) fs = np.log(xs) h = 0.001 x = 1.75430123456789 p = (x - 1.754)/h print(interpolate(fs, p, h)) print(np.log(x))

This prints

0.5620706206909348 0.5620706206909349

confirming that the interpolated value is indeed accurate to 15 figures.

Lagrange interpolation takes a lot of work to carry out by hand, and so sometimes you might use other techniques, such as transforming your calculation into one for which a Taylor series approximation converges quickly. In any case, sophisticated use of numerical tables was not simply a matter of looking things up.

A book of numerical tables enables you to do calculations without a computer. More than that, understanding how to do calculations **without** a computer helps you program calculations **with** a computer. Computers have to evaluate functions somehow, and one way is interpolating tabulated values.

For example, you could think of a digital image as a numerical table, the values of some ideal analog image sampled at discrete points. The screenshots above are interpolated: the HTML specifies the width to be less than that of the original screenshots,. You’re not seeing the original image; you’re seeing a new image that your computer has created for you using interpolation.

Interpolation is a kind of compression. A&S would be 100 billion times larger if it tabulated functions at 15 figure inputs. Instead, it tabulated functions for 4 figure inputs and gives you a recipe (Lagrange interpolation) for evaluating the functions at 15 figure inputs if you desire. This is a very common pattern. An SVG image, for example, does not tell you pixel values, but gives you equations for calculating pixel values at whatever scale is needed.

He talks about how software developers bemoan duct taping systems together, and would rather work on core technologies. He thinks it is some tragic failure, that if only wise system design was employed, you wouldn’t be doing all the duct taping.

Wrong.

Every expansion in capabilities opens up the opportunity to duct tape it to new areas, and this is where a lot of value creation happens. Eventually, when a sufficient amount of duct tape is found in an area, it is an opportunity for systemic redesigns, but you don’t wait for that before grabbing newly visible low hanging fruit!

The realistic alternative to duct tape and other aesthetically disappointing code is often no code.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Common sense says it matters what you saw. If 148 people were randomized to Design A, and every single one of them bought your product, while 10 out of the 152 people randomized to Design B bought your product, common sense says you should call Design A the winner and push it into production ASAP.

But what if you saw somewhat better results for Design A? You can have *some* confidence that Design A is better, though not as much as the nominal confidence level of the full experiment. This is what your (frequentist) statistician was trying to protect you from.

When your statistician designed your experiment, he obviously didn’t know what data you’d see, so he designed a *process* that would be reliable in a certain sense. When you looked at the data early, you violated the process, and so now your actual practice no longer has the probability of success initially calculated.

But you don’t care about the process; you want to know whether to deploy Design A or Design B. And you saw the data that you saw. Particularly in the case where the results were lopsidedly in favor of Design A, your gut tells you that you know what to do next. You might reasonably say “I get what you’re saying about repeated experiments and all that. (OK, not really, but let’s say I do.) But look what happened? Design A is a runaway success!”

In formal terms, your common sense is telling you to condition on the observed data. If you’ve never studied Bayesian statistics you may not know exactly what that means and how to calculate it, but it’s intuitively what you’ve done. You’re making a decision based on what you actually saw, not on the basis of a hypothetical sequence of experiments you didn’t run and won’t run.

Bayesian statistics does formally what your intuition does informally. This is important because even though your intuition is a good guide in extreme cases, it can go wrong when things are less obvious. As I wrote about recently, smart people can get probability very wrong, even when their intuition is correct in some instances.

If you’d like help designing experiments or understanding their results, we can help.

The post Condition on your data first appeared on John D. Cook.]]>