One way to capture the structure of a graph is its adjacency matrix *A*. Each element *a*_{ij} of this matrix equals 1 if there is an edge between the *i*th and *j*th node and a 0 otherwise. If you square the matrix *A*, the (*i*, *j*) entry of the result tells you how many paths of length 2 are between the *i*th and *j*th nodes. In general, the (*i,* *j*) entry of *A*^{n} tells you how many paths of length *n* there are between the corresponding nodes.

Calculating eigenvector centrality requires finding an eigenvector associated with the largest eigenvalue of *A*. One way to find such an eigenvector is the power method. You start with a random initial vector and repeatedly multiply it by *A*. This produces a sequence of vectors that converges to the eigenvector we’re after.

Conceptually this is the same as computing *A*^{n} first and multiplying it by the random initial vector. So not only does the matrix *A*^{n} count paths of length *n*, for large *n* it helps us compute eigenvector centrality.

Now for a little fine print. The power method will converge for any starting vector that has some component in the direction of the eigenvector you’re trying to find. This is almost certainly the case if you start with a vector chosen at random. The power method also requires that the matrix has a single eigenvector of largest magnitude and that its associated eigenspace have dimension 1. The post on eigenvector centrality stated that these conditions hold, provided the network is connected.

In principle, you could use calculate eigenvector centrality by computing *A*^{n} for some large *n*. In practice, you’d never do that. For a square matrix of size *N*, a matrix-vector product takes *O*(*N*²) operations, whereas squaring *A* requires *O*(*N*³) operations. So you would repeatedly apply *A* to a vector rather than computing powers of *A*.

Also, you wouldn’t use the power method unless *A* is sparse, making it relatively efficient to multiply by *A*. If most of the entries of *A* are zeros, and there is an exploitable pattern to where the non-zero elements are located, you can multiply *A* by a vector with far less than *N*² operations.

Anyone with more than casual experience with ChatGPT knows that prompt engineering is a thing. Minor or even trivial changes in a chatbot prompt can have significant effects, sometimes even dramatic ones, on the output [1]. For simple requests it may not make much difference, but for detailed requests it could matter a lot.

Industry leaders said they thought this would be a temporary limitation. But we are now a year and a half into the GPT-4 era, and it’s still a problem. And since the number of possible prompts has scaling that is exponential in the prompt length, it can sometimes be hard to find a good prompt given the task.

One proposed solution is to use search procedures to automate the prompt optimization / prompt refinement process. Given a base large language model (LLM) and an input (a prompt specification, commonly with a set of prompt/answer pair samples for training), a search algorithm seeks the best form of a prompt to use to elicit the desired answer from the LLM.

This approach is sometimes touted [2] as a possible solution to the problem. However, it is not without limitations.

A main one is cost. With this approach, one search for a good prompt can take many, many trial-and-error invocations of the LLM, with cost measured in dollars compared to the fraction of a cent cost of a single token of a single prompt. I know of one report of someone who does LLM prompting with such a tool full time for his job, at cost of about $1,000/month (though, for certain kinds of task, one might alternatively seek a good prompt “template” and reuse that across many near-identical queries, to save costs).

This being said, it would seem that for now (depending on budget) our best option for difficult prompting problems is to use search-based prompt refinement methods. Various new tools have come come out recently (for example, [3-6]). The following is a report on some of my (very preliminary) experiences with a couple of these tools.

The first is PromptAgent [5]. It’s a research code available on GitHub. The method is based on Monte Carlo tree search (MCTS), which tries out multiple chains of modification of a seed prompt and pursues the most promising. MCTS can be a powerful method, being part of the AlphaGo breakthrough result in 2016.

I ran one of the PromptAgent test problems using GPT-4/GPT-3.5 and interrupted it after it rang up a couple of dollars in charges. Looking at the logs, I was somewhat amazed that it generated long detailed prompts that included instructions to the model for what to pay close attention to, what to look out for, and what mistakes to avoid—presumably based on inspecting previous trial prompts generated by the code.

Unfortunately, PromptAgent is a research code and not fully productized, so it would take some work to adapt to a specific user problem.

DSPy [6] on the other hand is a finished product available for general users. DSPy is getting some attention lately not only as a prompt optimizer but also more generally as a tool for orchestrating multiple LLMs as agents. There is not much by way of simple examples for how to use the code. The web site does have an AI chatbot that can generate sample code, but the code it generated for me required significant work to get it to behave properly.

I ran with the MIPRO optimizer which is most well-suited to prompt optimization. My experience with running the code was that it generated many random prompt variations but did not do in-depth prompt modifications like PromptAgent. PromptAgent does one thing, prompt refinement, and must do it well, unlike DSPy which has multiple uses. DSPy would be well-served to have implemented more powerful prompt refinement algorithms.

I would wholeheartedly agree that it doesn’t seem right for an LLM would be so dependent on the wording of a prompt. Hopefully, future LLMs, with training on more data and other improvements, will do a better job without need for such lengthly trial-and-error processes.

[1] “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting,” https://openreview.net/forum?id=RIu5lyNXjT

[2] “AI Prompt Engineering Is Dead” (https://spectrum.ieee.org/prompt-engineering-is-dead, March 6, 2024

[3] “Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing,” https://openreview.net/forum?id=OXv0zQ1umU

[4] “Large Language Models as Optimizers,” https://openreview.net/forum?id=Bb4VGOWELI

[5] “PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization,” https://openreview.net/forum?id=22pyNMuIoa

[6] “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” https://openreview.net/forum?id=sY5N0zY5Od

The post The search for the perfect prompt first appeared on John D. Cook.]]>

Let’s look at a few motivating examples before we get into the details.

If you wanted to advertise something, say a book you’ve written, and you’re hoping people will promote it on Twitter. Would you rather get a shout out from someone with more followers or someone with less followers? All else being equal, more followers is better. But even better would be a shout out from someone whose followers have a lot of followers.

Suppose you’re at a graduation ceremony. Your mind starts to wander after the first few hundred people walk across the stage, and you start to think about how a cold might spread through the crowd. The dean handing out diplomas could be a superspreader because he’s shaking hands with everyone as they receive their diplomas. But an inconspicuous parent in the audience may also be a superspreader because she’s a flight attendant and will be in a different city every day for the next few days. And not only is she a traveler, she’s in contact with travelers.

Ranking web pages according to the number of inbound links they have was a good idea because this takes advantage of revealed preferences: instead of asking people what pages they recommend, you can observe what pages they implicitly recommend by linking to them. An even better idea was Page Rank, weighing inbound links by how many links the linking pages have.

The idea of eigenvector centrality is to give each node a rank proportional to the sum of the ranks of the adjacent nodes. This may seem circular, and it kinda is.

To know the rank of a node, you have to know the ranks of the nodes linking to it. But to know their ranks, you have to know the ranks of the nodes linking to them, etc. There is no **sequential** solution to the problem because you’d end up in an infinite regress, even for a finite network. But their is a **simultaneous** solution, considering all pages at once.

You want to assign to the *i*th node a rank *x*_{i} proportional to the sum of the ranks of all adjacent nodes. A more convenient way to express this to compute the weighted sum of the ranks of *all* nodes, with weight 1 for adjacent nodes and weight 0 for non-adjacent nodes. That is, you want to compute

where the *a*‘s are the components of the adjacency matrix *A*. Here *a*_{ij} equals 1 if there is an edge between nodes *i* and *j* and it equals 0 otherwise.

This is equivalent to looking for solutions to the system of equations

where *A* is the adjacency matrix and *x* is the vector of node ranks. If there are *N* nodes, then *A* is an *N* × *N* matrix and *x* is a column vector of length *N*.

In linear algebra terminology, *x* is an eigenvector of the adjacency matrix *A*, hence the name eigenvector centrality.

There are several mathematical details to be concerned about. Does the eigenvalue problem defining *x* have a solution? Is the solution unique (up to scalar multiples)? What about the eigenvalue λ? Does the adjacency matrix have multiple eigenvalues, which would mean there are multiple eigenvectors?

If *A* is the adjacency matrix for a connected graph, then there is a unique eigenvalue λ of largest magnitude, there is a unique corresponding eigenvector *x*, and all the components of *x* are non-negative. It is often said that this is a result of the Perron–Frobenius theorem, but there’s a little more to it than that because you need the hypothesis that the graph is connected.

The matrix *A* is non-negative, but not necessarily positive: some entries may be zero. But if the graph is connected, there’s a path between any node *i* and another node *j*, which means for some power of *A*, the *ij* entry of *A* is not zero. So although *A* is not necessarily positive, some power of *A* is positive. This puts us in the positive case of the Perron–Frobenius theorem, which is nicer than the non-negative case.

This section has discussed graphs, but Page Rank applies to *directed* graphs because links have a direction. But similar theorems hold for directed graphs.

If you were to do a linear regression on the data you’d get a relation

lumens = *a* × watts + *b*

where the intercept term *b* is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore *b*; it’s probably small. But what if you wanted to *require* that *b* = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope *a* without using any regression software.

Let **x** be the vector of input data, the wattage of the LED bulbs. And let **y** be the corresponding light output in lumens. The regression line uses the slope *a* that minimizes

(*a* **x** − **y**)² = *a*² **x** · **x** − 2*a* **x** · **y** + **y** · **y**.

Setting the derivative with respect to *a* to zero shows

*a* = **x** · **y** / **x** · **x**

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring *b* = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

- Best line to fit three points
- Logistic regression quick takes
- Linear regression and post quantum cryptography

[1] The orange line in the image above is the least squares fit for the model *y* = *ax*, but it’s not quite the same line you’d get if you fit the model *y* = *ax* + *b*.

The answer to the first question is that I wrote about the local maxima of the sinc function three years ago. That post shows that the derivative of the sinc function sin(*x*)/*x* is zero if and only if *x* is a fixed point of the tangent function.

As for why that should be connected to zeros a Bessel function, that one’s pretty easy. In general, Bessel functions cannot be expressed in terms of elementary functions. But the Bessel functions whose order is an integer plus ½ can.

For integer *n*,

So when *n* = 1, we’ve got the derivative of sinc right there in the definition.

Not only can you bootstrap tables to calculate logarithms of real numbers not given in the tables, you can also bootstrap a table of logarithms and a table of arctangents to calculate logarithms of complex numbers.

One of the examples in Abramowitz and Stegun (Example 7, page 90) is to compute log(2 + 3*i*). How could you do that with tables? Or with a programming language that doesn’t support complex numbers?

Now we have to be a little careful about what we mean by the logarithm of a complex number.

In the context of real numbers, the logarithm of a real number *x* is the real number *y* such that *e*^{y} = *x*. This equation has a unique solution if *x* is positive and no solution otherwise.

In the context of complex numbers, **a** logarithm of the complex number *z* is any complex number *w* such that *e*^{w} = *z*. This equation has no solution if *z* = 0, and it has infinitely many solutions otherwise: for any solution *w*, *w* + 2*n*π*i* is also a solution for all integers *n*.

If you write the complex number *z* in polar form

*z* = *r* *e*^{iθ}

then

log(*z*) = log(*r*) + *i*θ.

The proof is immediate:

*e*^{log(r) + iθ} = *e*^{log(r)} *e*^{iθ} = *r* *e*^{iθ}.

So computing the logarithm of a complex number boils down to computing its magnitude *r* and its argument θ.

The equation defining a logarithm has a unique solution if we make a branch cut along the negative real axis and restrict θ to be in the range −π < θ ≤ π. This is called the **principal branch** of log, sometimes written Log. As far as I know, every programming language that supports complex logarithms uses the principal branch implicitly. For example, in Python (NumPy), `log(x)`

computes the principal branch of the log function.

Going back to the example mentioned above,

log(2 + 3*i*) = log( √(2² + 3²) ) + arctan(3/2) = ½ log(13) + arctan(3/2) *i*.

This could easily be computed by looking up the logarithm of 13 and the arc tangent of 3/2.

The exercise in A&S actually asks the reader to calculate log(±2 ± 3*i*). The reason for the variety of signs is to require the reader to pick the value of θ that lies in the range −π < θ ≤ π. For example,

log(−2 + 3*i*) = = ½ log(13) + (π − arctan(3/2)) *i*.

Before calculators were common, function values would be looked up in a table. For example, here is a piece of a table of logarithms from Abramowitz and Stegun, affectionately known as A&S.

But you wouldn’t just “look up” logarithm values. If you needed to know the value of a logarithm at a point where it is explicitly tabulated, then yes, you’d simply look it up. If you wanted to know the log of 1.754, then there it is in the table. But what if, for example, you wanted to know the log of 1.7543?

Notice that function values are given to 15 significant figures but input values are only given to four significant figures. If you wanted 15 sig figs in your output, presumably you’d want to specify your input to 15 sig figs as well. Or maybe you only needed 10 figures of precision, in which case you could ignore the rightmost column of decimal places in the table, but you still can’t directly specify input values to 10 figures.

If you go to the bottom of the column of A&S in the image above, you see this:

What’s the meaning of the mysterious square bracket expression? It’s telling you that for the input values in the range of this column, i.e. between 1.750 and 1.800, the error using linear interpolation will be less than 4 × 10^{−8}, and that if you want full precision, i.e. 15 sig figs, then you’ll need to use Lagrange interpolation with 5 points.

So going back to the example of wanting to know the value of log(1,7543), we could calculate it using

0.7 × log(1.754) + 0.3 × log(1.755)

and expect the error to be less than 4 × 10^{−8}.

We can confirm this with a little Python code.

>>> from math import log >>> exact = log(1.7543) >>> approx = 0.7*log(1.754) + 0.3*log(1.755) >>> exact - approx 3.411265947494968e-08

Python uses double precision arithmetic, which is accurate to between 15 and 16 figures—more on that here—and so the function calls above are essentially the same as the tabulated values.

Now suppose we want the value of *x* = 1.75430123456789. The hint in square brackets says we should use Lagrange interpolation at five points, centered at the nearest tabulated value to *x*. That is, we’ll use the values of log at 1.752, 1.753, 1.754, 1.755, and 1.756 to compute the value of log(*x*).

Here’s the Lagrange interpolation formula, given in A&S as equation 25.2.15.

We illustrate this with the following Python code.

def interpolate(fs, p, h): s = (p**2 - 1)*p*(p-2)*fs[0]/24 s -= (p - 1)*p*(p**2 - 4)*fs[1]/6 s += (p**2 - 1)*(p**2 - 4)*fs[2]/4 s -= (p + 1)*p*(p**2 - 4)*fs[3]/6 s += (p**2 - 1)*p*(p + 2)*fs[4]/24 return s xs = np.linspace(1.752, 1.756, 5) fs = np.log(xs) h = 0.001 x = 1.75430123456789 p = (x - 1.754)/h print(interpolate(fs, p, h)) print(np.log(x))

This prints

0.5620706206909348 0.5620706206909349

confirming that the interpolated value is indeed accurate to 15 figures.

Lagrange interpolation takes a lot of work to carry out by hand, and so sometimes you might use other techniques, such as transforming your calculation into one for which a Taylor series approximation converges quickly. In any case, sophisticated use of numerical tables was not simply a matter of looking things up.

A book of numerical tables enables you to do calculations without a computer. More than that, understanding how to do calculations **without** a computer helps you program calculations **with** a computer. Computers have to evaluate functions somehow, and one way is interpolating tabulated values.

For example, you could think of a digital image as a numerical table, the values of some ideal analog image sampled at discrete points. The screenshots above are interpolated: the HTML specifies the width to be less than that of the original screenshots,. You’re not seeing the original image; you’re seeing a new image that your computer has created for you using interpolation.

Interpolation is a kind of compression. A&S would be 100 billion times larger if it tabulated functions at 15 figure inputs. Instead, it tabulated functions for 4 figure inputs and gives you a recipe (Lagrange interpolation) for evaluating the functions at 15 figure inputs if you desire. This is a very common pattern. An SVG image, for example, does not tell you pixel values, but gives you equations for calculating pixel values at whatever scale is needed.

He talks about how software developers bemoan duct taping systems together, and would rather work on core technologies. He thinks it is some tragic failure, that if only wise system design was employed, you wouldn’t be doing all the duct taping.

Wrong.

Every expansion in capabilities opens up the opportunity to duct tape it to new areas, and this is where a lot of value creation happens. Eventually, when a sufficient amount of duct tape is found in an area, it is an opportunity for systemic redesigns, but you don’t wait for that before grabbing newly visible low hanging fruit!

The realistic alternative to duct tape and other aesthetically disappointing code is often no code.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Common sense says it matters what you saw. If 148 people were randomized to Design A, and every single one of them bought your product, while 10 out of the 152 people randomized to Design B bought your product, common sense says you should call Design A the winner and push it into production ASAP.

But what if you saw somewhat better results for Design A? You can have *some* confidence that Design A is better, though not as much as the nominal confidence level of the full experiment. This is what your (frequentist) statistician was trying to protect you from.

When your statistician designed your experiment, he obviously didn’t know what data you’d see, so he designed a *process* that would be reliable in a certain sense. When you looked at the data early, you violated the process, and so now your actual practice no longer has the probability of success initially calculated.

But you don’t care about the process; you want to know whether to deploy Design A or Design B. And you saw the data that you saw. Particularly in the case where the results were lopsidedly in favor of Design A, your gut tells you that you know what to do next. You might reasonably say “I get what you’re saying about repeated experiments and all that. (OK, not really, but let’s say I do.) But look what happened? Design A is a runaway success!”

In formal terms, your common sense is telling you to condition on the observed data. If you’ve never studied Bayesian statistics you may not know exactly what that means and how to calculate it, but it’s intuitively what you’ve done. You’re making a decision based on what you actually saw, not on the basis of a hypothetical sequence of experiments you didn’t run and won’t run.

Bayesian statistics does formally what your intuition does informally. This is important because even though your intuition is a good guide in extreme cases, it can go wrong when things are less obvious. As I wrote about recently, smart people can get probability very wrong, even when their intuition is correct in some instances.

If you’d like help designing experiments or understanding their results, we can help.

The post Condition on your data first appeared on John D. Cook.]]>Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

Of course there’s nothing morally or legally wrong with looking at interim data, but is there anything statistically wrong? That depends on what you do with your observation.

There are basically two statistical camps, Frequentist and Bayesian. (There are others, but these two account for the lion’s share.) Frequentists say no, you should not look at interim data, unless you apply something called **alpha spending**. Bayesians, on the other hand, say go ahead. Why shouldn’t you look at your data? I remember one Bayesian colleague mocking alpha spending as being embarrassing.

So who is right, the Frequentists or the Bayesians? **Both are right**, given their respective criteria.

If you want to achieve success, as defined by Frequentists, you have to play by Frequentist rules, alpha spending and all. Suppose you design a hypothesis test to have a confidence level α, and you look at the data midway through an experiment. If the results are conclusive at the midpoint, you stop early. This procedure does **not** have a confidence level α. You would have to require stronger evidence for early stopping, as specified by alpha spending.

The Bayesian interprets the data differently. This approach says to quantify what is believed before conducting the experiment in the form of a prior probability distribution. Then after each data point comes in, you update your prior distribution using Bayes’ theorem to produce a posterior distribution, which becomes your new prior distribution. That’s it. At every point along the way, this distribution captures all that is known about what you’re testing. Your planned sample size is irrelevant, and the fact that you’ve looked at your data is irrelevant.

Now regarding your A/B test, **why** are you looking at the data early? If it’s simply out of curiosity, and cannot effect your actions, then it doesn’t matter. But if you act on your observation, you change the Frequentist operating characteristics of your experiment.

Stepping back a little, why are you conducting your test in the first place? If you want to make the correct decision with probability 1 − α in an infinite sequences of experiments, then you need to take into account alpha spending, or else you will lower that probability.

But if you don’t care about a hypothetical infinite sequence of experiments, you may find the Bayesian approach more intuitive. What do you know right at this point in the experiment? It’s all encapsulated in the posterior distribution.

Suppose your experiment size is a substantial portion of your potential visitors. You want to experiment with graphics, yes, but you also want to make money along the way. You’re ultimately concerned with profit, not publishing a scientific paper on your results. Then you could use a Bayesian approach to maximize your expected profit. This leads to things like “**bandits**,” so called by analogy to slot machines (“one-armed bandits”).

If you want to keep things simple, and if the experiment size is negligible relative to your expected number of visitors, just design your experiment according to Frequentist principles and don’t look at your data early.

But if you have good business reasons to want to look at the data early, not simply to satisfy curiosity, then you should probably interpret the interim data from a Bayesian perspective. As the next post explains, the Bayesian approach aligns well with common sense.

I’d recommend taking **either** a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.

If you need help designing, conducting, or interpreting experiments, we can help. If you want/need to look at interim results, we can show you how to do it the right way.

`\label{foo}`

and referenced like `\ref{foo}`

. Referring to sections by labels rather than hard-coded numbers allows references to automatically update when sections are inserted, deleted, or rearranged.
For every reference there ought to be a label. A label without a corresponding reference is fine, though it might be a mistake. If you have a reference with no corresponding label, and one label without a reference, there’s a good chance the reference is a typo variation on the unreferenced label.

We’ll build up a one-liner for comparing labels and references. We’ll use `grep`

to find patterns that look like labels by searching for `label{`

followed by any string of letters up to but not including a closing brace. We don’t want the `label{`

part, just what follows it, so we’ll use look-behind syntax, to exclude it from the match.

Here’s our regular expression:

(?<=label{)[^}]+

We’re using Perl-style look-behind syntax, so we’ll need to give `grep`

the `-P`

option. Also, we only want the match itself, not matching lines, so we’ll also using the `-o`

option. This will print all the labels:

grep -oP '(?<=label{)[^}]+' foo.tex

The regex for finding references is the same with `label`

replaced with `ref`

.

To compare the list of labels and the list of references, we’ll use the `comm`

command. For more on `comm`

, see Set theory at the command line.

We could save the labels to a file, save the references to a file, and run `comm`

on the two files. But we’re more interested in the differences between the two lists than the two lists, so we could pass both as streams to `comm`

using the `<(...)`

syntax. Finally, `comm`

assumes its inputs are sorted so we pipe the output of both `grep`

commands to `sort`

.

Here’s our one-liner

comm -12 <(grep -oP '(?<=label{)[^}]+' foo.tex | sort) <(grep -oP '(?<=ref{)[^}]+' foo.tex | sort)

This will produce three sections of output: labels which are not references, references which not labels, and labels that are also references.

If you just want to see references that don’t refer to a label, give `comm`

the option `-13`

. This suppresses the first and third sections of output, leaving only the second section, references that are not labels.

You can also add a `-u`

option (*u* for *unique*) to the calls to `sort`

to suppress multiple instances of the same label or same reference.

Note that the right hand side is not a series in φ but rather in sin φ.

Why might you know sin φ and want to calculate sin *m*φ / cos φ? This doesn’t seem like a sufficiently common task for the series to be well-known. The references are over a century old, and maybe the series were useful in hand calculations in a way that isn’t necessary anymore.

However, [1] was using the series for a theoretical derivation, not for calculation; the author was doing some hand-wavy derivation, sticking the difference operator *E* into a series as if it were a number, a technique known as “umbral calculus.” The name comes from the Latin word *umbra* for shadow. The name referred to the “shadowy” nature of the technique which wasn’t made rigorous until much later.

The series above terminates if *m* is an even integer. But there are no restrictions on *m*, and in general the series is infinite.

The series obviously has trouble if cos φ = 0, i.e. when φ = ±π/2, but it converges for all *m* if −π/2 < φ < π/2.

If *m* = 1, sin *m*φ / cos φ is simply tan φ. The function tan φ has a complicated power series in φ involving Bernoulli numbers, but it has a simpler power series in sin φ.

[1] G. J. Lidstone. Notes on Everett’s Interpolation Formula. 1922

[2] E. W. Hobson. A Treatise on Plane Trigonometry. Fourth Edition, 1918. Page 276.

The post A “well-known” series first appeared on John D. Cook.]]>I wrote years ago about how striking it was to see two senior professors arguing over an undergraduate probability exercise. As I commented in that post, “Professors might forget how to do a calculus problem, or make a mistake in a calculation, but you wouldn’t see two professors defending incompatible solutions.”

Not only do smart people often get probability wrong, they can be very confident while doing so. The same applies to cryptography.

I recently learned of a cipher J. E. Littlewood invented that he believed was unbreakable. His idea was essentially a stream cipher, simulating a one-time pad by using a pseudorandom number generator. He assumed that since a one-time pad is secure, his simulacrum of a one-time pad would be secure. But it was not, for reasons explained in this paper.

Littlewood was a brilliant mathematician, but he was naive, and even arrogant, about cryptography. Here’s the opening to the paper in which he explained his method.

The legend that every cipher is breakable is of course absurd, though still widespread among people who should know better. I give a sufficient example …

He seems to be saying “Here’s a little example off the top of my head that shows how easy it is to create an unbreakable cipher.” He was the one who should have known better.

Richard Feynman’s Nobel Prize winning discoveries in quantum electrodynamics were partly inspired by his randomly observing a spinning dinner plate one day in the cafeteria. Paul Feyerabend said regarding science discovery, “The only principle that does not inhibit progress is: anything goes” (within relevant ethical constraints, of course).

Ideas can come from anywhere, including physical play. Various books can improve creative discovery skills, like George Pólya’s *How to Solve It*, Isaac Watts’ *Improvement of the Mind*, W. J. J. Gordon’s *Synectics*, and methodologies like mind mapping and C-K theory, to name a few. Many software products present themselves as shiny new tools promising help. However, we are not just disembodied minds interacting with a computer, but instead integrated beings with reasoning integrated with memories, life history, emotions and multisensory input and interaction with the world The tactile is certainly a key avenue of learning, discovering, understanding.

Fidget toys are popular. Different kinds of toys have different semiotics with respect to how they interplay with our imaginations. Like Legos. Structured, like Le Corbusier-style architecture, or multidimensional arrays or tensors, or the snapping together of many software components with well-defined interfaces, with regular scaling from the one to many. Or to take a much older example, Tinkertoys—the analogy of the graph, interconnectedness, semi-structured but composable, like DNA or protein chains, or complex interrelated biological processes, or neuronal connections, or the wild variety within order of human language.

As creative workers, we seek ideas from any and every place to help us in what we do. The tactile, the physical, is a vital place to look.

The post Thinking by playing around first appeared on John D. Cook.]]>This means that linear combinations of the polynomials

1, *x*, *x*², *x*³, …

are dense in *C* [0, 1].

Do you need all these powers of *x*? Could you approximate any continuous function arbitrarily well if you left out one of these powers, say *x*^{7}? Yes, you could.

You cannot omit the constant polynomial 1, but you can leave out any other power of *x*. In fact, you can leave out a lot of powers of *x*, as long as the sequence of exponents doesn’t thin out too quickly.

Herman Müntz proved in 1914 that a necessary and sufficient pair of conditions on the exponents of *x* is that the first exponent is 0 and that the sum of the reciprocals of the exponents diverges.

In other words, the sequence of powers of *x*

*x*^{λ0}, *x*^{λ1}, *x*^{λ2}, …

with

λ_{0} < λ_{1} < λ_{2}

is dense in *C* [0, 1] if and only if λ_{0} = 0 and

1/λ_{1} + 1/λ_{2} + 1/λ_{3} + … = ∞

Euler proved in 1737 that the sum of the reciprocals of the primes diverges, so the sequence

1, *x*^{2}, *x*^{3}, *x*^{5}, *x*^{7}, *x*^{11}, …

is dense in *C* [0, 1]. We can find a polynomial as close as we like to any particular continuous function if we combine enough prime powers.

Let’s see how well we can approximate |*x* − ½| using prime exponents up to 11.

The polynomial above is

0.4605 − 5.233 *x*^{2} + 7.211* x*^{3} + 0.9295 *x*^{5} − 4.4646 *x*^{7} + 1.614 *x*^{11}.

This polynomial is not the best possible uniform approximation: it’s the least squares approximation. That is, it minimizes the 2-norm and not the ∞-norm. That’s because it’s convenient to do a least squares fit in Python using `scipy.optimize.curve_fit`

.

Incidentally, the Müntz approximation theorem holds for the 2-norm as well.

log_{10}(*x*) ≈ (*x* – 1)/(*x* + 1)

base *e*,

log* _{e}*(

and base 2

log_{2}(*x*) ≈ 3(*x* – 1)/(*x* + 1).

These can be used to mentally approximate logarithms to moderate accuracy, accurate enough for quick estimates.

Here’s what’s curious about the approximations: the proportionality constants are apparently wrong, and yet the approximations are each fairly accurate.

It is **not** the case that

log* _{e}*(

In fact,

log* _{e}*(

and so it seems that the approximation for natural logarithms should be off by 15%. But it’s not. The error is less than 2.5%.

Similarly,

log_{2}(*x*) = log_{2}(10) log_{10}(*x*) = 3.32 log_{10}(*x*)

and so the approximation for logarithms base 2 should be off by about 10%. But it’s not. The error here is also less than 2.5%.

What’s going on?

First of all, the approximation errors are nonlinear functions of *x* and the three approximation errors are not proportional. Second, the approximation for log* _{b}*(

Here’s a plot of the three error functions.

This plot makes it appear that the approximation error is much worse for natural logs and logs base 2 than for logs base 10. And it would be if we ignored the range of each approximation. Here’s another plot of the approximation errors, plotting each over only its valid range.

When restricted to their valid ranges, the approximations for logarithms base *e* and base 2 are *more* accurate than the approximation for logarithms base 10. Both errors are small, but in opposite directions.

Here’s a look at the relative approximation errors.

We can see that the relative errors for the log 2 and log *e* errors are less than 2.5%, while the relative error for log 10 can be up to 15%.

The post Logarithm approximation curiosity first appeared on John D. Cook.]]>

It turns out that if 2^{k} − 1 is prime then *k* must be prime, so Mersenne numbers have the form 2^{p} − 1 is prime. What about the converse? If *p* is prime, is 2^{k} − 1 also prime? No, because, for example, 2^{11} − 1 = 2047 = 23 × 89.

If *p* is not just a prime but a Mersenne prime, then is 2^{p} − 1 a prime? Sometimes, but not always. The first counterexample is *p* = 8191.

There is an interesting chain of iterated Mersenne primes:

This raises the question of whether *m* = 2^{M12} − 1 is prime. Direct testing using available methods is completely out of the question. The only way we’ll ever know is if there is some theoretical result that settles the question.

Here’s an easier question. Suppose *m* is prime. Where would it fall on the list of Mersenne primes if conjectures about the distribution of Mersenne primes are true?

This post reports

It has been conjectured that as

xincreases, the number of primesp≤xsuch that 2^{p}– 1 is also prime is asymptotically

e^{γ}logx/ log 2where γ is the Euler-Mascheroni constant.

If that conjecture is true, the number of primes less than *M*_{12} that are the exponents of Mersenne primes would be approximately

*e*^{γ} log *M*_{12} / log 2 = 226.2.

So if *m* is a Mersenne prime, it may be the 226th Mersenne prime, or *M*_{n} for some *n* around 226, if the conjectured distribution of Mersenne primes is correct.

We’ve discovered a dozen Mersenne primes since the turn of the century and we’re up to 51 discovered so far. We’re probably not going to get up to the 226th Mersenne prime, if there even is a 226th Mersenne prime, any time soon.

The post Iterated Mersenne primes first appeared on John D. Cook.]]>

This is wrong, but it’s a common mistake. And one reason it’s common is that **a variation on the mistake is approximately correct**, which we will explain shortly.

It’s obvious the reasoning in the opening paragraph is wrong when you extend it to five, or especially six, attempts. Are you certain to succeed after five attempts? What does it even mean that you have a 120% chance of success after six attempts?!

But let’s reduce the probabilities in the opening paragraph. If there’s a 2% chance of success on your first attempt, is there a 4% chance of success in two attempts and a 6% chance of success in three attempts? *Yes*, approximately.

Here’s is the correct formula for the probability of an event happening in two tries.

In words, the probability of *A* or *B* happening equals the probability of *A* happening, plus the probability of *B* happening, minus the probability of *A* and *B* both happening. The last term is is a correction term. Without it, you’re counting some possibilities twice.

So if the probability of success on each attempt is 0.02, the probability of success on two attempts is

0.02 + 0.02 − 0.0004 = 0.0396 ≈ 0.04.

When the probabilities of *A* and *B* are each small, the probability of *A* and *B* both happening is an order of magnitude smaller, assuming independence [2]. The smaller the probabilities of *A* and *B*, the less the correction term matters.

If the probability of success on each attempt is 0.2, now the probability of success after two attempts is 0.36. Simply adding probabilities and neglecting the correction term is incorrect, but not terribly far from correct in this case.

When you consider more attempts, things get more complicated. The probability of success after three attempts is given by

as I discuss here. Adding the probabilities of success separately over-estimates the correct probability. So you correct by subtracting the probabilities of pairs of successes. But then this is over-corrects, because you need to add back in the probability of three successes.

If *A*, *B*, and *C* all have a 20% probability, the probability of *A* or *B* or *C* happening is 48.8%, not 60%, again assuming independence.

The error from naively adding probabilities increases when the number of probabilities increase.

Now let’s look at the general case. Suppose your probability of success on each attempt is *p*. Then your probability of failure on each independent attempt is 1 − *p*. The probability of at least one success out of *n* attempts is the complement of the probability of all failures, i.e.

When *p* is small, and when *n* is small, we can approximate this by *np*. That’s why naively adding probabilities works when the probabilities are small and there aren’t many of them. Here’s a way to say this precisely using the binomial theorem.

The exact probability is *np* plus (*n* − 1) terms that involve higher powers of *p*. When *p* and *n* are sufficiently small, these terms can be ignored.

[1] I’m deliberately not saying who. My point here is not to rub his nose in his mistake. This post will be online long after the particular video has been forgotten.

[2] Assuming *A* and *B* are independent. This is not always the case, and wrongly assuming independence can have disastrous consequences as I discuss here, but that’s a topic for another day.

***

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

***

The article cited above is entitled “Rough approximations to move between logit and probability scales.” Here is a paragraph from the article giving its motivation.

When working in this space, it’s helpful to have some approximations to move between the logit and probability scales. (As an analogy, it is helpful to know that for a normal distribution, the interval ± 2 standard deviations around the mean covers about 95% of the distribution, while ± 3 standard deviations covers about 99%.)

Here are half the results from the post; the other half follow by symmetry.

|-------+-------| | prob | logit | |-------+-------| | 0.500 | 0 | | 0.750 | 1 | | 0.900 | 2 | | 0.995 | 3 | | 0.999 | 7 | |-------+-------|

Zero on the logit scale corresponds exactly to a probability of 0.5. The other values are approximate.

When I say the rest of the table follows by symmetry, I’m alluding to the fact that

logit(1 − *p*) = − logit(*p*).

So, for example, because logit(0.999) ≈ 7, logit(0.001) ≈ −7.

***

The post reminded me of the decibel scale. As I wrote in this post, “It’s a curious and convenient fact that many decibel values are close to integers.”

- 3 dB ≈ 2
- 6 dB ≈ 4
- 7 dB ≈ 5
- 9 dB ≈ 8

I was curious whether the logit / probability approximations were as accurate as these decibel approximations. Alas, they are not. They are rough approximations, as advertised in the title, but still useful.

***

The post also reminded me of a comment by Andrew Gelman and Jennifer Hill on why natural logs are natural for regression.

The post Logistic regression quick takes first appeared on John D. Cook.]]>

for small values of *z*, say *z* = 10^{−8}. This example comes from [1].

The Python code

from numpy import exp def f(z): return (exp(z) - 1 - z)/z**2 print(f(1e-8))

prints -0.607747099184471.

Now suppose you suspect numerical difficulties and compute your result to 50 decimal places using `bc -l`

.

scale = 50 z = 10^-8 (e(z) - 1 - z)/z^2

Now you get *u*(*z*) = .50000000166666667….

This suggests original calculation was completely wrong. What’s going on?

For small *z*,

*e*^{z} ≈ 1 + *z*

and so we lose precision when directly evaluating the numerator in the definition of *u*. In our example, we lost *all* precision.

The mean value theorem from complex analysis says that the value of an analytic function at a point equals the continuous average of the values over a circle centered at that point. If we approximate this average by taking the average of 16 values in a circle of radius 1 around our point, we get full accuracy. The Python code

def g(z): N = 16 ws = z + exp(2j*pi*arange(N)/N) return sum(f(ws))/N

returns

0.5000000016666668 + 8.673617379884035e-19j

which departs from the result calculated with `bc`

in the 16th decimal place.

At a high level, we’re avoiding numerical difficulties by averaging over points far from the difficult region.

[1] Lloyd N. Trefethen and J. A. C. Weideman. The Exponentially Convergent Trapezoid Rule. SIAM Review. Vol. 56, No. 3. pp. 385–458.

The post Numerical application of mean value theorem first appeared on John D. Cook.]]>*f*′ (*x*) ≈ ( *f*(*x* + *h*) − *f*(*x*) ) / *h*.

How small should *h* be? Since the exact value of the derivative is the limit as *h* goes to zero, the smaller *h* is the better. Except that’s not true in computer arithmetic. When *h* is too small, *f*(*x* + *h*) is so close to *f*(*x*) that the subtraction loses precision.

One way to see this is to think of the extreme case of *h* so small that *f*(*x* + *h*) equals *f*(*x*) to machine precision. Then the derivative is approximated by 0, regardless of what the actual derivative is.

As *h* gets smaller, the approximation error decreases, but the numerical error increases, and so there is some optimal value of *h*.

You can do significantly better by using a symmetric approximation for the derivative:

*f*′ (*x*) ≈ ( *f*(*x* + *h*) − *f*(*x* − *h*) ) / 2*h*.

For a given value of *h*, this formula has about the same numerical error but less approximation error. You still have a trade-off between approximation error and numerical error, but it’s a better trade-off.

If the function *f* that you want to differentiate is analytic, i.e. differentiable as a function of a complex variable, you can take the step *h* to be a complex number. When you do, the numerical difficulty completely goes away, and you can take *h* much smaller.

Suppose *h* is a small real number and you take

*f*′ (*x*) ≈ ( *f*(*x* + *ih*) − *f*(*x* − *ih*) ) / 2*ih*.

Now *f*(*x* + *ih*) and −*f*(*x* − *ih*) are approximately equal by the Schwarz reflection principle. And so rather than canceling out, the terms in the numerator add. We have

*f*(*x* + *ih*) − *f*(*x* − *ih*) ≈ 2 *f*(*x* + *ih*)

and so

*f*′ (*x*) ≈ *f*(*x* + *ih*) / *ih*.

Let’s take the derivative of the gamma function Γ(*x*) at 1 using each of the three methods above. The exact value is −γ where γ is the Euler-Mascheroni constant. The following Python code shows the accuracy of each approach.

from scipy.special import gamma def diff1(f, x, h): return (f(x + h) - f(x))/h def diff2(f, x, h): return (f(x + h) - f(x - h))/(2*h) γ = 0.5772156649015328606065 x = 1 h = 1e-7 exact = -γ approx1 = diff1(gamma, x, h) approx2 = diff2(gamma, x, h) approx3 = diff2(gamma, x, h*1j) print(approx1 - exact) print(approx2 - exact) print(approx3 - exact)

This prints

9.95565755390615e-08 1.9161483510998778e-10 (9.103828801926284e-15-0j)

In this example the symmetric finite difference approximation is about 5 times more accurate than the asymmetric approximation, but the complex step approximation is 10,000,000 times more accurate.

It’s a little awkward that the complex step approximation returns a complex number, albeit one with zero complex part. We can eliminate this problem by using the approximation

*f*′ (*x*) ≈ Im *f*(*x* + *ih*) / *h*

which will return a real number. When we implement this in Python as

def diff3(f, x, h): return (f(x + h*1j) / h).imag

we see that it produces the same result as `diff2`

but without the zero imaginary part.

Bob gives the example that if you want to get one integration point in each quadrant of a 20-dimensional space, you need a million samples. (2^{20} to be precise.) But you may be able to get sufficiently accurate results with a lot less than a million samples.

If you wanted to be certain to have one sample in each quadrant, you could sample at (±1, ±1, ±1, …, ±1). But if for some weird reason you wanted to sample randomly *and* hit each quadrant, you have a coupon collector problem. The expected number of samples to hit each of *N* cells with uniform [1] random sampling with replacement is

*N*( log(*N*) + γ )

where γ is Euler’s constant. So if *N* = 2^{20}, the expected number of draws would be over 15 million.

- The coupon collector problem and π
- Collecting a large number of coupons
- Consecutive coupon collector problem

[1] We’re assuming each cell is sampled independently with equal probability each time. Markov chain Monte Carlo is more complicated than that because the draws are *not* independent.

Gian-Carlo Rota made a profound observation on the application of theory.

One frequently notices, however, a wide gap between the bare statement of a principle and the skill required in recognizing that it applies to a particular problem.

This isn’t quite what he said. I made two small edits to generalize his actual statement. He referred specifically to the application of the inclusion-exclusion principle to problems in combinatorics. Here is his actual quote from [1].

One frequently notices, however, a wide gap between the bare statement of the principle and the skill required in recognizing that it applies to a particular combinatorial problem.

This post will expand a little on Rota’s original statement and a couple applications of the more general version of his statement.

I don’t think Rota would have disagreed with my edited version of his statement, but it’s interesting to think of his original context. The inclusion-exclusion principle seems like a simple problem solving strategy: you may count a set of things by first over-counting, then correcting for the over-counting.

For example, a few days ago I wrote about a graph created by turning left at the *n*th step if *n* is divisible by 7 or contains a digit 7. Suppose you wanted to count how many times you turn in the first 100 steps. You could count the number of positive integers up to 100 that are divisible by 7, then the number that contain the digit 7, then subtract the number that both are divisible by 7 and contain a 7.

You can carry this a step further by over-counting, then over-correcting for your over-counting, then correcting for your over-correction. This is the essence of the following probability theorem.

The inclusion-exclusion principle a clever idea, but not *that* clever. And yet Rota discusses how this idea was developed over decades into the Möbius inversion principle, which has diverse applications, including Euler characteristic and the calculus of finite differences.

Bayesian statistics is a direct application of Bayes’ theorem. Bayes’ theorem is a fairly simple idea, and yet people make careers out of applying it.

When I started working in the Biostatistics Department at MD Anderson, a bastion of Bayesian statistics, I was surprised how subtle Bayesian statistics is. I probably first saw Bayes’ theorem as a teenager, and yet it was not easy to wrap my head around Bayesian statistics. I would think “This is simple. Why is this hard?” The core principle was simple, but the application was not trivial.

When I took introductory physics, we would get stuck on homework problems and ask our professor for help. He would almost always begin by saying “Well, *F* = *ma*.”

This was infuriating. Yes, we know *F* = *ma*. But how does that let us solve this problem?!

There’s more to Newtonian mechanics than Newton’s laws, a lot more. And most of it is never made explicit. You pick it up by osmosis after you’ve worked hundreds of exercises.

[1] G. C. Rota. On the Foundations of Combinatorial Theory I. Theory of Möbius Functions. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 2 (1964) 340–368.

The post Up and down the abstraction ladder first appeared on John D. Cook.]]>`make`

in the context of building C programs. The program `make`

reads an input file to tell it how to make things. To make a C program, you compile the source files into object files, then link the object files together.
You can tell `make`

what depends on what, so it will not do any unnecessary work. If you tell it that a file `foo.o`

depends on a file `foo.c`

, then it will rebuild `foo.o`

if that file is older than the file `foo.c`

that it depends on. Looking at file timestamps allows `make`

to save time by not doing unnecessary work. It also makes it easier to dictate what order things need to be done in.

There is nothing about `make`

that is limited to compiling programs. It simply orchestrates commands in a declarative way. Because you state what the dependencies are, but let it figure if and when each needs to be run, a makefile is more like a Prolog program than a Python script.

I have a client that needs a dozen customized reports, each of which requires a few steps to create, and the order of the steps is important. I created the reports by hand. Then, of course, something changed and all the reports needed to be remade. So then I wrote a Python script to manage the next build. (And the next, and the next, …). But I should have written a makefile. I’m about to have to revise the reports, and this time I’ll probably use a makefile.

Here’s a very simple example of using a makefile to create documents. For more extensive examples of how to use makefiles for a variety of tasks, see How I stopped worrying and loved Makefiles. [1]

Suppose you need to create a PDF and a Microsoft Word version of a report from a LaTeX file [2].

all: foo.pdf foo.docx foo.pdf: foo.tex pdflatex foo.tex foo.docx: foo.tex pandoc foo.tex --from latex --to docx > foo.docx clean: rm foo.aux foo.log

If the file `foo.pdf`

does not exist, or exists but it older than `foo.tex`

, the command `make foo.pdf`

will create the PDF file by running `pdflatex`

. If `foo.pdf`

is newer than `foo.tex`

then running `make foo.pdf`

will return a message

make: 'foo.pdf' is up to date.

If you run `make`

with no argument, it will build both `foo.pdf`

and `foo.docx`

as needed.

The command `make clean`

deletes auxiliary files created by `pdflatex`

. This shows that a `make`

action does have to “make” anything; it simply runs a command.

[1] The blog post title is an allusion to the 1964 movie *Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb*.

[2] I use LaTeX for everything I can. This saves time in the long run, if not in the moment, because of consistency. I used to use Word for proposals and such, and LaTeX for documents requiring equations. My workflow became more efficient when I switched to using LaTeX for everything.

The post Making documents with makefiles first appeared on John D. Cook.]]>

One of the benefits of a scripting language like Python is that it gives you generalizations for free. For example, take the function `sorted`

. If you give it a list of integers, it will return a list of numerically sorted integers. But you could also give it a list of strings, and it will return a list of alphabetically sorted strings. In fact, you can give it more general collections of more general things. The user gets generality for free: more functionality without complicating the syntax for the most concrete use case.

Here’s a similar example in math. Suppose you prove that for invertible matrices *A* and *B* of the same size

(*AB*)^{−1} = *B*^{−1} *A*^{−1}.

You can prove just as easily that the same is true for any elements *A* and *B* in a group. In fact, the proof may be simpler. Because you have less to work with in a group, you’re lead by the nose to a simple proof.

But not all generalizations are free. In the sorting example, suppose you want your code to do more than sort lists, but instead perform any operation on lists that satisfies some properties that sorting has. This is the kind of thing some Haskell programmers love to do. The result is a kind of impenetrable code that takes talent to write.

Mathematical theorems are usually discovered in some concrete setting, then codified in a more general setting. There is some optimal level of generality that clarifies the original discovery without introducing more difficulty. But it is possible to push past this degree of generality at the cost of less intuitive hypotheses and more complicated proofs. The generalization comes at a cost. The cost may be worthwhile or not, depending on one’s motivations.

A gratuitous abstraction is one that has no apparent benefit and comes at some cost. A gratuitous abstraction may turn out to be useful in the future, but this rarely happens. And if it does happen, it will likely be because someone was lead to re-discover the abstract result while pursuing a particular problem, not because they read the abstract result and thought of an application.

This post was motivated by looking at generalizations of the Möbius inversion formula. Mr. Möbius introduced his formula nearly two centuries ago in the context of a concrete problem in number theory. The formula has been greatly generalized since then in ways that would appear to be gratuitous, introducing incidence algebras and so forth. But these generalizations have made the formula more widely applicable, particularly to problems in combinatorics.

I won’t get into Möbius inversion now, but I may in the future. Here is a post I wrote a couple years ago that touches on Möbius inversion.

I was uncritical of generalization as an undergrad. Striving for maximum generality was fine with me. Then I hit my abstraction tolerance in grad school, and became leery of gratuitous abstraction. I’ve been disappointed by going down highly abstract rabbit holes that never paid off. But Möbius inversion reminds me that levels of abstraction that seem gratuitous may be practical after all.

Later I learned that a “standard” deck of cards is not quite as standard as I thought. The “standard 52-card deck” is indeed standard in the English-speaking world, but there are variations used in other countries.

The Unicode values for the characters representing playing cards are laid out in a nice pattern. The last digit of the Unicode value corresponds to the point value of the card (aces low), and the second-to-last digit corresponds to the suite.

The ace of spades has code point U+1F0A1, and the aces of hearts, diamonds, and clubs have values U+1F0B1, U+1F0C1, and U+1F0D1 respectively. The three of spades is U+1F0A3 and the seven of hearts is U+1F0B7.

But there’s a surprise if you’re only away of the standard 52-card deck: there’s a knight between the jack and the queen. The Unicode values for jacks end in B (B_{hex} = 11_{ten}) as you’d expect, but queens end in D and kings end in E. The cards ending in C are knights, cards that don’t exist in the standard 52-card deck.

Here’s a little Python code that illustrates the Unicode layout.

for i in range(4): for j in range(14): print(chr(0x1f0a1+j+16*i), end='') print()

The output is too small to read as plain text, but here’s a screen shot from running the code above in a terminal with a huge font.

Playing cards have their origin in tarot cards, which are commonly associated with the occult. But John C. Wright argues that tarot cards were based on Christian iconography before they became associated with the occult.

The post A deck of cards first appeared on John D. Cook.]]>If JWST can see Titan with this kind of resolution, how well could it see Pluto or other planets? In this post I’ll do some back-of-the-envelope calculations, only considering the apparent size of objects, ignoring other factors such as how bright an object is.

The apparent size of an object of radius *r* at a distance *d* is proportional to (*r*/*d*)². [1]

Of course the JWST isn’t located on Earth, but JWST is close to Earth relative to the distance to Titan.

Titan is about 1.4 × 10^{12} meters from here, and has a radius of about 2.6 × 10^{6} m. Pluto is about 6 × 10^{12} m away, and has a radius of 1.2 × 10^{6} m. So the apparent size of Pluto would be around 90 times smaller than the apparent size of Titan. Based on only the factors considered here, JWST could take a decent photo of Pluto, but it would be lower resolution than the photo of Titan above, and far lower resolution than the photos taken by New Horizons.

Could JWST photograph an exoplanet? There are two confirmed exoplanets are orbiting Proxima Centauri 4 × 10^{16} m away. The larger of these has a mass slightly larger than Earth, and presumably has a radius about equal to that of Earth, 6 × 10^{6} m. Its apparent size would be 150 million times smaller than Titan.

So no, it would not be possible for JWST to photograph an expolanet.

[1] In case the apparent size calculation feels like a rabbit out of a hat, here’s a quick derivation. Imagine looking out on sphere of radius *d*. This sphere has area 4π*d*². A distant planet takes up π*r*² area on this sphere, 0.25(*r*/*d*)² of your total field of vision.

Here’s my recreation of the first 1000 steps:

You can see that in general it makes a lot of turns at first and then turns less often as the density of primes thins out.

I wondered what the analogous walk would look like for Fizz Buzz. There are several variations on Fizz Buzz, and the one that produced the most interesting visualization was to turn left when a number either is divisible by 7 or contains the digit 7.

Here’s what that looks like:

I wrote about something similar four years ago using a different rule for turning. That post looks at a walk on the complex numbers, turning left when you land on a Gaussian prime, a generalization of primes for complex numbers. These images have more symmetry and form closed loops. For example, here’s the walk you get starting at 3 + 5*i*.

The starting point matters. See the earlier post for more examples with different starting points.

Students may reasonably come away from an introductory course with the false impression that it is common for ODEs to have closed-form solutions because it *is* common in the class.

My education reacted against this. We were told from the beginning that differential equations rarely have closed-form solutions and that therefore we wouldn’t waste time learning how to find such solutions. I didn’t learn the classical solution techniques until years later when I taught an ODE class as a postdoc.

I also came away with a false impression, the idea that differential equations almost *never* have closed-form solutions in practice, especially nonlinear equations, and above all partial differential equations (PDEs). This isn’t far from the truth, but it is an exaggeration.

I specialized in nonlinear PDEs in grad school, and I don’t recall ever seeing a closed-form solution. I heard rumors of a nonlinear PDE with a closed form solution, the KdV equation, but I saw this as the exception that proves the rule. It was the only nonlinear PDE of practical importance with a closed-form solution, or so I thought.

It is unusual for a nonlinear PDE to have a closed-form solution, but it is not unheard of. There are numerous examples of nonlinear PDEs, equations with important physical applications, that have closed-form solutions.

Yesterday I received a review copy of *Analytical Methods for Solving Nonlinear Partial Differential Equations* by Daniel Arrigo. If I had run across with a book by that title as a grad student, it would have sounded as eldritch as a book on the biology of Bigfoot or the geography of Atlantis.

A few pages into the book there are nine exercises asking the reader to verify closed-form solutions to nonlinear PDEs:

- a nonlinear diffusion equation
- Fisher’s equation
- Fitzhugh-Nagumo equation
- Berger’s equation
- Liouville’s equation
- Sine-Gordon equation
- Korteweg–De Vries (KdV) equation
- modified Korteweg–De Vries (mKdV) equation
- Boussinesq’s equation

These are not artificial examples crafted to have closed-form solutions. These are differential equations that were formulated to model physical phenomena such as groundwater flow, nerve impulse transmission, and acoustics.

It remains true that differential equations, and especially nonlinear PDEs, typically must be solved numerically in applications. But the number of nonlinear PDEs with closed-form solutions is not insignificant.

- Period of a nonlinear pendulum
- Rational solutions to KdV
- Trading generalized derivatives for classical ones

[1] These techniques are not as haphazard as they seem. At a deeper level, they’re all about exploiting various forms of symmetry.

The post Closed-form solutions to nonlinear PDEs first appeared on John D. Cook.]]>

Julia. Scala. Lua. TypeScript. Haskell. Go. Dart. Various computer languages new and old are sometimes proposed as better alternatives to mainstream languages. But compared to mainstream choices like Python, C, C++ and Java (cf. Tiobe Index)—are they worth using?

Certainly it depends a lot on the planned use: is it a one-off small project, or a large industrial-scale software application?

Yet even a one-off project can quickly grow to production-scale, with accompanying growing pains. Startups sometimes face a growth crisis when the nascent code base becomes unwieldy and must be refactored or fully rewritten (or you could do what Facebook/Meta did and just write a new compiler to make your existing code base run better).

The scope of different types of software projects and their requirements is so incredibly diverse that any single viewpoint from experience runs a risk of being myopic and thus inaccurate for other kinds of projects. With this caveat, I’ll share some of my own experience from observing projects for many dozens of production-scale software applications written for leadership-scale high performance computing systems. These are generally on a scale of 20,000 to 500,000 lines of code and often require support of mathematical and scientific libraries and middleware for build support, parallelism, visualization, I/O, data management and machine learning.

Here are some of the main issues regarding choice of programming languages and compilers for these codes:

1. **Language and compiler sustainability**. While the lifetime of computing systems is measured in years, the lifetime of an application code base can sometimes be measured in decades. Is the language it is written in likely to survive and be well-supported long into the future? For example, Fortran, though still used and frequently supported, is is a less common language thus requiring special effort from vendors, with fewer developer resources than more popular languages. Is there a diversity of multiple compilers from different providers to mitigate risk? A single provider means a single point of failure, a high risk; what happens if the supplier loses funding? Are the language and compilers likely to be adaptable for future computer hardware trends (though sometimes this is hard to predict)? Is there a large customer base to help ensure future support? Similarly, is there an adequate pool of available programmers deeply skilled in the language? Does the language have a well-featured standard library ecosystem and good support for third-party libraries and frameworks? Is there good tool support (debuggers, profilers, build tools)?

2. Related to this is the question of **language governance**. How are decisions about future updates to the language made? Is there broad participation from the user community and responsiveness to their needs? I’ve known members of the C++ language committee; from my limited experience they seem very reasonable and thoughtful about future directions of the language. On the other hand, some standards have introduced features that scarcely anyone ever uses—a waste of time and more clutter for the standard.

3. **Productivity**. It is said that programmer productivity is limited by the ability of a few lines of code to express high level abstractions that can do a lot with minimal syntax. Does the language permit this? Does the language standard make sense (coherent, cohesive) and follow the principle of least surprise? At the same time, the language should not engulf what might better be handled modularly by a library. For example, a matrix-matrix product that is bound up with the language might be highly productive for simple cases but have difficulty supporting the many variants of matrix-matrix product provided for example by the NVIDIA CUTLASS library. Also, in-language support for CUDA GPU operations, for example, would make it hard for the language not to lag behind in support of the frequent new releases of CUDA.

4. **Strategic advantage**. The 10X improvement rule states that an innovation is only worth adopting if it offers 10X improvement compared to existing practice according to some metric . If switching to a given new language doesn’t bring significant improvement, it may not be worth doing. This is particularly true if there is an existing code base of some size. A related question is whether the new language offers an incremental transition path for an existing code to the new language (in many cases this is difficult or impossible).

5. **Performance (execution speed)**. Does the language allow one to get down to bare-metal performance rather than going through costly abstractions or an interpreter layer? Are the features of the underlying hardware exposed for the user to access? Is performance predictable? Can one get a sense of the performance of each line of code just by inspection, or is this occluded by abstractions or a complex compilation process? Is the use of just-in-time compilation or garbage collection unpredictable, which could be a problem for parallel computing wherein unexpected “hangs” can be caused by one process unexpectedly performing one of these operations? Do the compiler developers provide good support for effective and accurate code optimization options? Have results from standardized non-cherry-picked benchmarks been published (kernel benchmarks, proxy apps, full applications)?

Early adopters provide a vibrant “early alert” system for new language and tool developments that are useful for small projects and may be broadly impactful. Python was recognized early in the scientific computing community for its potential complementary use with standard languages for large scientific computations. When it comes to planning large-scale software projects, however, a range of factors based on project requirements must be considered to ensure highest likelihood of success.

The post Choosing a Computer Language for a Project first appeared on John D. Cook.]]>