We can solve this equation in closed-form, depending on your definition of closed-form, by multiplying by an integrating factor.

The left side factors to

and so

The indefinite integral above cannot be evaluated in elementary terms, though it can be evaluated in terms of special functions.

We didn’t specify where the integration starts or the constant *c*. You can make the lower limit of integration whatever you want, and the value of *c* will be whatever is necessary to satisfy initial conditions.

The initial conditions hardly matter for large *x* because the constant *c* is multiplied by a negative exponential. As we’ll see below, the solution *y* decays like 1/*x* while the effect of the initial condition decays exponentially.

I wrote a few months ago about power series solutions for differential equations, but that approach won’t work here, not over a very large range. If *y* has a Taylor series expansion, so does it’s derivative, and so *y*‘ + *y* has a Taylor series. But the right side of our differential equation has a singularity at 0.

What if instead of assuming *y* has a Taylor series expansion, we assume it has a Laurant series expansion? That is, instead of assuming *y* is a weighted sum of *positive* powers of *x*, we assume it’s a weighted sum of *negative* powers of *x*:

Then formally solving for the coefficients we find that

There’s one problem with this approach: The series diverges for all *x* because *n*! grows faster than *x*^{n}.

And yet the series gives a useful approximation! With a convergent series, we fix *x* and let *n* go to infinity. The divergent series above is an asymptotic series, and so we fix *n* and let *x* go to infinity.

To see how well the asymptotic solution compares to the analytic solution, we’ll pick the lower limit of integration to be 1 and pick *c* = 0 in the analytic solution.

The plot above shows the analytic solution, and the first and second order asymptotic solutions. (The *n*th order solution takes the first *n* terms of the asymptotic series.) Note that the three curves differ substantially at first but quickly converge together.

Now let’s look at the relative error.

That’s interesting. Eventually the second order approximation is more accurate than the first order, but not at first. In both cases the relative error hits a local minimum, bounces back up, then slowly decreases.

Does this pattern continue if we move on to a third order approximation. Yes, as illustrated below.

The graphs above suggests that higher order approximations are more accurate, eventually, but we might do better than any particular order by letting the order vary, picking the optimal order for each *x*. That’s the idea behind **superasymptotics**. For each *x*, the superasymptotic series sums up terms in the asymptotic series until the terms start getting larger.

When this approach works, it can produce a series that converges exponentially fast, even though each truncated asymptotic series only converges polynomially fast.

If the arguments differ by 1/2, there is no closed formula, but the there are useful approximations. I’ve needed something like this a few times lately.

The simplest approximation is

You could motivate or interpret this as saying Γ(*x* + 1/2) is approximately the geometric mean between Γ(*x* + 1) and Γ(*x*). As we’ll see in the plot below, this approximation is good to a couple significant figures for moderate values of *x*.

There is another approximation that is a little more complicated but much more accurate.

The following plot shows the relative error in both approximations.

By the way, the first approximation above is a special case of the more general approximation

Source: J. S. Frame. An Approximation to the Quotient of Gamma Function. The American Mathematical Monthly, Vol. 56, No. 8 (Oct., 1949), pp. 529-535

]]>

The Laplace distribution with scale β has density

The Laplace distribution is also called the double exponential because it looks like two mirror-image exponential distributions glued together.

Note that the scale β is not the standard deviation. The standard deviation is √2 β.

To generate samples from a Laplace distribution with scale β, generate two independent exponential samples with mean β and return their difference.

If you don’t have an API for generating exponential random values, generate uniform random values and return the negative of the log. That will produce exponential values with mean 1. To make random values with mean β, just multiply the results by β.

If you want to generate Laplace values in Python, you could simply use the `laplace`

function in `scipy.stats`

. But I’ll write a generator from scratch just to show what you might do in another environment where you didn’t have exponential or Laplace generators.

from math import log from random import random def exp_sample(mean): return -mean*log(random()) def laplace(scale): e1 = exp_sample(scale) e2 = exp_sample(scale) return e1 - e2

**Related**: Stand-alone numerical code, useful when you need a few common mathematical functions but are in an environment that doesn’t provide them, or when you want to avoid adding a library to your project.

I heard somewhere that Pluto receives more sunlight than you might think, enough to read by, and that sunlight on Pluto is much brighter than moonlight on Earth. I forget where I heard that, but I’ve done a back-of-the-envelope calculation to confirm that it’s true.

Pluto is about 40 AU from the sun, i.e. forty times as far from the sun as we are. The inverse square law implies Pluto gets 1/1600 as much light from the sun as Earth does.

Direct sun is between 30,000 and 100,000 lux (lumens per square meter). We’ll go with the high end because that’s an underestimate of how bright the sun would be if we didn’t have an atmosphere between us and the sun. So at high noon on Pluto you’d get at least 60 lux of sunlight.

Civil twilight is roughly enough light to read by, and that’s 3.4 lux. Moonlight is less than 0.3 lux.

60 lux would be comparable to indoor lighting in a hall or stairway.

]]>But suppose you want to minimize the worst case **relative** error? If you chose 36, as above, but the correct answer turned out to be 30, then your relative error is 1/5.

But suppose you had guessed 35. The numbers in the interval [30, 42] furthest away from 35 are of course the end points, 30 and 42. In either case, the relative error would be 1/6.

Let’s look at the general problem for an interval [*a*, *b*]. It seems the thing to do is to pick *x* so that the relative error is the same whether the true value is either extreme, *a* or *b*. In that case

(*x* – *a*) / *a* = (*b* – *x*) / *b*

and if we solve for *x* we get

*x* = 2*ab*/(*a* + *b*).

In other words, the worst case error is minimized when we pick *x* to be the **harmonic mean** of *a* and *b*.

Related post: Where to wait for an elevator

]]>The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone’s age, birthday, and birth date. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney’s famous result [1] that 87% of the US population can be identified based on zip code, sex, and birth date.

Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn’t matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common, or use actual demographic data. It only makes a difference in the 3rd decimal place.

I’ll be using the following age distribution data found on Wikipedia.

|-----------+------------| | Age range | Population | |-----------+------------| | 0– 4 | 20201362 | | 5– 9 | 20348657 | | 10–14 | 20677194 | | 15–19 | 22040343 | | 20–24 | 21585999 | | 25–29 | 21101849 | | 30–34 | 19962099 | | 35–39 | 20179642 | | 40–44 | 20890964 | | 45–49 | 22708591 | | 50–54 | 22298125 | | 55–59 | 19664805 | | 60–64 | 16817924 | | 65–69 | 12435263 | | 70–74 | 9278166 | | 75–79 | 7317795 | | 80–84 | 5743327 | | 85+ | 5493433 | |-----------+------------|

To get data for each particular age, I’ll assume ages are evenly distributed in each group, and I’ll assume the 85+ group consists of people from ages 85 to 92. [2]

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 2^{6} = 64.

If we assume birth days are uniformly distributed within each age, then age and birth date are independent. The information contained in the birth date would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birth date would give a total of 29.7 bits. Since the US population is between 2^{28} and 2^{29}, it’s plausible that we’d have enough information to identify everyone.

We’ve made a number of simplifying assumptions. We were a little fast and loose with age data, and we’ve assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date [1]. Her study was based on 1990 census data, and at that time the US population was a little less than 2^{28}.

- Randomized response and Bayes’ theorem
- Handedness, blood type, and introversion
- Toxic pairs and re-identification
- Database anonymization for testing

***

[1] Latanya Sweeney. “Simple Demographics Often Identify People Uniquely”. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000. Available here.

[1] Bob Wells and Mel Tormé. “The Christmas Song.” Commonly known as “Chestnuts Roasting on an Open Fire.”

]]>If you know someone’s US zip code, how much do you know about them? We can use entropy to measure the amount of information in bits.

To quantify the amount of information in a zip code, we need to know how many zip codes there are, and how evenly people are divided into zip codes.

There are about 43,000 zip codes in the US. The number fluctuates over time due to small adjustments.

**Average** information is maximized by dividing people into categories as **evenly** as possible. **Maximum** information about one person is maximized by dividing people into categories as **unevenly** as possible. To see this, suppose there were only two zip codes. The information we’d expect to learn from a zip code would be maximized if we divided people into two equal groups. Suppose on the other hand you were in one zip code and everyone else in the other. On average, zip code would reveal very little about someone, though it would reveal a lot about you!

If everyone were divided evenly into one of 43,000 zip codes, the amount of information revealed by knowing someone’s zip code would be about 15.4 bits, a little more information than asking 15 independent yes/no questions, each with equally likely answers.

But zip codes are not evenly populated. How much information is there in an actual five-digit zip code? To answer that question we need to know the population of each zip code. That’s a little tricky. Zip codes represent mail delivery points, not geographical areas. Technically the US Census Bureau tracks population by ZCTA (Zip Code Tabulation Area) rather than zip code per se. Population by ZCTA is freely available, but difficult to find. I gave up trying to find the data from official sources but was able to find it here.

We can go through the data and find the probability *p* of someone living in each ZCTA and add up –*p* log_{2}*p* over each area. When we do, we find that a ZTCA contains 13.83 bits of information. (We knew it had to be less than 15.4 because uneven distribution reduces entropy.)

The Safe Harbor provision of US HIPAA law lists zip codes as a quasi-identifier. Five digit zip codes do not fall under Safe Harbor. Three digit zip codes (the first three digits of five digit zip codes) do fall under Safe Harbor, mostly. Some areas are so sparsely populated that even three-digit zip code areas are considered too informative. Any three-digit zip code with fewer than 20,000 people is excluded. You can find a snapshot of the list here, though the list may change over time.

If we repeat our calculation for three-digit zip codes, we find that they carry about 9.17 bits of information. It makes little difference to the result whether you include sparse regions, exclude them, or lump them all into one region.

See the next post on information contained in age, birthday, and birth date.

**Related posts**:

Dark debt is found in complex systems and the anomalies it generates are complex system failures. Dark debt is not recognizable at the time of creation. … It arises from the unforeseen interactions of hardware or software with other parts of the framework. …

The challenge of dark debt is a difficult one. Because it exists mainly in interactions between pieces of the complex system, it cannot be appreciated by examination of those pieces. …

Unlike technical debt, which can be detected and, in principle at least, corrected by refactoring, dark debt surfaces through anomalies. Spectacular failures like those listed above do not arise from technical debt.

**Complexity posts**:

The generalization of the product rule is known as Leibniz rule. It’s fairly simple and looks a lot like the binomial theorem.

The generalization of the chain rule is known as Faà di Bruno’s theorem. It’s more complicated, and it uses exponential Bell polynomials, something I’ve blogged about a few time lately.

More Bell polynomial posts:

]]>You need at least two samples, and often two are enough, but you might get any number, and those larger numbers will pull the expected value up.

Here’s a simulation program in Python.

from random import random from collections import Counter N = 1000000 c = Counter() for _ in range(N): x = 0 steps = 0 while x < 1: x += random() steps += 1 c[steps] += 1 print( sum([ k*c[k] for k in c.keys() ])/N )

When I ran it I got 2.718921. There’s a theoretical result first proved by W. Weissblum that the expected value is *e* = 2.71828… Our error was on the order of 1/√N, which is what we’d expect from the central limit theorem.

Now we can explore further in a couple directions. We could take a look at a the *distribution* of the number steps, not just its expected value. Printing `c`

shows us the raw data.

Counter({ 2: 499786, 3: 333175, 4: 125300, 5: 33466, 6: 6856, 7: 1213, 8: 172, 9: 29, 10: 3 })

And here’s a plot.

We could also generalize the problem by taking powers of the random numbers. Here’s what we get when we use exponents 1 through 20.

There’s a theoretical result that the expected number of steps is asymptotically equal to *cn* where

I computed *c* = 1.2494. The plot above shows that the dependence on the exponent *n* does look linear. The simulation results appear to be higher than the asymptotic prediction by a constant, but that’s consistent with the asymptotic prediction since relative to *n*, a constant goes away as *n* increases.

Reference for theoretical results: D. J. Newman and M. S. Klamkin. Expectations for Sums of Powers. The American Mathematical Monthly, Vol. 66, No. 1 (Jan., 1959), pp. 50-51

]]>As a student, one of the things that seemed curious to me about power series was that a function might have a simple power series, but its inverse could be much more complicated. For example, the coefficients in the series for arctangent have a very simple pattern

but the coefficients in the series for tangent are mysterious.

There’s no obvious pattern to the coefficients. It is possible to write the sum in closed form, but this requires introducing the Bernoulli numbers which are only slightly less mysterious than the power series for tangent.

To a calculus student, this is bad news: a simple, familiar function has a complicated power series. But this is good news for combinatorics. Reading the equation from right to left, it says a complicated sequence has a simple generating function!

The example above suggests that inverting a power series, i.e. starting with the power series for a function and computing the power series for its inverse, is going to be complicated. We introduce exponential Bell polynomials to encapsulate some of the complexity, analogous to introducing Bernoulli numbers above.

Assume the function we want to invert, *f*(*x*), satisfies *f*(0) = 0 and its derivative satisfies *f*‘(0) ≠ 0. Assume *f* and its inverse *g* have the series representations

and

Note that this isn’t the power series per se. The coefficients here are *k*! times the ordinary power series coefficients. (Compare with ordinary and exponential generating functions explained here.)

Then you can compute the series coefficients for *g* from the coefficients for *f* as follows. We have *g*_{1} = 1/*f*_{1} (things start out easy!) and for *n* ≥ 2,

where the *B*‘s are exponential Bell polynomials,

and

is the *k*th rising power of *n*. (More on rising and falling powers here.)

Let’s do an example in Python. Suppose we want a series for the inverse of the gamma function near 2. The equations above assume we’re working in a neighborhood of 0 and that our function is 0 at 0. So we will invert the series for *f*(*x*) = Γ(*x*+2) – 1 and then adjust the result to find the inverse of Γ(*x*).

import numpy as np from scipy.special import gamma from sympy import factorial, bell def rising_power(a, k): return gamma(a+k)/gamma(a) # Taylor series coefficients for gamma centered at 2 gamma_taylor_coeff = [ 1.00000000000000, 0.42278433509846, 0.41184033042643, 0.08157691924708, 0.07424901075351, -0.0002669820687, 0.01115404571813, -0.0028526458211, 0.00210393334069, -0.0009195738388, 0.00049038845082, ] N = len(gamma_taylor_coeff) f_exp_coeff = np.zeros(N) for i in range(1, N): f_exp_coeff[i] = gamma_taylor_coeff[i]*factorial(i) # Verify the theoretical requirements on f assert( f_exp_coeff[0] == 0 ) assert( f_exp_coeff[1] != 0 ) f_hat = np.zeros(N-1) for k in range(1, N-1): f_hat[k] = f_exp_coeff[k+1] / ((k+1)*f_exp_coeff[1]) g_exp_coeff = np.zeros(N) g_exp_coeff[1] = 1/f_exp_coeff[1] for n in range(2, N): s = 0 for k in range(1, n): # Note that the last argument to bell must be a list, # and not a NumPy array. b = bell(n-1, k, list(f_hat[1:(n-k+1)])) s += (-1)**k * rising_power(n, k) * b g_exp_coeff[n] = s / f_exp_coeff[1]**n def g(x): return sum([ x**i * g_exp_coeff[i] / factorial(i) for i in range(1, N) ]) def gamma_inverse(x): return 2. + g(x-1.) # Verify you end up where you started when you do a round trip for x in [1.9, 2.0, 2.1]: print( gamma_inverse(gamma(x)) )

Output:

1.90000003424331 2.00000000000000 2.09999984671755]]>

You can think of it as a finite sum or an infinite sum: binomial coefficients are zero when the numerator is an integer and the denominator is a larger integer. See the general definition of binomial coefficients.

**Source**: Sam E. Ganis. Notes on the Fibonacci Sequence. The American Mathematical Monthly, Vol. 66, No. 2 (Feb., 1959), pp. 129-130

**More Fibonacci number posts**:

I worked as a project manager earlier in my career, but what I’m doing now feels completely different and much more pleasant. **Strip away the bureaucracy and politics, and project management just feels like getting work done**.

I have several people working for me now, but project management doesn’t take much time. I still spend most of my time working on my own projects.

]]>Sometimes these are surprising. The plot of the partial sums might bounce all over in the process of filling in a very symmetric plot. Here’s an example of that.

]]>

You can find details of what everything in the diagram means here.

How can you quantify these approximations? One way is to use Kullback-Leibler divergence. In this post I’ll illustrate this for the normal approximation to the beta and gamma distributions.

The Kullback-Leibler divergence between two random variables *X* and *Y* is defined as

We will compute this integral numerically in the code below to create graphs of how K-L divergence varies with parameters.

Here are the imports we’ll need.

from scipy.integrate import quad from scipy.stats import beta, gamma, norm from scipy import inf import matplotlib.pyplot as plt

As the shape parameters of a beta distribution become large, the probability distribution becomes approximately normal (Gaussian).

Here is code that will take the two shape parameters of a beta distribution, construct a normal approximation by moment matching, and compute the quality of the approximation as measured by Kullback-Leibler divergence.

def KL_beta_norm(a, b): b = beta(a, b) n = norm(b.mean(), b.std()) f = lambda x: -b.pdf(x)*(n.logpdf(x) - b.logpdf(x)) integral, error = quad(f, 0, 1) return integral

And here we make our plot.

x = [2**i for i in range(11)] y = [KL_beta_norm(t, 2*t) for t in x] plt.plot(x, y) plt.xscale("log") plt.yscale("log") plt.xlabel("a") plt.ylabel("KL divergence") plt.title("Comparing beta(a, 2a) and its normal approximation") plt.savefig("beta_KL.svg") plt.close()

The result is nearly linear on a log-log scale.

I made the *b* parameter twice the *a* parameter to show that you don’t need symmetry. When you do have symmetry, i.e *a* = *b*, the approximation quality is better and the graph is even straighter.

As the shape parameter of a gamma distribution increases, the probability density becomes more and more like that of a normal distribution. We can quantify this with the following code.

def KL_gamma_norm(shape): g = gamma(shape) n = norm(g.mean(), g.std()) f = lambda x: -g.pdf(x)*(n.logpdf(x) - g.logpdf(x)) mode = max(0, shape-1) integral1, error1 = quad(f, 0, mode) integral2, error2 = quad(f, mode, inf) return integral1 + integral2

The integration code is a little more complicated this time. For small shape parameters, code analogous to that for the beta distribution will work just fine. But for larger parameters, the integration fails. The numerical integration routine needs a little help. The largest contribution to the integral is located near the mode of the gamma distribution. For large shape parameters, the integration routine misses this peak and grossly underestimates the integral. We break the integral into two pieces at the mode of the gamma distribution so the integration routine can’t miss it. This is a small example of why numerical integration cannot be completely automated. You have to know something about what you’re integrating.

(The `quad()`

function has a parameter `points`

to let you tell it about points of interest like this, but it doesn’t work when one of the limits of integration is infinite.)

The plotting code is essentially the same as that for the beta distribution. As before, the plot is linear on a log-log scale.

You could do a similar analysis on the other approximations in the distribution relationship diagram above.

- Details for error bound on normal approximation to the Poisson distribution
- Error in the normal approximation to the beta distribution
- Error in the normal approximation to the binomial distribution
- Error in the normal approximation to the gamma distribution
- Error in the normal approximation to the Poisson distribution
- Error in the normal approximation to the t distribution
- Error in the Poisson approximation to the binomial distribution

The potential polynomials *A*_{r,n} are homogeneous polynomials of degree *r* in *n* variables. They are usually defined implicitly by

The can also be defined explicitly in terms of ordinary Bell polynomials *B*_{n,k} by

Does anyone know why they’re called “potential” polynomials? Is there some analogy with a physical potential? [1]

By the way, potential polynomials are called “ordinary” in the same sense that Bell polynomials and generating functions are called ordinary: to contrast with exponential forms that insert factorial scaling factors. See the previous post for details.

* * *

[1] When I was learning electricity and magnetism, I was confused by the phrase “potential difference.” Did that mean that there isn’t a difference yet, but there was the potential for a difference? No, it means a difference in electrical potential. There’s still a sense in which “potential” carries with it the idea of possibility. A rock sitting on top of a mountain has a lot of gravitational potential, energy that could be released if the rock started falling.

Potential polynomials could sound like functions that aren’t polynomials yet but have the potential to be, maybe if they just tried harder.

]]>There are Bell polynomials of one variable and Bell polynomials of several variables. The latter are sometimes called “partial” Bell polynomials. In differential equations, “ordinary” means univariate (ODEs) and “partial” means multivariate (PDEs). So if “partial” Bell polynomials are the multivariate form, you might assume that “ordinary” Bell polynomials are the univariate form. Unfortunately that’s not the case.

It seems that the “partial” in the context of Bell polynomials was taken from differential equations, but the term “ordinary” was not. Ordinary and partial are not opposites in the context of Bell polynomials. A Bell polynomial can be ordinary *and* partial!

“Ordinary” in this context is the opposite of “exponential,” not the opposite of “partial.” The analogy is with generating functions, not differential equations. The ordinary generating function of a sequence multiplies the *n*th term by *x*^{n} and sums. The exponential generating function of a sequence multiplies the *n*th term by *x*^{n}/*n*! and sums. For example, the ordinary generating function for the Fibonacci numbers is

while the exponential generating function is

where

The definitions of exponential and ordinary Bell polynomials are given below, but the difference between the two that I wish to point out is that the former divides the *k*th polynomial argument by *k*! while the latter does not. They also differ by a scaling factor. The exponential form of *B*_{n,k} has a factor of *n*! where the ordinary form has a factor of *k*!.

Based on the colloquial meaning of “ordinary” you might assume that it is the default. And indeed that is the case with generating functions. Without further qualification, generating function means ordinary generating function. You’ll primarily see explicit references to *ordinary* generating functions in a context where it’s necessary to distinguish them from some other kind of generating function. Usually the word “ordinary” will be written in parentheses to emphasize that an otherwise implicit assumption is being made explicit. In short, “ordinary” and “customary” correspond in the context of generating functions.

But in the context of Bell polynomials, it seems that the exponential form is more customary. At least in some sources, an unqualified reference to Bell polynomials refers to the exponential variety. That’s the case in SymPy where the function `bell()`

implements exponential Bell polynomials and there is no function to compute ordinary Bell polynomials.

Now for the actual definitions. We can make our definitions more concise if we first define multi-index notation. A **multi-index**

is a set of *n* non-negative integers. The **norm** of a multi-index is defined by

and the **factorial** of a multi-index is defined by

If *x* = (*x*_{1}, *x*_{2}, …, *x*_{n}) is an *n*-tuple of real numbers then *x* raised to the **power** of a multi-index α is defined by

The **ordinary Bell polynomial** *B*_{n,k} is

where *x* is a (*n* – *k* +1)-tuple and α ranges over all multi-indices subject to two constraints: the norm of α is *k* and

The **exponential Bell polynomials** can then be defined in terms of the ordinary bell polynomials by

You can subscribe to the blog via RSS or email.

I often use SVG images because they look great on a variety of devices, but most email clients won’t display that format. If you subscribe by email, there’s always a link to open the article in a browser and see all the images.

I also have a monthly newsletter. It highlights the most popular posts of the month and usually includes a few words about what I’ve been up to.

I have 17 Twitter accounts that post daily on some topic and one personal account.

The three most popular are CompSciFact, AlgebraFact, and ProbFact. These accounts are a little broader than the names imply. For example, if I run across something that doesn’t fall neatly into another mathematical category, I’ll post it to AlgebraFact. Miscellaneous applied math tends to end up on AnalysisFact.

You can find a list of all accounts and their descriptions here.

I don’t keep up with replies to my topical accounts, but I usually look at replies to my personal account. If you want to be sure I get your message, please call or send me email.

Occasionally people ask whether there’s a team of people behind my Twitter accounts. Nope. Just me. I delegate other parts of my business, but not Twitter. I schedule most of the technical account tweets in advance, but I write them. My personal account is mostly spontaneous.

Here’s my contact info.

My phone number isn’t on there. It’s 832.422.8646. If you’d like, you can import my contact info as a vCard or use the QR code below.

]]>A straight line has zero curvature, and a circle with radius *r* has curvature 1/*r*. So in a rounded square the curvature jumps from 0 to 1/*r* where the flat side meets the circular corner. But in the figure above there’s no abrupt change in curvature but instead a smooth transition. More on that in just a bit.

To get a smoother transition from the corners to the flat sides, designers use what mathematicians would call *L*^{p} balls, curves satisfying

in rectangular coordinates or

in polar coordinates.

When *p* = 2 we get a circle. When *p* = 2.5, we get square version of the superellipse. As *p* increases the corners get sharper, pushing out toward the corner of a square. Some sources define the squircle to be the case *p* = 4 but others say *p* = 5. The image at the top of the post uses *p* = 4. [*]

To show how the curvature changes, let’s plot the curvature on top of the squircle. The inner curve is the squircle itself, and radial distance to the outer curve is proportional to the curvature.

Here’s the plot for *p* = 4.

And here’s the plot for *p* = 5.

If we were to make the same plot for a rounded square, the curvature would be zero over the flat sides and jump to some constant value over the circular caps. We can see above that the curvature is largest over the corners but continuously approaches zero toward the middle of the sides.

**Related**: Swedish Superellipse

[*] Use whatever value of *p* you like. The larger *p* is, the closer the figure becomes to a square. The closer *p* is to 2, the closer the figure is to a circle.

You can even use smaller values of *p*. The case *p* = 1 gives you a diamond. If *p* is less than 1, the sides bend inward. The plot below shows the case *p* = 0.5.

As *p* gets smaller, the sides pull in more. As *p* goes to zero, the figure looks more and more like a plus sign.

If the function *f*(*x*) is constant, the differential equation is easy to solve in closed form. But when it is not constant, it is very unlikely that closed form solutions exist. But there may be useful closed form approximate solutions.

There is no linear term in the equation above, but there is a standard change of variables to eliminate a linear term and put any second order linear equation with variable coefficients in the form above.

The Liouville-Green approximate solutions are linear combinations of the functions

The constant of integration doesn’t matter. It would only change the solution by a constant multiple, so any indefinite integral will do.

To try out the Liouville-Green approximation, let’s look at the equation

Here *f*(*x*) = *x*² and so the LG approximation solutions are

Let’s see how these solutions behave compare to a numerical solution of the equation. To make things more definite, we need initial conditions. Let’s say *A*= *B* = 1, which corresponds to initial conditions *y*(1) = 2.25525 and *y*‘(1) = -0.0854354.

Here’s a plot of the L-G approximation and a numerical solution.

Here’s the Mathematica code that was used to create the plot above.

s = NDSolve[ { y''[x] == x^2 y[x], y[1] == 2.25525, y'[1] == -0.0854354 }, y[x], {x, 1, 3} ] g[x_] = (Exp[-x^2/2] + Exp[x^2/2])/Sqrt[x] Plot[ {Evaluate[y[x] /. s], g[x]}, {x, 1, 3}, PlotLabels -> {"Numerical", "LG"} ]

For more on Liouville-Green approximations, see Olver’s classic book Asymptotics and Special Functions.

**Related post**: Mechanical vibrations

I’m writing code based on a theoretical result in a journal article.

First the program gets absurd results. Theory pinpoints a bug in the code.

Then the code confirms my suspicions of an error in the paper.

The code also uncovers, not an error per se, but an important missing detail in the paper.

Then code and theory differ by about 1%. Uncertain whether this is theoretical approximation error or a subtle bug.

Then try the code for different arguments, ones which theory predicts will have less approximation error, and now code and theory differ by 0.1%.

Then I have more confidence in (my understanding of) the theory and in my code.

]]>Hermann Weyl is my great model. He used to write beautiful literature. Reading it was a joy because he put a lot of thought into it. Hermann Weyl wrote a book called The Classical Groups, very famous book, and at the same time a book appeared by Chevalley called Lie Groups. Chevalley’s book was compact, all the theorems are there, very precise, Bourbaki-style definitions. If you want to learn about it, it’s all there. Hermann Weyl’s book was the other way around. It is discursive, expansive, it’s a bigger book, it tells you much more than need to know. It’s the kind of book you read ten times. Chevalley’s book you read once. …

In the introduction he explains that he’s German, he’s writing in a foreign language, he apologizes saying he is writing in a language that was “not sung by the gods at my cradle.” You can’t get any more poetic than that. I’m a great admirer of his style, of his exposition and his mathematics. They go together.

Here’s the portion of the preface that Atiyah is quoting, where Weyl apologies eloquently for his lack of eloquence in writing English.

The gods have imposed upon my writing the yoke of a foreign tongue that was not sung at my cradle.

“Was dies heissen will, weiss jeder,

Der im Traum pferdlos geritten,”I am tempted to say with Gottfried Keller. Nobody is more aware than myself of the attendant loss of vigor, ease and lucidity of expression.

Photographer Bernhard Kreutzer came by while I was interviewing Atiyah at the first Heidelberg Laureate Forum in 2013.

]]>and the following two functions based on the series, one takes only the first non-zero term

def short_series(x): return 0.14285714*x

and a second that three non-zero terms.

def long_series(x): return 0.1425714*x - 4.85908649e-04*x**3 + 4.9582515e-07*x**5

Which is more accurate? Let’s make a couple plots plot to see.

First here are the results on the linear scale.

Note that the short series is far more accurate than the long series! The differences are more dramatic on the log scale. There you can see that you get more correct significant figures from the short series as the angle approaches zero.

What’s going on? Shouldn’t you get *more* accuracy from a longer Taylor series approximation?

Yes, but there’s an error in our code. The leading coefficient in `long_series`

is wrong in the 4th decimal place. That small error in the most important term outweighs the benefit of adding more terms to the series. The simpler code, implemented correctly, is better than the more complicated code with a small error.

The moral of the story is to **focus on the leading term. **Until it’s right, the rest of the terms don’t matter.

**Update**: Based on some discussion after publishing this post, I think my comment about `short_series`

being simpler misdirected attention from my main point. Here’s a variation on the same argument based on two equally complex functions.

def typo1(x): return 0.1425714*x - 4.85908649e-04*x**3 + 4.9582515e-07*x**5 def typo2(x): return 0.14285714*x - 5.85908649e-04*x**3 + 4.9582515e-07*x**5

Both functions have one typo, `typo1`

in the first non-zero coefficient and `typo2`

in the second non-zero coefficient. These two functions behave almost exactly like `long_series`

and `short_series`

respectively. A typo in the leading coefficient has a much bigger impact on accuracy than a typo in one of the other coefficients.

**Related post**: Life lessons from differential equations

The most obvious way to compute multiple integrals is to use product methods, analogous to the way you learn to compute multiple integrals by hand. Unfortunately the amount of work necessary with this approach grows exponentially with the number of variables, and so the approach is impractical beyond modest dimensions.

The idea behind Monte Carlo integration, or integration by darts, is to estimate the area under a (high dimensional) surface by taking random points and counting how many fall below the surface. More details here. The error goes down like the reciprocal of the square root of the number of points. The error estimate has nothing to do with the dimension per se. In that sense Monte Carlo integration is indeed independent of dimension, and for that reason is often used for numerical integration in dimensions too high to use product methods.

Suppose you’re trying to estimate what portion of a (high dimensional) box some region takes up, and you know a priori that the proportion is in the ballpark of 1/2. You could refine this to a more accurate estimate with Monte Carlo, throwing darts at the box and seeing how many land in the region of interest.

But in practice, the regions we want to estimate are small relative to any box we’d put around them. For example, suppose you want to estimate the volume of a unit ball in *n* dimensions by throwing darts at the unit cube in *n* dimensions. When *n* = 10, only about 1 dart in 400 will land in the ball. When *n* = 100, only one dart out of 10^{70} will land inside the ball. (See more here.) It’s very likely you’d never have a dart land inside the ball, no matter how much computing power you used.

If no darts land inside the region of interest, then you would estimate the volume of the region of interest to be zero. Is this a problem? Yes and no. The volume is very small, and the **absolute** error in estimating a small number to be zero is small. But the **relative** is 100%.

(For an analogous problem, see this discussion about the approximation sin(θ) = θ for small angles. It’s not just saying that one small number is approximately equal to another small number: all small numbers are approximately equal in absolute error! The small angle approximation has small *relative* error.)

If you want a small *relative* error in Monte Carlo integration, and you usually do, then you need to come up with a procedure that puts a larger proportion of integration points in the region of interest. One such technique is importance sampling, so called because it puts more samples in the important region. The closer the importance sampler fits the region of interest, the better the results will be. But you may not know enough a priori to create an efficient importance sampler.

So while the absolute accuracy of Monte Carlo integration does not depend on dimension, the problems you want to solve with Monte Carlo methods typically get harder with dimension.

]]>Integrating a polynomial in several variables over a ball or sphere is easy. For example, take the polynomial *xy*² + 5*x*²*z*² in three variables. The integral of the first term, *xy*², is zero. If any variable in a term has an odd exponent, then the integral of that term is zero by symmetry. The integral over half of the sphere (ball) will cancel out the integral over the opposite half of the sphere (ball). So we only need to be concerned with terms like 5*x*²*z*².

Now in *n* dimensions, suppose the exponents of *x*_{1}, *x*_{2}, …, *x*_{n} are *a*_{1}, *a*_{2}, …, *a*_{n} respectively. If any of the *a*‘s are odd, the integral over the sphere or ball will be zero, so we **assume all the a‘s are even**. In that case the integral over the unit

where

is the **multivariate beta function** and for each *i* we define *b*_{i} = (*a*_{i} + 1)/2. When *n* = 2 then *B* is the (ordinary) beta function.

Note that the integral over the unit sphere doesn’t depend on the dimension of the sphere.

The integral over the unit **ball** is

which is proportional to the integral over the sphere, where the proportionality constant depends on the sum of the exponents (the original exponents, the *a*‘s, not the *b*‘s) and the dimension *n*.

Note that if we integrate the constant polynomial 1 over the unit sphere, we get the surface area of the unit sphere, and if we integrate it over the unit ball, we get the volume of the unit ball.

You can find a derivation for the integral results above in [1]. The proof is basically Liouville’s trick for integrating the normal distribution density, but backward. Instead of going from rectangular to polar coordinates, you introduce a normal density and go from polar to rectangular coordinates.

[1] Gerald B. Folland, How to Integrate a Polynomial over a Sphere. The American Mathematical Monthly, Vol. 108, No. 5 (May, 2001), pp. 446-448.

]]>For any random variable *X* with 0 mean, or negative mean, there’s an inequality that bounds the 3rd moment, *m*_{3} in terms of the 4th moment, *m*_{4}:

The following example shows that this bound is the best possible.

Define

*u* = (1 – √ 3)/√ 2

*v* = (1 + √ 3)/√ 2

*p* = (3 + √ 3) / 6

and suppose *X* = *u* with probability *p* and *X* = *v* with probability *q* = 1 – *p*. Then *X* has mean 0, and you can show that exact equality holds in the inequality above.

Here’s some Mathematica code to verify this claim.

u = (1 - Sqrt[3])/Sqrt[2] v = (1 + Sqrt[3])/Sqrt[2] p = (Sqrt[3] + 1)/(2 Sqrt[3]) q = 1 - p m3 = p u^3 + q v^3 m4 = p u^4 + q v^4 Simplify[ (4/27)^(1/4) m4^(3/4) / m3 ]

As hoped, the code returns 1.

Reference: Iosif Pinelis, Relations Between the First Four Moments, American Mathematical Monthly, Vol 122, No 5, pp 479-481.

]]>`\underbrace`

.
It does what it sounds like it does. It puts a brace under its argument.

I used this a few days ago in the post on the new prime record when I wanted to show that the record prime is written in hexadecimal as a 1 followed by a long string of Fs.

The code that produced is is

1\underbrace{\mbox{FFF \ldots FFF}}_\mbox{{\normalsize 9,308.229 F's}}

The sizing is a little confusing. Without `\normalsize`

the text under the brace would be as large as the text above.

I’ve written before about the syntax of Emacs regular expressions. It’s a pretty conservative subset of the features you may be used to from other environments as summarized in the diagram below.

But there are many, many was to use regular expressions in Emacs. I did a quick search and found that about 15% of the pages in the massive Emacs manual contain at least one reference to regular expressions. Exhaustively listing the uses of regular expressions would not be practical or very interesting. Instead, I’ll highlight a few uses that I find helpful.

One of the most frequently used features in Emacs is incremental search. You can search forward or backward for a string, searching as you type, with the commands `C-s`

(`isearch-forward`

) and `C-r`

(`isearch-backward`

). The regular expression counterparts of these commands are `C-M-s`

(`isearch-forward-regexp`

) and `C-M-r`

(`isearch-backward-regexp`

).

Note that the regular expression commands add the Alt (meta) key to their string counterparts. Also, note that Emacs consistently refers to regular expressions as `regexp`

and never, as far as I know, as `regex`

. (Emacs relies heavily on conventions like this to keep the code base manageable.)

A common task in any editor is to search and replace text. In Emacs you can replace all occurrences of a regular expression with `replace-regexp`

or interactively choose which instances to replace with `query-replace-regexp`

.

You can delete all lines in a file that contain a given regular expression with `flush-lines`

. You can also invert this command, specifying which lines *not* to delete with `keep-lines`

.

One lesser-known but handy feature is `align-regexp`

. This command will insert white space as needed so that all instances of a regular expression in a region align vertically. For example, if you have a sequence of assignment statements in a programming language you could have all the equal signs line up by using `align-regexp`

with the regular expression consisting simply of an equal sign. Of course you could also align based on a much more complex pattern.

Although I imagine this feature is primarily used when editing source code, I imagine you could use it in other context such as aligning poetry or ASCII art diagrams.

The Emacs directory editor `dired`

is something like the Windows File Explorer or the OSX Finder, but text-based. `dired`

has many features that use regular expressions. Here are a few of the more common ones.

You can mark files based on the file names with `% m`

(`dired-mark-files-regexp`

) or based on the contents of the files with `% g`

(`dired-mark-files-containing-regexp`

). You can also mark files for deletion with `% d`

(`dired-flag-files-regexp`

).

Inside `dired`

you can search across a specified set of files by typing `A`

(`dired-do-find-regexp`

), and you can interactively search and replace across a set of files by typing `Q`

(`dired-do-find-regexp-and-replace`

).

The help apropos command (`C-h a`

) can take a string or a regular expression.

The command to search for available fonts (`list-faces-display`

) can take a string or regular expression.

Interactive highlighting commands (`highlight-regexp`

, `unhighlight-regexp`

, `highlight-lines-matching-regexp`

) take a regular expression argument.

You can use a regular expression to specify which buffers to close with `kill-matching-buffers`

.

Maybe the largest class of uses for regular expressions in Emacs is configuration. Many customizations in Emacs, such as giving Emacs hints to determine the right editing mode for a file or how to recognize comments in different languages, use regular expressions as arguments.

You can find more posts on regular expressions and on Emacs by going to my technical notes page. Note that the outline at the top has links for regular expressions

and for Emacs.

For daily tips on regular expressions or Unix-native tools like Emacs, follow @RegexTip and @UnixToolTip on Twitter.

]]>Michael Spivak put references to “yellow pig” in some of his books. I’ve heard that he put allusions to yellow pigs in *all* his books, but I don’t have all his books, and I haven’t been able to find yellow pigs in two of his books that I do own.

Spivak’s calculus text is dedicated to the memory of Y. P.

If you look up yellow pig in the index of the book, it will take you to a page that makes a passing reference to “whole hog.”

Spivak’s publishing company, Publish or Perish Press, originally used a pig as part of its logo.

The web site now has no logo. His most recent book, Physics for Mathematicians: Mechanics I, uses a different logo.

The cover of Spivak’s Differential Geometry, Volume 1, second edition, has two yellow drawings of a pig.

If you look up yellow pig in the index, it takes you to a page that doesn’t mention pigs, but does include a drawing that looks something like a ham.

I do not see a reference to yellow pig in Spivak’s first book, Calculus on Manifolds. It was published by Benjamin Cummings. Maybe they would not allow Easter eggs, or maybe the idea of including Easter eggs didn’t occur to Spivak until he had his own publishing company. I also do not see a reference to yellow pigs in his recent Physics for Mathematicians book.

**Summary of books mentioned above**:

Not all those tensor components are unique. Assuming our function *f* is smooth, the order in which partial derivatives are taken doesn’t matter. It only matters which variables you differentiate with respect to and how many times.

There is a potentially unique component of the *k*th order derivative for every choice of *k* items (the variables you’re differentiating with respect to) from a set of *n* items (the variables) with replacement (because you can differentiate with respect to the same variable multiple times). Richard Stanley introduced the notation

for this quantity, and it turns out that

See Counting selections with replacement.

If *n* or *k* are large, this is a very large number. Suppose you have a function of *n* = 1,000 variables and you want to take the 5th derivative. The derivative has 10^{15} components, but “only” about 10^{13} distinct components. 8,416,958,750,200 to be exact. A floating point number (IEEE 754 double) takes up 8 bytes, so storing the distinct components would require 62,711 gigabytes. So you could store the distinct components if you had a thousand computers with 64 GB of RAM each.

An application that depends on high-order derivatives of functions of many variables has to exploit some structure, such as sparsity or symmetry, and not directly compute and store the derivatives.

By the way, if both are large, the approximate number of components estimated via Stirling’s approximation is

where *m* = *n*-1.

***

[1] The value of the function is a real number, a scalar, a 0th order tensor. The first derivative, the gradient, is a vector, a 1st order tensor. The second derivative, the Hessian, is a matrix, a 2nd order tensor. There aren’t more common names for tensors of order three and higher. You could think of a 3rd order tensor as a three dimensional box of numbers, a stack of matrices.

]]>