- Constant functions are PR.
- The function that picks the
*i*th element of a list of*n*arguments is PR. - The successor function
*S*(*n*) =*n*+1 is PR. - PR functions are closed under composition.
- PR functions are closed under
**primitive recursion**.

The last axiom obviously gives PR functions their name. So what is primitive recursion? Given a PR function *f * that takes *k* arguments, and another PR function *g* that takes *k*+2 arguments, the primitive recursion of *f* and *g* is a function *h* of *k*+1 arguments satisfying two properties:

*h*(0,*x*_{1}, …,*x*_{k}) =*f*(*x*_{1}, …,*x*_{k})*h*(*S*(*y*),*x*_{1}, …,*x*_{k}) =*g*(*y*,*h*(*y*,*x*_{1}, …*x*_{k}),*x*_{1}, …,*x*_{k})

Not every computable function is primitive recursive. For example, Ackermann’s function is a **general recursive function**, but not a primitive recursive function. General recursive functions are Turing complete. Turing machines, lambda calculus, and general recursive functions are all equivalent models of computation, but the first two are better known.

For this post, the main thing about general recursive functions is that, as the name implies, they are more general than primitive recursive functions.

Now we switch from functions to sets. The **characteristic function** of a set *A* is the function that is 1 for elements of *A* and zero everywhere else. In other areas of math, there is a sort of duality between functions and sets via characteristic functions. For example, the indicator function of a measurable set is a measurable function. And the indicator function of a convex set is a convex function. But in recursive functions, there’s an unexpected wrinkle in this analogy.

A set of integers is **recursively enumerable** if it is either empty or the image of a **general** recursive function. But there’s a theorem, due to Alonzo Church, that a non-empty recursively enumerable set is actually the image of a **primitive** recursive function. So although general recursive functions are more general, you can’t tell that from looking at function images. For example, although the Ackermann function is not primitive recursive, there is a primitive recursive function with the same image.

In finite dimensional Euclidean space, linear functions are continuous. You can put a different norm on a Euclidean space than the one it naturally comes with, but all norms give rise to the same topology and hence the same continuous functions. (This is useful in numerical analysis where you’d like to look at a variety of norms. The norms give different analytical results, but they’re all topologically equivalent.)

In an infinite dimensional normed space, linear functions are not necessarily continuous. If the dimension of a space is only a trillion, all linear functions are continuous, but when you jump from high dimension to infinite dimension, you can have discontinuous linear functions. But if you look at this more carefully, there isn’t a really sudden change.

If a linear function is discontinuous, its finite dimensional approximations are continuous, but the **degree** of continuity is degrading as dimension increases. For example, suppose a linear function stretches the *n*th basis vector by a factor of *n*. The bigger *n* gets, the more the function stretches in the *n*th dimension. As long as *n* is bounded, this is continuous, but in a sense it is less continuous as *n* increases. The fact that the infinite dimensional version is discontinuous tells you that the finite dimensional versions, while technically continuous, scale poorly with dimension. (See practical continuity for more discussion along these lines.)

A Banach space is a *complete* normed linear space. Finite dimensional normed spaces are always complete (i.e. every sequence in the space converges to a point in the space) but this might not happen in infinite dimensions.

In basic linear algebra, the dual of a vector space *V* is the space of linear functionals on *V*, i.e. the set of linear maps from *V* to the reals. This space is denoted *V*^{*}. If *V* has dimension *n*, *V*^{*} has dimension *n*, and all *n*-dimensional spaces are isomorphic, so the distinction between a space and its dual seems pedantic. But in general a Banach space and its dual are not isomorphic and so its easier to tell them apart.

The second dual of a vector space, *V*^{**} is the dual of the dual space. In finite dimensional spaces, *V*^{**} is naturally isomorphic to *V*. In Banach spaces, *V* is isomorphic to a *subset* of *V*^{**}. And even when *V* is isomorphic to *V*^{**}, it might not be *naturally* isomorphic to *V*^{**}. (Here “natural” means natural in the category theory sense of natural transformations.)

*S*^{2} = *A*^{2} + *B*^{2} + *C*^{2}.

There’s an elegant proof of this theorem here using differential forms. Below I’ll sketch a less elegant but more elementary proof.

You could prove the identity above by using the fact that the area of a triangle spanned by two vectors is half the length of their cross product. Suppose *a*, *b*, and *c* are the locations of the three corners of *T*. Then

*S ^{2} = v*

where

*v* = (*a* – *b*) × (*c* – *b*)

and by *v*^{2} we mean the dot product of *v* with itself.

Write out the components of *v*^{2} and you get three squared terms. Notice that when you set the *x* components to zero, i.e. project onto the *yz* plane, the first of the three terms is unchanged and the other two are zero. In other words, the first of the three terms of *v*^{2} is *A*^{2}. A similar argument shows that the other two terms are *B*^{2} and *C*^{2}.

]]>

… it was a question of Amistics, which was a term that had been coined ages ago by a Moiran anthropologist to talk about the choices that different cultures made as to which technologies they would, and would not, make part of their lives. The word went all the way back to the Amish people … All cultures did this, frequently without being consciously aware that they had made collective choices.

Related post by Kevin Kelly: Amish Hackers

]]>It turns out there’s a remarkable closed-form solution:

Here are some questions you may have.

*But what if n and m are both odd? You can’t tile such a board with dominoes.*

Yes, in that case the formula evaluates to zero.

*Do you need an absolute value somewhere? Or a floor or ceiling?*

No. It looks like the double product could be a general complex number, but it’s real. In fact, it’s always the square of an integer.

**Update**: Apparently the formula *does* need an absolute value, not to turn complex values into real values but to turn negative integers into positive ones. See Aaron Meurer’s example below.

*Does it work numerically?*

Apparently so. If you evaluated the product in a package that could symbolically manipulate the cosines, the result would be exact. In floating point it cannot be, but at least in my experiments the result is correct when rounded to the nearest integer. For example, there are 12,988,816 ways to tile a standard 8 by 8 chessboard with dominoes, and the following python script returns 12988816.0. For sufficiently large arguments the result will not always round to the correct answer, but for moderate-sized arguments it should.

from numpy import pi, cos, sqrt def num_tilings(m, n): prod = 1 for k in range(1, m+1): for l in range(1, n+1): prod *= 2*cos(pi*k/(m+1)) + 2j*cos(pi*l/(n+1)) return sqrt(abs(prod)) print(num_tilings(8,8))

*The code looks wrong. Shouldn’t the ranges go up to m and n?*

No, Python ranges are half-open intervals. `range(a, b)`

goes from `a`

to `b-1`

. That looks unnecessarily complicated, but it makes some things easier.

*You said that there was no need for absolute values, but you code has one.*

Yes, because while in theory the imaginary part will be exactly zero, in floating point arithmetic the imaginary part might be small but not zero.

*Where did you find this formula?*

Thirty-three Miniatures: Mathematical and Algorithmic Applications of Linear Algebra

]]>Amplitude modulation multiplies a carrier signal by

1 + *d* sin(2π *f t*)

where *d* is the modulation depth, *f* is the modulation frequency, and *t* is time.

Here are some examples you can listen to. We use a pure 1000 Hz tone and Gaussian white noise as carriers, and vary modulation depth and frequency continuously over 10 seconds. he modulation depth example varies depth from 0 to 1. Modulation frequency varies from 0 to 120 Hz.

First, here’s a pure tone with increasing modulation depth.

Next we vary the modulation frequency.

Now we switch over to Gaussian white noise, first varying depth.

And finally white noise with varying modulation frequency. This one sounds like a prop-driven airplane taking off.

**Related**: Psychoacoustics consulting

In linear algebra, vector spaces have a field of scalars. This is where the coefficients in linear combinations come from. For real vector spaces, the scalars are real numbers. For complex vector spaces, the scalars are complex numbers. For vector spaces over any field *K*, the elements of *K* are called scalars.

But there is a more restrictive use of *scalar* in tensor calculus. There a scalar is **not just a number, but a number whose value does not depend on one’s choice of coordinates**. For example, the temperature at some location is a scalar, but the first coordinate of a location depends on your choice of coordinate system. Temperature is a scalar, but *x*-coordinate is not. Scalars are numbers, but not all numbers are scalars.

The linear algebraic use of *scalar* is more common among mathematicians, the coordinate-invariant use among physicists. The two uses of *scalar *is a special case of the two uses of *tensor* described in the previous post. Linear algebra thinks of tensors simply as things that take in vectors and return numbers. The physics/tensor analysis view of tensors includes behavior under changes of coordinates. You can think of a scalar as a oth order tensor, one that behaves as simply as possible under a change of coordinates, i.e. doesn’t change at all.

In the second post we said that a tensor is a multilinear functional. A *k*-tensor takes *k* vectors and returns a number, and it is linear in each argument if you hold the rest constant. We mentioned that this relates to the “box of numbers” idea of a tensor. You can describe how a *k*-tensor acts by writing out *k* nested sums. The terms in these sums are called the **components** of the tensor.

Tensors are usually defined in a way that has more structure. They vary from point to point in a space, and they do so in a way that in some sense is independent of the coordinates used to label these points. At each point you have a tensor in the sense of a multilinear functional, but the emphasis is usually on the changes of coordinates.

Tensors in the sense that we’re talking about here come in two flavors: covariant and contravariant. They also come in mixtures; more on that later.

We consider two coordinate systems, one denoted by *x*‘s and another by *x*‘s with bars on top. The components of a tensor in the *x*-bar coordinate system will also have bars on top. For a covariant tensor of order one, the components satisfy

First of all, coordinates are written with superscripts. So *x*^{r} is the *r* coordinate, not *x* to the power *r*. Also, this uses Einstein summation notation: there is an implicit sum over repeated indexes, in this case of *r*.

The components of a contravariant tensor of order one satisfy similar but different equation:

The components of a covariant tensor are written with subscripts, and the components of a contravariant tensor with superscripts. In the equation for covariant components, the partial derivatives are with respect to the new coordinates, the *x* bars. In the equation for contravariant components, the partial derivatives are with respect to the original coordinates, the *x*‘s. Mnemonic: when the indexes go down (covariant tensors) the new coordinates go down (in the partial derivatives). When the indexes go up, the new coordinates go up.

For covariant tensors of order two, the change of coordinate formula is

Here there the summation convention says that there are two implicit sums, one over *r* and one over *s*.

The contravariant counter part says

In general you could have tensors that are a mixture of covariant and contravariant. A tensor with covariant order *p* and contravariant order *q* has *p* subscripts and *q* superscripts. The partial derivatives have *x*-bars on bottom corresponding to the covariant components and *x*-bars on top corresponding to contravariant components.

We initially said a tensor was a multilinear functional. A tensor of order *k* takes *k* vectors and returns a number. Now we’d like to refine that definition to take two kinds of vectors. A tensor with covariant order *p* and contravariant order *q* takes *p* contravariant vectors and *q* covariant vectors. In linear algebra terms, in stead of simply taking *k* elements of a vector space *V*, we say our tensor takes *p* vectors from the dual space *V** and *q* vectors from *V*.

You may be familiar with the terms *covariant* and *contravariant* from category theory, or its application to object oriented programming. The terms are related. As Michael Spivak explains, “It’s very easy to remember which kind of vector field is covariant, and which is contravariant — it’s just the opposite of what it logically ought to be [from category theory].”

For example, you may have seen tensor product splines. Suppose you have a function over a rectangle that you’d like to approximate by patching together polynomials so that the interpolation has the specified values at grid points, and the patches fit together smoothly. In one dimension, you do this by constructing splines. Then you can bootstrap your way from one dimension to two dimensions by using tensor product splines. A tensor product spline in *x* and *y* is a sum of terms consisting of a spline in *x* and a spline in *y*. Notice that a tensor product spline is not simply a product of two ordinary splines, but a *sum* of such products.

If *X* is the vector space of all splines in the *x*-direction and *Y* the space of all splines in the *y*-direction, the space of tensor product splines is the tensor product of the spaces *X* and *Y*. Suppose a set *s*_{i}, for *i* running from 1 to *n,* is a basis for *X*. Similarly, suppose *t*_{j}, for *j* running from 1 to *m*, is a basis for *Y*. Then the products *s*_{i} *t*_{j} form a basis for the tensor product *X* and *Y*, the tensor product splines over the rectangle. Notice that if *X* has dimension *n* and *Y* has dimension *m* then their tensor product has dimension *n**m*. Notice that if we only allowed products of splines, not sums of products of splines, we’d get a much smaller space, one of dimension *n*+*m*.

We can use the same process to define the tensor product of any two vector spaces. A basis for the tensor product is all products of basis elements in one space and basis elements in the other. There’s a more general definition of tensor products that doesn’t involve bases sketched below.

You can also define tensor products of *modules*, a generalization of vector spaces. You could think of a module as a vector space where the scalars come from a ring ideas of a field. Since rings are more general than fields, modules are more general than vector spaces.

The tensor product of two modules over a commutative ring is defined by taking the Cartesian product and moding out by the necessary relations to make things bilinear. (This description is very hand-wavy. A detailed presentation needs its own blog post or two.)

Tensor products of modules hold some surprises. For example, let *m* and *n* be two relatively prime integers. You can think of the integers mod *m* or *n* as a module over the integers. The tensor product of these modules is zero because you end up moding out by everything. This kind of collapse doesn’t happen over vector spaces.

The first two posts in this series:

I plan to leave the algebraic perspective aside for a while, though as I mentioned above there’s more to come back to.

Next I plan to write about the analytic/geometric view of tensors. Here we get into things like changes of coordinates and it looks at first as if a tensor is something completely different.

]]>A dot product is an example of a tensor. It takes two vectors and returns a number. And it’s linear in each argument. Suppose you have vectors *u*, *v*, and *w*, and a real number *a*. Then the dot product (*u +* *v*, *w*) equals (*u*, *w*) + (*v*, *w*) and (*au*, *w*) = *a*(*u*, *w*). This shows that dot product is linear in its first argument, and you can show similarly that it is linear in the second argument.

Determinants are also tensors. You can think of the determinant of an *n* by *n* matrix as a function of its *n* rows (or columns). This function is linear in each argument, so it is a tensor.

The introduction to this series mentioned the interpretation of tensors as a box of numbers: a matrix, a cube, etc. This is consistent with our definition because you can write a multilinear functional as a sum. For every vector that a tensor takes in, there is an index to sum over. A tensor taking *n* vectors as arguments can be written as *n* nested summations. You could think of the coefficients of this sum being spread out in space, each index corresponding to a dimension.

Tensor products are simple in this context as well. If you have a tensor *S* that takes *m* vectors at a time, and another tensor *T* that takes *n* vectors at a time, you can create a tensor that takes *m* + *n* vectors by sending the first *m* of them to *S*, the rest to *T*, and multiply the results. That’s the tensor product of *S* and *T*.

The discussion above makes tensors and tensor products still leaves a lot of questions unanswered. We haven’t considered the most general definition of tensor or tensor product. And we haven’t said anything about how tensors arise in application, what they have to do with geometry or changes of coordinate. I plan to address these issues in future posts. I also plan to write about other things in between posts on tensors.

**Next post in series**: Tensor products

]]>

The word “tensor” is shrouded in mystery. The same term is applied to many different things that don’t appear to have much in common with each other.

You might have heared that a tensor is a box of numbers. Just as a matrix is a rectangular collection of numbers, a tensor could be a cube of numbers or even some higher-dimensional set of numbers.

You might also have heard that a tensor is something that has upper and lower indices, such as the Riemann tensor above, things that have arcane manipulation rules such as “Einstein summation.”

Or you might have heard that a tensor is something that changes coordinates like a tensor. A tensor is as a tensor does. Something that behaves the right way under certain changes of variables is a tensor.

And then there’s things that aren’t called tensors, but they have tensor *products*. These seem simple enough in some cases—you think “I didn’t realize that has a name. So it’s called a tensor product. Good to know.” But then in other cases tensor products seem more elusive. If you look in an advanced algebra book hoping for a simple definition of a tensor product, you might be disappointed and feel like the book is being evasive or even poetic because it describes what a tensor product *does* rather than what it *is*. That is, the definition is behavioral rather than constructive.

What do all these different uses of the word “tensor” have to do with each other? Do they have anything to do with the TensorFlow machine learning library that Google released recently? That’s something I’d like to explore over a series of blog posts.

**Next posts in the series**:

The above quote makes me think of a connection Fourier made between triangles and thermodynamics.

Trigonometric functions were first studied because they relate angles in a right triangle to ratios of the lengths of the triangle’s sides. For the most basic applications of trigonometry, it only makes sense to consider positive angles smaller than a right angle. Then somewhere along the way someone discovered that it’s convenient to define trig functions for any angle.

Once you define trig functions for any angle, you begin to think of these functions as being associated with *circles* rather than triangles. More advanced math books refer to trig functions as *circular* functions. The triangles fade into the background. They’re still there, but they’re drawn inside a circle. (Hyperbolic functions are associated with hyperbolas the same way circular functions are associated with circles.)

Now we have functions that historically arose from studying triangles, but they’re defined on the whole real line. And we ask the kinds of questions about them that we ask about other functions. How fast do they change from point to point? How fast does their rate of change change? And here we find something remarkable. The rate of change of a sine function is proportional to a cosine function and vice versa. And if we look at the rate of change of the rate of change (the second derivative or acceleration), sine functions yield more sine functions and cosine functions yield more cosine functions. In more sophisticated language, sines and cosines are eigenfunctions of the second derivative operator.

Here’s where thermodynamics comes in. You can use basic physics to derive an equation for describing how heat in some object varies over time and location. This equation is called, surprisingly enough, the heat equation. It relates second derivatives of heat in space with first derivatives in time.

Fourier noticed that the heat equation would be easy to solve if only he could work with functions that behave very nicely with regard to second derivatives, i.e. sines and cosines! If only everything were sines and cosines. For example, the temperature in a thin rod over time is easy to determine if the initial temperature distribution is given by a sine wave. Interesting, but not practical.

However, the initial distribution doesn’t have to be a sine, or a cosine. We can still solve the heat equation if the initial distribution is a sum of sines. And if the initial distribution is approximately a sum of sines and cosines, then we can compute an approximate solution to the heat equation. So what functions are approximately a sum of sines and cosines? All of them!

Well, not quite all functions. But lots of functions. More functions than people originally thought. Pinning down exactly what functions can be approximated arbitrarily well by sums of sines and cosines (i.e. which functions have convergent Fourier series) was a major focus of 19th century mathematics.

So if someone asks what use they’ll ever have for trig identities, tell them they’re important if you want to solve the heat equation. That’s where I first used some of these trig identities often enough to remember them, and that’s a fairly common experience for people in math and engineering. Solving the heat equation reviews everything you learn in trigonometry, even though there are not necessarily any triangles or circles in sight.

]]>I find Shinichi Mochizuki’s proof of the abc conjecture fascinating. Not the content of the proof—which I do not understand in the least—but the circumstances of the proof. Most mathematics, no matter how arcane it appears to outsiders, is not that original. Experts in a specific area can judge just how much or how little a new paper adds to their body of knowledge, and usually it’s not much. But Mochizuki’s work is such a departure from the mainstream that experts have been trying to understand it for four years now.

Five days ago, Nature published an article headlined Monumental Proof to Torment Mathematicians for Years to Come.

… Kedlaya says that the more he delves into the proof, the longer he thinks it will take to reach a consensus on whether it is correct. He used to think that the issue would be resolved perhaps by 2017. “Now I’m thinking at least three years from now.”

Others are even less optimistic. “The constructions are generally clear, and many of the arguments could be followed to some extent, but the overarching strategy remains totally elusive for me,” says mathematician Vesselin Dimitrov of Yale University in New Haven, Connecticut. “Add to this the heavy, unprecedentedly indigestible notation: these papers are unlike anything that has ever appeared in the mathematical literature.”

But today, New Scientist has an article with the headline Mathematicians finally starting to understand epic ABC proof. According to this article,

At least 10 people now understand the theory in detail, says Fesenko, and the IUT papers have almost passed peer review so should be officially published in a journal in the next year or so.

It’s interesting that the proof is so poorly understood that experts disagree on just how well it is understood.

**Related posts**:

A function *f* is continuous if nearby points go to nearby points. A discontinuity occurs when some points are close together but their images are far apart, such as when you have a threshold effect. For example, suppose you pay $3 to ship packages that weigh less than a pound and $5 to ship packages that weight a pound or more. Two packages can weigh almost the same, one a little less than a pound and one a little more, but not cost almost the same to ship. The difference in their shipping cost is $2, no matter how close together they are, as long as their on opposite sides of the one-pound threshold.

A practical notion of continuity has some idea of **resolution**. Suppose in our example that packages below one pound shipped for $3.00 and packages that weigh a pound or more ship for $3.05. You might say “I don’t care about differences of a nickle.” And so at that resolution, the shipping costs are continuous.

The key to understanding continuity is being precise about the two uses of “nearby” in the statement that a continuous function sends *nearby* points to *nearby* points. What do you mean by *nearby*? How close is close enough? In the pure mathematical definition of continuity, the answer is “as close as you want.” You specify any tolerance you like, *no matter how small*, and call it ε. If for any ε someone picks, it’s always possible to specify a small enough neighborhood around *x* that those points are mapped within ε of *f*(*x*), then *f* is continuous at *x*.

For applications, we modify this definition by putting a lower bound on ε. A function is continuous at *x*, for the purposes of a particular application, if for every ε **larger than the resolution of the problem**, you can find a neighborhood of *x* so small that all the points in that neighborhood are mapped within ε of *f*(*x*). In our shipping example, if you only care about differences in rates larger than $1, then if the rates change by $0.05 at the one-pound threshold, the rates are continuous as far as you’re concerned. But if the rates jump by $2 at one pound, then the rates are discontinuous for your purposes.

When you see a smooth curve on a high-resolution screen, it’s continuous as far as your vision is concerned. Nearby points on the curve go to points that are nearby as far as the resolution of your vision is concerned, even though strictly speaking the curve could have jump discontinuities at every pixel.

If you take a function that is continuous in the pure mathematical sense, then any multiple of that function is also continuous. If you make the function 10x larger, you just need to get closer to each point *x* to find a neighborhood so that all points get within ε of *f*(*x*). But in practical application, a multiple of a continuous function might not be continuous. If your resolution on shipping rates is $1, and the difference in cost between shipping a 15 ounce and a 17 ounce package is $0.05, then it’s continuous for you. But if the rates were suddenly 100x greater, now you care, because now the difference in cost is $5.

With the example of the curve on a monitor, if you were to zoom in on the image, at some point you’d see the individual pixels and so the image would no longer be continuous as far as your vision is concerned.

We’ve been focusing on what nearness means for the output, ε. Now let’s focus on nearness for the input. Introducing a restriction on ε let us say some functions are continuous for a particular application that are not continuous in the pure mathematical sense. We can also introduce a restriction on the resolution on the input, call it δ, so that the opposite is true: some functions are continuous in the pure mathematical sense that are not continuous for a particular application.

The pure mathematical definition of continuity of *f* at *x* is that for every ε > 0, there exists a δ > 0 such that if |*x* – *y*| < δ, then |*f*(*x*) – *f*(*y*)| < ε. But how small does δ have to be? Maybe too small for application. Maybe points would have to be closer together than they can actually be in practice. If a function changes rapidly, but smoothly, then it’s continuous in the pure mathematical sense, but it may be discontinuous for practical purposes. The more rapidly the function changes, the smaller δ has to be for points within δ of *x* to end up within ε of *f*(*x*).

So an applied definition of continuity would look something like this.

]]>A function

fis continuous atx, for the purposes of a particular application, if for every ε > the resolution of the problem, there exists a δ > the lower limit for that application, such that if |x–y| < δ, then |f(x) –f(y)| < ε.

The Riemann-Liouville fractional integral starts from the observation that for positive integer *n*,

This motivates a definition of fractional integrals

which is valid for any complex α with positive real part. Derivatives and integrals are inverses for integer degree, and we use this to define **fractional derivatives**: the derivative of degree *n* is the integral of degree –*n*. So if we could define fractional *integrals* for any degree, we could define a *derivative* of degree α to be an integral of degree -α.

Unfortunately we can’t do this directly since our definition only converges in the right half-plane. But for (ordinary) differentiable *f*, we can integrate the Riemann-Liouville definition of fractional integral by parts:

We can use the right side of this equation to define the left side when the real part of α is bigger than -1. And if *f* has two ordinary derivatives, we can repeat this process to define fractional integrals for α with real part bigger than -2. We can repeat this process to define the fractional integrals (and hence fractional derivatives) for any degree we like, provided the function has enough ordinary derivatives.

See previous posts for two other ways of defining fractional derivatives, via Fourier transforms and via the binomial theorem.

]]>*N*_{5}, the 95th percentile of loudness, measured in sone (which is confusingly called the 5th percentile)- ω
_{S}, a function of sharpness in asper and of loudness - ω
_{FR}, fluctuation strength (in vacil), roughness (in asper), and loudness.

Specifically, Zwicker calculates *PA*, psychoacoutic annoyance, by

A geometric visualization of the formula is given below.

Here’s an example of computing roughness using two sound files from previous posts, a leaf blower and a simulated kettledrum. I calibrated both to have sound pressure level 80 dB. But because of the different composition of the sounds, i.e. more high frequency components in the leaf blower, the leaf blower is much louder than the kettledrum (39 sone vs 15 sone) at the same sound pressure level. The annoyance of the leaf blower works out to about 56 while the kettledrum was only about 19.

]]>

The most recent episode of 99% Invisible tells the story of the Corp of Engineers’ enormous physical model of the Mississippi basin, nearly half of the area of the continental US. Spanning over 200 acres, the model was built during WWII and was shut down in 1993.

Here are some of my favorite lines from the show:

The reason engineers continue to rely on [physical models] is because today, in 2016, we still do not have the computers or the science to do all the things that physical models can do. …

Hydraulic engineering gets into some of the most complicated math there is. Allegedly when Albert Einstein’s son Hans said he wanted to study how sediment moves underwater, Einstein asked him why he wanted to work on something so complicated.

The physics involved happen on such a small scale that we still haven’t built equations complex enough to capture them. And so Stanford Gibson, a world-class numerical modeler, is one of the most ardent supporters of physical models.

But then I have a quibble. The show goes on to say “a physical model doesn’t require equations at all.” That’s not true. When you build a small thing to study how a big thing works, you’ve got to have some theory relating the behavior of the two. If the real thing is 1000 times bigger than the model, does that mean you can simply multiply measurements from the model by 1000 to predict measurements of reality? Sometimes. **Some effects scale linearly, but some do not**.

It would have been more accurate to say a physical model doesn’t require *as many* equations as an entirely mathematical model. **The physical model is a partial mathematical model** because of the mathematics necessary to extrapolate from the model to the thing being modeled.

Another line from the show that I liked was a quote from Stanford Gibson, introduced above.

The idea that science demystifies the world, I just don’t understand that. I feel that the deeper down the scientific rabbit hole I go, the bigger and grander and more magical the world seems.

**Related post**: Toy problems

Here’s one way fractional derivatives could be defined. Suppose the Fourier transform of

f(x) isg(ξ). Then for positive integern, then^{th}derivative off(x) has Fourier transform (2πiξ)^{n}g(ξ). So you could take then^{th}derivative off(x) as follows: take the Fourier transform, multiply by (2πiξ), and take the inverse Fourier transform. This suggests the same procedure could be used to^{n}definethen^{th}derivative whennis not an integer.

Here’s another way to define fractional derivatives that doesn’t use Fourier transforms, the Grünwald-Letnikov derivative. It’s a fairly direct approach.

The starting point is to note that for positive integer *n*, the *n*th derivative can be written as

where

and Δ^{n} iterates Δ. For example,

In general, for positive integer *n*, we have

We could set the upper limit of the sum to *n*, but there’s no harm in setting it to ∞ because the binomial coefficients will be zero for *k* larger than *n*. And in this form, we can replace *n* with the integer *n* with any positive real number α to define the αth derivative. That is, the Grünwald-Letnikov derivative of *f* is given by

See these notes for the definition of binomial coefficients for possibly non-integer arguments and for an explanation why for integer *n* the coefficients are eventually zero.

Notice that fractional derivatives require **non-local information**. Ordinary derivatives at a point are determined by the values of the function in an arbitrarily small neighborhood of that point. But notice how the fractional derivative, as defined in this post, depends on the values of the function at an evenly spaced infinite sequence of points. If we define fractional derivatives via Fourier transform, the non-local nature is more apparent since the Fourier transform at any point depends on the values of the function everywhere.

This non-local feature can be good or bad. If you want to model a phenomena with non-local dependence, fractional differential equations might be a good way to go. But if your phenomena is locally determined, fractional differential equations might be a poor fit.

**Related post**: Mittag-Leffler functions

**Update**: Yet another way to define fractional derivatives, this time via fractional integrals, which can be defined in terms of ordinary integrals

Yesterday I was on Amazon.com and noticed that nearly all the books they recommended for me were either about Lisp or mountain climbing. I thought this was odd, and mentioned it on Twitter. Carl Vogel had a witty reply: “I guess they weren’t sure whether you want to figuratively or literally look down on everyone.”

The stereotype Lisp programmer does look down on everyone. But this is based on a tiny, and perhaps unrepresentative, sample of people writing *about* Lisp compared to the much larger number of people who are writing *in* Lisp.

Lisp has been around for over 50 years and shows no signs of going away. There are a lot of people writing Lisp in obscurity. Kate Gregory said something similar about C++ developers, calling them the dark matter of programmers because there are lot of them but they don’t make much of a splash. They’re quietly doing their job, not speaking at conferences or writing much about their language.

I imagine there are a lot of humble Lisp programmers. It takes some humility to commit to an older technology, especially given the pervasive neomania of the programmer community. It also takes some humility to work on projects that have been around for years or that are deep within the infrastructure of more visible projects, which is where I expect a lot of Lisp lives.

You can do very clever things in Lisp, but you don’t have to. As Ed Post famously said, “The determined Real Programmer can write FORTRAN programs in any language.” There must be a lot of code out there that writes `(f x)`

instead of `f(x)`

but otherwise isn’t that different from FORTRAN.

All four involve the unknown function φ(*x*) in an integral with a kernel *K*(*x*, *y*) and all have an input function *f*(*x*). In all four the integration ranges from some fixed lower limit. In the **Volterra equations**, the upper limit of integration is the variable *x*, while in the **Fredholm equations**, the upper limit of integration is a fixed constant.

The so-called equations **of the first kind** only involve the unknown function φ inside the integral. The equations **of the second kind** also involve φ outside the integral.

So the four equations above are

- Volterra equation of the first kind
- Volterra equation of the second kind
- Fredholm equation of the first kind
- Fredholm equation of the second kind

Here’s a diagram to make these easier to keep straight:

In general, the theory of Volterra equations is easier than that of Fredholm equations. And while equations of the first kind look simpler at first, it’s common to reduce equations of the first kind to equations of the second kind and concentrate on the later.

There are many variations on this theme. The *x* in Volterra equations could be a vector. The integral could be, for example, a double or triple integral. In Fredholm equations, the integration may be over fixed general region. Maybe you’re integrating over a watermelon, as the late William Guy would say. You could have nonlinear versions of these equations where instead of multiplying *K*(*x*, *y*) times φ(*y*) you have a kernel *K*(*x*, *y, φ( y)*) that is some nonlinear function of φ.

You may see references to Volterra or Fredholm equations of the **third kind**. These are an extension of the second kind, where a function *A*(*x*) multiples the φ outside the integral. Equations of the second kind are the most important since the first and third kinds can often be reduced to the second kind.

Fluctuation strength reaches its maximum at a modulation frequency of around 4 Hz. For much higher modulation frequencies, one perceives roughness rather than fluctuation. The reference value for one vacil is a 1 kHz tone, fully modulated at 4 Hz, at a sound pressure level of 60 decibels. In other words

(1 + sin(8π*t*)) sin(2000π*t*)

where *t* is time in seconds.

Since the carrier frequency is 250 times greater than the modulation frequency, you can’t see both in the same graph. In this plot, the carrier is solid blue compared to the modulation.

Here’s what the reference for one vacil would sound like:

**See also**: What is an asper?

(1 + sin(140π*t*)) sin(2000π*t*)

where *t* is time in seconds.

Here’s what that sounds like (if you play this at 60 dB, about the loudness of a typical conversation at one meter):

And here’s the Python code that made the file:

from scipy.io.wavfile import write from numpy import arange, pi, sin, int16 def f(t, f_c, f_m): # t = time # f_c = carrier frequency # f_m = modulation frequency return (1 + sin(2*pi*f_m*t))*sin(2*f_c*pi*t) def to_integer(signal): # Take samples in [-1, 1] and scale to 16-bit integers, # values between -2^15 and 2^15 - 1. return int16(signal*(2**15 - 1)) N = 48000 # samples per second x = arange(3*N) # three seconds of audio # 1 asper corresponds to a 1 kHz tone, 100% modulated at 70 Hz, at 60 dB data = f(x/N, 1000, 70) write("one_asper.wav", N, to_integer(data))

**See also**: What is a vacil?

and we can generalize this to the Mittag=Leffler function

which reduces to the exponential function when α = β = 1. There are a few other values of α and β for which the Mittag-Leffler function reduces to more familiar functions. For example,

and

where erfc(*x*) is the complementary error function.

Mittag-Leffler was one person, not two. When I first saw the Mittag-Leffler theorem in complex analysis, I assumed it was named after two people, Mittag and Leffler. But the theorem and the function discussed here are named after one man, the Swedish mathematician Magnus Gustaf (Gösta) Mittag-Leffler (1846–1927).

The function that Mr. Mittag-Leffler originally introduced did not have a β parameter; that generalization came later. The function *E*_{α} is *E*_{α, 1}.

Just as you can make a couple probability distributions out of the exponential function, you can make a couple probability distributions out of the Mittag-Leffler function.

The exponential function exp(-*x*) is positive over [0, ∞) and integrates to 1, so we can define a probability distribution whose density (PDF) function is *f*(*x*) = exp(-*x*) and whose distribution function (CDF) is *F*(*x*) = 1 – exp(-*x*). The Mittag-Leffler distribution has CDF is 1 – *E*_{α}(-*x*^{α}) and so reduces to the exponential distribution when α = 1. For 0 < α < 1, the Mittag-Leffler distribution is a heavy-tailed generalization of the exponential. [1]

The Poisson distribution comes from taking the power series for exp(λ), normalizing it to 1, and using the *k*th term as the probability mass for *k*. That is,

The analogous discrete Mittag-Leffler distribution [2] has probability mass function

In addition to probability and statistics, the the Mittag-Leffler function comes up in fractional calculus. It plays a role analogous to that of the exponential distribution in classical calculus. Just as the solution to the simply differential equation

is exp(*ax*), for 0 < μ < 1, the soltuion to the **fractional differential equation**

is *a**x*^{μ-1} *E*_{μ, μ}(*a* *x*^{μ}). Note that this reduces to exp(*ax*) when μ = 1. [3]

[1] Gwo Dong Lin. Journal of Statistical Planning and Inference 74 (1998) 1–9, On the Mittag–Leffler distributions

[2] Subrata Chakraborty, S. H. Ong. Mittag-Leffler function distribution: A new generalization of hyper-Poisson distribution. arXiv:1411.0980v1

[3] Keith Oldham, Jan Myland, Jerome Spanier. An Atlas of Functions. Springer.

]]>This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

At the time Brian was working on one big paper in private. This has since been split into several papers and they’re now public.

- Kalman Folding, Part 1
- Kalman Folding 2: Tracking and System Dynamics
- Kalman Folding 3: Derivations
- Kalman Folding 4: Streams and Observables
- Kalman Folding 5: Non-Linear models and the EKF
- Kalman Folding 6: ??
- Kalman Folding 7: A Small Streams Library

]]>

This made me think of corners in the geometric sense. If you have a sphere in a box in high dimensions, **nearly all the volume is in the corners**, i.e. outside the sphere. This is more than a metaphor. You can think of software options geometrically, with each independent choice corresponding to a dimension. Paths through a piece of software that are individually rare may account for nearly all use when considered together.

With a circle inside a square, nearly 78.5% of the area is inside the circle. With a ball sitting inside a 3-D box, 52.4% of the volume is inside the ball. As the dimension increases, the proportion of volume inside the sphere rapidly decreases. For a 10-dimensional sphere sitting in a 10-dimensional box, 0.25% of the volume is in the sphere. Said another way, 99.75% of the volume is in the corners.

When you go up to 100 dimensions, the proportion of volume inside the sphere is about 2 parts in 10^{70}, a 1 followed by 70 zeros [1]. If 100 dimensions sounds like pure fantasy, think about a piece of software with more than 100 features. Those feature combinations multiply like geometric dimensions [2].

Here’s a little Python code you could use to see how much volume is in a sphere as a function of dimension.

from scipy.special import gamma from math import pi def unit_sphere_volume(n): return pi**(0.5*n)/gamma(0.5*n + 1) def unit_cube_volume(n): return 2**n def ratio(n): return unit_sphere_volume(n) / unit_cube_volume(n) print( [ratio(n) for n in range(1, 20)] )

* * *

[1] There are names for such extremely large numbers. These names are hardly ever used—scientific notation is much more practical— but they’re fun to say. 10^{70} is ten duovigintillion in American nomenclature, ten undecilliard in European.

[2] Geometric dimensions are perfectly independent, but software feature combinations are not. In terms of logic, some combinations may not be possible. Or in terms of probability, the probability of exploring some paths is conditional on the probability of exploring other paths. Even so, there are inconceivably many paths through any large software system. And in large-scale operations, events that should “never happen” happen regularly.

]]>]]>

People best understand computer programs in a different order than compilers do. This is a key idea of literate programming, and one that distinguishes literate programs from heavily commented programs.

Traditional source code, no matter how heavily commented, is presented in the order dictated by the compiler. The computer is the primary audience. Literate programming is more humanistic in the sense that the primary audience is a human. The *computer* has to go to extra effort to arrange the code for its needs. As Donald Knuth describes it in his book on literate programming,

The practitioner of literate programming … strives for a program that is comprehensible because its concepts have been introduced

in an order that is best for human understanding, using a mixture of formal and informal methods that nicely reinforce each other. [emphasis added]

There are two steps in processing literate programs: **weaving** and **tangling**. You take files containing prose and code, and *weave* them into documentation and *tangle* them into source code. Tools like Sweave and Pweave focus on the weave process, as their names imply. The weave side of literate programming has gotten the most attention.

A half-hearted approach to literate programming doesn’t require much of a tangle process. A well-commented program has no tangle step at all. A *weave document that follows the order of the source code has a trivial tangle step: save the code to its own file, manually or automatically, but don’t rearrange it. But a full-fledged literate program may make the tangle program work harder, rearranging code fragments from human-friendly to compiler-friendly order.

The most obvious feature of literate programming is that it requires careful explanation. Here’s more from the paragraph I quoted above, filling in the part I left out.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with explanation and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible …

The discipline of explaining every piece of code leads to better code. It serves a similar purpose to writing unit tests. I saw somewhere—I can’t remember where now— that Knuth hates unit testing and sees it as redundant effort. Presumably this is because unit testing and literate programming overlap. Unit tests are a kind of commentary on code, explaining how it used, exploring its limitations, etc.

Knuth understands that literate programming doesn’t replace the need for *testing*, just *unit testing*. He explained somewhere—again I forget where—that he would test TeX by spending days at a time trying fiendishly to break it.

When I read Knuth’s book, I could see the value of carefully explaining code. What I didn’t appreciate was the value of presenting code in a different order than the order of source code.

I’m working on a project now where a few lines of code may require a few paragraphs of explanation. That’s what got me thinking about literate programming. My intention was to write my documentation in the same order as the code. It took a while to realize I had stumbled on an ideal application of literate programming: a complicated algorithm that needs to be explained carefully, both in order to hand over to the client and to reduce the chances of errors. The best order to understand this algorithm is definitely not top-down going through the code.

I think I understand better now why literate programming hasn’t gained much of an audience. I used to think that it was because developers hate writing prose. That’s part of it. Most programmers I’ve worked with would much rather write a hundred lines of unit tests than write one complete sentence.

But that’s not the whole story. There are quite a few programmers who are willing and able to write prose. Why don’t more of them use literate programming?

I think part of the reason is that having a non-trivial tangle process is a barrier to adoption. A programmer can decide to start writing more extensive comments, gradually working up to essay-like explanations. But it’s one thing to say “I’m going to heavily comment my C code” and quite another to say “I’m not going to write C per se any more. I’m going to write CWEB files that compile to C.” Even if a programmer wants to write CWEB in secret, just checking in the tangled C files, other programmers will edit these C files and the changes won’t be reflected in the CWEB source. Also, the output of tangle is *less* readable than ordinary code. The programmer secretly using CWEB as a preprocessor would appear to be writing undocumented code.

Tricky code benefits from a literate presentation, but routine code does not benefit so much. You either have to have two ways of writing code—straight source for routine code and literate programs for the tricky parts—or impose the overhead of literate programming everywhere. Most code is mundane and repetitive, challenging because of its volume rather than its cleverness. Knuth doesn’t write this kind of code. He only writes code that benefits from a literate presentation.

To write a good literate program, not only do you need to be able to program, and need to be willing and able to write good prose, on top of that you need to have a good sense for story telling, arranging the code for the benefit of other readers. If this is done poorly, the result is harder to understand than traditional programs.

I may use literate programming more now that I’m starting to understand it, at least for my own benefit and hopefully for the benefit of clients. I usually deliver algorithms or libraries, not large applications, and so it wouldn’t be too much work to create two versions of my results. I could create a literate program, then weave a report, and manually edit the tangled code into a traditional form convenient for the client.

]]>Any number in Pascal’s triangle that is *not* in the outer two layers will appear at least three times, usually four.

Pascal’s triangle contains the binomial coefficients C(*n*, *r*) where *n* ranges over non-negative numbers and *r* ranges from 0 to *n*. The outer layers are the elements with *r* equal to 0, 1, *n*-1, and *n*.

We’ll write some Python code to explore how often the numbers up to 1,000,000 appear. How many rows of Pascal’s triangle should we compute? The smallest number on row *n* is C(*n*, 2). Now 1,000,000 is between C(1414, 2) and C(1415, 2) so we need row 1414. This means we need *N* = 1415 below because the row numbers start with 0.

I’d like to use a NumPy array for storing Pascal’s triangle. In the process of writing this code I realized that a NumPy array with `dtype int`

doesn’t contain Python’s arbitrary-sized integers. This makes sense because NumPy is designed for efficient computation, and using a NumPy array to contain huge integers is unnatural. But I’d like to do it anyway, and the to make it happen is to set `dtype`

to `object`

.

import numpy as np from collections import Counter N = 1415 # Number of rows of Pascal's triangle to compute Pascal = np.zeros((N, N), dtype=object) Pascal[0, 0] = 1 Pascal[1,0] = Pascal[1,1] = 1 for n in range(2, N): for r in range(0, n+1): Pascal[n, r] = Pascal[n-1, r-1] + Pascal[n-1, r] c = Counter() for n in range(4, N): for r in range(2, n-1): p = Pascal[n, r] if p <= 1000000: c[p] += 1

When we run this code, we find that our counter contains 1732 elements. That is, of the numbers up to one million, only 1732 of them appear inside Pascal’s triangle when we disallow the outer two layers. (The second layer contains consecutive integers, so every positive integer appears in Pascal’s triangle. But most integers *only* appear in the second layer.)

When *Single Digits* speaks of “Any number in Pascal’s triangle that is *not* in the outer two layers” this cannot refer to numbers that are not in the outer two layers because *every* natural number appears in the outer two layers. Also, when it says the number “will appear at least three times, usually four” it is referring to the entire triangle, i.e. including the outer two layers. So another way to state the sentence quoted above is as follows.

Define the interior of Pascal’s triangle to be the triangle excluding the outer two layers. Every number

nin the interior of Pascal’s triangle appears twice more outside of the interior, namely as C(n, 1) and C(n,n-1). Most oftennappears at least twice in the interior as well.

This means that any number you find in the interior of Pascal’s triangle, interior meaning not in the two outer layers, will appear at least three times in the full triangle, usually more.

Here are the statistics from our code above regarding just the *interior* of Pascal’s triangle.

- One number, 3003, appears six times.
- Six numbers appear four times: 120, 210, 1540, 7140, 11628, and 24310.
- Ten numbers appear only once: 6, 20, 70, 252, 924, 3432, 12870, 48620, 184756, and 705432.
- The large majority of numbers, 1715 out of 1732, appear twice.

This is very much like the idea of homotopy from topology, a continuous deformation of one thing into another. No discontinuities along the way — no ripping, no jumping suddenly from one thing to another.

* * *

]]>