Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.

In order to infer what you’d *like* to measure from what you *can* measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.

I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.

After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.

If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.

]]>

“A bird may love a fish but where would they build a home together?” — Fiddler on the Roof

]]>

Suppose *f*(*x*) is a function of several variables, i.e. *x* is a vector, and *g*(*x*) = *c* is a constraint. Then the Lagrange multiplier theorem at the maximum of *f* subject to the constraint *g* we have ∇*f* = λ ∇*g*.

Where does this mysterious λ come from? And why should the gradient of your objective function be related to the gradient of a constraint? These seem like two different things that shouldn’t even be comparable.

Here’s the geometric explanation. The set of points satisfying *g*(*x*) = *c* is a surface. And for any *k*, the set of points satisfying *f*(*x*) = *k* is also surface. Imagine *k* very large, larger than the maximum of *f* on the surface defined by *g*(*x*) = *c*. You could think of the surface *g*(*x*) = *c* being a balloon inside the larger balloon *f*(*x*) = *k**.*

Now gradually decrease *k*, like letting the air out of the outer balloon, until the surfaces *g*(*x*) = *c* and *f*(*x*) = *k* first touch. At that point, the two surfaces will be tangent, and so their normal vectors, given by their gradients, point in the same direction. That is, ∇*f* and ∇*g* are parallel, and so ∇*f* is some multiple of ∇*g*. Call that multiple λ.

I don’t know how well that explanation works when written down. But when I heard Jim Vick explain it, moving his hands in the air, it was an eye-opener.

This is not a rigorous proof, and it does not give the most general result possible, but it explains what’s going on. It’s something to keep in mind when reading proofs that are more rigorous or more general. As I comment on here,

Proofs serve two main purposes: to establish that a proposition is true, and to show

whyit is true.

The literally hand-wavy proof scores low on the former criterion and high on the latter.

***

Jim Vick was a great teacher. Some of us affectionately called him The Grinning Demon because he was always smiling, even while he gave devilishly hard homework. He was Dean of Natural Sciences when I took a couple classes from him. He later became Vice President for Student Affairs and kept teaching periodically. He has since retired but still teaches.

After taking his topology course, several of us asked him to teach a differential geometry course. He hesitated because it’s a challenge to put together an undergraduate differential geometry course. The usual development of differential geometry uses so much machinery that it’s typically a graduate-level course.

Vick found a book that starts by looking only at manifolds given by level sets of smooth functions, like the surfaces discussed above. Because these surfaces sit inside a Euclidean space, you can quickly get to some interesting geometric results using elementary methods. We eventually got to the more advanced methods, but by then we had experience in a more tangible setting. As Michael Atiyah said, abstraction should follow experience, not precede it.

]]>With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next **two** subjects? You might say 0.49, and that would be correct if we **knew*** *that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is *p*, then the probability of *s* successes and *f* failures in the next *s* + *f* subjects is given by

But if our probability of success has some uncertainty and we assume it has a beta(*a*, *b*) distribution, then the **predictive probability** of *s* successes and *f* failures is given by

where

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

]]>Expect to see tweets about constructive logic, type theory, formal proofs, proof assistants, etc.

The image for the account is a bowtie, a pun on formality. It’s also the symbol for natural join in relational algebra.

]]>The number 3435 has the following curious property:

3435 = 3^{3} + 4^{4} + 3^{3} + 5^{5}.

It is called a Münchausen number, an allusion to fictional Baron Münchausen. When each digit is raised to its own power and summed, you get the original number back. The only other Münchausen number is 1.

At least in base 10. You could look at Münchausen numbers in other bases. If you write out a number *n* in base *b*, raise each of its “digits” to its own power, take the sum, and get *n* back, you have a Münchausen number in base *b*. For example 28 is a Münchausen number in base 9 because

28_{ten} = 31_{nine} = 3^{3} + 1^{1}

Daan van Berkel proved that there are only finitely many Münchausen in any given base. In fact, he shows that a Münchausen number in base *b* cannot be greater than 2*b*^{b}, and so you could do a brute-force search to find all the Münchausen numbers in any base.

The upper bound 2*b*^{b} grows very quickly with *b* and so brute force becomes impractical for large *b.* If you wanted to find all the hexadecimal Münchausen numbers you’d have to search 2*16^{16} = 36,893,488,147,419,103,232 numbers. How could you do this more efficiently?

Suppose you have an expression (λ*x*.*x* + 2)*y*, which means apply to *y* the function that takes its argument and adds 2. Then β-reduction rewrites this expression as *y* + 2. In a more complicated expression, you might be able to apply β-reduction several times. When you do apply β-reduction several times, does the process always stop? And if you apply β-reduction to parts of an expression in a different order, can you get different results?

You might reasonably expect that if you apply β-reduction enough times you eventually get an expression you can’t reduce any further. *Au contraire*!

Consider the expression (λ*x*.*xx*) (λ*x*.*xx*). Beta reduction says to replace each of the red *x*s with the expression in blue. But when we do that, we get the original expression (λ*x*.*xx*) (λ*x*.*xx*) back. Beta reduction gets stuck in an infinite loop.

Next consider the expression *L* = (λ*x*.*xxy*) (λ*x*.*xxy*). Applying β-reduction the first time gives (λ*x*.*xxy*) (λ*x*.*xxy*) *y* or *Ly*. Applying β-reduction again yields *Lyy*. Beta “reduction” doesn’t reduce the expression at all but makes it bigger.

The technical term for what we’ve seen is that β-reduction is not *normalizing*. A rewrite system is *strongly normalizing* if applying the rules in any order eventually terminates. It’s *weakly normalizing* if there’s at least some sequence of applying the rules that terminates. Beta reduction is neither strongly nor weakly normalizing in the context of (untyped) lambda calculus.

In simply typed lambda calculus, we assign types to every variable, and functions have to take the right type of argument. This additional structure prevents examples such as those above that fail to normalize. If *x* is a function that takes an argument of type *A* and returns an argument of type *B* then you can’t apply *x* to itself. This is because *x* takes something of type *A*, not something of type function from *A* to *B*. You can prove that not only does this rule out specific examples like those above, it rules out all possible examples that would prevent β-reduction from terminating.

To summarize, β-reduction is not normalizing, not even weakly, in the context of untyped lambda calculus, but it is strongly normalizing in the context of simply typed lambda calculus.

Although β-reduction is not normalizing for untyped lambda calculus, the Church-Rosser theorem says it is *confluent*. That is, if an expression *P* can be transformed by β-reduction two different ways into expressions *M* and *N*, then there is an expression *T* such that both *M* and *N* can be reduced to *T*. This implies that if β-reduction *does* terminate, then it terminates in a unique expression (up to α-conversion, i.e. renaming bound variables). Simply typed lambda calculus is confluent as well, and so in that context we can say β-reduction always terminates in a unique expression (again up to α-conversion).

If I flip a coin a million times, I’m virtually certain to get 50 percent heads and 50 percent tails.

Depending on how you understand that line, it’s either imprecise or false. The more times you flip the coin, the **more** likely you are to get **nearly **half heads and half tails, but the **less** likely you are to get **exactly** half of each. I assume Dr. Lienhard knows this and that by “50 percent” he meant “nearly half.”

Let’s make the fuzzy statements above more quantitative. Suppose we flip a coin 2*n* times for some large number *n*. Then a calculation using Stirling’s approximation shows that the probability of *n* heads and *n* tails is approximately

1/√(π*n*)

which goes to zero as *n* goes to infinity. If you flip a coin a million times, there’s less than one chance in a thousand that you’d get exactly half heads.

Next, let’s quantify the statement that *nearly* half the tosses are likely to be heads. The normal approximation to the binomial tells us that for large *n*, the number of heads out of 2*n* tosses is approximately distributed like a normal distribution with the same mean and variance, i.e. mean *n* and variance *n*/2. The *proportion* of heads is thus approximately normal with mean 1/2 and variance 1/8*n*. This means the standard deviation is 1/√(8*n*). So, for example, about 95% of the time the proportion of heads will be 1/2 plus or minus 2/√(8*n*). As *n* goes to infinity, the width of this interval goes to 0. Alternatively, we could pick some fixed interval around 1/2 and show that the probability of the proportion of heads being outside that interval goes to 0.

With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, *conditional on the normality assumption*, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.

***

[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

]]>The source file for a Jupyter notebook is a JSON document containing formatting instructions, encoded images, and the notebook content. Looking at a notebook file in a text editor is analogous to unzipping a Word document and looking at the XML inside. Here’s a sample:

" if (with_grid_):\n", " plt.grid (which=\"both\",ls=\"-\",color=\"0.65\")\n", " plt.show() " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAawAAAESCAYAAA...

It’s hard to diff two Jupyter notebooks because content changes don’t map simply to changes in the underlying file.

We can look a little closer at WYSIWYG and ask what you see **where** and what you get **where**. Word documents and Jupyter notebooks are **visually** WYSIWYG: what you see in the interactive environment (browser) is what you get visually in the final product. Which is very convenient, except for version control and comparing changes.

Org-mode files are **functionally** WYSIWYG: what you see in the interactive environment (editor) is what you get functionally in the final code.

You could say that HTML and LaTeX are functionally or **causally** WYSIWYG whereas Word is visually WYSIWYG. Word is visually WYSIWYG in the sense that what you see on your monitor is essentially what you’ll see coming out of a printer. But Word doesn’t always let you see what’s causing your document to look as it does. Why can’t I delete this? Why does it insist on putting that here? HTML and LaTeX work at a lower level, for better and for worse. You may not be able to anticipate how things will look while editing the source, e.g. where a line will break, but cause and effect are transparent.

Everybody wants WYSIWYG, but they have different priorities regarding what they want to see and are concerned about different aspects of what they get.

]]>The limitations of floating point arithmetic are something to be aware of, and ignoring these limitations can cause problems, like crashing airplanes. On the other hand, floating point arithmetic is usually far more reliable than the application it is being used in.

It’s well known that if you multiply two floating point numbers, *a* and *b*, then divide by *b*, you might not get exactly *a* back. Your result might differ from *a* by one part in ten quadrillion (10^16). To put this in perspective, suppose you have a rectangle approximately one kilometer on each side. Someone tells you the exact area and the exact length of one side. If you solve for the length of the missing side using standard (IEEE 754) floating point arithmetic, your result could be off by as much as the width of a helium atom.

Most of the problems attributed to “round off error” are actually approximation error. As Nick Trefethen put it,

If rounding errors vanished, 90% of numerical analysis would remain.

Still, despite the extreme precision of floating point, in some contexts rounding error is a practical problem. You have to learn in what context floating point precision matters and how to deal with its limitations. **This is no different than anything else in computing**.

For example, most of the time you can imagine that your computer has an unlimited amount of memory, though sometimes you can’t. It’s common for a computer to have enough memory to hold the entire text of Wikipedia (currently around 12 gigabytes). This is usually far more memory than you need, and yet for some problems it’s not enough.

**More on floating point computing**:

Building proofs and programs are very similar activities, but there is one important difference: when looking for a proof it is often enough to find one, however complex it is. On the other hand, not all programs satisfying a specification are alike: even if the eventual result is the same, efficient programs must be preferred. The idea that the details of proofs do not matter—usually called

proof irrelevance—clearly justifies letting the computer search for proofs …

The quote is saying “Any proof will do, but among programs that do the same job, more efficient programs are better.” And this is certainly true, depending on what you want from a proof.

Proofs serve two main purposes: to establish that a proposition is true, and to show *why* it is true. If you’re only interested in the existence of a proof, then yes, any proof will do. But proofs have a sort of pedagogical efficiency just as programs have an execution efficiency. Given two proofs of the same proposition, the more enlightening proof is better by that criteria.

I assume the authors of the above quote would not disagree because they say it is *often* enough to find one, implying that for some purposes simpler proofs are better. And existence does come first: a complex proof that exists is better than an elegant proof that does not exist! Once you have a proof you may want to look for a simpler proof. Or not, depending on your purposes. Maybe you have an elegant proof that you’re convinced must be essentially true, even if you doubt some details. In that case, you might want to complement it with a machine-verifiable proof, no matter how complex.

**Source**: Interactive Theorem Proving and Program Development

Coq’Art: The Calculus of Inductive Constructions

For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not *English*.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample *must* correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.

]]>

I’m happier with the paragraph above if you replace “calculus” with “analysis.” Analysis certainly seeks to understand and model things that change continually, but calculus per se is the mechanism of analysis.

I used to think it oddly formal for people to say “differential and integral calculus.” Is there any other kind? Well yes, yes there is, though I didn’t know that at the time. A calculus is a system of rules for computing things. Differential and integral calculus is a system of rules for calculating derivatives and integrals. Lately I’ve thought about other calculi more than differential calculus: propositional calculus, lambda calculus, calculus of inductive constructions, etc.

In my first career I taught (differential and integral) calculus and was frustrated with students who would learn how to calculate derivatives but never understood what a derivative was or what it was good for. In some sense though, they got to the core of what a calculus is. It would be better if they knew what they were calculating and how to apply it, but they still learn something valuable by knowing how to carry out the manipulations. A computer science major, for example, who gets through (differential) calculus knowing how to calculate derivatives without knowing what they are is in a good position to understand lambda calculus later.

]]>**Dimensional analysis** is a well-established method of error detection. Simply checking that you’re not doing something like adding things that aren’t meaningful to add is surprisingly useful for detecting errors. For example, if you’re computing an area, does your result have units of area? This seems like a ridiculously basic question to ask, and yet it is more effective than it would seem.

**Type theory** and **category theory** are extensions of the idea of dimensional analysis. Simply asking of a function “Exactly what kind of thing does it take in? And what sort of thing does it put out?” is a powerful way to find errors. And in practice it may be hard to get answers to these questions. When you’re working with a system that wasn’t designed to make such things explicit and clear, it may be that nobody knows for certain, or people believe they know but have incompatible ideas. Type theory and category theory can do much more than nudge you to define functions well, but even that much is a good start.

Specifying types is more powerful than it seems. With **dependent types**, you can incorporate so much information in the type system that showing that an instance of a type exists is equivalent to proving a theorem. This is a consequence of the **Curry-Howard** **correspondence** (“Propositions as types”) and is the basis for proof assistants like **Coq**.

I’d like to suggest the term **Big Logic** for the application of logic on a large scale, using logic to prove properties of systems that are too complex for a typical human mind to thoroughly understand. It’s well known that systems have gotten larger, more connected, and more complex. It’s not as well known how much the tools of Big Logic have improved, making formal verification practical for more projects than it used to be.

The sides of a spherical triangle are formed by great circular arcs through the vertices. Since the sides are portions of a circle, they can be measured as angles. So in spherical trig you have this interesting interplay of two kinds of angles: the angles formed at the intersections of the sides, and the angles describing the sides themselves.

Here’s how you form the dual of a spherical triangle. Suppose the vertices of the angle are *A*, *B*, and *C*. Think of the arc connecting *A* and *B* as an equator, and let *C*‘ be the corresponding pole that lies on the same side of the arc as the original triangle *ABC*. Do the analogous process to find the points *A*‘ and *B*‘. The triangle *A*‘*B*‘*C*‘ is the dual of the triangle *ABC*. (This idea goes back to the Persian mathematician Abu Nasr Mansur circa 1000 AD.)

The sides in *A*‘*B*‘*C*‘ are the supplementary angles of the corresponding intersection angles in *ABC*, and the intersection angles in *A*‘*B*‘*C*‘ are the supplementary angles of the corresponding sides in *ABC*.

In his paper “Duality in the formulas of spherical trigonometry,” published in American Mathematical Monthly in 1909, W. A. Granville gives the following duality principle:

If the sides of a spherical triangle be denoted by Roman letters

a,b,cand the supplements of the corresponding opposite angles by the Greek letters α, β, γ, then from any given formula involving any of these six parts, we may wrote down a dual formula simply by interchanging the corresponding Greek and Roman letters.

**Related**: Notes on Spherical Trigonometry

- Constant functions are PR.
- The function that picks the
*i*th element of a list of*n*arguments is PR. - The successor function
*S*(*n*) =*n*+1 is PR. - PR functions are closed under composition.
- PR functions are closed under
**primitive recursion**.

The last axiom obviously gives PR functions their name. So what is primitive recursion? Given a PR function *f * that takes *k* arguments, and another PR function *g* that takes *k*+2 arguments, the primitive recursion of *f* and *g* is a function *h* of *k*+1 arguments satisfying two properties:

*h*(0,*x*_{1}, …,*x*_{k}) =*f*(*x*_{1}, …,*x*_{k})*h*(*S*(*y*),*x*_{1}, …,*x*_{k}) =*g*(*y*,*h*(*y*,*x*_{1}, …*x*_{k}),*x*_{1}, …,*x*_{k})

Not every computable function is primitive recursive. For example, Ackermann’s function is a **general recursive function**, but not a primitive recursive function. General recursive functions are Turing complete. Turing machines, lambda calculus, and general recursive functions are all equivalent models of computation, but the first two are better known.

For this post, the main thing about general recursive functions is that, as the name implies, they are more general than primitive recursive functions.

Now we switch from functions to sets. The **characteristic function** of a set *A* is the function that is 1 for elements of *A* and zero everywhere else. In other areas of math, there is a sort of duality between functions and sets via characteristic functions. For example, the indicator function of a measurable set is a measurable function. And the indicator function of a convex set is a convex function. But in recursive functions, there’s an unexpected wrinkle in this analogy.

A set of integers is **recursively enumerable** if it is either empty or the image of a **general** recursive function. But there’s a theorem, due to Alonzo Church, that a non-empty recursively enumerable set is actually the image of a **primitive** recursive function. So although general recursive functions are more general, you can’t tell that from looking at function images. For example, although the Ackermann function is not primitive recursive, there is a primitive recursive function with the same image.

In finite dimensional Euclidean space, linear functions are continuous. You can put a different norm on a Euclidean space than the one it naturally comes with, but all norms give rise to the same topology and hence the same continuous functions. (This is useful in numerical analysis where you’d like to look at a variety of norms. The norms give different analytical results, but they’re all topologically equivalent.)

In an infinite dimensional normed space, linear functions are not necessarily continuous. If the dimension of a space is only a trillion, all linear functions are continuous, but when you jump from high dimension to infinite dimension, you can have discontinuous linear functions. But if you look at this more carefully, there isn’t a really sudden change.

If a linear function is discontinuous, its finite dimensional approximations are continuous, but the **degree** of continuity is degrading as dimension increases. For example, suppose a linear function stretches the *n*th basis vector by a factor of *n*. The bigger *n* gets, the more the function stretches in the *n*th dimension. As long as *n* is bounded, this is continuous, but in a sense it is less continuous as *n* increases. The fact that the infinite dimensional version is discontinuous tells you that the finite dimensional versions, while technically continuous, scale poorly with dimension. (See practical continuity for more discussion along these lines.)

A Banach space is a *complete* normed linear space. Finite dimensional normed spaces are always complete (i.e. every sequence in the space converges to a point in the space) but this might not happen in infinite dimensions.

In basic linear algebra, the dual of a vector space *V* is the space of linear functionals on *V*, i.e. the set of linear maps from *V* to the reals. This space is denoted *V*^{*}. If *V* has dimension *n*, *V*^{*} has dimension *n*, and all *n*-dimensional spaces are isomorphic, so the distinction between a space and its dual seems pedantic. But in general a Banach space and its dual are not isomorphic and so its easier to tell them apart.

The second dual of a vector space, *V*^{**} is the dual of the dual space. In finite dimensional spaces, *V*^{**} is naturally isomorphic to *V*. In Banach spaces, *V* is isomorphic to a *subset* of *V*^{**}. And even when *V* is isomorphic to *V*^{**}, it might not be *naturally* isomorphic to *V*^{**}. (Here “natural” means natural in the category theory sense of natural transformations.)

**JC**: Can you say a little about yourself?

**TD**: I’m Trevor Dawson, I’m 25, born in the California Bay Area. I enjoy wood working, ceramics, soccer and travelling. I consider myself an environmentalist.

**JC**: What is your role at Borrego Solar?

**TD**: I am a Cost Reduction Analyst and I focus on applying Lean principles to identify and remove waste from both our internal processes and construction in the field. I use data to identify problems, prioritize them, and to verify the effectiveness of our solutions. I work with a variety of teams to reduce the cost or time of our projects.

Solar is a very fast-paced industry. Policy changes and technological improvements are being developed quickly and we have to respond quickly. A key function of my job is to assign measurable cost benefits to new practices and help ensure Borrego Solar continues to be an industry leader.

**JC**: What is your technical background and education?

**TD**: I graduated with a Bachelors of Science in Industrial & Systems Engineering (IE) from the University of Washington. I spent 3.5 years as an IE implementing process improvements on Boeing’s 777 Manufacturing Wing Line in Seattle, WA. I gained valuable experience in Lean, schedule optimization, design of experiments, and big data efforts. At Borrego, I get to apply those skills to help accelerate the adoption of the most time-tested, renewable energy source of all: the sun.

**JC**: What math, physics, or technical skills do you use every day?

**TD**: Addition, algebra, and simple statistics. I like to think I’ve mastered the basics. I also use a lot of my industrial engineering training to help gather and analyze data like design of experiments, time studies, and lean problem solving methodology.

I mostly work in Excel and use Power Pivot to drive large, cumbersome data into neat summary tables. Although the analysis can be a challenge, the hard work is rolling it up and presenting it in a way that is meaningful and convincing. When you’re suggesting a business decision, especially when it challenges the norm, your internal customers want to know the answer but they are equally interested in your process. For example, how does the business case change if a defined constraint or coefficient changes? The solar industry is dynamic and still maturing, so we have to be especially poised in our decision-making.

**JC**: What do you use much less than you expected?

**TD**: Calculus. I spent so much time learning calculus and even other things like differential equations but haven’t had much opportunity to apply them. However, I do think calculus taught me important practical problem solving skills and I put that to use now tackling large problems that span multiple pages.

**JC**: What math or technical skill do you wish you had more of or understood better?

**TD**: Excel programming and design. Excel rules the world, and although I was introduced to it at school, I think more intense courses should be commonplace. Regarding design, execution is the hardest part of any business decision, and design would help communicate results and suggestions much more effectively. A business needs verifiable proof that the suggested change is real and if executed will perform as predicted. This stage of verifying the effectiveness of a project could be improved with better design skills and may even reduce the amount of touch time and communications all the way through from inception to completion of a project.

**JC**: Anything else you’d like us to know?

**TD**: Go solar!

*S*^{2} = *A*^{2} + *B*^{2} + *C*^{2}.

There’s an elegant proof of this theorem here using differential forms. Below I’ll sketch a less elegant but more elementary proof.

You could prove the identity above by using the fact that the area of a triangle spanned by two vectors is half the length of their cross product. Suppose *a*, *b*, and *c* are the locations of the three corners of *T*. Then

*S ^{2} = v*

where

*v* = (*a* – *b*) × (*c* – *b*)

and by *v*^{2} we mean the dot product of *v* with itself.

Write out the components of *v*^{2} and you get three squared terms. Notice that when you set the *x* components to zero, i.e. project onto the *yz* plane, the first of the three terms is unchanged and the other two are zero. In other words, the first of the three terms of *v*^{2} is *A*^{2}. A similar argument shows that the other two terms are *B*^{2} and *C*^{2}.

]]>

… it was a question of Amistics, which was a term that had been coined ages ago by a Moiran anthropologist to talk about the choices that different cultures made as to which technologies they would, and would not, make part of their lives. The word went all the way back to the Amish people … All cultures did this, frequently without being consciously aware that they had made collective choices.

Related post by Kevin Kelly: Amish Hackers

]]>It turns out there’s a remarkable closed-form solution:

Here are some questions you may have.

*But what if n and m are both odd? You can’t tile such a board with dominoes.*

Yes, in that case the formula evaluates to zero.

*Do you need an absolute value somewhere? Or a floor or ceiling?*

No. It looks like the double product could be a general complex number, but it’s real. In fact, it’s always the square of an integer.

**Update**: Apparently the formula *does* need an absolute value, not to turn complex values into real values but to turn negative integers into positive ones. See Aaron Meurer’s example below.

*Does it work numerically?*

Apparently so. If you evaluated the product in a package that could symbolically manipulate the cosines, the result would be exact. In floating point it cannot be, but at least in my experiments the result is correct when rounded to the nearest integer. For example, there are 12,988,816 ways to tile a standard 8 by 8 chessboard with dominoes, and the following python script returns 12988816.0. For sufficiently large arguments the result will not always round to the correct answer, but for moderate-sized arguments it should.

from numpy import pi, cos, sqrt def num_tilings(m, n): prod = 1 for k in range(1, m+1): for l in range(1, n+1): prod *= 2*cos(pi*k/(m+1)) + 2j*cos(pi*l/(n+1)) return sqrt(abs(prod)) print(num_tilings(8,8))

*The code looks wrong. Shouldn’t the ranges go up to m and n?*

No, Python ranges are half-open intervals. `range(a, b)`

goes from `a`

to `b-1`

. That looks unnecessarily complicated, but it makes some things easier.

*You said that there was no need for absolute values, but you code has one.*

Yes, because while in theory the imaginary part will be exactly zero, in floating point arithmetic the imaginary part might be small but not zero.

*Where did you find this formula?*

Thirty-three Miniatures: Mathematical and Algorithmic Applications of Linear Algebra

]]>Amplitude modulation multiplies a carrier signal by

1 + *d* sin(2π *f t*)

where *d* is the modulation depth, *f* is the modulation frequency, and *t* is time.

Here are some examples you can listen to. We use a pure 1000 Hz tone and Gaussian white noise as carriers, and vary modulation depth and frequency continuously over 10 seconds. he modulation depth example varies depth from 0 to 1. Modulation frequency varies from 0 to 120 Hz.

First, here’s a pure tone with increasing modulation depth.

Next we vary the modulation frequency.

Now we switch over to Gaussian white noise, first varying depth.

And finally white noise with varying modulation frequency. This one sounds like a prop-driven airplane taking off.

**Related**: Psychoacoustics consulting

In linear algebra, vector spaces have a field of scalars. This is where the coefficients in linear combinations come from. For real vector spaces, the scalars are real numbers. For complex vector spaces, the scalars are complex numbers. For vector spaces over any field *K*, the elements of *K* are called scalars.

But there is a more restrictive use of *scalar* in tensor calculus. There a scalar is **not just a number, but a number whose value does not depend on one’s choice of coordinates**. For example, the temperature at some location is a scalar, but the first coordinate of a location depends on your choice of coordinate system. Temperature is a scalar, but *x*-coordinate is not. Scalars are numbers, but not all numbers are scalars.

The linear algebraic use of *scalar* is more common among mathematicians, the coordinate-invariant use among physicists. The two uses of *scalar *is a special case of the two uses of *tensor* described in the previous post. Linear algebra thinks of tensors simply as things that take in vectors and return numbers. The physics/tensor analysis view of tensors includes behavior under changes of coordinates. You can think of a scalar as a oth order tensor, one that behaves as simply as possible under a change of coordinates, i.e. doesn’t change at all.

In the second post we said that a tensor is a multilinear functional. A *k*-tensor takes *k* vectors and returns a number, and it is linear in each argument if you hold the rest constant. We mentioned that this relates to the “box of numbers” idea of a tensor. You can describe how a *k*-tensor acts by writing out *k* nested sums. The terms in these sums are called the **components** of the tensor.

Tensors are usually defined in a way that has more structure. They vary from point to point in a space, and they do so in a way that in some sense is independent of the coordinates used to label these points. At each point you have a tensor in the sense of a multilinear functional, but the emphasis is usually on the changes of coordinates.

Tensors in the sense that we’re talking about here come in two flavors: covariant and contravariant. They also come in mixtures; more on that later.

We consider two coordinate systems, one denoted by *x*‘s and another by *x*‘s with bars on top. The components of a tensor in the *x*-bar coordinate system will also have bars on top. For a covariant tensor of order one, the components satisfy

First of all, coordinates are written with superscripts. So *x*^{r} is the *r* coordinate, not *x* to the power *r*. Also, this uses Einstein summation notation: there is an implicit sum over repeated indexes, in this case of *r*.

The components of a contravariant tensor of order one satisfy similar but different equation:

The components of a covariant tensor are written with subscripts, and the components of a contravariant tensor with superscripts. In the equation for covariant components, the partial derivatives are with respect to the new coordinates, the *x* bars. In the equation for contravariant components, the partial derivatives are with respect to the original coordinates, the *x*‘s. Mnemonic: when the indexes go down (covariant tensors) the new coordinates go down (in the partial derivatives). When the indexes go up, the new coordinates go up.

For covariant tensors of order two, the change of coordinate formula is

Here there the summation convention says that there are two implicit sums, one over *r* and one over *s*.

The contravariant counter part says

In general you could have tensors that are a mixture of covariant and contravariant. A tensor with covariant order *p* and contravariant order *q* has *p* subscripts and *q* superscripts. The partial derivatives have *x*-bars on bottom corresponding to the covariant components and *x*-bars on top corresponding to contravariant components.

We initially said a tensor was a multilinear functional. A tensor of order *k* takes *k* vectors and returns a number. Now we’d like to refine that definition to take two kinds of vectors. A tensor with covariant order *p* and contravariant order *q* takes *p* contravariant vectors and *q* covariant vectors. In linear algebra terms, in stead of simply taking *k* elements of a vector space *V*, we say our tensor takes *p* vectors from the dual space *V** and *q* vectors from *V*.

You may be familiar with the terms *covariant* and *contravariant* from category theory, or its application to object oriented programming. The terms are related. As Michael Spivak explains, “It’s very easy to remember which kind of vector field is covariant, and which is contravariant — it’s just the opposite of what it logically ought to be [from category theory].”

For example, you may have seen tensor product splines. Suppose you have a function over a rectangle that you’d like to approximate by patching together polynomials so that the interpolation has the specified values at grid points, and the patches fit together smoothly. In one dimension, you do this by constructing splines. Then you can bootstrap your way from one dimension to two dimensions by using tensor product splines. A tensor product spline in *x* and *y* is a sum of terms consisting of a spline in *x* and a spline in *y*. Notice that a tensor product spline is not simply a product of two ordinary splines, but a *sum* of such products.

If *X* is the vector space of all splines in the *x*-direction and *Y* the space of all splines in the *y*-direction, the space of tensor product splines is the tensor product of the spaces *X* and *Y*. Suppose a set *s*_{i}, for *i* running from 1 to *n,* is a basis for *X*. Similarly, suppose *t*_{j}, for *j* running from 1 to *m*, is a basis for *Y*. Then the products *s*_{i} *t*_{j} form a basis for the tensor product *X* and *Y*, the tensor product splines over the rectangle. Notice that if *X* has dimension *n* and *Y* has dimension *m* then their tensor product has dimension *n**m*. Notice that if we only allowed products of splines, not sums of products of splines, we’d get a much smaller space, one of dimension *n*+*m*.

We can use the same process to define the tensor product of any two vector spaces. A basis for the tensor product is all products of basis elements in one space and basis elements in the other. There’s a more general definition of tensor products that doesn’t involve bases sketched below.

You can also define tensor products of *modules*, a generalization of vector spaces. You could think of a module as a vector space where the scalars come from a ring ideas of a field. Since rings are more general than fields, modules are more general than vector spaces.

The tensor product of two modules over a commutative ring is defined by taking the Cartesian product and moding out by the necessary relations to make things bilinear. (This description is very hand-wavy. A detailed presentation needs its own blog post or two.)

Tensor products of modules hold some surprises. For example, let *m* and *n* be two relatively prime integers. You can think of the integers mod *m* or *n* as a module over the integers. The tensor product of these modules is zero because you end up moding out by everything. This kind of collapse doesn’t happen over vector spaces.

The first two posts in this series:

I plan to leave the algebraic perspective aside for a while, though as I mentioned above there’s more to come back to.

Next I plan to write about the analytic/geometric view of tensors. Here we get into things like changes of coordinates and it looks at first as if a tensor is something completely different.

]]>A dot product is an example of a tensor. It takes two vectors and returns a number. And it’s linear in each argument. Suppose you have vectors *u*, *v*, and *w*, and a real number *a*. Then the dot product (*u +* *v*, *w*) equals (*u*, *w*) + (*v*, *w*) and (*au*, *w*) = *a*(*u*, *w*). This shows that dot product is linear in its first argument, and you can show similarly that it is linear in the second argument.

Determinants are also tensors. You can think of the determinant of an *n* by *n* matrix as a function of its *n* rows (or columns). This function is linear in each argument, so it is a tensor.

The introduction to this series mentioned the interpretation of tensors as a box of numbers: a matrix, a cube, etc. This is consistent with our definition because you can write a multilinear functional as a sum. For every vector that a tensor takes in, there is an index to sum over. A tensor taking *n* vectors as arguments can be written as *n* nested summations. You could think of the coefficients of this sum being spread out in space, each index corresponding to a dimension.

Tensor products are simple in this context as well. If you have a tensor *S* that takes *m* vectors at a time, and another tensor *T* that takes *n* vectors at a time, you can create a tensor that takes *m* + *n* vectors by sending the first *m* of them to *S*, the rest to *T*, and multiply the results. That’s the tensor product of *S* and *T*.

The discussion above makes tensors and tensor products still leaves a lot of questions unanswered. We haven’t considered the most general definition of tensor or tensor product. And we haven’t said anything about how tensors arise in application, what they have to do with geometry or changes of coordinate. I plan to address these issues in future posts. I also plan to write about other things in between posts on tensors.

**Next post in series**: Tensor products

]]>

The word “tensor” is shrouded in mystery. The same term is applied to many different things that don’t appear to have much in common with each other.

You might have heared that a tensor is a box of numbers. Just as a matrix is a rectangular collection of numbers, a tensor could be a cube of numbers or even some higher-dimensional set of numbers.

You might also have heard that a tensor is something that has upper and lower indices, such as the Riemann tensor above, things that have arcane manipulation rules such as “Einstein summation.”

Or you might have heard that a tensor is something that changes coordinates like a tensor. A tensor is as a tensor does. Something that behaves the right way under certain changes of variables is a tensor.

And then there’s things that aren’t called tensors, but they have tensor *products*. These seem simple enough in some cases—you think “I didn’t realize that has a name. So it’s called a tensor product. Good to know.” But then in other cases tensor products seem more elusive. If you look in an advanced algebra book hoping for a simple definition of a tensor product, you might be disappointed and feel like the book is being evasive or even poetic because it describes what a tensor product *does* rather than what it *is*. That is, the definition is behavioral rather than constructive.

What do all these different uses of the word “tensor” have to do with each other? Do they have anything to do with the TensorFlow machine learning library that Google released recently? That’s something I’d like to explore over a series of blog posts.

**Next posts in the series**:

The above quote makes me think of a connection Fourier made between triangles and thermodynamics.

Trigonometric functions were first studied because they relate angles in a right triangle to ratios of the lengths of the triangle’s sides. For the most basic applications of trigonometry, it only makes sense to consider positive angles smaller than a right angle. Then somewhere along the way someone discovered that it’s convenient to define trig functions for any angle.

Once you define trig functions for any angle, you begin to think of these functions as being associated with *circles* rather than triangles. More advanced math books refer to trig functions as *circular* functions. The triangles fade into the background. They’re still there, but they’re drawn inside a circle. (Hyperbolic functions are associated with hyperbolas the same way circular functions are associated with circles.)

Now we have functions that historically arose from studying triangles, but they’re defined on the whole real line. And we ask the kinds of questions about them that we ask about other functions. How fast do they change from point to point? How fast does their rate of change change? And here we find something remarkable. The rate of change of a sine function is proportional to a cosine function and vice versa. And if we look at the rate of change of the rate of change (the second derivative or acceleration), sine functions yield more sine functions and cosine functions yield more cosine functions. In more sophisticated language, sines and cosines are eigenfunctions of the second derivative operator.

Here’s where thermodynamics comes in. You can use basic physics to derive an equation for describing how heat in some object varies over time and location. This equation is called, surprisingly enough, the heat equation. It relates second derivatives of heat in space with first derivatives in time.

Fourier noticed that the heat equation would be easy to solve if only he could work with functions that behave very nicely with regard to second derivatives, i.e. sines and cosines! If only everything were sines and cosines. For example, the temperature in a thin rod over time is easy to determine if the initial temperature distribution is given by a sine wave. Interesting, but not practical.

However, the initial distribution doesn’t have to be a sine, or a cosine. We can still solve the heat equation if the initial distribution is a sum of sines. And if the initial distribution is approximately a sum of sines and cosines, then we can compute an approximate solution to the heat equation. So what functions are approximately a sum of sines and cosines? All of them!

Well, not quite all functions. But lots of functions. More functions than people originally thought. Pinning down exactly what functions can be approximated arbitrarily well by sums of sines and cosines (i.e. which functions have convergent Fourier series) was a major focus of 19th century mathematics.

So if someone asks what use they’ll ever have for trig identities, tell them they’re important if you want to solve the heat equation. That’s where I first used some of these trig identities often enough to remember them, and that’s a fairly common experience for people in math and engineering. Solving the heat equation reviews everything you learn in trigonometry, even though there are not necessarily any triangles or circles in sight.

]]>I find Shinichi Mochizuki’s proof of the abc conjecture fascinating. Not the content of the proof—which I do not understand in the least—but the circumstances of the proof. Most mathematics, no matter how arcane it appears to outsiders, is not that original. Experts in a specific area can judge just how much or how little a new paper adds to their body of knowledge, and usually it’s not much. But Mochizuki’s work is such a departure from the mainstream that experts have been trying to understand it for four years now.

Five days ago, Nature published an article headlined Monumental Proof to Torment Mathematicians for Years to Come.

… Kedlaya says that the more he delves into the proof, the longer he thinks it will take to reach a consensus on whether it is correct. He used to think that the issue would be resolved perhaps by 2017. “Now I’m thinking at least three years from now.”

Others are even less optimistic. “The constructions are generally clear, and many of the arguments could be followed to some extent, but the overarching strategy remains totally elusive for me,” says mathematician Vesselin Dimitrov of Yale University in New Haven, Connecticut. “Add to this the heavy, unprecedentedly indigestible notation: these papers are unlike anything that has ever appeared in the mathematical literature.”

But today, New Scientist has an article with the headline Mathematicians finally starting to understand epic ABC proof. According to this article,

At least 10 people now understand the theory in detail, says Fesenko, and the IUT papers have almost passed peer review so should be officially published in a journal in the next year or so.

It’s interesting that the proof is so poorly understood that experts disagree on just how well it is understood.

**Related posts**: