When does the thing on the left equal the thing on the right?

A few things could go wrong:

- Maybe not all the terms on the right side exist, i.e. the function
*f*might not be infinitely differentiable. - Maybe
*f*is infinitely differentiable but the series diverges. - Maybe
*f*is infinitely differentiable but the series converges to something other than*f*.

The canonical example of the last problem is the function *g*(*x*) defined to be exp(-1/*x*^{2}) for positive *x* and 0 otherwise. This function is so flat when it approaches the origin that all derivatives are zero there. So all the coefficients in Taylor’s formula are zero. But the function *g* is positive for positive *x*.

Everything above can be found in a standard calculus text, but the following results are hardly known.

Being infinitely differentiable is necessary but not sufficient for a function to be real analytic. What additional requirements would be sufficient?

We start by examining what went wrong with the function *g*(*x*) above. It has a Taylor series at every point, but the radii of convergence go to zero as we get close to the origin. At the origin it is not analytic because the radius of convergence collapses to 0.

Alfred Pringsheim claimed that it is enough to require that the radii of convergence be bounded below on an interval. If we let ρ(*x*) be the radius of convergence for a series centered at *x*, then a function *f* is analytic in an open interval *J* if ρ(*x*) has some lower bound δ > 0 in *J*. Pringsheim’s theorem was correct, but his proof was flawed. The proof was accepted for 40 years until Ralph Boas discovered the flaw.

Here is a generalization of Pringsheim’s theorem. (It’s not clear to me from my source [1] who first proved this theorem. Perhaps Boas, but in the same context Boas mentions the work of others on the problem.)

Let *f* be infinitely differentiable in an open interval *J*. Suppose the radius of convergence ρ(*x*) for a Taylor series centered at *x* is positive for each *x* in the interval *J*. Suppose also that for every point *p* in *J* we have

Then *f* is analytic on the interval *J*.

Note that if ρ(*x*) is bounded below by δ > 0 then the limit above is infinite and so Pringsheim’s theorem follows as a special case.

The function *g*(*x*) above does not satisfy the hypothesis of the theorem on any open interval around 0 because if we set *p* = 0, the limit above is 1, not greater than 1.

* * *

[1] R. P. Boas. When Is a *C*^{∞} Function Analytic? Mathematical Intelligencer, Vol 11, No. 4, pp. 34–37.

]]>

Here’s how I’ve approached this dilemma in the past. Information and outcomes are not directly comparable, so any objective function combining the two is going to add incommensurate things. One way around this is to put a value not on information *per se* but on what you’re going to **do** with the information.

In the context of a clinical trial, you want to treat patients in the trial effectively, and you want a high probability of picking the best treatment at the end. You can’t add patients and probabilities meaningfully. But **why** do you want to know which treatment is best? **So you can treat future patients effectively***.* The value of knowing which treatment is best is the increase in the expected number of future successful treatments. This gives you a meaningful objective: maximize the expected number of effective treatments, of patients in the trial and future patients treated as a result of the trial.

The hard part is guessing what will be done with the information learned in the trial. Of course this isn’t exactly known, but it’s the key to estimating the value of the information. If nobody will be treated any differently in the future no matter how the trial turns out—and unfortunately this may be the case—then the trial should be optimized strictly for the benefit of the people in the trial. On the other hand, if a trial will determine the standard of care for thousands of future patients, then there is more value in picking the best treatment.

]]>What is a logarithm? It’s the solution to an exponential equation. For example, the logarithm base 10 of 2 is the solution to the equation 10^{x} = 2, and so *x* =0.30103. Similarly, you could look for the logarithm base 10 of 2 modulo 19. This is an integer value of *x* such that 10* ^{x}* = 2 mod 19, and you can show that 17 is a solution.

If we work modulo an integer *n*, the discrete logarithm base *a* of a number *y* is an integer *x* such that *a*^{x} = *y*. For some values of *a* there may not be a solution, but there will be a solution if *a* is a generator of the integers mod* n.*

Brute force the simplest algorithm for finding discrete logarithms: try *x* = 1, 2, 3, …, *n* until you find a value of *x* satisfying *a*^{x} = *y*. The problem with brute force is that the expected run time is on the order of *n*, and *n* is often very large in application.

Discrete logarithms are expensive to compute, but we can do better than brute force. (Cryptographic applications take advantage of the fact that discrete logarithms are hard to compute.) There’s a simple algorithm by Daniel Shanks, known as the baby-step giant-step algorithm, that reduces the run time from order *n* to order roughly √*n*. (Actually O(√*n* log *n*) for reasons we’ll see soon.)

Let *s* be the ceiling of the square root of *n*. The idea is to search for *x* by taking giant steps forward (multiples of *s*) and baby (single) steps backward.

Let *G* be the set of pairs (*a*^{gs} mod *n*, *gs*) where *g* ranges from 1 to *s*. These are the giant steps.

Let *B* be the set of pairs (*y**a*^{b} mod *n*, *b*) where *b* ranges from 0 to *s*-1. These are the baby steps.

Now look for a pair in *G* and a pair in *B* with the matching first components, i.e. *ya*^{b} = *a*^{gs}. Then *x* = *gs* – *b* is the required solution.

Constructing the sets *G* and *B *requires O(*s*) operations, and matching the two sets takes O(*s* log *s*) operations.

Here’s an example, going back to the problem above of finding the discrete log base 10 of 2 mod 19, using a little Python code to help with the calculations.

The square root of 19 is between 4 and 5 so *s* = 5.

>>> [(2*10**r % 19, r) for r in range(5)] [(2, 0), (1, 1), (10, 2), (5, 3), (12, 4)]

>>> [(10**(4*t) % 19, 4*t) for t in range(1,6)] [(6, 4), (17, 8), (7, 12), (4, 16), (5, 20)]

The number 5 appears as the first element of a pair in both *B* and *G*. We have (5, 3) in *B* and (5, 20) in *G* so *x* = 20 – 3 = 17.

**Related**: Applied number theory

It’s possible that treatment *X* is doing so poorly that you want to end the trial without going any further. It’s also possible that *X* is doing so well that you want to end the trial early. Both of these are rare. Most of the time an interim analysis is more concerned with **futility**. You might want to stop the trial early not because the results are really good, or really bad, but because the results are really mediocre! That is, treatments *X *and *Y* are performing so similarly that you’re afraid that you won’t be able to declare one or the other better.

Maybe treatment *X* is doing a little better than *Y*, but not so much better that you can declare with confidence that *X* is better. You might want to stop for futility if you project that not only do you not have enough evidence now, you don’t believe you will have enough evidence by the end of the trial.

Futility analysis is more about resources than ethics. If *X* is doing poorly, ethics might dictate that you stop giving *X* to patients so you stop early. If *X* is doing spectacularly well, ethics might dictate that you stop giving the control treatment, if there is an active control. But if *X* is doing so-so, there’s usually not an ethical reason to stop, unless *X* is worse than *Y* on some secondary criteria, such as having worse side effects. You want to end futile studies so you can save resources and get on with the next study, and you could argue that’s an ethical consideration, though less direct.

Futility analysis isn’t about your current estimate of effectiveness. It’s about what you think you’re estimate regard effectiveness in the future. That is, it’s a second order prediction. You’re trying to understand the effectiveness of the **trial**, not of the **treatment** per se. You’re not trying to estimate a parameter, for example, but trying to estimate what range of estimates you’re likely to make.

This is why **predictive probability** is natural for interim analysis. You’re trying to predict **outcomes**, not **parameters**. (This is subtle: you’re trying to estimate the probability of **outcomes **that lead to certain estimates of **parameters**, namely those that allow you to reach a conclusion with pre-specified significance.)

Predictive probability is a Bayesian concept, but it is useful in analyzing frequentist trial designs. You may have frequentist conclusion criteria, such as a *p*-value threshhold or some requirements on a confidence interval, but you want to know how likely it is that if the trial continues, you’ll see data that lead to meeting your criteria. In that case you want to compute the (**Bayesian**) predictive probability of meeting your **frequentist** criteria!

**Related post**: Uncertainty in a probability

]]>

How long is the non-repeating part? How long is the period of the repeating part?

The answer depends on the prime factorization of the denominator *b*. If *b* has the form

*b* = 2^{α} 5^{β}

then the decimal expansion has *r* digits where *r* = max(α, β).

Otherwise *b* has the factorization

*b* = 2^{α} 5^{β} *d*

where *d* is some integer greater than 1 and relatively prime to 10. Then the decimal expansion of *a*/*b* has *r* non-repeating digits and a repeating part with period *s* where *r* is as above, and *s* is the smallest positive integer such that

*d* | 10^{s}– 1,

i.e. the smallest *s* such that 10^{s} – 1 is divisible by *d*. How do we know there exists any such integer *s*? This isn’t obvious.

Fermat’s little theorem tells us that

*d* | 10^{d – 1} – 1

and so we could take *s* = *d* – 1, though this may not be the smallest such *s*. So not only does *s* exist, we know that it is at most *d* – 1. This means that the period of the repeating part of *a*/*b* is no more than *d* – 1 where *d* is what’s left after factoring out as many 2’s and 5’s from *b* as possible.

By the way, this means that you can take any integer *d*, not divisible by 2 or 5, and find some integer *k* such that *dk* consists only of 9’s.

**Related post**: Cyclic fractions

Not all that speed improvement comes from changing languages. Some of it comes from better algorithms, eliminating redundancy, etc.

If code is running 100 times slower than you’d like, why not just run it on 100 processors? Sometimes that’s the way to go. But maybe the code doesn’t split up easily into pieces that can run in parallel. Or maybe you’d rather run the code on your laptop than send it off to the cloud. Or maybe you’d like to give your code to someone else and you want *them* to be able to run the code conveniently.

It’s sometimes possible to tweak R code to make it faster without rewriting it, especially if it is naively using loops for things that could easily be vectorized. And it’s possible to use better algorithms without changing languages.

Beyond these high-level changes, there are a number of low-level changes that may give you a small speed-up. This way madness lies. I’ve seen blog posts to the effect “I rewrote this part of my code in the following non-obvious way, and for reasons I don’t understand, it ran 30% faster.” Rather than spending hours or days experimenting with such changes and hoping for a small speed up, I use a technique fairly sure to give a 10x speed up, and that is rewriting (part of) the code in C++.

If the R script is fairly small, and if I have C++ libraries to replace all the necessary R libraries, I’ll rewrite the whole thing in C++. But if the script is long, or has dependencies I can’t replace, or only has a small section where nearly all the time is spent, I may just rewrite that portion in C++ and call it from R using Rcpp.

The R programs I’ve worked on often compute something approximately by simulation that could be calculated exactly much faster. This isn’t because the R language encourages simulation, but because the language is used by statisticians who are more inclined to use simulation than analysis.

Sometimes a simulation amounts to computing an integral. It might be possible to compute the integral in closed form with some pencil-and-paper work. Or it might be possible to recognize the integral as a special function for which you have efficient evaluation code. Or maybe you have to approximate the integral, but you can do it more efficiently by numerical analysis than by simulation.

Sometimes it’s possible to speed up code, written in any language, simply by not calculating the same thing unnecessarily. This could be something simple like moving code out of inner loops that doesn’t need to be there, or it could be something more sophisticated like memoization.

The first time it sees a function called with a new set of arguments, memoization saves the result and creates a way to associate the arguments with the result in some sort of look-up table, such as a hash. The next time the function is called with the same argument, the result is retrieved from memory rather than recomputed.

Memoization works well when the set of unique arguments is fairly small and the calculation is expensive relative to the cost of looking up results. Sometimes the set of *potential* arguments is very large, and it looks like memoization won’t be worthwhile, but the set of *actual* arguments is small because some arguments are used over and over.

** Related post**:

]]>The knee-jerk response [to neural networks] from statisticians was “What’s the big deal? A neural network is just a nonlinear model, not too different from many other generalizations of linear models.”

While this may be true, neural networks brought a new energy to the field. They could be scaled up and generalized in a variety of ways … And most importantly, they were able to solve problems on a scale far exceeding what the statistics community was used to. This was part computing scale expertise, part liberated thinking and creativity on the part of this computer science community.

After enjoying considerable popularity for a number of years, neural networks were somewhat sidelined by new inventions in the mid 1990’s. … Neural networks were

passé. But then they re-emerged with a vengeance after 2010…the reincarnation now being calleddeep learning.

Most people learn R as they learn statistics: Here’s a statistical concept, and here’s how you can compute it in R. Statisticians aren’t that interested in the R language itself but see it as connective tissue between commands that are their primary interest.

This works for statisticians, but it makes the language hard for non-statisticians to approach. Years ago I managed a group of programmers who supported statisticians. At the time, there were no books for learning R without concurrently learning statistics. This created quite a barrier to entry for programmers whose immediate concern was not the statistical content of an R program.

Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why anyone is interested in R—but it doesn’t *start* there. The first 40% or so of the book is devoted to basic language features, things you’re supposed to pick up by osmosis from a book focused more on statistics than on R *per se*. This is the book I wish I could have handed my programmers who had to pick up R.

Mathematical objects are usually defined internally. For example, the Cartesian product *P* of two sets *A* and *B* is defined to be the set of all ordered pairs (*a*, *b*) where *a* comes from *A* and *b* comes from *B*. The definition of *P* depends on the elements of *A* and *B* but it does not depend on any other sets.

Category theory turns this inside-out. Operations such as taking products are not defined in terms of elements of objects. Category theory makes no use of elements or subobjects [1]. It defines things by how they act, not their inner workings. People often stress what category theory **does not** depend on, but they less often stress what it **does** depend on. The definition of the product of two objects in any category depends on **all** objects in that category: The definition of the product of objects *A* and *B* contains the phrase “such that for any other object *X* …” [More on categorical products].

The payoff for this inside-out approach to products is that you can say something simultaneously about everything that acts like a product, whether it’s products of sets, products of fields (i.e. that they don’t exist), products of groups, etc. You can’t say something valid across multiple categories if you depend on details unique to one categories.

**This isn’t unique to products**. Universal properties are everywhere. That is, you see definitions containing “such that for any other object *X* …” all the time. In this sense, category theory is extremely non-local. The definition of a widget often depends on **all** widgets.

There’s a **symmetry** here. Traditional definitions depend on the internal workings of objects, but only on the objects themselves. There are no third parties involved in the definition. Categorical definitions have zero dependence on internal workings, but depend on the behavior of everything in the category. There are an infinite number of third parties involved! [2] You can have a definition that requires complete internal knowledge but zero external knowledge, or a definition that requires zero internal knowledge and an infinite amount of external knowledge.

**Related**: Applied category theory

* * *

[1] Category theory does have notions analogous to elements and subsets, but they are defined the same way everything else is in category theory, in terms of objects and morphisms, not by appealing to the inner structure of objects.

[2] You can have a category with a finite number of objects, but usually categories are infinite. In fact, they are usually so large that they are “classes” of objects rather than sets.

]]>*t* = *W*/*n*

where *t* is the calendar time to completion, *W* is a measure of how much work is to be done, and *n* is the number of people. This assumes everything on the project can be done in parallel. Nobody waits for anybody else.

The next refinement is to take into account the proportion of work that can be done in parallel. Call this *p*. Then we have

*t* = *W*[1 – *p*(*n*-1)/*n*].

If everything can be done in parallel, *p* = 1 and *t* = *W*/*t* as before. But if nothing can be done in parallel, *p*= 0, and so *t* = *W*. In other words, the total time is the same whether one person is on the project or more. This is essentially Amdahl’s law.

With the equation above, adding people never slows things down. And if *p* > 0, every addition person helps at least a little bit.

Next we add a term to account for communication cost. Assume communication costs are proportional to the number of communication paths, *n*(*n* – 1)/2. Call the proportionality constant *k*. Now we have

*t* = *W*[1 – *p*(*n*-1)/*n + **k**n*(*n*-1)/2].

If *k* is small but positive, then at first adding more people causes a project to complete sooner. But beyond some optimal team size, adding more people causes the project to take longer.

Of course none of this is exact. Project time estimation doesn’t follow any simple formula. Think of these equations more as rough guides or metaphors. It’s certainly true that beyond a certain size, adding more people to a project can slow the project down. Kevlin gave examples of projects that were put back on track by reducing the number of people working on them.

My quibble with the equation above is that I don’t think the cost of more people is primarily communication. Communication paths in a real project are not the simple trees of org charts, but neither are they complete graphs. And if the problem were simply communication, then improved communication would mitigate the cost of adding people to a project, though I imagine it hardly does.

I think the cost of adding people to a project has more to do with Parkinson’s Law which says that **people make work for each other**. (The aphorism form of Parkinson’s Law says that work expands to the time allowed. But the eponymous book explains **why** work expands, and it is in part because people make extra work for each other.)

I wrote about a similar theme in the blog post Maybe you only need it because you have it. Here’s the conclusion of that post:

]]>Suppose a useless project adds staff. These staff need to be managed, so they hire a manager. Then they hire people for IT, accounting, marketing, etc. Eventually they have their own building. This building needs security, maintenance, and housekeeping. No one questions the need for the security guard, but the guard would not have been necessary without the original useless project.

When something seems absolutely necessary, maybe it’s only necessary because of something else that isn’t necessary.

Mads said that now with .NET Core, C# code runs about as fast on Linux as Windows. Maybe a few percent slower on Linux. Scott Hanselman joined the conversation and explained that with .NET Core, the same code runs on every platform. The Mono project duplicated the much of the functionality of the .NET framework, but it was an entirely independent implementation and not as efficient.

I had assumed the difference was due to compiler optimizations or lack thereof, but Scott and Mads said that the difference was mostly the code implementation. There are some compiler optimizations that are better on the Windows side, and so C# might run a little faster on Windows, but the performance is essentially the same on both platforms.

I could recommend C# to a client running Linux if there’s a 5% performance penalty, but a 500% performance penalty was a show-stopper. Now I’d consider using C# on a project where I need more performance than Python or R, but wanted to use something easier to work with than C++.

Years ago I developed software with the Microsoft stack, but I moved away from Microsoft tools when I quit doing the kind of software development the tools are geared for. So I don’t write C# much any more. It’s been about a year since I had a project where I needed to write C# code. But I like the C# language. You can tell that a lot of thought has gone into the design, and the tool support is great. Now that the performance is better on Linux I’d consider using it for numerical simulations.

]]>It’s been fun to see some people I haven’t seen since I spoke at the GOTO and YOW conferences four years ago.

Photo above by conference photographer Fritz Schumann.

]]>Growing compute power and shrinking sensors open up possibilities we’re only beginning to explore. Even when the things we want to observe elude direct measurement, we may be able to infer them from other things that we can now measure accurately, inexpensively, and in high volume.

In order to infer what you’d *like* to measure from what you *can* measure, you need a mathematical model. Or if you’d like to make predictions about the future from data collected in the past, you need a model. And that’s where I come in. Several companies have hired me to help them create medical devices by working on mathematical models. These might be statistical models, differential equations, or a combination of the two. I can’t say much about the projects I’ve worked on, at least not yet. I hope that I’ll be able to say more once the products come to market.

I started my career doing mathematical modeling (partial differential equations) but wasn’t that interested in statistics or medical applications. Then through an unexpected turn of events, I ended up spending a dozen years working in the biostatistics department of the world’s largest cancer center.

After leaving MD Anderson and starting my consultancy, several companies have approached me for help with mathematical problems associated with their idea for a medical device. These are ideal projects because they combine my earlier experience in mathematical modeling with my more recent experience with medical applications.

If you have an idea for a medical device, or know someone who does, let’s talk. I’d like to help.

]]>

“A bird may love a fish but where would they build a home together?” — Fiddler on the Roof

]]>

Suppose *f*(*x*) is a function of several variables, i.e. *x* is a vector, and *g*(*x*) = *c* is a constraint. Then the Lagrange multiplier theorem says that at the maximum of *f* subject to the constraint *g* we have ∇*f* = λ ∇*g*.

Where does this mysterious λ come from? And why should the gradient of your objective function be related to the gradient of a constraint? These seem like two different things that shouldn’t even be comparable.

Here’s the geometric explanation. The set of points satisfying *g*(*x*) = *c* is a surface. And for any *k*, the set of points satisfying *f*(*x*) = *k* is also surface. Imagine *k* very large, larger than the maximum of *f* on the surface defined by *g*(*x*) = *c*. You could think of the surface *g*(*x*) = *c* being a balloon inside the larger balloon *f*(*x*) = *k**.*

Now gradually decrease *k*, like letting the air out of the outer balloon, until the surfaces *g*(*x*) = *c* and *f*(*x*) = *k* first touch. At that point, the two surfaces will be tangent, and so their normal vectors, given by their gradients, point in the same direction. That is, ∇*f* and ∇*g* are parallel, and so ∇*f* is some multiple of ∇*g*. Call that multiple λ.

I don’t know how well that explanation works when written down. But when I heard Jim Vick explain it, moving his hands in the air, it was an eye-opener.

This is not a rigorous proof, and it does not give the most general result possible, but it explains what’s going on. It’s something to keep in mind when reading proofs that are more rigorous or more general. As I comment on here,

Proofs serve two main purposes: to establish that a proposition is true, and to show

whyit is true.

The literally hand-wavy proof scores low on the former criterion and high on the latter.

***

Jim Vick was a great teacher. Some of us affectionately called him The Grinning Demon because he was always smiling, even while he gave devilishly hard homework. He was Dean of Natural Sciences when I took a couple classes from him. He later became Vice President for Student Affairs and kept teaching periodically. He has since retired but still teaches.

After taking his topology course, several of us asked him to teach a differential geometry course. He hesitated because it’s a challenge to put together an undergraduate differential geometry course. The usual development of differential geometry uses so much machinery that it’s typically a graduate-level course.

Vick found a book that starts by looking only at manifolds given by level sets of smooth functions, like the surfaces discussed above. Because these surfaces sit inside a Euclidean space, you can quickly get to some interesting geometric results using elementary methods. We eventually got to the more advanced methods, but by then we had experience in a more tangible setting. As Michael Atiyah said, abstraction should follow experience, not precede it.

]]>With no more information than this, what would you estimate the probability to be that the treatment is effective in the next subject? Easy: 0.7.

Now what would you estimate the probability to be that the treatment is effective in the next **two** subjects? You might say 0.49, and that would be correct if we **knew*** *that the probability of response is 0.7. But there’s uncertainty in our estimate. We don’t know that the response rate is 70%, only that we saw a 70% response rate in our small sample.

If the probability of success is *p*, then the probability of *s* successes and *f* failures in the next *s* + *f* subjects is given by

But if our probability of success has some uncertainty and we assume it has a beta(*a*, *b*) distribution, then the **predictive probability** of *s* successes and *f* failures is given by

where

In our example, after seeing 7 successes out of 10 subjects, we estimate the probability of success by a beta(7, 3) distribution. Then this says the predictive probability of two successes is approximately 0.51, a little higher than the naive estimate of 0.49. Why is this?

We’re not assuming the probability of success is 0.7, only that the mean of our estimate of the probability is 0.7. The actual probability might be higher or lower. The predictive probability calculates the probability of outcomes under all possible values of the probability, then creates a weighted average, weighing each probability of success by the probability of that value. The differences corresponding to probability above and below 0.7 approximately balance out, but the former carry a little more weight and so we get roughly what we did before.

If this doesn’t seem right, note that mean and median aren’t the same thing for asymmetric distributions. A beta(7,3) distribution has mean 0.7, but it has a probability of 0.537 of being larger than 0.7.

If our initial experiment has shown 70 successes out of 100 instead of 7 out of 10, the predictive probability of two successes would have been 0.492, closer to the value based on point estimate, but still different.

The further we look ahead, the more difference there is between using a point estimate and using a distribution that incorporates our uncertainty. Here are the probabilities for the number of successes out of the next 100 outcomes, using the point estimate 0.3 and using predictive probability with a beta(7,3) distribution.

So if we’re sure that the probability of success is 0.7, we’re pretty confident that out of 100 trials we’ll see between 60 and 80 successes. But if we model our uncertainty in the probability of response, we get quite a bit of uncertainty when we look ahead to the next 100 subjects. Now we can say that the number of responses is likely to be between 30 and 100.

]]>Expect to see tweets about constructive logic, type theory, formal proofs, proof assistants, etc.

The image for the account is a bowtie, a pun on formality. It’s also the symbol for natural join in relational algebra.

]]>The number 3435 has the following curious property:

3435 = 3^{3} + 4^{4} + 3^{3} + 5^{5}.

It is called a Münchausen number, an allusion to fictional Baron Münchausen. When each digit is raised to its own power and summed, you get the original number back. The only other Münchausen number is 1.

At least in base 10. You could look at Münchausen numbers in other bases. If you write out a number *n* in base *b*, raise each of its “digits” to its own power, take the sum, and get *n* back, you have a Münchausen number in base *b*. For example 28 is a Münchausen number in base 9 because

28_{ten} = 31_{nine} = 3^{3} + 1^{1}

Daan van Berkel proved that there are only finitely many Münchausen in any given base. In fact, he shows that a Münchausen number in base *b* cannot be greater than 2*b*^{b}, and so you could do a brute-force search to find all the Münchausen numbers in any base.

The upper bound 2*b*^{b} grows very quickly with *b* and so brute force becomes impractical for large *b.* If you wanted to find all the hexadecimal Münchausen numbers you’d have to search 2*16^{16} = 36,893,488,147,419,103,232 numbers. How could you do this more efficiently?

Suppose you have an expression (λ*x*.*x* + 2)*y*, which means apply to *y* the function that takes its argument and adds 2. Then β-reduction rewrites this expression as *y* + 2. In a more complicated expression, you might be able to apply β-reduction several times. When you do apply β-reduction several times, does the process always stop? And if you apply β-reduction to parts of an expression in a different order, can you get different results?

You might reasonably expect that if you apply β-reduction enough times you eventually get an expression you can’t reduce any further. *Au contraire*!

Consider the expression (λ*x*.*xx*) (λ*x*.*xx*). Beta reduction says to replace each of the red *x*s with the expression in blue. But when we do that, we get the original expression (λ*x*.*xx*) (λ*x*.*xx*) back. Beta reduction gets stuck in an infinite loop.

Next consider the expression *L* = (λ*x*.*xxy*) (λ*x*.*xxy*). Applying β-reduction the first time gives (λ*x*.*xxy*) (λ*x*.*xxy*) *y* or *Ly*. Applying β-reduction again yields *Lyy*. Beta “reduction” doesn’t reduce the expression at all but makes it bigger.

The technical term for what we’ve seen is that β-reduction is not *normalizing*. A rewrite system is *strongly normalizing* if applying the rules in any order eventually terminates. It’s *weakly normalizing* if there’s at least some sequence of applying the rules that terminates. Beta reduction is neither strongly nor weakly normalizing in the context of (untyped) lambda calculus.

In simply typed lambda calculus, we assign types to every variable, and functions have to take the right type of argument. This additional structure prevents examples such as those above that fail to normalize. If *x* is a function that takes an argument of type *A* and returns an argument of type *B* then you can’t apply *x* to itself. This is because *x* takes something of type *A*, not something of type function from *A* to *B*. You can prove that not only does this rule out specific examples like those above, it rules out all possible examples that would prevent β-reduction from terminating.

To summarize, β-reduction is not normalizing, not even weakly, in the context of untyped lambda calculus, but it is strongly normalizing in the context of simply typed lambda calculus.

Although β-reduction is not normalizing for untyped lambda calculus, the Church-Rosser theorem says it is *confluent*. That is, if an expression *P* can be transformed by β-reduction two different ways into expressions *M* and *N*, then there is an expression *T* such that both *M* and *N* can be reduced to *T*. This implies that if β-reduction *does* terminate, then it terminates in a unique expression (up to α-conversion, i.e. renaming bound variables). Simply typed lambda calculus is confluent as well, and so in that context we can say β-reduction always terminates in a unique expression (again up to α-conversion).

If I flip a coin a million times, I’m virtually certain to get 50 percent heads and 50 percent tails.

Depending on how you understand that line, it’s either imprecise or false. The more times you flip the coin, the **more** likely you are to get **nearly **half heads and half tails, but the **less** likely you are to get **exactly** half of each. I assume Dr. Lienhard knows this and that by “50 percent” he meant “nearly half.”

Let’s make the fuzzy statements above more quantitative. Suppose we flip a coin 2*n* times for some large number *n*. Then a calculation using Stirling’s approximation shows that the probability of *n* heads and *n* tails is approximately

1/√(π*n*)

which goes to zero as *n* goes to infinity. If you flip a coin a million times, there’s less than one chance in a thousand that you’d get exactly half heads.

Next, let’s quantify the statement that *nearly* half the tosses are likely to be heads. The normal approximation to the binomial tells us that for large *n*, the number of heads out of 2*n* tosses is approximately distributed like a normal distribution with the same mean and variance, i.e. mean *n* and variance *n*/2. The *proportion* of heads is thus approximately normal with mean 1/2 and variance 1/8*n*. This means the standard deviation is 1/√(8*n*). So, for example, about 95% of the time the proportion of heads will be 1/2 plus or minus 2/√(8*n*). As *n* goes to infinity, the width of this interval goes to 0. Alternatively, we could pick some fixed interval around 1/2 and show that the probability of the proportion of heads being outside that interval goes to 0.

With data from other distributions, the mean and variance may not be sufficient statistics, and in fact there may be no (useful) sufficient statistics. The full data set is more informative than any summary of the data. But out of habit people may think that the mean and variance are enough

Probability distributions are an idealization, of course, and so data never exactly “come from” a distribution. But if you’re satisfied with a distributional idealization of your data, there may be useful sufficient statistics.

Suppose you have data with such large outliers that you seriously doubt that it could be coming from anything appropriately modeled as a normal distribution. You might say the definition of sufficient statistics is wrong, that the full data set tells you something you couldn’t know from the summary statistics. But the sample mean and variance are still sufficient statistics in this case. They really are sufficient, *conditional on the normality assumption*, which you don’t believe! The cognitive dissonance doesn’t come from the definition of sufficient statistics but from acting on an assumption you believe to be false.

***

[1] Technically every distribution has sufficient statistics, though the sufficient statistic might be the same size as the original data set, in which case the sufficient statistic hasn’t contributed anything useful. Roughly speaking, distributions have useful sufficient statistics if they come from an “exponential family,” a set of distributions whose densities factor a certain way.

]]>The source file for a Jupyter notebook is a JSON document containing formatting instructions, encoded images, and the notebook content. Looking at a notebook file in a text editor is analogous to unzipping a Word document and looking at the XML inside. Here’s a sample:

" if (with_grid_):\n", " plt.grid (which=\"both\",ls=\"-\",color=\"0.65\")\n", " plt.show() " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAawAAAESCAYAAA...

It’s hard to diff two Jupyter notebooks because content changes don’t map simply to changes in the underlying file.

We can look a little closer at WYSIWYG and ask what you see **where** and what you get **where**. Word documents and Jupyter notebooks are **visually** WYSIWYG: what you see in the interactive environment (browser) is what you get visually in the final product. Which is very convenient, except for version control and comparing changes.

Org-mode files are **functionally** WYSIWYG: what you see in the interactive environment (editor) is what you get functionally in the final code.

You could say that HTML and LaTeX are functionally or **causally** WYSIWYG whereas Word is visually WYSIWYG. Word is visually WYSIWYG in the sense that what you see on your monitor is essentially what you’ll see coming out of a printer. But Word doesn’t always let you see what’s causing your document to look as it does. Why can’t I delete this? Why does it insist on putting that here? HTML and LaTeX work at a lower level, for better and for worse. You may not be able to anticipate how things will look while editing the source, e.g. where a line will break, but cause and effect are transparent.

Everybody wants WYSIWYG, but they have different priorities regarding what they want to see and are concerned about different aspects of what they get.

]]>The limitations of floating point arithmetic are something to be aware of, and ignoring these limitations can cause problems, like crashing airplanes. On the other hand, floating point arithmetic is usually far more reliable than the application it is being used in.

It’s well known that if you multiply two floating point numbers, *a* and *b*, then divide by *b*, you might not get exactly *a* back. Your result might differ from *a* by one part in ten quadrillion (10^16). To put this in perspective, suppose you have a rectangle approximately one kilometer on each side. Someone tells you the exact area and the exact length of one side. If you solve for the length of the missing side using standard (IEEE 754) floating point arithmetic, your result could be off by as much as the width of a helium atom.

Most of the problems attributed to “round off error” are actually approximation error. As Nick Trefethen put it,

If rounding errors vanished, 90% of numerical analysis would remain.

Still, despite the extreme precision of floating point, in some contexts rounding error is a practical problem. You have to learn in what context floating point precision matters and how to deal with its limitations. **This is no different than anything else in computing**.

For example, most of the time you can imagine that your computer has an unlimited amount of memory, though sometimes you can’t. It’s common for a computer to have enough memory to hold the entire text of Wikipedia (currently around 12 gigabytes). This is usually far more memory than you need, and yet for some problems it’s not enough.

**More on floating point computing**:

Building proofs and programs are very similar activities, but there is one important difference: when looking for a proof it is often enough to find one, however complex it is. On the other hand, not all programs satisfying a specification are alike: even if the eventual result is the same, efficient programs must be preferred. The idea that the details of proofs do not matter—usually called

proof irrelevance—clearly justifies letting the computer search for proofs …

The quote is saying “Any proof will do, but among programs that do the same job, more efficient programs are better.” And this is certainly true, depending on what you want from a proof.

Proofs serve two main purposes: to establish that a proposition is true, and to show *why* it is true. If you’re only interested in the existence of a proof, then yes, any proof will do. But proofs have a sort of pedagogical efficiency just as programs have an execution efficiency. Given two proofs of the same proposition, the more enlightening proof is better by that criteria.

I assume the authors of the above quote would not disagree because they say it is *often* enough to find one, implying that for some purposes simpler proofs are better. And existence does come first: a complex proof that exists is better than an elegant proof that does not exist! Once you have a proof you may want to look for a simpler proof. Or not, depending on your purposes. Maybe you have an elegant proof that you’re convinced must be essentially true, even if you doubt some details. In that case, you might want to complement it with a machine-verifiable proof, no matter how complex.

**Source**: Interactive Theorem Proving and Program Development

Coq’Art: The Calculus of Inductive Constructions

For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters “ETAOIN SHRDLU” comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you’d see ETAOIN SHRDLU in print, just as you might see “QWERTY” today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn’t the same as it’s frequency in English. And if you’d picked the novel Gadsby instead of A Tale of Two Cities you’d get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That’s a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it’s still not *English*.

What might we mean by “English” when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I’ve altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn’t a fixed thing. When we say “English” as a whole we have in mind some idealization that strictly speaking doesn’t exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can’t answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter’s frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can’t expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can’t say with certainty that the most common letter in a particular sample *must* correspond to E, the next to T, etc.

By the way, Peter Norvig’s analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.

]]>

I’m happier with the paragraph above if you replace “calculus” with “analysis.” Analysis certainly seeks to understand and model things that change continually, but calculus per se is the mechanism of analysis.

I used to think it oddly formal for people to say “differential and integral calculus.” Is there any other kind? Well yes, yes there is, though I didn’t know that at the time. A calculus is a system of rules for computing things. Differential and integral calculus is a system of rules for calculating derivatives and integrals. Lately I’ve thought about other calculi more than differential calculus: propositional calculus, lambda calculus, calculus of inductive constructions, etc.

In my first career I taught (differential and integral) calculus and was frustrated with students who would learn how to calculate derivatives but never understood what a derivative was or what it was good for. In some sense though, they got to the core of what a calculus is. It would be better if they knew what they were calculating and how to apply it, but they still learn something valuable by knowing how to carry out the manipulations. A computer science major, for example, who gets through (differential) calculus knowing how to calculate derivatives without knowing what they are is in a good position to understand lambda calculus later.

]]>**Dimensional analysis** is a well-established method of error detection. Simply checking that you’re not doing something like adding things that aren’t meaningful to add is surprisingly useful for detecting errors. For example, if you’re computing an area, does your result have units of area? This seems like a ridiculously basic question to ask, and yet it is more effective than it would seem.

**Type theory** and **category theory** are extensions of the idea of dimensional analysis. Simply asking of a function “Exactly what kind of thing does it take in? And what sort of thing does it put out?” is a powerful way to find errors. And in practice it may be hard to get answers to these questions. When you’re working with a system that wasn’t designed to make such things explicit and clear, it may be that nobody knows for certain, or people believe they know but have incompatible ideas. Type theory and category theory can do much more than nudge you to define functions well, but even that much is a good start.

Specifying types is more powerful than it seems. With **dependent types**, you can incorporate so much information in the type system that showing that an instance of a type exists is equivalent to proving a theorem. This is a consequence of the **Curry-Howard** **correspondence** (“Propositions as types”) and is the basis for proof assistants like **Coq**.

I’d like to suggest the term **Big Logic** for the application of logic on a large scale, using logic to prove properties of systems that are too complex for a typical human mind to thoroughly understand. It’s well known that systems have gotten larger, more connected, and more complex. It’s not as well known how much the tools of Big Logic have improved, making formal verification practical for more projects than it used to be.

The sides of a spherical triangle are formed by great circular arcs through the vertices. Since the sides are portions of a circle, they can be measured as angles. So in spherical trig you have this interesting interplay of two kinds of angles: the angles formed at the intersections of the sides, and the angles describing the sides themselves.

Here’s how you form the dual of a spherical triangle. Suppose the vertices of the angle are *A*, *B*, and *C*. Think of the arc connecting *A* and *B* as an equator, and let *C*‘ be the corresponding pole that lies on the same side of the arc as the original triangle *ABC*. Do the analogous process to find the points *A*‘ and *B*‘. The triangle *A*‘*B*‘*C*‘ is the dual of the triangle *ABC*. (This idea goes back to the Persian mathematician Abu Nasr Mansur circa 1000 AD.)

The sides in *A*‘*B*‘*C*‘ are the supplementary angles of the corresponding intersection angles in *ABC*, and the intersection angles in *A*‘*B*‘*C*‘ are the supplementary angles of the corresponding sides in *ABC*.

In his paper “Duality in the formulas of spherical trigonometry,” published in American Mathematical Monthly in 1909, W. A. Granville gives the following duality principle:

If the sides of a spherical triangle be denoted by Roman letters

a,b,cand the supplements of the corresponding opposite angles by the Greek letters α, β, γ, then from any given formula involving any of these six parts, we may wrote down a dual formula simply by interchanging the corresponding Greek and Roman letters.

**Related**: Notes on Spherical Trigonometry

- Constant functions are PR.
- The function that picks the
*i*th element of a list of*n*arguments is PR. - The successor function
*S*(*n*) =*n*+1 is PR. - PR functions are closed under composition.
- PR functions are closed under
**primitive recursion**.

The last axiom obviously gives PR functions their name. So what is primitive recursion? Given a PR function *f * that takes *k* arguments, and another PR function *g* that takes *k*+2 arguments, the primitive recursion of *f* and *g* is a function *h* of *k*+1 arguments satisfying two properties:

*h*(0,*x*_{1}, …,*x*_{k}) =*f*(*x*_{1}, …,*x*_{k})*h*(*S*(*y*),*x*_{1}, …,*x*_{k}) =*g*(*y*,*h*(*y*,*x*_{1}, …*x*_{k}),*x*_{1}, …,*x*_{k})

Not every computable function is primitive recursive. For example, Ackermann’s function is a **general recursive function**, but not a primitive recursive function. General recursive functions are Turing complete. Turing machines, lambda calculus, and general recursive functions are all equivalent models of computation, but the first two are better known.

For this post, the main thing about general recursive functions is that, as the name implies, they are more general than primitive recursive functions.

Now we switch from functions to sets. The **characteristic function** of a set *A* is the function that is 1 for elements of *A* and zero everywhere else. In other areas of math, there is a sort of duality between functions and sets via characteristic functions. For example, the indicator function of a measurable set is a measurable function. And the indicator function of a convex set is a convex function. But in recursive functions, there’s an unexpected wrinkle in this analogy.

A set of integers is **recursively enumerable** if it is either empty or the image of a **general** recursive function. But there’s a theorem, due to Alonzo Church, that a non-empty recursively enumerable set is actually the image of a **primitive** recursive function. So although general recursive functions are more general, you can’t tell that from looking at function images. For example, although the Ackermann function is not primitive recursive, there is a primitive recursive function with the same image.

In finite dimensional Euclidean space, linear functions are continuous. You can put a different norm on a Euclidean space than the one it naturally comes with, but all norms give rise to the same topology and hence the same continuous functions. (This is useful in numerical analysis where you’d like to look at a variety of norms. The norms give different analytical results, but they’re all topologically equivalent.)

In an infinite dimensional normed space, linear functions are not necessarily continuous. If the dimension of a space is only a trillion, all linear functions are continuous, but when you jump from high dimension to infinite dimension, you can have discontinuous linear functions. But if you look at this more carefully, there isn’t a really sudden change.

If a linear function is discontinuous, its finite dimensional approximations are continuous, but the **degree** of continuity is degrading as dimension increases. For example, suppose a linear function stretches the *n*th basis vector by a factor of *n*. The bigger *n* gets, the more the function stretches in the *n*th dimension. As long as *n* is bounded, this is continuous, but in a sense it is less continuous as *n* increases. The fact that the infinite dimensional version is discontinuous tells you that the finite dimensional versions, while technically continuous, scale poorly with dimension. (See practical continuity for more discussion along these lines.)

A Banach space is a *complete* normed linear space. Finite dimensional normed spaces are always complete (i.e. every sequence in the space converges to a point in the space) but this might not happen in infinite dimensions.

In basic linear algebra, the dual of a vector space *V* is the space of linear functionals on *V*, i.e. the set of linear maps from *V* to the reals. This space is denoted *V*^{*}. If *V* has dimension *n*, *V*^{*} has dimension *n*, and all *n*-dimensional spaces are isomorphic, so the distinction between a space and its dual seems pedantic. But in general a Banach space and its dual are not isomorphic and so its easier to tell them apart.

The second dual of a vector space, *V*^{**} is the dual of the dual space. In finite dimensional spaces, *V*^{**} is naturally isomorphic to *V*. In Banach spaces, *V* is isomorphic to a *subset* of *V*^{**}. And even when *V* is isomorphic to *V*^{**}, it might not be *naturally* isomorphic to *V*^{**}. (Here “natural” means natural in the category theory sense of natural transformations.)