Here’s a percolation problem for QR codes: What is the probability that there is a path from one side of a QR code to the opposite side? How far across a QR code would you expect to be able to go? For example, the QR code below was generated from my contact information. It’s not possible to go from one side to the other, and the red line shows what I believe is the deepest path into the code from a side.

This could make an interesting programming exercise. A simple version would be to start with a file of bits representing a particular QR code and find the deepest path into the corresponding image.

The next step up would be to generate simplified QR codes, requiring certain bits to be set, such as the patterns in three of the four corners that allow a QR reader to orient itself.

The next step in sophistication would be to implement the actual QR encoding algorithm, including its error correction encoding, then use this to encode random data.

(Because of the error correction used by QR codes, you could scan the image above and your QR reader would ignore the red path. It would even work if a fairly large portion of the image were missing because the error correction introduces a lot of redundancy.)

]]>You could say that an empty sum is 0 because 0 is the additive identity and an empty product is 1 because 1 is the multiplicative identity. If you’d like a simple answer, maybe you should stop reading here.

The problem with the answer above is that it doesn’t say why an operation on an empty set should be defined to be the identity for that operation. The identity is certainly a plausible candidate, but why should it make sense to even define an operation on an empty set, and why should the identity turn out so often to be the definition that makes things proceed smoothly?

The convention that the sum over an empty set should be defined as 0, and that a product over an empty set should be defined to be 1 works well in very general settings where “sum”, “product”, “0”, and “1” take on abstract meanings.

The **ultimate generalization** of products is the notion of products in category theory. Similarly, the ultimate generalization of sums is categorical co-products. (Co-products are sometimes called sums, but they’re usually called co-products due to a symmetry with products.) Category theory simultaneously addresses a wide variety of operations that could be called products or sums (co-products).

The particular advantage of bringing category theory into this discussion is that it has definitions of product and co-product that are the same for any number of objects, including zero objects; there is no special definition for empty products. Empty products and co-products are a **consequence** of a more general definition, **not special cases** defined by convention.

In the category of sets, products are Cartesian products. The product of a set with *n* elements and one with *m* elements is one with *nm* elements. Also in the category of sets, co-products are disjoint unions. The co-product of a set with *n* elements and one with *m* elements is one with *n+m* elements. These examples show a connection between products and sums in arithmetic and products and co-products in category theory.

You can find the full definition of a categorical product here. Below I give the definition leaving out details that go away when we look at empty products.

The product of a set of objects is an object *P* such that given any other object *X* … there exists a unique morphism from *X* to *P* such that ….

If you’ve never seen this before, you might rightfully wonder what in the world this has to do with products. You’ll have to trust me on this one. [1]

When the set of objects is empty, the missing parts of the definition above don’t matter, so we’re left with requiring that there is a unique morphism [2] from each object *X* to the product *P*. In other words, *P* is a terminal object, often denoted 1. **So in category theory, you can say empty products are 1**.

But that seems like a leap, since “1” now takes on a new meaning that isn’t obviously connected to the idea of 1 we learned as a child. How is an object such that every object has a unique arrow to it at all like, say, the number of noses on a human face?

We drew a connection between arithmetic and categories before by looking at the cardinality of sets. We could define the product of the numbers *n* and *m* as the number of elements in the product of a set with *n* elements and one with *m* elements. Similarly we could define 1 as the cardinality of the terminal element, also denoted 1. This is because there is a unique map from any set to the set with 1 element. Pick your favorite one-element set and call it 1. Any other choice is isomorphic to your choice.

Now for empty sums. The following is the definition of co-product (sum), leaving out details that go away when we look at empty co-products.

The co-product of a set of objects is an object *S* such that given any other object *X* … there exists a unique morphism from *S* to *X* such that ….

As before, when the set of objects is empty, the missing parts don’t matter. Notice that the direction of the arrow in the definition is reversed: there is a unique morphism from the co-product *S* to any object *X*. In other words, *S* is an initial object, denoted for good reasons as 0. [3]

In set theory, the initial object is the empty set. (If that hurts your head, you’re not alone. But if you think of functions in terms of sets of ordered pairs, it makes a little more sense. The function that sends the empty set to another set is an empty set of ordered pairs!) The cardinality of the initial object 0 is the integer 0, just as the cardinality of the initial object 1 is the integer 1.

===

[1] Category theory has to define operations entirely in terms of objects and morphisms. It can’t look inside an object and describe things in terms of elements the way you’d usually do to define the product of two numbers or two sets, so the definition of product has to look very different. The benefit of this extra work is a definition that applies much more generally.

To understand the general definition of products, start by understanding the product of two objects. Then learn about categorical limits and how products relate to limits. (As with products, the categorical definition of limits will look entirely different from familiar limits, but they’re related.)

[2] Morphisms are a generalization of functions. In the category of sets, morphisms *are* functions.

[3] Sometimes initial objects are denoted by ∅, the symbol for the empty set, and sometimes by 0. To make things more confusing, a “zero,” spelled out as a word rather than a symbol, has a different but related meaning in category theory: an object that is both initial and terminal.

]]>Other methods, such as fuzzy logic, may be useful, though they must violate common sense (at least as defined by Cox’s theorem) under some circumstances. They may be still useful when they provide approximately the results that probability would have provided and at less effort and stay away from edge cases that deviate too far from common sense.

There are various kinds of uncertainty, principally epistemic uncertainty (lack of knowledge) and aleatory uncertainty (randomness), and various philosophies for how to apply probability. One advantage to the Bayesian approach is that it handles epistemic and aleatory uncertainty in a unified way.

Blog posts related to quantifying uncertainty:

- How loud is the evidence?
- The law of small numbers
- Example of the law of small numbers
- Laws of large numbers and small numbers
- Plausible reasoning
- What is a confidence interval?
- Learning is not the same as gaining information
- What a probability means
- Irrelevant uncertainty
- Probability and information
- False positives for medical papers
- False positives for medical tests
- Most published research results are false
- Determining distribution parameters from quantiles
- Fitting a triangular distribution
- Musicians, drunks, and Oliver Cromwell

This exercise gave me confidence that mathematical definitions were created by ordinary mortals like myself. It also began my habit of examining definitions carefully to understand what motivated them.

One question that comes up frequently is why zero factorial equals 1. The pedantic answer is “Because it is defined that way.” This answer alone is not very helpful, but it does lead to the more refined question: Why is 0! defined to be 1?

The answer to the revised question is that many formulas are simpler if we define 0! to be 1. If we defined 0! to be 0, for example, countless formulas would have to add disqualifiers such as “except when *n* is zero.”

For example, the binomial coefficients are defined by

*C*(*n*, *k*) = *n*! / *k*!(*n* – *k*)!.

The binomial coefficient *C*(*n*, *k*) tells us how many ways one can draw take a set of *n* things and select *k* of them. For example, the number of ways to deal a hand of five cards from a deck of 52 is *C*(52, 5) = 52! / 5! 47! = 2,598,960.

How many ways are there to deal a hand of 52 cards from a deck of 52 cards? Obviously one: the deck is the hand. But our formula says the answer is

*C*(52, 52) = 52! / 52! 0!,

and the formula is only correct if 0! = 1. If 0! were defined to be anything else, we’d have to say “The number of ways to deal a hand of *k* cards from a deck of *n* cards is *C*(*n*, *k*), **except** when *k* = 0 or *k* = *n*, in which case the answer is 1.” (See [1] below for picky details.)

The example above is certainly not the only one where it is convenient to define 0! to be 1. Countless theorems would be more awkward to state if 0! were defined any other way.

Sometimes people appeal to the gamma function for justification that 0! should be defined to be 1. The gamma function extends factorial to real numbers, and the gamma function value associated with 0! is 1. (In detail, *n*! = Γ(*n*+1) for positive integers *n* and Γ(1) = 1.) This is reassuring, but it raises another question: Why should the gamma function be authoritative?

Indeed, there are many ways to extend factorial to non-integer values, and historically many ways were proposed. However, the gamma function won and its competitors have faded into obscurity. So why did it win? Analogous to the discussion above, we could say that the gamma function won because more formulas work out simply with this definition than with others. That is, you can very often replace *n*! with Γ(*n* + 1) in a formula true for positive integer values of *n *and get a new formula valid for real or even complex values of* n. *

There is another reason why gamma won, and that’s the Bohr–Mollerup theorem. It says that if you’re looking for a function *f*(*x*) defined for *x* > 0 that satisfies *f*(1) = 1 and *f*(*x*+1) = *x* *f*(*x*), then the gamma function is the only log-convex solution. Why should we look for log-convex functions? Because factorial is log-convex, and so this is a natural property to require of its extension.

**Update**: Occasionally I hear someone say that the gamma function (shifting its argument by 1) is the only analytic function that extends factorial to the complex plane, but this isn’t true. For example, if you add sin(πx) to the gamma function, you get another analytic function that takes on the same values as gamma for positive integer arguments.

**Related posts**:

- Why are empty products 1?
- Why are natural logarithms natural?
- Another reason natural logarithms are natural

===

[1] Theorems about binomial coefficients have to make some restrictions on the arguments. See these notes for full details. But in the case of dealing cards, the only necessary constraints are the **natural **ones: we assume the number of cards in the deck and the number we want in a hand are non-negative integers, and that we’re not trying to draw more cards for a hand than there are in a deck. Defining 0! as 1 keeps us from having to make any **unnatural** qualifications such as “unless you’re dealing the entire deck.”

]]>I tried. I tried to learn some statistics actually when I was younger and it’s a beautiful subject. But at the time I think I found the shakiness of the philosophical underpinnings were too scary for me. I felt a little nauseated all the time. Math is much more comfortable. You know where you stand. You know what’s proved and what’s not. It doesn’t have the quite same ethical and moral dimension that statistics has. I was never able to get comfortable with it the way my parents were.

In its simplest form the 80-20 rule says 80% of your outputs come from 20% of your inputs. You might find that 80% of your revenue comes from 20% of your customers, or 80% of your headaches come from 20% of your employees, or 80% of your sales come from 20% of your sales reps. The exact numbers 80 and 20 are not important, though they work surprisingly well as a rule of thumb.

The more general principle is that a large portion of your results come from a small portion of your inputs. Maybe it’s not 80-20 but something like 90-5, meaning 90% of your results coming from 5% of your inputs. Or 90-13, or 95-10, or 80-25, etc. Whatever the proportion, it’s usually the case that some inputs are far more important than others. The alternative, assuming that everything is equally important, is usually absurd.

The 80-20 rule sounds too good to be true. If 20% of inputs are so much more important than the others, why don’t we just concentrate on those? In an earlier post, I gave four reasons. These were:

- We don’t look for 80/20 payoffs. We don’t see 80/20 rules because we don’t think to look for them.
- We’re not clear about criteria for success. You can’t concentrate your efforts on the 20% with the biggest returns until you’re clear on how you measure returns.
- We’re unclear how inputs relate to outputs. It may be hard to predict what the most productive activities will be.
- We enjoy less productive activities more than more productive ones. We concentrate on what’s fun rather than what’s effective.

I’d like to add another reason to this list, and that is that we may find it hard to believe just how unevenly distributed the returns on our efforts are. We may have an idea of how things are ordered in importance, but **we don’t appreciate just how much more important the most important things are**. We mentally compress the range of returns on our efforts.

Making a list of options suggests the items on the list are roughly equally effective, say within an order of magnitude of each other. But it may be that the best option would be 100 times as effective as the next best option. (I’ve often seen that, for example, in optimizing software. Several ideas would reduce runtime by a few percent, while one option could reduce it by a couple orders of magnitude.) If the best option also takes the most effort, it may not seem worthwhile because we underestimate just how much we get in return for that effort.

]]>If you’d like to contribute an endorsement, please contact me.

]]>The appeal of magic is that it promises to render objects plastic to the will without one’s getting too entangled with them. Treated at arm’s length, the object can issue no challenge to the self. … The clearest contrast … that I can think of is the repairman, who must submit himself to the broken washing machine, listen to it with patience, notice its symptoms, and then act accordingly. He cannot treat it abstractly; the kind of agency he exhibits is not at all magical.

**Related post**: Programming languages and magic

Economic forecasting is useful for predicting the future up to about ten years ahead. Beyond ten years the quantitative changes which the forecast accesses are usually sidetracked or made irrelevant by qualitative changes in the rules of the game. Qualitative changes are produced by human cleverness … or by human stupidity … Neither cleverness nor stupidity are predictable.

Source: Infinite in All Directions, Chapter 10, Engineers’ Dreams.

]]>In everyone’s pocket right now is a computer far more powerful than the one we flew on

Voyager, and I don’t mean your cell phone—I mean thekey fobthat unlocks your car.

These days *technology* is equated with *computer technology*. For example, the other day I heard someone talk about bringing chemical engineering and technology together, as if chemical engineering isn’t technology. If technology only means computer technology, then the *Voyager* probes are very low-tech.

And yet *Voyager 1* has left the solar system! (Depending on how you define the solar system.*) It’s the most distant man-made object, about 20 billion kilometers away. It’s still sending back data 38 years after it launched, and is expected to keep doing so for a few more years before its power supply runs too low. *Voyager 2* is doing fine as well, though it’s taking longer to leave the solar system. Surely this is a far greater technological achievement than a key fob.

===

* *Voyager 1* has left the heliosphere, far beyond Pluto, and is said to be in the “interstellar medium.” But it won’t reach the Oort cloud for another 300 years and won’t leave the Oort cloud for 30,000 years.

Source: The Interstellar Age: Inside the Forty-Year Voyager Mission

]]>Introductory courses explain Monte Carlo integration as follows.

- Plot the function you want to integrate.
- Draw a box that contains the graph.
- Throw darts (random points) at the box.
- Count the proportion of darts that land between the graph and the horizontal axis.
- Estimate the area under the graph by multiplying the area of the box by the proportion above.

In principle this is correct, but this is far from how Monte Carlo integration is usually done in practice.

For one thing, Monte Carlo integration is seldom used to integrate functions of one variable. Instead, it is mostly used on functions of many variables, maybe hundreds or thousands of variables. This is because more efficient methods exist for low-dimensional integrals, but very high dimensional integrals can usually only be computed using Monte Carlo or some variation like quasi-Monte Carlo.

If you draw a box around your integrand, especially in high dimensions, it may be that nearly all your darts fall outside the region you’re interested in. For example, suppose you throw a billion darts and none land inside the volume determined by your integration problem. Then the point estimate for your integral is 0. Assuming the true value of the integral is positive, the relative error in your estimate is 100%. You’ll need a lot more than a billion darts to get an accurate estimate. But is this example realistic? Absolutely. Nearly all the volume of a high-dimensional cube is in the “corners” and so putting a box around your integrand is naive. (I’ll elaborate on this below. [1])

So how do you implement Monte Carlo integration in practice? The next step up in sophistication is to use “importance sampling.” [2] Conceptually you’re still throwing darts at a box, but not with a uniform distribution. You find a probability distribution that approximately matches your integrand, and throw darts according to that distribution. The better the fit, the more efficient the importance sampler. You could think of naive importance sampling as using a uniform distribution as the importance sampler. It’s usually not hard to find an importance sampler much better than that. The importance sampler is so named because it concentrates more samples in the important regions.

Importance sampling isn’t the last word in Monte Carlo integration, but it’s a huge improvement over naive Monte Carlo.

===

[1] So what does it mean to say most of the volume of a high-dimensional cube is in the corners? Suppose you have an *n*-dimensional cube that runs from -1 to 1 in each dimension and you have a ball of radius 1 inside the cube. To make the example a little simpler, assume *n* is even, *n* = 2*k*. Then the volume of the cube is 4^{k} and the volume of the sphere is π^{k} / *k*!. If *k* = 1 (*n* = 2) then the sphere (circle in this case) takes up π/4 of the volume (area), about 79%. But when *k* = 100 (*n* = 200), the ball takes up 3.46×10^{-169} of the volume of the cube. You could never generate enough random samples from the cube to ever hope to land a single point inside the ball.

[2] In a nutshell, importance sampling replaces the problem of integrating *f*(*x*) with that of integrating (*f*(*x*) / *g*(*x*)) *g*(*x*) where *g*(*x*) is the importance sampler, a probability density. Then the integral of (*f*(*x*) / *g*(*x*)) *g*(*x*) is the expected value of (*f*(*X*) / *g*(*X*)) where *X* is a random variable with density given by the importance sampler. It’s often a good idea to use an importance sampler with slightly heavier tails than the original integrand.

If you’d like some help with numerical integration, let me know.

]]>The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is *p* = 0.5. The alternative hypothesis was that *p* is not 0.5. There were *N* = 104,490,000 bits emitted during the experiment, and *s* = 52,263,471 were 1’s. The *p*-value, the probability of an imbalance this large or larger under the assumption that *p* = 0.5, is 0.0003. Such a tiny *p*-value would be regarded as extremely strong evidence in favor of ESP given the way *p*-values are commonly interpreted.

The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence **for** ESP while the Bayesian concludes there is strong evidence **against** ESP.

The following Python code shows how one might calculate the *p*-value and Bayes factor.

from scipy.stats import binom from scipy import log, exp from scipy.special import betaln N = 104490000 s = 52263471 # sf is the survival function, i.e. complementary cdf # ccdf multiplied by 2 because we're doing a two-sided test print("p-value: ", 2*binom.sf(s, N, 0.5)) # Compute the log of the Bayes factor to avoid underflow. logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) + betaln(0.5, 0.5) print("Bayes factor: ", exp(logbf))]]>

Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)

The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)

Similarly, the term *data scientist* is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”

The big deal about data science isn’t data but the science of drawing *inferences* from the data. *Inference science* would be a better term, in my opinion, but that term hasn’t taken off.

*Data science* could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title *data scientist* is rightfully associated with people who have better computational skills than statisticians typically have.

While the term *data science* isn’t perfect, there’s little to recommend the term *statistics* other than that it is well established. The root of *statistics* is *state*, as in a *government*. This is because statistics was first applied to the concerns of bureaucracies. The term *statistics* would be equivalent to *governmentistics*, a historically accurate but otherwise useless term.

So a request like “Please send me the data from your experiment” becomes “Please send me the measurements from your experiment.” Same thing.

But rousing statements about the power of data become banal or even ridiculous. For example, here’s an article from Forbes after substituting *measurements* for *data*:

]]>

The Hottest Jobs In IT: Training Tomorrow’s Measurements ScientistsIf you thought good plumbers and electricians were hard to find, try getting hold of a measurements scientist. The rapid growth of big measurements and analytics for use within businesses has created a huge demand for people capable of extracting knowledge from measurements.

…

Some of the top positions in demand include business intelligence analysts, measurements architects, measurements warehouse analysts and measurements scientists, Reed says. “We believe the demand for measurements expertise will continue to grow as more companies look for ways to capitalize on this information,” he says.

…

This distinction is valid in **broad strokes**, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.

Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model *per se* is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.

Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.

There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.

]]>One way to fit a triangular distribution to data would be to set *a* to the minimum value and *b* to the maximum value. You could pick *a* and *b* are the smallest and largest *possible* values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick *c*? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(*a* + *b* + *c*)/3

so you could simply solve for *c* to match the sample mean.

Another approach would be to pick *c* so that the resulting distribution has the same *median* as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is *m*. You can always find a point *c* so that half the area of the triangle lies to the left of a vertical line drawn through *m*. However, this might require the foot *c* to be to the left or the right of the base [*a*, *b*]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, *c* must be in the interval [*a*, *b*]. Such a density has a median in the range

[*b* – (*b* – *a*)/√2, *a* + (*b* – *a*)/√2].

If the sample median *m* is in this range, then we can solve for *c* so that the distribution has median *m*. The solution is

*c* = *b* – 2(*b* – *m*)^{2} / (*b* – *a*)

if *m* < (*a* + *b*)/2 and

*c* = *a* + 2(*a* – *m*)^{2} / (*b* – *a*)

otherwise.

]]>For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.

Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.

With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.

The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.

For example, suppose you want to fit a linear regression to a data set. That is, you want to pick *m* and *b* so that *y* = *mx* + *b* is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.

So here’s what you could do. Pick a constant value of *b*, call it *b*_{0}. Now fit the one-parameter model *y* = *mx* + *b*_{0} on your training data, selecting the value of *m* only to minimize the error in fitting the training set. Now pick another value of *b*, call it *b*_{1}, and see how well it does on the test set. Repeat until you’ve found the best value of *b*. You’ve essentially used the training and test data to fit a *two*-parameter model, albeit awkwardly.

In particular I wonder what applications there may be of number theory, especially analytic number theory. I’m not thinking of the *results* of number theory but rather the elegant *machinery* developed to attack problems in number theory. I expect more of this machinery could be useful to problems outside of number theory.

I also wonder about category theory. The theory certainly finds uses within pure mathematics, but I’m not sure how useful it is in direct application to problems outside of mathematics. Many of the reported applications don’t seem like applications at all, but window dressing applied after-the-fact. On the other hand, there are also instances where categorical thinking led the way to a solution, but did its work behind the scenes; once a solution was in hand, it could be presented more directly without reference to categories. So it’s hard to say whether applications of category theory are over-reported or under-reported.

The mathematical literature can be misleading. When researchers say their work has various applications, they may be blowing smoke. At the same time, there may be real applications that are never mentioned in journals, either because the work is proprietary or because it is not deemed *original* in the academic sense of the word.

Hereafter, when they come to model heaven

And calculate the stars: how they will wield the

The mighty frame, how build, unbuild, contrive

To save appearances, how gird the sphere

With centric and eccentric scribbled o’er,

Cycle and epicycle, orb in orb.

**Related post** Quaternions in Paradise Lost

This is a special case of Beatty’s theorem.

]]>A common model says that men’s and women’s heights are normally distributed with means of 70 and 64 inches respectively, both with a standard deviation of 3 inches. A woman with negative height would be 21.33 standard deviations below the mean, and a man with negative height would be 23.33 standard deviations below the mean. These events have probability 3 × 10^{-101} and 10^{-120} respectively. Or to write them out in full

0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003

and

0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001.

As I mentioned on Twitter yesterday, if you’re worried about probabilities that require scientific notation to write down, you’ve probably exceeded the resolution of your model. I imagine most probability models are good to two or three decimal places at most. When model probabilities are extremely small, factors outside the model become more important than ones inside.

According to Wolfram Alpha, there are around 10^{80} atoms in the universe. So picking one particular atom at random from all atoms in the universe would be on the order of a billion trillion times more likely than running into a woman with negative height. Of course negative heights are not just unlikely, they’re impossible. As you travel from the mean out into the tails, the first problem you encounter with the normal approximation is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is *under*-estimated. There exist people whose heights would be impossibly unlikely according to this normal approximation. See examples here.

Probabilities such as those above have no practical value, but it’s interesting to see how you’d compute them anyway. You could find the probability of a man having negative height by typing `pnorm(-23.33)`

into R or `scipy.stats.norm.cdf(-23.33)`

into Python. Without relying on such software, you could use the bounds

with *x* equal to -21.33 and -23.33. For a proof of these bounds and tighter bounds see these notes.

In the Star Trek episode “All Our Yesterdays” the people of the planet Sarpeidon have escaped into their past because their sun is about to become a supernova. They did this via a time machine called the Atavachron.

One detail of the episode has stuck with me since I first saw it many years ago: although people can go back to any period in history, they have to be prepared somehow, and once prepared they cannot go back. Kirk, Spock, and McCoy only have hours to live because they traveled back in time via the Atavachron without being properly prepared. (Kirk is in a period analogous to Renaissance England while Spock and McCoy are in an ice age.)

If such time travel were possible, I expect you would indeed need to be prepared. Life in Renaissance England or the last ice age would be miserable for someone with contemporary expectations, habits, fitness, etc., though things weren’t as bad for the people at the time. Neither would life be entirely pleasant for someone thrust into our time from the past. Cultures work out their own solutions to life’s problems, and these solutions form a package. It may not be possible to swap components in and out à la carte and maintain a working solution.

]]>If that’s the case, why aren’t more phenomena normally distributed? Someone asked me this morning specifically about phenotypes with many genetic inputs.

The central limit theorem says that the sum of many **independent**, **additive** effects is approximately normally distributed [2]. Genes are more digital than analog, and do not produce independent, additive effects. For example, the effects of dominant and recessive genes act more like max and min than addition. Genes do not appear independently—if you have some genes, you’re more likely to have certain other genes—nor do they act independently—some genes determine how other genes are expressed.

Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects.

Incidentally, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal.

***

Fine print:

[1] Men’s heights follow a normal distribution, and so do women’s. Adults not sorted by sex follow a mixture distribution as described here and so the distribution is flatter on top than a normal. It gets even more complicated when you considered that there are slightly more women than men in the world. And as with many phenomena, the normal distribution is a better description near the middle than at the extremes.

[2] There are many variations on the central limit theorem. The classical CLT requires that the random variables in the sum be identically distributed as well, though that isn’t so important here.

]]>Of course lie detectors can’t tell whether someone is lying. They can only tell whether someone is exhibiting physiological behavior believed to be associated with lying. How well the latter predicts the former is a matter of debate.

I saw a presentation of a machine learning package the other day. Some of the questions implied that the audience had a magical understanding of machine learning, as if an algorithm could extract answers from data that do not contain the answer. The software simply searches for patterns in data by seeing how well various possible patterns fit, but there may be no pattern to be found. Machine learning algorithms cannot *generate* information that isn’t there any more than a polygraph machine can predict the future.

The following lines from Book V of Paradise Lost, starting at line 180, are quoted in Kuipers’ book:

Air and ye elements, the eldest birth

Of nature’s womb, that inquaternionrun

Perpetual circle, multiform, and mix

And nourish all things, let your ceaseless change

Vary to our great maker still new praise.

When I see *quaternion* I naturally think of Hamilton’s extension of the complex numbers, discovered in 1843. Paradise Lost, however, was published in 1667.

Milton uses *quaternion* to refer to the four elements of antiquity: air, earth, water, and fire. The last three are “the eldest birth of nature’s womb” because they are mentioned in Genesis before air is mentioned.

]]>

You can find most of the links from previous Wednesday posts on one page by going to technical notes from the navigation menu at the top of the site.

]]>

graphemeA graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, agrapheme cluster stringis a single user-visiblecharacter, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending onnormalization.

In case the character ȫ doesn’t display correctly for you, here it is:

First, *graphene* has little to do with *grapheme*, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word *graphene* comes from graphite, the “lead” in pencils. The origin of *grapheme* has nothing to do with graphene but was an analogy to *phoneme*.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character `.`

in regular expressions matches any single character, but `\X`

matches any single grapheme. (Well, almost. The character `.`

usually matches any character *except a newline*, though this can be modified via optional switches. But `\X`

matches any grapheme including newline characters.)

# U+0226, o with diaeresis and macron my $a = "\x{22B}"; # U+00F6 U+0304, (o with diaeresis) + macron my $b = "\x{F6}\x{304}"; # o U+0308 U+0304, o + diaeresis + macron my $c = "o\x{308}\x{304}"; my @versions = ($a, $b, $c); # All versions display the same. say @versions; # The versions have length 1, 2, and 3. # Only $a contains one character and so matches . say map {length $_ if /^.$/} @versions; # All versions consist of one grapheme. say map {length $_ if /^\X$/} @versions;]]>

If the only criticism is that something is too easy or “OK for beginners” then maybe it’s a threat to people who invested a lot of work learning to do things the old way.

The problem with the “OK for beginners” put-down is that everyone is a beginner sometimes. Professionals are often beginners because they’re routinely trying out new things. And being easier for beginners doesn’t exclude the possibility of being easier for professionals too.

Sometimes we assume that harder must be better. I know I do. For example, when I first used Windows, it was so much easier than Unix that I assumed Unix must be better for reasons I couldn’t articulate. I had invested so much work learning to use the Unix command line, it must have been worth it. (There are indeed advantages to doing some things from the command line, but not the work I was doing at the time.)

There often are advantages to doing things the hard way, but something isn’t necessary better *because* it’s hard. The easiest tool to pick up may not be best tool for long-term use, but then again it might be.

Most of the time you want to *add* the easy tool to your toolbox, not take the old one out. Just because you can use specialized professional tools doesn’t mean that you always have to.

**Related post**: Don’t be a technical masochist

- CRMSimulator is used to design CRM trials, dose-finding based only on toxicity outcomes.
- BMA-CRMSimulator is a variation on CRMSimulator using Bayesian model averaging.
- EffTox is used for dose-finding based on toxicity and efficacy outcomes.
- TTEConduct and TTEDesigner are for safety monitoring of single-arm trials with time-to-event outcomes.
- Multc Lean monitors two binary outcomes, efficacy and toxicity.
- Adaptive Randomization is for multiple-arm trials using outcome-adaptive randomization.
- More software from MD Anderson

**Last week’s resource post**: Stand-alone numerical code

]]>