The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.

In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)

The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is *p* = 0.5. The alternative hypothesis was that *p* is not 0.5. There were *N* = 104,490,000 bits emitted during the experiment, and *s* = 52,263,471 were 1’s. The *p*-value, the probability of an imbalance this large or larger under the assumption that *p* = 0.5, is 0.0003. Such a tiny *p*-value would be regarded as extremely strong evidence in favor of ESP given the way *p*-values are commonly interpreted.

The Bayes factor, however, is 5.95, meaning that the null hypothesis appears to be about six times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).

So given the data and assumptions in this example, the Frequentist concludes there is strong evidence **for** ESP while the Bayesian concludes there is substantial evidence **against** ESP.

The following Python code shows how one might calculate the *p*-value and Bayes factor.

from scipy.stats import binom from scipy import log, exp from scipy.special import betaln N = 104490000 s = 52263471 # sf is the survival function, i.e. complementary cdf # ccdf multiplied by 2 because we're doing a two-sided test print("p-value: ", 2*binom.sf(s, N, 0.5)) # Compute the log of the Bayes factor to avoid underflow. logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) print("Bayes factor: ", exp(logbf))]]>

Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)

The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)

Similarly, the term *data scientist* is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”

The big deal about data science isn’t data but the science of drawing *inferences* from the data. *Inference science* would be a better term, in my opinion, but that term hasn’t taken off.

*Data science* could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title *data scientist* is rightfully associated with people who have better computational skills than statisticians typically have.

While the term *data science* isn’t perfect, there’s little to recommend the term *statistics* other than that it is well established. The root of *statistics* is *state*, as in a *government*. This is because statistics was first applied to the concerns of bureaucracies. The term *statistics* would be equivalent to *governmentistics*, a historically accurate but otherwise useless term.

So a request like “Please send me the data from your experiment” becomes “Please send me the measurements from your experiment.” Same thing.

But rousing statements about the power of data become banal or even ridiculous. For example, here’s an article from Forbes after substituting *measurements* for *data*:

]]>

The Hottest Jobs In IT: Training Tomorrow’s Measurements ScientistsIf you thought good plumbers and electricians were hard to find, try getting hold of a measurements scientist. The rapid growth of big measurements and analytics for use within businesses has created a huge demand for people capable of extracting knowledge from measurements.

…

Some of the top positions in demand include business intelligence analysts, measurements architects, measurements warehouse analysts and measurements scientists, Reed says. “We believe the demand for measurements expertise will continue to grow as more companies look for ways to capitalize on this information,” he says.

…

This distinction is valid in **broad strokes**, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.

Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model *per se* is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.

Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.

There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.

]]>One way to fit a triangular distribution to data would be to set *a* to the minimum value and *b* to the maximum value. You could pick *a* and *b* are the smallest and largest *possible* values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick *c*? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(*a* + *b* + *c*)/3

so you could simply solve for *c* to match the sample mean.

Another approach would be to pick *c* so that the resulting distribution has the same *median* as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is *m*. You can always find a point *c* so that half the area of the triangle lies to the left of a vertical line drawn through *m*. However, this might require the foot *c* to be to the left or the right of the base [*a*, *b*]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, *c* must be in the interval [*a*, *b*]. Such a density has a median in the range

[*b* – (*b* – *a*)/√2, *a* + (*b* – *a*)/√2].

If the sample median *m* is in this range, then we can solve for *c* so that the distribution has median *m*. The solution is

*c* = *b* – 2(*b* – *m*)^{2} / (*b* – *a*)

if *m* < (*a* + *b*)/2 and

*c* = *a* + 2(*a* – *m*)^{2} / (*b* – *a*)

otherwise.

]]>For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.

Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.

With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.

The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.

For example, suppose you want to fit a linear regression to a data set. That is, you want to pick *m* and *b* so that *y* = *mx* + *b* is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.

So here’s what you could do. Pick a constant value of *b*, call it *b*_{0}. Now fit the one-parameter model *y* = *mx* + *b*_{0} on your training data, selecting the value of *m* only to minimize the error in fitting the training set. Now pick another value of *b*, call it *b*_{1}, and see how well it does on the test set. Repeat until you’ve found the best value of *b*. You’ve essentially used the training and test data to fit a *two*-parameter model, albeit awkwardly.

In particular I wonder what applications there may be of number theory, especially analytic number theory. I’m not thinking of the *results* of number theory but rather the elegant *machinery* developed to attack problems in number theory. I expect more of this machinery could be useful to problems outside of number theory.

I also wonder about category theory. The theory certainly finds uses within pure mathematics, but I’m not sure how useful it is in direct application to problems outside of mathematics. Many of the reported applications don’t seem like applications at all, but window dressing applied after-the-fact. On the other hand, there are also instances where categorical thinking led the way to a solution, but did its work behind the scenes; once a solution was in hand, it could be presented more directly without reference to categories. So it’s hard to say whether applications of category theory are over-reported or under-reported.

The mathematical literature can be misleading. When researchers say their work has various applications, they may be blowing smoke. At the same time, there may be real applications that are never mentioned in journals, either because the work is proprietary or because it is not deemed *original* in the academic sense of the word.

Hereafter, when they come to model heaven

And calculate the stars: how they will wield the

The mighty frame, how build, unbuild, contrive

To save appearances, how gird the sphere

With centric and eccentric scribbled o’er,

Cycle and epicycle, orb in orb.

**Related post** Quaternions in Paradise Lost

This is a special case of Beatty’s theorem.

]]>A common model says that men’s and women’s heights are normally distributed with means of 70 and 64 inches respectively, both with a standard deviation of 3 inches. A woman with negative height would be 21.33 standard deviations below the mean, and a man with negative height would be 23.33 standard deviations below the mean. These events have probability 3 × 10^{-101} and 10^{-120} respectively. Or to write them out in full

0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003

and

0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001.

As I mentioned on Twitter yesterday, if you’re worried about probabilities that require scientific notation to write down, you’ve probably exceeded the resolution of your model. I imagine most probability models are good to two or three decimal places at most. When model probabilities are extremely small, factors outside the model become more important than ones inside.

According to Wolfram Alpha, there are around 10^{80} atoms in the universe. So picking one particular atom at random from all atoms in the universe would be on the order of a billion trillion times more likely than running into a woman with negative height. Of course negative heights are not just unlikely, they’re impossible. As you travel from the mean out into the tails, the first problem you encounter with the normal approximation is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is *under*-estimated. There exist people whose heights would be impossibly unlikely according to this normal approximation. See examples here.

Probabilities such as those above have no practical value, but it’s interesting to see how you’d compute them anyway. You could find the probability of a man having negative height by typing `pnorm(-23.33)`

into R or `scipy.stats.norm.cdf(-23.33)`

into Python. Without relying on such software, you could use the bounds

with *x* equal to -21.33 and -23.33. For a proof of these bounds and tighter bounds see these notes.

In the Star Trek episode “All Our Yesterdays” the people of the planet Sarpeidon have escaped into their past because their sun is about to become a supernova. They did this via a time machine called the Atavachron.

One detail of the episode has stuck with me since I first saw it many years ago: although people can go back to any period in history, they have to be prepared somehow, and once prepared they cannot go back. Kirk, Spock, and McCoy only have hours to live because they traveled back in time via the Atavachron without being properly prepared. (Kirk is in a period analogous to Renaissance England while Spock and McCoy are in an ice age.)

If such time travel were possible, I expect you would indeed need to be prepared. Life in Renaissance England or the last ice age would be miserable for someone with contemporary expectations, habits, fitness, etc., though things weren’t as bad for the people at the time. Neither would life be entirely pleasant for someone thrust into our time from the past. Cultures work out their own solutions to life’s problems, and these solutions form a package. It may not be possible to swap components in and out à la carte and maintain a working solution.

]]>If that’s the case, why aren’t more phenomena normally distributed? Someone asked me this morning specifically about phenotypes with many genetic inputs.

The central limit theorem says that the sum of many **independent**, **additive** effects is approximately normally distributed [2]. Genes are more digital than analog, and do not produce independent, additive effects. For example, the effects of dominant and recessive genes act more like max and min than addition. Genes do not appear independently—if you have some genes, you’re more likely to have certain other genes—nor do they act independently—some genes determine how other genes are expressed.

Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects.

Incidentally, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal.

***

Fine print:

[1] Men’s heights follow a normal distribution, and so do women’s. Adults not sorted by sex follow a mixture distribution as described here and so the distribution is flatter on top than a normal. It gets even more complicated when you considered that there are slightly more women than men in the world. And as with many phenomena, the normal distribution is a better description near the middle than at the extremes.

[2] There are many variations on the central limit theorem. The classical CLT requires that the random variables in the sum be identically distributed as well, though that isn’t so important here.

]]>Of course lie detectors can’t tell whether someone is lying. They can only tell whether someone is exhibiting physiological behavior believed to be associated with lying. How well the latter predicts the former is a matter of debate.

I saw a presentation of a machine learning package the other day. Some of the questions implied that the audience had a magical understanding of machine learning, as if an algorithm could extract answers from data that do not contain the answer. The software simply searches for patterns in data by seeing how well various possible patterns fit, but there may be no pattern to be found. Machine learning algorithms cannot *generate* information that isn’t there any more than a polygraph machine can predict the future.

The following lines from Book V of Paradise Lost, starting at line 180, are quoted in Kuipers’ book:

Air and ye elements, the eldest birth

Of nature’s womb, that inquaternionrun

Perpetual circle, multiform, and mix

And nourish all things, let your ceaseless change

Vary to our great maker still new praise.

When I see *quaternion* I naturally think of Hamilton’s extension of the complex numbers, discovered in 1843. Paradise Lost, however, was published in 1667.

Milton uses *quaternion* to refer to the four elements of antiquity: air, earth, water, and fire. The last three are “the eldest birth of nature’s womb” because they are mentioned in Genesis before air is mentioned.

]]>

You can find most of the links from previous Wednesday posts on one page by going to technical notes from the navigation menu at the top of the site.

]]>

graphemeA graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, agrapheme cluster stringis a single user-visiblecharacter, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending onnormalization.

In case the character ȫ doesn’t display correctly for you, here it is:

First, *graphene* has little to do with *grapheme*, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word *graphene* comes from graphite, the “lead” in pencils. The origin of *grapheme* has nothing to do with graphene but was an analogy to *phoneme*.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character `.`

in regular expressions matches any single character, but `\X`

matches any single grapheme. (Well, almost. The character `.`

usually matches any character *except a newline*, though this can be modified via optional switches. But `\X`

matches any grapheme including newline characters.)

# U+0226, o with diaeresis and macron my $a = "\x{22B}"; # U+00F6 U+0304, (o with diaeresis) + macron my $b = "\x{F6}\x{304}"; # o U+0308 U+0304, o + diaeresis + macron my $c = "o\x{308}\x{304}"; my @versions = ($a, $b, $c); # All versions display the same. say @versions; # The versions have length 1, 2, and 3. # Only $a contains one character and so matches . say map {length $_ if /^.$/} @versions; # All versions consist of one grapheme. say map {length $_ if /^\X$/} @versions;]]>

If the only criticism is that something is too easy or “OK for beginners” then maybe it’s a threat to people who invested a lot of work learning to do things the old way.

The problem with the “OK for beginners” put-down is that everyone is a beginner sometimes. Professionals are often beginners because they’re routinely trying out new things. And being easier for beginners doesn’t exclude the possibility of being easier for professionals too.

Sometimes we assume that harder must be better. I know I do. For example, when I first used Windows, it was so much easier than Unix that I assumed Unix must be better for reasons I couldn’t articulate. I had invested so much work learning to use the Unix command line, it must have been worth it. (There are indeed advantages to doing some things from the command line, but not the work I was doing at the time.)

There often are advantages to doing things the hard way, but something isn’t necessary better *because* it’s hard. The easiest tool to pick up may not be best tool for long-term use, but then again it might be.

Most of the time you want to *add* the easy tool to your toolbox, not take the old one out. Just because you can use specialized professional tools doesn’t mean that you always have to.

**Related post**: Don’t be a technical masochist

- CRMSimulator is used to design CRM trials, dose-finding based only on toxicity outcomes.
- BMA-CRMSimulator is a variation on CRMSimulator using Bayesian model averaging.
- EffTox is used for dose-finding based on toxicity and efficacy outcomes.
- TTEConduct and TTEDesigner are for safety monitoring of single-arm trials with time-to-event outcomes.
- Multc Lean monitors two binary outcomes, efficacy and toxicity.
- Adaptive Randomization is for multiple-arm trials using outcome-adaptive randomization.
- More software from MD Anderson

**Last week’s resource post**: Stand-alone numerical code

]]>

Since your goal is to find the best dose, it seems natural to compare dose-finding methods by how often they find the best dose. This is what is most often done in the clinical trial literature. But this seemingly natural criterion is actually artificial.

Suppose a trial is testing doses of 100, 200, 300, and 400 milligrams of some new drug. Suppose further that on some scale of goodness, these doses rank 0.1, 0.2, 0.5, and 0.51. (Of course these goodness scores are unknown; the point of the trial is to estimate them. But you might make up some values for simulation, pretending with half your brain that these are the true values and pretending with the other half that you don’t know what they are.)

Now suppose you’re evaluating two clinical trial designs, running simulations to see how each performs. The first design picks the 400 mg dose, the best dose, 20% of the time and picks the 300 mg dose, the second best dose, 50% of the time. The second design picks each dose with equal probability. The latter design picks the *best* dose more often, but it picks a *good* dose less often.

In this scenario, the two largest doses are essentially equally good; it hardly matters how often a method distinguishes between them. The first method picks one of the two good doses 70% of the time while the second method picks one of the two good doses only 50% of the time.

This example was exaggerated to make a point: obviously it doesn’t matter how often a method can pick the better of two very similar doses, not when it very often picks a bad dose. But there are less obvious situations that are quantitatively different but qualitatively the same.

The goal is actually to find a good dose. Finding the absolute best dose is impossible. The most you could hope for is that a method finds with high probability the best of the four *arbitrarily chosen doses* under consideration. Maybe the *best* dose is 350 mg, 843 mg, or some other dose not under consideration.

A simple way to make evaluating dose-finding methods less arbitrary would be to estimate the benefit to patients. Finding the best dose is only a matter of curiosity in itself unless you consider how that information is used. Knowing the best dose is important because you want to treat future patients as effectively as you can. (And patients in the trial itself as well, if it is an adaptive trial.)

Suppose the measure of goodness in the scenario above is probability of successful treatment and that 1,000 patients will be treated at the dose level picked by the trial. Under the first design, there’s a 20% chance that 51% of the future patients will be treated successfully, and a 50% chance that 50% will be. The expected number of successful treatments from the two best doses is 352. Under the second design, the corresponding number is 252.5.

(To simplify the example above, I didn’t say how often the first design picks each of the two lowest doses. But the first design will result in at least 382 expected successes and the second design 327.5.)

You never know how many future patients will be treated according to the outcome of a clinical trial, but there must be some implicit estimate. If this estimate is zero, the trial is not worth conducting. In the example given here, the estimate of 1,000 future patients is irrelevant: the future patient horizon cancels out in a comparison of the two methods. The patient horizon matters when you want to include the benefit to patients in the trial itself. The patient horizon serves as a way to weigh the interests of current versus future patients, an ethically difficult comparison usually left implicit.

]]>Albert Einstein, 1954

]]>The opposite of an idiot would not be someone who is wise, but someone who takes too much outside input, someone who passively lets others make all his decisions or who puts every decision to a vote.

An idiot lives only in his own world; the opposite of an idiot has no world of his own. Both are foolish, but I think the Internet encourages more of the latter kind of foolishness. It’s not turning people into idiots, it’s turning them into the opposite.

]]>Maybe the company is good at attracting investors. Maybe they have enormous economies of scale that overcome their diseconomies of scale. Maybe they’ve lobbied politicians to protect the company from competition. (That’s a form of success, though not an honorable one.) Maybe they serve a market well, but you don’t see it because you’re not part of that market.

]]>The code is available in Python, C++, and C# versions. It could easily be translated into other languages since it hardly uses any language-specific features.

I wrote these functions for projects where you don’t have a numerical library available or would like to minimize dependencies. If you have access to a numerical library, such as SciPy in Python, then by all means use it (although SciPy is missing some of the random number generators provided here). In C++ and especially C#, it’s harder to find some of this functionality.

**Last week**: Code Project articles

**Next week**: Clinical trial software

As the number of steps *N* goes to infinity, the probability that the proportion of your time in positive territory is less than *x* approaches 2 arcsin(√*x*)/π. The arcsine term gives this rule its name, the arcsine law.

Here’s a little Python script to illustrate the arcsine law.

import random from numpy import arcsin, pi, sqrt def step(): u = random.random() return 1 if u < 0.5 else -1 M = 1000 # outer loop N = 1000 # inner loop x = 0.3 # Use any 0 < x < 1 you'd like. outer_count = 0 for _ in range(M): n = 0 position= 0 inner_count = 0 for __ in range(N): position += step() if position > 0: inner_count += 1 if inner_count/N < x: outer_count += 1 print (outer_count/M) print (2*arcsin(sqrt(x))/pi)]]>

Aleksandr Khinchin proved that for almost all real numbers *x*, as *n* → ∞ the geometric means converge. Not only that, they converge to the same constant, known as Khinchin’s constant, 2.685452001…. (“Almost all” here mean in the sense of measure theory: the set of real numbers that are exceptions to Khinchin’s theorem have measure zero.)

To get an idea how fast this convergence is, let’s start by looking at the continued fraction expansion of π. In Sage, we can type

continued_fraction(RealField(100)(pi))

to get the continued fraction coefficient

[3, 7, 15, 1, 292, 1, 1, 1, 2, 1, 3, 1, 14, 2, 1, 1, 2, 2, 2, 2, 1, 84, 2, 1, 1, 15, 3]

for π to 100 decimal places. The geometric mean of these coefficients is 2.84777288486, which only matches Khinchin’s constant to 1 significant figure.

Let’s try choosing random numbers and working with more decimal places.

There may be a more direct way to find geometric means in Sage, but here’s a function I wrote. It leaves off any leading zeros that would cause the geometric mean to be zero.

from numpy import exp, mean, log def geometric_mean(x): return exp( mean([log(k) for k in x if k > 0]) )

Now let’s find 10 random numbers to 1,000 decimal places.

for _ in range(10): r = RealField(1000).random_element(0,1) print(geometric_mean(continued_fraction(r)))

This produced

2.66169890535 2.62280675227 2.61146463641 2.58515620064 2.58396664032 2.78152297661 2.55950338205 2.86878898900 2.70852612496 2.52689450535

Three of these agree with Khinchin’s constant to two significant figures but the rest agree only to one. Apparently the convergence is very slow.

If we go back to π, this time looking out 10,000 decimal places, we get a little closer:

print(geometric_mean(continued_fraction(RealField(10000)(pi))))

produces 2.67104567579, which differs from Khinchin’s constant by about 0.5%.

]]>]]>It was not a matter of everything in mathematics collapsing in on itself, with one branch turning out to have been merely a recapitulation of another under a different guise. Rather, the principle was that every sufficiently beautiful mathematical system was rich enough to mirror in part — and sometimes in a complex and distorted fashion — every other sufficiently beautiful system. Nothing became sterile and redundant, nothing proved to have been a waste of time, but everything was shown to be magnificently intertwined.

In nature, the optimum is almost always in the middle somewhere. Distrust assertions that the optimum is at an extreme point.

When I first read this I immediately thought of several examples where theory said that an optima was at an extreme, but experience said otherwise.

Linear programming (LP) says the opposite of Akin’s law. The optimal point for a linear objective function subject to linear constraints is *always* at an extreme point. The constraints form a many-sided shape—you could think of it something like a diamond—and the optimal point will always be at one of the corners.

Nature is not linear, though it is often approximately linear over some useful range. One way to read Akin’s law is to say that even when something is approximately linear in the middle, there’s enough non-linearity at the extremes to pull the optimal points back from the edges. Or said another way, when you have an optimal value at an extreme point, there may be some unrealistic aspect of your model that pushed the optimal point out to the boundary.

**Related post**: Data calls the model’s bluff

But “maximum tolerated dose” implies a degree of personalization that rarely exists in clinical trials. Phase I chemotherapy trials don’t try to find the maximum dose that any particular patient can tolerate. They try to find a dose that is toxic to a certain percentage of the trial participants, say 30%. (This rate may seem high, but it’s typical. It’s not far from the toxicity rate implicit in the so-called 3+3 rule or from the explicit rate given in many CRM (“continual reassessment method”) designs.)

It’s tempting to think of “30% toxicity rate” as meaning that each patient experiences a 30% toxic reaction. But that’s not what it means. It means that each patient has a 30% chance of a toxicity, however toxicity is defined in a particular trial. If toxicity were defined as kidney failure, for example, then 30% toxicity rate means that each patient has a 30% probability of kidney failure, not that they should expect a 30% reduction in kidney function.

**Related posts**:

It’s hard imagine how investors could abandon something as large and expensive as a shopping mall. And yet it must have been a sensible decision. If anyone disagreed, they could buy the abandoned mall on the belief that they could make a profit.

The idea that you should stick to something just because you’ve invested in it goes by many names: sunk cost fallacy, escalation of commitment, gambler’s ruin, etc. If further investment will simply lose more money, there’s no economic reason to continue investing, regardless of how much money you’ve spent. (There may be non-economic reasons. You may have a moral obligation to fulfill a commitment or to clean up a mess you’ve made.)

Most of us have not faced the temptation to keep investing in an unprofitable shopping mall, but everyone is tempted by the sunk cost fallacy in other forms: finishing a novel you don’t enjoy reading, holding on to hopelessly tangled software, trying to earn a living with a skill people no longer wish to pay for, etc.

According to Peter Drucker, “It cannot be said often enough that one should not postpone; one abandons.”

The first step in a growth policy is not to decide where and how to grow.

It is to decide what to abandon. In order to grow, a business must have a systematic policy to get rid of the outgrown, the obsolete, the unproductive.

It’s usually more obvious what someone else should abandon than what we should abandon. Smart businesses turn to outside consultants for such advice. Smart individuals turn to trusted friends. An objective observer without our emotional investment can see things more clearly than we can.

***

This posted started out as a shorter post on Google+.

Photo above by Steve Petrucelli via flickr

]]>