The post How to choose colors for maps and heat maps appeared first on All About Statistics.

]]>Have you ever looked as a statistical graph that uses bright garish colors and thought, "Why in the world did that guy choose those awful colors?" Don't be "that guy"! Your choice of colors for a graph can make a huge difference in how well your visualization is perceived by the reader. This is especially true for heat maps and choropleth maps in which the colors of small area elements must be distinguishable. This article describes how you can choose good color schemes for heat maps and choropleth maps. Most of the article is language-independent, but at the end I show how to create graphs with good color schemes in SAS.

I first became aware of the importance of color design in statistical graphics when I looked at the incredibly beautiful *Atlas of United States Mortality* by Linda Pickle and her colleagues (1997).
One of the reasons that the *Atlas* is so successful at conveying complex statistical data on maps is that it uses carefully designed color schemes and visualization strategies that were the result of pioneering research by Cynthia Brewer, Dan Carr, and many others. An example from the *Atlas* is shown to the left. The color schemes were designed so that one color does not visually dominate the others. Bright reds and yellows are nowhere to be found. Fully saturated colors are avoided in favor of slightly muted colors. The result is that graphs in the *Atlas* enable you to see variations in the data without being biased by visual artifacts that can result from poor color choices.

Choosing good colors for maps and graphs is not easy, which is why every statistician and graphics designer should know about Cynthia Brewer's ColorBrewer.org web site. The web site enables you to choose from dozens of color schemes that were carefully chosen to be aesthetically pleasing as well as statistically unbiased.

Brewer's color schemes are implemented in SAS/IML 13.1 software through the PALETTE function. (Other statistical languages also provide access to these schemes, such as via the RColorBrewer package in R.) The PALETTE function enables you to get an array of colors by knowing the name of a color scheme. Some of the available color schemes are shown at the left. Brewer's "BLUES" color scheme is single-hue progression that is similar to the two-color ramp in the SAS HTMLBlue ODS style. The "YLORRD" color scheme is a yellow-orange-red sequential color ramp that shows a blended progression from yellow to red.

The color ramps at the left are appropriate for visualizing data when it is important to differentiate high values from low values. For these so-called *sequential* color schemes, the reference level is the lowest value.

There are other possible color schemes. When the reference value is in the middle of the data range (such as zero or an average value), you should use a diverging color scheme, which uses a neutral color for the reference value. Low values are represented by using one hue and high values by using a different hue.

Examples of diverging color schemes are shown to the right. The first diverging color scheme ("BRBG" for brown-blue-green) is the default color ramp that is used to construct heat maps of discrete data in SAS/IML software. For example, a five-color version of that color ramp was used to visualize a binned correlation matrix in my recent article about how to create heat maps with a discrete color ramp. The "BDBU" scheme (a red-blue bipolar progression) is often used when one extreme is "good" (blue) and the other extreme is "bad" (red). The "SPECTRAL" color scheme is a useful alternative to the often-seen "rainbow" color ramp, which is notoriously bad for visualization and should be avoided. A spectral progression is often used for weather-related intensities, such as storm severity.

Finally, there are qualitative color schemes, as shown to the left. These schemes are useful for visualizing data that have no inherent ordering, such as ethnicities of people, different diseases, types of music, and other nominal categories.

Although these color schemes were designed for choropleth maps, the same principles are useful for designing heat maps with a small number of discrete categories. You can use the PALETTE function in SAS/IML software to choose effective colors for heat maps and other statistical graphics.

The PALETTE function is easy to use: Specify the name of the color scheme and the number of colors that you want to generate. These schemes are designed for a small number of discrete categories, so the schemes support between eight and 12 distinct and distinguishable colors.

As an example, suppose that you have computed a binned correlation matrix, but you want to override the default color scheme. For your application, you want negative values to be red and positive to be blue, so you decide on the bipolar "RDBU" color scheme. The following SAS/IML statements continue the example from my previous post:

/* The disCorr matrix is defined in the previous example. */ ramp = palette("RDBU", 5); /* create diverging 5-color ramp */ call HeatmapDisc(disCorr) colorramp=ramp /* use COLORRAMP= option */ title="Binned Correlations" xvalues=varNames yvalues=varNames;

As a second example, the heat map at the left use a discrete set of colors to visualize the most frequently appearing bigrams (two-letter combinations) in an English corpus. The values of the matrix have been binned into five categories. Red cells indicate bigrms that appear more than 2% of the time. Dark orange cells indicate bigrams that appear between 1% and 2%. Light orange represents between 0.5% and 1%; light yellow represents less than 0.5% of the time. Gray cells represent bigrams that appear rarely or not at all. You can download the SAS program that computes this discrete heat map of bigram frequency.

Incidentally, you can also use the PALETTE function as an easy way to generate a continuous color ramp. You can use the PALETTE function to obtain three or five anchor colors and interpolate between those colors to visualize intermediate values. In my article about hexagonal bin plots, the continuous color map used for the density map of ZIP codes was created by using the colors from the call `palette("YLORRD",5)`.

What method do you use to choose colors for maps or heat maps that display categories? Do you have a favorite color scheme? Leave a comment.

tags: 13.1, Heat maps, Statistical Graphics

**Please comment on the article here:** **The DO Loop**

The post How to choose colors for maps and heat maps appeared first on All About Statistics.

]]>The post p-value vs Bayes appeared first on All About Statistics.

]]>p-value and Bayes are the two hottest words in Statistics. Actually I still can not get why the debate between frequentist statistics and Bayesian statistics can last so long. What is the essence arguments behind it? (Any one can help me with this?) In my point of view, they are just two ways for solving practical problems. Frequentist people are using the random version of proof-by-contradiction argument (i.e. small p-value indicates less likeliness for the null hypothesis to be true), while Bayesian people are using learning argument to update their believes through data. Besides, mathematician are using partial differential equations (PDE) to model the real underlying process for the analysis. These are just different methodologies for dealing with practical problems. What’s the point for the long-last debate between frequentist statistics and Bayesian statistics then?

Although my current research area is mostly in frequentist statistics domain, I am becoming more and more Bayesian lover, since it’s so natural. When I was teaching introductory statistics courses for undergraduate students at Michigan State University, I divided the whole course into three parts: Exploratory Data Analysis (EDA) by using R software, Bayesian Reasoning and Frequentist Statistics. I found that at the end of the semester, the most impressive example in my students mind was the one from the second section (Bayesian Reasoning). That is the Monty Hall problem, which was mentioned in the article that just came out in the NYT. (Note that about the argument from Professor Andrew Gelman, please also check out the response from Professor Gelman.) “Mr. Hall, longtime host of the game show “Let’s Make a Deal,” hides a car behind one of three doors and a goat behind each of the other two. The contestant picks Door No. 1, but before opening it, Mr. Hall opens Door No. 2 to reveal a goat. Should the contestant stick with No. 1 or switch to No. 3, or does it matter?” And the Bayesian approach to this problem “would start with one-third odds that any given door hides the car, then update that knowledge with the new data: Door No. 2 had a goat. The odds that the contestant guessed right — that the car is behind No. 1 — remain one in three. Thus, the odds that she guessed wrong are two in three. And if she guessed wrong, the car must be behind Door No. 3. So she should indeed switch.” What a natural argument! Bayesian babies and Google untrained search for youtube cats (the methods of deep learning) are all excellent examples proving that Bayesian Statistics IS a remarkable way for solving problems.

What about the p-values? This random version of proof-by-contradiction argument is also a great way for solving problems from the fact that it have been helping solve so many problems from various scientific areas, especially in bio-world. Check out today’s post from Simply Statistics: “You think P-values are bad? I say show me the data,” and also the early one: On the scalability of statistical procedures: why the p-value bashers just don’t get it.

**Please comment on the article here:** **Honglang Wang's Blog**

The post p-value vs Bayes appeared first on All About Statistics.

]]>The post You think P-values are bad? I say show me the data. appeared first on All About Statistics.

]]>Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now investigating the transparency of science. It has been driven by the very public reproducibility disasters in genomics and economics.

There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.

There have been successes and failures in releasing all the data, but PLoS' policy on data availability and the alltrials initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!

The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands. Articles like the one that just came out in the NYT suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another. I agree with Roger in his tweet storm (quick someone is wrong on the internet Roger, fix it!):

5/If using Bayesian methods made you a better scientist, that would be great. But I need to see the evidence on that first.

— Roger D. Peng (@rdpeng) September 30, 2014

This is even more of a problem because the data deluge demands that almost all data analysis be performed by people with basic to intermediate statistics training at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around. So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.

**Please comment on the article here:** **Simply Statistics**

The post You think P-values are bad? I say show me the data. appeared first on All About Statistics.

]]>The post Example 2014.11: Contrasts the basic way for R appeared first on All About Statistics.

]]>As we discuss in section 6.1.4 of the second edition, R and SAS handle categorical variables and their parameterization in models quite differently. SAS treats them on a procedure-by-procedure basis, which leads to some odd differences in capabilities and default parameterizations. For example, in the

In R, in contrast, categorical variables can be designated as "factors" and parameterization stored an attribute of the factor.

In section 6.1.4, we demonstrate how the parameterization of a factor can be easily changed on the fly, in R, in

We begin by simulating censored survival data as in Example 7.30. We'll also export the data to use in R.

data simcox;

beta1 = 2;

lambdat = 0.002; *baseline hazard;

lambdac = 0.004; *censoring hazard;

do i = 1 to 10000;

x1 = rantbl(0, .25, .25,.25);

linpred = exp(-beta1*(x1 eq 4));

t = rand("WEIBULL", 1, lambdaT * linpred);

* time of event;

c = rand("WEIBULL", 1, lambdaC);

* time of censoring;

time = min(t, c); * which came first?;

censored = (c lt t);

output;

end;

run;

proc export data=simcox replace

outfile="c:/temp/simcox.csv"

dbms=csv;

run;

Now we'll fit the data in SAS, using effect coding.

proc phreg data=simcox;

class x1 (param=effect);

model time*censored(0)= x1 ;

run;

We reproduce the rather unexciting results here for comparison with R.

Parameter Standard

Parameter DF Estimate Error

x1 1 1 -0.02698 0.03471

x1 2 1 -0.01211 0.03437

x1 3 1 -0.05940 0.03458

In R we read the data in, then use the

We excerpt the relevant output to demonstrate equivalence with SAS.

simcox<- read.csv("c:/temp/simcox.csv")

sc2 = transform(simcox, x1.eff = C(as.factor(x1), contr.sum(4)))

effmodel <- coxph(Surv(time, censored)~ x1.eff,data= sc2)

summary(effmodel)

coef exp(coef) se(coef)

x1.eff1 -0.02698 0.97339 0.03471

x1.eff2 -0.01211 0.98797 0.03437

x1.eff3 -0.05940 0.94233 0.03458

**Please comment on the article here:** **SAS and R**

The post Example 2014.11: Contrasts the basic way for R appeared first on All About Statistics.

]]>The post Running RStudio via Docker in the Cloud appeared first on All About Statistics.

]]>Deploying applications via Docker container is the current talk of town. I have heard about Docker and played around with it a little, but when Dirk Eddelbuettel posted his R and Docker talk last Friday I got really excited and had to have a go myself.

My aim was to rent some resources in the cloud, pull an RStudio Server container and run RStudio in a browser. It was actually surprisingly simple to get started.

I chose Digital Ocean as my cloud provider. They have many Linux systems to choose from and also a pre-built Docker system.

After about a minute I had kicked off the Docker droplet I could login into the system in a browser window and start pulling the Docker file, e.g. Dirk's container.

Once the downloads finished I could start the RStudio Server using the

`docker run`

command and login to a RStudio session. To my surprise even my googleVis package worked out of the box. The plot command opened just another browser window to display the chart; here the output of the `WorldBank`

demo.All of this was done within minutes in a browser window. I didn't even use a terminal window. So, that's how you run R on an iPad. Considering that the cost for the server was $0.015 per hour, I wonder why I should buy my own server, or indeed buy a new computer.

**Please comment on the article here:** **mages' blog**

The post Running RStudio via Docker in the Cloud appeared first on All About Statistics.

]]>The post Running RStudio via Docker in the Cloud appeared first on All About Statistics.

]]>Deploying applications via Docker container is the current talk of town. I have heard about Docker and played around with it a little, but when Dirk Eddelbuettel posted his R and Docker talk last Friday I got really excited and had to have a go myself.

My aim was to rent some resources in the cloud, pull an RStudio Server container and run RStudio in a browser. It was actually surprisingly simple to get started.

I chose Digital Ocean as my cloud provider. They have many Linux systems to choose from and also a pre-built Docker system.

After about a minute I had kicked off the Docker droplet I could login into the system in a browser window and start pulling the Docker file, e.g. Dirk's container.

Once the downloads finished I could start the RStudio Server using the

`docker run`

command and login to a RStudio session. To my surprise even my googleVis package worked out of the box. The plot command opened just another browser window to display the chart; here the output of the `WorldBank`

demo.All of this was done within minutes in a browser window. I didn't even use a terminal window. So, that's how you run R on an iPad. Considering that the cost for the server was $0.015 per hour, I wonder why I should buy my own server, or indeed buy a new computer.

**Please comment on the article here:** **mages' blog**

The post Running RStudio via Docker in the Cloud appeared first on All About Statistics.

]]>The post Julia: Random Number Generator Functions appeared first on All About Statistics.

]]># In this post I will explore the built in Random Number functions in Julia.

# These can be found in the documentation at:

# http://docs.julialang.org/en/latest/stdlib/base/#random-numbers

# As with most random number generators it is useful to manually

# set the 'seed'. This allows for replication of results across

# multiple runs.

#Set seed is accomplished with the simple 'srand' function:

# We can verify that setting the seed results in identical draws from

# the random distribution by using the uniform [0,1) random draw

# function 'rand':

srand(123)

rand()

# 0.7684476751965699

srand(123); rand()

# 0.7684476751965699

# Julia random drawn objects have the convience of shaping themselves

# into random arrays the size specified by their arguments.

# For example a 2 by 10 random array

a1 = rand(2,10)

# 2x10 Array{Float64,2}:

# 0.940782 0.790201 0.900925 0.031831 … 0.759755 0.234544 0.627093

# 0.48 0.356221 0.529253 0.900681 0.19178 0.0976698 0.946697

# You can also use Julia to modify an existing array by filling it with

# random elements.

# First create a 3 x 3 of zeros

z1 = zeros(3,3)

# 3x3 Array{Float64,2}:

# 0.0 0.0 0.0

# 0.0 0.0 0.0

# 0.0 0.0 0.0

# Now populate it with random uniforms

rand!(z1)

# Notice the ! notation for functions that modify their arguments

z1

# 3x3 Array{Float64,2}:

# 0.615666 0.76537 0.256698

# 0.984495 0.322722 0.0808458

# 0.775836 0.0696626 0.275759

# rand can also be used to draw elements from the range of r in rand(r)

# For example, to draw a 2 x 2 array with elements drawn from 1 to 10.

rand(1:10,2,2)

# 2x2 Array{Int64,2}:

# 10 3

# 9 9

# We might also want to generate random boolean values which could be

# accomplished with either

rand(0:1,2,2).==1

# true false

# false true

# works as we would like it to in many ways, it is also limited in its

# ability to from the standard library generate many different random numbers

# from different distributions. R for instance has built in functions for

# drawing from the gamma distribution, the poisson, the binomial, etc.

# That said this should not be considered a major draw back since it is

# possible to script one's own random draw functions pretty easy given the

# right references. Being how Julia is on version 0.3 we can look forward

# to many improvements before its first official release.

# These can be found in the documentation at:

# http://docs.julialang.org/en/latest/stdlib/base/#random-numbers

# As with most random number generators it is useful to manually

# set the 'seed'. This allows for replication of results across

# multiple runs.

#Set seed is accomplished with the simple 'srand' function:

# We can verify that setting the seed results in identical draws from

# the random distribution by using the uniform [0,1) random draw

# function 'rand':

srand(123)

rand()

# 0.7684476751965699

srand(123); rand()

# 0.7684476751965699

# Julia random drawn objects have the convience of shaping themselves

# into random arrays the size specified by their arguments.

# For example a 2 by 10 random array

a1 = rand(2,10)

# 2x10 Array{Float64,2}:

# 0.940782 0.790201 0.900925 0.031831 … 0.759755 0.234544 0.627093

# 0.48 0.356221 0.529253 0.900681 0.19178 0.0976698 0.946697

# You can also use Julia to modify an existing array by filling it with

# random elements.

# First create a 3 x 3 of zeros

z1 = zeros(3,3)

# 3x3 Array{Float64,2}:

# 0.0 0.0 0.0

# 0.0 0.0 0.0

# 0.0 0.0 0.0

# Now populate it with random uniforms

rand!(z1)

# Notice the ! notation for functions that modify their arguments

z1

# 3x3 Array{Float64,2}:

# 0.615666 0.76537 0.256698

# 0.984495 0.322722 0.0808458

# 0.775836 0.0696626 0.275759

# rand can also be used to draw elements from the range of r in rand(r)

# For example, to draw a 2 x 2 array with elements drawn from 1 to 10.

rand(1:10,2,2)

# 2x2 Array{Int64,2}:

# 10 3

# 9 9

# We might also want to generate random boolean values which could be

# accomplished with either

rand(0:1,2,2).==1

# true false

# false true

`# Or the specific function`

randbool(2,2)

`# false false`

`# true false`

` `

`# The final type of random variable that comes with the base`

# install of Julia is the random normal generator:

randn(3,2)

# 3x2 Array{Float64,2}:

# -1.51181 0.139519

# -0.21379 -0.30305

# -0.711524 -0.048655

`# This generates based on standard normal but of course we can easily`

# any standard normal to have standard deviation x and mean y

x = 90

y = -34

randn(3,2)*x+y

# 3x2 Array{Float64,2}:

# -4.43119 -101.457

# -137.832 38.9093

# -9.66817 -20.2061

` `

# To conclude I would like to say that while Julia's random number generator# works as we would like it to in many ways, it is also limited in its

# ability to from the standard library generate many different random numbers

# from different distributions. R for instance has built in functions for

# drawing from the gamma distribution, the poisson, the binomial, etc.

# That said this should not be considered a major draw back since it is

# possible to script one's own random draw functions pretty easy given the

# right references. Being how Julia is on version 0.3 we can look forward

# to many improvements before its first official release.

**Please comment on the article here:** **Econometrics by Simulation**

The post Julia: Random Number Generator Functions appeared first on All About Statistics.

]]>The post Two hour marathon in 2041 appeared first on All About Statistics.

]]>Today Dennis Kimetto ran the Berlin Marathon in 2:02:57, shaving 26 seconds off the previous world record. That means it's time to check in on the world record progression and update my update from last year of my article from two years ago. The following is a revised version of that article, including the new data point.

Abstract: I propose a model that explains why world record progressions in running speed are improving linearly, and should continue as long as the population of potential runners grows exponentially. Based on recent marathon world records, I extrapolate that we will break the two-hour barrier in 2041.

-----

Let me start with the punchline:

In general, linear extrapolation of a time series is a dubious business, but in this case I think it is justified:

1) The distribution of running speeds is not a bell curve. It has a long tail of athletes who are much faster than normal runners. Below I propose a model that explains this tail, and suggests that there is still room between the fastest human ever born and the fastest possible human.

2) I’m not just fitting a line to arbitrary data; there is theoretical reason to expect the progression of world records to be linear, which I present below. And since there is no evidence that the curve is starting to roll over, I think it is reasonable to expect it to continue for a while.

3) Finally, I am not extrapolating beyond reasonable human performance. The target pace for a two-hour marathon is 13.1 mph, which is slower than the current world record for the half marathon (58:23 minutes, or 13.5 mph). It is unlikely that the current top marathoners will be able to maintain this pace for two hours, but we have no reason to think that it is beyond theoretical human capability.

My model, and the data it is based on, are below.

-----

In April 2011, I collected the world record progression for running events of various distances and plotted speed versus year (here's the data, mostly from Wikipedia). The following figure shows the results:

You might notice a pattern; for almost all of these events, the world record progression is a remarkably straight line. I am not the first person to notice, but as far as I know, no one has proposed an explanation for the shape of this curve.

Until now -- I think I know why these lines are straight. Here are the pieces:

1) Each person's potential is determined by several factors that are independent of each other; for example, your VO2 max and the springiness of your tendons are probably unrelated.

2) Each runner's overall potential is limited by their weakest factor. For example, if there are 10 factors, and you are really good at 9 of them, but below average on the 10th, you will probably not be a world-class runner.

3) As a result of (1) and (2), potential is not normally distributed; it is long-tailed. That is, most people are slow, but there are a few people who are much faster.

4) Runner development has the structure of a pipeline. At each stage, the fastest runners are selected to go on to the next stage.

5) If the same number of people went through the pipeline each year, the rate of new records would slow down quickly.

6) But the number of people in the pipeline actually grows exponentially.

7) As a result of (5) and (6) the rate of new records is linear.

8) This model suggests that linear improvement will continue as long as the world population grows exponentially.

Let's look at each of those pieces in detail:

Physiological factors that determine running potential include VO2 max, anaerobic capacity, height, body type, muscle mass and composition (fast and slow twitch), conformation, bone strength, tendon elasticity, healing rate and probably more. Psychological factors include competitiveness, persistence, tolerance of pain and boredom, and focus.

Most of these factors have a large component that is inherent, they are mostly independent of each other, and any one of them can be a limiting factor. That is, if you are good at all of them, and bad at one, you will not be a world-class runner. To summarize: there is only one way to be fast, but there are a lot of ways to be slow.

As a simple model of these factors, we can generate a random person by picking N random numbers, where each number is normally-distributed under a logistic transform. This yields a bell-shaped distribution bounded between 0 and 1, where 0 represents the worst possible value (at least for purposes of running speed) and 1 represents the best possible value.

Then to represent the running potential of each person, we compute the minimum of these factors. Here's what the code looks like:

def GeneratePerson(n=10):

factors = [random.normalvariate(0.0, 1.0) for i in range(n)]

logs = [Logistic(x) for x in factors]

return min(logs)

Yes, that's right, I just reduced a person to a single number. Cue the humanities majors lamenting the blindness and arrogance of scientists. Then explain that this is supposed to be an explanatory model, so simplicity is a virtue. A model that is as rich and complex as the world is not a model.

Here's what the distribution of potential looks like for different values of N:

When N=1, there are many people near the maximum value. If we choose 100,000 people at random, we are likely to see someone near 98% of the limit. But as N increases, the probability of large values drops fast. For N=5, the fastest runner out of 100,000 is near 85%. For N=10, he is at 65%, and for N=50 only 33%.

In this kind of lottery, it takes a long time to hit the jackpot. And that's important, because it suggests that even after 7 billion people, we might not have seen anyone near the theoretical limit.

Let's see what effect this model has on the progression of world records. Imagine that we choose a million people and test them one at a time for running potential (and suppose that we have a perfect test). As we perform tests, we keep track of the fastest runner we have seen, and plot the "world record" as a function of the number of tests.

Here's the code:

def WorldRecord(m=100000, n=10):

data = []

best = 0.0

for i in xrange(m):

person = GeneratePerson(n)

if person > best:

best = person

data.append(i/m, best))

return data

And here are the results with M=100,000 people and the number of factors N=10:

The x-axis is the fraction of people we have tested. The y-axis is the potential of the best person we have seen so far. As expected, the world record increases quickly at first and then slows down.

In fact, the time between new records grows geometrically. To see why, consider this: if it took 100 people to set the current record, we expect it to take 100 more to exceed the record. Then after 200 people, it should take 200 more. After 400 people, 400 more, and so on. Since the time between new records grows geometrically, this curve is logarithmic.

So if we test the same number of people each year, the progression of world records is logarithmic, not linear. But if the number of tests per year grows exponentially, that's the same as plotting the previous results on a log scale. Here's what you get:

That's right: a log curve on a log scale is a straight line. And I believe that that's why world records behave the way they do.

This model is unrealistic in some obvious ways. We don't have a perfect test for running potential and we don't apply it to everyone. Not everyone with the potential to be a runner has the opportunity to develop that potential, and some with the opportunity choose not to.

But the pipeline that selects and trains runners behaves, in some ways, like the model. If a person with record-breaking potential is born in Kenya, where running is the national sport, the chances are good that he will be found, have opportunities to train, and become a world-class runner. It is not a certainty, but the chances are good.

If the same person is born in rural India, he may not have the opportunity to train; if he is in the United States, he might have options that are more appealing.

So in some sense the relevant population is not the world, but the people who are likely to become professional runners, given the talent. As long as this population is growing exponentially, world records will increase linearly.

That said, the slope of the line depends on the parameter of exponential growth. If economic development increases the fraction of people in the world who have the opportunity to become professional runners, these curves could accelerate.

So let's get back to the original question: when will a marathoner break the 2-hour barrier? Before 1970, marathon times improved quickly; since then the rate has slowed. But progress has been consistent for the last 40 years, with no sign of an impending asymptote. So I think it is reasonable to fit a line to recent data and extrapolate. Here are the results [based on 2011 data; see above for the update]:

The red line is the target: 13.1 mph. The blue line is a least squares fit to the data. So here's my prediction: there will be a 2-hour marathon in 2043. I'll be 76, so my chances of seeing it are pretty good. But that's a topic for another article (see Chapter 1 of Think Bayes).

**Please comment on the article here:** **Probably Overthinking It**

The post Two hour marathon in 2041 appeared first on All About Statistics.

]]>The post TURF Analysis: A Bad Answer to the Wrong Question appeared first on All About Statistics.

]]>Now that R has a package performing Total Unduplicated Reach and Frequency (TURF) Analysis, it might be a good time to issue a warning to all R users. DON'T DO IT!

The technique itself is straight out of media buying from the 1950s. Given some number of n alternative advertising options (e.g., magazines), which set of size k will reach the most readers and be seen the most often? Unduplicated reach is the primary goal because we want everyone in the target audience to see the ad. In addition, it was believed that seeing the ad more than once would make the ad more effective (that is, until wearout), which is why frequency is a component. When TURF is used to create product lines (e.g., flavors of ice cream to carry given limited freezer space), frequency tends to be downplayed and the focus placed on reaching the largest percentage of potential customers. All this seems simple enough until one looks carefully at the details, and then one realizes that we are interpreting random variation.

The R package turfR includes an example showing how to use their turf() function by setting n to 10 and letting k range from 3 to 6.

Created by Pretty R at inside-R.org

This code produces a considerable amount of output. I will show only the first 10 best triplets from the 120 possible sets of three that can be formed from 10 alternatives. The rchX columns tells the weighted proportion of the 180 individuals in the dataset that would buy one of the 10 products listed in the columns labeled with integers from 1 to 10. Thus, according to the first row, 99.9% would buy something if Items 8, 9, and 10 were offered for sale.

combo | rchX | frqX | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |

1 | 120 | 0.998673 | 2.448993 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |

2 | 119 | 0.998673 | 2.431064 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |

3 | 99 | 0.995773 | 1.984364 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |

4 | 110 | 0.992894 | 2.185398 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |

5 | 64 | 0.991567 | 1.898693 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

6 | 109 | 0.990983 | 2.106944 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |

7 | 97 | 0.99085 | 1.966436 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |

8 | 116 | 0.989552 | 2.341179 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

9 | 85 | 0.989552 | 2.042792 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

10 | 36 | 0.989552 | 1.800407 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

The sales pitch for TURF depends on showing only the "best" solution for 3 through 6. Once we look down the list, we find that there are lots of equally good combinations with different products (e.g., the combination in the 7th position yields 99.1% reach with products 4, 7 and 10). With a sample size of 180, I do not need to run a bootstrap to know that the drop from 99.9% to 99.1% reflects random variation or error.

Of course, the data from turfR is simulated, but I have worked with many clients and many different datasets across a range of categories and I have never found anything but random differences among the top solutions. I have seen solutions where the top several hundred combinations cannot be distinguished based on reach, which is reasonable given that the number of combinations increases rapidly with n and k (e.g., the R function choose(30,5) indicates that there are 142,506 possible combinations of 30 things in sets of 5). You can find an example of what I see over and over again by visiting the TURF website for XLSTAT software.

Obviously, there is no single best item combination that dominates all others. It could have been otherwise. For example, it is possible that the market consists of distinct segments with each wanting one and only one item.

With no overlapping in this Venn diagram, it is clear that vanilla is the best single item, followed by vanilla and chocolate as the best pair, and so on had there been more flavors separated in this manner.

However, consumer segments are seldom defined by individual offerings in the market. You do not stop buying toothpaste because your brand has been discontinued. TURF asks the wrong question because consumer segmentation is not item-based.

As a quick example, we can think about credit card reward programs with its categories covering airlines, cash back, gas rebates, hotel, points, shopping and travel. Each category could contain multiple reward offers. A TURF analysis would seek the best individual rewards ignoring the categories. Yet, comparison websites use categories to organize searches because consumer segments are structured around the benefits offered by each category.

The TURF Analysis procedure from XLSTAT allows you to download an Excel file with purchase intention ratings for 27 items from 185 respondents. A TURF analysis would require that we set a cutoff score to transform the 1 through 5 ratings into a 0/1 binary measure. I prefer to maintain the 5-point scale and treat purchase intent as an intensity score after subtracting one so that the scale now ranges from 0=not at all to 4=quite sure. A nonnegative matrix factorization (NMF) reveals that the 27 items in the columns fall into 8 separable row categories marked by the red indicating a high probability of membership and yellow with values close to zero showing the categories where the product does not belong.

The above heatmap displays the coefficients for each of the 27 products, as the original Excel file names them. Unfortunately, we have only the numbers and no description of the 27 products. Still, it is clear that interest has an underlying structure and that perhaps we ought to consider grouping the products based on shared features, benefits or usages. For example, what do Products 5, 6 and 17 clustered together at the end of this heatmap have in common? Understand, we are looking for stable effects that can be found in the data and in the market where purchases are actually made.

The right question asks about consumer heterogeneity and whether it supports product differentiation. Different product offerings are only needed when the market contains segments seeking different benefits. Those advocating TURF analysis often use ice cream flavors as their example, as I did in the above Venn diagram. What if the benefit driving sales of less common flavors was not the flavor itself but the variety associated with a new flavor or a special occasion when one wants to deviate from their norm? A segmentation, whether NMF or another clustering procedure, would uncover a group interested in less typical flavors (probably many such flavors). This is what I found from the purchase history of whiskey drinkers, a number of segments each buying one of the major brands and a special occasion or variety seeking segment buying many niche brands. All of this is missed by a TURF analysis that gives us instead a bad answer to the wrong question.

First, download the Excel file, convert it to csv format, and set the working directory to the location of the data file.

test<-read.csv("demoTurf.csv")

Created by Pretty R at inside-R.org

**Please comment on the article here:** **Engaging Market Research**

The post TURF Analysis: A Bad Answer to the Wrong Question appeared first on All About Statistics.

]]>The post It’s time for job application now! appeared first on All About Statistics.

]]>I collected the following series on applying for faculty positions in 2011, when I was in my second year PhD. Now it’s my turn to apply for jobs. I will share the following useful materials with all you who want to apply for jobs this year.

- Applying for Jobs: Application Materials
- Applying for Jobs : Sending out Applications
- Applying for Jobs : Phone Interviews
- Applying for Jobs: On-Site Interviews
- Applying for Jobs: the Job Talk

**Please comment on the article here:** **Honglang Wang's Blog**

The post It’s time for job application now! appeared first on All About Statistics.

]]>