The post What does CNN have in common with Carmen Reinhart, Kenneth Rogoff, and Richard Tol: They all made foolish, embarrassing errors that would never have happened had they been using R Markdown appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Had the CNN team used an integrated statistical analysis and display system such as R Markdown, nobody would’ve needed to type in the numbers by hand, and the above embarrassment never would’ve occurred.

And CNN *should* be embarrassed about this: it’s much worse than a simple typo, as it indicates they don’t have control over their data. Just like those Rasmussen pollsters whose numbers add up to 108%. I sure wouldn’t hire *them* to do a poll for me!

I was going to follow this up by saying that Carmen Reinhart and Kenneth Rogoff and Richard Tol should learn about R Markdown—but maybe that sort of software would not be so useful to them. Without the possibility of transposing or losing entire columns of numbers, they might have a lot more difficulty finding attention-grabbing claims to publish.

Ummm . . . I better clarify this. I’m *not* saying that Reinhart, Rogoff, and Tol did their data errors on purpose. What I’m saying is that their cut-and-paste style of data processing enabled them to make errors which resulted in dramatic claims which were published in leading journals of economics. Had they done smooth analyses of the R Markdown variety (actually, I don’t know if R Markdown was available back in 2009 or whenever they all did their work, but you get my drift), it wouldn’t have been so easy for them to get such strong results, and maybe they would’ve been a bit less certain about their claims, which in turn would’ve been a bit less publishable.

To put it another way, sloppy data handling gives researchers yet another “degree of freedom” (to use Uri Simonsohn’s term) and biases claims to be more dramatic. Think about it. There are three options:

1. If you make no data errors, fine.

2. If you make an inadvertent data error that *works against* your favored hypothesis, you look at the data more carefully and you find the error, going back to the correct dataset.

3. But if you make an inadvertent data error that *supports* your favored hypothesis (as happened to Reinhart, Rogoff, and Tol), you have no particular motivation to check, and you just go for it.

Put these together and you get a systematic bias in favor of your hypothesis.

Science is degraded by looseness in data handling, just as it is degraded by looseness in thinking. This is one reason that I agree with Dean Baker that the Excel spreadsheet error was worth talking about and was indeed part of the bigger picture.

Reproducible research is higher-quality research.

**P.S.** Some commenters write that, even with Markdown or some sort of integrated data-analysis and presentation program, data errors can still arise. Sure. I’ll agree with that. But I think the three errors discussed above are all examples of cases where an interruption in the data flow caused the problem, with the clearest example being the CNN poll, where, I can only assume, the numbers were calculated using one computer program, then someone read the numbers off a screen or a sheet of paper and typed them into another computer program to create the display. This would not have happened using an integrated environment.

The post What does CNN have in common with Carmen Reinhart, Kenneth Rogoff, and Richard Tol: They all made foolish, embarrassing errors that would never have happened had they been using R Markdown appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Shamer shaming appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I can’t recall when I first saw “shaming” used in its currently popular sense. I remember noting “slut shaming” and “fat shaming” but did they first become popular two years ago? Three? At any rate, “shaming” is now everywhere…and evidently it’s a very bad thing.

When I first saw the term, I agreed with the message it was trying to convey: it is bad to try to make people feel ashamed of being fat, or of wanting to have sex. Indeed, I’d say it’s bad to try to make people feel ashamed of anything that isn’t unethical or morally wrong or at least irritating. Down with slut shaming! Down with fat shaming! Down with gay shaming!

But somehow all criticism seems to have become “shaming.” A few days ago I posted a message to my neighborhood listserv, reminding people that (1) we are in a severe drought (I live in California), (2) washing one’s car with a hose uses a lot of water, and indeed is a fineable offense if you don’t use a nozzle that shuts off the water when you release it, (3) all commercial car washes in our area recycle their water, and (4) our storm drains empty directly into a creek. The next day I got an angry email from a neighbor: how dare I shame him for washing his car on the street?

On this blog, Andrew has frequently posted about researchers doing shameful things, such as plagiarizing, and refusing to admit to major mistakes in their published work. (There’s nothing shameful about making a mistake, at least not if you’ve tried hard to get it right, but it is shameful to refuse to admit it). And, sure enough, some people have complained that Andrew is “shaming” these people.

Plagiarist-shaming, academic fraud-shaming, hack journalist-shaming, all of those are evidently in the same unacceptable category as fat-shaming and slut-shaming. There is nothing shameful in the world, except trying to make somebody feel ashamed. Shamer-shaming is the only kind of shaming that is OK.

This post is by Phil Price

The post Shamer shaming appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Palko’s on a roll appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>At least we can all agree that ad hominem and overly general attacks are bad: A savvy critique of the way in which opposition of any sort can be dismissed as “ad hominem” attacks. As a frequent critic (and subject of criticism), I agree with Palko that this sort of dismissal is a bad thing.

Wondering where the numbers come from — Rotten Tomatoes: These numbers really are rotten. Palko writes:

This figure indicates a “Good” rating. How does that translate to “Rotten”? . . . this is pretty clearly a glitch and it’s a glitch in the easy part of review aggregation . . . This brings up one of my [Palko's] problems with data-driven journalism. Reporters and bloggers are constantly barraging us with graphs and analyses and of course, narratives looking at things like Rotten Tomatoes rankings. All to often, though, their process starts with the data as given. They spend remarkably little time asking where the data came from or whether it’s worth bothering with.

I’ll just throw in the positive message that criticism can improve the numbers. After seeing this post, maybe the people at the website in question will be motivated to clean their data a bit.

The education reform movement has never lent itself to the standard left/right axis. Not only was its support bipartisan; it was often the supporters on the left who were quickest to embrace privatization, deregulation and market-based solutions. Zephyr Teachout may be a sign that anomaly is ending.

I’d also be interested in seeing poll data on this (if it’s possible to get good data, given the low salience of this issue for many voters). My guess is that, even if many leaders on the left were supportive of privatization, etc., that these were not so popular among rank-and-file, lower-income left-leaning voters.

In any case, I’m fascinated by this topic for several reasons, including its inherent importance and the compelling stories of various education-reform scams and scandals (well relayed by Palko over the past few years). Also, from a political-sciene perspective, I’ve always been interested in issues that *don’t* line up with the usual partisan divide.

Driverless Cars and Uneasy Riders: Dramatic claims are being made about the potential fuel and economic-efficiency gains from the use of driverless cars. Palko (and I) are skeptical.

Another story that needs to be on our radar — ECOT: Yet another education reform scam that should be a scandal. Eternal vigilance etc.

I know I go on about ignoring Canada’s education system: Palko links to, and criticizes, a report that’s so bad in so many dimensions that it probably deserves its own post here or at the Monkey Cage.

Selection effects on steroids: I’m not such a fan of the expression “on steroids”—to me it’s a bit of journalism cliche that should’ve died along with the 80s—but the statistical, and policy, point is important. Selection bias is one of these things that we statisticians have known about and talked about forever, but even so we probably don’t talk about it enough. As a researcher and as a teacher, I feel the challenge is to go beyond criticism and move to adjustment. But criticism is often a necessary first step.

Support your local journalists: Yup.

I know I pick on Netflix a lot: “the way flacks and hacks force real stories into standard narratives”

The essential distinction between charter schools and charter school chains:

The charter school sector is highly diverse. It ranges from literal mom and pop operations to nation-wide corporations. The best of these schools get good results, genuinely care about their students and can fill an important educational niche. The worst aggressively cook data to conceal mediocre results and gouge the taxpayers.

If current trends hold, I think charter schools will be nearly as diverse and I’m not optimistic about who the winners will be.

As Steven Levitt would say, incentives matter.

The post Palko’s on a roll appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What do you do to visualize uncertainty? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What do you do to visualize uncertainty?

Do you only use static methods (e.g. error bounds)?

Or do you also make use of dynamic means (e.g. have the display vary over time proportional to the error, so you don’t know exactly where the top of the bar is, since it moves while you’re watching)?Have you any thoughts on this topic?

I assume that since a Bayesian generates a posterior dist’n the output should not be point but rather a dist’n; and you being the most prolific Bayesian I know that you’ve got three or four old papers that you’ve written on it.

OK, sure, when you put it that way, my collaborators and I do have a few papers on the topic:

Visualization in Bayesian data analysis

Visualizing distributions of covariance matrices

Multiple imputation for model checking: completed-data plots with missing and latent data

A Bayesian formulation of exploratory data analysis and goodness-of-fit testing

All maps of parameter estimates are misleading

But I don’t really have much else to say right now. Dynamic graphics seem like a good idea but I’ve never programmed them myself. In many settings it will work to display point estimates, but sometimes this can create big problems (as discussed in some of the above-linked papers) because Bayesian point estimates will tend to be too smooth—less variable—compared to the variation in the underlying parameters being modeled.

So I’m kicking this one out to the commenters to see if they can offer some useful suggestions.

The post What do you do to visualize uncertainty? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post They know my email but they don’t know me appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Good afternoon,

I wanted to see if the data my colleague David sent to you was of any interest. I have attached here additional animated Gifs from PwC’s CEO survey. Let me know if you would be interested in featuring these pieces or in a guest post by PwC.

Best,

**** on behalf of **

Attached were two infographics which you can bet I’m not including here.

P.S. Just to be clear: I don’t think unsolicited emails are so horrible; I myself send emails to strangers all the time. Nor am I offended by the content. I just think it’s funny that there are people out there who think I’m interesting in publishing animated chartjunk.

The post They know my email but they don’t know me appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More bad news for the buggy-whip manufacturers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The main factor is technology. It’s a major cause of today’s response-rate problems – but it’s also the solution.

For decades, survey research has revolved around the telephone, and it’s worked very well. But Americans’ relationship with their phones has radically changed. It’s no surprise that survey research will have to as well. . . .

In the future, we are unlikely to live in a country in which information is scant. We are certain to live in one in which information is collected in different ways. The transition is under way, and the federal government is among those institutions that will need to adapt.

Let’s hope that the American Association for Public Opinion Research can adapt too.

The post More bad news for the buggy-whip manufacturers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** They know my email but they don’t know me

**Wed:** What do you do to visualize uncertainty?

**Thurs:** Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

**Fri:** Question about data mining bias in finance

**Sat:** Estimating discontinuity in slope of a response function

**Sun:** I can’t think of a good title for this one.

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Six quotes from Kaiser Fung appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>One of the biggest myth of Big Data is that data alone produce complete answers.

Their “data” have done no arguing; it is the humans who are making this claim.

That last one is an appropriate response to the Freshman Fallacy.

The post Six quotes from Kaiser Fung appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post He just ordered a translation from Diederik Stapel appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So I am applying for a DC driver’s license and needed a translation of my Spanish license to show to the DMV. I go to http://www.onehourtranslation.com/ and as I prepare to pay I see a familiar face in the bottom banner:

It appears Stapel is one of their “over 15,000 dedicated professional translators” (or maybe they put his picture there unauthorized). Either way now worried I may get a made up/plagiarized translation.

There are worse ways for a multilingual person to make a living . . . .

Perhaps they could get Bruno Frey to do some translations too. He’d only have to do it once, then he could just copy it over and over and over.

The post He just ordered a translation from Diederik Stapel appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What is the purpose of a poem? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As is often the case, I’m on the blog to procrastinate: in this case, my colleagues and I are preparing a new course and there’s tons of important work to be done. I’m getting tired of reading comments on economics and empiricism and so I scooted over to Basbøll’s blog and kicked off a brief comment thread about the academic entertainer Slavoj Zizek. At first I was going to post and continue that discussion here, but I don’t give too poops about Slavoj Zizek, so I followed Basbøll’s blogroll link to “Stupid Motivational Tricks” and right away found something interesting.

The something interesting that I found was a post by Jonathan Mayhew about someone who’s the poet laureate of North Carolina. I had no idea that an individual state would have a poet laureate but it seems like a good idea, a quite reasonable nearly cost-free thing to do, indeed it would be cool to have all sorts of official state art. In reading the post I was mildly irritated by Mayhew’s use of “NC” as a generic replacement for “North Carolina.” The abbreviation is fine in some contexts but I founn it a bit jarring to read, “The literary community of NC . . .” On the other hand, it’s just a goddam blog so I don’t know what I’m supposed to be expecting.

But I’m getting completely off the point here. What happened is that Mayhew quoted a couple of mediocre passages from poems by two of North Carolina’s poet laureates (apparently they just had a changing of the guard).

Mayhew’s reactions gave me some thoughts of my own regarding the purpose of poetry. I’ll first copy what he wrote and then give my reflections.

Mayhew quotes from the previous laureate:

“Joan and I were in Raleigh together

for the first time to take the tour

for new vista volunteers

at North Carolina’s Central Prison…”

and then shares his reaction:

Ouch. It’s fine to use seemingly plain language, etc… but no rhythm, nothing going on in the language. This kind of writing just causes physical pain to me.

Then he quotes from the recent laureate:

“I’m grateful for my car, he says,

voice raspy with hard living.

Tossed on the seat, a briefcase

covered with union stickers,

stuffed with unemployment forms,

want ads, old utility bills,

birth certificate, school application

papers for the skinny ten-year-old

sitting beside him who loves baseball…”

This he characterizes as worse than the first poem (“not much worse,” though), but I don’t quite understand where this ranking is coming from, given that he follows up with, “More is going on in her language, actually. It’s not exactly good, but it’s salvageable, with some concreteness there at least.”

I assume that we can all agree, though, that it’s hard to judge either poem, or either poet, by these short excerpts. Both excerpts radiate mediocrity but of course a bit of mediocrity can do the job in the context of a larger message. I’m pretty sure that, for almost any major poet, you could without much difficulty find passages that, if shown to me in isolation, would not sparkle and could indeed look a bit like hackwork. I mean, sure, “voice raspy with hard living” sounds cliched, but who among us does not grab a cliche from time to time. For all we know from this excerpt, the use of the cliche is part of the point in establishing the narrator’s voice.

OK, let me be clear here. I’m not trying to get all contrarian on you and praise these two poets. I have no problem giving Mayhew the benefit of the doubt, I’ll assume he read a bit by each of them and with these excerpts is giving something of a true sense of these poems’ style and content. So I will accept (until convinced otherwise) that these poets are indeed mediocre.

**What is the purpose of a poem?**

And this brings us to today’s topic. The thing that bothers me about Mayhew’s post (even though I have a feeling I’d agree with him 100% about the strengths and weaknesses of these poems, and I wouldn’t be surprised if we share many tastes about and attitudes toward literature) is the implicit attitude that I see there, which I feel I’ve seen in other discussions of poetry, which is that the purpose of a poem is to be wonderful.

Huh? “The purpose of a poem is to be wonderful.” That seems like a reasonable statement, no? Who could disagree with that?

To see my problem with this statement (which, to be fair, Mayhew never said, but which I see as implicit in his post), consider the related question, “What is the purpose of a novel?” Or, for that matter, what is the purpose of a research article? Or what is the purpose of a song?

My point is that I think it’s a bad attitude to think that the purpose of a poem is to be wonderful. It’s insulting to poetry to give it such a narrow range. A poem is a sort of song without music and, as such can have many different purposes.

OK, procrastination successful. An hour spent, now time for bed.

The post What is the purpose of a poem? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post mysterious shiny things appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>

To reproduce it yourself: download ui.R, server.R, and healthexp.Rds

Have a folder called “App-Health-Exp” in your working directory, with ui.R and server.R in the “App-Health-Exp” folder. Have the dataset healthexp.Rds in your working directory.

Then run this code:

if (!require(devtools)) install.packages("devtools") devtools::install_github("jcheng5/googleCharts") install.packages("dplyr") install.packages("shiny") library(shiny) library(googleCharts) library(dplyr) data = readRDS("healthexp.Rds") head(data) # Problem isn't the data, it seems that Switzerland is in Europe # in both 2001 and 2002: data[data$Year == 2001 & data$Country == "Switzerland",] data[data$Year == 2002 & data$Country == "Switzerland",] runApp("App-Health-Exp")

Anyone know what is happening?

The post mysterious shiny things appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post *Bayesian Cognitive Modeling* Examples Ported to Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

There’s a new intro to Bayes in town.

- Michael Lee and Eric-Jan Wagenmaker. 2014.
*Bayesian Cognitive Modeling: A Practical Course*. Cambridge Uni. Press.

This book’s a wonderful introduction to applied Bayesian modeling. But don’t take my word for it — you can download and read the first two parts of the book (hundreds of pages including the bibliography) for free from the book’s home page (linked in the citation above). One of my favorite parts of the book is the collection of interesting and instructive example models coded in BUGS and JAGS (also available from the home page). As a computer scientist, I prefer reading code to narrative!

In both spirit and form, the book’s similar to Lunn, Jackson, Best, Thomas, and Spiegelhalter’s *BUGS Book*, which wraps their seminal set of example models up in textbook form. It’s also similar in spirit to Kruschke’s *Doing Bayesian Data Analysis*, especially in its focus on applied cognitive psychology examples.

One of Lee and Wagenmaker’s colleagues, Martin Šmíra, has been porting the example models to Stan and the first batch is already available in the new Stan example model repository (hosted on GitHub):

- GitHub: stan-dev/example-models

Many of the models involve discrete parameters in the BUGS formulation which need to be marginalized out in the Stan models. The Stan 2.5 manual is adding a whole new chapter with some non-trivial marginalizations (change point models, CJS mark-recapture models, and categorical diagnostic accuracy models).

Expect the rest soon! And feel free to jump on the Stan users group to discuss the models and how they’ve been coded.

*Warning: The models are embedded as strings in R code. We’re looking for a volunteer to pull the models out of the R code and generate data for them in a standalone file that could be used in PyStan or CmdStan.*

If you’d like to contribute Stan models to our example repo, the README at the bottom of the front page of the GitHub repository linked above contains information on what we’d like to get. We only need open-source distribution rights — authors retain copyright for all their work on Stan. Contact us either via e-mail or via the Stan users group.

The post *Bayesian Cognitive Modeling* Examples Ported to Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post One-tailed or two-tailed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This image of a two-tailed lizard (from here, I can’t find the name of the person who took the picture) never fails to amuse me.

But let us get to the question at hand . . .

Richard Rasiej writes:

I’m currently teaching a summer session course in Elementary Statistics. The text that I was given to use is Triola’s Elementary Statistics, 12th ed.

Let me quote a problem on inference from two proportions:

11. Is Echinacea Effective for Colds? Rhino viruses typically cause common colds. In a test of the effectiveness of echinacea, 40 of the 45 subjects treated with echinacea developed rhinovirus infections. In a placebo group, 88 of the 103 subjects developed rhinovirus infections (based on data from “An Evaluation of Echinacea Angustifolia in Experimental Rhinovirus Infections,” by Turner et. al., New England Journal of Medicine, Vol. 353, No. 4). We want to use a 0.05 significance level to test the claim that echinacea has an effect on rhinovirus infection.

The answer in the back of the teacher’s edition sets up the hypothesis test as H0: p1 = p2, H1: p1 <> (not equal to) p2, gives a test statistic of z = 0.57, uses critical values of +/- 1.96, and gives a P-value of .5686.

I was having a hard time explaining the rationale for the book’s approach to my students. My thinking was that since there is no point in claiming that echinacea has an effect on the common cold unless you think it helps, we should be doing a one-tailed test with H0: p1 = p2, H1: p1 < p2. We would still fail to reject the null hypothesis, but with a P-value of .2843.

Or, is what I am missing that, if you are testing the claim that something has an effect you want to also test the possibility that the effect is the opposite of what you’d normally want (e.g. this herb is bad for you, or inhaling smoke is good for you, etc.)?

Any advice you could give me on how best to parse this problem for my students would be greatly appreciated. I already feel very nervous stating, in effect, “well, that’s not the way I would do it.”

My reply:

The quick answer is that maybe echinacea is bad for you! Really though the example is pretty silly, as one can simply compare 40/45 and 88/103 and look at the sampling variability of the proportions. I don’t see that the hypothesis test and p-value add anything.

This doesn’t sound like much, but, amazingly enough, Rasiej replied later that day:

I guess I was led astray by the lead-in to the problem, which seemed to imply that there was a benefit. Obviously it’s better to read the claim carefully and take it literally. So, “test the claim that echinacea has an effect” is two-tailed since ANY effect, beneficial or not, would be significant.

That said, I do agree with you that the example is silly, given the data in the problem.

Thanks again for your insights. They helped in my class today.

Perhaps (maybe I should say “probably”) he was just being polite, but I prefer to think that even a brief reply can convey some useful understanding. Also I think it’s a good general message to take what people say literally. This is not a message that David Brooks likes to hear, I think, but it is, to me, an essential aspect of statistical thinking.

**P.S.** Perhaps I should stress that in my response above I wasn’t saying that confidence intervals are some kind of wonderful automatic replacement for p-values. I was just saying that, in this particular case, it seems to me that you’d want a summary of the information provided by the experiment, and that this summary is best provided by the estimated proportions and their standard errors. To set if up in a p-value context would seem to imply that you’re planning on making a decision about echinacea based on this single experiment, but that wouldn’t make sense at all! No need to jump the gun and go all the way to a decision statement; it seems enough to just summarize the information in the data.

The post One-tailed or two-tailed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A couple months ago in a discussion of differences between econometrics and statistics, I alluded to the well-known fact that everyday uncertainty aversion can’t be explained by a declining marginal utility of money.

What really bothers me—it’s been bothering me for *decades* now—is that this is a simple fact that “everybody knows” (indeed, in comments some people asked why I was making such a big deal about this triviality), but, even so, it remains standard practice within economics to use this declining-marginal-utility explanation.

I don’t have any econ textbooks handy but here’s something from the Wikipedia entry for risk aversion:

Risk aversion is the reluctance of a person to accept a bargain with an uncertain payoff rather than another bargain with a more certain, but possibly lower, expected payoff.

OK so far. And now for their example:

A person is given the choice between two scenarios, one with a guaranteed payoff and one without. In the guaranteed scenario, the person receives $50. In the uncertain scenario, a coin is flipped to decide whether the person receives $100 or nothing. The expected payoff for both scenarios is $50, meaning that an individual who was insensitive to risk would not care whether they took the guaranteed payment or the gamble. However, individuals may have different risk attitudes. A person is said to be:

risk-averse (or risk-avoiding) – if he or she would accept a certain payment (certainty equivalent) of less than $50 (for example, $40), rather than taking the gamble and possibly receiving nothing. . . .

They follow up by defining risk aversion in terms of the utility of money:

The expected utility of the above bet (with a 50% chance of receiving 100 and a 50% chance of receiving 0) is,

E(u)=(u(0)+u(100))/2,

and if the person has the utility function with u(0)=0, u(40)=5, and u(100)=10 then the expected utility of the bet equals 5, which is the same as the known utility of the amount 40. Hence the certainty equivalent is 40.

But this is just wrong. It’s not *mathematically* wrong but it’s wrong in any practical sense, in that a utility function that curves this way between 0 and 100 can’t possibly make any real-world sense.

Way down on the page there’s one paragraph saying that this model has “come under criticism from behavioral economics.”

But this completely misses the point!

It would be as if you went to the Wikipedia entry on planetary orbits and saw a long and involved discussion of the Ptolemaic model, with much discussion of the modern theory of epicycles (image above from Wikipedia, taken from the Astronomy article in the first edition of the Enyclopaedia Brittanica), and then, way down on the page, a paragraph saying something like,

The notion of a geocentric universe has come under criticism from Copernican astronomy.

Again, this is frustrating because it’s so simple, it’s so obvious that any utility function that curves so much between 0 and 100 *can’t* keep going forward in any reasonable sense.

It’s an example I used to give as a class-participation activity in my undergraduate decision analysis class and which I wrote up a few years later in an article on classroom demonstrations.

I’m not claiming any special originality for this result. As I wrote in my recent post,

The general principle has been well-known forever, I’m sure.

Indeed, unbeknownst to me, Matt Rabin published a paper a couple years later with a more formal treatment of the same topic, and I don’t recall ever talking with him about the problem (nor was it covered in Mr. Cutlip’s economics class in 11th grade), so I assume he figured it out on his own. (It would be hard for me to imagine someone thinking hard about curving utility functions and *not* realizing they can’t explain everyday risk aversion.)

In response, commenter Megan agreed with me on the substance but wrote:

I am sure it has NOT been well-known forever. It’s only been known for 26 years and no one really understands it yet.

I’m pretty sure the Swedish philosopher who proved the mathematical phenomenon 10 years before you and 12 years before Matt Rabin was the first to identify it. The Hansson (1988)/Gelman (1998)/Rabin (2000) paradox is up there with Ellsberg (1961), Samuelson (1963) and Allais (1953).

**Not so obvious after all?**

Megain’s comment got me thinking: maybe this problem with using a nonlinear utility function for money is *not* so inherently obvious. Sure, it was obvious to me in 1992 or so when I was teaching decision analysis, but I was a product of my time. Had I taught the course in 1983, maybe the idea wouldn’t have come to me at all.

Let me retrace my thoughts, as best as I can now recall them. What I’d really like is a copy of my lecture notes from 1992 or 1994 or whenever it was that I first used the example, to see how it came up. But I can’t locate these notes right now. As I recall, I taught the first part of my decision analysis class using standard utility theory, first having students solve basic expected-monetary-value optimization problems and then going through the derivation of the utility function given the utility axioms. Then I talked about violations of the axioms and went on from there.

It was a fun course and I taught it several times, at Berkeley and at Columbia. Actually, the first time I taught the subject it was something of an accident. Berkeley had an undergraduate course on Bayesian statistics that David Blackwell had formerly taught. He had retired so they asked me to teach it. But I wasn’t comfortable teaching Bayesian statistics at the undergraduate level—this was before Stan and it seemed to me it would take the students all semester just to get up to speed on the math, with on time to do anything interesting—so I decided to teach decision analysis instead. using the same course number. One particular year I remember—I think it was 1994—when we had a really fun bunch of undergrad stat majors, and a whole bunch of them were in the course. A truly charming bunch of students.

Anyway, when designing the course I read through a bunch of textbooks on decision analysis, and the nonlinear utility function for money always came up as the first step beyond “expected monetary value.” After that came utility of multidimensional assets (the famous example of the value of a washer and a dryer, compared to two washers or two dryers), but the nonlinear utility for money, used sometimes to *define* risk aversion, came first.

But the authors of many of these books were also aware of the Kahneman, Slovic, and Tversky revolution. There was a ferment, but it still seemed like utility theory was tweakable and that the “heuristics and biases” research merely reflected a difficulty in *measuring* the relevant subjective probabilities and utilities. It was only a few years later that a book came out with the beautifully on-target title, “The Construction of Preference.”

Anyway, here’s the point. Maybe the problem with utility theory in this context was obvious to Hansson, and to me, and to Yitzhak, because we’d been primed by reading the work by Kahneman, Slovic, Tversky, and others exploring the failures of the utility model in practice. In retrospect, that work too should not have been a surprise—-after all, utility theory was at that time already a half-century old and it had been developed in the behavioristic tradition of psychology, predating the cognitive revolution of the 1950s.

I can’t really say, but it does seem that sometimes the time is ripe for an idea, and maybe this particular idea only seemed so trivial to me because it was already accepted that utility theory had problems modeling preferences. Once you accept the *empirical* problem, it’s not so hard to imagine there’s a theoretical problem too.

And, make no doubt about it, the problem is both empirical and theoretical. You don’t need any experimental data at all to see the problem here:

Also, let me emphasize that the solution to the problem is *not* to say that people’s preferences are correct and so the utility model is wrong. Rather, in this example *I find utility theory to be useful* in demonstrating why the sort of everyday risk aversion exhibited by typical students (and survey respondents) does not make financial sense. Utility theory is an excellent normative model here.

Which is why it seems particularly silly to be defining these preferences in terms of a nonlinear utility curve that could never be.

It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet.

The post “It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Phillip Middleton,

Is technology making you work harder? Or giving you more time off?

Seriously, it feels like it’s enabling me to work around the clock! Heck, I’m writing this email at 37,000 feet on a Virgin America flight from DC to LA at 11 p.m. ET.

So that being said, I want to share the actual DATA with you about Work vs. Leisure. . . .

It’s easy to forget that for centuries — for millenia — the “workforce” was ALL of us.

A few people lived in luxury, but the vast majority were slaves and serfs who did the work. In 1750, 75 percent of people on the planet worked to support the top 25 percent.

Let’s look at the numbers. It’s extraordinary how this has changed over time.

You’ll notice that by 2000, the global percentage of slaves and serfs in the world is down to 10 percent. As artificial intelligence and robotics come online, this number is going to drop down to zero.

Hey, if only artificial intelligence and robotics had existed in 1863, then Lincoln could’ve freed the—whaaaaa? What’s with that graph, anyway? Let’s look at the data, indeed. That curve looks suspiciously smooth!

Where did “the numbers” come from? The source says “Simon, pp. 171-177″ but that’s not quite enough information. Luckily, we make rapid progress via Google. A search on “percentage of slaves or serfs in the world” takes us to this 2001 book by Stephen Moore and Julian Simon and the following quote:

A larger percentage of the world’s inhabitants are freer than ever before in history. Economic historian Stanley Engerman has noted that as recently as the late 18th century, “The bulk of mankind, over 95 percent, were miserable slaves or [sic] despotic tyrants.” . . . The figure shows the decline of slavery from 1750 through the end of the 20th century.

This one’s kinda weird because they put 1917 exactly halfway between 1750 and 2000, which isn’t quite right. It’s almost like they just drew a curve freehand through some made-up numbers! Also a bit odd is that Moore and Simon’s curve is not consistent with their own citation: in their text, they say the proportion of slaves in the late 18th century was 95%, but in the graph it’s around 70%.

The next step, I suppose, is to track down “Simon, pp. 171-77; and authors’ calculations.” But I’m getting tired. Maybe someone else could follow this up for me?

In summary, the graph looks bogus to me. Some of these tech zillionaires seem to have no B.S. filter at all! Perhaps to be successful in that area it helps to be a bit credulous?

**P.S.** From comments below it seems clear that this graph has been created from a few nonexistent data points. It’s pretty horrible that Diamandis labeled this as “actual DATA.” I guess that’s just further confirmation that when people shout in ALL CAPS, they don’t know what they’re talking about!

The post Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My talk at the Simons Foundation this Wed 5pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>To learn about the human world, we should accept uncertainty and embrace variation. We illustrate this concept with various examples from our recent research (the above examples are with Yair Ghitza and Aki Vehtari) and discuss more generally how statistical methods can help or hinder the scientific process.

The post My talk at the Simons Foundation this Wed 5pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Wed 10 Sept 2014, 12-1pm in the Statistics Department large seminar room (Social Work Bldg room 903, Columbia University).

The post My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world”

**Wed:** “It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet”

**Thurs:** One-tailed or two-tailed

**Fri:** What is the purpose of a poem?

**Sat:** He just ordered a translation from Diederik Stapel

**Sun:** Six quotes from Kaiser Fung

The post Likelihood from quantiles? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Many observers, esp. engineers, have a tendency to record their observations as {quantile, CDF} pairs, e.g.,

x CDF(x)

3.2 0.26

4.7 0.39etc.

I suspect that their intent is to do some kind of “least-squares” analysis by computing theoretical CDFs from a model, e.g. Gamma(a, b), then regressing the observed CDFs against the theoretical quantiles, iterating the model parameters to minimize something, perhaps the K-S statistic.

I was wondering whether standard MCMC methods would be invalidated if the likelihood factor were constructed using CDFs instead of PDFs (or density mass). That is, the likelihood would be the product of F(x) values instead of the derivative, f(x). My intuition tells me that it shouldn’t matter since the result is still a product of probabilities but the apparent lack of literature examples gives me pause.

My reply: I don’t know enough about this sort of problem to give you a real answer, but in general the likelihood is the probability distribution of the data (given parameters), hence in setting up the likelihood you want to get a sense of what the measurements actually are. Is that “3.2″ measured with error, or are you concerned with variation across different machines or whatever? Once you know this, maybe you can model the measurements directly.

The post Likelihood from quantiles? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some time in the past 200 years the neighborhood has changed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some time in the past 200 years the neighborhood has changed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How does inference for next year’s data differ from inference for unobserved data from the current year? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recently came across your blog post from 2009 about how statistical analysis differs when analyzing an entire population rather than a sample.

I understand the part about conceptualizing the problem as involving a stochastic data generating process, however, I have a query about the paragraph on ‘making predictions about future cases, in which case the relevant uncertainty comes from the year-to-year variation’.

Wouldn’t the random-data-generating-process conceptualization cover the situation where you’re interested in making predictions about future cases? I just wanted to check that I’m not missing the importance of the year-to-year variation– this, presumably, wouldn’t be the random variation that’s necessary for inferential statistics to apply, as the year-to-year variation might be systematic rather than random?

My reply:

See for example the Gelman and King JASA paper from 1990. The point is that variation among units within a given year is not the same as variation within a unit from year to year.

We used a multilevel model.

But the real point here is that we were able to transform a somewhat philosophical question (What is the meaning of statistical inference if the entire population is observed?) into a technical question regarding variance within and between years. A lot of progress in statistical methods goes this way, that topics that formerly were consigned to philosophy can get subsumed into quantitative modeling.

The post How does inference for next year’s data differ from inference for unobserved data from the current year? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Confirmationist and falsificationist paradigms of science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The general issue is how we think about research hypotheses and statistical evidence. Following Popper etc., I see two basic paradigms:

**Confirmationist:** You gather data and look for evidence in support of your research hypothesis. This could be done in various ways, but one standard approach is via statistical significance testing: the goal is to reject a null hypothesis, and then this rejection will supply evidence in favor of your preferred research hypothesis.

**Falsificationist:** You use your research hypothesis to make specific (probabilistic) predictions and then gather data and perform analyses with the goal of rejecting your hypothesis.

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

How do these two forms of reasoning differ? In confirmationist reasoning, the research hypothesis of interest does *not* need to be stated with any precision. It is the null hypothesis that needs to be specified, because that is what is being rejected. In falsificationist reasoning, there is no null hypothesis, but the research hypothesis must be precise.

**In our research we bounce**

It is tempting to frame falsificationists as the Popperian good guys who are willing to test their own models and confirmationists as the bad guys (or, at best, as the naifs) who try to do research in an indirect way by shooting down straw-man null hypotheses.

And indeed I do see the confirmationist approach as having serious problems, most notably in the leap from “B is rejected” to “A is supported,” and also in various practical ways because the evidence against B isn’t always as clear as outside observers might think.

But it’s probably most accurate to say that each of us is sometimes a confirmationist and sometimes a falsificationist. In our research we bounce between confirmation and falsification.

Suppose you start with a vague research hypothesis (for example, that being exposed to TV political debates makes people more concerned about political polarization). This hypothesis can’t yet be falsified as it does not make precise predictions. But it seems natural to seek to confirm the hypothesis by gathering data to rule out various alternatives. At some point, though, if we really start to like this hypothesis, it makes sense to fill it out a bit, enough so that it can be tested.

In other settings it can make sense to check a model right away. In psychometrics, for example, or in various analyses of survey data, we start right away with regression-type models that make very specific predictions. If you start with a full probability model of your data and underlying phenomenon, it makes sense to try right away to falsify (and thus, improve) it.

**Dominance of the falsificationist rhetoric**

That said, Popper’s ideas are pretty dominant in how we think about scientific (and statistical) evidence. And it’s my impression that null hypothesis significance testing is generally understood as being part of a Popperian, falsificiationist approach to science.

So I think it’s worth emphasizing that, when a researcher is testing a null hypothesis that he or she does not believe, in order to supply evidence in favor of a preferred hypothesis, that this is confirmationist reasoning. It may well be good science (depending on the context) but it’s *not* falsificationist.

**The “I’ve got statistical significance and I’m outta here” attitude**

This discussion arose when Mayo wrote of a controversial recent study, “By the way, since Schnall’s research was testing ‘embodied cognition’ why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?”

This comment was interesting to me because it points to a big problem with a lot of social and behavioral science research, which is a vagueness of research hypotheses and an attitude that anything that rejects the null hypothesis is evidence in favor of the researcher’s preferred theory.

Just to clarify, I’m not saying that this is a particular problem with classical statistical methods; the same problem would occur if, for example, researchers were to declare victory when a 95% posterior interval excludes zero. The problem that I see here, and that I’ve seen in other cases too, is that there is little or no concern with issues of measurement. Scientific measurement can be analogized to links on a chain, and each link—each place where there is a gap between the object of study and what is actually being measured—is cause for concern.

All of this is a line of reasoning that is crucial to science but is often ignored (in my own field of political science as well, where we often just accept survey responses as data without thinking about what they correspond to in the real world). One area where measurement is taken very seriously is psychometrics, but it seems that the social psychologists don’t think so much about reliability and validity. One reason, perhaps, is that psychometrics is about quantitative measurement, whereas questions in social psychology are often framed in a binary way (Is the effect there or not?). And once you frame your question in a binary way, there’s a temptation for a researcher, once he or she has found a statistically significant comparison, to just declare victory and go home.

The *measurements* in social psychology are often quantitative; what I’m talking about here is that the *research hypotheses* are framed in a binary way (really, a unary way in that the researchers just about always seem to think their hypotheses are actually true). This motivates the “I’ve got statistical significance and I’m outta here” attitude. And, if you’ve got statistical significance already and that’s your goal, then who cares about reliability and validity, right? At least, that’s the attitude, that once you have significance (and publication), it doesn’t really matter exactly what you’re measuring, because you’ve proved your theory.

I am not intendeing to be cynical or to imply that I think these researchers are trying to do bad science. I just think that the combination of binary or unary hypotheses along with a data-based decision rule leads to serious problems.

The issue is that research projects are framed as quests for confirmation of a theory. And once confirmation (in whatever form) is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

To this, Mayo wrote:

I agreed that “the measurements used in the paper in question were not” obviously adequately probing the substantive hypothesis. I don’t know that the projects are framed as quests “for confirmation of a theory”,rather than quests for evidence of a statistical effect (in the midst of the statistical falsification arg at the bottom of this comment). Getting evidence of a genuine, repeatable effect is at most a necessary but not a sufficient condition for evidence of a substantive theory that might be thought to (statistically) entail the effect (e.g., a cleanliness prime causes less judgmental assessments of immoral behavior—or something like that). I’m not sure that they think about general theories–maybe “embodied cognition” could count as general theory here. Of course the distinction between statistical and substantive inference is well known. I noted, too, that the so-called NHST is purported to allow such fallacious moves from statistical to substantive and, as such, is a fallacious animal not permissible by Fisherian or NP tests.

I agree that issues about the validity and relevance of measurements are given short shrift and that the emphasis–even in the critical replication program–is on (what I called) the “pure” statistical question (of getting the statistical effect).

I’m not sure I’m getting to your concern Andrew, but I think that they see themselves as following a falsificationist pattern of reasoning (rather than a confirmationist one). They assume it goes something like this:

If the theory T (clean prime causes less judgmental toward immoral actions) were false, then they wouldn’t get statistically significant results in these experiments, so getting stat sig results is evidence for T.

This is fallacious when the conditional fails.

And I replied that I think these researchers are following a confirmationist rather than falsificationist approach. Why do I say this? Because when they set up a nice juicy hypothesis and other people fail to replicate it, they don’t say: “Hey, we’ve been falsified! Cool!” Instead they give reasons why they haven’t been falsified. Meanwhile, when they falsify things themselves, they falsify the so-called straw-man null hypotheses that they don’t believe.

The pattern is as follows: Researcher has hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A. I don’t see this as falsificationist reasoning, because the researchers’ actual hypothesis (that is, hypothesis A) is never put to the test. It is only B that is put to the test. To me, testing B in order to provide evidence in favor of A is confirmationist reasoning.

Again, I don’t see this as having anything to do with Bayes vs non-Bayes, and all the same behavior could happen if every p-value were replaced by a confidence interval.

I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

That might be ok—maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

**Summary for the tl;dr crowd**

In our paper, Shalizi and I argued that Bayesian inference does not have do be performed in an inductivist mode, despite a widely-held belief to the contrary. Here I’m arguing that classical significance testing is not necessarily falsificationist, despite a widely-held belief to the contrary.

The post Confirmationist and falsificationist paradigms of science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why isn’t replication required before publication in top journals? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I don’t recall seeing, on your blog or elsewhere, this question raised directly. Of course there is much talk about the importance of replication, mostly by statisticians, and economists are grudgingly following suit with top journals requiring datasets and code.

But why not make it a simple requirement? No replication, no publication.

I suppose that it would be too time-consuming (many reviewers shirk even that basic duty) and that there is a risk of theft of intellectual property.

My reply: In this context, “replication” can mean two things. The first meaning is that the authors supply enough information that the exact analysis can be replicated (this information would include raw data (suitably anonymized if necessary), survey forms, data collection protocols, computer programs and scripts, etc. Some journals already do require this; for example, we had to do it for our paper in the Quarterly Journal of Political Science. The second meaning of “replication” is that the authors would actually have to replicate their study, ideally with a preregistered design, as in the “50 shades of gray” paper. This second sort of replication is great when it can be done, but it’s not in general so easy in fields such as political science or economics where we work with historical data.

The post Why isn’t replication required before publication in top journals? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**The quotes**

Here’s one: “You have no choice but to accept that the major conclusions of these studies are true.”

Ahhhh, but we do have a choice!

First, the background. We have two quotes from this paper by E. J. Wagenmakers, Ruud Wetzels, Denny Borsboom, Rogier Kievit, and Han van der Maas.

Here’s Alan Turing in 1950:

I assume that the reader is familiar with the idea of extra-sensory perception, and the meaning of the four items of it, viz. telepathy, clairvoyance, precognition and psycho-kinesis. These disturbing phenomena seem to deny all our usual scientific ideas. How we should like to discredit them! Unfortunately the statistical evidence, at least for telepathy, is overwhelming.

Wow! Overwhelming evidence isn’t what it used to be.

In all seriousness, it’s interesting that Turing, who was in some ways an expert on statistical evidence, was fooled in this way. After all, even those psychologists who currently believe in ESP would not, I think, hold that the evidence for telepathy *as of 1950* was overwhelming. I say this because it does not seem so easy for researchers to demonstrate ESP using the protocols of the 1940s; instead there is continuing effort to come up with new designs

How could Turing have thought this? I don’t know much about Turing but it does seem, when reading old-time literature, that belief in the supernatural was pretty common back then, lots of mention of ghosts etc. And at an intuitive level there does seem, at least to me, an intuitive appeal to the idea that if we just concentrate hard enough, we can read minds, move objects, etc. Also, remember that, as of 1950, the discovery and popularization of quantum mechanics was not so far in the past. Given all the counterintuitive features of quantum physics and radioactivity, it does not seem at all unreasonable that there could be some new phenomena out there to be discovered. Things feel a bit different in 2014 after several decades of merely incremental improvements in physics.

To move things forward a few decades, Wagenmakers et al. mention “the phenomenon of social priming, where a subtle cognitive or emotional manipulation influences overt behavior. The prototypical example is the elderly walking study from Bargh, Chen, and Burrows (1996); in the priming phase of this study, students were either confronted with neutral words or with words that are related to the concept of the elderly (e.g., ‘Florida’, ‘bingo’). The results showed that the students’ walking speed was slower after having been primed with the elderly-related words.”

They then pop our this 2011 quote from Daniel Kahneman:

When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.

And that brings us to the beginning of this post, and my response: No, you *don’t* have to accept that the major conclusions of these studies are true. Wagenmakers et al. note, “At the 2014 APS annual meeting in San Francisco, however, Hal Pashler presented a long series of failed replications of social priming studies, conducted together with Christine Harris, the upshot of which was that disbelief does in fact remain an option.”

**Where did Turing and Kahneman go wrong?**

Overstating the strength of empirical evidence. How does that happen? As Eric Loken and I discuss in our Garden of Forking Paths article (echoing earlier work by Simmons, Nelson, and Simonsohn), statistically significant comparisons are not hard to come by, even by researchers who are not actively fishing through the data.

The other issue is that when any real effects are almost certainly tiny (as in ESP, or social priming, or various other bank-shot behavioral effects such as ovulation and voting), statistically significant patterns can be systematically misleading (as John Carlin and I discuss here).

Still and all, it’s striking to see brilliant people such as Turing and Kahneman making this mistake. Especially Kahneman, given that he and Tversky wrote the following in a famous paper:

People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psvchological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions.

Indeed.

**Having an open mind**

It’s good to have an open mind. Psychology journals publish articles on ESP and social priming, even though these may seem implausible, because implausible things sometimes are true.

It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does *not* represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset.

One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise. Maybe not, maybe these are real effects being discovered, but you should at least *consider* the possibility that you’re chasing noise. Despite what Turing and Kahneman say, you can keep an open mind.

**P.S.** Some commenters thought that I was disparaging Alan Turing and Daniel Kahneman. I wasn’t. Turing and Kahneman both made big contributions to science, almost certainly much bigger than anything I will ever do. And I’m not criticizing them for believing in ESP and social priming. What I am criticizing them for is their insistence that the evidence is “overwhelming” and that the rest of us “have no choice” but to accept these hypotheses. Both Turing and Kahneman, great as they are, overstated the strength of the statistical evidence.

And that’s interesting. When stupid people make a mistake, that’s no big deal. But when brilliant people make a mistake, it’s worth noting.

The post I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Questions about “Too Good to Be True” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I manage a team tasked with, among other things, analyzing data on Air Traffic operations to identify factors that may be associated with elevated risk. I think its fair to characterize our work as “data mining” (e.g., using rule induction, Bayesian, and statistical methods).

One of my colleagues sent me a link to your recent article “Too Good to Be True” (Slate, July 24). Obviously, as my friend has pointed out, your article raises questions about the validity of what I’m doing.

A few thoughts/questions:

(1) I agree with your overall point, but I’m having trouble understanding the specific complaint with the “red/pink” study. In their case, if I’m understanding the author’s rebuttal, they were not asking “what color is associated with fertility” and then mining the data to find a color…any color…which seemed to have a statistical association. They started by asking “is red/pink associated with fertility”, no? In which case, I think the point their making seems fair?

(2) But, your argument definitely applies to the kind of work I’m doing. In my case, I’m asking an open ended question: “Are there any relationships?” Well, of course, you would say, the odds are that you must find relationships…even if they are not really there.

(3) So let’s take a couple of examples. There are 1,000′s of economists building models to explain some economic phenomenon. All of these models are based on the same underlying data: the U.S. Income and Product Accounts. There are then 10,000′s of models built—only a handful of are publication-worthy. So, by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?

(4) Another example: one of the things that we have uncovered is that, in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?

(5) A caveat: In my case, we use the statistically significant findings to point us in directions that deserve more study. Basically as a form of triage (because we don’t have the resources to address every conceivable hazard in the airspace system). Perhaps fortunately, most of the people I deal with (primarily pilots and air traffic controllers) don’t understand statistics. So, the safety case we build must be based on more than just a mechanical analysis of the data.

My reply:

(1) Whether or not the authors of the study were “mining the data,” I think their analysis was contingent on the data. They had many data-analytic choices, including rules for which cases to include or exclude and which comparisons to make, as well as what colors to study. Their protocol and analysis were not pre-registered. The point is that, even though they did an analysis that was consistent with their general research hypothesis, there are many degrees of freedom in the specifics, and these specifics can well be chosen in light of the data.

This topic is really worth an article of its own . . . and, indeed, Eric Loken and I have written that article! So, instead of replying in detail in this post, I’ll point you toward The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

(2) You write, “the odds are that you must find relationships . . . even if they are not really there.” I think the relationships are there but that they are typically small, and they exist in the context of high levels of variation. So the issue isn’t so much that you’re finding things that aren’t there, but rather that, if you’re not careful, you’ll think you’re finding large and consistent effects, when what’s really there are small effects of varying direction.

(3) You ask, “by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?” My response: No, I don’t think that framing statistical statements as “true” or “false” is the most helpful way to look at things. I think it’s fine for lots of people to analyze the same dataset. And, for that matter, I think it’s fine for people to use various different statistical methods. But methods have assumptions attached to them. If you’re using a Bayesian approach, it’s only fair to criticize your methods if the probability distributions don’t seem to make sense. And if you’re using p-values, then you need to consider the reference distribution over which the long-run averaging is taking place.

(4) You write: “in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?” My response is, first, I’d like to see all the comparisons that you might be making with these data. If you found one interesting pattern, there might well be others, and I wouldn’t want you to limit your conclusions to just whatever happened to be statistically significant. Second, your finding seems plausible to me but I’d guess that the long-run difference will probably be lower than what you found in your initial estimate, as there is typically a selection process by which larger differences are more likely to be noticed.

(5) Your triage makes some sense. Also let me emphasize that it’s not generally appropriate to wait on statistical significance before making decisions.

The post Questions about “Too Good to Be True” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bad Statistics: Ignore or Call Out? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bad Statistics: Ignore or Call Out? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Questions about “Too Good to Be True”

**Wed:** I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

**Thurs:** Why isn’t replication required before publication in top journals?

**Fri:** Confirmationist and falsificationist paradigms of science

**Sat:** How does inference for next year’s data differ from inference for unobserved data from the current year?

**Sun:** Likelihood from quantiles?

We’ve got a full week of statistics for you. Welcome back to work, everyone!

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Questions about “Too Good to Be True”

I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

Why isn’t replication required before publication in top journals?

Confirmationist and falsificationist paradigms of science

How does inference for next year’s data differ from inference for unobserved data from the current year?

Likelihood from quantiles?

My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan”

Suspicious graph purporting to show “percentage of slaves or serfs in the world”

“It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet”

One-tailed or two-tailed

What is the purpose of a poem?

He just ordered a translation from Diederik Stapel

Six quotes from Kaiser Fung

More bad news for the buggy-whip manufacturers

They know my email but they don’t know me

What do you do to visualize uncertainty?

Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

Question about data mining bias in finance

Estimating discontinuity in slope of a response function

I can’t think of a good title for this one.

Study published in 2011, followed by successful replication in 2003 [sic]

My talk at the University of Michigan on Fri 25 Sept

I’m sure that my anti-Polya attitude is completely unfair

Waic for time series

MA206 Program Director’s Memorandum

“An exact fishy test”

People used to send me ugly graphs, now I get these things

If you do an experiment with 700,000 participants, you’ll (a) have no problem with statistical significance, (b) get to call it “massive-scale,” (c) get a chance to publish it in a ~~tabloid~~ top journal. Cool!

Carrie McLaren was way out in front of the anti-Gladwell bandwagon

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Avoiding model selection in Bayesian social research appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:

- “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”

- “How can BIC select a model that does not fit the data over one that does”

- “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

The post Avoiding model selection in Bayesian social research appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When we talk about the “file drawer,” let’s not assume that an experiment can easily be characterized as producing strong, mixed, or weak results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought you might be interested in our paper [the paper is by Annie Franco, Neil Malhotra, and Gabor Simonovits, and the link is to a news article by Jeffrey Mervis], forthcoming in Science, about publication bias in the social sciences given your interest and work on research transparency.

Basic summary: We examined studies conducted as part of the Time-sharing Experiments in the Social Science (TESS) program, where: (1) we have a known population of conducted studies (some published, some unpublished); and (2) all studies exceed a quality threshold as they go through peer review. We found that having null results made experiments 40 percentage points less likely to be published and 60 percentage points less likely to even be written up.

My reply:

Here’s a funny bit from the news article: “Stanford political economist Neil Malhotra and two of his graduate students . . .” You know you’ve hit the big time when you’re the only author who gets mentioned in the news story!

More seriously, this is great stuff. I would only suggest that, along with the file drawer, you remember the garden of forking paths. In particular, I’m not so sure about the framing in which an experiment can be characterized as producing “strong results,” “mixed results,” or “null results.” Whether a result is strong or not would seem to depend on how the data are analyzed, and the point of the forking paths is that with a given data it is possible for noise to appear as strong. I gather from the news article that TESS is different in that any given study is focused on a specific hypothesis, but even so I would think there is a bit of flexibility in how the data are analyzed and a fair number of potentially forking paths. For example, the news article mentions “whether voters tend to favor legislators who boast of bringing federal dollars to their districts over those who tout a focus on policy matters).” But of course this could be studied in many different ways.

In short, I think this is important work you have done, and I just think that we should go beyond the “file drawer” because I fear that this phase lends too much credence to the idea that a reported p-value is a legitimate summary of a study.

P.S. There’s also a statistical issue that every study is counted only once, as either a 1 (published) or 0 (unpublished). If Bruno Frey ever gets involved, you’d have to have a system where any result gets a number from 0 to 5, representing the number of different times it’s published.

The post When we talk about the “file drawer,” let’s not assume that an experiment can easily be characterized as producing strong, mixed, or weak results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Pre-election survey methodology: details from nine polling organizations, 1988 and 1992 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>By the way, the paper has a (small) error. The two outlying “h” points in Figure 1b are a mistake. I can’t remember what we did wrong, but I do remember catching the mistake, I think it was before publication but too late for the journal to fix the error. The actual weighted results for the Harris polls are *not* noticeably different from those of the other surveys at those dates.

Polling has changed in the past twenty years, but I think this paper is still valuable, partly in giving a sense of the many different ways that polling organizations can attempt to get a representative sample, and partly as a convenient way to shoot down the conventional textbook idea of survey weights as inverse selection probabilities. (Remember, survey weighting is a mess.)

The post Pre-election survey methodology: details from nine polling organizations, 1988 and 1992 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post One of the worst infographics ever, but people don’t care? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Perhaps prompted by the ALS Ice Bucket Challenge, this infographic has been making the rounds:

I think this is one of the worst I have ever seen. I don’t know where it came from, so I can’t give credit/blame where it’s due.

Let’s put aside the numbers themselves – I haven’t checked them, for one thing, and I’d also say that for this comparison one would be most interested in (government money plus donations) rather than just donations — and just look at this as an information display. What are some things I don’t like about it? Jeez, I hardly know where to begin.

1. It takes a lot of work to figure it out. (a) You have to realize that each color is associated with a different cause — my initial thought was that the top circles represent deaths and dollars for the first cause, the second circles are for the second cause, etc. (b) Even once you’ve realized what is being displayed, and how, you pretty much have to go disease by disease to see what is going on; there’s no way to grok the whole pattern at once. (b) Other than pink for breast cancer and maybe red for AIDS none of the color mappings are standardized in any sense, so you have to keep referring back to the legend at the top. (c) It’s not obvious (and I still don’t know) if the amount of “money raised” for a given cause refers only to the specific fundraising vehicle mentioned in the legend for each disease. It’s hard to believe they would do it that way, but maybe they do.

2. Good luck if you’re colorblind.

3. Maybe I buried the lede by putting this last: did you catch the fact that the area of the circle isn’t the relevant parameter? Take a look at the top two circles on the left. The upper one should be less than twice the size of the second one. It looks like they made the *diameter* of the circle proportional to the quantity, rather than the area; a classic way to mislead with a graphic.

At a bare minimum, this graphic could be improved by (a) fixing the terrible mistake with the sizes of the circles, (b) putting both columns in the same order (that is, first row is one disease, second row is another, etc)., (c) taking advantage of the new ordering to label each row so you don’t need the legend. This would also make it much easier to see the point the display is supposed to make.

As a professional data analyst I’d rather just see a scatterplot of money vs deaths, but I know a lot of people don’t understand scatterplots. I can see the value of using circle sizes for a general audience. But I can’t see how anyone could like this graphic. Yet three of my friends (so far) have posted it on Facebook, with nary a criticism of the display.

[Note added the next day:

The graphic is even worse than I thought. As several people have pointed out, my suspicion is true: the numbers do not show the total donations to fight the diseases listed, they show only the donations to a single organization. For instance, according to the legend the pink color represents donations to fight **breast cancer**, but the number is not for breast cancer as a whole, it's only for Komen Race for the Cure.

If they think people are interested in contributions to only a single charity in each category --- which seems strange, but let's assume that's what they want and just look at the display --- then they need a title that is much less ambiguous, and the labels need to emphasize the charity and not the disease.]

This post is by Phil Price.

The post One of the worst infographics ever, but people don’t care? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I like the discussion, and it includes some themes that keep showing up: the idea that modeling is important and you need to understand what your model is doing to the data. It’s not enough to just interpret the fitted parameters as is, you need to get in there, get your hands dirty, and examine all aspects of your fit, not just the parts that relate to your hypotheses of interest.

There is a continuity between the criticisms I addressed of that paper in 1994, and our recent criticisms of some applied models, for example of that regression estimate of the health effects of air pollution in China.

The post Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Dave Blei course on Foundations of Graphical Models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dave Blei writes:

This course is cross listed in Computer Science and Statistics at Columbia University.

It is a PhD level course about applied probabilistic modeling. Loosely, it will be similar to this course.

Students should have some background in probability, college-level mathematics (calculus, linear algebra), and be comfortable with computer programming.

The course is open to PhD students in CS, EE and Statistics. However, it is appropriate for quantitatively-minded PhD students across departments. Please contact me [Blei] if you are a PhD student who is interested, but cannot register.

Research in probabilistic graphical models has forged connections between signal processing, statistics, machine learning, coding theory, computational biology, natural language processing, computer vision, and many other fields. In this course we will study the basics and the state of the art, with an eye on applications. By the end of the course, students will know how to develop their own models, compute with those models on massive data, and interpret and use the results of their computations to solve real-world problems.

Looks good to me!

The post Dave Blei course on Foundations of Graphical Models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Review of “Forecasting Elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Political scientists are aware that most voters are consistent in their preferences, and one can make a good guess just looking at the vote counts in the previous election.

Objective analysis of a few columns of numbers can regularly outperform pundits who use inside knowledge.

The rationale for forecasting electoral vote directly . . . is mistaken.

The book’s weakness is its unquestioning faith in linear regression . . . We should always be suspicious of any grand claims made about a linear regression with five parameters and only 11 data points. . . .

Funny that I didn’t suggest the use of informative prior distributions. Only recently have I been getting around to this point.

And more:

The fact that U.S. elections can be successfully forecast with little effort, months ahead of time, has serious implications for our understanding of politics. In the short term, improved predictions will lead to more sophisticated campaigns, focusing more than ever on winnable races and marginal states.

The post Review of “Forecasting Elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discussion of “Maximum entropy and the nearly black object” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Under the “nearly black” model, the normal prior is terrible, the entropy prior is better and the exponential prior is slightly better still. (An even better prior distribution for the nearly black model would combine the threshold and regularization ideas by mixing a point mass at 0 with a proper distribution on [0, infinity].) Knowledge that an image is nearly black is strong prior information that is not included in the basic maximum entropy estimate.

Overall I liked the Donoho et al. paper but I was a bit disappointed in their response to me. To be fair, the paper had lots of comments and I guess the authors didn’t have much time to read each one, but still I didn’t think they got my main point, which was that the Bayesian approach was a pretty direct way to get most of the way to their findings. To put it another way, that paper had a lot to offer (and of course those authors followed it up with lots of other hugely influential work) but I think there was value right away in thinking about the different estimates in terms of prior distributions, rather than treating the Bayesian approach as a sort of sidebar.

The post Discussion of “Maximum entropy and the nearly black object” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Review of “Forecasting Elections”

**Wed:** Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections”

**Thurs:** Pre-election survey methodology: details from nine polling organizations, 1988 and 1992

**Fri:** Avoiding model selection in Bayesian social research

**Sat, Sun:** You might not be aware, but the NYC Labor Day parade is not held on Labor Day, as it would interfere with everyone’s holiday plans. Instead it’s held on the following weekend.

The post Poker math showdown! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In comments, Rick Schoenberg wrote:

One thing I tried to say as politely as I could in [the book, "Probability with Texas Holdem Applications"] on p146 is that there’s a huge error in Chen and Ankenman’s “The Mathematics of Poker” which renders all the calculations and formulas in the whole last chapter wrong or meaningless or both. I’ve never received a single ounce of feedback about this though, probably because only like 2 people have ever read my whole book.

Jerrod Ankenman replied:

I haven’t read your book, but I’d be happy to know what you think is a “huge” error that invalidates “the whole last chapter” that no one has uncovered so far. (Also, the last chapter of our book contains no calculations—perhaps you meant the chapter preceding the error?). If you contacted one of us about it in the past, it’s possible that we overlooked your communication, although I do try to respond to criticism or possible errors when I can. I’m easy to reach; firstname.lastname@yale.edu will work for a couple more months.

Hmmm, what’s on page 146 of Rick’s book? It comes up if you search inside the book on Amazon:

So that’s the disputed point right there. Just go to the example on page 290 where the results are normally distributed with mean and variance 1, check that R(1)=-14%, then run the simulation and check that the probability of the bankroll starting at 1 and reaching 0 or less is approximately 4%.

I went on to Amazon but couldn’t access page 290 of Chen and Ankenman’s book to check this. I did, however, program the simulation in R as I thought Rick was suggesting:

waiting <- function(mu,sigma,nsims,T){ time_to_ruin <- rep(NA,nsims) for (i in 1:nsims){ virtual_bankroll <- 1 + cumsum(rnorm(T,mu,sigma)) if (any(virtual_bankroll<0)) { time_to_ruin[i] <- min((1:T)[virtual_bankroll<0]) } } return(time_to_ruin) } a <- waiting(mu=1,sigma=1,nsims=10000,T=100) print(mean(!is.na(a))) print(table(a))

Which gave the following result:

> print(mean(!is.na(a))) [1] 0.0409 > print(table(a)) a 1 2 3 4 5 6 8 9 218 107 53 13 9 7 1 1

These results indicate that (i) the probability is indeed about 4%, and (ii) T=100 is easily enough to get the asymptotic value here.

Actually, the first time I did this I kept getting a probability of ruin of 2% which didn't seem right--I couldn't believe Rick would've got this simple simulation wrong--but then I found the bug in my code: I'd written "cumsum(1+rnorm(T,mu,sigma))" instead of "1+cumsum(rnorm(T,mu,sigma))".

So maybe Chen and Ankenman really did make a mistake. Or maybe Rick is misinterpreting what they wrote. There's also the question of whether Chen and Ankenman's mathematical error (assuming they did make the mistake identified by Rick) actually renders all the calculations and formulas in their whole last chapter, or their second-to-last chapter, wrong or meaningless or both.

P.S. According to the caption at the Youtube site, they're playing rummy, not poker, in the above clip. But you get the idea.

P.P.S. I fixed a typo pointed out by Juho Kokkala in an earlier version of my code.

The post Poker math showdown! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How Many Mic’s Do We Rip appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Our technical comment on Kinney and Atwal’s paper on MIC and equitability has come out in PNAS along with their response. Similarly to Ben Murrell, who also wrote you a note when he published a technical comment on the same work, we feel that they “somewhat missed the point.” Specifically: one statistic can be more or less equitable than another, and our claim has been that MIC is more equitable than other existing methods in a wide variety of settings. Contrary to what Kinney and Atwal write in their response (“Falsifiability or bust”), this claim is indeed falsifiable — it’s just that they have not falsified it.

2. We’ve just posted a new theoretical paper that defines both equitability and MIC in the language of estimation theory and analyzes them in that paradigm. In brief, the paper contains a proof of a formal relationship between power against independence and equitability that shows that the latter can be seen as a generalization of the former; a closed-form expression for the population value of MIC and an analysis of its properties that lends insight into aspects of the definition of MIC that distinguish it from mutual information; and new estimators for this population MIC that perform better than the original statistic we introduced.

3. In addition to our paper, we’ve also written a short FAQ for those who are interested in a brief summary of where the conversation and the literature on MIC and equitability are at this point, and what is currently known about the properties of these two objects.

PS – at your suggestion, the theory paper now has some pictures!

We’ve posted on this several times before:

16 December 2011: Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

26 Mar 2012: Further thoughts on nonparametric correlation measures

14 Mar 2014: The maximal information coefficient

1 May 2014: Heller, Heller, and Gorfine on univariate and multivariate information measures

7 May 2014: Once more on nonparametric measures of mutual information

I still haven’t formed a firm opinion on these things. Summarizing pairwise dependence in large datasets is a big elephant, and I guess it makes sense that different researchers who work in different application areas will have different perspectives on the problem.

The post How Many Mic’s Do We Rip appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Recently in the sister blog appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Meritocracy won’t happen: the problem’s with the ‘ocracy’

Does the sex of your child affect your political attitudes?

More hype about political attitudes and neuroscience

Modern polling needs innovation, not traditionalism

Who cares about copycat pollsters?

No, all Americans are not created equal when it comes to belief in conspiracy theories

*Six* does not just mean *a lot*: preschoolers see number words as specific

The post Recently in the sister blog appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Replication Wiki for economics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have been working on a replication project funded by the Institute for New Economic Thinking during the last two years and read several of your blog posts that touched the topic.

It can help for research as well as for teaching replication to students. We taught seminars at several faculties for which the information of this database was used. In the starting phase the focus was on some leading journals in economics, and we now cover more than 1800 empirical studies and 142 replications. Replication results can be published as replication working papers of the University of Göttingen’s Center for Statistics.

Teaching and providing access to information will raise awareness for the need for replications, provide a basis for research about the reasons why replications so often fail and how this can be changed, and educate future generations of economists about how to make research replicable.

The post Replication Wiki for economics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The field is a fractal appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>There is a real joy in doing mathematics, in learning ways of thinking that explain and organize and simplify. One can feel this joy discovering new mathematics, rediscovering old mathematics, learning a way of thinking from a person or text, or finding a new way to explain or to view an old mathematical structure.

This inner motivation might lead us to think that we do mathematics solely for its own sake. That’s not true: the social setting is extremely important. We are inspired by other people, we seek appreciation by other people, and we like to help other people solve their mathematical problems. What we enjoy is changes in response to other people. Social interaction occurs through face-to-face meetings. It also occurs through written and electronic correspondence, preprints, and journal articles. One effect of this highly social system of mathematics is the tendency of mathematicians to follow fads. For the purpose of producing new mathematical theorems this is probably not very efficient: we’d seem to be better off having mathematicians cover the intellectual field much more evenly. But most mathematicians don’t like to be lonely, and they have trouble staying excited about a subject, even if they are personally making progress, unless they have colleagues who share their excitement.

Fun quote but I disagree with the implications of the last bit. The trouble with the quote is the implication that there is a natural measure on “the intellectual field” so that it can be covered “evenly.” But I think the field is more of a fractal with different depths at different places, depending on how closely you look.

If we wanted to model this formally, we might say that researchers decide, based on what other people are doing, which parts of their fields are worth deeper study. It’s not just about being social, the point is that there’s no uniform distribution. To put it another way, following “fads,” in some sense, is *a necessity, not a choice*. This is not to say that whatever is currently being done is best; perhaps there should be more (or less) time-averaging, of the sort that we currently attain, for example, by appointing people to long-term job contracts (hence all the dead research that my colleagues at Berkeley were continuing to push, back in the 90s). I just want to emphasize that *some* measure needs to be constructed, somehow.

The post The field is a fractal appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “A hard case for Mister P” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m working on a problem that at first seemed like a clear case where multilevel modeling would be useful. As I’ve dug into it I’ve found that it doesn’t quite fit the usual pattern, because it seems to require a very difficult post-stratification.

Here is the problem. You have a question in a survey, and you need to estimate the proportion of positive responses to the question for a large number (100) of different subgroups of the total population at which the survey is aimed. The sample size for some of these subgroups can be rather small. If these were disjoint subgroups then this would be a standard multi-level modeling problem, but they are not disjoint: each subgroup is defined by one or two variables, but there are a total of over 30 variables used to define subgroups.

For example, if x[i], 1 <= i <= 30, are the variables used to define subgroups, subgroup i for i <= 30 might be defined as those individuals for which x[i] > 1, with the other subgroup definitions involving combinations of two or possibly three variables. Examples of these subgroup definitions include patterns such as

· x1 == 1 OR x2 == 1 OR x3 == 1

· (x1 == 1 OR x1 == 2) AND x3 < 4.You could do a multilevel regression with post-stratification, but that post-stratification step looks very difficult. It seems that you would need to model the 30-dimensional joint distribution for the 30 variables describing subgroups.

Have you encountered this kind of problem before, or know of some relevant papers to read?

I replied:

In your example, I agree that it sounds like it would be difficult to compute things on 2^30 cells or however many groups you have in the population. Maybe some analytic approach would be possible? What are your 30 variables?

And then he responded:

The 30+ variables are a mixture of categorical and ordinal survey responses indicating things like the person’s role in their organization, decision-making influence, familiarity with various products and services, and recognition of various ad campaigns. So you might have subgroups such as “people who recognize any of our ad campaigns,” or “people who recognize ad campaign W,” or “people with purchasing influence for product space X,” or more tightly defined subgroups such as “people with job description Y who are familiar with product Z.”

Here’s some more context. I’m looking for ways of getting better information out of tracking studies. In marketing research a tracking study is a survey that is run repeatedly to track how awareness and opinions change over time, often in the context of one or more advertising campaigns that are running during the study period. These surveys contain audience definition questions, as well as questions about familiarity with products, awareness of particular ads, and attitudes towards various products.

It’s hard to get clients to really understand just how large sampling error can be, so there tends to be a lot of upset and hand wringing when they see an unexplained fluctuation from one month to the next. Thus, there’s significant value in finding ways to (legitimately) stabilize estimates.

Where things get interesting is when the client wants to push the envelope by

a) running surveys more often, but with a smaller sample size, so that the total number surveyed per month remains the same, or

b) tracking results for many different overlapping subgroups.I’m seeing some good results for handling (a) by treating the responses in each subgroup over time as a time series and applying a simple state-space model with binomial error model; this is based on the assumption that the quantities being tracked don’t typically change radically from one week to the next. This kind of modeling is less useful in the early stages of the study, however, when you don’t yet have much information on the typical degree of variation from one time period to the next. Multilevel modeling for b) seems like a good candidate for the next improvement in estimation, and would help even in the early stages of the study, but as I mentioned, the post-stratification looks difficult.

Now here’s me again:

I see what you’re saying about the poststrat being difficult. In this case, one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells—for example, just use the proportion of respondents in the different categories in your sample—and then go from there. The point is that the poststrat would be giving you stability, even if it’s not matching quite to the population of interest.

And Van Horn came back with:

You write, “one starting point could be to make somewhat arbitrary (but reasonable) guesses for the sizes of the poststrat cells.”

But there are millions of poststrat cells… Or are you thinking of doing some simple modeling of the distribution for the poststrat cells, e.g. treating the stratum-defining variables as independent?

That sounds like it could often be a workable approach.

Just to stir the pot, though . . . One could argue that a good solution should have good asymptotic behavior, in the sense that, in the limit of a large subgroup sample size, the estimate for the proportion should tend to the empirical proportion. Certainly if one of the subgroups is large, in which case one would expect the empirical proportion to be a good estimate for that subgroup, and my multilevel-model-with-poststrat gives an estimate that differs significantly from the “obvious” answer, this is likely to raise questions about the validity of the approach. It seems to me that, to achieve this asymptotic behavior, I’d need to be able to model the distribution of poststrat cells at arbitrary levels of detail as the sample size increases. This line of thought has me looking into Bayesian nonparametric modeling.

Fun stuff.

The post “A hard case for Mister P” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stroopy names appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And this all makes me wonder: is there a psychology researcher somewhere with a dog named Stroopy? Probably so.

P.S. I just made the mistake of googling “Stroopy.” Don’t do it. I was referring to this.

The post Stroopy names appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some quick disorganzed tips on classroom teaching appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Anyway, here I am preparing my course on statistical computing and graphics and thinking of points to mention during the week on classroom teaching. My old approach would be to organize these points in outline format and then “cover” them in class. Instead, though, I’ll stick them here and then I can assign this to students to read ahead of time, freeing up class time for actual discussion.

**Working in pairs:**

This is the biggie, and there are lots of reasons to do it. When students are working in pairs, they seem less likely to drift off, also with two students there is more of a chance that one of them is interested in the topic. Students learn from teaching each other, and they can work together toward solutions. It doesn’t always work for students to do *homeworks* pairs or groups—I have a horrible suspicion that they’ll often just split up the task, with one student doing problem #1, another doing problem #2, and so forth—but having them work in pairs during class seems like a no-lose proposition.

**The board:**

Students don’t pay attention all the time nor do they have perfect memories; hence, use the blackboard as a storage device. For example, if you are doing a classroom activity (such as the candy weighing), outline the instructions on the board at the same time as you explain them to the class. For another example, when you’re putting lots of stuff on the board, organize it a bit: start at the top-left and continue across and down, and organize the board into columns with clear labels. In both cases, the idea is that if a student is lost, he or she can look up at the board and have a chance to see what’s up.

Another trick is to load up the board with relevant material before the beginning of class period, so that it’s all ready for you when you need it.

**The projector:**

It’s becoming standard to use beamer (powerpoint) slide presentations in classroom teaching as well as with research lectures. I think this is generally a good idea, and I have just a few suggestions:

- Minimize the number of words on the slides. If you know what you’re talking about, you can pretty much just jump from graph to graph.

- The trouble with this strategy is that, without seeing the words on the screen, it can be hard to remember what to say. This suggests that what we really need is a script (or, realistically, a set of notes) to go along with the slide show. Logistically this is a bit of a mess—it’s hard enough to keep a set of slides updated without having to keep the script aligned at the same time—and as a result I’ve tended to err on the side of keeping too many words on my slides (see here, for example). But maybe it’s time for me to bite the bullet and move to a slides-and-script format.

Another intriguing possibility is to go with the script and ditch the slides entirely. Indeed, you don’t even need a script; all you need are some notes or just an idea of what you want to be talking about. I discovered this gradually over the past few years when giving talks (see here for some examples). I got into the habit of giving a little introduction and riffing a bit before getting to the first slide. I started making these ad libs longer and longer, until at one point I gave a talk that started with 20 minutes of me talking off the cuff. It seemed to work well, and the next step was to give an entire talk with no slides at all. The audience was surprised at first but it went just fine. Most of the time I come prepared with a beamer file full of more slides than I’ll ever be able to use, but it’s reassuring to know that I don’t really need any of them.

Finally, assuming you do use slides in your classes, there’s the question of whether to make the slides available to the students. I’m always getting requests for the slides but I really don’t like it when students print them out. I fear that students are using the slides as a substitute for the textbook, also that if the slides are available, students will think they don’t need to pay attention during class because they can always read the slides later.

It’s funny: Students are eager to sign up for a course to get that extra insight they’ll obtain from attending classes, beyond whatever they’d get by simply reading the textbook and going through the homework problems on their own. But once they’re in class, they have a tendency to drift off, and I need to pull all sorts of stunts to keep them focused.

**The board and the projector, together:**

Just cos your classroom has a projector, that don’t mean you should throw away your blackboard (or whiteboard, if you want to go that stinky route). Some examples:

- I think it works better to write out an equation or mathematical derivation in real time rather than to point at different segments of an already-displayed formula.

- It can help to mix things up a little. After a few minutes of staring at slides it can be refreshing to see some blackboard action.

- You can do some fun stuff by projecting onto the blackboard. For example, project x and y axes and some data onto the board, then have a pair of students come up and draw the regression line with chalk. Different students can draw their lines, then you click onto the next slide which projects the actual line.

**Handouts:**

Paper handouts can be a great way to increase the effective “working memory” for the class. Just remember not to overload a handout. Putting something on paper is not the same thing as having it be read. You should figure out ahead of time what you’re going to be using in class and then refer to it as it arises.

I like to give out roughly two-thirds as many handouts as there are people in the audience. This gives the handouts a certain scarcity value, also it enforces students discussing in pairs since they’re sharing the handouts already. I found that when I’d give a handout to every person in the room, many people would just stick the handout in their notebook. The advantage of not possessing something is that you’re more motivated to consume it right away.

**Live computer demonstrations:**

These can go well. It perhaps goes without saying that you should try the demo at home first and work out the bugs, then prepare all the code as a script which you can execute on-screen, one paragraph of code at a time. Give out the script as a handout and then the students can follow along and make notes. And you should decide ahead of time how fast you want to go. It can be fine to do a demo fast to show how things work in real life, or it can be fine to go slowly and explain each line of code. But before you start you should have an idea of which of these you want to do.

**Multiple screens:**

When doing computing, I like to have four windows open at once: the R text editor, the R console, an R graphics window (actually nowadays I’ll usually do this as a refreshable pdf or png window rather than bothering with the within-R graphics window), and a text editor for whatever article or document I’m writing.

But it doesn’t work to display 4 windows on a projected screen: there’s just not enough resolution, and, even if resolution were not a problem, the people in the back of the room won’t be able to read it all. So I’m reluctantly forced to go back and forth between windows. That’s one reason it can help to have some of the material in handout form.

What I’d really like is multiple screens in the classroom so I can project different windows on to different screens and show all of them at once. But I never seem to be in rooms with that technology.

**Jitts:**

That’s “just in time teaching”; see here for details. I do this with all my classes now.

**Peer instruction:**

This is something where students work together in pairs on hard problems. It’s an idea from physics teaching that seems great to me but I’ve never succeeded in implementing true “peer instruction” in my classes. I have them work in pairs, yes, but the problems I give them don’t look quite like the “Concept Tests” that are used in the physics examples I’ve seen. The problem, perhaps, is that intro physics is just taught at a higher level than intro statistics. In my intro statistics classes, it’s hard enough to get the students to learn about the basics, without worrying about getting them into more advanced concepts. So when I have students work in pairs, it’s typically on more standard problems.

**Drills:**

In addition to these pair or small-group activities, I like the idea of quick drills that I shoot out to the whole class and students do, individually, right away. I want them to be able to handle basic skills such as sqrt(p*(1-p)/n) or log(a*x^(2/3)) instantly.

**Getting their attention:**

You want your students to stay awake and interested, to enter the classroom full of anticipation and to leave each class period with a brainful of ideas to discuss. Like a good movie, your class should be a springboard for lots of talk.

But you don’t want to get attention for the wrong things. An extreme example is the Columbia physics professor who likes to talk about his marathon-fit body and at one point felt the need to strip to his underwear in front of his class. This got everyone talking—but not about physics. At a more humble level, I sometimes worry that I’ll do goofy things in class to get a laugh, but then the students remember the goofiness and not the points I was trying to convey. Most statistics instructors probably go too far in the other direction, with a deadpan demeanor that puts the students to sleep.

It’s ok to be “a bit of a character” to the extent that this motivates the students to pay attention to you. But, again, I generally recommend that you structure the course so that you talk less and the students talk more.

**Walking around the classroom:**

Or wheeling around, if that’s your persuasion. Whatever. My point here is that you want your students to spend a lot of the class time working on problems in pairs. While they’re doing this, you (and your teaching assistants, if this is a large so-called lecture class with hundreds of students) should

**Teaching tips in general:**

As I explained in my book with Deb Nolan, I’m not a naturally good teacher and I struggle to get students to participate in class. Over the decades I’ve collected lots of tricks because I need all the help I can get. If you’re a naturally good teacher or if your classes already work then maybe you do without these ideas.

**Preparation:**

It’s not clear how much time should be spent preparing the course ahead of time. I think it’s definitely a good idea to write the final exam and all the homeworks before the class begins (even though I don’t always do this!) because then it gives you a clearer sense of where you’re heading. Beyond that, it depends. I’m often a disorganized teacher and I think it helps me a lot to organize the entire class before the semester begins.

Other instructors are more naturally organized and can do just fine with a one-page syllabus that says which chapters are covered which weeks. These high-quality instructors can then just go into each class, quickly get a sense of where the students are stuck, and adapt the class accordingly. For them, too much preparation might well backfire.

My problem is that I’m not so good at individualized instruction; even in a small class, it’s hard for me to keep track of where each student is getting stuck, and what the students’ interests and strengths are. I’d like to do better on this, but for now I’ve given up on trying to adapt my courses for individuals. Instead I’ve thrown a lot of effort into detailed planning of my courses, with the hope that these teaching materials will be useful for other instructors.

**Students won’t (in general) reach your level of understanding:**

You don’t teach students facts or even techniques, you teach them the skills needed to solve problems (including the skills needed to find the solution on their own). And there’s no point in presenting things they’re not supposed to learn; for example, if a mathematical derivation is important, put it on the exam with positive probability. And if students aren’t gonna get it anyway (my stock example here is the sampling distribution of the sample mean), just don’t cover it. That’s much better, I think, than wasting everyone’s time and diluting everyone’s trust level with a fake-o in-class derivation.

**The road to a B**:

You want a plan by which a student can get by and attain partial mastery of the material. See discussion here.

**Evaluation:**

What, if anything, did the students actually learn during the semester?

You still might want to evaluate what your students are actually learning, but we don’t usually do this. I don’t even do it, even though I talk about it. Creating a pre-test and post-test is work! And it requires some hard decisions. Whereas not testing at all is easy. And even when educators try to do such evaluations, they’re often sloppy, with threats to validity you could drive a truck through. At the very least, this is all worth thinking about.

**Relevance of this advice to settings outside college classrooms:**

Teaching of advanced material happens all over, not just in university coursework, and much of the above advice holds more generally. The details will change with the goals—if you’re giving a talk on your latest research, you won’t want the audience to be spending most of the hour working in pairs on small practice problems—but the general principles apply.

Anyway, it was pretty goofy that I used to teach a course on teaching and stand up and say all these things. It makes a lot more sense to write it here and reserve class time for more productive purposes.

**One more thing**

I can also add to this post between now and the beginning of class. So if you have any ideas, please share them in the comments.

The post Some quick disorganzed tips on classroom teaching appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Stroopy names

**Wed:** “A hard case for Mister P”

**Thurs:** The field is a fractal

**Fri:** Replication Wiki for economics

**Sat, Sun:** As Chris Hedges would say: Stop me if you’ve heard this one before

The post My courses this fall at Columbia appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’ll be going through the book, section by section. Follow the link to see slides and lecture notes from when I taught this course a couple years ago. This course has a serious workload: each week we have three homework problems, one theoretical, one computational, and one applied.

Stat 6191, Statistical Communication and Graphics, TuTh 10-11:30 in room C05 Social Work Bldg:

This is an entirely new course that will be structured around student participation. I’m still working out the details but here’s the current plan of topics for the 13 weeks:

1. Introducing yourself and telling a story

2. Introducing the course

3. Presenting and improving graphs

4. Graphing data

5. Graphing models

6. Dynamic graphics

7. Programming

8. Writing

9. Giving a research presentation

10. Collaboration and consulting

11. Teaching a class

12-13. Student projects

**Why am I teaching these courses?**

The motivation for the Bayesian Data Analysis class is obvious. There’s a continuing demand for this course, and rightly so, as Bayesian methods are increasingly powerful for a wide range of applications. Now that our book is available, I see the BDA course as having three roles: (1) the lectures serve as a guide to the book, we talk through each section and point to tricky points and further research; (2) the regular schedule of homework assignments gives students a lot of practice applying and thinking about Bayesian methods; and (3) students get feedback from the instructor, teaching assistant, and others in the class.

The idea of the communication and graphics class is that statistics is all about communication to oneself and to others. I used to teach a class on teaching statistics but then I realized that classroom teaching is just one of many communication tasks, along with writing, graphics, programming, and various forms of informal contact. I think it’s important for this class to *not* be conducted through lectures, or guest lectures, or whatever, but rather as much as possible via student participation.

The post My courses this fall at Columbia appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Psychohistory” and the hype paradox appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought you might be interested in this post.

I was asked about this by someone at Skytree and replied with this link to Tyler Vigen’s Spurious Correlations. What’s most interesting about Vigen’s site is not his video (he doesn’t go into the dangers of correlating time series, for example), but his examples themselves.

The GDELT project is a good case, I think, of how Big Data is wagging the analytic dog. The bigger the data, the more important the analysis. There seem to be at least a few at Google who have caught this disease.

The post Lee links to above is called “Towards Psychohistory: Uncovering the Patterns of World History with Google BigQuery” and is full of grand claims about using a database of news media stories “to search for the basic underlying patterns of global human society” and that “world history, at least the view of it we see through the news media, is highly cyclic and predictable.” Also some pretty graphs such as this one:

I responded to Lee:

Yes, I agree, the grand claims seem bogus to me. But it’s hard for me to judge because I’m not clear exactly what they’re plotting. Is it number of news articles each day including the word “protest” and the country name? Or all news articles featuring the country with a conflict theme?

In any case, perhaps the best analogy is to the excitement in previous eras regarding statistical regularity. A famous example is the number of suicides in a country each year. Typically these rates are stable, just showing some “statistical” variation. And this can seem a bit mysterious. Suicide is the ultimate individual decision yet the number from year to year shows a stunning regularity. Other examples would be the approximate normality of various distributions of measurements, and various appearances of Zipf’s law. In each case, the extreme claims regarding the distributions typically end up seeming pretty silly, but there is something there. In this case, the Google researchers are, as they say, learning something about statistical patterns of media coverage. And that’s fine. I wish they could do it without the hype—but perhaps the hype is the price that we must pay for the work to get done.

And Lee replied:

I’m not clear either regarding the dependent variable.

A few (sort of) random thoughts.

1) There’s little attention given to the number of considerable patterns in the second 60 day period. Not the number of *possible* patterns, because the dependent variable is presumably continuous or at least presents many possible values. I mean instead the number of patterns the researcher would have considered different from each other — a sort of JND measure of how they visually interpret the prediction. My guess is that there are not very many such patterns — in other words, they have a categorical prior over very few values. As evidence of this, they seem to be ignoring relatively small-scale variation in the first case and highlighting it in the second. Very subjective and post-hoc.

2) They appear to be willing to compare different series on different time scales in order to find similar patterns. This is reminiscent of dynamic time warping, which works OK for bird calls but is questionable for historical data. What are the limits of this flexibility in actual practice? One series covering only January and another covering the whole year that are deemed to be similar? I don’t see them explicitly ruling out such extreme comparisons.

3) Rather broadly, this appears to be similar to “charting” methods for picking stocks, which have been discredited for many years. Similar patterns don’t necessarily predict similar outcomes because context matters. Different exogenous variables can produce similar patterns for very different reasons. Put another way, one can find similar patterns in different time series that are based on fundamentally different processes, particularly on a small scale (60 days in this case?).

4) Searching that many correlations based on “p=.05″ is arbitrary. I know they need a magic number to help filter, but why give it this appearance of legitimacy?

5) They say, “Whether these patterns capture the actual psychohistorical equations governing all of human life or, perhaps far more likely, a more precise mathematical definition of how journalism shapes our understanding of global events, they demonstrate the unprecedented power of the new generation of “big data” tools like Google BigQuery …” I have no idea what they mean here. Perhaps there is some dynamical system underlying these types of historical events, but until someone identifies plausible variables, I find the observation both breathless and uninteresting.

6) I’m all for BIG DATA. After all, I now work at a machine learning company. But statistics is about using methods that minimize the chance of our being fooled by randomness or bias. The methods used here, it seems to me, offer none of these protections.

I still have positive feelings about this effort because, even though the big claims have gotta be bogus, setting aside the hype, ya gotta start somewhere. On the other hand, one can be legitimately annoyed by the hype, in that, without the hype, we never would’ve heard about this in the first place.

The post “Psychohistory” and the hype paradox appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Luck vs. skill in poker appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The thread of our recent discussion of quantifying luck vs. skill in sports turned to poker, motivating the present post.

**1. Can good poker players really “read” my cards and figure out what’s in my hand?**

For a couple years in grad school a group of us had a regular Thursday-night poker game, nickel-dime-quarter with a maximum bet of $2, I believe it was. I did ok, it wasn’t hard to be a steady winner by just laying low most of the time and raising when I had a good hand. Since then I’ve played only very rarely (one time was a memorable experience with some journalists and a foul-mouthed old-school politico—I got out of that one a couple hundred dollars up but with no real desire to return), but I did have a friend who was really good. I played a couple of times with him and some others, and it was like the kind of thing you hear about: he seemed to be able to tell what cards I was holding. Don’t get me wrong here, I’m not saying that he was cheating or that it was uncanny or anything, and it’s not like he was taking my money every hand. As always in a limit game, the outcomes had a lot of randomness. But from time to time, it big hands, it really did seem like he was figuring me out. I didn’t think to ask him how he was doing it but I was impressed.

Upon recent reflection, though (many years later), it seems to me that I was slightly missing the point. The key is that my friend didn’t need to “read” me or know what I had; all he needed to do was make the right bets (or, to be more precise, make betting decisions that would perform well on average). He could well have made some educated guesses about my holdings based on my betting patterns (or even my “tells”) and used that updated probability distribution to make more effective betting decisions. The point is that, in many many settings, he doesn’t need to guess my cards; he just needs a reasonable probability distribution (which might be implicit). For example, in some particular situation in a particular hand, perhaps it would be wise for him to fold if he the probability is more than 30% that a particular hole card of mine is an ace. With no information, he’d assess this event as having an (approximate) 2% probability. So do I have that ace? He just needs to judge whether the probability is greater or less than 30%, an assessment that he can do using lots of information available to him. But once he makes that call, if he does it right (as he will, often enough; that’s part of what it means to be a good poker player), it’ll seem to me like he was reading my hand.

**2. Some references on luck vs. skill in poker**

Louis Raes pointed to three papers:

Ben van der Genugten and Peter Borm wrote quite a bit on Poker and the extent to which skill or luck is important. This work is mainly geared towards Dutch regulation but interesting nonetheless.

See:

http://link.springer.com/article/10.1007/s001860400347#page-1

http://link.springer.com/chapter/10.1007/978-1-4615-4627-6_3#page-1

http://link.springer.com/article/10.1007/BF02579073#page-1

**3. Rick Schoenberg’s definition**

Rick Schoenberg sent along an excerpt from his book, Probability with Texas Holdem Applications. Rick writes:

Surprisingly, a lot of books on game theory do not define the words “luck” or “skill”, maybe because it is very hard to do so. . . . in poker I [Rick] define skill as equity gained during the betting rounds and luck as equity gained during the deal of the cards. I then go through a televised twenty-something hand battle between Howard Lederer and Dario Minieri, two players with about as opposite styles as you can get, and try to quantify how much of Lederer’s win was due to luck and how much was due to skill.

I’ll go through Rick’s material and intersperse some comments.

Here’s section 4.4 of Rick’s book:

The determination of whether Texas Hold’em is primarily a game of luck or skill has recently become the subject of intense legal debate. Complicating things is the fact that the terms luck and skill are extremely difficult to define. Surprisingly, rigorous definitions of these terms appear to have eluded most books and journal articles on game theory. A few articles have defined skill in terms of the variance in results among different players, with the idea that players should perform more similarly if a game is mostly based on luck, but their results might differ more substantially if the game is based on skill. Another definition of skill is the extent to which players can improve, and it has been shown that poker does indeed possess a significant amount of this feature (e.g. Dedonno and Detterman, 2008). Others have defined skill in terms of the variation in a given player’s results, since less variation would indicate that fewer repetitions are necessary in order to determine the statistical significance of one’s long-term edge in the game, and hence the sooner one can establish that one’s average profits or losses are primarily due to skill rather than short term luck.

These definitions above are obviously extremely problematic for various reasons. One is that they rely on the game in question being played repeatedly before even a crude assessment of luck or skill could be made. More importantly, there are many contests of skill wherein the differences between players are small, or where one’s results vary wildly. For instance, in Olympic trials of 100-meter sprints, the differences between finishers are typically quite small, often just hundredths of a second. This hardly implies that the results are based on luck. There are also sporting events where an individual’s results might vary widely from one day to another, e.g. pitching in baseball, but that hardly means that luck plays a major role.

Regarding the quantification of the amount of luck or skill in a particular game of poker, a possibility might be to define luck as equity gained when cards are dealt by the dealer, and skill as equity gained by one’s actions during the betting rounds. (Recall that equity was defined in Section 4.3 as the product cp.) That is, there are several reasons you might gain equity during a hand:

* The cards dealt by the dealer (whether the players’ hole cards or the flop, turn, or river) give you a greater chance of winning the hand in a showdown,

* The size of the pot is increased while your chance to win the hand in a showdown is better than that of your opponent(s).

* By betting, you get others to fold and thus win a pot that you might otherwise have lost.

Certainly, anyone would characterize the first case as luck, unless perhaps one believes in ESP or time travel.

Uh oh. Daryl Bem’s getting off the bus right here, I think!

Thus [Rick continues], a possible way to estimate the skill in poker can be obtained by looking at the second and third cases above. That is, we may view one’s skill as being comprised of the equity that one gains during the betting rounds, whereas luck is equity gained by the deal of the cards. The nice thing about this is that it is easily quantifiable, and one may dissect a particular poker game and analyze how much equity each player gained due to luck or skill.

There are obvious objections to this. First, why equity? One’s equity (which is sometimes called express equity) in a pot is defined as one’s expected return from the pot given no future betting, and the assumption of no future betting may seem absurdly simplistic and unrealistic. On the other hand, unlike implied equity which would account for betting on future betting rounds, express equity is unambiguously defined and easy to compute. Second, situations can occur where one would expect a terrible player to gain equity during the betting rounds against even the greatest player in the world, so to attribute such equity gains to skill might be objectionable. For instance, in heads up Texas Hold’em, if the two players are dealt AA and KK, one would expect the player with KK to put a great deal of chips in while way behind, and this situation seems more like bad luck for the player with KK than any deficit in skill. One possible response to this objection is that skill is difficult to define, and in fact most poker players, probably due to their huge and fragile egos, tend to chalk up losses for virtually any reason as being solely due to bad luck. In some sense, anything can be attributed to luck if one has a general enough definition of the word. Even if a player does an amazingly skillful poker play, such as folding a very strong hand because of an observed tell or betting pattern, one could argue that it was lucky that the player happened to observe that tell, or even that the player was lucky to have been born with the ability to discern the tell. On the other hand, situations like the AA versus KK example truly do seem like bad luck. It is difficult to think of any remedy to this problem. It may be that the word skill is too strong a word, and that while it may be of interest to analyze hands in terms of equity, one should instead use the term equity gained during the betting rounds rather than skill in what follows.

We should distinguish between the two concerns that Rick is noting here. First, the division between luck and skill is inherently arbitrary (in a similar way to the arbitrariness of the division between prior and likelihood in a hierarchical model). To take things to extremes, you could say that a skillful player is lucky to have been born with such skill. As Rick explains, some things are definitely luck, but nothing can really be defined as definitely being skill.

Rick’s second concern, which I share, is that in his example it does not seem like skill that, if you happen to get the pair of aces, your equity just keeps going up because the other player doesn’t know what you have. So I agree with him that his definition has real problems.

Rick continues:

Below is an extended example intended to illustrate the division of luck and skill in a given game of Texas Hold’em. The example involves the end of a tournament on Poker After Dark televised during the first week of October 2009. Dario Minieri and Howard Lederer were the final two players. Since this portion of the tournament involves only these two players, and since all hands (or virtually all) were televised, this example provides us with an opportunity to attempt to parse out how much of Lederer’s win was due to skill and how much was due to luck.

Technical side note: Before we begin, we need to clarify a couple of potential ambiguities. There is some ambiguity in the definition of equity before the flop, since the small and big blind have put in different amounts of chips. The definition used here is that preflop equity is the expected profit (equity one would have in the pot after calling minus cost), assuming the big blind and small blind call as well, or the equity one would have by folding, whichever happens to be greater. For example, in heads up Texas Hold’em with blinds of 800 and 1600, the preflop equity for the big blind is 2bp – 1600, and max{2bp – 1600, -800} for the small blind, where p is the probability of winning the pot in a showdown, and b is the amount of the big blind. Define increases in the size of the pot as relative to the big blind, i.e. increasing the pot size by calling preflop does not count as skill. The probability p of winning the hand in a showdown was obtained using the odds calculator at cardplayer.com, and the probability of a tie is divided equally among the two players in determining p.

Example 4.4.1. Below are summaries of all 27 hands shown on Poker After Dark in October 2009 between Dario Minieri and Howard Lederer in the Heads Up portion of the tournament, with each hand’s equity gains and losses broken down as luck or skill. Each hand is analyzed from Minieri’s perspective, i.e. “skill -100” refers to 100 chips of equity gained by Lederer during a betting round. The question we seek to address is: how much of Lederer’s win was due to skill, and how much of it was due to luck?

Based on the concerns discussed above, “skill” doesn’t seem to be the right word here. On the other hand, maybe this is no big deal because the “luck” aspects will ultimately average out.

For example [Rick continues], here is a detailed breakdown of hand 4, where the blinds were 800/1600, Minieri was dealt A♣ J♣, Lederer had A♥ 9♥, Minieri raised to 4300 and Lederer called. The flop was 6♣ 10♠ 10♣, Lederer checked, Minieri bet 6500, and Lederer folded.

a) Preflop dealing (luck): Minieri +642.08. Minieri was dealt a 70.065% probability of winning the pot in a showdown, so Minieri’s increase in equity is 70.065 x 3200 – 1600 = 642.08 chips. Lederer was dealt a 29.935% probability to win the pot in a showdown, so his increase in equity is 29.935% * 3200 - 1600 = -642.08 chips.

b) Preflop betting (skill): Minieri +1083.51. The pot was increased to 8600. 8600-3200=5400. Minieri paid an additional 2700 but had 70.065% x 5400 = 3783.51 additional equity, so Minieri’s expected profit due to betting was 3783.51 – 2700 = 1083.51 chips. Correspondingly, Lederer’s expected profit due to betting was -1083.51 chips, since 29.935% x 5400 – 2700 = -1083.51.

c) Flop dealing (luck): Minieri +1362.67. After the flop was dealt, Minieri’s probability of winning the 8600 chip pot in a showdown went from 70.065% to 85.91%. So by luck, Minieri increased his equity by (85.91% – 70.065%) x 8600 = +1362.67 chips.

d) Flop betting (skill): Minieri +1211.74. Because of the betting on the flop, Minieri’s equity went from 85.91% of the 8600 pot to 100% of the pot, so Minieri increased his equity by (100% – 85.91%)x8600 = 1211.74 chips.

So during the hand, by luck, Minieri increased his equity by 642.08 + 1362.67 = 2004.75 chips. By skill, Minieri increased his equity by 1083.51 + 1211.74 = 2295.25 chips. Notice that the total = 2004.75 + 2295.25 = 4300, which is the number of chips Minieri won from Lederer in the hand.Note that [Never use the expression "note that"! Also avoid "very" and "obviously" --- ed.] before the heads-up battle began, the broadcast reported that Minieri had 72,000 chips, and Lederer 48,000. Minieri must have won some chips in hands they did not televise, because the grand total has Minieri losing about 74,500 chips.

(Blinds 800 and 1600.)

Hand 1. Lederer A♣ 7♠, Minieri 6♠ 6•. Lederer 43.535%, Minieri 56.465%. Lederer raises to 4300. Minieri raises to 47800. Lederer folds.

Luck +206.88. Skill +4093.12.Hand 2. Minieri 4♠ 2•, Lederer K♠ 7♥. Minieri 34.36%, Lederer 65.64%. Minieri raises to 4300, Lederer raises all in for 43500, Minieri folds.

Luck -500.48. Skill -3799.52.Hand 3. Lederer 6♥ 3•, Minieri A• 9♣. Lederer 34.965%, Minieri 65.035%. Lederer folds in the small blind.

Luck +481.12. Skill +318.88.Hand 4. Minieri A♣ J♣, Lederer A♥ 9♥. Minieri 70.065%, Lederer 29.935%. Minieri raises to 4300, Lederer calls 2700. Flop 6♣ 10♠ 10♣. Minieri 85.91%, Lederer 14.09%. Lederer checks, Minieri bets 6500, Lederer folds.

Luck +2004.75. Skill +2295.25.Hand 5. Lederer 5♠ 3♥, Minieri 7• 6♠. Lederer 35.765%, Minieri 64.235%. Lederer folds in the small blind.

Luck +455.52. Skill +344.48.Hand 6. Minieri K♥ 10♣, Lederer 5¨ 2¨. Minieri 61.41%, Lederer 38.59% Minieri raises to 3200, Lederer raises to 9700, Minieri folds.

Luck +365.12. Skill -3565.12Hand 7. Minieri 10• 7♠, Lederer Q♣ 2♥. Minieri 43.57%, Lederer 56.43%. Minieri raises to 3200, Lederer calls 1600. Flop 8♠ 2♠ Q♥. Minieri 7.27%, Lederer 92.73%. Lederer checks, Minieri bets 3200, Lederer calls. Turn 4•. Minieri 0%, Lederer 100%. Lederer checks, Minieri bets 10,000, Lederer calls. River A♥. Lederer checks, Minieri checks.

Luck -205.76 – 2323.20 – 930.56 = -3459.52.

Skill -205.76 – 2734.72 – 10000 = -12940.48.Hand 8. Lederer 7♣ 2•, Minieri 9♣ 4•. Minieri 64.28%, Lederer 35.72%. Lederer folds.

Luck +456.96. Skill +343.04.Hand 9. Minieri 4♠ 2♣, Lederer 8♥ 7•. Minieri 34.345%, Lederer 65.655%. Minieri raises to 3200, Lederer calls 1600. Flop 3• 9♥ J♥. Minieri 22.025%, Lederer 77.975%. Lederer checks, Minieri bets 4800, Lederer folds.

Luck -500.96 – 788.48 = -1289.44. Skill -500.96 + 4990.40 = +4489.44.Hand 10. Lederer K♠ 5♠, Minieri K♥ 7♣. Minieri 59.15%, Lederer 40.85%. Lederer calls 800, Minieri raises to 6400, Lederer folds.

Luck +292.80. Skill +1307.20.Hand 11. Minieri A♥ 8♥, Lederer 6♥ 3♠. Minieri 66.85%, Lederer 33.15%. Minieri raises to 3200. Lederer folds.

Luck +539.20. Skill +1060.80.Hand 12. Lederer A• 4•. Minieri 7• 3♥. Minieri 34.655%, Lederer 65.345%. Lederer raises to 4300, Minieri raises to 11500, Lederer folds.

Luck -491.04. Skill +4791.04.Hand 13. Minieri 6♣ 3♣, Lederer K♠ 6♠. Minieri 29.825%, Lederer 70.175%. Minieri raises to 4800, Lederer calls 3200. Flop 5♥ J♣ 5♣. Minieri 47.425%, Lederer 52.575%. Lederer checks, Minieri bets 6000, Lederer folds.

Luck -645.60 + 1689.60 = +1044. Skill -1291.20 + 5047.20 = +3756.Hand 14. Lederer 7• 5♠, Minieri 8• 5•. Minieri 69.44%, Lederer 30.56%. Lederer calls 800, Minieri checks. Flop K♥ 10♠ 8♣. Minieri 94.395%, Lederer 5.605%. Minieri checks, Lederer bets 1800, Minieri calls. Turn 7♠. Minieri 95.45%, Lederer 4.55%. Minieri checks, Lederer checks. River 6♥. Check, check.

Luck +622.08 + 798.56 + 71.74 + 309.40 = 1801.78. Skill 0 + 1598.22 + 0 + 0 = 1598.22.

Blinds 1000/2000.Hand 15. Minieri 9• 5♠, Lederer A♥ 5•. Minieri 26.755%, Lederer 73.245%. Minieri calls 1000, Lederer raises to 7000, Minieri raises to 14000, Lederer calls 7000. Flop 10♠ Q• 6♥. Minieri 15.35%, Lederer 84.65%. Lederer checks, Minieri bets 14000, Lederer folds.

Luck -929.80 – 3193.40 = -4123.20. Skill -5578.80 + 23702 = 18123.20.Hand 16. Lederer 5♠ 5♥, Minieri A♣ J•. Minieri 46.085%, Lederer 53.915%. Lederer calls 1000, Minieri raises to 26800, Lederer calls all in. The board is 3♠ 9♠ K♠ 10• 9•.

Luck -156.60 – 24701.56 = -24858.16. Skill -1941.84.Hand 17. Minieri K♣ 10♣, Lederer 7• 5•. Minieri 62.22%, Lederer 37.78%. Minieri raises to 5000, Lederer calls 3000. Flop J♠ J• 4♠. Minieri 69.90%, Lederer 30.10%. Check check. Turn 8♠. Minieri 77.27%, Lederer 22.73%. Lederer bets 6000, Minieri folds.

Luck +488.80 + 768 + 737 = 1993.80. Skill +733.20 + 0 – 7727 = -6993.80.Hand 18. Lederer 5♠ 5♣, Minieri 10♠ 6♥. Minieri 46.12%, Lederer 53.88%. Lederer calls 1000, Minieri checks. Flop 7♣ 8♣ Q♥. Minieri 38.235%, Lederer 61.765%. Minieri checks, Lederer bets 2000, Minieri calls. Turn J♥. Minieri 22.73%, Lederer 77.27%. Minieri bets 4000, Lederer folds.

Luck -155.20 – 315.40 – 1240.40 = -1711. Skill 0 – 470.60 + 6181.60 = +5711.Hand 19. Lederer K♥ 5♠, Minieri K♣ 10•. Minieri 73.175%, Lederer 26.825%. Lederer raises to 5000, Minieri calls 3000. Flop J• 8♥ 10♥. Minieri 92.575%, Lederer 7.425%. Check, check. Turn 5•. Minieri 95.45%, Lederer 4.55%. Minieri bets 6000, Lederer folds.

Luck +927 + 1940 + 287.50 = 3154.50. Skill +1390.50 + 0 + 455 = 1845.50.Hand 20. Minieri 7♣ 2♠, Lederer Q♠ 9♠. Minieri 30.205%, Lederer 69.795%. Minieri raises to 6000. Lederer calls 4000. Flop A• A♠ Q•. Minieri 1.165%, Lederer 98.835%. Lederer checks, Minieri bets 6000, Lederer calls. Turn J♣. Minieri 0%, Lederer 100%. Lederer checks, Minieri bets 14000, Lederer raises to 35800, Minieri folds. Luck -791.80 – 3484.80 – 279.60 = -4556.20. Skill -1583.60 – 5860.20 – 14000 = -21443.80.

Hand 21. Minieri 10♥ 3•, Lederer Q♥ J♠. Minieri 30.00%, Lederer 70.00%. Minieri calls 1000, Lederer checks. Flop 8♠ 4♥ J♣. Minieri 4.34%, Lederer 95.66%. Lederer checks, Minieri bets 2000, Lederer raises to 7500, Minieri raises to 18500, Lederer raises all-in, Minieri folds. Luck -800 – 1026.40 = -1826.40. Skill 0 – 18673.60 = -18673.60.

Hand 22. Lederer A♠ 2•, Minieri 5♣ 3♥. Minieri 42.345%, Lederer 57.655%. Lederer calls 1000. Minieri checks. Flop K♠ 10♣ 3♠. Minieri 80.10%, Lederer 19.90%. Check check. Turn Q♠. Minieri 65.91%, Lederer 34.09%. Check, Lederer bets 2000, Minieri folds.

Luck -306.20 + 1510.20 – 567.60 = 636.40. Skill 0 + 0 – 2636.40 = -2636.40.(Blinds 1500/3000.)

Hand 23. Minieri 7♥ 7♣, Lederer 8• 3•. Minieri 68.175%, Lederer 31.825%. Minieri all-in for 21,700, Lederer folds.

Luck +1090.50. Skill +1909.50.Hand 24. Minieri Q♥ 5♥, Lederer 8• 5•. Minieri 68.37%, Lederer 31.63%. Minieri all-in for 26,200, Lederer folds.

Luck +1102.20. Skill +1897.80.Hand 25. Lederer 9♣ 3♣, Minieri 5• 2•. Minieri 40.63%, Lederer 59.37%. Lederer folds.

Luck -562.20. Skill +2060.20.Hand 26. Minieri 10♣ 2♠, Lederer 7♣ 7♥. Minieri 29.04%, Lederer 70.96%. Minieri folds.

Luck -1257.60. Skill -242.40.Hand 27. Lederer Q♣ 9♣, Minieri A♣ 5♠. Minieri 55.37%, Lederer 44.63%. Lederer all-in for 29,200. Minieri calls. Board 7♣ 6♣ 10♠ Q♠ 6•.

Luck +322.20 – 32336.08 = -32013.88. Skill +2813.88.Grand Totals: Luck -61023.59. Skill -13478.41.

Overall, although Lederer’s gains were primarily (about 81.9%) due to luck, Lederer also gained more equity due to skill than Minieri. On the first 19 hands, Minieri actually gained 20,836.41 in equity due to skill, and it appeared that Minieri was outplaying Lederer. On hands 20 and 21, however, Minieri tried two huge unsuccessful bluffs, both on hands (especially hand 20) where he should probably have strongly suspected that Lederer would be likely to call, and on those two hands combined, Minieri lost 40,117.40 in equity due to skill. Although Minieri played very well on every other hand, all of Minieri’s good plays on other hands could not overcome the huge loss of skill equity from just those two hands.

It is important to note that the player who gains the most equity due to skill does not always win. In the first 19 hands of this example, for instance, Minieri gained 20836.41 in equity attributed to skill, but because of bad luck, Minieri actually lost a total of 2800 chips over these same 19 hands. The bad luck Minieri suffered on hand 16 negated most of his gains due to skillful play. A common misconception is that one’s luck will ultimately balance out, i.e. that one’s total good luck will eventually exactly equal one’s total bad luck, but this is not true. Assuming one plays the same game repeatedly and independently, and assuming the expected value of one’s equity due to luck is 0 which seems reasonable, then one’s average equity per hand gained by luck will ultimately converge to zero. This is the law of large numbers, and is discussed further in Section 7.4. It does not imply that one’s total equity gained by luck will converge to zero, however. Potential misconceptions about the laws of large numbers and arguments about possible overemphasis on equity are discussed in Section 7.4.

To conclude this Section, a nice [I'd avoid the word "nice" too! --- AG] illustration of the potential pitfalls of analyzing a hand purely based on equity is a recent hand from Season 7 of High Stakes Poker. In this hand, with blinds of $400 and $800 plus $100 antes from each of the 8 players, after Bill Klein straddled for $1600, Phil Galfond raised to $3500 with Q♠ 10♥, Robert Croak [now that's a great poker name. --- ed.] called in the big blind with A♣ J♣, Klein called with 10♠ 6♠, and the other players folded. The flop came J♠ 9♥ 2♠, giving Croak top pair, Klein a flush draw, and Galfond an open ended straight draw. Croak bet $5500, Klein raised to $17500, and Galfond and Croak called. At this point, it is tempting to compute Klein’s probability of winning the hand by computing the probability of exactly one more spade coming on the turn and river without making a full house for Croak, or the turn and river including two 6s, or a 10 and a 6. This would yield a probability of [(8 x 35 - 4 - 4) + C(3,2) + 2x3] ÷ C(43,2) = 281/903 ~ 31.12%, and Klein could also split the pot with a straight if the turn and river were KQ or Q8 or 78, without a spade, which has a probability of [3x3 + 3x3 + 3x3] ÷ C(43,2) = 27/903 ~ 2.99%. These seem to be the combinations Klein needs, and one would not expect Klein to win the pot with a random turn and river combination not on this list, and especially not if the turn and river contain a king and a jack with no spades. But look at what actually happened in the hand. The turn was the K♣, giving Galfond a straight, and Croak checked, Klein bet $28000, Galfond raised to $67000, Croak folded, and Klein called. The river was the J♥, Klein bluffed $150000, and Galfond folded, giving Klein the $348,200 pot!

P.S. Image above from Barney Townshend shows some playing cards that Schoenberg designed a few years earlier.

The post Luck vs. skill in poker appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Updike and O’Hara appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>When Updike received the National Book Foundation Medal for Distinguished Contribution to American Letters, in 1998, two of [his second wife's] children were present, but his were not invited.

Menand’s article seemed insightful to me but I was surprised to not see the name “John O’Hara” once. Updike seems so clearly to be a follower of O’Hara, both in form (lots of New Yorker short stories and bestselling novels) and also in content (they wrote a lot about sex and a lot about social class). Here’s Menand:

Updike wanted to do with the world of mid-century middle-class American Wasps what Proust had done with Belle Époque Paris and Joyce had done with a single day in 1904 Dublin—and, for that matter, Jane Austen had done with the landed gentry in the Home Counties at the time of the Napoleonic Wars and James had done with idle Americans living abroad at the turn of the nineteenth century. He wanted to biopsy a minute sample of the social tissue and reproduce the results in the form of a permanent verbal artifact.

That sounds a lot like O’Hara, no? Also this:

Updike believed that people in that world sought happiness, and that, contrary to the representations of novelists like Cheever and Kerouac, they often found it. But he thought that the happiness was always edged with dread, because acquiring it often meant ignoring, hurting, and damaging other people.

And this:

Updike’s identification with Berks County and its un-cosmopolitan ways . . . was crucial to a deeply defended and fundamentally spurious conception of himself as an ordinary middle-American guy. He wanted to rescue serious fiction from what he saw as a doctrinaire rejection of middle-class life . . .

Sure, there were differences between the two authors, most notably that Updike was famous for having excelled at Harvard, whereas O’Hara was famous for resenting that he’d not gone to Yale. Also, O’Hara wrote lots of things in the old-fashioned story-with-a-twist style, whereas Updike’s plots were more straightforward, one might say more modernist in avoiding neat plotting. Overall, though, lots of similarities.

I’m not saying that Updike is a clone of O’Hara but I was surprised to that Menand didn’t mention him at all.

P.S. In searching on the web, I came across this article by Lorin Stein that quotes Fran Lebowitz as describing O’Hara as “underrated.” Which is funny to me because Fran Lebowitz is perhaps the most overrated writer I’ve ever heard of.

P.P.S. More interesting than all the above is this 1973 essay, “O’Hara, Cheever & Updike,” by Alfred Kazin.

The post Updike and O’Hara appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>