The post “An exact fishy test” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s an example: I came up with 10 random numbers:

> round(.5+runif(10)*100) [1] 56 23 70 83 29 74 23 91 25 89

and entered them into Macartan’s app, which promptly responded:

Unbelievable!

You chose the numbers 56 23 70 83 29 74 23 91 25 89

But these are clearly not random numbers. We can tell because random numbers do not contain patterns but the numbers you entered show a fairly obvious pattern.

Take another look at the sequence you put in. You will see that the number of prime numbers in this sequence is: 5. But the `expected number’ from a random process is just 2.5. How odd is this pattern? Quite odd in fact. The probability that a truly random process would turn up numbers like this is just p=0.074 (i.e. less than 8%).

Try again (with really random numbers this time)!

ps: you might think that if the p value calculated above is high (for example if it is greater than 15%) that this means that the numbers you chose are not all that odd; but in fact it means that the numbers are really particularly odd since the fishy test produces p values above 15% for less than 2% of all really random numbers. For more on how to fish see here.

The post “An exact fishy test” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post MA206 Program Director’s Memorandum appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A couple years ago I gave a talk at West Point. It was fun. The students are all undergraduates, and most of the instructors were just doing the job for two years or so between other assignments. The permanent faculty were focused on teaching and organizing the curriculum.

As part of my visit I sat in on an intro statistics class and did a demo for them (probably it was the candy weighing but I don’t remember). At that time I picked up an information sheet for the course: “Memorandum for Academic Year (AY) 13-02 MA206 Students, United States Military Academy.” Lots of details (as one would expect in that military-bureaucratic ways), also this list of specific objectives of the course:

1. Understanding the notion of randomness and the role of variability and sampling in making inference.

2. Apply the axioms and basic properties of probability and conditional probability to quantify the likelihood of events.

3. Employ models using discrete or continuous random variables to answer basic probability questions.

4. Be able to draw appropriate conclusions from confidence intervals.

5. Construct hypothesis tests and draw appropriate conclusions from p-values.

6. Apply and assess linear regression models for point estimation and association between explanatory and dependent variables.

7. Critically evaluate statistical arguments in print media and scientific journals.

This is all ok except for items 4 and 5, I suppose.

Also, at the end, a list of rules, beginning with:

a. All cadets are expected to maintain proper military bearing and appearance during instruction in accordance with appropriate regulations.

b. Respect others in the classroom – No profanity, unprofessional jokes, or unprofessional computer items . . .

e. Jackets are not permitted in the classroom . . .

g. Drinks must be inside a closed container (plastic bottle with a top, for example) or in the Dean-approved mug . . .

and ending with this:

j. Rules common to blackboards, written work, and examinations:

1) Draw and label figures or graphs when appropriate.

2) Report numerical answers using the appropriate number of significant digits and units of measure.

Now those are some rules I can get behind. They should be part of every statistics honor code.

The post MA206 Program Director’s Memorandum appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Free Stan T-shirt to the first “little twerp” who does a (good) Bayesian analysis of Jon Lee Anderson’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’d like to see a Stan implementation of the analysis presented in this comment by Gary from a year and a half ago.

The post Free Stan T-shirt to the first “little twerp” who does a (good) Bayesian analysis of Jon Lee Anderson’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Derek Jeter was OK” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Tom Scocca files a bizarrely sane column summarizing the famous shortstop’s accomplishments:

Derek Jeter was an OK ballplayer. He was pretty good at playing baseball, overall, and he did it for a pretty long time. . . . You have to be good at baseball to last 20 seasons in the major leagues. . . . He was a successful batter in productive lineups for many years. . . . He was not Ted Williams or Rickey Henderson. Spectators did not come away from seeing Derek Jeter marveling at the stupendous, unimaginable feats of hitting they had seen. But he did lots and lots of damage. He got many big hits and contributed to many big rallies. Pitchers would have preferred not to have to pitch to him. . . . His considerable athletic abilities allowed him to sometimes make spectacular leaping and twisting plays on misjudged balls that better shortstops would have played routinely. People enjoyed watching him make those plays, and that enjoyment led to his winning five Gold Gloves. That misplaced acclaim, in turn, helped spur more advanced analysis of defensive play in baseball, a body of knowledge which will ensure that no one ever again will be able to play shortstop as badly as Jeter for as long as he did. And that gave fans something to argue about, which is an important part of sports.

Scocca keeps going in this vein:

Regardless, on balance, Jeter’s good hitting helped his team more than his bad fielding hurt it. The statistical ledger says so—by Wins Above Replacement, according to Baseball Reference, his glovework drops him from being the 20th most productive position player of all time to the 58th. Having the 58th most productive career among non-pitchers in major-league history is still a solid achievement.

And still more:

When [Alex] Rodriguez showed up in the Bronx, Jeter would not yield the job. It was a selfish decision and the situation hurt the team. But powerful egos, misplaced competitiveness, and unrealistic self-appraisals are common features in elite athletes. Whatever wrong Jeter may have done in the intrasquad rivalry, it was the Yankees’ fault for not managing him better.

And this:

Like most star athletes of his era, he kept his public persona intentionally blank and dull . . . Depending on their allegiances, baseball fans could imagine him to be classy or imagine him to be pissy, and the limited evidence could support either conclusion.

I love this Scocca post because its hilariousness (which is intentional, I believe) is entirely contingent on its context. Sportswriting is so full of hype (either of the “Jeter is a hero” variety or the “Jeter’s no big whoop” variety or the “Hey, look at my cool sabermetrics” variety or the “Hey, look at what a humanist I am” variety) that it just comes off (to me) as flat-out funny to see a column that just plays it completely straight, a series of declarative sentences that tell it like it is.

Of course, if all the sportswriters wrote like this, it would be boring. But as long as all the others feel they need some sort of angle, this pitch-it-down-the-middle style will work just fine. The confounding of expectations and all that.

P.S. Also this from a commenter to Scocca’s post:

He also inspired people to like baseball again after the lockout and didn’t juice.

The post “Derek Jeter was OK” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Waic for time series appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m currently working on a model comparison paper using WAIC, and

would like to ask you the following question about the WAIC computation:I have data of one participant that consist of 100 sequential choices (you can think of these data as being a time series). I want to compute the WAIC for these data. Now I’m wondering how I should compute the predictive density. I think there are two possibilities:

(1) I compute the predictive density of the whole sequence (i.e., I consider the whole sequence as one data point, which means that n=1 in Equations (11) – (12) of your 2013 Stat Comput paper.)

(2) I compute the predictive density for each choice (i.e., I consider each choice as one data point, which means that n=# choices in Equations (11) – (12) of your 2013 Stat Comput paper.)

My quick thought was that Waic is an approximation to leave-one-out cross-validation and this computation gets more complicated with correlated data.

But I passed the question on to Aki, the real expert on this stuff. Aki wrote:

This a interesting question and there is no simple answer.

First we should consider what is your predictive goal:

(1) predict whole sequence for another participant

(2) predict a single choice given all other choices

or

(3) predict the next choice given the choices in the sequence so far?If your predictive goal is

(1) then you should note that WAIC is based on an asymptotic argument and it is not generally accurate with n=1. Watanabe has said (personal communication) that he thinks that this is not sensible scenario for WAIC, but if (1) is really your prediction goal, then I think that this is might be best you can do. It seems that when n is small, WAIC will usually underestimate the effective complexity of the model, and thus would give over-optimistic performance estimates for more complex models.

(2) WAIC should work just fine here (unless your model says that there is no dependency between the choices, ie. having 100 separate models with each having n=1). Correlated data here means just that it is easier to predict a choice if you know the previous choices and the following choices. This may make difference between some models small compared to scenario (1).

(3) WAIC can’t handle this, and you would need to use a specific form of cross-validation (I think I should write a paper on this).

The post Waic for time series appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Study published in 2011, followed by successful replication in 2003 [sic] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This one is like shooting fish in a barrel but sometimes the job just has to be done. . . .

The paper is by Daryl Bem, Patrizio Tressoldi, Thomas Rabeyron, and Michael Duggan, it’s called “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events,” and it begins like this:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded (Bem, 2011). To encourage exact replications of the experiments, all materials needed to conduct them were made available on request. We can now report a meta-analysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma . . . A Bayesian analysis yielded a Bayes Factor of 7.4 × 10-9 . . . An analysis of p values across experiments implies that the results were not a product of “p-hacking” . . .

Actually, no.

There is a lot of selection going on here. For example, they report that 57% (or, as they quaintly put it, “56.6%”) of the experiments had been published in peer reviewed journals or conference proceedings. Think of all the unsuccessful, unpublished replications that didn’t get caught in the net. But of course almost any result that happened to be statistically significant would be published, hence a big bias. Second, they go back and forth, sometimes considering all replications, other times ruling some out as not following protocol. At one point they criticize internet experiments which is fine, but again it’s more selection because if the results from the internet experiments had looked good, I don’t think we’d be seeing that criticism. Similarly, we get statements like, “If we exclude the 3 experiments that were not designed to be replications of Bem’s original protocol . . .”. This would be a lot more convincing if they’d defined their protocols clearly ahead of time.

I question the authors’ claims that various replications are “exact.” Bem’s paper was published in 2011, so how can it be that experiments performed as early as 2003 are exact replications? That makes no sense. Just to get an idea of what was going on, I tried to find one of the earlier studies that was stated to be an exact replication. I looked up the paper by Savva et al. (2005), “Further testing of the precognitive habituation effect using spider stimuli.” I could not find this one but I found a related one, also on spider stimuli. In what sense is this an “exact replication” of Bem? I looked at the Bem (2011) paper, searched on “spider,” and all I could find is a reference to Savva et al.’s 2004 work.

This baffled me so I went to the paper linked above and searched on “exact replication” to see how they defined the term. Here’s what I found:

“To qualify as an exact replication, the experiment had to use Bem’s software without any procedural modifications other than translating on-screen instructions and stimulus words into a language other than English if needed.”

I’m sorry, but, no. Using the same software is not enough to qualify as an “exact replication.”

This issue is central to the paper at hand. For example, there is a discussion on page 18 on “the importance of exact replications”: “When a replication succeeds, it logically implies that every step in the replication ‘worked’ . . .”

Beyond this, the individual experiments have multiple comparisons issues, just as did the Bem (2011) paper. We see very few actual preregistrations, and my impression is that when something counts as a successful replication there is still a lot of wiggle room regarding data inclusion rules, which interactions to study, etc.

**Who cares?**

The ESP context makes this all look like a big joke, but the general problem of researchers creating findings out of nothing, that seems to be a big issue in social psychology and other research areas involving noisy measurements. So I think it’s worth holding a firm line on this sort of thing. I have a feeling that the authors of this paper think that if you have a p-value or Bayes factor of 10^-9 then your evidence is pretty definitive, even if some nitpickers can argue on the edges about this or that. But it doesn’t work that way. The garden of forking paths is multiplicative, and with enough options it’s not so hard to multiply up to factors of 10^-9 or whatever. And it’s not like you have to be trying to cheat; you just keep making reasonable choices given the data you see, and you can get there, no problem. Selecting ten-year-old papers and calling them “exact replications” is one way to do it.

**P.S.** I found the delightful image above by googling *bullwinkle crystal ball* but I can’t seem to track down who to give the credit to. Jay Ward, Alex Anderson, and Bill Scott, I suppose. It doesn’t seem to matter so much who actually got the screenshot and posted it on the web.

The post Study published in 2011, followed by successful replication in 2003 [sic] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What were these powerful “irrelevant stimuli” that were outweighing the impact of subjects’ prior policy views? Before seeing each policy statement, each subject was subliminally exposed (for 39 milliseconds — well below the threshold of conscious awareness) to one of three images: a smiling cartoon face, a frowning cartoon face, or a neutral cartoon face. . . . the subliminal cartoon faces substantially altered their assessments of the policy statements . . .

I followed up with a post expressing some skepticism:

It’s clear that when the students [the participants in the experiment] were exposed to positive priming, they expressed more positive thoughts . . . But I don’t see how they make the leap to their next statement, that these cartoon faces “significantly and consistently altered [students'] thoughts and considerations on a political issue.” I don’t see a change in the number of positive and negative expressions as equivalent to a change in political attitudes or considerations.

I wrote:

Unfortunately they don’t give the data or any clear summary of the data from experiment No. 2, so I can’t evaluate it. I respect Larry Bartels, and I see that he characterized the results as the “subliminal cartoon faces substantially altered their assessments of the policy statements — and the resulting negative and positive thoughts produced substantial changes in policy attitudes.” But based on the evidence given in the paper, I can’t evaluate this claim. I’m not saying it’s wrong. I’m just saying that I can’t express judgment on it, given the information provided.

Larry then followed up with a post saying that further information was in chapter 3 of Erison’s Ph.D. dissertation, available online here.

And Erisen sent along a note which I said I would post. Erisen’s note is here:

As a close follower of the Monkey Cage, it is a pleasure to see some interest in affect, unconscious stimuli, perceived (or registered) but unappreciated influences. Accordingly I thought it is now the right time for me to contribute to the discussion.

First, I would like to begin with clarifying conceptual issues with respect to affective priming. Affective priming is not subliminal advertising, nor is it a subliminal message. Subliminal ads (or messages) were used back in the 1970s with questionable methods and current priming studies rarely refer to these approaches.

Second, it is quite normal to be skeptical because no earlier research has attempted to address these kinds of issues in political science. When they first hear about affective influences, people may naturally consider the consequences for measuring political attitudes and political preferences. These conclusions may be especially meaningful for democratic theory, as mentioned by Larry Bartels in an earlier post.

But, fear not, this is not a stand-alone research study. Rather, it is part of an overall research program (Lodge and Taber, 2013) and there are various studies on unconscious stimuli and contextual effects. We name these “perceived but unappreciated effects” in our paper. In addition, we cite some other work on contextual cues (Berger et al., 2008), facial attractiveness (Todorov and Uleman, 2004), the “RATS” ad (Weinberger and Westen, 2008), the Willie Horton ad (Mendelberg, 2001), upbeat music or threatening images in political ads (Brader, 2006), which all provide examples of priming. There is a great deal of research in social psychology that offers other relevant examples of the social or political effects of affective primes.

Third, with respect to the outcomes, I would like to refer the reader to our path analyses (provided in the paper and in

TheRationalizing Voter) that show the effects of affect-triggered thoughts on policy preferences (see below). What can be inferred from these results? We can say that controlling for prior attitudes affective primes not only directly affected policy preferences but also indirectly affected preferences through affect-evoked thoughts. The effects on political attitudes and preferences are significant as we discuss in greater detail in the paper.Fourth, these results were consistent across six experiments that I conducted for my dissertation. Priming procedure was about the same in all those studies and patterns across different dependent variables were quite similar.

Finally, we do not argue that voters cannot make decisions based on “enlightened preferences.” As we repeatedly state in the paper, affective cues color attitudes and preferences but this does not mean that voters’ decisions are necessarily irrational.

Both Bartels and Erisen posted path diagrams in support of their argument, so perhaps I should clarify that I’ve never understood these path diagrams. If an intervention has an effect on political attitudes, I’d like to see a comparison of the political attitudes with and without the intervention. No amount of path diagrams will convince me until I see the direct comparison. You could argue with some justification that my ignorance in this area is unfortunate, but you should also realize that there are a lot of people like me who don’t understand those graphs—and I suspect that many of those people who *do* like and understand path diagrams would also like to see the direct comparisons too. So, purely from the perspective of communication, I think it makes sense to connect the dots and not just show a big model without the intermediate steps. Otherwise you’re essentially asking the reader to take your claims on faith.

Again, I’m not saying that Erisen is wrong in his claims, just that the evidence he’s shown me is too abstract to convince me. I realize that he knows a lot more about his experiment and his data than I do and I’m pretty sure that he is much more informed on this literature than I am, so I respect that he feels he can draw certain strong conclusions from his data. But, for me, I have to go what information is available to me.

P.S. In his post, Larry also refers to the study of Andrew Healy, Neil Malhotra, and Cecilia Hyunjung Mo on college football games and election outcomes. That was an interesting study but, as I wrote when it came out a couple years ago, I think its implications were much less than were claimed at the time in media reports. Yes, people can be affected by irrelevant stimuli related to mood, but it matters what are the magnitudes of such effects.

The post Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “How to disrupt the multi-billion dollar survey research industry” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>P.P.S. Slightly relevant to this discussion: Satvinder Dhingra wrote to me:

An AAPOR probability-based survey methodologist is a man who, when he finds non-probability Internet opt-in samples constructed to be representative of the general population work in practice, wonders if they work in theory.

My reply:

Yes, although to be fair they will say that they’re not so sure that these methods work in practice. To which my reply is that I’m not so sure that their probability samples work so well in practice either!

The post “How to disrupt the multi-billion dollar survey research industry” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some will spam you with a six-gun and some with a fountain pen appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A few weeks ago the following came in the email:

Dear Professor Gelman,

I am writing you because I am a prospective doctoral student with considerable interest in your research. My name is Xian Zhao, but you can call me by my English name Alex, a student from China. My plan is to apply to doctoral programs this coming fall, and I am eager to learn as much as I can about research opportunities in the meantime.

I will be on campus next Monday, September 15th, and although I know it is short notice, I was wondering if you might have 10 minutes when you would be willing to meet with me to briefly talk about your work and any possible opportunities for me to get involved in your research. Any time that would be convenient for you would be fine with me, as meeting with you is my first priority during this campus visit.

Thanks you in advance for your consideration.

Sincerely

Alex

To which I’d responded as follows:

Hi, I’m meeting someone at 10am, you can come by at 9:50 and we can talk, then you can listen in on the meeting if you want.

A

And then I got this:

Dear Professor Gelman,

Thanks for your reply. I really appreciate your arranging to meet with me, but because of a family emergency I have to reschedule my visit. I apologize for any inconvenience this has caused you.

Alex

OK, at this point the rest of you can probably see where this story is going. But I didn’t think again about this until I received the following email yesterday:

Dear Professor Gelman,

A few weeks ago, you received an email from a Chinese student with the title “Prospective doctoral student on campus next Monday,” in which a meeting with you was requested. That email was part of our field experiment about how people react to Chinese students who present themselves by using either a Chinese name or an Anglicized name. This experiment was thoroughly reviewed and approved by the IRB (Institutional Review Board) of Kansas University. The IRB determined that a waiver of informed consent for this experiment was appropriate.

Here we will explain the purpose and expected results of this study. Many foreign students adopt Anglicized names when they come to the U.S., but little research has examined whether name selection affects how these individuals are treated. In this study, we are interested in whether the way a Chinese student presents him/herself could influence the email response rate, response speed, and the request acceptance rate from white American faculty members. The top 30 Universities in each social science and science area ranked by U.S. News & World Report were selected. We visited these department websites, including yours, and from the list of faculty we randomly chose one professor who appeared to be a U.S. citizen and White. You were either randomly assigned into the Alex condition in which the Chinese student in the email introduced him/herself as Alex or into the Xian condition in which the same Chinese student in the email presented him/herself as Xian (a Chinese name). Except for the name presented in the email, all other information was identical across these two conditions.

We predict that participants in the Alex condition will more often comply with the request to meet and respond more quickly than those in the Xian condition. But we also predict that because the prevalence of Chinese students is greater in the natural than social sciences in the U.S., faculty participants in the natural sciences will respond more favorably to Xian than faculty participants in the social sciences.

We apologize for not informing you that were participating in a study. Our institutional IRB deemed informed consent unnecessary in this case because of the minimal risk involved and because an investigation of this sort could not reasonably be done if participants knew, from the start, of our interest. We hope the email caused no more than minimal intrusion into your day, and that you quickly received the cancellation response if you agreed to meet.

Please note that no identifying information is being stored with our data from this study. We did keep a master list with your email address, and a corresponding participant number. But your response (yes or no, along with latency) was recorded in a separate file that does not contain your email address or any other identifying information about you. Please also note that we recognize there are many reasons why you may or may not have responded to the email, including scheduling conflicts, travel, etc. An individual response of “yes” or “no” to the meeting request actually tells us nothing about whether the name used by the bogus student had some influence. But in the aggregate, we can assess whether or not there is bias in favor of Chinese students who anglicize their names. We hope that this study will draw attention to how names can shape people’s reactions to others. Practically, the results may also shed light on the practices and policies of cultural adaptation.

Please know that you have the right to withdraw your response from this study at this time. If you do not want us to use your response in this study, please contact us by using the following contact information.

Thank you for taking the time to participating in this study. If you have questions now or in the future, or would like to learn the results of the study later in the semester [after November 30th], please contact one of the researchers below.

Xian Zhao, M.E. Monica Biernat, Ph.D.

Department of Psychology Department of Psychology

University of Kansas University of Kansas

Lawrence, KS 66045 Lawrence, KS 66045

zhaoxianpsych@ku.edu biernat@ku.edu

“Thank you for taking the time to participating in this study,” indeed. Thank *you* for not taking the time to proofread your damn email, pal.

I responded as follows to the email from “Xian Zhao, M.E.” and “Monica Biernat, Ph.D.”:

No problem. I know your time is as valuable as mine, so in the future whenever I get a request from a student to meet, I will forward that student to you. I hope you enjoy talking with statistics students, because you’re going to be hearing from a lot of them during the next few years!

Andrew

I guess the next logical step is for me to receive an email such as:

Dear Professor Gelman,

A few weeks ago, you received an email from two scholars with the title, “Names and Attitudes Toward Foreigners: A Field Experiment,” which purportedly described an experiment which was done in which you involuntarily participated without compensation, an experiment in which you were encouraged to alter your schedule on behalf of a nonexisting student, thus decreasing by some small amount the level of trust between American faculty and foreign students, all for the purpose of somebody’s Ph.D. thesis. Really, though, this “experiment” was itself an experiment to gauge your level of irritation at this experience.

Yours, etc.

In all seriousness, this is how the world of research works: A couple of highly-paid professors from Ivy League business schools do a crap-ass study that gets lots of publicity. This filters down, and the next thing you know, some not-so-well-paid researchers in Kansas are doing the same thing. Sure, they might not land their copycat study into a top journal, but surely they can publish it *somewhere*. And then, with any luck, they’ll get some publicity. Hey, they already did!

Good job, Xian Zhao, M.E., and Monica Biernat, Ph.D. You got some publicity. Now could you stop wasting all of our time?

Thanks in advance.

Yours, etc.

P.S. In case you’re wondering, the above picture (from the webpage of Edward Smith, but I don’t know who actually made the image) was the very first link in a google image search on *waste of time*. Thanks, Google—you came through for me again!

P.P.S. No, no, I won’t *really* forward student requests to Zhao and Biernat. Not out of any concern for Z & B—perhaps handling dozens of additional student requests per week would keep them out of trouble—but because it would be a waste of the students’ time.

P.P.P.S. When I encountered the fake study by Katherine Milkman and Modupe Akinola a few years ago, I didn’t think much of it. It was a one-off and I didn’t change my pattern of interactions with students. But now that I’ve received two of these from independent sources, I’m starting to worry. Either there are lots and lots and lots of studies out there, or else the jokers who are doing these studies are each mailing this crap out to thousands and thousands of professors. But, hey, email is free, so why not, right?

P.P.P.P.S. I just got this email:

Dear Dr. Gelman,

Apologize again that this study brought you troubles and wasted your valuable time. Sincerely hope our apology can make you feel better.

Regards

Xian Zhao

I appreciate the apology but they still wasted thousands of people’s time.

The post Some will spam you with a six-gun and some with a fountain pen appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes

**Wed:** Study published in 2011, followed by successful replication in 2003 [sic]

**Thurs:** Waic for time series

**Fri:** MA206 Program Director’s Memorandum

**Sat:** “An exact fishy test”

**Sun:** People used to send me ugly graphs, now I get these things

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I can’t think of a good title for this one. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recently read in the MIT Technology Review about some researchers claiming to remove “bias” from the wisdom of crowds by focusing on those more “confident” in their views.

I [Lee] was puzzled by this result/claim because I always thought that people who (1) are more willing to reassess their priors and (2) “hedgehogs” were more accurate forecasters.

I clicked through to the article and noticed this line: “tasks such as to estimate the length of the border between Switzerland and Italy, the correct answer being 734 kilometers.”

Ha! Haven’t they ever read Mandelbrot?

The post I can’t think of a good title for this one. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Estimating discontinuity in slope of a response function appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The method is a close cousin of regression discontinuity and has gotten a lot of traction recently among economists, with over 20 papers in the past few years, though less among statisticians.

We propose a simple placebo test based on constructing RK estimates at placebo policy kinks. Our placebo test substantially changes the findings from two RK papers (one which is revise and resubmit at Econometrica by David Card, David Lee, Zhuan Pei and Andrea Weber and another which is forthcoming in AEJ: Applied by Camille Landais). If applied more broadly — I think it is likely to change the conclusions of other RK papers as well.

Regular readers will know that I have some skepticism about certain regression discontinuity practices, so I’m sympathetic to this line from Ganong and Jager’s abstract:

Statistical significance based on conventional p- values may be spurious.

I have not read this new paper in detail but, just speaking generally, I’d imagine it would be difficult to estimate a change in slope. It seems completely reasonable to me that slopes will be changing all the time—that’s just nonlinearity!—but unless the changes are huge, they’ve gotta be hard to estimate from data, and I’d think the estimates would be supersensitive to whatever else is included in the model.

The Ganong and Jager paper looks interesting to me. I hope that someone will follow it up with a more model-based approach focused on estimation and uncertainty rather than hypothesis testing and p-values. Ultimately I think there should be kinks and discontinuities all over the place.

The post Estimating discontinuity in slope of a response function appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What does CNN have in common with Carmen Reinhart, Kenneth Rogoff, and Richard Tol: They all made foolish, embarrassing errors that would never have happened had they been using R Markdown appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Had the CNN team used an integrated statistical analysis and display system such as R Markdown, nobody would’ve needed to type in the numbers by hand, and the above embarrassment never would’ve occurred.

And CNN *should* be embarrassed about this: it’s much worse than a simple typo, as it indicates they don’t have control over their data. Just like those Rasmussen pollsters whose numbers add up to 108%. I sure wouldn’t hire *them* to do a poll for me!

I was going to follow this up by saying that Carmen Reinhart and Kenneth Rogoff and Richard Tol should learn about R Markdown—but maybe that sort of software would not be so useful to them. Without the possibility of transposing or losing entire columns of numbers, they might have a lot more difficulty finding attention-grabbing claims to publish.

Ummm . . . I better clarify this. I’m *not* saying that Reinhart, Rogoff, and Tol did their data errors on purpose. What I’m saying is that their cut-and-paste style of data processing enabled them to make errors which resulted in dramatic claims which were published in leading journals of economics. Had they done smooth analyses of the R Markdown variety (actually, I don’t know if R Markdown was available back in 2009 or whenever they all did their work, but you get my drift), it wouldn’t have been so easy for them to get such strong results, and maybe they would’ve been a bit less certain about their claims, which in turn would’ve been a bit less publishable.

To put it another way, sloppy data handling gives researchers yet another “degree of freedom” (to use Uri Simonsohn’s term) and biases claims to be more dramatic. Think about it. There are three options:

1. If you make no data errors, fine.

2. If you make an inadvertent data error that *works against* your favored hypothesis, you look at the data more carefully and you find the error, going back to the correct dataset.

3. But if you make an inadvertent data error that *supports* your favored hypothesis (as happened to Reinhart, Rogoff, and Tol), you have no particular motivation to check, and you just go for it.

Put these together and you get a systematic bias in favor of your hypothesis.

Science is degraded by looseness in data handling, just as it is degraded by looseness in thinking. This is one reason that I agree with Dean Baker that the Excel spreadsheet error was worth talking about and was indeed part of the bigger picture.

Reproducible research is higher-quality research.

**P.S.** Some commenters write that, even with Markdown or some sort of integrated data-analysis and presentation program, data errors can still arise. Sure. I’ll agree with that. But I think the three errors discussed above are all examples of cases where an interruption in the data flow caused the problem, with the clearest example being the CNN poll, where, I can only assume, the numbers were calculated using one computer program, then someone read the numbers off a screen or a sheet of paper and typed them into another computer program to create the display. This would not have happened using an integrated environment.

The post What does CNN have in common with Carmen Reinhart, Kenneth Rogoff, and Richard Tol: They all made foolish, embarrassing errors that would never have happened had they been using R Markdown appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Shamer shaming appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I can’t recall when I first saw “shaming” used in its currently popular sense. I remember noting “slut shaming” and “fat shaming” but did they first become popular two years ago? Three? At any rate, “shaming” is now everywhere…and evidently it’s a very bad thing.

When I first saw the term, I agreed with the message it was trying to convey: it is bad to try to make people feel ashamed of being fat, or of wanting to have sex. Indeed, I’d say it’s bad to try to make people feel ashamed of anything that isn’t unethical or morally wrong or at least irritating. Down with slut shaming! Down with fat shaming! Down with gay shaming!

But somehow all criticism seems to have become “shaming.” A few days ago I posted a message to my neighborhood listserv, reminding people that (1) we are in a severe drought (I live in California), (2) washing one’s car with a hose uses a lot of water, and indeed is a fineable offense if you don’t use a nozzle that shuts off the water when you release it, (3) all commercial car washes in our area recycle their water, and (4) our storm drains empty directly into a creek. The next day I got an angry email from a neighbor: how dare I shame him for washing his car on the street?

On this blog, Andrew has frequently posted about researchers doing shameful things, such as plagiarizing, and refusing to admit to major mistakes in their published work. (There’s nothing shameful about making a mistake, at least not if you’ve tried hard to get it right, but it is shameful to refuse to admit it). And, sure enough, some people have complained that Andrew is “shaming” these people.

Plagiarist-shaming, academic fraud-shaming, hack journalist-shaming, all of those are evidently in the same unacceptable category as fat-shaming and slut-shaming. There is nothing shameful in the world, except trying to make somebody feel ashamed. Shamer-shaming is the only kind of shaming that is OK.

This post is by Phil Price

The post Shamer shaming appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Palko’s on a roll appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>At least we can all agree that ad hominem and overly general attacks are bad: A savvy critique of the way in which opposition of any sort can be dismissed as “ad hominem” attacks. As a frequent critic (and subject of criticism), I agree with Palko that this sort of dismissal is a bad thing.

Wondering where the numbers come from — Rotten Tomatoes: These numbers really are rotten. Palko writes:

This figure indicates a “Good” rating. How does that translate to “Rotten”? . . . this is pretty clearly a glitch and it’s a glitch in the easy part of review aggregation . . . This brings up one of my [Palko's] problems with data-driven journalism. Reporters and bloggers are constantly barraging us with graphs and analyses and of course, narratives looking at things like Rotten Tomatoes rankings. All to often, though, their process starts with the data as given. They spend remarkably little time asking where the data came from or whether it’s worth bothering with.

I’ll just throw in the positive message that criticism can improve the numbers. After seeing this post, maybe the people at the website in question will be motivated to clean their data a bit.

The education reform movement has never lent itself to the standard left/right axis. Not only was its support bipartisan; it was often the supporters on the left who were quickest to embrace privatization, deregulation and market-based solutions. Zephyr Teachout may be a sign that anomaly is ending.

I’d also be interested in seeing poll data on this (if it’s possible to get good data, given the low salience of this issue for many voters). My guess is that, even if many leaders on the left were supportive of privatization, etc., that these were not so popular among rank-and-file, lower-income left-leaning voters.

In any case, I’m fascinated by this topic for several reasons, including its inherent importance and the compelling stories of various education-reform scams and scandals (well relayed by Palko over the past few years). Also, from a political-sciene perspective, I’ve always been interested in issues that *don’t* line up with the usual partisan divide.

Driverless Cars and Uneasy Riders: Dramatic claims are being made about the potential fuel and economic-efficiency gains from the use of driverless cars. Palko (and I) are skeptical.

Another story that needs to be on our radar — ECOT: Yet another education reform scam that should be a scandal. Eternal vigilance etc.

I know I go on about ignoring Canada’s education system: Palko links to, and criticizes, a report that’s so bad in so many dimensions that it probably deserves its own post here or at the Monkey Cage.

Selection effects on steroids: I’m not such a fan of the expression “on steroids”—to me it’s a bit of journalism cliche that should’ve died along with the 80s—but the statistical, and policy, point is important. Selection bias is one of these things that we statisticians have known about and talked about forever, but even so we probably don’t talk about it enough. As a researcher and as a teacher, I feel the challenge is to go beyond criticism and move to adjustment. But criticism is often a necessary first step.

Support your local journalists: Yup.

I know I pick on Netflix a lot: “the way flacks and hacks force real stories into standard narratives”

The essential distinction between charter schools and charter school chains:

The charter school sector is highly diverse. It ranges from literal mom and pop operations to nation-wide corporations. The best of these schools get good results, genuinely care about their students and can fill an important educational niche. The worst aggressively cook data to conceal mediocre results and gouge the taxpayers.

If current trends hold, I think charter schools will be nearly as diverse and I’m not optimistic about who the winners will be.

As Steven Levitt would say, incentives matter.

The post Palko’s on a roll appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What do you do to visualize uncertainty? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What do you do to visualize uncertainty?

Do you only use static methods (e.g. error bounds)?

Or do you also make use of dynamic means (e.g. have the display vary over time proportional to the error, so you don’t know exactly where the top of the bar is, since it moves while you’re watching)?Have you any thoughts on this topic?

I assume that since a Bayesian generates a posterior dist’n the output should not be point but rather a dist’n; and you being the most prolific Bayesian I know that you’ve got three or four old papers that you’ve written on it.

OK, sure, when you put it that way, my collaborators and I do have a few papers on the topic:

Visualization in Bayesian data analysis

Visualizing distributions of covariance matrices

Multiple imputation for model checking: completed-data plots with missing and latent data

A Bayesian formulation of exploratory data analysis and goodness-of-fit testing

All maps of parameter estimates are misleading

But I don’t really have much else to say right now. Dynamic graphics seem like a good idea but I’ve never programmed them myself. In many settings it will work to display point estimates, but sometimes this can create big problems (as discussed in some of the above-linked papers) because Bayesian point estimates will tend to be too smooth—less variable—compared to the variation in the underlying parameters being modeled.

So I’m kicking this one out to the commenters to see if they can offer some useful suggestions.

The post What do you do to visualize uncertainty? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post They know my email but they don’t know me appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Good afternoon,

I wanted to see if the data my colleague David sent to you was of any interest. I have attached here additional animated Gifs from PwC’s CEO survey. Let me know if you would be interested in featuring these pieces or in a guest post by PwC.

Best,

**** on behalf of **

Attached were two infographics which you can bet I’m not including here.

P.S. Just to be clear: I don’t think unsolicited emails are so horrible; I myself send emails to strangers all the time. Nor am I offended by the content. I just think it’s funny that there are people out there who think I’m interesting in publishing animated chartjunk.

The post They know my email but they don’t know me appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More bad news for the buggy-whip manufacturers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The main factor is technology. It’s a major cause of today’s response-rate problems – but it’s also the solution.

For decades, survey research has revolved around the telephone, and it’s worked very well. But Americans’ relationship with their phones has radically changed. It’s no surprise that survey research will have to as well. . . .

In the future, we are unlikely to live in a country in which information is scant. We are certain to live in one in which information is collected in different ways. The transition is under way, and the federal government is among those institutions that will need to adapt.

Let’s hope that the American Association for Public Opinion Research can adapt too.

The post More bad news for the buggy-whip manufacturers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** They know my email but they don’t know me

**Wed:** What do you do to visualize uncertainty?

**Thurs:** Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

**Fri:** Question about data mining bias in finance

**Sat:** Estimating discontinuity in slope of a response function

**Sun:** I can’t think of a good title for this one.

The post Six quotes from Kaiser Fung appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>One of the biggest myth of Big Data is that data alone produce complete answers.

Their “data” have done no arguing; it is the humans who are making this claim.

That last one is an appropriate response to the Freshman Fallacy.

The post Six quotes from Kaiser Fung appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post He just ordered a translation from Diederik Stapel appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So I am applying for a DC driver’s license and needed a translation of my Spanish license to show to the DMV. I go to http://www.onehourtranslation.com/ and as I prepare to pay I see a familiar face in the bottom banner:

It appears Stapel is one of their “over 15,000 dedicated professional translators” (or maybe they put his picture there unauthorized). Either way now worried I may get a made up/plagiarized translation.

There are worse ways for a multilingual person to make a living . . . .

Perhaps they could get Bruno Frey to do some translations too. He’d only have to do it once, then he could just copy it over and over and over.

The post He just ordered a translation from Diederik Stapel appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What is the purpose of a poem? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As is often the case, I’m on the blog to procrastinate: in this case, my colleagues and I are preparing a new course and there’s tons of important work to be done. I’m getting tired of reading comments on economics and empiricism and so I scooted over to Basbøll’s blog and kicked off a brief comment thread about the academic entertainer Slavoj Zizek. At first I was going to post and continue that discussion here, but I don’t give too poops about Slavoj Zizek, so I followed Basbøll’s blogroll link to “Stupid Motivational Tricks” and right away found something interesting.

The something interesting that I found was a post by Jonathan Mayhew about someone who’s the poet laureate of North Carolina. I had no idea that an individual state would have a poet laureate but it seems like a good idea, a quite reasonable nearly cost-free thing to do, indeed it would be cool to have all sorts of official state art. In reading the post I was mildly irritated by Mayhew’s use of “NC” as a generic replacement for “North Carolina.” The abbreviation is fine in some contexts but I founn it a bit jarring to read, “The literary community of NC . . .” On the other hand, it’s just a goddam blog so I don’t know what I’m supposed to be expecting.

But I’m getting completely off the point here. What happened is that Mayhew quoted a couple of mediocre passages from poems by two of North Carolina’s poet laureates (apparently they just had a changing of the guard).

Mayhew’s reactions gave me some thoughts of my own regarding the purpose of poetry. I’ll first copy what he wrote and then give my reflections.

Mayhew quotes from the previous laureate:

“Joan and I were in Raleigh together

for the first time to take the tour

for new vista volunteers

at North Carolina’s Central Prison…”

and then shares his reaction:

Ouch. It’s fine to use seemingly plain language, etc… but no rhythm, nothing going on in the language. This kind of writing just causes physical pain to me.

Then he quotes from the recent laureate:

“I’m grateful for my car, he says,

voice raspy with hard living.

Tossed on the seat, a briefcase

covered with union stickers,

stuffed with unemployment forms,

want ads, old utility bills,

birth certificate, school application

papers for the skinny ten-year-old

sitting beside him who loves baseball…”

This he characterizes as worse than the first poem (“not much worse,” though), but I don’t quite understand where this ranking is coming from, given that he follows up with, “More is going on in her language, actually. It’s not exactly good, but it’s salvageable, with some concreteness there at least.”

I assume that we can all agree, though, that it’s hard to judge either poem, or either poet, by these short excerpts. Both excerpts radiate mediocrity but of course a bit of mediocrity can do the job in the context of a larger message. I’m pretty sure that, for almost any major poet, you could without much difficulty find passages that, if shown to me in isolation, would not sparkle and could indeed look a bit like hackwork. I mean, sure, “voice raspy with hard living” sounds cliched, but who among us does not grab a cliche from time to time. For all we know from this excerpt, the use of the cliche is part of the point in establishing the narrator’s voice.

OK, let me be clear here. I’m not trying to get all contrarian on you and praise these two poets. I have no problem giving Mayhew the benefit of the doubt, I’ll assume he read a bit by each of them and with these excerpts is giving something of a true sense of these poems’ style and content. So I will accept (until convinced otherwise) that these poets are indeed mediocre.

**What is the purpose of a poem?**

And this brings us to today’s topic. The thing that bothers me about Mayhew’s post (even though I have a feeling I’d agree with him 100% about the strengths and weaknesses of these poems, and I wouldn’t be surprised if we share many tastes about and attitudes toward literature) is the implicit attitude that I see there, which I feel I’ve seen in other discussions of poetry, which is that the purpose of a poem is to be wonderful.

Huh? “The purpose of a poem is to be wonderful.” That seems like a reasonable statement, no? Who could disagree with that?

To see my problem with this statement (which, to be fair, Mayhew never said, but which I see as implicit in his post), consider the related question, “What is the purpose of a novel?” Or, for that matter, what is the purpose of a research article? Or what is the purpose of a song?

My point is that I think it’s a bad attitude to think that the purpose of a poem is to be wonderful. It’s insulting to poetry to give it such a narrow range. A poem is a sort of song without music and, as such can have many different purposes.

OK, procrastination successful. An hour spent, now time for bed.

The post What is the purpose of a poem? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post mysterious shiny things appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>

To reproduce it yourself: download ui.R, server.R, and healthexp.Rds

Have a folder called “App-Health-Exp” in your working directory, with ui.R and server.R in the “App-Health-Exp” folder. Have the dataset healthexp.Rds in your working directory.

Then run this code:

if (!require(devtools)) install.packages("devtools") devtools::install_github("jcheng5/googleCharts") install.packages("dplyr") install.packages("shiny") library(shiny) library(googleCharts) library(dplyr) data = readRDS("healthexp.Rds") head(data) # Problem isn't the data, it seems that Switzerland is in Europe # in both 2001 and 2002: data[data$Year == 2001 & data$Country == "Switzerland",] data[data$Year == 2002 & data$Country == "Switzerland",] runApp("App-Health-Exp")

Anyone know what is happening?

The post mysterious shiny things appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post *Bayesian Cognitive Modeling* Examples Ported to Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

There’s a new intro to Bayes in town.

- Michael Lee and Eric-Jan Wagenmaker. 2014.
*Bayesian Cognitive Modeling: A Practical Course*. Cambridge Uni. Press.

This book’s a wonderful introduction to applied Bayesian modeling. But don’t take my word for it — you can download and read the first two parts of the book (hundreds of pages including the bibliography) for free from the book’s home page (linked in the citation above). One of my favorite parts of the book is the collection of interesting and instructive example models coded in BUGS and JAGS (also available from the home page). As a computer scientist, I prefer reading code to narrative!

In both spirit and form, the book’s similar to Lunn, Jackson, Best, Thomas, and Spiegelhalter’s *BUGS Book*, which wraps their seminal set of example models up in textbook form. It’s also similar in spirit to Kruschke’s *Doing Bayesian Data Analysis*, especially in its focus on applied cognitive psychology examples.

One of Lee and Wagenmaker’s colleagues, Martin Šmíra, has been porting the example models to Stan and the first batch is already available in the new Stan example model repository (hosted on GitHub):

- GitHub: stan-dev/example-models

Many of the models involve discrete parameters in the BUGS formulation which need to be marginalized out in the Stan models. The Stan 2.5 manual is adding a whole new chapter with some non-trivial marginalizations (change point models, CJS mark-recapture models, and categorical diagnostic accuracy models).

Expect the rest soon! And feel free to jump on the Stan users group to discuss the models and how they’ve been coded.

*Warning: The models are embedded as strings in R code. We’re looking for a volunteer to pull the models out of the R code and generate data for them in a standalone file that could be used in PyStan or CmdStan.*

If you’d like to contribute Stan models to our example repo, the README at the bottom of the front page of the GitHub repository linked above contains information on what we’d like to get. We only need open-source distribution rights — authors retain copyright for all their work on Stan. Contact us either via e-mail or via the Stan users group.

The post *Bayesian Cognitive Modeling* Examples Ported to Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post One-tailed or two-tailed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This image of a two-tailed lizard (from here, I can’t find the name of the person who took the picture) never fails to amuse me.

But let us get to the question at hand . . .

Richard Rasiej writes:

I’m currently teaching a summer session course in Elementary Statistics. The text that I was given to use is Triola’s Elementary Statistics, 12th ed.

Let me quote a problem on inference from two proportions:

11. Is Echinacea Effective for Colds? Rhino viruses typically cause common colds. In a test of the effectiveness of echinacea, 40 of the 45 subjects treated with echinacea developed rhinovirus infections. In a placebo group, 88 of the 103 subjects developed rhinovirus infections (based on data from “An Evaluation of Echinacea Angustifolia in Experimental Rhinovirus Infections,” by Turner et. al., New England Journal of Medicine, Vol. 353, No. 4). We want to use a 0.05 significance level to test the claim that echinacea has an effect on rhinovirus infection.

The answer in the back of the teacher’s edition sets up the hypothesis test as H0: p1 = p2, H1: p1 <> (not equal to) p2, gives a test statistic of z = 0.57, uses critical values of +/- 1.96, and gives a P-value of .5686.

I was having a hard time explaining the rationale for the book’s approach to my students. My thinking was that since there is no point in claiming that echinacea has an effect on the common cold unless you think it helps, we should be doing a one-tailed test with H0: p1 = p2, H1: p1 < p2. We would still fail to reject the null hypothesis, but with a P-value of .2843.

Or, is what I am missing that, if you are testing the claim that something has an effect you want to also test the possibility that the effect is the opposite of what you’d normally want (e.g. this herb is bad for you, or inhaling smoke is good for you, etc.)?

Any advice you could give me on how best to parse this problem for my students would be greatly appreciated. I already feel very nervous stating, in effect, “well, that’s not the way I would do it.”

My reply:

The quick answer is that maybe echinacea is bad for you! Really though the example is pretty silly, as one can simply compare 40/45 and 88/103 and look at the sampling variability of the proportions. I don’t see that the hypothesis test and p-value add anything.

This doesn’t sound like much, but, amazingly enough, Rasiej replied later that day:

I guess I was led astray by the lead-in to the problem, which seemed to imply that there was a benefit. Obviously it’s better to read the claim carefully and take it literally. So, “test the claim that echinacea has an effect” is two-tailed since ANY effect, beneficial or not, would be significant.

That said, I do agree with you that the example is silly, given the data in the problem.

Thanks again for your insights. They helped in my class today.

Perhaps (maybe I should say “probably”) he was just being polite, but I prefer to think that even a brief reply can convey some useful understanding. Also I think it’s a good general message to take what people say literally. This is not a message that David Brooks likes to hear, I think, but it is, to me, an essential aspect of statistical thinking.

**P.S.** Perhaps I should stress that in my response above I wasn’t saying that confidence intervals are some kind of wonderful automatic replacement for p-values. I was just saying that, in this particular case, it seems to me that you’d want a summary of the information provided by the experiment, and that this summary is best provided by the estimated proportions and their standard errors. To set if up in a p-value context would seem to imply that you’re planning on making a decision about echinacea based on this single experiment, but that wouldn’t make sense at all! No need to jump the gun and go all the way to a decision statement; it seems enough to just summarize the information in the data.

The post One-tailed or two-tailed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A couple months ago in a discussion of differences between econometrics and statistics, I alluded to the well-known fact that everyday uncertainty aversion can’t be explained by a declining marginal utility of money.

What really bothers me—it’s been bothering me for *decades* now—is that this is a simple fact that “everybody knows” (indeed, in comments some people asked why I was making such a big deal about this triviality), but, even so, it remains standard practice within economics to use this declining-marginal-utility explanation.

I don’t have any econ textbooks handy but here’s something from the Wikipedia entry for risk aversion:

Risk aversion is the reluctance of a person to accept a bargain with an uncertain payoff rather than another bargain with a more certain, but possibly lower, expected payoff.

OK so far. And now for their example:

A person is given the choice between two scenarios, one with a guaranteed payoff and one without. In the guaranteed scenario, the person receives $50. In the uncertain scenario, a coin is flipped to decide whether the person receives $100 or nothing. The expected payoff for both scenarios is $50, meaning that an individual who was insensitive to risk would not care whether they took the guaranteed payment or the gamble. However, individuals may have different risk attitudes. A person is said to be:

risk-averse (or risk-avoiding) – if he or she would accept a certain payment (certainty equivalent) of less than $50 (for example, $40), rather than taking the gamble and possibly receiving nothing. . . .

They follow up by defining risk aversion in terms of the utility of money:

The expected utility of the above bet (with a 50% chance of receiving 100 and a 50% chance of receiving 0) is,

E(u)=(u(0)+u(100))/2,

and if the person has the utility function with u(0)=0, u(40)=5, and u(100)=10 then the expected utility of the bet equals 5, which is the same as the known utility of the amount 40. Hence the certainty equivalent is 40.

But this is just wrong. It’s not *mathematically* wrong but it’s wrong in any practical sense, in that a utility function that curves this way between 0 and 100 can’t possibly make any real-world sense.

Way down on the page there’s one paragraph saying that this model has “come under criticism from behavioral economics.”

But this completely misses the point!

It would be as if you went to the Wikipedia entry on planetary orbits and saw a long and involved discussion of the Ptolemaic model, with much discussion of the modern theory of epicycles (image above from Wikipedia, taken from the Astronomy article in the first edition of the Enyclopaedia Brittanica), and then, way down on the page, a paragraph saying something like,

The notion of a geocentric universe has come under criticism from Copernican astronomy.

Again, this is frustrating because it’s so simple, it’s so obvious that any utility function that curves so much between 0 and 100 *can’t* keep going forward in any reasonable sense.

It’s an example I used to give as a class-participation activity in my undergraduate decision analysis class and which I wrote up a few years later in an article on classroom demonstrations.

I’m not claiming any special originality for this result. As I wrote in my recent post,

The general principle has been well-known forever, I’m sure.

Indeed, unbeknownst to me, Matt Rabin published a paper a couple years later with a more formal treatment of the same topic, and I don’t recall ever talking with him about the problem (nor was it covered in Mr. Cutlip’s economics class in 11th grade), so I assume he figured it out on his own. (It would be hard for me to imagine someone thinking hard about curving utility functions and *not* realizing they can’t explain everyday risk aversion.)

In response, commenter Megan agreed with me on the substance but wrote:

I am sure it has NOT been well-known forever. It’s only been known for 26 years and no one really understands it yet.

I’m pretty sure the Swedish philosopher who proved the mathematical phenomenon 10 years before you and 12 years before Matt Rabin was the first to identify it. The Hansson (1988)/Gelman (1998)/Rabin (2000) paradox is up there with Ellsberg (1961), Samuelson (1963) and Allais (1953).

**Not so obvious after all?**

Megain’s comment got me thinking: maybe this problem with using a nonlinear utility function for money is *not* so inherently obvious. Sure, it was obvious to me in 1992 or so when I was teaching decision analysis, but I was a product of my time. Had I taught the course in 1983, maybe the idea wouldn’t have come to me at all.

Let me retrace my thoughts, as best as I can now recall them. What I’d really like is a copy of my lecture notes from 1992 or 1994 or whenever it was that I first used the example, to see how it came up. But I can’t locate these notes right now. As I recall, I taught the first part of my decision analysis class using standard utility theory, first having students solve basic expected-monetary-value optimization problems and then going through the derivation of the utility function given the utility axioms. Then I talked about violations of the axioms and went on from there.

It was a fun course and I taught it several times, at Berkeley and at Columbia. Actually, the first time I taught the subject it was something of an accident. Berkeley had an undergraduate course on Bayesian statistics that David Blackwell had formerly taught. He had retired so they asked me to teach it. But I wasn’t comfortable teaching Bayesian statistics at the undergraduate level—this was before Stan and it seemed to me it would take the students all semester just to get up to speed on the math, with on time to do anything interesting—so I decided to teach decision analysis instead. using the same course number. One particular year I remember—I think it was 1994—when we had a really fun bunch of undergrad stat majors, and a whole bunch of them were in the course. A truly charming bunch of students.

Anyway, when designing the course I read through a bunch of textbooks on decision analysis, and the nonlinear utility function for money always came up as the first step beyond “expected monetary value.” After that came utility of multidimensional assets (the famous example of the value of a washer and a dryer, compared to two washers or two dryers), but the nonlinear utility for money, used sometimes to *define* risk aversion, came first.

But the authors of many of these books were also aware of the Kahneman, Slovic, and Tversky revolution. There was a ferment, but it still seemed like utility theory was tweakable and that the “heuristics and biases” research merely reflected a difficulty in *measuring* the relevant subjective probabilities and utilities. It was only a few years later that a book came out with the beautifully on-target title, “The Construction of Preference.”

Anyway, here’s the point. Maybe the problem with utility theory in this context was obvious to Hansson, and to me, and to Yitzhak, because we’d been primed by reading the work by Kahneman, Slovic, Tversky, and others exploring the failures of the utility model in practice. In retrospect, that work too should not have been a surprise—-after all, utility theory was at that time already a half-century old and it had been developed in the behavioristic tradition of psychology, predating the cognitive revolution of the 1950s.

I can’t really say, but it does seem that sometimes the time is ripe for an idea, and maybe this particular idea only seemed so trivial to me because it was already accepted that utility theory had problems modeling preferences. Once you accept the *empirical* problem, it’s not so hard to imagine there’s a theoretical problem too.

And, make no doubt about it, the problem is both empirical and theoretical. You don’t need any experimental data at all to see the problem here:

Also, let me emphasize that the solution to the problem is *not* to say that people’s preferences are correct and so the utility model is wrong. Rather, in this example *I find utility theory to be useful* in demonstrating why the sort of everyday risk aversion exhibited by typical students (and survey respondents) does not make financial sense. Utility theory is an excellent normative model here.

Which is why it seems particularly silly to be defining these preferences in terms of a nonlinear utility curve that could never be.

It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet.

The post “It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Phillip Middleton,

Is technology making you work harder? Or giving you more time off?

Seriously, it feels like it’s enabling me to work around the clock! Heck, I’m writing this email at 37,000 feet on a Virgin America flight from DC to LA at 11 p.m. ET.

So that being said, I want to share the actual DATA with you about Work vs. Leisure. . . .

It’s easy to forget that for centuries — for millenia — the “workforce” was ALL of us.

A few people lived in luxury, but the vast majority were slaves and serfs who did the work. In 1750, 75 percent of people on the planet worked to support the top 25 percent.

Let’s look at the numbers. It’s extraordinary how this has changed over time.

You’ll notice that by 2000, the global percentage of slaves and serfs in the world is down to 10 percent. As artificial intelligence and robotics come online, this number is going to drop down to zero.

Hey, if only artificial intelligence and robotics had existed in 1863, then Lincoln could’ve freed the—whaaaaa? What’s with that graph, anyway? Let’s look at the data, indeed. That curve looks suspiciously smooth!

Where did “the numbers” come from? The source says “Simon, pp. 171-177″ but that’s not quite enough information. Luckily, we make rapid progress via Google. A search on “percentage of slaves or serfs in the world” takes us to this 2001 book by Stephen Moore and Julian Simon and the following quote:

A larger percentage of the world’s inhabitants are freer than ever before in history. Economic historian Stanley Engerman has noted that as recently as the late 18th century, “The bulk of mankind, over 95 percent, were miserable slaves or [sic] despotic tyrants.” . . . The figure shows the decline of slavery from 1750 through the end of the 20th century.

This one’s kinda weird because they put 1917 exactly halfway between 1750 and 2000, which isn’t quite right. It’s almost like they just drew a curve freehand through some made-up numbers! Also a bit odd is that Moore and Simon’s curve is not consistent with their own citation: in their text, they say the proportion of slaves in the late 18th century was 95%, but in the graph it’s around 70%.

The next step, I suppose, is to track down “Simon, pp. 171-77; and authors’ calculations.” But I’m getting tired. Maybe someone else could follow this up for me?

In summary, the graph looks bogus to me. Some of these tech zillionaires seem to have no B.S. filter at all! Perhaps to be successful in that area it helps to be a bit credulous?

**P.S.** From comments below it seems clear that this graph has been created from a few nonexistent data points. It’s pretty horrible that Diamandis labeled this as “actual DATA.” I guess that’s just further confirmation that when people shout in ALL CAPS, they don’t know what they’re talking about!

The post Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My talk at the Simons Foundation this Wed 5pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>To learn about the human world, we should accept uncertainty and embrace variation. We illustrate this concept with various examples from our recent research (the above examples are with Yair Ghitza and Aki Vehtari) and discuss more generally how statistical methods can help or hinder the scientific process.

The post My talk at the Simons Foundation this Wed 5pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Wed 10 Sept 2014, 12-1pm in the Statistics Department large seminar room (Social Work Bldg room 903, Columbia University).

The post My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Suspiciously vague graph purporting to show “percentage of slaves or serfs in the world”

**Wed:** “It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet”

**Thurs:** One-tailed or two-tailed

**Fri:** What is the purpose of a poem?

**Sat:** He just ordered a translation from Diederik Stapel

**Sun:** Six quotes from Kaiser Fung

The post Likelihood from quantiles? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Many observers, esp. engineers, have a tendency to record their observations as {quantile, CDF} pairs, e.g.,

x CDF(x)

3.2 0.26

4.7 0.39etc.

I suspect that their intent is to do some kind of “least-squares” analysis by computing theoretical CDFs from a model, e.g. Gamma(a, b), then regressing the observed CDFs against the theoretical quantiles, iterating the model parameters to minimize something, perhaps the K-S statistic.

I was wondering whether standard MCMC methods would be invalidated if the likelihood factor were constructed using CDFs instead of PDFs (or density mass). That is, the likelihood would be the product of F(x) values instead of the derivative, f(x). My intuition tells me that it shouldn’t matter since the result is still a product of probabilities but the apparent lack of literature examples gives me pause.

My reply: I don’t know enough about this sort of problem to give you a real answer, but in general the likelihood is the probability distribution of the data (given parameters), hence in setting up the likelihood you want to get a sense of what the measurements actually are. Is that “3.2” measured with error, or are you concerned with variation across different machines or whatever? Once you know this, maybe you can model the measurements directly.

The post Likelihood from quantiles? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some time in the past 200 years the neighborhood has changed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some time in the past 200 years the neighborhood has changed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How does inference for next year’s data differ from inference for unobserved data from the current year? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recently came across your blog post from 2009 about how statistical analysis differs when analyzing an entire population rather than a sample.

I understand the part about conceptualizing the problem as involving a stochastic data generating process, however, I have a query about the paragraph on ‘making predictions about future cases, in which case the relevant uncertainty comes from the year-to-year variation’.

Wouldn’t the random-data-generating-process conceptualization cover the situation where you’re interested in making predictions about future cases? I just wanted to check that I’m not missing the importance of the year-to-year variation– this, presumably, wouldn’t be the random variation that’s necessary for inferential statistics to apply, as the year-to-year variation might be systematic rather than random?

My reply:

See for example the Gelman and King JASA paper from 1990. The point is that variation among units within a given year is not the same as variation within a unit from year to year.

We used a multilevel model.

But the real point here is that we were able to transform a somewhat philosophical question (What is the meaning of statistical inference if the entire population is observed?) into a technical question regarding variance within and between years. A lot of progress in statistical methods goes this way, that topics that formerly were consigned to philosophy can get subsumed into quantitative modeling.

The post How does inference for next year’s data differ from inference for unobserved data from the current year? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Confirmationist and falsificationist paradigms of science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The general issue is how we think about research hypotheses and statistical evidence. Following Popper etc., I see two basic paradigms:

**Confirmationist:** You gather data and look for evidence in support of your research hypothesis. This could be done in various ways, but one standard approach is via statistical significance testing: the goal is to reject a null hypothesis, and then this rejection will supply evidence in favor of your preferred research hypothesis.

**Falsificationist:** You use your research hypothesis to make specific (probabilistic) predictions and then gather data and perform analyses with the goal of rejecting your hypothesis.

In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.

In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.

How do these two forms of reasoning differ? In confirmationist reasoning, the research hypothesis of interest does *not* need to be stated with any precision. It is the null hypothesis that needs to be specified, because that is what is being rejected. In falsificationist reasoning, there is no null hypothesis, but the research hypothesis must be precise.

**In our research we bounce**

It is tempting to frame falsificationists as the Popperian good guys who are willing to test their own models and confirmationists as the bad guys (or, at best, as the naifs) who try to do research in an indirect way by shooting down straw-man null hypotheses.

And indeed I do see the confirmationist approach as having serious problems, most notably in the leap from “B is rejected” to “A is supported,” and also in various practical ways because the evidence against B isn’t always as clear as outside observers might think.

But it’s probably most accurate to say that each of us is sometimes a confirmationist and sometimes a falsificationist. In our research we bounce between confirmation and falsification.

Suppose you start with a vague research hypothesis (for example, that being exposed to TV political debates makes people more concerned about political polarization). This hypothesis can’t yet be falsified as it does not make precise predictions. But it seems natural to seek to confirm the hypothesis by gathering data to rule out various alternatives. At some point, though, if we really start to like this hypothesis, it makes sense to fill it out a bit, enough so that it can be tested.

In other settings it can make sense to check a model right away. In psychometrics, for example, or in various analyses of survey data, we start right away with regression-type models that make very specific predictions. If you start with a full probability model of your data and underlying phenomenon, it makes sense to try right away to falsify (and thus, improve) it.

**Dominance of the falsificationist rhetoric**

That said, Popper’s ideas are pretty dominant in how we think about scientific (and statistical) evidence. And it’s my impression that null hypothesis significance testing is generally understood as being part of a Popperian, falsificiationist approach to science.

So I think it’s worth emphasizing that, when a researcher is testing a null hypothesis that he or she does not believe, in order to supply evidence in favor of a preferred hypothesis, that this is confirmationist reasoning. It may well be good science (depending on the context) but it’s *not* falsificationist.

**The “I’ve got statistical significance and I’m outta here” attitude**

This discussion arose when Mayo wrote of a controversial recent study, “By the way, since Schnall’s research was testing ‘embodied cognition’ why wouldn’t they have subjects involved in actual cleansing activities rather than have them unscramble words about cleanliness?”

This comment was interesting to me because it points to a big problem with a lot of social and behavioral science research, which is a vagueness of research hypotheses and an attitude that anything that rejects the null hypothesis is evidence in favor of the researcher’s preferred theory.

Just to clarify, I’m not saying that this is a particular problem with classical statistical methods; the same problem would occur if, for example, researchers were to declare victory when a 95% posterior interval excludes zero. The problem that I see here, and that I’ve seen in other cases too, is that there is little or no concern with issues of measurement. Scientific measurement can be analogized to links on a chain, and each link—each place where there is a gap between the object of study and what is actually being measured—is cause for concern.

All of this is a line of reasoning that is crucial to science but is often ignored (in my own field of political science as well, where we often just accept survey responses as data without thinking about what they correspond to in the real world). One area where measurement is taken very seriously is psychometrics, but it seems that the social psychologists don’t think so much about reliability and validity. One reason, perhaps, is that psychometrics is about quantitative measurement, whereas questions in social psychology are often framed in a binary way (Is the effect there or not?). And once you frame your question in a binary way, there’s a temptation for a researcher, once he or she has found a statistically significant comparison, to just declare victory and go home.

The *measurements* in social psychology are often quantitative; what I’m talking about here is that the *research hypotheses* are framed in a binary way (really, a unary way in that the researchers just about always seem to think their hypotheses are actually true). This motivates the “I’ve got statistical significance and I’m outta here” attitude. And, if you’ve got statistical significance already and that’s your goal, then who cares about reliability and validity, right? At least, that’s the attitude, that once you have significance (and publication), it doesn’t really matter exactly what you’re measuring, because you’ve proved your theory.

I am not intendeing to be cynical or to imply that I think these researchers are trying to do bad science. I just think that the combination of binary or unary hypotheses along with a data-based decision rule leads to serious problems.

The issue is that research projects are framed as quests for confirmation of a theory. And once confirmation (in whatever form) is achieved, there is a tendency to declare victory and not think too hard about issues of reliability and validity of measurements.

To this, Mayo wrote:

I agreed that “the measurements used in the paper in question were not” obviously adequately probing the substantive hypothesis. I don’t know that the projects are framed as quests “for confirmation of a theory”,rather than quests for evidence of a statistical effect (in the midst of the statistical falsification arg at the bottom of this comment). Getting evidence of a genuine, repeatable effect is at most a necessary but not a sufficient condition for evidence of a substantive theory that might be thought to (statistically) entail the effect (e.g., a cleanliness prime causes less judgmental assessments of immoral behavior—or something like that). I’m not sure that they think about general theories–maybe “embodied cognition” could count as general theory here. Of course the distinction between statistical and substantive inference is well known. I noted, too, that the so-called NHST is purported to allow such fallacious moves from statistical to substantive and, as such, is a fallacious animal not permissible by Fisherian or NP tests.

I agree that issues about the validity and relevance of measurements are given short shrift and that the emphasis–even in the critical replication program–is on (what I called) the “pure” statistical question (of getting the statistical effect).

I’m not sure I’m getting to your concern Andrew, but I think that they see themselves as following a falsificationist pattern of reasoning (rather than a confirmationist one). They assume it goes something like this:

If the theory T (clean prime causes less judgmental toward immoral actions) were false, then they wouldn’t get statistically significant results in these experiments, so getting stat sig results is evidence for T.

This is fallacious when the conditional fails.

And I replied that I think these researchers are following a confirmationist rather than falsificationist approach. Why do I say this? Because when they set up a nice juicy hypothesis and other people fail to replicate it, they don’t say: “Hey, we’ve been falsified! Cool!” Instead they give reasons why they haven’t been falsified. Meanwhile, when they falsify things themselves, they falsify the so-called straw-man null hypotheses that they don’t believe.

The pattern is as follows: Researcher has hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A. I don’t see this as falsificationist reasoning, because the researchers’ actual hypothesis (that is, hypothesis A) is never put to the test. It is only B that is put to the test. To me, testing B in order to provide evidence in favor of A is confirmationist reasoning.

Again, I don’t see this as having anything to do with Bayes vs non-Bayes, and all the same behavior could happen if every p-value were replaced by a confidence interval.

I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

That might be ok—maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

**Summary for the tl;dr crowd**

In our paper, Shalizi and I argued that Bayesian inference does not have do be performed in an inductivist mode, despite a widely-held belief to the contrary. Here I’m arguing that classical significance testing is not necessarily falsificationist, despite a widely-held belief to the contrary.

The post Confirmationist and falsificationist paradigms of science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why isn’t replication required before publication in top journals? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I don’t recall seeing, on your blog or elsewhere, this question raised directly. Of course there is much talk about the importance of replication, mostly by statisticians, and economists are grudgingly following suit with top journals requiring datasets and code.

But why not make it a simple requirement? No replication, no publication.

I suppose that it would be too time-consuming (many reviewers shirk even that basic duty) and that there is a risk of theft of intellectual property.

My reply: In this context, “replication” can mean two things. The first meaning is that the authors supply enough information that the exact analysis can be replicated (this information would include raw data (suitably anonymized if necessary), survey forms, data collection protocols, computer programs and scripts, etc. Some journals already do require this; for example, we had to do it for our paper in the Quarterly Journal of Political Science. The second meaning of “replication” is that the authors would actually have to replicate their study, ideally with a preregistered design, as in the “50 shades of gray” paper. This second sort of replication is great when it can be done, but it’s not in general so easy in fields such as political science or economics where we work with historical data.

The post Why isn’t replication required before publication in top journals? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**The quotes**

Here’s one: “You have no choice but to accept that the major conclusions of these studies are true.”

Ahhhh, but we do have a choice!

First, the background. We have two quotes from this paper by E. J. Wagenmakers, Ruud Wetzels, Denny Borsboom, Rogier Kievit, and Han van der Maas.

Here’s Alan Turing in 1950:

I assume that the reader is familiar with the idea of extra-sensory perception, and the meaning of the four items of it, viz. telepathy, clairvoyance, precognition and psycho-kinesis. These disturbing phenomena seem to deny all our usual scientific ideas. How we should like to discredit them! Unfortunately the statistical evidence, at least for telepathy, is overwhelming.

Wow! Overwhelming evidence isn’t what it used to be.

In all seriousness, it’s interesting that Turing, who was in some ways an expert on statistical evidence, was fooled in this way. After all, even those psychologists who currently believe in ESP would not, I think, hold that the evidence for telepathy *as of 1950* was overwhelming. I say this because it does not seem so easy for researchers to demonstrate ESP using the protocols of the 1940s; instead there is continuing effort to come up with new designs

How could Turing have thought this? I don’t know much about Turing but it does seem, when reading old-time literature, that belief in the supernatural was pretty common back then, lots of mention of ghosts etc. And at an intuitive level there does seem, at least to me, an intuitive appeal to the idea that if we just concentrate hard enough, we can read minds, move objects, etc. Also, remember that, as of 1950, the discovery and popularization of quantum mechanics was not so far in the past. Given all the counterintuitive features of quantum physics and radioactivity, it does not seem at all unreasonable that there could be some new phenomena out there to be discovered. Things feel a bit different in 2014 after several decades of merely incremental improvements in physics.

To move things forward a few decades, Wagenmakers et al. mention “the phenomenon of social priming, where a subtle cognitive or emotional manipulation influences overt behavior. The prototypical example is the elderly walking study from Bargh, Chen, and Burrows (1996); in the priming phase of this study, students were either confronted with neutral words or with words that are related to the concept of the elderly (e.g., ‘Florida’, ‘bingo’). The results showed that the students’ walking speed was slower after having been primed with the elderly-related words.”

They then pop our this 2011 quote from Daniel Kahneman:

When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.

And that brings us to the beginning of this post, and my response: No, you *don’t* have to accept that the major conclusions of these studies are true. Wagenmakers et al. note, “At the 2014 APS annual meeting in San Francisco, however, Hal Pashler presented a long series of failed replications of social priming studies, conducted together with Christine Harris, the upshot of which was that disbelief does in fact remain an option.”

**Where did Turing and Kahneman go wrong?**

Overstating the strength of empirical evidence. How does that happen? As Eric Loken and I discuss in our Garden of Forking Paths article (echoing earlier work by Simmons, Nelson, and Simonsohn), statistically significant comparisons are not hard to come by, even by researchers who are not actively fishing through the data.

The other issue is that when any real effects are almost certainly tiny (as in ESP, or social priming, or various other bank-shot behavioral effects such as ovulation and voting), statistically significant patterns can be systematically misleading (as John Carlin and I discuss here).

Still and all, it’s striking to see brilliant people such as Turing and Kahneman making this mistake. Especially Kahneman, given that he and Tversky wrote the following in a famous paper:

People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics. The prevalence of the belief and its unfortunate consequences for psvchological research are illustrated by the responses of professional psychologists to a questionnaire concerning research decisions.

Indeed.

**Having an open mind**

It’s good to have an open mind. Psychology journals publish articles on ESP and social priming, even though these may seem implausible, because implausible things sometimes are true.

It’s good to have an open mind. When a striking result appears in the dataset, it’s possible that this result does *not* represent an enduring truth or even a pattern in the general population but rather is just an artifact of a particular small and noisy dataset.

One frustration I’ve had in recent discussions regarding controversial research is the seeming unwillingness of researchers to entertain the possibility that their published findings are just noise. Maybe not, maybe these are real effects being discovered, but you should at least *consider* the possibility that you’re chasing noise. Despite what Turing and Kahneman say, you can keep an open mind.

**P.S.** Some commenters thought that I was disparaging Alan Turing and Daniel Kahneman. I wasn’t. Turing and Kahneman both made big contributions to science, almost certainly much bigger than anything I will ever do. And I’m not criticizing them for believing in ESP and social priming. What I am criticizing them for is their insistence that the evidence is “overwhelming” and that the rest of us “have no choice” but to accept these hypotheses. Both Turing and Kahneman, great as they are, overstated the strength of the statistical evidence.

And that’s interesting. When stupid people make a mistake, that’s no big deal. But when brilliant people make a mistake, it’s worth noting.

The post I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Questions about “Too Good to Be True” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I manage a team tasked with, among other things, analyzing data on Air Traffic operations to identify factors that may be associated with elevated risk. I think its fair to characterize our work as “data mining” (e.g., using rule induction, Bayesian, and statistical methods).

One of my colleagues sent me a link to your recent article “Too Good to Be True” (Slate, July 24). Obviously, as my friend has pointed out, your article raises questions about the validity of what I’m doing.

A few thoughts/questions:

(1) I agree with your overall point, but I’m having trouble understanding the specific complaint with the “red/pink” study. In their case, if I’m understanding the author’s rebuttal, they were not asking “what color is associated with fertility” and then mining the data to find a color…any color…which seemed to have a statistical association. They started by asking “is red/pink associated with fertility”, no? In which case, I think the point their making seems fair?

(2) But, your argument definitely applies to the kind of work I’m doing. In my case, I’m asking an open ended question: “Are there any relationships?” Well, of course, you would say, the odds are that you must find relationships…even if they are not really there.

(3) So let’s take a couple of examples. There are 1,000’s of economists building models to explain some economic phenomenon. All of these models are based on the same underlying data: the U.S. Income and Product Accounts. There are then 10,000’s of models built—only a handful of are publication-worthy. So, by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?

(4) Another example: one of the things that we have uncovered is that, in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?

(5) A caveat: In my case, we use the statistically significant findings to point us in directions that deserve more study. Basically as a form of triage (because we don’t have the resources to address every conceivable hazard in the airspace system). Perhaps fortunately, most of the people I deal with (primarily pilots and air traffic controllers) don’t understand statistics. So, the safety case we build must be based on more than just a mechanical analysis of the data.

My reply:

(1) Whether or not the authors of the study were “mining the data,” I think their analysis was contingent on the data. They had many data-analytic choices, including rules for which cases to include or exclude and which comparisons to make, as well as what colors to study. Their protocol and analysis were not pre-registered. The point is that, even though they did an analysis that was consistent with their general research hypothesis, there are many degrees of freedom in the specifics, and these specifics can well be chosen in light of the data.

This topic is really worth an article of its own . . . and, indeed, Eric Loken and I have written that article! So, instead of replying in detail in this post, I’ll point you toward The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.

(2) You write, “the odds are that you must find relationships . . . even if they are not really there.” I think the relationships are there but that they are typically small, and they exist in the context of high levels of variation. So the issue isn’t so much that you’re finding things that aren’t there, but rather that, if you’re not careful, you’ll think you’re finding large and consistent effects, when what’s really there are small effects of varying direction.

(3) You ask, “by the same logic, with that many people studying the same sample, it would be statistically true that many of the published papers in even the best economics journals are false?” My response: No, I don’t think that framing statistical statements as “true” or “false” is the most helpful way to look at things. I think it’s fine for lots of people to analyze the same dataset. And, for that matter, I think it’s fine for people to use various different statistical methods. But methods have assumptions attached to them. If you’re using a Bayesian approach, it’s only fair to criticize your methods if the probability distributions don’t seem to make sense. And if you’re using p-values, then you need to consider the reference distribution over which the long-run averaging is taking place.

(4) You write: “in the case of Runway Incursions, errors committed by air traffic controllers are many times more likely to result in a collision than errors committed by a pilot. The p-value here is pretty low—although the confidence interval is large because, thankfully, we don’t have a lot of collisions. What is your reaction to this finding?” My response is, first, I’d like to see all the comparisons that you might be making with these data. If you found one interesting pattern, there might well be others, and I wouldn’t want you to limit your conclusions to just whatever happened to be statistically significant. Second, your finding seems plausible to me but I’d guess that the long-run difference will probably be lower than what you found in your initial estimate, as there is typically a selection process by which larger differences are more likely to be noticed.

(5) Your triage makes some sense. Also let me emphasize that it’s not generally appropriate to wait on statistical significance before making decisions.

The post Questions about “Too Good to Be True” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bad Statistics: Ignore or Call Out? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bad Statistics: Ignore or Call Out? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Questions about “Too Good to Be True”

**Wed:** I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

**Thurs:** Why isn’t replication required before publication in top journals?

**Fri:** Confirmationist and falsificationist paradigms of science

**Sat:** How does inference for next year’s data differ from inference for unobserved data from the current year?

**Sun:** Likelihood from quantiles?

We’ve got a full week of statistics for you. Welcome back to work, everyone!

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Questions about “Too Good to Be True”

I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

Why isn’t replication required before publication in top journals?

Confirmationist and falsificationist paradigms of science

How does inference for next year’s data differ from inference for unobserved data from the current year?

Likelihood from quantiles?

My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan”

Suspicious graph purporting to show “percentage of slaves or serfs in the world”

“It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet”

One-tailed or two-tailed

What is the purpose of a poem?

He just ordered a translation from Diederik Stapel

Six quotes from Kaiser Fung

More bad news for the buggy-whip manufacturers

They know my email but they don’t know me

What do you do to visualize uncertainty?

Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

Question about data mining bias in finance

Estimating discontinuity in slope of a response function

I can’t think of a good title for this one.

Study published in 2011, followed by successful replication in 2003 [sic]

I’m sure that my anti-Polya attitude is completely unfair

Waic for time series

MA206 Program Director’s Memorandum

“An exact fishy test”

People used to send me ugly graphs, now I get these things

If you do an experiment with 700,000 participants, you’ll (a) have no problem with statistical significance, (b) get to call it “massive-scale,” (c) get a chance to publish it in a ~~tabloid~~ top journal. Cool!

Carrie McLaren was way out in front of the anti-Gladwell bandwagon

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Avoiding model selection in Bayesian social research appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:

– “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”

– “How can BIC select a model that does not fit the data over one that does”

– “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

The post Avoiding model selection in Bayesian social research appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When we talk about the “file drawer,” let’s not assume that an experiment can easily be characterized as producing strong, mixed, or weak results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought you might be interested in our paper [the paper is by Annie Franco, Neil Malhotra, and Gabor Simonovits, and the link is to a news article by Jeffrey Mervis], forthcoming in Science, about publication bias in the social sciences given your interest and work on research transparency.

Basic summary: We examined studies conducted as part of the Time-sharing Experiments in the Social Science (TESS) program, where: (1) we have a known population of conducted studies (some published, some unpublished); and (2) all studies exceed a quality threshold as they go through peer review. We found that having null results made experiments 40 percentage points less likely to be published and 60 percentage points less likely to even be written up.

My reply:

Here’s a funny bit from the news article: “Stanford political economist Neil Malhotra and two of his graduate students . . .” You know you’ve hit the big time when you’re the only author who gets mentioned in the news story!

More seriously, this is great stuff. I would only suggest that, along with the file drawer, you remember the garden of forking paths. In particular, I’m not so sure about the framing in which an experiment can be characterized as producing “strong results,” “mixed results,” or “null results.” Whether a result is strong or not would seem to depend on how the data are analyzed, and the point of the forking paths is that with a given data it is possible for noise to appear as strong. I gather from the news article that TESS is different in that any given study is focused on a specific hypothesis, but even so I would think there is a bit of flexibility in how the data are analyzed and a fair number of potentially forking paths. For example, the news article mentions “whether voters tend to favor legislators who boast of bringing federal dollars to their districts over those who tout a focus on policy matters).” But of course this could be studied in many different ways.

In short, I think this is important work you have done, and I just think that we should go beyond the “file drawer” because I fear that this phase lends too much credence to the idea that a reported p-value is a legitimate summary of a study.

P.S. There’s also a statistical issue that every study is counted only once, as either a 1 (published) or 0 (unpublished). If Bruno Frey ever gets involved, you’d have to have a system where any result gets a number from 0 to 5, representing the number of different times it’s published.

The post When we talk about the “file drawer,” let’s not assume that an experiment can easily be characterized as producing strong, mixed, or weak results appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Pre-election survey methodology: details from nine polling organizations, 1988 and 1992 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>By the way, the paper has a (small) error. The two outlying “h” points in Figure 1b are a mistake. I can’t remember what we did wrong, but I do remember catching the mistake, I think it was before publication but too late for the journal to fix the error. The actual weighted results for the Harris polls are *not* noticeably different from those of the other surveys at those dates.

Polling has changed in the past twenty years, but I think this paper is still valuable, partly in giving a sense of the many different ways that polling organizations can attempt to get a representative sample, and partly as a convenient way to shoot down the conventional textbook idea of survey weights as inverse selection probabilities. (Remember, survey weighting is a mess.)

The post Pre-election survey methodology: details from nine polling organizations, 1988 and 1992 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post One of the worst infographics ever, but people don’t care? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Perhaps prompted by the ALS Ice Bucket Challenge, this infographic has been making the rounds:

I think this is one of the worst I have ever seen. I don’t know where it came from, so I can’t give credit/blame where it’s due.

Let’s put aside the numbers themselves – I haven’t checked them, for one thing, and I’d also say that for this comparison one would be most interested in (government money plus donations) rather than just donations — and just look at this as an information display. What are some things I don’t like about it? Jeez, I hardly know where to begin.

1. It takes a lot of work to figure it out. (a) You have to realize that each color is associated with a different cause — my initial thought was that the top circles represent deaths and dollars for the first cause, the second circles are for the second cause, etc. (b) Even once you’ve realized what is being displayed, and how, you pretty much have to go disease by disease to see what is going on; there’s no way to grok the whole pattern at once. (b) Other than pink for breast cancer and maybe red for AIDS none of the color mappings are standardized in any sense, so you have to keep referring back to the legend at the top. (c) It’s not obvious (and I still don’t know) if the amount of “money raised” for a given cause refers only to the specific fundraising vehicle mentioned in the legend for each disease. It’s hard to believe they would do it that way, but maybe they do.

2. Good luck if you’re colorblind.

3. Maybe I buried the lede by putting this last: did you catch the fact that the area of the circle isn’t the relevant parameter? Take a look at the top two circles on the left. The upper one should be less than twice the size of the second one. It looks like they made the *diameter* of the circle proportional to the quantity, rather than the area; a classic way to mislead with a graphic.

At a bare minimum, this graphic could be improved by (a) fixing the terrible mistake with the sizes of the circles, (b) putting both columns in the same order (that is, first row is one disease, second row is another, etc)., (c) taking advantage of the new ordering to label each row so you don’t need the legend. This would also make it much easier to see the point the display is supposed to make.

As a professional data analyst I’d rather just see a scatterplot of money vs deaths, but I know a lot of people don’t understand scatterplots. I can see the value of using circle sizes for a general audience. But I can’t see how anyone could like this graphic. Yet three of my friends (so far) have posted it on Facebook, with nary a criticism of the display.

[Note added the next day:

The graphic is even worse than I thought. As several people have pointed out, my suspicion is true: the numbers do not show the total donations to fight the diseases listed, they show only the donations to a single organization. For instance, according to the legend the pink color represents donations to fight **breast cancer**, but the number is not for breast cancer as a whole, it's only for Komen Race for the Cure.

If they think people are interested in contributions to only a single charity in each category --- which seems strange, but let's assume that's what they want and just look at the display --- then they need a title that is much less ambiguous, and the labels need to emphasize the charity and not the disease.]

This post is by Phil Price.

The post One of the worst infographics ever, but people don’t care? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I like the discussion, and it includes some themes that keep showing up: the idea that modeling is important and you need to understand what your model is doing to the data. It’s not enough to just interpret the fitted parameters as is, you need to get in there, get your hands dirty, and examine all aspects of your fit, not just the parts that relate to your hypotheses of interest.

There is a continuity between the criticisms I addressed of that paper in 1994, and our recent criticisms of some applied models, for example of that regression estimate of the health effects of air pollution in China.

The post Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Dave Blei course on Foundations of Graphical Models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dave Blei writes:

This course is cross listed in Computer Science and Statistics at Columbia University.

It is a PhD level course about applied probabilistic modeling. Loosely, it will be similar to this course.

Students should have some background in probability, college-level mathematics (calculus, linear algebra), and be comfortable with computer programming.

The course is open to PhD students in CS, EE and Statistics. However, it is appropriate for quantitatively-minded PhD students across departments. Please contact me [Blei] if you are a PhD student who is interested, but cannot register.

Research in probabilistic graphical models has forged connections between signal processing, statistics, machine learning, coding theory, computational biology, natural language processing, computer vision, and many other fields. In this course we will study the basics and the state of the art, with an eye on applications. By the end of the course, students will know how to develop their own models, compute with those models on massive data, and interpret and use the results of their computations to solve real-world problems.

Looks good to me!

The post Dave Blei course on Foundations of Graphical Models appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Review of “Forecasting Elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Political scientists are aware that most voters are consistent in their preferences, and one can make a good guess just looking at the vote counts in the previous election.

Objective analysis of a few columns of numbers can regularly outperform pundits who use inside knowledge.

The rationale for forecasting electoral vote directly . . . is mistaken.

The book’s weakness is its unquestioning faith in linear regression . . . We should always be suspicious of any grand claims made about a linear regression with five parameters and only 11 data points. . . .

Funny that I didn’t suggest the use of informative prior distributions. Only recently have I been getting around to this point.

And more:

The fact that U.S. elections can be successfully forecast with little effort, months ahead of time, has serious implications for our understanding of politics. In the short term, improved predictions will lead to more sophisticated campaigns, focusing more than ever on winnable races and marginal states.

The post Review of “Forecasting Elections” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discussion of “Maximum entropy and the nearly black object” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Under the “nearly black” model, the normal prior is terrible, the entropy prior is better and the exponential prior is slightly better still. (An even better prior distribution for the nearly black model would combine the threshold and regularization ideas by mixing a point mass at 0 with a proper distribution on [0, infinity].) Knowledge that an image is nearly black is strong prior information that is not included in the basic maximum entropy estimate.

Overall I liked the Donoho et al. paper but I was a bit disappointed in their response to me. To be fair, the paper had lots of comments and I guess the authors didn’t have much time to read each one, but still I didn’t think they got my main point, which was that the Bayesian approach was a pretty direct way to get most of the way to their findings. To put it another way, that paper had a lot to offer (and of course those authors followed it up with lots of other hugely influential work) but I think there was value right away in thinking about the different estimates in terms of prior distributions, rather than treating the Bayesian approach as a sort of sidebar.

The post Discussion of “Maximum entropy and the nearly black object” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Review of “Forecasting Elections”

**Wed:** Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections”

**Thurs:** Pre-election survey methodology: details from nine polling organizations, 1988 and 1992

**Fri:** Avoiding model selection in Bayesian social research

**Sat, Sun:** You might not be aware, but the NYC Labor Day parade is not held on Labor Day, as it would interfere with everyone’s holiday plans. Instead it’s held on the following weekend.

The post Poker math showdown! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In comments, Rick Schoenberg wrote:

One thing I tried to say as politely as I could in [the book, "Probability with Texas Holdem Applications"] on p146 is that there’s a huge error in Chen and Ankenman’s “The Mathematics of Poker” which renders all the calculations and formulas in the whole last chapter wrong or meaningless or both. I’ve never received a single ounce of feedback about this though, probably because only like 2 people have ever read my whole book.

Jerrod Ankenman replied:

I haven’t read your book, but I’d be happy to know what you think is a “huge” error that invalidates “the whole last chapter” that no one has uncovered so far. (Also, the last chapter of our book contains no calculations—perhaps you meant the chapter preceding the error?). If you contacted one of us about it in the past, it’s possible that we overlooked your communication, although I do try to respond to criticism or possible errors when I can. I’m easy to reach; firstname.lastname@yale.edu will work for a couple more months.

Hmmm, what’s on page 146 of Rick’s book? It comes up if you search inside the book on Amazon:

So that’s the disputed point right there. Just go to the example on page 290 where the results are normally distributed with mean and variance 1, check that R(1)=-14%, then run the simulation and check that the probability of the bankroll starting at 1 and reaching 0 or less is approximately 4%.

I went on to Amazon but couldn’t access page 290 of Chen and Ankenman’s book to check this. I did, however, program the simulation in R as I thought Rick was suggesting:

waiting <- function(mu,sigma,nsims,T){ time_to_ruin <- rep(NA,nsims) for (i in 1:nsims){ virtual_bankroll <- 1 + cumsum(rnorm(T,mu,sigma)) if (any(virtual_bankroll<0)) { time_to_ruin[i] <- min((1:T)[virtual_bankroll<0]) } } return(time_to_ruin) } a <- waiting(mu=1,sigma=1,nsims=10000,T=100) print(mean(!is.na(a))) print(table(a))

Which gave the following result:

> print(mean(!is.na(a))) [1] 0.0409 > print(table(a)) a 1 2 3 4 5 6 8 9 218 107 53 13 9 7 1 1

These results indicate that (i) the probability is indeed about 4%, and (ii) T=100 is easily enough to get the asymptotic value here.

Actually, the first time I did this I kept getting a probability of ruin of 2% which didn't seem right--I couldn't believe Rick would've got this simple simulation wrong--but then I found the bug in my code: I'd written "cumsum(1+rnorm(T,mu,sigma))" instead of "1+cumsum(rnorm(T,mu,sigma))".

So maybe Chen and Ankenman really did make a mistake. Or maybe Rick is misinterpreting what they wrote. There's also the question of whether Chen and Ankenman's mathematical error (assuming they did make the mistake identified by Rick) actually renders all the calculations and formulas in their whole last chapter, or their second-to-last chapter, wrong or meaningless or both.

P.S. According to the caption at the Youtube site, they're playing rummy, not poker, in the above clip. But you get the idea.

P.P.S. I fixed a typo pointed out by Juho Kokkala in an earlier version of my code.

The post Poker math showdown! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>