The post John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. . . .

My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

How did the study go?

5 men and 11 women showed up, aged 19 to 67. . . . After a round of questionnaires and blood tests to ensure that no one had eating disorders, diabetes, or other illnesses that might endanger them, Frank randomly assigned the subjects to one of three diet groups. One group followed a low-carbohydrate diet. Another followed the same low-carb diet plus a daily 1.5 oz. bar of dark chocolate. And the rest, a control group, were instructed to make no changes to their current diet. They weighed themselves each morning for 21 days, and the study finished with a final round of questionnaires and blood tests.

A sample size of 16 might seem pretty low to you, but remember this, from a couple of years ago in Psychological Science:

So, yeah, these small-N studies are a thing. Bohannon writes, “And almost no one takes studies with fewer than 30 subjects seriously anymore. Editors of reputable journals reject them out of hand before sending them to peer reviewers.” Tell that to Psychological Science!

Bohannon continues:

Onneken then turned to his friend Alex Droste-Haars, a financial analyst, to crunch the numbers. One beer-fueled weekend later and… jackpot! Both of the treatment groups lost about 5 pounds over the course of the study, while the control group’s average body weight fluctuated up and down around zero. But the people on the low-carb diet plus chocolate? They lost weight 10 percent faster. Not only was that difference statistically significant, but the chocolate group had better cholesterol readings and higher scores on the well-being survey.

To me, the conclusion is obvious: Beer has a positive effect on scientific progress! They just need to run an experiment with a no-beer control group, and . . .

Ok, you get the point. But a crappy study is not enough. All sorts of crappy work is done all the time but doesn’t make it into the news. So Bohannon did more:

I called a friend of a friend who works in scientific PR. She walked me through some of the dirty tricks for grabbing headlines. . . .

The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text.

Take a look at the press release I cooked up. It has everything. In reporter lingo: a sexy lede, a clear nut graf, some punchy quotes, and a kicker. And there’s no need to even read the scientific paper because the key details are already boiled down. I took special care to keep it accurate. Rather than tricking journalists, the goal was to lure them with a completely typical press release about a research paper.

**It’s even worse than Bohannon says!**

I think Bohannon’s stunt is just great and is a wonderful jab at the Ted-talkin, tabloid-runnin statistical significance culture that is associated so much with science today.

My only statistical comment is that Bohannan actually *understates* the way in which statistical significance can be found via the garden of forking paths.

Bohannan’s understatement comes in a few ways:

1. He writes:

If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. . . .

P(winning) = 1 – (1-p)^n [or, as Ed Wegman would say, 1 – (1-p)*n — ed.]

With our 18 measurements, we had a 60% chance of getting some “significant” result with p < 0.05.

That’s all fine, but actually it’s much worse than that, because researchers can, and do, also look at subgroups and interactions. 18 measurements corresponds to a lot more than 18 possible tests! I say this because I can already see a researcher saying, “No, we only looked at one outcome variable so this couldn’t happen to us.” But that would be mistaken. As Daryl Bem demonstrated oh-so-eloquently, there many many possible comparisons can come from a single outcome.

2. Bohannon then writes:

It’s called p-hacking—fiddling with your experimental design and data to push p under 0.05—and it’s a big problem. Most scientists are honest and do it unconsciously. They get negative results, convince themselves they goofed, and repeat the experiment until it “works”.

Sure, but it’s not just that. As Eric Loken and I discussed in our recent article, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Even if a researcher only performs a single comparison on his or her data and thus did not do any “fishing” or “fiddling” at all, the garden of forking paths is *still* a problem, because the particular data analysis that was chosen, is typically informed by the data. That is, a researcher will, after looking at the data, choose data-exclusion rules and a data analysis. A unique analysis is done for these data, but the analysis depends on those data. Mathematically this of course is very similar to performing a lot of tests and selecting the ones with good p-values, but it can feel very different.

I always worry when people write about p-hacking, that they mislead by giving the wrong impression that, if a researcher performs only one analysis on his her data, that all is ok.

3. Bohannon notes in passing that he excluded one person from his study, and elsewhere he notes that researchers “drop ‘outlier’ data points” in their quest for scientific discovery. But I think he could’ve emphasized this a bit more, that researcher-degrees-of-freedom is not just about running lots of tests on your data, it’s also about the flexibility in rules for what data to exclude and how to code your responses. (Mark Hauser is an extreme case here but even with simple survey responses there are coding issues in the very very common setting that a numerical outcome is dichotomized.)

4. Finally, Bohannon is, I think, a bit too optimistic when he writes:

Luckily, scientists are getting wise to these problems. Some journals are trying to phase out p value significance testing altogether to nudge scientists into better habits.

I agree that p-values are generally a bad idea. But I think the real problem is with null hypothesis significance testing more generally, the idea that the goal of science is to find “true positives.”

In the real world, effects of interest are generally not true or false, it’s not so simple. Chocolate does have effects, and of course chocolate in our diet is paired with sugar and can also be a substitute for other desserts, etc etc etc. So, yes, I do think chocolate will have effects on weight. The effects will be positive for some people and negative for others, they’ll vary in their magnitude and they’ll vary situationally. If you try to nail this down as a “true” or “false” claim, you’re already going down the wrong road, and I don’t see it as a solution to replace p-values by confidence intervals or Bayes factors or whatever. I think we just have to get off this particular bus entirely. We need to embrace variation and accept uncertainty.

Again, just to be clear, I think Bohannon’s story is great, and I’m not trying to be picky here. Rather, I want to support what he did by putting it in a larger statistical perspective.

The post John Bohannon’s chocolate-and-weight-loss hoax study actually understates the problems with standard p-value scientific practice appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Cracked.com > Huffington Post, Wall Street Journal, New York Times appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As long as enterprising P.R. firms are willing to supply unsourced data, lazy journalists (or whatever you call these people) will promote it.

We saw this a few years ago in a Wall Street Journal article by Robert Frank (not the academic economist of same name) that purported to give news on the political attitudes of the super-rich but really was actually just credulously giving reporting unsubstantiated statements from some consulting company.

And of course we saw this a couple years ago when New York Times columnist David Brooks promoted some fake statistics on ethnicity and high school achievement.

I get it: journalism is hard work, and sometimes a reporter or columnist will take a little break and just report a press release or promote the claims of some political ideologue. It happens. But I don’t have to like it.

The post Cracked.com > Huffington Post, Wall Street Journal, New York Times appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What’s the worst joke you’ve ever heard? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s my contender, from the book “1000 Knock-Knock Jokes for Kids”:

– Knock Knock.

– Who’s there?

– Ann

– Ann who?

– An apple fell on my head.

There’s something beautiful about this one. It’s the clerihew of jokes. Zero cleverness. It lacks any sense of inevitability, in that any sentence whatsoever could work here, as long as it begins with the word “An.”

The post What’s the worst joke you’ve ever heard? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stock, flow, and two smoking regressions appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In a comment on our recent discussion of stock and flow, Tom Fiddaman writes:

Here’s an egregious example of statistical stock-flow confusion that got published.

Fiddaman is pointing to a post of his from 2011 discussing a paper that “examines the relationship between CO2 concentration and flooding in the US, and finds no significant impact.”

Here’s the title and abstract of the paper in question:

Has the magnitude of floods across the USA changed with global CO2 levels?

R. M. Hirsch & K. R. Ryberg

Abstract

Statistical relationships between annual floods at 200 long-term (85–127 years of record) streamgauges in the coterminous United States and the global mean carbon dioxide concentration (GMCO2) record are explored. The streamgauge locations are limited to those with little or no regulation or urban development. The coterminous US is divided into four large regions and stationary bootstrapping is used to evaluate if the patterns of these statistical associations are significantly different from what would be expected under the null hypothesis that flood magnitudes are independent of GMCO2. In none of the four regions defined in this study is there strong statistical evidence for flood magnitudes increasing with increasing GMCO2. One region, the southwest, showed a statistically significant negative relationship between GMCO2 and flood magnitudes. The statistical methods applied compensate both for the inter-site correlation of flood magnitudes and the shorter-term (up to a few decades) serial correlation of floods.

And here’s Fiddaman’s takedown:

There are several serious problems here.

First, it ignores bathtub dynamics. The authors describe causality from CO2 -> energy balance -> temperature & precipitation -> flooding. But they regress:

ln(peak streamflow) = beta0 + beta1 × global mean CO2 + error

That alone is a fatal gaffe, because temperature and precipitation depend on the integration of the global energy balance. Integration renders simple pattern matching of cause and effect invalid. For example, if A influences B, with B as the integral of A, and A grows linearly with time, B will grow quadratically with time.

This sort of thing comes up a lot in political science, where the right thing to do is not so clear. For example, suppose we’re comparing economic outcomes under Democratic and Republican presidents. The standard thing to look at is economic growth. But maybe it is changes in growth that should matter? As Jim Campbell points out, if you run a regression using economic growth as an outcome, you’re implicitly assuming that these effects on growth persist indefinitely, and that’s a strong assumption.

Anyway, back to Fiddaman’s critique of that climate-change regression:

The situation is actually worse than that for climate, because the system is not first order; you need at least a second-order model to do a decent job of approximating the global dynamics, and much higher order models to even think about simulating regional effects. At the very least, the authors might have explored the usual approach of taking first differences to undo the integration, though it seems likely that the data are too noisy for this to reveal much.

Second, it ignores a lot of other influences. The global energy balance, temperature and precipitation are influenced by a lot of natural and anthropogenic forcings in addition to CO2. Aerosols are particularly problematic since they offset the warming effect of CO2 and influence cloud formation directly. Since data for total GHG loads (CO2eq), total forcing and temperature, which are more proximate in the causal chain to precipitation, are readily available, using CO2 alone seems like willful ignorance. The authors also discuss issues “downstream” in the causal chain, with difficult-to-assess changes due to human disturbance of watersheds; while these seem plausible (not my area), they are not a good argument for the use of CO2. The authors also test other factors by including oscillatory climate indices, the AMO, PDO and ENSO, but these don’t address the problem either. . . .

I’ll skip a bit, but there’s one more point I wanted to pick up on:

Fourth, the treatment of nonlinearity and distributions is a bit fishy. The relationship between CO2 and forcing is logarithmic, which is captured in the regression equation, but I’m surprised that there aren’t other important nonlinearities or nonnormalities. Isn’t flooding heavy-tailed, for example? I’d like to see just a bit more physics in the model to handle such issues.

If there’s a monotonic pattern, it should show up even if the functional form is wrong. But in this case Fiddaman has a point, in that the paper he’s criticizing makes a big deal about *not* finding a pattern, in which case, yes, using a less efficient model could be a problem.

Similarly with this point:

Fifth, I question the approach of estimating each watershed individually, then examining the distribution of results. The signal to noise ratio on any individual watershed is probably pretty horrible, so one ought to be able to do a lot better with some spatial pooling of the betas (which would also help with issue three above).

Fiddaman concludes:

I think that it’s actually interesting to hold your nose and use linear regression as a simple screening tool, in spite of violated assumptions. If a relationship is strong, you may still find it. If you don’t find it, that may not tell you much, other than that you need better methods. The authors seem to hold to this philosophy in the conclusion, though it doesn’t come across that way in the abstract.

The post Stock, flow, and two smoking regressions appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An inundation of significance tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The last three research papers I’ve read contained 51, 49 and 70 significance tests (counting conservatively), and to the extent that I’m able to see the forest for the trees, mostly poorly motivated ones.

I wonder what the motivation behind this deluge of tests is.

Is it wanton obfuscation (seems unlikely), a legalistic conception of what research papers are (i.e. ‘don’t blame us, we’ve run that test, too!’) or something else?Perhaps you know of some interesting paper that discusses this phenomenon? Or whether it has an established name?

It’s not primarily the multiple comparisons problem but more the inundation aspect I’m interested in here.

He also links to this post of his on the topic. Just a quick comment on his post: he is trying to estimate a treatment effect via a before-after comparison, he’s plotting y-x vs. x and running into a big regression-to-the-mean pattern:

Actually he’s plotting y/x not y-x but that’s irrelevant for the present discussion.

Anyway, I think he should have a treatment and a control group and plot y vs. x (or, in this case, log y vs. log x) with separate lines for the two groups: the difference between the lines represents the treatment effect.

I don’t have an example with his data but here’s the general idea:

Back to the original question: I think it’s good to display more rather than less but I agree with Vanhove that if you want to display more, just display raw data. Or, if you want to show a bunch of comparisons, please structure them in a reasonable way and display as a readable grid. All these p-values in the text, they’re just a mess.

Thinking about this from a historical perspective, I feel (or, at least, hope) that null hypothesis significance tests—whether expresses using p-values, Bayes factors, or any other approach—are on their way out. But, until they go away, we may be seeing more and more of them leading to the final flame-out.

In the immortal words of Jim Thompson, it’s always lightest just before the dark.

The post An inundation of significance tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Stock, flow, and two smoking regressions

**Wed:** What’s the worst joke you’ve ever heard?

**Thurs:** Cracked.com > Huffington Post, Wall Street Journal, New York Times

**Fri:** Measurement is part of design

**Sat:** “17 Baby Names You Didn’t Know Were Totally Made Up”

**Sun:** What to do to train to apply statistical models to political science and public policy issues

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Chess + statistics + plagiarism, again! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In response to this post (in which I noted that the Elo chess rating system is a static model which, paradoxically, is used to for the purposes of studying changes), Keith Knight writes:

It’s notable that Glickman’s work is related to some research by Harry Joe at UBC, which in turn was inspired by data provided by Nathan Divinsky who was (wait for it) a co-author of one of your favourite plagiarists, Raymond Keene.

In the 1980s, Keene and Divinsky wrote a book, Warriors of the Mind, which included an all-time ranking of the greatest chess players – it was actually Harry Joe who did the modeling and analysis although Keene and Divinsky didn’t really give him credit for it. (Divinsky was a very colourful character – he owned a very popular restaurant in Vancouver and was briefly married to future Canadian Prime Minister Kim Campbell. Certainly not your typical Math professor!)

I wonder what Chrissy would think of this?

Knight continues:

And speaking of plagiarism, check out the two attached papers. Somewhat amusingly (and to their credit), the plagiarized version actually cites the original paper!

“Double Blind,” indeed!

The post Chess + statistics + plagiarism, again! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Kaiser’s beef appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Numbersense guy writes in:

Have you seen this?

It has one of your pet peeves… let’s draw some data-driven line in the categorical variable and show significance.

To make it worse, he adds a final paragraph saying essentially this is just a silly exercise that I hastily put together and don’t take it seriously!

Kaiser was pointing me to a news article by economist Justin Wolfers, entitled “Fewer Women Run Big Companies Than Men Named John.”

Here’s what I wrote back to Kaiser:

I took a look and it doesn’t seem so bad. Basically the sex difference is so huge that it can be dramatized in this clever way. So I’m not quite sure what you dislike about it.

Kaiser explained:

Here’s my beef with it…

Just to make up some numbers. Let’s say there are 500 male CEOs and 25 female CEOs so the aggregate index is 20.

Instead of reporting that number, they reduce the count of male CEOs while keeping the females fixed. So let’s say 200 of those male CEOs are named Richard, William, John, and whatever the 4th name is. So they now report an index of 200/25 = 8.

Problem 1 is that this only “works” if they cherry pick the top male names, probably the 4 most common names from the period where most CEOs are born. As he admitted at the end, this index is not robust as names change in popularity over time. Kind of like that economist who said that anyone whose surname begins with A-N has a better chance of winning the Nobel Prize (or some such thing).

Problem 2: we may need an experiment to discover which of the following two statements are more effective/persuasive:

a) there are 20 male CEOs for every female CEO in America

b) there are 8 male CEOs named Richard, Wiliam, John and David for every female CEO in AmericaFor me, I think b) is more complex to understand and in fact the magnitude of the issue has been artificially reduced by restricting to 4 names!

How about that?

I replied that I agree that the picking-names approach destroys much of the quantitative comparisons. Still, I think the point here is that the differences are so huge that this doesn’t matter. It’s a dramatic comparison. The relevant point, perhaps, is that these ratios shouldn’t be used as any sort of “index” for comparisons between scenarios. If Wolfers just wants to present the story as a way of dramatizing the underrepresentation of women, that works. But it would not be correct to use this to compare representation of women in different fields or in different eras.

I wonder if the problem is that econ has these gimmicky measures, for example the cost-of-living index constructed using the price of the Big Mac, etc. I don’t know why, but these sorts of gimmicks seem to have some sort of appeal.

The post Kaiser’s beef appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post John Lott as possible template for future career of “Bruno” Lacour appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The recent story about the retracted paper on political persuasion reminded me of the last time that a politically loaded survey was discredited because the researcher couldn’t come up with the data.

I’m referring to John Lott, the “economist, political commentator, and gun rights advocate” (in the words of Wikipedia) who is perhaps more well known on the internet by the name of Mary Rosh, an alter ego he created to respond to negative comments (among other things, Lott used the Rosh handle to refer to himself as “the best professor I ever had”).

Again from Wikipedia:

Lott claimed to have undertaken a national survey of 2,424 respondents in 1997, the results of which were the source for claims he had made beginning in 1997. However, in 2000 Lott was unable to produce the data, or any records showing that the survey had been undertaken. He said the 1997 hard drive crash that had affected several projects with co-authors had destroyed his survey data set, the original tally sheets had been abandoned with other personal property in his move from Chicago to Yale, and he could not recall the names of any of the students who he said had worked on it. . . .

On the other hand, ~~Rosh~~ Lott has continued to insist that the survey actually happened. So he shares that with Michael LaCour, the coauthor of the recently retracted political science paper.

I have nothing particularly new to say about either case, but I was thinking that some enterprising reporter might call up Lott and see what he thinks about all this.

Also, Lott’s career offers some clues as to what might happen next to LaCour. Lott’s academic career dissipated and now he seems to spend his time running an organization called the Crime Prevention Research Center which is staffed by conservative scholars, so I guess he pays the bills by raising funds for this group.

One could imagine LaCour doing something similar—but he got caught with data problems *before* receiving his UCLA social science PhD, so his academic credentials aren’t so strong. But, speaking more generally, given that it appears that respected scholars (and, I suppose, funders, but I can’t be so sure of that as I don’t see a list of funders on the website) are willing to work with Lott, despite the credibility questions surrounding his research, I suppose that the same could occur with LaCour. Perhaps, like Lott, he has the right mixture of ability, brazenness, and political commitment to have a successful career in advocacy.

The above might all seem like unseemly speculation—and maybe it is—but this sort of thing is important. Social science isn’t just about the research (or, in this case, the false claims masquerading as research); it’s also about the social and political networks that promote the work.

The post John Lott as possible template for future career of “Bruno” Lacour appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Creativity is the ability to see relationships where none exist appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Brent Goldfarb and Andrew King, in a paper to appear in the journal Strategic Management, write:

In a recent issue of this journal, Bettis (2012) reports a conversation with a graduate student who forthrightly announced that he had been trained by faculty to “search for asterisks”. The student explained that he sifted through large databases for statistically significant results and “[w]hen such models were found, he helped his mentors propose theories and hypotheses on the basis of which the ‘asterisks’ might be explained” (p. 109). Such an approach, Bettis notes, is an excellent way to find seemingly meaningful patterns in random data. He expresses concern that these practices are common, but notes that unfortunately “we simply do not have any baseline data on how big or small are the problems” (Bettis, 2012: p. 112).

In this article, we [Goldfarb and King] address the need for empirical evidence . . . in research on strategic management. . . .

Bettis (2012) reports that computer power now allows researchers to sift repeatedly through data in search of patterns. Such specification searches can greatly increase the probability of finding an apparently meaningful relationship in random data. . . . just by trying four functional forms for X, a researcher can increase the chance of a false positive from one in twenty to about one in six. . . .

Simmons et. al (2011) contend that some authors also push almost significant results over thresholds by removing or gathering more data, by dropping experimental conditions, by adding covariates to specified models, and so on.

And, beyond this, there’s the garden of forking paths: even if a researcher performs only *one* analysis of a given dataset, the multiplicity of choices involved in data coding and analysis are such that we can typically assume that different comparisons would have been studied had the data been different. That is, you can have misleading p-values without any cheating or “fishing” or “hacking” going on.

Goldfarb and King continue:

When evidence is uncertain, a single example is often considered representative of the whole (Tversky & Kahneman, 1973). Such inference is incorrect, however, if selection occurs on significant results. In fact, if “significant” results are more likely to be published, coefficient estimates will inflate the true magnitude of the studied effect — particularly if a low powered test has been used (Stanley, 2005).

They conducted a study of “estimates reported in 300 published articles in a random stratified sample from five top outlets for research on strategic management . . . [and] 60 additional proposals submitted to three prestigious strategy conferences.”

And here’s what they find:

We estimate that between 24% and 40% of published findings based on “statistically significant” (i.e. p<0.05) coefficients could not be distinguished from the Null if the tests were repeated once. Our best guess is that for about 70% of non-confirmed results, the coefficient should be interpreted to be zero. For the remaining 30%, the true B is not zero, but insufficient test power prevents an immediate replication of a significant finding. We also calculate that the magnitude of coefficient estimates of most true effects are inflated by 13%.

I’m surprised their estimated exaggeration factor is only 13%; I’d have expected much higher, even if only conditioning on “true” effects (however that is defined).

I have not tried to follow the details of the authors’ data collection and analysis process and thus can neither criticize nor endorse their specific findings. But I’m sympathetic to their general goals and perspective.

As a commenter wrote in an earlier discussion, it is the combination of a strawman with the concept of “statistical significance” (ie the filtering step) that seems to be a problem, not the p-value per se.

The post Creativity is the ability to see relationships where none exist appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Weggy update: it just gets sadder and sadder appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Best quote from Mashey’s write-up:

None of this made any sense to me, but then I am no lawyer. As it turned out, it made no sense to good lawyers either . . .

Lifting an encyclopedia is pretty impressive and requires real muscles. Lifting from Wikipedia, not so much.

In all seriousness, this is really bad behavior. Copying and garbling material from other sources and not giving the references? Uncool. Refusing to admit error? That’ll get you a regular column in a national newspaper. A 2 million dollar lawsuit? That’s unconscionable escalation, it goes beyond chutzpah into destructive behavior. I don’t imagine that ~~Raymond Keene~~ Bernard Dunn would be happy about what is being done in his name.

The post Weggy update: it just gets sadder and sadder appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Can talk therapy halve the rate of cancer recurrence? How to think about the statistical significance of this finding? Is it just another example of the garden of forking paths? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m writing to you now about another matter about which I hope you will offer an opinion. Here is a critique of a study, as well as the original study that claimed to find an effect of group psychotherapy on time to recurrence and survival of early breast cancer patients. In the critique I note that confidence intervals for the odd ratio of raw events for both death and recurrence have P values between .3 and .5. The authors’ claims are based on dubious adjusted analyses. I’ve tried for a number of years to get the data for reanalysis, but the latest effort ended in the compliance officer for Ohio State University pleading that the data were the investigator’s intellectual property. The response apparently written by the investigator invoked you as a rationale for her analytic decisions. I wonder if you could comment on that.

Here is the author’s invoking of you:

In analyzing the data and writing the manuscript, Andersen et al. (2008) were fully aware of opinions and data regarding the use of covariates. See, for example, a recent discussion (2011) among investigators about this issue and the response of Andrew Gelman, an expert on applied Bayesian data analysis and hierarchical models. Gelman’s (2011) provided positive recommendations for covariate inclusion and are corroborated by studies examining covariate selection and entry, which appeared prior to and now following Gelman’s statement in 2011.

Here’s what Coyne sent me:

“Psychologic Intervention Improves Survival for Breast Cancer Patients: A Randomized Clinical Trial,” a 2008 article by Barbara Andersen, Hae-Chung Yang, William Farrar, Deanna Golden-Kreutz, Charles Emery, Lisa Thornton, Donn Young, and William Carson, which reported that a talk-therapy intervention reduced the risk of breast cancer recurrence and death from breast cancer, with a hazard rate of approximately 50% (that is, the instantaneous risk of recurrence, or of death, at any point was claimed to have been reduced by half).

“Finding What Is Not There: Unwarranted Claims of an Effect of Psychosocial Intervention on Recurrence and Survival,” a 2009 article by Michael Stefanek, Steven Palmer, Brett Thombs, and James Coyne, arguing that the claims in the aforementioned article were implausible on substantive grounds and could be explained by a combination of chance variation and opportunistic statistical analysis.

A report from Ohio State University ruling that Barbara Anderson, the lead researcher on the controversial study, was not required to share her raw data with Stefanek et al., as they had requested so they could perform an independent analysis.

I took a look and replied to Coyne as follows:

1. I noticed this bit in the Ohio State report:

“The data, if disclosed, would reveal pending research ideas and techniques. Consequently, the release of such information would put those using such data for research purposes in a substantial competitive disadvantage as competitors and researchers would have access to the unpublished intellectual property of the University and its faculty and students.”

I see what they’re saying but it still seems a bit creepy to me. Think of it from the point of view of the funders of the study, or the taxpayers, or the tuition-paying students. I can’t imagine that they all care so much about the competitive position of the university (or, as they put it, the “University”).

Also, if given that the article was published in 2008, how could it be that the data could “reveal pending research ideas and techniques” in 2014? I mean, sure, my research goes slowly too, but . . . 6 years???

I read the report you sent me, that has quotes from your comments along with the author’s responses. It looks like the committee did not make a judgment on this? They just seemed to report what you wrote, and what the authors wrote, without comments.

Regarding the more general points about preregistration, I have mixed feelings. On one hand, I agree that, because of the garden of forking paths, it’s hard to know what to make of the p-values that come out of a study that had flexible rules on data collection, multiple endpoints, and the like. On the other hand, I’ve never done a preregistered study myself. So I do feel that if a non-prereigstered study is analyzed _appropriately_, it should be possible to get useful inferences. For example, if there are multiple endpoints, it’s appropriate to analyze all the endpoints, not to just pick one. When a study has a data-dependent stopping rule, the information used in the stopping rule should be included in the analysis. And so on.

On a more specific point, you argues that the study in question used a power analysis that was too optimistic. You perhaps won’t be surprised to hear that I am inclined to believe you on that, given that all the incentives go in the direction of making optimistic assumptions about treatment effects. Looking at the details: “The trial was powered to detect a doubling of time to an endpoint . . . cancer recurrences.” Then in the report when they defend the power analysis, they talk about survival rates but I don’t see anything about time to an endpoint. They then retreat to a retrospective justification, that “we conducted the power analysis based on the best available data sources of the early 1990’s, and multiple funding agencies (DoD, NIH, ACS) evaluated and approved the validity of our study proposal and, most importantly, the power analysis for the trial.” So their defense here is ultimately procedural rather than substantive: Maybe their assumptions were too optimistic, but everyone was optimistic back then. This doesn’t much address the statistical concerns but it is relevant to implications of ethical malfeasance.

Regarding the reference to my work: Yes, I have recommended that, even in a randomized trial, it can make sense to control for relevant background variables. This is actually a continuing area of research in that I think that we should be using informative priors to stabilize these adjustments, to get something more reasonable than would be obtained by simple least squares. I do agree with you that it is appropriate to do an unadjusted analysis as well. Unfortunately researchers do not always realize this.

Regarding some of the details of the regression analysis: the discussion brings up various rules and guidelines, but really it depends on contexts. I agree with the report that it can be ok for the number of adjustment variables to exceed 1/10 of the number of data points. There’s also some discussion of backward elimination of predictors. I agree with you that this is in general a bad idea (and certainly the goal in such a setting should not be “to reach a parsimonious model” as claimed in this report). However, practical adjustment can involve adding and removing variables, and this can sometimes take the form of backward elimination. So it’s hard to say what’s right, just from this discussion. I went into the paper and they wrote, “By using a backward elimination procedure, any covariates with P < .25 with an endpoint remained in the final model for that endpoint.” This indeed is poor practice; regrettably, it may well be standard practice.

2. Now I was curious so I read all of the 2008 paper. I was surprised to hear that psychological intervention improves survival for breast cancer patients. It says that the intervention will “alter health behaviors, and maintain adherence to cancer treatment and care.” Sure, ok, but, still, it’s pretty hard to imagine that this will double the average time to recurrence. Doubling is a lot! Later in the paper they mention “smoking cessation” as one of the goals of the treatment. I assume that smoking cessation would reduce recurrence rates. But I don’t see any data on smoking in the paper, so I don’t know what to do with this.

I’m also puzzled because, in their response to your comments, the author or authors say that time-to-recurrence was the unambiguous primary endpoint, but in the abstract they don’t say anything about time-to-recurrence, instead giving proportion of recurrence and survival rates conditional on the time period of the study. Also, the title says Survival, not Time to Recurrence.

The estimated effect sizes (an approx 50% reduction in recurrence and 50% recurrence in death) are implausibly large, but of course this is what you get from the statistical significance filter. Given the size of the study, the reported effects would *have* to be just about this large, else they wouldn’t be statistically significant.

OK, now to the results: “With 11 years median follow-up, disease recurrence had occurred for 62 of 212 (29%) women, 29 in the Intervention arm and 33 in the Assessment–only arm.” Ummm, that’s 29/114 = 0.25 for the intervention group and 33/113 = 29% in the control group, a difference of 4 percentage points. So I don’t see how they can get those dramatic results shown in figure 3. To put it another way, in their dataset, the probability of recurrence-free survival was 75/114 = 66% in the treatment group and 65/113 = 58% in the control group. (Or, if you exclude the people who dropped out of the study, 75/109 = 69% in treatment group and 65/103 = 63% in control group). A 6 or 8 percentage point difference ain’t nothing, but Figure 3 shows much bigger effects. OK, I see, Figure 3 is just showing survival for the first 5 years. But, if differences are so dramatic after 5 years and then reduce in the following years, that’s interesting too. Overall I’m baffled by the way in which this article goes back and forth between different time durations.

3. Now time to read your paper with Stefanek et al. Hmmm, at one point you write, “There were no differences in unadjusted rates of recurrence or survival between the intervention and control groups.” But there were such differences, no? The 4% reported above? I agree that this difference is not statistically significant and can be explained by chance, but I wouldn’t call it “no difference.”

Overall, I am sympathetic with your critique, partly on general grounds and partly because, yes, there are lots of reasonable adjustments that could be done to these data. The authors of the article in question spend lots of time saying that the treatment and control groups are similar on their pre-treatment variables—but then it turns out that the adjustment for pre-treatment variables is necessary for their findings. This does seem like a “garden of forking paths” situation to me. And the response of the author or authors is, sadly, consistent with what I’ve seen in other settings: a high level of defensiveness coupled with a seeming lack of interest in doing anything better.

I am glad that it was possible for you to publish this critique. Sometimes it seems that this sort of criticism faces a high hurdle to reach publication.

I sent the above to Coyne, who added this:

For me it’s a matter of not only scientific integrity, but what we can reasonably tell cancer patients about what will extend their lives. They are vulnerable and predisposed to grab at anything they can, but also to feel responsible when their cancer progresses in the face of information that should be controllable by positive thinking or take advantage of some psychological intervention. I happen to believe in support groups as an opportunity for cancer patients to find support and the rewards of offering support to others in the same predicament. If patients want those experiences, they should go to readily available support groups. However they should not go with the illusion that it is prolonging their life or that not going is shortening it.

I have done a rather extensive and thorough systematic review and analysis of the literature I can find no evidence that in clinical trials in which survival was in a priori outcome, was an advantage found for psychological interventions.

The post Can talk therapy halve the rate of cancer recurrence? How to think about the statistical significance of this finding? Is it just another example of the garden of forking paths? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post BREAKING . . . Princeton decides to un-hire Kim Jong-Un for tenure-track assistant professorship in aeronautical engineering appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Full story here.

Here’s the official quote:

As you’ve correctly noted, at this time the individual is not a Princeton University employee. We will review all available information and determine next steps.

And here’s what Kim has to say:

I’m gathering evidence and relevant information so I can provide a single comprehensive response. I will do so at my earliest opportunity.

The post BREAKING . . . Princeton decides to un-hire Kim Jong-Un for tenure-track assistant professorship in aeronautical engineering appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “In my previous post on the topic, I expressed surprise at the published claim but no skepticism” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Don’t believe everything you read in the tabloids, that’s for sure.

**P.S.** I googled to see what else was up with this story and found this article which reported that someone claimed that Don Green’s retraction (see above link for details) was the first for political science.

I guess it depends on how you define “retraction” and how you define “political science.” Cos a couple of years ago I published this:

In the paper, “Should the Democrats move to the left on economic policy?” AOAS 2 (2), 536-549 (2008), by Andrew Gelman and Cexun Jeffrey Cai, because of a data coding error on one of the variables, all our analysis of social issues is incorrect. Thus, arguably, all of Section 3 is wrong until proven otherwise. We thank Yang Yang Hu for discovering this error and demonstrating its importance.

Officially this is a correction not a retraction. And, although it’s entirely a political science paper, it was not published in a political science journal. So maybe it doesn’t count. I’d guess there are others, though. I don’t think Aristotle ever retracted his claim that slavery is cool, but give him time, the guy has a lot on his plate.

The post “In my previous post on the topic, I expressed surprise at the published claim but no skepticism” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Objects of the class “Foghorn Leghorn” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The other day I saw some kids trying to tell knock-knock jokes, The only one they really knew was the one that goes: Knock knock. Who’s there? Banana? Banana who? Knock knock. Who’s there? Banana? Banana who? Knock knock. Who’s there? Orange. Orange who? Orange you glad I didn’t say banana?

Now that’s a fine knock-knock joke, among the best of its kind, but what interests me here is that it’s clearly not a basic k-k; rather, it’s an inspired parody of the form. For this to be the most famous knock-knock joke—in some circles, the only knock-knock joke—seems somehow wrong to me. It would be as if everybody were familiar with Duchamp’s Mona-Lisa-with-a-moustache while never having heard of Leonardo’s original.

Here’s another example: Spinal Tap, which lots of people have heard of without being familiar with the hair-metal acts that inspired it.

The poems in Alice’s Adventures in Wonderland and Through the Looking Glass are far far more famous now than the objects of their parody.

I call this the Foghorn Leghorn category, after the Warner Brothers cartoon rooster (“I say, son . . . that’s a joke, son”) who apparently was based on a famous radio character named Senator Claghorn. Claghorn has long been forgotten, but, thanks to reruns, we all know about that silly rooster.

And I think “Back in the USSR” is much better known than the original “Back in the USA.”

Here’s my definition: a parody that is more famous than the original.

**Some previous cultural concepts**

Objects of the class “Whoopi Goldberg”

Objects of the class “Weekend at Bernie’s”

**P.S.** Commenter Jhe has a theory:

I’m not entirely surprised that often the parody is better know than its object. The parody illuminates some aspect of culture which did not necessarily stand out until the parody came along. The parody takes the class of objects being parodied and makes them obvious and memorable.

The post Objects of the class “Foghorn Leghorn” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bayesian inference: The advantages and the risks appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I would not refer to the existing prediction algorithm as frequentist. Frequentist refers to the evaluation of statistical procedures but it doesn’t really say where the estimate or prediction comes from. Rather, I’d say that the Bayesian prediction approach succeeds by adding model structure and prior information.

The advantages of Bayesian inference include:

1. Including good information should improve prediction,

2. Including structure can allow the method to incorporate more data (for example, hierarchical modeling allows partial pooling so that external data can be included in a model even if these external data share only some characteristics with the current data being modeled).The risks of Bayesian inference include:

3. If the prior information is wrong, it can send inferences in the wrong direction.

4. Bayes inference combines different sources of information; thus it is no longer an encapsulation of a particular dataset (which is sometimes desired, for reasons that go beyond immediate predictive accuracy and instead touch on issues of statistical communication).OK, that’s all background. The point is that we can compare Bayesian inference with existing methods. The point is not that the philosophies of inference are different—it’s not Bayes vs frequentist, despite what you sometimes hear. Rather, the issue is that we’re adding structure and prior information and partial pooling, and we have every reason to think this will improve predictive performance, but we want to check.

To evaluate, I think we can pretty much do what you say: ROC as basic summary and do graphical exploration, cross-validation (and related methods such as WAIC), and external validation.

The post Bayesian inference: The advantages and the risks appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post New Alan Turing preprint on Arxiv! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dan Kahan writes:

I know you are on 30-day delay, but since the blog version of you will be talking about Bayesian inference in couple of hours, you might like to look at paper by Turing, who is on 70-yr delay thanks to British declassification system, who addresses the utility of using likelihood ratios for helping to form a practical measure of evidentiary weight (“bans” & “decibans”) that can guide cryptographers (who presumably will develop sense of professional judgment calibrated to the same).

Actually it’s more like a 60-day delay, but whatever.

The Turing article is called “The Applications of Probability to Cryptography,” it was written during the Second World War, and it’s awesome.

Here’s an excerpt:

The evidence concerning the possibility of an event occurring usually divides into a part about which statistics are available, or some mathematical method can be applied, and a less definite part about which one can only use one’s judgement. Suppose for example that a new kind of traffic has turned up and that only three messages are available. Each message has the letter V in the 17th place and G in the 18th place. We want to know the probability that it is a general rule that we should find V and G in these places. We first have to decide how probable it is that a cipher would have such a rule, and as regards this one can probably only guess, and my guess would be about 1/5,000,000. This judgement is not entirely a guess; some rather insecure mathematical reasoning has gone into it, something like this:-

The chance of there being a rule that two consecutive letters somewhere after the 10th should have certain fixed values seems to be about 1/500 (this is a complete guess). The chance of the letters being the 17th and 18th is about 1/15 (another guess, but not quite as much in the air). The probability of a letter being V or G is 1/676 (hardly a guess at all, but expressing a judgement that there is no special virtue in the bigramme VG). Hence the chance is 1/(500 × 15 × 676) or about 1/5,000,000. This is however all so vague, that it is more usual to make the judgment “1/5,000,000” without explanation.

The question as to what is the chance of having a rule of this kind might of course be resolved by statistics of some kind, but there is no point in having this very accurate, and of course the experience of the cryptographer itself forms a kind of statistics.

The remainder of the problem is then solved quite mathematically. . . .

He’s so goddamn *reasonable*. He’s everything I aspire to.

Reasonableness is, I believe, and underrated trait in research. By “reasonable,” I don’t mean a supine acceptance of the status quo, but rather a sense of the connections of the world, a sort of generalized numeracy, an openness and honesty about one’s sources of information. “This judgement is not entirely a guess; some rather insecure mathematical reasoning has gone into it”—exactly!

Damn this guy is good. I’m glad to see he’s finally posting his stuff on Arxiv.

The post New Alan Turing preprint on Arxiv! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bob Carpenter’s favorite books on GUI design and programming appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I would highly recommend two books that changed the way I thought about GUI design (though I’ve read a lot of them):

* Jeff Johnson. GUI Bloopers.

I read the first edition in book form and the second in draft form (the editor contacted me based on my enthusiastic Amazon feedback, which was mighty surprising). I also like

* Stephen Krug. Don’t Make Me Think.

I think I read the first edition and it’s now up to the 3rd. And

also this one about general design:* Robin Williams. Non-Designers’ Design Book. (not web specific, but great advice in general on layout)

It’s also many editions past where I first read it. I’m also a huge

fan of* Hunt and Thomas. The Pragmatic Programmer.

for general programming and development advice. We’ve implemented most of the recommended practices in Stan’s workflow.

The post Bob Carpenter’s favorite books on GUI design and programming appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Bayesian inference: The advantages and the risks

**Wed:** Objects of the class “Foghorn Leghorn”

**Thurs:** “Physical Models of Living Systems”

**Fri:** Creativity is the ability to see relationships where none exist

**Sat:** Kaiser’s beef

**Sun:** Chess + statistics + plagiarism, again!

The post “Do we have any recommendations for priors for student_t’s degrees of freedom parameter?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recommend as an easy default option

realnu;

nu ~ gamma(2,0.1);This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified “penalised complexity (PC) prior”, which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student’s t in Stan.

The post “Do we have any recommendations for priors for student_t’s degrees of freedom parameter?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Are you ready to go fishing in the data lake? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Are you ready to go fishing in the data lake? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Apology to George A. Romero appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Good Afternoon Mr. Gelman,

I am reaching out to you on behalf of Pearson Education who would like to license an excerpt of text from How Many Zombies Do You Know? for the following, upcoming textbook program:

Title: Writing Today

Author: Richard Johnson-Sheehan and Charles Paine

Edition: 3

Anticipated Pub Date: 01/2015For this text, beginning with “The zombie menace has so far,” (page 101) and ending with “Journal of the American Statistical Association,” (409-423), Pearson would like to request US & Canada distribution, English language, a 150,000 print run, and a 7 year term in all print and non-print media versions, including ancillaries, derivatives and versions whole or in part.

The requested material is approximately 550 words and was originally published March 31, 2010 on Scienceblogs.com

If you could please look over the attached license request letter and return it to us, it would be much appreciated. If you need to draw up an invoice, please include all granted rights within the body of your invoice (the above, underlined portion). . . .

I decided to charge them $150 (I had no idea, I just made that number up) and I sent along the following message:

Also, at the bottom of page 2, they have a typo in my name (so please cross that out and replace with my actual last name!) and also please cross out “Author: George A. Romano”. Finally, please cross out the link (http://scienceblogs.com/appliedstatistics/2010/07/01/how-many-zombies-do-you-know-u/) and replace by: http://arxiv.org/pdf/1003.6087.pdf

I got the $150 and they told me they’d send me a copy of the book. And last month it came in the mail. So cool! I’ve always fancied myself a writer so I loved the idea of having an entry in a college writing textbook. (Yeah, yeah, I know some people say that college is a place where kids learn how to write badly. Whatever.)

I quickly performed what Yair calls a “Washington read” and found my article. It’s right there on page 266, one of the readings in the Analytical Reports chapter. B-b-b-ut . . . they altered my deathless prose!

– They removed the article’s abstract. That’s fine, the abstract wasn’t so funny.

– My name in the author list pointed to the following hilarious footnote which they removed: “Department of Statistics, Columbia University, New York. Please do not tell my employer that I spent any time doing this.”

– George A. Romero’s name in the author list pointed to the following essential footnote which they removed: “Not really.”

– They changed “et al.” to “et. al.” That’s just embarrassing for them. Making a mistake is one thing, but *changing* something correct into a mistake, that’s just sad. It reminds me of when one of my coauthors noticed the word “comprises” in the paper I’d written and scratched it out and replaced it with “is comprised of.” Ugh.

– They removed Section 4 of the paper, which read:

Technical note

We originally wrote this article in Word, but then we converted it to Latex to make it look more like science.

Ouch. That hurts.

But the biggest problem was, by keeping Romero’s name on the article and removing the disclaimer, they made it look like Romero actually was involved in this silly endeavor. Indeed, in their intro they refer to “the authors,” and after they refer to “Gelman and Romero’s article.” That’s better than “Gleman and Romano,” but, still, it doesn’t seem right to assign any of the blame for this on Romero. I’d have no problem sharing the credit but I have no idea how he’d feel about it.

At least they kept in the ZDate joke.

**P.S.** Overall I’m happy to see my article in this textbook. But it’s funny to see where it got messed up.

The post Apology to George A. Romero appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I actually think this infographic is ok appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So much to go with here, but I [Duckenfield] would just highlight the bars as the most egregious problem as it is implied that the same number of people are in each category. Obviously that is not the case — the top 1% and the 90-99% group, even if the coverage were comprehensive which it isn’t, would have fewer people in them than the decile groups.

But even more to the point, there is no reason to think that the top 10 jobs in each category all yield the same total number of jobs since there much be dozens, if not more broad categories of employment, most of which are off. They’d have been better to have a big “other jobs” block at the end so the bars balanced out. But I suspect the coverage of the top 10 jobs in each category is under 50%,so you’d see a lot of “other”

And this leads to the implication about total number of people making certain incomes might be the same. But all that is the same is their percentage of the employment among the top 10 jobs in the income range.

And I’m not entirely sure that the median salary, which looks to conveniently be $40K is correct. Household income is more like $50K and the median wage earner had a *median* net compensation according to Social Security of around $27500 (although I expect that has deductions and other withholdings from SS wages like health insurance). And the *average* net compensation was about $42500.

My reply: I agree that these graphs have problems but I kinda like them because they do contain a lot of information, if you don’t over-interpret them.

The post I actually think this infographic is ok appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The connection between varying treatment effects and the well-known optimism of published research findings appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought this article [by Hunt Allcott and Sendhil Mullainathan], although already a couple of years old, fits very well into the themes of your blog—in particular the idea that the “true” treatment effect is likely to vary a lot depending on all kinds of factors that we can and cannot observe, and that especially large estimated effects are likely telling us as much about the sample as about the Secrets of the Social World.

They find that sites that choose to participate in randomized controlled trials are selected on characteristics correlated with the estimated treatment effect, and they have some ideas about “suggestive tests of external validity.”

I’d be curious about where you agree and disagree with their approach.

I pointed this to Avi, who wrote:

I’m actually a big fan of this paper (and of Hunt and Sendhil). Rather than look at the original NBER paper, however, I’d point you to Hunt’s recent revision, which looks at 111 experiments (!!) rather than the 14 experiments analyzed in the first study.

In particular, Hunt uses the first 10 experiments they conducted to predict the results of the next 101, finding that the predicted effect is significantly larger than the observed effect in those remaining trials.

Good stuff. I haven’t had the time to look at any of this, actually, but it does all seem both relevant to our discussions and important more generally. It’s good to see economists getting into this game. The questions they’re looking at are similar to issues of Type S and Type M error that are being discussed in psychology research, and I feel that, more broadly, we’re seeing a unification of models of the scientific process, going beyond the traditional “p less than .05″ model of discovery. I’m feeling really good about all this.

The post The connection between varying treatment effects and the well-known optimism of published research findings appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “I mean, what exact buttons do I have to hit?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Unfortunately there’s the expectation that if you start with a scientific hypothesis and do a randomized experiment, there should be a high probability of learning an enduring truth. And if the subject area is exciting, there should consequently be a high probability of publication in a top journal, along with the career rewards that come with this. I’m not morally outraged by this: it seems fair enough that if you do good work, you get recognition. I certainly don’t complain if, after publishing some influential papers, I get grant funding and a raise in my salary, and so when I say that researchers expect some level of career success and recognition, I don’t mean this in a negative sense at all.

I do think, though, that this attitude is mistaken from a statistical perspective. If you study small effects and use noisy measurements, anything you happen to see is likely to be noise, as is explained in this now-classic article by Katherine Button et al. On statistical grounds, you can, and should, expect lots of strikeouts for every home run—call it the Dave Kingman model of science—or maybe no home runs at all. But the training implies otherwise, and people are just expecting to the success rate you might see if Rod Carew were to get up to bat in your kid’s Little League game.

To put it another way, the answer to the question, “I mean, what exact buttons do I have to hit?” is that *there is no such button*.

The post “I mean, what exact buttons do I have to hit?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My talk at MIT this Thursday appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Anyway, it seems that MIT is starting some sort of statistics program, and they invited me to the inaugural symposium! Which is cool.

My talk is this Thurs, 14 May, at 10am at room 46-3002. I’m pretty sure this is a new building.

Here are my title and abstract:

Little Data: How Traditional Statistical Ideas Remain Relevant in a Big-Data World; or, The Statistical Crisis in Science; or, Open Problems in Bayesian Data Analysis

“Big Data” is more than a slogan; it is our modern world in which we learn by combining information from diverse sources of varying quality. But traditional statistical questions—how to generalize from sample to population, how to compare groups that differ, and whether a given data pattern can be explained by noise—continue to arise. Often a big-data study will be summarized by a little p-value. Recent developments in psychology and elsewhere make it clear that our usual statistical prescriptions, adapted as they were to a simpler world of agricultural experiments and random-sample surveys, fail badly and repeatedly in the modern world in which millions of research papers are published each year. Can Bayesian inference help us out of this mess? Maybe, but much research will be needed to get to that point.

(The title is long because I wasn’t sure which of 3 talks to give, so I thought I’d give them all.)

The post My talk at MIT this Thursday appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post There’s something about humans appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>If we can’t trust p-values, does experimental science involving human variation just have to start over?

In the comments, Rahul wrote:

Isn’t the qualifier about human variation redundant? If we cannot trust p-values we cannot trust p-values.

My reply:

At a technical level, a lot of the problems arise when signal is low and noise is high. Various classical methods of statistical inference perform a lot better in settings with clean data. Recall that Fisher, Yates, etc., developed their p-value-based methods in the context of controlled experiments in agriculture.

Statistics really is more difficult with humans: it’s harder to do experimentation, outcomes of interest are noisy, there’s noncompliance, missing data, and experimental subjects who can try to figure out what you’re doing and alter their responses correspondingly.

The post There’s something about humans appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post There’s No Such Thing As Unbiased Estimation. And It’s a Good Thing, Too. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Following our recent post on econometricians’ traditional privileging of unbiased estimates, there were a bunch of comments echoing the challenge of teaching this topic, as students as well as practitioners often seem to want the comfort of an absolute standard such as best linear unbiased estimate or whatever. Commenters also discussed the tradeoff between bias and variance, and the idea that unbiased estimates can overfit the data.

I agree with all these things but I just wanted to raise one more point: In realistic settings, unbiased estimates simply *don’t exist*. In the real world we have nonrandom samples, measurement error, nonadditivity, nonlinearity, etc etc etc.

So forget about it. We’re living in the real world.

**P.S.** Perhaps this will help. It’s my impression that many practitioners in applied econometrics and statistics think of their estimation choice kinda like this:

1. The unbiased estimate. It’s the safe choice, maybe a bit boring and maybe not the most efficient use of the data, but you can trust it and it gets the job done.

2. A biased estimate. Something flashy, maybe Bayesian, maybe not, it might do better but it’s risky. In using the biased estimate, you’re stepping off base—the more the bias, the larger your lead—and you might well get picked off.

This is not the only dimension of choice in estimation—there’s also robustness, and other things as well—but here I’m focusing on the above issue.

Anyway, to continue, if you take the choice above and combine it with the unofficial rule that statistical significance is taken as proof of correctness (in econ, this would also require demonstrating that the result holds under some alternative model specifications, but “p less than .05″ is still key), then you get the following decision rule:

A. Go with the safe, unbiased estimate. If it’s statistically significant, run some robustness checks and, if the result doesn’t go away, stop.

B. If you don’t succeed with A, you can try something fancier. But . . . if you do that, everyone will know that you tried plan A and it didn’t work, so people won’t trust your finding.

So, in a sort of Gresham’s Law, all that remains is the unbiased estimate. But, hey, it’s safe, conservative, etc, right?

And that’s where the present post comes in. My point is that the unbiased estimate does not exist! There is no safe harbor. Just as we can never get our personal risks in life down to zero (despite what Gary Becker might have thought in his ill-advised comment about deaths and suicides), there is no such thing as unbiasedness. And it’s a good thing, too: recognition of this point frees us to do better things with our data right away.

The post There’s No Such Thing As Unbiased Estimation. And It’s a Good Thing, Too. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** There’s something about humans

**Wed:** Humility needed in decision-making

**Thurs:** The connection between varying treatment effects and the well-known optimism of published research findings

**Fri:** I actually think this infographic is ok

**Sat:** Apology to George A. Romero

**Sun:** “Do we have any recommendations for priors for student_t’s degrees of freedom parameter?”

The post Collaborative filtering, hierarchical modeling, and . . . speed dating appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Jonah Sinick posted a few things on the famous speed-dating dataset and writes:

The main element that I seem to have been missing is principal component analysis of the different rating types.

The basic situation is that the first PC is something that people are roughly equally responsive to, while people vary a lot with respect to responsiveness to the second PC, and the remaining PCs don’t play much of a role at all, so that you can just allow the coefficient of the second PC to vary.

Despite feeling like I understand the qualitative phenomenon, if I do a train/test split, the multilevel model doesn’t yield better log loss, (though there are other respects in which the multilevel model yields clear improvements) and I haven’t isolated the reason. I don’t think that there’s a quick fix – I’ve run into ~5 apparently deep statistical problems in the course of thinking about this. The situation is further complicated by the fact that in this context the issues are intertwined.

And he adds:

Do you know of researchers who work at the intersection of collaborative filtering and hierarchical modeling? Googling yields some papers that seem like they might fall into this category, but in each case it would take me a while to parse what the authors are doing.

The post Collaborative filtering, hierarchical modeling, and . . . speed dating appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Social networks spread disease—but they also spread practices that reduce disease appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Zelner follows up:

This made me think of my favorite figure from this paper, which showed the impact of relative network position within villages on risk. Basically, less-connected households in low average-degree villages were at higher risk than more-connected households in those villages, but in high average-degree places there was no meaningful relative degree effect.

Here it is:

The post Social networks spread disease—but they also spread practices that reduce disease appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What I got wrong (and right) about econometrics and unbiasedness appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>“Unbiasedness”: You keep using that word. I do not think it means what you think it means.

The talk went all right—people seemed ok with what I was saying—but I didn’t see a lot of audience involvement. It was a bit like pushing at an open door. Everything I said sounded so reasonable, it didn’t seem clear where the interesting content was.

The talk went like this: I discussed various examples where people were using classical unbiased estimates. There was one example with a silly regression discontinuity analysis controlling for a cubic polynomial, where it’s least squares so it’s unbiased but the model makes no sense. And another example with a simple comparison in a randomized experiment, where selection bias (the statistical significance filter and various play-the-winner decisions) push the estimate higher so that the reported estimate is biased, even though the particular statistical procedure being reported is nominally unbiased.

My point was: Here are these methods that respected researchers (including economists) use, that get published in top journals, but which are clearly wrong, in the sense of giving estimates and uncertainty statements that we don’t believe.

Why are people using such methods, in one case using a clearly inappropriate model and in the other case avoiding clearly appropriate adjustments?

I think part of the problem is a prioritizing of “unbiasedness” and a misunderstanding of what this really means in the practical world of data analysis and publication. The idea is that unbiased estimates are seen as pure, and that it’s ok to use an analysis that’s evidently flawed, if it does not “bias” the estimate. So, in a regression discontinuity setting, it’s considered ok to control for a high-degree polynomial because this fits into the general idea that, if you control for unnecessary predictors in a regression, you’re fine: it adds no bias and all that might happen is that your standard error gets bigger. Now, ok, nobody wants a big standard error, but remember that the usual goal in applied work is “statistical significance.” So . . . as long as your estimate is more than 2 standard errors away from 0, you’re cool. The price you paid in terms of variance was, apparently, not too high.

In my talk, I then continued by briefly describing our Xbox analysis, as an example of how we can succeed by adjusting. Instead of clinging to a nominally unbiased procedure, we can reduce actual bias by modeling.

As I said above, the people in the audience (mostly economists and political scientists) pretty much agreed with everything I said, except that they disagreed with my claim that “minimizing bias is the traditional first goal of econometrics.”

Or, maybe they accepted that this was a *traditional* first goal but they said that econometrics has moved on. In particular, I was informed that econometricians are much more interested in interval estimation than point estimation, and their typical first goal now is coverage. In fact, I was told that I could pretty much keep my talk as it was and just replace “unbiasedness” with “coverage” everywhere and it would still work. Thus, various conventional approaches for obtaining 95% intervals are believed to be ok because they have 95% coverage. But, because of selection and omitted variables, these intervals *don’t* have that nominal coverage. And that’s good to know.

The other point that was made to me after the talk was that, yes, some of the work I criticized was by respected economists—but this work was not published in econ journals. One of the papers was published in the tabloid PPNAS, for example. And these Princeton people assured me that had the work I’d criticized been presented in their seminar, they would’ve seen the problems—the omitted variable bias in one example and the selection bias in the other.

The point I made which still holds, I think, is the critique of what is commonly viewed as inferential conservatism. I feel that a central stream in econometric thinking is to play it safe, to favor unbiasedness and robustness over efficiency. And my central point is that the choices that look like “playing it safe” (for example, using least squares with no shrinkage, or taking simple comparisons with no adjustments) are, in practice, only used when the resulting estimates are more than 2 standard errors away from 0—and this selection sets us up for lots of problems.

So, I agree that it’s misleading to think of unbiasedness as the first goal in modern econometrics, but it remains my impression that there’s a misguided tendency in econometrics to downplay methods that increase statistical efficiency.

The post What I got wrong (and right) about econometrics and unbiasedness appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A question about physics-types models for flows in economics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve been attempting to generate a set of visually (animated in fact) mappable models which represent measurable forces that demonstrate effects on localized economic (census block level) outcomes, which in turn affect and are affected by regional education dynamics, brick/mortar business development, etc… This is coming out of some reading and observation of mine that such forces appear interestingly to resemble fluid-like behavior.

Who might you know who seems fairly keen in geospatial/geotemporal economics? While my model perspective is currently quite bayesian, and mode of interpretation physical (fluids, non-linear dynamical systems, etc), I have yet to make any reasonably solid links between the two (i.e. solidly connecting the modeled factors fairly predictive and explanatory nature over space-time), and the seemingly fluid nature of their occurrence (i.e. think ideas like ‘ebb’ ‘flow’, ‘ripples’, vortices (and other decaying phenomena, sinks, etc). Think I’m hitting a brick wall and need a bit of external thought, or a scotch-laden reboot.

He continues:

This entire ‘thing’ i’ve been chasing came out of a picture I was drawing when imagining the free-market as a bathtub. Each spigot is an economic output, and the drain an input for one entity (no matter the scale). The ‘water table’ of the tub is simply the summary state of this economic system, measurable in currency, to the entities who share in that system. Obviously, spigots and drains can vary in size (spending vs hoarding at any level for example). Big drains and small spigots can lower the water table – less capacity of the system available left for which the rest can compete –> problem (cost vs living std gap, etc etc blah blah). Nothing new here – perhaps not even my imagery.

I don’t know anything about this but maybe some of you do?

The post A question about physics-types models for flows in economics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post In criticism of criticism of criticism appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I do a lot of criticism. I’m sure you can think of lots of things that I like to criticize, but to keep things simple, let’s focus on graphics criticism, for example this post where I criticized a graph for false parallelism.

At this point some people would say that graphics criticism is mean, and not just mean but counterproductive, that we should be more constructive and less critical, you catch more flies with honey than with vinegar, etc. Let me call this *criticism criticism*.

If I’m in a quippy mood, I’ll reply that maybe you can catch more flies with honey than with vinegar, but you can capture even more flies with poop!

More seriously though, here’s an advantage of criticism that we don’t always hear about, which is the positive way in which criticism can shift the discourse.

The audience for criticism is not just the direct objects of the criticism, it’s also the community more generally.

Take the above-linked example. It may be that Dan Kahan, the person who pointed made this graph, will be motivated to think harder about his goals when plotting data. Or maybe my negativity will just send Dan into a defensive crouch. Or, perhaps most likely, he’ll find my criticism to be pretty much irrelevant to his larger goals. That’s fine. But I hope that other readers of this post will be made aware of the Gricean messages being sent implicitly by various choices in a graphical display.

In political science, I proceed on two tracks: I criticize graphs I don’t like, and I make graphs following my own principles. The criticism I do, makes me more aware of my goals, of how to do better. And, in the field of political science, sure, there are some people who think it’s funny that I’m always there to criticize a graph. But, in the past 10 years, I’ve been seeing more and more intense, informative graphs coming from researchers in that field. It’s taken awhile, but I’m seeing forward movement. I think criticism is part of this. Criticism makes us aware of our goals and what we need to do to get there.

So I’m a critic of criticism of criticism. And this post is an exercise in criticism criticism criticism.

**P.S.** In graphics, as in all fields, our criticism should come with an effort to understand the context of the behavior we’re criticizing. For example, when Antony Unwin and I wrote about infovis and statistical graphics. Typically the behavior we don’t like is done in service of *some* goals, and it’s a good idea to try to understand these goals. Criticism is a valuable part of this process.

The post In criticism of criticism of criticism appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post He’s looking for probability puzzles appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recently created a little probability puzzle app for android, and I was wondering whether you have any suggestions for puzzles that are engaging, approachable to someone who hasn’t taken a probability course, and don’t involve coins or dice. I think my easy puzzles are easy enough, but I’m having trouble thinking of new ones that are different enough from what’s already there.

Feel free to post any suggestions in the comments.

The post He’s looking for probability puzzles appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A causal-inference version of a statistics problem: If you fit a regression model with interactions, and the underlying process has an interaction, your coefficients won’t be directly interpretable appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A colleague pointed me to a recent paper, “Does Regression Produce Representative Estimates of Causal Effects?” by Peter Aronow and Cyrus Samii, which begins:

With an unrepresentative sample, the estimate of a causal effect may fail to characterize how effects operate in the population of interest. What is less well understood is that conventional estimation practices for observational studies may produce the same problem even with a representative sample.

Linear regression, controlling for pre-treatment variables, is the standard method for causal inference in experiments and observational studies. The idea is that the regression on background variables serves to adjust for differences between the treatment and control group so that comparable groups are effectively being compared in the causal analysis.

Aronow and Samii’s point is that, when the treatment effect varies (that is, when the treatment is more effective for some people than others, which of course is the case in general), that the estimate from a regression, even if it controls for the right variables, will not in general give an easily interpretable estimate of an average treatment effect.

They’re right; once the treatment effect can vary, the linear model is no longer correct, and so estimates from linear regression will not generally have any clean interpretation.

This is related to my 2004 article on varying treatment effects. (That paper appeared in an obscure volume so I’m not criticizing Aronow and Samii for not knowing about it.)

In short, yes, when you have varying treatment effects the additivity assumption of regression no longer holds. I’m not quite sure why they keep referring to “multiple regression” but maybe that is just their way of emphasizing that the problem arises in general, not just when there’s only one pre-treatment predictor.

I don’t know that I really see the point of the weighting scheme of Aronow and Samii: ultimately I think the right way to go is to just fit the more general model that allows treatment effect variation, as we do for example in this paper from 2008. Now that we have Stan, it’s much easier to fit this sort of model that has multiple variance components.

That said, the statistical problem is not easy, as the variance of the treatment effects is essentially nonidentified without strong assumptions about the structure of the problem. (In my 2004 paper I talk about additive, subtractive, and replacement models for treatment effects, but these are just three special cases in a continuous space of possible structures.) So: prior information is needed.

One thing that I think would help—it comes up in our 2004 and 2008 papers—is a model that explicitly allows treatment to have a different structure than control, that is, does *not* consider the two treatments symmetrically.

In any case, let me emphasize that (a) I agree with the authors that varying treatment effects are important, and (b) I am not particularly interested in the various definitions of “average treatment effect” that float around in the causal inference literature. I understand why Rubin and others like these expressions, as they make precise what people are estimating (or trying to estimate) with these models. But, ultimately, I’d like to model treatment effect variation directly.

Also, I hope they clean up the figures before publication. Their world map is full of a massive Greenland and Antarctica. Whassup with that? There are better map projections out there than a straight lat-long grid. And the tables: average age “47.58”—c’mon!

And one other, minor thing: on p.2, they write that people “praise matching methods as an alternative to regression.” I know that some people think this, but it would be a good idea to make clear that matching is not, in general, an *alternative* to regression (see here and here). You do matching and then you can run regression on your matched set. Matching deals with lack of overlap, regression deals with imbalance.

Anyway, I think this is a valuable paper in that they’re drawing people’s attention to a real problem. There are various ways to attack the problem and I can’t say I’ve come up with any great solutions myself, so I’m glad to see other people taking a crack at it.

**Experiments and observational studies**

I showed the above to my colleague, who added:

Keep in mind the point of their paper is, no longer can we prioritize obs/regression over experiment for reasons of external validity, which they see as devastating given that they allow for almost no other benefits to obs work, if any.

I’ve written about experiments and observational studies before

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that “To find out what happens when you change something, it is necessary to change it.”

At the same time, in my capacity as a social scientist, I’ve published many applied research papers, almost none of which have used experimental data. . . .

So, in response to my colleague, let me just say that the main advantage of observational studies is obvious: it’s that observational data are already available, or are relatively cheap to gather, whereas experiments take effort and in many settings can’t easily be done at all.

The post A causal-inference version of a statistics problem: If you fit a regression model with interactions, and the underlying process has an interaction, your coefficients won’t be directly interpretable appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** He’s looking for probability puzzles

**Wed:** In criticism of criticism of criticism

**Thurs:** A question about physics-types models for flows in economics

**Fri:** What I got wrong (and right) about econometrics and unbiasedness

**Sat:** Social networks spread disease—but they also spread practices that reduce disease

**Sun:** Collaborative filtering, hierarchical modeling, and . . . speed dating

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>He’s looking for probability puzzles

In criticism of criticism of criticism

A question about physics-types models for flows in economics

What I got wrong (and right) about econometrics and unbiasedness

Social networks spread disease—but they also spread practices that reduce disease

Collaborative filtering, hierarchical modeling, and . . . speed dating

Characterizing the spatial structure of defensive skill in professional basketball

There’s something about humans

Humility needed in decision-making

The connection between varying treatment effects and the well-known optimism of published research findings

I actually think this infographic is ok

Apology to George A. Romero

“Do we have any recommendations for priors for student_t’s degrees of freedom parameter?”

Bob Carpenter’s favorite books on GUI design and programming

Bayesian inference: The advantages and the risks

Objects of the class “Foghorn Leghorn”

“Physical Models of Living Systems”

Creativity is the ability to see relationships where none exist

Kaiser’s beef

Chess + statistics + plagiarism, again!

An inundation of significance tests

Stock, flow, and two smoking regressions

What’s the worst joke you’ve ever heard?

Cracked.com > Huffington Post, Wall Street Journal, New York Times

Measurement is part of design

“17 Baby Names You Didn’t Know Were Totally Made Up”

What to do to train to apply statistical models to political science and public policy issues

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Inventor of Arxiv speaks at Columbia this Tues 4pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I [Ginsparg] will give a very brief sociological overview of the current metastable state of scholarly research communication, and then a technical discussion of the practical implications of literature and usage data considered as computable objects, using arXiv as exemplar. Some of these algorithms scale to larger data sets.

The post Inventor of Arxiv speaks at Columbia this Tues 4pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Forget about pdf: this looks much better, it makes all my own papers look like kids’ crayon drawings by comparison. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Forget about pdf: this looks much better, it makes all my own papers look like kids’ crayon drawings by comparison. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Which of these classes should he take? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I for many years wanted to pursue medicine but after recently completing a master of public health, I caught the statistics bug. I need to complete the usual minimum prerequisites for graduate study in statistics (calculus through multivariable calculus plus linear algebra) but want to take additional math courses as highly competitive stats and biostats programs either require or highly recommend more than the minimum. I could of course end up earning a whole other bachelor degree in math if I tried to take all the recommended courses. Could you please rank the following courses according to importance/practical utility in working in statistics and in applying for a competitive stats PhD program? This would greatly assist me in prioritising which courses to complete.

1. Mathematical modeling

2. Real analysis

3. Complex analysis

4. Numerical analysis

My quick advice:

– “Mathematical modeling”: I don’t know what’s in this class. But, from the title, it seems very relevant to statistics.

– “Real analysis”: Not so relevant to real-world statistics but important for PhD applications because it’s a way to demonstrate that you understand math. And understanding math _is_ important to real-world statistics. Thus, the point of a “real analysis” class for a statistician is not so much that you learn real analysis, which is pretty irrelevant for most things, but that it demonstrates that you can do real analysis.

– “Complex analysis”: A fun topic but you’ll probably never ever need it, so no need to take this one.

– “Numerical analysis”: I don’t know what’s in this class. You could take it but it’s not really necessary.

The post Which of these classes should he take? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Continuing from yesterday‘s quotation of my 2012 article in Epidemiology:

Like many Bayesians, I have often represented classical confidence intervals as posterior probability intervals and interpreted one-sided p-values as the posterior probability of a positive effect. These are valid conditional on the assumed noninformative prior but typically do not make sense as unconditional probability statements.

The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong. At first this may sound paradoxical, that a noninformative or weakly informative prior yields posteriors that are too forceful—and let me deepen the paradox by stating that a stronger, more informative prior will tend to yield weaker, more plausible posterior statements.

How can it be that adding prior information weakens the posterior? It has to do with the sort of probability statements we are often interested in making. Here is an example from Gelman and Weakliem (2009). A sociologist examining a publicly available survey discovered a pattern relating attractiveness of parents to the sexes of their children. He found that 56% of the children of the most attractive parents were girls, compared to 48% of the children of the other parents, and the difference was statistically significant at p<0.02. The assessments of attractiveness had been performed many years before these people had children, so the researcher felt he had support for a claim of an underlying biological connection between attractiveness and sex ratio.

The original analysis by Kanazawa (2007) had multiple comparisons issues, and after performing a regression rather than selecting the most significant comparison, we get a p-value closer to 0.2 rather than the stated 0.02. For the purposes of our present discussion, though, in which we are evaluating the connection between p-values and posterior probabilities, it will not matter much which number we use. We shall go with p=0.2 because it seems like a more reasonable analysis given the data.

Let θ be the true (population) difference in sex ratios of attractive and less attractive parents. Then the data under discussion (with a two-sided p-value of 0.2), combined with a uniform prior on θ, yields a 90% posterior probability that θ is positive. Do I believe this? No. Do I even consider this a reasonable data summary? No again. We can derive these No responses in three different ways, first by looking directly at the evidence, second by considering the prior, and third by considering the implications for statistical practice if this sort of probability statement were computed routinely.

First off, a claimed 90% probability that θ>0 seems too strong. Given that the p-value (adjusted for multiple comparisons) was only 0.2—that is, a result that strong would occur a full 20% of the time just by chance alone, even with no true difference—it seems absurd to assign a 90% belief to the conclusion. I am not prepared to offer 9 to 1 odds on the basis of a pattern someone happened to see that could plausibly have occurred by chance, nor for that matter would I offer 99 to 1 odds based on the original claim of the 2% significance level.

Second, the prior uniform distribution on θ seems much too weak. There is a large literature on sex ratios, with factors such as ethnicity, maternal age, and season of birth corresponding to difference in probability of girl birth of less than 0.5 percentage points. It is a priori implausible that sex-ratio differences corresponding to attractiveness are larger than for these other factors. Assigning an informative prior centered on zero shrinks the posterior toward zero, and the resulting posterior probability that θ>0 moves to a more plausible value in the range of 60%, corresponding to the idea that the result is suggestive but not close to convincing.

Third, consider what would happen if we routinely interpreted one-sided p-values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero—that is, exactly what one might expect from chance alone—would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty—5 to 1 odds—based on data that not only could occur by chance but in fact represent an expected level of discrepancy. This system-level analysis accords with my criticism of the flat prior: as Greenland and Poole note in their article, the effects being studied in epidemiology are typically range from -1 to 1 on the logit scale, hence analyses assuming broader priors will systematically overstate the probabilities of very large effects and will overstate the probability that an estimate from a small sample will agree in sign with the corresponding population quantity.

Rather than relying on noninformative priors, I prefer the suggestion of Greenland and Poole to bound posterior probabilities using real prior information.

OK, I did discuss some buffoonish research here. But, look, no mockery! I was using the silly stuff as a lever to better understand some statistical principles. And that’s ok.

The post “The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post There are 6 ways to get rejected from PLOS: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, (4) keeping a gambling addict away from the casino, (5) chapter 11 bankruptcy proceedings, and (6) having no male co-authors appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>[The author] and her colleague have appealed to the unnamed journal, which belongs to the PLoS family . . .

I thought PLOS published just about everything! This is not a slam on PLOS. Arxiv publishes everything too, and Arxiv is great.

The funny thing is, I do think there are cases where having both male and female coauthors gives a paper more credibility, sometimes undeserved. For example, if you take a look at those papers on ovulation and voting, and ovulation and clothing, and fat arms and political attitudes, you’ll see these papers have authors of both sexes, which insulates them from the immediate laugh-them-out-of-the-room reaction that they might get were they written by men only. Having authors of both sexes does not of course exempt them from direct criticisms of the work; I just think that a paper on “that time of the month” written by men would, for better or worse, get a more careful review.

**P.S.** Also, one thing I missed in my first read of this story: the reviewer wrote:

Perhaps it is not so surprising that on average male doctoral students co-author one more paper than female doctoral students, just as, on average, male doctoral students can probably run a mile race a bit faster than female doctoral students . . . And it might well be that on average men publish in better journals . . . perhaps simply because men, perhaps, on average work more hours per week than women, due to marginally better health and stamina.

“Marginally better health and stamina”—that’s a laff and a half! Obviously this reviewer is no actuary and doesn’t realize that men die at a higher rate than women at every age.

On the plus side, it’s pretty cool that James Watson is still reviewing journal articles, giving something back to the community even in retirement. Good on ya, Jim! Don’t let the haters get you down.

The post There are 6 ways to get rejected from PLOS: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, (4) keeping a gambling addict away from the casino, (5) chapter 11 bankruptcy proceedings, and (6) having no male co-authors appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Good, mediocre, and bad p-values appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In theory the p-value is a continuous measure of evidence, but in practice it is typically trichotomized approximately into strong evidence, weak evidence, and no evidence (these can also be labeled highly significant, marginally significant, and not statistically significant at conventional levels), with cutoffs roughly at p=0.01 and 0.10.

One big practical problem with p-values is that they cannot easily be compared. The difference between a highly significant p-value and a clearly non-significant p-value is itself not necessarily statistically significant. . . . Consider a simple example of two independent experiments with estimates ± standard error of 25 ± 10 and 10 ± 10. The first experiment is highly statistically significant (two and a half standard errors away from zero, corresponding to a (normal-theory) p-value of about 0.01) while the second is not significant at all. Most disturbingly here, the difference is 15 ± 14, which is not close to significant . . .

In short, the p-value is itself a statistic and can be a noisy measure of evidence. This is a problem not just with p-values but with any mathematically equivalent procedure, such as summarizing results by whether the 95% confidence interval includes zero.

Good, mediocre, and bad p-valuesFor all their problems, p-values sometimes “work” to convey an important aspect of the relation of data to model. Other times a p-value sends a reasonable message but does not add anything beyond a simple confidence interval. In yet other situations, a p-value can actively mislead. Before going on, I will give examples of each of these three scenarios.

A p-value that worked.Several years ago I was contacted by a person who suspected fraud in a local election (Gelman, 2004). Partial counts had been released throughout the voting process and he thought the proportions for the different candidates looked suspiciously stable, as if they had been rigged ahead of time to aim for a particular result. Excited to possibly be at the center of an explosive news story, I took a look at the data right away. After some preliminary graphs—which indeed showed stability of the vote proportions as they evolved during election day—I set up a hypothesis test comparing the variation in the data to what would be expected from independent binomial sampling. When applied to the entire dataset (27 candidates running for six offices), the result was not statistically significant: there was no less (and, in fact, no more) variance than would be expected by chance. In addition, an analysis of the 27 separate chi-squared statistics revealed no particular patterns. I was left to conclude that the election results were consistent with random voting (even though, in reality, voting was certainly not random—for example, married couples are likely to vote at the same time, and the sorts of people who vote in the middle of the day will differ from those who cast their ballots in the early morning or evening), and I regretfully told my correspondent that he had no case.In this example, we did not interpret a non-significant result as a claim that the null hypothesis was true or even as a claimed probability of its truth. Rather, non-significance revealed the data to be compatible with the null hypothesis; thus, my correspondent could not argue that the data indicated fraud.

A p-value that was reasonable but unnecessary.It is common for a research project to culminate in the estimation of one or two parameters, with publication turning on a p-value being less than a conventional level of significance. For example, in our study of the effects of redistricting in state legislatures (Gelman and King, 1994), the key parameters were interactions in regression models for partisan bias and electoral responsiveness. Although we did not actually report p-values, we could have: what made our paper complete was that our findings of interest were more than two standard errors from zero, thus reaching the p<0.05 level. Had our significance level been much greater (for example, estimates that were four or more standard errors from zero), we would doubtless have broken up our analysis (for example, separately studying Democrats and Republicans) in order to broaden the set of claims that we could confidently assert. Conversely, had our regressions not reached statistical significance at the conventional level, we would have performed some sort of pooling or constraining of our model in order to arrive at some weaker assertion that reached the 5% level. (Just to be clear: we are not saying that we would have performed data dredging, fishing for significance; rather, we accept that sample size dictates how much we can learn with confidence; when data are weaker, it can be possible to find reliable patterns by averaging.)In any case, my point here is that in this example it would have been just fine to summarize our results in this example via p-values even though we did not happen to use that formulation.

A misleading p-value.Finally, in many scenarios p-values can distract or even mislead, either a non-significant result wrongly interpreted as a confidence statement in support of the null hypothesis, or a significant p-value that is taken as proof of an effect. A notorious example of the latter is the recent paper of Bem (2011), which reported statistically significant results from several experiments on ESP. At brief glance, it seems impressive to see multiple independent findings that are statistically significant (and combining the p-values using classical rules would yield an even stronger result), but with enough effort it is possible to find statistical significance anywhere (see Simmons, Nelson, and Simonsohn, 2011).The focus on p-values seems to have both weakened the study (by encouraging the researcher to present only some of his data so as to draw attention away from non-significant results) and to have led reviewers to inappropriately view a low p-value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis (ESP) rather than other, perhaps more scientifically plausible alternatives such as measurement error and selection bias.

I’ve written on these issues in many other places but the questions keep coming up so I thought it was worth reposting.

Tomorrow I’ll highlight another part of this article, this time dealing with Bayesian inference.

The post Good, mediocre, and bad p-values appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Carl Morris: Man Out of Time [reflections on empirical Bayes] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>When Carl Morris came to our department in 1989, I and my fellow students were so excited. We all took his class. The funny thing is, though, the late 1980s might well have been the worst time to be Carl Morris, from the standpoint of what was being done in statistics at that time—not just at Harvard, but in the field in general. Carl has made great contributions to statistical theory and practice, developing ideas which have become particularly important in statistics in the last two decades. In 1989, though, Carl’s research was not in the mainstream of statistics, or even of Bayesian statistics.

When Carl arrived to teach us at Harvard, he was both a throwback and ahead of his time.

Let me explain. Two central aspects of Carl’s research are the choice of probability distribution for hierarchical models, and frequency evaluations in hierarchical settings where both Bayesian calibration (conditional on inferences) and classical bias and variance (conditional on unknown parameter values) are relevant. In Carl’s terms, these are “NEF-QVF” and “empirical Bayes.” My point is: both of these areas were hot at the beginning of Carl’s career and they are hot now, but somewhere in the 1980s they languished.

In the wake of Charles Stein’s work on admissibility in the late 1950s there was an interest, first theoretical but with clear practical motivations, to produce lower-risk estimates, to get the benefits of partial pooling while maintaining good statistical properties conditional on the true parameter values, to produce the Bayesian omelet without cracking the eggs, so to speak. In this work, the functional form of the hierarchical distribution plays an important role—and in a different way than had been considered in statistics up to that point. In classical distribution theory, distributions are typically motivated by convolution properties (for example, the sum of two gamma distributions with a common shape parameter is itself gamma), or by stable laws such as the central limit theorem, or by some combination or transformation of existing distributions. But in Carl’s work, the choice of distribution for a hierarchical model can be motivated based on the properties of the resulting partially pooled estimates. In this way, Carl’s ideas are truly non-Bayesian because he is considering the distribution of the parameters in a hierarchical model not as a representation of prior belief about the set of unknowns, and not as a model for a population of parameters, but as a device to obtain good estimates.

So, using a Bayesian structure to get good classical estimates. Or, Carl might say, using classical principles to get better Bayesian estimates. I don’t know that they used the term “robust” in the 1950s and 1960s, but that’s how we could think of it now.

The interesting thing is, if we take Carl’s work seriously (and we should), we now have two principles for choosing a hierarchical model. In the absence of prior information about the functional form of the distribution of group-level parameters, and in the absence of prior information about the values of the hyperparameters that would underly such a model, we should use some form with good statistical properties. On the other hand, if we

dohave good prior information, we should of course use it—even R. A. Fisher accepted Bayesian methods in those settings where the prior distribution is known.But, then, what do we do in those cases in between—the sorts of problems that arose in Carl’s applied work in health policy and other areas? I learned from Carl to use our prior information to structure the model, for example to pick regression coefficients, to decide which groups to pool together, to decide which parameters to model as varying, and then use robust hierarchical modeling to handle the remaining, unexplained variation. This general strategy wasn’t always so clear in the theoretical papers on empirical Bayes, but it came through in the Carl’s applied work, as well as that of Art Dempster, Don Rubin, and others, much of which flowered in the late 1970s—not coincidentally, a few years after Carl’s classic articles with Brad Efron that put hierarchical modeling on a firm foundation that connected with the edifice of theoretical statistics, gradually transforming these ideas from a parlor trick into a way of life.

In a famous paper, Efron and Morris wrote of “Stein’s paradox in statistics,” but as a wise man once said, once something is understood, it is no longer a paradox. In un-paradoxing shrinkage estimation, Efron and Morris finished the job that Gauss, Laplace, and Galton had begun.

So far, so good. We’ve hit the 1950s, the 1960s, and the 1970s. But what happened next? Why do I say that, as of 1989, Carl’s work was “out of time”? The simplest answer would be that these ideas were a victim of their own success: once understood, no longer mysterious. But it was more than that. Carl’s specific research contribution was not just hierarchical modeling but the particular intricacies involved in the combination of data distribution and group-level model. His advice was not simply “do Bayes” or even “do empirical Bayes” but rather had to do with a subtle examination of this interaction. And, in the late 1980s and early 1990s, there wasn’t so much interest in this in the field of statistics. On one side, the anti-Bayesians were still riding high in their rejection of all things prior, even in some quarters a rejection of probability modeling itself. On the other side, a growing number of Bayesians—inspired by applied successes in fields as diverse as psychometrics, pharmacology, and political science—were content to just fit models and not worry about their statistical properties.

Similarly with empirical Bayes, a term which in the hands of Efron and Morris represented a careful, even precarious, theoretical structure intended to capture classical statistical criteria in a setting where the classical ideas did not quite apply, a setting that mixed estimation and prediction—but which had devolved to typically just be shorthand for “Bayesian inference, plugging in point estimates for the hyperparameters.” In an era where the purveyors of classical theory didn’t care to wrestle with the complexities of empirical Bayes, and where Bayesians had built the modeling and technical infrastructure needed to fit full Bayesian inference, hyperpriors and all, there was not much of a market for Carl’s hybrid ideas.

This is why I say that, at the time Carl Morris came to Harvard, his work was honored and recognized as pathbreaking, but his actual research agenda was outside the mainstream.

As noted above, though, I think things have changed. The first clue—although it was not at all clear to me at the time—was Trevor Hastie and Rob Tibshirani’s lasso regression, which was developed in the early 1990s and which has of course become increasingly popular in statistics, machine learning, and all sorts of applications. Lasso is important to me partly as the place where Bayesian ideas of shrinkage or partial polling entered what might be called the Stanford school of statistics. But for the present discussion what is most relevant is the centrality of the functional form. The point of lasso is not just partial pooling, it’s partial pooling with an exponential prior. As I said, I did not notice the connection with Carl’s work and other Stein-inspired work back when lasso was introduced—at that time, much was made of the shrinkage of certain coefficients all the way to zero, which indeed is important (especially in practical problems with large numbers of predictors), but my point here is that the ideas of the late 1950s and early 1960s again become relevant. It’s not enough just to say you’re partial pooling—it matters _how_ this is being done.

In recent years there’s been a flood of research on prior distributions for hierarchical models, for example the work by Nick Polson and others on the horseshoe distribution, and the issues raised by Carl in his classic work are all returning. I can illustrate with a story from my own work. A few years ago some colleagues and I published a paper on penalized marginal maximum likelihood estimation for hierarchical models using, for the group-level variance, a gamma prior with shape parameter 2, which has the pleasant feature of keeping the point estimate off of zero while allowing it to be arbitrarily close to zero if demanded by the data (a pair of properties that is not satisfied by the uniform, lognormal, or inverse-gamma distributions, all of which had been proposed as classes of priors for this model). I was (and am) proud of this result, and I linked it to the increasingly popular idea of weakly informative priors. After talking with Carl, I learned that these ideas were not new to me, indeed these were closely related to the questions that Carl has been wrestling with for decades in his research, as they relate both to the technical issue of the combination of prior and data distributions, and the larger concerns about default Bayesian (or Bayesian-like) inferences.

In short: in the late 1980s, it was enough to be Bayesian. Or, perhaps I should say, Bayesian data analysis was in its artisanal period, and we tended to be blissfully ignorant about the dependence of our inferences on subtleties of the functional forms of our models. Or, to put a more positive spin on things: when our inferences didn’t make sense, we changed our models, hence the methods we used (in concert with the prior information implicitly encoded in that innocent-sounding phrase, “make sense”) had better statistical properties than one would think based on theoretical analysis alone. Real-world inferences can be superefficient, as Xiao-Li Meng might say, because they make use of tacit knowledge.

In recent years, however, Bayesian methods (or, more generally, regularization, thus including lasso and other methods that are only partly in the Bayesian fold) have become routine, to the extent that we need to think of them as defaults, which means we need to be concerned about . . . their frequency properties. Hence the re-emergence of truly empirical Bayesian ideas such as weakly informative priors, and the re-emergence of research on the systematic properties of inferences based on different classes of priors or regularization. Again, this all represents a big step beyond the traditional classification of distributions: in the robust or empirical Bayesian perspective, the relevant properties of a prior distribution depend crucially on the data model to which it is linked.

So, over 25 years after taking Carl’s class, I’m continuing to see the centrality of his work to modern statistics: ideas from the early 1960s that were in many ways ahead of their time.

Let me conclude with the observation that Carl seemed to us to be a “man out of time” on the personal level as well. In 1989 he seemed ageless to us both physically and in his personal qualities, and indeed I still view him that way. When he came to Harvard he was not young (I suppose he was about the same age as I am now!) but he had, as the saying goes, the enthusiasm of youth, which indeed continues to stay with him. At the same time, he has always been even-tempered, and I expect that, in his youth, people remarked upon his maturity. It has been nearly fifty years since Carl completed his education, and his ideas remain fresh, and I continue to enjoy his warmth, humor, and insights.

The post Carl Morris: Man Out of Time [reflections on empirical Bayes] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What’s the most important thing in statistics that’s not in the textbooks? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As I wrote a couple years ago:

Statistics does not require randomness. The three essential elements of statistics are measurement, comparison, and variation. Randomness is one way to supply variation, and it’s one way to model variation, but it’s not necessary. Nor is it necessary to have “true” randomness (of the dice-throwing or urn-sampling variety) in order to have a useful probability model.

For my money, the #1 neglected topic in statistics is **measurement**.

In most statistics texts that I’ve seen, there’s a lot on data analysis and some stuff on data collection—sampling, random assignment, and so forth—but nothing at all on measurement. Nothing on reliability and validity but, even more than that, nothing on the *concept* of measurement, the idea of considering the connection between the data you gather and the underlying object of your study.

It’s funny: the data model (the “likelihood”) is central to much of the theory and practice of statistics, but the steps that are required to make this work—the steps of measurement and assessment of measurements—are hidden.

When it comes to the question of how to take a sample or how to randomize, or the issues that arise (nonresponse, spillovers, selection, etc.) that interfere with the model, statistics textbooks take the practical issues seriously—even an intro statistics book will discuss topics such as blinding in experiments and self-selection in surveys. But when it comes to measurement, there’s silence, just an implicit assumption that the measurement is what it is, that it’s valid and that it’s as reliable as it needs to be.

**Bad things happen when we don’t think seriously about measurement**

And then what happens? Bad, bad things.

In education—even statistics education—we don’t go to the trouble of accurately measuring what students learn. Why? Part of it is surely that measurement takes effort, and we have other demands on our time. But it’s more than that. I think a large part is that we don’t carefully think about evaluation as a measurement issue and we’re not clear on what we want students to learn and how we can measure this. Sure, we have vague ideas, but nothing precise. In other aspects of statistics we aim for precision, but when it comes to measurement, we turn off our statistics brain. And I think this is happening, in part, because the topic of measurement is tucked away in an obscure corner of statistics and is then forgotten.

And in research too, we see big problems. Consider all those “power = .06″ experiments, these “Psychological Science”-style papers we’ve been talking so much about in recent years. A common thread in these studies is sloppy, noisy, biased measurement. Just a lack of seriousness about measurement and, in particular, a resistance to the sort of within-subject designs which much more directly measure the within-person variation that is often of interest in such studies.

Measurement, measurement, measurement. It’s central to statistics. It’s central to how we learn about the world.

The post What’s the most important thing in statistics that’s not in the textbooks? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Eccentric mathematician appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What I liked about Wilkinson’s article is how it captured Zhang’s eccentricities with affection but without condescension. Zhang is not like the rest of us, but from reading the article, I get the sense of him as an individual, not defined by his mathematical abilities.

At one level, sure, duh: each of us is an individual. I’m an unusual person myself so maybe it’s a bit rich for me to put the “eccentric” label on some mathematician I’ve never met.

But I think there’s more to it than that. For one thing, I think the usual way to frame an article about someone like this is to present him as a one-of-a-kind genius, to share stories about how brilliant he is. Here, though, you get the idea that Zhang is a top mathematician but not that he has some otherworldly brilliance. Similarly, he solved a really tough problem but we don’t have to hear all about how he’s the greatest of all time. Rather, I get the idea from Wilkinson that Zhang’s life is worth living even if he hadn’t done this great work. Of course, without that, the idea for the article never would’ve come up in the first place, but still.

Here’s a paragraph. I don’t know if it conveys the feeling I’m trying to share but here goes:

Zhang met his wife, to whom he has been married for twelve years, at a Chinese restaurant on Long Island, where she was a waitress. Her name is Yaling, but she calls herself Helen. A friend who knew them both took Zhang to the restaurant and pointed her out. “He asked, ‘What do you think of this girl?'” Zhang said. Meanwhile, she was considering him. To court her, Zhang went to New York every weekend for several months. The following summer, she came to New Hampshire. She didn’t like the winters, though, and moved to California, where she works at a beauty salon. She and Zhang have a house in San Jose, and he spends school vacations there.

So gentle, both on the part of Zhang and of Wilkinson. New Yorker, E. B. White-style, and I mean that in a good way here. It could’ve come straight out of Charlotte’s Web. And it’s such a relief to read after all the Erdos-Feynman-style hype, not to mention all the latest crap about tech zillionaires. I just wish I could’ve met Stanislaw Ulam.

The post Eccentric mathematician appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** What’s the most important thing in statistics that’s not in the textbooks?

**Wed:** Carl Morris: Man Out of Time [reflections on empirical Bayes]

**Thurs:** “The general problem I have with noninformatively-derived Bayesian probabilities is that they tend to be too strong.”

**Fri:** Good, mediocre, and bad p-values

**Sat:** Which of these classes should he take?

**Sun:** Forget about pdf: this looks much better, it makes all my own papers look like kids’ crayon drawings by comparison.

The post This year’s Atlantic Causal Inference Conference: 20-21 May appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The conference will take place May 20-21 (with a short course on May 19th) and the web site for the conference is here. The deadline for submitting a poster title for the poster session is this Friday. Junior researchers (graduate students, postdoctoral fellows, and assistant professors) whose poster demonstrates exceptional research will also be considered for the Thomas R. Ten Have Award, which recognizes “exceptionally creative or skillful research on causal inference.” The two award winners will be invited to speak at the 2016 Atlantic Causal Inference Conference.

We held the first conference in this series ten years ago at Columbia, and I’m glad to see it’s still doing well.

The post This year’s Atlantic Causal Inference Conference: 20-21 May appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Statistical analysis on a dataset that consists of a population appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Donna Towns writes:

I am wondering if you could help me solve an ongoing debate?

My colleagues and I are discussing (disagreeing) on the ability of a researcher to analyze information on a population. My colleagues are sure that a researcher is unable to perform statistical analysis on a dataset that consists of a population, whereas I believe that statistical analysis is appropriate if you are testing future outcomes. For example, a group of inmates in a detention centre receive a new program. As it would contravene ethics, all offenders receive the program. Therefore, a researcher would need to compare a group of inmates prior to the introduction of the program. Assuming, or after confirm that these two populations are similar, are we able to apply statistical analysis to compare the outcomes of these to populations (such as time to return to detention)? If so, what would be the methodologies used? Do you happen to know of any articles that discuss this issue?

I replied with a link to this post from 2009, which concludes:

If you set up a model including a probability distribution for these unobserved outcomes, standard errors will emerge.

The post Statistical analysis on a dataset that consists of a population appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>