The post Some U.S. demographic data at zipcode level conveniently in R appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I chuckled when I read your recent “R Sucks” post. Some of the comments were a bit … heated … so I thought to send you an email instead.

I agree with your point that some of the datasets in R are not particularly relevant. The way that I’ve addressed that is by adding more interesting datasets to my packages. For an example of this you can see my blog post choroplethr v3.1.0: Better Summary Demographic Data. By typing just a few characters you can now view eight demographic statistics (race, income, etc.) of each state, county and zip code in the US. Additionally, mapping the data is trivial.

I haven’t tried this myself, but assuming it works . . . that’s great to be able to make maps of American Community Survey data at the zipcode level!

The post Some U.S. demographic data at zipcode level conveniently in R appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Survey weighting and that 2% swing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Nate Silver agrees with me that much of that shocking 2% swing can be explained by systematic differences between sample and population: survey respondents included too many Clinton supporters, even after corrections from existing survey adjustments.

In Nate’s words, “Pollsters Probably Didn’t Talk To Enough White Voters Without College Degrees.” Last time we looked carefully at this, my colleagues and I found that pollsters weighted for sex x ethnicity and age x education, but not by ethnicity x education.

I could see that this could be an issue. It goes like this: Surveys typically undersample less-educated people, I think even relative to their proportion of voters. So you need to upweight the less-educated respondents. But less-educated respondents are more likely to be African Americans and Latinos, so this will cause you to upweight these minority groups. Once you’re through with the weighting (whether you do it via Mister P or classical raking or Bayesian Mister P), you’ll end up matching your target population on ethnicity and education, but not on their interaction, so you could end up with too few low-income white voters.

There’s also the gender gap: you want the right number of low-income white male and female voters in each category. In particular, we found that in 2016 the gender gap increased with education, so if your sample gets some of these interactions wrong, you could be biased.

Also a minor thing: Back in the 1990s the ethnicity categories were just white / other and there were 4 education categories: no HS / HS / some college / college grad. Now we use 4 ethnicity categories (white / black / hisp / other) and 5 education categories (splitting college grad into college grad / postgraduate degree). Still just 2 sexes though. For age, I think the standard is 18-29, 30-44, 45-64, and 65+. But given how strongly nonresponse rates vary by age, it could make sense to use more age categories in your adjustment.

Anyway, Nate’s headline makes sense to me. One thing surprises me, though. He writes, “most pollsters apply demographic weighting by race, age and gender to try to compensate for this problem. It’s less common (although by no means unheard of) to weight by education, however.” Back when we looked at this, a bit over 20 years ago, we found that some pollsters didn’t weight at all, some weighted only on sex, and some weighted on sex x ethnicity and age x education. The surveys that did very little weighting relied on the design to get a more representative sample, either using quota sampling or using tricks such as asking for the youngest male adult in the household.

Also, Nate writes, “the polls may not have reached enough non-college voters. It’s a bit less clear whether this is a longstanding problem or something particular to the 2016 campaign.” All the surveys I’ve seen (except for our Xbox poll!) have massively underrepresented young people, and this has gone back for decades. So no way it’s just 2016! That’s why survey organizations adjust for age. There’s always a challenge, though, in knowing what distribution to adjust to, as we don’t know turnout until after the election—and not even then, given all the problems with exit polls.

**P.S.** The funny thing is, back in September, Sam Corbett-Davies, David Rothschild, and I analyzed some data from a Florida poll and came up with the estimate that Trump was up by 1 in that state. This was a poll where the other groups analyzing the data estimated Clinton up by 1, 3, or 4 points. So, back then, our estimate was that a proper adjustment (in this case, using party registration, which we were able to do because this poll sampled from voter registration lists) would shift the polls by something like 2% (that is, 4% in the differential between the two candidates). But we didn’t really do anything with this. I can’t speak for Sam or David, but I just figured this was just one poll and I didn’t take it so seriously.

In retrospect maybe I should’ve thought more about the idea that mainstream pollsters weren’t adjusting their numbers enough. And in retrospect Nate should’ve thought of that too! Our analysis was no secret; it appeared in the New York Times. So Nate and I were both guilty of taking the easy way out and looking at poll aggregates and not doing the work to get inside the polls. We’re doing that now, in December, but I we should’ve been doing it in October. Instead of obsessing about details of poll aggregation, we should’ve been working more closely with the raw data.

**P.P.S.** Could someone please forward this email to Nate? I don’t think he’s getting my emails any more!

The post Survey weighting and that 2% swing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How can you evaluate a research paper? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Shea Levy writes:

You ended a post from last month [i.e., Feb.] with the injunction to not take the fact of a paper’s publication or citation status as meaning anything, and instead that we should “read each paper on its own.” Unfortunately, while I can usually follow e.g. the criticisms of a paper you might post, I’m not confident in my ability to independently assess arbitrary new papers I find. Assuming, say, a semester of a biological sciences-focused undergrad stats course and a general willingness and ability to pick up any additional stats theory or practice, what should someone in the relevant fields do to get to the point where they can meaningfully evaluate each paper they come across?

My reply: That’s a tough one. My own view of research papers has become much more skeptical over the years. For example, I devoted several posts to the Dennis-the-Dentist paper without expressing any skepticism at all—and then Uri Simonsohn comes along and shoots it down. So it’s hard to know what to say. I mean, even as of 2007, I think I had a pretty good understanding of statistics and social science. And look at all the savvy people who got sucked into that Bem ESP thing—not that they thought Bem had proved ESP, but many people didn’t realize how bad that paper was, just on statistical grounds.

So what to do to independently assess new papers?

I think you have to go Bayesian. And by that I *don’t* mean you should be assessing your prior probability that the null hypothesis is true. I mean that you have to think about effect sizes, on one side, and about measurement, on the other.

It’s not always easy. For example, I found the claimed effect sizes for the Dennis/Dentist paper to be reasonable (indeed, I posted specifically on the topic). For that paper, the problem was in the measurement, or one might say the likelihood: the mapping from underlying quantity of interest to data.

Other times we get external information, such as the failed replications in ovulation-and-clothing, or power pose, or embodied cognition. But we should be able to do better, as all these papers had major problems which were apparent, even before the failed reps.

One cue which we’ve discussed a lot: if a paper’s claim relies on p-values, and they have lots of forking paths, you might just have to set the whole paper aside.

Medical research: I’ve heard there’s lots of cheating, lots of excluding patients who are doing well under the control condition, lots of ways to get people out of the study, lots of playing around with endpoints.

The trouble is, this is all just a guide to skepticism. But I’m not skeptical about *everything*.

And the solution can’t be to ask Gelman. There’s only one of me to go around! (Or two, if you count my sister.) And I make mistakes too!

So I’m not sure. I’ll throw the question to the commentariat. What do you say?

The post How can you evaluate a research paper? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An exciting new entry in the “clueless graphs from clueless rich guys” competition appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Jeff Lax points to this post from Matt Novak linking to a post by Matt Taibbi that shares the above graph from newspaper columnist / rich guy Thomas Friedman.

I’m not one to spend precious blog space mocking bad graphs, so I’ll refer you to Novak and Taibbi for the details.

One thing I *do* want to point out, though, is that this is not necessarily the worst graph promulgated recently by a zillionaire. Let’s never forget this beauty which was being spread on social media by wealthy human Peter Diamandis:

The post An exciting new entry in the “clueless graphs from clueless rich guys” competition appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Interesting epi paper using Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Just thought I’d send along this paper by Justin Lessler et al. Thought it was both clever & useful and a nice ad for using Stan for epidemiological work.

Basically, what this paper is about is estimating the true prevalence and case fatality ratio of MERS-CoV [Middle East Respiratory Syndrome Coronavirus Infection] using data collected via a mix of passive and active surveillance, which if treated naively will result in an overestimate of case fataility and underestimate of burden b/c only the most severe cases are caught via passive surveillance. All of the interesting modeling details are in the supplementary information.

The post Interesting epi paper using Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “A bug in fMRI software could invalidate 15 years of brain research” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>About 50 people pointed me to this press release or the underlying PPNAS research article, “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates,” by Anders Eklund, Thomas Nichols, and Hans Knutsson, who write:

Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

This is all fine (I got various emails with lines such as, “Finally, a PPNAS paper you’ll appreciate”), and I’m guessing it won’t surprise Vul, Harris, Winkielman, and Pashler one bit.

I continue to think that the false-positive, false-negative thing is a horrible way to look at something like brain activity, which is happening all over the place all the time. The paper discussed above looks like a valuable contribution and I hope people follow up by studying the consequences of these FMRI issues using continuous models.

The post “A bug in fMRI software could invalidate 15 years of brain research” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post OK, sometimes the concept of “false positive” makes sense. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Paul Alper writes:

I know by searching your blog that you hold the position, “I’m negative on the expression ‘false positives.'”

Nevertheless, I came across this. In the medical/police/judicial world, false positive is a very serious issue:

$2

Cost of a typical roadside drug test kit used by police departments. Namely, is that white powder you’re packing baking soda or blow? Well, it turns out that these cheap drug tests have some pretty significant problems with false positives. One study found 33 percent of cocaine field tests in Las Vegas between 2010 and 2013 were false positives. According to Florida Department of Law Enforcement data, 21 percent of substances identified by the police as methamphetamine were not methamphetamine. [ProPublica]

The ProPublica article is lengthy:

Tens of thousands of people every year are sent to jail based on the results of a $2 roadside drug test. Widespread evidence shows that these tests routinely produce false positives. Why are police departments and prosecutors still using them? . . .

The Harris County district attorney’s office is responsible for half of all exonerations by conviction-integrity units nationwide in the past three years — not because law enforcement is different there but because the Houston lab committed to testing evidence after defendants had already pleaded guilty, a position that is increasingly unpopular in forensic science. . . .

The Texas Criminal Court of Appeals overturned Albritton’s conviction in late June, but before her record can be cleared, that reversal must be finalized by the trial court in Houston. Felony records are digitally disseminated far and wide, and can haunt the wrongly convicted for years after they are exonerated. Until the court makes its final move, Amy Albritton — for the purposes of employment, for the purposes of housing, for the purposes of her own peace of mind — remains a felon, one among unknown tens of thousands of Americans whose lives have been torn apart by a very flawed test.

Yes, I agree. There are cases where “false positive” and “false negative” make sense. Just not in general for scientific hypotheses. I think the statistical framework of hypothesis testing (Bayesian or otherwise) is generally a mistake. But in settings in which individuals are in one of some number of discrete states, it can make a lot of sense to think about false positives and negatives.

The funny thing is, someone once told me that he had success teaching the concepts of type 1 and 2 errors by framing the problem in terms of criminal defendants. My reaction was that he was leading the students exactly in the wrong direction!

I haven’t commented on the politics of the above story but of course I agree that it’s horrible. Imagine being sent to prison based on some crappy low-quality lab test. There’s a real moral hazard here: The people who do these tests and who promote them based on bad data, they aren’t at risk of going to prison themselves here, even though they’re putting others in jeopardy.

The post OK, sometimes the concept of “false positive” makes sense. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An election just happened and I can’t stop talking about it appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Some things I’ve posted elsewhere:

The Electoral College magnifies the power of white voters (with Pierre-Antoine Kremp)

I’m not impressed by this claim of vote rigging

And, in case you missed it:

Explanations for that shocking 2% shift

Coming soon:

What theories in political science got supported or shot down by the 2016 election? (with Bob Erikson)

A bunch of maps and graphs (with Rob Trangucci, Imad Ali, and Doug Rivers)

The post An election just happened and I can’t stop talking about it appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Reminder: Instead of “confidence interval,” let’s say “uncertainty interval” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We had a vigorous discussion the other day on confusions involving the term “confidence interval,” what does it mean to have “95% confidence,” etc. This is as good a time as any for me to remind you that I prefer the term “uncertainty interval”. The uncertainty interval tells you how much uncertainty you have. That works pretty well, I think. Also, I prefer 50% intervals. More generally, I think confidence intervals are overrated for reasons discussed here and here.

The post Reminder: Instead of “confidence interval,” let’s say “uncertainty interval” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Happiness formulas appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Jazi Zilber writes:

Have you heard of “the happiness formula”?

Lyubomirsky at al. 2005. Happiness = 0.5 genetic, 0.1 circumstances, 0.4 “intentional activity”

They took the 0.4 unexplained variance and argued it is “intentional activity”

Cited hundreds of times by everybody.

The absurd is, to you even explaining it is unneeded. For others, I do not know how to explain it.

No, I hadn’t heard of it. So I googled *happiness formula*. And what turned up was a silly-looking formula (but *not* the formula of Lyubomirsky at al. 2005), and some reasonable advice. For example, this from Alexandra Sifferlin in Time magazine in 2014:

Researchers at University College London were able to create an equation that could accurately predict the happiness of over 18,000 people, according to a new study.

First, the researchers had 26 participants complete decisionmaking tasks in which their choices either led to monetary gains or losses. The researchers used fMRI imaging to measure their brain activity, and asked them repeatedly, “How happy are you now?” Based on the data the researchers gathered from the first experiment, they created a model that linked self-reported happiness to recent rewards and expectations.

Here’s what the equation looks like:

Yeah, yeah, I know what you’re thinking . . . it looks like B.S., right? But, as I said, the ultimate advice seemed innocuous enough:

The researchers were not surprised by how much rewards influenced happiness, but they were surprised by how much expectations could. The researchers say their findings do support the theory that if you have low expectations, you can never be disappointed, but they also found that the positive expectations you have for something—like going to your favorite restaurant with a friend—is a large part of what develops your happiness.

Nothing as ridiculous as that formula quoted by Zilber above.

So I next googled *Lyubomirsky at al. 2005* and I found the paper Zilber was talking about, and . . . yeah, it has it all! An exploding pie chart, a couple of 3-d bar charts that would make Ed Tufte spin in his, ummm, Tufte’s still alive so I guess it would make him spin in his chair, they’re so bad. Oh, yeah, also a “longitudinal path model” with asterisks indicating low p-values. What more could you possibly desire? The whole paper made me happy, in a perverse way. By which I mean, it made me sad.

The good news, though, is that this 2005 paper does not seem so influential anymore. At least, when you google *happiness formula* it does not come up on the first page of listings. So that’s one thing we can be happy about.

The post Happiness formulas appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Discussion on overfitting in cluster analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Ben Bolker wrote:

It would be fantastic if you could suggest one or two starting points for the idea that/explanation why BIC should naturally fail to identify the number of clusters correctly in the cluster-analysis context.

Bob Carpenter elaborated:

Ben is finding that using BIC to select number of mixture components is selecting too many components given the biological knowledge of what’s going on. These seem to be reasonable mixture models like HMMs for bison movement with states corresponding to transiting and foraging and resting, with the data (distance moved and turning angle) being clearly multimodal.

First (this is more to Christian): Is this to be expected if the models are misspecified and the data’s relatively small?

Second (this one more to Andrew): What do you recommend doing in terms of modeling? The ecologists are already on your page w.r.t. adding predictors (climate, time of day or year) and general hierarchical models over individuals in a population.

Number of components isn’t something we can put a prior on in Stan other than by having something like many mixture components with asymmetric priors or by faking up a Dirichlet process a la some of the BUGS examples. I’ve seen some work on mixtures of mixtures which looks cool, and gets to Andrew’s model expansion inclinations, but it’d be highly compute intensive.

X replied:

Gilles Celeux has been working for many years on the comparison between AIC, BIC and other-ICs for mixtures and other latent class models. Here is one talk he gave on the topic. With the message that BIC works reasonably well for density estimation but not for estimating the number of clusters. Here is also his most popular paper on such information criteria, including ICL.

I am rather agnostic on the use of such information criteria as they faiL to account for prior information or prior opinion on what’s making two components distinct rather than identical. In that sense I feel like the problem is non-identifiable. If components are not distinguishable in some respect. And as a density estimation problem, the main drawback in having many components is an increased variability. This is not a Bayesian/frequentist debate, unless prior inputs can make components make sense. And prior modelling fights against over-fitting by picking priors on the weight near zero (in the Rousseau-Mengersen 2012 sense).

And then I wrote:

I think BIC is fundamentally different from AIC, WAIC, LOO, etc, in that all those other things are estimates of out-of-sample prediction error, while BIC is some weird thing that under certain ridiculous special cases corresponds to an approximation to the log marginal probability.

Just to continue along these lines: I think it makes more sense to speak of “choosing” the number of clusters or “setting” the number of clusters, not “estimating” the number of clusters, because the number of clusters is not in general a Platonic parameter that it would make sense to speak of estimating. I think this comment is similar to what X is saying, just in slightly different language (although both in English, pas en français).

To put it another way, what does it mean to say “too many components given the biological knowledge of what’s going on”? This depends on how “component” is defined. I don’t mean this as a picky comment: I think this is fundamental to the question. To move the discussion to an area I know more about: suppose we want to characterize voters. We could have 4 categories: Dem, Rep, Ind, Other. We could break this down more, place voters on multiple dimensions, maybe identify 12 or 15 different sorts of voters. Ultimately, though, we’re each individuals, so we could define 300 million clusters, one for each American. It seems to me that the statement “too many components” has to be defined with respect to what you will be doing with the components. To put it another way: what’s the cost to including “too many” components? Is the cost that estimates will be too noisy? If so, there is some interaction between #components and the prior being used on the parameters: one might have a prior that works well for 4 or 5 components but not so well when there are 20 or 25 components.

Actually, I can see some merit to the argument that there can just about never be more than 4 or 5 clusters, ever. My argument goes like this: if you’re talking “clusters” you’re talking about a fundamentally discrete process. But once you have more than 4 or 5, you can’t really have real discreteness; instead things slide into a continuous model.

OK, back to the practical question. Here I like the idea of using LOO (or WAIC) in that I understand what it’s doing: it’s an estimate of out-of-sample prediction error, and I can take that for what it is.

To get to the modeling question: if Ben is comfortable with a model with, say, between 3 and 6 clusters, then I think he should just fit a model with 6 clusters. Just include all 6 and let some of them be superfluous if that’s what the model and data want. One way to keep the fitting under control is to regularize a bit by putting strong priors on the weights on the mixtures, so that mixture components 1, 2, etc, are large in expectation large, and later components are smaller. You can do this with an informative Dirichlet prior on the vector of lambda parameters. I’ve never tried this but it seems to me like it could work.

Also–and I assume this is being already but I’ll mention just in case–don’t forget to put informative priors for the parameters in each mixture component. I don’t know the details of this particular model, but, just for example, if we are fitting a mixture of normals, it’s important to constrain the variances of the normals because the model will blow up with infinite likleihood at points where any variance equals zero. The constraint can be “soft,” for example lognormal priors on scale parameters, or a hierarchical prior on the sale parameters with a proper prior on how much they vary. The same principle applies to other sorts of mixture models.

And Aki added:

If the whole distribution is multimodal it is easier to identify the number of modes and say that these correspond to clusters. Even if we have “true” clusters, but they are overlapping so that there are no separate modes, the number of clusters is not well identified

unlesswe have lot of information about the shape of each cluster. Example: using mixture of Gaussians to fit Student t data -> when n->infty, the number of components (clusters) goes to infty. Depennding on the amount of model misspecification and separability of clusters we may not be able to identify the number of clusters no matter which criteria we use. In simulated examples with true small number of clusters, use of criteria which favors small number of clusters is likely to perform well (LOO (or WAIC) is likely to favor more clusters than marginal likelihood, BIC or WBIC). In Andrew’s voters example, and in many medical examples I’ve seen, there are no clear clusters as the variation between individuals is mostly continuous or discrete in high dimensions.

The post Discussion on overfitting in cluster analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Breakfast skipping, extreme commutes, and the sex composition at birth” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bhash Mazumder sends along a paper (coauthored with Zachary Seeskin) which begins:

A growing body of literature has shown that environmental exposures in the period around conception can affect the sex ratio at birth through selective attrition that favors the survival of female conceptuses. Glucose availability is considered a key indicator of the fetal environment, and its absence as a result of meal skipping may inhibit male survival. We hypothesize that breakfast skipping during pregnancy may lead to a reduction in the fraction of male births. Using time use data from the United States we show that women with commute times of 90 minutes or longer are 20 percentage points more likely to skip breakfast. Using U.S. census data we show that women with commute times of 90 minutes or longer are 1.2 percentage points less likely to have a male child under the age of 2. Under some assumptions, this implies that routinely skipping breakfast around the time of conception leads to a 6 percentage point reduction in the probability of a male child.

Here are the key graphs. First, showing that people with long commute times are more likely to be skipping breakfast:

I have no idea how 110% of people are supposed to be skipping breakfast, but whatever.

And, second, showing that people with long commute times are less likely to have boy babies:

I have no idea what’s going on with these bars that start at 49.8%, but whatever. Maybe someone can tell these people that it’s ok to plot points, you don’t need big gray bars attached?

Anyway, what can I say . . . I don’t buy it. This second graph, in particular: everything looks too noisy to be useful.

Or, to put it another way: The general hypothesis seems reasonable, when the fetus gets less nourishment, it’s more likely the boy fetus doesn’t survive. But this all looks really really noisy. Also, the statistical significance filter. So the estimates they report, are overestimates.

To put it another way: Get a new data set, and I don’t expect to see the pattern repeat.

That said, there are papers in this literature that are a lot worse. For example, Mazumder and Seeskin cite a Mathews, Johnson, and Neil paper on correlation between maternal diet and sex ratio that had a sample size of only 740, which makes it absolutely useless for learning anything at all, given actual effect sizes on sex ratios. They could’ve just as well been publishing random numbers. But that was 2008, back before people know about these problems. We can only hope that the editors of “Proceedings of the Royal Society B: Biological Sciences” know better today.

The post “Breakfast skipping, extreme commutes, and the sex composition at birth” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Abraham Lincoln and confidence intervals appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical *practice*. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any *particular* interval that you might see.

**P.S.** More here.

The post Abraham Lincoln and confidence intervals appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I’m only adding new posts when they’re important . . . and this one’s really important. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Durf Humphries writes:

I’m a fact-checker and digital researcher in Atlanta. Your blog has been quite useful to me this week. Your statistics and explanations are impressive, but the decision to ornament your articles with such handsome cats? That’s divine genius and it’s apparent that these are not random cats, but carefully curated critters that compliment the scholarship. Bravo.

He adds:

I’ve also attached a picture of my cats that you are welcome to use. Their names are Peach (the striped one) and Pancake (the very, very dark gray one).

The post I’m only adding new posts when they’re important . . . and this one’s really important. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How best to partition data into test and holdout samples? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bill Harris writes:

In “Type M error can explain Weisburd’s Paradox,” you reference Button et al. 2013. While reading that article, I noticed figure 1 and the associated text describing the 50% probability of failing to detect a significant result with a replication of the same size as the original test that was just significant.

At that point, something clicked: what advice do people give for holdout samples, for those who test that way?

R’s Rattle has a default partition of 70/15/15 (percentages). http://people.duke.edu/~rnau/three.htm recommends at least a 20% holdout — 50% if you have a lot of data.

Seen in the light of Button 2013 and Gelman 2016, I wonder if it’s more appropriate to have a small training sample and a larger test or validation sample. That way, one can explore data without too much worry, knowing that a significant number of results could be spurious, but the testing or validation will catch that. With a 70/15/15 or 80/20 split, you risk wasting test subjects by finding potentially good results and then having a large chance of rejecting the result due to sampling error.

My reply:

I’m not so sure about your intuition. Yes, if you hold out 20%, you don’t have a lot of data to be evaluating your model and I agree with you that this is bad news. But usually people do 5-fold cross-validation, right? So, yes you hold out 20%, but you do this 5 times, so ultimately you’re fitting your model to all the data.

Hmmmm, but I’m clicking on your link and it does seem that people recommend this sort of one-shot validation on a subset (see image above). And this does seem like it would be a problem.

I suppose the most direct way to check this would be to run a big simulation study trying out different proportions for the test/holdout split and seeing what performs best. A lot will depend on how much of the decision making is actually being done at the evaluation-of-the-holdout stage.

I haven’t thought much about this question—I’m more likely to use leave-one-out cross-validation as, for me, I use such methods not for model selection but for estimating the out-of-sample predictive properties of models I’ve already chosen—but maybe others have thought about this.

I’ve felt for awhile (see here and here) that users of cross-validation and out-of-sample testing often seem to forget that these methods have sampling variability of their own. The winner of a cross-validation or external validation competition is just the winner for that particular sample.

**P.S.** My main emotion, though, when receiving the above email was pleasure or perhaps relief to learn that at least one person is occasionally checking to see what’s new on my published articles page! I hadn’t told anyone about this new paper so it seems that he found it there just by browsing. (And actually I’m not sure of the publication status here: the article was solicited by the journal but then there’ve been some difficulties, we’ve brought in a coauthor . . . who knows what’ll happen. It turns out I really like the article, even though I only wrote it as a response to a request and I’d never heard of Weisburd’s paradox before, but if the Journal of Quantitative Criminology decides it’s too hot for them, I don’t know where I could possibly send it. This happens sometimes in statistics, that an effort in some very specific research area or sub-literature has some interesting general implications. But I can’t see a journal outside of criminology really knowing what to do with this one.)

The post How best to partition data into test and holdout samples? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post a2 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Wow.

**P.S.** In the comment thread, Peter Dorman has an interesting discussion of Carlsen’s errors so far during the tournament.

The post a2 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bayesian data analysis, as my colleagues and I have formulated it, has a human in the loop.

Here’s how we put it on the very first page of our book:

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

1. Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.

2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.

3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

How does this fit in with goals of performing statistical analysis using artificial intelligence? Lots has been written on “machine learning” but in practice this often captures just part of the process. Here I want to discuss the possibilities for automating the entire process.

Currently, human involvement is needed in all three steps listed above, but in different amounts:

1. Setting up the model involves a mix of look-up and creativity. We typically pick from some conventional menu of models (linear regressions, generalized linear models, survival analysis, Gaussian processes, splines, Bart, etc etc). Tools such as Stan allow us to put these pieces together in unlimited ways, in the same way that we can formulate paragraphs by putting together words and sentences. Right now, a lot of human effort is needed to set up models in real problems, but I could imagine an automatic process that constructs models from parts, in the same way that there are computer programs to write sports news stories.

2. Inference given the model is the most nearly automated part of data analysis. Model-fitting programs still need a bit of hand-holding for anything but the simplest problems, but it seems reasonable to assume that the scope of the “self-driving inference program” will gradually increase. Just for example, we can automatically monitor the convergence of iterative simulations (that came in 1990!) and, with Nuts, we don’t have to tune the number of steps in Hamiltonian Monte Carlo. Step by step, we should be able to make our inference algorithms more automatic, also with automatic checks (for example, based on adaptive fake-data simulations) to flag problems when they do appear.

3. The third step—identifying model misfit and, in response, figuring out how to improve the model—seems like the toughest part to automate. We often learn of model problems through open-ended exploratory data analysis, where we look at data to find unexpected patterns and compare inferences to our vast stores of statistical experience and subject-matter knowledge. Indeed, one of my main pieces of advice to statisticians is to integrate that knowledge into statistical analysis, both in the form of formal prior distributions and in a willingness to carefully interrogate the implications of fitted models.

How would an AI do step 3? One approach is to simulate the human in the loop by explicitly building a model-checking module that takes the fitted model, uses it to make all sorts of predictions, and then checks this against some database of subject-matter information. I’m not quite sure how this would be done, but the idea is to try to program up the Aha process of scientific revolutions.

**The conscious brain: decision-making homunculus or interpretive storyteller?**

There is another way to go, though, and I thought of this after seeing Julien Cornebise speak at Google about a computer program that his colleagues wrote to play video games. He showed the program “figuring out” how to play a simulation the 1970s arcade classic game, Breakout. What was cool was not just how it could figure out how to position the cursor to always get to the ball on time, but how the program seemed to learn strategies: Cornebise pointed out how, after a while, the program seemed to have figured out how to send the ball up around the blocks to the top where it would knock out lots of bricks:

OK, fine. What does this have to do with model checking, except to demonstrate that in this particular example no model checking seems to be required as the model does just fine?

Actually, I don’t know on that last point, as it’s possible the program required some human intervention to get to the point that it could learn on its own how to win at Breakout.

But let me continue. For many years, cognitive psychologists have been explaining to us that our conscious mind doesn’t really make decisions as we usually think of it, at least not for many regular aspects of daily life. Instead, we do what we’re gonna do, and our conscious mind is a sort of sportscaster, observing our body and our environment and coming up with stories that explain our actions.

To return to the Breakout example, you could imagine a plug-in module that would observe the game and do some postprocessing—some statistical analysis on the output—and notice that, all of a sudden, the program was racking up the score. The module would interpret this as the discovery of a new strategy, and do some pattern recognition to figure out what’s going on. If this happens fast enough, it could feel like the computer “consciously” decided to try out the bounce-the-ball-along-the-side-to-get-to-the-top strategy.

That’s not quite what the human players do: we can imagine the strategy without it happening yet. But of course the computer could do so to, via a simulation model of the game.

Now let’s return to step 3 of Bayesian data analysis: model checking and improvement. Maybe it’s possible for some big model to be able to learn and move around model space, and to suddenly come across better solutions. This could look like model checking and improvement, from the perspective of the sportscaster part of the brain (or the corresponding interpretive plug-in to the algorithm) even though it’s really just blindly fitting a model.

All that is left, then, is the idea of a separate module that identifies problems with model fit based on comparisons of model inferences to data and prior information. I think that still needs to be built.

The post Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>See here: My next 170 blog posts (inbox zero and a change of pace).

So to find out what’s next, just click there and scroll down. All the posts are there (except for various topical items that I’ve inserted).

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “US Election Analysis 2016: Media, Voters and the Campaign” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Actually, at least one of these chapters was written before the election. When the editors asked me if I could contribute to this book, I said, sure, and I pointed them to this article from a few weeks ago, “Trump-Clinton Probably Won’t Be a Landslide. The Economy Says So.” After the election, I changed the tenses of a few verbs and produced “Trump-Clinton was expected to be close: the economy said so.”

I haven’t had a chance to read any of the other chapters, but I glanced at the table of contents and noticed that one of them was by Ken Cosgrove, from Mad Men, writing on brand loyalty and politics. Cool! I knew the guy had published a short story in the Atlantic, so I guess it was only a matter of time before he dabbled in political science as well.

The post “US Election Analysis 2016: Media, Voters and the Campaign” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Unfinished (so far) draft blog posts appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Most of the time when I start writing a blog post, I continue till its finished. As of this writing this blog has 7128 posts published, 137 scheduled, and only 434 unpublished drafts sitting in the folder.

434 might sound like a lot, but we’ve been blogging for over 10 years, and a bunch of those drafts never really got started.

Anyway, just for your amusement, I thought I’d share the titles of the draft posts, most of which are unfinished and probably will never be finished. They’re listed in reverse chronological order, and I’m omitting all the posts that I hadn’t bothered to title.

Here are the most recent few:

- Of polls and prediction markets: More on #BrexitFail
- Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness
- “Simple, Scalable and Accurate Posterior Interval Estimation”
- ESP and the Bork effect
- Hey, PPNAS . . . this one is the fish that got away.
- The new quantitative journalism
- Trying to make some sense of it all, But I can see that it makes no sense at all . . . Stuck in a local maximum with you
- Statistical significance, the replication crisis, and a principle from Bill James
- Is retraction only for “the worst of the worst”?
- Stan – The Bayesian Data Scientist’s Best Friend [this one’s from Aki]
- How to think about a study that’s iffy but that’s not obviously crap
- The identification trap
- nisbett
- The penumbra of shooting victims
- What is the prior distribution for treatment effects in social psychology?
- You can’t do Bayesian Inference for LDA! [by Bob]
- Product vs. Research Code: The Tortoise and the Hare [another one from Bob; he was busy that week!]
- Party like it’s 2005
- The challenge of constructive criticism
- We got mooks [This one I actually posted, and then one of my colleagues asked me to take it down because my message wasn’t 100% positive.]
- Can’t Stop Won’t Stop Splittin
- Some statistical lessons from the middle-aged-mortality-trends story
- Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.
- If I have not seen far, it’s cos I’m standing on the toes of midgets
- Ovulation and clothing: More forking paths [this one was in the Zombies category and I think we’ve run enough posts on the topic]
- How to get help with Stan [from Daniel. I don’t know why he didn’t post it.]
- Running Stan [also from Daniel]
- Stan taking over the world
- Why is Common Core so crappy?
- Attention-grabbing crap, statistics edition
- Optimistic or pessimistic priors
- I hate hate hate hate this graph. Not so much because it’s a terrible graph—which it is—but because it’s [Yup, that’s it. I guess I didn’t even finish the title of this one!]
- Some more statistics quotes!
- Show more of the time series
- Just in case there was any confusion
- “Steven Levitt from Freakonomics describes why he’s obsessed with golf” [Enough already on this guy. — ed.]
- A statistical communication problem!
- What should be in an intro stat course?
- Postdoc opportunities to work with our research group!!
- When you call me bayesian, I know I’m not the only one
- The NIPS Experiment [from Bob]
- Sociology comments
- When is a knave also a fool?
- Income Inequality: A Question of Velocity or Acceleration? [by David K. Park]
- Economics now = Freudian psychology in the 1950s [I already posted on the topic, so this post must be some sort of old draft.]
- “College Hilariously Defends Buying $219,000 Table”
- Having a place to put my thoughts
- ; vs |
- The (useful) analogy between
*preregistration of a replication study*and*randomization in an experiment* - It’s somewhat about the benjamins [Hey, I like that title!]
- Intellectuals’ appreciation of old pop culture
- Is it hype or is it real?
- Scientific and scholarly disputes
- Book by tobacco-shilling journalist given to Veterans Affairs employees
- Alphabetism
- I ain’t got no watch and you keep asking me what time it is

That takes us back to Oct 2014. Some of these are close to finished and maybe I’ll post soon; others are on topics we’ve already done to death; and some of the others, I have no idea what I was going to say. That last post above, I remember thinking of the idea when I was riding my bike and that Dylan song came on. When I got home, I wrote the title of the post but failed to put anything in the main text box, and now I completely forget my intentions. Too bad, it’s a good title.

P.S. I wrote the above a few months ago and I have a couple more drafts now in the pile.

The post Unfinished (so far) draft blog posts appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Individual and aggregate patterns in the Equality of Opportunity research project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve been looking at the work of the Equality of Opportunity Project and noticed that you had commented on some of their work.

Since you are somewhat familiar with the work, and since they do not respond to my queries, I thought I’d ask you about something that is bothering me. I, too, was somewhat put off by their repeated use of the word “causation.” But what really concerns me is that it appears that the work is based on taking huge samples (millions of people) and doing the analysis based on aggregations of them into deciles. Isn’t this demonstrating ecological correlation—which would be fine, except that their interpretations all involve predictive and causative statements at the individual level. In other words, they find a close relationship between various aggregate measures—such as the percentile income rank and the percent attending college—and then interpret that correlation as representing individual correlations. The individual correlations are guaranteed to be weaker than the aggregate ones, and perhaps not even in the same direction.

There is significant effort in this work and it will take me a long time to understand exactly what they have done, but I thought you might be able to save me a bunch of time by telling me whether this is something worth pursuing. I would think that these researchers would be well aware of ecological correlations, but I was constantly puzzled by why their scatterplots have so few points when the sample sizes are so large. Finding a strong linear correlation between aggregate measures conveys a compelling story—but it may not be a true story.

My reply: I’m not sure. This update (which Lehman pointed me to) shows a bunch of individual-level results as well. So it reminds me of our Red State Blue State project where we used individual-level data where possible but also examined aggregate patterns.

The post Individual and aggregate patterns in the Equality of Opportunity research project appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Know-it-all neuroscientist explains Trump’s election victory to the rest of us appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>There’s a horrible piece in “Scientific American” arguing that knowing some neuroscience explains the results of the election, in particuar why people voted for Trump (https://blogs.scientificamerican.com/mind-guest-blog/trump-s-victory-and-the-neuroscience-of-rage/)

It says things like “To understand this election you must understand the brain’s threat detection mechanism”, or “This neuroscience perspective explains the seemingly incomprehensible situation of a privileged billionaire becoming the champion of working class men and women…”, or “Whether these real divisions in society will explode into factionalism or unite us will be determined by the same neural circuitry in our prefrontal cortex that separates “us from them.””

It’s bad on so many levels.

On Facebook, I pointed out that the explanatory force in this piece is located entirely on the psychological/social level: people feel angry and threatened etc by social disruption and alienation etc and then make a decision based on their emotions and on not reasoning etc.

The neural apparatus that is invoked doesn’t do any explanatory work. You can test that by omitting, in turn, the neural apparatus and the psychosocial story and see which one does the explaining. So it’s another instance of the overhyping of neuroscience by people who should know better.

Also, insofar as the psychosocial story rests on evolutionary reasoning (“us” vs “them”, tribalism…) it is pure speculation as we don’t have reliable evidence about the minds and social behaviors of our ancestors.

But that’s only the beginning of the problems with this piece. Others are

– it assumes, without evidence, to know why Trump voters voted like they did

– it assumes, without evidence, that *all* Trump voters voted for the very same reasonSo it brutally ignores the variability in voter motivations and all the nuances of the social, economic, ethnic and personality backgrounds that play into them. It ignores the role of the media in shaping the perception of reality and in creating mass behaviors. It compresses all this into a reductionistic non-explanation of “you did this because your prefrontal cortex did X and your limbic system did Y”.

Finally, there’s some irony in people now fearing what will happen to science under the new science-challenged president. The fact is if science is on the level of this piece (and much, if not most of it, is), then it does deserve to go down the drain.

I agree with Alex. The article in question begins:

Pollsters, politicians, much of the press and public are dismayed by Donald Trump’s surprising victory in the presidential election, but not neuroscientists.

So . . . neuroscientists all predicted that Hillary Clinton would get 50% of the two-party vote, but not 52%? Pretty cool, huh?

Also this bit:

Whom to marry, where to live, or even what entrée to select from a dinner menu, are decisions we make not by reason, but rather by how we “feel.”

Wow—science! Jeez, and before reading this I thought my choice of steak instead of chicken had been driven by hard, rational calculation. It’s a good think we have neuroscience to tell us that we rely on emotion to decide who we marry. Who’d a thunk it?

As a political scientist, though, what really irks me is the bit about “The pollsters got it wrong because . . .” Off by 2%, dude! I agree that this 2% was a problem—and it was more than 2% in some key states—but, shoot, man, what kind of accuracy are you expecting here? Suppose the polls had been off by 1%? Then would that be ok with you?

As a small-d democrat, I’m also repulsed by this guy’s characterization of Republican voters (yes, they voted approx 50% Republican for the House and Senate, not just for president, also they gave Mitt Romney 48% of the two-party vote in 2012, etc.) as “an appeal to the brain’s limbic system.”

Also bizarre is this bit: “It is impossible to feel love or hate for someone you have not yet met.” Millions of people feel hate for Donald Trump or Hillary Clinton (or both!) without having met either of them.

This particular Hari Seldon concludes with a statement about “the electorate has concluded.” Kind of amazing how the electoral college is part of neuroscience too.

What I’m wondering is, if this was all so damn obvious, why didn’t he publish it *before* the election? That would’ve been much cooler.

The funny thing is, I know of only one neuroscientist who made a public prediction of the 2016 election. He gave Hillary Clinton a 99% chance of winning the election. Afterward he wrote that “the polls were off, massively.” Actually the polling errors were not “massive”; they were in line with historical polling errors. This neuroscientist wrote, “The entire polling industry . . . ended up with data that missed tonight’s results by a very large margin.” Actually the margin wasn’t so large. But it was consequential. The neuroscientist who predicted the election wrong ahead of time called the result a “giant surprise,” but the other neuroscientist who wrote after the election wasn’t bewildered at all.

Maybe the two neuroscientists should get together and work this one out.

The post Know-it-all neuroscientist explains Trump’s election victory to the rest of us appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Good news! PPNAS releases updated guidelines for getting a paper published in their social science division appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**From zero to Ted talk in 18 simple steps: Rolf Zwaan explains how to do it!**

The advice is from 2013 but I think it still just might work. Here’s Zwaan:

How to Cook up Your Own Social Priming Article

1. Come up with an idea for a study. Don’t sweat it. It’s not as hard as it looks. All you need to do is take an idiomatic expression and run with it. Here we go: the glass is half-full or the glass is half-empty.

2. Create a theoretical background. Surely there is some philosopher (preferably a Greek one) who has said something remotely relevant about optimists and pessimists while staring at a wine glass. Include him. For extra flavor you might want to add an anthropologist or a sociologist into the mix; Google is your friend here. Top it off with a few social psychology references. There, you have your theoretical framework. That wasn’t so hard, was it?

3. Think of a manipulation. Again, this is nothing to get nervous about. All you need to do is take the expression literally. Imagine this scenario. The subject is in a room. In the glass-full condition, a confederate comes in with an empty glass and a bottle of water. She then pours the glass half full and leaves the room. In the glass-half-empty condition, she comes in with a full glass and a bottle. She then pours half the glass back into the bottle and leaves.

4. Think of a dependent measure. This is where the fun begins. As you may know, the dependent measure of choice in social priming research is candy. You simply cannot go wrong with candy! So let’s say the subjects get to choose ten pieces of differently colored pieces of candy from a container that has equal numbers of orange and brown M&Ms. Your prediction here is that people in the half-full condition will be more likely to pick the cheery orange M&Ms than those in the half-empty condition, who will tend to prefer the gloomy brown ones.

5. Get a sample. You don’t want to overdo it here. About 30 students from a nondescript university will do nicely. Only 30 in a between-subjects design?, you worry. Worry no more. This is how we roll in social priming.

6. Run Experiment 1. Don’t fuss about issues like the age and gender of the subjects and details of the procedure; you won’t be reporting them anyway.

7. Analyze the results. Normally, you’d worry that you might not find an effect. But this is social priming remember? You are guaranteed to find an effect. In fact, your effect size will be around .8. That’s social priming for you!

8. Now on to Experiment 2. Come up with a new manipulation. What’s wrong with the glass and bottle from Experiment 1?, you might wonder. Are you kidding? This is social priming research. You need a new dependent measure. Just let your imagination run wild. How about balloons? In the half-full condition, the confederate walks in with an inflated balloon and lets half the air out in front of the subject. In the half empty condition, she half-inflates a balloon. And bingo! You’re done (careful with the word bingo, by the way; it makes people walk real slow).

9. Think of a new dependent measure. Why not have the subjects list their favorite TV shows? Your prediction here is that the half-full condition will list more sitcoms like Seinfeld and Big Bang Theory than the half-empty condition, which will list more crime shows like CSI and Law & Order (or maybe one of those stupid vampire shows). You could also include a second dependent measure. How about having subjects indicate how much they identify with Winnie de Pooh characters? Your prediction here is obvious: the half full condition will identify with Tigger the most while the half empty condition will prefer Eeyore by a landslide.

10. Repeat steps 5-7.

11. Now you are ready to write your General Discussion. You want to discuss the implications of your research. Don’t be shy here. Talk about the major implications for business, health, education, and politics this research so evidently has.

12. For garnish, add a quirky celebrity quote. Don’t work yourself into a lather. Just go to www.goodreads.com to find a quote. Here, I already did the work for you: “Some people see the glass half full. Others see it half empty. I see a glass that’s twice as big as it needs to be.” ― George Carlin. Just say something clever like: Unless you are like George Carlin, it does make a difference whether the glass is half empty or half full.

13. The next thing you need is an amusing title. And here your preparatory work really pays off. Just use the expression from Step 1 as your main title, describe your (huge) effect in the subtitle and your done: Is the Glass Half Empty or Half Full? The Effect of Perspective on Mood.

14. Submit to a journal that regularly publishes social priming research. They’ll eat it up.

15. Wax poetically about your research in the public media. If it wasn’t a good idea to be modest in the general discussion, you really need to let loose here. Like all social priming research, your work has profound consequences for all aspects of society. Make sure the taxpayer (and your Dean, haha) knows about it.

16. If bloggers are critical about your work, just ignore them. They’re usually cognitive psychologists with nothing better to do.

17. Once you’ve worked through this example, you might try your hand at more advanced topics like coming out of the closet. Imagine all the fun you’ll have with that one!

18. Good luck!

This is all so perfect, I just have nothing to add. You know how journals have style guides, and instructions on what sort of papers they like to publish? Wouldn’t it be just perfect if PPNAS (see here or here or here or . . .) linked to the Rolf Zwaan page, completely deadpan, saying this is the path to getting a paper published in their social science division?

The post Good news! PPNAS releases updated guidelines for getting a paper published in their social science division appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Thinking more seriously about the design of exploratory studies: A manifesto appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In the middle of a long comment thread on a silly Psychological Science paper, Ed Hagen wrote:

Exploratory studies need to become a “thing.” Right now, they play almost no formal role in social science, yet they are essential to good social science. That means we need to put as much effort in developing standards, procedures, and techniques for exploratory studies as we have for confirmatory studies. And we need academic norms that reward good exploratory studies so there is less incentive to disguise them as confirmatory.

Yesyesyesyesyesyesyesyesyesyesyes.

The problem goes like this:

1. Exploratory work gets no respect. Do an exploratory study and you’ll have a difficult time getting it published.

2. So, people don’t want to do exploratory studies, and when someone *does* do an exploratory study, he or she is motivated to cloak it in confirmatory language. (Our hypothesis was Z, we did test Y, etc.)

3. If you tell someone you will interpret their study as being exploratory, they may well be insulted, as if you’re saying their study is only exploration and not real science.

4. Then there’s the converse: it’s hard to criticize an exploratory study. It’s just exploratory, right? Anything goes!

And here’s what I think:

Exploration is important. In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s a good idea for data to be collected with exploration in mind.

But exploration, like anything else, can be done well or it can be done poorly (or anywhere in between). To describe a study as “exploratory” does not get it off the hook for problems of measurement, conceptualization, etc.

For example, Ed Hagen in that thread mentioned that horrible ovulation and clothing paper, and its even more horrible followup where the authors pulled the outdoor temperature variable out of a hat to explain away an otherwise embarrassing non-replication (which shouldn’t’ve been embarrassing at all given the low low power and many researcher degrees of freedom of the original study which had gotten them on the wrong track in the first place). As I wrote in response to Hagen, I love exploratory studies, but gathering crappy one-shot data on a hundred people and looking for the first thing that can explain your results . . . that’s low-quality exploratory research.

**From “EDA” to “Design of exploratory studies”**

With the phrase “Exploratory Data Analysis,” the statistician John Tukey named and gave initial shape to a whole new way of thinking formally about statistics. Tukey of course did not invent data exploration, but naming the field gave a boost to thinking about it formally (in the same way that, to a much lesser extent, our decades of writing about posterior predictive checks has given a sense of structure and legitimacy to Bayesian model checking). And that’s all fine. EDA is great, and I’ve written about connections between EDA and Bayesian modeling; see here and here.

But today I want to talk about something different, which is the idea of *design* of an exploratory study.

Suppose you know ahead of time that your theories are a bit vague and omnidirectional, that all sorts of interesting things might turn up that you will want to try to understand, and you want to move beyond the outmoded Psych Sci / PPNAS / Plos-One model of chasing p-values in a series of confirmatory studies.

You’ve thought it through and you want to do it right. You know it’s time for exploration first and confirmation later, if at all. So you want to design an exploratory study.

What principles do you have? What guidelines? If you look up “design” in statistics or methods textbooks, you’ll find a lot of power calculations, maybe something on bias and variance, and perhaps some advice on causal identification. All these topics are *relevant* to data exploration and hypothesis generation, but not directly so, as the output of the analysis is not an estimate or hypothesis test.

So I think we—the statistics profession—should be offering guidelines on the design of exploratory studies.

An analogy here is observational studies. Way back when, causal inference was considered to come from experiments. Observational studies were second best, and statistics textbooks didn’t give any advice on the design of observational studies. You were supposed to just take your observational data, feel bad that they didn’t come from experiments, and go from there. But then Cochran, and Rosenbaum, and Angrist and Pischke, wrote textbooks on observational studies, including advice on how to design them. We’re gonna be doing observational studies, so let’s do a good job at them, which includes thinking about how to plan them.

Same thing with exploratory studies. Data-based exploration and hypothesis generation are central to science. Statisticians should be involved in the design as well as the analysis of these studies.

So what advice should we give? What principles do we have for the design of exploratory studies?

Let’s try to start from scratch, rather than taking existing principles such as power, bias, and variance that derive from confirmatory statistics.

– Measurement. I think this has to be the #1 principle. Validity and reliability: that is, you’re measuring what you think you’re measuring, and you’re measuring it precisely. Related: within-subject designs or, to put it more generally, structured measurements. If you’re interested in studying people’s behavior, measure it over and over, ask people to keep diaries, etc. If you’re interested in improving education, measure lots of outcomes, try to figure out what people are actually learning. And so forth.

– Open-endedness. Measuring lots of different things. This goes naturally with exploration.

– Connections between quantitative and qualitative data. You can learn from those open-ended survey responses—but only if you look at them.

– Where possible, collect or construct continuous measurements. I’m thinking of this partly because graphical data analysis is an important part of just about any exploratory study. And it’s hard to graph data that are entirely discrete.

I think much more can be said here. It would be great to have some generally useful advice for the design of exploratory studies.

The post Thinking more seriously about the design of exploratory studies: A manifesto appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Anti-immigration attitudes: they didn’t want a bunch of Hungarian refugees coming in the 1950s appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A few days ago, an article in the New York Times by Amanda Taub said that working-class support for Donald Trump reflected a “crisis of white identity.” Today, Ross Douthat said that it reflected the “thinning out of families.” The basic idea in both was that “working class” (ie less educated people’s) opposition to immigration is a symptom of anxiety about something else.

In September 1957, the days of the baby boom and the “affluent society,” when unions were strong and no one was talking about a crisis of white identity or masculinity, the Gallup Poll asked “UNDER THE PRESENT IMMIGRATION LAWS, THE HUNGARIAN REFUGEES WHO CAME TO THIS COUNTRY AFTER THE REVOLTS LAST YEAR HAVE NO PERMANENT RESIDENCE AND CAN BE DEPORTED AT ANY TIME. DO YOU THINK THE LAW SHOULD OR SHOULD NOT BE CHANGED SO THAT THESE REFUGEES CAN STAY HERE PERMANENTLY?”

42% said yes, and 43% said no.In July 1958, another Gallup Poll asked “IN EUROPE THERE ARE STILL ONE HUNDRED AND SIXTY THOUSAND REFUGEES WHO LEFT HUNGARY TO ESCAPE THE COMMUNISTS. IT HAS BEEN SUGGESTED THAT THE U.S. PERMIT SIXTY-FIVE THOUSAND OF THESE PEOPLE TO COME TO THIS COUNTRY. WOULD YOU APPROVE OR DISAPPROVE OF THIS PLAN?”

33% approved and 55% disapproved.With both questions, education made a difference for opinions. For example, in 1958, 55% of the people with a college degree favored letting the refugees come to the United States, compared to 31% of those without college degrees. The only other demographic factor that made a clear difference was that Jews were more likely to favor letting the refugees stay.

The 1957 survey also had a question about the Brown vs. Board of Education decision against school segregation—people who approved were more likely to favor letting the refugees stay. The 1958 survey had a series of questions about whether you would vote for various religious or racial minorities for president—people who were more tolerant were more likely to favor letting the refugees come to the United States.

The Hungarian refugees were white, Christian, and could be seen as part of a clear story of oppression vs. resistance. Despite this, most people, especially less educated people, were not in favor of letting them stay in the United States. So the contemporary opposition to immigration, and the tendency for it to be stronger among less educated people, are not a reflection of something specific to today, but continue a long-standing pattern. Of course, an increase in the number of immigrants today presumably makes the issue more important. But the basic pattern is not new.

[Data from the Roper Center for Public Opinion Research]

Interesting: among those who expressed an opinion, over 60% opposed letting those 65,000 anti-communist Hungarian refugees come to the U.S. And, as Weakliem points out, it’s hard to explain this based on ethnic prejudice, which is how we usually think about earlier anti-immigrant movements such as the Know Nothings of the 1850s.

Just one thing, though: There was a big recession in 1958. So people could’ve been reacting to that. In retrospect the 1958 recession doesn’t seem like much, but at the time people didn’t know if we were going to jump into another great depression.

The post Anti-immigration attitudes: they didn’t want a bunch of Hungarian refugees coming in the 1950s appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Only on the internet . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It started with this completely reasonable message:

Professor,

I was unable to run your code here:

https://www.r-bloggers.com/downloading-option-chain-data-from-google-finance-in-r-an-update/

Besides a small typo [you have a 1 after names (options)], the code fails when you actually run the function. The error I get is a lexical error:

Error: lexical error: invalid character inside string.

{” “:{” “:2016,” “:11,” “:18},”

(right here) ——^If you could help me understand where things went wrong, I can fix your code for you.

Thanks!

I didn’t remember any such code, but I followed the link, it was to a post from 2015 entitled “Downloading Option Chain Data from Google Finance in R: An Update,” and written by someone named Andrew, but not me.

So I replied:

Hi, that post was not by me!

A few hours later I got this reply in my inbox:

Yes it was. It says “by andrew.” Stop lying, professor.

Huh? Intonation is notoriously difficult to convey in typed speech. This was so over the top it must be someone goofing around. So I replied:

I guess you’re joking, right? I’m not the only person with that name.

And then he shoots back with this:

Unbelievable! Do you take me as a fool?

Ummmm . . . I better not touch that one!

**P.S.** I got one more email from this guy! He wrote:

Nothing to say? Fine, I will post a blog on this experience. The world will see how you were embarrassed to admit that your code had flaws!

Add another one to the list of professors with big egos and “no flaws.” I am now your 2nd biggest enemy. Your 1st is your own ego.

This was starting to get weird so I sent him one more email:

Hey, no kidding, it was a different Andrew. Follow the link and it goes here:

It’s by someone named Andrew Collier. I’m Andrew Gelman. Different people.

I hope this works. The internet is a scary place.

The post Only on the internet . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Sniffing tears perhaps not as effective as claimed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Marcel van Assen has a story to share:

In 2011 a rather amazing article was published in Science where the authors claim that “We found that merely sniffing negative-emotion-related odorless tears obtained from women donors induced reductions in sexual appeal attributed by men to pictures of women’s faces.”

The article is this:

Gelstein, S., Yeshurun, Y., Rozenkrantz, L., Shushan, S., Frumin, I., Roth, Y., & Sobel, N. (2011). Human tears contain a chemosignal. Science, 331(6014), 226-230.Ad Vingerhoets, an expert on crying, and a coworker Asmir Gračanin were amazed by this result and decided to replicate the study in several ways (my role in this paper was minor, i.e. doing and reporting some statistical analyses when the paper was already largely written). This resulted in:

Gračanin, A., van Assen, M. A., Omrčen, V., Koraj, I., & Vingerhoets, A. J. (2016). Chemosignalling effects of human tears revisited: Does exposure to female tears decrease males’ perception of female sexual attractiveness?.Cognition and Emotion, 1-12.The paper failed to replicate the findings in the original study.

Original findings that do not get replicated is not special, but unfortunately core business. What IS striking, however, is the response of Sobel to the article of Gracanin et al (2016). See …

Sobel, N. (2016). Revisiting the revisit: added evidence for a social chemosignal in human emotional tears. Cognition and Emotion, 1-7.Sobel re-analyzes the data of Gracanin et al, and after extensive fishing (with p-values just below .05) he concludes that the original study was right and the Gracanin et al study bad. Irrespective of whether chemosignalling actually exists, Sobel’s response is imo a beautiful and honest defense, where p-hacking is explicitly acknowledged and its consequences not understood.

We also wrote a short response to Sobel’s comment, commenting on the p-hacking of Sobel.

Gračanin, A., Vingerhoets, A. J., & van Assen, M. A. (2016). Response to comment on “Chemosignalling effects of human tears revisited: Does exposure to female tears decrease males’ perception of female sexual attractiveness?”.Cognition and Emotion, 1-2.To save time, if your interested, I recommend reading Sobel (2016) first.

I asked Assen why he characterized Sobel’s horrible bit of p-hacking as “a beautiful and honest defense,” and he [Assen] responded:

I think it is beautiful (in the sense that I like it) because it is honest. I also think it is a beautiful and excellent example of how one should NOT react to a failed replication, and of NOT understanding how p-hacking works.

This is about emotions; although I was involved in this project, I ENJOYED the comment of Sobel because of its tone and content, even though it I did not agree with its content at all.

Our response to Sobel’s comment supports the fact that Sobel has been p-hacking. Vingerhoets asked BEFORE the replication if it mattered Tilburg had no lab, and Sobel says ‘no’, and AFTERWARDS when the replication fails he believes it IS a problem.

None of this is new, of course. By this time we should not be surprised that Science publishes a paper with no real scientific content. As we’ve discussed many times, newsworthiness rather than correctness is the key desideratum in publication in these so-called tabloid journals. The reviewers just assume the claims in submitted papers are correct and then move on to the more important (to them) problem of deciding whether the story is big and important enough for their major journal.

I agree with Assen that this particular case is notable in that the author of the original study flat-out admits to p-hacking and still doesn’t care.

Gračanin et al. tell it well in their response:

Generally, a causal theory should state that “under conditions X, it holds that if A then B”. Relevant to our discussion in particular and evaluating results of replications in general are conditions X, which are called scope conditions. Suppose an original study concludes that “if A then B”, but fails to specify conditions X, while the hypothesis was tested under condition XO. The replication study subsequently tested under condition XR and concludes that “if A then B” does NOT hold. Leaving aside statistical errors, two different con- clusions can be drawn. First, the theory holds in con- dition XO (and perhaps many other conditions) but not in condition XR. Second, the theory is not valid. We argue that the second explanation should be taken very seriously . . .

They continue:

What seems remarkable and inconsistent is that Sobel regards some of our as well as Oh, Kim, Park, and Cho’s (2012; Oh) findings as strong support for his theory, despite the fact that there was no sad context present in these studies. Apparently, in case of a failure to find corroborating results, the sad context is regarded crucial, but if some of our and Oh’s findings point in the same direction as his original findings, the lack of sad context and exact procedures are no longer important issues.

And this:

Sobel concludes that we did not dig very deep in our data to probe for a possible effect. That is true. We did not try to dig at all. Our aim was to test if human emotional tears act as a social chemosignal, using a different research methodology and with more statistical power than the original study; we were not on a fishing expedition.

I find the defensive reaction of Sobel to be understandable but disappointing. I’m just so so so tired of researchers who use inappropriate statistical methods and then can’t let go of their mistakes.

It makes me want to cry.

The post Sniffing tears perhaps not as effective as claimed appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Josh Miller hot hand talks in NYC and Pittsburgh this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Joshua Miller (the person who, with Adam Sanjurjo, discovered why the so-called “hot hand fallacy” is not really a fallacy) will be speaking on the topic this week.

In New York, Thurs 17 Nov, 12:30pm, 19 W 4th St, room 517, Center for Experimental Social Science seminar.

In Pittsburgh, Fri 18 Nov, 12pm, 4716 Posvsar Hall, University of Pittsburgh, Experimental Economics seminar.

And here’s the latest version of their paper, Surprised by the Gambler’s and Hot Hand Fallacies? A Truth in the Law of Small Numbers.

Josh gave a talk on this here a few months ago, and it was great. Lots of data, theory, and discussion. So I recommend you check this out.

The post Josh Miller hot hand talks in NYC and Pittsburgh this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Men with large testicles” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We also commented on a paper in PNAS. The original paper was by Fanelli & Ioannidis on “US studies may overestimate effect sizes in softer research.”

Their analyses smelled, with a rather obscure “to the power .25” of effect size. We re-analyzed their data in a more obvious way, and found nothing.

We were also amazed that the original paper’s authors waved away our re-analyses.

The most interesting PNAS article I [Assen] read last years is the one with Riling as one of the authors, arguing that men with bigger balls take less care of their children than men with smaller balls (yes, I mean testicles).

It had some p-values very close to .05. At that time I requested their data. I got the data, together with some requests not so share them. I did not follow upon this paper… Recently, I noticed Riling is also the author of at least one of the neuropsychology papers with huge correlations between a behavioral measure and neural activity.

OK, and here’s the promised cat picture:

Cute, huh?

The post “Men with large testicles” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Kaggle Kernels appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In late August, Kaggle launched an open data platform where data scientists can share data sets. In the first few months, our members have shared over 300 data sets on topics ranging from election polls to EEG brainwave data. It’s only a few months old, but it’s already a rich repository for interesting data sets.

It’s also a nice place to share reproducible data science. We have built a tool called Kaggle Kernels, which allows data scientists and statisticians to share notebooks and scripts in Python or R on top of the data. If you find analysis you want to extend, you can “fork it” which gives you a reproducible version without going through the pain of replicating the author’s environment. It’s useful for learning new techniques (by being able to fork and play with other’s code), to share your side project with a large community and to draw attention to your research and store it in a way that can be easily reproduced.

He adds:

We don’t support Stan yet but we inevitably will.

Sooner rather than later, I hope!

**P.S.** Jamie Hall of Kaggle writes:

We’ve got RStan and PyStan ready to go in Kernels now. It would be fantastic to see some examples of the best ways to use them.

**P.P.S.** Aki has made a Kaggle notebook Bayesian Logistic Regression with rstanarm, and it works just fine.

The post Kaggle Kernels appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Webinar, Stan Classes, and StanCon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We have a number of Stan related events in the pipeline. On 22 Nov, Ben Goodrich and I will be holding a free webinar called Introduction to Bayesian Computation Using the rstanarm R Package.

Here is the abstract:

The goal of the rstanarm package is to make it easier to use Bayesian estimation for most common regression models via Stan while preserving the traditional syntax that is used for specifying models in R and R packages like lme4. In this webinar, Ben Goodrich, one of the developers of rstanarm, will introduce the most salient features of the package.

To demonstrate these features, we will fit a model to loan repayments data from Lending Club and show why, in order to make rational decisions for loan approval or interest rate determination, we need a full posterior distribution as opposed to point predictions available in non-Bayesian statistical software.

As part of the upcoming StanCon 2017, we will be teaching a number of classes on Bayesian inference and statistical modeling. Here is the lineup:

- Introduction to Bayesian Inference with Stan (2 days): 19 – 20 Jan 2017
- Stan for Finance and Econometrics (1 day): 20 Jan 2017
- Stan for Pharmacometrics (1 day): 20 Jan 2017
- Advanced Stan: Programming, Debugging, Optimizing (1 day): 20 Jan 2017

For Stan users and readers of this blog, please use the code “stanusers” to get a 10% discount.

We hope to see many of you online and in person.

The post Stan Webinar, Stan Classes, and StanCon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Should scientists be allowed to continue to play in the sandbox after they’ve pooped in it? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This picture is on topic (click to see the link), but I’d like it even if it weren’t! I think I’ll be illustrating all my posts for awhile with adorable cat images. This is a lot more fun than those Buzzfeed-style headlines we were doing a few months ago. . . .

Anyway, back to the main topic. I got this email from Jonathan Kurtzman:

Thought you might be interested in this: researcher who had to retract 13 papers for image manipulation fraud, who is banned from funding in Germany where she worked, has received a grant from Cancer Research UK. CRUK relies on the school to review proposals, so …

The linked article, by Ivan Oransky and Adam Marcus, is entitled, “Do scientific fraudsters deserve a second chance?”

When put that way, sure, how can you deny someone a second chance? Indeed, I have worked with various people who have done scientific misconduct. No cases of image manipulation that I know of, but some plagiarism, some publishing of results known to be erroneous or meaningless, and some misleading presentation of research in which negative results were buried. These people have continued to get funding, and it’s not like I called up funding agencies to blow the whistle.

On the other hand, if I were on a panel considering funding, would I give any money to someone who’d cheated in this way? Nope. The cost-benefit just doesn’t work out. Even honest, well-intentioned researchers do sloppy work all the time. If you have someone who comes in with the willingness to cheat, the game’s over before it began.

I’m not saying these people should starve. I hope there’s a useful way for them to contribute to society. I just wouldn’t want them working on any scientific project that I’m involved with. Maybe they could help out in some other way, for example making sure the lab equipment is kept up to date, or moving furniture, or operating the copy machine, or, I dunno, there must be lots of ways they could help. Just keep them away from the data!

The post Should scientists be allowed to continue to play in the sandbox after they’ve pooped in it? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Goucher College is looking for a founding director of their Quantitative Reasoning Center appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We are currently searching for a founding Director of the Quantitative Reasoning Center. Goucher is a small liberal arts college and we are trying to make data analytics and quant reasoning a larger part of our core curriculum.

The academic discipline of the center is open. While its advertised as mathematics education or a closely-related quantitative discipline, they would certainly consider a candidate with strong quant skills in the social sciences. It would a good opportunity for someone who wants to build a program and would enjoy working at a liberal arts college.

This looks great. I’m always running into talented Ph.D. students and recent graduates in statistics or quantitative social science who really want to teach, but they find themselves struggling to get jobs at universities that don’t really value their teaching. This seems like a great opportunity for such people. It’s only one job so it won’t do much by itself to solve the problem, but there’s no reason other colleges can’t create such positions too. We started Quantitative Methods in Social Sciences at Columbia nearly twenty years ago and it’s still going strong.

When people email me job ads, I don’t usually post them on the blog, but I’m sharing this one because I’m hoping it will be part of a trend of such positions.

The post Goucher College is looking for a founding director of their Quantitative Reasoning Center appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Case Studies: A good way to jump in to the language appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s a way to jump in: Stan Case Studies. Find one you like and try it out.

P.S. I blogged this last month but it’s so great I’m blogging it again. For this post, the target audience is not already-users of Stan but new users.

The post Stan Case Studies: A good way to jump in to the language appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post More on my paper with John Carlin on Type M and Type S errors appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Deborah Mayo asked me some questions about that paper (“Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors”), and here’s how I responded:

I am not happy with the concepts of “power,” “type 1 error,” and “type 2 error,” because all these are defined in terms of statistical significance, which I am more and more convinced is a bad way of trying to summarize data. The concepts of Type S and Type M errors are not perfect but I think they are a step forward.

Now, one odd thing about my paper with Carlin is that it gives some tools that I recommend others use when designing and evaluating their research, but I would not typically use these tools directly myself! Because I am not wanting to summarize inference by statistical significance.

But let’s set that aside, recognizing that my paper with Carlin is intended to improve current practice which remains focused on statistical significance.

One key point of our paper is that “design analysis” considered broadly (that is, calculations or estimations of the frequency properties of statistical methods) can be useful and relevant, even _after_ the data have been gathered. This runs against usual expert advice from top statisticians. The problem is that there’s a long ugly history of researchers doing crappy “post hoc power analysis” where they perform a power calculation, using the point estimate from the data as their assumed true parameter value. This procedure can be very misleading, either by getting researchers off the hook (“sure, I didn’t get statistical significance, but that’s because I had low power”) or by encouraging overconfidence. So there’s lots of good advice in the stat literature, telling people not to do those post-hoc power analyses. What Carlin and I recommend is different in that we recommend using real prior information to posit the true parameter values.

The other key point of our paper is the statistical significance filter, which we rename as the exaggeration factor. The exaggeration factor is always greater than 1, but it can be huge if the signal is much smaller than noise.

Finally, this all fits in with the garden of forking paths. If researchers were doing preregistered experiments, then in crappy “power = .06” studies, they’d only get statistical significance 6% of the time. And, sure, those 6% of cases would be disasters, but at least in the other 94% of cases, researchers would give up. But with the garden of forking paths, researchers can just about always get statistical significance, hence they come face to face with the problems that Carlin and I discuss in our paper.

I hope this background is helpful. Published papers get revised so many times that their original motivation can become obscured.

**P.S.** Here’s the first version of that paper. It’s from May, 2011. I didn’t realize I’d been thinking about this for such a long time!

P.P.S. In comments, Art Owen points to a recent paper of his, “Confidence intervals with control of the sign error in low power settings,” following up on some of the above ideas.

The post More on my paper with John Carlin on Type M and Type S errors appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post En bra statistiklektion appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>According to the translation, the show also includes “how paralyzed monkeys can walk again with the help of wireless burglar connection of neural pathways, how overfed planted salmon survive poorly in the wild, and how much of society’s most important job that can be done from home when snow chaos so requires.”

The post En bra statistiklektion appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The role of models and empirical work in political science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I recently posted a review of A Model Discipline, by Clarke and Primo on Amazon.com. My review is entitled “Why Physics Envy will Persist,” at

http://www.amazon.com/gp/review/R3I8GC5V1ZSYVI/ref=cm_cr_pr_rvw_ttl?ASIN=019538220XAs you likely know, they are critical of the widespread belief among political scientists in the hypothetical-deductive method. As part of my review of the book, I offer an explanation as to why, three years later, H-D hegemony continues as strong as ever.

I responded that I actually received a copy of that book in the mail awhile ago but hadn’t actually looked at it. I indeed am a bit bothered by the whole “Empirical Implications of Theoretical Models” thing. I see lots and lots of political science projects which are set up as a set of research hypotheses that are then to be empirically tested, and the whole thing often seems bogus to me.

I’m more and more becoming convinced of Dan Kahan’s idea that the paradigmatic task of empirical science is not the testing of hypotheses but the gathering of data in order to *distinguish* between competing models of the world.

The post The role of models and empirical work in political science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Election surprise, and Three ways of thinking about probability appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Background:** Hillary Clinton was given a 65% or 80% or 90% chance of winning the electoral college. She lost.

**Naive view:** The poll-based models and the prediction markets said Clinton would win, and she lost. The models are wrong!

**Slightly sophisticated view:** The predictions were probabilistic. 1-in-3 events happen a third of the time. 1-in-10 events happen a tenth of the time. Polls have nonsampling error. We know this, and the more thoughtful of the poll aggregators included this in their model, which is why they were giving probabilities in the range 65% to 90%, not, say, 98% or 99%.

**More sophisticated view:** Yes, the probability statements are not invalidated by the occurrence of a low-probability event. But we can learn from these low-probability outcomes. In the polling example, yes an error of 2% is within what one might expect from nonsampling error in national poll aggregates, but the point is that nonsampling error has a reason: it’s not just random. In this case it seems to have arisen from a combination of differential nonresponse, unexpected changes in turnout, and some sloppy modeling choices. It makes sense to try to understand this, not to just say that random things happen and leave it at that.

This also came up in our discussions of betting markets’ failure in Trump in the Republican primaries, Leicester City, and Brexit. Dan Goldstein correctly wrote that “Prediction markets have to occasionally ‘get it wrong’ to be calibrated,” but, once we recognize this, we should also, if possible, do what the plane-crash investigators do: open up the “black box” and try to figure out what went wrong that could’ve been anticipated.

Hindsight gets a bad name but we can learn from our failures and even from our successes—if we look with a critical eye and get inside the details of our forecasts rather than just staring at probabilities.

The post Election surprise, and Three ways of thinking about probability appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post David Rothschild and Sharad Goel called it (probabilistically speaking) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>David Rothschild and Sharad Goel write:

In a new paper with Andrew Gelman and Houshmand Shirani-Mehr, we examined 4,221 late-campaign polls — every public poll we could find — for 608 state-level presidential, Senate and governor’s races between 1998 and 2014. Comparing those polls’ results with actual electoral results, we find the historical margin of error is plus or minus six to seven percentage points. . . .

Systematic errors imply that these problems persist, to a lesser extent, in poll averaging, as shown in the above graph.

David and Sharad conclude:

This November, we would not be at all surprised to see Mrs. Clinton or Mr. Trump beat the state-by-state polling averages by about two percentage points. We just don’t know which one would do it.

Yup. 2 percentage points. They wrote this on October 6, 2016.

The post David Rothschild and Sharad Goel called it (probabilistically speaking) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Can a census-tract-level regression analysis untangle correlation between lead and crime? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Daniel Hawkins pointed me to a post by Kevin Drum entitled, “Crime in St. Louis: It’s Lead, Baby, Lead,” and the associated research article by Brian Boutwell, Erik Nelson, Brett Emo, Michael Vaughn, Mario Schootman, Richard Rosenfeld, Roger Lewis, “The intersection of aggregate-level lead exposure and crime.”

The short story is that the areas of St. Louis with more crime and poverty, had higher lead levels (as measured from kids in the city who were tested for lead in their blood).

Here’s their summary:

I had a bit of a skeptical reaction—not about the effects of lead, I have no idea about that—but on the statistics. Looking at those maps above, the total number of data points is not large, and those two predictors are so highly correlated, I’m surprised that they’re finding what seem to be such unambiguous effects. In the abstract it says n=459,645 blood measurements and n=490,433 crimes, but for the purpose of the regression, n is the number of census tracts in their dataset, about 100.

So I contacted the authors of the paper and one of them, Erik Nelson, did some analyses for me.

First he ran the basic regression—no Poisson, no spatial tricks, just regression of log crime rate on lead exposure and index of social/economic disadvantage. Data are at the census tract level, and lead exposure is the proportion of kids’ lead measurements from that census tract that were over some threshold. I think I’d prefer a continuous measure but that will do for now.

In this simple regression, the coefficient for lead exposure was large and statistically significant.

Then I asked for a scatterplot: log crime rate vs. lead exposure, indicating census tracts with three colors tied to the terciles of disadvantage.

And here it is:

He also fit a separate regression line for each tercile of disadvantage.

As you can see, the relation between lead and crime is strong, especially for census tracts with less disadvantage.

Erik also made separate plots for violent and non-violent crimes. They look pretty similar:

In summary: the data are what they are. The correlation seems real, not just an artifact of a particular regression specification. It’s all observational so we shouldn’t overinterpret it, but the pattern seems worth sharing.

The post Can a census-tract-level regression analysis untangle correlation between lead and crime? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How effective (or counterproductive) is universal child care? Part 2 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Yesterday we discussed the difficulties of learning from a small, noisy experiment, in the context of a longitudinal study conducted in Jamaica where researchers reported that an early-childhood intervention program caused a 42%, or 25%, gain in later earnings. I expressed skepticism.

Today I want to talk about a paper making an opposite claim: “Canada’s universal childcare hurt children and families.”

I’m skeptical of this one too.

Here’s the background. I happened to mention the problems with the Jamaica study in a talk I gave recently at Google, and afterward Hal Varian pointed me to this summary by Les Picker of a recent research article:

In Universal Childcare, Maternal Labor Supply, and Family Well-Being (NBER Working Paper No. 11832), authors Michael Baker, Jonathan Gruber, and Kevin Milligan measure the implications of universal childcare by studying the effects of the Quebec Family Policy. Beginning in 1997, the Canadian province of Quebec extended full-time kindergarten to all 5-year olds and included the provision of childcare at an out-of-pocket price of $5 per day to all 4-year olds. This $5 per day policy was extended to all 3-year olds in 1998, all 2-year olds in 1999, and finally to all children younger than 2 years old in 2000.

(Nearly) free child care: that’s a big deal. And the gradual rollout gives researchers a chance to estimate the effects of the program by comparing children at each age, those who were and were not eligible for this program.

The summary continues:

The authors first find that there was an enormous rise in childcare use in response to these subsidies: childcare use rose by one-third over just a few years. About a third of this shift appears to arise from women who previously had informal arrangements moving into the formal (subsidized) sector, and there were also equally large shifts from family and friend-based child care to paid care. Correspondingly, there was a large rise in the labor supply of married women when this program was introduced.

That makes sense. As usual, we expect elasticities to be between 0 and 1.

But what about the kids?

Disturbingly, the authors report that children’s outcomes have worsened since the program was introduced along a variety of behavioral and health dimensions. The NLSCY contains a host of measures of child well being developed by social scientists, ranging from aggression and hyperactivity, to motor-social skills, to illness. Along virtually every one of these dimensions, children in Quebec see their outcomes deteriorate relative to children in the rest of the nation over this time period.

More specifically:

Their results imply that this policy resulted in a rise of anxiety of children exposed to this new program of between 60 percent and 150 percent, and a decline in motor/social skills of between 8 percent and 20 percent. These findings represent a sharp break from previous trends in Quebec and the rest of the nation, and there are no such effects found for older children who were not subject to this policy change.

Also:

The authors also find that families became more strained with the introduction of the program, as manifested in more hostile, less consistent parenting, worse adult mental health, and lower relationship satisfaction for mothers.

I just find all this hard to believe. A doubling of anxiety? A decline in motor/social skills? Are these day care centers really *that* horrible? I guess it’s possible that the kids are ruining their health by giving each other colds (“There is a significant negative effect on the odds of being in excellent health of 5.3 percentage points.”)—but of course I’ve also heard the opposite, that it’s better to give your immune system a workout than to be preserved in a bubble. They also report “a policy effect on the treated of 155.8% to 394.6%” in the rate of nose/throat infection.

OK, hre’s the research article.

The authors seem to be considering three situations: “childcare,” “informal childcare,” and “no childcare.” But I don’t understand how these are defined. Every child is cared for in some way, right? It’s not like the kid’s just sitting out on the street. So I’d assume that “no childcare” is actually informal childcare: mostly care by mom, dad, sibs, grandparents, etc. But then what do they mean by the category “informal childcare”? If parents are trading off taking care of the kid, does this count as informal childcare or no childcare? I find it hard to follow exactly what is going on in the paper, starting with the descriptive statistics, because I’m not quite sure what they’re talking about.

I think what’s needed here is some more comprehensive organization of the results. For example, consider this paragraph:

The results for 6-11 year olds, who were less affected by this policy change (but not unaffected due to the subsidization of after-school care) are in the third column of Table 4. They are largely consistent with a causal interpretation of the estimates. For three of the six measures for which data on 6-11 year olds is available (hyperactivity, aggressiveness and injury) the estimates are wrong-signed, and the estimate for injuries is statistically significant. For excellent health, there is also a negative effect on 6-11 year olds, but it is much smaller than the effect on 0-4 year olds. For anxiety, however, there is a significant and large effect on 6-11 year olds which is of similar magnitude as the result for 0-4 year olds.

The first sentence of the above excerpt has a cover-all-bases kind of feeling: if results are similar for 6-11 year olds as for 2-4 year olds, you can go with “but not unaffected”; if they differ, you can go with “less effective.” Various things are pulled out based on whether they are statistically significant, and they never return to the result for anxiety, which would seem to contradict their story. Instead they write, “the lack of consistent findings for 6-11 year olds confirm that this is a causal impact of the policy change.” “Confirm” seems a bit strong to me.

The authors also suggest:

For example, higher exposure to childcare could lead to increased reports of bad outcomes with no real underlying deterioration in child behaviour, if childcare providers identify negative behaviours not noticed (or previously acknowledged) by parents.

This seems like a reasonable guess to me! But the authors immediately dismiss this idea:

While we can’t rule out these alternatives, they seem unlikely given the consistency of our findings both across a broad spectrum of indices, and across the categories that make up each index (as shown in Appendix C). In particular, these alternatives would not suggest such strong findings for health-based measures, or for the more objective evaluations that underlie the motor-social skills index (such as counting to ten, or speaking a sentence of three words or more).

Health, sure: as noted above, I can well believe that these kids are catching colds from each other.

But what about that motor-skills index? Here are their results from the appendix:

I’m not quite sure whether + or – is desirable here, but I do notice that the coefficients for “can count out loud to 10” and “spoken a sentence of 3 words or more” (the two examples cited in the paragraph above) go in opposite directions. That’s fine—the data are the data—but it doesn’t quite fit their story of consistency.

More generally, the data are addressed in an scattershot manner. For example:

We have estimated our models separately for those with and without siblings, finding no consistent evidence of a stronger effect on one group or another. While not ruling out the socialization story, this finding is not consistent with it.

This appears to be the classic error of interpretation of a non-rejection of a null hypothesis.

And here’s their table of key results:

As quantitative social scientists we need to think harder about how to summarize complicated data with multiple outcomes and many different comparisons.

I see the current standard ways to summarize this sort of data are:

(a) Focus on a particular outcome and a particular comparison (choosing these ideally, though not usually, using preregistration), present that as the main finding and then tag all else as speculation.

Or, (b) Construct a story that seems consistent with the general pattern in the data, and then extract statistically significant or nonsignificant comparisons to support your case.

Plan (b) was what was done again, and I think it has problems: lots of stories can fit the data, and there’s a real push toward sweeping any anomalies aside.

For example, how *do* you think about that coefficient of 0.308 with standard error 0.080 for anxiety among the 6-11-year-olds? You can say it’s just bad luck with the data, or that the standard error calculation is only approximate and the real standard error should be higher, or that it’s some real effect caused by what was happening in Quebec in these years—but the trouble is that any of these explanations could be used just as well to explain the 0.234 with standard error 0.068 for 2-4-year-olds, which directly maps to one of their main findings.

Once you start explaining away anomalies, there’s just a huge selection effect in which data patterns you choose to take at face value and which you try to dismiss.

So maybe approach (a) is better—just pick one major outcome and go with it? But then you’re throwing away lots of data, that can’t be right.

I am unconvinced by the claims of Baker et al., but it’s not like I’m saying their paper is terrible. They have an identification strategy, and clean data, and some reasonable hypotheses. I just think their statistical analysis approach is not working. One trouble is that statistics textbooks tend to focus on stand-alone analyses—getting the p-value right, or getting the posterior distribution, or whatever, and not on how these conclusions fit into the big picture. And of course there’s lots of talk about exploratory data analysis, and that’s great, but EDA is typically not plugged into issues of modeling, data collection, and inference.

**What to do?**

OK, then. Let’s forget about the strengths and the weaknesses of the Baker et al. paper and instead ask, how *should* one evaluate a program like Quebec’s nearly-free preschool? I’m not sure. I’d start from the perspective of trying to learn what we can from what might well be ambiguous evidence, rather than trying to make a case in one direction or another. And lots of graphs, which would allow us to see more in one place, that’s much better than tables and asterisks. But, exactly what to do, I’m not sure. I don’t know whether the policy analysis literature features any good examples of this sort of exploration. I’d like to see something, for this particular example and more generally as a template for program evaluation.

The post How effective (or counterproductive) is universal child care? Part 2 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Explanations for that shocking 2% shift appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The title of this post says it all. A 2% shift in public opinion is not so large and usually would not be considered shocking. In this case the race was close enough that 2% was consequential.

Here’s the background:

Four years ago, Mitt Romney received 48% of the two-party vote and lost the presidential election. This year, polls showed Donald Trump with about 48% of the two-party vote. When the election came around, Trump ended up with nearly 50% of the two-party vote—according to the latest count, he lost to Hillary Clinton by only 200,000 vote. Because of the way the votes were distributed in states, Trump won the electoral college and thus the presidency.

In this earlier post I graphed the Romney-Trump swing by state and also made this plot showing where Trump did better or worse than the polls in 2016:

Trump outperformed the polls in several key swing states and also in lots of states that were already solidly Republican.

**The quantitative pundits**

Various online poll aggregators were giving pre-election probabilities ranging from 66% to 99%. These probabilities were high because Clinton had been leading in the polls for months; the probabilities were not 100% because it was recognized that the final polls might be off by quite a bit from the actual election outcome. Small differences in how the polls were averaged corresponded to large apparent differences in win probabilities; hence we argued that the forecasts that were appearing, were not so different as they seemed based on those reported odds.

The final summary is that the polls were off by about 2% (or maybe 3%, depending on which poll averaging you’re using), which, again, is a real error of moderate size that happened to be highly consequential given the distribution of the votes in the states this year. Also we ignored correlations in some of our data, thus producing illusory precision in our inferences based on polls, early voting results, etc.

**What happened?**

Several explanations have been offered. It’s hard at this point to adjudicate among them, but I’ll share what thoughts I have:

– Differential voter turnout. It says here that voter turnout was up nearly 5% from the previous election. Voter turnout could have been particularly high among Trump-supporting demographic and geographic groups. Or maybe not! it says here that voter turnout was *down* this year. Either way, the story would be that turnout was high for Republicans relative to Democrats, compared to Obama’s elections.

– Last-minute change in vote. If you believe the exit poll, Trump led roughly 48%-41% among the 14% of voters who say they decided in the past week. That corresponds to an bump in Trump’s 2-party vote percentage of approximately .14*(.48-.41)/2 = 0.005, or 1/2 a percentage point. That’s something.

– Differential nonresponse. During the campaign we talked about the idea that swings in the polls mostly didn’t correspond to real changes in vote preferences but rather came from changes in nonresponse patterns: when there was good news for Trump, his supporters responded more to polls, and when there was good news for Clinton, her supporters were more likely to respond. But this left the question of where was the zero point.

When we analyzed a Florida poll last month, adjusting for party registration, we gave Trump +1 in the state, while the estimates from the others ranged from Clinton +1 to Clinton +4. That gives you most of the shift right there. This was just one poll so I didn’t take it too seriously at the time but maybe I should’ve.

– Trump supporters not admitting their support to pollsters. It’s possible, but I’m skeptical of this mattering too much, given that Trump outperformed the polls the most in states such as North Dakota and West Virginia where I assume respondents would’ve had little embarrassment in declaring their support for him, while he did no better than the polls’ predictions in solidly Democratic states. Also, Republican candidates outperformed expectations in the Senate races, which casts doubt on the model in which respondents would not admitting they supported Trump; rather, the Senate results are consistent with differential nonresponse or unexpected turnout or opposition to Hillary Clinton.

– Third-party collapse. Final polls had Johnson at 5% of the vote. He actually got 3%, and it’s a reasonable guess that most of this 2% went to Trump.

– People dissuaded from voting because of long lines or various measures making it more difficult to vote. I have no idea how big or small this one is. This must matter a lot more in some states than in others.

I’m sure there are some other things I missed. Let me just emphasize that the goal in this exercise is to understand the different factors that were going on, not to identify one thing or another that could’ve flipped the election outcome. The election was so close that any number of things could’ve swung enough votes for that.

**P.S.** Two other parts of the story:

– Voter enthusiasm. The claim has been made that Trump’s supporters had more enthusiasm for their candidate. They were part of a movement (as with Obama 2008) in a way that was less so for Clinton’s supporters. That enthusiasm could transfer to unexpectedly high voter turnout, with the twist that this would be hard to capture in pre-election surveys if Trump’s supporters were, at the same time, less likely to respond to pollsters.

– The “ground game” and social media. One reason the election outcome came as a surprise is that we kept hearing stories about Hillary Clinton’s professional campaign and big get-out-the-vote operation, as compared to Donald Trump’s campaign which seemed focused on talk show appearances and twitter. But maybe the Trump’s campaign’s social media efforts were underestimated.

**P.P.S.** One more thing: I think one reason for the shock is that people are reacting not just to the conditional probability, Pr (Trump wins | Trump reaches Election Day with 48% of two-party support in the polls), but to the unconditional probability, Pr (Trump becomes president of the United States | our state of knowledge two years ago). That unconditional probability is very low. And I think a lot of the stunned reaction is in part that things got so far.

To use a poker analogy: if you’re drawing to an inside straight on the river, the odds are (typically) against you. But the real question is how you got to the final table of the WSOP in the first place.

The post Explanations for that shocking 2% shift appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A 2% swing: The poll-based forecast did fine (on average) in blue states; they blew it in the red states appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>First let’s compare the 2016 election to 2012, state by state:

Now let’s look at how the 2016 election turned out, compared to the polls:

This is interesting. In the blue states (those won by Obama in 2012), Trump did about as well as predicted from the polls (on average, but not in the key swing states of Pennsylvania and Florida). But in the red states, Trump did much better than predicted.

**P.S.** More here.

The post A 2% swing: The poll-based forecast did fine (on average) in blue states; they blew it in the red states appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How effective (or counterproductive) is universal child care? Part 1 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’ve talked before about various empirically-based claims of the effectiveness of early childhood intervention. In a much-publicized 2013 paper based on a study of 130 four-year-old children in Jamaica, Paul Gertler et al. claimed that a particular program caused a 42% increase in the participants’ earnings as young adults. (It was a longitudinal study, and these particular kids were followed up for 20 years.) At the time I expressed skepticism based on the usual reasons of the statistical significance filter, researcher degrees of freedom, and selection problems with the data.

A year later, Gertler et al. released an updated version of their paper, this time with the estimate downgraded to 25%. I never quite figured out how this happened, but I have to admit to being skeptical of the 25% number too.

One problem is that a lot of this research seems to be presented in propaganda form. For example:

From the published article: “A substantial literature shows that U.S. early childhood interventions have important long-term economic benefits.”

From the press release: “Results from the Jamaica study show substantially greater effects on earnings than similar programs in wealthier countries. Gertler said this suggests that early childhood interventions can create a substantial impact on a child’s future economic success in poor countries.”

These two quotes, taken together, imply that (a) these interventions have large and well-documented effects in the U.S., but (b) these effect are not as large as the 25% reported for the Jamaica study.

But how does that work? How large, exactly, were the “important long-term economic benefits”? An increase of 10% in earnings, perhaps? 15%? If so, then do they really have evidence that the Jamaica program had effects that were not only clearly greater from zero, but clearly greater than 10% or 15%?

I doubt it.

Rather, I suspect they’re trying to have it both ways, to simultaneously claim that their results are consistent with the literature and that they’re new and exciting.

I’m perfectly willing to believe that early childhood intervention can have large and beneficial effects, and that these effects could be even larger in Jamaica than in the United States. What I’m not convinced of is that this particular study offers the evidence that is claimed. I’m worried that the researchers are chasing noise. That is, it’s not clear to me how much they learned from this new experiment, beyond what they already knew (or thought they knew) from the literature.

*This was the first of a series of two posts. Tune in tomorrow for part 2.*

The post How effective (or counterproductive) is universal child care? Part 1 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Election forecasting updating error: We ignored correlations in some of our data, thus producing illusory precision in our inferences appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In retrospect, a key mistake in the forecast updating that Kremp and I did, was that we ignored the correlation in the partial information from early-voting tallies. Our model had correlations between state-level forecasting errors (but maybe the corrs we used were still too low, hence giving us illusory precision in our national estimates), but we did not include any correlations at all in the errors from the early-voting estimates. That’s why our probability forecasts were, wrongly, so close to 100% (as here).

The post Election forecasting updating error: We ignored correlations in some of our data, thus producing illusory precision in our inferences appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What if NC is a tie and FL is a close win for Clinton? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Let’s run the numbers, Kremp:

> update_prob2(clinton_normal=list("NC"=c(50,2), "FL"=c(52,2))) Pr(Clinton wins the electoral college) = 95%

That’s good news for Clinton.

What if both states are tied?

> update_prob2(clinton_normal=list("NC"=c(50,2), "FL"=c(50,2))) Pr(Clinton wins the electoral college) = 90%

**P.S.** To be complete I should include all the states that were already called (KY, MA, etc.) but this would add essentially no information so I won’t bother.

OK, Ok, just to illustrate:

> update_prob2(trump_states=c("KY","IN"), clinton_states=c("IL","MA"), clinton_normal=list("NC"=c(50,2), "FL"=c(50,2))) Pr(Clinton wins the electoral college) = 90%

You see, no change.

**P.P.S.** What if Florida is close but Clinton loses there?

> update_prob2(trump_states=c("FL"), clinton_normal=list("NC"=c(50,2), "FL"=c(50,1))) Pr(Clinton wins the electoral college) = 75% [nsim = 37716; se = 0.2%]

Her chance goes down to 75%. Still better than Trump’s 25%.

**P.P.P.S.** And what if NC and FL are both close but Trump wins both?

> update_prob2(trump_states=c("NC","FL"), clinton_normal=list("NC"=c(50,1), "FL"=c(50,1))) Pr(Clinton wins the electoral college) = 65%

The post What if NC is a tie and FL is a close win for Clinton? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Election updating software update appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Kremp updated the program to allow an option where you could specify the winner in each state, and also give an estimate and standard deviation when you have some idea of the vote share.

Here’s an example, based on some out-of-date (from a few hours ago) estimates of Clinton getting 51.5% of the vote in Colorado, 51.5% in Florida, 52.7% in Iowa, 50.7% in Nevada, 52.2% in Ohio, 46.2% in Pennsylvania, and 56.7% in Wisconsin, with standard deviations of 2% in each case:

> update_prob2(clinton_normal = list("CO" = c(51.5, 2), "FL" = c(51.5, 2), "IA" = c(52.7, 2), "NV" = c(50.7, 2), "OH" = c(52.2, 2), "PA" = c(46.2, 2), "WI" = c(56.7, 2))) [nsim = 100000; se = 0%]

Again, I don’t particularly trust those numbers. But, again, you can now play along and throw in as many states as you want in this way without worrying about the simulations crashing.

**P.S.** Kremp updated again. Go to his site, refresh it, download the new files on Github, and do some R and Stan!

The post Election updating software update appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Now that 7pm has come, what do we know? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>On TV they said that Trump won Kentucky and Indiana (no surprise), Clinton won Vermont (really no surprise), but South Carolina, Georgia, and Virginia were too close to call.

I’ll run Pierre-Antoine Kremp’s program conditioning on this information, coding states that are “too close to call” as being somewhere between 45% and 55% of the two-party vote for each candidate:

> update_prob(trump_states = c("KY","IN"), clinton_states = c("VT"), clinton_scores_list=list("SC"=c(45,55), "GA"=c(45,55), "VA"=c(45,55))) Pr(Clinton wins the electoral college) = 95% [nsim = 65433; se = 0.1%]

Just a rough guess, still; obv this all depends on the polls-based model which was giving Clinton a 90% chance of winning before any votes were counted.

The post Now that 7pm has come, what do we know? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What might we know at 7pm? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>At 7pm, the polls will be closed in the following states: KY, GA, IN, NH, SC, VT, VA.

Let’s list these in order of projected Trump/Clinton vote share: KY, IN, SC, GA, NH, VA, VT.

I’ll use Kremp’s updating program to compute Trump and Clinton’s probabilities of winning, under his model, for several different scenarios.

First, with no information except the pre-election polls:

> update_prob() Pr(Clinton wins the electoral college) = 90% [nsim = 100000; se = 0.1%]

Clinton has a 90% chance of winning

Now let’s consider the best possible scenario for Trump at 7pm, in which he wins Kentucky, Indiana, South Carolina, Georgia, New Hampshire, or Virginia (but not Vermont, cos let’s get serious):

> update_prob(trump_states = c("KY","IN","SC","GA","NH","VA"), clinton_states = c("VT")) Pr(Clinton wins the electoral college) = 2% [nsim = 1340; se = 0.4%]

Next-best option for Trump, he wins all the states except Virginia and Vermont:

> update_prob(trump_states = c("KY","IN","SC","GA","NH"), clinton_states = c("VA","VT")) Pr(Clinton wins the electoral college) = 28% [nsim = 3856; se = 0.7%]

Most likely scenario, Trump wins Kentucky, Indiana, South Carolina, and Georgia, but loses New Hampshire, Virginia, and Vermont:

> update_prob(trump_states = c("KY","IN","SC","GA"), clinton_states = c("NH","VA","VT")) Pr(Clinton wins the electoral college) = 93% [nsim = 88609; se = 0.1%]

Or Trump just wins Kentucky, Indiana, and South Carolina:

> update_prob(trump_states = c("KY","IN","SC"), clinton_states = c("GA","NH","VA","VT")) Pr(Clinton wins the electoral college) = 100% [nsim = 5240; se = 0%]

**P.S.** Kremp writes:

If you remember, I have a polling error term in my forecast, so all polls can be off in any given state by the same amount. And these polling errors are correlated across states. I picked a 0.7 correlation—which may be a bit high. It made the model more conservative about Clinton’s chances, but today, it’s going to make it jump to conclusions when the first results come in.

Interesting. I’m not sure if Kremp’s model really is overreacting: I’d guess that errors across states will have a very high correlation. I guess we’ll see once all the data come in.

**P.P.S.** I forgot Kentucky in my first version of this post. Kentucky was never going to be close so including it does not change the numbers at all. But for completeness I updated the code.

The post What might we know at 7pm? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Blogging the election at Slate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Blogging the election at Slate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>