The post Solution to the helicopter design problem appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s the question:

In the helicopter activity, pairs of students design paper ”helicopters” and compete to create the copter that takes longest to reach the ground when dropped from a fixed height. The two parameters of the helicopter, a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1. In the activity, students are allowed to make 20 test helicopters at design points (a,b) of their choosing. The students measure how long each copter takes to reach the ground, and then they are supposed to fit a simple regression (not hierarchical or even Bayesian) to model this outcome as a function of a and b. Based on this model, they choose the optimal a,b and then submit this to the class. Here is the question. Why is it inappropriate for that regression model to be linear?

And here’s the answer: For a linear model the optimum is necessarily on the boundary. But we already know the solution can’t be on the boundary (each of a and b must be more than 0 and less than 1). You need a nonlinear model to get an internal optimum.

I was happy to see that all 4 of the students got this correct.

The post Solution to the helicopter design problem appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post No, Michael Jordan didn’t say that! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>First verse. There’s an article by a journalist,

- The odds, continually updated, by F.D. Flam in the
*NY Times*

to which Andrew responded in blog form,

- No, I didn’t say that, by Andrew Gelman, on this blog.

Second verse. There’s an article by a journalist,

- Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts, by Lee Gomes, published in
*IEEE Spectrum*(the “light” magazine you get when you join IEEE).

to which Michael Jordan responded in blog form,

- Big Data, Hype, the Media and Other Provocative Words to Put in a Title, by Michael Jordan on the Berkeley Amp Lab’s blog.

Whenever I (Bob, not Andrew) read a story in an area I know something about (slices of computer science, linguistics, and statistics), I’m almost always struck by the inaccuracies. The result is that I mistrust journalists writing about topics I don’t know anything about, such as foreign affairs, economics, or medicine.

The post No, Michael Jordan didn’t say that! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some questions from our Ph.D. statistics qualifying exam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>- In the helicopter activity, pairs of students design paper ”helicopters” and compete to create the copter that takes longest to reach the ground when dropped from a fixed height. The two parameters of the helicopter, a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1. In the activity, students are allowed to make 20 test helicopters at design points (a,b) of their choosing. The students measure how long each copter takes to reach the ground, and then they are supposed to fit a simple regression (not hierarchical or even Bayesian) to model this outcome as a function of a and b. Based on this model, they choose the optimal a,b and then submit this to the class. Here is the question. Why is it inappropriate for that regression model to be linear?
- You are designing an experiment where you are estimating a linear dose-response pattern with a dose that x can take on the values 1, 2, 3, and the response is continuous. Suppose that there is no systematic error and that the measurement variance is proportional to x. You have 100 people in your experiment. How should you allocate them among the x=1, 2, and 3 conditions to best estimate the dose-response slope?
- It is sometimes said that the p-value is uniformly distributed if the null hypothesis is true. Give two different reasons why this statement is not in general true. The problem is with real examples, not just toy examples, so your reasons should not involve degenerate situations such as zero sample size or infinite data values.

You can try to do these at home, also try to guess which of these problems were easy for the students and which were hard. (One of them was solved correctly by all 4 students who took the exam, while another turned out to be so difficult that none of the students got close to the right answer.) I’ll post solutions over the next three days.

The post Some questions from our Ph.D. statistics qualifying exam appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan 2.5, now with MATLAB, Julia, and ODEs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As usual, you can find everything on the

Drop us a line on the stan-users group if you have problems with installs or questions about Stan or coding particular models.

**New Interfaces**

We’d like to welcome two new interfaces:

- MatlabStan by Brian Lau, and

- Stan.jl (for Julia) by Rob Goedman.

The new interface home pages are linked from the Stan home page.

**New Features**

The biggest new feature is a differential equation solver (Runge-Kutta from Boost’s odeint with coupled sensitivities). We also added new cbind and rbind functions, is_nan and is_inf functions, a num_elements function, a mechanism to throw exceptions to reject samples with printed messages, and two new distributions, the Frechet and 2-parameter Pareto (both contributed by Alexey Stukalov).

**Backward Compatibility**

Stan 2.5 is fully backward compatible with earlier 2.x releases and will remain so until Stan 3 (which is not yet designed, much less scheduled).

**Revised Manual**

In addition to the ODE documentation, there is a new chapter on marginalizing discrete latent parameters with several example models, new sections on regression priors for coefficients and noise scale in ordinary, hierarchical, and multivariate settings, along with new chapters on all the algorithms used by Stan for MCMC sampling, optimization, and diagnosis, with configuration information and advice.

**Preview of 2.6 and Beyond**

Our plans for major features in the near future include stiff ODE solvers, a general MATLAB/R-style array/matrix/vector indexing and assignment syntax, and uncertainty estimates for penalized maximum likelihood estimates via Laplace approximations with second-order autodiff.

**Release Notes**

Here are the release notes.

v2.5.0 (20 October 2014) ====================================================================== New Features ---------------------------------------- * ordinary differential equation solver, implemented by coupling the user-specified system with its sensitivities (#771) * add reject() statement for user-defined rejections/exceptions (#458) * new num_elements() functions that applies to all containers (#1026) * added is_nan() and is_inf() functions (#592) * nested reverse-mode autodiff, primarily for ode solver (#1031) * added get_lp() function to remove any need for bare lp__ (#470) * new functions cbind() and rbind() like those in R (#787) * added modulus function in a way tht is consistent with integer division across platforms (#577) * exposed pareto_type_2_rng (#580) * added Frechet distribution and multi_gp_cholesky distribution (thanks to Alexey Stukalov for both) Enhancements ---------------------------------------- * removed Eigen code insertion for numeric traits and replaced with order-independent metaprogram (#1065) * cleaned up error messages to provide clearer error context and more informative messages (#640) * extensive tests for higher order autodiff in densities (#823) * added context factory * deprecated lkj_cov density (#865) * trying again with informational/rejection message (#223) * more code moved from interfaces into Stan common libraries, including a var_context factory for configuration * moved example models to own repo (stan-dev/example-models) and included as submodule for stan-dev/stan (#314) * added per-iteration interrupt handler to BFGS optimizer (#768) * worked around unused function warnings from gcc (#796) * fixed error messages in vector to array conversion (#579, thanks Kevin S. Van Horn) * fixed gp-fit.stan example to be as efficient as manual version (#782) * update to Eigen version 3.2.2 (#1087) Builds ---------------------------------------- * pull out testing into Python script for developers to simplify makes * libstan dependencies handled properly and regenerate dependencies, including working around bug in GNU make 3.8.1 (#1058, #1061, #1062) Bug Fixes ---------------------------------------- * deal with covariant return structure in functions (allows data-only variables to alternate with parameter version); involved adding new traits metaprograms promote_scalar and promote_scalar_type (#849) * fixed error message on check_nonzero_size (#1066) * fix arg config printing after random seed generation (#1049) * logical conjunction and disjunction operators short circuit (#593) * clean up parser bug preventing variables starting with reserved names (#866) * fma() function calls underlying platform fma (#667) * remove upper bound on number of function arguments (#867) * cleaned up code to remove compiler warnings (#1034) * share likely() and unlikely() macros to avoid redundancy warnings (#1002) * complete review of function library for NaN behavior and consistency of calls for double and autodiff values, with extensive documentation and extensive new unit tests for this and other, enhances NaN testing in built-in test functions (several dozen issues in the #800 to #902 range) * fixing Eigen assert bugs with NO_DEBUG in tests (#904) * fix to makefile to allow builds in g++ 4.4 (thanks to Ewan Dunbar) * fix precedence of exponentiation in language (#835) * allow size zero inputs in data and initialization (#683) Documentation ---------------------------------------- * new chapter on differential equation solver * new sections on default priors for regression coefficients and scales, including hierarchical and multivariate based on full Cholesky parameterization * new part on algorithms, which chapters on HMC/NUTS, optimization, and diagnostics * new chapter on models with latent discrete parameters * using latexmk through make for LaTeX compilation * changed page numbers to beg contiguous throughout so page numbers match PDF viewer page number * all user-supplied corrections applied from next-manual issue * section on identifiability with priors, including discussion of K-1 parameterization of softmax and IRT * new section on convergence monitoring * extensive corrections from Andrew Gelman on regression models and notation * added discussion of hurdle model in zero inflation section * update built-in function doc to clarify several behaviors (#1025)

The post Stan 2.5, now with MATLAB, Julia, and ODEs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Sailing between the Scylla of hyping of sexy research and the Charybdis of reflexive skepticism appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Recently I had a disagreement with Larry Bartels which I think is worth sharing with you. Larry and I took opposite positions on the hot topic of science criticism.

To put things in a positive way, Larry was writing about some interesting recent research which I then constructively criticized.

To be more negative, Larry was hyping some sexy research and I was engaging in mindless criticism.

The balance between promotion and criticism is always worth discussing, but particularly so in this case because of two factors:

1. The research in question is on the borderline. The conclusions in question are not rock-solid—they depend on how you look at the data and are associated with p-values like 0.10 rather than 0.0001—but neither are they silly. Some of the findings definitely seem real, and the debate is more about how far to take it than whether there’s anything there at all. Nobody in the debate is claiming that the findings are empty; there’s only a dispute about their implications.

2. The topic—the effect of unperceived messages on political attitudes—is important.

3. And, finally, Larry and I generally respect each other, both as scholars and as critics. So, even though we might be talking past each other regarding the details of this particular debate, we each recognize that the other has something valuable to say, both regarding methods and public opinion.

**What it’s all about**

The background is here:

We had a discussion last month on the sister blog regarding the effects of subliminal messages on political attitudes. It started with a Larry Bartels post entitled “Here’s how a cartoon smiley face punched a big hole in democratic theory,” with the subtitle, “Fleeting exposure to ‘irrelevant stimuli’ powerfully shapes our assessments of policy arguments,” discussing the results of an experiment conducted a few years ago and recently published by Cengiz Erisen, Milton Lodge and Charles Taber. Larry wrote:

What were these powerful “irrelevant stimuli” that were outweighing the impact of subjects’ prior policy views? Before seeing each policy statement, each subject was subliminally exposed (for 39 milliseconds — well below the threshold of conscious awareness) to one of three images: a smiling cartoon face, a frowning cartoon face, or a neutral cartoon face. . . . the subliminal cartoon faces substantially altered their assessments of the policy statements . . .

I followed up with a post expressing some skepticism:

Unfortunately they don’t give the data or any clear summary of the data from experiment No. 2, so I can’t evaluate it. I respect Larry Bartels, and I see that he characterized the results as the “subliminal cartoon faces substantially altered their assessments of the policy statements — and the resulting negative and positive thoughts produced substantial changes in policy attitudes.” But based on the evidence given in the paper, I can’t evaluate this claim. I’m not saying it’s wrong. I’m just saying that I can’t express judgment on it, given the information provided.

Larry then followed up with a post saying that further information was in chapter 3 of Erisen’s Ph.D. dissertation and presented as evidence this path analysis:

along with this summary:

In this case, subliminal exposure to a smiley cartoon face reduced negative thoughts about illegal immigration, increased positive thoughts about illegal immigration, and (crucially for Gelman) substantially shifted policy attitudes.

And Erisen sent along a note with further explanation, the centerpiece of which was another path analysis.

Unfortunately I still wasn’t convinced. The trouble is, I just get confused whenever I see these path diagrams. What I really want to see is a direct comparison of the political attitudes with and without the intervention. No amount of path diagrams will convince me until I see the direct comparison.

However, I had not read all of the relevant chapter of Erisen’s dissertation in detail. I’d looked at the graphs (which had results of path analyses, and data summaries on positive and negative thoughts, but no direct data summaries of issue attitudes) and at some of the tables. It turns out, thought that there were some direct comparisons of issue attitudes in the text of the dissertation but not in the tables and figures.

I’ll get back to that in a bit, but first let me return to what I wrote at the time, in response to Erisen and Bartels:

I’m not saying that Erisen is wrong in his claims, just that the evidence he [and Larry] shown me is too abstract to convince me. I realize that he knows a lot more about his experiment and his data than I do and I’m pretty sure that he is much more informed on this literature than I am, so I respect that he feels he can draw certain strong conclusions from his data. But, for me, I have to go what information is available to me.

Why do these claims from path analysis confuse me? An example is given in a comment by David Harris, who reports that Erisen et al. “seem to acknowledge that the effect of their priming on people’s actual policy evaluations is nil” but that they then follow up with a convoluted explanation involving a series of interactions.

Convoluted can be OK—real life is convoluted—but I’d like to see some simple comparisons. If someone wants to claim that “Fleeting exposure to ‘irrelevant stimuli’ powerfully shapes our assessments of policy arguments,” I’d like to see if these fleeting exposures indeed have powerful effects. In an observational setting, such effects can be hard to “tease out,” as the saying goes. But in this case the researchers did a controlled experiment, and I’d like to see the direct comparison as a starting point.

Commenter Dean Eckles wrote:

The answer is that those effects are not significant at conventional levels in Exp 2. From ch. 3 (pages 89-91) of Cengiz Erisen’s dissertation (from https://dspace.sunyconnect.suny.edu/handle/1951/52338) we have:

Illegal Immigration: “In the first step of the mediation model a simple regression shows the effect of affective prime on the attitude (beta=.34; p [less than] .07). Although not hypothesized, this confirms the direct influence of the affective prime on the illegal immigration attitude.”

Energy Security: “As before, the first step of the mediation model ought to present the effect of the prime on one’s attitude. In this mediation model, however, the affective prime does not change energy security attitude directly (beta=-.10; p [greater than] .10. Yet, as discussed before, the first step of mediation analysis is not required to establish the model (Shrout & Bolger 2002; MacKinnon 2008).”

So (the cynic in me says), this pretty much covers it. The direct result was not statistically significant. When it went in the expected direction and was not statistically significant, it was taken as a confirmation of the hypothesis. When it went in the wrong direction and was not statistically significant, it was dismissed as not being required.

**Back to the debate**

OK, so here you have the story as I see it: Larry heard of an interesting study regarding subliminal messages, a study that made a lot of sense especially in light of the work of Larry and others regarding the ways in which voters can be swayed by information that logically should be irrelevant to voting decisions or policy positions (and, indeed, consistent with the work of Kahneman, Slovic, and Tversky regarding shortcuts and heuristics in decision making). The work seemed solid and was supported by several statistical analyses. And there does seem to be something there (in particular, Erisen shows strong evidence of the stimulus affecting the numbers of positive and negative thoughts expressed by the students in his experiment). But the evidence for the headline claim—that the subliminal smiley-faces affect political attitudes themselves, not just positive and negative expressions—is not so clear.

That’s my perspective. Now for Larry’s. As he saw it, my posts were sloppy: I reacted to the path analyses presented by him and Erisen and did not look carefully within Erisen’s Ph.D. thesis to find the direct comparisons. Here’s what Larry wrote:

Now it seems that one of your commenters has read (part of) the dissertation chapter and found two tests of the sort you claimed were lacking, one of which indicates a substantial effect (.34 on a six-point scale) and the other of which indicates no effect. If you or your commenter bothered to keep reading, you would find four more tests, two of which (involving different issues) indicate substantial effects (.40 and .51) and two of which indicate no effects. The three substantial effects (out of six) have reported p-values of <.07, <.08, and >.10. How likely is that set of results to occur by chance? Do you really want to argue that the appropriate way to assess this evidence is one .05 test at a time?

Hmmm, I’ll have to think about this one.

My quick response is as follows:

1. Sure, if we accept the general quality of the measurements in this study (no big systematic errors, etc.) then there’s very clear evidence of the subliminal stimuli having effects on positive and negative expressions, hence it’s completely reasonable to expect effects on other survey responses including issue attitudes.

2. That is, we’re not in “Bem” territory here. Conditional on the experiments being done competently, there are real effects here.

3. Given that the stimuli can affect issue attitudes, it’s reasonable to expect variation, to expect some positive and some negative effects, and for the effects to vary across people and across situations.

4. So if I wanted to study these effects, I’d be inclined to fit a multilevel model to allow for the variation and to better estimate average effects in the context of variation.

5. When it comes to *specific* effects, and to specific claims of large effects (recall the original claim that the stimulus “*powerfully* [emphasis added] shapes our assessments of policy arguments,” elsewhere “substantially altered,” elsewhere “significantly and consistently altered,” elsewhere “punched a big hole in democratic theory”), I’d like to see some strong evidence. And these “p less than .07″ and “p greater than .10″ things don’t look like strong evidence to me.

6. I agree that these results are consistent with *some* effect on issue attitudes but I don’t see the evidence for the large effects that have been claimed.

7. Finally, I respect the path analyses for what they are, and I’m not saying Erisen *shouldn’t* have done them, but I think it’s fair to say that these are the sorts of analyses that are used to understand large effects that exist; they don’t directly address the question of the effects of the stimulus on policy attitudes (which is how we could end up with explanation of large effects that cancel out).

As a Bayesian, I do accept Larry’s criticism that it was odd for me to claim that there was no evidence just because p was not less than 0.05. Even weak evidence should shift my priors a bit, no?

And I agree that weak evidence is not the same as zero evidence.

So let me clarify that, conditional on accepting the quality of Erisen’s experimental protocols (which I have no reason to question), I have no doubt that some effects are there. The question is about the size and the direction of the effects.

**Summary**

In some sense, the post-publication review process worked well: Larry promoted the original work on the sister blog which gave it a wider audience. I read Larry’s post and offered my objection on the sister blog and here, and, in turn, Erisen and various commenters replied. And, eventually, after a couple of email exchange, I finally got the point that Larry had been trying to explain to me, that Erisen *did* have the direct comparisons I’d been asking for, they were just in the text of his dissertation and not in the tables and figures.

This post-publication discussion was slow and frustrating (especially for Larry, who was rightly annoyed that I kept saying that the information wasn’t available to me, when it was there in the dissertation all along), but I still think it moved forward in a better way than would’ve happened *without* the open exchange, if, for example, all we’d had were a series of static, published articles presenting one position or another.

But these questions are difficult and somewhat unstable because of the massive selection effects in play. This discussion had its frustrating aspects on both sides but things are typically much worse! Most studies in political science don’t get discussed on the Monkey Cage or on this blog, and what we see is typically bimodal: a mix of studies that we like and think are worth sharing, and studies that we dislike and think are worth taking the time to debunk.

But I don’t go around looking for studies to shoot down! What typically happens is they get hyped by somebody else (whether it be Freakonomics, or David Brooks, or whoever) and then I react.

In this case, Larry posted on a research finding that he thought was important and perhaps had not received enough attention. I was skeptical. After all the dust has settled, I remain skeptical about any effects of the subliminal message on political attitudes. I think Larry remains convinced, and maybe our disagreement ultimately comes down to priors, which makes sense given that the evidence from the data is weak.

Meanwhile, new studies get published, and get neglected, or hyped, or both. I offer no general solution to how to handle these—clearly, the standard system of scientific publishing has its limitations—here I just wanted to raise some of these issues in a context where I see no easy answers.

To put it another way, I think social science can—and should—do better than we usually do. For a notorious example, consider “Reinhart and Rogoff”: a high-profile paper published in a top journal with serious errors that were not corrected for *several years* after publication.

On one hand, the model of discourse described in my above post is not at all scalable—Larry Bartels and I are just 2 guys, after all, and we have finite time available for this sort of thing. On the other hand, consider the many thousands of researchers who spend so many hours refereeing papers for journals. Surely this effort could be channeled in a more useful way.

The post Sailing between the Scylla of hyping of sexy research and the Charybdis of reflexive skepticism appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Try a spaghetti plot appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Joe Simmons writes:

I asked MTurk NFL fans to consider an NFL game in which the favorite was expected to beat the underdog by 7 points in a full-length game. I elicited their beliefs about sample size in a few different ways (materials .pdf; data .xls).

Some were asked to give the probability that the better team would be winning, losing, or tied after 1, 2, 3, and 4 quarters. If you look at the

averagewin probabilities, their judgments look smart.But this graph is super misleading, because the fact that the average prediction is wise masks the fact that the average person is not. Of the 204 participants sampled, only 26% assigned the favorite a higher probability to win at 4 quarters than at 3 quarters than at 2 quarters than at 1 quarter. About 42% erroneously said, at least once, that the favorite’s chances of winning would be greater for a shorter game than for a longer game.

How good people are at this depends on how you ask the question, but no matter how you ask it they are not very good.

The explicit warning, “This Graph is Super Misleading,” is a great idea.

But don’t stop there! You can do better. The next step is to follow it up with a spaghetti plot showing people’s estimates. If you click through the links, you see there are about 200 respondents, and 200 is a lot to show in a spaghetti plot, but you could handle this by breaking up the people into a bunch of categories (for example, based on age, sex, and football knowledge) thus allowing a grid of smaller graphs, each of which wouldn’t have too many lines.

**P.S.** Jeff Leek points out that sometimes a spaghetti plot won’t work so well because there are too many lines to plot and all you get is a mess (sort of like the above plate-o-spag image, in fact). He suggests the so-called lasagna plot, which is a sort of heat map, and which seems to have some similarities to Solomon Hsiang’s “watercolor” uncertainty display.

A heat map could be a good idea but let me also remind everyone that there are some solutions to overplotting of the lines in a spaghetti plot, some ways to keep the spaghetti structure while losing some of the messiness. Here are some strategies, in increasing order of complexity:

1. Simply plot narrower lines. Graphics devices have improved, and thin lines can work well.

2. Just plot a random sample of the lines. If you have 100 patients in your study, just plot 20 lines, say.

3. Small multiples: for example, a 2×4 grid broken down by male/female and 4 age categories. Within each sub-plot you don’t have so many lines so less of a problem with overplotting.

4. Alpha-blending.

The post Try a spaghetti plot appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Three ways to present a probability forecast, and I only like one of them appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I think the National Weather Service knows what they’re doing on this one.

The post Three ways to present a probability forecast, and I only like one of them appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Try a spaghetti plot

**Wed:** I ain’t got no watch and you keep asking me what time it is

**Thurs:** Some questions from our Ph.D. statistics qualifying exam

**Fri:** Solution to the helicopter design problem

**Sat:** Solution to the problem on the distribution of p-values

**Sun:** Solution to the sample-allocation problem

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Your Paper Makes SSRN Top Ten List” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dear Andrew Gelman:

Your paper, “WHY HIGH-ORDER POLYNOMIALS SHOULD NOT BE USED IN REGRESSION DISCONTINUITY DESIGNS”, was recently listed on SSRN’s Top Ten download list for: PSN: Econometrics, Polimetrics, & Statistics (Topic) and Political Methods: Quantitative Methods eJournal.

As of 02 September 2014, your paper has been downloaded 17 times. You may view the abstract and download statistics at: http://ssrn.com/abstract=2486395.

Top Ten Lists are updated on a daily basis. . . .

The paper (with Guido Imbens) is here.

What amused me, though, was how low the number was. 17 downloads isn’t so many. I guess it doesn’t take much to be in the top 10!

The post “Your Paper Makes SSRN Top Ten List” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hoe noem je? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Reviewing my notes and books on categorical data analysis, the term “nominal” is widely employed to refer to variables without any natural ordering. I was a language major in UG school and knew that the etymology of nominal is the Latin word nomen (from the Online Etymological Dictionary: early 15c., “pertaining to nouns,” from Latin nominalis “pertaining to a name or names,” from nomen (genitive nominis) “name,” cognate with Old English nama (see name (n.)). Meaning “of the nature of names” (in distinction to things) is from 1610s. Meaning “being so in name only” first recorded 1620s.)

So variables without a natural order such as gender (male-female), transport mode (walk, bicycle, bus, train, car) and so on are just coded 0, 1 and so on. Yet the textbook writers do not explain that nominal just means name which it seems to me would help the students better understand the application.

Do you know when this usage was first introduced into statistics?

I have no idea but maybe you, the readers, can offer some insight?

The post Hoe noem je? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How do companies use Bayesian methods? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m in Northwestern’s Predictive Analytics grad program. I’m working on a project providing Case Studies of how companies use certain analytic processes and want to use Bayesian Analysis as my focus.

The problem: I can find tons of work on how one might apply Bayesian Statistics to different industries but very little on how companies actually do so except as blurbs in larger pieces.

I was wondering if you might have ideas of where to look for cases of real life companies using Bayesian principles as an overall strategy.

Some examples that come to mind are pharmaceutical companies that use hierarchical pharmacokinetic/pharmacodynamic modeling, as well as people on the Stan users list who are using Bayes in various business settings. And I know that some companies do formal decision analysis which I think is typically done in a Bayesian framework. And I’ve given some short courses at companies, which implies that they’re *interested* in Bayesian methods, though I don’t really know if they ended up following my particular recommendations.

Perhaps readers can to supply other examples?

The post How do companies use Bayesian methods? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Prediction Market Project for the Reproducibility of Psychological Science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>

The second prediction market project for the reproducibility project will soon be up and running – please participate!There will be around 25 prediction markets, each representing a particular study that is currently being replicated. Each study (and thus market) can be summarized by a key hypothesis that is being tested, which you will get to bet on.

In each market that you participate, you will bet on a binary outcome: whether the effect in the replication study is in the same direction as the original study, and is statistically significant with a p-value smaller than 0.05.

Everybody is eligible to participate in the prediction markets: it is open to all members of the Open Science Collaboration discussion group – you do not need to be part of a replication for the Reproducibility Project. However, you cannot bet on your own replications.

Each study/market will have a prospectus with all available information so that you can make informed decisions.

The prediction markets are subsidized. All participants will get about $50 on their prediction account to trade with. How much money you make depends on how you bet on different hypotheses (on average participants will earn about $50 on a Mastercard (or the equivalent) gift card that can be used anywhere Mastercard is used).

The prediction markets will open on October 21, 2014 and close on November 4.

If you are willing to participate in the prediction markets, please send an email to Siri Isaksson by October 19 and we will set up an account for you. Before we open up the prediction markets, we will send you a short survey.

The prediction markets are run in collaboration with Consensus Point.

If you have any questions, please do not hesitate to email Siri Isaksson.

The post Prediction Market Project for the Reproducibility of Psychological Science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Statistical Communication and Graphics Manifesto appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Statistical communication includes graphing data and fitted models, programming, writing for specialized and general audiences, lecturing, working with students, and combining words and pictures in different ways.

The common theme of all these interactions is that we need to consider our statistical tools in the context of our goals.

Communication is not just about conveying prepared ideas to others: often our most important audience is ourselves, and the same principles that suggest good ways of communication with others also apply to the methods we use to learn from data.

See also the description of my statistical communication and graphics course, where we try to implement the above principles.

[I'll be regularly updating this post, which I sketched out (with the help of the students in my statistical communication and graphics course this semester) and put here so we can link to it from the official course description.]

The post Statistical Communication and Graphics Manifesto appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My course on Statistical Communication and Graphics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We will study and practice many different aspects of statistical communication, including graphing data and fitted models, programming in Rrrrrrrr, writing for specialized and general audiences, lecturing, working with students and colleagues, and combining words and pictures in different ways.

You learn by doing: each week we have two classes that are full of student participation, and before each class you have a pile of readings, a homework assignment, and jitts.

You learn by teaching: you spend a lot of time in class explaining things to your neighbor.

You learn by collaborating: you’ll do a team project which you’ll present at the end of the semester.

The course will take a lot of effort on your part, effort which should be aligned with your own research and professional goals. And you will get the opportunity to ask questions of guest stars who will illustrate diverse perspectives in statistical communication and graphics.

See also the statistical communication and graphics manifesto.

[I'll be regularly updating this post, which I sketched out (with the help of the students in my statistical communication and graphics course this semester) and put here so we can link to it from the official course description.]

The post My course on Statistical Communication and Graphics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The Fault in Our Stars: It’s even worse than they say appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In our recent discussion of publication bias, a commenter link to a recent paper, “Star Wars: The Empirics Strike Back,” by Abel Brodeur, Mathias Le, Marc Sangnier, Yanos Zylberberg, who point to the notorious overrepresentation in scientific publications of p-values that are just below 0.05 (that is, just barely statistically significant at the conventional level) and the corresponding underrepresentation of p-values that are just above the 0.05 cutoff.

Brodeur et al. correctly (in my view) attribute this pattern not just to selection (the much-talked-about “file drawer”) but also to data-contingent analyses (what Simmons, Nelson, and Simonsohn call “p-hacking” and what Loken and I call “the garden of forking paths”). They write:

We have identified a misallocation in the distribution of the test statistics in some of the most respected academic journals in economics. Our analysis suggests that the pattern of this misallocation is consistent with what we dubbed an inflation bias: researchers might be tempted to inflate the value of those almost-rejected tests by choosing a “significant” specification. We have also quantified this inflation bias: among the tests that are marginally significant, 10% to 20% are misreported.

They continue with “These figures are likely to be lower bounds of the true misallocation as we use very conservative collecting and estimating processes”—but I would go much further. One way to put it is that there are (at least) *three* selection processes going on here:

1. (“the file drawer”) Significant results (traditionally presented in a table with asterisks or “stars,” hence the photo above) more less likely to get published.

2. (“inflation”) Near-significant results get jiggled a bit until they fall into the box

3. (“the garden of forking paths”) The direction of an analysis is continually adjusted in light of the data.

Brodeur et al. point out that item 1 doesn’t tell the whole story, and they come up with an analysis (featuring a “lemma” and a “corollary”!) explaining things based on item 2. But I think item 3 is important too.

The point is that the analysis is a moving target. Or, to put it another way, there’s a one-to-many mapping from scientific theories to statistical analyses.

So I’m wary of any general model explaining scientific publication based on a fixed set of findings that are then selected or altered. In many research projects, there is either no baseline analysis or else the final analysis is so far away from the starting point that the concept of a baseline is not so relevant.

Although maybe things are different in certain branches of economics, in that people are arguing over an agreed-upon set of research questions.

P.S. I only wish I’d known about these people when I was still in Paris; we could’ve met and talked.

The post The Fault in Our Stars: It’s even worse than they say appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I didn’t say that! Part 2 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Garden of Forking Paths paper, by Eric Loken and myself, just appeared in American Scientist. Here’s our manuscript version (“The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time”), and here’s the final, trimmed and edited version (“The Statistical Crisis in Science”) that came out in the magazine.

Russ Lyons read the published version and noticed the following sentence, actually the second sentence of the article:

Researchers typically express the confidence in their data in terms of p-value: the probability that a perceived result is actually the result of random variation.

How horrible! Russ correctly noted that the above statement is completely wrong, on two counts:

1. To the extent the p-value measures “confidence” at all, it would be confidence in the null hypothesis, not confidence in the data.

2. In any case, the p-value is not not not not not “the probability that a perceived result is actually the result of random variation.” The p-value is the probability of seeing something at least as extreme as the data, if the model (in statistics jargon, the “null hypothesis”) were true.

**How did this happen?**

The editors at American Scientist liked our manuscript but it was too long, also parts of it needed explaining for a nontechnical audience. So they cleaned up our article and added bits here and there. This is standard practice at magazines. It’s not just Raymond Carver and Gordon Lish.

Then they sent us the revised version and asked us to take a look. They didn’t give us much time. That too is standard with magazines. They have production schedules.

We went through the revised manuscript but not carefully enough. *Really* not carefully enough, given that we missed a glaring mistake—*two* glaring mistakes—in the very first paragraph of the article.

This is ultimately not the fault of the editors. The paper is our responsibility and it’s our fault for not checking the paper line by line. If it was worth writing and worth publishing, it was worth checking.

**P.S.** Russ also points out that the examples in our paper all are pretty silly and not of great practical importance, and he wouldn’t want readers of our article to get the impression that “the garden of forking paths” is only an issue in silly studies.

That’s a good point. The problems of nonreplication etc affect all sorts of science involving human variation. For example there is a lot of controversy about something called “stereotype threat,” a phenomenon that is important if real. For another example, these problems have arisen in studies of early childhood intervention and the effects of air pollution. I’ve mentioned all these examples in talks I’ve given on this general subject, they just didn’t happen to make it into this particular paper. I agree that our paper would’ve been stronger had we mentioned some of these unquestionably important examples.

The post I didn’t say that! Part 2 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Wed:** The Fault in Our Stars: It’s even worse than they say

**Thurs:** Buggy-whip update

**Fri:** The inclination to deny all variation

**Sat:** Hoe noem je?

**Sun:** “Your Paper Makes SSRN Top Ten List”

The post 10th anniversary of “Statistical Modeling, Causal Inference, and Social Science” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>During this time, we’ve had 5688 posts, 48799 comments, and who knows how many readers.

On this tenth anniversary, I’d like to thank my collaborators on all the work I’ve blogged, my co-bloggers (“This post is by Phil”), our commenters, Alex Tabarrok for linking to us way back when, and also the many many people who’ve pointed us to interesting research, interesting graphs, bad research, bad graphs, and links to the latest stylings of David Brooks and Satoshi Kanazawa.

It’s been fun, and I think this blog has been (and I hope will remain) an excellent communication channel on all sorts of topics, statistical and otherwise. Through the blog I’ve met friends, colleagues, and collaborators—including some such as Basbøll and Palko whom I’ve still not yet met!—; I’ve been motivated to think hard about ideas that I otherwise would’ve encountered; and I’m pretty sure I’ve motivated many people to examine ideas that *they* otherwise would not have thought seriously about.

The blog has been enlivened with a large and continuing cast of characters, including lots of “bad guys” such as . . . well, no need to list these people here. It’s enough to say they’ve provided us with plenty of entertainment and food for thought.

We’ve had some epic comment threads and enough repeating topics that we had to introduce the Zombies category. We’ve had comments or reactions from culture heroes including Gerd Gigerenzer, Judea Pearl, Helen DeWitt, and maybe even Scott Adams (but we can’t be sure about that last one). We’ve had fruitful exchanges with other researchers such as Christian Robert, Deborah Mayo, and Dan Kahan who have blogs of their own, and, several years back, we launched the internet career of the late Seth Roberts.

Here are the titles of the first five posts from our blog (in order):

A weblog for research in statistical modeling and applications, especially in social sciences

The Electoral College favors voters in small states

Why it’s rational to vote

Bayes and Popper

Overrepresentation of small states/provinces, and the USA Today effect

As you can see, some of our recurrent themes showed up early on.

Here are the next five:

Sensitivity Analysis of Joanna Shepherd’s DP paper

Unequal representation: comments from David Samuels

Problems with Heterogeneous Choice Models

Morris Fiorina on C-SPAN

A fun demo for statistics class

And the ten after that:

Red State/Blue State Paradox

Statistical issues in modeling social space

2 Stage Least Squares Regression for Death Penalty Analysis

Partial pooling of interactions

Bayesian Methods for Variable Selection

Reference for variable selection

The blessing of dimensionality

Why poll numbers keep hopping around by Philip Meyer

Matching, regression, interactions, and robustness

Homer Simpson and mixture models

(Not all these posts are by me.)

See you again in 2024!

The post 10th anniversary of “Statistical Modeling, Causal Inference, and Social Science” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Illinois chancellor who fired Salaita accused of serial self-plagiarism.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The first was a review by Kimberle Crenshaw of a book by Joan Biskupic about Supreme Court judge Sonia Sotomayor. Crenshaw makes the interesting point that Sotomayor, like many political appointees of the past, was chosen in part because of her ethnic background, but that unlike various other past choices (for example, Antonin Scalia, the first Italian American on the court), “Sotomayor’s ethnicity is still viewed [by many] with skepticism.”

I was reminded of Laurence “ten-strike” Tribe’s statement that Sotomayor is “not nearly as smart as she seems to think she is,” a delightfully paradoxical sentence that one could imagine being said by Humpty Dumpty or some other Lewis Carroll character. More to the point, Tribe got caught plagiarizing a few years ago.

So here’s the question. Based on the letter where the above quote appears, Tribe seems to consider himself to be pretty smart (smarter than Sotomayor, that’s for sure). But, from my perspective, what kind of smart person plagiarizes? Not a very smart person, right?

But maybe I’m completely missing the point. If some of the world’s best athletes are doping, maybe some of the world’s best scholars are plagiarizing? It’s hard for me to wrap my head around this one. Also, in fairness to Tribe, he’s over 70 years old. Maybe he used to be smart when he was younger.

The second story came to me via an email from John Transue who pointed me to a post by Ali Abunimah, “Illinois chancellor who fired Salaita accused of serial self-plagiarism.” I had to follow some links to see what was going on here: apparently there was a professor who got fired after pressure on the university from a donor.

I hadn’t heard of Stephen Salaita (the prof who got fired) or Phyllis Wise (the University of Illinois administrator who apparently was in charge of the process), but apparently there’s some controversy about her publication record from her earlier career as a medical researcher.

It looks like a simple case of Arrow’s theorem, that any result can only be published at most five times. Wise seemed to have published the particular controversial paper only three different times, so she has two freebies to go.

As I discussed a couple years ago (click here and scroll down to “It’s 1995″), in some places Arrow’s theorem is such a strong expectation that you’re penalized if you *don’t* publish several versions of the same paper.

But, to get back to the main thread here: to what extent does Wise’s unscholarly behavior—and it is definitely unscholarly and uncool to copy your old papers without making clear the source, even if it’s not as bad as many other academic violations, it’s something you shouldn’t do, and it demonstrates an ethical lapse or a level of sloppiness so extreme as to cast questions on one’s scholarship—to what extend should this lead us to mistrust her other decisions, in this case in the role of university administrator?

In some sense this doesn’t matter at all: Wise could’ve been the most upstanding, rule-following scientist of all time and the supporters of Salaita would still be strongly disagreeing with her decision and the process used to make it (just as we can all give a hearty laugh at Laurence Tribe’s obnoxiousness, even if he’d never in his life put his name on someone else’s writing).

Or maybe it is relevant, in that Wise’s disregard for the rules in science might be matched by her disregard for the rules in administration. And Tribe’s diminished capacities as a scholar, as revealed by his plagiarism, might lead one to doubt his judgment of the intellectual capacities of his colleagues.

**P.S.** A vocal segment of our readership gets annoyed when I write about plagiarism. I continue to insist that my thoughts in this area have scholarly value (see here and here, for example, and that latter article even appeared in a peer-reviewed journal!), but I *am* influenced by the judgments of others, and so I do feel a little bad about these posts, so I’ve done youall a favor by posting this one late at night on a weekend when nobody will be reading. So there’s that.

The post “Illinois chancellor who fired Salaita accused of serial self-plagiarism.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Science tells us that fast food lovers are more likely to marry other fast food lovers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Emma Pierson writes:

I’m a statistician working at the genetics company 23andMe before pursuing a master’s in statistics at Oxford on a Rhodes scholarship. I’ve really enjoyed reading your blog, and we’ve been doing some social science research at 23andMe which I thought might be of interest. We have about half a million customers answering thousands of survey questions on everything from homosexuality to extroversion to infidelity, which as you can imagine produces an interesting dataset.

1. We found that customers who answer our survey questions in the middle of the night are significantly less happy and significantly more likely to be manic. See here and here.

2. Using genetic data, we identified 15,000 couples along with the child they had had together. We showed that couples tended to be similar — 97% of traits showed a positive correlation between woman and man, even when we controlled for race and age — although there were often intriguing exceptions.

At this point Pierson shows an info graphic that says that “punctual people,” “skiers,” “hikers,” “non-smokers,” “fast food lovers,” and “apology prone people” are more likely to marry each other, while, in contrast, “early birds” are more likely to marry “night owls,” and “human GPSs” are more likely to marry “constant wrong turners.”

She continues:

We also showed that couples who were dissimilar in terms of BMI or age tended to be less happy (even when we controlled for individual BMI + age). See here and here.

What I’d really like to see is the full list. I’m not so interested in learning that skiers and hikers are likely to marry each other, but if I could see an (organized) list of all the traits they look at, this could be interesting.

The post Science tells us that fast food lovers are more likely to marry other fast food lovers appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When am I a conservative and when am I a liberal (when it comes to statistics, that is)? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here I am one day:

Let me conclude with a statistical point. Sometimes researchers want to play it safe by using traditional methods — most notoriously, in that recent note by Michael Link, president of the American Association of Public Opinion Research, arguing against non-probability sampling on the (unsupported) grounds that such methods have “little grounding in theory.” But in the real world of statistics, there’s no such thing as a completely safe method. Adjusting for party ID might seem like a bold and risky move, but, based on the above research, it could well be riskier to not adjust.

I’ve written a lot about the benefits of overcoming the scruples of traditionalists and using (relatively) new methods, specifically Bayesian multilevel models, to solve problems (such as estimation of public opinion in small subgroups of the population) that would otherwise be either impossible or would be done on a completely sloppy, ad hoc basis.

On the other hand, sometimes I’m a conservative curmudgeon, for example in my insistence that claims about beauty and sex ratios, or menstrual cycles and voting, are bogus (or, to be precise, that those claims are pure theoretical speculation, unsupported by the data that purport to back them up).

What’s the deal? How to resolve this? One way to get a handle on this is in each case to think about the alternative. The balance depends on the information available in the problem at hand. In a sense, I’m always a curmudgeonly conservative (as in that delightful image of Grampa Simpson above) in that I’m happy to use prior information and I don’t think I should defer to whatever piece of data happens to be in front of me.

This is the point that Aleks Jakulin and I made in our article, “Bayes: radical, liberal, or conservative?”

Consider the polling scene, where I’m a liberal or radical in wanting to use non-probability sampling (gasp!). But, really, this stance of mine has two parts:

1 (conservative): I don’t particularly trust raw results from probability sampling, as the nonresponse rate is so high and so much adjustment needs to be done to such surveys anyway.

2 (liberal): I think with careful modeling we can do a lot more than just estimate toplines and a few crosstabs.

Now consider those junk psychology studies that get published in tabloid journals based on some flashy p-values. Again, I have two stances:

1 (conservative): Just cos someone sees a pattern in 100 online survey responses, I don’t see this as strong evidence for a pattern in the general population, let alone as evidence for a general claim about human nature or biology or whatever.

2 (liberal): I’m open to the possibility that there are interesting patterns to be discovered, and I recommend careful measurement and within-subject designs to estimate these things accurately.

The post When am I a conservative and when am I a liberal (when it comes to statistics, that is)? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Varieties of description in political science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am organizing a panel at next year’s American Political Science Association meeting tentatively entitled “Varieties of Description.” The idea is to compare and contrast the ways in which different disciplines approach descriptive inferences, that how they go about collective data, how they validate descriptive inferences and what ontological assumptions they make. The panel tries to expand on the work by John Gerring and others that identify description as distinct elements to social analysis. The panel currently has contributors who analyze the structure of historical description as well as thick anthropological-like description. To round out the panel, I am interested in finding contributors who take a more quantitative approach to description. This could involve the use descriptive statistics or exploratory data analysis to describe how, when and where social phenomena unfold or who employ it to generate new theoretical insights. Or it could involve work on concept formation or typologizing.

I responded by pointing him to my paper with Basbøll in which we say that that storytelling, like exploratory data analysis more generally, can be seen as a form of model checking: an interesting story is one which acts as a counterexample to some model (explicit or implicit) of interest.

As we write in our article:

One might imagine a statistician criticizing storytellers for selection bias, for choosing the amusing, unexpected, and atypical rather than the run-of-the- mill boring reality that should form the basis for most of our social science. But then how can we also say the opposite that stories benefit from being anomalous? We reconcile this apparent contradiction by placing stories in a different class of evidence from anecdotal data as usually conceived. The purpose of a story is not to pile on evidence in support of one theory or another but rather to shine a spotlight on an anomaly—a problem with an existing model—and to stand as an immutable object that conveys the complexity of reality.

Anyway, if anyone is interested in participating in Kreuzer’s panel at next year’s APSA meeting, feel free to contact him directly.

The post Varieties of description in political science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Science does not advance by guessing” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And, speaking as a statistician and statistical educator, I think there’s a big problem with the usual treatment of statistics and scientific discovery in statistics articles and textbooks, in that the usual pattern is for the theory and experimental design to be airlifted in from who-knows-where and then the statistical methods are just used to prove (beyond some reasonable doubt) that the theory is correct, via a p-value or a posterior probability or whatever. As Seth pointed out many times, this skips the most key question of where the theory came from, and in addition it skips the almost-as-key question of how the study is designed.

I do have a bit of a theory of where theories come from, and that is from anomalies: in a statistical sense, predictions from an existing model that do not make sense or that contradict the data. We discuss this in chapter 6 of BDA, and Cosma Shalizi and I frame it from a philosophical perspective in our philosophy paper. Or, for a completely non-technical, “humanistic,” take, see my paper with Thomas Basbøll on the idea that good stories are anomalous and immutable.

The idea is that we have a tentative model of the world, and we push that model, and gather data, and find problems with the model, and it is the anomalies that motivate us to go further.

The new theories themselves, though, where do they come from? That’s another question. It seems to me that new theories often come via analogies from other fields (or from other subfields, within physics, for example). At this point I think I should supply some examples but I don’t quite have the energy.

My real point is that sometimes it does seem like science advances by guessing, no? At least, retrospectively, it seems like Bohr, Dirac, etc., just kept guessing different formulations and equations and pushing them forward and getting results. Or, to put it another way, these guys did do “deep investigation of the content and the apparent contradictions of previous empirically successful theories.” But then they guessed too. But their guesses were highly structured, highly constrained. The guesses of Dirac etc. were mathematically sophisticated, not the sort of thing that some outsider could’ve come up with.

How does this relate to, say, political science or economics? I’m not sure. I do think that outsiders can make useful contributions to these fields but there does need to be some sense of the theoretical and empirical constraints.

The post “Science does not advance by guessing” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When there’s a lot of variation, it can be a mistake to make statements about “typical” attitudes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>1. There’s a tendency for scientific results to be framed in absolute terms (in psychology, this corresponds to general claims about the population) but that can be a mistake in that sometimes the most important part of the story is variation; and

2. Before getting to the comparisons, it can make sense to just look at the data.

Here’s the background. I came across a post by Leif Nelson, who wrote:

Recently *Science* published a paper [by Timothy Wilson, David Reinhard, Erin Westgate, Daniel Gilbert, Nicole Ellerbeck, Cheryl Hahn, Casey Brown, and Adi Shaked] concluding that people do not like sitting quietly by themselves. . . .

The reason I [Nelson] write this post is that upon analyzing the data for those studies, I arrived at an inference opposite the authors’. They write things like:

Participants typically did not enjoy spending 6 to 15 minutes in a room by themselves with nothing to do but think. (abstract)

It is surprisingly difficult to think in enjoyable ways even in the absence of competing external demands. (p.75, 2nd column)

The untutored mind does not like to be alone with itself (last phrase).

But the raw data point in the opposite direction: people reported to enjoy thinking. . . .

In the studies, people sit in a room for a while and then answer a few questions when they leave, including how enjoyable, how boring, and how entertaining the thinking period was, in 1-9 scales (anchored at 1 = “not at all”, 5 = “somewhat”, 9 = “extremely”). Across the nine studies, 663 people rated the experience of thinking, the overall mean for these three variables was M=4.94, SD=1.83 . . . Which is to say, people endorse the midpoint of the scale composite: “somewhat boring, somewhat entertaining, and somewhat enjoyable.”

Five studies had means below the midpoint, four had means above it.

I see no empirical support for the core claim that “participants typically did not enjoy spending 6 to 15 minutes in a room by themselves.”

Here are the data:

Nelson writes:

Out of 663 participants, MOST (69.6%) [or, as we would say in statistics, "70%" --- ed.] said that the experience was somewhat enjoyable or better.

If I were trying out a new manipulation and wanted to ensure that participants typically DID enjoy it, I would be satisfied with the distribution above. I would infer people typically enjoy being alone in a room with nothing to do but think.

Nelson concludes:

If readers think that the electric shock finding is interesting conditional on the (I think, erroneous) belief that it is not enjoyable to be alone in thought, then the finding is surely even more interesting if we instead take the data at face value: Some people choose to self-administer an electric shock despite enjoying sitting alone with their thoughts.

He also asked the authors if they had any comments on his reaction that their paper showed a finding opposite to what they’d claimed, and the authors sent him a reply in which they wrote that they “were continually surprised by these results” which reminded me of our earlier discussion of how to interpret surprising results.

**Reconciling the article and the criticism**

It’s a challenge to go back and forth reading the original article, Nelson’s comments, and Wilson and Gilbert’s reply. I agree with Nelson that it seems incorrect to state that people did not enjoy being alone with their thoughts, given that more than two-thirds of the people in the study reported the experience to be “somewhat enjoyable” or better. On the other hand, Wilson and Gilbert point out that “The percentage who admitted cheating [doing other activities beyond just sitting and thinking] ranged from 32% to 54% . . . 67% of men and 25% of women opted to shock themselves rather than ‘just think’ . . .”

The resolution, I think, is that we have to avoid the tendency to think deterministically. There’s variation! As shown in the above histogram, some people reported thinking to be “not at all enjoyable,” some reported it to be “somewhat enjoyable,” and there were a lot of people in the middle. Given this, it’s not so helpful to make statements about what people “typically” enjoy (as in the abstract of the paper).

Finally, let me return to my original point about respecting the data. In their reply, Wilson and Gilbert write, “we believe the preponderance of the evidence does not favor Professor Nelson’s claim that most people in our studies enjoyed thinking.” Looking at the above graph, it all seems to depend on how you categorize the “somewhat enjoyable” response.

Perhaps it’s most accurate to say that (a) two-thirds of respondents find thinking to be at least somewhat enjoyable, and, at the same time, (b) two-thirds of respondents find thinking to be no more than somewhat enjoyable! The glass is both two-thirds empty (according to Wilson et al.) and two-thirds full (according to Nelson).

**P.S.** Nelson credits the paper to Science, the journal where it is published. I think it’s more appropriate to credit the authors, so I’ve done it that way (see brackets in the first paragraph of quoted material above). The authors are the ones who do the work; the journal is just a vessel where it is published.

**P.P.S.** Zach wins the thread with this comment:

I enjoy thinking, but I can do that any time. Put me in a room with a way to safely shock myself and I’ll take the opportunity to experiment.

The post When there’s a lot of variation, it can be a mistake to make statements about “typical” attitudes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Rational != Self-interested appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ve said it before (along with Aaron Edlin and Noah Kaplan) and I’ll say it again. Rationality and self-interest are two dimensions of behavior. An action can be:

1. Rational and self-interested

2. Irrational and self-interested

3. Rational and altruistic

4. Irrational and altruistic.

It’s easy enough to come up with examples of all of these.

Before going on, let me just quickly deal with three issues that sometimes come up:

– Yes, these are really continuous scales, not binary.

– Sure, you can tautologically define all behavior as “rational” in that everything is done for some reason. But such an all-encompassing definition is not particularly interesting as it it drains all meaning from the term.

– Similarly, if you want you can tautologically define all behavior as self-interested, in the sense that if you do something nice for others that does not benefit yourself (for example, donate a kidney to some stranger), you must be doing it because you want to, so that’s self-interested. But, as I wrote a few years ago, the challenge in all such arguments is to avoid circularity. If selfishness means maximizing utility, and if we always maximize utility (by definition, otherwise it isn’t our utility, right?), then we’re always selfish. But then that’s like, if everything in the world is the color red, would we have a word for “red” at all? I’m using self-interested in the more usual sense of giving instrumental benefits.

To put it another way, if “selfish” means utility-maximization, which by definition is always being done (possibly to the extent of being second-order rational by rationally deciding not to spend the time to exactly optimize our utility function), then everything is selfish. Then let’s define a new term, “selfish2,” to represent behavior that benefits ourselves instrumentally without concern for the happiness of others. Then my point is that rationality is not the same as selfish2.

**What’s new here?**

The above is all background. It came to mind after I read this recent post by Rajiv Sethi regarding agent-based models. Sethi quotes Chris House who wrote:

The reason that economists set up their theories this way – by making assumptions about goals and then drawing conclusions about behavior – is that they are following in the central tradition of all of economics, namely that allocations and decisions and choices are guided by self-interest. This goes all the way back to Adam Smith and it’s the organizing philosophy of all economics. Decisions and actions in such an environment are all made with an eye towards achieving some goal or some objective. For consumers this is typically utility maximization – a purely subjective assessment of well-being. For firms, the objective is typically profit maximization. This is exactly where rationality enters into economics. Rationality means that the “agents” that inhabit an economic system make choices based on their own preferences.

No no no no no. Self-interest is the end, rationality is the means. You can pursue non-self-interested goals in rational or irrational ways, and you can pursue self-interested goals in rational or irrational ways.

Sethi’s post is about the relevance of agent-based models (as indicated in the above YouTube clip) to the study of economics and finance, and is worth reading on its own terms. But it also reminds me of the general point that we should not melange rationality with self-interest. I can see the appeal of such a confusion, as it seems to be associated with a seemingly hard-headed, objective view of the world. But really it’s an oversimplification that can lead to lots of confusion.

P.S. House’s blog is subtitled, “Economics, chess and anything else on my mind.” This got me interested so I entered “chess” into the search box but all that came out was this, which isn’t about chess at all. So that was a disappointment.

P.P.S. Some commenters asked for examples so I added some in comments. I’ll repeat them here.

First, the real-life example:

Some students in my class are designing and building a program to display inferences from Stan. They are focused on others’ preferences; they want to make a program that works for others, for the various populations of users out there. And they are trying to achieve this goal in a rational way.

Second, the quick examples:

Rational and self-interested: investing one’s personal money in index funds based on a judgment that this is the savvy way to long-term financial reward.

Rational and non-self-interested: donating thousands of dollars to a charity recommended by GiveWell.

Irrational and self-interested: day trading based on tips you find on sucker-oriented websites and gradually losing your assets in commission fees.

Irrational and non-self-interested: rushing into a burning building, risking your life to save your pet goldfish that was gonna die in 2 days anyway.

You could argue about the details of any of these examples but the point is that rationality is about the means and self-interest is about the ends.

The post Rational != Self-interested appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “We have used Stan to study dead dolphins” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In response to our call for references to successful research using Stan, Matthieu Authier points us to this:

@article{

year={2014},

journal={Biodiversity and Conservation},

volume={23},

number={10},

doi={10.1007/s10531-014-0741-3},

title={How much are stranding records affected by variation in reporting rates? A case study of small delphinids in the Bay of Biscay},

url={http://dx.doi.org/10.1007/s10531-014-0741-3},

keywords={Monitoring; Marine mammal; Strandings},

author={Authier, Matthieu and Peltier, Hélène and Dorémus, Ghislain and Dabin, Willy and Van Canneyt, Olivier and Ridoux, Vincent},

pages={2591-2612},

}

Next stop, flying squirrels!

The post “We have used Stan to study dead dolphins” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Regular Customer: It was so much easier when I was a bum. I didn’t have to wake up at 4am to go to work, didn’t have all these bills and girlfriends.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Regular Customer: It was so much easier when I was a bum. I didn’t have to wake up at 4am to go to work, didn’t have all these bills and girlfriends.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Rational != Self-interested

**Wed:** When there’s a lot of variation, it can be a mistake to make statements about “typical” attitudes

**Thurs:** “Science does not advance by guessing”

**Fri:** When am I a conservative and when am I a liberal (when it comes to statistics, that is)?

**Sat:** Science tells us that fast food lovers are more likely to marry other fast food lovers

**Sun:** 10th anniversary of “Statistical Modeling, Causal Inference, and Social Science”

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>“Regular Customer: It was so much easier when I was a bum. I didn’t have to wake up at 4am to go to work, didn’t have all these bills and girlfriends.”

Rational != Self-interested

When there’s a lot of variation, it can be a mistake to make statements about “typical” attitudes

“Science does not advance by guessing”

When am I a conservative and when am I a liberal (when it comes to statistics, that is)?

Science tells us that fast food lovers are more likely to marry other fast food lovers

10th anniversary of “Statistical Modeling, Causal Inference, and Social Science”

In one of life’s horrible ironies, I wrote a paper “Why we (usually) don’t have to worry about multiple comparisons” but now I spend lots of time worrying about multiple comparisons

The Fault in Our Stars: It’s even worse than they say

Buggy-whip update

The inclination to deny all variation

Hoe noem je?

“Your Paper Makes SSRN Top Ten List”

The Fallacy of Placing Confidence in Confidence Intervals

Try a spaghetti plot

I ain’t got no watch and you keep asking me what time it is

Some questions from our Ph.D. statistics qualifying exam

Solution to the helicopter design problem

Solution to the problem on the distribution of p-values

Solution to the sample-allocation problem

A key part of statistical thinking is to use additive rather than Boolean models

Yes, I’ll help people for free but not like this!

I love it when I can respond to a question with a single link

Boo! Who’s afraid of availability bias?

That last one is a special Halloween-themed post. I hope you enjoy it.

The post On deck this month appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I saw on your blog post that you listed aggregation as one of the desirable things to do. Do you agree with the following argument? I want to point out a problem with repeated measures ANOVA in talk:

In a planned experiment, say a 2×2 design, when we do a repeated measures ANOVA, we aggregate all responses by subject for each condition. This actually leads us to underestimate the variability within subjects. The better way is to use linear mixed models (even in balanced designs) because they allow us to stay faithful to the experiment design and to describe how we think the data were generated.

The issue is that in a major recent paper the authors did an ANOVA after they fail to get statistical significance with lmer. Even ignoring the cheating and p-value chasing aspect of it, I think that using ANOVA is statistically problematic for the above reason alone.

My response: Yes, this is consistent with what I say in my 2005 Anova paper, I think. But I consider that sort of hierarchical model to be a (modern version of) Anova. As a side note, classical Anova is kinda weird because it is mostly based on point estimates of variance parameters. But classical textbook examples are typically on the scale of 5×5 datasets, and in these cases the estimated variances are very noisy.

The post Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Carrie McLaren was way out in front of the anti-Gladwell bandwagon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Carrie McLaren was way out in front of the anti-Gladwell bandwagon appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post 65% of principals say that at least 30% of students . . . wha?? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The OECD put out a report drawing on their PISA and TALIS data:

http://oecdeducationtoday.blogspot.ie/2014/07/poverty-and-perception-of-poverty-how.html

I notice that it’s already attracted a NY Times op-ed by David Leonhart:

http://www.nytimes.com/2014/07/23/upshot/principals-in-us-are-more-likely-to-consider-their-students-poor.html

There are a number of things I find strange in its analysis and interpretation but, for starters, there’s the horizontal axis in the chart that’s reproduced in both the original and the NYT piece. As best I can tell the data is actually drawn from Table 2.4A here:

http://www.keepeek.com/Digital-Asset-Management/oecd/education/talis-2013-results_9789264196261-en#page43

So what’s actually being measured for each country is “the percentage of teachers working in schools whose principals estimated that 30% or more of their pupils came from socioeconomically disadvantaged homes”. Then what’s initially interesting in the discussion is how the measures on the vertical axis (a supposedly “objective” measure of disadvantage used in the PISA survey) differ from those on the horizontal, i.e. looking at points that lie significantly above or below the diagonal. So Brazil and the US are obvious outliers, although Singapore, Serbia and Croatia are by a proportional measure also fairly notable. So what caught me first is that this measure is obviously affected by the distribution of disadvantage across schools, e.g. if disadvantage (PISA-measure) is concentrated spatially then you can get a high score on the horizontal axis without having a correspondingly high score on the vertical axis. A highly skewed distribution of school size will also affect things (as I guess will a skewed distribution of teachers, but presumably that’s highly correlated with school size).

The discussion on the third dimension, shown in the bubbles, also seems to me to be dubious, but that’s more complicated.

I don’t really have anything to say on this except that I agree these numbers are hard to interpret.

The post 65% of principals say that at least 30% of students . . . wha?? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Rss move appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Rss move appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post International Journal of Epidemiology versus Hivemind and the Datagoround appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Hivemind wins (see the comment thread here, which is full of detective work from various commenters).

As I wrote as a postscript to that earlier post, maybe we should call this the “stone soup” or “Bem” phenomenon, when a highly flawed work stimulates interesting, thoughtful discussion.

The post International Journal of Epidemiology versus Hivemind and the Datagoround appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post In defense of stories and classroom activities, from a resubmission letter from 1999 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s a part of the letter, a response to some questions of one of the reviewers:

With regard to the comment that “You present absolutely no evidence that any of these demonstration methods is actually helpful. For at least a couple of these demonstrations you need to collect data to see if your tools are helping in understanding the concept. I will let you worry about how to measure this but this is a must”:

Of course, your statement is true, but consider the alternative, which is to do examples like this on the blackboard. We haven’t seen Moore & McCabe or Mosteller or anyone else conducting experiments to show that class-participation demos are _not_ better than straight lectures. And, given this state of uncertainty, we think that it’s useful to consider this alternative approach to teaching this material.

We agree that it would be a good idea for someone to collect data on the effectiveness of various teaching approaches. As all are well aware, this is a potentially huge research project. In the meantime, we think that presenting a bunch of demos in an easy-to-use format is potentially a major contribution. Our feeling is that a paper like this should have either (a) some really cool stuff that people can go out and use right away, or (b) some perhaps-boring stuff but with some evidence that it “works” (e.g., studies showing that students learn better when they work in groups). We think that there is room in the literature for papers like ours of type (a) and also other papers of type (b).

You might also notice that all the papers of the form, “A new proof of the central limit theorem” or whatever, never seem to have evidence of whether they are effective in class. Why? Because it seems evident that if such a new proof can increase statistical understanding, then it’s a good thing and can in some way be usefully integrated into a course. We think this is similar with the demos in our paper: they are ultimately about increasing understanding by focusing on the fact that statistics is, in reality, a participatory process with many actors. This is a deep truth which is obscured when a professor merely does blackboard material. (We have added this point in the conclusion to our article.)

. . .

Finally, the referee writes, “I think this paper needs more work so that it is not just a set of interesting stories.” Actually, I think that interesting stories (with useful directions) is not a bad thing. I wouldn’t want all the Teacher’s corner articles to be like that, but the occasional such article, if of high quality, is a contribution, I believe, in that people might actually read the article and use it to improve their teaching.

I continue to hold and express this pluralistic attitude toward research and publication.

The post In defense of stories and classroom activities, from a resubmission letter from 1999 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Can anyone guess what went wrong here? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dear Professor Gelman:

The editor of ** asked me to write to see if you would be willing to review MS ** entitled

**

We are hoping for a review within the next 2-3 weeks if possible. I would appreciate if you confirm whether you are willing to advise me on this by clicking on the url below

**

This site will also not only allow you to choose an alternative due date, but also to suggest alternative referees if you are unable to review.

If you choose to review the manuscript you can upload your report and cover letter via our secure online form at

**

This is a secure form and your report will be transmitted anonymously. You should supply either the title or the MS number, **, to ensure that your report is properly filed.

Thanks for your assistance. I very much value your advice.

Sincerely,

I’ve omitted identifying details as there’s no point in embarrassing the journal editor. We all make mistakes, and this is not a big one.

Anyway, here’s the riddle: What was horribly wrong about the above email?

And here’s a hint: There’s no way you can figure out the problem merely from what I’ve sent you above. You’ll have to guess.

And another hint: The email came from a legitimate journal, not one of those “predatory” or spam journals.

I’ll give the answer tomorrow, but I’m guessing some of you will figure this out right away.

**P.S.** OK, OK, you win. Everybody guessed it already (see comments). I guess this puzzle was too easy.

The post Can anyone guess what went wrong here? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Are Ivy League schools overrated? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I [Palko] am in broad agreement with this New Republic article by William Deresiewicz [entitled "Don't Send Your Kid to the Ivy League: The nation's top colleges are turning our kids into zombies"] and I’ll try to blog on it if I can get caught up with more topical threads. I was particularly interested in the part about there being a “non-aggression pact” outside of the sciences.

This fits in with something I’ve noticed. I know this sounds harsh, but when I run across someone who is at the top of their profession and yet seems woefully underwhelming, they often have Ivy League BAs in non-demanding majors (For example, Jeff Zucker, Harvard, History. John Tierney, Yale, American Studies). My working hypothesis is that, while everyone who graduates from an elite school has an advantage in terms of reputation and networks, the actual difficulty of completing certain degrees isn’t that high relative to non-elite schools. Thus a history degree from Harvard isn’t worth that much more than a history degree from a Cal State school.

And David Brooks graduated from the University of Chicago with a degree in history . . .

In all seriousness, I don’t know if I agree with the claim in the headline of that article Palko links to.

I was very impressed by some of the Harvard undergrads I taught. Then again, they were statistics majors. In the old days, statistics might have been considered the soft option compared to math, but I don’t think that’s the case anymore. If anything, math majors are sometimes the sleepwalkers who happened to be good at math in school and never thought of stepping off the track. Anyway, it’s hard for me to make any general statements considering that I don’t teach many undergrads at all at Columbia.

Palko responded:

Yeah, I don’t want to put down Harvard grads, even the history majors. I’m sure that a disproportionate number of the brightest, most promising young historians are working on Harvard B.A. What’s more, I suspect most of them are developing valuable relationships with some of the most important names in their field.

What I’m wondering about is the popular notion that Ivy League schools are hard to get into and hard to get through. The first part is certainly true and the second appears to be true for STEM (which also has an additional self-selection bias). I’m not just not sure if it holds for all fields.

I don’t think there’s any question that selection bias, networking opportunities and halo effects play a large role here. What if they account for most of the benefit of attending an elite school for most students? This is worrisome from both sides: students are twisting themselves into knots to meet artificial and frankly somewhat odd selection criteria; and we’re giving the students who meet these odd criteria huge advantages in terms of wealth, career, and influence.

That can’t be good.

The post Are Ivy League schools overrated? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post No, I didn’t say that! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>That’s fine, and Flam captured the general “affect” of our discussion—the idea that Bayes allows the use of prior information, and that p-values can’t be taken at face value. As I discuss below, I like Flam’s article, I’m glad it’s out there, and I’m grateful that she took the time to get my perspective.

Unfortunately, though, some of the details got garbled.

Flam never put quotation marks around anything I said, and I know that with journalism there isn’t always time to check every paragraph. After I saw the article online I pointed out the mistakes and Flam asked the NYT editors to correct them so I hope this will be done soon.

In the meantime, I’ll post the corrections here.

In the article, it says:

But there’s a danger in this [p-value] tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.

No no no no no. I recommended correcting as follows:

But there’s a danger in this tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent can lead to spurious findings—cases where an observed “statistically significant” pattern in data does not reflect a corresponding pattern in the population—far more than 5 percent of the time. The weaker the signal and the noisier the measurements, the more likely that a pattern, even if statistically significant, will not replicate.

To the outsider this might sound almost the same, but on a technical level it makes a big difference!

The article then says that I say:

The proportion of wrong results published in prominent journals is probably even higher

I would change this to:

This could well be an even bigger problem with prominent journals

Later the article refers to the notorious fecundity-and-voting study and says:

Dr. Gelman re-evaluated the study using Bayesian statistics. That allowed him look at probability not simply as a matter of results and sample sizes, but in the light of other information that could affect those results.

He factored in data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. When he did, the study’s statistical significance evaporated.

This is not correct. I did not re-evaluated the study using Bayesian methods, nor did I claim to have done so.

Here’s my suggested revision:

Dr. Gelman felt this result was not consistent with polling data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. And after accounting for the many different analyses that could have been performed on the data, the study’s statistical significance evaporated.

Finally, the article writes of me:

He suggests using Bayesian calculations not necessarily to replace classical statistics but to flag spurious results.

I wouldn’t quite put it that way! I prefer:

He says that in such studies there is strong prior information, which can be included using Bayesian methods or in other ways.

**Putting it into perspective**

I suppose journalists find it difficult to deal with academics because we’re so picky. As I noted above, I think the article captured the general sense of what I was saying, and overall I like the article, I like how Flam quoted people who had varying perspectives; I think it’s important for people to see statistics as a pluralistic field with different tools for solving different problems.

But I do think the details matter (and I certainly don’t want people to think I said things I didn’t say, or that I did things I didn’t do) so I hope the corrections can be made soon. And I stand by the larger point that lots of bad stuff happens when people think that “statistically significant” + “vague theory” = truth. I can’t say that I’m *surprised* that Kristina Durante, the author of the fecundity-and-voting study, stands by those claims, but I think it’s too bad. The point is not that there’s anything horrible about Durante (a person whom I’ve never met), nor do I know of anything horrible about Daryl Bem, etc., but that there is widespread confusion about how to do statistics, especially when studying small effects in the presence of large measurement errors (that’s one of the things I discuss in my above-cited article), and I’m glad to get these concerns out there, as precisely as is possible within the format of a newspaper article.

In any case, this’ll be an excellent example for my statistical communication class!

**P.S.** I also just noticed this bit from the article:

The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the nine-in-10 result is — a piece of information that can be useful in investigating your suspicion.

By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.

No!!!!!!!!!!!!!! Weighting a coin does not (appreciably) affect the probability that a coin lands heads. You can load a die but you can’t bias a coin. Yes, with practice you can *throw* a coin (weighted or otherwise) to generally land heads or tails, but, no, there is no such thing as a weighted coin which has an appreciably greater than 50% chance of generally landing heads. No big deal but this is one of my pet peeves. Also, beyond the flaws in this particular example, I don’t think it’s a good representation of science, in that the point to me is not to distinguish fair from unfair coins (equivalently, to distinguish randomness from non-randomness) but rather to understand the many real patterns in the world, which are not purely random but can be buried in noise if we’re not careful, hence motivating noise-reduction efforts such as this, with Sharad Goel, David Rothschild, and Doug Rivers. (And my point there was not to promote that work but to illustrate my general point with an example.)

The post No, I didn’t say that! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some general principles of Bayesian data analysis, arising from a Stan analysis of John Lee Anderson’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>God is in every leaf of every tree. The leaf in question today is the height of journalist and Twitter aficionado Jon Lee Anderson, a man who got some attention a couple years ago after disparaging some dude for having too high a tweets-to-followers ratio. Anderson called the other guy a “little twerp” which made me wonder if he (Anderson) suffered from “tall person syndrome,” that problem that some people of above-average height have, that they think they’re more important than other people because they literally look down on them.

After I raised this issue, a blog commenter named Gary posted an estimate of Anderson’s height using information available on the internet:

Based on this picture: http://farm3.static.flickr.com/2235/1640569735_05337bb974.jpg he appears to be fairly tall. But the perspective makes it hard to judge.

Based on this picture: http://www.catalinagarcia.com/cata/Libraries/BLOG_Images/Cata_w_Jon_Lee_Anderson.sflb.ashx he appears to be about 9-10 inches taller than Catalina Garcia.

But how tall is Catalina Garcia? Not that tall – she’s shorter than the high-wire artist Phillipe Petit http://www.catalinagarcia.com/cata/Libraries/BLOG_Images/Cata_w_Philippe_Petite.sflb.ashx. And he doesn’t appear to be that tall… about the same height as Claire Danes: http://cdn.theatermania.com/photo-gallery/Petit_Danes_Daldry_2421_4700.jpg – who according to Google is 5′ 6″.

So if Jon Lee Anderson is 10″ taller than Catalina Garcia, who is 2″ shorter than Philippe Petit, who is the same height as Claire Danes, then he is 6′ 2″ tall.

I have no idea who Catalina Garcia is, but she makes a decent ruler.

I happened to run across that comment the other day (when searching the blog for Tom Scocca) and it inspired me to put out a call for the above analysis to be implemented in Stan. A couple of other faithful commenters (Andrew Whalen and Daniel Lakeland) did this. But I wasn’t quite satisfied with either of those efforts (sorry, I’m picky, what can I say? You must’ve known this going in). So I just did it myself.

**Modeling**

Before getting to my model, let me emphasize that nothing fancy is going on. I’m pretty much just translating Gary’s above comment into statistical notation.

Here’s my Stan program:

transformed data { real mu_men; real mu_women; real sigma_men; real sigma_women; mu_men <- 69.1; mu_women <- 63.7; sigma_men <- 2.9; sigma_women <- 2.7; } parameters { real Jon; real Catalina; real Phillipe; real Claire; realJon_shoe_1; real Catalina_shoe_1; real Catalina_shoe_2; real Phillipe_shoe_1; real Phillipe_shoe_2; real Claire_shoe_1; } model { Jon ~ normal(mu_men,sigma_men); Catalina ~ normal(mu_women,sigma_women); Phillipe ~ normal(mu_men,sigma_men); Claire ~ normal(66,1); (Jon + Jon_shoe_1) - (Catalina + Catalina_shoe_1) ~ normal(9.5,1.5); (Catalina + Catalina_shoe_2) - (Phillipe + Phillipe_shoe_1) ~ normal(2,1); (Phillipe + Phillipe_shoe_2) - (Claire + Claire_shoe_1) ~ normal(0,1); Jon_shoe_1 ~ beta(2,2); Catalina_shoe_1 / 4 ~ beta(2,2); Catalina_shoe_2 / 4 ~ beta(2,2); Phillipe_shoe_1 ~ beta(2,2); Phillipe_shoe_2 ~ beta(2,2); Claire_shoe_1 / 4 ~ beta(2,2); }

**Hey! Html ate some of my code! I didn’t notice till a commenter pointed this out. In the declarations, the “shoe” variables should be bounded: “angle bracket lower=0,upper=1 angle bracket” for the men’s shoes, and “angle bracket lower=0,upper=4 angle bracket” for the women’s shoes.**

I’ll present the results in a moment, but first here’s a quick discussion of some of the choices that went into the model:

- I got the population distributions of heights of men and women from a 1992 article in the journal Risk Analysis, “Bivariate distributions for height and weight of men and women in the United States,” by J. Brainard and D. E. Burmaster, which is the reference that Deb Nolan and I used for the heights distribution in our book on Teaching Statistics.

- I assumed that men’s shoe heights were between 0 and 1 inches, and that women’s shoe heights were between 0 and 4 inches, in all cases using a beta(2,2) distribution to model the distribution. This is a hack in so many ways (for one thing, nobody in these pictures is barefoot so 0 isn’t the right lower bound; for another, some men do wear elevator shoes and boots with pretty high heels) but, as always, ya gotta start somewhere.

- I took the height comparisons as stated in Gary’s comment, giving a standard deviation of 1 inch for each, except that I gave a standard deviation of 1.5 inches for the “9 or 10 inches” comparison between Jon and Claire, since that seemed like a tougher call.

- Based on the statement that Claire was 66 inches tall, I gave her a prior of 66 with a standard deviation of 1.

**Inference**

I saved the stan program as “heights.stan” and ran it from R:

heights <- stan_run("heights.stan", chains=4, iter=1000) print(heights)

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat Jon 74.3 0.1 1.8 70.6 73.1 74.3 75.5 77.9 877 1 Catalina 65.1 0.1 1.5 62.3 64.1 65.1 66.2 67.9 754 1 Phillipe 66.1 0.0 1.3 63.7 65.2 66.1 67.0 68.6 813 1 Claire 65.5 0.0 1.0 63.8 64.9 65.5 66.1 67.5 1162 1 Jon_shoe_1 0.5 0.0 0.2 0.1 0.4 0.5 0.7 0.9 1658 1 Catalina_shoe_1 1.5 0.0 0.8 0.2 0.9 1.4 2.1 3.3 1708 1 Catalina_shoe_2 2.6 0.0 0.8 0.9 2.1 2.7 3.2 3.8 1707 1 Phillipe_shoe_1 0.5 0.0 0.2 0.1 0.3 0.5 0.6 0.9 1391 1 Phillipe_shoe_2 0.5 0.0 0.2 0.1 0.4 0.5 0.7 0.9 1562 1 Claire_shoe_1 1.6 0.0 0.8 0.2 1.0 1.5 2.2 3.3 1390 1 lp__ -21.6 0.1 2.5 -27.6 -23.1 -21.2 -19.8 -17.8 749 1

OK, everything seems to have converged, and it looks like Jon is somewhere between 6'1" and 6'4".

Tables are ugly. Let's make some graphs:

sims <- extract(heights,permuted=FALSE) mon <- monitor(sims,warmup=0) library("arm") png("heights1.png", height=170, width=500) par(bg="gray90") subset <- 1:4 coefplot(rev(mon[subset,"mean"]), rev(mon[subset,"sd"]), varnames=rev(dimnames(mon)[[1]][subset]), main="Estimated heights in inches (+/- 1 and 2 s.e.)\n", cex.main=1, cex.var=1, mar=c(0,4,5.1,2)) dev.off() png("heights2.png", height=180, width=500) par(bg="gray90") subset <- 5:10 coefplot(rev(mon[subset,"mean"]), rev(mon[subset,"sd"]), varnames=rev(c("Jon", "Catalina 1", "Catalina 2", "Phillipe 1", "Phillipe 2", "Claire")), main="Estimated shoe heights in inches (+/- 1 and 2 s.e.)\n", cex.main=1, cex.var=1, mar=c(0,4,5.1,2)) dev.off()

That is:

**Model criticism**

OK, now let's do some model criticism. What's in this graph that we don't believe, that doesn't make sense?

- Most obviously, some of the intervals for shoe height go negative. But that's actually not our model, it's coming from our crude summary of inference as +/- 2 sd. If instead we used the simulated quantiles directly, this problem would not arise.

- Catalina's shoes are estimated to be taller in her second picture (the one with Phillipe) than in the first, with Jon. But that's not so unreasonable, given the pictures. If anything, perhaps the intervals overlap too much. But that is just telling us that we might have additional information from the photos that is not captured in our model.

- The inferences for everyone's heights seem pretty weak. Is it really possible that Phillipe Petit could be 5'9" tall (as is implied by the upper bound of his 95% posterior interval)? Maybe not. Again, this implies that we have additional prior information that could be incorporated into the model to make better predictions.

Fitting a model, making inferences, evaluating these inferences to see if we have additional information we could include: That's what it's all about.

**Software criticism**

Finally, let's do the same thing with our code. What went wrong during the above process:

- First off, my Stan model wasn't compiling. It was producing an error at some weird place in the middle of the program. I couldn't figure out what was going on. Then, at some point in cutting and pasting, I realized what had happened: my text editor was using a font in which lower-case-l and the number 1 were indistinguishable. And I'd accidentally switched one for the other. I changed the font and fixed the problem.

- Again Stan gave an error, this time even more mysterious:

Error in compileCode(f, code, language = language, verbose = verbose) :

Compilation ERROR, function(s)/method(s) not created!Agreeing to the Xcode/iOS license requires admin privileges, please re-run as root via sudo. In addition: Warning message:

running command ‘/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB file3d6829c6b35e.cpp 2> file3d6829c6b35e.cpp.err.txt' had status 1

I posted the problem on stan-users and Daniel Lee replied that Apple had automatically updated Xcode and I needed to do a few clicks on my computer to activate the permissions.

- Then it ran, indeed, it ran on the first try, believe it or not!

- There were some issues with the R code. The calls to coefplot are a bit ugly, I had to do a bit of fiddling to get everything to look OK. It would be better to be able to do this directly from rstan, or at least to be able to make these plots with a bit less effort.

- Umm, that's about it. Actually the programming wasn't too bad.

**Summary**

I like Bayesian (Jaynesian) data analysis. You lay out your model step by step, and when the inferences don't seem right (either because of being in the wrong place, or being too strong, or too weak), you can go back and figure out what went wrong, or what information is available that you could throw into the model.

P.S. to Andrew Whalen and Daniel Lakeland: Don't worry, you've still earned your Stan T-shirts. Just email me with your size, and your shirts will be in the mail.

The post Some general principles of Bayesian data analysis, arising from a Stan analysis of John Lee Anderson’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Are Ivy League schools overrated?

**Wed:** Can anyone guess what went wrong here?

**Thurs:** What went wrong

**Fri:** 65% of principals say that at least 30% of students . . . wha??

**Sat:** Carrie McLaren was way out in front of the anti-Gladwell bandwagon

**Sun:** Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests

The post People used to send me ugly graphs, now I get these things appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We found a sinusoidal pattern in CMM [cutaneous malignant melanoma] risk by season of birth (P = 0.006). . . . Adjusted odds ratios for CMM by season of birth were 1.21 [95% confidence interval (CI), 1.05–1.39; P = 0.008] for spring, 1.07 (95% CI, 0.92–1.24; P = 0.40) for summer and 1.12 (95% CI, 0.96–1.29; P = 0.14) for winter, relative to fall. . . . In this large cohort study, persons born in spring had increased risk of CMM in childhood through young adulthood, suggesting that the first few months of life may be a critical period of UVR susceptibility.

Rinaldi expresses concern about multiple comparisons, along with skepticism about the hypothesis that in Sweden 2-3 months old babies get some sunshine completely naked.

**P.S.** Some of the comments below are fascinating, far more so than the original paper! Maybe we should call this the “stone soup” or “Bem” phenomenon, when a work that is fairly empty of inherent interest (and likely does not represent any real, persistent pattern) gets a lot of people thinking furiously about a topic.

The post People used to send me ugly graphs, now I get these things appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “An exact fishy test” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s an example: I came up with 10 random numbers:

> round(.5+runif(10)*100) [1] 56 23 70 83 29 74 23 91 25 89

and entered them into Macartan’s app, which promptly responded:

Unbelievable!

You chose the numbers 56 23 70 83 29 74 23 91 25 89

But these are clearly not random numbers. We can tell because random numbers do not contain patterns but the numbers you entered show a fairly obvious pattern.

Take another look at the sequence you put in. You will see that the number of prime numbers in this sequence is: 5. But the `expected number’ from a random process is just 2.5. How odd is this pattern? Quite odd in fact. The probability that a truly random process would turn up numbers like this is just p=0.074 (i.e. less than 8%).

Try again (with really random numbers this time)!

ps: you might think that if the p value calculated above is high (for example if it is greater than 15%) that this means that the numbers you chose are not all that odd; but in fact it means that the numbers are really particularly odd since the fishy test produces p values above 15% for less than 2% of all really random numbers. For more on how to fish see here.

The post “An exact fishy test” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post MA206 Program Director’s Memorandum appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A couple years ago I gave a talk at West Point. It was fun. The students are all undergraduates, and most of the instructors were just doing the job for two years or so between other assignments. The permanent faculty were focused on teaching and organizing the curriculum.

As part of my visit I sat in on an intro statistics class and did a demo for them (probably it was the candy weighing but I don’t remember). At that time I picked up an information sheet for the course: “Memorandum for Academic Year (AY) 13-02 MA206 Students, United States Military Academy.” Lots of details (as one would expect in that military-bureaucratic ways), also this list of specific objectives of the course:

1. Understanding the notion of randomness and the role of variability and sampling in making inference.

2. Apply the axioms and basic properties of probability and conditional probability to quantify the likelihood of events.

3. Employ models using discrete or continuous random variables to answer basic probability questions.

4. Be able to draw appropriate conclusions from confidence intervals.

5. Construct hypothesis tests and draw appropriate conclusions from p-values.

6. Apply and assess linear regression models for point estimation and association between explanatory and dependent variables.

7. Critically evaluate statistical arguments in print media and scientific journals.

This is all ok except for items 4 and 5, I suppose.

Also, at the end, a list of rules, beginning with:

a. All cadets are expected to maintain proper military bearing and appearance during instruction in accordance with appropriate regulations.

b. Respect others in the classroom – No profanity, unprofessional jokes, or unprofessional computer items . . .

e. Jackets are not permitted in the classroom . . .

g. Drinks must be inside a closed container (plastic bottle with a top, for example) or in the Dean-approved mug . . .

and ending with this:

j. Rules common to blackboards, written work, and examinations:

1) Draw and label figures or graphs when appropriate.

2) Report numerical answers using the appropriate number of significant digits and units of measure.

Now those are some rules I can get behind. They should be part of every statistics honor code.

The post MA206 Program Director’s Memorandum appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Free Stan T-shirt to the first “little twerp” who does a (good) Bayesian analysis of Jon Lee Anderson’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’d like to see a Stan implementation of the analysis presented in this comment by Gary from a year and a half ago.

The post Free Stan T-shirt to the first “little twerp” who does a (good) Bayesian analysis of Jon Lee Anderson’s height appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Derek Jeter was OK” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Tom Scocca files a bizarrely sane column summarizing the famous shortstop’s accomplishments:

Derek Jeter was an OK ballplayer. He was pretty good at playing baseball, overall, and he did it for a pretty long time. . . . You have to be good at baseball to last 20 seasons in the major leagues. . . . He was a successful batter in productive lineups for many years. . . . He was not Ted Williams or Rickey Henderson. Spectators did not come away from seeing Derek Jeter marveling at the stupendous, unimaginable feats of hitting they had seen. But he did lots and lots of damage. He got many big hits and contributed to many big rallies. Pitchers would have preferred not to have to pitch to him. . . . His considerable athletic abilities allowed him to sometimes make spectacular leaping and twisting plays on misjudged balls that better shortstops would have played routinely. People enjoyed watching him make those plays, and that enjoyment led to his winning five Gold Gloves. That misplaced acclaim, in turn, helped spur more advanced analysis of defensive play in baseball, a body of knowledge which will ensure that no one ever again will be able to play shortstop as badly as Jeter for as long as he did. And that gave fans something to argue about, which is an important part of sports.

Scocca keeps going in this vein:

Regardless, on balance, Jeter’s good hitting helped his team more than his bad fielding hurt it. The statistical ledger says so—by Wins Above Replacement, according to Baseball Reference, his glovework drops him from being the 20th most productive position player of all time to the 58th. Having the 58th most productive career among non-pitchers in major-league history is still a solid achievement.

And still more:

When [Alex] Rodriguez showed up in the Bronx, Jeter would not yield the job. It was a selfish decision and the situation hurt the team. But powerful egos, misplaced competitiveness, and unrealistic self-appraisals are common features in elite athletes. Whatever wrong Jeter may have done in the intrasquad rivalry, it was the Yankees’ fault for not managing him better.

And this:

Like most star athletes of his era, he kept his public persona intentionally blank and dull . . . Depending on their allegiances, baseball fans could imagine him to be classy or imagine him to be pissy, and the limited evidence could support either conclusion.

I love this Scocca post because its hilariousness (which is intentional, I believe) is entirely contingent on its context. Sportswriting is so full of hype (either of the “Jeter is a hero” variety or the “Jeter’s no big whoop” variety or the “Hey, look at my cool sabermetrics” variety or the “Hey, look at what a humanist I am” variety) that it just comes off (to me) as flat-out funny to see a column that just plays it completely straight, a series of declarative sentences that tell it like it is.

Of course, if all the sportswriters wrote like this, it would be boring. But as long as all the others feel they need some sort of angle, this pitch-it-down-the-middle style will work just fine. The confounding of expectations and all that.

P.S. Also this from a commenter to Scocca’s post:

He also inspired people to like baseball again after the lockout and didn’t juice.

The post “Derek Jeter was OK” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Waic for time series appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m currently working on a model comparison paper using WAIC, and

would like to ask you the following question about the WAIC computation:I have data of one participant that consist of 100 sequential choices (you can think of these data as being a time series). I want to compute the WAIC for these data. Now I’m wondering how I should compute the predictive density. I think there are two possibilities:

(1) I compute the predictive density of the whole sequence (i.e., I consider the whole sequence as one data point, which means that n=1 in Equations (11) – (12) of your 2013 Stat Comput paper.)

(2) I compute the predictive density for each choice (i.e., I consider each choice as one data point, which means that n=# choices in Equations (11) – (12) of your 2013 Stat Comput paper.)

My quick thought was that Waic is an approximation to leave-one-out cross-validation and this computation gets more complicated with correlated data.

But I passed the question on to Aki, the real expert on this stuff. Aki wrote:

This a interesting question and there is no simple answer.

First we should consider what is your predictive goal:

(1) predict whole sequence for another participant

(2) predict a single choice given all other choices

or

(3) predict the next choice given the choices in the sequence so far?If your predictive goal is

(1) then you should note that WAIC is based on an asymptotic argument and it is not generally accurate with n=1. Watanabe has said (personal communication) that he thinks that this is not sensible scenario for WAIC, but if (1) is really your prediction goal, then I think that this is might be best you can do. It seems that when n is small, WAIC will usually underestimate the effective complexity of the model, and thus would give over-optimistic performance estimates for more complex models.

(2) WAIC should work just fine here (unless your model says that there is no dependency between the choices, ie. having 100 separate models with each having n=1). Correlated data here means just that it is easier to predict a choice if you know the previous choices and the following choices. This may make difference between some models small compared to scenario (1).

(3) WAIC can’t handle this, and you would need to use a specific form of cross-validation (I think I should write a paper on this).

The post Waic for time series appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Study published in 2011, followed by successful replication in 2003 [sic] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This one is like shooting fish in a barrel but sometimes the job just has to be done. . . .

The paper is by Daryl Bem, Patrizio Tressoldi, Thomas Rabeyron, and Michael Duggan, it’s called “Feeling the Future: A Meta-Analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events,” and it begins like this:

In 2011, the Journal of Personality and Social Psychology published a report of nine experiments purporting to demonstrate that an individual’s cognitive and affective responses can be influenced by randomly selected stimulus events that do not occur until after his or her responses have already been made and recorded (Bem, 2011). To encourage exact replications of the experiments, all materials needed to conduct them were made available on request. We can now report a meta-analysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma . . . A Bayesian analysis yielded a Bayes Factor of 7.4 × 10-9 . . . An analysis of p values across experiments implies that the results were not a product of “p-hacking” . . .

Actually, no.

There is a lot of selection going on here. For example, they report that 57% (or, as they quaintly put it, “56.6%”) of the experiments had been published in peer reviewed journals or conference proceedings. Think of all the unsuccessful, unpublished replications that didn’t get caught in the net. But of course almost any result that happened to be statistically significant would be published, hence a big bias. Second, they go back and forth, sometimes considering all replications, other times ruling some out as not following protocol. At one point they criticize internet experiments which is fine, but again it’s more selection because if the results from the internet experiments had looked good, I don’t think we’d be seeing that criticism. Similarly, we get statements like, “If we exclude the 3 experiments that were not designed to be replications of Bem’s original protocol . . .”. This would be a lot more convincing if they’d defined their protocols clearly ahead of time.

I question the authors’ claims that various replications are “exact.” Bem’s paper was published in 2011, so how can it be that experiments performed as early as 2003 are exact replications? That makes no sense. Just to get an idea of what was going on, I tried to find one of the earlier studies that was stated to be an exact replication. I looked up the paper by Savva et al. (2005), “Further testing of the precognitive habituation effect using spider stimuli.” I could not find this one but I found a related one, also on spider stimuli. In what sense is this an “exact replication” of Bem? I looked at the Bem (2011) paper, searched on “spider,” and all I could find is a reference to Savva et al.’s 2004 work.

This baffled me so I went to the paper linked above and searched on “exact replication” to see how they defined the term. Here’s what I found:

“To qualify as an exact replication, the experiment had to use Bem’s software without any procedural modifications other than translating on-screen instructions and stimulus words into a language other than English if needed.”

I’m sorry, but, no. Using the same software is not enough to qualify as an “exact replication.”

This issue is central to the paper at hand. For example, there is a discussion on page 18 on “the importance of exact replications”: “When a replication succeeds, it logically implies that every step in the replication ‘worked’ . . .”

Beyond this, the individual experiments have multiple comparisons issues, just as did the Bem (2011) paper. We see very few actual preregistrations, and my impression is that when something counts as a successful replication there is still a lot of wiggle room regarding data inclusion rules, which interactions to study, etc.

**Who cares?**

The ESP context makes this all look like a big joke, but the general problem of researchers creating findings out of nothing, that seems to be a big issue in social psychology and other research areas involving noisy measurements. So I think it’s worth holding a firm line on this sort of thing. I have a feeling that the authors of this paper think that if you have a p-value or Bayes factor of 10^-9 then your evidence is pretty definitive, even if some nitpickers can argue on the edges about this or that. But it doesn’t work that way. The garden of forking paths is multiplicative, and with enough options it’s not so hard to multiply up to factors of 10^-9 or whatever. And it’s not like you have to be trying to cheat; you just keep making reasonable choices given the data you see, and you can get there, no problem. Selecting ten-year-old papers and calling them “exact replications” is one way to do it.

**P.S.** I found the delightful image above by googling *bullwinkle crystal ball* but I can’t seem to track down who to give the credit to. Jay Ward, Alex Anderson, and Bill Scott, I suppose. It doesn’t seem to matter so much who actually got the screenshot and posted it on the web.

The post Study published in 2011, followed by successful replication in 2003 [sic] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What were these powerful “irrelevant stimuli” that were outweighing the impact of subjects’ prior policy views? Before seeing each policy statement, each subject was subliminally exposed (for 39 milliseconds — well below the threshold of conscious awareness) to one of three images: a smiling cartoon face, a frowning cartoon face, or a neutral cartoon face. . . . the subliminal cartoon faces substantially altered their assessments of the policy statements . . .

I followed up with a post expressing some skepticism:

It’s clear that when the students [the participants in the experiment] were exposed to positive priming, they expressed more positive thoughts . . . But I don’t see how they make the leap to their next statement, that these cartoon faces “significantly and consistently altered [students'] thoughts and considerations on a political issue.” I don’t see a change in the number of positive and negative expressions as equivalent to a change in political attitudes or considerations.

I wrote:

Unfortunately they don’t give the data or any clear summary of the data from experiment No. 2, so I can’t evaluate it. I respect Larry Bartels, and I see that he characterized the results as the “subliminal cartoon faces substantially altered their assessments of the policy statements — and the resulting negative and positive thoughts produced substantial changes in policy attitudes.” But based on the evidence given in the paper, I can’t evaluate this claim. I’m not saying it’s wrong. I’m just saying that I can’t express judgment on it, given the information provided.

Larry then followed up with a post saying that further information was in chapter 3 of Erisen’s Ph.D. dissertation, available online here.

And Erisen sent along a note which I said I would post. Erisen’s note is here:

As a close follower of the Monkey Cage, it is a pleasure to see some interest in affect, unconscious stimuli, perceived (or registered) but unappreciated influences. Accordingly I thought it is now the right time for me to contribute to the discussion.

First, I would like to begin with clarifying conceptual issues with respect to affective priming. Affective priming is not subliminal advertising, nor is it a subliminal message. Subliminal ads (or messages) were used back in the 1970s with questionable methods and current priming studies rarely refer to these approaches.

Second, it is quite normal to be skeptical because no earlier research has attempted to address these kinds of issues in political science. When they first hear about affective influences, people may naturally consider the consequences for measuring political attitudes and political preferences. These conclusions may be especially meaningful for democratic theory, as mentioned by Larry Bartels in an earlier post.

But, fear not, this is not a stand-alone research study. Rather, it is part of an overall research program (Lodge and Taber, 2013) and there are various studies on unconscious stimuli and contextual effects. We name these “perceived but unappreciated effects” in our paper. In addition, we cite some other work on contextual cues (Berger et al., 2008), facial attractiveness (Todorov and Uleman, 2004), the “RATS” ad (Weinberger and Westen, 2008), the Willie Horton ad (Mendelberg, 2001), upbeat music or threatening images in political ads (Brader, 2006), which all provide examples of priming. There is a great deal of research in social psychology that offers other relevant examples of the social or political effects of affective primes.

Third, with respect to the outcomes, I would like to refer the reader to our path analyses (provided in the paper and in

TheRationalizing Voter) that show the effects of affect-triggered thoughts on policy preferences (see below). What can be inferred from these results? We can say that controlling for prior attitudes affective primes not only directly affected policy preferences but also indirectly affected preferences through affect-evoked thoughts. The effects on political attitudes and preferences are significant as we discuss in greater detail in the paper.Fourth, these results were consistent across six experiments that I conducted for my dissertation. Priming procedure was about the same in all those studies and patterns across different dependent variables were quite similar.

Finally, we do not argue that voters cannot make decisions based on “enlightened preferences.” As we repeatedly state in the paper, affective cues color attitudes and preferences but this does not mean that voters’ decisions are necessarily irrational.

Both Bartels and Erisen posted path diagrams in support of their argument, so perhaps I should clarify that I’ve never understood these path diagrams. If an intervention has an effect on political attitudes, I’d like to see a comparison of the political attitudes with and without the intervention. No amount of path diagrams will convince me until I see the direct comparison. You could argue with some justification that my ignorance in this area is unfortunate, but you should also realize that there are a lot of people like me who don’t understand those graphs—and I suspect that many of those people who *do* like and understand path diagrams would also like to see the direct comparisons too. So, purely from the perspective of communication, I think it makes sense to connect the dots and not just show a big model without the intermediate steps. Otherwise you’re essentially asking the reader to take your claims on faith.

Again, I’m not saying that Erisen is wrong in his claims, just that the evidence he’s shown me is too abstract to convince me. I realize that he knows a lot more about his experiment and his data than I do and I’m pretty sure that he is much more informed on this literature than I am, so I respect that he feels he can draw certain strong conclusions from his data. But, for me, I have to go what information is available to me.

P.S. In his post, Larry also refers to the study of Andrew Healy, Neil Malhotra, and Cecilia Hyunjung Mo on college football games and election outcomes. That was an interesting study but, as I wrote when it came out a couple years ago, I think its implications were much less than were claimed at the time in media reports. Yes, people can be affected by irrelevant stimuli related to mood, but it matters what are the magnitudes of such effects.

The post Why I’m still not persuaded by the claim that subliminal smiley-faces can have big effects on political attitudes appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “How to disrupt the multi-billion dollar survey research industry” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>P.P.S. Slightly relevant to this discussion: Satvinder Dhingra wrote to me:

An AAPOR probability-based survey methodologist is a man who, when he finds non-probability Internet opt-in samples constructed to be representative of the general population work in practice, wonders if they work in theory.

My reply:

Yes, although to be fair they will say that they’re not so sure that these methods work in practice. To which my reply is that I’m not so sure that their probability samples work so well in practice either!

The post “How to disrupt the multi-billion dollar survey research industry” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>