The post Better to just not see the sausage get made appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mike Carniello writes:

This article in the NYT leads to the full text, in which these statement are buried (no pun intended):

What is the probability that two given texts were written by the same author? This was achieved by posing an alternative null hypothesis H0 (“both texts were written by the same author”) and attempting to reject it by conducting a relevant experiment. If its outcome was unlikely (P ≤ 0.2), we rejected the H0 and concluded that the documents were written by two individuals. Alternatively, if the occurrence of H0 was probable (P > 0.2), we remained agnostic.

See the footnote to this table:

Ahhh, so horrible. The larger research claims might be correct, I have no idea. But I hate to see such crude statistical ideas being used, it’s like using a pickaxe to dig for ancient pottery.

The post Better to just not see the sausage get made appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Letters we never finished reading appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Over the last several years, a different kind of science book has found a home on consumer bookshelves. Anchored by meticulous research and impeccable credentials, these books bring hard science to bear on the daily lives of the lay reader; their authors—including Malcolm Gladwell . . .

OK, then.

The book might be ok, though. I wouldn’t judge it on its publicity material.

The post Letters we never finished reading appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Free workshop on Stan for pharmacometrics (Paris, 22 September 2016); preceded by (non-free) three day course on Stan for pharmacometrics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Workshop: Stan for Pharmacometrics Day**

If you are interested in a free day of Stan for pharmacometrics in Paris on 22 September 2016, see the registration page:

Julie Bertrand (statistical pharmacologist from Paris-Diderot and UCL) has finalized the program:

When | Who | What |
---|---|---|

09:00–09:30 | Registration | |

9:30-10:00 | Bob Carpenter | Introduction to the Stan Language and Model Fitting Algorithms |

10:00-10:30 | Michael Betancourt | Using Stan for Bayesian Inference in PK/PD Models |

10:30-11:00 | Bill Gillepsie | Prototype Stan Functions for Bayesian Pharmacometric Modeling |

11:00-11:30 | coffee break | |

11:30-12:00 | Sebastian Weber | Bayesian popPK for Pediatrics – bridging from adults to pediatrics |

12:00-12:30 | Solene Desmee | Using Stan for individual dynamic prediction of the risk of death in nonlinear joint models: Application to PSA kinetics and survival in metastatic prostate cancer |

12:30-13:30 | lunch | |

13:30-14:00 | Marc Vandemeulebroecke | A longitudinal Item Response Theory model to characterize cognition over time in elderly subjects |

14:00-14:30 | William Barcella | Modeling correlated binary variables: an application to lower urinary tract symptoms |

14:30-15:00 | Marie-Karelle Riviere | Evaluation of the Fisher information matrix without linearization in nonlinear mixed effects models for discrete and continuous outcomes |

15:00-15:30 | coffee break | |

15:30-16:00 | Dan Simpson | TBD |

16:00-16:30 | Frederic Bois | Bayesian hierarchical modeling in pharmacology and toxicology / about what we need next |

16:30-17:00 | Everyone | Discussion |

**Course: Bayesian Inference with Stan for Pharmacometrics**

The three days preceding the workshop (19–21 September 2016), Michael Betancourt, Daniel Lee, and I will be teaching a course on Stan for Pharmacometrics. This, alas, is not free, but if you’re interested, registration details are here:

It’s going to be very hands-on and by the end you should be fitting hierarchical PK/PD models based on compartment differential equations.

P.S. As Andrew keeps pointing out, all proceeds (after overhead) go directly toward Stan development. It turns out to be very difficult to get funding to maintain software that people use, because most funding is directed at “novel” research (not software development, research, which means prototypes, not solid code). These courses help immensely to supplement our grant funding and let us continue to maintain Stan and its interfaces.

The post Free workshop on Stan for pharmacometrics (Paris, 22 September 2016); preceded by (non-free) three day course on Stan for pharmacometrics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A day in the life appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So in this post I’ll just tell you everything I’ve been thinking about today, Thurs 14 Apr 2016.

Actually I’ll start with yesterday, when I posted an update to our Prior Choice Recommendations wiki. There had been a question on the Stan mailing list about priors for cutpoints in ordered logistic regression and this reminded me of a few things I wanted to add, not just on ordered regression but in various places in the wiki. This wiki is great and I’ll devote a full post to it sometime.

Also yesterday I edited a post on this sister blog. Posting there is a service to the political science profession and it’s good to reach Washington Post readers which is a different audience than what we have here. But it’s also can be exhausting as I need to explain everything, whereas for you regular readers I can just speak directly.

This morning I taught my class on design and analysis of sample surveys. Today’s class was on Mister P. Jitts led into a 20-minute discussion about the history and future of sample surveys. I don’t know much about the history of sample surveys. Why was there no Gallup Poll in 1990? How much random sampling was being done, anywhere, before 1930? I don’t know. After that, the class was all R/Stan demos and discussion. I had some difficulties. I took an old R script I had from last year’s class but it didn’t run. I’d deleted some of the data files—Census PUMS files I needed for the poststratification—so I needed to get them again.

After that I biked downtown to give a talk at Baruch College, where someone had asked me to speak. On the way down I heard this story, which the This American Life producers summarize as follows:

When Jonathan Goldstein was 11, his father gave him a book called Ultra-Psychonics: How to Work Miracles with the Limitless Power of Psycho-Atomic Energy. The book was like a grab bag of every occult, para-psychology, and self-help book popular at the time. It promised to teach you how to get rich, control other people’s minds, and levitate. Jonathan found the book in his apartment recently and decided to look into the magical claims the book made.

It turns out that the guy who wrote the book was just doing it to make money:

At the time, Schaumberger was living in New Jersey and making a decent wage as an editor at a publishing house that specialized in occult self help books with titles like “Secrets From Beyond The Pyramids” and “The Magic Of Chantomatics.” And he was astonished by the amount of money he saw writers making. . . .

Looking at it now, it seems obvious it was a lark. It almost reads like a parody of another famous science fiction slash self help book with a lot of psuedoscience jargon that, for legal reasons, I will only say rhymes with diuretics.

Take, for instance, the astral spur. You were supposed to use it at the race track to give your horse extra energy, and it involved standing on one foot and projecting a psychic laser at your horse’s hindquarters.

Then there’s the section on ultra vision influence. The road to domination is explained this way– one, sit in front of a mirror and practice staring fixedly into your own eyes. Two, practice the look on animals. Cats are the best. See if you can stare down a cat. Don’t be surprised if the cat seems to win the first few rounds. Three, practice the look on strangers on various forms of public transport. Stare steadily at someone sitting opposite you until you force them to turn their head away or look down. You have just mastered your first human subject.

I’m listening to this and I’m thinking . . . power pose! It’s just like power pose. It *could* be true, it kinda sounds right, it involves discipline and focus.

One difference is that power pose has a “p less than .05” attached to it. But, as we’ve seen over and over again, “p less than .05” doesn’t mean very much.

The other difference is that, presumably, the power pose researchers are sincere, whereas this guy was just gleefully making it all up. And yet . . . there’s this, from his daughter:

Well, he was very familiar with all these things. The “Egyptian Book of the Dead” was a big one, because there was always this thing of, well, maybe if they had followed the formulas correctly, maybe something . . . He may have wanted to believe. It may be that in his private thoughts, there were some things in there that he believed in.

I think there may be something going on here, the idea that, even if you make it up, if you will it, you can make it true. If you just try hard enough. I wonder if the power-pose researchers and the ovulation-and-clothing researchers and all the rest, I wonder if they have a bit of this attitude, that if they just really really try, it will all become true.

And then there was more. I’ve had my problems with This American Life from time to time, but this one was a great episode. It had this cool story of a woman who was caring for her mother with dementia, and she (the caregiver) and her husband learned about how to “get inside the world” of the mother so that everything worked much more smoothly. I’m thinking I should try this approach when talking with students!

OK, so I got to my talk. It went ok, I guess. I wasn’t really revved up for it. But by the time it was over I was feeling good. I think I’m a good speaker but one thing that continues to bug me is that I rarely elicit many questions. (Search this blog for Brad Paley for more on this.)

After my talk, on the way back, another excellent This American Life episode, including a goofy/chilling story of how the FBI was hassling some US Taliban activist and trying to get him to commit crimes so they could nail him for terrorism. Really creepy: they seemed to want to create crimes where none existed, just so they could take credit for catching another terrorist.

Got home and started typing this up.

What else relevant happened recently? On Monday I spoke at a conference on “Bayesian, Fiducial, and Frequentist Inference.” My title was “Taking Bayesian inference seriously,” and this was my abstract:

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

I don’t know if my talk quite lived up to this, but I *have* been thinking a lot about prior distributions, as was indicated at the top of this post. On the train ride to and from the conference (it was in New Jersey) I talked with Deborah Mayo. I don’t really remember anything we said—that’s what happens when I don’t take notes—but Mayo assured me she’d remember the important parts.

I also had an idea for a new paper, to be titled, “Backfire: How methods that attempt to avoid bias can destroy the validity and reliability of inferences.” OK, I guess I need a snappier title, but I think it’s an important point. Part of this material was in my talk, “‘Unbiasedness’: You keep using that word. I do not think it means what you think it means,” which I gave last year at Princeton—that was before Angus Deaton got mad at me, he was really nice during that visit and offered a lot of good comments, both during and after the talk—but I have some new material too. I want to work in the bit about the homeopathic treatments that have been so popular in social psychology.

Oh, also I received emails today from 2 different journals asking me to referee submitted papers, someone emailed me his book manuscript the other day, asking for comments, and a few other people emailed me articles they’d written.

I’m not complaining, nor am I trying to “busy-brag.” I love getting interesting things to read, and if I feel too busy I can just delete these messages. My only point is that there’s a lot going on, which is why it can be a challenge to limit myself to one blog post per day.

Finally, let me emphasize that I’m *not* saying there’s anything special about me. Or, to put it another way, sure, I’m special, and so are each of you. You too can do a Nicholson Baker and dissect every moment of your lives. That’s what blogging’s all about. God is in every leaf etc.

The post A day in the life appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey pollsters! Poststratify on party ID, or we’re all gonna have to do it for you. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In five days, Clinton’s lead increased from 5 points to 12 points. And Democratic party ID margin increased from 3 points to 10 points.

No, I don’t think millions of voters switched to the Democratic party. I think Democrats are were just more likely to respond in that second poll. And, remember, survey response rates are around 10%, whereas presidential election turnout is around 60%, so it makes sense that we’d see big swings in differential nonresponse to polls which will not be expected to map to comparable swings in differential voting turnout.

We’ve been writing about this a lot recently. Remember this post, and this earlier graph from Abramowitz:

and this news article with David Rothschild, and this research article with Rothschild, Doug Rivers, and Sharad Goel, and this research article from 2001 with Cavan Reilly and Jonathan Katz? The cool kids know about this stuff.

I’m telling you this for free cos, hey, it’s part of my job as a university professor. (The job is divided into teaching, research, and service; this is service.) But I know that there are polling and news organizations that make money off this sort of thing. So, my advice to you: start poststratifying on party ID. It’ll give you a leg up on the competition.

That is, assuming your goal is to assess opinion and not just to manufacture news. If what you’re looking for is headlines, then by all means go with the raw poll numbers. They jump around like nobody’s business.

**P.S.** Two questions came up in discussion:

1. *If this is such a good idea, why aren’t pollsters doing it already?* Many answers here, including (a) some pollsters *are* doing it already, (b) other pollsters get benefit from headlines, and you get more headlines with noisy data, (c) survey sampling is a conservative field and many practitioners resist new ideas (just search this blog for “buggy whip” for more on that topic), and, most interestingly, (d) response rates keep going down, so differential nonresponse might be a bigger problem now than it used to be.

2. *Suppose I want to poststratify on party ID? What numbers should I use?* If you’re poststratifying on party ID, you don’t simply want to adjust to party registration data: party ID is a survey response, and party registration is something different. The simplest approach would be to take some smoothed estimate of the party ID distribution from many surveys: this won’t be perfect but it should be better than taking any particular poll, and much better than not poststratifying at all. To get more sophisticated, you could model the party ID distribution as a slowly varying time series as in our 2001 paper but I doubt that’s really necessary here.

The post Hey pollsters! Poststratify on party ID, or we’re all gonna have to do it for you. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post His varying slopes don’t seem to follow a normal distribution appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have a question about multilevel modeling I’m hoping you can help with.

What should one do when random effects coefficients are clearly not normally distributed (i.e., coef(lmer(y~x+(x|id))) )? Is this a sign that the model should be changed? Or can you stick with this model and infer that the assumption of normally distributed coefficients is incorrect?

I’m seeing strongly leptokurtic random slopes in a context where I have substantive interest in the shape of this distribution. That is, it would be useful to know if there are more individuals with “extreme” and fewer with “moderate” slopes than you’d expect of a normal distribution.

My reply: You can fit a mixture model, or even better you can have a group-level predictor that breaks up your data appropriately. To put it another way: What are your groups? And which are the groups that have low slopes and which have high slopes? Or which have slopes near the middle of the distribution and which have extreme slopes? You could fit a mixture model where the variance varies, but I think you’d be better off with a model using group-level predictors. Also I recommend using Stan which is more flexible than lmer and gives you the full posterior distribution.

Doré then added:

My groups are different people reporting life satisfaction annually surrounding a stressful life event (divorce, bereavement, job loss). I take it that the kurtosis is a clue that there are unobserved person-level factors driving this slope variability? With my current data I don’t have any person-level predictors that could explain this variability, but certainly it would be good to try to find some.

The post His varying slopes don’t seem to follow a normal distribution appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Postdoc in Finland with Aki appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The person hired will participate in research on Gaussian processes, functional constraints, big data, approximative Bayesian inference, model selection and assessment, deep learning, and survival analysis models (e.g. cardiovascular diseases and cancer). Methods will be implemented mostly in GPy and Stan. The research will be made in collaboration with Columbia University (Andrew and Stan group), University of Sheffield, Imperial College London, Technical University of Denmark, The National Institute for Health and Welfare, University of Helsinki, and Helsinki University Central Hospital.

See more details here

The post Postdoc in Finland with Aki appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>When studying the effects of interventions on individual behavior, the experimental research template is typically: Gather a bunch of people who are willing to participate in an experiment, randomly divide them into two groups, assign one treatment to group A and the other to group B, then measure the outcomes. If you want to increase precision, do a pre-test measurement on everyone and use that as a control variable in your regression. But in this post I argue for an alternative approach—study individual subjects using repeated measures of performance, with each one serving as their own control.

As long as your design is not constrained by ethics, cost, realism, or a high drop-out rate, the standard randomized experiment approach gives you clean identification. And, by ramping up your sample size N, you can get all the precision you might need to estimate treatment effects and test hypotheses. Hence, this sort of experiment is standard in psychology research and has been increasingly popular in political science and economics with lab and field experiments.

However, the clean simplicity of such designs has led researchers to neglect important issues of measurement . . .

I summarize:

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement. Real-world experiments are imperfect—they do have issues with ethics, cost, realism, and high drop-out, and the strategy of doing an experiment and then grabbing statistically-significant comparisons can leave a researcher with nothing but a pile of noisy, unreplicable findings.

Measurement is central to economics—it’s the link between theory and empirics—and it remains important, whether studies are experimental, observational, or some combination of the two.

I have no idea who reads that blog but it’s always good to try to reach new audiences.

The post Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Evil collaboration between Medtronic and FDA appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Medtronic ran a retrospective study of 3,647 Infuse patients from 2006-2008 but shut it down without reporting more than 1,000 “adverse events” to the government within 30 days, as the law required.

Medtronic, which acknowledges it should have reported the information promptly, says employees misfiled it. The company eventually reported the adverse events to the FDA more than five years later.

Medtronic filed four individual death reports from the study in July 2013. Seven months later, the FDA posted a three-sentence summary of 1,039 other adverse events from the Infuse study, but deleted the number from public view, calling it a corporate trade secret.

Wow. I feel bad for that FDA employee who did this: it must be just horrible to have to work for the government when you have such exquisite sensitivity to corporate secrets. I sure hope that he or she gets a good job in some regulated industry after leaving government service.

The post Evil collaboration between Medtronic and FDA appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Bayesian inference completely solves the multiple comparisons problem appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I promised I wouldn’t do any new blogging until January but I’m here at this conference and someone asked me a question about the above slide from my talk.

The point of the story in that slide is that flat priors consistently give bad inferences. Or, to put it another way, the routine use of flat priors results in poor frequency properties in realistic settings where studies are noisy and effect sizes are small. (More here.)

Saying it that way, it’s obvious: Bayesian methods are calibrated if you average over the prior. If the distribution of effect sizes that you average over, is not the same as the prior distribution that you’re using in the analysis, your Bayesian inferences in general will have problems.

But, simple as this statement is, the practical implications are huge, because it’s standard to use flat priors in Bayesian analysis (just see most of the examples in our books!) and it’s even more standard to take classical maximum likelihood or least squares inferences and interpret them Bayesianly, for example interpreting a 95% interval that excludes zero as strong evidence for the sign of the underlying parameter.

In our 2000 paper, “Type S error rates for classical and Bayesian single and multiple comparison procedures,” Francis Tuerlinckx and I framed this in terms of researchers making “claims with confidence.” In classical statistics, you make a claim with confidence on the sign of an effect if the 95% confidence interval excludes zero. In Bayesian statistics, one can make a comparable claim with confidence if the 95% *posterior* interval excludes zero. With a flat prior, these two are the same. But with a Bayesian prior, they are different. In particular, with normal data and a normal prior centered at 0, the Bayesian interval is always more likely to include zero, compared to the classical interval; hence we can say that Bayesian inference is more conservative, in being less likely to result in claims with confidence.

Here’s the relevant graph from that 2000 paper:

This plot shows the probability of making a claim with confidence, as a function of the variance ratio, based on the simple model:

True effect theta is simulated from normal(0, tau).

Data y are simulated from normal(theta, sigma).

Classical 95% interval is y +/- 2*sigma

Bayesian 95% interval is theta.hat.bayes +/- 2*theta.se.bayes,

where theta.hat.bayes = y * (1/sigma^2) / (1/sigma^2 + 1/tau^2)

and theta.se.bayes = sqrt(1 / (1/sigma^2 + 1/tau^2))

What’s really cool here is what happens when tau/sigma is near 0, which we might call the “Psychological Science” or “PPNAS” domain. In that limit, the classical interval has a 5% chance of excluding 0. Of course, that’s what the 95% interval is all about: if there’s no effect, you have a 5% chance of seeing something.

But . . . look at the Bayesian procedure. There, the probability of a claim with confidence is essentially 0 when tau/sigma is low. This is right: in this setting, the data only very rarely supply enough information to determine the sign of any effect. But this can be counterintuitive if you have classical statistical training: we’re so used to hearing about 5% error rate that it can be surprising to realize that, if you’re doing things right, your rate of making claims with confidence can be much lower.

We *are* assuming here that the prior distribution and the data model are correct—that is, we compute probabilities by averaging over the data-generating process in our model.

**Multiple comparisons**

OK, so what does this have to do with multiple comparisons? The usual worry is that if we are making a lot of claims with confidence, we can be way off if we don’t do some correction. And, indeed, with the classical approach, if tau/sigma is small, you’ll still be making claims with confidence 5% of the time, and a large proportion of these claims will be in the wrong direction (a “type S,” or sign, error) or much too large (a “type M,” or magnitude, error), compared to the underlying truth.

With Bayesian inference (and the correct prior), though, this problem disappears. Amazingly enough, you *don’t* have to correct Bayesian inferences for multiple comparisons.

I did a demonstration in R to show this, simulating a million comparisons and seeing what the Bayesian method does.

Here’s the R code:

setwd("~/AndrewFiles/research/multiplecomparisons") library("arm") spidey <- function(sigma, tau, N) { cat("sigma = ", sigma, ", tau = ", tau, ", N = ", N, "\n", sep="") theta <- rnorm(N, 0, tau) y <- theta + rnorm(N, 0, sigma) signif_classical <- abs(y) > 2*sigma cat(sum(signif_classical), " (", fround(100*mean(signif_classical), 1), "%) of the 95% classical intervals exclude 0\n", sep="") cat("Mean absolute value of these classical estimates is", fround(mean(abs(y)[signif_classical]), 2), "\n") cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_classical]), 2), "\n") cat(fround(100*mean((sign(theta)!=sign(y))[signif_classical]), 1), "% of these are the wrong sign (Type S error)\n", sep="") theta_hat_bayes <- y * (1/sigma^2) / (1/sigma^2 + 1/tau^2) theta_se_bayes <- sqrt(1 / (1/sigma^2 + 1/tau^2)) signif_bayes <- abs(theta_hat_bayes) > 2*theta_se_bayes cat(sum(signif_bayes), " (", fround(100*mean(signif_bayes), 1), "%) of the 95% posterior intervals exclude 0\n", sep="") cat("Mean absolute value of these Bayes estimates is", fround(mean(abs(theta_hat_bayes)[signif_bayes]), 2), "\n") cat("Mean absolute value of the corresponding true parameters is", fround(mean(abs(theta)[signif_bayes]), 2), "\n") cat(fround(100*mean((sign(theta)!=sign(theta_hat_bayes))[signif_bayes]), 1), "% of these are the wrong sign (Type S error)\n", sep="") } sigma <- 1 tau <- .5 N <- 1e6 spidey(sigma, tau, N)

Here's the first half of the results:

sigma = 1, tau = 0.5, N = 1e+06 73774 (7.4%) of the 95% classical intervals exclude 0 Mean absolute value of these classical estimates is 2.45 Mean absolute value of the corresponding true parameters is 0.56 13.9% of these are the wrong sign (Type S error)

So, when tau is half of sigma, the classical procedure yields claims with confidence 7% of the time. The estimates are huge (after all, they have to be at least two standard errors from 0), much higher than the underlying parameters. And 14% of these claims with confidence are in the wrong direction.

The next half of the output shows the results from the Bayesian intervals:

62 (0.0%) of the 95% posterior intervals exclude 0 Mean absolute value of these Bayes estimates is 0.95 Mean absolute value of the corresponding true parameters is 0.97 3.2% of these are the wrong sign (Type S error)

When tau is half of sigma, Bayesian claims with confidence are extremely rare. When there *is* a Bayesian claim with confidence, it will be large---that makes sense; the posterior standard error is sqrt(1/(1/1 + 1/.5^2)) = 0.45, and so any posterior mean corresponding to a Bayesian claim with confidence here will have to be at least 0.9. The average for these million comparisons turns out to be 0.94.

So, hey, watch out for selection effects! But no, not at all. If we look at the underlying *true effects* corresponding to these claims with confidence, these have a mean of 0.97 (in this simulation; in other simulations of a million comparisons, we get means such as 0.89 or 1.06). And very few of these are in the wrong direction; indeed, with enough simulations you'll find a type S error rate of a bit less 2.5% which is what you'd expect, given that these 95% posterior intervals exclude 0, so something less than 2.5% of the interval will be of the wrong sign.

So, the Bayesian procedure only very rarely makes a claim with confidence. But, when it does, it's typically picking up something real, large, and in the right direction.

We then re-ran with tau = 1, a world in which the standard deviation of true effects is equal to the standard error of the estimates:

sigma <- 1 tau <- 1 N <- 1e6 spidey(sigma, tau, N) And here's what we get:

sigma = 1, tau = 1, N = 1e+06 157950 (15.8%) of the 95% classical intervals exclude 0 Mean absolute value of these classical estimates is 2.64 Mean absolute value of the corresponding true parameters is 1.34 3.9% of these are the wrong sign (Type S error) 45634 (4.6%) of the 95% posterior intervals exclude 0 Mean absolute value of these Bayes estimates is 1.68 Mean absolute value of the corresponding true parameters is 1.69 1.0% of these are the wrong sign (Type S error)

The classical estimates remain too high, on average about twice as large as the true effect sizes; the Bayesian procedure is more conservative, making fewer claims with confidence and not overestimating effect sizes.

**Bayes does better because it uses more information**

We should not be surprised by these results. The Bayesian procedure uses more information and so it can better estimate effect sizes.

But this can seem like a problem: what if this prior information on theta isn’t available? I have two answers. First, in many cases, some prior information *is* available. Second, if you have a lot of comparisons, you can fit a multilevel model and estimate tau. Thus, what can seem like the worst multiple comparisons problems are not so bad.

One should also be able to obtain comparable results non-Bayesianly by setting a threshold so as to control the type S error rate. The key is to go beyond the false-positive, false-negative framework, to set the goals of estimating the sign and magnitudes of the thetas rather than to frame things in terms of the unrealistic and uninteresting theta=0 hypothesis.

**P.S.** Now I know why I swore off blogging! The analysis, the simulation, and the writing of this post took an hour and a half of my work time.

**P.P.S.** Sorry for the ugly code. Let this be a motivation for all of you to learn how to code better.

The post Bayesian inference completely solves the multiple comparisons problem appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post One more thing you don’t have to worry about appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So I have been convinced by the futility of NHT for my scientific goals and by the futility of of significance testing (in the sense of using p-values as a measure of the strength of evidence against the null). So convinced that I have been teaching this for the last 2 years. Yesterday I bump into this paper [“To P or not to P: on the evidential nature of P-values and their place in scientific inference,” by Michael Lew] which I thought makes a very strong argument for the validity of using significance testing for the above purpose. Furthermore—by his 1:1 mapping of p-values to likelihood functions he kind of obliterates the difference between the Bayesian and frequentist perspectives. My questions are 1. is his argument sound? 2.what does this mean regarding the use of p-values as measures of strength of evidence?

I replied that it all seems a bit nuts to me. If you’re not going to use p-values for hypothesis testing (and I agree with the author that this is not a good idea), why bother with p-values at all. It seems weird to use p-values to summarize the likelihood; why not just use the likelihood and do Bayesian inference directly? Regarding that latter point, see this paper of mine on p-values.

Eitam followed up:

But aren’t you surprised that the p-values

dosummarize the likelihood?

I replied that I did not read the paper in detail, but or any given model and sample size, I guess it makes sense that any two measures of evidence can be mapped to each other.

The post One more thing you don’t have to worry about appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Evil collaboration between Medtronic and FDA

**Wed:** His varying slopes don’t seem to follow a normal distribution

**Thurs:** A day in the life

**Fri:** Letters we never finished reading

**Sat:** Better to just not see the sausage get made

**Sun:** Oooh, it burns me up

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Taking Bayesian Inference Seriously [my talk tomorrow at Harvard conference on Big Data] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Taking Bayesian Inference Seriously

Over the years I have been moving toward the use of informative priors in more and more of my applications. I will discuss several examples from theory, application, and computing where traditional noninformative priors lead to disaster, but a little bit of prior information can make everything work out. Informative priors also can resolve some of the questions of replication and multiple comparisons that have recently shook the world of science. It’s funny for me to say this, after having practiced Bayesian statistics for nearly thirty years, but I’m only now realizing the true value of the prior distribution.

The post Taking Bayesian Inference Seriously [my talk tomorrow at Harvard conference on Big Data] appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Kaiser Fung on the ethics of data analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Kaiser Fung on the ethics of data analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Michael Porter as new pincushion appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>New Zealand seems to score well on his index so perhaps I shouldn’t complain, but Michael Porter was well known in this part of the world 25 years ago when our government commissioned him to write a report titled “Upgrading New Zealand’s Competitive Advantage” (but known colloquially as the Porter Project.) Back then (perhaps not quite so much now) our government departments were in thrall of any overseas “expert” who could tell us what to do, and especially so if their philosophy happened to align with that of the government of the day.

Anyway this critique written at the time by one of our leading political economists suggests that his presentation and analysis skills weren’t the greatest back then either.

I followed the link and read the article by Brian Easton, which starts out like this:

Flavour of the moment is Upgrading New Zealand’s Competitive Advantage, the report of the so-called Porter Project. Its 178 pages (plus appendices) are riddled with badly labelled graphs; portentous diagrams which, on reflection, say nothing; chummy references to “our country”, when two of the three authors are Americans; and platitudes dressed up as ‘deep and meaningful sentiments.

Toward the end of the review, Easton sums up:

It would be easy enough to explain this away as the usual shallowness of a visiting guru passing through; But New Zealand’s. Porter Project spent about $1.5 million (of taxpayers’ money) on a report which is, largely a recycling of conventional wisdom and material published elsewhere. Even if there were more and deeper case studies, the return on the money expended would still be low.

But that’s just leading up to the killer blow:

Particularly galling is the book’s claim that we should improve the efficiency of government spending. The funding of this report would have been a good place to start. It must be a candidate for the lowest productivity research publication ever funded by government.

In all seriousness, I expect that Michael Porter is so used to getting paid big bucks that he hardly noticed where the $1.5 million went. (I guess that’s 1.5 million New Zealand dollars, so something like $750,000 U.S.) Wasteful government spending on other people, sure, that’s horrible, but when the wasteful government spending goes directly to you, that’s another story.

The post Michael Porter as new pincushion appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Things that sound good but aren’t quite right: Art and research edition appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>For example consider this blog comment from Chris G:

Years ago I heard someone suggest these three questions for assessing a work of art:

1. What was the artist attempting to do?

2. Were they successful?

3. Was it worth doing?I think those apply equally well to assessing research.

The idea of applying these same standards to research as to art, that was interesting. And the above 3 questions sounded good too—at first. But then I got to thinking about all sorts of art and science that didn’t fit the above rules. As I wrote:

There are many cases of successful art, and for that matter successful research, that were created by accident, where the artist or researcher was just mucking around, or maybe just trying to do something to pay the bills, and something great came out of it.

I’m not saying you’ll get much from completely random mucking around of the monkeys-at-a-typewriter variety. And in general I do believe in setting goals and working toward them. But artistic and research success often does seem to come in part by accident, or as a byproduct of some other goals.

The post Things that sound good but aren’t quite right: Art and research edition appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post An ethnographic study of the “open evidential culture” of research psychology appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Science studies scholars have shown that the management of natural complexity in lab settings is accomplished through a mixture of technological standardization and tacit knowledge by lab workers. Yet these strategies are not available to researchers who study difficult research objects. Using 16 months of ethnographic data from three laboratories that conduct experiments on infants and toddlers, the author shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor. This research raises important questions regarding the value of restrictive evidential cultures in challenging research environments.

And it concludes:

Open evidential cultures may be defensible under certain conditions. When problems are pressing and progress needs to be made quickly, creativity may be prized over ascetic rigor. Certain areas of medical or environmental science may meet this criterion. Developmental psychology does not. However, it may meet a second criterion. When research findings are not tightly coupled with some piece of material or social technology—that is, when the “consumers” of such science do not significantly depend on the veracity of individual articles—then local culture can function as an internal mechanism for evaluation in the field. Similar to the way oncologists use a “web of trials” rather than relying on a single, authoritative study or how weather forecasters use multiple streams of evidence and personal experience to craft a prediction, knowledge in such fields may develop positively even in a literature that contains more false positives than would be expected by chance alone.

It’s an interesting article, because usually discussions of research practices are all about what is correct, what should be done or not done, what do the data really tell us, etc. But here we get an amusing anthropological take on things, treating scientists’ belief in their research findings with the same respect that we treat tribal religious beliefs. This paper is not normative, it’s descriptive. And description is important. As I often say, if we want to understand the world, it helps to know what’s actually happening out there!

I like the term “open evidential culture”: it’s descriptive without being either condescending, on one hand, or apologetic, on the other.

The post An ethnographic study of the “open evidential culture” of research psychology appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Course up North (Anchorage, Alaska) 23–24 Aug 2016 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Daniel Lee’s heading up to Anchorage, Alaska to teach a two-day Stan course at the Alaska chapter of the American Statistical Association (ASA) meeting in Anchorage. Here’s the rundown:

I hear Alaska’s beautiful in the summer—16 hour days in August and high temps of 17 degrees celsius. Plus Stan!

**More Upcoming Stan Events**

All of the Stan-related events of which we are aware are listed on:

After Alaska, Daniel and Michael Betancourt will be joining me in Paris, France on 19–21 September to teach a three-day course on Pharmacometric Modeling using Stan. PK/PD in Stan is now a whole lot easier after Sebastian Weber integrated CVODES (pun intended) to solve stiff differential equations with control over tolerances and max steps per iteration.

The day after the course in Paris, on 22 September, we (with Julie Bertrand and France Mentre) are hosting a one-day Workshop on Pharmacometric Modeling with Stan.

**Your Event Here**

Let us know if you hear about other Stan-related events (meetups, courses, workshops) and we can post them on our events page and advertise them right here on the blog.

The post Stan Course up North (Anchorage, Alaska) 23–24 Aug 2016 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What’s gonna happen in November? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>2016 may be strange with Trump. Do you have any thoughts on how people might go about modeling a strange election? When I asked you about predictability and updating election forecasts, you stated that models that rely on polls at different points should be designed to allow for surprises. You have touted the power of weakly informative priors. Could those be a good tool for this situation?

I received this message on 4 Apr and I’m typing this on 9 Apr but it’s 17 Aug in blog time. So you’re actually reading a response that’s 4 months old.

What is it that they say: History is journalism plus time? I guess political science is political journalism plus time.

Anyway . . . whenever people asked me about the primary elections, I’d point them to my 2011 NYT article, Why Are Primaries Hard to Predict? Here’s the key bit:

Presidential general election campaigns have several distinct features that distinguish them from most other elections:

1. Two major candidates;

2. The candidates clearly differ in their political ideologies and in their positions on economic issues;

3. The two sides have roughly equal financial and organizational resources;

4. The current election is the latest in a long series of similar contests (every four years);

5. A long campaign, giving candidates a long time to present their case and giving voters a long time to make up their minds.

OK, now to Hassan’s question. I don’t really have a good answer! I guess I’d take as a starting point the prediction from a Hibbs-like model predicting the election based on economic conditions during the past year, presidential popularity, and party balancing. Right now the economy seems to be going OK though not great, Obama is reasonably popular, and party balancing favors the Democrats because the Republicans control both houses of Congress. So I’m inclined to give the Democratic candidate (Hillary Clinton, I assume) the edge. But that’s just my guess, I haven’t run the numbers. There’s also evidence from various sources that more extreme candidates don’t do so well, so if Sanders is the nominee, I’d assume he’d get a couple percentage points less than Clinton would. Trump . . . it’s hard to say. He’s not ideologically extreme, on the other hand he is so unpopular (even more so than Clinton), it’s hard to know what to say. So I find this a difficult election to predict. And once August rolls around, it’s likely there will be some completely different factors that I haven’t even thought about! From a statistical point of view, I guess I’d just add an error term which would increase my posterior uncertainty.

It’s not so satisfying to say this, but I don’t have much to offer as an election forecast beyond what you could read in any newspaper. I’m guessing that statistical tools will be more relevant in modeling what will happen in individual states, relative to the national average. As Kari Lock and I wrote a few years ago, it can be helpful to decompose national trends and the positions of the states. So maybe by the time this post appears here, I’ll have more to say.

**P.S.** This seems like a natural for the sister blog but I’m afraid the Washington Post readers would get so annoyed at me for saying I can’t make a good forecast! So I’m posting it here.

The post What’s gonna happen in November? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How schools that obsess about standardized tests ruin them as measures of success appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mark Palko and I wrote this article comparing the Success Academy chain of charter schools to Soviet-era factories:

According to the tests that New York uses to evaluate schools, Success Academies ranks at the top of the state — the top 0.3 percent in math and the top 1.5 percent in English, according to the founder of the Success Academies, Eva Moskowitz. That rivals or exceeds the performance of public schools in districts where homes sell for millions of dollars.

But it took three years before any Success Academy students were accepted into New York City’s elite high school network — and not for lack of trying. After two years of zero-percent acceptance rates, the figure rose to 11 percent this year, still considerably short of the 19 percent citywide average.

News coverage of those figures emphasized that that acceptance rate was still higher than the average for students of color (the population Success Academy mostly serves). But from a statistical standpoint, we would expect extremely high scores on the state exam to go along with extremely high scores on the high school application exams. It’s not clear why race should be a factor when interpreting one and not the other.

The explanation for the discrepancy would appear to be that in high school admissions, everybody is trying hard, so the motivational tricks and obsessive focus on tests at Success Academy schools has less of an effect. Routine standardized tests are, by contrast, high stakes for schools but low stakes for students. Unless prodded by teachers and anxious administrators, the typical student may be indifferent about his or her performance. . . .

We summarize:

In general, competition is good, as are market forces and data-based incentives, but they aren’t magic. They require careful thought and oversight to prevent gaming and what statisticians call model decay. . . .

What went wrong with Success Academy is, paradoxically, what also seems to have gone right. Success Academy schools have excelled at selecting out students who will perform poorly on state tests and then preparing their remaining students to test well. But their students do not do so well on tests that matter to the students themselves.

Like those Soviet factories, Success Academy and other charter schools have been under pressure to perform on a particular measure, and are reminding us once again what Donald Campbell told us 40 years ago: Tampering with the speedometer won’t make the car go faster.

The post How schools that obsess about standardized tests ruin them as measures of success appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post ~~Calorie labeling reduces obesity~~ Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling appeared first on Statistical Modeling, Causal Inference, and Social Science.

I wonder if you might have some perspective to offer on this analysis by Partha Deb and Carmen Vargas regarding restaurant calorie counts.

[Thin columnist] Cass Sunstein says it proves “that calorie labels have had a large and beneficial effect on those who most need them.”

I wonder about the impact of using self-reported BMI as a primary input and also the effect of confounding variables. Someone also suggested that investigator degrees of freedom is an important consideration.

They’re using data from a large national survey (Behavioral Risk Factor Surveillance System) and comparing self-reported body mass index of people who lived in counties with calorie-labeling laws, compared to counties without such laws, and they come up with these (distorted) maps:

Here’s their key finding:

The two columns correspond to two different models they used to adjust for demographic differences between the people in the two groups of counties. As you can see, average BMI seems to have increased faster in the no-calorie-labeling counties.

On the other hand, if you look at the map, it seems like they’re comparing {California, Seattle, Portland (Oregon), and NYC} to everyone else (with Massachusetts somewhere in the middle), and there are big differences between these places. So I don’t know how seriously we can attribute the differences between those trends to food labeling.

Also, figure 5 of that paper, showing covariate balance, is just goofy. I recommend simple and more readable dotplots as in chapter 10 of ARM. Figure 4 is a bit mysterious too, I’m not quite clear on what is gained by the barplots on the top; aren’t they just displaying the means of the normal distributions on the bottom? And Figures 1 and 2, the maps, look weird: they’re using some bad projection, maybe making the rookie mistake of plotting latitude vs. longitude, not realizing that when you’re away from the equator one degree of latitude is not the same distance as one degree of longitude.

As to the Cass Sunstein article (“Calorie Counts Really Do Fight Obesity”), yeah, it seems a bit hypey. Key Sunstein quote: “All in all, it’s a terrific story.” Even aside from the causal identification issues discussed above, don’t forget that the difference between significant and non-significant etc.

Speaking quite generally, I agree with Sunstein when he writes:

A new policy might have modest effects on Americans as a whole, but big ones on large subpopulations. That might be exactly the point! It’s an important question to investigate.

But of course researchers—even economists—have been talking about varying treatment effects for awhile. So to say we can draw this “large lesson” from this particular study . . . again, a bit of hype going on here. It’s fine for Sunstein if this particular paper has woken him up to the importance of interactions, but let’s not let his excitement about the general concept, and his eagerness to tell a “terrific story” and translate into policy, distract us from the big problems of interpreting the claims made in this paper.

And, to return to the multiple comparisons issue, ultimately what’s important is not so much what the investigators did or might have done, but rather what the data say. I think the right approach would be some sort of hierarchical model that allows for effects in all groups, rather than a search for a definitive result in some group or another.

**P.S.** Kyle referred to the article by Deb and Vargas as a “NBER analysis” but that’s not quite right. NBER is just a consortium that publishes these papers. To call their paper an NBER analysis would be like calling this blog post “a WordPress analysis” because I happen to be using this particular software.

The post ~~Calorie labeling reduces obesity~~ Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post The history of characterizing groups of people by their averages appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I stumbled across this article on the End of Average.

I didn’t know about Todd Rose, thus I had a look at his Wikipedia entry:

Rose is a leading figure in the science of individual, an interdisciplinary field that draws upon new scientific and mathematical findings that demonstrate that it is not possible to draw meaningful inferences about human beings using statistical averages.

Hmmm. I guess you would have something to say about that last sentence. To me, it sounds either trivial, if we interpret it in the sense illustrated by the US Air force later on in the same page, i.e., that individuals whose properties (weight, height, chest, etc.) are “close” to those of the Average Human Being are very rare, provided the number of properties is sufficiently high. Or plain wrong, if it’s a claim that statistics cannot be used to draw useful inferences on some specific population of individuals (American voters, middle-aged non-Hispanic white men, etc.). Either way, I think this would make a nice entry for your blog.

My reply: I’m not sure! On one hand, I love to be skeptical; on the other hand, since you’re telling me I won’t like it, I’m inclined to say I like it, just to be contrary!

OK, that was my meta-answer. Seriously, though . . . I haven’t looked at Rose’s book, but I kinda like his Atlantic article that you linked to, in that it has an interesting historical perspective. Of course we can draw meaningful inferences using statistical averages—any claim otherwise seems just silly. But if the historical work is valid, we can just go with that and ignore any big claims about the world. Historians can have idiosyncratic views about the present but still give us valuable insights about how we got to where we are today.

The post The history of characterizing groups of people by their averages appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Tax Day: The Birthday Dog That Didn’t Bark appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Let’s go to the data:

These are data from 1968-1988 so it would certainly be interesting to see new data, but here’s what we got:

– April 1st has a lot less

– Maybe something going on Apr 15 but not much, really nothing going on there at all.

– A lot less on vacation holidays such as July 4th, Labor day,etc.

– Extra births before xmas and between xmas and New year’s, which makes sense: the baby has to come out sometime!

– Day-of-week effects were increasing over the years.

But, really, nothing going on with April 15th. April Fools is where it’s at.

I just don’t think tax day is such a big deal. It looms large in the folklore of comedy writers and editorial writers, but for regular people it’s just a pain in the ass and then it’s over, not like, “Hey, I don’t want my kid to have an April Fools birthday.”

The post Tax Day: The Birthday Dog That Didn’t Bark appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** ~~Calorie labeling reduces obesity~~ Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labeling

**Wed:** What’s gonna happen in November?

**Thurs:** An ethnographic study of the “open evidential culture” of research psychology

**Fri:** Things that sound good but aren’t quite right: Art and research edition

**Sat:** Michael Porter as new pincushion

**Sun:** Kaiser Fung on the ethics of data analysis

The post Modeling correlation of issue attitudes and partisanship within states appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have taught myself multilevel modeling using your book and read your work with Delia Baldassarri about partisanship and issue alignment. I have a question about related to these two works.

I want to find the level of correlation between partisanship and issues at the state level. Your work with Professor Baldassarri estimated the correlation at the national level, but I want to estimate it at the state level. The problem is that ANES is designed to be a national representative sample, so without using multilevel modeling, the estimated state level correlation is useless.

If I run a varying-intercept, varying-slope model with states as the group variable, I can use these estimates as somewhat comparable to correlation coefficients, though they are not same. If I run a linear regression, I know the coefficient is different with the correlation coefficient as the ratio of standard deviation of x and y. However, even though I understand multilevel model coefficients are the weighted average of the group and the whole coefficients, I don’t know how to compare multilevel coefficients with correlation coefficients.

Given my situation, I have two questions.

1) Is it OK to use estimates from a varying-intercept, varying-slope model to compare the state level correlation of partisanship and issue positions?

2) If no, how can I derive a correlation coefficient to compare state level correlations?

My reply: Yes, I think that rather than modeling correlations, if you’re interested in partisanship and issue attitudes, it would make sense to simply regress issue attitudes on partisanship, with varying intercepts and slopes for states. The varying slopes are what you’re interested in. That said, I’m guessing that ANES won’t have nearly enough data to estimate varying slopes with any level of accuracy. You should pool several years of ANES or else use larger surveys such as Annenberg and Pew as we did for most of our Red State Blue State analysis.

The post Modeling correlation of issue attitudes and partisanship within states appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Science reporters are getting the picture appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I don’t really have anything to add here beyond what I’ve blogged on these topics before. (I mean, sure, I *could* laugh at this quote, “The average person cannot evaluate a scientific finding for themselves any more easily than they can represent themselves in court or perform surgery on their own appendix,” which came from a psychology professor who is notorious for claiming that the replication rate in psychology is “statistically indistinguishable from 100%”—but I won’t go there.)

No, I just wanted to express pleasure that journalists are seeing the big picture here. At this point there’s a large cohort of science writers who’ve moved beyond the “Malcolm Gladwell” or “Freakonomics” model of scientist-as-hero, or the “David Brooks” model of believing anything that confirms your political views, or even the controversy-in-the-lab model, to a clearer view of science as a collective enterprise. We really do seem to be moving forward, even in the past five or ten years. Science reporters are no longer stenographers; they are active citizens of the scientific community.

The post Science reporters are getting the picture appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Will youths who swill Red Bull become adult cocaine addicts? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The above is the question asked to me by Michael Stutzer, who writes:

I have attached an increasingly influential paper [“Effects of Adolescent Caffeine Consumption on Cocaine Sensitivity,” by Casey O’Neill, Sophia Levis, Drew Schreiner, Jose Amat, Steven Maier, and Ryan Bachtell] purporting to show the effects of caffeine use in adolescents (well, lab rats anyway) on biomarkers of rewards to cocaine use later in life. I’d like to see a Gelmanized analysis of the statistics. Note, for example, Figure 2, panels A and B. Figure 2, Panel A contrasts the later (adult) response to cocaine between 16 rats given caffeine as adolescents, vs. 15 rats who weren’t given caffeine as adolescents. Panel B contrasts the adult response to cocaine between 8 rats given caffeine only as adults, vs. 8 rats who weren’t given caffeine. The authors make much of the statistically significant difference in means in Panel A, and the apparent lack of statistical significance in Panel B, although the sign of the effect appears to still be there. But N=8 likely resulted in a much larger calculated standard error in Panel B than the N=16 did in Panel A. I wonder if the results would have held in a balanced design with N=16 rats used in both experiments, or with larger N in both. In addition, perhaps a Bonferroni correction should be made, because he could have just lumped the 24 caffeine-swilling (16 adolescent + 8 adult) rats together and tested the difference in the mean response between them and the 23 adolescent and adult rats who weren’t given caffeine. The authors may have done that correction when they contrasted the separate Panels differences in means (they purport to do that in the other panels), but the legend doesn’t indicate it.

Because the paper is getting a lot of citations, some lab should try to replicate all this with larger sample sizes, and perturbations of the experimental procedures.

My reply:

I don’t have the equipment to replicate this one myself, so I’ll post your request here.

It’s hard for me to form any judgment about the paper because all these biology details are so technical, I just don’t have the energy to track everything that’s going on.

Just to look at some details, though: It’s funny how hard it is to find the total number of rats in the experiment, just by reading the paper. N is not in the abstract or in the Materials and Methods section. In the Results section I see that one cohort had 32 adolescent and 20 adult rats. So there must be other cohorts in the study?

I also find frustrating the convention that everything is expressed as a hypothesis test. **The big big trouble with hypothesis tests is that the p-values are basically uninterpretable if the null hypothesis is false.** Just for example:

What’s the point of that sort of thing? If there’s a possibility there is no change, then, sure, I can see the merit of including the p-value. But when the p-value is .0001 . . . c’mon, who cares about the damn F statistic, just give me the numbers: what was the average fluid consumption per day for the different animals? Also I have a horrible feeling their F-test was not appropriate, cos I’m not clear on what those 377 and 429 are.

I’d like to conclude by saying two things that may at first seem contradictory, but which are not:

1. This paper looks like it has lots of statistical errors.

2. I’m not trying to pick on the authors of this paper. And, despite the errors, they may be studying something real.

The lack of contradiction comes because, as I wrote last month, statistics is like basketball, or knitting. It’s hard. There’s no reason we should expect a paper written by some statistical amateurs to *not* have mistakes, any more than we’d expect the local high school team to play flawless basketball or some recreational knitter to be making flawless sweaters.

It does not give me joy to poke holes in the statistical analysis of a random journal article, any more than I’d want to complain about Aunt Edna’s sweaters or laugh at the antics of whoever is the gawky kid who’s playing center for the Hyenas this year. Everyone’s trying their best, and I respect that. To point out statistical errors in a published paper is not an exercise in “debunking,” it’s just something that I’ll notice, and it’s relevant to the extent that the paper’s conclusions lean on its statistical analysis.

And one reason we sometimes want brute-force preregistered replications is because then we don’t have to worry about so many statistical issues.

The post Will youths who swill Red Bull become adult cocaine addicts? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A little story of the Folk Theorem of Statistical Computing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>A colleague and I were working on a data analysis problem, had a very simple overdispersed Poisson regression with a hierarchical, varying-intercept component. Ran it and it was super slow and not close to converging after 2000 iterations. Took a look and we found the problem: The predictor matrix of our regression lacked a constant term. The constant term comes in by default if you do glmer or rstanarm, but if you write it as X*beta in a Stan model, you have to remember to put in a column of 1’s in the X matrix (or to add a “mu” to the regression model), and we’d forgotten to do that.

Once we added in that const term, the model (a) ran much faster (cos adaptation was smoother) and (b) converged just fine after just 100 iterations.

Yet another instance of the folk theorem.

The post A little story of the Folk Theorem of Statistical Computing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post NBA in NYC appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Jason Rosenfeld writes:

We’re holding the first ever NBA Basketball Analytics Hackathon on Saturday, September 24 at Terminal 23 in midtown Manhattan.

I can’t *guarantee* that Bugs will be there, but ya never know!

The post NBA in NYC appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Are stereotypes statistically accurate? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Lin Bian and Andrei Cimpian write:

In his book Social Perception and Social Reality, Lee Jussim suggests that people’s beliefs about various groups (i.e., their stereotypes) are largely accurate. We unpack this claim using the distinction between generic and statistical beliefs—a distinction supported by extensive evidence in cognitive psychology, linguistics, and philosophy. Regardless of whether one understands stereotypes as generic or statistical beliefs about groups, skepticism remains about the rationality of social judgments.

Bian and Cimpian start by distinguishing what cognitive psychologists call “statistical” and “generic” beliefs about categories. This is pretty cool. Here they go:

Consider the statements below:

(1a) Fewer than 1% of mosquitoes carry the West Nile virus.

(1b) Mosquitoes carry the West Nile virus.

(2a) The majority of books are paperbacks.

(2b) Books are paperbacks.Statements (1a) and (2a) are statistical: They express a belief about a certain number or proportion of the members of a category. Statements (1b) and (2b) are generic: They express a belief about the category as a whole rather than a specific number, quantity, or proportion. . . .

The fact that generic claims – and the beliefs they express – are not about numbers or quantities has a crucial consequence: It severs their truth conditions from the sort of statistical data that one could objectively measure in the world. . . .

This point is illustrated by the examples above. Both (1a) and (1b) are considered true: Although very few mosquitoes actually carry the West Nile virus, participants judge the generic claim (that mosquitoes, as a category, carry the West Nile virus) to be true as well. . . .

In contrast, even though (2a) is true – paperbacks are indeed very common – few believe that books, as a category, are paperbacks (i.e., [2b] is false). . . .

Bian and Cimpian continue:

These are not isolated examples. The literature is replete with instances of generic claims that either are judged true despite unimpressive statistical evidence or judged false despite overwhelming numbers . . . the rules that govern which generic beliefs are deemed true and which are deemed false are so baroque and so divorced from the statistical facts that many linguists and philosophers have spent the better part of 40 years debating them. . . .

And to return to stereotyping:

All of the foregoing applies to beliefs about social groups as well. . . . The distinction between statistical and generic beliefs is operative regardless whether these beliefs concern mosquitoes, books, and other categories of non-human entities, or women, African Americans, Muslims, and other categories of humans.

And, the punch line:

Generic beliefs about social groups, just like other generic beliefs, are typically removed from the underlying statistics.

**Statistics vs. stereotypes**

Bian and Cimpian follow up with two examples:

More people hold the generic belief that Muslims are terrorists than hold the generic belief that Muslims are female. However, there are vastly more Muslims who are female than there are Muslims who are terrorists. . . .

Compare, for instance, “Asians are really good at math” and “Asians are right-handed.” Many more people would agree with the former generic claim than with the latter, while simultaneously being aware that the statistics go the opposite way.

OK, let’s unpack these. Here the statistics are so obviously counter to the stereotype that there has to be something else going on. In this case, I’d say the relevant statistical probabilities are not that Muslims are likely to be terrorists, or that Asians are more likely to be math whizzes, but that Muslims are *more likely* than other groups to be terrorists, or that Asians are *more likely* than other groups to be math whizzes. Maybe these statements aren’t correct either (I guess it would all depend on how all these things are defined), but that would seem to be the statistics to look at.

The stereotypes of a group, that is, would seem to be defined relative to other groups.

This does not tell the whole story either, though, as I’m sure that lots of stereotyping is muddled by what Kahneman and Tversky called availability bias.

Bian and Cimpian continue—you can read the whole thing—by discussing whether stereotypes should be considered as “generic beliefs” or “statistical beliefs.” As a statistician I’m not so comfortable with this distinction—I’m inclined to feel that generic beliefs are also a form of statistical belief, if the statistical question is framed the right way—but I do think they’re on to something in trying to pin down what people are thinking when they use stereotypes in their reasoning.

**P.S.** I sent the above to Susan, who added:

The issues you’re raising are ones that have been discussed a fair amount in the literature. Some of these ideas have been studied with experiments, but others have not (i.e., they’ve been discussed but not formally tested).

I agree that statistical info goes beyond just P(feature|category) (e.g., P(West Nile Virus|mosquito). As I think you’re saying, one could also ask: what about distinctiveness, which is the opposite — P(category|feature) (e.g., P(Mosquite|WNV)? Although distinctiveness can make a generic more acceptable, generics need not be distinctive (e.g., “Lions eat meat”; “Dogs are 4-legged”; “Cats have good hearing” are all non-distinctive but good generics). There are even properties that are relatively infrequent (i.e., true of less than half the category) and are non-distinctive, but make good generics (e.g., “Ducks lay eggs”; “Goats produce milk”; “Peacocks are colorful”). Finally, there are features that are frequent and distinctive but don’t (ordinarily) make good generics (e.g., “People are right-handed”; “Bees are sterile”; “Turtles die in infancy”).

I think that people are doing some assessment of how conceptually central a feature is, where centrality could be cued by any of a number of factors, including: prevalence, distinctiveness, danger/harm/threat (we have data on this as well — dangerous features make for better generics than benign features), and biological folk theories (e.g., features that only adults have are more likely to be in generics than features that only babies have — e.g., we say “Swans are beautiful”, not “Swans are ugly”).

This in turn gives me two thoughts:

1. Why we think swans are beautiful . . . that’s an interesting one, I’m sure there’s been lots written about that!

2. “People are right-handed” . . . that’s a great example. We are much more likely to be right-handed, compared to other animals (which generally have weak or no hand preference). And the vast majority of people are righties. Yet, saying “people are right-handed” odes seem wrong. On the other hand, if 80% of cats, say, were right handed, maybe we’d be ok saying “cats are right handed.” I guess there must be some kind of Grice going on here too.

**P.P.S.** In comments, Chris Martin points to further responses by Jussim and others.

The post Are stereotypes statistically accurate? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post George Orwell on the Olympics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>If you wanted to add to the vast fund of ill-will existing in the world at this moment, you could hardly do it better than by a series of football matches between Jews and Arabs, Germans and Czechs, Indians and British, Russians and Poles, and Italians and Jugoslavs, each match to be watched by a mixed audience of 100,000 spectators. I do not, of course, suggest that sport is one of the main causes of international rivalry; big-scale sport is itself, I think, merely another effect of the causes that have produced nationalism. Still, you do make things worse by sending forth a team of eleven men, labelled as national champions, to do battle against some rival team, and allowing it to be felt on all sides that whichever nation is defeated will “lose face”.

I’m a sports fan myself but I can see his point.

The post George Orwell on the Olympics appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Nick Menzies writes:

I thought you might be interested in this case in our local news in Boston.

This is a case of alleged data manipulation as part of a grant proposal, with the (former) lead statistician as the whistleblower. Is a very large grant, so high stakes both in terms of reputation and money.

It seems that the courts have sided with the alleged data manipulators over the whistleblower.

I assume this does not look good for the statistician. Separate from the issue of whether data manipulation (or analytic searching) occurred, it makes clear that calling our malfeasance comes with huge professional risk.

The original link no longer works but a search yields the record of the case here. In a news report, Michael Macagnone writes:

Kenneth Jones has alleged that the lead author of the study used to justify the grants, Dr. Ronald Killiany, falsified data by remeasuring certain MRI scans. . . .

It was Jones’s responsibility as chief statistician in the 2001 study—an examination of whether brain scans could determine who would contract Alzheimer’s disease—to verify the reliability of data . . . he alleges he was fired from his position on the study for questioning the use of the information. . . .

I know nothing at all about this case.

The post You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Boostrapping your posterior appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I bumped into your paper with John Carlin, Beyond Power Calculations, and encountered your concept of the hypothetical replication of the point estimate. In my own work I have used a similarly structured (but for technical reasons, differently motivated) concept which I have informally been calling the “consensus posterior.”

Specifically, supposing a prior distribution for the true effect size D, I observe the experimental data and compute the posterior belief over values of D. Then I consider the “replication thought experiment” as follows:

1. Assuming my *posterior* as the distribution for true value of D …

2. Consider the posterior distribution that another researcher would obtain if they performed the identical experiment, assuming they hold the same prior but that the true distribution of D is as my posterior distribution

3. I then get a distribution over posterior distributions that I think of as the set of beliefs other researchers might hold, given they start from the same assumptions and have the same experimental capability that I do. Then I can calculate various point values from this “consensus” distribution. As I’m sure is clear to you, the consensus distribution is always much wider and less spiky than even the semi-smoothed distributions arising from a weakly-informative prior.

4. In my field (automated control systems, distributed computing, sensor fusion, and some elements of machine learning), I have found this provides a much more stable and well-conditioned signal for triggering automated sensor-driven behaviors than conventional techniques (even conventional Bayesian techniques). We are especially sensitive in our work to multimodal distributions and especially relative magnitudes of local peaks (because we often want to trigger a complex response or set of responses, not just get one average-case point estimate).

5. perhaps the most important

Now, having just read your paper, the variant I describe above seems “obvious,” so I wonder if you have thought on this subject before and could point me to any additional resources. Or, perhaps, if you see some fatal flaw that I have missed I would appreciate being told about that as well. “Works in the field” doesn’t necessarily mean “sound in principle,” and I would like to know if I am treading somewhere dangerous.

My reply: This could work, but much depends on how precise is your posterior distribution. If your data and your prior are weak, you could still have problems with your distribution being under-regularized (that is, being too far from zero).

The post Boostrapping your posterior appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I know I said I wouldn’t blog for awhile, but this one was just too good to resist appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Have you heard of the “victory pose.” It’s a way to change your body chemistry almost instantly by putting your hands above your head like you won something. That’s a striking example of how easy it is to manipulate your mood and thoughts by changing your body’s condition.

So easy to do, yet so hard to replicate . . .

Adams is a bit of a Gregg Easterbrook in that he alternates science speculation with savvy political commentary:

I’ve been watching the Democratic National Convention and wondering if this will be the first time in history that we see a candidate’s poll numbers plunge after a convention.

But even better is when he mixes the two together:

Based on what I know about the human body, and the way our thoughts regulate our hormones, the Democratic National Convention is probably lowering testosterone levels all over the country. Literally, not figuratively. And since testosterone is a feel-good chemical for men, I think the Democratic convention is making men feel less happy. They might not know why they feel less happy, but they will start to associate the low feeling with whatever they are looking at when it happens, i.e. Clinton.

Keep this up, Scott, and a Ted talk’s in your future!

You’re talking about Scott Adams. He’s not talking about you.

The post I know I said I wouldn’t blog for awhile, but this one was just too good to resist appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post DrawMyData appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This web page is something I constructed recently. You might find it useful for making artificial datasets that demonstrate a particular point for students. At any rate, if you have any feedback on it I’d be interested to hear it. I’ve tried to keep it as simple as possible but in due course, I’d like to add more descriptive stats, an optional third variable and maybe an option to make one categorical. It might also be interesting to get students to use it themselves to construct ‘pathological’ datasets where the stats would lead them astray, to help them understand how to spot problems and the importance of diagnostic plotting.

The post DrawMyData appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Shameless little bullies claim that published triathlon times don’t replicate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Paul Alper sends along this inspiring story of Julie Miller, a heroic triathlete who just wants to triathle in peace, but she keeps getting hassled by the replication police. Those shameless little bullies won’t let her just do her thing, instead they harp on technicalities like missing timing clips and crap like that. Who cares about missing timing clips? Her winning times were statistically significant, that’s what matters to me. And her recorded victories were peer reviewed. But, no, those second stringers can’t stop with their sniping.

I for one don’t think this running star should resist any calls for her to replicate her winning triathlon times. The replication rate of those things is statistically indistinguishable from 100%, after all! Track and field has become preoccupied with prevention and error detection—negative psychology—at the expense of exploration and discovery.

In fact, I’m thinking the American Statistical Association could give this lady the Founders Award, which hasn’t really had a worthy recipient since 2002.

The post Shameless little bullies claim that published triathlon times don’t replicate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Boostrapping your posterior

**Wed:** You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

**Thurs:** Are stereotypes statistically accurate?

**Fri:** Will youths who swill Red Bull become adult cocaine addicts?

**Sat:** Science reporters are getting the picture

**Sun:** Modeling correlation of issue attitudes and partisanship within states

The post Documented forking paths in the Competitive Reaction Time Task appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Baruch Eitan writes:

This is some luscious garden of forking paths.

Indeed. Here’s what Malte Elson writes at the linked website:

The Competitive Reaction Time Task, sometimes also called the Taylor Aggression Paradigm (TAP), is one of the most commonly used tests to purportedly measure aggressive behavior in a laboratory environment. . . .

While the CRTT ostensibly measures how much unpleasant, or even harmful, noise a participant is willing to administer to a nonexistent confederate, that amount of noise can be extracted as a measure in myriad different ways using various combinations of volume and duration over one or more trials. There are currently 120 publications in which results are based on the CRTT, and they reported 147 different quantification strategies in total!

Elson continues:

This archive does not contain all variations of the CRTT, as some procedural differences are so substantial that their quantification strategies would be impossible to compare. . . . Given the number of different versions of the CRTT measure that can be extracted from its use in a study, it is very easy for a researcher to analyze several (or several dozen) versions of the CRTT outcome measures in a study, running hypothesis tests with one version of the measure after another until a version is found that produces the desired pattern of results. . . .

The post Documented forking paths in the Competitive Reaction Time Task appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Smooth poll aggregation using state-space modeling in Stan, from Jim Savage appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I just saw your post on poll bounces; have been thinking the same myself. Why are the poll aggregators so jumpy about new polls?

Annoyed, I put together a poll aggregator that took a state-space approach to the unobserved preferences; nothing more than the 8 schools (14 polls?) example with a time-varying mean process and very small updates to the state.

One of the things I was thinking of was to use aggregated demographic polling data (from the polling companies’ cross-tabs) as a basis for estimating individual states for each demographic cell, and then performing post-stratification on those. Two benefits: a) having a time-varying state deals nicely with the decaying importance of old polls, and b) getting hold of unit-level polling data for MRP is tough if you’re me (perhaps tough if you’re you?).

Here’s the plot:

A full writeup, automated data scraping, model etc. is below.

Here’s the zipfile with everything.

My only comment is that you should be able to do even better—much better—by also including party ID among the predictors in the model, then fitting a state-space model to the underlying party ID proportions and poststratifying on it as well. That would fix some of the differential nonresponse stuff we’ve been talking about.

And here’s Jim’s writeup:

This tutorial covers how to build a low-to-high frequency interpolation model in which we have possibly many sources of information that occur at various frequencies. The example I’ll use is drawing inference about the preference shares of Clinton and Trump in the current presidential campaign. This is a good example for this sort of imputation:

- Data (polls) are sporadically released. Sometimes we have many released simultaneously; at other times there may be many days with no releases.
- The various polls don’t necessarily agree. They might have different methodologies or sampling issues, resulting in quite different outcomes. We want to build a model that can incorporate this.
There are two ingredients to the polling model. A multi-measurement model, typified by Rubin’s 8 schools example. And a state-space model. Let’s briefly describe these.

Multi-measurement model and the 8 schools exampleLet’s say we run a randomized control trial in 8 schools. Each school ii reports its own treatment effect teitei, which has a standard error σiσi. There are two questions the 8-schools model tries to answer:

- If you administer the experiment at one of these schools, say, school 1, and have your estimate of the treatment effect te1te1, what do you expect would be the treatment effect if you were to run the experiment again? In particular, would your expectations of the treatment effect in the next experiment change once you learn the treatment effects of the other schools?
- If you roll out the experiment at a new school (school 99), what do we expect the treatment effect to be?
The statistical model that Rubin proposed is that each school has its own

truelatent treatment effect yiyi, around which our treatment effects are distributed.tei∼(yi,σi)teiNyiσiThese “true” but unobserved treatment effects are in turn distributed according to a common hyper-distribution with mean μμ and standard deviation ττ

yi∼(μ,τ)yiNμτOnce we have priors for μμ and ττ, we can estimate the above model with Bayesian methods.

A state-space modelState-space models are a useful way of dealing with noisy or incomplete data, like our polling data. The idea is that we can divide our model into two parts:

The state. We don’t observe the state; it is a latent variable. But we know how it changes through time (or at least how large its potential changes are).The measurement. Our state is measured with imprecision. The measurement model is the distribution of the data that we observe around the state.A simple example might be consumer confidence, an unobservable latent construct about which our survey responses should be distributed. So our state-space model would be:

The state

conft∼(conft−1,σ)conftNconft1σwhich simply says that consumer confidence is a random walk with normal innovations with a standard deviation σσ, and

survey_measuret∼(conft,τ)survey_measuretNconftτwhich says that our survey measures are normally distributed around the true latent state, with standard deviation ττ.

Again, once we provide priors for the initial value of the state conf0conf0 and ττ, we can estimate this model quite easily.

The important thing to note is that we have a model for the state even if there is no observed measurement. That is, we know how consumer confidence should progress even for the periods in which there are no consumer confidence surveys. This makes state-space models ideal for data with irregular frequencies or missing data.

Putting it togetherAs you can see, these two models are very similar: they involve making inference about a latent quantity from noisy measurements. The first shows us how we can aggregate many noisy measurements together

within a single time period, while the second shows us how to combine irregular noisy measuresover time. We can now combine these two models to aggregate multiple polls over time.The data generating process I had in mind is a very simple model where each candidate’s preference share is an unobserved state, which polls try to measure. Unlike some volatile poll aggregators, I assume that the unobserved state can move according to a random walk with normal disturbances of standard deviation .25%. This greatly smoothes out the sorts of fluctuations we see around the conventions etc.

That is, we have the state for candidate cc in time tt evolving according to

Vote sharec,t∼(Vote sharec,t−1.0.25)Vote sharectNVote sharect10.25with measurements being made of this in the polls. Each poll pp at time tt is distributed according to

pollc,p,t∼(Vote sharec,t.τ)pollcptNVote sharectτI give an initial state prior of 50% to Clinton and a 30% prior to Trump May of last year. As we get further from that initial period, the impact of the prior is dissipated.

The code to download the data, run the model (in the attached zip file) is below.

`library(rvest); library(dplyr); library(ggplot2); library(rstan); library(reshape2); library(stringr); library(lubridate) options(mc.cores = parallel::detectCores()) source("theme.R") # The polling data realclearpolitics_all <- read_html("http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html#polls") # Scrape the data polls <- realclearpolitics_all %>% html_node(xpath = '//*[@id="polling-data-full"]/table') %>% html_table() %>% filter(Poll != "RCP Average") # Function to convert string dates to actual dates get_first_date <- function(x){ last_year <- cumsum(x=="12/22 - 12/23")>0 dates <- str_split(x, " - ") dates <- lapply(1:length(dates), function(x) as.Date(paste0(dates[[x]], ifelse(last_year[x], "/2015", "/2016")), format = "%m/%d/%Y")) first_date <- lapply(dates, function(x) x[1]) %>% unlist second_date <- lapply(dates, function(x) x[2])%>% unlist data_frame(first_date = as.Date(first_date, origin = "1970-01-01"), second_date = as.Date(second_date, origin = "1970-01-01")) } # Convert dates to dates, impute MoE for missing polls with average of non-missing, # and convert MoE to standard deviation (assuming MoE is the full 95% two sided interval length) polls <- polls %>% mutate(start_date = get_first_date(Date)[[1]], end_date = get_first_date(Date)[[2]], N = as.numeric(gsub("[A-Z]*", "", Sample)), MoE = as.numeric(MoE))%>% select(end_date, `Clinton (D)`, `Trump (R)`, MoE) %>% mutate(MoE = ifelse(is.na(MoE), mean(MoE, na.rm = T), MoE), sigma = MoE/4) %>% arrange(end_date) # Stretch out to get missing values for days with no polls polls3 <- left_join(data_frame(end_date = seq(from = min(polls$end_date), to= as.Date("2016-08-04"), by = "day")), polls) %>% group_by(end_date) %>% mutate(N = 1:n()) %>% rename(Clinton = `Clinton (D)`, Trump = `Trump (R)`) # One row for each day, one column for each poll on that day, -9 for missing values Y_clinton <- polls3 %>% dcast(end_date ~ N, value.var = "Clinton") %>% dplyr::select(-end_date) %>% as.data.frame %>% as.matrix Y_clinton[is.na(Y_clinton)] <- -9 Y_trump <- polls3 %>% dcast(end_date ~ N, value.var = "Trump") %>% dplyr::select(-end_date) %>% as.data.frame %>% as.matrix Y_trump[is.na(Y_trump)] <- -9 # Do the same for margin of errors for those polls sigma <- polls3 %>% dcast(end_date ~ N, value.var = "sigma")%>% dplyr::select(-end_date)%>% as.data.frame %>% as.matrix sigma[is.na(sigma)] <- -9 # Run the two models clinton_model <- stan("state_space_polls.stan", data = list(T = nrow(Y_clinton), polls = ncol(Y_clinton), Y = Y_clinton, sigma = sigma, initial_prior = 50)) trump_model <- stan("state_space_polls.stan", data = list(T = nrow(Y_trump), polls = ncol(Y_trump), Y = Y_trump, sigma = sigma, initial_prior = 30)) # Pull the state vectors mu_clinton <- extract(clinton_model, pars = "mu", permuted = T)[[1]] %>% as.data.frame mu_trump <- extract(trump_model, pars = "mu", permuted = T)[[1]] %>% as.data.frame # Rename to get dates names(mu_clinton) <- unique(paste0(polls3$end_date)) names(mu_trump) <- unique(paste0(polls3$end_date)) # summarise uncertainty for each date mu_ts_clinton <- mu_clinton %>% melt %>% mutate(date = as.Date(variable)) %>% group_by(date) %>% summarise(median = median(value), lower = quantile(value, 0.025), upper = quantile(value, 0.975), candidate = "Clinton") mu_ts_trump <- mu_trump %>% melt %>% mutate(date = as.Date(variable)) %>% group_by(date) %>% summarise(median = median(value), lower = quantile(value, 0.025), upper = quantile(value, 0.975), candidate = "Trump") # Plot results bind_rows(mu_ts_clinton, mu_ts_trump) %>% ggplot(aes(x = date)) + geom_ribbon(aes(ymin = lower, ymax = upper, fill = candidate),alpha = 0.1) + geom_line(aes(y = median, colour = candidate)) + ylim(30, 60) + scale_colour_manual(values = c("blue", "red"), "Candidate") + scale_fill_manual(values = c("blue", "red"), guide = F) + geom_point(data = polls3, aes(x = end_date, y = `Clinton`), size = 0.2, colour = "blue") + geom_point(data = polls3, aes(x = end_date, y = Trump), size = 0.2, colour = "red") + lendable::theme_lendable() + xlab("Date") + ylab("Implied vote share") + ggtitle("Poll aggregation with state-space smoothing", subtitle= paste("Prior of 50% initial for Clinton, 30% for Trump on", min(polls3$end_date)))`

The post Smooth poll aggregation using state-space modeling in Stan, from Jim Savage appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “What can recent replication failures tell us about the theoretical commitments of psychology?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Psychology/philosophy professor Stan Klein was motivated by our power pose discussion to send along this article which seems to me to be a worthy entry in what I’ve lately been calling “the literature of exasperation,” following in the tradition of Meehl etc.

I offer one minor correction. Klein writes, “I have no doubt that the complete reinstatement of experimental conditions will ensure a successful replication of a task’s outcome.” I think this statement is too optimistic. Redo the same experiment on the same people but re-randomize, and anything can happen. If the underlying effect is near zero (as I’d guess is the case, for example, in the power pose example), then there’s no reason to expect success even in an exact replication.

More to the point is Klein’s discussion of the nature of theorizing in psychology research. Near the end of his article he discusses the materialist doctrine “that reality, in its entirety, must be composed of quantifiable, material substances.”

That reminds me of one of the most ridiculous of many ridiculous hyped studies in the past few decades, a randomized experiment purporting to demonstrate the effectiveness of intercessory prayer (p=.04 after performing 3 hypothesis tests, not that anyone’s counting; Deb and I mention it in our Teaching Statistics book). What amazed me about this study—beyond the philosophically untenable (to me) idea that God is unable to interfere with the randomization but will go to the trouble of improving the health of the prayed-for people by some small amount, just enough to assure publication—was *the effort the researchers put in to diminish any possible treatment effect*.

It’s reasonable to think that prayer could help people in many ways, for example it is comforting to know that your friends and family care enough about your health to pray for it. But in this study they chose people to pray who had no connection to the people prayed for—and the latter group were not even told of the intervention. The experiment was explicitly designed to remove all but supernatural effects, somewhat in the manner that a magician elaborately demonstrates that there are no hidden wires, nothing hidden in the sleeves, etc. Similarly with Bargh’s embodied cognition study: the elderly words were slipped into the study so unobtrusively as to almost remove any chance they could have an effect.

I suppose if you tell participants to think about elderly people and then they walk slower, this is boring; it only reaches the status of noteworthy research if the treatment is imperceptible. Similarly for other bank-shot ideas such as the correlation between menstrual cycle and political attitudes. There seems to be something that pushes researchers to attenuate their treatments to zero, at which point they pull out the usual bag of tricks to attain statistical significance. It’s as if they were taking ESP research as a model. See discussion here on “piss-poor omnicausal social science.”

Klein’s paper, “The Unplanned Obsolescence of Psychological Science and an Argument for Its Revival”, is relevant to this discussion.

The post “What can recent replication failures tell us about the theoretical commitments of psychology?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Don’t believe the bounce appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Alan Abramowitz sent us the above graph, which shows the results from a series of recent national polls, for each plotting Hillary Clinton’s margin in support (that is, Clinton minus Trump in the vote-intention question) vs. the Democratic Party’s advantage in party identification (that is, percentage Democrat minus percentage Republican).

This is about as clear a pattern as you’ll ever see in social science: Swings in the polls are driven by swings in differential nonresponse. After the Republican convention, Trump supporters were stoked, and they were more likely to respond to surveys. After the Democratic convention, the reverse: Democrats are more likely to respond, driving Clinton up in the polls.

David Rothschild and I have the full story up at Slate:

Tou sort of know there is a convention bounce that you should sort of ignore, but why? What’s actually in a polling bump? The recent Republican National Convention featured conflict and controversy and one very dark acceptance speech—enlivened by some D-list celebrities (welcome back Chachi!)—but it was still enough to give nominee Donald Trump a big, if temporary, boost in many polls. This swing, which occurs predictably in election after election, is typically attributed to the persuasive power of the convention, with displays of party unity persuading partisans to vote for their candidate and cross-party appeals coaxing over independents and voters of the other party.

Recent research, however, suggests that swings in the polls can often be attributed not to changes in voter intention but in changing patterns of survey nonresponse: What seems like a big change in public opinion turns out to be little more than changes in the inclinations of Democrats and Republicans to respond to polls. We learned this from a study we performed [with Sharad Goel and Wei Wang] during the 2012 election campaign using surveys conducted on the Microsoft Xbox. . . .

Our Xbox study showed that very few respondents were changing their vote preferences—less than 2 percent during the final month of the campaign—and that most, fully two-thirds, of the apparent swings in the polls (for example, a big surge for Mitt Romney after the first debate) were explainable by swings in the percentages of Democrats and Republicans responding to the poll. This nonresponse is very loosely correlated with likeliness to vote but mainly reflects passing inclinations to participate in polling. . . . large and systematic changes in nonresponse had the effect of amplifying small changes in actual voter intention. . . .

[See this paper, also with Doug Rivers, with more, including supporting information from other polls.]

We can apply these insights to the 2016 convention bounces. For example, Reuters/Ipsos showed a swing from a 15-point Clinton lead on July 14 to a 2-point Trump lead on July 27. Who was responding in these polls? The pre-convention survey saw 53 percent Democrats, 38 percent Republican, and the rest independent or supporters of other parties. The post-convention respondents looked much different, at 46 percent Democrat, 43 percent Republican. The 17-point swing in the horse-race gap came with a 12-point swing in party identification. Party identification is very stable, and there is no reason to expect any real swings during that period; thus, it seems that about two-thirds of the Clinton-Trump swing in the polls comes from changes in response rates. . . .

Read the whole thing.

The political junkies among you have probably been seeing all sorts of graphs online showing polls and forecasts jumping up and down. These calculations typically don’t adjust for party identification (an idea we wrote about back in 2001, but without realizing the political implications that come from systematic, rather than random, variation in nonresponse) and thus can vastly overestimate swings in preferences.

The post Don’t believe the bounce appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The p-value is a random variable appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>P values from identical experiments can differ greatly in a way that is surprising to many. The failure to appreciate this wide variability can lead researchers to expect, without adequate justification, that statistically significant findings will be replicated, only to be disappointed later.

I agree that the randomness of the p-value—the fact that it is a function of data and thus has a sampling distribution—is an important point that is not well understood. Indeed, I think that the z-transformation (the normal cdf, which takes a z-score and transforms it into a p-value) is in many ways a horrible thing, in that it takes small noisy differences in z-scores and elevates them into the apparently huge differences between p=.1, p=.01, p=.001. This is the point of the paper with Hal Stern, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” The p-value, like any data summary, is a random variable with a sampling distribution.

Incidentally, I have the same feeling about cross-validation-based estimates and even posterior distributions: all of these are functions of the data and thus have sampling distributions, but theoreticians and practitioners alike tend to forget this and instead treat them as truths.

This particular article is that it takes p-values at face value, whereas in real life p-values typically are the product of selection, as discussed by Uri Simonson et al. a few years ago in their “p-hacking” article and as discussed by Eric Loken and myself a couple years ago in our “garden of forking paths” article. I think real-world p-values are much more optimistic than the nominal p-values discussed by Lazzeroni et al. But in any case I think they’re raising an important point that’s been under-emphasized in textbooks and in the statistics literature.

The post The p-value is a random variable appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Guy Fieri wants your help! For a TV show on statistical models for real estate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I got the following email from David Mulholland:

I’m a producer at Citizen Pictures where we produce Food Network’s “Diners, Dives and Drive-Ins” and Bravo’s digital series, “Going Off The Menu,” among others. A major network is working with us to develop a show that pits “data” against a traditional real estate agent to see who can find a home buyer the best house for them. In this show, both the real estate agent and the data team each choose two properties using their very different methods. The show will ask the question: “Who will do a better job of figuring out what the client wants, ‘data’ or the traditional real estate agent?”

TV and real estate are two topics I know nothing about, so I pointed Mulholland to some Finnish dudes who do sophisticated statistical modeling of the housing market. They didn’t think it was such a good fit for them, with Janne Sinkkonen remarking that “Models are good at finding trends and averages from large, geographically or temporally sparse data. The richness of a single case, seen on the spot, is much better evaluated by a human.”

That makes sense, but it is also possible that a computer-assisted human can do better than a human alone. Say you have a model that gives quick price estimates for every house. Those estimates are sitting on the computer. A human then goes to house X and assesses its value at, say, $350,000. The human then looks up and sees that the computer gave an assessment, based on some fitted algorithm, of $420,000. What does the human conclude? Not necessarily that the computer is wrong; rather, at this point the human can introspect and consider why the computer estimate is so far off. What features of the house make it so much less valuable than the computer “thinks”? Perhaps some features not incorporated into the computer’s model, for example the state of the interior of the house, or a bad paint job and unkempt property, or something about the location that had not been in the model. This sort of juxtaposition can be valuable.

That said, I still know nothing about real estate or about what makes good TV, so I offered to post Mulholland’s question here. He said sure, and added:

I’m particularly delighted to hear your analysis of a “computer-assisted human” as that is a direction we have been investigating. Simply put, we do not have the resources to implement any sort of fully computerized solution. I think the computer-assisted human is definitely a direction we would take.

I’d love to hear the thoughts of blog readers. At the moment, the big question we are considering is, “Assuming that we have full access to a users data (with the user’s cooperation of course . . . data example include Facebook, web browser history, online shopping history, geotracking, etc), how can we use human and computer to best sort through this data to find the house the user will like the most?”

Ball’s in your court now, TV-savvy blog readers!

The post Guy Fieri wants your help! For a TV show on statistical models for real estate appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Amazon NYC decision analysis jobs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Amazon is having a hiring event (Sept 8/9) here in NYC. If you are interested in working on demand forecasting either here in NYC or in Seattle send your resume to rcariapp@amazon.com by September 1st, 2016.

Here’s the longer blurb:

Amazon Supply Chain Optimization Technologies (SCOT) builds systems that automate decisions in Amazon’s supply chain. These systems are responsible for predicting customer demand; optimization of sourcing, buying and placement of all inventory, and ensuring optimal customer experience from an order fulfillment promise perspective. In other words, our systems automatically decide how much to buy of which items, from which vendors/suppliers, which fulfilment centers to put them in, how to get it there, how to get it to the customer and what to promise customers – all while maximizing customer satisfaction and minimizing cost.

Could be interesting, and it’s always fun to work on real decision problems!

The post Amazon NYC decision analysis jobs appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post In policing (and elsewhere), regional variation in behavior can be huge, and perhaps give a clue about how to move forward. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s Sethi:

Moskos is not arguing here that the police can do no wrong; he is arguing instead that in the aggregate, whites and blacks are about equally likely to be victims of bad shootings. . . .

Moskos offers another, quite different reason why bias in individual incidents might not be detected in aggregate data: large regional variations in the use of lethal force.

To see the argument, consider a simple example of two cities that I’ll call Eastville and Westchester. In each of the cities there are 500 police-citizen encounters annually, but the racial composition differs: 40% of Eastville encounters and 20% of Westchester encounters involve blacks. There are also large regional differences in the use of lethal force: in Eastville 1% of encounters result in a police killing while the corresponding percentage in Westchester is 5%. That’s a total of 30 killings, 5 in one city and 25 in the other.

Now suppose that there is racial bias in police use of lethal force in both cities. In Eastville, 60% of those killed are black (instead of the 40% we would see in the absence of bias). And in Westchester the corresponding proportion is 24% (instead of the no-bias benchmark of 20%). Then we would see 3 blacks killed in one city and 6 in the other. That’s a total of 9 black victims out of 30. The black share of those killed is 30%, which is precisely the black share of total encounters. Looking at the aggregate data, we see no bias. And yet, by construction, the rate of killing per encounter reflects bias in both cities.

This is just a simple example to make a logical point. Does it have empirical relevance? Are regional variations in killings large enough to have such an effect? Here is Moskos again:

Last year in California, police shot and killed 188 people. That’s a rate of 4.8 per million. New York, Michigan, and Pennsylvania collectively have 3.4 million more people than California (and 3.85 million more African Americans). In these three states, police shot and killed… 53 people. That’s a rate of 1.2 per million. That’s a big difference.

Were police in California able to lower their rate of lethal force to the level of New York, Michigan, and Pennsylvania… 139 fewer people would be killed by police. And this is just in California… If we could bring the national rate of people shot and killed by police (3 per million) down to the level found in, say, New York City… we’d reduce the total number of people killed by police 77 percent, from 990 to 231!

This is a staggeringly large effect.

Additional evidence for large regional variations comes from a recent report by the Center for Policing Equity. The analysis there is based on data provided voluntarily by a dozen (unnamed) departments. Take a close look at Table 6 in that document, which reports use of force rates per thousand arrests. The medians for lethal force are 0.29 and 0.18 for blacks and whites respectively, but the largest recorded rates are much higher: 1.35 for blacks and 3.91 for whites. There is at least one law enforcement agency that is killing whites at a rate more than 20 times greater than that of the median agency.

On the reasons for these disparities, one can only speculate:

I really don’t know what some departments and states are doing right and others wrong. But it’s hard for me to believe that the residents of California are so much more violent and threatening to cops than the good people of New York or Pennsylvania. I suspect lower rates of lethal force has a lot to do with recruitment, training, verbal skills, deescalation techniques, not policing alone, and more restrictive gun laws.

This is all important in its own right but I also wanted to highlight it as an example of a more general principle about different levels of variation when considering policy interventions.

One of my favorite examples here is smoking: it’s really hard to have an individual-level intervention to help people quit smoking. But aggregate interventions, such as banning indoor smoking, seem to work. This seems a bit paradoxical: after all, aggregate changes are nothing but aggregations of individual changes, so how could it be easier to change the smoking behavior of many thousands of people, than to change behaviors one at a time? But that’s how it is. Individual decisions are not so individual, as is most obvious, perhaps, in the variation across populations and across eras in family size: nowadays, it’s trendy in the U.S. to have 3 kids; a couple decades back, 2 was the standard; and a few decades earlier, 4-child families were common. We make our individual choices based on what other people are doing. And, again, it’s really hard to quit smoking, which can make it seem like smoking is as inevitable as death or taxes, but smoking rates vary a lot by country, and by state within this country.

To return to the policing example, we’ve had lots of discussion about whether or not particular cops or particular police departments are racially biased—lots of comparisons *within* cities—but Moskos argues we have not been thinking hard enough about comparisons *between* cities. An interesting point, and it would be good to see it on the agenda.

The post In policing (and elsewhere), regional variation in behavior can be huge, and perhaps give a clue about how to move forward. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My next 170 blog posts (inbox zero and a change of pace) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And one result was to fill up the blog through mid-January.

I think I’ve been doing enough blogging recently, so my plan now is to stop for awhile and instead transfer my writing energy into articles and books. We’ll see how it goes.

Just to give you something to look forward to, below is a list of what’s in the queue. I’m sure I’ll be interpolating new posts on this and that—there is an election going on, after all, indeed I just inserted a politics-related post two days ago. And other things come up sometime that just can’t wait. Also, my co-bloggers are free to post on Stan or whatever else they want, whenever they want.

But this is what’s on deck so far:

In policing (and elsewhere), regional variation in behavior can be huge, and perhaps can give a clue about how to move forward.

Guy Fieri wants your help! For a TV show on statistical models for real estate

The p-value is a random variable

“What can recent replication failures tell us about the theoretical commitments of psychology?”

Documented forking paths in the Competitive Reaction Time Task

Shameless little bullies claim that published triathlon times don’t replicate

Boostrapping your posterior

You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

Are stereotypes statistically accurate?

Will youths who swill Red Bull become adult cocaine addicts?

Science reporters are getting the picture

Modeling correlation of issue attitudes and partisanship within states

Tax Day: The Birthday Dog That Didn’t Bark

The history of characterizing groups of people by their averages

~~Calorie labeling reduces obesity~~Obesity increased more slowly in California, Seattle, Portland (Oregon), and NYC, compared to some other places in the west coast and northeast that didn’t have calorie labelingWhat’s gonna happen in November?

An ethnographic study of the “open evidential culture” of research psychology

Things that sound good but aren’t quite right: Art and research edition

Michael Porter as new pincushion

Kaiser Fung on the ethics of data analysis

One more thing you don’t have to worry about

Evil collaboration between Medtronic and FDA

His varying slopes don’t seem to follow a normal distribution

A day in the life

Letters we never finished reading

Better to just not see the sausage get made

Oooh, it burns me up

Birthdays and heat waves

Publication bias occurs within as well as between projects

Graph too clever by half

Take

that, Bruno Frey! Pharma company busts through Arrow’s theorem, sets new record!A four-way conversation on weighting and regression for causal inference

How paracompact is that?

In Bayesian regression, it’s easy to account for measurement error

Garrison Keillor would be spinning etc

“Brief, decontextualized instances of colaughter”

The new quantitative journalism

It’s not about normality, it’s all about reality

Hokey

mas, indeedYou may not be interested in peer review, but peer review is interested in you

Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2:30pm Thurs 15 Sept)

Genius is not enough: The sad story of Peter Hagelstein, living monument to the sunk-cost fallacy

Bayesian Statistics Then and Now

No guarantee

Let’s play Twister, let’s play Risk

“Evaluating Online Nonprobability Surveys”

Redemption

Pro Publica Surgeon Scorecard Update

Hey, PPNAS . . . this one is the fish that got away

FDA approval of generic drugs: The untold story

Acupuncture paradox update

More p-value confusion: No, a low p-value does not tell you that the probability of the null hypothesis is less than 1/2

Multicollinearity causing risk and uncertainty

Andrew Gelman is not the plagiarism police because there is no such thing as the plagiarism police.

Cracks in the thin blue line

Politics and chance

I refuse to blog about this one

“Find the best algorithm (program) for your dataset.”

NPR’s gonna NPR

Why the garden-of-forking-paths criticism of p-values is

notlike a famous Borscht Belt comedy bitDon’t trust Rasmussen polls!

Astroturf “patient advocacy” group pushes to keep drug prices high

It’s not about the snobbery, it’s all about reality: At last, I finally understand hatred of “middlebrow”

The never-back-down syndrome and the fundamental attribution error

Michael Lacour vs John Bargh and Amy Cuddy

It’s ok to criticize

“The Prose Factory: Literary Life in England Since 1918” and “The Windsor Faction”

Note to journalists: If there’s no report you can read, there’s no study

Heimlich

No, I don’t think the Super Bowl is lowering birth weights

Gray graphs look pretty

Should you abandon that low-salt diet?

Transparency, replications, and publication

Is it fair to use Bayesian reasoning to convict someone of a crime?

“Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

Some people are so easy to contact and some people aren’t.

Should Jonah Lehrer be a junior Gladwell? Does he have any other options?

Advice on setting up audio for your podcast

The Psychological Science stereotype paradox

We have a ways to go in communicating the replication crisis

Authors of AJPS paper find that the signs on their coefficients were reversed. But they don’t care: in their words, “None of our papers actually give a damn about whether it’s plus or minus.” All right, then!

Another failed replication of power pose

“How One Study Produced a Bunch of Untrue Headlines About Tattoos Strengthening Your Immune System”

Ptolemaic inference

How not to analyze noisy data: A case study

The problems are everywhere, once you know to look

“Generic and consistent confidence and credible regions”

Happiness of liberals and conservatives in different countries

Conflicts of interest

“It’s not reproducible if it only runs on your laptop”: Jon Zelner’s tips for a reproducible workflow in R and Stan

Unintentional parody of Psychological Science-style research redeemed by Dan Kahan insight

Rotten all the way through

Some modeling and computational ideas to look into

How to improve science reporting? Dan Vergano sez: It’s not about reality, it’s all about a salary

Kahan: “On the Sources of Ordinary Science Knowledge and Ignorance”

Why I prefer 50% to 95% intervals

How effective (or counterproductive) is universal child care? Part 1

How effective (or counterproductive) is universal child care? Part 2

“Another terrible plot”

Can a census-tract-level regression analysis untangle correlation between lead and crime?

The role of models and empirical work in political science

More on my paper with John Carlin on Type M and Type S errors

Should scientists be allowed to continue to play in the sandbox after they’ve pooped in it?

“Men with large testicles”

Sniffing tears perhaps not as effective as claimed

Thinking more seriously about the design of exploratory studies: A manifesto

From zero to Ted talk in 18 simple steps: Rolf Zwaan explains how to do it!

Individual and aggregate patterns in the Equality of Opportunity research project

Unfinished (so far) draft blog posts

Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness

How best to partition data into test and holdout samples?

Abraham Lincoln and confidence intervals

“Breakfast skipping, extreme commutes, and the sex composition at birth”

Discussion on overfitting in cluster analysis

Happiness formulas

OK, sometimes the concept of “false positive” makes sense.

“A bug in fMRI software could invalidate 15 years of brain research”

Interesting epi paper using Stan

How can you evaluate a research paper?

Some U.S. demographic data at zipcode level conveniently in R

So little information to evaluate effects of dietary choices

Frustration with published results that can’t be reproduced, and journals that don’t seem to care

Using Stan in an agent-based model: Simulation suggests that a market could be useful for building public consensus on climate change

Data 1, NPR 0

Dear Major Textbook Publisher

“So such markets were, and perhaps are, subject to bias from deep pocketed people who may be expressing preference more than actual expectation”

Temple Grandin

fMRI clusterf******

How to think about the p-value from a randomized test?

Avoiding selection bias by analyzing all possible forking paths

The social world is (in many ways) continuous but people’s mental models of the world are Boolean

Science journalist recommends going easy on Bigfoot, says you should bash of mammograms instead

Applying statistical thinking to the search for extraterrestrial intelligence

An efficiency argument for post-publication review

Bayes is better

What’s powdery and comes out of a metallic-green cardboard can?

Low correlation of predictions and outcomes is no evidence against hot hand

Jail for scientific fraud?

Is the dorsal anterior cingulate cortex “selective for pain”?

Quantifying uncertainty in identification assumptions—this is important!

If I had a long enough blog delay, I could just schedule this one for 1 Jan 2026

Historical critiques of psychology research methods

p=.03, it’s gotta be true!

Objects of the class “George Orwell”

Sorry, but no, you can’t learn causality by looking at the third moment of regression residuals

“The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (Yet False) Research Conclusions”

Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements

Transformative treatments

Comment of the year

Migration explaining observed changes in mortality rate in different geographic areas?

Fragility index is too fragile

When you add a predictor the model changes so it makes sense that the coefficients change too.

Nooooooo, just make it stop, please!

“Which curve fitting model should I use?”

We fiddle while Rome burns: p-value edition

The Lure of Luxury

Confirmation bias

Problems with randomized controlled trials (or any bounded statistical analysis) and thinking more seriously about story time

When do stories work, Process tracing, and Connections between qualitative and quantitative research

A small, underpowered treasure trove?

Problems with “incremental validity” or more generally in interpreting more than one regression coefficient at a time

No evidence of incumbency disadvantage?

To know the past, one must first know the future: The relevance of decision-based thinking to statistical analysis

Powerpose update

Absence of evidence is evidence of alcohol?

“Estimating trends in mortality for the bottom quartile, we found little evidence that survival probabilities declined dramatically.”

SETI: Modeling in the “cosmic haystack”

There should really be something here for everyone. I don’t remember half these posts myself, and I look forward to reading them when they come out!

**P.S.** It’s a good thing I blog for free because nobody could pay me enough for the effort that goes into it.

The post My next 170 blog posts (inbox zero and a change of pace) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post All maps of parameter estimates remain misleading appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Roland Rau writes:

After many years of applying frequentist statistical methods in mortality research, I just began to learn about the application of Bayesian methods in demography. Since I also wanted to change a part of my research focus on spatial models, I discovered your 1999 paper with Phil Price, All maps of parameter estimates are misleading. As this article is already 17 years old, I wanted to ask whether you think that the last part of the final sentence of the article—“we know of no satisfactory solution to the problem of generating maps for general use”—is still valid. Or would you recommend some other technique to avoid the pitfalls of plotting observed rates or posterior means/medians?

My reply:

For the reasons discussed in our article, I think that there is inherently no way to avoid a map of parameter estimates being misleading in some way (unless variation is tiny or the data have some symmetry so that all sample sizes are identical). It’s just not possible to project the globe of multivariate uncertainty onto the plane of point estimates.

That said, there could well be new ideas in how best to map uncertainty and variation. So I expect there *has* been progress in mapping parameter estimates in the past twenty years, even if there are fundamental mathematical constraints that will always be with us.

The post All maps of parameter estimates remain misleading appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A kangaroo, a feather, and a scale walk into Viktor Beekman’s office appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>E. J. writes:

I enjoyed your kangaroo analogy [see also here—ed.] and so I contacted a talented graphical artist—Viktor Beekman—to draw it. The drawing is on Flickr under a CC license.

Thanks, Viktor and E.J.!

The post A kangaroo, a feather, and a scale walk into Viktor Beekman’s office appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan 2.11 Good, Stan 2.10 Bad appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We are happy to announce that all of the interfaces have been updated to Stan 2.11. There was a subtle bug introduced in 2.10 where a probabilistic acceptance condition was being checked twice. Sorry about that and thanks for your patience. We’ve added some additional tests to catch this kind of thing going forward.

As usual, instructions on downloading all of the interfaces are linked from:

The bug was introduced in 2.10, so 2.9 (pre language syntax enhancements) should still be OK.

There are also a couple bonus bug fixes: printing now works if the current iteration is rejected, and rejecting integer division by zero rather than crashing.

**Thanks to everyone who helped make this happen**

We found the bug due to a reproducible example posted to our user list by Yannick Jadoul. Thanks!

Thanks to Joshua Pritikin, who updated OpenMX (a structural equation modeling package that uses Stan’s automatic differentiation), so that the CRAN release could go through.

Thanks also to Michael Betancourt, Daniel Lee, Allen Riddell, and Ben Goodrich of the Stan dev team for stepping up to fix Stan itself and get the PyStan and RStan interfaces out ASAP.

**What’s next?**

We’re aiming to release minor versions at most quarterly. Here’s what should be in the next release (2.12):

- substantial speed improvements to our matrix arithmetic
- compound declare and define statements
- elementwise versions of all unary functions for arrays, vectors, and matrices
- command refactor (mostly under the hood, but will make new command functionality much easier)
- explicit control of proportionality (dropping constants) in probability mass and density functions in the language
- vector-based lower- and upper-bounds constraints for variables

After the next release, we’ll bring the example model code up to our current recommendations on priors and our current recommendations on Stan programming. Then, after the command refactor, the way will be clear for the Stan 3 versions of our interfaces, where we’ll be making all of our interfaces consistent and giving them more fine-grained control. Stay tuned!

The post Stan 2.11 Good, Stan 2.10 Bad appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Even social scientists can think like pundits, unfortunately appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I [Safford] actually hold to the idea that the winning candidate for President is always the one who has a clearer view of the challenges and opportunities facing the country and articulates a viable roadmap for how to navigate them.

I disagree entirely. I think Safford’s view is naive and it is based on the idea that the outcome of the election is determined by the candidates and the campaign. In contrast, I believe the political science research that says that economic conditions are the most important factor determining the election outcome. You could go through election after election and be hard-pressed to make the case that the winning candidate had a clearer view of the challenges etc. Sure, you can find some examples: arguably Reagan had a clearer view than Carter, and Obama had a clearer view than McCain, and . . . ummmm, maybe that’s about it. Or maybe not. You could make a pretty good argument for either candidate in most elections, from 1948 onward.

Also Safford should really really watch out about that “always.” In the postwar period, there have been three elections that were essentially ties: 1960, 1968, and 2000. Even if you want to make the case (a case that I completely disagree with) that presidential elections are won by candidates expressing a clearer view and a more viable roadmap, still, you can’t hope to think that this will work every time, not given that some elections are basically coin flips.

The above fallacies—the idea that elections are determined by candidates and campaigns, and the idea that there is some key by which the election outcome can be known deterministically—appear a lot in political journalism, and my colleagues and I spend a bit of time at the sister blog explaining why they’re wrong. I don’t usually see academic researchers making these errors, though.

I looked up Safford and he’s done a lot of qualitative work on labor and social networks. This is important stuff, and I expect that if I started opining on the effects of labor union strategies, I’d be about as confused as Safford is when writing about electoral politics. So I’d like to emphasize that I’m not trying to slam the guy for making this mistake. We all make mistakes, and what’s a blog for, if not to put some of our more casual speculations out for general criticism. (In contrast, I was more annoyed a few years ago when political theorist David Runciman had the BBC as his platform for spreading pundit-level errors about U.S. public opinion.) So, no hard feelings, it’s just interesting to see an academic make a mistake that I usually associate with pundits.

The post Even social scientists can think like pundits, unfortunately appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>