The post USAs usannsynlige presidentkandidat. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>It’s a news article by Linda May Kallestein, which begins as follows:

Sosialisten Bernie Sanders: Kan en 73 år gammel jøde, født av polske innvandrere, vokst opp under enkle kår og som vil innføre sosialdemokrati etter skandinavisk modell, ha sjanse til å bli USAs neste president?

And here’s my quote:

I actually said it in English, but you get the picture. Not as exciting as the time I was quoted in Private Eye, but I’ll still take it.

The full story is on the sister blog.

The post USAs usannsynlige presidentkandidat. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post To understand the replication crisis, imagine a world in which everything was published. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>John Snow points me to this post by psychology researcher Lisa Feldman Barrett who reacted to the recent news on the non-replication of many psychology studies with a contrarian, upbeat take, entitled “Psychology Is Not in Crisis.”

Here’s Barrett:

An initiative called the Reproducibility Project at the University of Virginia recently reran 100 psychology experiments and found that over 60 percent of them failed to replicate — that is, their findings did not hold up the second time around. . . .

But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works. . . . Science is not a body of facts that emerge, like an orderly string of light bulbs, to illuminate a linear path to universal truth. Rather, science (to paraphrase Henry Gee, an editor at Nature) is a method to quantify doubt about a hypothesis, and to find the contexts in which a phenomenon is likely. Failure to replicate is not a bug; it is a feature. It is what leads us along the path — the wonderfully twisty path — of scientific discovery.

All this is fine. Indeed, I’ve often spoken of the fractal nature of science: at any time scale, whether it be minutes or days or years, we see a mix of forward progress and sudden shocks, realizations that much of what we’ve thought was true, isn’t. Scientific discovery is indeed both wonderful and unpredictable.

But Barrett’s article disturbs me too, for two reasons. First, yes, failure to replicate is a feature, not a bug—but only if you respect that feature, if you take the failure to replicate to reassess your beliefs. But if you just complacently say it’s no big deal, then you’re not taking the opportunity to learn.

Here’s an example. The recent replication paper by Nosek et al. had many examples of published studies that did not replicate. One example was described in Benedict Carey’s recent New York Times article as follows:

Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

Carey got a quote from the author of that original study. To my disappointment, the author did *not* say something like, “Hey, it looks like we might’ve gone overboard on that original study, that’s fascinating to see that the replication did not come out as we would’ve thought.” Instead, here’s what we got:

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

“Theory-required adjustments,” huh? Unfortunately, just about anything can be interpreted as theory-required. Just ask Daryl Bem.

We can actually see what the theory says. Philosopher Deborah Mayo went to the trouble to look up Bressan’s original paper, which said the following:

Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extra pair mating with the former. Only if conception occurs, however, do the evolutionary benefits of such a strategy overcome its costs. Accordingly, we predicted that (a) partnered women should prefer attached men, because such men are more likely than single men to have pair-bonding qualities, and hence to be good replacement partners, and (b) this inclination should reverse when fertility rises, because attached men are less available for impromptu sex than single men.

Nothing at all about Italians there! Apparently this bit of theory requirement wasn’t apparent until *after* the replication didn’t work.

What if the replication *had* resulted in statistically significant results in the same direction as expected from the earlier, published paper? Would Bressan have called up the Replication Project and said, “Hey—if the results replicate under these different conditions, something must be wrong. My theory requires that the model won’t work with American college students!” I really really don’t think so. Rather, I think Bressan would call it a win.

And that’s my first problem with Barrett’s article. I feel like she’s taking a heads-I-win, tails-you-lose position. A successful replication is welcomed as a confirmation, an unsuccessful replication indicates new conditions required for the theory to hold. Nowhere does she consider the third option: that the original study was capitalizing on chance and in fact never represented any general pattern in *any* population. Or, to put it another way, that any true underlying effect is too small and too variable to be measured by the noisy instruments being used in some of those studies.

As the saying goes, when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.

My second problem with Barrett’s article is at the technical level. She writes:

Suppose you have two well-designed, carefully run studies, A and B, that investigate the same phenomenon. They perform what appear to be identical experiments, and yet they reach opposite conclusions. Study A produces the predicted phenomenon, whereas Study B does not. . . . Does this mean that the phenomenon in question is necessarily illusory? Absolutely not. If the studies were well designed and executed, it is more likely that the phenomenon from Study A is true

only under certain conditions[emphasis in the original].

At one level, there is nothing to disagree with here. I don’t really like the presentation of phenomena as “true” or “false”—pretty much everything we’re studying in psychology has *some* effect—but, in any case, all effects vary. The magnitude and even the direction of any effect will vary across people and across scenarios. So if we interpret the phrase “the phenomenon is true” in a reasonable way, then, yes, it will only be true under certain conditions—or, at the very least, vary in importance across conditions.

The problem comes when you look at specifics. Daryl Bem found some comparisons in his data which, when looked in isolation, were statistically significant. These patterns did not show up in replication. Satoshi Kanazawa found a correlation between beauty in sex ratio in a certain dataset. When he chose a particular comparison, he found p less than .05. What do we learn from this? Do we learn that, in the general population, beautiful parents are more likely to have girls? No. The most we can learn is that the Journal of Theoretical Biology can be fooled into publishing patterns that come from noise. (His particular analysis was based on a survey of 3000 people. A quick calculation using prior information on sex ratios shows that you would need data on hundreds of thousands of people to estimate any effect of the sort that he was looking for.) And then there was the himmicanes and hurricanes study which, ridiculous as it was, falls well within the borders of much of the theorizing done in psychology research nowadays. And so on, and so on, and so on.

We could let Barrett off the hook on the last quote above because she does qualify her statement with, “If the studies were well designed and executed . . .” But there’s the rub. How do we know if a study was well designed and executed? Publication in Psychological Science, or PPNAS is not enough—lots and lots of poorly designed and executed studies appear in these journals. It’s almost as if the standards for publication are not just about how well designed and executed a study is, but also about how flashy are the claims, and whether there is a “p less than .05” somewhere in the paper. It’s almost as if reviewers often can’t tell whether a study is well designed and executed. Hence the demand for replication, hence the concern about unreplicated studies, or studies that for mathematical reasons are essentially dead on arrival because the noise is so much greater than the signal.

**Imagine a world in which everything was published**

A close reading of Barrett’s article reveals the centrality of the condition that studies be “well designed and executed,” and lots of work by statisticians and psychology researchers in recent years (Simonsohn, Button, Nosek, Wagenmakers, etc etc) has made it clear that current practice, centered on publication thresholds (whether it be p-value or Bayes factor or whatever), won’t do so well at filtering out the poorly designed and executed studies.

To discourage or disparage or explain away failed replications is to give a sort of “incumbency advantage” to published claims, which puts a burden on the publication process that it cannot really handle.

To better understand what’s going on here, imagine a thought experiment where *everything* is published, where there’s no such thing as Science or Nature or Psychological Science or JPSP or PPNAS; instead, everything’s published on Arxiv. Every experiment everyone does. And with no statistical significance threshold. In this world, nobody has ever heard of inferential statistics. All we see are data summaries, regressions, etc., but no standard errors no posterior probabilities, no p-values.

What would we do then? Would Barrett reassure us that we shouldn’t be discouraged by failed replications, that everything already published (except, perhaps, for “a few bad eggs”) be taken as likely to be true? I assume (hope) not. The only way this sort of reasoning can work is if you believe the existing system screens out the bad papers. But the point of various high-profile failed replications (for example, in the field of embodied cognition) is that, no, the system does not work so well. This is one reason the replication movement is so valuable, and this is one reason I’m so frustrated by people who dismiss replications or who claim that replications show that “the system works.” It only works if you take the information from the failed replications (and the accompanying statistical theory, which is the sort of thing that I work on) and *do something about it*!

As I wrote in an earlier discussion on this topic:

Suppose we accept this principle [that published results are to be taken as true, even if they fail to be replicated in independent studies by outsiders]. How, then, do we treat an unpublished paper? Suppose someone with a Ph.D. in biology posts a paper on Arxiv (or whatever is the biology equivalent), and it can’t be replicated? Is it ok to question the original paper, to treat it as only provisional, to label it as unreplicated? That’s ok, right? I mean, you can’t just post something on the web and automatically get the benefit of the doubt that you didn’t make any mistakes. Ph.D.’s make errors all the time (just like everyone else). . . .

Now we can engage in some salami slicing. According to Bissell (as I interpret here), if you publish an article in Cell or some top journal like that, you get the benefit of the doubt and your claims get treated as correct until there are multiple costly, failed replications. But if you post a paper on your website, all you’ve done is make a claim. Now suppose you publish in a middling journal, say, the Journal of Theoretical Biology. Does that give you the benefit of the doubt? What about Nature Neuroscience? PNAS? Plos-One? I think you get my point. A publication in Cell is nothing more than an Arxiv paper that happened to hit the right referees at the right time. Sure, approval by 3 referees or 6 referees or whatever is something, but all they did is read some words and look at some pictures.

It’s a strange view of science in which a few referee reports is enough to put something into a default-believe-it mode, but a failed replication doesn’t count for anything.

**I’m a statistician so I’ll conclude with a baseball analogy**

Bill James once wrote with frustration about humanist-style sportswriters, the sort of guys who’d disparage his work and say they didn’t care about the numbers, that they cared about how the athlete actually played. James’s response was that if these sportswriters really wanted to talk baseball, that would be fine—but oftentimes their arguments ended up having the form: So-and-so hit .300 in Fenway Park one year, or so-and-so won 20 games once, or whatever. His point was that these humanists were actually making their arguments using statistics. They were just using statistics in an uninformed way. Hence his dictum that the alternative to good statistics is not “no statistics,” it’s “bad statistics.”

That’s how I feel about the people who deny the value of replications. They talk about science and they don’t always want to hear my statistical arguments, but then if you ask them why we “have no choice but to accept” claims about embodied cognition or whatever, it turns out that their evidence is nothing but some theory and a bunch of p-values. Theory can be valuable but it won’t convince anybody on its own; rather, theory is often a way to interpret data. So it comes down to the p-values.

Believing a theory is correct because someone reported p less than .05 in a Psychological Science paper is like believing that a player belongs in the Hall of Fame because hit .300 once in Fenway Park.

This is not a perfect analogy. Hitting .300 anywhere is a great accomplishment, whereas “p less than .05” can easily represent nothing more than an impressive talent for self-delusion. But I’m just trying to get at the point that ultimately it is statistical summaries and statistical models that are being used to make strong (and statistical ridiculous) claims about reality, hence statistical criticisms, and external data such as come from replications, are relevant.

If, like Barrett, you want to dismiss replications and say there’s no crisis in science: Fine. But then publish everything and accept that all data are telling you something. Don’t privilege something that happens to have been published once and declare it true. If you do that, and you follow up by denying the uncertainty that is revealed by failed replications (and was earlier revealed, on the theoretical level, by this sort of statistical analysis), well, then you’re offering nothing more than complacent happy talk.

**P.S.** Fred Hasselman writes:

I helped analyze the replication data of the Bressan & Stranieri study.

There were two replication samples:

›Original effect is a level comparison after a 2x2x2 ANOVA:

›F(1, 194) = 7.16, p = .008, f = 0.19

t(49) = 2.45, p = .02, Cohen’s d = 0.37›Replication 1 in-lab with N=263, Power > 99%, Cohen’s d = .06

›Replication 2 on-line with N=317, Power > 99%, Cohen’s d = .09Initially I did not have the time to read the entire article. I recently did, because I wanted to use the study as an example in a lecture.

I completely agree with the comparisons to Bem-logic.

What I ended up doing is showing the original materials and elaborating on the theory behind the hypothesis during the lecture.After seeing the stimuli, learning about the hypothesis, but before learning about the replication studies, there was a consensus among students (99% female) that claims like the first sentence of the abstract should disqualify the study as a serious work of science:

ABSTRACT—Because men of higher genetic quality tend to be poorer partners and parents than men of lower genetic quality, women may profit from securing a stable investment from the latter, while obtaining good genes via extrapair mating with the former.

Really.

Think about it.

Men of higher genetic quality are poorer partners and parents.

That’s a fact you know.

And this genetic quality of men (yes, they mean attractiveness) is why women want their babies, more so than babies from their current partner (the ugly variety of men, but very sweet and good with kids).My brain hurts.

Thankfully the conclusion is very modest:

In humans’ evolutionary past, the switch in preference from less to more sexually accessible men associated with each ovulatory episode would have been highly adaptive. Our data are consistent with the idea that, although the length of a woman’s reproductive lifetime and the extent of the potential mating network have expanded considerably over the past 50,000 years, this unconscious strategy guides women’s mating choices still.Erratum: We meant ‘this unconscious strategy guides Italian women’s mating choices still’.

Dayum.

The post To understand the replication crisis, imagine a world in which everything was published. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan attribution appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I worry that I get too much credit for Stan. So let me clarify. I didn’t write Stan. Stan is written in C++, and I’ve never in my life written a line of C, or C+, or C++, or C+++, or C-, or any of these things.

Here’s a quick description of what we’ve all done, listed in order of joining the development team.

• Andrew Gelman (Columbia University)

chief of staff, chief of marketing, chief of fundraising, chief of modeling, chief of training, max marginal likelihood, expectation propagation, posterior analysis, R, founder

• Bob Carpenter (Columbia University)

language design, parsing, code generation, autodiff, templating, ODEs, probability functions, con- straint transforms, manual, web design / maintenance, fundraising, support, training, C++, founder

• Matt Hoffman (Adobe Creative Technologies Lab)

NUTS, adaptation, autodiff, memory management, (re)parameterization, C++, founder

• Daniel Lee (Columbia University)

chief of engineering, CmdStan (founder), builds, continuous integration, testing, templates, ODEs, autodiff, posterior analysis, probability functions, error handling, refactoring, C++, training

• Ben Goodrich (Columbia University)

RStan, multivariate probability functions, matrix algebra, (re)parameterization, constraint trans- forms, modeling, R, C++, training

• Michael Betancourt (University of Warwick)

chief of smooth manifolds, MCMC, Riemannian HMC, geometry, analysis and measure theory, ODEs, CmdStan, CDFs, autodiff, transformations, refactoring, modeling, variational inference, logos, web design, C++, training

• Marcus Brubaker (University of Toronto, Scarborough)

optimization routines, code efficiency, matrix algebra, multivariate distributions, C++

• Jiqiang Guo (NPD Group)

RStan (founder), C++, Rcpp, R

• Peter Li (Columbia University)

RNGs, higher-order autodiff, ensemble sampling, Metropolis, example models, C++

• Allen Riddell (Dartmouth College)

PyStan (founder), C++, Python

• Marco Inacio (University of São Paulo)

functions and distributions, C++

• Jeffrey Arnold (University of Washington)

emacs mode, pretty printing, manual, emacs

• Rob J. Goedman (D3Consulting b.v.)

parsing, Stan.jl, C++, Julia

• Brian Lau (CNRS, Paris)

MatlabStan, MATLAB

• Mitzi Morris (Lucidworks)

parsing, testing, C++

• Rob Trangucci (iSENTIUM)

max marginal likelihood, multilevel modeling and poststratification, template metaprogramming, training, C++, R

• Jonah Sol Gabry (Columbia University)

shinyStan (founder), R

• Alp Kucukelbir (Columbia University)

variational inference, C++

• Robert L. Grant (St. George’s, University of London & Kingston University)

StataStan, Stata

• Dustin Tran (Havard University)

variational inference, C++

Development Team Alumni

These are developers who have made important contributions in the past, but are no longer contributing actively.

• Michael Malecki (Crunch.io, YouGov plc)

original design, modeling, logos, R

• Yuanjun Guo (Columbia University)

dense mass matrix estimation, C++

The post Stan attribution appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Constructing an informative prior using meta-analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am trying to construct an informative prior by synthesizing or collecting some information from literature (meta-analysis) and then to apply that to a real data set (it is longitudinal data) for over 20 years follow-up.

In constructing the prior using the meta-analysis data, the issue of publication bias came up. I have tried looking to see if there is any literature on this but it seems almost all the articles on Bayesian meta-analysis do not actually account for this issue apart from one (Givens, Smith and Tweedie 1997).

My thinking was that I could assume a data augmentation approach by fitting a joint model with the assumption that the observed data are normally distributed and the unobserved studies probably exist but not included in my studies and can be thought of to be missing data (missing not at random or non-ignorable missingness). This way a Bernoulli distribution could be used to account for the missingness.

But according to Lesaffre and Lawson 2012, pp. 196; in hierarchical models, the data augmentation approach enters in a quite natural way via the latent (unobserved) random effects. This statement to me implies that my earlier idea may not be necessary and may even bias the posterior estimates.

My reply: You could certainly do this, build a model in which there are a bunch of latent unreported studies and then go from there. I don’t know how well this would work, though, for two reasons:

1. Estimating what’s missing based on the shape of the distribution—-that’s tough. Inferences will be so sensitive to all sorts of measurement and selection issues, and I’d be skeptical of whatever comes out.

2. You’re trying to adjust for unreported studies in a meta-analysis. But I’d be much more worried about choices in data processing and analysis in each of the studies you have. As I’ve written many times, I think the file-drawer problem is overrated and it’s nothing compared to the garden of forking paths.

The post Constructing an informative prior using meta-analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Uri Simonsohn warns us not to be falsely reassured appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I agree with Uri Simonsohn that you don’t learn much by looking at the distribution of all the p-values that have appeared in some literature. Uri explains:

Most p-values reported in most papers are irrelevant for the strategic behavior of interest.

Covariates, manipulation checks, main effects in studies testing interactions, etc. Including them we underestimate p-hacking and we overestimate the evidential value of data. Analyzing all p-values asks a different question, a less sensible one. Instead of “Do researchers p-hack what they study?” we ask “Do researchers p-hack everything?”

He demonstrates with an example and summarizes:

Looking at all p-values is falsely reassuring.

I agree and will just add two comments:

1. I prefer the phrase “garden of forking paths” because I think the term “p-hacking” suggests intentionality or even cheating. Indeed, in the quoted passage above, Simonsohn refers to “strategic behavior.” I have not doubt that *some* strategic behavior and even outright cheating goes on, but I like to emphasize that the garden of forking paths can occur even when a researcher does only one analysis of the data at hand and does not directly “fish” for statistical significance.

The idea is that analyses are contingent on data, and researchers can and do make choices in data coding, data exclusion, and data analysis in light of the data they see, setting various degrees of freedom in reasonable-seeming ways that support their model of the world, thus being able to obtain statistical significance at a high rate, merely by capitalizing on chance patterns in data. It’s the forking paths, but it doesn’t feel like “hacking,” not is it necessarily “strategic behavior” in the usual sense of the term.

2. If p-values are what we have, it makes sense to learn what we can from them, as in the justly influential work of Uri Simonsohn, Greg Francis, and others. But, looking at the big picture, once we move to the goal of learning about underlying effects, I think we want to be analyzing raw data (and in the context of prior information), not merely pushing these p’s around. P-values are crude data summaries, and a lot of information can be lost by moving from raw data to p-values. Doing science using published p-values is like trying to paint a picture using salad tongs.

The post Uri Simonsohn warns us not to be falsely reassured appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Stan attribution

**Wed:** Cannabis/IQ follow-up: Same old story

**Thurs:** Defining conditional probability

**Fri:** In defense of endless arguments

**Sat:** Emails I never finished reading

**Sun:** BREAKING . . . Sepp Blatter accepted $2M payoff from Dennis Hastert

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Another bad chart for you to criticize” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Another bad chart for you to criticize” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Performing design calculations (type M and type S errors) on a routine basis? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am conducting a survival analysis (median follow up ~10 years) of subjects who enrolled on a prospective, non-randomized clinical trial for newly diagnosed multiple myeloma. The data were originally collected for research purposes and specifically to determine PFS and OS of the investigational regimen versus historic controls. The trial has been closed to new enrollment for many years; however, we are monitoring for disease progression and all cause mortality.

Here is the crux of the issue. Although data were prospectively collected for research purposes, my investigational variable was collected but not reported as a variable. The results of the prospective trial (PFS and OS) have been previously published in Blood. I am updating the original report with the long-term follow up, but am also exploring the potential impact of my new variable on PFS and OS. I have not yet analyzed the data and do not know the potential impact, or magnitude of impact, on PFS or OS. If I am interpreting your paper correctly, I believe that I should treat the power calculation on a post-hoc basis and utilize Type S and Type M analysis.

I know this is brief, if you would offer a comment or a direction I would be deeply grateful. I am sure it is obvious that I don’t study statistics, I focus on the biology of multiple myeloma.

Fair enough. I’m no expert on myeloma. As a matter of fact, I don’t even know what myeloma is! (Yes, I could google it, but that would be cheating.) Based on the above paragraphs, I assume it is a blood-related disease.

Anyway, my response is, yes, I think it would be a good idea to do some design analysis, using your best scientific understanding to hypothesize an effect size and then going from there, to see what “statistical significance” really implies in such a case, given your sample size and error variance. The key is to hypothesize a reasonable effect size—don’t just use the point estimate from a recent study, as this can be contaminated by the statistical significance filter.

The post Performing design calculations (type M and type S errors) on a routine basis? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post New paper on psychology replication appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The Open Science Collaboration, a team led by psychology researcher Brian Nosek, organized the replication of 100 published psychology experiments. They report:

A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes.

“Despite” is a funny way to put it. Given the statistical significance filter, we’d expect published estimates to be overestimates. And then there’s the garden of forking paths, which just makes things more so. It would be meaningless to try to obtain a general value for the “Edlin factor” but it’s gotta be less than 1, so *of course* exact replications should produce weaker evidence than claimed from the original studies.

Things may change if and when it becomes standard to report Bayesian inferences with informative priors, but as long as researchers are reporting selected statistically-significant comparisons—and, no, I don’t think that’s about to change, even with the publication and publicity attached to this new paper—we can expect published estimates to be overestimates.

That said, even though these results are no surprise, I still think they’re valuable.

As I told Monya Baker in an interview for a news article, “this new work is different from many previous papers on replication (including my own) because the team actually replicated such a large swathe of experiments. In the past, some researchers dismissed indications of widespread problems because they involved small replication efforts or were based on statistical simulations. But they will have a harder time shrugging off the latest study, the value of this project is that hopefully people will be less confident about their claims.”

Nosek et al. provide some details in their abstract:

The mean effect size of the replication effects was half the magnitude of the mean effect size of the original effects, representing a substantial decline. Ninety-seven percent of original studies had significant results. Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

This is all fine, again the general results are no surprise but it’s good to see some hard numbers with real experiments. The only thing that bothers me in the above sentence is the phrase, “if no bias in original results is assumed . . .” Of course there is bias in the original results (see discussion above), so this just seems like a silly assumption to make. I think I know where the authors are coming from—they’re saying, even if there was no bias, there’d be problems—but really the no-bias assumption makes no sense given the statistical significance filter, so this seems unnecessary.

Anyway, great job! This was a big effort and it deserves all the publicity it’s getting.

Disclaimer: I am affiliated with the Open Science Collaboration. I’m on the email list, and at one point I was one of the zillion authors of the article. At some point I asked to be removed from the author list, as I felt I hadn’t done enough—I didn’t do any replication, nor did I do any data analysis, all I did was participate in some of the online discussions. But I do feel generally supportive of the project and am happy to be associated with it in whatever way that is.

The post New paper on psychology replication appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post A political sociological course on statistics for high school students appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am designing a semester long non-AP Statistics course for high school juniors and seniors. I am wondering if you had some advice for the design of my class. My currentthinking for the design of the class includes:

0) Brief introduction to R/ R Studio and descriptive statistics and data sheet structure.

1) Great Migration in 20th Century US. Students will read sections of “The Warmth of Other Suns”. Each student will explore the size of the Great migration from the South in an industrial city of their choice. We will use the IPUMS micro census data to estimate white and black migration from Southern states and use the income figures to compare migrants and non migrant residents over the years 1910 – 1980. The old teaching software package Fathom used to do the sampling from IPUMS easily, but the Census sampling feature now no longer works with the newer operating systems. I will have the students sample directly from the University of Minnesota site and then decode their samples in excel and R Studio. A final part of the project will be visits with retired people who were a part of the migration.

2) I plan to have the students divide into working groups to prepare statistical information for lobbying elected officials on a social problem of their choice. We have access to the AFSC’s Criminal Justice program near at our school and immigration rights might fruitful topic to study after our examination of migration.

3) It will be primary season again next Spring and I would love to have the students look at geographical effects in political elections. We will, of course, study polling and survey design and explore sampling distributions.

I have just picked up copies of year texts “A Quantitative Tour…” and “Teaching Statistics…” and I plan to mine them for other activities to explore. I also will be catching up on reading your blog!

This sounds great! My only tip is to do as much of the data analysis yourself first so you can be sure your students can handle it. I did some ipums stuff recently and there were lots of little details with the data that were difficult to handle at first.

Perhaps readers of this blog will have other suggestions.

The post A political sociological course on statistics for high school students appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Vizzy vizzy vizzy viz appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Nadia Hassan points me to this post by Matthew Yglesias, who writes:

Here’s a very cool data visualization from HowMuch.net that took me a minute to figure out because it’s a little bit unorthodox. The way it works is that it visualizes the entire world’s economic output as a circle. That circle is then subdivided into a bunch of blobs representing the economy of each major country. And then each country-blob is sliced into three chunks — one for manufacturing, one for services, and one for agriculture.

What do I like about this image and what don’t I like?

Paradoxically, the *best thing* about this graph may also be its *worst*: Its tricky, puzzle-like characteristic (it even looks like some sort of hi-tech jigsaw puzzle) makes it hard to read, hard to follow, but at the same time gratifying for the reader who goes to the trouble of figuring it out.

It’s the Chris Rock effect: Some graphs give the pleasant feature of visualizing things we already knew, shown so well that we get a shock of recognition, the joy of relearning what we already know, but seeing it in a new way that makes us think more deeply about all sorts of related topics.

As a statistician, I can tell you a whole heap of things I don’t like about this graph, starting with the general disorganization—there’s no particular way to find any country you might be looking for, and there seems to be no logic to the spatial positions—I have no idea what Austrlia is doing in the middle of the circle, or why South Korea and Switzerland are long and thin while Mexico and India are more circular. The breakdown of economy into services/industry/agriculture is particularly confusing because of all the different shapes, and for heaven’s sake, why are the numbers given to a hyper-precise two decimal places?? (You might wonder what it means to say that Russia is “2.49%” of the world economy, given that, last time I checked, readily-available estimates of Russia’s GDP per capita varied my more than a factor of five!)

Yglesias’s post is headlined, “This striking diagram will change how you look at the world economy,” and I can believe it will change people’s understanding, not because the data are presented clearly of because the relevant comparisons are easily available, but because the display is unusual enough that it might motivate people to stare at these numbers that they otherwise might ignore.

Some of the problems with this graph can be seen by carefully considering this note from Yglesias:

You can see some cool things here.

For example, compare the US and China. Our economy is much larger than theirs, but our industrial sectors are comparable in size, and China’s agriculture sector looks to be a little bit larger. Services are what drive the entire gap.

The UK and France have similarly sized overall economies, but agriculture is a much bigger slice of the French pie.

For all that Russia gets played up as some kind of global menace, its economy produces less than Italy. Put all the different European countries together, and Russia looks pathetic.

You often hear the phrase “China and India,” but you can see here that the two Asian giants are in very different shape economically.

The only African nation on this list, South Africa, has a smaller economy than Colombia.

What struck me about all these items is how *difficult* it actually is to find them in the graph. Comparing the U.S. with China on their industry sector, that’s tough: you have to figure out which color is which—it’s particularly confusing here because the color codes for the two countries are different—and then compare two quite different shapes, a task that would make Jean Piaget flip out. The U.K. and France can be compared without too much difficulty but only because they happen to be next to each other, through some quirk of the algorithm. Comparing China and India is not so easy—it took me awhile to find India on this picture. And finding South Africa was even trickier.

My point is not that the graph is “bad”—I’d say it’s excellent for its #1 purpose which is to draw attention to these numbers. It’s just an instructive example for what one might want in a data display.

**The click-through solution**

As always, I recommend what I call the “click-through solution”: Start with a visually grabby graphic like this one, something that takes advantage of the Chris Rock effect to suck the viewer in. Then click and get a suite of statistical graphs that allow more direct visual comparisons of the different countries and different sectors of the economy. Then click again to get a spreadsheet with all the numbers and a list of sources.

The post Vizzy vizzy vizzy viz appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan’s 3rd birthday! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>If you’re around and want to celebrate with some Stan developers and users, feel free to join us:

Monday, August 31.

6 – 9 pm

Untamed Sandwiches

43 W 39th St

New York, NY

If you didn’t know, we also have a Stan Users NYC group that meets every few months.

Thanks and hope to see some of you there.

The post Stan’s 3rd birthday! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Can you change your Bayesian prior?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m very curious as to how you would answer this for subjective Bayesians, at least. I found this section of my book showed various positions, not in agreement.

I responded on her blog:

As we discuss in BDA and elsewhere, one can think of one’s statistical model, at any point in time, as a placeholder, an approximation or compromise given constraints of computation and of expressing one’s model. In many settings the following iterative procedure makes sense:

1. Set up a placeholder model (that is, whatever statistical model you might fit).

2. Perform inference (no problem, now that we have Stan!).

3. Look at the posterior inferences. If some of the inferences don’t “make sense,” this implies that you have additional information that has not been incorporated into the model. Improve the model and return to step 1.

If you look carefully you’ll see I said nothing about “prior,” just “model.” So my answer to your question is: Yes, you can change your statistical model. Nothing special about the “prior.” You can change your “likelihood” too.

And Mayo responded:

Thanks. But surely you think it’s problematic for a subjective Bayesian who purports to be coherent?

I wrote back: No, subjective Bayesianism is inherently incoherent. As I’ve written, if you could in general express your knowledge in a subjective prior, you wouldn’t need Bayesian Data Analysis or Stan or anything else: you could just look at your data and write your subjective posterior distribution. The prior and the data models are just models, they’re not in practice correct or complete.

More here on noninformative priors.

And here’s an example of the difficulty of throwing around ideas like “prior probability” without fully thinking them through.

The post “Can you change your Bayesian prior?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “The belief was so strong that it trumped the evidence before them.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We’ve previously talked about bloggers trying to live on a food stamp budget for a week (yeah, that’s a thing). One of the many odd recurring elements of these post is a litany of complaints about life without caffeine because…

I had already understood that coffee, pistachios and granola, staples in my normal diet, would easily blow the weekly budget.

Which is really weird because coffee isn’t all that expensive.

Palko then goes into detail about how easy it is to buy a can of ground coffee at the supermarket for the cost of 5 or 10 cents a cup.

He continues:

On the other end, if you go to $0.15 or $0.20 a cup and you know how to shop, you can move up into some surprisingly high-quality whole bean coffee . . . you can do better than the typical cup of diner coffee for a dime and better than what you’d get from most coffee houses for a quarter.

To be clear, I’m not recommending that everyone rush out to Wal-Mart for a big ol’ barrel of Great Value Classic Roast. If your weekly food budget is more than fifty dollars a week, bargain coffee should be near the bottom of your concerns.

But here’s the important point—that is, important in general, not just for coffee drinkers (of which I am not one):

What we’re interested in here are perceptions. The people we discussed earlier suffered through a week of headaches and other caffeine-withdrawal pains, not because they couldn’t afford it but because the belief that they couldn’t afford it was so strong that it trumped the evidence before them.

This comes up a lot. People condition on information that isn’t true.

The post “The belief was so strong that it trumped the evidence before them.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** “Can you change your Bayesian prior?”

**Wed:** How to analyze hierarchical survey data with post-stratification?

**Thurs:** A political sociological course on statistics for high school students

**Fri:** Questions about data transplanted in kidney study

**Sat:** Performing design calculations (type M and type S errors) on a routine basis?

**Sun:** “Another bad chart for you to criticize”

The post We provide a service appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I got the attached solicitation [see below], and Google found me your blog post on the topic. Thank you for quickly explaining what’s going on here!

As far as I can see, they’ve removed the mention of payment from this first contact message – so they’re learning!

But also they have enough past clients to be able to include some nice clips. Ah, the pathological results of making academics feel obliged to self-promote.

This time the email didn’t come from “Nick Bagnall,” it came from “Josh Carpanini.” Still spam. But, as I wrote last time, it’s better than mugging old ladies for spare change or selling Herbalife dealerships.

P.S. Here’s the solicitation:

From: Josh Carpanini

Date: Friday, June 5, 2015

Subject: International Innovation – Highlighting Impacts of Technology ResearchDear Dr **,

I hope this message finds you well.

I was hoping to speak with you at some point in the next few days about an upcoming Technology edition of International Innovation. I have come across some of your research and I am very interested to discuss with you the possibility of highlighting your work within the forthcoming July edition.

I would like to create an article about your work within our next edition; this would be similar in format to some of the attached example articles from previous editions. As you can see, the end result would be a piece looking at the wider implications and impact of your current research. . . .

The post We provide a service appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Plaig! (non-Wegman edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>What initially disturbed me about the art of Shepard Fairey is that it displays none of the line, modeling and other idiosyncrasies that reveal an artist’s unique personal style. His imagery appears as though it’s xeroxed or run through some computer graphics program; that is to say, it is machine art that any second-rate art student could produce. . . .

Fairey’s Greetings from Iraq is not a direct scan or tracing of the FAP print, but it does indicate an over reliance on borrowing the design work of others. There was no political point or ironic statement to be made by expropriating the FAP print – it was simply the act of an artist too lazy to come up with an original artwork. . . .

Some supporters of Shepard Fairey like to toss around a long misunderstand quote by Pablo Picasso, “Good artists copy, great artists steal.” Aside from the ridiculous comparison of Fairey to Picasso, there’s little doubt that Picasso was referring to the “stealing” of aesthetic flourishes and stylings practiced by master artists, and not simply carting off their works and putting his signature to them.

A last ditch defense used by Fairey groupies is to acknowledge that their champion does indeed “borrow” the works of other artists both living and deceased, but it is argued that the plundered works are all in the “public domain”, and therefore the rights of artists have not been violated. There are those who say that artists should have the right to alter and otherwise modify already existing works in order to produce new ones or to make pertinent statements. Despite some reservations I generally agree with that viewpoint – provided that such a process is completely transparent. . . .

I’m reminded of George Orwell’s classic slam on lazy and dishonest writing:

Each of these passages has faults of its own, but, quite apart from avoidable ugliness, two qualities are common to all of them. The first is staleness of imagery; the other is lack of precision. The writer either has a meaning and cannot express it, or he inadvertently says something else, or he is almost indifferent as to whether his words mean anything or not. This mixture of vagueness and sheer incompetence . . .

Laziness and dishonesty go together, and that fits the stories of Shepard Fairey and Ed Wegman as well. You copy from someone else, and you have nothing of your own to add, so you hide your sources, and this sends you into a sort of spiral of lies. In which case, why do any work at all? In Fairey’s case, the work is all about promotion, not about the art itself. In Wegman’s case, the work all goes into lawsuits and backroom maneuvering, not into the statistics.

Once you’re hiding your sources, you might as well cut corners on the product, eh?

The post Plaig! (non-Wegman edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post That was easy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Are you available this afternoon or Wednesday to talk about a fact-check article I’m doing on Gov. Scott Walker’s statement that Wisconsin is a “blue” state?

I’m aware, of course, that Wisconsin has voted for the Democratic presidential nominee in each election since 1988.

But I’d like to talk about whether there are other common ways that states are labeled as red or blue (or perhaps purple).

Tues and Wed have already passed, so it’s probably too late, but here’s my response: I would call Wisconsin a 50-50 or “purple” state, in that its vote split has been very close to the national average in recent presidential elections.

The post That was easy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Aahhhhh, young people! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Ummmm, let’s do a quick calculation: 50 – 12 = 38. If you assume the average woman lives to be 80, then the proportion of the population who is menstruating is approximately .52*38/80 = .247.

25% is hardly “basically half”!

But if you’re a young adult, I guess you don’t think so much about people who are under 12 or over 50.

I was similarly amused by the mistake of Beall and Tracy, authors of that now-famous ovulation-and-clothing study, who thought that peak fertility started 6 days after menstruation. If you’re young, you’ve probably been reminded by sex-ed classes that you can get pregnant at any time. It’s only when you get older that you learn about which are the most important days if you’re trying to get pregnant.

The post Aahhhhh, young people! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Data-analysis assignments for BDA class? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So, I need a bunch of examples. I’d appreciate your suggestions. Here’s what I’ve got so far:

Classic examples:

8 schools

Arsenic in Bangladesh

Modern classics:

World Cup

Speed dating

Hot hand

Gay rights opinions by age

The effects of early childhood intervention in Jamaica

I’m also not clear on how to set things up: Do I just throw them example after example and have them try their best, or do I start with simple one- and two-parameter models and then go from there?

One idea is to go on two parallel tracks, with open-ended real-data examples that follow no particular order, and fake-data, confidence-building examples that go through the chapters in the book.

Anyway, any suggestions of yours would be appreciated. Thanks.

The post Data-analysis assignments for BDA class? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Soylent 1.5” < black beans and yoghurt appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mark Palko quotes Justin Fox:

On Monday, software engineer Rob Rhinehart published an account of his new life without alternating electrical current — which he has undertaken because generating that current “produces 32 percent of all greenhouse gases, more than any other economic sector.” Connection to the power grid isn’t all Rhinehart has given up. He also doesn’t drive, wash his clothes (or hire anyone else to wash them) or cook anything but coffee and tea. But he still lives in a big city (Los Angeles) and is chief executive officer of a corporation with $21.5 million in venture capital funding.

That corporation is Rosa Labs, the maker of Soylent, a “macronutritious food beverage” designed to free its buyers from the drudgery of shopping, cooking and chewing. In the 2,900-word post on his personal blog, Rhinehart worked in an extended testimonial for Soylent 2.0, a new, improved version of the drink — algae and soy seem to be the two most important ingredients — that will begin shipping in October.

Fox’s piece is headlined, “Soylent Is Weird, But It’s Good Weird.”

But *is* it really “good weird”? Or, if so, what kind of “good” is it?

According to Palko, Soylent is *not so nutritious*.

Here’s the comparison of 115 grams of Solyent:

to comparable servings of black beans:

and nonfat Greek yoghurt:

And I think it’s safe to say it’s *not so delicious*.

Nor is it so amazingly convenient. Palko writes:

Nor do you have to cook to do better than Soylent. I did a quick check at the grocery store last night and I found lots of frozen entrees that gave you more nutrition for less calories than Rosa Lab’s product.

In summary:

Basically, when you cut through all of the pseudo science and buzzwords and LOOKATME antics, Rhinehart is simply peddling a mediocre protein shake with the same tired miracle food claims that marketers have been using since John Harvey Kellogg gave C.W. Post his first enema.

**The paradox . . . or is it?**

At first this seems like a paradox . . . Silicon Valley genius, $21 million in venture capital funding . . . how could it be just a scam?

But then you realize that nutrition has nothing to do with it (other than as a marketing concept).

Recall that the goal of the people who invested 21 million dollars in this product, is not to give people healthy and satisfying meals, it’s to have the image of something healthy and satisfying.

Is Soylent a scam? Yes and no. It’s a scam to the people who are being sold the product, but maybe not to the investors.

Perhaps the whole Silicon Valley thing is a distraction, and the right analogy is to something like the movie Battleship, which was universally agreed to be crap but still sold jillions of dollars worth of tickets.

So, when business writer Justin Fox writes that Soylent is “good” and that it is “an interesting product,” this would be like a movie reviewer saying that Battleship is a good movie. It was good to its investors, I assume!

And for a business writer to credulously take Rinehart’s word on the health benefits of “macronutrient balance” and “glycemic index” of the products he’s selling, without just going to the supermarket and comparing to the label on a can of black beans and a tub of yoghurt.

But is Solyent a good model for a business? I guess that depends on whether potential consumers view it as a sugary, fatty, bad-tasting alternative to beans and yoghurt; or as a healthy processed-food alternative to a breakfast of cornflakes and Coca-cola.

And that in turn must depend on part on press coverage. As Palko has written elsewhere on his blog, A Statistician Walks into a Grocery Store, journalists typically don’t seem to have a good framework for writing about food and nutrition, especially when it comes to low budgets.

So, in that sense, the credulous news reports on Soylent (and it’s not just Justin Fox; see, for example, this gee-whiz article by Lizzie Widdicombe in the New Yorker, subtitled, “Has a tech entrepreneur come up with a product to replace our meals?”) are just part of the larger picture.

Food and nutrition reporting have little context. Imagine if entertainment reporting were the same way:

Battleship: The Hamlet for the 21st Century

The post “Soylent 1.5” < black beans and yoghurt appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Daniel on Stan at the NYC Machine Learning Meetup appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Bob gave a talk there 3.5 years ago. My talk will be light and include where we’ve been and where we’re going.

P.S. If you make it, find me. I have Stan stickers to give out.

P.P.S. Stan is on twitter.

The post Daniel on Stan at the NYC Machine Learning Meetup appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Macartan Humphreys on the Worm Wars appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My Columbia political science colleague shares “What Has Been Learned from the Deworming Replications: A Nonpartisan View”:

Last month there was another battle in a dispute between economists and epidemiologists over the merits of mass deworming.1 In brief, economists claim there is clear evidence that cheap deworming interventions have large effects on welfare via increased education and ultimately job opportunities. It’s a best buy development intervention. Epidemiologists claim that although worms are widespread and can cause illnesses sometimes, the evidence of important links to health is weak and knock-on effects of deworming to education seem implausible. . . .

So. Deworming: good for educational outcomes or not?

You’ll have to click through to read the details, but here’s Macartan’s quick summary:

The conclusions that I take away though are that (a) the magnitude and significance of spillover effects are in doubt because of the measurement issues and the inference issues; (b) the inferences on the main effects are also in doubt because of the problems with identification and explanation. Neither of the main claims is demonstrably incorrect, but there are good grounds to doubt both of them.

What about policy? Macartan continues:

A number of commentators have argued that the policy implications are more or less unchanged. This includes organizations that focus specifically on the evidence base for policy (such as CGD and GiveWell).

Perhaps the most important point of confusion is what policy conclusions this discussion could affect. Many are defending deworming for non-educational reasons. But the discussion of the MK [Miguel and Kremer] paper really only matters for the education motivation. And perhaps primarily for the short-term school attendance motivation. Like much other literature in this area it finds only weak evidence for direct health benefits (beyond the strong evidence for the removal of worms). It also does not claim to find evidence on actual performance. Although many groups endorse deworming for health reasons, and rank it as a top priority, this, curiously, goes against the weight of evidence as summarized in the Cochrane reports at least. If the consensus for deworming for health reasons still stands it is not because of this paper.

Does the challenge to this paper weaken the case for deworming for educational reasons? I find it hard to see how it cannot.

I have a few comments of my own, not on deworming—I know nothing about that—but on some of the statistical points raised by Macartan’s post.

– The 800-pound gorilla in the room is opportunity cost, or cost-benefit analysis. As you say, who could be against de-worming kids? I’m reminded of Jeff Sachs’s argument that *all* of these sorts of interventions are worth doing, and that rather than trying so hard to rank the cost-effectiveness of different health and economic interventions, the rich countries should just kick in that 1% of GDP or whatever and do all of them. I’m not saying Sachs is necessarily right on this, I’m just saying that most of the discussion seems to be on traditional statistical grounds (Is there an effect? Is it statistically significant? Has it been proven beyond a reasonable doubt?) and the cost-benefit or opportunity cost calculations are implicit. Once or twice, cost-benefit calculations do get done, but not in a serious way. For example, Macartan points to a “60 to 1” benefit-to-cost ratio for deworming claimed by the Copenhagen Consensus, but apparently those guys just took the point estimate of effectiveness (which is a biased estimate, possibly hugely biased; see more on this below) and ran with it.

– Macartan talks about multiple comparisons, which is fine (though I’d prefer hierarchical modeling rather than classical corrections; see here and here. Macartan mentions the statistical significance filter: Statistically significant estimates tend to overestimate the magnitude of true effects (we call it the type M error or exaggeration factor here). This can be a big deal, especially once things get to the decision stage.

– Macartan mentions development economist Paul Gertler. I’ve only encountered his work once, and it was a case where he hyped and exaggerated (unintentionally, I’m sure) an effect size. I contacted him about it and asked him if he was concerned about the statistical significance filter, and he did not reply. Apparently he was happy reporting an overestimate. It was an early-childhood intervention experiment in Jamaica. Again, who could object to helping poor kids?

– I share Macartan’s skepticism about the spillovers. One problem here is that researchers have an incentive to make a “discovery.” De-worming helps kids, ok, that’s fine. But a spillover effect, that’s news. But the paradox is that these surprising findings are *more* subject to the statistical significance filter. The headline clams can be the biggest overestimates. And this is completely consistent with the calculation in section 3.4.1 of Macartan’s report. It is similar to the calculation that Eric Loken and I did regarding the notorious claim that women in a certain part of their monthly cycle were more likely to wear red. The researchers were proud of making this discovery with such a noisy measuring instrument, but if you back out how large the effect would’ve had to be, for the claimed effect to show in the population, it would have to be unrealistically huge. And of course this happened with that horrible LaCour study—the claimed effects in the aggregate implied huge effects in the subgroup of the population who would’ve been affected by the treatment.

– I don’t like Macartan’s section 4.2, “Can we be a bit more Bayesian?” I guess I’d like him to be a bit *more* Bayesian. In particular, I really don’t like the sort of binary thinking in which deworming works or doesn’t work for some purpose. To me, the concern is not that deworming or whatever is a “dud” but rather that it is not as effective as the published record might suggest. For a Bayesian decision analysis I’d prefer to do it straight, with costs, benefits, and a continuous parameter that represents the effectiveness of the treatment. Even setting the decision analysis aside, you can do Bayesian inference: just say there’s a true (population, average) casual effect and that you have a prior for it. Then it’s simple inference, an inverse-variance weighted average of the data and the prior information, no need for tricky probability formulas.

Finally, I appreciate the way that, in his report, Macartan moves back and forth between the details and the big questions. These connections are a key part of any methodological analysis.

The post Macartan Humphreys on the Worm Wars appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post My 2 classes this fall appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Modern Bayesian methods offer an amazing toolbox for solving science and engineering problems. We will go through the book Bayesian Data Analysis and do applied statistical modeling using Stan, using R (or Python or Julia if you prefer) to preprocess the data and postprocess the analysis. We will also discuss the relevant theory and get to open questions in model building, computing, evaluation, and expansion. The course is intended for students who want to do applied statistics and also those who are interested in working on statistics research problems.

Stat 8307, Statistical Communication and Graphics

Communication is central to your job as a quantitative researcher. Our goal in this course is for you to improve at all aspects of statistical communication, including writing, public speaking, teaching, informal conversation and collaboration, programming, and graphics. With weekly assignments and group projects, this course offers you a chance to get practice and feedback on a range of communication skills. All this is in the context of statistics; in particular we will discuss the challenges of visualizing uncertainty and variation, and the ways in which a deeper integration of these concepts into statistical practice could help resolve the current statistical crisis in science. Statistics research is not separate from communication; the two are intertwined, and this course is about you putting in the work to become a better writer, teacher, speaker, and statistics practitioner.

The communication and graphics course should be no problem; I’ll teach it pretty much how I taught it last year, with 2 meetings a week, diaries, jitts, homeworks, class discussions, projects, etc.

The Bayes class I’ll be doing in a new way. It’ll meet once a week, and my plan is for the first half of each class to be a discussion of material from the book and in the second half for students to work together using Stan, with me and the teaching assistant walking around helping. Also, the homeworks will be more Stan-centered. The idea is for the students to really learn applied Bayesian statistics, as well as to have a chance to grapple with important theoretical concepts and to be introduced to the research frontier. We’ll see how it goes. The key will be coming up with in-class and homework assignments that give students the chance to fit Bayesian models for interesting problems.

The post My 2 classes this fall appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** “Soylent 1.5” < black beans and yoghurt

**Wed:** 0.05 is a joke

**Thurs:** Data-analysis assignments for BDA class

**Fri:** Aahhhhh, young people!

**Sat:** Plaig! (non-Wegman edition)

**Sun:** We provide a service

The post Rockin the tabloids appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The prevailing structures of personal reputation and career advancement [in biology] mean the biggest rewards often follow the flashiest work, not the best. . . .

We all know what distorting incentives have done to finance and banking. The incentives my colleagues face are not huge bonuses, but the professional rewards that accompany publication in prestigious journals – chiefly Nature, Cell and Science.

These luxury journals are supposed to be the epitome of quality, publishing only the best research. Because funding and appointment panels often use place of publication as a proxy for quality of science, appearing in these titles often leads to grants and professorships. But the big journals’ reputations are only partly warranted. . . .

These journals aggressively curate their brands, in ways more conducive to selling subscriptions than to stimulating the most important research. Like fashion designers who create limited-edition handbags or suits, they know scarcity stokes demand, so they artificially restrict the number of papers they accept. . . .

A paper can become highly cited because it is good science – or because it is eye-catching, provocative or wrong. Luxury-journal editors know this, so they accept papers that will make waves because they explore sexy subjects or make challenging claims. . . . It builds bubbles in fashionable fields where researchers can make the bold claims these journals want . . .

In extreme cases, the lure of the luxury journal can encourage the cutting of corners, and contribute to the escalating number of papers that are retracted as flawed or fraudulent. . . .

Sharif don’t like it.

The post Rockin the tabloids appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why couldn’t Breaking Bad find Mexican Mexicans? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Why couldn’t Breaking Bad find Mexican Mexicans? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post ShinyStan v2.0.0 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**ShinyStan v2.0.0 released**

ShinyStan v2.0.0 is now available on CRAN. This is a major update with a new look and a lot of new features. It also has a new(ish) name: **ShinyStan** is the app/GUI and **shinystan** the R package (both had formerly been shinyStan for some reason apparently not important enough for me to remember). Like earlier versions, this version has enhanced functionality for Stan models but is compatible with MCMC output from other software packages too.

You can install the new version from CRAN like any other package:

`install.packages("shinystan")`

If you prefer a version with a few minor typos fixed you can install from Github using the devtools package:

`devtools::install_github("stan-dev/shinystan", build_vignettes = TRUE)`

(Note: after installing the new version and checking that it works we recommend removing the old one by running remove.packages(“shinyStan”).)

If you install the package and want to try it out without having to first fit a model you can launch the app using the preloaded demo model:

library(shinystan)

launch_shinystan_demo()

**Notes**

This update contains a lot of changes, both in terms of new features added, greater UI stability, and an entirely new look. Some release notes can be found on GitHub and there are also some instructions for getting started on the ShinyStan wiki page. Here are two highlights:

- The new interactive diagnostic plots for Hamiltonian Monte Carlo. In particular, these are designed for models fit with Stan using NUTS (the No-U-Turn Sampler).
- The
function, which lets you easily deploy ShinyStan apps for your models to RStudio’s ShinyApps hosting service. Each of your apps (i.e. each of your models) will have a unique URL. To use this feature please also install the shinyapps package:`deploy_shinystan`

.`devtools::install_github("rstudio/shinyapps")`

The plan is to release a minor update with bug fixes and other minor tweaks in a month or so. So if you find anything we should fix or change (or if you have any other suggestions) we’d appreciate the feedback.

The post ShinyStan v2.0.0 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Harry S. Truman, Jesus H. Christ, Roy G. Biv appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Harry S. Truman, Jesus H. Christ, Roy G. Biv appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We selected 12 already published research articles by investigators from prestigious and highly productive American psychology departments, one article from each of 12 highly regarded and widely read American psychology journals with high rejection rates (80%) and nonblind refereeing practices.

With fictitious names and institutions substituted for the original ones (e.g., Tri-Valley Center for Human Potential), the altered manuscripts were formally resubmitted to the journals that had originally refereed and published them 18 to 32 months earlier. Of the sample of 38 editors and reviewers, only three (8%) detected the resubmissions. This result allowed nine of the 12 articles to continue through the review process to receive an actual evaluation: eight of the nine were rejected. Sixteen of the 18 referees (89%) recommended against publication and the editors concurred. The grounds for rejection were in many cases described as “serious methodological flaws.”

Amusing. On the plus side, it could reflect a positive trend, that crappy papers that were getting accepted 2 years ago, would get rejected now.

The post Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Reprint of “Observational Studies” by William Cochran followed by comments by current researchers in observational studies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Norman Breslow

Thomas Cook

David Cox & Nanny Wermuth

Stephen Fienberg

Joseph Gastwirth & Barry Graubard

Andrew Gelman

Ben Hansen & Adam Sales

Miguel Hernan

Jennifer Hill

Judea Pearl

Paul Rosenbaum

Donald Rubin

Herbert Smith

Mark van der Laan

Tyler VanderWeele

Stephen West.

My discussion is called “The state of the art in causal inference: Some changes Since 1972.”

Cochran’s article and all the discussions are downloadable in a convenient pdf here, at the journal’s website. Lots to chew on.

The post Reprint of “Observational Studies” by William Cochran followed by comments by current researchers in observational studies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Wasting time reading old comment threads appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I was linking to something and came across this hilarious thread, which culminated in this revelation by commenter Jrc:

True story: after reading this post, http://andrewgelman.com/2011/01/12/picking_pennies/, I started going to the Jamaican store around the corner. I was eating a lot of those things by the end. Its probably good that we moved.

The post Wasting time reading old comment threads appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post It’s hard to replicate (that is, duplicate) analyses in sociology appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I read the comments. The topic arouses a lot of passion. Some of the commenters are pretty rude! And, yes, I’m glad to see this post, given my own frustrating experience trying to re-analyze a sociology study. To me, one of the big problems is the idea that once a paper is published, it is considered to be Truth, by the authors, by promotion committees, by the ASR and the NYT. Take that away, and everything changes: all of a sudden there’s not so much incentive to hide your data.

The post It’s hard to replicate (that is, duplicate) analyses in sociology appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Neither time nor stomach appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Mark Palko writes:

Thought you might be interested in an EngageNY lesson plan for statistics. So far no (-2)x(-2) = -4 (based on a quick read), but still kind of weak. It bothers me that they keep talking about randomization but only for order of test; they assigned treatment A to the first ten of each batch.

Maybe I’m just in a picky mood.

I replied that I don’t like this bit at all: “Students use a randomization distribution to determine if there is a significant difference between two treatments.”

I don’t like randomization tests and I really really really don’t like the idea that the purpose of a study is “to determine if there is a significant difference between two treatments.”

Also it’s a bit weird that it’s in Algebra II. This doesn’t seem like algebra at all.

Palko added:

If you have the time (and the stomach), I’d recommend going through the entire “Topic D” section. You’ll find lots more to blog about.

I fear I have neither time nor stomach for this.

The post Neither time nor stomach appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Fitting a multilevel model appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I have a question about the use of BRT (Boosting regression tree). I am planning to write an article about the effects of soil fauna and understory fine roots on forest soil organic carbon. The experiment was conducted in a subtropical forest area in China. There were 16 blocks each with 5 coarse-mesh nylon bag wrapped and 5 fine-mesh nylon bag wrapped soil cores throughout four forests (forests A/B/C/D). Each forest had four different blocks. The coarse mesh (4 mm) bags allowed the access of both the entire soil fauna and understory plant fine roots. The meshes of fine bags were on the whole 0.038 mm, except that a row of 4 mm holes were distributed evenly at the lateral side and a circle of 4 mm holes at the circumferential bottom. Ideally, we supposed that this kind of fine bags could allow the access of most of soil fauna through 4 mm holes, while getting rid of the roots by the overall 0.038 mm meshes.

Originally, the dataset generated from this experiment should be 160 pieces of record constituted of soil total carbon, dissolved organic carbon, microbial biomass carbon plus soil fauna abundances and fine roots biomass. The soil fauna data was of coarse taxonomic resolution and mainly included six abundant fauna taxa, i.e., Acari, Collembola, Hymenoptera, Nematode, Enchytraeid. For fine roots data, parts of records in forests B (18 records) and C (17 records) were missing because of faulty operation. As a result, fine roots biomass had records at only 78% soil cores.

Because BRT can handle missing data in predictor variables by using surrogates, I decided to use BRT to determine the relative influence of predictor variables on the individual response for soil carbon, especially for the abundant fauna taxa and fine roots. So, can BRT be applicable to the abovementioned dataset, which only has 160 records and one of its main explanatory variable covers less than 80% records?

I have sent a similar mail to Professor *** who advised me to read your book “Data Analysis Using Regression and Multilevel/Hierarchical Models” for the resolution. Since my data are structured (samples within blocks within forests), a generalized linear mixed model will be suited. But I am worrying about that, while removing the records with missing data about fine roots, the experimental design would be an unbalanced one and at the expense of less useful data of other variables such as soil carbon and soil fauna taxa. Upon the unbalanced design, I will use Redundancy Analysis as a tool for multivariate ANOVA. Through RDA, I might select the several most influential fauna taxa probably with fine root. After this action, I would use Variation Partitioning to quantifying the variation explained by these factor. However, I am puzzled that, should the model used by RDA involve the effects of mesh size of nylon bag (fine mesh versus coarse mesh) and blocks factors or as well as forests? I am mainly using software R for statistical analyses. If the alternative of BRT is RDA, how can the model of RDA involve the data structure which seems to be a split-plot design? I feel that it is difficult to use functions such as adonis {vegan} in R to define the data structure of split-plot designs, especially under the condition of unbalanced design. What’s worse, my focus is the effects of soil fauna and fine root instead of mesh size/blocks factors/forests.

My reply:

You write, “BRT can handle missing data in predictor variables by using surrogates.” But any method for prediction can handle missing data by using surrogates, no? Or maybe I’m missing something here. In any case, here are some general comments:

– If your missingness is only in the outcome variable, then just fit a multilevel model, no problem.

– If some of your predictors have some missingness, I recommend imputing the missing values first using some multivariate imputation method. We have an R package “mi” to do this, but other options are available too.

– Unbalanced data don’t present any difficulty; indeed, multilevel models are well suited to unbalanced data. Unbalanced data is one of the main motivations for fitting multilevel models.

– Similarly, you’re fine with any mix of nested and non-nested factors. Split-plot is fine too, the multilevel model will handle it automatically, as discussed in my 2005 paper.

– Once it’s time for you to fit your multilevel model, I recommend you use Stan. In my book with Jennifer we use lmer, but now I use Stan. We just haven’t updated the book yet.

The post Fitting a multilevel model appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Dan Kahan doesn’t trust the Turk appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Dan Kahan writes:

I [Kahan] think serious journals should adopt policies announcing that they won’t accept studies that use M Turk samples for types of studies they are not suited for. . . . Here is my proposal:

Pending a journal’s adoption of a uniform policy on M Turk samples, the journal should should oblige authors who use M Turk samples to give a full account of why the authors believe it is appropriate to use M Turk workers to model the reasoning process of ordinary members of the U.S. public. The explanation should consist of a full accounting of the authors’ own assessment of why they are not themselves troubled by the objections that have been raised to the use of such samples; they shouldn’t be allowed to dodge the issue by boilerplate citations to studies that purport to “validate” such samples for all purposes, forever & ever. Such an account helps readers to adjust the weight that they afford study findings that use M Turk samples in two distinct ways: by flagging the relevant issues for their own critical attention; and by furnishing them with information about the depth and genuineness of the authors’ own commitment to reporting research findings worthy of being credited by people eager to figure out the truth about complex matters.

There are a variety of key points that authors should be obliged to address.

First, M Turk workers recruited to participate in “US resident only” studies have been shown to misrepresent their nationality. Obviously, inferences about the impact of partisan affiliations distinctive of US society on the reasoning of members of the U.S. general public cannot validly be made on the basis of samples that contain a “substantial” proportion of individuals from other societies (Shapiro, Chandler and Muller 2013) Some scholars have recommended that researchers remove from their “US only” M Turk samples those subjects who have non-US IP addresses. However, M Turk workers are aware of this practice and openly discuss in on-line M Turk forums how to defeat it by obtaining US-IP addresses for use on “US worker” only projects (Chandler, Mueller & Paolacci 2014). Why do the authors not view this risk as one that makes using M Turk workers inappropriate in a study like this one?

Second, M Turk workers have demonstrated by their behavior that they are not representative of the sorts of individuals that studies of political information-processing are supposed to be modeling. Conservatives as grossly under-represented among M Turk workers who represent themselves as being from the U.S. (Richey 2012). One can easily “oversample” conservatives to generate adequate statistical power for analysis. But the question is whether it is satisfactory to draw inferences about real US conservatives generally from individuals who are doing something that such a small minority of real U.S. conservatives are willing to do. It’s easy to imagine that the M Turk “US” conservatives lack sensibilities that ordinary US conservatives normally have—such as the sort of disgust sensibilities that are integral to their political outlooks (Haidt & Hersch 2001), and that would likely deter them from participating in a “work force” a major business activity of which is “tagging” the content of on-line porn. These unrepresentative “US” conservatives might well not react as strongly or dismissively toward partisan arguments on a variety of issues. Is this not a concern for the authors? It is for me, and I’m sure would be for many readers trying to assess what to make of a study like this.

Third, there are in fact studies that have investigated this question and concluded that M Turk workers do not behave the way that US general population or even US student samples do when participating in political information-processing experiments (Krupnikov & Levine 2014). Readers will care about this—and about whether the authors care.

Fourth, Amazon M Turk worker recruitment methods are not fixed and are neither warranted nor designed to be calibrated to generate samples suitable for scholarly research. No serious person who cares about getting at the truth would accept the idea that a particular study done at a particular time could “validate” M Turk, for the obvious reason that Amazon doesn’t publicly disclose its recruitment procedures, can change them and has on multiple occasions, and is completely oblivious to what researchers care about. A scholar who decides it’s “okay” to use M Turk anyway should tell readers why this does not trouble him or her.

Fifth, M Turk workers share information about studies and how to respond to them (Chandler, Mueller & Paolacci 2014). This makes them completely unsuitable for studies that use performance-based reasoning proficiency measures, which M Turk workers have been massively exposed to. But it also suggests that the M Turk workforce is simply not an appropriate place to recruit subjects from; they are evincing a propensity to behave in a manner that makes all of their responses highly suspect. Imagine you discovered that the firm you had retained to recruit your sample had a lounge in which subjects about to take the study could discuss it w/ those who just had completed it; would you use the sample, and would you keep coming back to that firm to supply you with study subjects in the future? If this does not bother the authors, they should say so; that’s information that many critical readers will find helpful in evaluating their work.

I [Kahan] feel pretty confident M Turk samples are not long for this world.

OK, so far so good. But I’d bet the other direction on whether M Turk samples (or something similar) are long for this world. Remember Gresham’s Law?

Kahan does also give a positive argument, that there is a better alternative:

Google Consumer Surveys now enables researchers to field a limited number of questions for between $1.10 & $3.50 per complete– a fraction of the cost charged by on-line firms that use valid & validated recruitment and stratification methods.

Google Consumer Surveys has proven its validity in the only way that a survey mode–random-digit dial, face-to-face, on-line –can: by predicting how individuals will actually evince their opinions or attitudes in real-world settings of consequence, such as elections. Moreover, if Google Surveys goes into the business of supplying high-quality scholarly samples, they will be obliged to be transparent about their sampling and stratification methods and to maintain them (or update them for the purposes of making them even more suited for research) over time. . . .

The problem right now w/ Google Consumer Surveys is that the number of questions is limited and so, as far as I can tell, is the complexity of the instrument that one is able to use to collect the data, making experiments infeasible.

But I predict that will change.

OK, maybe so. But it does seem to me that M Turk’s combination of low cost and low validity will make it an attractive option for many researchers.

Some background:

Don’t trust the Turk (also see discussion in comments, back from the days when the sister blog had a useful comments section)

Researchers are rushing to Amazon’s Mechanical Turk. Should they?

That latter post, by Kathleen Searles and John Barry Ryan, concludes that “platitudes such as ‘Don’t trust the Turk’ are nice, but, as is often the case in life, they are too simple to be followed.”

I actually think “Don’t trust the Turk” is a slogan not a platitude but I take their point, and indeed even though Searles and Ryan are broadly pro-Turk while Kahan is anti-Turk, these researchers all offer the common perspective that when evaluating a data source you need to consider the purpose for which it will be used.

**P.S.** Some good discussion in comments. “Don’t trust the Turk” doesn’t mean “Never *use* the Turk. It means: Be aware of the Turk’s limitations. Don’t exhibit the sort of blind faith associated with the buggy-whip lobby and their purported “grounding in theory.”

The post Dan Kahan doesn’t trust the Turk appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Neither time nor stomach

**Wed:** Reprint of “Observational Studies” by William Cochran followed by comments by current researchers in observational studies

**Thurs:** Hey—Don’t trust anything coming from the Tri-Valley Center for Human Potential!

**Fri:** Harry S. Truman, Jesus H. Christ, Roy G. Biv

**Sat:** Why couldn’t Breaking Bad find Mexican Mexicans?

**Sun:** Rockin the tabloids

The post Stan at JSM2015 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>And everyone should check out Andrew’s breakout performance in “A Stan is Born”.

Update: Turns out I missed even more Stan! There was a great session just this morning, that unfortunately I was not able to post earlier due to some logistical issues (i.e. my inadvertently leaving my laptop behind after my talk yesterday…). Seth will also be talking about his sweet Gaussian processes Tuesday at 10:35 AM.

The post Stan at JSM2015 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Monte Carlo and the Holy Grail appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>On 31 Dec 2010, someone wrote in:

A British Bayesian curiosity: Adrian Smith has just been knighted, and so becomes Sir Adrian. He can’t be the first Bayesian knight, as Harold Jeffreys was Sir Harold.

I replied by pointing to this discussion from 2008, and adding: Perhaps Spiegelhalter can be knighted next. Or maybe Ripley!

My correspondent replied the next day:

I doubt that Ripley will ever get an Honour. But Spiegelhalter did strike me as the most likely next Bayesian knight in 5-10 years. I would not want to put a number on the probability. Please don’t quote me in public on that, unless anonymously.

Now here it is, 4 1/2 years later, and the person informs me that David Spiegelhalter was indeed knighted in 2014!

All hail Lord Spiegelhalter! He will smite you if you communicate statistics poorly.

The post Monte Carlo and the Holy Grail appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Classifying causes of death using “verbal autopsies” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In areas without complete-coverage civil registration and vital statistics systems there is uncertainty about even the most basic demographic indicators. In such areas the majority of deaths occur outside hospitals and are not recorded. Worldwide, fewer than one-third of deaths are assigned a cause, with the least information available from the most impoverished nations. In populations like this, verbal autopsy (VA) is a commonly used tool to assess cause of death and estimate cause-specific mortality rates and the distribution of deaths by cause. VA uses an interview with caregivers of the decedent to elicit data describing the signs and symptoms leading up to the death. This paper develops a new statistical tool known as InSilicoVA to classify cause of death using information acquired through VA. InSilicoVA shares uncertainty between cause of death assignments for specific individuals and the distribution of deaths by cause across the population. Using side-by-side comparisons with both observed and simulated data, we demonstrate that InSilicoVA has distinct advantages compared to currently available methods.

As I’ve been saying a lot recently, *measurement* is a central and underrated aspect of statistics, so I’m always happy to see serious research on measurement-error models. I hope this project is directly useful and also stimulates further work in this area.

The post Classifying causes of death using “verbal autopsies” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The secret to making a successful conference presentation appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>in 20 minutes, something like this:

– What is Stan?

– Where does Stan work well?

– Current and future Stan research.For JSM audience it could be good to spend some time on our exciting future research ideas.

The goal is not to teach people Stan, it’s to get them excited about it.You can also look in our JEBS paper for material, as we do some comparison of Stan with other Bayesian model-fitting options.

Regarding surveys, you can say that you personally are not currently working on survey data but the past and current development of Stan has been motivated by various applications of mine, including survey analysis, and we are currently being supported by the polling company YouGov.

Stan has the potential to revolutionize survey inference, as follows: More and more, surveys are not reprsentative of the popualation. Problems with non-response, self-selection, etc. So we want to weight or adjust for as many variables as possible so as to match sample to population. But it’s well known that if you try to weight on a lot of varables, and their interactions, the weights will be super-noisy. Better and more stable to do MRP, i.e. hierarchical Bayes. Stan allows us to build big fast and make models with lots of predcitors and (it will be necessary) informative priors.

The key piece of advice (the secret to giving a good talk) is in bold above.

P.S. And if the computer for the presentation is linked to the sound system, he can start off with the Stan trailer.

The post The secret to making a successful conference presentation appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When does Bayes do the job? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m writing a paper where I discuss one of the advantages of Bayesian inference, namely that it scales up to complex problems where maximum likelihood would simply be unfeasible or unattractive. I have an example where 2000 parameters are estimated in a nonlinear hierarchical model; MLE would not fare well in this case.

I recall that you have also stressed this issue, and I’d like to acknowledge that. Do you have pointers to a few of your papers where you explicitly mention this? Ideally I would just take a quotation.

I responded:

Bayes will do this but only with informative priors. With noninformative priors, the Bayes answer can sometimes be worse than maximum likelihood; see section 3 of this 1996 paper which I absolutely love.

Then there’s this paper about why, with hierarchical Bayes, we don’t need to worry about multiple comparisons.

Here’s a quote from that paper:

Researchers from nearly every social and physical science discipline have found themselves in the position of simultaneously evaluating many questions, testing many hypothesis, or comparing many point estimates. . . . we believe that the problem is not multiple testing but rather insufficient modeling of the relationship between the corresponding parameters of the model. Once we work within a Bayesian multilevel modeling framework and model these phenomena appropriately, we are actually able to get more reliable point estimates. A multilevel model shifts point estimates and their corresponding intervals toward each other (by a process often referred to as “shrinkage” or “partial pooling”), whereas classical procedures typically keep the point estimates stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). In this way, multilevel estimates make comparisons appropriately more conservative, in the sense that intervals for comparisons are more likely to include zero. As a result we can say with confidence that those comparisons made with multilevel estimates are more likely to be valid. At the same time this “adjustment” does not sap our power to detect true differences as many traditional methods do.

That’s a bit long but maybe you can take what you need!

You also might enjoy this paper with Aleks on whether Bayes is radical, liberal, or conservative.

The post When does Bayes do the job? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post How Hamiltonian Monte Carlo works appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>( Statement 1) If the kinetic energy equation comes from a distribution $L$ which is not a symmetric distribution, then thanks to the “Conservation of the Hamiltonian” property we’ll still be able to accept the proposal with probability 1 if we are computing the Hamiltonian’s equations exactly.

( Statement 2) But if we are approximating the Hamilton’s equations by discretizing time (using leapfrog) then the acceptance probability [from $(q_0, p_0)$ to $(q_1, p_1)$] won’t be simply $\exp(U_{0}+K_{0}-U_{1}-K_{1})$

We would have to find $p*$ for which $(q_{0}, p_{0})$ will be the proposal values if we start from $(q_{1}, p*)$

And, in this case, the acceptance probability will be: $\exp(U_{0}+K_{0}-U_{1}-K_{1})L(p*)/L(p_0)$

Are both statements correct? If so, it seems that it’s pretty difficult to find the value of $p*$…

I replied that this is a job for Betancourt. And Betancourt responded:

Let’s review the basics of HMC. We have our target distribution, pi(q) = exp(-V(q)), and the conditional distribution of the momenta, pi(p|q) = exp(-T(q, p)). The HMC transition is then (i) sample the momenta, p ~ pi(p|q) (ii) evolve using Hamiltonian flow, (q, p) -> \phi_{t} (q, p) (iii) marginalize the momenta, (q, p) -> q. Each step in the operation preserves pi(q) and so the entire transition preserves pi(q) and yields a desired sample. T(q, p) can be almost anything and there is no Metropolis correction here.

In practice we can’t do (ii) exactly and we have to approximate the flow with a symplectic integrator. This introduces error and pi(q) is no longer preserved exactly. Typically we fix this by treating the approximate flow as a proposal and add a Metropolis correction, but this requires that the flow be reversible which it is not. To make the flow reversible we can do a few things — we can sample the integration time from any distribution symmetric around zero or, if the kinetic energy is symmetric with respect to p, we can add a momentum flip at the end of the flow, (q, p) -> (q, -p). Outside of NUTS people typically consider a fixed integration time which leaves only the second option, hence the importance of the symmetry of the kinetic energy.

What you’ve proposed is to sample from a distribution pi(p|q) = exp(-L(q, p)) that is not related to the kinetic energy. Immediately this will cause a problem because (i) will preserve only exp(-L) while (ii) preserves only exp(-T) and the combined transition no longer has a stationary distribution. So the exact HMC algorithm doesn’t work. We can still try to use this as a Metropolis-Hastings proposal (although a poorly performing one) if we can make it reversible. How we make it reversible depends on the choice of L(q, p), but it general it will not be an easy problem.

Choosing a kinetic energy is a somewhat subtle problem. Nominally there are no constraints on T(q, p) which is usually a really bad sign — the more the math constrains our options the less tuning we have to do and the more robust the algorithm will be. One way to introduce constraints is to introduce more structure, such as a metric. It turns out that a metric gives a canonical family of kinetic energies with nice properties, and every member of that family is symmetric because of the symmetry of the metric. So asymmetric kinetic energies are not only awkward to use they’re actually really hard to motivate in the first place.

For all the gory details see http://arxiv.org/abs/1410.5110.

The post How Hamiltonian Monte Carlo works appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Pro Publica’s new Surgeon Scorecards appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>You should definitely weigh in on this…

Pro Publica created “Surgeon Scorecards” based upon risk adjusted surgery compilation rates. They used hierarchical modeling via the lmer package in R.

For detailed methodology, click the methodology “how we calculated complications” link, then atop that next page click on the detailed methodology to download a publication quality pdf).

At least three doctors have raised objections:

https://statmd.wordpress.com/2015/07/30/the-little-mixed-model-that-could-but-shouldnt-be-used-to-score-surgical-performance/

http://www.datasurg.net/2015/07/24/an-alternative-presentation-of-the-propublica-surgeon-scorecard/

http://www.kevinmd.com/blog/2015/07/why-the-surgeon-scorecard-is-a-journalistic-low-point-for-propublica.html

Curious as to your critique of Pro Publica’s methodology and results.

Next time this sort of thing is done, maybe they’ll use Stan. But that’s not really the point. The real point is that, yes, I probably should weigh in on this, but it would take a bit of work! This is not your run-of-the-mill “p less than .05” paper in PPNAS, it’s a serious project.

I quickly read through the online critiques and I saw some good points and some bad points. The bad points were some generic ranting against “shrinkage”; the blogger in question didn’t seem to realize that these issues arise in any prediction problem and represent inferential uncertainty that is the inevitable consequence of variation. Another blogger complained about wide uncertainty intervals but, again, that’s just life. The more important criticisms involved data quality, and that’s something I can’t really comment on, at least without reading the report in more detail.

It’s too bad. Something dumb like himmicanes and hurricanes is easy to criticize, easy for me to post on. But an important topic like rating doctors, that would require a lot more work for me to say anything definitive.

I will say, though, that I like what Pro Publica is doing. No model is perfect, but I think this is the way to start: You fit a model, do the best you can, be open about your methods, then invite criticism. You can then take account of the criticisms, include more information, and do better.

So go for it, Pro Publica. Don’t stop now! Consider your published estimates as a first step in a process of continual quality improvement.

The post Pro Publica’s new Surgeon Scorecards appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Meetup Talk in Ann Arbor this Wednesday (5 Aug 2015) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>To see the abstract and register to attend:

Neat—Wordpress includes a preview of the link.

We’ll see how many people show up expecting to see Andrew despite my saying it’s me and the talk saying it’s me.

The post Stan Meetup Talk in Ann Arbor this Wednesday (5 Aug 2015) appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The plagiarist next door strikes back: Different standards of plagiarism in different communities appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Commenters on this blog sometimes tell me not to waste so much time talking about plagiarism. And in the grand scheme of things, what could be more trivial than plagiarism in an obscure German book of chess anecdotes? Yet this is what I have come to talk with you about today.

As usual, I will make the claim that this discussion has more general relevance, that a careful exploration of this particularly trivial topic can give us larger insights into statistics and human understanding.

Earlier this year I learned that a fellow student from my graduate program, Christian Hesse, who for many years has been a math professor in Germany, ripped off material for a chess book that he wrote. In the comments to my post on the story, Christian wrote, “The author falsely accuses me of copying material for my chess book.” And by “the author,” Christian is talking about me. But, from the linked material from Edward Winter, it seems pretty clear that Chrissy *did* copy material, introducing errors in the process.

As I wrote in response to Chrissy’s comment, if you’re gonna copy material, you should give the source. Otherwise you’re misleading your readers and allowing yourself to propagate misinformation. And why do that?

But that’s all background. Today I want to focus on a particular aspect of this dispute, which is Chrissy’s implicit argument that this sort of copying in standard in the world of chess, and that Edward Winter and others accuse of plagiarism “anybody who later writes about the same chess games and matches, chess positions, studies, events.” I think what Chrissy is saying is that it’s commonplace to do what he did, which is to copy material from an old chess magazine or book, not attribute the source, and not check it for accuracy. And, indeed, you can go far in the world of chess journalism by copying, much more shamelessly than Christian Hesse has ever done.

So here’s the question: If everybody does it, and Christian’s book has been well received (he quotes a glowing newspaper review), then should we care?

I don’t mean, “Should we care?” as in “Is it important?” In that case, no, we shouldn’t care, any more than we should care whether Tom Brady had his footballs deflated or whether Pete Rose gambled on baseball or whether Lance Armstrong ever uttered a true sentence in his life. This is not an Ed Wegman situation in which professional misconduct was used in the service of potentially consequential political activities, or even a Mark Hauser story in which professional misconduct was used to waste a lot of people’s time and money. It’s just chess. It’s just sports.

No, when I say, “Should we care?”, I mean “Does this matter in the context of the chess book?” In the same way as we could care about Pete Rose because we are baseball fans and don’t want to see the sport become as scripted as the NBA, for example.

And this gets back to the way in which we (that is, Thomas Basbøll, me, and anyone else out there who happens to agree with us) like to frame plagiarism in particular, and scholarly misconduct more generally: **It’s not about the wrongdoing, it’s about the corruption of the communication channel.**

OK, so that wasn’t so pithy; maybe one of you can punch this up enough that it can make it into the lexicon?

Ok, to continue: If the problem with Chrissy’s copying-without-attribution is that he’s cheating, one could well respond that, no, he’s not cheating: the value-added in his chess book does not come from the games and stories themselves but in how he arranges them. If the problem is that he’s breaking the rules, one could well respond that, no, in the chess world, “the rules” allow this sort of thing. If the problem is that he’s stealing, one could respond that chess games are free for all to share, as are stories, and even any directly copied material might be in the public domain by now anyway.

T. S. Eliot wrote, “Immature poets imitate; mature poets steal.” Similar quotes were then attributed to Igor Stravinsky and Pablo Picasso, two other great artists from the modernist period.

Also, Martin Luther King, Jr., plagiarized—have I mentioned that recently??

So, sure, steal and steal and steal away. It’s a waste land out there.

No, the problem with copying without attribution is, *if* anyone’s going to want to care about these stories, they can learn a lot more by knowing where the stories come from. As Basbøll and I wrote, it’s a statistical crime. Which is one reason it makes me sad to see a statistician doing it.

**P.S.** Chrissy also notes that the world chess champion and his father are dear friends of his. I think it’s safe to say that Chrissy is a much better chess player than I am. Much much much better! If we played a game where I got an hour and he got 2 minutes, I’m almost certain he’d destroy me. Also, let me be clear that I am *not* claiming that his book has no value. Not at all! It could well contain some plagiarized material and some good material. Even some of the plagiarized material could be good. Actually, it *should* by good, otherwise why copy it?

The post The plagiarist next door strikes back: Different standards of plagiarism in different communities appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Tues:** Pro Publica’s new Surgeon Scorecards

**Wed:** How Hamiltonian Monte Carlo works

**Thurs:** When does Bayes do the job?

**Fri:** Here’s a theoretical research project for you

**Sat:** Classifying causes of death using “verbal autopsies”

**Sun:** All hail Lord Spiegelhalter!

The post Spam! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>On Jun 11, 2015, at 11:29 AM, Joanna Caldwell

wrote: Webinar: Tips & Tricks to Improve Your Logistic Regression

. . .

Registration Link: . . .Abstract: Logistic regression is a commonly used tool to analyze binary classification problems. However, logistic regression still faces the limitations of detecting nonlinearities and interactions in data. In this webinar, you will learn more advanced and intuitive machine learning techniques that improve on standard logistic regression in accuracy and other aspects. As an APPLIED example, we will demonstrate using a banking dataset where we will predict future financial stress of a loan applicant in order to determine whether they should be granted a loan. Although the focus is related to finance and loans, the concepts are relevant for anyone who actively uses logistic regression and wishes to improve accuracy and predictor understanding. . . .

I wrote onto the list:

Now we’re getting spam on the Stan list?? This is really weird! Unless they’re actually using Stan, but I doubt it. “More advanced and intuitive techniques,” indeed!

And Bob replied:

The spam came from an account with their receptionist’s name on it. We have been getting spam all along, which is why we have a registration step. Daniel banned this user, but dedicated spammers are hard to keep out, especially if they go the manual labor route.

Dedicated spammers, huh? I guess it’s a sort of backhanded compliment to Stan, that we’re considered important enough for some sleazoids to sic their secretary on us. Not enough phone calls to answer today, so spend the afternoon posting unwanted ads on listservs. Maybe they’re using some sort of advanced Bayesian filter to decide whose time to waste here.

The post Spam! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Statistical/methodological prep for a career in neuroscience research? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m currently a software developer, but I’m trying to transition to the neuroscience research world. Do you have any general advice or recommended resources to prepare me to perform sound and useful experimental design and analyses? I have a very basic stats background from undergrad plus eclectic bits and pieces I’ve picked up since, and have a fairly strong mathematical background.

I’m not sure! I think my book with Jennifer is a good place to start on statistical analysis, and for design I recommend the classic book by Box, Hunter, and Hunter.

I also recommend this paper by Microsoft engineers Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu, a group of software engineers (I presume) at Microsoft. These authors don’t seem so connected to the statistics literature—in a follow-up paper they rediscover a whole bunch of already well-known stuff and present it as new—but this particular article is crisp and applied, and I like it.

Maybe readers have other suggestions?

The post Statistical/methodological prep for a career in neuroscience research? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post If you leave your datasets sitting out on the counter, they get moldy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I had a look at the dataset on speed dating you put online, and I found some big inconsistencies. Since a lot of people are using it, I hope this can help to fix them (or hopefully I did a mistake in interpreting the dataset).

Here are the problems I found.

1. Field dec is not consistent at all (boolean for a big chunk of the dataset, in the range 1-10 later). Should this be the field of the decision and dec_o be the decision of the partner? dec and

matchshould be the same thing? I tried to used match instead of dec but then I get the following problem2. I tried to see if matches are consistent (if my partner decided yes it should mean that in his record I see a match): if I look at the record with iid x and pid y, dec_o=1 should mean that in the record with iid y and pid x I should see a match (in match or dec). This is not in general true. So dec_o is not consinstent with the matches.

3. Same thing for like and attr_o (or attr and attr_o)

I sent this to Ray Fisman, the source of the data, who replied:

Saurabh Bhargava used the underlying files and has posted data in a replication file for a study in the Review of Economics and Statistics.

I’m glad somebody put those data in the freezer.

The post If you leave your datasets sitting out on the counter, they get moldy appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>