The post Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>At this point in the season, folks are interested in extreme stats and want to predict final season measures. On the morning of Saturday May 20, here are the leading batting averages:

Justin Turner .379

Ryan Zimmerman .375

Buster Posey .369At the end of this season, who among these three will have the highest average? . . . these batting averages are based on a small number of at-bats (between 120 and 144) and one expects all of these extreme averages to move towards the mean as the season progresses. One might think that Turner will win the batting crown, but certainly not with a batting average of .379. . . .

I’m scheduling this post to appear in October, at which point we’ll know the answer!

Albert makes his season predictions not just using current batting average, but also using strikeout rates, home run rates, and batting average for balls in play. I think he’s only using data from the current year, which doesn’t seem quite right, but I guess it’s fine given that this is a demonstration of statistical methods and is not intended to represent a full-information prediction. In any case, here’s what he concludes in May:

I [Albert] predict Posey to finish with a .311 average followed by Zimmerman at .305 and Turner at .297.

These are reasonable predictions. But . . . I’m guessing that the league-leading batting average will be higher than .311!

Why do I say this? Check out recent history. The top batting averages in the past ten seasons (listed most recent year first) have been .348, .333, .319, .331, .336, .337, .336, .342, .364, .340. Actually, it looks like the top batting average in MLB has *never* been as low as .311. So I doubt that will happen this year. In 2016 there appear to have been 12 players who batted over .311 during the season.

What happened? Nothing wrong with Albert’s predictions. He’s just giving the posterior mean for each player, which cannot be directly examined to given an inference for the maximum over all players. Assuming he’s fitting his models in Stan—there’s no good reason to do otherwise—he’s also getting posterior simulation draws. He could then simulate, say, 1000 possibilities for the end-of-season records—and there he’d find that in just about any particular simulation the top batting average will exceed .311. Lots of players have a chance to make it, not just those three listed above.

This is not to diss Albert’s post; I’m just extending it by demonstrating out the perils of estimating extreme values from point predictions. It’s an issue that Phil and I discussed in our article, All maps of parameter estimates are misleading.

**P.S.** This post is appearing, as scheduled, on 19 Oct, during the playoffs. The season’s over so we can check what happened:

Buster Posey hit .320

Ryan Zimmerman hit .303

Justin Turner hit .322.

The league-leading batting averages were Charlie Blackmon at .331 and Jose Altuve at .346. So Albert’s predictions were not far off (these three batters did a bit better than the point predictions but I assume they’re well within the margin of error) and, indeed, it was two other hitters that won the batting titles.

From a math point of view, it’s an interesting example of how the mean of the maximum of a set of random variables is higher than the max of the individual means.

The post Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Beyond “power pose”: Using replication failures and a better understanding of data collection and analysis to do better science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So. A bunch of people pointed me to a New York Times article by Susan Dominus about Amy Cuddy, the psychology researcher and Ted-talk star famous for the following claim (made in a paper written with Dana Carney and Andy Yap and published in 2010):

That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.

Awkwardly enough, no support for that particular high-stakes claim was ever presented in the journal article where it appeared. And, even more awkwardly, key specific claims for which the paper *did* offer some empirical evidence for, failed to show up in a series of external replication studies, first by Ranehill et al. in 2015 and then more recently various other research teams (see, for example, here). Following up on the Ranehill et al. paper was an analysis by Joe Simmons and Uri Simonsohn explaining how Carney, Cuddy, and Yap could’ve gotten it wrong in the first place. Also awkward was a full retraction by first author Dana Carney, who detailed many ways in which the data were handled in order to pull out apparently statistically significant findings.

Anyway, that’s all background. I think Dominus’s article is fair, given the inevitable space limitations. I wouldn’t’ve chosen to have written an article about Amy Cuddy—I think Eva Ranehill or Uri Simonsohn would be much more interesting subjects. But, conditional on the article being written largely from Cuddy’s perspective, I think it portrays the rest of us in a reasonable way. As I said to Dominus when she interviewed me, I don’t have any personal animosity toward Cuddy. I just think it’s too bad that the Carney/Cuddy/Yap paper got all that publicity and that Cuddy got herself tangled up in defending it. It’s admirable that Carney just walked away from it all. And it’s probably a good call of Yap to pretty much have avoided any further involvement in the matter.

The only thing that really bugged me about the NYT article is when Cuddy is quoted as saying, “Why not help social psychologists instead of attacking them on your blog?” and there is no quoted response from me. I remember this came up when Dominus interviewed me for the story, and I responded right away that I *have* helped social psychologists! A lot. I’ve given many talks during the past few years to psychology departments and at professional meetings, and I’ve published several papers in psychology and related fields on how to do better applied research, for example here, here, here, here, here, here, here, and here. I even wrote an article, with Hilda Geurts, for The Clinical Neuropsychologist! So, yeah, I do spend some time helping social psychologists.

Dominus also writes, “Gelman considers himself someone who is doing others the favor of pointing out their errors, a service for which he would be grateful, he says.” This too is accurate, and let me also emphasize that this is a service for which I not only *would* be grateful. I actually *am* grateful when people point out my errors. It’s happened several times; see for example here. When we do science, we can make mistakes. That’s fine. What’s important is to learn from our mistakes.

In summary, I think Dominus’s article was fair, but I do wish she hadn’t let that particular false implication by Cuddy, the claim that I didn’t help social psychologists, go unchallenged. Then again, I also don’t like it that Cuddy baselessly attacked the work of Simmons and Simonsohn and to my knowledge never has apologized for that. (I’m thinking of Cuddy’s statement, quoted here, that Simmons and Simonsohn “are flat-out wrong. Their analyses are riddled with mistakes . . .” I never saw Cuddy present any evidence for these claims.)

**Good people can do bad science. Indeed, if you have bad data you’ll do bad science (or, at best, report null findings), no matter how good a person you are.**

Let me continue by saying something I’ve said before, which is that being a scientist, and being a good person, does not necessarily mean that you’re doing good science. I don’t know Cuddy personally, but given everything I’ve read, I imagine that she’s a kind, thoughtful, and charming person. I’ve heard that Daryl Bem is a nice guy too. And I expect Satoshi Kanazawa has many fine features too. In any case, it’s not my job to judge these people nor is it their job to judge me. A few hundred years ago, I expect there were some wonderful, thoughtful, intelligent, good people doing astrology. That doesn’t mean that they were doing good science!

If your measurements are too noisy (again, see here for details), it doesn’t matter how good a person you are, you won’t be able to use your data to make replicable predictions of the world or evaluate your theories: You won’t be able to do empirical science.

Conversely, if Eva Ranehill, or Uri Simonsohn, or me, or anyone else, performs a replication (and don’t forget the time-reversal heuristic) or analyzes your experimental protocol or looks carefully at your data and finds that your data are too noisy for you to learn anything useful, then they *may* be saying you’re doing bad science, but they’re not saying you’re a bad person.

As the subtitle of Dominus’s excellent article says, “suddenly, the rules changed.” It happened over several years, but it really did feel like something sudden. And, yes, Carney, Cuddy, and Yap ideally should’ve known back in 2010 that they were chasing for patterns in noise. But they, like many others, didn’t. They, and we, were fortunate to have Ranehill et al. reveal some problems in their study with the failed replication. And they, and we, were fortunate to have Simmons, Simonsohn, and others explain in more detail how they could’ve got things wrong. Through this and other examples of failed studies (most notably Bem’s ESP paper, but also hopelessly flawed Kanazawa and many others), and through lots of work by psychologists such as Nosek and others, we are developing a better understanding of how to do research on unstable, context-dependent human phenomena. There’s no reason to think of the authors of those fatally flawed papers as being bad people. We learn, individually and collectively, from our mistakes. We’re all part of the process, and Dominus is doing the readers of the New York Times a favor by revealing one part of that process from the inside. Instead of the usual journalistic trope of scientist as hero, it’s science as community, including confusion, miscommunication, error, and an understanding that a certain research method that used to be popular and associated with worldly success—the method of trying out some vaguely motivated idea, gathering a bunch of noisy data, and looking for patterns—doesn’t work so well at producing sensible or replicable results. That’s a good thing to know, and it could well be interesting for outsiders to see the missteps it took for us all to get there.

**Selection bias in what gets reported**

When people make statistical errors, I don’t say “gotcha,” I feel sad. Even when I joke about it, I’m not happy to see the mistakes; indeed, I often blame the statistics profession—including me, as a textbook writer!—for portraying statistical methods as tools for routine discovery: Do the randomization, gather the data, pass statistical significance and collect $200.

Regarding what gets mentioned in the newspapers and in the blogs, there’s some selection bias. A lot of selection bias, actually. Suppose, for example, that Daryl Bem had not made the serious, fatal mistakes he’d made in his ESP research.. Suppose he’d fit a hierarchical model or done a preregistered replication or used some other procedure to avoid jumping at patterns in noise. That would’ve been great. And then he most likely would’ve found nothing distinguishable from a null effect, no publication in JPSP (no, I don’t think they’d publish the results of a large multi-year study finding no effect for a phenomenon that most psychologists don’t believe in the first place), no article on Bem in the NYT . . . indeed, I never would’ve heard of Bem!

Think of the thousands of careful scientists who, for whatever combination of curiosity or personal interests or heterodoxy, decide to study offbeat topics such as ESP or the effect of posture on life success—but who conduct their studies carefully, gathering high-quality data, and using designs and analyses that minimize the chances of being fooled by noise. These researchers will, by and large, quietly find null results, which for very reasonable dog-bite-man reasons will typically be unpublishable, or only publishable in minor journals and will not be likely to inspire lots of news coverage. So we won’t hear about them.

Conversely, I’ll accept the statement that Cuddy in her Ted talks could be inspiring millions of people in a good way, even if power pose does nothing, or even does more harm than good. (I assume it depends on context, that power pose will do more good than harm in some settings, and more harm than good in others). The challenge for Cuddy—and in all seriousness I hope she follows up on this—is to be this inspirational figure, to communicate to those millions, in a way that respects the science. I hope Cuddy can stop insulting Simmons and Simonsohn, forget about the claims of the absolute effects of power pose, and move forward, sending the message that people can help themselves by taking charge of their environment, by embodying who they want to be. The funny thing is, I think that pretty much *is* the message of that famous Ted talk, and that the message would be stronger *without* silly, unsupported claims such as “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.”

**A way forward**

People criticize Cuddy for hyping her science and making it into a Ted talk. But, paradoxically, I’m now thinking we should be saying the opposite. The Ted talk has a lot going for it: it’s much stronger than the journal articles that justify it and purportedly back it up. I have the impression that Cuddy and others think the science of power pose needs to be defended in part because of its role in this larger edifice, but I recommend that Cuddy and her colleagues go the other way: follow the lead of Dana Carney, Eva Ranehill, et al., and abandon the scientific claims, which ultimately were based on an overinterpretation of noise (again, recall the time-reversal heuristic)—and then let the inspirational Ted talk advice fly free of that scientific dead end. There are lots of interesting ways to study how people can help themselves through tools such as posture and visualization, but I think these have to be studied for real, not through crude button-pushing ideas such as power pose but through careful studies on individuals, recognizing that different postures, breathing exercises, yoga moves, etc., will work for different people. Lots of interesting ideas here, and it does these ideas no favor to tie them to some silly paper published in 2010 that happened to get a bunch of publicity. The idea is to take the positive aspects of the work of Cuddy and others—the inspirational message that rings true for millions of people—and to investigate it using a more modern, data-rich, within-person model of scientific investigation. That’s the sort of thing that *should* one day be appearing in the pages of Psychological Science.

I think Cuddy has the opportunity to take her fame and her energy and her charm and her good will and her communication skills and her desire to change the world and take her field in a useful direction. Or not. It’s her call, and she has no obligation to do what *I* think would be a good idea. I just wanted to emphasize that there’s no reason her career, or even her famous Ted talk, has to rely on a particular intriguing idea (on there being a large and predictable effect of a certain pose) that happened not to work out. And I thank Dominus for getting us all to think about these issues.

**P.S.** There are a bunch of comments, including from some people who strongly disagree with me—which I appreciate! That is, I appreciate that people who disagree are taking the trouble to share their perspective and engage in discussion here.

There are a lot of details above so maybe it would be a good idea to emphasize some key points:

1. I thought Dominus’s article was excellent and fair to all sides. There were a couple points that I wish had not been left out, but of course it’s the reporter’s job to write the story how she thinks is best. By raising these points, I’m not saying Dominus should’ve written her article in a different way, I’m just adding my perspective. In particular, I didn’t want people to have the impression that I don’t help social psychologists, given that I’ve put a lot of work during the past few years doing just that! In any case, if Simmons or Cuddy or any of the other people mentioned in the article want to share their perspective here too, I’d be happy to post their remarks as separate entries on this blog.

2. I have no desire to act as gatekeeper for scientific research. I think it’s fine that the original Carney, Cuddy, and Yap paper was published, and that Ranehill et al. and other replication studies were published, and that the analysis by Simmons and Simonsohn was published, and so on.

3. I’m not trying to tell Cuddy (or, for that matter, Simmons, Simonsohn, Bem, or anyone else) what to do. I’ll offer suggestions based on my current statistical understanding, and I’ll even publish research papers offering suggestions, as think that’s part of my job. These suggestions are just that, offered with full awareness that all these researchers are independent agents who are free to follow all, some, or none of my advice.

4. I do disagree with some things that Cuddy and her collaborators have written, for example the quote from the journal article at the top of this post and the quote from Cuddy characterizing the work of Simmons and Simonsohn. When I disagree, I explain why. At the same time, I recognize that what I’m doing is offering my perspective and, again, expressing my disagreement does not represent any attempt to stop others from expressing their views and their reasons for these views.

The post Beyond “power pose”: Using replication failures and a better understanding of data collection and analysis to do better science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post No tradeoff between regularization and discovery appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s a simple example:

Suppose the prior distribution, as estimated by the hierarchical model, is that the population of effects has mean 0 and standard deviation of 0.1. And now suppose that the data-based estimate for one of the treatment effects is 0.5 with a standard error of 0.2 (thus, statistically significant at conventional levels). Also assume normal distributions all around. Then the posterior distribution for this particular treatment effect is normal with mean (0/0.1^2 + 0.5/0.2^2)/(1/0.1^2 + 1/0.2^2) = 0.10, with standard deviation 1/sqrt(1/0.1^2 + 1/0.2^2) = 0.09. Based on this inference, there’s an 87% posterior probability that the treatment effect is positive.

We could expand this hypothetical example by considering possible alternative prior distributions for the unknown treatment effect. Uniform(-inf,inf) is just too weak. Perhaps normal(0,0.1) is also weakly informative, and maybe the actual population distribution of the true effects is something like normal(0,0.05). In that case, using the normal(0,0.1) prior as above will under-pool, that is, the inference will be anti-conservative and be too susceptible to noise.

With a normal(0,0.05) prior and normal(0.5,0.2) data, you’ll get a posterior that’s normal with mean (0/0.05^2 + 0.5/0.2^2)/(1/0.05^2 + 1/0.2^2) = 0.03, with standard deviation 1/sqrt(1/0.05^2 + 1/0.2^2) = 0.05. Thus, the treatment effect is likely to be small, and there’s a 72% chance that it is positive.

Also, all this assumes zero bias in measurement and estimation, which is just about never correct but can be an ok approximation when standard errors are large. Once the standard error becomes small, then we should think about including an error term to allow for bias, to avoid ending up with too-strong claims.

**Regularization vs. discovery?**

The above procedure is an example of *regularization* or smoothing, and from the Bayesian perspective it’s the right thing to do, combining prior information and data to get probabilistic inference.

A concern is sometimes raised, however, that regularization gets in the way of *discovery*. By partially pooling estimates toward zero, are we reducing our ability to discover new and surprising effects?

My answer is no, there’s *not* a tradeoff between regularization and discovery.

How is that? Consider the example above, with the 0 ± 0.05 prior with 0.5 ± 0.2 data. Our prior pulls the estimate to 0.03 ± 0.05, thus moving the estimate from clearly statistically significant (2.5 standard errors away from 0) to not even close to statistical significance (less than 1 standard error from zero).

So we’ve lost the opportunity for discovery, right?

No.

There’s nothing stopping you from gathering more data to pursue this possible effect you’ve discovered. Or, if you can’t gather such data, you just have to accept this uncertainty.

If you want to be more open to discovery, you can pursue more leads and gather more and higher quality data. That’s how discovery happens.

B-b-b-but, you might say, what about discovery by luck? By regularizing, are we losing the ability to get lucky? Even if our hypotheses are mere lottery tickets, why throw away tickets that might contain a winner?

Here, my answer is: If you want to label something that might likely be wrong as a “discovery,” that’s fine by me! No need for a discovery to represent certainty or even to represent near-certainty. In the above example, we have a 73% posterior probability of seeing a positive effect in an exact replication study. Call that a discovery if you’d like. Integrate this discovery into your theoretical and practical understanding of the world and use it to decide where to go next.

**P.S.** The above could be performed using longer-tailed distributions if that’s more appropriate for the problem under consideration. The numbers will change but the general principles are the same.

The post No tradeoff between regularization and discovery appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This exchange came from a comment thread last year.

Diana Senechal points to this bizarre thing:

Brian Little says in Me, Myself, and Us (regarding the “lemon introvert test”):

One of the more interesting ways of informally assessing extraversion at the biogenic level is to do the lemon-drop test. [Description of experiment omitted from present quote—DS.] For some people the swab will remain horizontal. For others it will dip on the lemon juice end. Can you guess which? For the extraverts, the swab stays relatively horizontal, but for introverts it dips. . . . I have done this exercise on myself a number of times, and each time my swab dips deeply. I am, at least by this measure, a biogenic introvert.

I mean, really . . .

This claim has (at least) two serious problems: first, the weirdly overconfident button-pushing model of science, which reminds me of phrenological ideas from the schoolyard that how you lift your hands or cross your legs reveals some deep truth about you; and, second, the complete lack of understanding of variation, the idea that this thing would work every time. (Recall this similar attitude of researchers who felt the need for their theory to explain *every case*.)

Here, though, I want to point out a more positive take on this story. From my response to Senechal on that thread:

Maybe we’re making progress. 50 years ago we’d be hearing rumors of perpetual motion machines, car engines that ran on water, spoon bending, and even bigfoot. Now the purveyors of pseudoscience have moved to embodied cognition, lemon juice extraversion, power pose, and himmicanes. The boundaries of pseudoscience are being pushed back into the trivial. From perpetual motion to the lemon juice test: in the grand scheme of things this is a retreat.

Just to elaborate on this point: Bigfoot of course is trivial, but the point about perpetual motion machines etc. is that, if they were real, they’d imply huge changes in our understanding of physics. Similarly with the old story about the car engine that ran on water that some guy built in his backyard but was then suppressed by the powers-that-be in Detroit: if true, this would have implied huge changes in our understanding of physics and of economics. By comparison, the new cargo cult science of embodied cognition, shark attacks, smiley faces, beauty and sex ratio, etc., is so moderate and trivial: Any of these claims *could* be true; they’re just not well supported by the evidence, and as a whole they don’t fit together. To put it another way, the himmicanes story is not as obviously silly as, say, those photographs of fairies that fooled Arthur Conan Doyle, or the bending spoons and Noah’s ark stories that fooled people in the 1970s. So I think this is a positive development, that even though pseudoscience still is prominent in NPR, Ted, PPNAS, etc., it’s a fuzzier sort of pseudoscience, not as demonstrably wrong as perpetual motion machines, Nessie, and all that old-school hocus-pocus.

The post From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Analyzing New Zealand fatal traffic crashes in Stan with added open-access science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’ll get to the meat of this post in a second, but I just wanted to highlight how the study I’m about to talk about was done in the open and how that helped everyone. Tim Makarios read the study and responded in the blog comments,

Hold on. As I first skimmed this post, I happened to have, on the coffee table next to me, half a sheet of The Evening Post dated July 18, 1984. At the bottom of page 2, there’s a note saying “The Ministry of Transport said today that 391 road deaths had been reported so far this year. This compared with 315 at the same time last year.”

How is there such a big discrepancy with your chart of Sam Warburton’s data?

to which Peter Ellis, the author of the study responded in typical open-source fashion, encouraging the original poster to dig deeper and report back,

Good question. I’m not a traffic crash expert, the spirit of open source is – you let me know when you think you’ve worked it out. Obviously these are measuring two different things, I’m interested to know what! Thanks.

The question apparently prompted Peter to look himself; he followed up with

Looks like I misread the data for that particular bit of the analysis and my graphic was only showing *driver* deaths. I’ve updated it and the source code so it shows total casualties, which are consistent with that number in your old paper. Thanks for alerting me to that.

I love to see open science in action! Anyway, onto the real topic here.

**New Zealand fatal traffic crashes**

The above was an exchange about Peter Ellis’s analysis,

- New Zealand fatal traffic crashes (blog post with source code)

In his at-a-glance summary, Peter says,

I explore half a million rows of disaggregated crash data for New Zealand, and along the way illustrate geo-spatial projections, maps, forecasting with ensembles of methods, a state space model for change over time, and a generalized linear model for understanding interactions in a three-way cross tab.

I’d highly recommend it if you are interested in spatio-temporal modeling in particular or even just in plotting. It has great plots, very nice Stan code, and lots of great exploratory and Bayesian data analysis.

**Cold wind from the north**

We’ve had a rash of work lately on spatial models; must be the wind blowing down from the north (Finland and Toronto, specifically).

**Contribute to Stan case studies?**

Peter, if you’re listening and would be willing to release this open source, it’d make a great Stan case study. If you already submitted it for StanCon 2018, thanks! We’ll all be getting the New Year’s gift of a couple dozen new Stan case studies!

The post Analyzing New Zealand fatal traffic crashes in Stan with added open-access science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>From a statistical perspective, the article by Tella and Rotemberg is a disaster of forking paths, as can be seen even from the abstract:

We present a simple model of populism as the rejection of “disloyal” leaders. We show that adding the assumption that people are worse off when they experience low income as a result of leader betrayal (than when it is the result of bad luck) to a simple voter choice model yields a preference for incompetent leaders. These deliver worse material outcomes in general, but they reduce the feelings of betrayal during bad times. Some evidence consistent with our model is gathered from the Trump-Clinton 2016 election: on average, subjects primed with the importance of competence in policymaking decrease their support for Trump, the candidate who scores lower on competence in our survey. But two groups respond to the treatment with a large (between 5 and 7 percentage points) increase in their support for Donald Trump: those living in rural areas and those that are low educated, white and living in urban and suburban areas.

There are just so many reasonable interactions that one could look at here, also no reason at all that we’d expect to be in a “needle in a haystack” situation in which there are one or two very large effects and a bunch of zeroes. So it doesn’t make sense to pull out various differences that happen to be large in these particular data and then spin out stories. The trouble is that this approach has poor statistical properties under repeated sampling: with another dataset sampled from the same population, you could find other patterns and tell other stories.

It’s not that Tella and Rotemberg are necessarily wrong in their conclusions (or Cowen wrong in taking these conclusions seriously), but I don’t think these data are helping here: they all might be better off just speculating based on other things they’ve heard.

What to do, then? Preregistered replication (as in 50 shades of gray), sure. But, before then, I’d suggest multilevel modeling and partial pooling to get a better handle on what an be learned from their existing data.

This could be an interesting project: to get the raw data from the above study and reanalyze using multilevel modeling.

The post Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Baseball, apple pie, and Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>St. Louis Cardinals Baseball Development Analyst

Tampa Bay Rays Baseball Research and Development Analyst

The post Baseball, apple pie, and Stan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>As you may know, the relatively recent “orphan drug” laws allow (basically) companies that can prove an off-patent drug treats an otherwise untreatable illness, to obtain intellectual property protection for otherwise generic or dead drugs. This has led to a new business of trying large numbers of combinations of otherwise-unused drugs against a large number of untreatable illnesses, with a large number of success criteria.

Charcot-Marie-Tooth is a moderately rare genetic degenerative peripheral nerve disease with no known treatment. CMT causes the Schwann cells, which surround the peripheral nerves, to weaken and eventually die, leading to demyelination of the nerves, a loss of nerve conduction velocity, and an eventual loss of nerve efficacy.

PXT3003 is a drug currently in Phase 2 clinical testing to treat CMT. PXT3003 consists of a mixture of low doses of baclofen (an off-patent muscle relaxant), naltrexone (an off-patent medication used to treat alcoholism and opiate dependency), and sorbitol (a sugar substitute.)

Pre-phase 2 results from PXT3003 are shown here.

I call your attention to Figure 2, and note that in Phase 2, efficacy will be measured exclusively by the ONLS score.

My reply: 33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??

I sent this exchange to a colleague, who wrote:

In a past life I did mutual fund research. One of the fun things that fund managers do is “incubate” dozens of funds with their own money. Some do very well, others do miserably. They liquidate the poorly performing funds and “open” the high-performing funds to public investment (of course, reporting the fantastic historical earnings to the fund databases). Then sit back and watch the inflows (and management fees) pour in.

The post Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan case studies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>2017:

Modeling Loss Curves in Insurance with RStan, by Mick Cooney

Splines in Stan, by Milad Kharratzadeh

Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data, by Mitzi Morris

The QR Decomposition for Regression Models, by Michael Betancourt

Robust RStan Workflow, by Michael Betancourt

Robust PyStan Workflow, by Michael Betancourt

Typical Sets and the Curse of Dimensionality, by Bob Carpenter

Diagnosing Biased Inference with Divergences, by Michael Betancourt

Identifying Bayesian Mixture Models, by Michael Betancourt

How the Shape of a Weakly Informative Prior Affects Inferences, by Michael Betancourt

2016:

Exact Sparse CAR Models in Stan, by Max Joseph

A Primer on Bayesian Multilevel Modeling using PyStan, by Chris Fonnesbeck

The Impact of Reparameterization on Point Estimates, by Bob Carpenter

Hierarchical Two-Parameter Logistic Item Response Model, by Daniel Furr

Rating Scale and Generalized Rating Scale Models with Latent Regression, by Daniel Furr

Partial Credit and Generalized Partial Credit Models with Latent Regression, by Daniel Furr

Rasch and Two-Parameter Logistic Item Response Models with Latent Regression, by Daniel Furr

Two-Parameter Logistic Item Response Model, by Daniel Furr, Seung Yeon Lee, Joon-Ho Lee, and Sophia Rabe-Hesketh

Cognitive Diagnosis Model: DINA model with independent attributes, by Seung Yeon Lee

Pooling with Hierarchical Models for Repeated Binary Trials, by Bob Carpenter

2015:

Multiple Species-Site Occupancy Model, by Bob Carpenter

2014:

Soil Carbon Modeling with RStan, by Bob Carpenter

The post Stan case studies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Bayesian evidence synthesis” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>My colleagues and I have a paper recently accepted in the journal Psychological Science in which we “bang” on Bayes factors. We explicitly show how the Bayes factor varies according to tau (I thought you might find this interesting for yourself and your blog’s readers). There is also a very nice figure.

Here is a brief excerpt:

Whereas BES [a new method introduced by EJ Wagenmakers] assumes zero between-study variability, a multilevel model does not make this assumption and allows for examining the influence of heterogeneity on Bayes factors. Indeed, allowing for some variability substantially reduced the evidence in favor of an effect….In conclusion, we strongly caution against BES and suggest that researchers wanting to use Bayesian methods adopt a multilevel approach.

My reply: I just have one suggestion if it’s not too late for you to make the change. In your title you refer to “Bayesian Evidence Synthesis” which is a kind of brand name for a particular method. One could also speak of “Bayesian evidence synthesis” as referring to methods of synthesizing evidence using Bayesian models, for example this would include the multilevel approach that you prefer. The trouble is that many (most) readers of your paper will not have heard of the brand name “Bayesian Evidence Synthesis”—I had not heard of it myself!—and so they will erroneously take your paper as a slam on Bayesian evidence synthesis.

This is similar to how you can be a democrat without being a Democrat, or a republican without being a Republican.

**P.S.** Williams replied in comments. He and his collaborators revised the paper, including changing the title; the new version is here.

The post “Bayesian evidence synthesis” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Mick Cooney: case study on modeling loss curves in insurance with RStan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>All the Stan case studies are here.

The post Mick Cooney: case study on modeling loss curves in insurance with RStan appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I don’t have much to say here, except that:

1. It’s nearly a year later but Christmas is coming again so here’s my post.

2. Yes, the effects of local weather on climate change attitudes do seem worth studying in a more systematic way. I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.

There’s some general principle here, I think, which is worth exploring further.

3. Yes, they really do say “Happy holidays” here. Or “Have a good winter break.” Also “Merry Christmas” etc. I’ve never heard anyone say “Season’s greetings”—that just sounds like something you’d see on a Hallmark card.

And here’s what Reynolds sent me:

I figure it’s likely you will not reply given it’s nearly Christmas and you may have already come across it. But I thought it might be worth covering on your blog. I think it fits the theme of probably-correct-social-science-that-could-be-done-better, which you do seem to cover a bit.

Yesterday a study was published in the Proceedings of the National Academy of Sciences linking the proportion of people in an area who believe that climate change is happening with the most recent local weather record.

They conclude that people who have experienced more recent record highs are more likely to accept climate change than people who have experienced record lows (actually it’s slightly more complicated—“the number of days per year for which the year of the record high temperature is more recent than the year of the record low temperature”).

While I find the claim plausible if not likely I don’t really understand the analysis they have done, and find it all a bit weird.

I think they may have fallen into the camp of – got access to lots of noisy data, built a metric, done some statistical tests with significant p-values and then come up with a narrative to fit the results. Something weird is going on with the significance levels- “Levels of significance (*5%; **1%; ***10%).”

Wouldn’t that normally be *** .1% ? Or just (*.05; **.01; ***.001), so you don’t end up making typos with percentages.Interesting that usually you have to explain to people that weather is not climate (and that weather data is noisy), but in this case it’s how people’s experience of one (which is still going to be noisy) effects their belief about the other.

So I don’t think they have really collected data that truly tests their theory, and then built and tested a model to explain the data. I notice from looking at the maps of America that pretty much everywhere has had “red” levels of TMax and the regions that have the most blue levels are mostly in the mid-west. Wouldn’t it be better to build in a model that also factors in other predictors of belief like region, climate in that region, wealth, levels of education etc. Shouldn’t it also include record highs for each season, rather than just the year? A particularly hot winter or chilly summer might convince someone more/less that climate change is happening than yet another stinking hot Californian summer.

In Australia it feels like we have been living through a handful of “Once in a lifetime weather events” each year.

I guess there’s just so many different maximums that can be broken – hottest day, month, season or year.The more I think about it the more things I’d want to check, but I’ve already got more than enough ideas for my own research so I should stop.

What I think would be cool if someone could build a model that tracks change in social attitudes about climate change over time (and if local weather events have any effect).

Cheers and Happy Holidays (which apparently is the preferred season greeting in New York)

The post “I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “congratulations, your article is published!” Ummm . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The following came in the email under the heading, “congratulations, your article is published!”:

I don’t know that I should be congratulated on correcting an error, but sure, whatever.

**P.S.** The above cat is adorably looking out and will notice all of your errors.

The post “congratulations, your article is published!” Ummm . . . appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When do we want evidence-based change? Not “after peer review” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Jonathan Falk sent me the above image in an email with subject line, “If this isn’t the picture for some future blog entry I’ll never forgive you.” This was a credible threat so here’s the post.

But I don’t agree with that placard at all!

Waiting for peer review is a bad idea for two reasons: First because you might be waiting for a really long time (especially if an econ journal is involved), second because all sorts of bad stuff gets through peer review—just search this blog for PPNAS.

Falk replied:

Completely agree… That’s what makes it funny, no?

Making a living outside academia is all about this for me. When asked on the stand whether some analysis of mine has had peer review, my answer is yes, because my colleagues have reviewed it. But if we had to defend on traditional academic peer review, the time lags would put us out of business—buy of course the REAL peer review is performed by the experts in the other side. Consulting experts are the replication ideal, no?

I don’t know enough about legal consulting to have an opinion on whether consulting experts are the replication ideal, but I do know that I don’t want to wait around for peer review.

Another way to see this is from a decision-analytic perspective. There are potential costs and benefits to change, potential costs and benefits to remaining with the status quo, and potential costs and benefits to waiting for more information. A decision among these options has to depend to some extent on these costs and benefits, which are considered only obliquely in any peer review process.

The post When do we want evidence-based change? Not “after peer review” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Postdoc in Finland and NY to work on probabilistic inference and Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The funding is part of the joint project with Antti Honkela and Arto Klami at University of Helsinki and we are together hiring 3 postdocs. Two other postdocs would work in Helsinki and part of the time in DTU, Denmark (Ole Winther) or Cambridge, UK (Zoubin Ghahramani).

The project is about theory and methods for assessing the quality of distributional approximations, improving the inference accuracy by targeting the approximation towards the eventual application goal and by better utilising the available data, e.g., when having data with privacy constraints.

More information about the position and how to apply is here.

You can manage very well in Finland with English and you don’t need to learn any Finnish for the job. Helsinki has been selected many times among world’s top 10 liveable cities https://yle.fi/uutiset/osasto/news/helsinki_again_among_worlds_top_10_liveable_cities/9781098

The post Postdoc in Finland and NY to work on probabilistic inference and Stan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Does racquetball save lives? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>8e5 people in study, about half reported exercising, about half not. About 10% died overall. So overall death rate difference of 28% is pretty remarkable. It means about 3500 deaths instead of 4500 for a similar sample size.

But when you compare the rate of heart disease risk specifically (about 2% died of heart disease, or around 1000 in each sample), for runners vs. racket sports specifically (less than 10% each) you are really shooting in the dark. Say around 5000 people engaged in each kind of activity and around 100 died of heart disease in each group, sounds like normal variation.

Also they eliminated people who had heart disease at the beginning of the study, not sure why they would do this.

I guess the biggest issue is not controlling for endogeneity of activity, as the people who are frail and sickly are probably not engaging in much sports activity.

Not sure how much is author hype vs. journalist hype.

My reply: This reminds me of the “what does not kill me makes me stronger” principle. The elimination of people with risk at beginning of study, that’s interesting. I can see why it makes sense to do this and I can also see how it can cause bias. I guess the right way to do this is to express results conditional on initial risk.

Meir:

Afterwards I realized the biggest weakness: They do not control for income. If you examine the results you see that the people who engage in the most expensive forms of exercise (e.g. racquetball) have the lowest morbidity. Could be just a proxy for income.

Anyway, a flawed study is better than no study, the next time they can try to control for income.

Yup. Gotta start somewhere. Make all your data and code available, don’t hype your claims, and we can go from there.

**P.S.** Meir also pointed me to this book, “The Lion in the Living Room: How House Cats Tamed Us and Took Over the World,” by Abigail Tucker. I haven’t read it—really I should spend less time online and more time reading books!—but it has a pretty good title, I’ll grant it that.

The post Does racquetball save lives? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Halifax, NS, Stan talk and course Thu 19 Oct appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I (Bob, not Andrew) am going to be giving a talk on Stan and then Mitzi and I will be teaching a course on Stan after that. The public is invited, though space is limited for the course. Here are details if you happen to be in the Maritime provinces.

**TALK: Stan: A Probabilistic Programming Language for Bayesian Inference**

Date: Thursday October 19, 2017

Time: 10am

Location: Slonim Conference room (#430), Goldberg Computer Science Building, Dalhousie University, 6050 University Avenue, Halifax

Abstract

I’ll describe Stan’s probabilistic programming language, and how it’s used, including

- blocks for data, parameter, and predictive quantities
- transforms of constrained parameters to unconstrained spaces, with automatic Jacobian corrections
- automatic computation of first- and higher-order derivatives
- operator, function, and linear algebra library
- vectorized density functions, cumulative distributions, and random number generators
- user-defined functions
- (stiff) ordinary differential equation solvers

I’ll also provide an overview of the underlying algorithms for full Bayesian inference and for maximum likelihood estimation:

- adaptive Hamiltonian Monte Carlo for MCMC
- L-BFGS optimization and transforms for MLE

I’ll also briefly describe the user-facing interfaces: RStan (R), PyStan (Python), CmdStan (command line), Stan.jl (Julia), MatlabStan (MATLAB)

I’ll finish with an overview of the what’s on the immediate horizon:

- GPU matrix operations
- MPI multi-core, multi-machine parallelism
- data parallel expectation propagation for approximate Bayes
- marginal Laplace approximations

**TUTORIAL: Introductio to Bayesian Modeling and Inference with RStan**

Instructors:

- Bob Carpenter, Columbia University
- Mitzi Morris, Columbia University

Date: Thursday October 19, 2017

Time: 11:30am-5:30pm (following the seminar on Stan at 10am)

Location: Slonim Conference room (#430)

Goldberg Computer Science Building

Dalhousie University

6050 University Avenue, Halifax

Registration: EventBrite Registration Page

Description:

This short course will provide

- an introduction to Bayesian modeling
- an introduction to Monte Carlo methods for Bayesian inference
- an overview of the probabilistic programming language Stan

Stan provides a language for coding Bayesian models along

with state-of-the-art inference algorithms based on gradients. There will be an overview of how Stan works, but the main focus will be on the RStan interface and building applied models.

The afternoon will be devoted to a case study of hierarchical modeling, the workhorse of applied Bayesian statistics. We will show how hierarchical models pool estimates toward the population means based on population variance and how this automatically estimates regularization and adjusts for multiple comparisons. The focus will be on probabilistic inference, and in particular on testing posterior predictive calibration and the sharpness of predictions.

Installing RStan: Please show up with RStan installed. Instructions are linked from here:

*Warning*: follow the instructions step-by-step; even though installation involves a CRAN package, it’s more complex than just installing from RStudio because a C++ toolchain is required at runtime.

If you run into trouble, please ask for help on our forums—they’re very friendly:

**Full Day Schedule**

10:00-11:00am Open Seminar – Introduction to the “Stan” System

11:00-11:30am Break

11:30am-1:00pm Tutorial part 1

1:00pm -2:00pm Lunch Break

2:00pm -3:30pm Tutorial part 2

3:30pm -3:45pm Break

3:45pm -5:30pm Tutorial part 3

The post Halifax, NS, Stan talk and course Thu 19 Oct appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Workshop on Interpretable Machine Learning appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>NIPS 2017 Symposium

Interpretable Machine Learning

Long Beach, California, USA

December 7, 2017Call for Papers:

We invite researchers to submit their recent work on interpretable machine learning from a wide range of approaches, including (1) methods that are designed to be more interpretable from the start, such as rule-based methods, (2) methods that produce insight into existing ML models, and (3) perspectives either for or against interpretability in general. Topics of interest include:

– Deep learning

– Kernel, tensor, graph, or probabilistic methods

– Automatic scientific discovery

– Safe AI and AI Ethics

– Causality

– Social Science

– Human-computer interaction

– Quantifying or visualizing interpretability

– Symbolic regressionAuthors are welcome to submit 2-4 page extended abstracts, in the NIPS style. References and supplementary material are not included in the page limit. Author names do not need to be anonymized. Accepted papers will have the option of inclusion in the proceedings. Certain papers will also be selected to present spotlight talks. Email submissions to interpretML2017@gmail.com.

Key Dates:

Submission Deadline: 20 Oct 2017

Acceptance Notification: 23 Oct 2017

Symposium: 7 Dec 2017Workshop Overview:

Complex machine learning models, such as deep neural networks, have recently achieved outstanding predictive performance in a wide range of applications, including visual object recognition, speech perception, language modeling, and information retrieval. There has since been an explosion of interest in interpreting the representations learned and decisions made by these models, with profound implications for research into explainable ML, causality, safe AI, social science, automatic scientific discovery, human computer interaction (HCI), crowdsourcing, machine teaching, and AI ethics. This symposium is designed to broadly engage the machine learning community on the intersection of these topics—tying together many threads which are deeply related but often considered in isolation.

For example, we may build a complex model to predict levels of crime. Predictions on their own produce insights, but by interpreting the learned structure of the model, we can gain more important new insights into the processes driving crime, enabling us to develop more effective public policy. Moreover, if we learn that the model is making good predictions by discovering how the geometry of clusters of crime events affect future activity, we can use this knowledge to design even more successful predictive models. Similarly, if we wish to make AI systems deployed on self-driving cars safe, straightforward black-box models will not suffice, as we will need methods of understanding their rare but costly mistakes.

The symposium will feature invited talks and two panel discussions. One of the panels will have a moderated debate format where arguments are presented on each side of key topics chosen prior to the symposium, with the opportunity to follow-up each argument with questions. This format will encourage an interactive, lively, and rigorous discussion, working towards the shared goal of making intellectual progress on foundational questions. During the symposium, we will also feature the launch of a new Explainability in Machine Learning Challenge, involving the creation of new benchmarks for motivating the development of interpretable learning algorithms.

I’m interested in this topic! It relates to some of the ideas we’ve been talking about regarding Stan and Bayesian workflow.

The post Workshop on Interpretable Machine Learning appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Partial pooling with informative priors on the hierarchical variance parameters: The next frontier in multilevel modeling appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In the course of tinkering with someone else’s hairy dataset with a great many candidate explanatory variables (some of which are largely orthogonal factors, but the ones of most interest are competing “binning” schemes of the same latent elements). I wondered about the following “model selection” strategy, which you may have alluded to in your multiple comparisons paper:

Include all plausible factors/interactions as random intercept effects (i.e., (1|A), (1|A:B) in lmer parlance [that’s stan_lmer() now — ed.]). Since there are many competing, non-orthogonal binning schemes included all at once, the model would be overdetermined (singular) if they were all included as fixed effects [he means “non-varying coefficients.” Or maybe he means “varying coefficients estimated without regularization.” The “fixed”/”random” terminology is unclear. — ed.]. However, as random effects [“varying coefficients estimated using regularization” — ed.], we can rely on partial pooling and shrinkage to sort out among them, such that variance along factors that are not well supported by the data (or are explained away by other factors) shrinks to zero. This happens quite decisively in lmer, but a “regularizing” exponential prior on the variances in a fully bayesian model would achieve something similar, I think [easy to do in stan or rstanarm — ed.]. (A more sophisticated approach would be to put a common, pooling prior on the variances for all the individual factors…)

This approach seems to yield sensible results, but I am a bit concerned because I have never seen it used by others, so I am probably missing something. It may just be that it is rarely computationally practical to include all candidate factors/binning-schemes in such an “overcomplete” model. Or perhaps there is a compelling reason why explanatory variables of substantive interest should be treated as fixed, rather than random effects? Is there a fundamental problem with this approach that I am not thinking of? Or is this a well-known technique that I have simply never heard of?

My reply (in addition to the editorial comments inserted above):

This sounds fine. I think, though, you may be overestimating what lmer will do (perhaps my fault given that I featured lmer in my multilevel modeling book). The variance estimate from lmer can be noisy. But, sure, yes, partial pooling is the way to go, I think. Once you’ve fit the model I don’t really see the need to shrink small coefficients all the way to zero, but I guess you can if you want. Easiest I think is to fit the model in Stan (or rstanarm if you want to use lmer-style notation).

Also, the whole fixed/random thing is no big deal. You can allow any and all coefficients to vary; it’s just that if you don’t have a lot of data then it can be a good idea to put informative priors on some of the group-level variance parameters to control the amount of partial pooling. Putting in all these priors might sound kinda weird but that’s just cos we don’t have a lot of experience with such models. Once we have more examples of these (and once they’re in the new edition of my book with Jennifer), then it will be no big deal for you and others to follow this plan in your own models.

Informative priors on group-level variances makes a lot more sense than making a priori decisions to do no pooling, partial pooling, or complete pooling.

The post Partial pooling with informative priors on the hierarchical variance parameters: The next frontier in multilevel modeling appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Splines in Stan; Spatial Models in Stan ! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Two case studies:

Splines in Stan, by Milad Kharratzadeh.

Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data, by Mitzi Morris.

This is great. Thanks, Mitzi! Thanks, Milad!

The post Splines in Stan; Spatial Models in Stan ! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Please contribute to this list of the top 10 do’s and don’ts for doing better science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I was wondering if you had ever considered publishing a top ten ‘do’s/don’ts’ for those of us that are committed to doing better science, but don’t necessarily have the time to devote to all of these issues [of statistics and research methods].

Obviously, there is a lot of nuance in both methods and stats for any particular project. So, I’m not asking you for a ‘one size fits all’, but more of a 5 or 10 factor checklist as a framework for those of us committed to doing better work, but worried we may not have the expertise or time to follow-through on these commitments. Sort of a—whatever you do, at least do x, y, and z.

I looked up Glasford on the internet and found this description of his research:

The focus of much of my work is on three interrelated streams of research questions that are concerned with understanding: how people make decisions about what to do when faced with injustice; what compels people to join and stay involved in political protest that can benefit their own group, as well as groups they do not belong to; and when, why, and what helps individuals from groups of differing power improve relations with one another.

Wow—this sounds important. I should talk with this guy.

In the meantime, do I have a checklist of 10 items? I’ve given advice to psychology researchers from time to time but I don’t have a convenient list of 10 things.

But we should have such a list! Can you make some suggestions in comments? Also, if anyone out there is in contact with any leading social psychologists, maybe we could get their thoughts too? There’s a lot I disagree with in the writings of, say, Susan “terrorists” Fiske or Daniel “shameless little bullies” Gilbert or Mark “Evilicious” Hauser or all the other people you’re sick of hearing about on this blog—but, say what you want about these people, they’ve thought a lot about psychology research and I’d be interested in what *their* top 10 tips would be. Not tips on how to get published in PNAS or wherever, but tips on doing better science.

Unfortunately I don’t expect we’ll hear from the above people (I’d be happy to be surprised on that end, though!), so in the meantime I’d love to hear your thoughts.

OK, I’ll start with items #1, 2, and 3 on the top-10 list, in decreasing order of importance:

1. Learning from data involves three stages of extrapolation: from sample to population, from treatment group to control group, and from measurement to the underlying construct of interest. Worry about all three of these, but especially about measurement, as this tends to be taken for granted in statistics discussions.

2. Variance is as important as bias. To put it another way, take a look at your (estimated) bias and your standard error. Whichever of these is higher, that’s what you should be concerned about.

3. Measurement error and variation are concerns even if your estimate is more than 2 standard errors from zero. Indeed, if variation or measurement error are high, then you learn almost nothing from an estimate even if it happens to be “statistically significant.”

OK, none of the above are so pithy, and I’m open to the idea of other items bumping these down the list.

It’s your turn.

The post Please contribute to this list of the top 10 do’s and don’ts for doing better science appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Tenure-Track or Tenured Prof. in Machine Learning in Aalto, Finland appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>We are looking for a professor to either further strengthen our strong research fields, with keywords including statistical machine learning, probabilistic modelling, Bayesian inference, kernel methods, computational statistics, or complementing them with deep learning. Collaboration with other fields is welcome, with local opportunities both at Aalto and University of Helsinki. A joint appointment with the Helsinki Institute for Information Technology HIIT a joint research centre with University of Helsinki, can be negotiated. Aalto is currently investing heavily in fundamental research in AI and Finland to applied AI, and Aalto Department of Computer Science is the right place to pursue the next generation of machine learning and AI.

With nine professors and ca. 100 PhD students and postdocs working in machine learning, data mining, and probabilistic modelling, Aalto Department of Computer Science is one of Europe’s leading centres of research in the field of the call.

Review of the position will begin on 29 October 2017. The position will remain open until filled.

I can add that Aalto has also the second largest concentration of Stan developers in the world!

See more at the full call

P.S. You can manage very well in Finland with English, you can teach in English and you don’t need to learn any Finnish for the job. Helsinki has been selected many times among world’s top 10 liveable cities https://yle.fi/uutiset/osasto/news/helsinki_again_among_worlds_top_10_liveable_cities/9781098

The post Tenure-Track or Tenured Prof. in Machine Learning in Aalto, Finland appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The house is stronger than the foundations appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Regarding the whole ‘double use of data’ issue with posterior predictive checks [see here and, for a longer discussion, here], I just wanted to note that David Cox describes the ‘Fisherian reduction’ as (I’ve summarised slightly; see p. 24 of ‘Principles of Statistical Inference)

– Find the likelihood function

– Reduce to a sufficient statistic S of the same dimension as theta

– Estimate theta based on the sufficient statistic

– Use the conditional distribution of the data given S=s informally or formally to assess the adequacy of the formulationYour conception of posterior predictive checks seems to me to be essentially the same

– Find a likelihood and prior

– Use Bayes to estimate the parameters.

– The data enters this estimation procedure only via the sufficient statistics (i.e. ‘the likelihood principle as applied within the model’)

– There is thus a ‘leftover’ part of the data y|S(y)

– You can use this to check the adequacy of the formulation

– Do this by conditioning on the sufficient statistics, i.e. using the posterior predictive distribution which was fit using the sufficient statisticsFormally, I think, the posterior predictive distribution is essentially p(y|S(y)) since Bayes only uses S(y) rather than the ‘full’ data.

Thus there is no ‘double use of data’ when checking the parts of the data corresponding to the ‘residual’ y|S(y).

On the other hand the aspects corresponding to the sufficient statistics are essentially ‘automatically fit’ by Bayes (to use your description in the PSA abstract).

You are probably aware of all of this, but it may help some conceptually.

I personally found it useful to make this connection in order to resolve (at least some parts of) the conflict between my intuitive understanding of why PPP are good and some of the formal objections raised.

My reply: When I did my formalization of predictive checks in the 1990s, it was really for non-Bayesian purposes: I had seen problems where I wanted to test a model and summarize that test, but the p-value depended on unknown parameters, so it made sense to integrate them out. Since then, posterior predictive checks have become popular among Bayesians, but I’ve been disappointed that non-Bayesians have not been making use of this tool. The non-Bayesians seem obsessed with the uniform distribution of the p-value, a property that makes no sense to me.

The following papers might be relevant here:

Two simple examples for understanding posterior p-values whose distributions are far from unform

Section 2.3 of A Bayesian formulation of exploratory data analysis and goodness-of-fit testing

Maclaren responded:

It seems to me that a relevant division of non-Bayesians is into something like

– Fisherians, e.g. David Cox and those who emphasise likelihood, conditioning, ‘information’ and ‘inference’. If they are interested in coverage it is usually conditional coverage with respect to the appropriate situation. Quite similar to your ideas on defining the appropriate ‘replications’ of interest.

– Neymanians, i.e. those with a more ‘pure’ Frequentist bent who emphasise optimality, decisions, coverage (often unconditional) etc.

I think the former are/would be much more sympathetic to your approach. For example, as noted I think Cox basically advocates the same thing in the simple case. Lindsey, Sprott etc also all emphasise the perspective of ‘information division’ which I think addresses at least some concerns with double use of data in simple cases.

With regard to having the ‘residual’ dependent on the parameters: presumably there is some intuitive notion here of a ‘weak’ or ‘local’ dependence on the fitted parameters (or something similar)? Or some kind of ‘inferential separation’? Perhaps an unusual model structure?

I’m trying to think of the ‘logic’ of information separation here.

For example, I can imagine a factorisation something like

P(Y|θ) = P(Y|S,α(λ))P(S|λ)

where

P(S|λ) gives the likelihood for fitting λ

P(Y|S,α(λ)) gives the residual for model checking, now depending on λ but via α(λ)In this case θ = (λ,α(λ)) seem to provide the needed separation but they are not (variation) independent.

So it still makes sense to use your best estimate of λ in model checking to make sure you use a relevant α (i.e. average over λ’s posterior).

Something like a curved exponential model might fit this case.

Just thinking out loud, really.

Me again: Sure, but also there’s all the regularization and machine learning stuff. Take, for example, the Stanford school of statistics: Efron, Hastie, Tibshirani, Donoho, etc. They use what I (and they) would call modern methods which I think of as Bayesian and they think of as regularized likelihood or whatever, but I think we all worship the same god even if we give it different names. When it comes to foundations, I’m pretty sure that the Stanford crew think in a so-called Neyman-Pearson framework with null hypotheses and error rates. There’s no doubt that they’ve had real success, both methodological and applied, with that false discovery rate approach, even though I still find it lacking as to me it’s based on a foundation of null hypotheses that is in my opinion worse than rickety.

In any case, I have mixed feelings about the relevance of posterior predictive p-values for these people. I would definitely like them to do some model checks, and I continue to feel that some posterior predictive distribution is the best way to get a reference set to use to compare observed data in a model check. But I think more and more that p-values are a dead end. I guess what I’d really like of non-Bayesian statisticians is for them to make their assumptions more explicit—to express their assumptions in the form of generative models, so that then these models can be checked and improved. Right now things are so indirect: the method is implicitly based on assumps (or, to put it another way, the method will be most effective when averaging over data generating processes that are close to some ideal) but these assumps are not stated clearly or always well understood, which I think makes it difficult to choose among methods or to improve them in light of data.

I’ve been thinking this a long time. I have a discussion of a paper of Donoho et al. from, ummm, 1990 or 1992, making some of the above points (in proto-fashion). But I don’t think I explained myself clearly enough: in their rejoinder, Donoho et al. saw that I was drawing a mathematical equivalence between their estimators and Bayesian priors, but I hadn’t been so clear on the positive virtues of making assumptions that can be rejected, with that rejection providing a direction for improvement.

Maclaren:

There’s a lot here I agree with of course.

And yes, the cultures of statistics, and quantitative modeling generally, are pretty variable and it can be difficult to bridge gaps in perspective.

Now, some more overly long comments from me.

As some context for the different cultures aspect, I’ve bounced around maths and engineering departments while working on biological problems, industrial problems, working with physicists, mathematicians, statisticians, engineers, computer scientists, biologists etc. It has of course been very rewarding but the biggest barriers are usually basic ‘philosophical’ or cultural differences in how people see and formulate the main questions and methods of addressing these. These are much more entrenched than you realise until you try to actually bridge these gaps.

I wouldn’t really describe myself as a Bayesian, Frequentist, Likelihoodist, Machine Learner etc, despite seeing plenty of value in each approach. The more I read on foundations the more I find myself – to my surprise, since I used to view them as old-fashioned – quite sympathetic with Fisher, Barnard, Cox, Sprott, Barndorff-Nielsen etc. In particular on the organisation, reduction, splitting, combining etc of ‘information’ and the geometric perspective on this.

Hence me trying to understand PPC from this point of view. I think the simple point that you can for example base estimation on part of the data and checking on another part, and in simple cases represent this as a factorisation, clears a few things up for me. It also explains some (retrospectively) obvious results I saw when using PPC eg the difference between checks based directly on fitted stats vs those based on ‘residual’ information.

But even then I have plenty of disagreements with the Fisherian school, and would like to see it extended to more complex problems. Bayes in the Jeffreys, Jaynes vein is of course similar to this ‘organisation of information’ perspective, but I find Jaynes in particular tends to often make overly strong claims while ignoring mathematical and philosophical subtleties. Classic physicist style of course!

(I started writing some notes on a reformulation of this sort of perspective in terms of category theory but I doubt I’ll ever finish them or anyone would read them if I did! Sander did, surprisingly, offer some encouragement on this – the DAG people are probably more open to general abstract nonsense in diagram form!).

RE: The Stanford school. Yes they seem a somewhat strange mix of decision theory, optimisation and function approximation. (Though again, not an unfamiliar mix to me – I spend a fair amount of time around operations research people, and minored in it in undergrad. Everything is rewritten as an optimal decision problem. And yes the statistical aspect seems to come from NP origins.

The models are often implicit while primary focus is given to the ‘fitting procedure’. And to them, likelihood is mainly just an objective function to be maximised to get estimators to evaluate Frequentist style.

(Of course this connects to the two big misconceptions about likelihood analysis from both Bayes and Freq – one that it’s just for getting Frequentist estimators, usually via maximisation. Two, it can’t handle nuisance parameters systematically.)

Bayes of course tends to blend modeling and inference. Both have pros and cons to me – I think there is benefit to separating a model from its analysis (think for example finding weak solutions to differential equations) while there is also benefit in seeing this in turn as a modified model (think for example rewriting a differential equation as an integral equation – again leads to weak solutions, but from a more model-based perspective).

Some people love to think in terms of models, some in terms of procedures. This is a difficult gap to bridge sometimes, particularly between eg stats/comp sci vs scientists. I think the idea of ‘measurement’ is important here. ‘Framework theories’ like quantum mechanics and thermodynamics provide a good guide to me, but of course there is no shortage of arguments over how to think about these subjects either!

In terms of p-values for model checking, I definitely prefer graphical checks. In terms of Frequentist parameter inference I differ from you I think in that I see value in seeing confidence intervals as inverted hypothesis tests. I prefer however to see them as something like an inverse image of the data in parameter space rather than as a measure of uncertainty or even as a measure of error rates.

Me: I think it’s great when people come up with effective methods. What irritates me is when people tie themselves into knots trying to solve problems that in my opinion aren’t real. For example there’s this huge literature on simulation from the distribution of a discrete contingency table with known margins. But I think that’s all a waste of time because the whole point is to compute a p-value with respect to a completely uninteresting model in which margins are fixed, which corresponds to a design that’s just about never used. For another example, Efron etc. wasted who knows how many journal pages and man-years of effort on the problem of bootstrap confidence intervals. But I think the whole confidence intervals thing is a waste of time. (I think uncertainty intervals are great; what I specifically don’t like are those inferences that are supposed to have specified coverage conditional on any value of the unknown parameters, and which are defined by inverting hypothesis tests.)

There’s an expression I sometimes use with this work, which is that the house is stronger than the foundations.

The post The house is stronger than the foundations appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Sudden Money appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’m not sure if you’re keeping track of published failures to replicate the power posing effect, but this article came out earlier this month:

“Embodied power, testosterone, and overconfidence as a causal pathway to risk-taking”

From the abstract:

We were unable to replicate the findings of the original study and subsequently found no evidence for our extended hypotheses.

Gotta love that last sentence of the abstract:

As our replication attempt was conducted in the Netherlands, we discuss the possibility that cultural differences may play a moderating role in determining the physiological and psychological effects of power posing.

I’d like to stop here but maybe I should explain further. I think the effects of power pose *do* vary by country. They also vary from year to year, they’re different on weekday and weekend, different in work and home environments, they differ by outdoor temperature (the clothing you wear will affect the comfort or awkwardness of the pose), of course it varies by sex, and hormone level, and the time of the month, and your marital/relationship status, and the socioeconomic status of your parents, and the number of older siblings you have, and every other damn factor that’s every been considered as an interaction in a social psychology study. The effects can also be moderated by subliminal smiley faces and priming with elderly-related words and shark attacks and college football games and whether your age ends in a 9 and gay genes and ESP and . . . hmmm, did I forget anything? I’m too lazy to supply links but you can search this blog for all the above phrases for more.

The point is, in a world where everything’s affecting everything else, the idea of “the effect” of power pose is pretty much meaningless. I mean, sure, it could have a huge and consistent effect. But the experiments that have been conducted don’t find that, and this is no surprise. Trying to come up with explanations with patterns in noise (as in the “Netherlands” comment above), that’s a mug’s game. You might as well just cut out the middleman, go to Vegas, and gamble away your reputation on the craps table. (See item 75 here.) In which case you’ll have to support yourself by writing things like The Book of Virtues, and who wants to do that?

**P.S.** We’re making slow but steady progress going through these Westlake-inspired post titles.

The post Sudden Money appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I respond to E. J.’s response to our response to his comment on our paper responding to his paper appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Empirical claims often concern the presence of a phenomenon. In such situations, any reasonable skeptic will remain unconvinced when the data fail to discredit the point-null. . . . When your goal is to convince a skeptic, you cannot ignore the point-null, as the point-null is a statistical representation of the skeptic’s opinion. Refusing to discredit the point-null means refusing to take seriously the opinion of a skeptic. In academia, this will not fly.

I don’t know why E. J. is so sure about what will or not fly in academia, given that I’ve published a few zillion applied papers in academic journals while only very occasionally doing significance tests.

But, setting aside claims about things not flying, I agree with the general point that hypothesis tests can be valuable at times. See, for example, page 70 of this paper. Indeed, in our paper, we wrote, “We have no desire to ‘ban’ p-values. . . . in practice, the p-value can be demoted from its threshold screening role and instead be considered as just one among many pieces of evidence.” I think E. J.’s principle about respecting skeptics is consistent with what we wrote, that p-values can be part of a statistical analysis.

**P.S.** Also E. J. promises to blog on chess. Cool. We need more statistician chessbloggers. Maybe Chrisy will start a blog too. After all, there’s lots of great material he could copy. There’d be no need for plagiarize: Chrissy could just read the relevant material, not check it for accuracy, and then rewrite it in his own words, slap his name on it, and be careful not to give credit to the people who went to the trouble to compile the material themselves.

**P.P.S.** I happened to have just come across this relevant passage from Regression and Other Stories:

We have essentially no interest in using hypothesis tests for regression because we almost never encounter problems where it would make sense to think of coefficients as being exactly zero. Thus, rejection of null hypotheses is irrelevant, since this just amounts to rejecting something we never took seriously in the first place. In the real world, with enough data, any hypothesis can be rejected.

That said, uncertainty in estimation is real, and we do respect the deeper issue being addressed by hypothesis testing, which is assessing when an estimate is overwhelmed by noise, so that some particular coefficient or set of coefficients could just as well be zero, as far as the data are concerned. We recommend addressing such issues by looking at standard errors as well as parameter estimates, and by using Bayesian inference when estimates are noisy, as the use of prior information should stabilize estimates and predictions.

The post I respond to E. J.’s response to our response to his comment on our paper responding to his paper appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Why bioRxiv can’t be the Central Service” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Anyway, as a fan of preprint servers I appreciate Anaya’s point-by-point discussion of why one particular server, bioRxiv (which I’d never heard of before but I guess is popular in biology), can’t do what some people want it to do.

The whole thing is also one more demonstration of why twitter sucks (except this one time), in that Anaya is responding to some ignorance coming from that platform. On the other hand, one could say that twitter is valuable in this case as having brought a widespread misconception to the surface.

**P.S.** Lots and lots of biology papers get written and cited. Just for example I was looking up my colleague John Carlin on Google Scholar. He has an h-index of 95! He works in biostatistics. Another friend from grad school, Chris Schmid, his h-index is 85. h-index is just one thing, it’s no big deal, it’s just interesting to see how that works. Some fields get lots of citation because people are publishing tons of papers there. In biology there are a zillion postdocs all publishing papers, and every paper has about 30 authors. imagine there will soon be a similar explosion of citations in computer science—if it hasn’t happened already—because every Ph.D. student and postdoc in CS is submitting multiple papers to all the major conferences. If conference papers are getting indexed, this is gonna blow all the citation counts through the roof. Actually this sort of hyperinflation might be a net positive in that it would devalue the whole citation-count thing.

**P.P.S.** Anaya’s post has a place for comments but it’s on this site called Medium where if you want to comment, you need to sign in, and then you start getting mail in your inbox from Medium, and if you want to cancel your Medium account, it tells you that if you do so, it will delete all your posted comments. That ain’t cool.

The post “Why bioRxiv can’t be the Central Service” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Stan Roundup, 6 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>**Jonah Gabry**returned from teaching a one-week course for a special EU research institute in Spain.

**Mitzi Morris**has been knocking out bug fixes for the parser and some pull requests to refactor the underlying type inference to clear the way for tuples, sparse matrices, and higher-order functions.

**Michael Betancourt**with help from**Sean Talts**spent last week teaching an intro course to physicists about Stan.**Charles Margossian**attended and said it went really well.

**Ben Goodrich**, in addition to handling a slew of RStan issues has been diving into the math library to define derivatives for Bessel functions.

**Aki Vehtari**has put us in touch with the MxNet developers at Amazon UK and Berlin and we had our first conference call with them to talk about adding sparse matrix functionality to Stan (Neil Lawrence is working there now).**Aki**is also working on revising the*EP as a way of life*paper and finalizing other Stan-related papers.

**Bob Carpenter**and**Andrew Gelman**have recruited**Advait Rajagopal**to help us with the Coursera specialization we’re going to offer (contingent on coming to an agreement with Columbia). The plan’s to have four course: Intro to BDA (Andrew), Stan (Bob), MCMC (Bob), and Regression and other stories (Andrew).

**Ben Bales**finished the revised pull request for vectorized RNGS. Turns out these things are much easier to write than they are to test thoroughly. Pesky problems with instantiations by integers and what not turn up.

**Daniel Lee**is getting ready for ACoP, which**Bill Gillespie**and**Charles Margossian**will also be presenting at.

**Steven Bronder**and**Rok Češnovar**, with some help from**Daniel Lee**, are going to merge the ViennaCL library for GPU matrix ops with their own specializations for derivatives in Stan into the math library. This is getting close to being real for users.

**Sean Talts**when he wasn’t teaching or learning physics has been refactoring the Jenkins test facilities. As our tests get bigger and we get more developers, it’s getting harder and harder to maintain stable continuous integration testing.

**Breck Baldwin**is taking over dealing with StanCon. Our goal is to get up to 150 registrations.

**Breck Baldwin**has also been working with**Andrew Gelman**and**Jonathan Auerbach**on non-conventional statistics training (like at Maker Fairs)—they have the beginnings of a paper. Breck’s highly recommending the math musueum in NY to see how this kind of thing’s done.

**Bob Carpenter**published a Wiki page on a Stan 3 model concept, which is probably what we’ll be going with going forward. It’s pretty much like what we have now with better const correctness and some better organized utility functions.

**Imad Ali**went to the the New England Sports Stats conference. Expect to see more models of basketball using Stan soon.

**Ben Goodrich**fixed the problem with exception handling in RStan on some platforms (always a pain because it happened on Macs and he’s not a Mac user).

**Advait Rajagopal**has been working with**Imad Ali**on adding ARMA and ARIMA time-series functions to rstanarm.

**Aki Vehtari**is working to enhance the loo package with automated code for K-fold cross validation for (g)lmer models.

**Lizzie Wolkovich**visited us for a meeting (she’s on our NumFOCUS leadership body), where she reported that she and a postdoc have been working on calibrating Stan models for phenology (look it up).

**Krzysztof Sakrejda**has been working on proper standalone function generation for Rcpp. Turns out to be tricky with their namespace requirements, but I think we have it sorted out as of today.

**Michael Andreae**has kicked off is meta-analysis and graphics project at Penn State with**Jonah Gabry**and**Ben Goodrich**chipping in.

**Ben Goodrich**also fixed the infrastructure for RStan so that multiple models may be supported more easily, which should make it much easier for R package writers to incorporate Stan models.

**Yuling Yao**gave us the rundown on where ADVI testing stands. It may falsely report convergence when it’s not at a maximum, it may converge to a local minimum, or it may converge but the Gaussian approximation may be terrible, either in terms of the posterior means or the variances. He and**Andrew Gelman**are looking at using Pareto smoothed importance sampling (a la the loo package) to try to sort out the quality of the approximation. Yuling thinks convergence is mostly scaling issues and preconditioning along with natural gradients may solve the problem. It’s nice to see grad students sink their teeth into a problem! It’d be great if we could come up a more robust ADVI implementation that had diagnostic warnings if the approximation wasn’t reliable.

The post Stan Roundup, 6 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am frustrated by the lack of Bayesianism in most of the religious belief I observe. I’ve never met a believer who asserted: “I’m really not sure here. But I think Lutheranism is true with p = .018, and the next strongest contender comes in only at .014, so call me Lutheran.” The religious people I’ve known rebel against that manner of framing, even though during times of conversion they may act on such a basis.

I think Cowen’s missing the point here when it comes to Bayesianism. Indeed, as an applied Bayesian statistician, I’m not even “Bayesian” in Cowen’s sense when it comes to statistical inference! Suppose I fit some data using logistic regression (my go-to default when modeling survey data with Mister P). I don’t say “logistic regression is true with p = .018, and the next strongest contender comes in only at .014, so call me logistic.” What I say is that I use logistic regression because it works for the problems I work on, and if it has problems, I’ll change the model. I also might want to try some other models as a robustness check. But Bayesian reasoning doesn’t at all require that I assign probabilities to my models.

Or, in a different direction, we can resolve Cowen’s problem by thinking of religious belief as analogous to nationality. Being an American doesn’t mean that I say that Pr(Americanism is true) = .018 or whatever. It’s just an aspect of who I am. This framing becomes particularly clear if you think of interactions between religion and nationality, such as Irish Catholic or Indian Muslim or whatever. And then there are Episcopalians, which from a doctrinal perspective are very close to Roman Catholics but are just part of a different organization. There’s a lot of overlap between religion and nationality. Another way to put it is that sticking with your own nationality, or your own religion, is the default. You can switch religions if there’s another religion you really like, or because you have some other reason (for example, liking the community at one of the local churches), but that’s not a statement about which religion is “true.”

To loop back to statistics, I suppose someone might talk Bayesianly about the probability that a particular religion is best for him/herself, but that’s not at all the same as the probability that the doctrine is true.

Cowen is frustrated by what he sees as “lack of Bayesianism” in religious beliefs that he observes, but I think that if he had a fuller view of Bayeisanism this would all make sense to him. In my recent paper with Hennig we talk about “falsificationist Bayesianism.” The idea is that a falsificationist Bayesian performs inference conditional on a model—that is, treats the model as if it were true—and then uses these inferences to make decisions while keeping an eye out for implications of the model that conflict with data or don’t make sense. From a Bayesian perspective, if a prediction “doesn’t make sense,” this implies that it’s in contradiction with some piece of prior information that may not yet have been included in the model. As we move forward in this way, we continue to update and revise our model, occasionally revamping or even discarding the model entirely if it is continuing to offer predictions that make no sense. This sort of Bayeisanism does not seem so far off from many forms of non-fundamentalist religious belief.

P.S. to all the wiseguys who will joke that Bayeisanism is a religion: Sure, whatever. The same principle applies to statistical methods and frameworks: we use them to solve problems and then alter or abandon them when they no longer seem to be working for us.

The post I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What am I missing and what will this paper likely lead researchers to think and do? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In a previous post Ken Rice brought our attention to a recent paper he had published with Julian Higgins and Thomas Lumley (RHL). After I obtained access and read the paper, I made some critical comments regarding RHL which ended with “Or maybe I missed something.”

This post will try to discern what I might have missed by my recasting some of the arguments I discerned as being given in the paper. I do still think, “It is the avoidance of informative priors [for effect variation] that drives the desperate holy grail quest to make sense of varying effects as fixed”. However, given for argument’s sake that one must for some vague reason avoid informative priors for effect variation at all cost, I will try to discern if RHL’s paper outlined a scientifically profitable approach.

However, I should point out their implied priors seem to be a point prior of zero for there being any effect variation due to varying study quality and a point prior of one that the default fixed effect estimate can be reasonably generalized to a population of real scientific interest. In addition to this, as I think the statistical discipline needs to take more responsibility for the habits of inference they instil in others I am very concerned what various research groups most likely will think and do given an accurate reading of RHL?

Succinctly (as its a long post) what I mostly don’t like about RHL’s paper is that they seem to suggest their specific weighted averaging to a population estimand – which annihilates the between study variation – will be of scientific relevance and from which one can sensibly generalize to a target population of interest. Furthermore it is suggested as being widely applicable and often only involves the use of default inverse variance weights. Appropriate situations will exist but I think they will be very rare. Perhaps most importantly, I believe RHL need to be set out how this will be credibly assessed to be the case in application. RHL does mention limitations, but I believe these are of a rather vague sort of don’t use these methods when they are not appropriate.

That is seemingly little or no advice for when (or how to check) if one should use the publication interesting narrow intervals or the publication uninteresting wide intervals.

First, I will review work by Don Rubin that I came across when I was trying to figure out how to deal with the varying quality of RCTs that we were trying to meta-analyse in the 1980,s. It helps clarify what meta-analyses should ideally be aiming at. He conceptualized meta-analysis as building and extrapolating response surfaces in an attempt to estimate “true effects.” These true effects were defined as the effects that would be obtained in perfect hypothetical studies. I referred to this work in my very first talk on meta-analysis and RHL also referred to this paper on it – Meta-analysis: Literature synthesis or effect-size surface estimation? DB Rubin – Journal of Educational Statistics, 1992, though in a very limited way. I think I prefer Rubin’s earlier paper “A New Perspective” in this book. I will then apply this perspective of what meta-analyses should ideally be aiming at to critically assess where RHL’s proposed approach would be most promising.

Now, Rubin was building and extrapolating response surfaces out of a concern that “we really do not care scientifically about summarizing this finite population {of published studies we have become aware of) but rather “the underlying process that is generating these outcomes that we happen to see – that we, as fallible researchers, are trying to glimpse through the opaque window of imperfect empirical studies”. He argues that to better do this we should model two kinds of factors – scientific factors (I often refer to this as biological variation) and scientifically uninteresting design factors (I often refer to this as study quality variation). Furthermore as we want to get at the underlying process, we need to extrapolate to the highest quality studies as these more directly truly reflect the underlying process.

Using the notation X for biological factors and Z for study quality factors he is “proposing, answers are conditional on those factors that describe the science [biology], X, and an ideal [quality] study Z = Z0. That is where we are making inferences.” If there is a lot of extrapolation uncertainty – then that’s the answer. Not much to learn from these studies, they are not the ones you are looking for, so move on.

Now his work has been primarily conceptual as far as I am aware. Sander Greenland and I tried to see how far we could take modelling and extrapolation in a paper entitled “On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions.” Unfortunately no one seemed willing to share a suitable data set for us to actually try it out. Ideally, such a data set would have a reasonable number a studies that have been adequately assessed for their quality (which we defined as whatever leads to more valid results which likely is of fairly high dimension with the quality dimensions being highly application-specific and hard to measure from published information). Given such requirements are quite demanding, I don’t think they will be met in the published clinical research with primarily only access to the publications. Perhaps though in a co-operating group of researchers studying a common disease and treatment prospectively conducting studies where they should have access to all the study protocols, revisions and raw data. Or perhaps a prospective inter-laboratory reliability study. In summary, there is always a need to adequately model a X, Z surface and extrapolate to Z0, with this being ignorable only when all studies are close enough to Z0.

So we are moving on to just considering RHL’s method in consistently high quality studies – they are not the methods to use when there is varying quality. A rare situation but it can happen. I believe it would exclude most meta-analysis done within for instance the Cochrane Collaboration. Not sure if RHL would agree. Even within this restricted context of uniformly high quality studies, I think it will be helpful to give my sense of the science or reality of multiple high quality studies as even that can be tricky.

Each study likely will recruit an idiosyncratic sample of patients from the general population – it will be highly selective (non-random) and not that well characterized. This can be seen in the variation of patient characteristics and for instance the varying control group outcomes in the trials. There will be a restrictions on eligibility criteria. However, within that there there can still be much variation. So each study will have a idiosyncratic sample of patients recruited from the general population which provides a relative percentage in that study of the total studied population. RA Fisher used the term indefinite to refer to this type of situation where we have no means of reproducing such differing sub-populations at will. The same investigators at a later time or in a different city would unlikely be able to recruit the same mix of patients (i.e. the recruitment into clinical trails is pretty haphazard in my experience). Because of this I don’t how the sub-population could even be adequately described nor the relative percentage of this sub-populations in a population of interest ever be determined.

Now, the relative percentage of the total studied population of these study sub-populations will likely differ from the percentage of these same idiosyncratic sub-populations in the general population or some targeted population of interest. Sample sizes in conducted trials is largely determined by funds available, recruitment abilities of the trial staff, other trials competing to enrol subsets of targeted patients at the time, etc. Because of this I don’t see why it would be expected that the relative percentages of the various study recruited patients (of the total studied population) would approximately equal the relative percentages of the various patients in a general or target interest population.

Now in a very convenient case where the only patient characteristic that drives treatment effect variation is say gender – we will know the relative percentages in the study population and likely any targeted population (with negligible error) and post-stratifying (weighting to match to a target population gender proportions) will likely be straightforward. RHL provides an appendix which actually carries this out for such an example. But in realistic study settings, I have no idea how the needed weights could be obtained. That is, we will usually just have idiosyncratic samples of patients in studies without much if any knowledge of their make up or their relative percentages in targeted populations or what drives the biological variation (given there is variation).

Now it is clear that we do have the study population and so it should be fairly direct to assess the average treatment effect in the studied population. Here between study scientific variation (i.e. study population variation) can be taken as fixed and ignored. However, I would argue this question is not one of scientific relevance (that RHL is primarily interested in) but rather a practical economy of research question – is it worth continued study of this intervention to get a better sense of what that effect would be in a targeted population and what studies should we do to get a better sense. This perspective goes way back to Fisher with his careful discernment of what variation should and should not be ignored for questions of scientific relevance.

As an aside, I have thought a fair amount about these issues as I presented a similar weighting argument to get an uninteresting average magnitude effect estimate and an interesting average sign effect estimate – the latter just being positive or negative. I presented the argument to a SAMSI Meta-analysis working group in 2008. I recall Ken and Julian being there but I am not really sure and they likely do not remember my presentation either. Jim Berger criticized the population that was defined by the inverse variance weighting as being non-existent or imaginary. Now I had thought the assumption of the treatment effect being monotonic would make that criticism moot, but I was not sure at the time. Richard Peto often insisted he was justified in making such an assumption and I was taking that as a clue. If the effect is only positive or negative, then an average effect in any population real or counterfactual – no matter how uninteresting – would enable the sign to be pinned down for any other population. I later discussed this with David Cox via email and he argued that monotonicity was a very questionable assumption. Furthermore, if I actually wanted to make such an assumption, why not assume a treatment variation distribution on the positive line? So I abandoned the idea.

Now some specific excerpts from RHL that I may benefit from some clarification:

RHL > “we discuss in detail what the inverse-variance-weighted average represents, and how it should be interpreted under these different models” and “[the] summary that is produced should be interpretable, relevant to the scientific question at hand and statistically well calibrated … controversial issue that we aim to clarify is whether ˆβ in equation (1) [inverse-variance-weighted average] estimates a parameter of scientific relevance.”

I definitely like these promises but I don’t see them being explicitly or adequately met in the paper.

RHL > “we restrict ourselves to the situation of a collection of studies with similar aims and designs, free of important flaws in their implementation or analysis. (See Section 6 for further discussion.)”and then in Section 6 “when studies do not provide valid analyses, either because of limitations in the design and conduct of the study, or because, after data collection, post hoc changes are made to the analysis, but reported analyses do not take these steps into account… If in practice these procedures cannot be avoided, accounting or the biases that they induce is known to be difficult…”

The challenge here is that its not clear what is meant by important flaws and when they are present they seem to be suggesting not much can be done. For instance, what percentage of the meta-analyses in the Cochrane collection of meta-analyses would have such flaws – 10% or 90%? Would one or more entry in the risk of bias tool help sort this out?

RHL>”[inverse-variance-weighted average] estimates a population parameter, for the population formed by amalgamating the study populations at hand. … in the overall population that amalgamates all k individual study populations, define ηi as the proportion of the population that comes from study population i.”

I am not sure whether RHL mean to define ηi as the proportion of the total studied population that comes from study i or the proportion of a general or population of interest that matches study i’s sub-population. The following two claims suggest it is the first.

RHL>”we see that ni/Σni [the proportion of total study population in study i] is consistent for ηi” and “proportions ηi are known with negligible error”

The the population of scientific relevance is the second and so how to we get to that – just assume the proportion of the total studied population roughly equals a population of interest? That surely needs some justification?

RHL>”It remains to discuss the scientific relevance of β; the use of this specific weighted average is described … general results given in Section 3.3.”

I very much agree that it does remain and RHL claim it will be in Section 3.3.

But Section 3.3 is just an inverse-variance-weighted average view or recasting of general regression into sub-pieces – which I can not see as addressing scientific relevance. As if general regression was the definition of scientific relevance!?

Rather, I would (and have) argue(d) its just an analogue of various ways to factorize the joint likelihood of studies with various choices of (in RHL terms) identical versus independent parameter assumptions. For example in the simplest example of regression of a single x through the origin, specifying an identical slope parameter for all studies but independent within study variance parameters different for each study (that is each study gets its own variance parameter). The usual regression involving all the study’s raw data in one regression being expressed as Normal(beta,sd1, sd2,…,sdn,x,y) that is being rewritten as Normal(beta,sd1,x1,y1) * Normal(beta,sd2,x2,y2) * …. * Normal(beta,sdn,xn,yn). As these individual study likelihoods are exactly quadratic (given RHL’s assumptions) they can be replaced with inverse variance weighted individual study beta estimates. So what?

RHL>”the fixed effects meta-analysis estimates a well-defined population parameter, of general relevance in realistic settings. Consequently, assessing the appropriateness of fixed effects analyses by checking homogeneity is without foundation— … Both in theory and in practice, the argument is not tenable and should not be made.”

I think this a very strong claim given what is and isn’t in the paper. Furthermore – checking homogeneity is just checking some of the assumptions of the data generating model. If one is not making a common parameter assumption but rather a independent parameter assumption, does that make model checking impossible? That would be convenient. The minimalist of data generating model assumptions for meta-analysis is there not being apples and oranges. Something is being taken as common in these multiple studies – is that in error? How would one ever know?

“For example, if the subjects in the studies contributing to the meta-analysis are representative of an overall population of interest, the fixed effects estimate is directly relevant to that overall population … If, however, the sample sizes across studies vary so greatly that the combined population is unrepresentative of any plausible overall population, then the fixed effects parameter will not be as useful.”

Maybe this explains what I am missing – RHL is only suggested the method be used when you know the mix of idiosyncratic study populations are actually is representative of an overall population of interest. That is if and only if the proportion of total study population in study i is consistent for ηi. How would one know that? How would one check that?

But RHL also claimed it was of general relevance in realistic settings so they are assuming in most realistic settings the proportion of total study population in study i is consistent for ηi? Or so it seems given this statement.

RHL>”Fixed effects meta-analysis can and often should be used in situations where effects differ between studies”

Now for somethings I do fully agree with.

RHL> “However, if the random-effects assumption is motivated through exchangeability alone”

Yup – it is surely one of those very untrue models (a really really wrong model as the Spice Girls would put it) but which is useful in bringing in some of the real uncertainty – though admittedly seldom the right amount. That is why I once was claimed it was not the least wrong model.

RHL> “Measures of heterogeneity should not be used to determine whether fixed effects analysis is appropriate, but users should instead make this decision by deciding whether fixed effects analysis—or some variant of it—answers a question that is relevant to the scientific situation at hand.”

I fully agree – but again its the banning of informative priors altogether – forcing there to be a discrete decision to either completely ignore or fully incorporate very noisy between study variation. Nothing in between! And with seemingly little or no advice to be had for when (or how to check) if one should use the publication interesting narrow intervals or the publication uninteresting wide intervals. This is the real problem.

The post What am I missing and what will this paper likely lead researchers to think and do? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I’m not on twitter appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>So if there’s anything buggin ya, put it in a blog comment.

The post I’m not on twitter appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Should we worry about rigged priors? A long discussion. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Buck shared Cook’s post with Jon Baron, who wrote:

My concern is that if researchers are systematically too optimistic (or even self-deluded) about about the prior evidence—which I think is usually the case—then using prior distributions as the basis for their new study can lead to too much statistical confidence in the study’s results. And so could compound the problem.

Stuart Buck asked what would I say to this, and I replied:

My response to Jon is that I think all aspects of a model should be justified. Sometimes I speak of there being a “paper trail” of all modeling and data-analysis decisions. My concern here is not so much about p-hacking etc. but rather that people can get wrong answers because they just use conventional modeling choices. For example, in those papers on beauty and sex ratios, the exciting but wrong claims can be traced to the use of a noninformative uniform prior on the effects, even though there’s a huge literature showing that sex ratios vary by very little. Similarly in that ovulation-and-clothing paper: for the data to have been informative, any real effect would have had to be huge, and this just makes no sense. John Carlin and I discuss this in our 2014 paper.

To address Jon’s concern more directly: Suppose a researcher does an experiment and he says that his prior is that the new treatment will be effective, for example his prior dist on the effect size is normal with mean 0.2 and sd 0.1, even before he has any data. Fine, he can say this, but he needs to justify this choice. Just as, when he supplies a data model, it’s not enough for him just to supply a vector of “data,” he also needs to describe his experiment so we know where his data came from. What’s his empirical reasoning for his prior? Implicitly if he gives a prior such as N(0.2, 0.1), he’s saying that in other studies of this sort, real effects are of this size. That’s a big claim to make, and I see no reason why a journal would accept this or why a policymaker would believe it, if no good evidence is given.

Stuart responded to me:

“Implicitly if he gives a prior such as N(0.2, 0.1), he’s saying that in other studies of this sort, real effects are of this size.”

Aha, I think that’s just the rub – what are “real” effects as opposed to the effects found in prior studies? Due to publication bias, researcher biases, etc., effects found in prior studies may be highly inflated, right? So anyone studying a particular social program (say, an educational intervention, a teen pregnancy program, a drug addiction program, etc.) might be able to point to several prior studies finding huge effects. But does that mean the effects are real? I’d say no. Likely the effects are inflated.

So if the prior effects are inflated, how would that affect a Bayesian analysis of a new study on the same type of program?

I replied: Yes, exactly. Any model has to be justified. For example, in that horrible paper purporting to estimate the effects of air pollution in China (see figure 1 here), the authors should have felt a need to justify that high-degree polynomial—actually, the problem is not so much with a high-degre curve but with the unregularized least-squares fit. It’s enough just to pick a conventional model and start interpreting coefficients. Picking a prior distribution based on biased point estimates from the published literature, that’s not a good justification. One of the advantages of requiring a paper trail is that then you can see the information that people are using to make their modeling decisions.

Stuart followed up:

Take a simpler question (as my colleague primarily funds RCTs) — a randomized trial of a program intended to raise high school graduation rates. 1,000 kids are randomized to get the program, 1,000 are randomized into the control, and we follow up 3 years later to see which group graduated more often.

The simplest frequentist way to analyze that would be a t-test of the means, right? Or just a simple regression — Y (grad rate) = alpha + Beta * [treatment] + error.

If you analyzed the RCT using Bayesian stats instead, would your ultimate conclusion about the success of the program be affected by your choice of prior, and if so, how much? My colleague has the impression that a researcher who is strongly biased in favor of that program would somehow use Bayesian stats in order to “stack the deck” to show the program really works, but I’m not sure that makes sense.

I replied: The short story is that, yes, the Bayesian analysis depends on assumptions, and so does the classical analysis. I think it’s best for the assumps to be clear.

Let’s start with the classical analysis. A t-test is a t-test, and a regression is a regression, no assumptions required, these are just data operations. **The assumptions come in when you try to interpret the results.** For example, you do the t-test and the result is 2.2 standard errors away from 0, and you take that as evidence that the treatment “works.” That conclusion is based on some big assumptions, as John Carlin and I discuss in our paper. In particular, the leap from “statistical significance” to “the treatment works” is only valid when type M and type S errors are low—and any statement about these errors requires assumptions about effect size.

Let’s take an example that I’ve discussed a few times on the blog. Gertler et al. ran a randomized experiment of an early childhood intervention in Jamaica and found that the treatment raised earnings by 42% (the kids in the study were followed up until they were young adults and then their incomes were compared). The result was statistically significant so for simplicity let’s say the 95% conf interval is [2%, 82%]. Based on the classical analysis, what conclusions are taken from this study? (1) The treatment works and has a positive effect. (2) The estimated treatment effect is 42%. Both these conclusions are iffy: (a) Given the prior literature (see, for example, the Charles Murray quote here), it’s hard to believe the true effect is anything near 42%, which suggests that Type M and Type S errors in this study could be huge, implying that statistical significance doesn’t tell us much; (b) The Gertler et al. paper has forking-path issues so it would not be difficult for them to find a statistically significant comparison even in the absence of any consistent true effect; (c) in any case, the 42% is surely an overestimate: Would the authors or anyone else really be wiling to bet that a replication would achieve such a large effect?

So my point is that the classical inferences—the conclusion that the treatment works and the point estimate of the effect—are strongly based on assumptions which, in conventional reporting, are completely hidden. Indeed I doubt that Gertler et al. themselves are aware of the assumptions underlying their conclusions. They correctly recognize that the mathematical operations they apply to their data—the t-test and the regression—are assumption-free (or, I should say, rely on very few assumptions). But they *don’t* recognize that the implications they draw from their statistical significance depend very strongly on assumptions which, in their example, are difficult to justify. If they *were* required to justify their assumptions (to make a paper trail, as I put it), they might see the problem. They might recognize that the strong claims they draw from their study are only justifiable conditional on already believing the treatment has a very large and positive effect.

OK, now on to the Bayesian analysis. You can start with the flat-prior analysis. Under the flat prior, a statistically significant difference gives a probability of greater than 97.5% probability that the true effect is in the observed direction. For example in that Gerler et al. study you’d be 97.5%+ sure that the treatment effect is positive, and you’d be willing to bet at even odds that the true effect is bigger or smaller than 42%. Indeed, you’d say that the effect is as likely to be 82% as 2%. That of course is ridiculous: a 2% or even a 0% effect is quite plausible, whereas an 82% effect, even if it might exist in this population for some unlikely historical reason, is not plausible in any larger context. But that’s fine, this tells us that we have prior information that’s not included in our model. A more plausible prior might have a mean of 0 and a standard deviation of 10%, or maybe some longer-tailed distribution such as a t with low degrees of freedom with center 0 and scale 10%. I’m not sure what’s best here, but one could make some prior based on the literature. The point is that it would have to be justified.

Now suppose some wise guy wants to stack the deck by, for example, giving the effect size a prior that’s normal with mean 20% and sd 10%. Well, the first thing is that he’d have to justify that prior, and I think it would be hard to justify. If it did get accepted by the journal reviewers, that’s fine, but then anyone who reads the paper would see this right there in the methods section: “We assumed a normal prior with mean 20% and sd 10%.” Such a statement would be vulnerable to criticism. People know about priors. Even a credulous NPR reporter or a Gladwell would recognize that the prior is important here! The other funny thing is, in this case, such a prior is in some ways an improvement upon the flat prior in that the estimate would be decreased from the 42% that comes from the flat prior.

So I think my position here is clear. Sure, people can stack the deck. Any stacking should be done openly, and then readers can judge the evidence for themselves. That would be much preferable to the current situation in which inappropriate inferences are made without recognition of the assumptions that justify them.

At this point Jon Baron jumped back in. First, where I wrote above “Even a credulous NPR reporter or a Gladwell would recognize that the prior is important here!”, Baron wrote:

I’m not sure it wouldn’t fly under the radar just like the other assumptions in Gertler’s study that make its findings unreliable—I think the Heckmans and many other wishful thinkers on early childhood programs would say that the assumption about priors is fully justified.

I replied: Sure, maybe they’d say so, but I’d like to see that claim in black and white in the paper: then I could debate it directly! As it is, the authors can implicitly rely on such a claim and then withdraw it later. That’s the problem I have with these point estimates: the point estimate is used as advertising but then if you question it, the authors retreat to saying it’s just proof of an effect.

That happened with that horrible ovulation-and-clothing paper: my colleague and I asked how anyone could possibly believe that women are 3 times as likely to wear red on certain days of the month, and then the authors and their defenders pretty much completely declined to defend that factor of 3. I have this amazing email exchange with a psych prof who was angry at me for dissing that study: I asked him several times whether he thought that women were actually 3 times more likely to wear red on these days, and he just refused to respond on that point.

So, yeah, I think it would be a big step forward for these sorts of quantitative claims to be out in the open.

Second, Baron followed up my statement that “such a prior [normal with mean 20% and sd 10%] is in some ways an improvement upon the flat prior in that the estimate would be decreased from the 42% that comes from the flat prior,” by asking:

What about the not-unrealistic situation where the wishful thinker says the prior effect size is 30% (based on Perry Preschool and Abecedarian etc.) and his new study comes in with an effect size of, say, 25%. Would the Bayesian approach be more likely to find a statistically significant effect than the classical approach in this situation?

My reply: Changing the prior will change the point estimate and also change the uncertainty interval. In your example, if the wishful thinker says 30% and the new study estimate says 25%, then, yes, the wiseguy will feel confirmed. But it’s the role of the research community to point out that an appropriate analysis of Perry, Aecedarian, etc., do not lead to a 30% estimate!

The post Should we worry about rigged priors? A long discussion. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post BREAKING . . . . . . . PNAS updates its slogan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Here’s the story. For awhile I’ve been getting annoyed by the junk science papers (for example, here, here, and here) that have been published by the Proceedings of the National Academy of Sciences under the editorship of Susan T. Fiske. I’ve taken to calling it PPNAS (“Prestigious proceedings . . .”) because so many news outlets seem to think the journal is so damn prestigious. Indeed, if PNAS just published those articles and nobody listened, it would be fine. I have a blog where I can publish any old things that I want; Susan T. Fiske has a journal where she can publish articles by her friends and other papers that she personally thinks are interesting and important. The problem is that, to many in the outside world, publication in PPNAS is a signal of quality, and organs such as NPR will report PPNAS articles without appropriate skepticism.

One thing that bugged me about PPNAS was this self-description on their website:

So I contacted someone at the National Academy of Sciences, asking if they could do something about that false statement on its webpage: “PNAS publishes only the highest quality scientific research.”

No journal is perfect, it’s no slam on PNAS to say that they publish some low quality papers. But that statement seems weird in that it puts the National Academy of Sciences in the position of defending some extremely bad papers that the journal has happened to mistakenly publish.

And . . . they fixed it. Here’s the new version:

They *strive* to publish only the highest quality scientific research. That’s exactly right! I’m so glad they fixed that. I’m not being ironic here. I really mean it.

PNAS no longer has a demonstrably false statement on their webpage. Progress happens one step at a time, and I welcome this step. Good on ya, PNAS!

**P.S.** Just to be clear, let me emphasize that the message of this post is positive positive positive. PNAS is a journal that publishes lots of excellent papers. It publishes some duds, but that’s unavoidable if you publish 3000 papers a year. Editors are busy, peer reviewers are unpaid and don’t always know what they’re doing, and it can be hard for everyone involved to keep up with the latest scientific developments. And PNAS sometimes publishes papers that are outside its areas of core competence (PNAS publishing on baseball makes about as much sense as Bill James writing on physics), but, even here, I can see the virtue of stepping out on occasion, living on the edge. Little is lost by such experiments and there’s always the potential for unexpected connections. Striving to publish only the highest quality scientific research is an excellent aim, and I’m glad that’s what the National Academy of Sciences is doing.

The post BREAKING . . . . . . . PNAS updates its slogan! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post When considering proposals for redefining or abandoning statistical significance, remember that their effects on science will only be indirect! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>– Dan Benjamin, Jim Berger, Magnus Johannesson, Valen Johnson, Brian Nosek, and E. J. Wagenmakers

– Felipe De Brigard

– Kenny Easwaran

– Andrew Gelman and Blake McShane

– Kiley Hamlin

– Edouard Machery

– Deborah Mayo

– “Neuroskeptic”

– Michael Strevens

– Kevin Zollman.

Many of the commenters have interesting things to say, and I recommend you read the entire discussion.

The one point that I think many of the discussants are missing, though, is the importance of design and measurement. For example, Benjamin et al. write, “Compared to using the old 0.05 threshold, maintaining the same level of statistical power requires increasing sample sizes by about 70%.” I’m not disputing the math, but I think that sort of statement paints much too optimistic a picture. Existing junk science such as himmicanes and air rage, or ovulation and voting and clothing, or the various fmri and gay-gene studies that appear regularly in the news, will not be saved by increasing sample size by 70% or 700%. Larger sample size might enable researchers to more easily reach those otherwise elusive low p-values but I don’t see this increasing our reproducible scientific knowledge. Along those likes, Kiley Hamlin recommends going straight to full replications, which would have the advantage of giving researchers a predictive target to aim at. I like the idea of replication, rather than p-values, being a goal. On the other hand, again, p-values are noisy, and none of this is worth anything if measurements are no good.

So one thing I wish more of the discussants had talked about is that, when applied to junk science—and all of this discussion is in large part the result of the cancerous growth of junk science within the scientific enterprise—the effect of new rules on p-values etc. will be *indirect*. Requiring p less than 0.005, or requiring Bayes factors, abandoning statistical significance entirely, or anything in between: none of these policies will turn work such as power pose or beauty-and-sex-ratio or the work of the Cornell University Food and Brand Lab into reproducible science. All it will do is possibly (a) make such work harder to publish as is, and (b) as a consequence of that first point, motivate researchers to better science, to design more targeted studies with better measurements so as to be able to succeed in the future.

It’s good goal to aim for (a) and (b), so I’m glad of all this discussion. But I think it’s important to emphasize that all the statistical analysis and statistical rule-giving in the world can’t transform bad data into good science. So I’m a bit concerned about messages implying that with a mere increase of sample size by a factor of 1.7 or 2, that reproducibility problems will be solved. At some point, good science requires good design and measurement.

There’s an analogy to approaches to education reform that push toward high standards, toward not letting students graduate unless their test scores reach some high threshold. Ending “social promotion” from grade to grade in school might be a good idea in itself, and in the right environment it might motivate students to try harder at learning and schools to try harder at teaching—but, by themselves, standards are just an indirect tool. At some point the learning has to happen. This analogy is not perfect—for one thing, a p-value is not a measure of effect size, and null hypothesis significance testing addresses an uninteresting model of zero effect and zero systematic error, kind of like if an educational test did not even attempt to measure mastery, instead merely trying to demonstrate that the amount learned was not exactly zero—but my point in the present post is to emphasize the essentially indirect nature of any procedural solutions to research problems.

Again we can consider that hypothetical study attempting to measure the speed of light by using a kitchen scale to weighing an object before and after it is burned: it doesn’t matter what p-value is required, this experiment will never allow us to measure the speed of light. The best we can do with rules is to make it more difficult and awkward to claim that such a study can give definitive results, and thus dis-incentivize people from trying to perform, publish, and promote such work. Substitute ESP or power pose or fat arms and voting or himmicanes etc. in the above sentences and you’ll get the picture.

As Blake and I wrote in the conclusion of our contribution to the above-linked discussion:

Looking forward, we think more work is needed in designing experiments and taking measurements that are more precise and more closely tied to theory/constructs, doing within-person comparisons as much as possible, and using models that harness prior information, that feature varying treatment effects, and that are multilevel or meta-analytic in nature, and—of course—tying this to realism in experimental conditions.

See here and here for more on this topic which we are blogging to death. I appreciate the comments we’ve had here from people who disagree with me on these issues: a blog comment thread is a great place to have a discussion back and forth involving multiple viewpoints.

The post When considering proposals for redefining or abandoning statistical significance, remember that their effects on science will only be indirect! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Alan Sokal’s comments on “Abandon Statistical Significance” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I just came across your paper “Abandon statistical significance”. I basically agree with your point of view, but I think you could have done more to *distinguish* clearly between several different issues:

1) In most problems in the biomedical and social sciences, the possible hypotheses are parametrized by a

continuousvariable (or vector of variables), or at least one that can be reasonably approximated as continuous. So it is conceptually wrong to discretize or dichotomize the possible hypotheses. [The same goes for the data: usually it is continuous, or at least discrete with a large number of possible values, and it is silly to artifically dichotomize or trichotomize it.]Now, in such a situation, the sharp point null hypothesis is almost certainly false: as you say, two treatments are *always* different, even if the difference is tiny.

So here the solution should be to report, not the p value for the sharp point null hypothesis, but the

complete likelihood function— or if it can be reasonably approximated by a Gaussian, then the mean and standard deviation (or mean vector and covariance matrix).2) The difference between the two treatments — especially if it is small — might be due, *not* to an actual difference between the two treatments, but to a systematic error in the experiment (e.g. a small failure of double-blinding, or a correlation between measurement errors and the treatment).

This is not a statistical issue, but rather an experimental and interpretive one: every experimenter must strive to reduce systematic errors to the smallest level possible AND to estimate honestly whatever systematic errors might remain; and an observed effect, even if it is statistically established beyond a reasonable doubt, can be considered “real” only if it is much larger than any plausible systematic error.

3) The likelihood function does not contain the whole story (from a Bayesian point of view), because the prior matters too. After all, even people who are not die-hard Bayesians can understand that “extraordinary claims require extraordinary evidence”. So one must try to understand — at least at the level of orders of magnitude — the prior likelihood of various alternative hypotheses. If only 1 out of 1000 drugs (or social interventions) have an effect anywhere near as large as the likelihood function seems to indicate, then probably the result is a false positive.

4) When practical decisions are involved (e.g. whether or not to approve a drug, whether or not to start or terminate a social program), the loss function matters too. There may be a huge difference in the losses from failing to approve a useful drug and approving a useless or harmful one — and I could imagine that in some cases those huge differences might go one way, and in other cases the other way. So the decision-makers have to analyze explicitly the loss function, and take it into account in the final decision. (But they should also always keep this analysis — which is basically economic —

separatefrom the analysis of issues #1,2,3, which are basically “scientific”.)

My reply:

I agree with you on most of these points; see for example here.

Regarding your statement about the likelihood function: that’s fine but more generally I like to say that researchers should display all comparisons of interest and not select based on statistical significance. The likelihood function is a summary based on some particular model but in a lot of applied statistics there is no clear model, hence I give the more general recommendation to display all comparisons.

Regarding your point 2: yes on the relevance of systematic error, which is why we refer on page 1 of our paper to the “sharp point null hypothesis of zero effect and zero systematic error”! Along similar lines, see the last paragraph of this post.

Regarding your point 3, I prefer to avoid the term “false positive” in most statistical contexts because of the association of the typically nonsensical model of zero effect and zero systematic error; see here.

Regarding your point 4, yes, as we say in our paper, “For regulatory, policy, and business decisions, cost-benefit calculations seem clearly superior to acontextual statistical thresholds.”

Alan responded:

I think we are basically in agreement. My suggestion was simply to *distinguish* more clearly these 4 different issues (possibly by making them explicit and numbering them), because they are really *very* different in nature.

The post Alan Sokal’s comments on “Abandon Statistical Significance” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post 2 quick calls appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Study 1:

Using footage from body-worn cameras, we analyze the respectfulness of police officer language toward white and black community members during routine traffic stops. We develop computational linguistic methods that extract levels of respect automatically from transcripts, informed by a thin-slicing study of participant ratings of officer utterances. We find that officers speak with consistently less respect toward black versus white community members, even after controlling for the race of the officer, the severity of the infraction, the location of the stop, and the outcome of the stop. Such disparities in common, everyday interactions between police and the communities they serve have important implications for procedural justice and the building of police-community trust.Study 2:

Exposure to parental separation or divorce during childhood has been associated with an increased risk for physical morbidity during adulthood. Here we tested the hypothesis that this association is primarily attributable to separated parents who do not communicate with each other. We also examined whether early exposure to separated parents in conflict is associated with greater viral-induced inflammatory response in adulthood and in turn with increased susceptibility to viral-induced upper respiratory disease. After assessment of their parents’ relationship during their childhood, 201 healthy volunteers, age 18-55 y, were quarantined, experimentally exposed to a virus that causes a common cold, and monitored for 5 d for the development of a respiratory illness. Monitoring included daily assessments of viral-specific infection, objective markers of illness, and local production of proinflammatory cytokines. Adults whose parents lived apart and never spoke during their childhood were more than three times as likely to develop a cold when exposed to the upper respiratory virus than adults from intact families. Conversely, individuals whose parents were separated but communicated with each other showed no increase in risk compared with those from intact families. These differences persisted in analyses adjusted for potentially confounding variables (demographics, current socioeconomic status, body mass index, season, baseline immunity to the challenge virus, affectivity, and childhood socioeconomic status). Mediation analyses were consistent with the hypothesis that greater susceptibility to respiratory infectious illness among the offspring of noncommunicating parents was attributable to a greater local proinflammatory response to infection.

My reply:

1. I’d run this by a computational linguist who doesn’t have a stake in this example. I’m skeptical in any case because this kind of “respect” thing is contextual. I mean, sure, I believe that building of trust is important; I just don’t know if much is gained by the “extract levels of respect automatically” thing.

2. I’ll believe this one after it appears in an independent preregistered replication, not before.

The post 2 quick calls appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Do statistical methods have an expiration date?” My talk at the University of Texas this Friday 2pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Do statistical methods have an expiration date?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

There is a statistical crisis in science, particularly in psychology where many celebrated findings have failed to replicate, and where careful analysis has revealed that many celebrated research projects were dead on arrival in the sense of never having sufficiently accurate data to answer the questions they were attempting to resolve. The statistical methods which revolutionized science in the 1930s-1950s no longer seem to work in the 21st century. How can this be? It turns out that when effects are small and highly variable, the classical approach of black-box inference from randomized experiments or observational studies no longer works as advertised. We discuss the conceptual barriers that have allowed researchers to avoid confronting these issues, which arise not just in psychology but also in policy research, public health, and other fields. To do better, we recommend three steps: (a) designing studies based on a perspective of realism rather than gambling or hope, (b) higher quality data collection, and (c) data analysis that combines multiple sources of information.

Some of material in the talk appears in our recent papers, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it” and “Some natural solutions to the p-value communication problem—and why they won’t work.”

The talk will be in the psychology department but should be of interest to statisticians and quantitative researchers more generally. I was invited to come by the psychology Ph.D. students—that’s so cool!

The post “Do statistical methods have an expiration date?” My talk at the University of Texas this Friday 2pm appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Response to some comments on “Abandon Statistical Significance” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration—often scant—given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.

Since then we’ve received some feedback that we’d like to share and address.

**1.** Sander Greenland commented that maybe we shouldn’t label as “radical” our approach of removing statistical significance from its gatekeeper role, given that prominent statisticians and applied researchers have recommended this approach (abandoning statistical significance as a decision rule) for a long time.

Here are two quotes from David Cox et al. from a 1977 paper, “The role of significance tests”:

Here’s Cox from 1982 implicitly endorsing the idea of type S errors:

And here he is, explaining (a) the selection bias involved in any system in which statistical significance is a decision rule, and (b) the importance of measurement, a crucial issue in statistics that is obscured by statistical significance:

Hey! He even pointed out that the difference between “significant” and “non-significant” is not itself statistically significant:

In this paper, Cox also brings up the crucial point that the “null hypothesis” is not just the assumption of zero effect (which is typically uninteresting) but also the assumption of zero systematic error (which is typically ridiculous).

And he says what we say, that the p-value tells us very little on its own:

There are also more recent papers that say what McShane et al. and I say; for example, Valentin Amrhein, Fränzi Korner-Nievergelt, and Tobias Roth wrote:

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process. We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values can tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. . . . Data dredging, p-hacking and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. . . . We further discuss potential arguments against removing significance thresholds, such as ‘we need more stringent decision rules’, ‘sample sizes will decrease’ or ‘we need to get rid of p-values’. We conclude that, whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

Damn! I liked that paper when it came out, but now that I see it again, I realize how similar our points are to theirs.

Also this recent letter by Valentin Amrhein and Sander Greenland, “Remove, rather than redefine, statistical significance” which, again, has a very similar perspective to ours.

**2.** In the park today I ran into a friend who said that he’d read our recent article. He expressed the opinion that our plan might be good in some ideal sense but it can’t work in the real world because it requires more time-consuming and complex analyses than researchers are willing or able to do. If we get rid of p-values, what would we replace them with?

I replied: No, our plan is eminently realistic! First off, we don’t recommend getting rid of p-values; we recommend treating them as one piece of evidence. Yes, it can be useful to see that a given data pattern could or not plausibly have arisen purely by chance. But, no, we don’t think that publication of a result, or further research in an area, should require a low p-value. Depending on the context, it can be completely reasonable to report and follow up on a result that is interesting and important, even if the data are weak enough that the pattern could’ve been obtained by chance: that just tells us we need better data. Report the p-value and the confidence interval and other summaries; don’t use them to decide what to report. And definitely don’t use them to partition results into “significant” and “non-significant” groups.

I also remarked that it’s not like the current system is so automatic. Statistically significance, in most cases, a requirement for publication, but journals still have to decide what to do with the zillions of “p less than 0.05” papers that get sent to them every month. So we’re just saying that, at a start, that journals can use whatever rules they’re currently using to decide which of these papers to publish.

Then I launched into another argument . . . but at this point my friend gave me a funny look and started to back away. I think he’d just mentioned my article and his reaction as a way to say hi, and he wasn’t really asking for a harangue in the middle of the park on a nice day.

But I’m pretty sure that most of you reading this blog are sitting in your parent’s basement eating Cheetos, with one finger on the TV remote and the other on the Twitter “like” button. So I can feel free to rant away.

**3.** There’s a paper, “Redefine statistical significance,” by Daniel Benjamin et al., who recognize that the p=0.05 threshold has lots of problems (I don’t think they mention air rage, himmicanes, ages ending in 9, fat arms and political attitudes, ovulation and clothing, ovulation and voting, power pose, embodied cognition, and the collected works of Satoshi Kanazawa and Brian Wansink, but they could have) and promote a revised p-value threshold of 0.005. As we wrote in our article (which was in part a response to Benjamin et al.):

We believe this proposal is insufficient to overcome current difficulties with replication . . . In the short term, a more stringent threshold could reduce the flow of low quality work that is currently polluting even top journals. In the medium term, it could motivate researchers to perform higher-quality work that is more likely to crack the 0.005 barrier. On the other hand, a steeper cutoff could lead to even more overconfidence in results that do get published as well as greater exaggeration of the effect sizes associated with such results. It could also lead to the discounting of important findings that happen not to reach it. In sum, we have no idea whether implementation of the proposed 0.005 threshold would improve or degrade the state of science as we can envision both positive and negative outcomes resulting from it. Ultimately, while this question may be interesting if difficult to answer, we view it as outside our purview because we believe that p-value thresholds (as well as those based on other statistical measures) are a bad idea in general.

**4.** And then yet another article, this one by Lakens et al., “Justify your alpha.” Their view is closer to ours in that they do not want to use any fixed p-value threshold, but they still seem to recommend that statistical significance be used for decision rules: “researchers justify their choice for an alpha level before collecting the data, instead 2of adopting a new uniform standard.” We agree with most of what Lakens et al. write, especially things like, “Single studies, regardless of their p-value, are never enough to conclude that there is strong evidence for a theory” and their call to researchers to provide “justifications of key choices in research design and statistical practice.”

We just don’t see any good reason to make design, analysis, publication, and decision choices based on “alpha” or significance levels. As we write:

Various features of contemporary biomedical and social sciences—small and variable effects, noisy measurements, a publication process that screens for statistical significance, and research practices—make null hypothesis significance testing and in particular the sharp point null hypothesis of zero effect and zero systematic error particularly poorly suited for these domains. . . .

Proposals such as changing the default p-value threshold for statistical significance, employing confidence intervals with a focus on whether or not they contain zero, or employing Bayes factors along with conventional classifications for evaluating the strength of evidence suffer from the same or similar issues as the current use of p-values with the 0.05 threshold. In particular, each implicitly or explicitly categorizes evidence based on thresholds relative to the generally uninteresting and implausible null hypothesis of zero effect and zero systematic error.

**5.** E. J. Wagenmakers, one of the authors of the Benjamin et al. paper that motivated a lot of this recent discussion, wrote a post on his new blog (E. J. has a blog now! Cool. Will he start posting on chess?), along with Quentin Gronau, responding to our recent article.

E. J. and Quentin begin their post with five places where they agree with us. Then, in true blog fashion, they spends most of the post elaborating on three places where they disagree with us. Fair enough.

I’ll go through them one at a time:

**E. J. and Quentin’s disagreement 1.** E. J. says that our general advice (studying and reporting the totality of their data and relevant results) is eminently sensible, but it is not sufficiently explicit to replace anything. Rightly or wrongly, the p-value offers a concrete and unambiguous guideline for making key claims; the Abandoners [that’s us!] wish to replace it with something that can be summarized as ‘transparency and common sense.'”

I disagree!

First, the p-value does *not* offer “a concrete and unambiguous guideline for making key claims.” Thousands of experiments are performed every month (maybe every day!) with “p less than 0.05” results, but only a very small fraction of these make their way into JPSP, Psych Science, PPNAS, etc. P-value thresholds supply an illusion of rigor, and maybe in some settings that’s a good idea, by analogy to “the consent of the governed” in politics, but there’s nothing concrete or unambiguous about their use.

Second, yes I too support “transparency and common sense,” but that’s *not* all we’re recommending. Not at all! Recall my recent paper, Transparency and honesty are not enough. All the transparency and common sense in the world—even with preregistered replication—won’t get you very far in the absence of accurate and relevant measurement. Hence the last paragraph of this post.

**E. J. and Quentin’s disagreement 2.** I’ll let my coauthor Christian Robert respond to this one. And he did!

**E. J. and Quentin’s disagreement 3.** They write, “One of the Abandoners’ favorite arguments is that the point-null hypothesis is usually neither true nor interesting. So why test it? This echoes the opinion of researchers like Meehl and Cohen. We believe, however, that Meehl and Cohen were overstating their case.”

E. J. and Quentin begin with an example of a hypothetical researcher comparing the efficacies of unblended or blended whisky as a treatment of snake bites. I agree that in this case the point null hypothesis is worth studying. This sort of example has come up in some recent comment threads so I’ll repeat what I said there:

I don’t think that point hypotheses are

nevertrue; I just don’t find them interesting or appropriate in the problems in social and environmental science that I work on and which we spend a lot of time discussing on this blog.There are some problems where discrete models make sense. On commenter gave the example of a physical law; other examples are spell checking (where, at least most of the time, a person was intending to write some particular word) and genetics (to some reasonable approximation). In such problems I recommend fitting a Bayesian model for the different possibilities. I still don’t recommend hypothesis testing as a decision rule, in part because in the examples I’ve seen, the null hypothesis also bundles in a bunch of other assumptions about measurement error etc. which are not so sharply defined.

I’m happy to (roughly) discretely divide the world into discrete and continuous problems, and to use discrete methods when studying the effects of snakebites, and ESP, and spell checking, and certain problems in genetics, and various other problems of this sort; and to use continuous methods when studying the effects of educational interventions, and patterns of voting and opinion, and the effects of air pollution on health, and sex ratios and hurricanes and behavior on airplanes and posture and differences between gay and straight people and all sorts of other topics that come up all the time. And I’m also happy to use mixture models with some discrete components; for example, in some settings in drug development I expect it makes sense to allow for the possibility that a particular compound has approximately no effect (I’ve heard this line of research is popular at UC Irvine right now). I don’t want to take a hard line, nothing-is-ever-approximately-zero position. But I do think that comparisons to a null model of absolutely zero effect and zero systematic error are rarely relevant.

E. J. and Quentin also point out that if an effect is very small compared to measurement/estimation error, then it doesn’t matter, from the standpoint of null hypothesis significance testing, whether the effect is exactly zero. True. But we don’t particularly care about null hypothesis significance testing! For example, consider “embodied cognition.” Embodied cognition is a joke, and it’s been featured in lots of junk science, but I don’t think that masked messages have zero or even necessarily tiny effects. I think that any effects will vary a lot by person and by context. And, more to the point, if someone wants to do research in this topic, I don’t think that a null hypothesis significance test should be a screener for what results are considered worth looking at, and I think that it’s a mistake to use a noisy data summary to selecting a limited subset of results to report.

**Summary**

We’re in agreement with just about all the people in this discussion on the following key point: We’re unhappy with the current in which “p less than 0.05” is used as the first step in a lexicographic decision rule in deciding which results in a study should be presented, which studies should be published, and which lines of research should be pursued.

Beyond this, here are the different takes:

Benjamin et al. recommend replacing 0.05 by 0.005, not because they think a significance-testing-based lexicographic decision rule is a good idea, but, as I understand them, because they think that 0.005 is a stringent enough cutoff that it will essentially break the current system. Assuming there is a move to reduce uncorrected researcher degrees of freedom and forking paths, it will become very difficult for researchers to reach the 0.005 threshold with noisy, useless studies. Thus, the new threshold, if applied well, will suddenly cause the stream of easy papers to dry up. Bad news for Ted, NPR, and Susan Fiske, but good news for science, as lots of journals will either have to get a lot thinner or will need to find some interesting papers outside the usual patterns. In the longer term, the stringent threshold (if tied to control of forking paths) could motivate researchers to do higher-quality studies with more serious measurement tied more carefully to theory.

Lakens et al. recommend using p-value thresholds but with different thresholds for different problems. This has the plus of moving away from automatic rules but has the minus of asking people to “justify their alpha.” I’d rather have scientists justifying their substantive conditions by delineating reasonable ranges of effect sizes (see, for example, section 2.1 of this paper) rather than having them justify a scientifically meaningless threshold, and I’d prefer that statisticians and methodologists evaluate frequency properties of type M and type S errors rather than p-values. But, again, we agree with Lakens et al., and with Benjamin et al., on the key point that what we need is better measurement and better science.

Finally, our perspective, shared with Amrhein, Korner-Nievergelt, and Roth, as well as Amrhein and Greenland, is that it’s better to just remove null hypothesis significance testing from its gatekeeper role. That is, instead of trying to tinker with the current system (Lakens et al.) or to change the threshold so much that the system will break (Benjamin et al.), let’s just discretize less and display more.

We have some disagreements regarding the relevance of significance tests and null hypotheses but we’re all roughly on the same page as Cox, Meehl, and other predecessors.

The post Response to some comments on “Abandon Statistical Significance” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “5 minutes? Really?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Daniel says this issue

https://github.com/stan-dev/stan/issues/795#issuecomment-26390557117

is an easy 5-minute fix.

In my ongoing role as wet blanket, let’s be realistic. It’s

sort of like saying it’s an hour from here to Detroit because

that’s how long the plane’s in the air.Nothing is a 5 minute fix (door to door) for Stan and I really

don’t want to give people the impression that it should be. It

then just makes them feel bad when it takes longer than 5 minutes,

because they feel like they’ve wasted the time this will really take.

Or it makes people angry who suggest other “5 minute fixes” that

we don’t get around to doing because they’re really more involved.This can’t be five minutes (certainly not net to the project)

when you need to create a branch, fix the issue, run the tests,

run cpplint, commit, push, create a pull request,

nag someone else to review it (then they have to then fill out the

code-review form), then you might have to make fixes (and perhaps

get another sign off from the reviewer), then Jenkins and Travis

may need to be kicked, (then someone has to decide to merge),

then we get to do it again in the upstream with changes to

the interfaces.Easy once you’re used to the process, but not 5 minutes!

The mythical man-minute.

The post “5 minutes? Really?” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “From ‘What If?’ To ‘What Next?’ : Causal Inference and Machine Learning for Intelligent Decision Making” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>NIPS 2017 Workshop on Causal Inference and Machine Learning (WhatIF2017)

“From ‘What If?’ To ‘What Next?’ : Causal Inference and Machine Learning for Intelligent Decision Making” — December 8th 2017, Long Beach, USA.

Submission deadline for abstracts and papers: October 31, 2017

Acceptance decisions: November 7, 2017In recent years machine learning and causal inference have both seen important advances, especially through a dramatic expansion of their theoretical and practical domains. This workshop is aimed at facilitating more interactions between researchers in machine learning, causal inference, and application domains that use both for intelligent decision making. To this effect, the 2017 ‘What If?’ To ‘What Next?’ workshop welcomes contributions from a variety of perspectives from machine learning, statistics, economics and social sciences, among others. This includes, but it is not limited to, the following topics:

– Combining experimental control and observational data

– Bandit algorithms and reinforcement learning with explicit links to causal inference and counterfactual reasoning

– Interfaces of agent-based systems and causal inference

– Handling selection bias

– Large-scale algorithms

– Applications in online systems (e.g. search, recommendation, ad placement)

– Applications in complex systems (e.g. cell biology, smart cities, computational social sciences)

– Interactive experimental control vs. counterfactual estimation from logged experiments

– Discriminative learning vs. generative modeling in counterfactual settings

We invite contributions both in the form of extended abstract and full papers. At the discretion of the organizers, some contributions will be assigned slots as short contributed talks and others will be presented as posters.Submission length: 2 page extended abstracts or up to 8 page full paper. At least one author of each accepted paper must be available to present the paper at the workshop.

I’m pretty sure that, in these settings, there’s not much reason to be interested in the model of zero causal effects and zero systematic error, so I hope people at this conference don’t waste any time on null hypothesis significance testing except when they are talking about how to do better.

The post “From ‘What If?’ To ‘What Next?’ : Causal Inference and Machine Learning for Intelligent Decision Making” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post The “fish MRI” of international relations studies. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This article uses a replication experiment of ninety-four specifications from sixteen different studies to show the severity of the problem of selection on unobservables. Using a variety of approaches, it shows that membership in the General Agreement on Tariffs and Trade/World Trade Organization has a significant effect on a surprisingly high number of dependent variables (34 per cent) that have little or no theoretical relationship to the WTO. To make the exercise even more conservative, the study demonstrates that membership in a low-impact environmental treaty, the Convention on Trade in Endangered Species, yields similarly high false positive rates. The authors advocate theoretically informed sensitivity analysis, showing how prior theoretical knowledge conditions the crucial choice of covariates for sensitivity tests. While the current study focuses on international institutions, the arguments also apply to other subfields and applications.

My reply: I’m not a fan of the “false positive” framework, but the general attitude expressed in the paper makes sense to me and I’m guessing this paper will be a very useful contribution to the literature in its field. It’s the “fish MRI” of international relations studies.

The post The “fish MRI” of international relations studies. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Apply for the Earth Institute Postdoc at Columbia and work with us! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Apply for the Earth Institute Postdoc at Columbia and work with us! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post For mortality rate junkies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>For mortality rate junkies, here’s another example [by Steven Martin and Laudan Aron] of bundled stats lending to misinterpretation, in this case not correcting for the black cohort having a slightly younger average age plus a higher percentage of women.

As Martin and Aron point out:

Why were these differences by race not apparent in the CDC figure? Because for adults ages 65 and older, blacks and whites looked very different in 2015. The average age of blacks was 73.7 compared with 74.6 for whites. Also, a higher share of blacks ages 65 and older were women: 61 percent compared with 57 percent for whites. Because younger people have lower death rates than older people and women have lower death rates than men, comparing a younger and more female population of blacks with an older and more male population of whites offsets the underlying race differences in death rates.

The post For mortality rate junkies appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Contribute to this pubpeer discussion! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I’d love to get feedback from you and / or the commenters on a behavioral economics / social neuroscience study from my university (Zürich). This would fit perfectly with yesterday’s “how to evaluate a paper” post. In fact, let’s have a little journal club, one with a twist!

The twist is that I’ve already posted a critique of the study on PubPeer and just now got a response by the authors. I’m preparing a response to the response, and here’s where I could use y’all’s input.

I don’t have the energy to read the paper or the discussions but I wanted to post it here as an example of this sort of post-publication review.

Gamma continues:

Here’s the abstract to the paper:

Goal-directed human behaviors are driven by motives. Motives are, however, purely mental constructs that are not directly observable. Here, we show that the brain’s functional network architecture captures information that predicts different motives behind the same altruistic act with high accuracy. In contrast, mere activity in these regions contains no information about motives. Empathy-based altruism is primarily characterized by a positive connectivity from the anterior cingulate cortex (ACC) to the anterior insula (AI), whereas reciprocity-based altruism additionally invokes strong positive connectivity from the AI to the ACC and even stronger positive connectivity from the AI to the ventral striatum. Moreover, predominantly selfish individuals show distinct functional architectures compared to altruists, and they only increase altruistic behavior in response to empathy inductions, but not reciprocity inductions.

The exchange on PubPeer so far is here, the paper is here (from first author’s ResearchGate page), the supplementary material here. (Email me at alex.gamma@uzh.ch if you have trouble with any of these links.)

The basic set-up of the study is:

• fMRI• N=34 female subjects• 3 conditions• baseline (N=34)• induced motive “empathy” (N=17; between-subject)• induced motive “reciprocity” (N=17; between-subject)• ML prediction / classification of the two motives using SVM• accuracy ~ 70%, stat. sign.

In my comment, the main criticism was that their presentation of the results was misleading by suggesting that the motives in question had been “read off” the brain directly without already knowing them by other means. I’ve since realized that it is mainly the title that suggests so and thereby creates a context within which one interprets the rest of the paper. Without the title, the paper would be more or less OK in this regard. In any case, to say in the title that brain data “reveals” human motives suggests (clearly, to me) that these motives were not previously known. That they were “hidden” and then uncovered by examining the brain. But obviously, the prediction algorithm had to be trained on prior knowledge of the motives, so that’s not at all what happens. This is one thing I intend to argue in my response.

But there’s more.

In the comment, I’ve also raised issues about the prediction/machine learning aspects and I want to bring up more in my response to their reponse. These issues concern the purpose of prediction, the relationship between prediction and causal inference, generalizability, overfitting and the scope for forking paths. So lots of interesting stuff! And since I’m not an expert (not a statistician, but with not-too-technical exposure to ML), I’d love to get input from the knowledgeable crowd here on the blog.

Before I separate the issues into chunks, I’ll outline what I gathered they did with their data. As far as neuroimaging studies go, they used quite sophisticated modeling. Below, the dotted lines (—) are loosely used to indicate “is input to” or “leads to” or “produces as output”, or simply “is followed by”.

- fMRI (“brain activity”) — GLM (empathy vs reciprocity vs baseline) — diff between two motives n.s., but diff betw. motives and baseline stat. sign. in a “network” of 3 brain areas — use of DCM (“Dynamic causal models”) to get “functional connectivity” in this network, separately for the 3 conditions
- DCM: uses time-series of fMRI activations to infer connectivity in the network of 3 brain areas — start w/ 28 plausible initial models, each a different combination of 7 network components (see Fig. 2A, p.1075, and Fig S2., p.12 of the supplement) — use Bayesian model averaging to estimate parameters of the 7 components (components = strengths/direction of connections and external inputs) — end up with 14 “DCM-parameters” per subject, 7 per motive condition, 7 per baseline
- Prediction: compute diff between DCM-parameters of each motive vs baseline (1: emp – base; 2: rec – base) = dDCM parameters — input these into SVM to classify empathy vs reciprocity — LOOCV — classification weights for 7 dDCM params (Fig. 2B, p.1075)
- “Mechanistic models”: start again with 28 initial models from 2. — random-effect Bayesian model selection — average best models for each condition (emp – rec – base; Fig. 3, p.1076)
The paper is a mix of talk about the prediction aspect and the mechanistic insight into the neural basis of the two motives that supposedly can be gleaned from the data. There seems to be some confusion on the part of the authors as how these two aspects are related. Which leads to the first issue.

I. Purpose of prediction

In my comment, I questioned the usefulness of their prediction exercise (I called it a “predictive circus act”). I thought the causal modeling part (DCM) is OK because it could contribute to an understanding of what, and eventually how, brain processes generate mental states. However, I didn’t think the predictive part added anything to that. (And I couldn’t help noticing that the predictive part would allow them advertize their findings as “the brain revealing motives” instead of just “here’s what’s going on in the brain while we experience some motives”.)

What’s your take? Does the prediction per se have a role to play in such a context?

II. Relationship between prediction and causal modeling/mechanistic insights

The authors claim that the predictive part supports or even furnishes the mechanistic (causal?) insight the data supposedly deliver, although that is not stated as the official purpose of the predictive part. They write:

“We obtain these mechanistic insights because the inputs into the support vector machine are not merely brain activations but small brain models of how relevant brain regions interact with each other (i.e., functional neural architectures)…. And it is these models that deliver the mechanistic insights into brain function…”

The last sentence of the paper then reads:

“Our study, therefore, also demonstrates how “mere prediction” and “insights into the mechanisms” that underlie psychological concepts (such as motives) can be simultaneously achieved if functional neural architectures are the inputs for the prediction.”

But if my outline of their analytic chain is correct, these statements are confused. As a matter of fact, they do *not* derive their mechanistic models (i.e. the specific connectivity parameters of the network of 3 brain areas, see Fig. 3 p.1076) from the predictive model. The mechanistic models are the result of a different analytic path than the predictive model. This can already be seen from the fact that the predictive model is based on *differences* between motive and baseline conditions, while the mechanistic models they discuss at length in the paper exist for each of these conditions separately.

If all this is right, the authors misunderstand their own analysis.

(They also have *this* sentence, which I consider a tautology: “Thus, by correctly predicting the induced motives, we simultaneously determine those mechanistic models of brain interaction that best predict the motives.”)

I would be happy, however, if someone found this interesting enough to check whether my understanding of the modeling procedure is correct.

III. Generalizability

The authors make much of their use of LOOCV:

“We predicted each subject’s induced motive with a classifier whose parameters were not influenced by that subject’s brain data… Instead, the parameters of the classifier were solely informed by other subjects’ brain data. This means that the motive-specific brain connectivity patterns are generalizable across subjects. The distinct and across-subject–generalizable neural representation of the different motives thus provides evidence for a distinct neurophysiological existence of motives.”

They do not address at all, however, the issue of generalizability to new samples (all the more important for a single-sex sample). I thought the emphasis is completely wrong here. My understanding was and is that achieving a decent in-sample classification accuracy is only the smallest part of finding a robust classifier. The real test is the performance in new samples from new populations. Also, I felt that something was wrong with their particular emphasis on how cool it is that LOOCV leads to a classifier that generalizes within the sample.

I wrote that “the authors’ appeal to generalizability is misleading. They emphasize that their predictive analysis is conducted using a particular technique (called leave-one out-cross-validation or LOOCV) to make sure the resulting classifier is “generalizable across subjects”. But that is rather trivial. LOOCV and its congeners are a standard feature of predictive models, and achieving a decent performance within a sample is nothing special.”

In their response, they challenged this:“Well, if it is so easy to achieve a decent predictive performance, why do the behavioral changes in altruistic behavior induced by the empathy and the reciprocity motive enable only a very poor predictability of the underlying motives? On the basis of the motive-induced behavioral changes the classification accuracy of the support vector machine is only 41%, i.e., worse than chance. And if achieving decent predictive performance is so easy, why is it then impossible to predict better than chance based on brain activity levels for those network nodes for which brain connectivity is highly predictive (motive classification accuracy based on the level of brain activity = 55.2%, P = 0.3). The very fact that we show that brain activity levels are not predictive of the underlying motives means that we show – in our context – the limits of traditional classification analyses which predominantly feed the statistical machine with brain activity data.”

What they say certainly shows that you can’t get a good classifier out of just any features you have in the data. So in that sense my statement would be false. But what I had in mind was more along the lines that to find *some* good predictors among many features is nothing special. But is this true? And is it true for their particular study? This will come up again under forking paths later.

To get back to the bigger issue, was I right to assume that getting a decent classifier in a small sample is not even half the rent if you want to say something general about human beings?

(To be fair to the authors, they state in their response that it would be desirable, even “very exciting”, to be able to predict subjects’ motives out-of-sample.)

IV. Overfitting and forking paths

Finally, what is the scope for overfitting, noise mining and forking paths in this study? I would love to get some expert opinion on that. They had 17 subjects per motive condition. They first searched for stat. sign. differences in brain activity between the 3 conditions. What shows up is a network of 3 brain regions. They attached to it 7 connectivity parameters and tested 28 combinations of them (“models”). Bayesian model averaging yielded averages for the 7 parameters, per condition. Subtract baseline from motive parameters, feed the differences into an SVM.

Can you believe anything coming from such an analysis?I hope and believe that this could also bring the familiar insights and excitement of a journal club that so many of you have professed their love for. And last not least, maybe the scientific audience of PubPeer could learn something, too.

I have no idea, but, again, I wanted to share this as an example of a post-publication review.

The post Contribute to this pubpeer discussion! appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Some ideas on using virtual reality for data visualization: I don’t really agree with the details here but it’s all worth discussing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Evan Warfel writes:

Misleading graphs are one thing you consistently write about, and it seems to me that Virtual Reality has the potential to solve part of this problem — there is no point in sharing a “static” VR experience; instead, by allowing users to change the axes and interact with the data, they have access to many more details about the data.

Anyways, if you are interested, I’ve written a guest blog post about how VR can augment traditional data visualization.

I’ve tried to take a non “Pie Charts, but in 3d!” approach. The piece covers everything from the infamous Is Linda a Bank Teller? problem to the disadvantages of realistic Chernoff faces. I’d love to hear your thinking on the piece or the ideas I’ve mentioned above.

My reply: I don’t really agree with the details of what you’re suggesting but it’s all worth thinking about, so I’ll blog it.

**P.S.** Thanks to Steven Johnson for the above image showing virtual reality.

The post Some ideas on using virtual reality for data visualization: I don’t really agree with the details here but it’s all worth discussing appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post I am (somewhat) in agreement with Fritz Strack regarding replications appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>This is noteworthy because the last time this paper by Strack was discussed on this blog it was in a negative way (“Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories”). So, as Strack correctly points out, it’s a bit rich to see me (appropriately) criticizing null hypothesis testing today, given that a few months ago I was using hypothesis tests to declare that certain findings did not replicate.

Strack is saying that the reports of non-replication of his study are non-rejections of hypothesis tests, and we shouldn’t overinterpret such non-rejections. A non-rejection tells us that a certain pattern in data is in a class of patterns that could be explained by the random number generator that is the null hypothesis—but who cares, given that we have little interest in the null hypothesis (of zero effect and zero systematic error) in the first place. As Strack puts it, and I agree, just because a pattern in data *could be* explained by chance, that doesn’t mean it’s so useful to way the pattern *is* explained by chance.

And, at a rhetorical level, I was endorsing the use of a p-value (in this case, a non-rejection) to make a scientific decision (that Strack’s work could be labeled as non-replicable).

Strack’s point here is similar to Deborah Mayo’s criticism of critics of hypothesis testing, that we are aghast at routine use of p-values in research decisions, but then we seem happy to cite data from hypothesis tests to demonstrate replication problems. Mayo has pointed to instances when critics use hypothesis tests to reject the hypothesis that observed p-values come from some theoretical distribution that would be expected in the absence of selection, but I think her general point holds in this case as well.

I agree with Strack that “non-replication” is a fuzzy idea and that it’s inappropriate to say that non-replication implies that an effect is not “true.” Indeed, even with power pose, which has repeatedly failed to show anticipated effects in replication studies, I don’t say the effect is zero; rather, I think the effects of power pose vary by person and situation. I don’t see any evidence that power pose has a *consistent* effect of the sort claimed in the original papers on the topic (notoriously, this stunner, which I guess was good enough for a Ted talk and a book contract: “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.”)

Getting back to Strack’s experiments: I don’t think the non-statistically-significant results in replications imply that there is no effect. What I do think the totality of the evidence shows (based on my general impression, let me emphasize that I’ve not analyzed any raw data here) is that, again, there’s no good evidence for any consistent effects of Strack’s treatments. And, beyond this, I’m skeptical that Strack’s designs and data collections are sufficient to learn much about how these effects vary, for reasons discussed in section 3 of this paper.

In short: high variation in the effect plus high variation in the measurement makes it difficult to discover anything replicable, where by “replicable” I mean predictions about new data, without any particular reference to statistical significance.

So: for general reasons of statistical measurement and inference I am skeptical of Strack’s substantive claims; I suspect that the data from his experiments don’t provide useful evidence regarding those claims. Similarly, there is a limit to what can be learned from the replication studies if they were conducted in the same way.

What about replication studies? I’ve been skeptical about replication studies for a long time, on the general principle that if the original study is too noisy to work, the replication is likely to have similar problems. That said, I do see value in replication studies—not so much as a strategy for scientific learning but as a motivator for researchers. Or, maybe I should say, as a disincentive for sloppy studies. Again, consider power pose: That work was flawed in a zillion ways, some of which are clear from reading the published article and others of which were explained later in a widely-circulated recantation written by the first author on that paper. But it’s hard for a lot of people to take seriously the claims of outsiders that a research project is flawed. Non-replication is a convincer. Again, not a convincer that the effect is zero, but a convincer that the original study is too noisy to be useful.

So, replication studies can play a sort of “institutional” role in keeping science on track, even if the particular replication studies that get the most attention don’t actually give us much of any useful scientific information at all.

Strack cites Susan Fiske, Dan Schacter, and Shelley Taylor who “point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects.” Maybe. Or, I should say, yes but only if the studies in question—both the original and the replication—are targeted enough and accurate enough to answer real questions. For example, that ovulation-and-clothing study was hopeless—any consistent signal is just tiny compared to the noise level—and replications of the same study will just supply additional noise (which, as noted above, can be valuable in confirming the point that the data are too noisy to be useful, but is not valuable in helping us learn anything at all about psychology or behavior).

So, in some ideal world there could be some sense in that claim of Fiske, Schacter, and Taylor regarding replications being an opportunity to find limiting conditions and contextual effects. But in the real world of real replication studies, I think that’s way too optimistic, and the lesson from a series of unsuccessful replications is, as with power pose, that it’s time to start over and to rethink what exactly we’re trying to learn and how we want to study it.

Finally, I agree with Strack’s point that theory is useful, that psychology, or the human sciences more generally, is not merely “a collection of effects and phenomena.” The place where I think more work is needed is in designing experiments and taking measurements that are more precise and more closely tied to theory, also doing within-person comparisons as much as possible, really taking seriously the idea of directly measuring the constructs of interest, and tying this to realism in experimental conditions.

The post I am (somewhat) in agreement with Fritz Strack regarding replications appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What you value should set out how you act and that how you represent what to possibly act upon: Aesthetics -> Ethics -> Logic. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Peirce’s primary focus in his career was on logic. Until late in his career, he considered ethics and aesthetics to be largely frivolous topics. Then around 1900 he saw them as absolutely necessary to understanding logic. His thinking was that you first need to decide what you value above all, second how one should deliberately act to best obtain what you value and third how you should best represent what you plan to act upon prior to acting in the world. The “how one should best represent” to profitably advance inquiry being logic. So Aesthetics -> Ethics -> Logic.

Peirce took aesthetics to be the topic of what you should value above all regardless of any ulterior purposes. The ultimate thing to value above all he took to be the grasping of reality as being reasonable. That is being construed in ways seen to be capable of being understood rather than being mysterious. Even though we have no direct access to it. He took ethics to be the topic of deliberate controlled acting to achieve desired goals (the ultimate goal being reasonableness). He took logic to be the topic of deliberate controlled thinking or representation (which entails unending re-representation). To think about reality we need to somehow represent it that is not too wrong and re-represent it in equivalent ways – avoiding making it even more wrong.

The latter being a more common concern of logic being about truth preservation. But before and continually as we act on reality, we need to assess if our representation is too wrong to allow acting without being frustrated by reality. That is, delaying deliberate acceptance until we are satisfied we can’t do better. The first he thought of as being abduction, the second deduction and the third induction – the three topics of his total concept of logic. That is represent, re-represent faithfully and critically check what profitably should be doubted. Doubt, according to Peirce, being an art which has to be acquired with difficulty.

If aesthetics is taken as the theory of the objectively admirable without ulterior reasons – as ends simply as they present themselves – in statistics it might be the grasping of the real uncertainties of learning about that reality that is beyond us from planned and unplanned observations. If ethics is taken as the theory of deliberate self controlled conduct – as ends simply as they relate to actions and efforts – in statistics the list of virtues given here does seem a very good start. It would be here where the grasping would or would not be resolved into quantification. If logic is taken as the theory of deliberate thinking aimed at getting purposeful representations – as ends in regard to representation – in statistics it might be the discerning, modifying and tentatively keeping of joint probability models as representations of the realities that could have produced the observations we have.

Using this perspective, I attempt here to re-represent purposeful Bayesian inference to hopefully bringing its reasonableness into a clearer view.

In purposeful (or some say pragmatic) Bayesian approaches in statistical (in contrast to classical subjective or objective Bayesian approaches) there are always three intermingled aspects.

First (speculative inference) (1) choosing a (probabilistic) representation of how unknown quantities were set or came about (aka a prior), (2) a (probabilistic) representation of how the data in hand came about or were generated (aka data generating model or likelihood) and (3) how these two representations connect for a joint representation of an empirical happening to be interpreted more generally.

Second (quantitative inference),(1) revising the first representation (the prior) in light of the data to get the implied representation given by the joint representation _and_ the data (aka the posterior via Bayes theorem), (2) (conceptually) generating what data could result from such an implied representation (aka posterior predictive).

Third (evaluative inference), (1) choosing/guessing how the first and second steps might have been wrong and in light of this (2a) assessing the reasonableness of the data generating model to have generated the data in hand (checking for model data conflict) and then (2b) the reasonableness of the prior to have generated the parameters now most supported in the posterior (aka checking for prior data conflict), (3a) choosing what aspects of the joint representation might be made less wrong and how, (3b) working through the implications and how those fit with the data in hand and possibly past experience and finally (3c) deciding whether to settle on the joint representation as is and its implications for now or starting again at the first step.

All in all,

speculative inference -> quantitative inference -> evaluative inference or

abduction -> deduction -> induction -> or

First -> Second -> Third

Over and over again, endlessly.

Peirce: A Guide for the Perplexed. Cornelis de Waal – https://www.amazon.com/Peirce-Guide-Perplexed-Guides/dp/1847065163

The post What you value should set out how you act and that how you represent what to possibly act upon: Aesthetics -> Ethics -> Logic. appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Automated Inference on Criminality Using High-tech GIGO Analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>You might be interested in this article.

My reply was that this is just a big joke, one more bit of hype over a bunch of correlations. Lots of obvious problems with this paper, and it’s too bad that journalists fell for it. And, as an MIT grad, I’m particularly sad to see Technology Review be among the suckers.

I don’t have the energy to point out all the problems. Conveniently, though, I came across this pretty thorough discussion by Carl Bergstrom and Jevin West.

It’s an interesting example because, for once, we’re not in Type M / Type S error territory. Instead, we have the other big problem in quantitative social science: measurements that are not closely enough tied to the underlying constructs of interest.

tl;dr: No, despite what you might read in Technology Review, it’s not true that a neural network learned to identify criminals by their faces.

The post Automated Inference on Criminality Using High-tech GIGO Analysis appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post Abandon Statistical Significance appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>Blake McShane, David Gal, Christian Robert, Jennifer Tackett, and I wrote a short paper arguing for the removal of null hypothesis significance testing from its current gatekeeper role in much of science. We begin:

In science publishing and many areas of research, the status quo is a lexicographic decision rule in which any result is first required to have a p-value that surpasses the 0.05 threshold and only then is consideration—often scant—given to such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. There have been recent proposals to change the p-value threshold, but instead we recommend abandoning the null hypothesis significance testing paradigm entirely, leaving p-values as just one of many pieces of information with no privileged role in scientific publication and decision making. We argue that this radical approach is both practical and sensible.

Read the whole thing. It feels so liberating to just forget about the whole significance-testing threshold entirely. As we write, “we believe it is entirely acceptable to publish an article featuring a result with, say, a p-value of 0.2 or a 90% confidence interval that includes zero, provided it is relevant to a theory or applied question of interest and the interpretation is sufficiently accurate. It should also be possible to publish a result with, say, a p-value of 0.001 without this being taken to imply the truth of some favored alternative hypothesis.” We also discuss the abandonment of significance-testing thresholds in research and statistical decision making more general. Decisions are necessary, but a lexicographic rule based on statistical significance, no.

**P.S.** The adorable cat pictured above sees no need to perform a null hypothesis significance test.

The post Abandon Statistical Significance appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post “Cheerleading with an agenda: how the press covers science” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I thought you might be interested in this new (critical) perspective on science journalism: Cheerleading with an agenda: how the press covers science. It’s a topic far less urgent than the election (but related to the broader press failures that have been very visible in politics).

My reply: This is an excellent article. I’d say the problems are related to journalism being eclipsed by public relations (or here, or here, or . . .), except that it’s my impression that science journalism has been cheerleading forever. It’s not like there was any past golden age of critical science journalism. Indeed, I think the golden age of science journalism is now. So I don’t know how this quite fits into your story.

Katz responded:

I agree with you entirely, I think science journalism was born as a cheerleading enterprise (largely with same attitude as PR.) I don’t think there has really been a golden age yet, though in late ’80-90s people realized that science has to be covered rigorously for society’s sake, and started science journalism programs. Then there was probably some regression, and now there’s a wave of people being very critical about reproducibility and fraud in some science areas, but I don’t think it qualifies as golden age yet. Undark magazine presents itself as trying to cover science critically, though it’s marginal and hasn’t yet lived up to its promise (it’s very new; maybe it will).

Ed Yong may not be perfect but I think he and Felix Salmon and Nate Silver and other quantitative journalists have a lot to offer.

The post “Cheerleading with an agenda: how the press covers science” appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>The post What happened in the recent German election? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>I am not a statistician, but a historian by training. However, I have a lay interest in election analyses, among other topics covered in your blog.

I live in Germany. Yesterday, the federal elections were concluded, as you may have heard. I was wondering if you could share your views on it on your blog?

A bit of background: Many of the results were expected, but some were surprising. Chancellor Merkel’s party was the largest, as expected. On the other hand, particularly unexpected was the extent to which the major parties were weakened and the strong performance of small parties, particularly the radical right-wing Alternative for Germany (AfD), which ended with 12.6%. It is strongest in eastern Germany. This is seen as a reaction to Merkel’s refugee policy, but probably has deeper roots.

Germany’s overall prosperity has also meant that the Left parties have found it difficult to push their message as strongly as they would have liked. According to some graphs they showed on TV yesterday, economic concerns like employment played a smaller role for voters than in previous elections. At the same time, there are many who feel left out. The German electoral landscape has changed, and analysts have offered various explanations for this. Is this speaking too soon? Maybe these elections represent only a blip rather than a rightward shift?

Germany has a proportional representation system, explained here.

Studies of German voting behaviour have been analysed in many scholarly works, which suggest both a generational effect as well as a life-cycle effect (sorry, I could not find English language resources for this). I think this has something to do with the proportional representation system. German voters change their choices from election to election more often than American voters, a phenomenon known as “Wählerwanderung”. This article (link here) from the Financial Times has some nice graphs on the German elections.

We see, for instance (based on fairly accurate exit polls), that the AfD has taken votes from all sorts of sources. Mostly it has mobilised non-voters, but has also taken substantial chunks from the CDU/CSU (Conservatives, also Merkel’s faction) and the SPD (Social Democrats). Even the Linke (Left Party) has lost as much as 11% of its voters to the radical right-wing party.

I have no idea; I know nothing about German politics beyond what I read in the newspapers. But it can be a good idea to blog on things I know nothing about. So feel free to comment, everyone!

The post What happened in the recent German election? appeared first on Statistical Modeling, Causal Inference, and Social Science.

]]>