Statistical Modeling, Causal Inference, and Social Science

Reading the referee reports of that retracted paper by the science reformers: A peek behind the curtain

Andrew — Sun, 26 Jan 2025 14:33:40 +0000

One interesting sidelight of the story of a much criticized and finally retracted article on replication in psychology is that we get to read some of the referee reports.

These aren’t the reports on the original article—I guess those reviews were positive, otherwise it wouldn’t have been published in the first place—but on a proposed replacement paper; Berna Devezer explains the background.

The reviews are in this document, which came from this public repository.

The first two reviews (by Tal Yarkoni and Daniel Lakens) are really long, and they are both very thoughtful, really the equivalent of publishable articles (or blog posts) on their own. I guess they knew this was a high-stakes case, so they put in extra effort.

The third review was more cursory, and its summary was, “This is an important and novel set of findings” and “The approach is valid, and the quality of data and of its presentation are good.” After reading reviews 1 and 2, I don’t think reviewer 3’s assessment is correct—but I’m not gonna get all upset at reviewer 3. I’ve reviewed hundreds, maybe thousands of papers. I do a quick read, and sometimes I miss the point! That’s why a journal will obtain three reviews—different reviews have different styles, and any given reviewer can make a mistake.

Anyway, for those of you who have never been involved in editing a scholarly journal, it might be interesting for you to read these reviews, just to get a peek inside a system that is usually hidden from outsiders.

Some of the reviewers’ comments are just so on the mark:

The paper’s central claim—i.e., that a high rate of replication failures is not inevitable if one uses optimal practices—is obviously true, and requires no empirical support. The only way it could be false is if there were no reliable effects at all in social science—which is on its face an absurd proposition. The authors write that they “report the results of a prospective replication examining whether low replicability and declining effects are inevitable when using current optimal practices.” But how could low replicability and declining effects possibly be “inevitable”, either with or without optimal practices?

Good point!

A charitable reading of the authors’ central claim might be that what they are really trying to do is quantitatively estimate the impact of using best practices on the replicability of previous findings. That is, the question is not really whether or not replicability is *achievable* (of course it is), but *under what conditions a certain degree of replicability is achievable*, and *how much of an impact certain procedures seem to make*. If the paper were explicitly framed this way, I think it would be posing an important question. A serious effort to quantify the impact of implementing specific rigorous methodologies on replicability would be a valuable service to the field. However, framing the paper this way would also make it clear that, at least in its current presentation, the study has a number of major design limitations that in my view preclude it from providing an informative answer to the question.

Bang! The reviewer is not trying to be mean; he’s just stating some truths.

The generalization target is unclear. It is never made clear in the manuscript what population of effects the authors take their conclusions to apply to. . . . absent the authors’ disclosure of the processes that led them to select these particular effects for inclusion, it is impossible to determine—even in a ballpark sense—what population of studies the present results are meant to apply to.

That sort of thing is often important. It’s the kind of thing that people can be sloppy about in the writing of the paper and that journals will often just let through because nobody cares about going through and getting everything right.

Inadequate description of effects.

That one’s huge. Here’s an example:

For example, for the “Ads” study, Table 1 describes the central result as “Watching a short ad within a soap-operate episode increases one’s likelihood to recommend and promote the company in the ad”. This is incorrect in at least two ways. First, the authors did not measure likelihood of recommendation or promotion of companies; they measured *self-reported* likelihoods of these behaviors. That there is at best a very weak relationship between these things should be obvious, or else McDonald’s would experience a significant boost in revenue every time it aired a single ad on TV, which is obviously not the case (indeed, there is a cottage industry within marketing research questioning whether TV ad campaigns have *any* meaningful effect on sales). Second, the current wording implies that the effect applies to companies in general, when actually the authors only asked about McDonald’s, and used only a single ad comparison (McDonald’s vs. Prudential). This design does not license any general conclusions about “the company in the ad”; it licenses conclusions only about McDonald’s.

This reminds me of our interrogation of a psychology paper that kept describing things inaccurately. It can be so hard for authors to just say exactly what they did!

Of the “Cookies” study, the authors write: “People will be seen as greedier when they take three of the same kind of (free) cookie than when they take three different (free) cookies”. This is an inaccurate description, as no cookie-takers were actually observed; participants were asked to *imagine* how they would feel if they observed people taking cookies. A more accurate description would be: “Participants directed to imagine a specific hypothetical norm-violation scenario rate it as greedier to take three of the same kind of (free) cookie than to take three different (free) cookies.”

You might see this as picky, but I don’t. We’re supposed to be doing science here! You say what you actually did. Even though, I know that it’s absolutely standard practice to not do this (notoriously, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” describing a study that had no measures of power, let alone “instantly becoming more powerful”).

At several points in the paper the authors remark the performed studies with ‘high statistical power’ (e.g., abstract). But this is not true. . . . If this confirmatory study happens to be a study where, as a fluke an effect size was observed that was rather extreme (as should happen, by change, in 16 studies), and the true effect was smaller (as it seems to be, in the replication studies), then this was not a sufficiently powered study. Instead, it *seemed to be* a sufficiently powered study, based on the *observed* effect size. But if we take the four replications as an estimate of the true effect size, the study had low power. Of course, all of this requires some speculation, as we never know the true effect size, but the point is, the authors can not argue the studies all had high power, and the absence of a power analysis (let alone a conservative power analysis, such as a safeguard power analysis) should make the authors even more careful about claiming they had ‘high power’. Power is a curve. . . .

That’s a lot of words, but it’s understandable. It can take a lot of words to explain what’s going on to people who have made a mistake. And it’s just a referee report; there’s no reason the reviewer should put in lots of time to write it crisply. The point is, yeah, the statement in the article was wrong. And, again, the reviewer isn’t being mean, he’s just telling it like it is.

Thou Shalt Not Cheat

Andrew — Sat, 25 Jan 2025 14:40:22 +0000

Palko points to this post by Remy Levin and writes:

JMR, a top 4 marketing journal, has issued a formal “Expression of Concern” for Mazar, Amir, & Ariely (2008), the infamous Ten Commandments study.

We discussed some version of that experiment a few years ago. Lots of colorful characters came up. In addition to TV star psychologist Dan Ariely, there was ultra-connected economist Andrei Shleifer, who was credibly accused of something like embezzlement, and of course there was the big irony that there was cheating in a study about cheating.

At that point I would’ve said that the irony meter had gone off the scale, had it not been for the story of Michael Shermer, the professional skeptic who, in apparent sincerity, shared the story of a haunted radio (yes, really!).

Amazingly—or not so amazingly—there’s a connection between these two denizens of the irony zone, as Shermer blurbed Ariely’s recent book. Ahhhh, celebrities!

And at this point you probably won’t be surprised to hear that both Ariely and Shermer have been involved with the notorious Edge Foundation. Funky paper shredders, magic radios, . . . can’t get much edgier than that. No schoolmarms there!

Anyway, the best comment on that Ten Commandments study came from Jonathan, back in 2018:

Where in the 10 Commandments does it say ‘don’t cheat on a matrix test?’ It isn’t murder. It isn’t sex with a married woman (because that’s what it actually says is outlawed – because that confuses parentage and causes violence). It isn’t coveting. It could be lying but was this within the confines of a honor code with meaning, like part of a college commitment, or was it just an agreement to take this test? Does the lying gain someone advantages? The point of the ‘lying’ commandment is bearing false witness, and that would be stretch. Aren’t the 10 Commandments more a lesson that little crap like this doesn’t matter, that you should not kill people, not mistreat parents or children, not steal, and not want someone else’s stuff (including wife) rather than going out and getting your own?

Good point. The Bible says nothing about shredders or psychology experiments. So, everything’s cool.

The fractal nature of scientific revolutions

Andrew — Fri, 24 Jan 2025 14:27:40 +0000

I was contacted by a new website, Daily 27–they asked me to write a short article of 300-500 words, “aimed at the general public” and related to “statistics and social sciences.” Considering what I’d written of this length that would be of general interest, I thought of the fractal nature of scientific revolutions, an idea that came up in this article with Cosma Shalizi, Philosophy and the practice of Bayesian statistics in the social sciences, published in 2011 in the Oxford Handbook of the Philosophy of the Social Sciences.

The Daily 27 version is here.

Oddly enough, they put it in the “Society and Gender” category. I guess it makes sense–I’m a dude, and that’s a gender, so, sure, why not?

P.S. I did some rooting around on the site and found an article by the notorious Roy Baumeister. Oh no (see also here and here). Oh well. I’m not too proud to publish somewhere that also publishes Baumeister. I also occasionally write for PNAS.

Science and the malleability of the self

Jessica Hullman — Thu, 23 Jan 2025 18:00:20 +0000

“We still don’t really have any sure answers. We have made so much progress in some spheres, and yet so very little in others. We ought to feel the sting of our own ignorance; beset and bested on all sides by the things we do not know. Why then do so many academics — and applied, business, whatever, “sciency” people — want to act as a colossus, act as if astride the world?”

This is Jessica. Rachael Meager wrote a really great essay last week that ranged over many topics close to my heart (and perhaps the heart of this blog, if it has one), from superficial appeals to rigor and objectivity, heedless acceptance of generalities, disbelief in chance, and a tolerance (or even desire for) breathless predictions with no real mechanism for disincentivizing overconfidence.

Ultimately it resonated as a meditation on how deep our discomfort runs when it comes to engaging with uncertainty as we construct our little research empires. These are themes that have been on my mind frequently over the last few years. I’d always been interested in limitations of scientific methods but it wasn’t until around the time I got tenure that I felt like I really “dove into the sea of uncertainty and stuck my head in and started blowing bubbles”, to paraphrase a quote from Andrew that Meager also reminded me of. One of the most striking parts of all this was realizing the extent to which people find uncertainty and doubt threatening, even as I was being careful to keep my opinions mostly to myself. It’s almost as if questioning the premises or foundations of certain research threatens the very “personhood” of the researchers involved.

And so, in reading Meager’s post, I found myself thinking about how “avoiding the sting of ignorance” can go hand-in-hand with treating research as a kind of self-affirmation, where scientific investigation gets confused with the unfolding of the scientist’s personal narrative. And counterproductive norms develop to protect the researcher’s right to assert (and preserve) their self through their research.

Meager writes:

Early on in my work I had to contend with people who thought quantitative evidence aggregation was unnecessary or useless in economics; later on some of these same people thought, well, look at the Cochrane handbook, this problem has already been solved. That’s sort of like looking at a stack of bibles and thinking we know everything about god. Actually people wrote the bible, and a lot of those people were fools.

One of the things I find most frustrating when some specific paper or line of work gets criticized as overconfident or unjustified is the seeming unwillingness on the part of the authors to risk something, to stop the knee jerk reaction to find a rationalization and instead actually sit with the possibility that they failed to consider something important. Instead, the fact that their work fits well into some larger landscape of research, where others have asked similar questions using similar methods gets interpreted as a sort of insurance against failing. I remember as a grad student learning that one technique for rebutting critical reviews on a paper you submitted is to find cases where the venue had published other papers that used the same method. Because if they did it and it was acceptable, why should I be held to a higher standard? And so we proceed as if it’s taken for granted that no criticism could ever be so bad as to motivate rescinding the idea.

I think most people do not really believe in inference, for inference means things are hidden. They only believe their own eyes. If something happens they think it obliterates all alternatives. It never could have been a different way. I think that lots of people think the whole world is completely deterministic.

As a result of this, there is enormous pressure to just go along with things. To say, “Okay, it’s fine” when someone presents you with some half-baked result, even after you just explained why you think it is quite far from fine and you don’t see how there is anything that can be said from the results at all. This is not to say that there is never any room for recognizing that the criticism makes a good point–this certainly happens sometimes, and sometimes the critical papers even win awards. It’s more that there is little precedent for truly taking back an idea. Once it’s out there, it must be right. And so once published, even the very bad ideas enjoy a kind of halo that paints them as still somehow valid in some way, and the authors of those papers are still justified in selling them without any real acknowledgement of the limitations.

Andrew’s analogizing of publication as a kind of truth mill gets at part of this, but I’m interested in how this arises in part as a result of seeing research as a kind of affirmation of the self. It’s as if we’ve rewritten the definition of scientific learning to be something that stems from ourselves rather than the constraints of the world or of the representational systems that bind us in our particular fields. “Science as a celebration of the self” gets written into how we treat publications and academic milestones like dissertations, even as we claim to be interested only in “objective” truth. Denying that the work is ready gets confused with an attack on one’s right to exist.

People claim to care a lot about rigour, and exactitude, and chance. But people claim to care about a lot of things.

With chance, in practice, people usually disrespect it. A lot of the time when people say they care about something, they just like the idea of the thing. To respect and to accept something is real is to accept that at times, like all things, it brings challenges. But that is not the way that we normally treat chance.

I was reviewing a paper recently where I was the only reviewer who disagreed with the premise of the work. To me the work was emblematic of how low expectations in this particular field about what constitutes a research contribution have come to validate authors churning out derivative “synthesizing” papers that do some shallow reflecting on trends in some emerging body of literature that was itself ad-hoc and reactive. It was so unambitious, a shot at the “minimum publishable unit”, with no clear generalizable contribution. I tried to get myself to think pragmatically, that maybe this paper was saying something other people would benefit from hearing even if I considered it a waste of time, but I couldn’t find it. And still, among the reviewers, I was the one who appeared to be out of touch. As if the fact that the paper was reasonably well-written and reacting to what had come before meant it was off the table that I could question it in this way.

This attitude–that those who question the premises or take the holes too seriously must be wrong–is canonized through jokes about R2, the curmudgeonly reviewer who must have some personal affliction if they are trying to shoot down our work. I mean, how can they be justified when the majority of reviewers agreed we were doing fine? Of course we deserve to go to the conferences and have a good time. That’s what we’re entitled to when we submit. If we do happen to get rejected, well, that’s when we believe in randomness and chance. We must have just gotten unlucky. It can’t be that we were wrong.

So on some level it’s not that we never acknowlege uncertainty, it’s that we are only willing to accept it when it’s the wind that blows in the direction of our unfolding personal narrative. The converse of believing that some very hard problem has been “solved” because someone else has worked on it is refusing to accept that there can be a solution or that some real progress has been made, because that would constrain the development of our research identity. I see this a lot in research related to human-computer interfaces, whether it’s designing interactive data visualization systems or interfaces to AI/ML models. For example, I get easily annoyed when people talk about topics like “appropriate reliance” on a predictive model or “evaluating AI explanations,” where often they can provide no clear definition of what they even mean. Meanwhile, decision theory provides a basis for clarifying much of this, but when you present them with such a perspective the response is like, “Well, that is one opinion” and they go right back to letting their intuition drive.

Or, you point out that a paper already exists that technically solves the problem they care about but they don’t want to acknowledge this, because now that they had the idea that a solution is needed, it must come from them. It’s like we’re very good at pattern matching on the problem, but unwilling to do it on solutions. Once you’ve decided that you want to work on a problem, then it can’t be solved (at least, not by anyone but you).

The problem is that trying to really learn something new is humbling, especially if you come at it from a position of apparent success. The ego gets built up so, so easily, when things are going well, unless you work hard for it not to. It is not just online writers and clout-chasing posters who mistake numbers or success for importance, or for virtue. It is also real for tenured academics

The ego really does get in the way, and yet, we seem to welcome this. In computer science, we basically train for it. Advancing as a researcher coincides with imposing assertions, rather than learning to listen and slow down and study all of the details before speaking up. We’ve created an academic culture where being confused or at a loss for words has no place. Instead we’re often content to accept vibes in the guise of academic rigor. It’s easier to pretend the ego with its various hopes and dreams really is enough to suss out truth than to try to actually find it.

And so it’s not necessarily surprising that many trends in how we publish and make hiring and promotion decisions point away from doing deep, thoughtful work, and toward pushing out as many assertions as one can. The expectation that everyone be publishing tens of papers a year, even as a Ph.D. student (at least in certain areas of computer science) is hardly likely to foster deep engagement with uncertainty. Ben Recht has called for less publishing; similar proposals (judge papers based on quality, look at impact) seem to arise every few years since I’ve been a grad student, but so far none of these have made so much as a dent in the hyper-productivity. From the perspective that we’ve created a system with built-in safeguards against any serious threats to the researcher’s sense of self, is there really an alternative? If going deep is too risky, then we’re left with the churn.

The funny thing is, even in the circumstances where there is the highest premium on being in complete control, like interviews or job talks, often it’s still possible to glimpse that there is an honest human beneath all of the storytelling, who does have a sense of what they don’t know. It’s just that we’ve trained researchers to be extremely cautious about admitting that. Better to pretend that you fully understand it all then to acknowledge in front of others that there’s still some confusion.

As Meager writes, “there can be a real tradeoff between shooting for (mortal, fleeting) glory and carving out the time and space for good and deep and satisfying work of any kind.”

Ultimately, what one works on, and how much uncertainty one chooses to let in, is a personal decision. The hard part of criticizing researchers for using their work as a form of self-affirmation or self-preservation is that of course we should expect research trajectories to be individualized. Everyone progresses in their thinking at their own rate, in directions that depend on their personal attractions to different topics and insecurities they perceive they must overcome. We become accustomed, even entrenched, in our preferred technical languages or ways of seeing, such that it’s hard to completely blame someone if they refuse to acknowledge a solution in another language, or to relate to doubts raised by someone they see as outside their preferred tradition. Trying to remove the personal entirely from scientific inquiry is unrealistic.

And yet, I guess part of the sentiment behind this post, which Meager’s excellent essay really got me thinking about, is that as much as “the system” might seem to prevent us from going deep and internalizing doubt, ultimately it’s a personal choice. You can either choose to be the kind of scientist who’s in it for the personal glory or you can let uncertainty in. You can let “the standards” dictate how many papers you write a year and how ambitious each is, or you can try to be honest with yourself about when you’ve learned something worth sharing. There may be no way of omitting the self from science, but there are certainly different ways of using it. Research can be about self-discovery, but if there’s no risk, how can there be growth?

When estimating a treatment effect with a cluster design, you need to include varying slopes, even if the fit gives warning messages.

Andrew — Thu, 23 Jan 2025 14:50:01 +0000

Nick Brown and I recently published this paper, How statistical challenges and misreadings of the literature combineto produce unreplicable science: An example from psychology, in the journal, Advances in Methods and Practices in Psychological Science, about a published article that claimed to find that healing of bruises could be sped or slowed by manipulating people’s subjective sense of time. The article under discussion had major problems–including weak theory, a flawed data analysis, and multiple errors in summarizing the literature–to the extent that we did not find their claims to be supported by their evidence.

That sort of thing happens–we call it everyday, bread-and-butter pseudoscience. The point of the new paper by Nick and me was not so much to shoot down one particular claim of mind-body healing but rather to explore some general reasons for scientific error and overconfidence and to provide a template for future researchers to avoid these problems in the future.

We posted this the other day on the blog (under the heading, 7 steps to junk science that can achieve worldly success), and something came up in comments, something I hadn’t thought of before and I wanted to share with you. The commenter, Jean-Paul, pointed to a misunderstanding that can arise when fitting multilevel models using some software.

Need for multilevel modeling or some equivalent adjustment for clustering in data

Here was the problem. The article in question estimated treatment effects with a cluster design: they had three treatments applied to 33 research participants (in psychology jargon, “subjects”) with data provided by 25 raters. The total number of measurements was 2425 (not quite 3*33*25 because of some missing data, which is not the point in this story).

When you want to estimate treatment effects from a clustered design, you need to fit a multilevel model or make some equivalent adjustment. Otherwise your standard errors will be too small. In this case you should include effects for participants (some have more prominent bruises than others in this study) and for raters (who will have systematic differences in how they characterize the severity of a bruise using a numerical rating).

The researchers knew to do this, so they fit a multilevel model. Fortunately for me, they used the R program lmer, which I’m familiar with. Here’s the output, which for clarity I show using the display function in the arm package:

lmer(formula = Healing ~ Condition + (1 | Subject) + (1 | ResponseId), 
    data = DFmodel, REML = FALSE)
            coef.est coef.se
(Intercept) 6.20     0.31   
Condition28 0.23     0.09   
Condition56 1.05     0.09   

Error terms:
 Groups     Name        Std.Dev.
 Subject    (Intercept) 1.07    
 ResponseId (Intercept) 1.22    
 Residual               1.87    
---
number of obs: 2425, groups: Subject, 33; ResponseId, 25
AIC = 10131.4, DIC = 10119.4
deviance = 10119.4

So, fine. As expected, participant and rater effects are large–they vary by more than the magnitudes of the estimated treatment effects!–so it’s a good thing we adjusted for them. And the main effects are a stunning 2.4 and 11.1 standard errors away from zero, a big win!

Need for varying slopes, not just varying intercepts

But . . . wait a minute. The point of this analysis is not just to estimate average responses; it’s to estimate treatment effects. And, for that, we need to allow the treatment effects to vary by participant and rater–in multilevel modeling terms, we should be allowing slopes as well as intercepts to vary. This is in accordance with the well-known general principle of the design and analysis of experiments that the error term for any comparison should be at the level of analysis of the comparison.

No problem, lmer can do that, and we do so! here’s the result:

lmer(formula = Healing ~ Condition + (1 + Condition | Subject) + 
    (1 + Condition | ResponseId), data = DFmodel, REML = FALSE)
            coef.est coef.se
(Intercept) 6.18     0.39   
Condition28 0.25     0.36   
Condition56 1.09     0.37   

Error terms:
 Groups     Name        Std.Dev. Corr        
 Subject    (Intercept) 1.71                 
            Condition28 1.99     -0.71       
            Condition56 2.03     -0.72  0.65 
 ResponseId (Intercept) 1.24                 
            Condition28 0.07      1.00       
            Condition56 0.13     -1.00 -1.00 
 Residual               1.51                 
---
number of obs: 2425, groups: Subject, 33; ResponseId, 25
AIC = 9317.9, DIC = 9285.9
deviance = 9285.9

The estimated average treatment effects have barely changed, but the standard errors are much bigger. The fitted models shows that treatment effects vary a lot by participant, not so much by rater.

You might be happy because the t-statistic for one of the treatment comparisons is still 3 standard errors away from zero, but other problems remain, both with multiple comparisons and with the interpretation of the data, as we discuss in section 2.4 of our paper.

There’s one thing you might notice if you look carefully at the above output, which is that the estimated covariance matrix for the rater effects is degenerate: the correlations are at 1 and -1. This sort of thing happens a lot with multilevel models when data are noisy and the number of groups is small. It’s an issue we discussed in this paper from 2014.

In practice, it’s no big deal here: the degeneracy is a consequence of a very noisy estimate, which itself arises because the noise in the data is much larger than the signal, which itself arises because the variation in treatment effects across raters is indistinguishable from zero (look at those small estimated standard deviations of the varying slopes for rater). But that’s all pretty subtle, not something that’s in any textbook, even ours!

That scary warning on the computer

Here’s the problem. When you fit that varying slopes model, R displays a warning in scary red type:

It’s possible that the researchers tried this varying-slope fit, saw the singularity, got scared, and retreated to the simpler varying-intercept model, which had the incidental benefit of giving them an estimated treatment effect that was 11 standard error from zero.

Just to be clear, I’m not saying this is a problem with R, or with lme4. Indeed, if you type help(“isSingular”) in the R console, you’ll get some reasonable advice, none of which is to throw out all the varying slopes. But the first option suggested there is to “avoid fitting overly complex models in the first place,” which could be misunderstood by users.

If you fit the model using stan_lmer on its default settings, everything works fine:

 family:       gaussian [identity]
 formula:      Healing ~ Condition + (1 + Condition | Subject) + (1 + Condition | 
	   ResponseId)
 observations: 2425
------
            Median MAD_SD
(Intercept) 6.22   0.40  
Condition28 0.26   0.34  
Condition56 1.09   0.37  

Auxiliary parameter(s):
      Median MAD_SD
sigma 1.51   0.02  

Error terms:
 Groups     Name        Std.Dev. Corr       
 Subject    (Intercept) 1.731               
            Condition28 2.015    -0.63      
            Condition56 2.056    -0.66  0.57
 ResponseId (Intercept) 1.225               
            Condition28 0.158     0.33      
            Condition56 0.172    -0.51  0.04
 Residual               1.509               
Num. levels: Subject 33, ResponseId 25

These estimates average over the posterior distribution, so nothing on the boundary, no problem. And, no surprise, the estimates and standard errors of the treatment effects are basically unchanged.

That said, when you run stan_lmer, it gives warnings in red too!

These warning messages are a big problem! On one hand, yeah, you want to warn people; on the other hand, it would be just horrible if the warnings are so scary that they send users to use bad models that don’t happen to spit out warnings.

I also tried fitting the model using blmer, which I thought would work fine, but that produced some warning messages that scared even me:

By default, blme uses a degeneracy-avoiding prior, so this sort of thing shouldn’t happen at all. We should figure out what’s going on here!

Summary

1. If you are estimating treatment effects using clustered data, you should be fitting a multilevel model with varying intercepts and slopes (or do the equivalent adjustment in some other way, for example by boostrapping according to the cluster structure). Varying intercepts is not enough. If you do it wrong, your standard errors can be way off. This is all consistent with the general principle to use all design information in the analysis.

2. If you apply a more complicated method on the computer and it gives you a warning message, this does not mean that you should go back to a simpler method. More complicated models can be harder to fit. This is something you might have to deal with. If it really bothers you, then be more careful in your data collection; you can design studies that can be analyzed more simply. Or you can do some averaging of your data, which might lose some statistical efficiency but will allow you to use simpler statistical methods (for example, see section 2.3, “Simple paired-comparisons analysis,” of our paper).

Dan Ariely: “Why Louisiana’s Ten Commandments law is a broken moral compass”

Andrew — Wed, 22 Jan 2025 14:37:17 +0000

You think I’m kidding with the above title, but I’m not. Key quote from the noted social psychologist / business-school professor:

The assumption that the Ten Commandments can serve as a universal moral code is increasingly out of touch with contemporary American society.

I didn’t check out all the links in this article, but my guess is that Ariely’s problem with the 10 Commandments is that God signed the tablets at the bottom, not at the top.

Or maybe he didn’t like that commandment, “Thou shalt not make unto thee any graven image,” which, when translated into the language of modern science, might imply, “Thou shalt not fabricate data.” Those damn commandments, always getting in the way!

P.S. You might say, why write about this? Why not just decorously look away. Well, remember what they say about dead horses. The Hill is a reputable publication, and they’re using their space to promote questionable work. I think this kind of thing is bad for science as a whole. It’s just particularly ridiculous when they invite someone with major ethical concerns about his research to be lecturing us on ethics. I do think mockery is appropriate here. It’s nothing personal. I just think of all the people in social science, working so hard and not pushing the ethical boundaries, working openly and honestly and with a sincere willingness to learn . . . and then this is what gets rewarded with publicity. I laugh because otherwise I’d scream.

“Interrogating Ethnography”: The Alice Goffman story

Andrew — Tue, 21 Jan 2025 14:33:07 +0000

I came across this book from 2018, Interrogating Ethnography: Why Evidence Matters, by law professor Steven Lubet. It’s a crisp (137 pages) and fascinating discussion of the role of evidence in qualitative social science, and I think it should be of interest to many of you, as it parallels so many discussions we’ve had over the years regarding the role of evidence in quantitative research.

Sometimes I’ve had negative reactions to writings by law professors on social science, but in this case Lubet’s expertise is relevant, as so many legal cases turn on evidence.

Lubet discusses several examples, focusing on sociologist Alice Goffman’s controversial 2015 book On the Run. As we discussed a few years ago, it’s a problem of trust. Goffman offers no documentation for her extraordinary claims and thus must rely on her readers and colleagues to trust her statements and treat them as fact. In this case, trust is brittle, and once the trust is gone, not much remains.

One reason Lubet’s book is interesting is that he gets into the details and presents things very carefully. Just for example, from page 131:

It is unfortunate that ethnographers have so seldom essayed revisits to others’ research sites. Despite the obvious difficulties, there are cases in which the impediments can be readily overcome. It would not take long for an ethnographer to interview personnel at the hospitals in West Philadelphia where Alice Goffman claims to have seen police cordons at the entrances. Moreover, there are only six hospitals in Philadelphia with maternity services, so it would be possible, even now, to fact-check Goffman’s story of having observed the arrests of three new fathers on the same ward in a single evening.

I’m guessing that this maternity ward falls into the same category as Marc Hauser’s monkey tapes, Brian Wansink’s bottomless soup bowl and his 80-pound rock, Diederik Stapel’s survey forms, Mary Rosh’s survey forms, Michael Bellesiles’s probate inventories, Matthew Walker’s National Geographic video, the Surgisphere dataset, and Dan Ariely’s paper shredder. But all things are possible.

The other thing notable about Lubet’s book is its even tone. Some of the stories in the book are funny, others are kinda shocking, and Lubet manages to convey all this without himself ever expressing amusement or outrage. There’s nothing wrong with expressing amusement or outrage—I do it all the time!—; it’s just impressive to me how he wrote this entire book with a straight face. I recommend it.

The “delay-the-reckoning heuristic” in pro football?

Andrew — Mon, 20 Jan 2025 14:03:32 +0000

Paul Campos tells this story of an NFL coach making a bad decision:

Denver kicked a field goal with 1:54 to go to make the score 13-6 Pittsburgh. Denver had one time out remaining. At this point [Denver coach] Payton elected to kick the ball off through the end zone rather than try an onside kick. If you do the math, this meant that the reasonable best case scenario for Denver was that they would stop Pittsburgh from getting a first down after three running plays, and get the ball back after a punt deep in their own territory with about ten seconds left and no time outs. This is in fact what happened. Since the odds of scoring a game-tying TD in this situation are almost zero, the question that naturally arose after the game is why Payton didn’t try an onside kick. . . .

Now the dumbest part of all this is that the downside of a failed onside kick in this situation is trivial. If the kick fails, Pittsburgh gets the ball around the 50 rather than on their own 30 after the kick through the end zone. If Pittsburgh makes a first down in either situation the game is over so that’s irrelevant. But what’s the downside of Pittsburgh punting from the Denver 44 instead of, as they did, from the Pittsburgh 36? This is a 20-yard difference, but, because the end zone serves as a constraint on punters since a punt that goes into the end zone comes out to the 20, the real difference in field position as a practical matter is probably more like ten yards, with the likely outcome being Denver getting the ball on its own 20 rather than its own ten. So Payton passed up a chance to get the ball back at midfield with nearly two minutes and a time out left — a very manageable situation if you need a TD — for a realistic best case scenario that would have required something tantamount to a miracle to result in a TD.

Campos’s framing of the decision, a framing that seems reasonable to me, is that Denver, down by 7 points with 1:54 to go, had two options with their kickoff:

1. Deep kick, then try to keep Pittsburgh from getting a first down, then try to score a touchdown with the one or two plays remaining after the punt recovery.

2. Onside kick, then in the unlikely event that Denver recovers (approximately 6% chance, according to this source), they have more than a minute and a half to try to reach the end zone.

Campos argues that option 2 is better because, even conditional on stopping the first down and getting the ball back, the probability of scoring a touchdown in one or two plays, starting from deep in your own territory, is so much lower than the probability of scoring a touchdown from closer to midfield with a minute and a half to go. If Denver has a 60% chance of stopping Pittsburgh from getting the first down, then this would imply that Campos thinks that Denver would be much less than 1/10th as likely to get the touchdown in one or two plays than with a minute and a half and one time out. I don’t know these probabilities, but I assume a pro football coach would have an assistant who’d be able to access these numbers instantly. As Campos says, “all of this happened immediately after the two minute warning, so he and his staff didn’t have to make snap decision: they had three minutes of beer and ED commercials to figure out this statistical puzzle.”

Assuming that (a) Campos is correct about the probabilities, and (b) that the decision isn’t even close, the question then arises, why did the coach make the wrong decision (which, again, we’re assuming was wrong prospectively, not just retrospectively)?

Campos offers three explanations:

(1) When choosing between a course of action that creates a very slim chance of winning — onside kicks are rarely recovered — and one that creates a much slimmer chance, coaches tend to treat the decision in an irrational way, because what’s the difference between a one in 50 chance of winning and a one in 200 . . . when the baseline is low enough it’s like it doesn’t even matter. Now this is very likely true in this one particular instance, but it’s very much NOT true in the long run, when making many similar decisions over time.

(2) Coaches have a strong and strongly irrational preference for delaying the arrival of certain defeat for as long as possible, even at the cost of greatly reducing the odds of actually winning. . . . [This fits] Payton’s mentality throughout this game, including the decision to kick a field goal while down 13-0 with ten minutes to go, and even more so the decision to punt on fourth and eight from the Denver 33 while down by ten with seven and half minutes to go. Such decisions are both more likely to cause the moment of certain defeat to arrive later than it would arrive otherwise, and seriously suboptimal in terms of increasing the chances of actually winning the game.

(3) A third factor in such decisions is that coaches would prefer defeat while pursuing the conventional course of action to defeat while doing something unconventional, since the latter makes them prone to heavier criticism, even if the criticism is wrong. . . .

Let’s get that third factor out of the way first, as it often comes up in this sort of discussions of coaching decisions. We talked about it a few years ago in the context of fourth-down decisions in the NFL (with followup here). The short answer is yeah, most coaches have a motivation to be conservative, as it looks worse if you do something bold and it fails, but in this case the decision seems so clear that I don’t think the onside kick is particularly controversial.

As for the first factor . . . sure, I guess the point is that when a decision seems unimportant (in this case, raising the probability of tying or winning from near-zero to higher but still very low), then there is less motivation to make the rational decision. But, I don’t really buy this. Think about it the other way. From the coach’s perspective, if the probability of winning is tiny, then the game is already basically lost, so at that point the team is playing with the house’s money. So why not roll the dice? From a psychological standpoint, this seems fundamentally different from if you’re almost certain to win and you make a seemingly-more conservative play even if it slightly increases your probability of losing, because if you’re gonna lose, you don’t want it to happen from what would be considered a weird play.

So then it comes down to Campos’s second argument, which interests me because it seems related to other decision-analysis paradoxes. But, like other ideas in the always-confusing heuristics-and-biases literature, it introduces its own challenges.

The delay-the-reckoning heuristic

I’m gonna label this idea identified by Campos, that “Coaches have a strong and strongly irrational preference for delaying the arrival of certain defeat for as long as possible, even at the cost of greatly reducing the odds of actually winning,” as the delay-the-reckoning heuristic.

The general scenario is that you are at a fork in the decision tree, where one branch will give you decisive, or near-decisive, information right away, while the other branch will take you down further steps until the uncertainty is resolved.

Which fork will you take?

Just speaking in general terms, I feel like it could go either way. Let me break the scenario into two sub-scenarios: potential good news (as in the football example when you’re behind but there’s an outside chance you could get lucky and win) or potential bad news (if you were leading in the football game and there’s an outside chance you could get unlucky and lose). Or you could think of medical examples: you have a serious disease but there’s a potential miracle drug with a small chance of working, or you’re healthy and you’re going to take a blood test that might reveal you have an incurable cancer.

In the potential-good-news scenario, I agree with Campos that it somehow seems more natural (whatever that means) to delay the reckoning. The idea is that you’re pretty sure it’s gonna be bad news, so you’d like to prolong the period of hope for as long as possible. From a pure decision-analysis standpoint, it’s always better to get information sooner, as it can inform later steps in the decision problem, but from an emotional standpoint, I understand the appeal of keeping hope alive for as long as possible.

On the other hand, what about the “Give it to me straight, Doc?” attitude? If you’re in bad shape, maybe you want to just know already and move on. So I’m not sure.

Also, I kind of understand the emotional logic to delaying the reckoning . . . but this is all happening within two minutes of a football game! Is it really worth lowering your win probability just to gain an additional minute and a half of hope? That doesn’t seem quite right. I feel like there’s something more going on here.

What about the potential-bad-news scenario? There I feel like it’s natural to want the information as soon as possible so as to rule out the unlikely bad outcome. Or maybe not. I feel like I’m working in the grand tradition of judgment and decision making research, which is to theorize based on personal impressions of hypothetical scenarios.

I sent the above to Dan Goldstein, an expert in judgment and decision making, and he pointed us to the book, Deliberate Ignorance: Choosing Not to Know, edited by Ralph Hertwig and Christoph Engel. So maybe there’s something there that’s relevant to our discussion.

The delay-the-reckoning heuristic interests me for its own sake and also for its connection to other time-related decision analysis issues such as the base-rate fallacy and its opposite, the slow-to-update problem.

Problems caused by grade inflation

Andrew — Sun, 19 Jan 2025 14:52:36 +0000

Columbia math lecturer Peter Woit writes:

There has been significant grade inflation over the years, so having a transcript with a string of As isn’t worth what it once was. This is not good for the unusually talented, who now need to find other ways to distinguish themselves.

That’s a good point! I’ve typically thought of grade inflation in isolation (as in my post asking why weren’t the instructors all giving all A’s already?) with the problem being that inflated grades provide less information to future employers.

Woit’s point is related but goes further. Now that A’s are given out like candy corn in the world’s worst Halloween party, they don’t provide much signal, first because, as Woit says, non-unusually-talented students can also get strings of A’s on their transcripts, and also because if you’re competing on grades, the occasional slip can be so costly. Either way, ambitious students have to distinguish themselves in other ways—for example, by publishing articles in journals and conferences. This propagation of “publish or perish” down to the high school level just exacerbates the explosion of publications—apparently, zillions of medical students are kinda required to publish research too, and if publication is a requirement, then the quality is not gonna matter so much, and these papers just get stirred in with whatever remaining legitimate literature is being produced.

So, yeah, if we were to give out more B’s and C’s, maybe the world would be a better place.

I’m not planning to first, though. As I wrote a few years ago, the real mystery to me is not, Why is there grade inflation?, but rather, Why is there any room left to inflate: why weren’t the instructors all giving all A’s already?

At that time, I recommended statistician Val Jonhson’s plan to “make post-hoc adjustments to assigned grades to account for differences in faculty grading policies”—basically, fit a multilevel item-response model to estimate students’ latent abilities based on their grades. As I wrote at the time:

The beauty of Val’s approach is that it does three things:

1. By statically correcting for grading practices, Val’s method produces adjusted grades that are more informative measures of student ability.

2. Since students know their grades will be adjusted, they can choose and evaluate their classes based on what they expect to learn and how they expect to perform; they don’t have to worry about the extraneous factor of how easy the grading is.

3. Since instructors know the grades will be adjusted, they can assign grades for accuracy and not have to worry about the average grade. (They can still give all A’s but this will no longer be a benefit to the individual students after the course is over.)

I still like Val’s idea, but at this point there may be too much grade inflation at some schools for it to work. At some point there is so little signal left that you can’t recover the information you want.

OK, at this point you might say, sure, grades are B.S., whatever. But that puts in the worse position of implicitly requiring students to have other qualifications. At best, this sends students to interesting research projects and internships, but many times it just pushes them into trying to hop on projects to get credentials. Rather than writing some crappy Neurips paper and then learning the tricks to get it accepted, I think they’d be better off taking interesting courses in college, working hard, doing well on exams, and writing good term papers.

Where should we publish our paper, “Statistical graphics and comics: Parallel histories of visual storytelling”?

Andrew — Sat, 18 Jan 2025 14:21:48 +0000

Hey! Susan Kruglinski and I wrote this article I really like, Statistical graphics and comics: Parallel histories of visual storytelling:

What do data visualization and cartoons have in common? One of these is used to communicate in science and journalism, and the other appears in arts and entertainment, but both convey complex messages in economical, intuitive, and visually appealing ways. And both these graphic forms are relatively new, having made rapid progress only in the past few centuries, despite requiring little in the way of raw material to produce. We connect this history to a combination of abstraction and accessibility that is common to both these forms of visual expression: comic strips and scatterplots both now seem intuitive but represent the development of abstract conventions. We also discuss differences between these two methods of visual storytelling in their goals and in how they are experienced by the reader.

Read the whole thing. It has a message I think is important.

But my message to you is: Where should we publish this article? We sent it to the journal American Statistician, which didn’t seem quite right; in any case they agreed with that assessment and told us it would be better to publish somewhere else. But we’re not sure where.

There’s no need for the paper to appear in a statistics journal, or in a “journal” at all–it’s not like we’re getting “publish or perish” credit for it! Lots of non-statistician “civilians” are interested in dataviz and comics, and I’d like to reach some audience beyond whoever’s reading this post right now.

If you have any thoughts on where to publish this article–or, of course, any thoughts on the substance of the article itself–you can just let us know right here in the comments section. Otherwise, just enjoy the article.

How far can exchangeability get us toward agreeing on individual probability?

Jessica Hullman — Fri, 17 Jan 2025 17:39:56 +0000

This is Jessica. What’s the common assumption behind the following?

- Partial pooling of information over groups in hierarchical Bayesian models
- In causal inference of treatment effects, saying that the outcome you would get if you were treated (Y^a) shouldn’t change depending on whether you are assigned the treatment (A) or not
- Acting as if we believe a probability is the “objective chance” of an event even if we prefer to see probability as an assignment of betting odds or degrees of belief to an event

The question is rhetorical, because the answer is in the post title. These are all examples where statistical exchangeability is important. Exchangeability says the joint distribution of a set of random variables is unaffected by the order in which they are observed.

Exchangeability has broad implications. Lately I’ve been thinking about it as it comes up at the ML/stats intersection, where it’s critical to various methods: achieving coverage in conformal prediction, using counterfactuals in analyzing algorithmic fairness, identifying independent causal mechanisms in observational data, etc.

This week it came up in the course I’m teaching on prediction for decision-making. A student asked whether exchangeability was of interest because often people aren’t comfortable assuming data is IID. I could see how this might seem like the case given how application-oriented papers (like on conformal prediction) sometimes talk about the exchangeabilty requirement as an advantage over the usual assumption of IID data. But this misses the deeper significance, which is that it provides a kind of practical consensus between different statistical philosophies. This consensus, and the ways in which it’s ultimately limited, is the topic of this post.

Interpreting the probability of an individual event

One of the papers I’d assigned was Dawid’s “On Individual Risk,” which, as you might expect, talks about what it means to assign probability to a single event. Dawid distinguishes “groupist” interpretations of probability that depend on identifying some set of events, like the frequentist definition of probability as the limiting frequency over hypothetical replications of the event, from individualist interpretations, like a “personal probability” reflecting the beliefs of some expert about some specific event conditioned on some prior experience. For the purposes of this discussion, we can put Bayesians (subjective, objective, and pragmatic, as Bob describes them here) in the latter personalist-individualist category.

On the surface, the frequentist treatment of probability as an “objective” quantity appears incompatible with the individualist notion of probability as a descriptor of a particular event from the perspective of the particular observer (or expert) ascribing beliefs. If you have a frequentist and a personalist thinking about the next toss of a coin, for example, you would expect the probability the personalist assigns to depend on their joint distribution over possible sequences of outcomes, while the frequentist would be content to know the limiting probability. But de Finetti’s theorem shows that if one believes a sequence of events to be exchangeable, then you can’t distinguish their beliefs about those random variables from conceiving of independent events with some underlying probability. Given a sequence of exchangeable Bernoulli random variables X1, X2, X3, … , you can think of a draw from their joint distribution as sampling p ~ mu, then drawing X1, X2, X3, … from Bernoulli(p) (where mu is a distribution on [0,1]). So the frequentist and personalist can both agree, under exchangeability, that p is meaningful for decision making. David Spiegalhalter recently published an essay on interpreting probability that he ended by commenting on how remarkable this pragmatic consensus is.

But Dawid’s goal is to point out ways in which the apparent alignment is not as satisfactory as it may seem in resolving the philosophical chasm. It’s more like we’ve thrown a (somewhat flimsy) plank over it. Exchangeability may sometimes get us across by allowing the frequentist and personalist to coordinate in terms of actions, but we have to be careful how much weight we put on this.

The reference set depends on the state of information

One complication is that the personalist’s willingness to assume exchangeability depends on the information they have. Dawid uses the example of trying to predict the exam score of some particular student. If they have no additional information to distinguish the target student from the rest, the personalist might be content to be given an overall limiting relative frequency p of failure across a set of students. But as soon as they learn something that makes the individual student unique, p is no longer the appropriate reference for the individual student’s probability of passing the exam.

As an aside, this doesn’t mean that exchangeability is only useful if we think of members of some exchangeable set as identical. There may still be practical benefits of learning from the other students in the context of a statistical model, for example. See, e.g., Andrew’s previous post on exchangeability as an assumption in hierarchical models, where he points out that assuming exchangeability doesn’t necessarily mean that you believe everything is indistinguishable, and if you have additional information distinguishing groups, you can incorporate that in your model as group-level predictors.

But for the purposes of personalists and frequentists agreeing on a reference for the probability of a specific event, the dependence on information is not ideal. Can we avoid this by making the reference set more specific? What if we’re trying to predict a particular student’s score on a particular exam in a world where that particular student is allowed to attempt the same exam as many times as they’d like? Now that the reference group refers to the particular student and particular exam, would the personalist be content to accept the limiting frequency as the probability of passing the next attempt?

The answer is, not necessarily. This imaginary world still can’t get us to the generality we’d need for exchangeability to truly reconcile a personalist and frequentist assessment of the probability.

Example where the limiting frequency is constructed over time

Dawid illustrates this by introducing a complicating (but not at all unrealistic) assumption: that the student’s performance on their next try on the exam will be affected by their performance on the previous tries. Now we have a situation where the limiting frequency of passing on repeated attempts is constructed over time.

As an analogy, consider drawing balls from an urn, where when we draw our first ball, there is 1 red ball and 1 green ball in it. Upon drawing a ball, we immediately return and add an additional ball of the same color. At each draw, each ball in the urn is equally likely of being drawn, and the sequence of colors is exchangeable.

Given that p is not known, which do you think the personalist would prefer to consider as the probability of a red ball on the first draw: the proportion of red balls currently in the urn, or the limiting frequency of drawing a red ball over the entire sequence?

Turns out in this example, the distinction doesn’t actually matter: the personalist should just bet 0.5. So why is there still a problem in reconciling the personalist assessment with the limiting frequency?

The answer is that we now have a situation where knowledge of the dynamic aspect of the process makes it seem contradictory for the personalist to trust the limiting frequency. If they know it’s constructed over time, then on what ground is the personalist supposed to assume the limiting frequency is the right reference for the probability on the first draw? This gets at the awkwardness of using behavior in the limit to think about individual predictions we might make.

Why this matters in the context of algorithmic decision-making

This example is related to some of my prior posts on why calibration does not satisfy everyone as a means of ensuring good decisions. The broader point in the context of the course I’m teaching is that when we’re making risk predictions (and subsequent decisions) about people, such as in deciding who to grant a loan or whether to provide some medical treatment, there is inherent ambiguity in the target quantity. Often there are expectations that the decision-maker will do their best to consider the information about that particular person and make the best decision they can. What becomes important is not so much that we can guarantee our predictions behave well as a group (e.g., calibration) but that we understand how we’re limited by the information we have and what assumptions we’re making about the reference group in an individual case.

7 steps to junk science that can achieve worldly success

Andrew — Fri, 17 Jan 2025 14:22:15 +0000

More than a decade after the earthquake that was the replication crisis (for some background, see my article with Simine Vazire, Why did it take so many decades for the behavioral sciences to develop a sense of crisis around methodology and replication?), it is frustrating to see junk science still being published, promoted, and celebrated, even within psychology, the field that was at the epicenter of the crisis.

The crisis continues

An example that I learned about recently was an article out of Harvard, Physical healing as a function of perceived time, published in 2023 and subsequently promoted in the news media, that claimed to demonstrate that healing of bruises could be sped or slowed by manipulating people’s subjective sense of time. All things are possible, and never say never, but, yeah, this paper offered no good evidence for its extraordinary claims. It was standard-issue junk science: a grabby idea, a statistically significant p-value extracted from noisy data, and big claims.

Someone pointed me to this paper, and for some reason that I can no longer remember, Nick Brown and I decided to figure out exactly what went wrong with it. We published our findings in this article, How statistical challenges and misreadings of the literature combine to produce unreplicable science: An example from psychology, which will appear in the journal Advances in Methods and Practices in Psychological Science.

In short, the published article was flawed in two important ways, first in its statistical analysis (see section 2.4 of our paper, where we write, “We are skeptical that this study reveals anything about the effect of perceived time on physical healing, for four reasons”) and second in its interpretation of its cited literature (see section 3 of our paper, where we write, “Here we discuss three different examples of this sort of misinterpretation of the literature cited in the paper under discussion”).

I don’t have any particular interest in purported mind-body healing, but Nick and I went to the trouble to shepherd our article through the publication process, with two goals in mind:
– Providing an example of how we, as outsiders, could look carefully at a research article and its references and figure out what went wrong. This is important, because it’s pretty common to see papers that make outlandish claims but seem to be supported by data and the literature.
– Exploring what exactly goes wrong–in this case, it was a mis-analysis of a complex data structure, researcher degrees of freedom in decisions of what to report, and multiple inaccurate summaries of the literature.

What does it take for junk science to be successful?

All this got me thinking about what it takes for researchers to put together a successful work of junk science in the modern era, which is the subject of today’s post.

Before going on, let me emphasize that I have no reason to suspect misconduct on the part of the authors of the paper in question. It’s a bad paper, and it’s bad science, but that happens given how people are trained, and given the track record of what gets published in leading journals (Psychological Science, PNAS), what gets rewarded in academia, and what gets publicity from NPR, Ted, Freakonomics, and the like. As we’ve discussed many times, you can do bad science without being a bad person and without committing what would usually be called research misconduct. (I actually don’t think that bad data analysis and inaccurate description of the literature would usually be put in the “research misconduct” category.)

This is also why I’m not mentioning the authors’ names here. The names are no secret–just click on the above link and the paper is right there!–I’m just not including them in this post, so as to emphasize that I’m writing here about the process of bad science and its promotion; it’s not about these particular authors (or any particular authors).

7 steps to junk science

So here they are, 7 things that allow junk science to thrive:

1. Bad statistical analysis. Statistics is hard; there are a lot of ways to make mistakes, and often these mistakes can lead to what appears to be strong evidence.

2. Researcher degrees of freedom. Garden of forking paths. As always, the problem is not with the forking paths–there really are a lot of ways to collect, code, and analyze data!–but rather with selection in what is reported. As Simmons et al. (2011) unforgettably put it, “undisclosed flexibility in data collection and analysis allows presenting anything as significant.” And, as Loken and I emphasized in our paper on forking paths, “undisclosed flexibility” could be undisclosed to the authors themselves: the problem is with data-dependent analysis choices, even if the data at hand were analyzed only once.

3. Weak or open-ended substantive theory. Theories such as evolutionary psychology, embodied cognition, and mind-body healing are vague enough to explain just about anything. As Brown and I wrote in our above-linked article, “The authors refer to ‘mind–body unity’ and ‘the importance of psychological factors in all aspects of health and wellbeing,’ and we would not want to rule out the possibility of such an eﬀect, but no mechanisms are examined in this study, so the result seems at best speculative, even taking the data summaries at face value. During the half hour of the experimental conditions, the participants were performing various activities on the computer that could affect blood flow, and these activities were different in each condition . . . there are many alternative explanations for the results which we find
just as scientifically plausible as the published claim.”

4. Inaccurate summaries of the literature. This is a big deal, a huge deal, and something we don’t talk enough about.

It’s a lot to expect the journal editors and reviewers to check citations and literature reviews. It’s your job as an author to read and understand the work you’re citing before using those papers to make unsupported claims. For example, don’t make the claim, “If a person who does not exercise weighed themselves, checked their blood pressure, took careful body measurements, wrote everything down, maintained their same diet and level of physical activity, and then repeated the same measures a month later, few would expect exercise-like improvements. But in a study involving hotel housekeepers, that is effectively what the researchers found,” if you’re citing a study that does not support this claim.

5. Institutional support. Respectable journals are willing to publish articles that make outlandish claims based on weak evidence. Respected universities give Ph.D.’s for such work. Again, I’m not suggesting malfeasance on the part of the authors; they’re just playing by the rules that they’ve learned.

6. External promotion. This work was featured in Freakonomics, Scientific American, and other podcasts and news outlets (see here and here). This external promotion has three malign effects:
– Most directly, it spreads the (inaccurate) word about the bad research.
– The publicity also provides an incentive for people to more sloppy work that can yield these sorts of strong claims from weak evidence.
– Also, publicity for sloppy, bad science can crowd out publicity and reduce the incentives to do careful, good science.

7. Celebrity culture. This is a combination of items 5 and 6 above: many celebrity academic and media figures prop each other up. Some of it’s from converging interests, as when the Nudgelords presented the work of Brian Wansink as “masterpieces,” but often I think it’s more just a sense that all these media-friendly scientists and podcasters and journalists feel that they’re part of some collective project of science promotion, and from that perspective it doesn’t really matter if the science is good or bad, as long as it’s science-like, by their standards.

Anyway, this continues to bug the hell out of me, which is why I keep chewing on it and writing about it from different angles. I’m glad that Nick and I wrote that paper–it took some effort to track down all the details and express ourselves both clearly and carefully.

Why I like preregistration (and it’s not about p-hacking). When done right, it unifies the substance of science with the scientific method.

Andrew — Thu, 16 Jan 2025 14:18:59 +0000

This came up in comments to Jessica’s recent post.

I like preregistration. It’s not something I used to do, and I still don’t always do it. I’ve worked on hundreds of research projects, and only a few of them had had any preregistration at all.

That said, I think preregistration has value, and I’m doing it more and more.

The reason I like preregistration has nothing at all to do with hypothesis tests or p-values or p-hacking or questionable research practices or anything like that.

I like preregistration for two reasons.

1. For me, preregistration implies constructing a hypothetical world–not a “null hypothesis” of no effect, but a possible world corresponding to what I’m actually aiming to study–and then simulating fake data and proposing and trying out analysis methods on those simulated data. I find this sort of commitment–the effort of laying out a complete generative model for the process–to be helpful. Thinking about effect sizes and their variation, all sorts of things, also seeing if the proposed analysis can recover parameters of interest from the simulated data, which is what’s often called power analysis although I prefer the more general term “design analysis.”

2. When other people preregister, that can be useful because then we can see discrepancies between the original plan and what actually got reported. Two examples are here and here–in both those cases, discrepancies between the preregistration and the final paper gave us doubts about the published claims. When these changes happen, it is not a moral failure on anyone’s part–we can learn from data!–it’s just relevant for understanding the theories being promulgated in these papers.

I agree that preregistration is not necessary for good science. I still think it can be a useful tool, both my own workflow in developing scientific hypotheses and gathering data to understand them, and in communication of workflow to others.

Preregistration has a valuable indirect function of making it more difficult to do bad science. It does not directly turn bad science into good science. That doesn’t make preregistration a bad idea–recently I’ve been preregistering studies and, more generally, simulating data before gathering any data–; we should just be aware that this sort of procedural step can only one small part of the story. Ultimately, science is about the substance of science, not just about the scientific method.

There’s something interesting here, though, that links the two perspectives. If you do things right, your preregistration will involve the substance of what you’re studying and will not merely be a procedural step, a form of paperwork that exists to validate the p-values that your study will produce. Rather, doing this preregistration will require simulating fake data, which in turn will require hypothesizing a full model of the underlying process.

I recognize that what I just described is not the usual thing that is meant by “preregistration,” which is more along the lines of: “We will perform this comparison and use a 2-sided test,” etc. But it could be! I think this is a useful connection.

P.S. As discussed in comments, a more precise term for what I’m recommending is fake-data simulation or simulated-data experimentation. I use the term “preregistration” above in order to connect with the many people in the science-reform movement who use that term.

“The terror among academics on the covid origins issue is like nothing we’ve ever seen before”

Andrew — Wed, 15 Jan 2025 14:19:05 +0000

Michael Weissman sends along this article he wrote with a Bayesian evaluation of Covid origins probabilities. He writes:

It’s a peculiar issue to work on. The terror among academics on the covid origins issue is like nothing we’ve ever seen before.

I was surprised he was talking about “terror” . . . People sometimes send me stuff about covid origins and it all seems civil enough. I guess I’m too far out of the loop to have noticed this! That said, there have been times that I’ve been attacked for opposing some aspect of the scientific establishment, so I can believe it.

I asked Weissman to elaborate, and he shared some stories:

A couple of multidisciplinary researchers from prestigious institutions were trying to write up a submittable paper. They were leaning heavily zoonotic, at least before we talked. They said they didn’t publish because they could not get any experts to talk with them. They said they prepared formal legal papers guaranteeing confidentiality but it wasn’t enough. I guess people thought that their zoo-lean was a ruse.

The extraordinarily distinguished computational biologist Nick Patterson tells me that a prospective collaborator cancelled their collaboration because Patterson had blogged that he thought the evidence pointed to a lab leak. It is not normal for a scientist to drop an opportunity to collaborate with someone like Patterson over a disagreement on an unrelated scientific question. You can imagine the effect of that environment on younger, less established scientists.

Physicist Richard Muller at Berkeley tried asking some bio colleague about an origins-related technical issue. The colleague blew him off. Muller asked if a student or postdoc could help. No way- far too risky, would ruin their career. (see around minute 43 here)

Come to think about it, I got attacked (or, at least, misrepresented) for some of my covid-related research too; the story is here. Lots of aggressive people out there in the academic research and policy communities.

Also, to put this in the context of the onset of covid in 2020, whatever terror we have been facing by disagreeing with powerful people in academia and government is nothing compared to the terror faced by people who were exposed to this new lethal disease. Covid is now at the level of a bad flu season, so still pretty bad but much less scary than a few years ago.

Genre fiction: Some genres are cumulative and some are not.

Andrew — Tue, 14 Jan 2025 14:58:10 +0000

I’ve been reading some books about the history of twentieth-century mystery novels and science fiction stories, and one thing that struck me was that each of these genres had a sense of continuity. If you used a gimmick in a story, then it was “yours,” and future authors were kind of obliged to come up with something new or, if they were going to use your idea, they were supposed to give it a new twist or else refer back to your original story. And there was a sense of progression, as the mystery puzzles became more elaborate and the science fiction scenarios became more deeply realized.

With an expectation of progression comes a fear of stagnation, and I think that one reason that genre fans talk about a “golden age” (that would be the 1930s for mysteries or the 1940s for science fiction) is the idea that you can’t keep coming up with new tricks. At some point you need to change the rules, which for mysteries included directions such as interacting psychology and sociology (Ross Macdonald, etc.), focusing on local color and how the world really works (John D. Macdonald, George V. Higgins, etc.), and for science fiction meant a move away from the poles of horror on one side and techno-optimism on the other. Even as the genres expanded, though, I think there remained a sense of them as cumulative, with new writers building upon what had been done in the past. You weren’t supposed to write a mystery novel or science fiction story where you ripped off some previously-published plots without adding something.

What about other genres?

Let’s start with “mundane” or “memetic” fiction—that is, non-genre writing that follows conventions of realism. There, it’s considered ok to reuse plots, sometimes very openly as with Jane Smiley’s remake of King Lear, other times just with standard plot structures of happy families and unhappy families and affairs and business reversals and all sorts of other stories. In memetic fiction, the plot is not the main focus, and even if the plot is a driver of the story (as with Jonathan Franzen, for example), nobody would really care if it’s taken from somewhere else.

Other genres commonly mentioned in the same breath as mystery and science fiction are romance, western, porn, and men’s adventures (war stories, etc.). I don’t know much about these genres, so maybe readers can correct me on this, but it’s my impression that the twentieth-century versions of these genres have not been cumulative in the way of mystery and science fiction. It was not expected that a romance story or a war story would need a new plot twist or a new idea. Westerns might be different just because they were so popular for awhile that maybe their high profile pushed authors to come up with new twists, I’m not sure. I’m not saying these genres are static—they will change over time as readers’ expectations change—but they’re not expected to offer novelty or innovation in the way of mystery or science fiction.

Why would these other genres not be cumulative? Perhaps because they are offering different sorts of pleasures. Traditionally, mystery and science fiction stories have the form of puzzles; you’re reading them for the pleasure of trying, and often failing, to figure them out, and then maybe rereading for the pleasure of understanding how the mechanisms were put together. In contrast, memetic fiction and genres such as romance/western/etc. are read more for their emotional impact. OK, I guess people would also read memetic fiction as a way to learn about the world—but to learn, you don’t need a new twist in the story, you just need a clear presentation.

I’m not saying that mystery and science fiction are read merely for their puzzle aspects. Mysteries also offer the thrill of suspense and the twin satisfactions of lawbreaking and justice, science fiction has the sense of wonder, and both genres offer some form of social commentary that is gained by looking at society from a distance. I’m just saying that the puzzle is a big part of Golden Age mystery and science fiction, and this could help explain their cumulative natures.

What other genres could we consider?

There’s writing for children (recall Orwell’s classic essay on boys’ weeklies) and young adults, but with rare exceptions these books are read by a new audience every few years, so there’s no need for continuity. It’s no problem if you steal a plot idea from a twenty-year-old book that today’s kids are no longer reading.

There’s also modernist fiction, where the innovation is supposed to come in the form, not the content. You can steal an old plot but you’re supposed to present it from some new perspective, and that’s a puzzle in a different way.

So that’s how I see things as of 1950 or so (with a few forward references). What’s been happening since?

So many more books get published each year than before, and nobody can keep track of them. In the meantime, it seems that fewer people are interested in reading for the puzzle. Yes, Agatha Christie remains popular, and I’m guessing that some classic science fiction continues to sell, but I get the impression that, for a long time now, readers of mystery and science fiction aren’t looking for clever puzzles anymore, with rare exceptions such as Everyone in My Family Has Killed Someone and The Martian. Mystery and science fiction novels are now more like mystery and science fiction movies, which, again, with rare exceptions, are mostly about delivering thrills, with a side of philosophical reflection.

So, between the disappearance of the past, the diminishing interest in puzzles, and the sheer impossibility of remaining aware of all the earlier books in the genre, I’m guessing that mystery and science fiction are no longer cumulative endeavors the way they used to be. Instead of trying to top what came before, their authors are just writing books.

There’s also the economics of it all: 50 or 100 years ago, you could make an OK living churning out books of any sort, you could make a good living writing successful books, and you had an outside chance of getting rich (by the standards of the day; I’m talking “millionaire,” not “billionaire”) by plugging into the zeitgeist and writing bestsellers. Nowadays, with very rare exceptions, even successful authors don’t sell many books; many literary authors are reduced to supporting themselves through academic jobs; and pretty much the only way they’ll make real money from writing is through movie or TV contracts.

To return to the main topic of this post, the transition from a cumulative to a static literature: this happens in other fields too. Music, for example, This happens in different genres at different times, but it seems that in many genres of music, there is a period where different composers and artists are feeding on each others’ work and feel the need to do something new (recall Brian Wilson’s attitude with respect to the Beatles), and a period where there is no longer the sense of cumulative building of a genre.

The theory crisis in physics compared to the replication crisis in social science: Two different opinion-field inversions that differ in some important ways

Andrew — Mon, 13 Jan 2025 14:09:15 +0000

Yesterday we discussed an “opinion-field inversion” in theoretical physics: in the prestige news media and publicity complex (NPR, Ted, etc), string theory reigns supreme; in elite physics departments, string theory is where it’s at, but then there’s a middle range of skeptics ranging from my Columbia colleague Peter “Not Even Wrong” Woit to xkcd columnist Randall Munroe who characterize string theory as an overhyped nothingburger. I guess that Woit would be cool with some elite physicists studying string theory as part of the overall portfolio of theoretical research, but he and others in this middle ground think that whatever profile string theory should have, both in academia and in public perceptions of physics, should be much lower.

I referred to this sort of layered difference in public opinion as an opinion-field inversion, by analogy to the phenomenon of temperature inversion that is a precursor to tornadoes.

The theory crisis in physics

Before going on, let me emphasize that I know nothing of theoretical physics. I find Woit’s writing on the topic persuasive, but I’ve not tried to understand any of the debate in physics terms. I’m only discussing this from the perspective of sociology of science.

I’ll call it the theory crisis in physics, by analogy to the replication crisis in social science and medical research. The string theory thing isn’t a replication crisis—indeed, one of the main criticisms of string theory is that it does not make new testable predictions, so there’s no possibility of replication or falsification–but it’s still a crisis. I think the term “theory crisis” is about right.

Arguably the replication crisis in psychology and economics is also a theory crisis, in that the work is based on broad theories such as embodied cognition and evolutionary psychology that have major problems of the sort that they can be used to explain any possible result, but it was the failed replications that were the convincer to many people, hence the term “replication crisis” rather than “theory crisis” or “methods crisis.”

The replication crisis as another opinion-field inversion

In any case, another example of an opinion-field inversion in science, at least until recently, was woo-woo psychology such as social priming and walking speed, ovulation and voting, air rage, power pose, himmicanes, ages ending in 9, signing at the top of the form, etc. The news media and associated institutions (Ted, Freakonomics, etc.) were all-in on these things; informed scientists such as Uri Simonsohn, Anna Dreber, and various other so-called methodological terrorists were very skeptical; and the power centers at Harvard, PNAS, etc., were a mix of head-in-the-sand true believers (claiming the replication rate “is statistically indistinguishable from 100%”) and I’ve-got-mine-don’t-rock-the-boat nudgelords, who seemed to be more concerned about keeping their Henry Kissinger party invitations and positions on NPR speed-dial than in cleaning up their house.

With junk science, things have changed–more and more reporters seem to be tired of having their chains yanked by whatever Psychological Science and PNAS happen to be promoting this week–and I guess that’s partly a consequence of the opinion inversion. Shaking up the power centers might be more of a challenge.

Differences between the theory crisis in physics and the replication crisis in psychology and economics

What will happen with string theory, I don’t know. One difference is that in psychology and economics, my impression is that the people who do this sort of headline-bait are not taken very seriously at an intellectual level. They may have institutional power (hi, Robert Sternberg!) and others in their field may enjoy the reflected glow of their TV appearances, but nobody would consider them to be the brightest lights in the chandelier. In contrast, it seems that many of the physicists who work in string theory are considered to be brilliant, most notably Ed Witten—I know nothing of his work, it’s all beyond me, but he’s always described in superlatives.

So if string theory really is a hyped dead end, it’s a much sadder story than junk econ and junk psychology, which seem more like the products of ambitious careerists who, for a couple of decades, stumbled upon a way to hack the system of scientific publication and publicity at the nexus of academia and the news media.

The other twist is that even the opponents of string theory still characterize it as having some mathematical interest–their criticism is not that string theory is being done, so much with that too much is being claimed for it. That’s different than research in himmicanes, air rage, beauty-and-sex-ratio, extra-sensory perception, etc., which is pretty much unmitigated crap, whose only contribution to science has been to reveal the rotten core of science as it is often practiced.

This also gives different flavors to the discussions in these two fields. With psychology and economics, the frustration is mostly external, with observers being bothered that bad work gets so much publicity, that methodological criticisms and unsuccessful replications get less notice than problematic work, etc. With physics, the frustration seems mostly internal, with people bothered that the top physics students are going into what they perceive as a dead-end world. In physics, the concern is with the misuse of human resources in the form of brilliant Ph.D. students. In psychology and econ, the concern is with the bad work giving the general public a misleading view of science and perhaps leading to bad policies (Excel error, anyone?). Nobody’s saying it’s too bad Brian Wansink and Dan Ariely got into this social priming stuff or otherwise they could’ve made major discoveries.

String theory wars: An opinion-field inversion.

Andrew — Sun, 12 Jan 2025 14:08:48 +0000

From my Columbia math department colleague Peter Woit:

Brian Greene’s The Elegant Universe is being reissued today, in a 25th anniversary edition. It’s the same text as the original, with the addition of a 5 page preface and a 36 page epilogue. . . .

One thing I [Woit] was looking for in the new material was Greene’s response to the detailed criticisms of string theory that have been made by me and others such as Lee Smolin and Sabine Hossenfelder over the last 25 years. It’s there, and here it is, in full:

There is a small but vocal group of string theory detractors who, with a straight face, say things like “A long time ago you string theorists promised to have the fundamental laws of quantum gravity all wrapped up, so why aren’t you done?” or “You string theorists are now going in directions you never expected,” to which I respond, in reverse order “Well, yes, the excitement of searching into the unknown is to discover new directions” and “You must be kidding.”

As one of the “small but vocal group” I’ll just point out that this is an absurd and highly offensive straw-man argument. The arguments in quotation marks are not ones being made by string theory detractors, and the fact that he makes up this nonsense and refuses to engage with the real arguments speaks volumes.

Woit follows up in the comments section:

It’s interesting to see that the comments back up an argument I’ve often heard from physicists when I criticize books like this. They tell me “sure, the material in that book about string theory is nonsense and misleading the public, but it’s getting young people excited about physics and interested in becoming physicists. Once they do become physicists they’ll realize this is nonsense and go on to do some real physics.”

Woit’s a persuasive writer, and he, or someone, seems to have convinced Randall Munroe. I have no idea, as my physics days are long gone and I never learned any particle physics in my studies. When I worked at the Laboratory for Cosmic Ray Physics we dealt with muons and stuff like that, but nothing so theoretical as string theory or its predecessors. So I offer no comment on the technical side of all this, except to note that debates remain even in much simpler areas of quantum physics, as I am reminded whenever the discussion comes up of joint distributions in the two-slit experiment (as in section 2 of or article on holes in Bayesian statistics): some people will pop up and say I’m completely wrong, and others will agree with me. Physics is hard, and even in examples where the theory is well understood, it can be a challenge to come up with a definitive experiment, or even a definitive thought experiment, to resolve fundamental disagreements.

But the sociology-of-science part of the string theory story, that I can comment on. To start with, Woit and Greene are both in the Columbia math department. They’re not far apart in age, and they both studied physics at Harvard. But, from the quotes above, I get the impression that they can’t stand each other! Or, at least, they don’t respect each other. That’s fine—there’s no reason the mathematics faculty at Columbia should agree on everything, indeed some healthy disagreement is a good thing. Still, it’s interesting.

It’s also interesting that in some aspects of the court of scientifically-informed public opinion, Woit and the anti-stringers represent the standard or accepted view, whereas Greene and the stringers seem to have locked up the two extremes—the general public (as represented by PBS, Ted, etc.) and the academic theoretical physics community.

You know how tornados come from “temperature inversions”? I think of this sort of thing as an “opinion inversion”: an unstable pattern in which opinions among more informed people are different from those of the general public. In this case it’s a double inversion, with the general public and the institutional power on one side and generally scientifically-informed people in the middle.

Muckraking at the University of Oregon

Andrew — Sat, 11 Jan 2025 14:35:35 +0000

Check it out. I was amused by these posts:

From 2014: Crap-free UO homepage

From 2024: Provost Chris Long is paid $540K plus $130K per year start-up & alcohol budget

Columbia should have a crap-free homepage too! And the alcohol budget reminded me of this story from Columbia a few years ago.

UO Matters is run by economics professor Bill Harbaugh, who seems to be a busy person. Every city, town, and institution should have this sort of news outlet—or, ideally, more than one.

“The king, sir, is much better!”

Andrew — Fri, 10 Jan 2025 14:23:43 +0000

Algis Budrys wrote this in 1983:

Budrys brings this up in the context of how our reading of a book is affected by whatever reviews and publicity materials we’ve seen–an interesting point which I became more aware of after going to the bookstore in France and picking out books without having a sense of what they’d be like (see entry under Crédit Illimité in this post. In any case, the main thing is that I just like Budrys’s story and how he tells it. He was a true blogger, which I mean in the best sense of the word.

Postdoc, doctoral student, and summer intern positions, Bayesian methods, Aalto

Aki Vehtari — Fri, 10 Jan 2025 08:30:23 +0000

Postdoc and doctoral student positions in developing Bayesian methods at Aalto University, Finland! This job post is by Aki (maybe someday I write an actual blog post).

The positions are funded by Finnish Center for Artificial Intelligence FCAI and there are many other topics, but if you specify me as the preferred supervisor then it’s going to be Bayesian methods, workflow, cross-validation, and diagnostics. See my video on Bayesian workflow, Bayesian workflow paper, my publication list, and my talk list, for more about what I’m working on.

There are also plenty of other topics and supervisors in

Reinforcement learning
Probabilistic methods
Simulation-based inference
Privacy-preserving machine learning
Collaborative AI and human modeling
Machine learning for science

Join us, and learn why Finland has been ranked the happiest country in the world for seven years in row!

See how to apply at fcai.fi/winter-2025-researcher-positions-in-ai-and-machine-learning

I might also hire one summer intern (BSc or MSc level) to work on the same topics. Applications through Aalto Science Institute call

I would add three words to this statement by Uri Simonsohn on preregistration

Andrew — Thu, 09 Jan 2025 14:53:01 +0000

Uri writes:

Pre-registrations should only contain information that helps demarcate confirmatory vs exploratory statistical analyses (i.e., that would help a reader identify harking and p-hacking), and should generally avoid other information.

I disagree, partly because I think that confirmatory and exploratory statistical analyses are the same thing (see here), partly because I very rarely care about p-values anyway, and mostly because preregistration is super useful to me, not because of harking or p-hacking but because I think that the work we put into preregistration improves our research. See my discussion here and Jessica’s here. For still more, I refer you to my post from 2022, What’s the difference between Derek Jeter and preregistration?.

That said, I can well believe that, for Uri, the best preregistration only contain information that helps demarcate etc etc.

So my suggested alteration to Uri’s above-quoted statement is just to add “for Uri Simonsohn” before the word “should.” Problem solved!

Junk science becomes more professionalized. Meanwhile, conspiracy theories are being more associated with the center-right and right, politically. How does all this fit together? I’m not sure.

Andrew — Wed, 08 Jan 2025 14:38:48 +0000

A few years ago I wrote a post, Junk Science Then and Now discussing the movement of junk science from the periphery of elite culture to the core:

The junk science (by which I mean work that has some of the forms of scientific research but is missing key elements such as valid and reliable measurements, transparency, and openness to criticism) of the mid-twentieth century came from cranks and outsiders, often self-educated people with no academic positions, and even those who were in academia were peripheral figures, for example, the ESP researcher J. B. Rhine at Duke University, who according to Wikipedia was trained as a botanist and was not a central figure in the psychology profession. Immanuel “Worlds in Collision” Velikovsky had lots of scientist friends, but he was an outsider to the scientific community. And those guys from the 1970s who wrote books about ancient astronauts and the Bermuda triangle, I don’t think they even claimed to have any scientific backing. Yes, there were some missteps within academic science from N-rays to cold fusion, but these were minor storms that blew up and went away.

Nowadays, though, the pseudoscientists are well ensconced in the academy, they play power games in the field of psychology, and they get to publish in the Proceedings of the National Academy of Sciences (air rage, himmicanes, ages ending in 9, etc etc) whenever they want. The call is coming from inside the house, as it were. Many of them are still considered by the news media to be the legitimate representatives of the scientific community. Even absolutely ridiculous ideas like the $100,000 citations. There’s also the related phenomenon of . . . not “junk science” exactly, but bad science: scientific errors that then persist because the scientific community refuses to come to terms with corrections. An example is the contagion-of-obesity story.

When discussing this, I wrote that the above-described shift represents a sort of gentrification of scientific error, mirroring the professionalism that has come into so many other aspects of our intellectual life. Instead of some wacky guy somewhere claiming to have developed a perpetual motion machine or whatever, you’ve got a Stanford professor promoting junk science on cold showers.

I thought about all this recently when reading a post by political journalist Matthew Yglesias on what he calls “the crank realignment”:

Robert F. Kennedy Jr.’s transition from semi-prominent Democrat to third party spoiler to Donald Trump endorser is emblematic of a broader, decade-long “crank realignment” in American politics.

Trump himself, of course, used to be a Democrat. He switched parties in a blaze of birther conspiracy theories, and only then came to embrace conservative views on topics like gun control and abortion. And RFK Jr. was into election fraud conspiracy theories long before January 6, but his version was about George W. Bush stealing the 2004 election in Ohio. That wasn’t a mainstream Democratic Party view (there’s a reason there was no Kerry-led insurrection), but it was mainstream enough to be published in Rolling Stone and for Kennedy to continue to be a player in progressive politics.

Twenty years later, that’s no longer the case. Democrats are much more buttoned-up, and the GOP is much more accepting of cranks and know-nothings like Kennedy.

The partisan shifts of both Trump and RFK Jr. are part of a long term cycle in which educated professionals have gravitated toward the Democratic Party coalition and a generic suspicion of institutions and the people who run them has come to be associated with conservative politics.

I think Yglesias is on to something here. I agree with him that from a logical point of view, there’s no reason why conspiracy theories should be concentrated on the right half of the political spectrum. Indeed, from a logical perspective you might expect conspiracy theories to be more popular on the left, as this would be consistent with a general leftist anti-powerful-people, anti-big-business take.

One thing I’ve noticed in the past is that commentators have been tied to the idea of anti-science leftists even when the data don’t bear that out. Here’s an example from a couple years ago, where political scientist Chris Blattman made the offhanded remark that opposition to genetically modified organisms (GMOs) was “mostly left,” even though actually opposition was about the same on the left and right. It’s a convenient story to pair anti-vaccine attitudes on the right to anti-GMO attitudes on the left and ask why can’t we all get along, but that’s not what public opinion happens to look like. I agree with Yglesias that this is kinda too bad, as it makes it harder to have a bipartisan push against anti-science.

It’s still hard for me to put all this together in my head. One challenge is that I have the impression that most of the prominent purveyors of junk science and bad science in academia are on the left, or the center-left. OK, not Dr. Oz, and maybe not that cold-shower dude at Stanford. And not those covid-minimizers. Or the climate-change denialists being promoted by Freakonomics. But the mainstream NPR/Ted/PNAS world . . . they’re mostly on the left, right? We do hear about right-wing people in science, but they get some attention because they are exceptions.

So the professionalization of bullshit—as exemplified by Gladwell’s prominence at the New Yorker, the UFO’s-as-space-aliens theories promoted by elite journalists, the Association for Psychological Science promotion of superstition, and various wacky stuff coming out of Harvard, Stanford, etc.—runs counter to the movement of conspiracy theorizing from the political fringes to the core of the political right.

I don’t know where this will all lead. It seems kind of unstable.

Treasure trove of forensic details in arXiv’s LaTeX source code

Bob Carpenter — Tue, 07 Jan 2025 20:00:25 +0000

There’s gold in them thar hills^*

When you submit a paper to arXiv, you send them a bundle including the LaTeX source, figures, etc. These are all available for download through the arXiv site. This morning, I was downloading the source^** for the original Hoffman and Gelman no-U-turn sampler paper. If you want to follow along, ere’s the arXiv link, but you have to click through to the “TeX Source” link under the “Access Paper:” header on the top right side under the banner. What I found was a treasure trove of comments that never made it to the paper, some of which I will share below.

Examples

Returning to Hoffman and Gelman’s arXiv source LaTeX, what struck me was the following comment right after the algorithm itself.

%% Algorithm ?? is more efficient than algorithm
%% ??, but the policy of sampling uniformly from
%% $\cC$ leaves something to be desired. We would prefer to select an
%% element of $\cC$ that is farther away from the initial position
%% $\theta^t$, rather than face the possibility of performing many costly
%% gradient evaluations just to wind up choosing an element of $\cC$ that
%% is close to where we started. Algorithm ?? addresses
%% this issue by giving preference to points subtrees that do not include
%% the starting point $\{\theta^t, r^t\}$. [To do: explain why this is
%%   valid. Probably proof by induction is the easiest way to go.]

Neither the arXiv preprint nor the final JMLR paper have a clearly delineated inductive proof. In both versions, we get “this is equivalent to a Metropolis_Hastings kernel with proposal …, and it is straightforward to show that it obeys detailed balance” (second-to-last sentence on page 1604 of JMLR paper).

Presumably on the principle of minimizing surface area for reviewers to gripe, the following useful comment from the abstract didn’t make the final cut.

%% This issue is compounded when the
%% target distribution depends on a set of parameters that cannot be
%% updated by HMC (such as discrete parameters) and are updated
%% independently of the parameters updated by HMC. 
%% In this case, optimal settings of $L$ may change from iteration to
%% iteration.

Here’s another useful comment that wound up on the cutting-room floor. I’m not saying these should all be in the paper—usually there are so many things you can add and qualify that it requires some judgement. But for the dedicated and interested reader, the paper would have been more useful with the elided comments.

%% Even if we assume that there exists some transformation of the
%% parameter space under which all parameters are i.i.d. and that this
%% transformation can be applied cheaply (i.e. in $O(D)$ time, for
%% example using a low-rank transformation matrix to avoid the $O(D^2)$
%% cost of dense matrix multiplication and the $O(D^3)$ cost of dense
%% matrix inversion), the cost of obtaining an effectively independent
%% sample using RWM is still $O(D^2)$ \citep{Creutz:1988}. Gibbs also
%% requires $O(D^2)$ operations per effectively independent sample in
%% this setting, since it must update $D$ parameters and it must perform
%% a transformation costing $O(D)$ operations after each update.

There are also useful explanations of figures that never made the final cut, like this one, which expands the diagram from the one of “naive NUTS” in the paper figures to what the paper calls “efficient NUTS.”

%% %% %% Figure ?? illustrates how an iteration of NUTS might
%% %% %% proceed once the slice and initial momentum variables have been
%% %% %% resampled. Initially (a), we have only one node. We double the size of
%% %% %% the tree to two nodes by taking a single step forward (b), and since
%% %% %% the new point is valid tentatively set $w^{t+1}$ to that new point
%% %% %% (with probability $1/1=1$). We then redouble the size of the tree to
%% %% %% four nodes, taking two steps forward (c). Only one of the two new
%% %% %% nodes is valid, so the probability of choosing a node from the new
%% %% %% half-tree is $1/2$ (the ratio of the number of valid new nodes to
%% %% %% valid old nodes). In this example, we randomly choose to stick with
%% %% %% the old value of $w^{t+1}$. Next, we again double the size of the tree
%% %% %% by taking four steps backward from $w^-$ (d). We discover that the new
%% %% %% half-tree satisfies the stopping criterion, and so we cannot select
%% %% %% any points from it. Finally, we double the tree one more time, this
%% %% %% time going forward (e). This half-tree contains some valid points and
%% %% %% does not satisfy the stopping criterion, but a subtree of it does
%% %% %% satisfy the stopping criterion, so we invalidate the points in that
%% %% %% subtree. The number of valid points in this half-tree (3) is the same
%% %% %% as the number of valid points in the old half-tree (3), so we choose a
%% %% %% point uniformly at random from the new half-tree for $w^{t+1}$. At
%% %% %% this point, the end points $w^-$ and $w^+$ satisfy the stopping
%% %% %% criterion, and we return $w^{t+1}$ as the new position-momentum pair.

The following would have been nice.

%% Also, we should probably have a scatterplot showing target versus
%% realized criteria (mean acceptance probability, mean energy change)
%% that shows that the stochastic approximation scheme pretty much works,
%% and maybe a plot showing convergence speed.

There’s more where these came from—I was just cherry picking from the algorithm, abstract, intro, and conclusion.

Who knew?

I’ve never heard anyone mention diving into the source of papers, so I wonder just what’s out there to be mined. I also wonder how many authors realize that comments in their arXiv LaTeX are forever.

^* An American idiom meaning there’s value to be found from exploring in a particular place; see the wikitionary for a definition and etymology.

^** I downloaded the LaTeX source of Hoffman and Gelman’s paper in order to produce a ChatGPT(o1[plus]) translation of the efficient NUTS algorithm to Python. I need to code a similar algorithm for a new sampler we’re exploring and wanted to make sure I had understood the structure of the NUTS algorithm, because it’s a very subtle recursion. GPT continues to impress!

Truth is more realistic than fiction, and what this tells us about odious thought experiments

Andrew — Tue, 07 Jan 2025 14:42:37 +0000

In 2010, economist Robin Hanson gained some notoriety by writing about “gentle silent rape”:

Imagine a woman was drugged into unconsciousness and then gently raped, so that she suffered no noticeable physical harm nor any memory of the event, and the rapist tried to keep the event secret. Now drugging someone against their will is a crime, but the added rape would add greatly to the crime in the eyes of today’s law, and the added punishment for this addition would be far more than for cuckoldry. . . . A colleague of mine suggests this is gender bias, pure and simple; women seem feminist, and men chivalrous, by railing against rape, but no one looks good complaining about cuckoldry. What other explanations you got?

Hanson’s wikipedia entry contains this quote from Nate Silver from 2012:

He is clearly not a man afraid to challenge the conventional wisdom. Instead, Hanson writes a blog called Overcoming Bias, in which he presses readers to consider which cultural taboos, ideological beliefs, or misaligned incentives might constrain them from making optimal decisions.

Taking these phrases one at time:

– “Cultural taboos” = attitudes that the many people have but which you don’t share.
– “Ideological beliefs” = ideologies that many people hold that you don’t share.
– “Misaligned incentives” = incentives for people to do things that make you unhappy.
– “Optimal decision” = decisions that you approve of.

In any case, the “gentle silent rape” thing sounded like a bizarre thought experiment. But then a news item recently appeared in which it really happened: a man had drugged and raped his wife and kept it secret for decades. The result was neither gentle nor silent, which I guess might lead Hanson to say that this real-world case wasn’t an example of what he was talking about, but I would take this argument in the opposite direction and say that the real-world horror story demonstrates a problem with the thought experiment, which is that gentle silent rape isn’t really a thing—the phrase is a way of minimizing a real crime by giving it impossible modifying qualifiers.

The point of this post is not to have some sort of gotcha on Nate. Rather, it’s just a horribly vivid demonstration of a general issue with thought experiments in social science, which is that to work they should be internally coherent and also consistent with reality.

Progress in 2024 (Aki)

Aki Vehtari — Tue, 07 Jan 2025 08:30:53 +0000

Here’s my 2024 progress report. There are 5 publications common with Andrew in 2024.

Active Statistics book is the biggest in size, but personally getting the Pareto smoothed importance sampling paper published after 9 years from the first submission was a big event, too. I think I only blogged 2023 progress report and job ads (I sometimes have blog post ideas, but as I’m a slow writer, it’s difficult to find time to turn them to actual posts). I’m very happy with the progress in 2024, but also excited on what we are going to get done in 2025!

Book

Andrew Gelman and Aki Vehtari (2024). Active Statistics.

Papers published or accepted for publication in 2024

Yann McLatchie, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari (2025). Advances in projection predictive inference. Statistical Science, accepted for publication. arXiv preprint arXiv:2306.15581. Software: projpred, kulprit.
Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, Aki Vehtari, and Daniel Hsu (2025). The piranha problem: Large effects swimming in a small pond. Notices of the American Mathematical Society, 72(1):15-25. arXiv preprint arXiv:2105.13445.
Kunal Ghosh, Milica Todorović, Aki Vehtari, and Patrick Rinke (2025). Active learning of molecular data for task-specific objectives. The Journal of Chemical Physics, doi:10.1063/5.0229834.
Charles C. Margossian, Matthew D. Hoffman, Pavel Sountsov, Lionel Riou-Durand, Aki Vehtari, and Andrew Gelman (2024). Nested Rhat: Assessing the convergence of Markov chain Monte Carlo when running many short chains. Bayesian Analysis, doi:10.1214/24-BA1453.Software: posterior.
Yann McLatchie and Aki Vehtari (2024). Efficient estimation and correction of selection-induced bias with order statistics. Statistics and Computing, 34(132). doi:10.1007/s11222-024-10442-4.
Frank Weber, Änne Glass, and Aki Vehtari (2024). Projection predictive variable selection for discrete response families with finite support. Computational Statistics, doi:10.1007/s00180-024-01506-0. Software projpred.
Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry (2024). Pareto smoothed importance sampling. Journal of Machine Learning Research, 25(72):1-58. Online. Software: loo, posterior, ArviZ
Manushi Welandawe, Michael Riis Andersen, Aki Vehtari, and Jonathan H. Huggins (2024). A framework for improving the reliability of black-box variational inference. Journal of Machine Learning Research, 25(219):1-71. Online.
Noa Kallioinen, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari (2024). Detecting and diagnosing prior and likelihood sensitivity with power-scaling. Statistics and Computing, 34(57). Online.
Supplementary code.
Software: priorsense
Erik Štrumbelj, Alexandre Bouchard-Côté, Jukka Corander, Andrew Gelman, Håvard Rue, Lawrence Murray, Henri Pesonen, Martyn Plummer, and Aki Vehtari (2024). Past, present, and future of software for Bayesian inference. Statistical Science, 39(1):46-61. Online.
Alex Cooper, Dan Simpson, Lauren Kennedy, Catherine Forbes, and Aki Vehtari (2024). Cross-validatory model selection for Bayesian autoregressions with exogenous regressors. Bayesian Analysis, doi:10.1214/23-BA1409.
Marta Kołczyńska, Paul-Christian Bürkner, Lauren Kennedy, and Aki Vehtari (2024). Trust in state institutions in Europe, 1989–2019. Survey Research Methods, 18(1). doi:10.18148/srm/2024.v18i1.8119.
Alex Cooper, Aki Vehtari, Catherine Forbes, Lauren Kennedy, and Dan Simpson (2024). Bayesian cross-validation by parallel Markov chain Monte Carlo. Statistics and Computing, 34:119. doi:10.1007/s11222-024-10404-w.
Ryoko Noda, Michael Francis Mechenich, Juha Saarinen, Aki Vehtari, Indrė Žliobaitė (2024). Predicting habitat suitability for Asian elephants in non-analog ecosystems with Bayesian models. Ecological Informatics, 82:102658. doi:10.1016/j.ecoinf.2024.102658.
Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas, Henri Pesonen, Jukka Corander, Aki Vehtari, Samuel Kaski, Paul-Christian Bürkner, Arto Klami (2024). Prior knowledge elicitation: The past, present, and future. Bayesian Analysis, 19(49):1129-1161. doi:10.1214/23-BA1381.

arXived in 2024

Marvin Schmitt, Chengkun Li, Aki Vehtari, Luigi Acerbi, Paul-Christian Bürkner, and Stefan T. Radev (2024). Amortized Bayesian Workflow (Extended Abstract). arXiv preprint arXiv:2409.04332.
Måns Magnusson, Jakob Torgander, Paul-Christian Bürkner, Lu Zhang, Bob Carpenter, and Aki Vehtari (2024). posteriordb: Testing, benchmarking and developing Bayesian inference algorithms. arXiv preprint arXiv:2407.04967. Database and software: posteriordb
David Kohns, Noa Kallionen, Yann McLatchie, and Aki Vehtari (2024). The ARR2 prior: flexible predictive prior definition for Bayesian auto-regressions. arXiv preprint arXiv:2405.19920.
Anna Elisabeth Riha, Nikolas Siccha, Antti Oulasvirta, and Aki Vehtari (2024). Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis. arXiv preprint arXiv:2404.01688.
Guangzhao Cheng, Aki Vehtari, and Lu Cheng (2024). Raw signal segmentation for estimating RNA modifications and structures from Nanopore direct RNA sequencing data. bioRxiv preprint.

Software

Stan development team. Stan.
Releases v2.34, v2.35, v2.36.
Aki Vehtari, Jonah Gabry, Måns Magnusson, Yuling Yao, Paul-Christian Bürkner, Topi Paananen, and Andrew Gelman (2024). loo: Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models. mc-stan.org/loo/.
Releases v2.7.0, v2.8.0
Paul-Christian Bürkner, Jonah Gabry, Matthew Kay, and Aki Vehtari (2024). posterior: Tools for Working with Posterior Distributions. mc-stan.org/posterior/.
Release 1.6.0
Noa Kallioinen, Paul-Christian Bürkner, Topi Paananen, Frank Weber, and Aki Vehtari (2024). priorsense: Detecting and diagnosing prior and likelihood sensitivity with power-scaling. github.com/n-kall/priorsense.
Release 1.0.* (first CRAN release!)

Case studies

Aki Vehtari (2024). Nabiximols. Model checking and comparison, comparison of continuous and discrete models, LOO-PIT checking, calibration plots, prior sensitivity analysis, model refinement, treatment effect, effect of model mis-specification.
Aki Vehtari (2024). Birthdays. Workflow example for iterative building of a time series model. In 2024, added demonstration of Pathfinder for quick initial results and MCMC initialization.

FAQ

Aki Vehtari (2024). Cross-validation FAQ. Updates.

Video

Aki Vehtari (2024). Pareto-k diagnostic and sample size needed for CLT to hold (StanCon 2024)

This looks like an excellent new business line for Wolfram Research!

Andrew — Mon, 06 Jan 2025 14:52:25 +0000

Can you catch the trick in the above letter? I was staring and staring and couldn’t figure out the scam. Yes, I get it that “Mushens and Churchill” is a fake literary agency (according to this post from Victoria Strauss, which is where I found this story, this particular scammer is taking the name of the legitimate literary agent Juliet Mushens, which is a really horrible thing to do), and they’re preying on the hopes of authors. I get that “we cannot promise the moon and the stars” is classic soft-sell. What I couldn’t figure out is what’s the motivation for the scammer. They get someone to send them their unpublished manuscript? Later they ask for money to publish the book? But that doesn’t make sense—the author is already self-publishing. (And, yes, there’s no shame in paying money to publish your own work—do you think this blog hosting comes for free?)

Strauss explains how the scam works:

Although I haven’t yet heard from anyone who has actually signed up with MCLit, and therefore don’t know what they’re charging, the fifth paragraph of the solicitation above gives away what they’re selling: an “International Literary Registration Seal and Bookstore Access Code”. Both of these are completely bogus items that scammers have invented to enable them to drain writers’ bank accounts.

Ha! I didn’t catch that at all.

Here’s another:

Strauss spells it out for us:

Story Arc Literary Groups employs an approach common to many fake literary agency scams: promising to work on commission only, with no other fees due (note especially paragraph 5, which helpfully explains that “a reputable literary agent should not charge upfront fees”). The aim of such solicitations, however, is always money, and writers who sign up with Story Arc soon discover this. In order for Story Arc to successfully pitch a book to traditional publishers, authors are told they must first “re-license” their book (a requirement that, as I’ve explained in another blog post, is completely fictional). As is typical for this type of scam, they’re referred to a “trusted” company to perform the service–in this case, an outfit called CreativeIP. The price tag: $5,000.

Ouch!

Here’s another:

Typical of fake literary agency scams, Zenith Literary is an aggressive solicitor. One writer who responded to this solicitation was told that in order to snag a traditional publisher’s interest, they needed to gather various “action items”, including “ten editorial reviews and endorsements” (hint: reviews and endorsements are nice, but they are absolutely not required by traditional publishers). To obtain these, the writer was referred to Verse Bound Solutions, a company with no apparent existence beyond a Wyoming business registration but active enough to phone the author and offer them ten book reviews for $3,000.

And another:

The author who was targeted with [a solicitation from “ImplicitPress Literary Agency”] was asked to supply a variety of necessary “documents”; note #5, which is what this scam is hoping to sell (no publisher requires or cares about a book trailer):

If you’re itching for more such stories, just go here:

What a world we live in.

P.S. In case you’re wondering about the title of this post, see here for the relevant background.

“I would have had a simple piece of advice: Say nothing.”

Andrew — Sun, 05 Jan 2025 14:26:36 +0000

Amusing story here from Jonathan Bailey on “NaNoWriMo’s Massive AI Blunder.” I’m reminded of those people who lie about their running times in zero-stakes races.

Bailey’s article is a pleasure to read because it’s just straight-up common sense.

What are my goals? What are their goals? (How to prepare for that meeting.)

Andrew — Sat, 04 Jan 2025 14:50:27 +0000

Corresponding with someone who had a difficult meeting coming up, where she was not sure how much to trust the person she was meeting with, I gave the following advice:

Proceed under the assumption that they want to do things right. I say this because if they’re gonna be defensive, then it doesn’t matter what you say; it’s not like you’re gonna sweet-talk them into opening up. But if they do want to do better, then maybe there is some hope.

My correspondent responded that the person she was meeting hadn’t been helpful up to this point: “I always assume (and hope for) good intentions and a desire to do better. But I’ll admit I’m feeling less positive after a few days of not getting an answer.”

I continued:

Many of these sorts of meetings require negotiation, and good negotiation often involves withholding of information or outright deception, and I’m not good at either of these things, so I don’t even try. Instead I try some of the classic “Getting to Yes” strategies:
(1) Before the meeting, I ask myself what are my goals: my short-term goals for the meeting and my medium and long-term goals that I’m aiming for.
(2) During the meeting, I explicitly ask the other parties what their goals are.

When I think of various counterproductive interactions I’ve had in the past, often it seems this has come in part because I was not clear on my goals or on the goals of the other parties; as a result we butted heads when we could’ve found a mutually-beneficial solution. I’m including here some interactions with bad actors: liars, cheats, etc. Even when working with people you can’t trust, the general principles can apply.

It does not always make sense to tell the other parties what your goals are! But, don’t worry, most people won’t ever ask, as they will typically be focused on trying to stand firm on some micro-issue or another. Kinda like how amateur poker players are notorious for looking over and over again at their own hole cards and not looking enough at you.

The above advice may seem silly because you’re not involved in a negotiation at all! Even so, if you have a sense of what your goals are and what their goals are, this could be helpful. And be careful to distinguish goals from decision options. A goal is “I would like X to happen”; a decision option is “I will do Y.” It’s natural to think in terms of decision options, but I think this is limiting, compared to thinking about goals.

Anyway, that’s just my take from a mixture of personal experience and reading on decision making; I’ve done no direct research on the topic.

The above techniques are not any sort of magic; they’re just an attempt to focus on what is important.

Echoing Eco: From the logic of stories to posterior predictive simulation

Andrew — Fri, 03 Jan 2025 14:33:22 +0000

“When I put Jorge in the library I did not yet know he was the murderer. He acted on his own, so to speak. And it must not be thought that this is an ‘idealistic’ position, as if I were saying that the characters have an autonomous life and the author, in a kind of trance, makes them behave as they themselves direct him. That kind of nonsense belongs in term papers. The fact is that the characters are obliged to act according to the laws of the world in which they live. In other words, the narrator is the prisoner of his own premises.” — Umberto Eco, Postscript to The Name of the Rose (translated by William Weaver)

Perfectly put. As I wrote less poetically a few years ago, the development of a story is a working-out of possibilities, and that’s why it makes sense that authors can be surprised at how their own stories come out. In statistics jargon, the surprise we see in a story is a form of predictive check, a recognition that a scenario, if carried out logically, can lead to unexpected places.

In statistics, one reason we make predictions is to do predictive checks, to elucidate the implications of a model, in particular what it says (probabilistically) regarding observable outcomes, which can then be checked with existing or new data.

To put it in storytelling terms, if you tell a story and it leads to a nonsensical conclusion, this implies there’s something wrong with your narrative logic or with your initial scenario.

Again, I really like how Eco frames the problem, reconciling the agency of the author (who is the one who comes up with premise and the rules of the game and who works out their implications) and the apparent autonomy of the character (which is a consequence of the logic of the story).

This also connects to a discussion we had a year ago about chatbots. As I wrote at the time, a lot of what I do at work—or when blogging!—is a sort of autocomplete, where I start with some idea and work out its implications. Indeed, an important part of the writing process is to get into a flow state where the words, sentences, and paragraphs come out smoothly, and in that sense there’s no other way to do this than with some sort of autocomplete. Autocomplete isn’t everything—sometimes I need to stop and think, make plans, do some math—but it’s a lot.

Different people do autocomplete in different ways. Just restricting ourselves to bloggers here, give the same prompt to Jessica, Bob, Phil, and Lizzie, and you’ll get four much different posts—and, similarly, Umberto Eco’s working out of the logic of a murder in a medieval monastery will come out different from yours. Not even considering the confounding factor that we get to choose the “prompts” for our blog posts, and Eco picked his scenario because he thought it would be fruitful ground for a philosophical novel. Writing has value, in the same way that prior or posterior simulation has value: we don’t know how things will come out any more than we can know the millionth digit of pi without doing the damn calculation.

Progress in 2024 (Jessica)

Jessica Hullman — Thu, 02 Jan 2025 18:08:37 +0000

2024 was an enjoyable year. Below are a few things I did.

Conference and journal papers published

Improving out-of-population prediction: The complementary effects of model assistance and judgmental bootstrapping. International Journal of Forecasting. Hardy, M. D., Zhang, S., Hullman, J., Hofman, J. M., and Goldstein, D. G.
What to Consider When Considering Differential Privacy for Policy. Policy Insights from the Behavioral and Brain Sciences (PIBBS). Nanayakkara, P. and Hullman, J.
VMC: A Grammar for Visualizing Statistical Model Checks. IEEE Transactions of Visualization & Computer Graphics (Proceedings of IEEE VIS 2024). Guo, Z., Kale, A., Kay, M., and Hullman, J.
REFORMS: Consensus-based Recommendations for Machine-learning-based Science. Science Advances, 10 (18). Kapoor, S., Cantrell, E. M., Peng, K., Pham, T. H., Bail, C., Gundersen, O. E., Hofman, J., Hullman, J., Lones, M., Malik, M., Nanayakkara, P., Poldrack, R., Raji, I. D., Roberts, M., Salganik, M., Serra-Garcia, M., Stewart, B., Vandewiele, G., and Narayanan, A.
A Conceptual Framework for Ethical Evaluation of Machine Learning Systems. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. Gupta, N. R., Hullman, J., and Subramonyam, H.
Measure-Observe-Remeasure: An Interactive Paradigm for Differentially-Private Exploratory Analysis. IEEE symposium on Privacy & Security (Proceedings of S&P 2024). Nanayakkara, P., Kim, H., Wu, Y., Sarvghad, A., Mahyar, N., Miklau, G., and Hullman, J.
A Decision Theoretic Framework for Measuring AI Reliance. ACM Conference on Fairness, Accountability, & Transparency (Proceedings of FAccT 2024). Guo, Z., Wu, Y., Hartline, J., and Hullman, J.
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling. ACM Conference on Human Factors in Computing Systems (Proceedings of CHI 2024). Zhang, D., Chatzimparmpas, A., Kamali, N., and Hullman, J.
Erie: A Declarative Grammar for Data Sonification. ACM Conference on Human Factors in Computing Systems (Proceedings of CHI 2024. Kim, H., Kim, Y-S., and Hullman, J.
Milliways: Taming Multiverses through Principled Evaluation of Data Analysis Paths. ACM Conference on Human Factors in Computing Systems (Proceedings of CHI 2024. Sarma, A., Hwang, K., Hullman, J., and Kay, M.
Causal quartets: Different ways to attain the same average treatment effect. American Statistician 78. Gelman, A., Hullman, J., and Kennedy, L.

Three of my collaborators above were Ph.D. students I advised, who graduated in 2024! Congrats to Priyanka Nanayakkara (now a postdoc at Harvard CRCS), Hyeok Kim (now a postdoc at University of Washington CS, on the academic job market), and Dongping Zhang (now a research scientist at NREL).

Talks (that are available online)

Benchmarking Visualization for Decision-Making. MIT Code Conference, Oct. 2024.
Data analysis and imagination. Alan Turing Institute, June 2024.

I gave a several other talks that I think exist somewhere online, but I can’t find the links.

Workshop participation/organization

The highlights of the year for me were workshops I attended over the summer. Getting to travel to interesting places to think deeply about topics you find fascinating is truly a privilege.

Bridging Prediction and Intervention Problems in Social Systems. Organized by Lydia Liu, Inioluwa Deborah Raji, Angela Zhou, and Arvind Narayanan. Banff Canada, June 2024.
Navigating the garden of forking paths/Theoretical foundations for interactive data analysis in data-driven science. Organized by Cagatay Turkay and Roger Beecham. London, June 2024.
Workshop on Individualized Decision-Making. Organized by Ben Recht. Berkeley, CA, July 2024.
Apple Workshop on Human-Centered Machine Learning. Organized by Kareem Bedri, Leah Findlater, Dominik Moritz, and Jeff Nichols. Cupertino, CA, Aug 2024.

I co-organized a few other workshops I enjoyed:

Theoretical Foundations of Human-AI Complementarity. With Jason Hartline. Northwestern University, Sept. 2024.
Statistical Frontiers in LLMs and Foundation Models. With Anastasios Angelopoulos, Stephen Bates, Alex D’Amour, Fanny Yang, Sophia Sun, and Tatsunori Hashimoto. NeurIPS, Vancouver, Dec. 2024.

My goal for next year is to do more creative writing. I would be thrilled if I could write even a couple poems I’m happy with.

Advice for weighting the results of conjoint analyses/experiments

Andrew — Thu, 02 Jan 2025 14:22:42 +0000

Someone asked me, “What advice do you have for weighting the results of conjoint analyses/experiments?”

I replied that a conjoint experiment is basically a regression analysis. Here we discuss survey weights and regression, and here’s the relevant bit:

So the first paper I recommend is Winship, C., and Radbill, L. (1994). Sampling weights and regression analysis. Sociological Methods and Research 23, 230–257.

I also recommend this recent paper by Brandon de la Cuesta, Naoki Egami, Kosuke Imai, Improving the External Validity of Conjoint Analysis: The Essential Role of Profile Distribution, which addresses issues of average predictive comparisons that have come up many times before in this space, for example here and here.

Newly published in 2024

Andrew — Wed, 01 Jan 2025 14:16:26 +0000

Our big item is the book Active Statistics: Stories, Games, Problems, and Hands-on Demonstrations for Applied Regression and Causal Inference (Andrew Gelman and Aki Vehtari).

Then there are the recently published research articles:

[2025] Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes. Journal of the Royal Statistical Society A.(Jonas Mikhaeil, Andrew Gelman, and Philip Greengard)
[2025] The piranha problem: Large effects swimming in a small pond. Notices of the American Mathematical Society. (Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, Aki Vehtari, sand Daniel Hsu)
[2025] For how many iterations should we run Markov chain Monte Carlo? In Handbook of Markov Chain Monte Carlo, second edition. (Charles C. Margossian and Andrew Gelman)
[2024] Why forecast an election that’s too close to call? Nature 634, 1019. (Andrew Gelman)
[2024] Grappling with uncertainty in forecasting the 2024 U.S. presidential election. Harvard Data Science Review 6 (4). (Andrew Gelman, Ben Goodrich, and Geonhee Han)
[2024] Review of “Noise: A Flaw in Human Judgment,” by Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein. Chance 37 (3), 70-72.(Gaurav Sood and Andrew Gelman)
[2024] How statistical challenges and misreadings of the literature combine to produce unreplicable science: An example from psychology. Advances in Methods and Practices in Psychological Science. (Andrew Gelman and Nicholas J. L. Brown)
[2024] Statistics as a social activity: Attitudes toward amalgamating evidence. Entropy 26 (8), 652. (Andrew Gelman and Keith O’Rourke)
[2024] Nested R-hat: Assessing the convergence of Markov chain Monte Carlo when running many short chains. Bayesian Analysis. (Charles C. Margossian, Matthew D. Hoffman, Pavel Sountsov, Lionel Riou-Durand, Aki Vehtari, and Andrew Gelman)
[2024] Using leave-one-out cross-validation (LOO) in a multilevel regression and poststratification (MRP) workflow: A cautionary tale. Statistics in Medicine 43, 953-982. (Swen Kuh, Lauren Kennedy, Qixuan Chen, and Andrew Gelman)
[2024] Hopes and limitations of reproducible statistics and machine learning. Harvard Data Science Review 6 (1). (Andrew Gelman)
[2024] Pareto smoothed importance sampling. Journal of Machine Learning Research 25 (72). (Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry)
[2024] A new look at p-values for randomized clinical trials. NEJM Evidence 3 (1). (Erik van Zwet, Andrew Gelman, Sander Greenland, Guido Imbens, Simon Schwab, and Steven N. Goodman)
[2024] Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. Bayesian Analysis. (Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari)
[2024] Before data analysis: Additional recommendations for designing experiments to learn about the world. Journal of Consumer Psychology 34, 190-191. (Andrew Gelman)
[2024] In pursuit of campus-wide data literacy: A guide to developing a statistics course for students in non-quantitative fields. Journal of Statistics and Data Science Education 32, 241-252. (Alexis Lerner and Andrew Gelman)
[2024] Causal quartets: Different ways to attain the same average treatment effect. American Statistician 78, 267-272. (Andrew Gelman, Jessica Hullman, and Lauren Kennedy)

It’s a good mix: some technical work, some research methods, some applications, some teaching material, some reviews. Lots of collaborators!

If you want to see what’s coming next, you can check out the lists of unpublished and unwritten articles.

Here are our audios and videos from the past year:

Bayesian Workflow (Conversation with Charles Margossian for the MetrumRG podcast, 22 Oct 2024)
Tough Choices in Election Forecasting: All the Things That Can Go Wrong (Presented at the Washington Statistical Society, 11 Oct 2024)
The Political Content of Unreplicable Research (Presented at the Stanford Classical Liberalism Seminar, 3 Oct 2024)
Fooling Yourself Less: The Art of Statistical Thinking in AI (Conversation for the High Signal podcast, 18 Sep 2024)
Holes in Bayesian Statistics (Presented at the International Society for Bayesian Analysis meeting, 3 Jul 2024)
Beyond the Black Box: Toward a New Paradigm of Statistics in Science (Presented at the Alan Turing Institute, 20 Jun 2024)

Last but not least, our blog posts of 2024. We had 540 posts with 11,483 total comments:

It’s bezzle time: The Dean of Engineering at the University of Nevada gets paid $372,127 a year and wrote a paper that’s so bad, you can’t believe it. (204 comments)
In some cases academic misconduct doesn’t deserve a public apology (177 comments)
Getting a pass on evaluating ways to improve science (154 comments)
Niall Ferguson, J. D. Vance, George Washington, and Jesus (129 comments)
Stabbers gonna stab — fraud edition (124 comments)
Suspicious data pattern in recent Venezuelan election (111 comments)
This well-known paradox of R-squared is still buggin me. Can you help me out? (105 comments)
Reflections on the recent election (103 comments)
The mainstream press is failing America (UK edition) (99 comments)
The Behavioural Insights Team decided to scare people. (98 comments)
If you want to play women’s tennis at the top level, there’s a huge benefit to being ____. Not just ____, but exceptionally ___, outlier-outlier ___. (And what we can learn about social science from this stylized fact.) (94 comments)
“Is it really ‘the economy, stupid’?” (91 comments)
Bad stuff going down at the American Sociological Association (84 comments)
How to think about the claim by Justin Wolfers that “the income of the average American will double approximately every 39 years”? (84 comments)
Polling averages and political forecasts and what do you really think is gonna happen in November? (83 comments)
Intelligence is whatever machines cannot (yet) do (82 comments)
On the border between credulity and postmodernism: The case of the UFO’s-as-space-aliens media insiders (78 comments)
Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging. (78 comments)
“Not once in the twentieth century . . . has a single politician, actor, athlete, or surgeon emerged as a first-rate novelist, despite the dismayingly huge breadth of experience each profession affords.” (77 comments)
Abraham Lincoln and confidence intervals (77 comments)
What is the prevalence of bad social science? (76 comments)
Whassup with those economists who predicted a recession that then didn’t happen? (74 comments)
Feedback on the blog—this is your chance! (74 comments)
If school funding doesn’t really matter, why do people want their kid’s school to be well funded? (69 comments)
“It’s a very short jump from believing kale smoothies are a cure for cancer to denying the Holocaust happened.” (69 comments)
On lying politicians and bullshitting scientists (67 comments)
My comments on Nate Silver’s comments on the Fivethirtyeight election forecast (66 comments)
My suggestion for the 2028 Olympics (66 comments)
The statistical controversy over “White Rural Rage: the Threat to American Democracy” (and a comment about post-publication review) (65 comments)
HMC fails when you initialize at the mode (63 comments)
Selection bias leads to confusion about the relative stability of deterministic and stochastic algorithms (63 comments)
Wendy Brown: “Just as nothing is more corrosive to serious intellectual work than being governed by a political programme (whether that of states, corporations, or a revolutionary movement), nothing is more inapt to a political campaign than the unending reflexivity, critique and self-correction required of scholarly inquiry.” (63 comments)
Bayesians are frequentists. (62 comments)
More red meat for you AI skeptics out there (61 comments)
The River, the Village, and the Fort: Nate Silver’s new book, “On the Edge” (61 comments)
“Zombie Ideas” in psychology, from personality profiling to lucky golf balls (61 comments)
Andrew Gelman is not the science police because there is no such thing as the science police (60 comments)
With journals, it’s all about the wedding, never about the marriage. (59 comments)
Stanford medical school professor misrepresents what I wrote (but I kind of understand where he’s coming from) (59 comments)
Hey—let’s collect all the stupid things that researchers say in order to deflect legitimate criticism (58 comments)
Bayesian statistics: the three cultures (58 comments)
A quick simulation to demonstrate the wild variability of p-values (58 comments)
How to think about the effect of the economy on political attitudes and behavior? (57 comments)
Forking paths in LLMs for data analysis (57 comments)
Two kings, a royal, a knight, and three princesses walk into a bar (Nobel prize edition) (57 comments)
Prediction markets in 2024 and poll aggregation in 2008 (57 comments)
The New York Young Republican Club (56 comments)
Holes in Bayesian statistics (my talk tomorrow at the Bayesian conference, based on work with Yuling) (56 comments)
Abortion crime controversy update (55 comments)
Some references and discussions on the foundations of probability—not the math so much as its connection to the real world, including the claim that “Pr(aliens exist on Neptune that can rap battle) = .137” (54 comments)
Props to the liberal anticommunists of the 1930s-1950s (54 comments)
Interpreting recent Iowa election poll using a rough Bayesian partition of error (54 comments)
How would the election turn out if Biden or Trump were replaced by a different candidate? (53 comments)
Uncertainty in games: How to get that balance so that there’s a motivation to play well, but you can still have a chance to come back from behind? (52 comments)
Decisions of parties to run moderate or extreme candidates (52 comments)
Benefit of Stanford: Are there connections between unethical behavior in science promotion and cheating in private life? (51 comments)
When all else fails, add a code comment (50 comments)
The feel-good open science story versus the preregistration (who do you think wins?) (49 comments)
Deadwood (49 comments)
Why are all these school cheating scandals happening? (48 comments)
Is marriage associated with happiness for men or for women? Or both? Or neither? (48 comments)
He took public funds and falsified his data. Are they gonna make him pay back the $19 million? (48 comments)
Arnold Foundation and Vera Institute argue about a study of the effectiveness of college education programs in prison. (46 comments)
How do you interpret standard errors from a regression fit to the entire population? (46 comments)
It’s Harvard time, baby: “Kerfuffle” is what you call it when you completely botched your data but you don’t want to change your conclusions. (46 comments)
Where have all the count words gone? In defense of “fewer” and “among” (45 comments)
Arguing about bitcoin (44 comments)
Polling by asking people about their neighbors: When does this work? Should people be doing more of it? And the connection to that French dude who bet on Trump (44 comments)
Prediction isn’t everything, but everything is prediction (43 comments)
Who wrote the music for In My Life? Three Bayesian analyses (43 comments)
Honesty and transparency are not enough: politics edition (43 comments)
What’s gonna happen between now and November 5? (42 comments)
What’s the story behind that paper by the Center for Open Science team that just got retracted? (42 comments)
Torment executioners in Reno, Nevada, keep tormenting us with their publications. (41 comments)
“You want to gather data to determine which of two students is a better basketball shooter. You plan to have each student take N shots and then compare their shooting percentages. Roughly how large does N have to be for you to have a good chance of distinguishing a 30% shooter from a 40% shooter?” (41 comments)
Why are we making probabilistic election forecasts? (and why don’t we put so much effort into them?) (41 comments)
The appeal of New York Times columnist David Brooks . . . Yeah, I know this all sounds like a nutty “it’s wheels within wheels, man” sort of argument, but I’m serious here! (40 comments)
Sympathy for the Nudgelords: Vermeule endorsing stupid and dangerous election-fraud claims and Levitt promoting climate change denial are like cool dudes in the 60s wearing Che T-shirts and thinking Chairman Mao was cool—we think they’re playing with fire, they think they’re cute contrarians pointing out contradictions in the system. For a certain kind of person, it’s fun to be a rogue. (40 comments)
Freakonomics does it again (not in a good way). Jeez, these guys are credulous: (40 comments)
Mister P and Stan go to Bangladesh . . . (39 comments)
“Things are Getting So Politically Polarized We Can’t Measure How Politically Polarized Things are Getting” (39 comments)
Opposition (38 comments)
“My basic question is do we really need data to be analysed by both methods?” (38 comments)
The most interesting part of the story is that the publisher went through all these steps of reviewing and revising. If they just want to make money by publishing crap, why bother engaging outside reviewers at all? (38 comments)
A new argument for estimating the probability that your vote will be decisive (37 comments)
Kamala Harris gets coveted xkcd endorsement. (37 comments)
Implicitly denying the controversy associated with the Implicit Association Test. (Whassup with the American Association of Arts & Sciences?) (37 comments)
Simulation to understand two kinds of measurement error in regression (36 comments)
I strongly doubt that any human has ever typed the phrase, “torment executioners,” on any keyboard—except, of course, in discussions such as this. (36 comments)
What is the purpose of a methods section? (36 comments)
The four principles of Barnard College: Respect, empathy, kindness . . . and censorship? (35 comments)
Whooping cough! How to respond to fatally-flawed papers? An example, in a setting where the fatal flaw is subtle, involving a confounding of time and cohort effects (35 comments)
The election is coming: What forecasts should we trust? (35 comments)
Prediction markets and the need for “dumb money” as well as “smart money” (35 comments)
“The Exceptions: Nancy Hopkins, MIT, and the Fight for Women in Science” (35 comments)
Shreddergate! A fascinating investigation into possible dishonesty in a psychology experiment (34 comments)
20 years of blogging . . . What have been your favorite posts? (34 comments)
That’s what happens when you try to run the world while excluding 99.8% of the population (34 comments)
The Theory and Practice of Oligarchical Collectivism (34 comments)
What’s the problem, “math snobs” or rich dudes who take themselves too seriously and are enabled in that by the news media? (33 comments)
(Trying to) clear up a misunderstanding about decision analysis and significance testing (33 comments)
“Take a pass”: New contronym just dropped. (33 comments)
Well, today we find our heroes flying along smoothly… (33 comments)
“How a simple math error sparked a panic about black plastic kitchen utensils”: Does it matter when an estimate is off by a factor of 10? (33 comments)
“Why do medical tests always have error rates?” (32 comments)
Toward a unified theory of bad science and bad scholarship (32 comments)
Michael Clayton in NYC (32 comments)
I’ve been mistaken for a chatbot (31 comments)
Here’s my excuse for using obsolete, sub-optimal, or inadequate statistical methods or using a method irresponsibly. (31 comments)
“Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001].” (30 comments)
“Don’t feed the trolls” and the troll semi-bluff (30 comments)
Bad parenting in the news, also, yeah, lots of kids don’t believe in Santa Claus (30 comments)
What do the data say about Kamala Harris’s electability? (30 comments)
When is calibration enough? (30 comments)
“I wonder just what it takes to get people to conclude that a research seam has been mined to the point of exhaustion.” (30 comments)
Extinct Champagne grapes? I can be even more disappointed in the news media (29 comments)
Relating t-statistics and the relative width of confidence intervals (29 comments)
Report of average change from an Alzheimer’s drug: I don’t get the criticism here. (29 comments)
Obnoxious receipt from Spirit Airlines (29 comments)
The recent Iranian election: Should we be suspicious that the vote totals are all divisible by 3? (29 comments)
Bill James hangs up his hat. Also some general thoughts about book writing vs. blogging. Also I push back against James’s claim about sabermetrics and statistics. (29 comments)
Unsolicited feedback on your research from the LLMs at large (29 comments)
Defining statistical models in JAX? (29 comments)
She wants to know what are best practices on flagging bad responses and cleaning survey data and detecting bad responses. Any suggestions from the tidyverse or crunch.io? (29 comments)
What to do with age? (including a regression predictor linearly and also in discrete steps) (28 comments)
“Our troops with aching hearts were obliged to fire a part of the town as a punishment.” (28 comments)
The Rider (28 comments)
“Trivia question for you. I kept temperature records for 100 days one year in Boston, starting August 15th (day “0”). What would you guess is the correlation between day# and temp? r=???” (28 comments)
Anti-immigration attitudes: they didn’t want a bunch of Hungarian refugees coming in the 1950s (28 comments)
What genre of writing is AI-generated poetry? (28 comments)
“I work in a biology lab . . . My PI proposed a statistical test that I think is nonsense. . .” (28 comments)
Blog is adapted to laptops or desktops, not to smartphones or pads. (27 comments)
“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . . (27 comments)
Inspiring story from a chemistry classroom (27 comments)
Which books, papers, and blogs are in the Bayesian canon? (27 comments)
3M misconduct regarding knowledge of “forever chemicals”: As is so often the case, the problem was in open sight for a long time before anything was done (27 comments)
God is in every leaf of every tree—comic book movies edition. (26 comments)
Our new Substack newsletter: The Future of Statistical Modeling! (26 comments)
Pinker was right, I was wrong. (26 comments)
Does this study really show that lesbians and bisexual women die sooner than straight women? Disparities in Mortality by Sexual Orientation in a Large, Prospective JAMA Paper (26 comments)
A cook, a housemaid, a gardener, a chauffeur, a nanny, a philosopher, and his wife . . . (26 comments)
MCMC draws cannot fill the posterior in high dimensions (26 comments)
Beyond junk science: How to go forward (26 comments)
Instability of win probability in election forecasts (with a little bit of R) (26 comments)
Mark Twain on chatbots (26 comments)
Fake data on the honeybee waggle dance, followed by the inevitable “It is important to note that the conclusions of our studies remain firm and sound.” (26 comments)
Make a hypothesis about what you expect to see, every step of the way. A manifesto: (26 comments)
“Of course, this could conceivably be a case of near unbelievable luck: A flawed analysis based on wrong assumptions gave an unusually large causal effect estimate – but the misguided result just happened to be correct. We can imagine how the research team huddled nervously around the computer terminal biting their nails and silently praying as they executed their updated Stata code, only to erupt in joy and celebration as the results appeared on screen and revealed they were right all along. . . .” (26 comments)
A very interesting discussion by Roy Sorensen of the interesting-number paradox (26 comments)
I love this paper but it’s barely been noticed. (25 comments)
Conformal prediction and people (25 comments)
More on the disconnect between who voters support and what they support (25 comments)
Ancestor-worship in academia: Where does it happen? (25 comments)
A welcome rant on betting, knowledge, belief, and the foundations of probability (25 comments)
Oh no Stanford no no no not again please make it stop (25 comments)
Evidence-based Medicine Eats Itself, and How to do Better (my talk at USC this Friday) (25 comments)
This one might possibly be interesting. (25 comments)
4 different meanings of p-value (and how my thinking has changed) (25 comments)
Credit where due to NPR regarding science data fraud, and here’s how they can do even better (25 comments)
Why isn’t Barack Obama out there giving political speeches? (24 comments)
“Exclusive: Embattled dean accused of plagiarism in NSF report” (yup, it’s the torment executioners) (24 comments)
Mindlessness in the interpretation of a study on mindlessness (and why you shouldn’t use the word “whom” in your dating profile) (24 comments)
“Here’s the Unsealed Report Showing How Harvard Concluded That a Dishonesty Expert Committed Misconduct” (24 comments)
Again on the role of elite media in spreading UFOs-as-space-aliens and other bad ideas (24 comments)
The piranha problem: Large effects swimming in a small pond (24 comments)
Crap papers with crude political agendas published in scientific journals: A push-pull problem (24 comments)
log(A + x), not log(1 + x) (24 comments)
B-school prof data sleuth lawsuit fails (24 comments)
Nonsampling error and the anthropic principle in statistics (24 comments)
A 10% swing in win probability corresponds (approximately) to a 0.4% swing in predicted vote (24 comments)
How large is that treatment effect, really? (My talk at the NYU economics seminar, ~~Thurs 7 Mar~~ 18 Apr) (23 comments)
“I was left with an overwhelming feeling that the World Values Survey is simply a vehicle for telling stories about values . . .” (23 comments)
Grappling with uncertainty in forecasting the 2024 U.S. presidential election (23 comments)
Google is violating the First Law of Robotics. (23 comments)
A question for Nate Cohn at the New York Times regarding a claim about adjusting polls using recalled past vote (23 comments)
New Course: Prediction for (Individualized) Decision-making (23 comments)
Bias remaining after adjusting for pre-treatment variables. Also the challenges of learning through experimentation. (23 comments)
“Accounting for Nonresponse in Election Polls: Total Margin of Error” (23 comments)
This post is not really about Aristotle. (22 comments)
The free will to repost (22 comments)
Paper cited by Stanford medical school professor retracted—but even without considering the reasons for retraction, this paper was so bad that it should never have been cited. (22 comments)
“AI” as shorthand for turning off our brains. (This is not an anti-AI post; it’s a discussion of how we think about AI.) (22 comments)
Fewer kids in our future: How historical experience has distorted our sense of demographic norms (22 comments)
More on the oldest famous person ever (just considering those who lived to at least 104) (22 comments)
Freakonomics asks, “Why is there so much fraud in academia,” but without addressing one big incentive for fraud, which is that, if you make grabby enough claims, you can get featured in . . . Freakonomics! (22 comments)
Fake stories in purported nonfiction (22 comments)
Applications of (Bayesian) variational inference? (22 comments)
The immediate victims of the con would rather act as if the con never happened. Instead, they’re mad at the outsiders who showed them that they were being fooled. (21 comments)
The contrapositive of “Politics and the English Language.” One reason writing is hard: (21 comments)
“A passionate group of scientists determined to revolutionize the traditional publishing model in academia” (21 comments)
It’s Ariely time! They had a preregistration but they didn’t follow it. (21 comments)
Do research articles have to be so one-sided? (21 comments)
Adverse Adult Research Outcomes Increased After Increased Willingness of Public Health Journals to Publish Absolute Crap (21 comments)
The NYT sinks to a new low in political coverage (21 comments)
Calibration is sometimes sufficient for trusting predictions. What does this tell us when human experts use model predictions? (21 comments)
Keith O’Rourke’s final published paper: “Statistics as a social activity: Attitudes toward amalgamating evidence” (21 comments)
Close Reading Archive (21 comments)
Ben Shneiderman’s Golden Rules of Interface Design (20 comments)
Clinical trials that are designed to fail (20 comments)
Zotero now features retraction notices (20 comments)
“He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.” (20 comments)
Evidence, desire, support (20 comments)
Now here’s a tour de force for ya (20 comments)
To what extent is psychology different from other fields regarding fraud and replication problems? (20 comments)
What to do with election forecasts after Biden is replaced on the ticket? Also something on prediction markets. (20 comments)
The Rise and Fall of the Rock Stars (20 comments)
Carroll/Langer: Credulous, scientist-as-hero reporting from a podcaster who should know better (20 comments)
Who understands alignment anyway (19 comments)
“‘Pure Craft’ Is a Lie” and other essays by Matthew Salesses (19 comments)
Myths of American history from the left, right, and center; also a discussion of the Why everything you thought you knew was wrong” genre of book. (19 comments)
Sorry, NYT, but, yes, “Equidistant Letter Sequences in the Book of Genesis” was junk science (19 comments)
Clarke’s Law, and who’s to blame for bad science reporting (18 comments)
Preregistration is a floor, not a ceiling. (18 comments)
Hey! Here’s a study where all the preregistered analyses yielded null results but it was presented in PNAS as being wholly positive. (18 comments)
“Andrew, you are skeptical of pretty much all causal claims. But wait, causality rules the world around us, right? Plenty have to be true.” (18 comments)
Not eating sweet potatoes: Is that gonna kill me? (18 comments)
Three takes on the protests at Columbia University (18 comments)
Marquand. (18 comments)
5 different reasons why it’s important to include pre-treatment variables when designing and analyzing a randomized experiment (or doing any causal study) (18 comments)
How did the press do on that “black spatula” story? Not so great. (18 comments)
Here’s a sad post for you to start the new year. The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals. (17 comments)
Jonathan Bailey vs. Stephen Wolfram (17 comments)
The data are on a 1-5 scale, the mean is 4.61, and the standard deviation is 1.64 . . . What’s so wrong about that?? (17 comments)
What to make of implicit biases in LLM output? (17 comments)
Statistics Blunder at the Supreme Court (17 comments)
Design analysis is not just about statistical significance and power; it’s relevant for Bayesian inference too. (17 comments)
What can aspiring political moderates learn from the example of Nelson Rockefeller? (17 comments)
Chutzpah is their superpower (Dominic Sandbrook edition) (17 comments)
Stan’s autodiff is 4x faster than JAX on CPU but 5x slower on GPU (in one eval) (17 comments)
“Toward reproducible research: Some technical statistical challenges” and “The political content of unreplicable research” (my talks at Berkeley and Stanford this Wed and Thurs) (17 comments)
Clybourne Park. And a Jamaican beef patty. (But no Gray Davis, no Grover Norquist, no rabbi.) (17 comments)
Columbia Surgery Prof Fake Data Update . . . (yes, he’s still being promoted on the university webpage) (17 comments)
Objects of the class “David Owen” (17 comments)
Why does this guy have 2 gmail accounts? (17 comments)
Why am I willing to bet you $100-1000 there will be a Nobel Prize for Adaptive Experimentation in the next 40 years? (17 comments)
Regarding the use of “common sense” when evaluating research claims (16 comments)
“Whistleblowers always get punished” (16 comments)
Every time Tyler Cowen says, “Median voter theorem still underrated! Hail Anthony Downs!”, I’m gonna point him to this paper . . . (16 comments)
People have needed rituals to turn data into truth for many years. Why would we be surprised if many people now need procedural reforms to work? (16 comments)
No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study. (16 comments)
I just got a strange phone call from two people who claimed to be writing a news story. They were asking me very vague questions and I think it was some sort of scam. I guess this sort of thing is why nobody answers the phone anymore. (16 comments)
“A bizarre failure in the review process at PNAS” (16 comments)
From what body part does the fish rot? (16 comments)
Put multiple graphs on a page: that’s what Nathan Yau says, and I agree. (16 comments)
Remember that paper that reported contagion of obesity? How’s it being cited nowadays? (16 comments)
It’s martingale time, baby! How to evaluate probabilistic forecasts before the event happens? Rajiv Sethi has an idea. (Hint: it involves time series.) (16 comments)
The odd non-spamness of some spam comments (16 comments)
“And while I don’t really want a back-and-forth . . .” (15 comments)
“When will AI be able to do scientific research both cheaper and better than us, thus effectively obsoleting humans?” (15 comments)
Scientific publishers busily thwarting science (again) (15 comments)
A suggestion on how to improve the broader impacts statement requirement for AI/ML papers (15 comments)
Putting a price on vaccine hesitancy (Bayesian analysis of a conjoint experiment) (15 comments)
What is your superpower? (15 comments)
Who is the Stephen Vincent Benet of today? (15 comments)
“The Active Seating Zone (An Educational Experiment)” (15 comments)
The JAMA effect plus news media echo chamber: More misleading publicity on that problematic claim that lesbians and bisexual women die sooner than straight women (15 comments)
Heroes and Villains: The Effects of Identification Strategies on Strong Causal Claims in France (15 comments)
Present Each Other’s Posters: An update after 15 years (15 comments)
Different perspectives on the claims in the paper, The Colonial Origins of Comparative Development (15 comments)
What makes an MCMC sampler GPU-friendly? (15 comments)
Meta-analysis with a single study (15 comments)
Those correction notices, in full. (Yes, it’s possible to directly admit and learn from error.) (15 comments)
Practical issues with calibration for every group and every decision problem (15 comments)
Presidential campaign effects are small. (15 comments)
Bayesian inference (and mathematical reasoning more generally) isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision. (15 comments)
The Lakatos soccer training (14 comments)
“Science as Verified Trust” (14 comments)
Hand-drawn Statistical Workflow at Nelson Mandela (14 comments)
Combining multiply-imputed datasets, never easy (14 comments)
Simulation from a baseline model as a way to better understand your data: This is what “hypothesis testing” should be. (14 comments)
Age gaps between spouses in U.S., U.K., and India (14 comments)
“Alphabetical order of surnames may affect grading” (14 comments)
Piranhas for “omics”? (14 comments)
Pete Rose and gambling addiction: An insight and a question (14 comments)
Update on that politically-loaded paper published in Demography that I characterized as a “hack job”: Further post-publication review (14 comments)
Bad science as genre fiction: I think there’s a lot to be said for this analogy! (14 comments)
If you wanted to be a top tennis player in the late 1930s, there was a huge benefit to being a member of ____. Or to being named ____. (14 comments)
The Village Voice in the 1960s/70s and blogging in the early 2000s (14 comments)
Social penumbras predict political attitudes (my talk at Harvard on Monday Feb 12 at noon) (13 comments)
A new piranha paper (13 comments)
“There is a war between the ones who say there is a war, and the ones who say there isn’t.” (13 comments)
Another opportunity in MLB for Stan users: the Phillies are hiring (13 comments)
Statistical factuality versus practicality versus poetry (13 comments)
Nicholas Carlini on LLMs and AI for research programmers (13 comments)
Movements in the prediction markets, and going beyond a black-box view of markets and prediction models (13 comments)
ChatGPT o1-preview can code Stan (13 comments)
What if the polls are right? (some scatterplots, and some comparisons to vote swings in past decades) (13 comments)
Help teaching short-course that has a healthy dose of data simulation (13 comments)
Answering two questions, one about Bayesian post-selection inference and one about prior and posterior predictive checks (13 comments)
Evaluating samplers with reference draws (12 comments)
Refuted papers continue to be cited more than their failed replications: Can a new search engine be built that will fix this problem? (12 comments)
Their signal-to-noise ratio was low, so they decided to do a specification search, use a one-tailed test, and go with a p-value of 0.1. (12 comments)
You probably don’t have a general algorithm for an MLE of Gaussian mixtures (12 comments)
This is a very disturbing map. (12 comments)
This is not an argument against self-citations. It’s an argument about how they should be counted. Also, a fun formula that expresses the estimated linear regression coefficient as a weighted average of local slopes. (12 comments)
Data issues in that paper that claims that TikTok and Instagram have consumption spillovers that lead to negative utility (12 comments)
Sports media > Prestige media (space aliens edition) (12 comments)
In search of a theory associating honest citation with a higher/deeper level of understanding than (dishonest) plagiarism (12 comments)
Sports gambling addiction epidemic fueled by some combination of psychology, economics, and politics (12 comments)
Awesome online graph guessing game. And scatterplot charades. (12 comments)
“Pitfalls of Demographic Forecasts of US Elections” (12 comments)
A feedback loop can destroy correlation: This idea comes up in many places. (11 comments)
The paradox of replication studies: A good analyst has special data analysis and interpretation skills. But it’s considered a bad or surprising thing that if you give the same data to different analysts, they come to different conclusions. (11 comments)
Hey, here’s some free money for you! Just lend your name to this university and they’ll pay you $1000 for every article you publish! (11 comments)
Mary Rosh! (11 comments)
When Steve Bannon meets the Center for Open Science: Bad science and bad reporting combine to yield another ovulation/voting disaster (11 comments)
Boris and Natasha in America: How often is the wife taller than the husband? (11 comments)
What happens when you’ve had deferential media coverage and then, all of a sudden, you’re treated as a news item rather than as a figure of admiration? (11 comments)
A data science course for high school students (11 comments)
Is there a balance to be struck between simple hierarchical models and more complex hierarchical models that augment the simple frameworks with more modeled interactions when analyzing real data? (11 comments)
Blog was down and is now operating again. (11 comments)
The state of statistics in 1990 (11 comments)
“A Hudson Valley Reckoning: Discovering the Forgotten History of Slaveholding in My Dutch American Family” (11 comments)
Probabilistic numerics and the folk theorem of statistical computing (11 comments)
Specification curve analysis and the multiverse (11 comments)
The true meaning of the alzabo (11 comments)
Resources for teaching and learning survey sampling, from Scott Keeter at Pew Research (10 comments)
Hey! Here’s some R code to make colored maps using circle sizes proportional to county population. (10 comments)
Our new book, Active Statistics, is now available! (10 comments)
“Hot hand”: The controversy that shouldn’t be. And thinking more about what makes something into a controversy: (10 comments)
N=43, “a statistically significant 226% improvement,” . . . what could possibly go wrong?? (10 comments)
If I got a nickel every time . . . (10 comments)
“Nonreplicable” publications are cited more than “replicable” ones? (10 comments)
Implicit assumptions in the Tversky/Kahneman example of the blue and green taxicabs (10 comments)
It’s Stanford time, baby: 8-hour time-restricted press releases linked to a 91% higher risk of hype (10 comments)
“Responsibility for Raw Data”: “Failure to retain data for some reasonable length of time following publication would produce notoriety equal to the notoriety attained by publishing inaccurate results. A possibly more effective means of controlling quality of publication would be to institute a system of quality control whereby random samples of raw data from submitted journal articles would be requested by editors and scrutinized for accuracy and the appropriateness of the analysis performed.” (10 comments)
The interactions paradox in statistics (10 comments)
Here is the Data Sharing Statement, in its entirety, for van Dyck CH, Swanson CJ, Aisen P, et al. Trial of Lecanemab in Early Alzheimer’s Disease. N Engl J Med. DOI: 10.1056/NEJMoa2212948. (10 comments)
Which book should you read first, Active Statistics or Regression and Other Stories? (10 comments)
Stan Playground: Run Stan on the web, play with your program and data at will, and no need to download anything on your computer (10 comments)
How to cheat at Codenames; cheating at board games more generally (10 comments)
Progress in 2023 (9 comments)
What’s up with spring blooming? (9 comments)
Fun with Dååta: Reference librarian edition (9 comments)
Hey, I got tagged by RetractoBot! (9 comments)
Minimum criteria for studies evaluating human decision-making (9 comments)
Here’s some academic advice for you: Never put your name on a paper you haven’t read. (9 comments)
Defining optimal reliance on model predictions in AI-assisted decisions (9 comments)
Philip K. Dick’s character names (9 comments)
Evilicious 3: Face the Music (9 comments)
Two kings, a royal, a knight, and three princesses walk into a bar . . . (Dude from Saudi Arabia accuses the lords of AI of not giving him enough credit.) (9 comments)
Dan Luu asks, “Why do people post on [bad platform] instead of [good platform]?” (9 comments)
When the story becomes the story (9 comments)
1. Why so many non-econ papers by economists? 2. What’s on the math GRE and what does this have to do with stat Ph.D. programs? 3. How does modern research on combinatorics relate to statistics? (9 comments)
Pervasive randomization problems, here with headline experiments (9 comments)
Some solid criticisms of Ariely and Nudge—from 2012! (9 comments)
It’s lumbar time: Wrong inference because of conditioning on a reasonable, but in this case false, assumption. (9 comments)
You can guarantee that the term “statistical guarantee” will irritate me. Here’s why, and let’s go into some details. (9 comments)
It’s $ time! How much should we charge for a link? (9 comments)
“How bad are search results?” Dan Luu has some interesting thoughts: (9 comments)
Some books: The Good Word (1978), The Hitler Conspiracies (2020), In Defense of History (1999), The Book of the Month (1986), Slow Horses (2010), Freedom’s Dominion (2022), A Meaningful Life (1971) (9 comments)
Background on “fail fast” (9 comments)
“Unusual Betting Patterns With Several Temple Games”: It’s martingale time, baby! (9 comments)
“The Stadium” by Frank Guridy (9 comments)
“Very interesting failed attempt at manipulation on Polymarket today” (9 comments)
Supercentenarian superfraud update (9 comments)
Inaccuracy in New York magazine report on election forecasting (9 comments)
The comments section: A request to non-commenters, occasional commenters, and frequent commenters (9 comments)
20-year anniversary of this blog (9 comments)
Supporting Bayesian modeling workflows with iterative filtering for multiverse analysis (9 comments)
That day in 1977 when Jerzy Neyman committed the methodological attribution fallacy. (9 comments)
Plagiarism searches and post-publication review (9 comments)
Physics is like Brazil, Statistics is like Chile (9 comments)
Progress in 2023, Aki’s software edition (8 comments)
Learning from mistakes (my online talk for the American Statistical Association, 2:30pm Tues 30 Jan 2024) (8 comments)
Lefty Driesell and Bobby Knight (8 comments)
There is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case. (8 comments)
How large is that treatment effect, really? (my talk at NYU economics department Thurs 18 Apr 2024, 12:30pm) (8 comments)
Delayed retraction sampling (8 comments)
“Close but no cigar” unit tests and bias in MCMC (8 comments)
Infovis, infographics, and data visualization: My thoughts 12 years later (8 comments)
6 ways to follow this blog (8 comments)
For that price he could’ve had 54 Jamaican beef patties or 1/216 of a conference featuring Gray Davis, Grover Norquist, and a rabbi (8 comments)
Break it to grok it: The best way to understand how a method works is go construct scenarios where it fails (8 comments)
Loving, hating, and sometimes misinterpreting conformal prediction for medical decisions (8 comments)
Here’s a useful response by Christakis to criticisms of the contagion-of-obesity claims (8 comments)
Here is the Data Sharing Statement, in its entirety, for Goodwin GM, Aaronson ST, Alvarez O, et al. Single-Dose Psilocybin for a Treatment-Resistant Episode of Major Depression. N Engl J Med. DOI: 10.1056/NEJMoa2206443. (8 comments)
Progress in 2023, Jessica Edition (7 comments)
Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins on 26 Apr) (7 comments)
Listen to those residuals (7 comments)
When do we expect conformal prediction sets to be helpful? (7 comments)
Hey! A new (to me) text message scam! Involving a barfing dog! (7 comments)
Michael Lewis. (7 comments)
Free online book by Bruno Nicenboim, Daniel Schad, and Shravan Vasishth on Bayesian inference and hierarchical modeling using brms and Stan (7 comments)
Minor-league Stats Predict Major-league Performance, Sarah Palin, and Some Differences Between Baseball and Politics (7 comments)
How to code and impute income in studies of opinion polls? (7 comments)
How often is there a political candidate such as Vivek Ramaswamy who is so much stronger in online polls than telephone polls? (7 comments)
Banning the use of common sense in data analysis increases cases of research failure: evidence from Sweden (7 comments)
“Bayesian Workflow: Some Progress and Open Questions” and “Causal Inference as Generalization”: my two upcoming talks at CMU (7 comments)
Decorative statistics and historical records (7 comments)
Some fun basketball graphs (7 comments)
“A Columbia Surgeon’s Study Was Pulled. He Kept Publishing Flawed Data.” . . . and it appears that he’s still at Columbia! (7 comments)
He wants to compute “the effect of a predictor” (that is, an average predictive comparison) for a hierarchical mixture model. You can do it in Stan! (7 comments)
Luck vs. skill in poker (7 comments)
Basu’s Bears (Fat Bear Week and survey calibration) (7 comments)
Flatiron Institute hiring: postdocs, joint faculty, and permanent research positions (7 comments)
Violent science teacher makes ridiculously unsupported research claims, gets treated by legislatures/courts/media as expert on the effects of homeschooling (7 comments)
Should pollsters preregister their design, data collection, and analyses? (7 comments)
Calibration for everyone and every decision problem, maybe (7 comments)
Iterative imputation and incoherent Gibbs sampling (7 comments)
Data manipulation in the world of long-distance swimming! (7 comments)
Announcing two new members of our blogging team . . . (7 comments)
Progress in 2023, Aki Edition (6 comments)
Michael Wiebe has several new replications written up on his site. (6 comments)
The importance of measurement, and how you can draw ridiculous conclusions from your statistical analyses if you don’t think carefully about measurement . . . Leamer (1983) got it. (6 comments)
Cherry blossoms—not just another prediction competition (6 comments)
Tutorial on varying-intercept, varying-slope multilevel models in Stan, from Will Hipson (6 comments)
Mitzi’s and my talks in Trieste 3 and 4 June 2024 (yes, they’ll be broadcast) (6 comments)
One way you can understand people is to look at where they prefer to see complexity. (6 comments)
Edward Kennedy on the Facebook/Instagram 2020 election experiments (6 comments)
Last week’s summer school on probabilistic AI (6 comments)
Toward a Shnerbian theory that establishes connections between the complexity (nonlinearity, chaotic dynamics, number of components) of a system and the capacity to infer causality from datasets (6 comments)
The “fail fast” principle in statistical computing (6 comments)
Two job openings, one in New York on data visualization, one near Paris on Bayesian modeling (6 comments)
An apparent paradox regarding hypothesis tests and rejection regions (6 comments)
What’s a generative model? PyMC and Stan edition (6 comments)
“Announcing the 2023 IPUMS Research Award Winners” (6 comments)
Pete Rose (6 comments)
eLife press release: Deterministic thinking led to a nonsensical statement (6 comments)
“Reduce likelihood of a tick bite by 73.6 times”? Forking paths on the Appalachian Trail. (6 comments)
Delicate language for talking about statistical guarantees (6 comments)
It’s about time (5 comments)
A gathering of the literary critics: Louis Menand and Thomas Mallon, meet Jeet Heer (5 comments)
Why we say that honesty and transparency are not enough: (5 comments)
Statistical practice as scientific exploration (5 comments)
Analogy between (a) model checking in Bayesian statistics, and (b) the self-correcting nature of science. (5 comments)
Population forecasting for small areas: an example of learning through a social network (5 comments)
Data challenges with the Local News Initiative mapping project (5 comments)
Lucy is not a nickname. (5 comments)
“The Secret Life of John Le Carré” (5 comments)
Evil scamming fake publishers (5 comments)
The Mets are looking to hire a data scientist (5 comments)
Why art is more forgiving than game design (5 comments)
Salesses: “some writing exercises meant to help students with various elements of craft” (5 comments)
StanCon 2024 Oxford: recorded talks are now released! (5 comments)
Code it! (patterns in data edition) (5 comments)
Softmax is on the log, not the logit scale (5 comments)
“My view is that if I can show that a result was cooked and that doing it correctly does not yield the answer the authors claimed, then the result is discredited. . . . What I hear, instead, is the following . . .” (4 comments)
Lancet-bashing! (4 comments)
Bayesian Analysis with Python (4 comments)
Bayesian inference with informative priors is not inherently “subjective” (4 comments)
Here’s something you should do when beginning a project, and in the middle of a project, and in the end of the project: Clearly specify your goals, and also specify what’s not in your goal set. (4 comments)
“When are Bayesian model probabilities overconfident?” . . . and we’re still trying to get to meta-Bayes (4 comments)
“Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments” (4 comments)
Studying causal inference in the presence of feedback: (4 comments)
They’re trying to get a hold on the jungle of cluster analysis. (4 comments)
“Beyond the black box: Toward a new paradigm of statistics in science” (talks this Thursday in London by Jessica Hullman, Hadley Wickham, and me) (4 comments)
Interactive and Automated Data Analysis: thoughts from Di Cook, Hadley Wickham, Jessica Hullman, and others (4 comments)
(This one’s important:) Looking Beyond the Obvious: Essentialism and abstraction as central to our reasoning and beliefs (4 comments)
“The Waltz of Reason” and a paradox of book reviewing (4 comments)
StanCon 2024… is a wrap! (4 comments)
22 Revision Prompts from Matthew Salesses (4 comments)
Two spans of the bridge of inference (4 comments)
Average predictive comparisons (4 comments)
Gayface Data Replicability Problems (4 comments)
Addressing legitimate counterarguments in a scientific review: The challenge of being an insider (4 comments)
Most popular posts of 2024 (4 comments)
What is the minimum bloggable contribution? (4 comments)
What to trust in the newspaper? Example of “The Simple Nudge That Raised Median Donations by 80%” (3 comments)
Bayesian BATS to advance Bayesian Thinking in STEM (3 comments)
Intro to BridgeStan: The new in-memory interface for Stan (3 comments)
Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins this Friday) (3 comments)
Evaluating MCMC samplers (3 comments)
Applied modelling in drug development? brms! (3 comments)
GPT today: Buffon’s Needle in Python with plotting (and some jokes) (3 comments)
Comedy and child abuse in literature (3 comments)
Cross validation and pointwise or joint measures of prediction accuracy (3 comments)
A guide to detecting AI-generated images, informed by experiments on people’s ability to detect them (3 comments)
Free Textbook on Applied Regression and Causal Inference (3 comments)
Free Book of Stories, Activities, Computer Demonstrations, and Problems in Applied Regression and Causal Inference (3 comments)
Close reading in literary criticism and statistical analysis (3 comments)
GIST: Now with local step size adaptation for NUTS (3 comments)
Should you always include a varying slope for the lower-level variable involved in a cross-level interaction? (3 comments)
3 levels of fraud: One-time, Linear, and Exponential (3 comments)
Postdoc position at Northwestern on evaluating AI/ML decision support (3 comments)
The marginalization or Jeffreys-Lindley paradox: it’s already been resolved. (3 comments)
They solved the human-statistical reasoning interface back in the 80s (2 comments)
“Replicability & Generalisability”: Applying a discount factor to cost-effectiveness estimates. (2 comments)
Leap Day Special! (2 comments)
Postdoc Opportunity at the HEDCO Institute for Evidence-Based Educational Practice in the College of Education at the University of Oregon (2 comments)
GIST: Gibbs self-tuning for HMC (2 comments)
“Former dean of Temple University convicted of fraud for using fake data to boost its national ranking” (2 comments)
Update on “the hat”: It’s “the spectre,” a single shape that can tile the plane aperiodically but not periodically, and doesn’t require flipping (2 comments)
New online Stan course: 80 videos + hosted live coding environment (2 comments)
In Stan, “~” should be called a “distribution statement,” not a “sampling statement.” (2 comments)
Forking paths and workflow in statistical practice and communication (2 comments)
Doctoral student positions in Bayesian workflow at Aalto, Finland (2 comments)
Election prediction markets: What happens next? (2 comments)
Oregon State Stats Dept. is Hiring (2 comments)
Hey, journalist readers! Does anyone have a contact at NPR? (2 comments)
“My quick answer is that I don’t care much about permutation tests because they are testing a null null hypothesis that is of typically no interest” (1 comments)
“Theoretical statistics is the theory of applied statistics”: A scheduled conference on the topic (1 comments)
Progress in 2023, Charles edition (1 comments)
A question about Lindley’s supra Bayesian method for expert probability assessment (1 comments)
Those annoying people-are-stupid narratives in journalism (1 comments)
ISBA 2024 Satellite Meeting: Lugano, 25–28 June (1 comments)
My NYU econ talk will be Thurs 18 Apr 12:30pm (NOT Thurs 7 Mar) (1 comments)
Is the 2024 New York presidential primary really an “important election”? (1 comments)
Fully funded doctoral student positions in Finland (1 comments)
“Randomization in such studies is arguably a negative, in practice, in that it gives apparently ironclad causal identification (not really, given the ultimate goal of generalization), which just gives researchers and outsiders a greater level of overconfidence in the claims.” (1 comments)
Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis (1 comments)
No, I don’t believe the claim that “Mothers negatively affected by having three daughters and no sons, study shows.” (1 comments)
He has some questions about a career in sports analytics. (1 comments)
Questions and Answers for Applied Statistics and Multilevel Modeling (1 comments)
StanCon 2024: scholarships, sponsors, and other news (1 comments)
19 ways of looking at data science at the singularity, from David Donoho and 17 others (1 comments)
A message to Christian Hesse, mathematician and author of chess books (1 comments)
Online seminar for Monte Carlo Methods++ (1 comments)
Faculty positions at the University of Oregon’s new Data Science department (1 comments)
“Tough choices in election forecasting: All the things that can go wrong” (my webinar this Friday 11am with the Washington Statistical Society) (1 comments)
Call for StanCon 2025+ (1 comments)
What should Yuling include in his course on statistical computing? (1 comments)
Calibration “resolves” epistemic uncertainty by giving predictions that are indistinguishable from the true probabilities. Why is this still unsatisfying? (1 comments)
Since Jeffrey Epstein is in the news again . . . (0 comments)
Postdoc at Washington State University on law-enforcement statistics (0 comments)
Here’s how to subscribe to our new weekly newsletter: (0 comments)
Progress in 2023, Leo edition (0 comments)
Click here to help this researcher gather different takes on making data visualizations for blind people (0 comments)
Using the term “visualization” for non-visual representation of data (0 comments)
Varying slopes and intercepts in Stan: still painful in 2024 (0 comments)
BD corner: I came across this interesting interview with Daniel Clowes on the sources for Monica (0 comments)
Hey, some good news for a change! (Child psychology and Bayes) (0 comments)
A nested helix plot that simultaneously shows events on the scales of centuries, millennia, . . . all the way back to billions of years (0 comments)
Job Ad: Spatial Statistics Group Lead at Oak Ridge National Laboratory (0 comments)
Bayesian Workflow, Causal Generalization, Modeling of Sampling Weights, and Time: My talks at Northwestern University this Friday and the University of Chicago on Monday (0 comments)
Papers on human decision-making under uncertainty in ML venues! We have advice. (0 comments)
New stat podcast just dropped (0 comments)
Faculty and postdoc jobs in computational stats at Newcastle University (UK) (0 comments)
Subscribe to this free newsletter and get a heads-up on our scheduled posts a week early! (0 comments)
“What do we need from a probabilistic programming language to support Bayesian workflow?” (0 comments)
StanCon 2024 is in 32 days! (0 comments)
NeurIPS 2024 workshop on Statistical Frontiers in LLMs and Foundation Models (0 comments)
Faculty positions at the University of California on AI, Inequality, and Society (0 comments)
Modeling Weights to Generalize (my talk this Wed noon at the Columbia University statistics department) (0 comments)
Bayesian social science conference in Amsterdam! Next month! (0 comments)
Postdoc opportunity! to work with me here at Columbia! on Bayesian workflow! for contamination models! With some wonderful collaborators!! (0 comments)
NYT catches up to Statistical Modeling, Causal Inference, and Social Science (0 comments)
Leave-one-out cross validation (LOO) for an astronomy problem (0 comments)
Self-reference and self-reproduction of evidence (0 comments)
The Red Sox are hiring (0 comments)
Faculty positions at Princeton in interdisciplinary data science (0 comments)

Thank you all for your contributions!

Calibration “resolves” epistemic uncertainty by giving predictions that are indistinguishable from the true probabilities. Why is this still unsatisfying?

Jessica Hullman — Tue, 31 Dec 2024 17:27:07 +0000

This is Jessica. The last day of the year is like a good time for finishing things up, so I figured it’s time for one last post wrapping up some thoughts on calibration.

As my previous posts got into, calibrated prediction uncertainty is the goal of various posthoc calibration algorithms discussed in machine learning research, which use held out data to learn transformations on model predicted probabilities in order to achieve calibration on the held out data. I’ve reflected a bit on what calibration can and can’t give us in terms of assurances for decision-making. Namely, it makes predictions trustworthy for decisions in the restricted sense that a decision-maker who will choose their action purely based on the prediction can’t do better than treating the calibrated predictions as the true probabilities.

But something I’ve had trouble articulating as clearly as I’d like involves what’s missing (and why) when it comes to what calibration gives us versus a more complete representation of the limits of our knowledge in making some predictions.

The distinction involves how we express higher order uncertainty. Let’s say we are doing multiclass classification, and fit a model fhat to some labeled data. Our “level 0” prediction fhat(x) contains no uncertainty representation at all; we check it against the ground truth y. Our “level 1” prediction phat(.|x) predicts the conditional distribution over classes; we check it against the empirical distribution that gives a probability p(y|x) for each possible y. Our “level 2” prediction is trying to predict the distribution of the conditional distribution over classes, p(p(.|x), e.g. a Dirichlet distribution that assigns probability to each distribution p(.|x), which we can distinguish using some parameters theta.

From a Bayesian modeling perspective, it’s natural to think about distributions of distributions. A prior distribution over model parameters implies a distribution over possible data-generating distributions. Upon fitting a model, the posterior predictive distribution summarizes both “aleatoric” uncertainty due to inherent randomness in the generating process and “epistemic” uncertainty stemming from our lack of knowledge of the true parameter values.

In some sense calibration “resolves” epistemic uncertainty by providing point predictions that are indistinguishable from the true probabilities. But if you’re hoping to get a faithful summary of the current state of knowledge, it can seem like something is still missing. In the Bayesian framework, we can collapse our posterior prediction of the outcome y for any particular input x to a point estimate, but we don’t have to.

Part of the difficulty is that whenever we evaluate performance as loss over some data-generating distribution, having more than a point estimate is not necessary. This is true even without considering second order uncertainty. If we train a level 0 prediction of the outcome y using the standard loss minimization framework with 0/1 loss, then it will learn to predict the mode. And so to the extent that it’s hard to argue one’s way out of loss minimization as a standard for evaluating decisions, it’s hard to motivate faithful expression of epistemic uncertainty.

For second order uncertainty, the added complication is there is no ground truth. We might believe there is some intrinsic value in being able to model uncertainty about the best predictor, but how do we formalize this given that there’s no ground truth against which to check our second order predictions? We can’t learn by drawing samples from the distribution that assigns probability to different first order distributions p(.|x) because technically there is no such distribution beyond our conception of it.

Daniel Lakeland previously provided an example I found helpful of the difference between putting Bayesian probability on a predicted frequency, where there’s no sense in which we can check the calibration of the second order prediction.

Related to this, I recently came across a few papers by Viktor Bengs et al that formalize some of this in an ML context. Essentially, they show that there is no well-defined loss function that can be used in the typical ML learning pipeline to incentivize the learner to make correct predictions that are also faithful as expressions of the epistemic uncertainty. This can be expressed in terms of trying to find a proper scoring rule. In the case of first order predictions, as long as we use a proper scoring rule as the loss function, we can expect accurate predictions, because a proper scoring rule is one for which one cannot score higher by deviating from reporting our true beliefs. But there is no loss function that incentivizes a second-order learner to faithfully represent its epistemic uncertainty like a proper scoring rule does for a first order learner.

This may seem obvious, especially if you’re coming from a Bayesian tradition, considering that there is no ground truth against which to score second order predictions. And yet, various loss functions have been proposed for estimating level 2 predictors in the ML literature, such as minimizing the empirical loss of the level 1 prediction averaged over possible parameter values. These results make clear that one needs to be careful interpreting the predictors they give, because, e.g., they can actually incentivize predictors that appear to be certain about the first order distribution.

I guess a question that remains is how to talk about incentives for second order uncertainty at all in a context where minimizing loss from predictions is the primary goal. I don’t think the right conclusion is that it doesn’t matter since we can’t integrate it into a loss minimization framework. Having the ability to decompose predictions by different sources of uncertainty and be explicit about what our higher order uncertainty looks like going in (i.e., by defining a prior) has scientific value in less direct ways, like communicating beliefs and debugging when things go wrong.

What is the minimum bloggable contribution?

Andrew — Tue, 31 Dec 2024 14:41:28 +0000

The other day someone sent me an email pointing me to an online article, a statistical analysis criticizing an online article that was a reanalysis of data from an article that was a meta-analysis and literature review of a controversial topic that had been written about in some earlier published papers that were themselves literature reviews.

My correspondent was interested in my take, and I replied that the latest article made some good points and also some errors. I didn’t think I had anything useful to say on this one so I didn’t post on the topic.

Given that I’d already gone to the trouble of reading–OK, skimming–all these articles, and I can post here for free, arguably I could contribute in a useful way just with a short post explaining where I agreed with this new article and where I thought it went wrong. I have no real or perceived beefs with any of the people involved in this one, and I expect that whatever feedback I were to provide would be taken constructively, not defensively.

So why not post? The difficulty here would come not directly in my comments on those articles but rather in the necessary scaffolding: all the bits I’d need to add to avoid writing something that could be misinterpreted.

One of the challenges of writing, as compared to speaking, is that your words are just out there, interpreted without the benefit of intonation, context, and dialogue. This isn’t anything specific to blogging; the same issue can arise when communicating with people by email. Not that direct face-to-face conversation works either; it’s just that something written is just out there in some way.

So it’s a cost-benefit calculation: weighing my contributions to this particular discussion against the effort of constructing the scaffolding. In this case I decided the best option was to not bother.

But then I became interested in the meta-topic of when to post, so I wrote this, which I’ll schedule to appear in a few months.

Most popular posts of 2024

Andrew — Tue, 31 Dec 2024 14:12:58 +0000

John Cook writes that he looked on Hacker News to see which of his posts were most popular: “I [Cook] didn’t look at my server logs, but generally the posts that get the most traffic are posts that someone submits to Hacker News.”

I don’t look at my server logs either–that way lies madness! My question is, how much should we trust the popularity ranking from Hacker News?

I don’t know much about Hacker News myself. I went there and entered statmodeling in the search box, and by default it lists in order of popularity. Here are the first two pages of the most popular from 2024:

Suspicious data pattern in recent Venezuelan election (https://statmodeling.stat.columbia.edu/2024/07/31/suspicious-data-pattern-in-recent-venezuelan-election/)
903 points|kgwgk|5 months ago|470 comments

Bayesian Statistics: The three cultures (https://statmodeling.stat.columbia.edu/2024/07/10/three-cultures-bayes-subjective-objective-pragmatic/)
309 points|luu|5 months ago|109 comments

The immediate victims of a con would rather act as if the con never happened (https://statmodeling.stat.columbia.edu/2024/01/07/french-bio-lab-research-scandal/)
230 points|Tomte|1 year ago|140 comments

Bad stuff going down at the American Sociological Association (https://statmodeling.stat.columbia.edu/2024/01/17/bad-stuff-going-down-at-the-american-sociological-association/)
129 points|Tomte|1 year ago|116 comments

Those correction notices, in full (https://statmodeling.stat.columbia.edu/2024/11/24/those-correction-notices-in-full-yes-its-possible-to-directly-admit-and-learn-from-error/)
128 points|Tomte|1 month ago|35 comments

Dean of Engineering at University of Nevada wrote a paper that’s bad (https://statmodeling.stat.columbia.edu/2024/02/06/its-bezzle-time-the-dean-of-engineering-at-the-university-of-nevada-gets-paid-372127-a-year-and-wrote-a-paper-thats-so-bad-you-cant-believe-it-i-mean-really-you-have-to-take-a-look-at-t/)
123 points|Tomte|11 months ago|79 comments

Defining Statistical Models in Jax? (https://statmodeling.stat.columbia.edu/2024/10/08/defining-statistical-models-in-jax/#comments)
118 points|hackandthink|3 months ago|18 comments

Critique of Freakonomics interview with psychologist Ellen Langer (https://statmodeling.stat.columbia.edu/2024/10/28/freakonomics-does-it-again-not-in-a-good-way-jeez-these-guys-are-credulous/)
94 points|nabla9|2 months ago|82 comments

Refuted papers continue to be cited more than their failed replications (https://statmodeling.stat.columbia.edu/2024/03/10/refuted-papers-continue-to-be-cited-more-than-their-failed-replications-can-a-new-search-engine-be-built-that-will-fix-this-problem/)
79 points|nabla9|10 months ago|34 comments

Well-known paradox of R-squared is still buggin me (https://statmodeling.stat.columbia.edu/2024/06/17/this-well-known-paradox-of-r-squared-is-still-buggin-me-can-you-help-me-out/)
71 points|luu|6 months ago|103 comments

Levels of fraud: One-time, Linear, and Exponential (https://statmodeling.stat.columbia.edu/2024/09/27/3-levels-of-fraud-one-time-linear-and-exponential/)
4 points|Tomte|3 months ago|0 comments

You can guarantee that the term “statistical guarantee” will irritate me (https://statmodeling.stat.columbia.edu/2024/07/05/you-can-guarantee-that-the-term-statistical-guarantee-will-irritate-me-heres-why-and-lets-go-into-some-details/)
4 points|Tomte|6 months ago|0 comments

Is it really “the economy, stupid”? (https://statmodeling.stat.columbia.edu/2024/05/11/is-it-really-the-economy-stupid/)
4 points|munichpavel|8 months ago|0 comments

What are the best scientific papers ever written? (https://statmodeling.stat.columbia.edu/2020/06/09/what-are-the-best-scientific-papers-ever-written/)
4 points|reqo|9 months ago|0 comments

Zotero now features retraction notices (https://statmodeling.stat.columbia.edu/2024/03/12/zotero-now-features-retraction-notices/)
4 points|Tomte|10 months ago|0 comments

“the income of the average American will double approximately every 39 years”? (https://statmodeling.stat.columbia.edu/2024/05/15/the-income-of-the-average-american-will-double-approximately-every-39-years-who-says-that-sort-of-thing-show-some-respect-for-uncertainty-dude/)
3 points|luu|7 months ago|2 comments

Prediction markets and the need for “dumb money” as well as “smart money” (https://statmodeling.stat.columbia.edu/2024/10/25/prediction-markets-and-the-need-for-dumb-money-as-well-as-smart-money/)
3 points|nabla9|2 months ago|1 comments

If you want to play tennis at the top level there’s a benefit to being ____ (https://statmodeling.stat.columbia.edu/2024/09/06/meritocracy-and-womens-tennis/)
3 points|noelwelsh|4 months ago|1 comments

Why bother engaging outside reviewers at all? (https://statmodeling.stat.columbia.edu/2024/12/14/the-most-interesting-part-of-the-story-is-that-the-publisher-went-through-all-these-steps-of-reviewing-and-revising-if-they-just-want-to-make-money-by-publishing-crap-why-bother-engaging-outside-re/)
3 points|Tomte|16 days ago|0 comments

ChatGPT o1-preview can code Stan (https://statmodeling.stat.columbia.edu/2024/10/22/chatgpt-o1-preview-can-code-stan/)
3 points|lr0|2 months ago|0 comments

Sean Carroll/Ellen Langer: Credulous, scientist-as-hero reporting (https://statmodeling.stat.columbia.edu/2024/10/19/carroll-langer-credulous-scientist-as-hero-reporting-from-a-podcaster-who-should-know-better/)
3 points|nabla9|2 months ago|0 comments

Freakonomics asks, “Why is there so much fraud in academia,” missing one reason (https://statmodeling.stat.columbia.edu/2024/09/14/freakonomics-asks-why-is-there-so-much-fraud-in-academia-but-without-addressing-one-big-incentive-for-fraud-which-is-that-if-you-make-grabby-enough-claims-you-can-get-features-in-freako/)
3 points|nabla9|4 months ago|0 comments

Google is violating the First Law of Robotics (https://statmodeling.stat.columbia.edu/2024/07/16/google-is-violating-the-first-law-of-robotics/)
3 points|nabla9|6 months ago|0 comments

Best way to understand how a method works? Construct scenarios where it fails (https://statmodeling.stat.columbia.edu/2024/05/25/break-it-to-grok-it-the-best-way-to-understand-how-a-method-works-is-go-construct-scenarios-where-it-fails/)
3 points|rossdavidh|7 months ago|0 comments

Infovis, Infographics, and Data Visualization (https://statmodeling.stat.columbia.edu/2024/04/18/infovis-infographics-and-data-visualization-my-thoughts-12-years-later/)
3 points|Tomte|9 months ago|0 comments

What is the prevalence of bad social science? (https://statmodeling.stat.columbia.edu/2024/04/06/what-is-the-prevalence-of-bad-social-science/)
3 points|Tomte|9 months ago|0 comments

The Contrapositive of “Politics and the English Language.” (https://statmodeling.stat.columbia.edu/2024/03/25/the-contrapositive-of-politics-and-the-english-language-one-reason-writing-is-hard/)
3 points|luu|9 months ago|0 comments

The Contrapositive of “Politics and the English Language.” (https://statmodeling.stat.columbia.edu/2024/03/25/the-contrapositive-of-politics-and-the-english-language-one-reason-writing-is-hard/)
3 points|Tomte|9 months ago|0 comments

How can a top scientist be so confidently wrong? (2022) (https://statmodeling.stat.columbia.edu/2022/06/08/how-can-a-top-scientist-be-so-confidently-wrong-r-a-fisher-and-smoking-example/)
3 points|EndXA|10 months ago|0 comments

Michael Lewis (https://statmodeling.stat.columbia.edu/2024/02/22/michael-lewis/)
3 points|luu|10 months ago|0 comments

Is Matthew Walker’s “Why We Sleep” Riddled with Scientific/Factual Errors? (2019) (https://statmodeling.stat.columbia.edu/2019/11/18/is-matthew-walkers-why-we-sleep-riddled-with-scientific-and-factual-errors/)
3 points|Tomte|10 months ago|0 comments

Clinical trials that are designed to fail (https://statmodeling.stat.columbia.edu/2024/02/11/clinical-trials-that-are-designed-to-fail/)
3 points|Tomte|11 months ago|0 comments

Number of inline code comments is zero. Nada. Zilch. Nil. Naught (https://statmodeling.stat.columbia.edu/2024/02/07/when-all-else-fails-add-a-code-comment/)
3 points|thaumasiotes|11 months ago|0 comments

The Onion (ok, an Onion-affiliate site) is plagiarizing. For reals (https://statmodeling.stat.columbia.edu/2024/01/01/the-onion-ok-an-onion-affiliate-site-is-plagiarizing/)
3 points|Tomte|1 year ago|0 comments

The rise and fall of Seth Roberts and the Shangri-La diet (2023) (https://statmodeling.stat.columbia.edu/2023/11/20/the-rise-and-fall-of-seth-roberts-and-the-shangri-la-diet/)
2 points|sowbug|1 month ago|1 comments

Prediction markets need dumb money as well as smart money (https://statmodeling.stat.columbia.edu/2024/10/25/prediction-markets-and-the-need-for-dumb-money-as-well-as-smart-money/)
2 points|johndcook|2 months ago|1 comments

Whassup with those economists who predicted a recession that then didn’t happen? (https://statmodeling.stat.columbia.edu/2024/05/31/whassup-with-those-economists-who-predicted-a-recession-that-then-didnt-happen/)
2 points|_pfco|7 months ago|1 comments

Harvard time baby: completely botched data but you don’t change your conclusions (https://statmodeling.stat.columbia.edu/2024/12/12/harvard-lexicon-kerfuffle-is-what-you-call-it-when-you-completely-botched-your-data-but-you-dont-want-to-change-your-conclusions/)
2 points|nabla9|17 days ago|0 comments

What it takes to conclude that a research seam has been mined to exhaustion (https://statmodeling.stat.columbia.edu/2024/11/26/i-wonder-just-what-it-takes-to-get-people-to-conclude-that-a-research-seam-has-been-mined-to-the-point-of-exhaustion/)
2 points|domofutu|1 month ago|0 comments

Andrew Gelman is not the science police because there is no science police (https://statmodeling.stat.columbia.edu/2024/11/21/andrew-gelman-is-not-the-science-police-because-there-is-no-such-thing-as-the-science-police/)
2 points|Tomte|1 month ago|0 comments

I guess I should be thanking kgwgk, luu, nabla9, Tomte, and the rest of the gang for linking to us! It’s interesting how many of the links come from the same few Hacker News participants.

Also, I don’t know what these “points” represent. 2 points doesn’t sound like a lot, which is kind of a bummer . . . I put so much effort into writing these posts, I’d always appreciate more readers!

A very interesting discussion by Roy Sorensen of the interesting-number paradox

Andrew — Mon, 30 Dec 2024 14:09:00 +0000

There’s this mathematical joke that all numbers–more precisely, all positive integers–are interesting. As Roy Sorensen puts it:

Mathematicians are fond of Edwin Beckenbach’s (1945) argument:

A. If some [positive] integer is not interesting, then there is a least such integer.

B. If some integer is the first uninteresting integer, then that fact makes the integer interesting.

C. Therefore, all integers are interesting.

The sophism has attracted no philosophical commentary because of a trivializing resemblance to Berry’s paradox and the sorites.

I’d not heard of either Berry’s paradox or the sorites–you can look them up yourself on wikipedia. Indeed, you can find a whole list of paradoxes of self-reference.

Sorensen’s article is fun and thought-provoking. The full reference is Roy Sorensen (2011). Interestingly dull numbers. Philosophy and Phenomenological Research 82, 655-673.

Here’s how he resolves the paradox:

My main objection is to premise B (`If some integer is the first uninteresting integer, then that fact makes the integer interesting’) of Beckenbach’s sophism. . . .

I [Sorensen] agree with Beckenbach that some numbers are interesting. . . . But I also think that there are infinitely many uninteresting integers. Since the dull integers must start somewhere, there must be a first one — even if the vagueness of `dull’ makes it impossible to specify which it is. . . .

Beckenbach bases premise B on the fact that any instance of D will imply E:

D. It is interesting that n is the first uninteresting integer.

E. Therefore, n is an interesting integer.

My counter-explanation of the unsoundness is that the inference is invalid. . . .

Sorensen’s key point is that a number can be embedded in an interesting statement without itself being interesting. Suppose, for example, we say that each of the integers from 0 through 20 are interesting, as are 23 and 24, but that 21 and 22 are dull (with the terms “dull” and “interesting” depending on context; the set of numbers that are interesting to a sports fan could differ from the set of numbers that are interesting to a mathematician; and also conditional on the level of focus, as the harder you look the more likely it is that you can find something interesting; but that doesn’t affect the paradox, it’s just a matter of definition). It’s arguably an interesting statement that 21 is the smallest dull positive integer, but that doesn’t make 21 itself interesting.

Sorensen gives several examples of that sort of thing:

An uninteresting fact can embed an interesting fact. (For instance, it is interesting that the coastline of Norway is longer than the coastline of the United States but it is not interesting that this fact is interesting.) The case for D centers on the dual of this embedding principle: An interesting fact can incorporate a dull fact.

Indeed, it can be interesting that a fact is dull . . . `873 is the difference between the squares of two consecutive integers’ looks interesting. But actually this fact is not interesting; any odd number greater than 1 is the difference between the squares of two consecutive integers. The dullness of `873 is the difference between the squares of two consecutive integers’ is interesting because this dullness is explained by an interesting generality.

Undistinctiveness is just one genre of instructive dullness. The monotony of `The decimal expansion of 1/9 is .111….’ is a sign that it is a non-terminating fraction. The enervating patternlessness of the decimal expansion of π is a sign that it is a transcendental number.

In an early example of computer program, Alan Turing analyzed chess into a sequence subtasks. The more menial he made the sub procedures, the more interest he added to the overall effect. Turing’s chess programs breathed new life into homuncular models of psychological processes. . . .

In defense of conceptual analysis, I say the dullness of an identity statement often promotes the interest of the analysis. Consider contested identities. Students at first deny 1 = .999…. The teacher then points out that 1/3 = .333…. and 1/3 x 3 = 1. This conjunction of trivial truths makes most students regard 1 and .999…. as alternate numeric representations of the same number. They stop viewing 1 = .999…. as a near miss and start regarding it as trivially true. In the final analysis, the interesting fact is not that 1 = .999…. Just the opposite! The interesting fact is that `1 = .999….’ is dull.

Or, for an even simpler example, you can write an article with about the color green, without that article itself being green.

The questions become more subtle when we move between mathematics and the social world (including the very use of base 10 as a social convention):

A dull number can be denoted by an interesting numeral. In hexadecimal (base 16), 570005 is denoted by DEAD. . . .

Once we become sensitized to the distinction between using an expression and merely mentioning it, we become more discriminating about the means by which a number can inherit interest from a fact. . . .

In mathematics, inheritance is restricted to internal relations. The interest of `92 is the number of different arrangements of 8 non-attacking queens on an 8 x 8 chessboard’ is assigned to chess rather than 92 because chess is an alien relatum.

People relate interestingly to numbers but the interest of these relationships attaches to people. The interest of `The grandmaster Bobby Fisher died at 64, the number of squares on a chessboard’ attaches to Fisher, not 64.

Plutarch remarks that “The Pythagoreans also have a horror for the number 17, for 17 lies exactly halfway between 16, which is a square, and the number 18, which is the double of a square, these two, 16 and 18, being the only two numbers representing areas for which the perimeter equals the area”. This is an interesting fact about Pythagoreans. Their horror does not add to the interest of 17 (though 17 may accrue interest from the mathematical relationship that troubled the Pythagoreans).

In 1866, sixteen year old B. Nicolò I. Paganini found the small amicable pair (1184, 1210). . . . Paganini’s amicable pair is interesting in that it partly answers `Which are the amicable pairs?’. But it is more interesting as evidence for the psychological question `How reliable were the great mathematicians?’. Mathematicians are reluctant to credit a pair of numbers with the interest that attaches to contingent facts about it.

Another obstacle to crediting interest to Paganini’s numbers 1184 and 1210 is that they are interesting as a pair. Interest in a pair need not pass down to the individuals comprising the pair (just as a husband and wife can each be bores and yet be interesting as a couple).

Also this:

The robustness of interesting dullness is manifested by the sheer volume of commentary on boredom. Philosophers marvel at the power of this motivational vacuum. . . . Social scientists agree that just as there can be sober studies of inebriation, there can be interesting studies of tedium, repetition, and apathy. . . .

When astronomers explain away coincidences with an identity hypothesis, loss of wonder is experienced as insight. Demystification is a sign of explanatory progress.

The balance of boredom

Sorensen’s article is full of striking insights, for example:

Satiation differs from boredom in that you can exit. The gorged gourmand just leaves the restaurant. But the dishwasher is obliged to stay. World weary, the dishwasher can only escape into daydreams and diversions. Boredom correlates with understanding. So there is some temptation to compress Heidegger into two lines: To understand everything is to be bored by everything. So everything is boring.

But boredom can also be produced by incomprehension or a slight distraction (too small to be recognized as the true cause of one’s inability to focus). The laggard is too far behind to make sense of the lesson. The prodigy is too far ahead to find the lesson stimulating. The interested student lies in between, challenged but not overwhelmed.

Prudent students monitor their boredom to check whether they have deviated from this balance. The vain misconstrue the boredom of incompetence as the boredom of mastery.

Relation to the philosophy or sociology of science

The discussion in the second half of Sorensen’s article reminds me of some things we’ve talked about regarding bad science:

1. My false theorem. I proved a false theorem once! Here’s the original article (Andrew Gelman and T. P. Speed (1993), Characterizing a joint probability distribution by conditionals, Journal of the Royal Statistical Society B 55, 185-188), and here’s the correction notice, from 1999. Embarrassingly for us, the falseness of our claim was shown by a one-line counterexample. Here’s the relevance to the present discussion: I always referred to this as our “false theorem,” but then someone pointed out that it’s not a theorem if it’s false! So what word to use? “Conjecture” doesn’t seem quite right, because we were offering it as a theorem, not a hypothesis. “Claim” is better, but that doesn’t convey that we were not just saying that a particular statement was true; we were claiming we’d proved it. I think this difficulty in explaining is “real,” not just a matter of the lack of a good word in English for “claimed proof.”

2. Evidence vs. truth. We’ve talked about this many times, for example here, here, and here. I think this is a big issue with problematic science, that researchers will make a claim that might well be true, but they don’t offer good evidence. When a scientific paper P is published claiming to demonstrate statement X, what the paper is really claiming is not “X is true” but rather “P contains strong evidence in favor of the truth of X.” When an outsider (such as me!) criticizes the paper’s “methods,” we’re typically arguing against that second claim, i.e. we’re saying that P does not contain strong evidence in favor of the truth of X. The original authors of the paper will typically respond with some version of, “We believe that X is true,” which might be fine, but I think it impedes the discussion for them to not first accept that P does not contain strong evidence in favor of the truth of X–or at least to address the outsider’s criticism on that level.

3. Big if true. This comes up a lot, for example studies of extra-sensory perception, or the claim that women are three times more likely to wear red or pink clothing during certain times of the month, or claims that subliminal messages can cause huge opinion swings, or claims of a stolen election in 2020, or various other topics we’ve covered in this space over the years: these claims are implausible on their face, and a careful look at the published evidence offered in their support do not change this assessment. But, if they were true, they’d be interesting! Thus, as Sorensen discusses, these are statements whose interestingness depends on their truth value.

P.S. I just wrote this post this morning. The next slot on the schedule is in May, but I bumped today’s scheduled post and stuck in this one instead, because the topic is so interesting (to me). Really.

Sorry, NYT, but, yes, “Equidistant Letter Sequences in the Book of Genesis” was junk science

Andrew — Sun, 29 Dec 2024 14:13:30 +0000

From the New York Times:

It sounds like a headline ripped from a supermarket tabloid: In 1994, three Israeli researchers claimed to have found a secret code embedded in Genesis, the first book of the Old Testament.

But this wasn’t junk science.

Ahhhh . . . but it was!

The Times continues:

The paper in which they revealed their findings appeared in an esteemed, peer-reviewed journal. And the academic reputations of the three authors — Eliyahu Rips, Yoav Rosenberg and Doron Witztum — were unimpeachable, especially that of Dr. Rips.

The reporter’s mistake was to think that, just cos something’s authored by professors at a legitimate journal and published in a reputable journal, that it can’t be junk science.

An understandable mistake to make in 1994, but hard to support nowadays. Lucky golf ball, anyone?

P.S. I remember talking with Dave Krantz a couple years after that Bible code paper came out. Some of his colleagues were really bothered by it and went to the trouble of figuring out and explaining what was going on. Dave was kind of irritated that they were wasting any time on it at all, and I just thought the whole thing was a big joke.

In retrospect I think Dave and I should’ve shown more respect to the debunkers. The Bible code paper was an early example of junk science leveraging the authority of the academic community to get publicity.

The junk-science-to-journal-to-NPR/Ted pipeline must have seemed like a great deal for everyone at the time: more publicity for researchers, more credibility for unconventional scientific claims, more fun feature stories for the news media. And for awhile it seemed to be going just fine, turning Malcolm Gladwell into a New Yorker celebrity, powering the Freakonomics franchise, and providing raw material for Jeffrey Epstein’s Edge Foundation. The Pizzagate and Shreddergate researchers became wildly successful, and even fringe players such as the beauty-and-sex-ratio guy were able to land book contracts. Formerly obscure law professors got to mingle with Henry Kissinger! It was boom time in This Week in Psychological Science.

Eventually the weight of the junk science overwhelmed the system, and ultimately I think that the whole thing was a bit of a deal with a devil: In the short term, academic social science got lots of publicity, and a few well-placed and credulous or unscrupulous professors became media stars. In the medium term, academic journals justly lost much of their authority. In the longer term, who knows.

Bayesian inference (and mathematical reasoning more generally) isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision.

Andrew — Sat, 28 Dec 2024 14:21:39 +0000

Palko writes:

I’m just an occasional Bayesian (and never an exo-biologist) so maybe I’m missing some subtleties here, but I believe educated estimates for the first probability vary widely with some close to 0 and some close to 1 with no good sense of the distribution. Is there any point in applying the theorem at that point? From this Wired article:

If or when scientists detect a putative biosignature gas on a distant planet, they can use a formula called Bayes’ theorem to calculate the chance of life existing there based on three probabilities. Two have to do with biology. The first is the probability of life emerging on that planet given everything else that’s known about it. The second is the probability that, if there is life, it would create the biosignature we observe. Both factors carry significant uncertainties, according to the astrobiologists Cole Mathis of Arizona State University and Harrison Smith of the Earth-Life Science Institute of the Tokyo Institute of Technology, who explored this kind of reasoning in a paper last fall.

My reply: I guess it’s fine to do the calculation, if only to make it clear how dependent it is on assumps. Bayesian inference isn’t just about getting the answer; it’s also about clarifying the mapping from assumptions to inference to decision.

Come to think about it, that last paragraph remains true if you replace “Bayesian inference” with “Mathematics.”

Announcing two new members of our blogging team . . .

Andrew — Fri, 27 Dec 2024 14:01:25 +0000

. . . Nick Hornby and David Roche!

Just kidding.

Really, though, these guys are great and should be blogging for us.

Nick Hornby you know about. He’s the author of High Fidelity, the book that Phil once said that no woman should be allowed to read because it’s a cheat code for understanding guys. I’ve been reading an old collection of his book columns for the Believer, and it’s so bloggy . . . it would fit in so well right here. OK, he’s not gonna be blogging here; his monthly blog (I guess that’s too low a frequency of posts to be a “blog,” so let’s call it a “colummn”) is here . . . I guess I’ll put it in the Blogs We Read page.

Hornby’s what I’ve called a writer in the David Owen mode (that is, David Owen the American journalist, not David Owen the British politician): serious, earnest, somewhat intelligent but a bit of a blockhead. Which I mean in a good way. Not clever-clever or even clever, but he wants to get things right. There’s kind of a paradox: Hornby’s shtick is that he’s an unpretentious regular guy, writing about unpretentious regular guys—also, though, he’s a brilliant and successful author. It’s a little different than John Updike, who wrote about his uncultured hero Rabbit, while maintaining the urbane “Updike” character for his public image. Also different from Kingsley Amis, who basically tore himself apart in his attempt to be the regular guy befitting his literary and political ideology. In his blogging columns, Hornby is pretty upfront about the tension between his literary-intellectual and regular-guy personas.

Our other new blogger (not really) is David Roche, who I’d never heard of until I read this interview of him by Dennis Young. Roche is a funny guy, also very analytical, would be a good blogger for us. Like Hornby, he also has a column, but it’s in Trail Runner magazine, and it’s full of articles like “New Study Shows Strategically Reduced Carbohydrate Intake Does Not Improve Performance,” and I could care less about trail running. That interview, though, it’s great.

Here’s a sample:

A few studies have come out with athletes pushing 120 grams of carbs per hour, showing improved fatigue resistance late in events. But my mind was opened to how I could solve the Leadville equation during Stage 18 of the Tour de France. Victor Campenaerts won the stage unexpectedly, and after the stage, his sponsor Precision Fueling & Hydration released his data (rare in cycling, where everyone holds secrets like they’re Gollum). He did 132 g/hr, pushing closer to 150 g/hr at times. I had never heard of anyone trying that and succeeding at such a high level. So I went for it, and I think I showed that it’s possible in running too.

Cycling is conducting the biggest uncontrolled performance experiment in the world. In ultras, the margins of human performance are not here yet. I was 1.6 percent ahead of Matt Carpenter, and I bet someone is going to be a few percent ahead of me. In cycling, everyone has (very roughly) the same power at baseline. When the margins are that narrow, the best training/fueling wins, and that’s one reason why doping has such a troubled history in that sport—it’s a confounding variable that fucks up a true understanding of human biking performance over longer time horizons.

All this self-experimentation . . . he’s kind of like a sane Seth Roberts.

Softmax is on the log, not the logit scale

Bob Carpenter — Thu, 26 Dec 2024 20:00:52 +0000

Bad Stan naming

I realized recently that we followed the confusing terminological convention of ML in our description of Stan’s categorical_logit function. In Stan, if there’s a suffix to a distribution, it describes the scale of one or more of the parameters. For example,

poisson_log(y | u) == poisson(y | exp(u)).

So when we write categorical(y | p) we take p to be a simplex (sequence of finite, non-negative values that sum to 1). So it would make sense that categorical_logit(y | logit(p)) would be equivalent, where logit(p) = log(p / (1 - p)). But that’s not how it works in Stan. Instead,

caetgorical_logit(y | u) = categorical(y | softmax(u)).

We made the same mistake everyone on ML makes in their variable naming! We call the u here “logits”, when in fact they’re (unnormalized [see below]) log probabilities. This is probably due to the fact that if u is a regression, then the resulting system is called “multinomial logistic regression.”

Example

The softmax function is defined by softmax(u) = exp(u) / sum(exp(u)). When used like this, the arguments to softmax are log probabilities, not logit probabilities. Here’s a little snippet of Python to illustrate (the style sheet is adding the extra space, not me, and I don’t want to fix it manually in this post with a hack because it’ll mess up the page if the style sheet is ever fixed).

>>> p = np.asarray([0.2, 0.5, 0.3])


>>> def logit(p): return np.log(p / (1 - p))

... 
>>> logit_p = logit(p)
>>> log_p = np.log(p)
>>> sp.special.softmax(logit_p)

array([0.14893617, 0.59574468, 0.25531915])

>>> sp.special.softmax(log_p) array([0.2, 0.5, 0.3])

This shows that for the round trip probabilities through softmax, the appropriate operation is the natural logarithm, not the logit function.

Origin of the confusion

So where did this confusion come from? Let’s look at a standard binary logistic regression. There we take

p(y | alpha, beta, x) = bernoulli(y | inv_logit(alpha + beta * x))

where

inv_logit(v) = exp(v) / (1 + exp(v)).

Writing inverse logit this way suggests how to write a logistic regression with a categorical distribution and softmax.

p(y | alpha, beta, x) = categorical(y | softmax([0, alpha + beta * x]))

that’s because

softmax([0, alpha + beta * x]) = [exp(0), exp(alpha + beta * x)] / (exp(0) + exp(alpha + beta * x)) = [1, exp(alpha + beta(x)] / (1 + exp(alpha + beta * x)) = [1 / (1 + exp(alpha + beta * x), exp(alpha + beta * x) / (1 + exp(alpha + beta * x)] = [1 - inv_logit(alpha + beta * x), inv_logit(alpha + beta * x)],

This derivation shows that the probability of the categorical in this formulation returning 1 is inv_logit(alpha + beta * x). But this connection falls apart in the multinomial case when there are more than two outcomes.

In traditional frequentist K outcome multinomial logistic regressions, the first input to softmax is pinned to 0 for identifiability just as in the binary case.

softmax([0, u[2], ..., u[K1]) = [exp(0), exp(u[2]), ..., exp(u[K])] / (exp(0) + exp(u[2]) + ... + exp(u[K]))

This leads to asymmetry in the regression as we don’t have a regression for the first element. What it does do is make softmax and log proper inverses. If you reduce the choice to just the first category and some other category, then you get a standard binomial logistic regression again. But you still can’t round trip the multinomial case with logit, because

exp(u[2]) / (exp(0) + exp(u[2]) + ... + exp(u[K])) != inv_logit(u[2])

To see that this is still not going to produce logits in the multinomial case, here’s some more Python.

>>> log_p array([-1.60943791, -0.69314718, -1.2039728 ])


>>> log_p_zero = log_p - log_p[0]
>>> log_p_zero

array([0.        , 0.91629073, 0.40546511])

>>> sp.special.softmax(log_p_zero) array([0.2, 0.5, 0.3])

So as you can see, softmax isn’t identified without pinning one of the values—we can add or subtract a constant from each element of the input and get the same value. But this still doesn’t turn the inputs to softmax into logits.

>>> def inv_logit(v): 1 / (1 + exp(-v))
…

>>> inv_logit(log_p_zero)
array([0.5 , 0.71428571, 0.6 ])

So you can see that the input 0.91629073 is not the logit of the probability even when pinning a value to zero to identify.

P.S. I really miss being able to write math on the blog and really hate that all my old posts with math no longer render. Maybe if Andrew reminds us why it went away, someone will have a suggestion on how to fix.

Data manipulation in the world of long-distance swimming!

Andrew — Thu, 26 Dec 2024 14:04:27 +0000

The story is here. It’s kind of like the Excel error in that famous economics paper, except that (a) it seems to have been on purpose, not an accident (according to the alleged perp, “The hacker has hacked into my administrative account and is using my accounts to make changes”), and (b) it did not do any damage to the world economy, it just enabled a fictionalized story to be presented as fact in a Hollywood movie.

So, no biggie except for people involved in long-distance swimming—for them it’s a big deal.

How to cheat at Codenames; cheating at board games more generally

Andrew — Wed, 25 Dec 2024 14:36:02 +0000

This is a good post for Christmas Day, with all of you at home with your families playing board games.

Dan Luu has an amusing post explaining how you can win Codenames by just memorizing the configurations of the 40 setup carts. The basic strategy is to play your best until you can figure out the unique configuration, then you win. The fun part is that if you’re playing against a team that hasn’t learned this memorization trick, then you can win even if you don’t guess any words yourself—you just take advantage of the config information that you get from their correct guesses (along with any wrong guesses that come up)! If both teams have memorized the 40 cards, then you get to a new level of strategy.

As Luu says, no one would want to play Codenames in this way. The whole point of the game is to guess the words; if you’re gonna do it by memorizing patterns, why play the game in the first place? On the other hand, he also points out that once this information is there, you can’t un-see it. So it’s a balance.

This comes up in the rules of Codenames itself: you’re not allowed to give clues that suggest the position of the word on the grid, nor are you allowed to make faces or otherwise give clues as people are guessing. It can be hard to avoid giving this information sometimes!

More generally, most games can be “cracked” through a backdoor approach in some way or another. Here’s how Luu puts it:

Personally, when I run into a side-channel attack in a game or a game that’s just totally busted if played to win . . . I think it makes sense to try to avoid “attacking” the game to the extent possible. I think this is sort of impossible to do perfectly in Codenames because people will form subconscious associations (I’ve noticed people guessing an extra word on the first turn just to mess around, which works more often than not — assuming they’re not cheating, and I believe they’re not cheating, the success rate strongly suggests the use some kind of side-channel information. That doesn’t necessarily have to be positional information from the cards, it could be something as simple as subconsciously noticing what the spymasters are intently looking at.

Dave Sirlin calls anyone who doesn’t take advantage of any legal possibility to win is a sucker (he derogatorily calls such people “scrubs”) (he says that you should use cheats to win, like using maphacks in FPS games, as long as tournament organizers don’t ban the practice, and that tournaments should explicitly list what’s banned, avoiding generic “don’t do bad stuff” rules). I think people should play games however they find it fun and should find a group that likes playing games in the same way. If Dave finds it fun to memorize arbitrary info to win all of these games, he should do that. The reason I, as Dave Sirlin would put it, play like a scrub, for the kinds of games discussed here is because the games are generally badly broken if played seriously and I don’t personally find the ways in which they’re broken to be fun.

It gets tricky sometimes, though. Consider those goofy words that are in the Scrabble dictionary but aren’t really words, for example ef (“the letter F”) or po (“a chamber pot”). These are not English words! On the other hand, when you’re actually playing and you see an opportunity for ef or po or whatever, it’s hard to deny yourself the opportunity. In that case, there’s an easy solution: the rules allow the players to agree on any dictionary ahead of time, so no need to use the Scrabble dictionary. On the other hand, this will annoy serious players.

There’s more of gray area with collusion, which can “break” almost any multiplayer game. In poker, collusion is a form of cheating. I don’t know how casinos or informal games monitor or enforce the rule against collusion, but you’re not supposed to do it. You’re allowed to lie in poker but not to cheat.

But what about a game such as Monopoly or Risk where bargaining is part of the game? Here’s a simple strategy in a 3-player game of Monopoly that will up your odds of winning from 1/3 to nearly 1/2: Before the game begins, pick one of the other players and agree to flip a coin, after which the winner of the flip will devote all their effort to helping the other player win. That’s easy enough to do: just buy whatever property that comes up and sell to the other player for $1. It won’t guarantee a win but it’s gotta take the win probability to very close to 100%. Similarly with Risk. Now, nobody’s gonna play this way because it’s no fun (except maybe once as a joke). To put it another way, “winning a game of Monopoly or Risk” does not have much positive value in itself; the fun is in winning the game legitimately. Again, though, there is a gray zone, and other players will rightly get annoyed if they see player A deliberately trying to help player B without there being a good reason in the context of a game. In Risk, “I won’t attack you here if you don’t attack me there” is a legitimate strategy, but “I don’t attack you because I want to help you win” is not so cool.

A few years ago I was playing a lot of online chess, and one thing I noticed is that some players would set up opening traps: clearly unsound sequences of moves that would get them a win if their opponents played naively and hadn’t seen the trick before. My thought was: Why do that? Winning against a stranger using a trap, what’s the point of that? Upon reflection, though, I decided to not be so bothered by this. If you try to spring a trap, then the fun part is when the trap fails and then you have to get out of a bad position of your own devising. So, all good.

Years ago I read the book Thursday Night Poker by Peter Steiner. One thing Steiner discusses is that in a casual game you can often do just fine by playing really tight, a strategy that won’t work against good players but can make you steady money if some of the people at the table are just playing for fun. As Steiner says, though, most of us are not playing in a friendly poker game with the goal of maximizing our dollars. We’re playing poker for fun, and “action”—getting involved in hands, making betting decisions, going up against the other players—is where the fun is at. No poker player would be a “scrub”—you’ll always take advantage of any legal way to win, it’s not like you’d ignore relevant information that someone reveals—but, even in poker, winning is not the only goal.

All of this is kind of obvious, but as Luu discusses, sometimes it needs to be pointed out, to push against naive models of the world. Also, the bit about the Codenames cards is cool—I’d never thought about that!

Addressing legitimate counterarguments in a scientific review: The challenge of being an insider

Andrew — Tue, 24 Dec 2024 14:40:07 +0000

Review articles can be written by outsiders or insiders.

From the outside, it’s easier. You assess the evidence and draw your conclusions. For example, this is what I did when summarizing the research on ballot-order effects and addressing the question of whether Donald Trump won the 2016 election because his name came first on the ballot in key states; see pages 240-242 of Active Statistics for the full story. Or when my colleagues and I wrote about “nudge” interventions. We could be right, we could be wrong, but in any case the job is clear enough.

Writing a review is more difficult from the inside. For example, consider the meta-analysis of nudge that was written by some nudge researchers. It had big problems: garbage in, garbage out. It’s hard to step back and examine the evidence, if you’re part of the story. For another example, we recently discussed a review of the controversial Implicit Association Test that was flawed in only presenting part of the story, not even acknowledging that there was a controversy.

What to do if you’re an insider and you want to write a review that includes some of your own work?

I don’t think you should just give up. As an insider, you have a special perspective, and it makes sense that you’ll want to review the evidence. At the same time, you have to avoid the natural inclination to try to present too much of a coherent story. In real life, the evidence doesn’t always all go in the same direction!

My recommendation, for the insider writing a review article, is not to try to debate or shoot down such counterarguments but rather to acknowledge the disagreement and fit it into your larger story.

Here’s an example where we did exactly that. Our article is called Reconciling Evaluations of the Millennium Villages Project, and it begins:

The Millennium Villages Project was an integrated rural development program carried out for a decade in 10 clusters of villages in sub-Saharan Africa starting in 2005, and in a few other sites for shorter durations. An evaluation of the 10 main sites compared to retrospectively chosen control sites estimated positive effects on a range of economic, social, and health outcomes (Mitchell et al. 2018). More recently, an outside group performed a prospective controlled (but also nonrandomized) evaluation of one of the shorter-duration sites and reported smaller or null results (Masset et al. 2020). Although these two conclusions seem contradictory, the differences can be explained by the fact that Mitchell et al. studied 10 sites where the project was implemented for 10 years, and Masset et al. studied one site with a program lasting less than 5 years, as well as differences in inference and framing. Insights from both evaluations should be valuable in considering future development efforts of this sort. Both studies are consistent with a larger picture of positive average impacts (compared to untreated villages) across a broad range of outcomes, but with effects varying across sites or requiring an adequate duration for impacts to be manifested.

“Mitchell et al. 2018” was us! I worked with the Millennium Villages Project to help conduct a retrospective evaluation, which yielded positive estimated effects. My take was that the project worked well. That’s what I believe, but I’m an insider. Masset et al. 2020 was an outside team that did a different analysis on different data and reported null results. The purpose of our review article was to understand how these two studies of the same program came to such different conclusions. In writing this paper we had to walk a fine line: we’re trying our best to assess the past work objectively, but one of the papers was ours. The key here is that we presented the disagreement—we did not pretend the dissenting article did not exist, nor did we dismiss it—; rather, we incorporated it into a larger understanding. I’m not saying that this new paper of ours was perfect, nor that it was influential—indeed, according to Google scholar it has exactly zero citations, really a low payoff given all the work we put into it—but I still think of it as a model for how to present and assess conflicting evidence from the inside.

The marginalization or Jeffreys-Lindley paradox: it’s already been resolved.

Andrew — Mon, 23 Dec 2024 14:48:12 +0000

Nicole Wang writes:

I am a PhD student in statistics starting this year. I read the Dawid et al. (1973) marginalization paradoxes paper. I found several approaches claiming their “resolution” to this paradox. I am very curious whether you think there is a complete resolution to this paradox so far. If there is one, I sincerely wonder what it is. If not, which approach(s) do you think it(they) is(are) closer to the resolution?

I replied by pointing to these two posts:
Christian Robert on the Jeffreys-Lindley paradox; more generally, it’s good news when philosophical arguments can be transformed into technical modeling issues
and
My problem with the Lindley paradox.
It’s not a topic I’ve thought about much lately because to me it’s already been resolved.

Stanford medical school professor misrepresents what I wrote (but I kind of understand where he’s coming from)

Andrew — Sun, 22 Dec 2024 17:59:12 +0000

This story is kinda complicated. It’s simple, but it’s complicated.

The simple part is the basic story, which goes something like this:

– In 2020, a study was done at Stanford–a survey of covid exposure–that I publicly criticized: I wrote that this study had statistical problems, that its analysis did not adjust for uncertainty in the false positive rate of the test they were using, and that they did something wrong by not making use of the statistical expertise that is readily available at Stanford.

– In 2023, one of the people involved in that study wrote a long post about that study and its aftermath. In that post, my comments from 2020 on that study were misrepresented–hence the title of the present post.

– Just last week someone pointed me to the 2023 post. I was unhappy to have been misrepresented so I emailed the author of the post. He didn’t respond–I don’t take that personally: he’s very busy, the holiday season is coming, the post came out over a year ago (I just only happened to hear about it the other day), and my complaint concerns only one small paragraph in a long post–so, just to correct the record, I’m posting this here.

The more complicated, and interesting, part, involves the distinction between evidence and truth. It’s something we’ve talked about before–indeed, I published a short article on the topic back in 2020 using this very example!–and here it has come again, so here goes:

And now, the details:

Stanford professor Jay Bhattacharya wrote about a covid study from 2020 that he was involved in, which attracted some skepticism at the time:

Some serious statisticians also weighed in with negative reviews. Columbia University’s Andrew Gelman posted a hyperbolic blog that we should apologize for releasing the study. He incorrectly thought we had not accounted for the possibility of false positives. He later recanted that harsh criticism but wanted us to use an alternative method of characterizing the uncertainty around our estimates.

On the plus side, I appreciate that he characterizes me as a serious statistician. He also called my post “hyperbolic.” He doesn’t actually link to it, so I’ll give the link here so you can make your own judgment. The title of that post is, “Concerns with that Stanford study of coronavirus prevalence,” and I don’t think it’s hyperbolic at all! But that’s just a matter of opinion on Bhattacharya’s part so I can’t say it’s wrong.

He does have two specific statements there that are wrong, however:

1. It’s not true that I “incorrectly thought their study had not accounted for the possibility of false positives.” In my post, I explicitly recognized that their analysis accounted for the possibility of false positives. What I wrote is that they were “focusing on the point estimates of specificity.” Specificity = 1 – false positive rate. What I wrote is that they didn’t properly account for uncertainty in the false positive rate. I did not say they had not accounted for the possibility of false positives.

2. I never “recanted that harsh criticism.” What I wrote in my post is that their article “does not provide strong evidence that the rate of people in Santa Clara county exposed by that date was as high as claimed.” But I also wrote, “I’m not saying that the claims in the above-linked paper are wrong . . . The Bendavid et al. study is problematic if it is taken as strong evidence for those particular estimates, but it’s valuable if it’s considered as one piece of information that’s part of a big picture that remains uncertain.” And I clarified, “When I wrote that the authors of the article owe us all an apology, I didn’t mean they owed us an apology for doing the study, I meant they owed us an apology for avoidable errors in the statistical analysis that led to overconfident claims. But, again, let’s not make the opposite mistake of using uncertainty as a way to affirm a null hypothesis.”

I do not see this as a “recanting,” nor did I recant at any later time, but I’m fine if Bhattacharya or anyone else wants to quote me directly to make clear that at no time did I ever say that their substantive claims were false; I only said that the data offered in that study did not supply strong evidence.

This bothers me. I’d hate for people to think that I’d incorrectly thought they had not accounted for the possibility of false positives, or that I’d recanted my criticism. Again I emphasize that my criticism was statistical and involved specification of uncertainty; it was not a claim on my part that the percentage of people who’d been exposed to Covid was X, Y, or Z.

The good news is that this post from Bhattacharya appeared in 2023 and I only heard about it just the other day, so I guess it did not get wide circulation. Maybe more people will see this correction than the original post! In any case I’m glad to have the opportunity to correct the record.

Stanford contagion

I have no problem with Bhattacharya making arguments about covid epidemiology and policy–two topics he’s thought a lot about. It’s completely reasonable for him and his colleagues to say that their original study was inconclusive but that it was consistent with their larger message.

Bhattacharya also writes:

In the end, Stanford’s leadership undermined public and scientific confidence in the results of the Santa Clara study. Given this history, members of the public could be forgiven if they wonder whether any Stanford research can be trusted.

He doesn’t fully follow up on this point, but I think he’s right.

A few weeks before the above-discussed covid study came out, Stanford got some press when law professor Richard Epstein published something through Stanford’s Hoover Institution predicting that U.S. covid deaths would max out at 500, a prediction he later updated to 5000 (see here for details). I’ve never met Epstein or corresponded with him, but he comes off as quite the asshole, having said this to a magazine interviewer: “But, you want to come at me hard, I am going to come back harder at you. And then if I can’t jam my fingers down your throat, then I am not worth it. . . . But a little bit of respect.” A couple years later, he followed up with some idiotic statements about the covid vaccine. Fine–the guy’s just a law professor, not a health economist or anyone else with relevant expertise here–the point is that Stanford appears to be stuck with him. In Bhattacharya’s words, “Given this history, members of the public could be forgiven if they wonder whether any Stanford research can be trusted.”

This sort of guilt-by-Stanford-association would represent poor reasoning. Just cos Stanford platforms idiots like Richard Epstein, it doesn’t mean we shouldn’t trust the research of serious scholars such as Rob Tibshirani and Jay Bhattacharya. But I guess that members of the public could be forgiven if they show less trust in the Stanford brand. Just as my Columbia affiliation is tarnished by my employer’s association with Mehmet Oz, Robert Hadden, and its willingness to fake its U.S. News numbers. And Harvard is tarnished by its endorsement of various well-publicized bits of pseudoscience.

Reputation goes both ways. By publishing Epstein’s uniformed commentary, Stanford’s leadership undermined public and scientific confidence, and then Bhattacharya had to pay some of the price for this undermined confidence. That’s too bad, also too bad that he ended up on the advisory board of an organization that gave the following advice: “Currently, there is no one for whom the benefit would outweigh the risk of these [covid] vaccines–even the most vulnerable, elderly nursing home patients.”

The challenge is that legitimate arguments about policy responses under uncertainty get tangled with ridiculous claims such as that covid would only kill 500 Americans, or ridiculous policies such as removing the basketball hoops in the local park. On one side, you had people spreading what can only be called denial (for example, the claim that the pandemic was over in the summer of 2020); on the other side were public health authorities playing the fear card and keeping everyone inside.

So I can see how Bhattacharya was frustrated by the response to his study. But he’s missing the mark when he misrepresents what I wrote, and when, elsewhere in his post, he disparages Stephanie Lee for doing reporting on his study. We’re doing our jobs–I’m assessing the strength of statistical evidence, Lee is tracking down leads in the story–just like Bhattacharya is doing his job by making policy recommendations.

The true meaning of the alzabo

Andrew — Sat, 21 Dec 2024 14:17:25 +0000

OK, I guess this is obvious, but it hadn’t occurred to me until recently.

Nonexistent books

Here’s the background: The other day we went to the Grolier Book Club, which had an exhibition on lost, incomplete, and imaginary books. The fun gimmick was that they made mock-ups of physical books representing various books that no longer exist or never were. Examples included a lost play of Shakespeare, a never-finished novel by Sylvia Plath, the complete version of Kubla Khan, a monograph by Sherlock Holmes, the legendary Necronomicon, The Theory and Practice of Oligarchical Collectivism, by Emmanuel Goldstein, the code of the Bene Gesserit, and that famous original work which is a word-for-word remake of Don Quixote. I was pleased to see a copy of the “original” Garden of Forking Paths but disappointed not to see anything by Nathan Zuckerman, and really disappointed not to see some version of The Book of Gold, but I understand, as there are so many lost, incomplete, and imaginary books that any library of them would overflow beyond the space allotted to the exhibition.

When I was at the Grolier looking at these real objects representing unreal books–actually, the objects were just the outsides of the books, I’m pretty sure they weren’t filled with text on their pages–I was thinking that there must be a wikipedia entry with a long list of famous books that don’t exist. Wikipedia has its flaws (see also here and here), but . . . a list of fake books, this seems like it would be right in Wikipedia’s wheelhouse, kind of like if you want the plot of every Simpsons episode or a list of all the flavors of Pop Tarts.

Some googling led to the wikipedia page for “fictional book”. It’s not as comprehensive as I was expecting, but it does include The Grasshopper Lies Heavy—I’d forgotten about that one! Wikipedia also has this list, which is even more disappointing.

What’s up, Wikipedia? I was gonna say that the editors are so laser-focused on modern pop culture that they can’t be bothered to collect a list of old-fashioned books . . . but then I checked out the page on “lost literary works”, which offers an impressively comprehensive list starting with antiquity and going through the centuries since then, ending with this:

Terry Pratchett’s unfinished works were destroyed in 2017 after his death, fulfilling his last will; his computer hard drive containing his unfinished works was deliberately crushed by a steamroller.

The list of lost works includes novels by L. Frank Baum, plays by James Joyce, August Strindberg, and Leon Trotsky (!), treatises by Adam Smith, all sorts of things. It’s a list worth reading. Wikipedia also has an entry on unfinished creative work, which includes things like Mozart’s requiem and Schubert’s eighth symphony as well as various books.

The alzabo

As noted above, when I saw nonexistent books, I thought of The Book of Gold, and that made me think of the alzabo, that fictional animal who, if it eats a person, will ingest his or her personality as well, and then if you kill the alzabo, prepare it in a certain way, and swallow its extract, you will retain the memories of the person who’d been eaten.

Seeing the exhibition of the books gave me the sudden understanding that the alzabo is, very directly, a metaphor for literature:

– When, as an author, you pour your memories, ideas, and emotions into a book, that is like feeding yourself to the alzabo. As Paul Gallico said, “It is only when you open your veins and bleed onto the page a little that you establish contact with your reader.”

– When, as a reader, you dive into a book, this is like eating the alzabo. The characters, themes, and events of the book become part of your memory.

In the Book of the New Sun, there are times when Severian is overwhelmed by the memories he’s absorbed from Thecla, and there’s a point where he gets lost among the accumulated memories of all the past autarchs. When I read that, I recall thinking this was implausible (even in the science-fiction context of a willing suspension of disbelief)–how would there be room in his head for the memories of thousands of people?–but then I realized that my head contains memories of many thousands of books that I’ve read (not to mention TV shows and various other story-delivering mechanisms), so, yeah, that’s how it works. The alzabo is a beautifully poetic representation of this two-stage transfer, first from author to book, then from book to reader.

In one of history’s greatest fish-out-of-water stories, a retired Hollywood actor became president of the United States. Ronald Reagan was notorious for confusing his memories of movies with reality–but, at some level, all of us can relate to this, as we all “know” fictional characters who seem realer than people we know, we remember fictional stories that are scarier and more tearjerking than real events in our lives, and so forth.

Connection to statistics

And then, writing this, I realize that something similar happens in statistics! Measurements are performed that abstract complex reality into “data,” which are then analyzed by the researcher, who uses these data to reconstruct the world. Wow. I’d never thought of it that way before.

Delicate language for talking about statistical guarantees

Jessica Hullman — Fri, 20 Dec 2024 17:15:21 +0000

This is Jessica. As I’ve been reading papers on uncertainty quantification in machine learning (related to topics like conformal prediction or calibration), I’ve been reflecting on the language choices that authors make.

It started when I was reading a paper about an approach that uses set-aside data labeled with human reports to learn a threshold on a function that predicts the “alignment” (e.g., similarity) of predictions with those reports. They do this to ensure that new predictions are sufficiently aligned. There are many new methods for quantifying prediction uncertainty or controlling prediction risk in ML that have a similar flavor — they involve learning either thresholds (e.g., conformal prediction) or adjustments (e.g., posthoc calibration algorithms) on held-out calibration data that are then applied to predictions. Some of them come with “statistical guarantees”: If the data used for calibration is a good approximation of future data (e.g., is i.i.d. or at least exchangable), then we can expect calibration in the future.

Anyway, I paused when I got to this sentence:

new reports are selected if their predicted alignment scores exceed a data-driven threshold, which is delicately set with Dcal such that the FDR is strictly controlled at the desired level

because I found the choice of the phrase “delicately set” interesting. It seems like a good way to describe this kind of procedure: naturally there are assumptions on which the expectation about controlling the FDR rate depend, like statistical exchangeability between the calibration data and new instances, that might not hold. By using a term like “delicately” the authors signal some fragility.

It makes me think of all of the other words that could be used to signal tentativeness or epistemic uncertainty when discussing such methods: “tenuously,” “gingerly,” “warily,“ or, if you want to sound like you just studied for the GRE, “frangibly.” Maybe a bit of poetic license would not be a bad thing when describing our expectations about such methods. It would certainly make for a more entertaining reading experience.

However, the fact that the above statement continues “… such that the FDR is strictly controlled at the desired level” makes it sound a lot less fallible. This phrasing is more typical of the unconditional way in which such methods are often described.

Here’s another example, where the guarantee is about human behavior:

we develop an efficient and near-optimal search method to find the conformal predictor under which the expert is guaranteed to achieve the greatest accuracy with high probability.

But can we ever really guarantee that future human behavior will remain consistent with the past?

As I was writing this post, I noticed that Andrew has previously blogged on his dislike of the term “guarantees” because it hide assumptions. I agree that it’s problematic to talk about guarantees unconditionally, as if they cannot be violated even when the methods are applied in practice. This is why the unqualified statements about calibration being sufficient for good decisions bug me so much. Andrew’s post concludes by saying that it’s not so bad to talk about guarantees if you’re writing to a statistics and ML audience, since they may expect that.

But the more time I spend reading papers involving uncertainty or calibration guarantees, the more I find myself wishing there was a little more care taken in how their statistical properties are discussed, particularly when talking about practically-oriented methods that assume future data is like past data. Having poked around recent papers in domains outside of CS on topics like calibration and conformal prediction, it seems that the nuance is not necessarily obvious when people go to apply the methods in practice. So these days I’m trying to be more careful with my own word choices. For example, beyond trying to avoid flagrant use of the word guarantee, add the word “expected” more often. E.g., “under which the expert is expected to achieve the greatest accuracy…” or “delicately selected so that the FDR is expected to be strictly controlled at the desired level.” It’s minor but I think it better emphasizes that our assumptions may not always don’t hold, because they were unrealistic and/or because in practice we’re dealing with singular events rather than hypothetical replications.

Another idea would be to deliberately choose weaker synonyms for guarantees, like “this method promises that the expected FDR will be less than…” Sounds a little odd. But maybe it’s a good thing for descriptions of methods to take the reader back to childhood when their so-called friend “promised” not to tell on them or their parent “promised” not to snoop in their room! Or as Paul Alper suggests, say “statistical oomph” instead of guarantees, which I think captures the same vibe of excitement and clout but sounds so much sillier.

“Accounting for Nonresponse in Election Polls: Total Margin of Error”

Andrew — Fri, 20 Dec 2024 14:13:33 +0000

Jeff Dominitz and Chuck Manski write:

The potential impact of nonresponse on election polls is well known and frequently acknowledged. Yet measurement and reporting of polling error has focused solely on sampling error, represented by the margin of error of a poll. Survey statisticians have long recommended measurement of the total survey error of a sample estimate by its mean square error (MSE), which jointly measures sampling and non-sampling errors. Extending the conventional language of polling, we think it reasonable to use the square root of maximum MSE to measure the total margin of error. This paper demonstrates how to measure the potential impact of nonresponse using the concept of the total margin of error, which we argue should be a standard feature in the reporting of election poll results. We first show how to jointly measure statistical imprecision and response bias when a pollster lacks any knowledge of the candidate preferences of non-responders. We then extend the analysis to settings where the pollster has partial knowledge that bounds the preferences of non-responders.

Good stuff. This relates to two of my papers on nonsampling errors and differential nonresponse:

[2018] Disentangling bias and variance in election polls. Journal of the American Statistical Association 113, 607-614. (Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and Andrew Gelman)

[1998] Modeling differential nonresponse in sample surveys. Sankhya 60, 101-126. (Thomas C. Little and Andrew Gelman)

It’s funny about that last paper because Tom Little and I did that project nearly 30 years ago and I forgot about it even while doing applied research on differential nonresponse. I think there’s more to be done here, integrating the different perspectives here. I like the idea of modeling the nonsampling error rather than just treating it as another error term.

On the general point of reporting surveys and margins of errors, I recommend Ansolabehere and Belin’s paper from 1993.

How did the press do on that “black spatula” story? Not so great.

Andrew — Thu, 19 Dec 2024 15:25:27 +0000

Act 1: The journal Chemosphere publishes a research article finding that black plastic kitchen items contain dangerous toxins arising from reprocessing of plastics that contained carcinogenic flame retardants. This result is publicized by the advocacy group that helped conduct the study, and it is picked up in the major media, with headlines such as, “Black-colored plastic used for kitchen utensils and toys linked to banned toxic flame retardants,” as CNN reported from 1 Oct. Various news organizations pick up the story and give recommendations about not reusing plastic containers, etc.

Act 2: It turns out that the journal article had a factor-of-10 error, where a given exposure was stated to be 80% of the legal limit for a certain toxin, but it was really only 8%.

We wrote about this the other day, citing, linking, and quoting from an article of 11 Dec from Canada’s National Post newspaper, that reported:

Plastics rarely make news like this . . . the media uptake was enthusiastic on a paper published in October in the peer-reviewed journal Chemosphere.

“Your cool black kitchenware could be slowly poisoning you, study says. Here’s what to do,” said the LA Times. “Yes, throw out your black spatula,” said the San Francisco Chronicle. Salon was most blunt: “Your favorite spatula could kill you,” it said.

The study, by researchers at the advocacy group Toxic-Free Future . . . estimated that using contaminated kitchenware could cause a median intake of 34,700 nanograms per day of Decabromodiphenyl ether, known as BDE-209. . . . The trouble is that, in the study’s section on “Health and Exposure Concerns,” the researchers said this number, 34,700, “would approach” the reference dose given by the United States Environmental Protection Agency. . . .

The paper correctly gives the reference dose for BDE-209 as 7,000 nanograms per kilogram of body weight per day, but calculates this into a limit for a 60-kilogram adult of 42,000 nanograms per day. . . . But 60 times 7,000 is not 42,000. It is 420,000. This is what [McGill University’s] Joe Schwarcz noticed. . . .

The article goes on to interview Schwarcz about the spatula claim and general issues of summarizing uncertainty.

The starting point of Act 2 of our story seems to have been this press release by Schwarcz from 6 Dec on the McGill University website, entitled, “Are Black Plastic Spatulas and Serving Spoons Safe to Use? For me, the risk would not be enough to discard them, but science should find a way to keep flame retardants out of such items.”

It’s a complicated problem. Indeed, only just now I’m realizing that “median intake” is a strange thing to look at when summarizing a public health risk.

Act 3: The error was corrected in the journal and in a press release by the advocacy organization, but in a minimizing way: “it is important to note that this does not impact our results.” This was mentioned in the above-linked National Post article.

Act 4: Several news organizations followed up on the correction.

And that’s the subject of today’s post. How did the news organizations do?

I’d say their record is mixed.

I’ll go through them in the order they came up in my Google searches:

Here’s SF Gate, a website that was formerly part of the San Francisco Chronicle:

Did you read that black spatulas were poisoning you and everyone you love and then throw away multiple kitchen utensils in a blind panic? . . . Publications ranging from the Atlantic (“Throw Out Your Black Plastic Spatula”) to the Los Angeles Times (“Your cool black kitchenware could be slowly poisoning you, study says”) to Salon.com (“Your favorite spatula could kill you”) all spread the panic. . . . But then Joe Schwarcz, director of McGill University’s Office for Science and Society, noticed a mistake in the study. . . .

The SF Gate article, which is dated 17 Dec, has some strong similarities to the National Post article–I’m not saying it’s plagiarized, exactly, more that they seem to have read that earlier news article and rewritten it. The good news is they kept the credit to Schwarcz; the bad news is . . . well, let’s see if you notice. Read carefully:

From the National Post: “‘Your cool black kitchenware could be slowly poisoning you, study says. Here’s what to do,’ said the LA Times. ‘Yes, throw out your black spatula,’ said the San Francisco Chronicle. Salon was most blunt: ‘Your favorite spatula could kill you,’ it said.”

From SF Gate: “Publications ranging from the Atlantic (‘Throw Out Your Black Plastic Spatula’) to the Los Angeles Times (‘Your cool black kitchenware could be slowly poisoning you, study says’) to Salon.com (‘Your favorite spatula could kill you’) all spread the panic.”

SF Gate kept the LA Times and Salon.com, they added the Atlantic, but they removed the San Francisco Chronicle. Kind of funny to get rid of the only local angle, no? Especially given that the SF Gate article has the title, “Turns out everyone in SF didn’t need to throw away their black spatulas.”

I can’t figure out why SF Gate didn’t mention the San Francisco Chronicle. Here are two, somewhat contradictory, guesses:

1. SF Gate and the Chronicle are competitors and so, wherever possible, they will avoid linking to or mentioning each other.

2. SF Gate and the Chronicle are sister organizations and so SF Gate removed any mention of the Chronicle in their article so as not to cause any embarrassment to that newspaper.

I don’t know which it is; either way, sure, it’s not the world’s biggest scandal but it’s kind of uncool.

In any case, SF Gate did not credit the National Post at all, which seems really bad given that it looks like they were just rewriting that National Post story from a week earlier. Again, I’m not saying it was plagiarism; it’s just bad practice to not credit an earlier source.

Google also turns up an article, “Viral study about black plastic spatulas had a big math problem,” from the San Francisco Chronicle itself. This one also cites Schwarcz, but not the National Post, even though, again, it seems to be a recycling (ha ha, geddit?) of that earlier news article.

On the plus side, the Chronicle does explicitly say that they had covered the earlier story, and they link to that story, although without reproducing their incendiary (ha ha, geddit?) headline, “Yes, throw out your black spatula.”

Here’s the New York Times, my hometown newspaper, with an article entitled, “Do I Really Need to Throw Out My Black Plastic Spatula? A new study detected dangerous chemicals in a variety of household items. But experts say the health risks aren’t clear-cut.” In true Gray Lady fashion, they’ve worked to make their headline and sub-headline as boring as possible—and they’ve pretty much succeeded!

This Times article does the worst job of reporting we’ve seen so far! In some ways it’s good: it takes a measured tone and includes interviews with multiple experts. But . . . the article was published 10 Dec and updated 17 Dec, and it doesn’t mention the factor-of-10 problem at all! Even though Schwarcz’s press release appeared on 6 Dec. OK, fair enough–the reporter didn’t go to the trouble of a careful internet search that could’ve found the press release, and as of 10 Dec, when the NYT article was first published, the National Post article hadn’t yet appeared. It was really the National Post that broke the story. But it says right there on the Times article that it was updated on 17 Dec, indeed that it appeared in the physical newspaper on that day. By then, even a cursory internet search on some variant of *black plastic study* would’ve revealed the problem.

Back on 14 Nov, the Times had run an article, “Black Plastic Kitchen Tools Might Expose You to Toxic Chemicals. Here’s What to Use Instead.” They didn’t know about the factor-of-10 error, and they pretty much buy the “Toxic-Free Future” alarmist pitch straight-up, no skepticism. On the plus side, unlike all the other sources, this Times article has some useful suggestions, as they discuss metal, silicone, and wood alternatives.

On the positive side, here’s Slate, which did very well on this story. Back on 4 Nov, long before any of the scandal had occurred, they ran a story, “I’m Not Throwing Away My Black Plastic Spatula. Yes, they can have scary chemicals in them. But let’s take a closer look at the research,” with lots of details going into the science. And, as a bonus, on 16 Dec they added an update at the very beginning of their story, “Black plastic is even less worrisome than it initially seemed—one of the papers sparking concern contained a math error,” with that last link going back to . . . the National Post story from 11 Dec!

Good job, Slate! Excellent science reporting. And I say this not just as an occasional science reporter for Slate.

The Los Angeles Times, to its credit, also got on the bandwagon, with a story dated 19 Dec entitled “Your black plastic kitchen utensils aren’t so toxic after all. But you should still toss them, group says,” and continuing:

Published in the peer-reviewed journal Chemosphere, experts from the nonprofit Toxic-Free Future said they detected flame retardants and other toxic chemicals in 85% of 203 items made of black plastic including kitchen utensils, take-out containers, children’s toys and hair accessories. . . .

But in an update to the study, the authors say they made an error in their calculations and the real levels were “an order of magnitude lower” than the EPA’s thresholds. The error was discovered by Joe Schwarcz, director of McGill University’s Office for Science and Society in Canada.

In a blog post, Schwarcz explained that the Toxin-Free Future scientists miscalculated the lower end of what the EPA considered a health risk through a multiplication error. Instead of humans being potentially exposed to a dose of toxic chemicals in black plastic utensils near the minimum level that the EPA deems a health risk, it’s actually about one-tenth of that.

The L.A. Times didn’t link to or credit the National Post, but they did credit Schwarcz, which is important. Oddly enough, though, they did not link to Schwarcz’s press release (which they label as a blog, which is I don’t think is quite accurate, but maybe it’s close enough). The L.A. Times does link to their earlier article, but they do a bad thing by (a) not pointing out that it was their newspaper that fell for that earlier hype (to know it was the L.A. Times, you’d need to hover over or click on that link) and (b) not pointing out their earlier, alarmist, headline, “Your cool black kitchenware could be slowly poisoning you, study says. Here’s what to do.”

Some further internet searching took me to the world of news aggregators–those sites that manage to show up in your google search but don’t do any reporting themselves. The site msn.com offered two articles dated today, one from the South African site DW and one from the U.S.? site Newser. The latter article is just a rewrite of the National Post article. To its credit, it cites and links to the National Post, so in that sense, it’s the best of the bunch. On the other hand, the whole thing is kind of sketchy: Newser describes itself as “a human + ai experiment,” and what it’s doing is paraphrasing an existing article and then putting it on some new platform with a bunch of ads, which is then scraped by msn.com, which adds no value but is then itself included in a google search. Lots of middlemen in this system, and at no stage is any value added.

Here’s Scientific American, with a story dated 18 Dec, “Any Level of Flame Retardants in Black Plastic Spatulas Concerns Scientists. The scientists behind a popular study on the health effects of flame retardants in black plastic cooking utensils and toys made a calculation error but still say their revised findings are alarming.”

I’m not so happy about this story, for two reasons. First, yes, they mention the factor-of-10 error but without citing or linking to Schwarcz at all, indeed not mentioning him at all. Second, they give lots of quotes from the advocacy group, pretty much just following their spin. I guess this fits into Scientific American’s recent reputation as being a politicized anti-industry propaganda sheet.

From the other side, here’s Plastics Today, representing the plastics industry, or some segment of it. Their article is called, “The Case of the Black Plastic Spatula. Did you throw yours out? That might have been premature.” They don’t mention Joe Schwarcz, but they correctly do cite and link to the National Post.

USA Today reports the correction, which is fine, but they only cite the advocacy group that published the paper, without crediting Schwarcz. They link to the advocacy group’s updated press release, which minimizes their factor-of-10 mistake (“The error does not impact the study’s findings, recommendations, or conclusions,” which makes you wonder why they bothered putting the number in their paper in the first place, also makes you wonder if they’d said it didn’t impact anything had the error been a factor of 10 in a direction that did not favor their advocacy) and, yes, they say “We are grateful to the scientists who brought this forward and we apologize for the error,” but, no, they do not link to Schwarcz’s report or credit him in any way.

As you might expect, the tech news site Gizmodo does a better job, citing Schwarcz and linking to the National Post article.

Bill Bottenberg pointed me to an article dated 16 Dec in the news site Ars Technica, “Huge math error corrected in black plastic study; authors say it doesn’t matter. Correction issued for black plastic study that had people tossing spatulas.” It’s pretty bad: zero credit to Schwarcz or the National Post, it just goes over the original published paper and the correction notice from the journal.

Regarding the substance of the matter, Bottenberg writes:

Being a natural born physical chemist I noticed the earlier claim about the dangers of black plastic in the kitchen. Of course I examined immediately my kitchen tool drawers and found: narrow spatula with a lotta holes, wide spatula with a flat front edge – good for flipping omeletes, nice slotted spoon good for removing stuff from a boiling broth for various reasons, and a deep wide spoon good for serving from a soup or stew. These were all as black and plasticky as could be. I looked at them and let them be.

Why would I do this? Well, maybe a feeling that the mass transfer processes in water (with stuff) didn’t seem likely at the temperature ranges that might be encountered. But then, the mass transfer rates in dang hot oil (with stuff) might be meaningful. These considerations might be for aromatic molecules, maybe for nasty bromine containing organics aromatic or not.

What a puzzle. I closed the drawers and continued using them as I mulled over the possibilities.

This is how we human beings do stuff. Either we throw them all out, keep them because we don’t care, or just think for a while about the ramifications…

To the current points you have been making, which I really enjoy and appreciate – the authors have admitted to the mistake, but not so much as to change their recommendations. It’s a strange world.

Ha ha, the jokes on me. Now I have to go back and really read the original paper, rethink the Ars Technica article and figure out what’s next.

I’m not emptying the drawers of black plastic…

My recommendations to Bottenberg are:

1. Don’t “rethink the Ars Technica article.” Just go back to the original sources. And I’d recommend spending more time on Gizmodo and less time on Ars Technica (and zero time on Scientific American!).

2. Use a metal ladle to serve your soup, wooden spoons for stirring, a metal spatula for frying on your usual pans, and a silicone spatula for your nonstick pans. If you don’t have any of these items, then, sure, no rush to get them all at once, but you must already have some of these in your kitchen already, no?

Summary

The news organizations did a reasonable job covering the correction, except for four things:

1. Many fewer outlets reported on the error than on the original story.

2. Even when reporting the error, they mostly relied too strongly on the spin provided by the advocacy organization.

3. They mostly did not cite the National Post, even though that was the newspaper that broke the story. Unfortunately, this seems to be standard practice, most notoriously with the New York Times but with other newspapers as well, to not credit the reports from other newspapers that got to a story first. Here’s a particularly bad example from last year of the NYT retelling a story without acknowledging the original source, which was a Canadian newspaper.

4. Many of the outlets did not credit Joe Schwarcz, the person who discovered the error. This was a consequence of items 2 and 3 above, as Schwarcz was not credited in the correction notice and press release from the authors of the study.

This is not plagiarism, but a similar issue arises here, something that Thomas Basbøll and I have written about: When a news article obscures is sources, information is lost, and the reader pays the price.

In this case, it’s clear that several of the followup articles were directly based on the National Post article; by not mentioning or linking to that earlier reporting, they were covering their tracks and making it harder for readers to see beyond the spin being provided by the advocacy organization.

I’m not saying here that plastic kitchen items are safe (or that they’re not); what I’m talking about is the reporting.

Also, don’t trust Scientific American.

He took public funds and falsified his data. Are they gonna make him pay back the $19 million?

Andrew — Wed, 18 Dec 2024 14:33:31 +0000

As a former University of Maryland student, I’m incensed by this story. From Retraction Watch, this story about a professor of Biochemistry and Molecular Biology:

The ORI finding stated Eckert “engaged in research misconduct in research supported by” every NIH grant on which he served as principal investigator, totaling more than $19 million. The finding also lists multiple “Center Core Grants” worth hundreds of millions for shared resources and facilities at research centers. . . .

According to ORI’s findings, Eckert erased a band in one of the paper’s figures “to falsely show a favorable result.”

In the 13 papers and two grant applications, Eckert used and reused images “representing unrelated experiments, with or without manipulating them, and falsely relabeling them as data representing different proteins and/or experimental results,” ORI found.

I don’t really know what to think about the Center Core Grants, but the 19 million dollars he should have to pay back! OK, maybe he doesn’t have $19 million cash on hand, but there’s gotta be a system for this, right? Maybe they start by repoing his car, freezing his bank account, and selling his house, then he’s allowed some reasonable amount to live on . . . hmmm, it appears that the NIH predoc salary level is 28,224 per year, so they could let him keep that much . . . then he could try to come up with a payment plan. I dunno. They throw drug dealers in prison, but that would just cost the taxpayer more money. Better to just put an ankle monitor on him and take away his internet access. . . .

Ummm, let’s read on:

Eckert agreed to forgo contracting with the federal government or receiving government funding for eight years . . . Eckert also agreed not to serve on any advisory or peer review committees for the U.S. Public Health Service, which includes the NIH, for eight years.

OK, maybe after eight years if he shows some evidence of rehabilitation, maybe. But what about the $19 million? Paying that back should be the absolute minimum requirement, no?