Statistical Modeling, Causal Inference, and Social Science

The Mets are hiring

Andrew — Sun, 11 Jan 2026 19:55:27 +0000

Sam Saskin writes:

I’m reaching out because we are hiring for a couple of jobs on the Mets analytics team and I was wondering if you’d be willing to share the job postings on your blog. The two positions (posting links below) are Senior Data Scientist, which would be a match for anyone looking for a full-time position, and Data Science Intern, which would be a match for current students (either undergraduate or graduate) who would be interested in spending a summer working with our team. I really appreciate the assistance, as we’ve had a lot of luck finding great candidates through visibility on your blog in the past.

Also if you have a 100 mph fastball they might be able to find a place for you somewhere in the organization.

“The Great Believers” by Rebecca Makkai

Andrew — Sun, 11 Jan 2026 14:51:29 +0000

The author Rebecca Makkai has come up a couple times already in this space, first in a quick mention regarding the character names in her excellent recent novel, I Have Some Questions for You, then later in a discussion of the structure of suspense in literature. Also she has a blog that’s included in the Cultural section of our links page.

This all came up because I just finished another Makkai novel, The Great Believers, which I also liked a lot. I’d say I “enjoyed” it–I think that’s an accurate description of a book that I read a few pages every day at bedtime until one day I just sat on the couch and read the last 30 pages–except that the story is sad, so in that sense the reading experience wasn’t quite enjoyable. In any case, I recommend it.

The book did something with time that really worked. Chapters alternated between 1985 and 2015, where you see some of the characters from the earlier time period reappear. This gives some perspective (as logician and sometime Chicagoan Raymond Smullyan memorably said, “To know the past, one must first know the future.” But it’s not just that. There’s some way in which placing the main action in the 1985, but situating the story in the present time, makes the people in 1985 seem more alive then the would feel in a straight-up historical novel. I’m not quite sure why this would be, and I don’t think the parallel-time-line thing always works–indeed, I’ve read novels that use it and it seems more like a gimmick, underlying the artificiality of the story–but in this case I found it to be very moving.

One more thing. From the author description:

Makkai is on the MFA faculties of the University of Nevada, Reno at Lake Tahoe and Northwestern University, and is the artistic director of StoryStudio Chicago. She lives on the campus of the midwestern boarding school where her husband teaches, and in Vermont.

Which leaves me with some questions:

1. If it’s “the University of Nevada, Reno at Lake Tahoe,” why don’t they just call it “the University of Nevada, Lake Tahoe”?

2. I wonder if she knows the story of the torment executioners at the University of Nevada, Reno? I have a feeling that the characters in The Great Believers would be very amused by the idea of a torment executioner.

3. Makkai has 3 jobs in 2 different states, and then also lives in a third state? How does she do it? I’m reminded of this guy:

Niall Ferguson, MA, D.Phil., is the Laurence A. Tisch Professor of History at Harvard University. He is a resident faculty member of the Minda de Gunzburg Center for European Studies. He is also a Senior Research Fellow of Jesus College, Oxford University, and a Senior Fellow of the Hoover Institution, Stanford University.

I guess some people just have a lot of energy. I’m pretty sure that Ferguson would hate Makkai’s book, though, given that he’s famous for a homophobic slur and The Great Believers is a sympathetic portrayal of gay life.

Concerns about the z-curve method

Erik van Zwet — Sat, 10 Jan 2026 18:15:46 +0000

This is Erik: A few weeks ago, Andrew blogged about a paper by Richard Morey and Clint Davis-Stober entitled “On the poor statistical properties of the P-curve meta-analytic procedure”. Andrew quoted Morey:

We make the point that many of these techniques were never vetted by experts, and often are just “verified” by a few simulations. For tests, this is not good enough, but nevertheless these methods can get popular because (in my opinion) they tell people what they want to hear.

I believe that another meta-analytic method called z-curve (Brunner and Schimmack (2020), Bartos and Schimmack (2022), Schimmack and Bartos (2023)) has similar problems.

Recall that the signal-to-noise ratio (SNR) in statistics is the ratio of the true effect to the standard error of its estimator. If we make the “usual assumptions” then the z-statistic (the estimator divided by its standard error) has the normal distribution with mean SNR and standard deviation 1.

If we have a collection of studies, then the distribution of the z-statistics is the convolution (sum) of the distribution of the SNRs of the studies and the standard normal distribution. If we’ve estimated the distribution of the z-statistics, we can get the distribution of the SNRs by deconvolution. Deconvolution is known to be very unstable. That means that we need very many data points (studies) or very strong assumptions – preferably both – to get an accurate result.

The z-curve method is based on the assumption that the absolute values of the SNRs have a discrete distribution supported on 0,1,2,…, 6. Note that SNR=0 corresponds to “null effects”. To circumvent the effects of selection on statistical significance, z-curve uses only the absolute values of the z-statistics which exceed 1.96 in magnitude to estimate the 7 probabilities. Deconvolution is bad enough, but it gets much worse if only such a small part of the data is used. This makes uncertainty quantification especially important.

The z-curve method as implemented in the R package zcurve provides (among other things) estimates and confidence intervals of the expected discovery rate (EDR) and the expected replicability rate (ERR). I believe these are defined in my terminology as

EDR=P(|z|>1.96)
ERR=P(|z_repl| > 1.96 and z_repl × z > 0 | |z|>1.96)

The zcurve package also provides an estimate of “Soric’s FDR” but that is just a simple (monotone) transformation of the EDR.

It should be clear that z-curve’s estimate of P(SNR=0) (i.e. the proportion of “null effects”) will be especially noisy because studies with SNR=0 contribute relatively little to the significant z-statistics. Consequently, the estimate of the EDR will be very noisy too. To quantify this uncertainty, the authors use the bootstrap. By default, the zcurve function provides “robust” intervals by adding 5 percentage points to the confidence interval of the EDR and 3 percentage points to the confidence interval of the ERR. This approach is “verified” by a few simulations. Unfortunately, even the adjusted intervals do not provide correct coverage.

To illustrate the problem, I’ve done a small simulation. I generate samples of size n=100 from the two-component mixture 0.25×N(0,1) + 0.75×N(4,1). In 40 out of 100 simulations, the null component is missed entirely. In other words, P(SNR=0) is estimated to be zero. The problem is easy to see from a typical example (see the figure below). The null component is essentially “invisible” from the observations that exceed 1.96.

The consequence is that across 100 simulations, the coverage of the 95% “robust” confidence intervals is incorrect. In particular,

The coverage of the EDR is 65% (CI: 55%-74%).
The coverage of the ERR is 100% (CI: 96%-100%)

I shared my concerns with the authors Ulrich Schimmack, Jerry Brenner and Frantisek Bartos. Bartos responded that he generally agrees with the simulation, but notes that the coverage does come close to nominal when the sample size is increased from n=100 to n=1000. I responded that the zcurve function accepts as few as 10 significant z-statistics, and that most meta-analyses don’t have 1000 studies. Bartos wrote:

To be fair, I agree that we should’ve been explicit about the recommended sample size in the original article (and probably add a warning to the method if used with less than XXX estimates). I didn’t anticipate that people would apply z-curve to small meta-analyses. In my mind, the purpose of the tool (including our examples) is larger-scale meta-epidemiological projects.

Bartos also noted:

With respect to the simulations – although apparently imperfect – I still think that we did actually a much better job than most published methods. (…) The commonly used alternatives for the same purpose at the time were p-curve (for ERR and EDR) and Jager and Leek’s mixture model (for FDR) which both have much worse properties in my opinion. As such, I view this development as a step forward.

In my opinion, statistical methods should be reliable when their assumptions are met. I don’t think unreliable methods should be used because no better methods are available.

My general advice if you’re stuck on a problem understanding a model you’ve fit to data

Andrew — Sat, 10 Jan 2026 14:30:00 +0000

Someone wrote in with a complicated question about some model he’d fit, he was trying to test for endogeneity in the residuals, a bunch of things. It was a short email, just two paragraphs long, but with lots of details, and then he asked me a bunch of questions of what to do.

I replied:

Hi, sorry, I don’t know. I will just give you the general advice to write a small computer program to simulate fake data from your model and then you can apply your statistical procedures to your simulated data and repeat the process to understand the statistical properties of your methods. Best of luck.

A few days later he replied:

Many thanks for your advice, it was very helpful.

That’s satisfying. I’m posting this, partly because it’s a pleasant story and partly out of the possibility that others out there with difficult questions will see this helpful (I hope) advice.

This one’s for the blimp

Andrew — Sat, 10 Jan 2026 01:34:16 +0000

Marty Supreme was excellent. Similar to Good Time, including in its intense throbbing soundtrack and bright lighting. I’d say that Good Time was more of a tour de force, but Marty Supreme was just as good in its own way. The only thing I didn’t get was why the soundtrack was loaded with songs from the 80s. Was this supposed to represent that the events were being interpreted from the perspective of their child, looking back from that later decade? Also, near the very beginning there was a jarring anachronism when Marty describes something as being in somebody’s DNA. They wouldn’t have said that in 1952. At first I thought this was just a slip-up in the script that didn’t get caught by anyone, but now I’m wondering if it was on purpose, a hint that the entire movie is an imaginative reconstruction from thirty years later.

8 arguments against polling (some are good arguments, some are bad)

Andrew — Fri, 09 Jan 2026 14:34:51 +0000

1. Background

A few months ago we had a post explaining what pre-election polls can and cannot do. I’ll repeat the first part because I think it’s important, and it’s a story that can confuse people:

I think there’s too much political polling. If you want to forecast the election, you get most of the way using the economic and political “fundamentals.” Polls, when analyzed carefully, provide some additional information, but not enough to justify the saturation coverage they receive in the news media and on social media. Even the most dramatic polls, when interpreted in a careful Bayesian way, don’t tell us much.

Despite that, I think I understand why there’s so much reporting of polls on news and social media. Survey organizations make money doing polls for commercial sponsors, and if you’re doing a national or state poll anyway, you might as well throw in some political questions at the beginning so you can get some press for your org. Also, when the election is coming up, a lot of avid newsreaders want the latest information, so news media will commission their own polls to get some traffic. It all makes sense. But, in a world with a zillion polls, you don’t get much more from the zillion-and-first poll; indeed you don’t get much from the next zillion polls; indeed you don’t get much from the first zillion polls.

From our 1993 paper, Why are American Presidential election campaign polls so variable when votes are so predictable? (incidentally, polls aren’t so variable anymore, but that’s another story), here’s a quick summary:

Our claim is not that fundamentals-based forecasts will always be within 0.3% of the national vote—for one thing, there are many different fundamentals-based forecasts out there; for another, they were off by a couple percentage points in 2000—but rather that, as I said above, they get you most of the way there. I think that a world in which the news media focused on fundamentals-based forecasts and then reported on the occasional poll (recognizing the general level of nonsampling error) would be better than the pre-election world we have now. It wouldn’t change who wins the election; it would just provide a saner basis for reporting during the campaign period.

Poll aggregation got huge in 2008, and then it got overhyped, and then this has led to annoyance at polls and their aggregators. In 2024 there was annoyance that forecast were so close to 50-50, and then because the point forecasts were off.

In the run-up to the election, forecasters were very open about their uncertainty. For example, Elliott Morris of 538 wrote, “Trump and Harris are both a normal polling error away from a blowout. The race is uncertain, but that doesn’t mean the outcome will be close,” and Nate Silver wrote, “One thing that might be counterintuitive is that even a normal-sized polling error — polls are typically off by around 3 points in one direction or the other — could lead to one candidate sweeping all 7 key battleground states. . . . the baseline assumption of the Silver Bulletin model is that while the polls could be wrong again — and in fact, they probably will be wrong to some degree — it’s extremely hard to predict the direction of the error”–but people didn’t always want to hear this.

2. Update

I’m writing this current post in response to a recent post by economist Dan Davies, who thinks opinion polls are really bad: he likens them to dubious financial products, states that “it might be the case that this means that survey research is no longer a viable way to find things out,” and recommends “prohibition” (but “not legal prohibition”) of polls. He also wrote, “I really don’t see how you could see this as communicating that a Trump landslide was a significant probability.” But Trump did not win in a landslide. He won by 2% of the vote. When Reagan won 59% of the two-party vote in 1984, that was a landslide. When Obama won 54% in 2008, that was a decisive victory, but not a landslide. Trump won less than 51% of the two-party vote. It was a close election, and that’s why the Economist, Fivethirtyeight, and Nate Silver repeatedly emphasized that the election could go either way.

OK, that’s fine. People make mistakes. Davies is an expert on finance and I’m an expert on polling. I can have strong opinions on finance (for example, “Bitcoin is a scam”) and I might even be right, but mostly I’m outsourcing my views on such topics to third parties. Similarly, Davies makes strong statements about polling, but he realizes these are just his opinions.

What was more interesting to me was the discussion in the comments to that blog post. When I came across the post, I went to the trouble of responding to many of the comments, and I was struck by how much fury there was at the polls. Lots of people seemed to believe that polls were useless, or destructive, or both, and there were various uninformed sideswipes (for example, Davies referring to “efforts to reweight nonrandom samples by subjective brute force” or a commenter suggesting that pollsters “thought that people were like lab mice”) which indicated a general animus toward survey research. Which, yeah, I kind of understand–see my comment at the very top of this post that I think there’s too much political polling, and I’ve been saying that for a long time.

3. Eight different arguments, all tangled up

I feel like these commenters who are so mad at polls have several different arguments which get mixed up:

1. Forecasters (including us) communicated uncertainly poorly.

2. Forecasters (including us) did this on purpose because we benefited from people thinking our forecasts are more certain than they are.

3. Forecasting and poll aggregation have been oversold.

4. There’s too much polling.

5. If a forecaster gives 50/50 odds, that’s equivalent to giving up.

6. For the purpose of national election forecasting, a forecast that can be off by 2 percentage points is useless.

7. For the purpose of national election forecasting, a forecast that can be off by 2 percentage points is worse than useless if users interpret it deterministically.

8. All of polling is useless because response rates are low and it’s not random sampling.

Some of these arguments make sense and some don’t:

1. I hate to admit it, but maybe they’re right on this, that we should’ve done more to emphasize uncertainty. But see item 7 below.

2. That’s ridiculous in light of the many open statements that we made, emphasizing uncertainty. We certainly weren’t hiding it!

3. Agreed. After 2008 in particular, poll-based forecasting got too good a reputation because nonsampling errors happened to be close to zero that year. None of us tried to oversell what we were doing, but the news media and the public got sucked in by the hype.

4. Agreed. I’ve written this in public many times.

5. False. Not all elections are close. It is informative to learn that an election could go either way.

6. False. Not all national elections are so close (recall 2008), also it’s informative that the election could likely be close.

7. This is possible . . . but I actually don’t think that many readers interpreted the forecasts deterministically! What I see is a lot of commenters saying that other people interpreted the forecasts deterministically. It seems to me that news coverage was pretty clear that the election was a toss-up and that either candidate could sweep the swing states. So I kinda feel that lots of the discourse about polls being wrong etc. is meta. For example, did Dan Davies think, a week before the election, that there was no plausible chance that Harris could lose all the swing states? I’m guessing no, he’s just concerned that other people could make that mistake. But I don’t recall seeing any pundits–or any normies–making that mistake at the time. I guess there must be some cases, and there are some particular cases where people overreacted to polls (as in the pre-election Iowa poll), but in those cases the forecasts were actually a voice of reason.

8. Again, being off by a couple percentage points is not bad. Also, polls have never been random samples, and polling accuracy is as good now as it was decades ago when response rates were higher. To me, saying that polling is useless (or, as some commenters said, that it should be “prohibited” or “banned”) makes about as much sense as shutting down the Bureau of Labor Statistics because their measurements are imperfect and their estimates need to be adjusted, or shutting down the National Weather Service because somebody somewhere might not take an umbrella to work one day when the forecast probability of rain is only 46%.

Again, all this is tricky. It’s easy for me to write about polling uncertainty because it’s a problem I’ve been thinking about for a long time. Ask me to write about finance, and all I can offer you is at worst my hot take and at best some social science analogies.

4. Summary

If you frame this as a debate between one side (Elliott Morris, Nate Silver, me, etc.) who are pro-polling and another side who are polling skeptics, then it makes sense to take the side of the polling skeptics. After all, polls aren’t all that. Poll aggregation has been hyped like evolutionary psychology has been hyped, like bitcoin has been hyped, etc., and it’s natural to want to take the under on polling at this point

But when you get to the specifics, it’s another story. Items 3 and 4, and arguably item 1, above are legitimate, serious criticisms. The others, not so much. It should be possible to be bothered by hyping of polls, to think they’ve been oversold and even to think they have been a net minus for modern society, without grabbing on to flat-out false arguments such as 2, 5, and 6, and without being so sure about questionable arguments such as 7 and 8.

To put it another way, instead of throwing out 8 anti-poll arguments and then saying how polling is so horrible, how about starting with saying that polls make you uncomfortable, you think they’ve been overhyped, and here are some concerns. Get your position out there first, and then you can evaluate each argument in turn without feeling the need to build a case.

Jessica Hullman’s thoughts

I sent the above to my computer science colleague Jessica Hullman, who added:

Fwiw, as someone who has watched the evolution of uncertainty communication carefully over the last few election cycles, I would say there has been marked improvement between 2016 and 2024 on how forecasters communicated uncertainty. Not to mention, anyone who was old enough to be shocked in 2016 had that experience behind them, and inevitably brought it to the following couple elections. People hate being surprised when it goes against their preferences. So while there may be more room to improve uncertainty communication further, I also think it’s a much smaller gap than it used to be, and much of the audience who was very surprised in 2016 was less naive this time.

And so, I agree with your response to point 7, that these concerns often come up about other people taking polls too seriously, but it’s hard to imagine that those complaining so loudly now were truly that shocked.

Overall I don’t think that the fact that people can misinterpret uncertainty easily is a good reason to withhold information. Lots of people are interested in learning from the forecasts and all the info about the process that comes with them. People who don’t find them helpful can ignore them.

Beyond the knee jerk reaction to suppress uncertainty, I wonder how ambiguity about the prediction target contributes to people getting confused, or people believing that everyone else is confused – i.e., are we predicting what will happen today if the election were held, or what will happen the day it’s actually held? How differently should the reader interpret the forecast in each case? Maybe there should be more clarity around how to think about the target of the forecast and what uncertainty gets baked in because of it.

I like your point that it’s hard to separate this from arguing that the BLS should stop producing estimates, because they aren’t perfect.

People vehemently turning on polling and forecasting does not make a lot of sense to me. But it does seem to be a thing.

The soft bigotry of low expectations

Andrew — Fri, 09 Jan 2026 00:59:47 +0000

The headline of this NYT op-ed says it all: “Kennedy Is Telling Americans How to Eat. It’s Not Crazy Advice.”

That’s funny; the news article about the guidelines says that they “flip the food pyramid on its head, putting steak, cheese and whole milk near the top.” That doesn’t sound like such good advice!

If you actually read the linked op-ed (I don’t recommend you do), you’ll see that it categorizes two of the items on the new dietary guidelines as “good,” one as “totally fine” (meaning that there’s no evidence for it one way or another, so she’s giving Kennedy credit for a recommendation that at least isn’t bad), two as “complex” (which is actually negative, given that she describes one of these as “this advice may be counterproductive” and the other as “unrealistic for nearly everyone”), and two as “weird” (which I guess is the author’s positive spin).

This is just pitiful. There’s an official government publication with 8 recommendations, 2 of which the op-ed writer characterizes as “good,” and her summary is “These guidelines are a very good start for telling people where to go; now the job should be helping them get there.”

This is what we’ve come to? Official guidelines are being praised for being “not crazy”? She doesn’t even make an argument that the advice is net positive.

Step back for a moment. The U.S. government has access to top nutritional experts (also to top economics professors, for that matter). If they’re giving 8 pieces of advice, these should be 8 pieces of good advice. This isn’t like baseball, where .300 is excellent and .500 is impossible.

Look, don’t get me wrong. I’m not naive here. I don’t think the government’s perfect. Experts can be wrong, also food and nutrition policy are notoriously subject to political influence: the milk lobby, the meat lobby, etc. Last I checked, we still have ethanol subsidies!

But that’s the point: if the government is giving bad advice, that’s bad! To praise them for not being uniformly crazy . . . ummmm, that’s like if your boss’s idiot nephew comes into the office to tell everyone how to do their jobs, and after he leaves, you loudly say, “Hey, this new advice is — dare I say it? — overall very sensible. Junior made a good point when he told the sales force to be more customer-focused. And when he told the engineering team to think outside the box, yeah, you have to admit he’s onto something there.”

My political take on this is that the author is a Democrat and suffers from something I’ve noticed in popular history writing, which is a form of reasoning that focuses on the mistakes on “our side” and assumes that whatever “their side” does is pre-ordained. It’s a sort of fundamental attribution error by which our decisions and mistakes are based on context and circumstance, whereas theirs are based on their unchangeable character.

That’s the soft bigotry of low expectations: setting the bar so low that being mostly “not crazy” is enough. We should be holding the government to a higher standard than that!

More and more I’m thinking that it was a national disgrace that Ted Kennedy got away with Chappaquiddick.

What is “workflow” and why is it important?

Andrew — Thu, 08 Jan 2026 14:20:44 +0000

A few years ago we decided to write a book on Bayesian workflow, and we got ready for it by writing this article, which begins as follows:

The idea was the traditional framing of statistical inference was too narrow, in that it treated the statistical model as given, without capturing the real-world practices of building, evaluating, improving, and understanding models. Methods such as predictive checking, exploratory data analysis, and sensitivity analysis, are important steps in data analysis without being part of inference. We use the term “workflow” to represent our larger process of data analysis. An extended workflow would also include pre-data design of data collection and measurement and after-inference decision making, but in our article and book on Bayesian workflow we focus on modeling existing data.

“Workflow” has different meanings in different contexts. We have been influenced by the ideas about workflow in computing that are in the air, including statistical developments such as the tidyverse which are not particularly Bayesian but have a similar feel of experiential learning. Many recent developments in machine learning have a similar plug-and-play feel: they are easy to use, easy to experiment with, and users have the healthy sense that fitting a model is a way of learning something from the data without representing a commitment to some probability model or set of statistical assumptions.

In our paper we supply some background:

We can also connect to general ideas of building, checking, and expanding statistical models, as expressed by Tukey (1977), Box (1980), and Jaynes (1983). Or, from another direction, we can look up “workflow” on wikipedia, which says:

Workflow is a generic term for orchestrated and repeatable patterns of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of operations, the work of a person or group, the work of an organization of staff, or one or more simple or complex mechanisms. . . . The modern history of workflows can be traced to Frederick Taylor and Henry Gantt, although the term “workflow” was not in use as such during their lifetimes. One of the earliest instances of the term “work flow” was in a railway engineering journal from 1921.

Here’s the relevant bit from the 1921 citation:

We and other statisticians have used the term “workflow” to encompass the various steps of data processing and analysis that we use to make our work scientifically replicable, both in the direct sense of providing all necessary practical details so that the computations can be reproduced, and in the larger sense of fitting models and applying procedures that are appropriate to the data at hand and to the questions being asked.

All of this is important for (at least) three reasons:

1. Practical workflow includes all sorts of things that aren’t written down, or are scattered across various sources. It’s good to document what we are doing.

2. Once these steps are written in some sort of organized way, this can help future researchers be more systematic in their data analysis. We would not expect every workflow tool to be used in every data analysis; the point is to make these tools more available, through explanations, examples, and software.

3. Future research should be benefit from a clearer exposition of present best practices. Theoretical statistics is the theory of applied statistics, and so it’s good to know what is being done, to expand the boundaries of theoretical investigation, which eventually should result in improved methods.

Also, data analysis workflow is closely related to ideas in computing workflow such as version control, testing, readability, and maintainability of code. Software is written collaboratively and to be used in a variety of contexts by a range of users; similarly, statistical methods build upon multiple contributors and, once released, get used in the wild in various unexpected ways. So there’s a connection between the workflow involved in developing a method or writing software, and the workflow of the later users of the method or software.

It’s open season on the unabashedly earnest

Jessica Hullman — Wed, 07 Jan 2026 17:20:23 +0000

This is Jessica. In response to my post on slop, Thomas Basbøll shared a 1967 New Yorker essay by Jacob Brackman about the havoc wreaked by the emergence of the “Put-On” in 1960s (and slightly earlier) art and culture. True to its name, the “Put-On” refers to a response that is deliberately outlandish yet ambiguous about intention, confusing the other party and causing them to doubt its sincerity.

The put-on is perhaps best exemplified by Bob Dylan’s smart alecky style of responding to interviewers, in which he alternates between crazy stories about his past, exasperation with the counter-culture of which he’s part, and pointed questions turned back on the interviewer. Who is left to wonder, How much of this is real? Is he caricaturing himself, or is this actually his personality? But the put-on also appears in art and culture more broadly – e.g., Is John Cage making an important statement or just putting the audience on with these silence performances? Is Andy Warhol out to make fools of his critics with the Brillo boxes? The put-on is unsettling because you cannot resolve whether meaning is intended or still to come, or you are just wasting your time: “put-ons may disguise the fact that someone has nothing of interest to say—may, indeed, give precisely the opposite impression.”

Today the put-on takes different forms – video shorts of animals doing things that are just beyond the boundary of what seems plausible, enough so that we need to watch a second time to figure out if it’s real. Essays or presentations by our students that elude a little too much confidence given their lack of experience with the topic, but which they deny using generative AI to write. There has always been plenty of bullshit on the internet, and plenty of cheating in classes, but Brackman’s stages of the put-on are especially familiar lately:

You’re sucked in.
You become confused.
You resent (or appreciate) having been tricked.

Patience games

The problem with the put-on–whether orchestrated by musicians or artists in the 60s or today’s language models and image generators–is that the ambiguity is strategic. You don’t know if it is going somewhere. You’re stuck sitting with your uncertainty, reflecting on how far your good naturedness extends.

In teaching, when you think you’re facing undisclosed (over)reliance on generative AI for an assignment, do you take the sincere path of asking the student what they did, and trusting their response? Do you try to catch them in a lie? Or do you decide it’s just not worth your time to sleuth and let the students decide for themselves if they will use the course to learn something versus play the game?

We find ourselves facing games we may not want to play, and for which we have no precedent. This year ICML, one of the big machine learning conferences, is offering authors a choice: opt-in to a permissive policy about generative AI use in reviewing, or go the purist route, where your reviewers can’t use it at all, and you can’t use it at all for your own reviews either. The reviewer matching process sounds like it could get messy, and ML conferences are already known for their review randomness. Which option is likely to be less noisy?

Not to mention that as a reviewer, one must increasingly wonder whether the paper they are preparing their comments on is an experiment in automated science. Will the authors even read your feedback? Do they care to improve the work? Or have you been inadvertently reduced to a Turing signal?

It’s not easy for the “unabashedly earnest”, who dislike playing games and want to retain a certain innocence in their encounters with others, but who also want to stay ahead of the curve and not get duped. The put-on depends on the gullibility of its victim, so you face a choice of being ok with continuing as usual but feeling used at times, or becoming more skeptical about people in general. Please don’t make me part of your game is becoming the refrain for a new way of life.

There’s little reason to think it will get better anytime soon. It’s still early and many people are still playing the old game, or still experimenting with how much they can get generative AI to do. We should be preparing for more disruption.

From the sacred to the ~~profane~~ probabilistic

I find myself thinking about what kinds of signals I consider more sacred, i.e., that I would most dread seeing lose their meaning. For example, what do you do about undisclosed use of generative AI in close relationships? What if you suspect the friend or romantic partner you are corresponding with is relying on the AI suggested responses to do the thinking? Do you ask them about it, or let it go and risk the uncertainty undermining your ability to trust them?

I would also distinguish feedback on writing that is more personal. I don’t mind an AI-generated review on my research if it’s guided by a human with the right expertise. But if, for example, I was to learn that comments on my posts here that I took seriously as a reflection of engagement with what I wrote or that just gave me a rewarding feeling of connecting with people outside my usual sphere (which blogging is great for), I would feel dumb, and it would probably affect my desire to blog. But this is already happening on social media, with bot accounts jumping in with random, effusive compliments on what you write.

Another scenario that makes me cringe is the application of generative AI to the kinds of art and literature that I get inspiration from. I can potentially enjoy some AI-generated music or script folded into the mundane background track or sitcom if it’s decent, but I look to art museums for a kind of consolation on what it means to be human, to be vulnerable, to feel forms of loss on a deep level. I don’t doubt that generative AI could occasionally result in experiences that would be hard for me to distinguish from human contemporary art. But I can’t imagine myself ever getting interested in art created by AI the way I’m interested in what other people make, because of the lack of specificity or intention. So if it were to infiltrate that realm, and I could no longer count on there being a human lived experience behind art, it would bother me.

One thing I feel relatively sure of is that I won’t be wanting an AI guru. I wouldn’t be surprised if generative AI could do a pretty good job of mimicking the kind of capriciousness associated with spiritual guides like Zen masters. But similar to art, there’s something important about the person having experiences in the world that feels essential.

I would be curious though to hear counterarguments from people who have thought about AI in art or religion or more intimate personal communication. Part of what I find difficult in all this is that I consider myself generally optimistic about new technology, and open to change from it (I am, after all, a computer scientist). So I would also hate to prematurely “close my ears” like a square in the 50s or 60s walking out on Cage’s experiments in sound. And so I expect my patience to remain unstable, and it to remain hard for me to predict what experiences will give me the urge to ditch versus hit rewind.

Institutional unraveling

Returning to the general theme of new decision points as signals erode in value, things are likely to get worse before they get better. Many of our systems are still mostly functioning at this point, because many people are still figuring out how to use generative AI, and where to draw their own line, or they are avoiding it completely. But the seeds for institutional breakdown are all around us.

According to Zeynep Tufecki’s recent keynote at NeurIPS (which I summarize here), the problem is that society is built on assumptions that certain things will be hard (or “load bearing frictions”), i.e., that only humans can generate outputs with certain properties. LLMs break our ability to conclude there is proof of effort, or of authenticity or sincerity. Gatekeeping is a necessary function, and when the old mechanisms stop working, other measures will step in, like relying on the prestige of the candidate’s institution or their connections to decide who to hire, or what papers to cite or publish. When those things are no longer hard, some mechanism must step in in its place, and it may not be ideal. The point being if you break something important, you don’t necessarily get something better unless you build something better.

I like how this view focuses attention on outcomes within the realm of our ability to predict, like what kinds of gatekeeping will emerge or are already emerging to fill in the holes. We can then try to identify better alternatives to those, rather than trying to predict when “AGI” will happen or what the most destructive thing AI could do is. Though it doesn’t absolve us of the very humanist discomfort of watching our precious tokens of sincerity wash away, and the personal choices that come with that.

Brackman quotes P. T. Barnum on how “People like to be fooled,” and “There’s a sucker born every minute.” While the put-on has always relied on the victim’s willingness to stay in the conversation, the answer is unlikely to be opting out of dealing with AI output entirely (though there are certainly people in that camp). Some flexibility is warranted while norms are still shifting, and organizations are doing the right thing by experimenting with new policies. But until we have better signals, the burden of the put-on stays where it was: on the person deciding in that moment whether to continue listening.

Uh oh prediction markets

Andrew — Wed, 07 Jan 2026 14:23:49 +0000

Palko points to this post from Molly White, who writes:

When billionaire Bill Ackman suggested on Twitter that Eric Adams could “place a large [Polymarket] bet on Andrew Cuomo and then announce [his] withdrawal” from the New York City mayoral race, he described something that feels profoundly illegal. A politician profiting from non-public knowledge of their own withdrawal from an election surely crosses some line — insider trading? Market manipulation? Election interference? Illegal gambling? Ackman ended his tweet: “There is no insider trading on Polymarket” — not because it doesn’t happen, but because it won’t be charged. He’s right: the Securities and Exchange Commission’s insider trading rules don’t apply here. But that leaves the question: what rules, if any, do?

As Ackman says, prediction markets fall outside the SEC’s jurisdiction,a living in a different regulatory world than stock markets where executives get prosecuted for trading on non-public earnings or tipping off friends about upcoming mergers. . . .

White continues:

Prediction markets — platforms where people trade contracts that pay out based on whether specific events happen — have enjoyed a surge in popularity over the last few years as they’ve dramatically expanded their operations in the United States. While they have existed for decades, they were long confined to strictly academic exercises — operating as small-scale non-profits that carefully constrained their operations to avoid running afoul of the CFTC. . . .

In 2020, the US-based Polymarket began allowing customers to use cryptocurrency to trade events contracts, though they made no effort to certify their contracts with the CFTC. In 2021, Kalshi emerged as the first fully regulated prediction market in the US, following a hard-won CFTC approval. . . . The CFTC cracked down on prediction markets in 2022. . . . when a district court ruled in Kalshi’s favor in 2024, the company swiftly reinstated the contested markets. The regulatory landscape shifted further after Trump took office. . . .

Finally:

Though prediction markets aren’t a new phenomenon, their growing accessibility to retail traders is. . . . The CFTC has yet to bring any enforcement actions pertaining to market manipulation on events contracts, and it’s not clear they have much appetite to begin doing so.

Other industries that deal with outcome-based bets, like sports wagering, have evolved robust integrity systems both to protect consumers and to preserve trust in the games themselves. . . . Today, sports betting platforms work to screen out athletes, referees, and sports program employees to ensure they’re not betting on games they could potentially influence, and employ monitoring programs to detect suspicious bets. . . .

While Kalshi imposes strict trading restrictions on its presidential election market — barring politicians, campaign staff, pollsters, election officials, and foreign nationals — many of its other markets lack any such prohibitions. This includes election-related markets identical to the type of bet Ackman suggested Adams could place on Polymarket about his own mayoral campaign withdrawal.

Polymarket, which does not yet serve US customers, does no such screening. The platform merely asks users to self-certify they aren’t US-based, with additional basic geofencing that users regularly circumvent. . . .

White summarizes:

Without much oversight, these markets are ripe for manipulation. The gambling-like nature of many markets, combined with limited addiction prevention programs, likely puts vulnerable users at risk. And election markets create concerning new financial incentives that could further corrupt democratic processes.

This does seem like a serious concern.

At one level, this is a problem that should cure itself, in that a few high-profile cases of manipulation should be enough to drive any serious bettors out of the market, so that it just becomes something more like a game of poker than anything else, and the total dollars in the market would not be enough for anyone to make much money by throwing an election, or a sporting event. On the other hand, lots of bad things could happen in sports and politics on the way to this eventual equilibrium. Also, in the absence of serious regulation, we should not underestimate the abilities of cheaters and scammers to come up with new clever means of corruption. So, yeah, I feel like we should be worried.

Another take

Along similar lines, here’s a related post, Prediction markets and the need for “dumb money” as well as “smart money” where I discuss similar ideas that were presented by people who are more politically conservative–they’re less concerned about market manipulation and more concerned that the prediction markets will fall apart on their own.

It’s interesting to juxtapose these two different, but related, lines of criticisms of prediction markets, with one set of criticisms coming from the left and the other coming from the right.

Difference from sports

The post linked just above discusses various differences between news-based prediction markets and sports betting.

Relevant to the NYC mayoral race discussion, one difference is that, in sports, throwing a game to win a bet is one of the main ways that corruption can enter the system. If there is no betting, it’s much harder to have any motivation to throw a game. In contrast, in politics there are lots of ways to benefit from dropping out, especially in today’s political environment where the national government is pretty openly dropping prosecutions of political allies or threatening political opponents. As we discussed in this game-theory-heavy post, there are lots of ways that Adams could corruptly benefit from strategically dropping out, and indeed these are methods that could be both easier and more lucrative than trying to manipulate betting markets. Indeed, if the government doesn’t want to offer Adams some position for which he is unqualified, Bill Ackman himself could just hire Adams for some no-show job in his organization at a salary of a million dollars a year, no?

So, given everything that’s been talked about in the mayoral election so far, it doesn’t seem that prediction markets represent much of a change to the moral, financial, and political calculations of corrupt maneuvering in a political race. It still seems disturbing, though, in the same way that it’s disturbing when politicians decide that various sleazy activities, which in the past they would have tried to hide, are now done in the open. And also disturbing in the potential pollution of the signal offered by prediction markets themselves.

Survey Statistics: 4th helpings of the logit shift

shira — Tue, 06 Jan 2026 21:00:14 +0000

In June 2025 we discussed 2 flavors of calibration, including “the logit shift”. In August 2025 we took 2nd helpings of the logit shift, focusing on multinomial outcomes. In December 2025 we took 3rd helpings, focusing on multivariate outcomes. Maybe folks have had enough (I was the only person to comment on the 3rd helpings post). But G. Elliott Morris linked to the new Will Marble and Josh Clinton multivariate logit-shifting paper, which reminded me to think about it again.

They consider the 2022 midterms in Michigan:

y_1 = governor vote choice
y_2 = abortion proposition vote choice
x = demographics

They estimate E(y_2 | county) and compare it to known truth. They have y_1, y_2, x in a survey, x in the population, and E(y_1 | county). Will and Josh look at a few estimation methods, including “Chained Calibrated MrsP“:

Add y_1 to the population data: fit p(y_1 | x) from the survey, logit-shift to E(y_1 | county).
MRP: fit p(y_2 | y_1, x) from the survey, average over the population distribution of y_1, x

Kuriwaki et al. 2024 do this, as we discussed in June 2025. Our mystery from December 2025 was why this method seems to do poorly in Will and Josh‘s paper:

Will suspected it was due to Jensen’s inequality, so I wrote a silly simulation to try to understand that.

n_county <- 50
n_per <- 500
b <- 8
a_county <- rnorm(n_county, -3, 2.5)
p <- runif(n_county, 0.3, 0.7)
county <- rep(1:n_county, each = n_per)
y1 <- rbinom(n_county * n_per, 1, p[county])

true_E <- tapply(seq_along(y1), county, function(ix)
mean(plogis(a_county[county[ix[1]]] +b*y1[ix])))

jensen_ignored <- plogis(a_county + b * tapply(y1, county, mean))

plot(true_E, jensen_ignored,
     xlab="True E[logit^-1(a + b*y1) | county]",
     ylab="logit^-1(a + b*E[y1|county])",
     pch=16,xlim=c(0,1),ylim=c(0,1))
abline(0, 1, col = "red", lwd = 2)

Thoughts ? Have you had enough of the logit-shift ?

The stories behind our published research from last year

Andrew — Tue, 06 Jan 2026 16:36:49 +0000

It’s January so time to look back on what we’ve done in the past year. I thought this time I’d give a little story of background on each of our published papers.

First, here’s the list of recently published papers:

[2026] Adaptive sequential Monte Carlo for structured cross validation in Bayesian hierarchical models. {\em Journal of Computational and Graphical Statistics}. (Geonhee Han and Andrew Gelman)
[2026] Reanalysis of “Competition and innovation: An inverted-U relationship.” {\em Journal of Robustness Reports}. (Andrew Gelman)
[2026] The ladder of abstraction in statistical graphics. {\em American Statistician}. (Andrew Gelman and Kaiser Fung)
[2026] Statistical workflow. {\em Philosophical Transactions of the Royal Society A}. (Andrew Gelman, Aki Vehtari, and Richard McElreath)
[2026] Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection. {\em Social Service Review}. (Lawrence M. Berger, Tia Dickerson, Andrew Gelman, Hye-Min Jung, Seonghun Lee, Margaret Thomas, and Jane Waldfogel)
[2025] A multilevel Bayesian approach to climate-fueled migration and conflict. {\em Scientific Reports}. (Claire Palandri, Paulina Concha Larrauri, Andrew Gelman, Michael J. Puma, and Upmanu Lall)
[2025] Artificial intelligence and aesthetic judgment. {\em Sankhya}. (Jessica Hullman, Ari Holtzman, and Andrew Gelman)
[2025] Discussion of “Statistical exploration of the manifold hypothesis,” by N. Whiteley, A. Gray, and P. Rubin-Delanchy. {\em Journal of the Royal Statistical Society B}. (Andrew Gelman)
[2025] Meta-analysis with a single study. {\em Statistical Methods in Medical Research}. (Erik van Zwet, Witold Wiecek, and Andrew Gelman)
[2025] Normative scientific conflict is unavoidable and should be welcomed. {\em Theory and Society}. (Andrew Gelman)
[2025] Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals. {\em Biometrika}. (Andrew Gelman and Jonas Mikhaeil)
[2025] Multilevel regression and poststratification using margins of poststratifiers: Improving inference for HIV health outcomes during the COVID-19 pandemic. {\em Statistics in Medicine}.(Amy J. Pitts, Maiko Yomogida, Angela Aidala, Andrew Gelman, and Qixuan Chen)
[2025] Statistical graphics and comics: Parallel histories of visual storytelling. {\em Nightingale}. (Andrew Gelman and Susan Kruglinski)
[2025] Letter to the editor. {\em Perspectives on Psychological Science}. (Andrew Gelman)
[2025] Rethinking approaches to analysis of global randomised controlled trials. {\em British Medical Journal} {\bf 389}, r1273. (James M. Brophy and Andrew Gelman)
[2025] Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. {\em Bayesian Analysis} {\bf 20}, 461–488. (Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari)
[2025] Visualizing distributions of covariance matrices. {\em Journal of Data Science, Statistics, and Visualisation} {\bf 5}, 7. (Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Andrew Gelman, and Francis Tuerlinckx)
[2025] Interrogating the “cargo cult science” metaphor. {\em Theory and Society} {\bf 54}, 197–207.. (Andrew Gelman and Megan Higgs)
[2025] A calibrated BISG for inferring race from surname and geolocation. {\em Journal of the Royal Statistical Society A}. (Philip Greengard and Andrew Gelman)
[2025] Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes. {\em Journal of the Royal Statistical Society A}. (Jonas Mikhaeil, Andrew Gelman, and Philip Greengard)
[2025] The piranha problem: Large effects swimming in a small pond. {\em Notices of the American Mathematical Society} {\bf 72}, 15–25. (Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, and Daniel Hsu)
[2025] For how many iterations should we run Markov chain Monte Carlo? In {\em Handbook of Markov Chain Monte Carlo}, second edition. (Charles C. Margossian and Andrew Gelman)
[2026] Reanalysis of “Competition and innovation: An inverted-U relationship.” {\em Journal of Robustness Reports}. (Andrew Gelman)
[2026] The ladder of abstraction in statistical graphics. {\em American Statistician}. (Andrew Gelman and Kaiser Fung)
[2026] Statistical workflow. {\em Philosophical Transactions of the Royal Society A}. (Andrew Gelman, Aki Vehtari, and Richard McElreath)
[2026] Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection. {\em Social Service Review}. (Lawrence M. Berger, Tia Dickerson, Andrew Gelman, Hye-Min Jung, Seonghun Lee, Margaret Thomas, and Jane Waldfogel)
[2025] A multilevel Bayesian approach to climate-fueled migration and conflict. {\em Scientific Reports}. (Claire Palandri, Paulina Concha Larrauri, Andrew Gelman, Michael J. Puma, and Upmanu Lall)
[2025] Artificial intelligence and aesthetic judgment. {\em Sankhya}. (Jessica Hullman, Ari Holtzman, and Andrew Gelman)
[2025] Discussion of “Statistical exploration of the manifold hypothesis,” by N. Whiteley, A. Gray, and P. Rubin-Delanchy. {\em Journal of the Royal Statistical Society B}. (Andrew Gelman)
[2025] Meta-analysis with a single study. {\em Statistical Methods in Medical Research}. (Erik van Zwet, Witold Wiecek, and Andrew Gelman)
[2025] Normative scientific conflict is unavoidable and should be welcomed. {\em Theory and Society}. (Andrew Gelman)
[2025] Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals. {\em Biometrika}. (Andrew Gelman and Jonas Mikhaeil)
[2025] Multilevel regression and poststratification using margins of poststratifiers: Improving inference for HIV health outcomes during the COVID-19 pandemic. {\em Statistics in Medicine}.(Amy J. Pitts, Maiko Yomogida, Angela Aidala, Andrew Gelman, and Qixuan Chen)
[2025] Statistical graphics and comics: Parallel histories of visual storytelling. {\em Nightingale}. (Andrew Gelman and Susan Kruglinski)
[2025] Letter to the editor. {\em Perspectives on Psychological Science}. (Andrew Gelman)
[2025] Rethinking approaches to analysis of global randomised controlled trials. {\em British Medical Journal} {\bf 389}, r1273. (James M. Brophy and Andrew Gelman)
[2025] Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity. {\em Bayesian Analysis} {\bf 20}, 461–488. (Martin Modrák, Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman, and Aki Vehtari)
[2025] Visualizing distributions of covariance matrices. {\em Journal of Data Science, Statistics, and Visualisation} {\bf 5}, 7. (Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Andrew Gelman, and Francis Tuerlinckx)
[2025] Interrogating the “cargo cult science” metaphor. {\em Theory and Society} {\bf 54}, 197–207.. (Andrew Gelman and Megan Higgs)
[2025] A calibrated BISG for inferring race from surname and geolocation. {\em Journal of the Royal Statistical Society A}. (Philip Greengard and Andrew Gelman)
[2025] Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes. {\em Journal of the Royal Statistical Society A}. (Jonas Mikhaeil, Andrew Gelman, and Philip Greengard)
[2025] The piranha problem: Large effects swimming in a small pond. {\em Notices of the American Mathematical Society} {\bf 72}, 15–25. (Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, and Daniel Hsu)
[2025] For how many iterations should we run Markov chain Monte Carlo? In {\em Handbook of Markov Chain Monte Carlo}, second edition. (Charles C. Margossian and Andrew Gelman)

Also we completed some new work that’s not yet been published:

Power analysis is essential: High-powered tests suggest minimal to no effect of rounded shapes on click-through rates. (Ron Kohavi, Jakub Linowski, Lukas Vermeer, Fabrice Boisseranc, Joachim Furuseth, Andrew Gelman, Guido Imbens, and Ravikiran Rajagopal)
Efficient scenario analysis in real-time Bayesian election forecasting via sequential meta-posterior sampling. (Geonhee Han, Andrew Gelman, and Aki Vehtari)
Continuous adaptive path sampling for efficient multimodal sampling and marginalization. (Yuling Yao, Collin Cademartori, Aki Vehtari, and Andrew Gelman)
Conformal prediction and human decision making. (Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, and Andrew Gelman)
When fiction is presented as real: The case of the burly boatmen. (Andrew Gelman)

We have a lot on deck for 2026, including two new books (Bayesian Workflow and the second edition of the edited Handbook of Monte Carlo) and a bunch of research articles on different topics in statistical modeling, causal inference, and social science.

And you can expect another 600 or so blog posts.

The stories behind the papers

It’s hard for me to pick my favorites among all the recently published papers, so let me just say something about each of them, in the same order they were listed above (roughly inverse chronological order of publication):

Adaptive sequential Monte Carlo for structured cross validation in Bayesian hierarchical models: GH took a couple of my classes and had ideas for a couple of papers, including this one. This is his idea that I just helped on a small amount.
Reanalysis of “Competition and innovation: An inverted-U relationship”: This was originally a blog post. The editor of the Journal of Robustness Reports asked me to submit it to them. It took a couple rounds–the reviewers made some good points!–and fun thing about this journal is you can go to the link and see the entire review process.
The ladder of abstraction in statistical graphics: I absolutely love this paper. It originated in a talk I gave to Ron Yurko’s statistical graphics class at CMU. I sent it to the journal and they had some good suggestions for improvement that my friend and colleague Kaiser Fung was able to do.
Statistical workflow: As many of youall know, we’ve been writing a book on Bayesian Workflow–it will appear very soon! I felt that the workflow concept would be useful in non-Bayesian statistics too, so my colleagues and I organized a special issue of a journal, where we solicited a bunch of articles from theoretical and applied researchers, mostly not Bayesian, to get different perspectives on workflow. The journal issue is looking good–I guess it will be out soon–and we wrote this short article to lead off that issue. It’s a short paper and I recommend you take a look!
Adjusting for underreporting of child protective services involvement in the Future of Families and Child Wellbeing Study and assessing its empirical implications through illustrative analyses of young adult disconnection: OK, I don’t have much to say about this one. It’s by my colleagues at the school of social work at Columbia; I was involved in the survey weighting for the study.
A multilevel Bayesian approach to climate-fueled migration and conflict: Hey, I don’t remember much about this at all! But, yeah, multilevel modeling, I guess I did something useful here!
Artificial intelligence and aesthetic judgment: This one’s mostly by Jessica and Ari, but I made some contributions throughout, which might be recognized from earlier appearances of some of these ideas on the blog. It’s published in Sankhya because I think they asked me to submit something for a special issue, and we had this cool paper that we couldn’t figure out what to do with.
Discussion of “Statistical exploration of the manifold hypothesis”: This journal sometimes runs papers with discussions (they did a couple of mine in the past decade), and sometimes I contributed something. Here I saw a good opportunity to remind people of my thoughts on Tibshirani’s “bet on sparsity” principle and where it can go wrong.
Meta-analysis with a single study: What can I say? This paper has an awesome title. Erik, Witold, and I have been meeting weekly and will be coming out with more articles soon on science and meta-science.
Normative scientific conflict is unavoidable and should be welcomed: I can’t remember how, but I came across an announcement of a special issue of the journal Theory and Society on the topic of normative scientific conflict. I had some things to say on the topic, and this seemed like a good outlet. I like this paper! You should read it.
Russian roulette: The need for stochastic potential outcomes when utilities depend on counterfactuals: This paper has a funny story behind it. I was contacted by economist Amanda Kowalski about a paper she and her colleagues had written about causal inference. That paper got me thinking about stochastic potential outcomes and asymmetric utility functions, and I had this idea of demonstrating these ideas in a simple example of Russian roulette. Jonas joined as a collaborator and clarified a bunch of issues that I’d been sloppy with. We asked Amanda if she wanted to join in, but she was too busy on her own stuff. Anyway, the final paper is cool–it’s really clean, and it’s timely because lots of people are interested in going beyond the stable unit treatment value assumption.
Multilevel regression and poststratification using margins of poststratifiers: Improving inference for HIV health outcomes during the COVID-19 pandemic: Qixuan has been taking the lead on a bunch of papers we’ve been doing, generalizing MRP in various ways. I think we’re gradually moving toward a bright future of generalizing from sample to population.
Statistical graphics and comics: Parallel histories of visual storytelling: This is an idea that I’ve had for a while. I mentioned it in class offhandedly one day, and one of the students told me she was interested in the topic too, so we wrote this article. It was a true collaboration. It’s kind of a specialized topic, but I think it should have a potentially wide audience, because lots of people love comics and lots of people love statistical graphics. We focus on the fascinating question of how it is that these two modes of communication have developed only in the past few centuries, even though they could have been invented much earlier. This is a sister paper to the “ladder of abstraction” paper mentioned above.
Letter to the editor: Long story here. Back in 2017, a bigshot professor lied about me in a published article in the journal, Perspectives on Psychological Science. It was the kind of crap article that should never have been accepted, but at the time that journal was run by a corrupt cabal and they were publishing their friends’ articles essentially without peer review. At the time I complained to the journal but only got rude responses from the cabal. But things change. The journal is now run by civilized people and they published my letter. Better 8 years late than never at all. And, no, the people who wrote and published the lies never apologized. Of course not! Apologies are for losers, not for members of the prestigious National Academy of Sciences.
Rethinking approaches to analysis of global randomised controlled trials: Epidemiologist Jay Brophy wrote this one. I had some minor contribution, I can’t remember what.
Simulation-based calibration checking for Bayesian computation: The choice of test quantities shapes sensitivity: This is the latest version of a long series of papers on SBC, starting with Samantha Cook’s Ph.D. thesis, which we turned into a paper that was published twenty years earlier. I continue to be interested in the idea of accompanying inferences with simulations that check the computations.
Visualizing distributions of covariance matrices: This paper is nearly 20 years old! At the time we had difficulty getting it published and we moved on to other things. Then a couple years ago a journal asked me for an article and I sent them this one. Unfortunately it was a so-called predatory journal, and one of my coauthors didn’t want our article appearing there. Fair enough! But then we thought we might as well get it published, so we sent it off. I like the paper, and I also like that it’s on the relatively understudied topic of visualizing models (as opposed to visualizing data).
Interrogating the “cargo cult science” metaphor: This topic had been bugging me for a while, and Megan and I wrote this paper which got rejected by a couple of places. Neither of us really knows how to communicate with researchers in the field of science studies, so it was a hard paper to place, even though it makes a clean point. Then I happened to hear about the journal Theory and Society, which seemed like the perfect place. I don’t know if anyone read our article, but I’d like to think that, in the future, people will think twice before talking about cargo cult science.
A calibrated BISG for inferring race from surname and geolocation: This is Philip’s project. I did help out a bit, but I remain frustrated in that we haven’t been able to frame this in a fully Bayesian or generative way. We’re continuing to work on the problem, and we have a new method, supercaliBISG, which does even better than caliiBISG, which is an improvement on BISG, which itself has the word “improved” in its title (and also calls itself Bayesian, but it’s not fully so).
Hierarchical Bayesian models to mitigate systematic disparities in prediction with proxy outcomes: I can’t remember exactly where his paper came from, but it was somehow associated with some conversations we had with Sharad Goel and others on statistical measures of disparity. As is often the case, I think much is gained by framing the problem within a generative model.
The piranha problem: Large effects swimming in a small pond: This one’s important! The basic idea–there are probabilistic or statistical constraints regarding patterns of dependence in high dimensions, and this has implications for our understanding of patterns in complex structures–was mine, but the coauthors did most of the rest, to collect some relevant mathematical results. As I like to say, I think there’s more to be said in this area, maybe some connections to random matrix theory. Also, the paper has an unusual publication story. What happened was that a student from the statistics club at San Diego State University asked me to do a remote meeting with them. I did so–it was a fun conversation–and it turned out that their faculty adviser, Richard Levine, was editor of the Notices of the American Mathematical Society, and was looking for general-interest math papers with applied or statistical relevance. So I sent him the piranha paper. Articles in this journal have a strict limit of no more than 10 pages and no more than 20 references. It was hard for me to keep the references under 20 while demonstrating the applied relevance of the topic, so I cheated and wrote a blog post entitled, “Here are just some of the factors that have been published in the social priming and related literatures as having large effects on behavior,” so that just counted as 1 reference in our paper. Kind of like if the genie gives you 3 wishes and you spend one of them on more wishes.
For how many iterations should we run Markov chain Monte Carlo?: This is an update of my paper with Kenny Shirley for the new edition of the MCMC handbook. Charles took the lead on this chapter.

Last post on the estimated effects of Mississippi school reforms

Andrew — Mon, 05 Jan 2026 14:49:19 +0000

For background:

– How much of “Mississippi’s education miracle” is an artifact of selection bias?

– When the numbers don’t look right, check them! (Mississippi education update)

– More on school reform, this time New Orleans

And now one more, from Noah Spencer, who writes:

I did have a good back-and-forth with Wainer et al., but remain unconvinced by their main critique.

– I [Spencer] address the authors’ main critique – that truncation due to retention mechanically explains the observed effects – in Section 7.2 of my paper. Basically, students who are retained in grade 3 do not just stay there forever. The typical student is retained for one year and then proceeds to grade 4, where they can write the NAEP. Based on the timing of the policy, it just would not have been the case that any NAEP-taking cohort would be artificially missing a mass of weaker students.
– “One hypothesis is that the NAEP test score gains are a mechanical consequence of weaker 3rd-grade students not making it to fourth grade to write the NAEP test. Given the timing of the retention policy however, this purely mechanical explanation does not make sense. The first cohort eligible for retention under the LBPA was the 2014-2015 grade 3 cohort. Thus, the 2014-2015 grade 4 NAEP test-takers were not exposed to the new retention policy. It is true that the 2016-2017 grade 4 NAEP test-taking cohort would not have included students who were retained in grade 3 after the 2015-2016 school year (who would have been in grade 4 in 2016-2017 absent the LBPA). However, the 2016-2017 test-taking cohort would have included students who were retained in grade 3 after the 2014-2015 school year (assuming they were not retained again in 2015-2016).

Thus, the mass of weaker students taking the NAEP would not be eliminated due to the LBPA, but rather replaced by a mass of previously under-achieving students who had been retained and had now passed the necessary grade 3 reading assessment.

Similar logic follows for the 2018-2019 test-taking cohort.”
– Minor note: Being retained multiple times in grade 3 is rare in Mississippi.

– I also test in my paper whether the LBPA changed the composition of NAEP-takers beyond the above truncation concern (see Table B3). I do not find statistically significant effects on the percent of NAEP takers who: are White, are male, are English language learners, have a disability, or have a computer at home.

– The question of whether retention was the key mechanism through which the LBPA’s effects manifest is a good one. Are the average test score gains across Mississippi driven by the scores of retained students? The 2014-2015 treatment effect cannot be due to LBPA-induced retention as Mississippi’s 2014-2015 grade 4 cohort was not exposed to the retention aspect of the policy (which started in 2014-2015). The 2018-2019 treatment effect is unlikely to be substantially influenced by LBPA-induced retention given that the 2016-2017 third-grade retention rate (3.8%) was so similar to the pre-LBPA retention rate (3.3% in 2013-2014). You would have to assume incredible gains in test scores due to retention for such a small segment of students to influence a state’s average so greatly. The 2016-2017 treatment effect is the most likely to be affected by retention given that 8.1% of third-graders in 2014-2015 were retained. In Appendix C, I conduct a decomposition exercise and estimate that only about 22% of the 2016-2017 treatment effect is due to retention aspect of the LBPA – though I should note that this decomposition exercise does require some strong assumptions.

– With respect to longer-term effects, I show in Appendix B.1 of my paper that effects persist until at least grade 7 on higher-stakes, state-level tests. There is some fadeout, but this is not unusual among educational interventions. I did not analyze effects on grade 8 NAEP reading scores in my paper partially because there was only one pre-COVID grade 8 cohort who was exposed to the LBPA and partially because I wanted to use grade 8 test scores as covariates. For what it’s worth, though, I have run the analysis quickly and find positive effects for grade 8 NAEP reading test-takers (including the 2022 and 2024 cohorts), though I would be hesitant to take much from post-COVID results because there was so much else changing at the time.

– Carefully evaluating effects on longer-term outcomes like high school completion rates, ACT scores, and post-secondary entrance rates is an important topic for future research. Mississippi’s gains on grade 4-8 assessments certainly do not guarantee longer-term effects and, again, it would not be unusual for short/medium-term effects to fade out.

– The claim that “The 2024 NAEP fourth grade mathematics scores rank the state at a tie at 50th!” is incorrect: Mississippi ranked 16th. They are also ranked 35th in 8th grade math, not 50th. I believe the authors have corrected this in an updated version of their article.

– “He improvised by using some prior years’ data as the control group, and instead of random assignment he used various bits of covariate information to equate this year’s students with the previous years…” – This was not what I did (nor what the synthetic difference-in-differences method does). I generated a control group based on a weighted average of states with similarly evolving test scores pre-treatment.

– Mississippi’s results are not entirely unique. Westall and Cummings (2023) assess early literacy policies across the country and find 0.14 SD effects for kids exposed from K-3 in the average “comprehensive policy” state. My 0.23 SD estimated effect for Mississippi is not wholly inconsistent with their national results.

When in doubt (in teaching and in research) do a simulation on the computer.

Andrew — Sun, 04 Jan 2026 14:38:17 +0000

The other day I taught my class and it didn’t go so well. Some students had a question about the central limit theorem–the point is that a sum of many small independent terms will be approximately normally distributed, but that won’t be the case if some of the terms have very long tails or if they have strong dependence or if there is one very large contributor to the sum. I gave as an example the distribution of heights: the distribution of heights of adult women or adult men is approximately normally distributed, but the distribution of heights of all adults is not. This is shown in the above figure from page 41 of Regression and Other Stories.

But this wasn’t so clear to the students. Or, to put it another way: Some students in the class already understood the central limit theorem, and my talking through it in this way didn’t add anything to their understanding; other students in the class had only a vague conception of central limit theorem, and my explanation didn’t help.

This is a not unfamiliar experience for a teacher: you feel like you are talking into a void but it’s not clear what to do next. Keep talking, or just give up.

OK, let me not overstate the problem. The class was not a disaster! This whole episode took only a few minutes, and we moved on. Still, I hate to waste the student’s time, and also I really don’t like to get into this mindless-lecturing vibe. I was just parroting the explanation of the central limit theorem that was already in print, and it would’ve been better for me to have just pointed the students to the relevant page in the book and then move on.

But there’s another way–as I remembered on the walk back to my office after the class was over.

It’s always an option to do a simulation on the computer. It’s win-win: you’re teaching programming as well as statistics, and also the code you write provides an opening for students to explore further on their own.

Here’s an example. First I set things up to perform a simulation and display multiple graphs:

par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(1.5,.5,0))
set.seed(123)

Then I simulate a million instances of a random variable formed by adding 10 little pieces:

N <- 1e6
K <- 10
y_individual <- array(NA, c(N,K))
for (k in 1:K){
  y_individual[,k] <- runif(N, 0, 1)
}
y_sum <- rowSums(y_individual)

This should look a lot like a normal distribution:

hist(y_sum)

I then add a big fat new term so it won't look normal anymore:

y_sum_2 <- y_sum + 10*rbinom(N, 1, 0.5)
hist(y_sum_2)

These are almost too separated to be convincing so I try a smaller separation:

y_sum_2 <- y_sum + 3*rbinom(N, 1, 0.5)
hist(y_sum_2)

To get something more interesting I can move them together and give the two modes unequal weight:

y_sum_2 <- y_sum + 2*rbinom(N, 1, 0.4)
hist(y_sum_2)

Lots more could be done. The point is that I could do the above in a few minutes and it provides the basis for student input. For example, I could ask students to work in pairs and try to come up with other ways to defeat the central limit theorem. This will be much clear with runnable code than with me scrawling math on the blackboard.

I've already said that how much I like simulated-data experimentation when doing applied or methodological research . . . ummm, here are some old posts on the topic:

- Fake-data simulation as posterior predictive checking: A formalization of the folk theorem of statistical computing!

- Why I like hypothesis testing (it’s another way to say “fake-data simulation”):

- (What’s So Funny ‘Bout) Fake Data Simulation

- Simulated-data experimentation: Why does it work so well?

- Yes, I really really really like fake-data simulation, and I can’t stop talking about it.

And that last post has a bunch more old links at the end.

The combination of originality, ambition, and lack of scruple can take you far in social science.

Andrew — Sat, 03 Jan 2026 14:41:53 +0000

I happened to come across the above line in this post from a few years ago, about the scholar and Ted talk performer who made the ridiculously innumerate claim that “It’s possible to put actual monetary value on each citation a paper receives. We can, in other words calculate exactly how much a single citation is worth. . . . in the United States each citation is worth a whopping $100,000.”

Being an idiot is part of this guy’s success–but only part of it. The nation’s universities are full of intellectually limited tenured professors, and they don’t all get Ted talks. As I put it earlier, I attribute this guy’s success his ability to come up with big ideas, along with his willingness to act as if his claims were supported by evidence, when they’re not. The big ideas are important–without them, he’s just one more schlub with a Ph.D. and a Rolodex.

Barnard College president promotes free expression, does not comment on recent anti-free-expression policies at the college

Andrew — Fri, 02 Jan 2026 14:08:57 +0000

The president of neighboring Barnard College writes:

Now Is the Time for Colleges to Host Difficult Speakers . . . A commitment to nonviolent disagreement should be an obvious part of the fabric of our campuses, just as it is obvious that students need oxygen to breathe. Colleges and universities need to reconfirm our commitment to nonviolent forms of disagreement — even when we are confronted with voices that disparage or dismiss identities and worldviews. This is also a time to foster more disagreement, not less.

I agree. She continues:

Colleges and universities have long resisted polarization and monolithic thinking by invoking these commitments to open discussion and inquiry, and we must continue to do so. College campuses must remain places where students are able to ask and grapple with hard questions, especially those that are uncomfortable and even hurtful. Higher education’s role is not to erase conflict but to channel it into dialogue, debate and learning. To do so, educators and students must face ideas we find offensive and speakers whose words cause pain.

Again, I agree. But as law professor Paul Campos points out, this “commitment to nonviolent forms of disagreement” on the part of the Barnard administration is new. Until recently Barnard has been pretty aggressive about trying to suppress free expression:

Last year came this policy:

Barnard is mandating that students remove any items affixed to room or suite doors by Feb. 28, after which point the college will begin removing any remaining items, Barnard College Dean Leslie Grinage announced in a Friday email to the Barnard community. . . .

“We know that you have been hearing often lately about our community rules and policies. And we know it may feel like a lot,” Grinage wrote. “The goal is to be as clear as possible about the guardrails, and, meeting the current moment, do what we can to support and foster the respect, empathy and kindness that must guide all of our behavior on campus.”

“Support and the respect, empathy and kindness” by not letting people put notices on their doors, huh? This seems like the absolute opposite of “educators and students must face ideas we find offensive and speakers whose words cause pain.” Also, affixing items to your dorm room door is nonviolent! (I’m assuming these items aren’t poison-laden scratch-and-sniff cards.)

Also, notoriously, the Barnard administration attempted to cancel the showing of a controversial film on campus. So, yeah, colleges and universities–including Barnard College, which is a division of Columbia University, where I work–need to reconfirm our commitment to nonviolent forms of disagreement — even when we are confronted with voices that disparage or dismiss identities and worldviews.

In short, I agree with the Barnard president’s op-ed and I think it would’ve been much improved by an acknowledgment that it represents a major change in policy from the recent policies at Barnard College.

If you’re gonna talk about the value of allowing and even promoting nonviolent disagreement, you can at least talk about the difficulty of implementing such recommendations–difficulties that you’ve directly faced at your own institution.

Maybe the Barnard administration could also apologize to the students they hassled regarding the showing of that movie, and they could apologize to the students who they were hassling about messages on their dorm room doors.

Pinning the group-level variance parameters to speed computation for hierarchical models

Andrew — Thu, 01 Jan 2026 14:43:05 +0000

In case you don’t know about the Stan Forums, let me just tell you that it’s a great online space for discussions about applied statistics and computing. All sorts of things come up.

Today we had a discussion about challenges of fitting big hierarchical models, where I wrote:

1. I often recommend pinning the group-level variance parameter or covariance matrix to a pre-chosen value based on subject-matter information. Often the inference isn’t super-sensitive to this group-level variance, as long as it’s not so small that it causes all the estimates to disappear to zero and not so large that the estimates are wildly noisy.

I’ve toyed with the idea of making this a more formal procedure, for example drawing 10 values of the set of variance parameters from a prior, then using these to run 10 fast inferences (could be MCMC or even just plain old optimization and Laplace approx), then averaging over them using stacking. I think this could work, but I’ve never actually tried it, let alone evaluated the idea. It’s a research idea!

2. Sometimes we do use gamma priors for group-level variance parameters. The gamma prior with 1 or more degrees of freedom has the pleasant property of being zero-avoiding, which is especially helpful when doing marginal maximum likelihood, as we discuss in our 2013 paper: https://sites.stat.columbia.edu/gelman/research/published/chung_etal_Pmetrika2013.pdf or for covariance matrices (using the Wishart, _not_ inverse-Wishart) prior for cov matrix in our 2014 paper: https://sites.stat.columbia.edu/gelman/research/published/chung_cov_matrices.pdf

3. Another thing that’s worked well for me is to use Pathfinder to get starting values. It varies, but sometimes Pathfinder runs very fast and then we can jointly estimate all the parameters and not worry so much about the funnel.

I’m sharing this here partly because it might be useful to some of you and partly because it includes a research idea.

What we can remember (the vagaries of literary fame)

Andrew — Wed, 31 Dec 2025 15:59:54 +0000

OK, this one’s just for Campos:

Booth Tarkington is known as an author of popular and slightly saccharine novels about boyhood in Indiana, now surely most famous as the author of The Magnificent Ambersons, which was made by Orson Welles into a wonderful movie that, by all accounts, would be much more wonderful if the studio hadn’t destroyed something like 40 minutes of it while Welles was in Brazil making a left-wing movie which was released posthumously a couple decades ago and itself is not too bad, considering, but nothing like Ambersons. If the missing reels of Ambersons were to ever turn up . . . well, that would be just about the greatest artistic discovery ever, I’d say more so than a missing Leonardo of Joconde quality or a missing Agatha Christie of ABC quality. So, not a bad way for Tarkington to be remembered.

Edna Ferber: OK, I have an Edna Ferber story, kind of. It’s my dad’s best carpooling story. He’s reading a book and somebody else in the car asks who’s it by.
My dad: Thurber.
Other guy: Edna Thurber?
My dad: Edna Thurber? Never heard of her.
Other guy: No, you wouldn’t have.
[Background: James Thurber was a “New Yorker” humorist. Edna Ferber was a popular novelist. Presumably my dad had the reputation as an unintellectual kind of guy.]

Daphne du Maurier: She wrote Rebecca! Most famous as a movie, but the book is considered a classic. We discussed in our recent post, Reading like it’s 1937, where one of our commenters was surprised to learn that they still assign it to high school students.

Of course you could look all this up–ok, not the story from my dad’s carpool–but the point here is what we can remember, right?

I have a horrible feeling sometimes that heavily promoted crap research on space aliens, cold showers, mind-body healing, schoolyard evolutionary psychology, extra-sensory perception, magic golf balls, air rage, himmicanes, subliminal smiley faces, etc etc etc, has softened the ground so that the seeds of more evil trees could then be planted and take root.

Andrew — Wed, 31 Dec 2025 14:34:21 +0000

Dale Lehman sends an email with subject line “A new low in science”:

I watched part of the painful hearing and here is a story on it.

It is a new low in both statistical studies, policymakers reactions, and public reactions. Of course, the study could be correct, but without any review and without any data, it is simply out of line with many other studies. And, in the hearing the explanations amount to a vast conspiracy where thousands of scientists and clinicians are in a grand scheme (mostly out of fear for their jobs – an irony, given how this administration has used fear about employment as a tool against government employees) to prevent the truth from getting out. The hearing itself was a travesty with virtually none of the subcommittee members present – only Ron Johnson (who outdid himself) and Blumenthal who left (I think out of frustration). I don’t know what bothered me the most – the tone of the witnesses, Johnson’s tirade, or the public applauding the conspiracy statements.

I replied that the bit about “the study could be correct” reminds me of the distinction between evidence and truth which we’ve discussed on the blog.

I’m not sure that this is “a new low in science.” Some other recent lows in science have included members of the elite media promoting ridiculous UFO-as-space-aliens speculation, and celebrity academics here and here promoting junk-science claims of mind-body healing.

The vaccine crap is much worse from a policy perspective—space aliens and mind-body healing are mostly just a waste of time—but, as we’ve discussed, I have a horrible feeling sometimes that heavily promoted crap on space aliens, cold showers, mind-body healing, schoolyard evolutionary psychology, extra-sensory perception, magic golf balls, air rage, himmicanes, subliminal smiley faces, etc etc etc, has softened the ground so that the seeds of more evil trees can now planted and take root.

All that junk science over the past twenty years has been promoted by leading academics. Prominent professors from Harvard, Stanford, Columbia, Chicago, etc. have been promoting magical thinking under the guise of science. Perhaps no surprise that politicians and non-academic hucksters want to get into the game too.

P.S. I asked Lehman whether it was ok for me to quote him, and he said, “Sure. But if it is delayed by 6 months, there will probably be worse examples to come.” I replied that I’ve been doing less blogging lately and the lag is only 4 months now.

P.P.S. Between when I wrote the above post and when it appeared, I posted something else with a similar theme: 25,000 lives saved per ship sunk, $100,000 per citation, a probability of 10^-90 of a decisive vote . . . Is there a through line from B.S. numbers in junk science to B.S. numbers coming from the government?

Part of this may just be me trying to justify my own existence: I write about bad science and I’m concerned about bad policy, so it’s natural for me to see a connection. But I do see something real here. Many of the celebrated leaders of our pop-science establishment–professors with elevated titles at major universities, members of the National Academy of Sciences, PBS and Ted stalwarts, etc.–have at best a tolerance and at worst an active taste for junk science. And many of our political leaders will say the most ridiculous, immediately refutable things. I can’t do much about this, but I can scream. Oh yes, I can do that.

Survey Statistics: more adventures in mismeasured X

shira — Tue, 30 Dec 2025 21:00:41 +0000

Last week we considered this simple example of measurement error in auxiliary data X:

Y = current 2025 support
X = true 2024 vote choice
X* = response for 2024 vote choice

All are binary, = 1 for Democrats and 0 for Republicans. This cartoon example is from politics (not meant to be particularly realistic), but measurement error occurs in almost every survey. When does measurement error in an auxiliary adjustment variable negate the gains from adjusting for it to reduce nonresponse bias ?

Suppose we want E(Y), the current 2025 support in the population. If the true X were enough to handle nonresponse bias, then we could estimate this via poststratification:

P(X=1) E(Y | X = 1, sample) + P(X = 0) E(Y | X = 0, sample)

where we have P(X=1) and P(X=0) from the 2024 election results. But we can’t directly estimate E(Y | X = 1, sample) because we only have X* in the sample.

We considered two choices:

Adjust with mismeasured X*: P(X=1) E(Y | X* = 1, sample) + P(X = 0) E(Y | X* = 0, sample)
No adjustment: E(Y | sample)

Questions:

A) Which is closest to the truth, E(Y) ?

B) Which is closer to the previous election result, E(X) ?

C) Which is higher for Democrats ?

The answers depend on the distribution of Y,X,X* in the population and in the sample. For question A, generally I’d guess adjusting even with a mismeasured X* usually gets us closer to truth, but as we’ll see below, it doesn’t always. For question B, one might think adjusting for a past election always brings us closer to that past election’s results, but as we’ll see below, it doesn’t always. For question C, let’s rewrite the no adjustment estimator:

So adjusting for X* could increase support for Democrats if P(X=1) > P(X* = 1 | sample). In other words, more people voted for Democrats in 2024 than say they did in the sample. This sounds like winner’s bias, but it’s also comparing apples (population) to oranges (sample), so not quite.

So the answers to these three questions really does depend !

Here is some R code to simulate your own worlds. I made 4 examples so far. Do you think they’re realistic ?

# Y = 2025 support
# X = 2024 vote
# X* = 2024 recalled vote
# p_ij = P(Y=1 | X=i, X*=j) is 2025 Democrat support by X and X*
# s_ij = P(X=i, X*=j | sample) is distribution of X and X* in sample
# s01 could come from consistency bias 
# s10 could come from winner's bias
# P(X=1) = 0.49 is the true election result

EY_calc <- function(p11,p01,p10,p00, s11,s01,s10,s00, PX1=0.49){
  PX0 <- 1 - PX1
  PY1_X1 <- (s11/(s11+s10))*p11 + (s10/(s11+s10))*p10 # P(Y=1 | X=1, sample)
  PY1_X0 <- (s01/(s01+s00))*p01 + (s00/(s01+s00))*p00 # P(Y=1 | X=0, sample)
  Truth  <- PX1*PY1_X1 + PX0*PY1_X0

  PY1_Xs1 <- (s11/(s11+s01))*p11 + (s01/(s11+s01))*p01 # P(Y=1 | X*=1, sample)
  PY1_Xs0 <- (s10/(s10+s00))*p10 + (s00/(s10+s00))*p00 # P(Y=1 | X*=0, sample)
  Xstar_adjust <- PX1*PY1_Xs1 + PX0*PY1_Xs0

  no_adjust <- s11*p11 + s01*p01 + s10*p10 + s00*p00 # P(Y=1 | sample)

  # closeness to Truth E(Y)
  closer_truth <- if (abs(Xstar_adjust-Truth) < abs(no_adjust-Truth)) "Xstar_adjust" 
  else "no_adjust"

  # closeness to E(X) (last election result)
  closer_EX <- if (abs(Xstar_adjust-PX1) < abs(no_adjust-PX1)) "Xstar_adjust" 
  else "no_adjust"

  # higher for Democrats
  higher_for_Democrats <- if (Xstar_adjust > no_adjust) "Xstar_adjust" 
  else "no_adjust"

  est <- c(Truth=Truth, Xstar_adjust=Xstar_adjust, no_adjust=no_adjust) 
  list(estimates=signif(est, 3), 
       closer_truth=closer_truth, 
       closer_EX=closer_EX, 
       higher_for_Democrats=higher_for_Democrats) 
}

# 1) Xstar_adjust is closer to Truth E(Y)
EY_calc(
  p11=0.82, p01=0.68, p10=0.42, p00=0.25,
  s11=0.48, s01=0.06, s10=0.07, s00=0.39,
  PX1=0.49
)

# 2) no_adjust is closer to Truth E(Y)
EY_calc(
  p11=0.78, p01=0.66, p10=0.46, p00=0.34,
  s11=0.44, s01=0.10, s10=0.05, s00=0.41,
  PX1=0.49
)

# 3) no_adjust is closer to last election E(X)
EY_calc(
  p11=0.781, p01=0.648, p10=0.550, p00=0.297,
  s11=0.476, s01=0.010, s10=0.095, s00=0.419,
  PX1=0.49
)

# 4) Winner’s bias only: s01 = 0 no consistency bias
EY_calc(
  p11=0.86, p01=0.74, p10=0.40, p00=0.28,
  s11=0.50, s01=0.00, s10=0.10, s00=0.40,
  PX1=0.49
)

A funny mismatch between the level of the course and what the instructor is doing on the blackboard

Andrew — Tue, 30 Dec 2025 14:04:00 +0000

I was doing some web searching and came across this article from 1994 by Mahmoud Sayrafiezadeh, which begins:

The birthday problem asks for the probability that at least two people in a group of k people will have the same birthday. . . . The present work was motivated by the need to provide an approximation formula for the solution of the birthday problem in a liberal arts course on the Nature of Mathematics. The main result enables students who have not yet studied calculus to approximate solutions to birthday-type problems.

So far, so good. But here’s the funny part. The article continues with some formulas:

OK so far, maybe. I suspect that this level of technical detail will already confuse the liberal arts students who have not yet studied calculus, but, who knows, maybe this person was able to teach it in the classroom.

But now things start to get really hairy:

At this point, I can’t believe that those liberal arts students who have not yet studied calculus are following. I don’t mean this as an insult to the students! Mathematics is a foreign language to most people, and it’s not a disparagement to say that non-math-speakers will be challenged by the above expressions, any more than it’s disparaging me to say that if you talk to me fast in Spanish, I’ll quickly get lost.

The article continues:

And a bit later it really gets going:

If this is the elementary version for the non-calculus-based liberal-arts course, I hate to think what this teacher did for the advanced classes. Maybe they’d prove the Riemann hypothesis as a homework problem?

Again, I assume there’s nothing wrong with the content of the paper; it’s just funny to think of someone teaching math to liberal-arts students and not being able to resist all this technical stuff. I guess most of us are like this when we teach!

P.S. I showed the above to someone who was a teaching assistant for math classes, and he said it reminded him of the famous xkcd quartz cartoon.

How the covid vaccine almost killed me

Andrew — Mon, 29 Dec 2025 14:23:04 +0000

So, I was talking on the phone with a friend the other day and she said she just got covid, and I realized that I knew a few other people who’d had covid recently, and this season’s version of the vaccine had come out. I scheduled an appointment at the doctor’s office the next day for covid and flu shots. But when I got there, all they had was the flu shot—the covid shots hadn’t come in yet. The nurse recommended I try doing it through a pharmacy. I kinda forgot about it but then a couple days later I remembered. I went on the CVS website and it was really easy to schedule . . . actually they had an appointment in 20 minutes on West 57 St in midtown. (Amusingly enough, when I typed in my location, it gave the closest locations as some places in New Jersey—I guess they were measuring as-the-crow-flies distance rather than travel time.) 20 minutes doesn’t leave much margin of error so I threw on my shoes, grabbed my bike, zipped over to the subway, went down to 59 St, and biked over to the corner where the CVS was . . . I wasn’t sure which way to go and I couldn’t see any street numbers so I took a guess and turned left, then I saw the street numbers were too low . . . I was in a real hurry now, I didn’t want to get there too late and have them retract my appointment, also I had to return home in about 40 minutes, so I decided to turn around right there in the middle of the block. As I was making that U-turn I slowed down to find a break in the traffic going the other direction and I saw a city bus barreling right at me! Fortunately there was some space in the cars so I could get into the traffic and I didn’t get run over.

Everything else went well. I got the shot and I got home in time for my 4pm meeting. But I almost got run over by a bus (entirely my fault, not the bus driver’s at all). So that’s my story: the vaccine almost killed me.

I’m reminded of the principle that the most dangerous part of a flight is the ride to the airport.

Looking at the Port Huron Statement, 63 years later

Andrew — Sun, 28 Dec 2025 14:14:12 +0000

In preparation for this new class, I was reading the Port Huron Statement, White Collar, the classic 1962 manifesto from the Students for a Democratic Society. It’s good!

Here’s the stirring beginning:

The statement continues:

It’s interesting to think how things have changed. Social inequality is still a big deal and nuclear war is still a threat, but now the most important issues are economic problems, the government, and immigration. OK, those numbers came from a poll of all American adults, but I think if you just surveyed young left-wing activists–today’s equivalent of the Students for a Democratic Society–that the economy would still be the leading issue, maybe with democracy or political polarization as #2.

It’s funny, though–and I’m far from the first to point this out–that Americans are so much less satisfied with the economy now than they were in 1962, given how much richer the society is: a smaller percentage living in poverty and with the median American having better cars, bigger houses, and all the rest. I get it: for one thing, people measure their status against what they already have, and we take for granted so much that we have now; also, the economy isn’t just about ownership and consumption, it’s also about having a stable job and a sense of good things happening in the future. Just for example, global warming is very low on most people’s list of most important issues, but there is widespread concern that future generations will struggle economically, that there is some form of unsustainability, whether from physical (environmental) constraints or from societal problems that won’t be resolved. I kinda want to say that usual economic theory doesn’t handle this so well, but it’s not quite a problem with economics, which can include the value of future wellbeing; the problem lies more at the intersection of economics, politics, and psychology, in that in the short term voters are influenced by short-term economic conditions that seem to have only a very indirect connection to larger concerns.

In any case, let’s return to the pleasant world of 1962, where the economy was in the middle of a decades-long period of growth and young radicals could focus their energy elsewhere.

There’s lots to chew on. For example:

Listed fourteenth, huh? I couldn’t easily find the relevant Gallup report online, but I was able to access this 1985 article, “The Polls: America’s Most Important Problems Part I: National and International,” by Tom Smith, which reports that Gallup started asking that question back in 1935! Here’s the data summary from 1962:

Wait–“foreign affairs” was listed by 72% of respondents! “International affairs” and “foreign affairs” are the same thing, no? So I guess the writers of the Port Huron Statement were engaging in a bit of poetic license on this one.

Then there’s this:

With this footnote:

I was curious how this has changed since then. Since 1962, America has become much more unequal economically, at least at the high end; that’s been well documented. On the other hand, it’s much more common to own stock than it used to be. Overall I’d say that “the percentage of stock owned by the top X%” is not a good measure of inequality–really you’d want the percentage of total wealth–; on the other hand, to the extent that government and elite policies are focused on pumping up the stock market–or, at least, stopping it from crashing–then, yeah, it’s a relevant number to look at.

Anyway, I googled *what percent of stock is owned by the richest 1%*, which turned up a 2024 article from an advocacy organization that begins:

New Federal Reserve analysis of stock markets has found that the concentration of ownership of the public equity stock market has hit an all-time high.

The article also points to this news article entitled “The rich now own a record share of stocks,” that pointed to this op-ed that reported:

While the richest 1 per cent owned just 40 per cent two decades ago, their share stood at 54 per cent in the most recent data from 2022.

The Federal Reserve link gives this graph:

If the Port Huron Statement is correct, the richest 1% back in 1962 owned over 80% of the stock, but now it’s only 52%–so, hardly an “all-time high”! We need a graph that goes back before 1990.

Hey–some googling turns up this paper, “Household Wealth Trends in the United States, 1962 to 2019” . . . sounds like just what we’re looking for! But, no, I don’t see the percentage of stock owned by the richest 1% in 1962.

My guess is that the Port Huron Statement is correct, or nearly correct, in its stock numbers and that the news reports claiming “the concentration of ownership of the public equity stock market has hit an all-time high” are wrong–but it doesn’t really matter because stock ownership was so much less of a thing back in 1962.

At that point you might ask, Why did I just spend a half hour failing to tracking down a number that doesn’t really matter? The answer is, you don’t always know what matters ahead of time. And it’s good to check things when you can.

“No one could suspect that times were coming . . . when the man who did not gamble would lose all the time, even more surely than he who gambled.”

Andrew — Sat, 27 Dec 2025 14:57:52 +0000

In preparation for this new class, I was reading White Collar, the classic 1951 book by sociologist C. Wright Mills. It’s perfect for week 2 of the course because it begins with a discussion of the changes from an American middle class of freeholders and tradesmen to a society of employees. I can only assume that lots of its claims have been disputed and discredited in the past 75 years, but at the very least it gives a window into the urban version of the frontier thesis in American history.

But what I wanted to talk about now is the quote that I put at the title of this post. It’s the epigraph to White Collar, and Mills attributes it to Charles Péguy. A Google search points us to a long poem from 1912 entitled Le porche du mystère de la deuxième vertu, translated as The Portal of the Mystery of Hope. I found part of a translation online here, but this was published in 1996 so I guess Mills was quoting from some earlier translation, or maybe he translated that brief passage himself from the French?

Searching in French leads a link to the actual published poem from 1912! I didn’t have the patience to read the whole thing. I skimmed through to see if I could see any passage close to “No one could suspect that times were coming … when the man who did not gamble would lose all the time, even more surely than he who gambled,” but no dice. The document also has a search function (“Estimated OCR rate for this document : 99.69%”), and I searched and searched but couldn’t find anything. I searched on “personne,” “soupçonner,” “temps,” “homme,” “parier” (also “pari,” “pariez,” etc.), “jouer” and its variants, “perdre,” even “sûrement,” but none of these led to anything even close to the quoted passage. Adding to the mystery, the only translation I could see of The Portal of the Mystery of Hope was from 1996–obviously not the version that Mills was quoting from back in 1951.

Can any of you track down this quote? Maybe this is a job for the Quote Investigator. I did a quick search and he’s never covered this one, so who knows.

For reasons that should be obvious, I like the quote a lot, but I’m loath to use it until I know its source. Yes, I could cite as “C. Wright Mills (1951), attributing to Charles Péguy,” but I’d like to do better than that!

How much of an NBA team’s won-loss record is from skill and how much is luck?

Andrew — Fri, 26 Dec 2025 14:44:16 +0000

Paul Campos reminds us that just two years ago the Detroit Pistons were in the middle of a historic 28-game losing streak on the way to a 14-68 record (following up previous records of 17-65, 23-59, 20-50, and 20-46, so it’s not like that was much of an aberration), but now they’re leading the Eastern Conference with a 24-6 record, even though “The core talent group on that historically bad team still makes up the core talent of the present Detroit team, exactly two years later: Cade Cunningham, Jalen Duren, Ausar Thompson, Jaden Ivy, and Isaiah Stewart.”

Campos continues:

How did this happen? The answer is that all these players were extremely young two years ago: Cunningham and Stewart were 22, Ivy and Thompson were 21, and Duren was 20. Each of them has taken a huge leap forward in the subsequent two years . . .

I don’t know enough about basketball, and I haven’t been following the NBA at all lately, so I can’t comment on Campos’s judgment of the Pistons situation.

But in his post he also links to this old post of mine that I’d completely forgotten!, where I did a bunch of analysis to estimate how much information we get from 30 games in a season, compared to the information available from preseason betting odds.

I enjoy these posts where we go into the data and crunch through the R, and I know many of you like them too, so I thought I’d repeat it for you today for your holiday reading.

So here goes, from Christmas 2023:

Paul Campos points us to this discussion of the record of the Detroit professional basketball team:

The Detroit Pistons broke the NBA record for most consecutive losses in a season last night, with their 27th loss in a row. . . . A team’s record is, roughly speaking, a function of two factors:

(1) The team’s quality. By “quality” I mean everything about the team”s performance that isn’t an outcome of random factors, aka luck — the ability of the players, individually and collectively, the quality of the coaching, and the quality of the team’s management, for example.

(2) Random factors, aka luck.

The above-linked post continues:

How do we disentangle the relative importance of these two factors when evaluating a team’s performance to some point in the season? . . . The best predictor ex ante of team performance is the evaluation of people who gamble on that performance. I realize that occasionally gambling odds include significant inefficiencies, in the form of the betting public making sentimental rather than coldly rational wagers, but this is very much the exception rather than the rule. . . . the even money over/under for Detroit’s eventual winning percentage this season was, before the first game was played, a winning percentage of .340. To this point, a little more than third of the way through the season, Detroit’s winning percentage has been .0666. . . .

To the extent that the team has had unusually bad luck, then one would expect the team’s final record to be better. But how much better? Here we can again turn to the savants of Las Vegas et. al., who currently set the even money odds of the team’s final record on the basis of the assumption that it will have a .170 winning percentage in its remaining games.

Part of the confusion here is that we’re dealing with inference for p (the team’s “quality,” as summarized by the probability that they’d win against a randomly-chosen opponent on a random day) and also with predictions of outcomes. For the posterior mean, there’s no difference: under the basic model, the posterior expected proportion of future games won is equal to the posterior mean of p. It gets trickier when we talk about uncertainty in p.

How, then, could we take the beginning-of-season and current betting lines–which we will, for the purposes of our discussion here, identify as the prior and posterior means of p, ignoring systematic biases of bettor–and extract implied prior and posterior distributions? There’s surely enough information here to do this, if we use information from all 30 teams and calibrate properly.

Exploratory analysis

I started by going to the internet, finding various sources on betting odds, team records, and score differentials, and entering the data into this file. The latest Vegas odds I could find on season records were from 19 Dec; everything else came from 27 Dec.

Next step was to make some graphs. First, I looked at point differential and team records so far:

nba <- read.table("nba2023.txt", header=TRUE, skip=1)
nba$ppg <- nba$avg_points
nba$ppg_a <- nba$avg_points_opponent
nba$ppg_diff <- nba$ppg - nba$ppg_a
nba$record <- nba$win_fraction
nba$start_odds <- nba$over_under_beginning/82
nba$dec_odds <- nba$over_under_as_of_dec/82
nba$sched <- - (nba$schedule_strength - mean(nba$schedule_strength)) # signed so that positive value implies a more difficult schedule so far in season
nba$future_odds <- (82*nba$dec_odds - 30*nba$record)/52

pdf("nba2023_1.pdf", height=3.5, width=10)
par(mfrow=c(1,2), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$ppg_a, nba$ppg)
plot(rng, rng, xlab="Points per game allowed", ylab="Points per game scored", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_a, nba$ppg, nba$team, col="blue")
#
par(pty="m")
plot(nba$ppg_diff, nba$record, xlab="Point differential", ylab="Won/lost record so far", bty="l", type="n")
text(nba$ppg_diff, nba$record, nba$team, col="blue")
#
mtext("Points per game and won-lost record as of 27 Dec", line=.5, side=3, outer=TRUE)
dev.off()

Here's a question you should always ask yourself: What do you expect to see?

Before performing any statistical analysis it's good practice to anticipate the results. So what do you think these graphs will look like?
- Ppg scored vs. ppg allowed. What do you expect to see? Before making the graph, I could have imagined it going either way: you might expect a negative correlation, with some teams doing the run-and-gun and others the physical game, or you might expect a positive correlation, because some teams are just much better than others. My impression is that team styles don't vary as much as they used to, so I was guessing a positive correlation.
- Won/lost record vs. point differential. What do you expect to see? Before making the graph, I was expecting a high correlation. Indeed, if I could only use one of these two metrics to estimate a team's ability, I'd be inclined to use point differential.

Aaaand, here's what we found:

Hey, my intuition worked on these! It would be interesting to see data from other years to see if I just got lucky with that first one.

Which is a better predictor of won-loss record: ppg scored or ppg allowed?

OK, this is a slight distraction from Campos's question, but now I'm wondering, which is a better predictor of won-loss record: ppg scored or ppg allowed? From basic principles I'm guessing they're about equally good.

Let's do a couple of graphs:

pdf("nba2023_2.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="m")
rng <- range(nba$ppg_a, nba$ppg)
plot(rng, range(nba$record), xlab="Points per game scored", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg, nba$record, nba$team, col="blue")
#
par(pty="m")
plot(rng, range(nba$record), xlab="Points per game allowed", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_a, nba$record, nba$team, col="blue")
#
par(pty="m")
plot(range(nba$ppg_diff), range(nba$record), xlab="Avg score differential", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$ppg_diff, nba$record, nba$team, col="blue")
#
mtext("Predicting won-loss record from ppg, ppg allowed, and differential", line=.5, side=3, outer=TRUE)
dev.off()

Which yields:

So, about what we expected. To round it out, let's try some regressions:

library("rstanarm")
print(stan_glm(record ~ ppg, data=nba, refresh=0), digits=3)
print(stan_glm(record ~ ppg_a, data=nba, refresh=0), digits=3)
print(stan_glm(record ~ ppg + ppg_a, data=nba, refresh=0), digits=3)

The results:

            Median MAD_SD
(Intercept) -1.848  0.727
ppg          0.020  0.006

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.162  0.021 
------
            Median MAD_SD
(Intercept)  3.192  0.597
ppg_a       -0.023  0.005

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.146  0.019 
------
            Median MAD_SD
(Intercept)  0.691  0.335
ppg          0.029  0.002
ppg_a       -0.030  0.002

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.061  0.008

So, yeah, points scored and points allowed are about equal as predictors of won-loss record. Given that, it makes sense to recode as ppg differential and total points:

print(stan_glm(record ~ ppg_diff + I(ppg + ppg_a), data=nba, refresh=0), digits=3)

Here's what we get:

               Median MAD_SD
(Intercept)     0.695  0.346
ppg_diff        0.029  0.002
I(ppg + ppg_a) -0.001  0.001

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.062  0.009

Check. Once we include ppg_diff as a predictor, the average total points doesn't do much of anything. Again, it would be good to check with data from other seasons, as 30 games per team does not supply much of a sample.

Now on to the betting lines

Let's now include the Vegas over-unders in our analysis. First, some graphs:

pdf("nba2023_3.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$start_odds, nba$record)
plot(rng, rng, xlab="Betting line at start", ylab="Won/lost record so far", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$start_odds, nba$record, nba$team, col="blue")
#
par(pty="s")
rng <- range(nba$record, nba$dec_odds)
plot(rng, rng, xlab="Won/lost record so far", ylab="Betting line in Dec", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$record, nba$dec_odds, nba$team, col="blue")
#
par(pty="s")
rng <- range(nba$start_odds, nba$dec_odds)
plot(rng, rng, xlab="Betting line at start", ylab="Betting line in Dec", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
text(nba$start_odds, nba$dec_odds, nba$team, col="blue")
#
mtext("Won-lost record and over-under at start and in Dec", line=.5, side=3, outer=TRUE)
dev.off()

Which yields:

Oops--I forgot to make some predictions before looking. In any case, the first graph is kinda surprising. You'd expect to see an approximate pattern of E(y|x) = x, and we do see that--but not at the low end. The teams that were predicted to do the worst this year are doing even worse than expected. It would be interesting to see the corresponding graph for earlier years. My guess is that this year is special, not only in the worst teams doing so bad, but in them underperforming their low expectations.

The second graph is as one might anticipate: Betters are predicting some regression toward the mean. Not much, though! And the third graph doesn't tell us much beyond the first graph.

Upon reflection, I'm finding the second graph difficult to interpret. The trouble is that "Betting line in Dec" is the forecast win percentage for the year, but 30/82 of that is the existing win percentage. (OK, not every team has played exactly 30 games, but close enough.) What I want to do is just look at the forecast for their win percentage for the rest of the season:

pdf("nba2023_4.pdf", height=3.5, width=10)
par(mfrow=c(1,3), oma=c(0,0,2,0))
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="s")
rng <- range(nba$record, nba$dec_odds)
plot(rng, rng, xlab="Won/lost record so far", ylab="Betting line of record for rest of season", bty="l", type="n")
abline(0, 1, lwd=.5, col="gray")
fit <- coef(stan_glm(future_odds ~ record, data=nba, refresh=0))
print(fit)
abline(fit, lwd=.5, col="blue")
text(nba$record, nba$future_odds, nba$team, col="blue")
#
dev.off()

Here's the graph:

The fitted regression line has a slope of 0.66:

            Median MAD_SD
(Intercept) 0.17   0.03  
record      0.66   0.05  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.05   0.01

Next step is to predict the Vegas prediction for the rest of the season given the initial prediction and the team's record so far:

print(stan_glm(future_odds ~ start_odds + record, data=nba, refresh=0), digits=2)

            Median MAD_SD
(Intercept) -0.02   0.03 
start_odds   0.66   0.10 
record       0.37   0.06 

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00

It's funny--everywhere we look, we see this 0.66. And 30 games is 37% of the season!

Now let's add into the regression the points-per-game differential, as this should include additional information beyond what was in the won-loss so far:

print(stan_glm(future_odds ~ start_odds + record + ppg_diff, data=nba, refresh=0), digits=2)

            Median MAD_SD
(Intercept) 0.06   0.06  
start_odds  0.67   0.09  
record      0.20   0.11  
ppg_diff    0.01   0.00  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00

Hard to interpret this one, as ppg_diff is on a different scale from the rest. Let's quickly standardize it to be on the same scale as the won-lost record so far:

nba$ppg_diff_std <- nba$ppg_diff * sd(nba$ppg_record) / sd(nba$ppg_diff)
print(stan_glm(future_odds ~ start_odds + record + ppg_diff_std, data=nba, refresh=0), digits=2)

             Median MAD_SD
(Intercept)  0.06   0.06  
start_odds   0.67   0.09  
record       0.20   0.11  
ppg_diff_std 0.17   0.10  

Auxiliary parameter(s):
      Median MAD_SD
sigma 0.03   0.00

OK, not enough data to cleanly disentangle won-lost record and point differential as predictors here. My intuition would be that, once you have point differential, the won-lost record tells you very little about what will happen in the future, and the above fitted model is consistent with that intuition, but it's also consistent with the two predictors being equally important, indeed it's consistent with point differential being irrelevant conditional on won-lost record.

What we'd want to do here--and I know I'm repeating myself--is to repeat the analysis using data from previous years.

Interpreting the implied Vegas prediction for the rest of the season as an approximate weighted average of the preseason prediction and the current won-lost record

In any case, the weighting seems clear: approx two-thirds from starting odds and one-third from the record so far, which at least on a naive level seems reasonable, given that the season is about one-third over.

Just for laffs, we can also throw in difficulty of schedule, as that could alter our interpretation of the teams' records so far.

nba$sched_std <- nba$sched * sd(nba$record) / sd(nba$sched)
print(stan_glm(future_odds ~ start_odds + record + ppg_diff_std + sched_std, data=nba, refresh=0), digits=2)

             Median MAD_SD
(Intercept)  0.06   0.06  
start_odds   0.68   0.09  
record       0.21   0.11  
ppg_diff_std 0.17   0.10  
sched_std    0.04   0.03

So, strength of schedule does not supply much information. This makes sense, given that 30 games is enough for the teams' schedules to mostly average out.

The residuals

Now that I've fit the regression, I'm curious about the residuals. Let's look:

fit_5 <- stan_glm(future_odds ~ start_odds + record + ppg_diff_std + sched_std, data=nba, refresh=0)
fitted_5 <- fitted(fit_5)
resid_5 <- resid(fit_5)
#
pdf("nba2023_5.pdf", height=5, width=8)
par(mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
#
par(pty="m")
plot(fitted_5, resid_5, xlab="Vegas prediction of rest-of-season record", ylab="Residual from fitted model", bty="l", type="n")
abline(0, 0, lwd=.5, col="gray")
text(fitted_5, resid_5, nba$team, col="blue")
#
dev.off()

And here's the graph:

The residual for Detroit is negative (-0.05*52 = -2.6, so the Pistons are expected to win about 3 games less than their regression prediction based on prior odds and outcome of first 30 games). Cleveland and Boston are also expected to do a bit worse than the model would predict. On the other direction, Vegas is predicting that Memphis will win about 4 games more than predicted from the regression model.

I have no idea whassup with Memphis. The quick generic answer is that the regression model is crude, and bettors have other information not included in the regression.

Reverse engineering an implicit Bayesian prior

OK, now for the Bayesian analysis. As noted above, we aren't given a prior for team j's average win probability, p_j; we're just given a prior point estimate of each p_j.

But we can use the empirical prior-to-posterior transformation, along with the known likelihood function, under the simplifying assumption the 30 win-loss outcomes for each team j are independent with constant probability p_j for team j. This assumption that is obviously wrong, given that teams are playing each other, but let's just go with it here, recognizing that with full data it would be straightforward to extend to an item-response model with an ability parameter for each team (as here).

To continue, the above regression models show that the Vegas "posterior Bayesian" prediction of p_j after 30 games is approximately a weighted average of 0.65*(prior prediction) + 0.35*(data won-loss record). From basic Bayesian algebra (see, for example, chapter 2 of BDA), this tells us that the prior has about 65/35 as much information as data from 30 games. So, informationally, the prior is equivalent to the information from (65/35)*30 = 56 games, about two-thirds of a season worth of information.

Hey--what happened??

But, wait! That approximate 2/3 weighting for the prior and 1/3 weighting of the data from 30 games is the opposite of what Campos reported, which was a 1/3 weighting of the prior and 2/3 of the data. Recall: prior estimated win probability of 0.340, data win rate of 0.067, take (1/3)*0.340 + (2/3)*0.067 and you get 0.158, which isn't far from the implied posterior estimate of 0.170.

What happened here is that the Pistons are an unusual case, partly because the Vegas over-under for their season win record is a few percentage points lower than the linear model predicted, and partly because when the probability is low, a small percentage-point change in the probability corresponds to a big change in the implicit weights.

Again, it would be good to check all this with data from other years.

Skill and luck

There's one more loose end, and that's Campos taking the weights assigned to data and prior and characterizing them as "skill" and "luck" in prediction errors. I didn't follow that part of the reasoning at all so I'll just let it go for now. Part of the problem here is in one place Campos seems to be talking about skill and luck as contributors to the team's record, and in another place he seems to considering them as contributors to the difference between preseason predictions and actual outcomes.

One way to think about skill and luck in a way that makes sense to me is within an item-response-style model in which the game outcome is a stochastic function of team abilities and predictable factors. For example, in the model,

score differential = ability of home team - ability of away team + home-field advantage + error,

the team abilities are in the "skill" category and the error is in the "luck" category, and, ummm, I guess home-field advantage counts as "skill" too? OK, it's not so clear that the error in the model should all be called "luck." If a team plays better against a specific opponent by devising a specific offensive/defensive plan, that's skill, but it would pop up in the error term above.

In any case, once we've defined what is skill and what is luck, we can partition the variance of the total to assign percentages to each.

Another way of looking at this is to consider the extreme case of pure luck. If outcomes determined only by luck, then each game is a coin flip, and we'd see this in the data because the team win proportions after 30 games would follow a binomial distribution with n=30 and p=0.5. The actual team win proportions have mean 0.5 (of course) and sd 0.18, as compared to the theoretical mean of 0.5 and sd of 0.5/sqrt(30) = 0.09. That simple calculation suggests that skill is (0.18/0.09)^2 = 4 times as important as luck when determining the outcome of 30 games.

And maybe I'm getting just getting this all tangled myself. The first shot at any statistical analysis often will have some mix of errors in data, modeling, computing, and general understanding, with that last bit corresponding to the challenge of mapping from substantive concepts to mathematical and statistical models. Some mixture of skill and luck, I guess.

Summary

1. Data are king. In the immortal words of Hal Stern, the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use. I could do more than Campos did, not so much because of my knowledge of Bayesian statistics but because I was using data from all 30 teams.

2. To continue with that point, you can do lots better than me by including data from other years.

3. Transparency is good. All my data and code are above. I might well have made some mistakes in my analyses, and, in any case, many loose ends remain.

4. Basketball isn't so important (hot hand aside). The idea of backing out an effective prior by looking at information updating, that's a more general idea worth studying further. This little example is a good entry point into the potential challenge of such studies.

5. Models can be useful, not just for prediction but also for understanding, as we saw for the problem of partitioning outcomes into skill and luck.

If only Lee Bollinger

Andrew — Thu, 25 Dec 2025 14:16:04 +0000

If only Lee Bollinger, the former president of Columbia University, hadn’t already tarnished his reputation with his weak responses to the medical school sexual assaulter and the fake U.S. news numbers. If it hadn’t been for that, he would’ve been a perfect person to stand up to the government’s attack on universities. He’s an expert on the first amendment, he’s had a long and distinguished career, and he had the full confidence of the board of trustees.

I see an analogy to Tony Blair, who had so much political authority, all destroyed by his terrible decisions on the Iraq War. As with Bollinger at Columbia, but on a much larger scale, the problem was not just that Blair made a bad call but that he made no effort to get to the bottom of the problem. Bollinger’s association with the Columbia sexual assaulter and the Columbia fake data are magnified by his apparent lack of interest in making things right, in both these cases.

Other examples are Richard Nixon and the Watergate scandal, and Lyndon Johnson and the expansion of the U.S.-Vietnam War. As with Bollinger and as with Blair, it all seems tragic, in that these presidents squandered immense power, authority, and popularity—and you could easily imagine a world in which these events didn’t happen.

A few months ago, Bollinger gave an interview on universities in the current political climate. He had some reasonable things to say, but he doesn’t have so much moral authority to say it. I’m sure that Tony Blair has some reasonable things to say about foreign policy, too. It’s too bad it had to be this way.

Slop is not distinguishable by its attributes. It is an attitude of production

Jessica Hullman — Wed, 24 Dec 2025 18:54:01 +0000

Since it’s dictionary week here on the blog, why not discuss Merriam-Webster’s word of the year: slop. They define it as:

digital content of low quality that is produced usually in quantity by means of artificial intelligence.

Max Read discusses conventional associations with slop–qualities like “forgettability, predictability, unoriginality, lifelessness” or “cheap, low-effort, convenient, consumable, interchangeable,” He collects several more pointed definitions from the web:

“a low-to-zero marginal-cost substitute for something valued, or something being aggressively positioned to substitute for craft” from Bluesky

“the negative platonic form: not the ideal that particulars aspire toward, but the silhouette left when you subtract everything that would make a specific instance rather than a thing of a type” from Kevin Baker

He also proposes his own definition:

“slop” is that which is “fully optimized” to its domain to the point of texturelessness or characterlessness. “Slop” in this sense is anything designed to be as easy as possible to produce, sell, and consume, but it’s particularly slop at the point where all or most other players in the same space adopt the same strategies, and the material is no longer individual or differentiated from its competitors.

I enjoyed all of these. They paint slop as a kind of mass-produced shell rushing toward you at the speed of modern silicon chips.

But these definitions also all miss a defining feature of slop, the thing that makes me feel vaguely repulsed when I see it despite the superficial harmlessness of what is often just some generic message or image or text. Slop is not merely a genre of media, it is an attitude of production, a cynical operating posture that is offensive not just on a surface level of insulting the consumer’s intelligence, though there is that. It is an ethos of resigned instrumentality that disgusts us with its intentional satisficing and lack of effort the way kitsch disgusted some art critics, a refusal of responsibility to authenticity, situatedness, and the risks associated with individualistic expression. A practical nihilism that threatens to engulf our own more sparse yet genuine attempts at production. From this perspective, the act of denial makes slop more like a spiritual threat than a type of content.

Speaking of kitsch, I think it’s worth distinguishing art from slop. Kevin Baker’s definition of slop as a kind of shell devoid of any individual substance reminds me of certain philosophical arguments about art post-modernity. Various writers have described how after the emergence of the conception of “taste” in art, and the series of events that led up to moments like Duchamp installing a toilet in a gallery and calling it art, great art can no longer have “positive” content. It can only refer to the absence of something. In this sense art is irony. And yet, while I think humans can very much create slop without AI, I don’t think of much art as slop, because whether doomed to be self-referential or not, making art implies belief in something on the part of its creator, a kind of taking of responsibility to interaction. To make art is to anticipate its completion through the viewer.

For example, lately I’ve been thinking about pop art. I was in Pittsburgh and went to the Warhol museum. I was in Copenhagen and went to the Louisiana museum, where they happened to have a Marisol exhibition. Contemporary art has a special place in my heart, but I don’t like pop art. I never really have. However, I don’t think it’s fair to call it slop, even though it would fit many of the definitions above–it’s cheap, low-effort, could be produced in bulk, designed to mimic the predictable and forgettable. I can respect someone like Warhol because at the time, the work expressed a point of view, it contributed to a conversation, and by doing so opened a door to possibility, like all great art aspires to do.

Slop, on the other hand, is talking when you have nothing to say. Slop is a waste of your time as a consumer, but also a waste of time for the author, who pleads for attention while denying themself a chance at discovering meaning. In this way, one could say slop is a matter of life or death, since after all, every moment is bringing us closer to death.

P.S. Merry Christmas to those who celebrate!

Holiday open thread: Correct me! Point out all my mistakes.

Andrew — Wed, 24 Dec 2025 14:01:55 +0000

I’ve made lots of mistakes, and failure has been good for my intellectual development. But this only works if I learn from my errors, which in turn requires me to be aware that I got things wrong in the first place. Four times I’ve had errors that were big enough that I issued formal corrections. And then there was the time that I messed up on a statistical analysis that I’d posted on the blog and someone pointed out my error, causing me to spend a couple months figuring out how to do better.

There’s also this talk I gave on learning from mistakes and this one too. And also this.

And I’ve made various smaller slips over the years that I’ve had to correct (for example here and here). It happens!

And there was the embarrassing error in the second paragraph of this article. I think that particular mistake was introduced by the copy editor, but I’ll still have to take responsibility for not carefully checking and letting it slip through.

An opportunity for all of you

Given all this, it only stands to reason that there must be a bunch of things I’ve done wrong that I haven’t yet realized were in error. Or mistakes I’ve made that I’ve corrected, but the corrections are not prominent enough and people continue to miss the point. Or simply some places where we disagree: things I’ve written where you think I’m wrong, even past my insistence otherwise. Fair enough.

This is your chance! I’m speaking to my friends as well as the haters out there. Let me know what I’ve done wrong. This is your chance to help me learn from my mistakes, to force me to confront disagreements, also a way for you to share your gripes with others. Point out my mistakes right here in the comments, and tens of thousands of readers will be made aware of them.

It’s your opportunity to help me out, to embarrass me, and to get your perspective out there. So go at it. Complain loudly. The squeaky wheel gets the grease.

Survey Statistics: is a mismeasured X better than none at all ?

shira — Tue, 23 Dec 2025 22:42:16 +0000

We’ve talked a lot about the “Representation” side of the Groves et al. figure below (also in last month’s holiday-timed blog). Especially nonresponse error.

The “Measurement” side focuses on the gaps between a construct we want to measure (e.g. someone’s current vote choice) and the response we record for them. This concerns what we’ve been calling the “Y” variable.

But there can be measurement problems also with what we’ve been calling the “X” auxiliary variables. Remember poststratification (e.g. MRP), where we attempt to reduce nonresponse bias (on the “Representation” side). We calibrate our estimates of means E(Y) to population data about another variable X, using E(Ehat(Y | X, sample)). Both X and Y can be mismeasured. Let’s focus today on mismeasured X. Can it still help us ?

Let Y be someone’s current vote choice. It is controversial to adjust for a recalled past vote X* because it might be quite different from actual past vote X. We have population data on X (not X*) from past elections. Why might X and X* differ ? For example, a Harris 2024 voter might say they voted for Trump in 2024 (“winner’s bias”). Or a current (2025) Democrat supporter might say they voted for Harris in 2024, even if they voted for Trump in 2024 (“consistency bias”).

So how might this affect poststratification ? Let’s consider a simple example with only one “X”.

Y = current 2025 support
X = true 2024 vote choice
X* = response for 2024 vote choice

All are binary, = 1 for Democrats and 0 for Republicans.

If by some miracle this one X is enough to handle nonresponse bias, then E(Y | X, sample) = E(Y | X). So we get the E(Y) we want by E(E(Y | X, sample)). Let’s write it in an expanded way:

P(X=1) E(Y | X = 1, sample) + P(X = 0) E(Y | X = 0, sample)

We have P(X=1) and P(X=0) from the 2024 election results. But we can’t directly estimate E(Y | X = 1, sample) because we only have X* in the sample.

Consider two choices:

Use the imperfect X*: P(X=1) E(Y | X* = 1, sample) + P(X = 0) E(Y | X* = 0, sample)
Don’t use it: E(Y | sample)

How would these compare to each other and to the true E(Y) ?

Suppose there is winner’s bias, so some folks voted Democratic (X = 1) but say they didn’t (X* = 0). Folks that still say they voted for Democrats are probably still supportive, so E(Y | X* = 1, sample) might be higher than E(Y | X = 1, sample).

In general, it helps to consider all 4 possible types of folks based on their true X and reported X*. What is the population distribution P(X,X*) ? The distribution in the sample P(X,X* | sample) ? Their Y distribution P(Y = 1 | X,X*) ?

Considering these simple cases helps understand how adjusting for an imperfectly measured past vote could affect results.

The problems with popular internet heuristics such as “Hanlon’s razor,” “steelmanning,” and “Godwin’s law,” all of which kind of fall apart in the presence of actual malice, actual bad ideas, and actual Nazis.

Andrew — Tue, 23 Dec 2025 14:38:04 +0000

From my review of Dan Davies’s book on business fraud:

Fraud might be an unusual “tail risk” in business, but in science it’s usual. It happens all the time. Just in my own career, I had a colleague who plagiarized; another one who published a report deliberately leaving out data that contradicted the story he wanted to tell; another who lied, cheated, and stole (I can’t be sure about that one as I didn’t see it personally; the story was told to me by someone who I trust); another who smugly tried to break an agreement; and another who was conned by a coauthor who made up data. That’s a lot! It’s two cases that directly affected me and three that involved people I knew personally. There was also Columbia faking its U.S. News ranking data; I don’t know any of the people involved but, as a Columbia employee, I guess that I indirectly benefited from the fraud while it was happening. I’d guess that dishonesty is widespread in business as well.

This led me to an point that’s important enough that it deserves a post of its own (i.e., this one):

This also reminds me of the problems with popular internet heuristics such as “Hanlon’s razor,” “steelmanning,” and “Godwin’s law,” all of which kind of fall apart in the presence of actual malice, actual bad ideas, and actual Nazis. The challenge is to hold the following two ideas in your head at once:

1. In science, bad work does not require cheating; in science, honesty and transparency are not enough; just cos I say you did bad work it doesn’t mean I’m accusing you of fraud; just cos you followed the rules as you were taught and didn’t cheat it doesn’t mean you made the discovery you thought you did.

2. There are a lot of bad guys and cheaters out there. It’s typically a bad idea to assume that someone is cheating, but it’s also often a mistake to assume that they’re not.

A related point from that post:

Davies refers to “the vital element of time” in perpetuating a fraud. A key point here is that uncovering the fraud is never as high a priority to outsiders as perpetuating the fraud is for the fraudsters. Even when money is at stake, the amount of money lost by each individual investor will be less than what is at stake for the perpetuator of the fraud. What this means is that sometimes the fraudster can stay alive by just dragging things out until the people on the other side get tired. That’s a standard strategy of insurance companies, right? To delay, delay, delay until the policyholder just gives up, making the rational calculation that it’s better to just cut your losses.

I’ve seen this sort of thing before, that cheaters take advantage of other people’s rationality. They play a game of chicken, acting a bit (or a lot) crazier than anyone else. It’s the madman theory of diplomacy. We’ve seen some examples recently of researchers who’ve had to deal with the aftermath of cheating collaborators, and it can be tough! When you realize a collaborator is a cheater, you’re dancing with a tiger. Someone who’s willing to lie and cheat and make up data could be willing to do all sorts of things, for example they could be willing to lie about your collaboration. So all of a sudden you have to be very careful.

P.S. I talked about other problems with “steelmanning” here.

Postdoc opportunity at Stanford and Chicago on Bayesian hierarchical modeling and partial pooling for improving the accuracy and equity of property tax assessments

Andrew — Mon, 22 Dec 2025 21:11:05 +0000

Evelyn Smith writes:

I’m reaching out because Stanford RegLab and the Mansueto Institute at the University of Chicago are beginning a search for a postdoctoral scholar, and I wanted to ask whether you might recommend any strong PhD students or recent graduates.

We’re seeking candidates with expertise in Bayesian hierarchical modeling, partial pooling, geostatistics, or related areas for applied work using fine-grained geographic data to improve the accuracy and equity of property tax assessments. The role would be a joint position between Stanford and U Chicago.

Because we hope to hire by early spring (March/April), any referrals you may have would be greatly appreciated. Please feel free to share this note with anyone who might know promising candidates. I’m happy to speak directly (esmith@abfn.org) with anyone who’d like to learn more about the project or role.

This looks interesting, also it’s an important topic. And I’ll add this: if you’re doing statistics or social science, it’s super important to work on real problems–not just “real data” but what I call “live problems” where stakeholders really care about the answers, not just to get publications or whatever but because there are policy implications. This seems to be the case here, also of course it’s a good sign that they’re already interested in Bayesian hierarchical modeling and partial pooling.

From one perspective, “property tax assessments” could sound kinda boring. But it’s a big deal in this country. So go at it!

Who else is in the goddam dictionary?

Andrew — Mon, 22 Dec 2025 14:22:41 +0000

Following up on the recent posts by Jessica and me, I thought I’d look up some other people.

Hey, here’s somebody we know:

But no citations from Satoshi Kanazawa, Edward Wegman, Susan Fiske, Robert Sternberg, Lawrence Summers, or Jeffrey Epstein.

Since we’re on the topic, no citations from Noam Chomsky either (not even for “colorless”? C’mon, Merriam-Webster, you can do this!), but he’s got us one better as he’s an actual dictionary word:

Meanwhile, Dan Ariely’s all over the Webster-verse:

Maybe we can use all these words in a sentence:

OPTIMAL . . . SOCIAL . . . FALLACY . . . LEGAL TENDER . . . SHECKEL . . . GOAD

I’m not quite sure how to do this, but all the above words seem very Ariely-related.

Amusingly enough, I’m mentioned on the page for GOAD–but I’m just being referred to, not quoted.

Here’s the Ariely citation:

I think, from Ariely’s perspective the unattainable goal was to make an actual discovery in psychology; the attainable goal was to get some LEGAL TENDER by working with people’s SOCIAL expectations, maybe not OPTIMAL from my perspective as I think of belief in such claims as a FALLACY and they GOAD me into spending my time in what are ultimately less than OPTIMAL pursuits.

On the plus side, friend-of-the-blog Jordan Ellenberg has three:

MARKOV CHAIN, INFORMATION THEORY, and HAGGARD . . . not bad!

And, speaking of Jordan, how about the three Michael Jordans? Merriam-Webster doesn’t appear to be using any citations from them, but Michael B Jordan and Michael J Jordan are mentioned in several of the citations. Same thing with Paul Erdos and Kevin Bacon.

“I think there’s an argument to be made that much meta-scientific work is a kind of mirror image of the empirical work it critiques”

Andrew — Sun, 21 Dec 2025 14:15:32 +0000

In the context of our recent discussion of the p-curve paper, Richard Morey wrote, “I think there’s an argument to be made that much meta-scientific work is a kind of mirror image of the empirical work it critiques,” and he shared this chart:

I think Morey is on to something here, but, as someone who does a lot of empirical science and a lot of meta-science, I think there’s one big thing he’s missing, one major asymmetry between empirical science and meta-science, and that is that bad empirical science makes strong claims, and the role of meta-science is to question the evidential support behind these claims, not usually to make a positive claim in itself.

The usual pattern goes like this: empirical scientists collect data D, perform analysis A, and use these to make strong general claim X about the world. The meta-scientist then comes along to assess the evidence. A negative meta-science analysis comes to the conclusion that D + A do not provide good evidence for X. The meta-science analysis does not make the strong claim that X is false, let alone the even stronger claim that some preferred alternative Y is true.

This comes up all the time. Some Cornell psychology professor claims to have strong evidence for extra-sensory perception or influence of food labeling on eating or whatever. The meta-scientist comes along and notes irregularities with the data or analysis and provides an alternative story of how these apparently convincing patterns in data could have come to be. The conclusion of the meta-scientific report is not that ESP or large effects of food labels don’t exist but rather that the published record does not provide good evidence of these extraordinary claims. (And indeed the claims are extraordinary, which is how they got so much publicity in the first place.)

It’s the all-important distinction between truth and evidence. I know that Morey understands this distinction and I’m not saying that anything in his above chart is wrong; I’m just trying to put it in the larger perspective of scientific inquiry.

In discussing the above asymmetry between empirical science and meta-science, I’m not saying that meta-science is better. Meta-science is fundamentally parasitic on empirical science, and, sure, empirical science is associated with bold claims, but it’s through making bold leaps–and being willing to retract those leaps as needed–that we make progress. The problem with bad science is not so much the overconfident conjectures–such steps may be psychologically necessary–so much as the unwillingness to reflect on contrary evidence, the unwillingness to admit error, and the practice of not confronting past mistakes.

And also the really stupid things that people say and never apologize for.

Hey, I’m in the dictionary (too!)

Jessica Hullman — Sun, 21 Dec 2025 02:57:43 +0000

I guess someone at Merriam-Webster has good taste!

Though it figures that they’ve associated me with all of my least favorite uncertainty visualizations. With my luck someone will probably put an error bar on my tombstone (really, please don’t).

Hey, I’m in the dictionary!

Andrew — Sat, 20 Dec 2025 14:31:37 +0000

The House of Commons, the Supreme Court, Private Eye, and, above, Merriam-Webster. Now I can truly retire.

Validating language models as study participants: How it’s being done, why it fails, and what works instead

Jessica Hullman — Fri, 19 Dec 2025 16:57:47 +0000

This is Jessica. Earlier this year, I started paying attention to proposals to use LLMs to simulate participants in surveys and behavioral experiments. The idea is that LLMs can be prompted with experiment or survey instructions and a participant persona (e.g., demographic description), making it possible to simulate target human samples without the cost and headache of recruiting real people. A number of papers have pointed to promising results, like where LLM results are moderately to highly correlated with human study results, to argue that they could transform behavioral science: by increasing sample sizes, generating missing counterfactuals, allowing us to learn about hard-to-reach populations or ethically fraught situations, etc.

The obvious elephant in the room is validation: how do you establish that conclusions drawn about human behavior from analyses that substitute or augment human data with LLM outputs are valid, in the sense that using LLM outputs doesn’t systematically bias your ability to estimate the target human parameter (mean effects, regression coefficients, etc.)? Many papers on this topic deal with this in a loose, heuristic way. For example, authors will demonstrate partial replication of some human results with LLMs, then go on to argue that LLMs could be used to approximate human behavior more broadly in that domain. Some attempt to codify this kind of heuristic validation.

So we decided to write something specifically about validating LLM study participants: what the landscape of approaches people are taking looks like, and of these, which meet minimum requirements for getting valid downstream parameter estimates. David Broska, Huaman Sun, Aaron Shaw and I write:

A growing literature presents large language model systems (LLMs) as a transformative data source for simulating human behavior in experiments. However, to arrive at credible conclusions when substituting these AI surrogates for human participants requires showing that LLMs can approximate the target human responses or parameters. We characterize approaches to validation in the literature. A heuristic approach argues for generalization based on strong resemblance between humans and AI surrogates on a subset of tasks, often in combination with ex-ante “repair strategies” designed to reduce LLM-contributed biases through prompt engineering or model fine-tuning. However, the lack of accounting for remaining bias precludes the researcher from attaining basic validity guarantees customary for confirmatory research. In contrast, a statistical calibration approach uses auxiliary human data and statistical adjustment to account for discrepancies between observed and predicted responses. Calibration approaches help ensure that use of AI surrogates does not mislead researchers who claim to ultimately target human behav- ior. They are not, however, a panacea; even when assumptions hold, benefits may be modest as a result of high variability in behavioral targets. Restricting LLM use to predicting effects in discovery-oriented research avoids validation challenges, but requires caution in interpreting effect sizes. We propose ways that LLMs could help researchers address pervasive blind spots in design and analysis if used instead to challenge researchers’ expectations about effect size and analysis.

Heuristic validation

We first characterize ways that authors are using a “validate-then-simulate” pattern, where they first compare human and LLM study results to demonstrate face validity–by showing that direction or significance of effects is preserved, or that LLM results are highly correlated with human results, or that it’s hard to statistically distinguish LLM responses from human ones, etc. It is then implied that because the LLM replicated the important structure in the example cases, it will also replicate the important structure in related cases. The problem is that arguments based on face validity can’t provide the kind of guarantees we typically expect of inferential methods. We don’t expect to be able to present study results as causal estimates if we haven’t attempted to meet basic independence conditions, or to interpret OLS coefficients if we’re not willing to make assumptions about linearity and residuals. Similarly, we should not be content to trust behavioral estimates that involve LLMs unless either a) sufficient conditions for valid inference have been demonstrated, or b) authors have explicitly taken steps to account for bias in downstream analyses.

Why heuristic validation doesn’t work

We summarize what conditions would have to hold for direct substitution, based on recent theoretical treatments. For example, Ludwig et al. (2025) use an econometric model to define two conditions that should hold for directly substituting LLM labels for human labels to serve as a general method. In a nutshell, unless you’re willing to make strong assumptions about how the prompts (e.g., experimental scenarios or survey instruments) that you’re studying are sampled from the space of all relevant prompts, you need to ensure 1) that there is no leakage between the experimental scenarios or survey instruments you’re studying and the model training data, and 2) that the relevant conditions (i.e., moment criteria) that need to hold for your analysis to identify the target human parameters still hold when you plug-in LLM responses instead. Essentially, the potential for LLM errors to be correlated with covariates you care about means that even if the LLM shows very small bias in predicting the human responses, your estimates of population means, regression coefficients, or other target parameters could still be pretty far off.

Calibration as alternative

Consequently, showing that an LLM adequately replicates target human behavior on some subset of tasks isn’t sufficient evidence for expecting generalization to related tasks. But that’s ok–a number of recently proposed approaches demonstrate calibration approaches, where a small amount of jointly (human and LLM) labeled data is used to adjust an estimator so that LLM responses can be integrated without biasing downstream parameter estimates. For example, previously we discussed my co-author David Broska’s work on Mixed Subjects Design, which uses a prediction-powered inference approach to define a hybrid estimator. The PPI estimator is centered on the same parameter targetted by the human subjects estimator (e.g., a population mean, a regression coefficient, etc.) but when properly tuned, is designed to be at least as precise based on the inclusion of the LLM responses. Other approaches to augmented estimators, like design based supervised learning, make slightly stronger assumptions about how the human-labeled instances are sampled.

Of course, the existence of these approaches doesn’t automatically mean that it’s worth your time to figure out how to work with LLM subjects in your studies. That’s a harder call. There are lots of proposals around what we call “repair strategies”–tips on prompt engineering, model choice, model fine-tuning, etc.–to improve the fidelity of LLM predictions to human responses. But we should also keep in mind that how much more we can learn about humans by incorporating LLM subjects will depend in part on how noisy the target human behavior is. In fields like psychology, the noise ceiling (i.e., maximum predictive performance any model could reach) is low, and so we shouldn’t be too surprised if adding large numbers of LLM yields only modest gains. This tracks with what we see in the handful of existing demonstrations of calibration approaches like PPI to human subjects data: gains in effective sample size are only up to about 15%, even when the amount of LLM simulated data is much much larger than the human sample.

Reserving LLMs for discovery of effects, or to act as design adversary

Another proposal that has been circulating is to rely on LLM simulations purely to aid discovery of effects: simulate a bunch of experimental scenarios to figure out where there’s an association, or stress test your study instrument to improve question wording or other components. There’s nothing wrong with this, but we should keep in mind that many behavioral studies are targeting small effects. Again, just a little bit of bias can make effect estimates from LLM simulations misleading if the bias correlates with covariates you care about. And this is not at all implausible: it happens, for example, when LLMs are systematically more accurate for certain types of participants, or for certain scenarios that are closer to what’s in the training data.

Something we propose that I haven’t seen come up is the idea of using the LLM to challenge or interrogate your expectations during design and interpretation of study results. There’s a lot of focus on using LLM subjects to make predictions about what’s probable, but they could also be used simply to explore possibilities, and help you better understand the kinds of implicit commitments you’re making in design or interpretation of results. For example, fake data simulation is a powerful practice for designing better experiments, but I get the sense that many human subjects researchers don’t do it by default. As a natural language programming interface LLMs could make this a lot easier – you ask them to simulate data under different expectations about effect magnitude and variance, to help you reason about sample size and other aspects of study design. Inspired by Causal Quartets, you could employ them to help you think through implications of heterogeneity for the estimate you’re using to power your study or that you observed when you ran the study with humans. Similarly, they could help by generating deviant but plausible data scenarios for stress testing your planned modeling approach. It’s a slight shift in framing to see them as tools for stimulating imagination and interrogating assumptions instead of as soothsayer, but could be a productive one.

What’s your Jordan3 number?

Andrew — Fri, 19 Dec 2025 14:43:03 +0000

In the discussion of our post, Who has the lowest Erdos-Bacon-Epstein number? (the winner appears to be the mathematician Daniel Kleitman, my freshman-year academic adviser at MIT!), an anonymous commenter asks:

Is there anyone with a finite Michael Jordan^3 number (acting with Michael B Jordan, coauthoring with Michael I Jordan, playing on a team with Michael J Jordan)?

Good question! In the earlier post we discussed the rules for what counts in being in the acting network (IMDB and with a legitimate acting credit, not just being interviewed) and the academic authorship network (scholarly journals).

What about playing on a team? What would it take to be in the Michael J Jordan network? It would be too much to restrict to players on NBA teams. I’d allow any college team–but only varsity would count, not intramurals–but even that is pretty darn restrictive, so I think I’d count high school varsity as well.

I guess that lots of guys who’ve played high school varsity basketball have some connection to Jordan. You just need to have one player on your team who played in college, then one guy in that player’s college team who ever made it to the NBA, and then the graph must be complete from there. You could also get there through a different sport–for example, maybe you played football, and someone on your football team played basketball, and someone on their team played in college . . . or maybe someone on your football team played college football for awhile, and someone else on that team played basketball in high school, and someone else on his high school team played basketball in college, etc.

I’m guessing that somewhere there are people who (a) have acted in at least one movie, (b) have coauthored at least one academic article, and (c) played on a high school varsity team. And if you have all three of these attributes, you have a shot at having a finite Jordan3 number.

I can’t do it myself, as I’ve never acted and I’ve never played varsity sports.

I do have a cousin who’s acted on TV, though. This one show he was on has a huge list of famous names, which I guess can happen for a TV show that runs for lots of episodes, but, still, the very very list includes the still famous Billy Dee Williams, along with vaguely-familiar faces such as Dennis Christopher, Max Gail, Stuart Margolin, as well as G. Gordon Liddy (!) and someone named Tony W. Randall (no, not the Odd Couple guy) and someone named Robert Axelrod (no, not the political scientist). My cousin also was in the Olympics, and maybe someone on his team also played serious high school sports, so he could well have a finite Michael J Jordan number too. But his Michael I Jordan number is infinite, because he has no academic publications. Just to check this out, I searched for my cousin’s name on Google scholar, but all I found were two papers by his dad, but they’re single-authored so that wouldn’t work either. My uncle was no academic; he was a doctor who many years ago was enthusiastic about computer touchpad and voice-recognition technology and wrote a couple articles about a system he was trying to sell for computerized medical records.

And then there’s Michael J Jordan, who by definition has a Michael J Jordan number of 0, and he starred in Space Jam, and that movie has a long cast list, so I’m guessing his Michael B Jordan number is no more than 4. But no scholarly publications (no, this namesake doesn’t count), so his Michael I Jordan number is infinity.

I’m guessing, though, that there are some people out there with that finite Jordan3 number. Any ideas? Someone you know who’s acted in a legit production, coauthored a scholarly publication, and played on a high school sports team? No Jeffrey Epstein connection required.

Everything I need to know I learned in Little League

Bob Carpenter — Thu, 18 Dec 2025 20:00:05 +0000

This post is by Bob

“Little League” is what we call baseball for kids in the United States. I often tell people that I learned a ton about how to behave and how to approach problems, teamwork, and life in little league. Turns out I’ve been saying that for a while. My sister just sent me this little poster I made for my dad at some point.

Dad repeated this advice regularly, even decades after my baseball-playing days. I still believe it’s good advice, so I’m sharing.

I put the most important advice first—keep your eye on the ball. That’s really key to just about anything.

I have found that hustle is also really critical in life. Dad and I loved hustling baseball players like Pete Rose. Dad used to drive me from Detroit to Cincinnatti in the early 70s to see the Big Red Machine in person, then drive back for work the next day. I find it demoralizing today how players just watch their hits rather than hustling as soon as there’s contact. I really miss “little ball”, which is why Cleveland’s my favorite team (that and it’s Mitzi’s home town).

The staying loose part is also really important and really hard. No editor, so I included keeping your eye on the ball twice. Without the duplication, I could have saved enough space to not cramp the bottom—otherwise, my graphical layout’s pretty good.

For me, sportsmanship is really critical. I also makes me sad that players only shake hands with their own team after the end of the game. We always had to go and shake every other player’s hand and tell them “good game” (even if it wasn’t). And the pros did the same.

I can’t emphasize the teamwork advice enough for the real world—part of that should have said “there’s enough credit to go around.” I should have put that higher up. Listening to how star players respond to interviews is key—it’s usually along the lines of, “I’m just trying to play my role and help the other 8 guys out on the field.”

Getting in front of the ball is also super important not only literally in baseball, but also metaphorically in life. You can do so much by just getting in front of the ball. It might hurt a bit when it hits you if you can’t catch it cleanly, but at least it didn’t get by you! I might rephrase “throw overhand” as “take the straight ahead approach” rather than “trying to get fancy.”

As a bonus, my sister also sent along this photo of our Little League days in Detroit.

That’s dad in the back and me on the far left of the back row. This is 1972 or 1973, so I was 8 or 9 years old (top row, far left) and dad was only 29 or 30. At the time this was taken, he was paying his way through law school photographing sports teams and accident scenes (I tagged along to both and “helped” in the darkroom). I love the attention to detail in the arrangement of gloves on the first row and the classically crossed bats—I also learned photography and design from dad, not to mention printing. Also notice how poor the neighborhood was. One of my teammates, Adam (can’t recall his last name), is wearing dress shoes; you can’t see Tibor’s shoes in the back, but they were mostly duct tape. We couldn’t even afford new baseballs, so I’m not sure how dad managed the spiffy uniforms—probably shilling a pizza joint or auto body repair on the back.

EVERYTHING I NEED TO KNOW I LEARNED IN LITTLE LEAGUE^*

DAD ON THE GAME
keep your eye on the ball

DAD ON HUSTLE
run, don’t walk

DAD ON BATTING
stay loose, keep your shoulder down & keep your eye on the ball

DAD ON SPORTSMANSHIP
don’t be a bad loser & don’t be a bad winner; shake hands

DAD ON TEAMWORK
there are 8 other players to help you

DAD ON FIELDING
get in front of the ball & keep your eye on it

DAD ON THROWING
overhand, it goes straight

^* FROM MY DAD [Mack L. Carpenter]

“Re-examination of the 3/4-law of metabolism” and “Toward a metabolic theory of ecology”

Andrew — Thu, 18 Dec 2025 14:37:37 +0000

The above graphs are from Regression and Other Stories, as a demonstration of the log-log transformation, but there are some questions of where this 0.74 slope comes from. Simple geometry would suggest a slope of 2/3 (if animals are spheres of constant temperature, they will radiate heat in proportion to their surface area), but there’s this idea that larger animals are more sphere-like and run colder, compared to smaller animals.

Dodds, Rothman, and Weitz look into some of this in a paper from 2001, “Re-examination of the 3/4-law of metabolism”:

I don’t like the whole “null hypothesis” thing at all–but the data and models are interesting, and overall I like the paper. It would be an interesting example for someone to go back and analyze the data more directly using hierarchical modeling.

Related is this paper from 2004 by Brown, Gillooly, Allen, Savage, and West:

Metabolism provides a basis for using first principles of physics, chemistry, and biology to link the biology of individual organisms to the ecology of populations, communities, and ecosystems. Metabolic rate, the rate at which organisms take up, transform, and expend energy and materials, is the most fundamental biological rate. We have developed a quantitative theory for how metabolic rate varies with body size and temperature. Metabolic theory predicts how metabolic rate, by setting the rates of resource uptake from the environment and resource allocation to survival, growth, and reproduction, controls ecological processes at all levels of organization from individuals to the biosphere. Examples include: (1) life history attributes, including development rate, mortality rate, age at maturity, life span, and population growth rate; (2) population interactions, including carrying capacity, rates of competition and predation, and patterns of species diversity; and (3) ecosystem processes, including rates of biomass production and respiration and patterns of trophic dynamics.

They talk a lot about that 3/4 power too:

It’s an interesting topic!

We may live in a state of prosecutorial overcorrection, but I think it’s a dialectical response to the fact that the default position for a certain kind of celebrity scientist has usually been ferocious, uncritical defense.

Andrew — Wed, 17 Dec 2025 14:35:26 +0000

In her recent book, Dangerous Fictions, Lyta Gold writes:

[Philip] Roth has long been stereotyped as one of the “lit bro” white male midcentury novelists; this is largely undeserved, especially as Roth wasn’t perceived in his own context as an unmarked white man, facing both antisemitism from the literary world and censure from the Jewish community for his portrayals of neurotic and often grotesque Jewish characters. But the “lit bro” shoe does slightly fit: Roth wrote white male antiheroes who were often based on himself, with intense sexual obsessions and a usually misogynist view of women. Whether this misogyny is better addressed in some novels than others is subject to debate.

What’s not debatable, however, is that even being able to critically describe Roth’s writing as misogynistic is a new development. I remember the ancient days of the 1990s and 2000s, when offering even the gentlest critique of a famous white male writer was usually met with an accusation of failing to understand his genius. We may live in a state of prosecutorial overcorrection, but I think it’s a dialectical response to the fact that the default position for a certain kind of white male writer and their self-absorbed antiheroes has usually been ferocious, uncritical defense.

We have always been in the courtroom, just on the other side of the aisle; the writer–and the literary critic–have normally worked as defense attorneys. The statement “I think this writing is misogynist” was perceived even before the days of social media as a criminal accusation, a presumption that you were putting the writer and his characters on trial, as well as any of his loving readers as codefendants. It’s a very American sequence of ideas, really: to jump straight from a simple statement of critical opinion to the presumption of trials and witch burnings. . . .

That’s a good point, and it reminds me of science criticism!

It’s always been possible to criticize scientific work–indeed, the back-and-forth of criticism has always been recognized as fundamental to science–but, about 10 or 15 years ago, I and other science critics were getting heavy pushback from some of the researchers whose work we had been criticized. I think part of this is that in the early 2000s, lots of social scientists were getting lots of positive publicity–this was the era of Freakonomics, Gladwell, Ted talks, and tech journalism. (See my paper with Simine Vazire for some background here.)

At the time, some researchers complained that criticism was happening on social media (for example, this blog!) rather than in the peer-reviewed literature, but I think that complaint was bogus, first because they’d taken control over much of the peer-reviewed literature, and second because these researchers seemed to have no problem with positive, often unthinkingly positive, treatment in the media; it was only when the negative comments appeared that they got all uptight about keeping it in the journals.

So I see an analogy between researchers and their fans in the early/mid-2010s complaining about criticism, and authors and their fans in the late 2010s complaining about cancellation. As Gold says, criticism can go too far, but we have to remember that, until relatively recently, there was almost no public criticism at all, and the public criticism that did arise was typically easily deflected.

P.S. Another thing I appreciated about Gold’s book is that she discussed Jonathan Gottschall’s book, “The Story Paradox,” which I liked a lot. One fun thing about nonfiction is you can read one book which points you to another, and another, and another . . . a sort of hyperlinking, if you will.

Survey Statistics: 3rd helpings of the logit shift

shira — Tue, 16 Dec 2025 21:00:40 +0000

In June we discussed 2 flavors of calibration:

Poststratification
“Logit Shift” or “Intercept Correction”: Calibrate our estimates of regressions E(Y|X) to aggregate data about E(Y). (Rosenman et al. 2023, Ghitza and Gelman 2020, Ghitza and Gelman 2013, and Kuriwaki et al. 2024.)

In August we took 2nd helpings of the logit shift, focusing on multinomial outcomes.

Now we take 3rd helpings, focusing on multivariate outcomes.

I’m inspired by Will Marble and Josh Clinton‘s new paper and our discussion in the comments: Calibrate our estimates of p(y_1, y_2 | X) to aggregate data about E(y_1).

As Marble and Clinton write:

our paper builds on methods for calibrating model-based inferences to known population quantities… a so-called “logit-shift”…We extend this approach by estimating models with several correlated outcomes – some with known ground truth and others without…

I proposed:

we could first fit p(y_1 | X) and perform its intercept shift. Then fit p(y_2 | y_1, X).

Marble and Clinton call this “Chained Calibrated MrsP“. They note that my example may be clunky to generalize:

the order of the modeling is likely to be consequential and difficult to choose systematically — especially when there is more than a single outcome with groundtruth margins.

And it doesn’t look as good as their method “Multivariate Calibration”:

What’s going wrong with Chained Calibrated MrsP ?

Annals of idiot spam

Andrew — Tue, 16 Dec 2025 14:51:13 +0000

This one came in the mail today:

Hi Andrew,

Your achievements and academic background caught my eye; they have the kind of credibility and impact that could naturally belong on Wikipedia, where recognition becomes part of the public record.

I work with individuals to ensure their stories are told accurately, managing, improving, and publishing Wikipedia pages so they’re well-sourced and built to last.

If your Wikipedia page went live tomorrow, what’s the first thing you’d want people to read about you?

Best regards,
**
Wikipedia Editor | User

What kind of asshole would send this spam without . . . ummmm, checking to see if I already had a Wikipedia page? That’s just stupid.

“Hi Andrew,” indeed.

P.S. In the time between when I wrote the above post and when it appeared on the blog, I received another email from this idiot:

Hello Andrew,

I hope you’re doing well. I just wanted to follow up to confirm if you received my previous email. As mentioned, I can assist you with publishing, maintaining, and editing your Wikipedia page, so you don’t need to worry about the process.

Would you like me to share a detailed plan on how I can best support you with this?

Best regards,
**

Oh, this person hopes I’m doing well! How charming. Maybe he can get a job at Wolfram Research.

Simulating from and checking a model in Stan: It’s so easy in Stan Playground–it just runs on your browser!

Andrew — Mon, 15 Dec 2025 21:34:38 +0000

When building models, it’s helpful to check our understanding by simulating fake data and seeing if the fitted model can recover the true underlying parameters. And it’s so easy to do in Stan!

See above–I just did a simple example in Stan Playground. (Thanks, Brian!) The code’s kind of ugly because I specified the true parameter values in the transformed data block Stan code rather than calling them as data; on the upside the whole thing is then in one program.

This is not the same as full simulation-based calibration checking (SBC). In SBC you draw the true parameter values from the prior, repeat the entire process in parallel many times, and then check the average coverage of posterior inferences with respect to the true parameter values, averaging over the prior. Here I’m just running one simulation and setting the true parameter values just once. It’s a kind of quick-and-dirty SBC which can still be useful in catching problems such as nonidentified models or poor mixing.

And it’s so easy to do! Whenever you fit a model, you should be checking it on fake data.

I was doing the above example because I wanted to quickly check for one of the exercises I’m writing for the forthcoming Bayesian Workflow book.

P.S. Here’s the code for the above example if you want to try it yourself:

transformed data {
  int N = 100;
  real a_true = 0.2;
  real b_true = 0.3;
  real sigma_true = 0.2;
  vector[N] x;
  vector[N] y;
   for (n in 1:N) {
    x[n] = uniform_rng(0, 10);
    y[n] = normal_rng(1 / (a_true + b_true * x[n]), sigma_true);
  }
}
parameters {
  real a, b, sigma;
}
model {
  a ~ normal(0, 1);
  b ~ normal(0, 1);
  for (n in 1:N) {
    y[n] ~ normal(1 / (a + b * x[n]), sigma);
  }
}

Who is the most famous living person who was born on each continent?

Andrew — Mon, 15 Dec 2025 14:05:51 +0000

Matt Larson writes:

Because of your interest in fame [see here, here, here, and, especially, here — ed.], I would be interested in your thoughts on the following questions: who is the most famous living person who was born on each continent?

For North America and Africa, it seems like the answer is clearly Trump and Musk. For Antarctica it must be Emilio Palma, the only person born there to have a wikipedia page.

For South America Messi seems like a very strong contender. I suppose Shakira could be another candidate. Until a couple months ago maybe it was Pope Francis.

Asia, Europe, and Oceania seem tricky. I would guess that Kim Jong-Un has higher name recognition than Modi or Xi. Maybe it’s a Kpop star (Jennie?) or something. For Europe, Putin, Paul McCartney, and Ronaldo come to mind. I have no idea about Oceania, maybe it’s Jacinda Adern or Hugh Jackman, but they’re not very famous.

Maybe Rupert Murdoch is the most famous living person who was born in Oceania? Maybe at one time Mel Gibson but not anymore.

To me, the most interesting thing about this sort of question is not the specific people but rather the measurement issue. How would you define “most famous person”? Perhaps you could consider a hypothetical poll of everyone alive on the planet, where you ask each of them if they’ve heard of person X and you ask them to briefly describe who this person is. (I guess it doesn’t count as fame if someone says they know who you are but they can’t describe anything about you.)

Here are a few other dimensions:

– Changes over time. There might have been a time when Michael Jackson was the most famous living person from North America. Mel Gibson might have been the most famous living Australian at some point.

– Locus of fame: you could be more or less famous in different geographic regions, different age categories, etc.

Combining a high-quality probability sample with data from larger online panels

Andrew — Sun, 14 Dec 2025 14:33:47 +0000

Yajuan Si, James Wagner, and Ron Kessler write:

The traditional use of high-quality probability samples to carry out psychiatric epidemiological surveys of the household population is facing increasing financial and operational challenges. Surveys from nonprobability and probability-based online panels have emerged as cost-effective alternatives with the additional advantage of rapid turnaround time, albeit with biases that can in some cases be substantial.

We recommend a middle ground of integrating surveys from online panels with small parallel high-quality probability samples . . . The key features of such “hybrid designs” are as follows: use of a high-quality probability sample as a population surrogate to provide information about the distributions of otherwise unavailable variables that differentiate participants in online panels from the larger household population, inclusion in both surveys of measures that are both strongly associated with the outcomes of interest and strongly predictive of membership in the online panel, and use of best-practice statistical methods that blend results across the 2 samples.

Such a hybrid design should be the minimally acceptable design for psychiatric epidemiological surveys of the household population given the biases known to exist in online panels. However, we also comment on several other designs that might be used for more rapid and less expensive exploratory analyses.

This is interesting, to think of multi-frame, multi-mode sampling as best practice in itself rather than as an awkward problem to be dealt with only if absolutely necessary.

Yajuan offers some background on the project:

This is my first time writing a paper without any equations or data modeling but having to rely on solid statistical knowledge, understanding the extensive literature, and gathering lots of data. And Ron Kesser is a phenomenal collaborator. I learned a lot from working with him.

Anyway, here is the idea of the paper: We propose a hybrid data collection of large-scale nonprobability samples and small parallel high-quality probability samples as common practice for population-based research. For MRP applications, we often struggle with the availability of population information of X. We propose to estimate the population distribution of X in a small probability sample, after we identify the list of highly predictive covariates X for the outcome Y. We can also collect Y in the probability sample. We propose the sequential weighting adjustment by first weighting the probability sample to the census data (this should be based a small list of adjustment factors, say only demographics, assuming the probability sample design is well controlled and nonresponse bias is small) and then weighting the nonprobability sample to the initially weighted probability sample (the list of adjustment factors could be large, even including the outcome). After the sequential weighting, the combined samples can give us enough power for small area estimates. I use weighting adjustment here for simplicity, but we can also use MRP for the adjustment if we have an outcome of interest.

Basically, I’m trying to push the MRP adjustment from post-collection inference to inform study design and modify data collection adaptively, releasing the burden or strong assumptions on analysis by improving the study design from the starting point.

This is interesting and potentially important for several reasons:

1. Data quality of survey responses is becoming more and more of an issue, and it makes sense to try to reach potential respondents in more comfortable places than the traditional survey interview.

2. We should be thinking more systematically about how to integrate data from multiple sources.

3. MRP can be adapted to more general data structures.

4. As Yajuan says, we should be aware of all these data collection and analysis issues in the design stage.

The cathedral, the bazaar, and statistical workflow

Andrew — Sat, 13 Dec 2025 14:13:30 +0000

Back in 1997, programmer and open-source activist Eric Raymond wrote an online essay, The Cathedral and the Bazaar, making the case for open-source software development, using Linux as an example:

Linux is subversive. Who would have thought even five years ago (1991) that a world-class operating system could coalesce as if by magic out of part-time hacking by several thousand developers scattered all over the planet, connected only by the tenuous strands of the Internet? . . . Linux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time.

Linus Torvalds’s style of development–release early and often, delegate everything you can, be open to the point of promiscuity–came as a surprise. No quiet, reverent cathedral-building here–rather, the Linux community seemed to resemble a great babbling bazaar of differing agendas and approaches (aptly symbolized by the Linux archive sites, who’d take submissions from anyone) out of which a coherent and stable system could seemingly emerge only by a succession of miracles.

The fact that this bazaar style seemed to work, and work well, came as a distinct shock. . . .

Lots has happened in software workflow in the past quarter century, and it’s my impression that developers try to get the best of the cathedral and the bazaar. Ultimately, the two forms go together: any “bazaar” is built from individual “stalls” or mini-cathedrals, while any “cathedral” must ultimately compete in a larger “marketplace” or bazaar. So the question isn’t really “the cathedral or the bazaar,” but rather, where should development be more cathedral-like and where should it be bazaar-like. Raymond’s article was helpful in framing software projects in terms of these metaphors, and it connects closely to ideas in economics regarding the roles of firms within the larger economy.

The two software products that I use the most are R and Stan, and I’d say that R has suffered from being too bazaar-like and Stan has suffered from being too cathedral-like.

R code can be a mess because there’s no standard way that things are written, and also because it keeps adding layers and layers of complexity. Back in the day, if you had a matrix, it was a matrix and you could display it access its entries directly. Nowadays, I often need to do tricks such as as.numeric() or as.matrix() to strip away various structures in order to get at the actual numbers. I’m not saying this is always bad–I’m sure that the bazaar-keepers of R have good reasons for all this–it has just become more and more confusing to access attributes, objects, data, whatever.

When I say that Stan is too cathedral-like, I just mean that it seems to take a long time for improvements in the algorithms to show up in the program. I understand that this is important for quality control, I just find it frustrating to not be able to easily access the latest developments.

For both R and Stan, I’m not saying there’s some easy alternative. Things are what they are for a reason. I’m just using these as examples of how I sometimes think in terms of cathedrals and bazaars.

Statistical workflow

The most important difference between statistical workflow and mere statistical inference is that workflow is all about building trust. Steps such as simulation-based model checking, predictive model evaluation, and comparison of the results of multiple models are ways to stress-test our fitted models, to reveal their weaknesses and, sometimes, to lead to more confidence in the results.

It’s possible to fit models with all the workflow, but then you can end up with wrong answers, such as the ridiculous estimate that losing an election for governor costs 5-10 years of life.

One way to think of this is to consider medieval cathedrals. Still standing after all those centuries–that’s so impressive! But there’s selection. These cathedrals were built without good theory, and lots of them collapsed. As much as possible we would like to use science and engineering workflow to avoid building cathedrals that will immediately crumble. Science is a kind of cathedral–a collective human endeavor with large goals–and we would like to design it better.

But science is also a kind of bazaar . . . and, here’s the point: Bazaars, like cathedrals, are designed. They’re not designed in the same way as cathedrals, but they don’t just happen either. There have to be rules, to keep the bazaar from becoming a bloodbath. It’s fair enough to think of the current system of science, with its millions of scientists and hundreds of thousands of research labs, as being a sort of bazaar, and we’re trying to improve the workflow, which would make the bazaar more effective. I think we can do better than the current system of p-values, failed replications, deterministic claims, and waste of reviewer resources.

If you’re interested in the Box-Cox power transformation . . .

Andrew — Fri, 12 Dec 2025 14:18:43 +0000

You can read this post from Danielle Navarro.

I also recommend Section 7.6 of Bayesian Data Analysis, which extends an example of Rubin where the Box-Cox power transformation fails. We do some cool stuff there, including a posterior predictive check that reveals the problem, and an extension of the model by incorporating a bound on the extreme end of the distribution.

The entire BDA3 book can be downloaded at the above link, but for convenience I’m including Section 7.6 right here for you.

Seven-parameter drift-diffusion pdfs and cdfs now in Stan

Bob Carpenter — Thu, 11 Dec 2025 20:00:46 +0000

This post is from Bob.

Drift-diffusion models

Whew. The cdf function for the seven-parameter drift-diffusion model was just merged. The pdf was merged a few months ago. This is a big deal. These pdfs and cdfs are used for in decision-time models in cognitive psychology. There’s a really nice illustration through NLM on nih.gov. The basic idea is that you have a binary task like deciding if an image is red or blue. The data being recorded is time to decision and the decision being made. The underlying generative model is a continuous Wiener diffusion process that has a lag time to get started before drifting with some bias toward opposing decision boundaries. The decision is determined by when it crosses a boundary and which one it crosses (see the illustration). The cdf is important when the task ends before a decision is made, giving you censored observations, which require cdfs or truncated pdfs to implement.

The first time I saw this model being applied was by Bruno Nicenboim and Shravan Vasishth (psycholinguists at Potsdam at the time, though Bruno has since moved to Tilburg) about six or eight years ago. At that point, it took Stan a month or so to fit the model (yes, that’s a month, not a typo)—you may know them as two of the three authors of the really wonderful book, Introduction to Bayesian Data Analysis for Cognitive Science (2025, CRC), which, in its final chapter, covers accumulator models of which the drift-diffusion model is one form. Now these models are very fast in Stan with the new built-in functions.

The pull requests and engineering challenge

Hats off to Franziska Henrich, a cognitive psychologist and Stan developer at the University of Freiburg, aka GitHub user Franzi2114, for writing the code and bearing with Steve Bronder’s hundreds of comments and fixes and my final round of a hundred or so change requests. You can see all the gory details in the discussions around the pull requests and in the code itself.

GitHub: pdf pull request
GitHub: cdf pull request

These pair of functions were perhaps the two hardest functions to get into Stan for a myriad of reasons. The most challenging obstacle beyond the inherent complication of the functions themselves is that our testing framework for densities can’t handle seven-parameter densities. So all the tests had to be projected into subsets of parameters (and seven choose four or five or whatever it was led to a lot of tests). A further difficulty is that to make the arithmetic stable, the code branches all over the place (see the Hartmann and Klaeur article linked below), which also complicates testing.

Some academic background

In addition to Vasishth et al.’s book chapter, there is a vast literature on drift-diffusion models in cognitive psychology and elsewhere. Most relevantly, Franziska wrote an open-access article about the model and the Stan implementation.

Henrich, F., Hartmann, R., Pratz, V., Voss, A., & Klauer, K. C. 2024. The Seven-parameter Diffusion Model: an Implementation in Stan for Bayesian Analyses. Behavior Research Methods.

Luckily, Hartmann and Klauer provided the derivatives in a previous (closed-access) article.

Raphael Hartmann and Karl Christoph Klauer. 2021. Partial derivatives for the first-passage time distribution in Wiener diffusion models. Journal of Mathematical Psychology.

As is often the case, you can find a pdf through Google Scholar. It’s a daunting pile of mathematics that puts the “M” in “mathematical psychology.” Luckily for us, the authors published an R implementation in package WienR on CRAN (the name is because it’s the Wiener diffusion model underlying the process), which Franziska could use for testing.

Coming to the Stan language next release

We just put out a new Stan release, so we have plenty of time to get the language wrappers around the math library functions before the next release of Stan. Ideally, we’ll also have a User’s Guide chapter with examples of how to use them. We’re always open to new *User’s Guide* chapters about models or methodologies in wide use, and as you can see from this example, we take pull requests, which go down much more easily with the User’s Guide.

A slew of improvements to NUTS

Bob Carpenter — Thu, 11 Dec 2025 20:00:33 +0000

This post is from Bob.

Hold onto your hats, because 2026 promises to bring a whole slew of improvements to MCMC for continuously differentiable densities. The really awesome part is that all of these improvements are orthogonal, so they stack. I’m going to list the ones we know work first, followed by a couple in which we have high hopes.

I’m currently working on implementing all this in a C++ reference implementation, which I plan to roll out with an interface like that of Nutpie. You can follow here:

GitHub repository: flatironinstitute/walnuts

You’ll find the work in progress on branches. We are, of course, always happy to get feedback on the code, and I’m happy to get feedback on these or other ideas for sampling.

Fisher divergence for mass matrix adaptation

Adrian Seyboldt developed much faster and more robust and better targeted adaptation using Fisher divergence for his sampler Nutpie. I’m helping Adrian and Eliot Carlsen finish up a paper on this, which we hope to release in a week or so. In addition to better diagonal and dense estimators, it contains a really nice low-rank plus diagonal preconditioner that seems very effective (the risk is getting stuck too much in the subspace defined by the low rank structure). It also contains all the mathematical proofs, thanks to Adrian. The basic idea is that Fisher divergence adaptation targets getting the preconditioned target as close to a standard normal in terms of gradients as possible (not as close as possible in terms of density—that’s KL divergence). With some surprising mathematical magic, you can solve the exact optimization of Fisher divergence by taking the geometric mean in the affine-invariant manifold of positive-definite matrices of two quantities: the inverse covariance of the draws and the covariance of the scores (where the score is the gradient of the log density). In a multivariate normal, the covariance of the scores is the inverse covariance of the target density, has zero expectation, and thus acts like a control variate. Who knew? (That was a rhetorical question. Ben Goodrich, of course, knew—he knows everything.)

Adaptive step size on the fly

This was developed by Nawaf Bou-Rabee, Tore Kleppe, Sifan Liu, and me over the course of a few papers expanding on our Gibbs self tuning (GIST) ideas, culminating in our WALNUTS paper (on arXiv). The basic idea here is at each leapfrog step to try to take a step, and if the Hamiltonian diverges past a tolerance, try a smaller step size (half the original). There’s a bit of adjustment to do for reversibility (very much like in the MALT sampler of Lionel Riou-Durand et al. [the paper has an all star cast]), but the expectation is that the adjustment will be close to unity and that seems to hold in practice. There’s some subtlety here in that the WALNUTS sampler we’ve produced can be slower than NUTS (on a per gradient basis) for densities that NUTS can fit well, but if it’s well tuned in terms of minimal numbers of micro steps per macro step in the WALNUTS algorithm, it will be more efficient on a per-iteration basis. This shows we really are fixing the integrator, even if that doesn’t always give us a better sampler on a per-gradient basis. This same kind of issue comes up with higher-order leapfrog algorithms—they’re more precise, but often not worth the effort. On the plus side, when you have highly varying scales, like in the funnel density, WALNUTS can sample it effectively where NUTS will just fail silently reporting overly optimistic R-hat and ESS values.

Isokinetic sampler

As the name implies, the idea of isokinetic sampler is that you keep the kinetic energy constant. This is carried out by designing a kinetic energy function that’s different than the usual quadratic model derived from Newtonian physics (decomposed a la Hamilton, of course). This was developed decades ago by Mark Tuckerman at NYU’s Courant Institute for molecular dynamics, written about in Leimkuhler and Matthews Molecular Dynamics textbook, and reinvented recently by Jakob Robink and Uroš Seljak under the name “microcanonical sampling.” Isokinetic sampling has the remarkable property that you can fix a single energy level set and the isokinetic sampler can be restricted to that, yet remain ergodic for the distribution. Although they’re singing the praises of unadjusted methods, Reuben Cohn-Gordon joined Jakob and Uroš and they wrote a joint paper on Metropolis-adjusted methods (on arXiv). Of course, you could extend this to something like jittered HMC, multinomial HMC, or NUTS (we discuss these alternatives for standard HMC in the various GIST papers in detail). Recently my colleagues Tore Kleppe and Nawaf Bou-Rabee (my partners in crime on GIST) have verified the results presented by Robnik et al. and it seems to give a pretty clean factor of 1.5 to 3 over the standard NUTS kinetic energy model.

Unadjusted sampling, at least for warmup

In statistics, we’re used to thinking of unbiased MCMC with the correct ergodicity properties—you run them longer and you get a better answer. But in estimation in stats, we’re often OK with introducing a bit of bias if it gives us a large enough reduction in variance that we get better answers. If you view sampling the same way, then removing the Metropolis step from HMC gives you a biased sampler that can be a lot faster at moving toward the typical set where we want to sample than an adjusted sampler. And while Jakob, Uroš and Reuben are mainly working on unadjusted samplers and trying to prove error bounds for them under pretty strong assumptions, they have also suggested the reasonable tactic of using unadjusted sampling during warmup then switching to adjusted later so that everything works as expected for longer runs.

Smoother adaptation of Nutpie

Nutpie, like NUTS before it, works in blocks. It evaluates a given number of MCMC iterations, then updates its mass matrix estimate. I’ve developed a smooth alternative that simply exponentially discounts the past at a decreasing rate to mimic the block structure of Stan’s warmup. This seems to work very well and doesn’t suffer the problem of Nutpie of never converging due to a finite cap on block size (Adrian’s working on changing the version in Nutpie in some way to get around the initial design of 100-length blocks; Stan works in a sequence of blocks that doubles in size).

Adam replacement for dual averaging

Stan used the dual averaging stochastic gradient descent algorithm to set step size. It’s not spelled out in the short NUTS paper that dual averaging is an SGD algorithm because the NUTS paper never gives you the objective and its gradient. If you work it out backwards from the dual averaging algorithm, you can deduce that Hoffman and Gelman used a normal target (i.e., squared error), the gradient of which is the negative difference between the observation and the target. This gives you simple stochastic gradients you can plug into any SGD optimizer. I’ve just done this this week, but I’ve already found that Adam is much faster to converge, less highly variable after convergence, and more monotonic getting to convergence in the short tests I’ve done so far. I had to add an update discounting factor to Adam or you get the terrible oscillations rather than convergence for which both dual averaging and Adam are known.

Concurrent adaptation and convergence monitoring

This is a big one that Andrew’s been asking for for years. I’ve finished the online R-hat monitor (it’s in a branch of that name in the Walnuts repo on flatironinstitute). It uses the C++11 threads library to build an asynchronous, non-blocking monitor. The chains roll along as usual, each within its own thread, but they accumulate their within chain means and variances via Welford’s algorithm. They publish these in a per-chain Atomic using relaxed memory guarantees (hence the lack of blocking). Then there is also a monitor thread running that continually reads the per-chain means and variances and computes R-hat (the original one, not the split or ranked versions that are currently used in posterior (R) and ArviZ (Python)). I am able to run at least 100 R-hat checks per second without slowing down the chains noticeably, so it can detect convergence within a few iterations of when it happens even for very fast models. This all works blazingly fast on my new Mac Studio Server running 16 chains in parallel(*). Next up, I have to monitor adaptation. For that, there’s not a pre-built solution, but I’ll be using something like R-hat to monitor whether the mass matrix and step sizes have converged (on a log scale, which makes the monitoring respect the positive-definite manifold structure and operate scale free). And I’ll have to use a double-buffered store rather than an atomic because the vectors required for monitoring convergence of the mass matrix are D-dimensional, not default copyable.

(*) Mac ARM hardware

Not an algorithmic change, but … if you haven’t gotten the memo yet, the new Mac ARM chips are very well suited to parallel sampling like we do in Stan. Their memory architecture is much more tightly integrated into the CPU and more multiplexed than Intel architecture. This is great for sampling or other tasks where data and parameters are being hit asynchronously in memory in parallel.

I just got a Mac Studio with 20 performance cores on the M3 Ultra chip and 256GB of memory. I would highly recommend this specific machine if you have US$6K to spend (your cost may vary—electronics seem to be way more expensive outside of the U.S.). They also have a US$2K starter model, which should also be great compared to just about anything else. Even the MacBooks are good—the inexpensive Air I got a few years ago crushed my mega-expensive (albeit 8 year old) iMac running Intel Xeon chips. People have told me Stan’s broken on Windows after comparing Windows and Mac, but really, it’s just the memory architecture. I didn’t do any controlled tests, but I swear that tests that were taking 3s now take 1-2s after upgrading to the Tahoe OS (bunch of small UI changes that make pretty much no difference to my experience); Apple says they’re continuing to integrate more ARM goodness into their OS, so maybe that was it? The release notes don’t mention anything about heavy thread performance.

I don’t know how long this is going to be useful. I plan to develop a lot of these algorithms in JAX to run on GPU, which is going to be way faster than anything you can do on the desktop.

TENTATIVE IMPROVEMENTS

These haven’t been well tested by us yet.

Generalized HMC

NUTS is not good for GPUs. Matt Hoffman et al. have a new paper out that’s going into the next edition of the MCMC handbook about MCMC on modern hardware that explains why (on arXiv for now). Hugh Dance et al. just wrote another paper showing how to code something like NUTS on GPU, but it’s still expensive. Something like generalized HMC promises to give much of the advantage of NUTS implicitly without the elaborate recursive doubling structure. This is the kind of sampling to which Matt Hoffman and his former colleague Pavel Sountsov developed (Matt left Google and doesn’t work on sampling any more as far as I know). This is also what Gilad Turok, Chirag Modi, and I found with delayed rejection. It also sidesteps the problem of having to chain all the adjustments for local step-size adaptativity which can add up in a bad way for longer chains. It does add some additional tuning, which is how much to partially refresh momentum, which is a kind of proxy for path length for U-turns. It also has the advantage of being implicitly Rao-Blackwellized compared to HMC (you can average all the steps on an HMC path when computing expectations, but it’s just usually not worth the storage—this may turn out to be the case for G-HMC too).

Local mass matrix adaptation

I think of this as the final frontier. If we could locally condition, we wouldn’t need variable step sizes. We don’t have anything that works for this in complex cases. In Stan, you get a choice of a diagonal or dense preconditioned, and in Nutpie you also get low rank plus diagonal, but they’re all global preconditions, not local.

In very simple cases, we can use GIST to generate a mass matrix by taking an inverse Wishart sample around the negative Hessian. If you set the degrees of freedom correctly to get low variance, this pretty much perfectly preconditions a multivariate normal and should be a sound strategy in any log concave density. It’s just expensive, even with everything Cholesky factored. But that’s not enough for a system like Stan. The problem in going to more general models is that the Hessian’s no longer guaranteed to be positive definite (as it would be in a log concave density). This means we have to use something like Michael Betancourt’s softabs technique to condition the Hessian back to positive definite (e.g., eigendecompose, move negative eigenvalues up to positive, put back together, which is prohibitive because of the cubic cost). As an alternative, Nawaf and Tore have been looking at some explicit Riemannian integrators that only require a few Hessian-vector products, which are cheap with autodiff (linear in dimension rather than quadratic). This avoids the problem with the implicit integrator in Riemannian HMC, which is an additional obstacle beyond the need for positive-definite metrics. We might go back to implicit midpoint, as Arya Pourzanjani explored in his thesis, because we can use GIST to set a step size where we can guarantee stability; even so, the implicit nature of the algorithm is a real challenge for reversibility.

“We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.”

Andrew — Thu, 11 Dec 2025 14:18:56 +0000

Joshua Brooks points us to this research article by Brooke Macnamara and Alexander Burgoyne, “Do growth mindset interventions impact students’ academic achievement? A systematic review and meta-analysis with recommendations for best practices,” which states:

According to mindset theory, students who believe their personal characteristics can change–that is, those who hold a growth mindset–will achieve more than students who believe their characteristics are fixed. Proponents of the theory have developed interventions to influence students’ mindsets, claiming that these interventions lead to large gains in academic achievement. Despite their popularity, the evidence for growth mindset intervention benefits has not been systematically evaluated considering both the quantity and quality of the evidence. Here, we provide such a review by (a) evaluating empirical studies’ adherence to a set of best practices essential for drawing causal conclusions and (b) conducting three meta-analyses. When examining all studies (63 studies, N = 97,672), we found major shortcomings in study design, analysis, and reporting, and suggestions of researcher and publication bias: Authors with a financial incentive to report positive findings published significantly larger effects than authors without this incentive. Across all studies, we observed a small overall effect . . . which was nonsignificant after correcting for potential publication bias. No theoretically meaningful moderators were significant. When examining only studies demonstrating the intervention influenced students’ mindsets as intended . . . the effect was nonsignificant . . . When examining the highest-quality evidence . . . the effect was nonsignificant . . . We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.

I haven’t read the paper, let alone the 63 cited studies, but I thought I’d do my part by getting this into the discussion.

We talked about earlier critical work by Mcnamara on growth mindset back in 2018, where I discussed how to think about effect sizes for such interventions.

My main message was that, if mindset interventions work, we’d still expect small average effects, because they won’t work for all students. As I wrote, “it’s a small effect in the context of any student, and of course it’s a small effect. It’s hard to get good grades, and there’s no magic way to get there!”

In one sense, my conclusion is negative on mindset interventions in that I’m saying we shouldn’t expect to see large effects, and any large effects that do show up are likely to be huge overestimates.

In another sense, my conclusion is positive on mindset interventions in that, given that any average effects will be small, the lack of statistically significant average effects in small or even moderately-large studies does not have to imply that mindset interventions don’t work; it just says that they only work in some settings, and individual effects will mostly be small.

Also relevant is this discussion we had a few years ago on mindset interventions with contributions from Russell Warne and David Yeager. Lots to chew on here, also this example helped form my thinking on varying treatment effects, leading to our causal quartets paper and some future lines of research.