Recommendation 1. Consider measurements that address the underlying construct of interest.The concepts of validity and reliability of measurement are well known in psychology but are often forgotten in experimental design and analysis. Often we see exposure or treatment measures and outcome measures that connect only indirectly to substantive research goals. This can be seen in the frequent disconnect between the title and abstract of a research paper, on one hand, and the actual experiment, on the other. A notorious example in psychology is a paper that referred in its title to a “long-term experimental study” that in fact was conducted for only three days.

Our recommendation goes in two directions. First, set up your design and data collection to measure what you want to learn about. If you are interested in long-term effects, conduct a long-term study if possible. Second, to the extent that it is not possible to take measurements that align with your inferential goals, be open about this gap and explicit about the theoretical assumptions or external information that you are using to support your more general conclusions.

Recommendation 2. When designing an experiment, consider realistic effect sizes.There is a tendency to overestimate effect sizes when designing a study. Part of this is optimism and availability bias: it is natural for researchers who have thought hard about a particular effect to think that it will be important, to envision scenarios where the treatment will have large effects and not to think so much about cases where it will have no effect or where it will be overwhelmed by other possible influences on the outcome. In addition, past results will be much more likely to be published if they have reached a significance threshold, and this results in literature reviews that vastly overestimate effect sizes.

Overestimation of effect sizes leads to overconfidence in design, with researchers being satisfied by small sample size and sloppy measurements in a mistaken belief that the underlying effect is so large that it can be detected even with a very crude inference. And this causes three problems. First, it is a waste of resources to conduct an experiment that is so noisy that there is essentially no chance of learning anything useful, and this sort of work can crowd out the more careful sorts of studies that would be needed to detect realistic effect sizes. Second, a false expectation of high power creates a cycle of hype and disappointment that can discredit a field of research. Third, the process of overestimation can be self-perpetuating, with a noisy experiment being analyzed until apparently statistically-significant results appear, leading to another overestimate to add to the literature. These problems arise not just in statistical power analysis (where the goal is to design an experiment with a high probability of yielding a statistically significant result) but also applies to more general design analyses where inferences will be summarized by estimates and uncertainties.

Recommendation 3. Simulate your data collection and analysis on the computer first.In the past, we have designed experiments and gathered data on the hope that the results would lead to insight and possible publication—but then the actual data would end up too noisy, and we would realize in retrospect that our study never really had a chance of answering the questions we wanted to ask. Such an experience is not a complete waste— we learn from our mistakes and can use them to design future studies—but we can often do better by preceding any data collection with a computer simulation.

Simulating a study can be more challenging than conducting a traditional power analysis. The simulation does not require any mathematical calculations; the challenge is the need to specify all aspects of the new study. For example, if the analysis will use regression on pre-treatment predictors, these must be simulated too, and the simulated model for the outcome should include the possibility of interactions.

Beyond the obvious benefit of revealing designs that look to be too noisy to detect main effects or interactions of interest, the construction of the simulation focuses our ideas by forcing us to make hard choices in assuming the structure and sizes of effects. In the simulation we can make assumptions about variation in measurement and in treatment effects, which can facilitate the first two recommendations above.

Recommendation 4. Design in light of analysis.In his book, The Chess Mysteries of Sherlock Holmes, logician Raymond Smullyan (1979) wrote, “To know the past, one must first know the future.” The application of this principle to statistics is that design and data collection should be aligned with how you plan to analyze your data. As Lee Sechest put it, “The central issue is the validity of the inferences about the construct rather than the validity of the measure per se.”

One place this arises is in the collection of pre-treatment variables. If there is concern about imbalance between treatment and control groups in an observational study or an experiment with dropout, it is a good idea to think about such problems ahead of time and gather information on the participants to use in post-data adjustments. Along similar lines, it can make sense to recruit a broad range of participants and record information on them to facilitate generalizations from the data to larger populations of interest. A model to address problems of representativeness should include treatment interactions so that effects can vary by characteristics of the person and scenario.

In summary, we can most effectively learn from experiment if we think plan the design and data collection ahead of time, which involves: (1) using measurement that relates well to underlying constructs of interest, (2) considering realistic effect sizes and variation, (3) simulating experiments on the computer before collecting any data, and (4) keeping analysis plans in mind in the design stage.

The background on this short paper was that I was asked by Joel Huber from the Journal of Consumer Psychology to comment on an article by Michel Wedel and David Gal, Beyond statistical significance: Five principles for the new era of data analysis and reporting. Their recommendations were: (1) summarize evidence in a continuous way, (2) recognize that rejection of statistical model A should not be taken as evidence in favor of preferred alternative B, (3) use substantive theory to generalize from experimental data to the real world, (4) report all the data rather than choosing a single summary, (5) report all steps of data collection and analysis. (That’s my rephrasing; you can go to the Wedel and Gal article to see their full version.)

]]>Multilevel regression and poststratification (MRP) is a model-based approach for estimating a population parameter of interest, generally from large-scale surveys. It has been shown to be effective in highly selected samples, which is particularly relevant to investigators of large-scale population health and epidemiologic surveys facing increasing difficulties in recruiting representative samples of participants. We aimed to further examine the accuracy and precision of MRP in a context where census data provided reasonable proxies for true population quantities of interest. We considered 2 outcomes from the baseline wave of the Ten to Men study (Australia, 2013–2014) and obtained relevant population data from the 2011 Australian Census. MRP was found to achieve generally superior performance relative to conventional survey weighting methods for the population as a whole and for population subsets of varying sizes. MRP resulted in less variability among estimates across population subsets relative to sample weighting, and there was some evidence of small gains in precision when using MRP, particularly for smaller population subsets. These findings offer further support for MRP as a promising analytical approach for addressing participation bias in the estimation of population descriptive quantities from large-scale health surveys and cohort studies.

This article appeared in 2020 but I just happened to hear about it now.

Here’s the result from the first example considered by Downes and Carlin:

For the dichotomous labor-force outcome, MRP produced very accurate population estimates, particularly at the national level and for the larger states, where the employment rate was estimated within 1% in each case. For the smallest states of ACT and NT, MRP overestimated the employment rate by approximately 5%. Post-hoc analyses revealed that these discrepancies could be explained partly by important interaction terms that were evident in population data but not included in multilevel models due to insufficient data. For example, based on Census data, there was a much higher proportion of Indigenous Australians living in NT (25%) compared with all other states (<5%), but only 3% (n = 2) of the Ten to Men sample recruited from NT identified as Indigenous. There were also differences in the labor-force status of Indigenous Australians by state according to the Census: 90% of Indigenous Australians residing in ACT were employed compared with 79% residing in NT. Due to insufficient data available, it was not possible to obtain a meaningful estimate of this Indigenous status-by-state interaction effect.

And here’s their second example:

For the continuous outcome of hours worked, the performance of MRP was less impressive, with population quantities consistently overestimated by approximately 2 hours at the national level and for the larger states and by up to 4 hours for the smaller states. MRP still however, outperformed both unweighted and weighted estimation in most cases. The inaccuracy of all 4 estimation methods for this outcome likely reflects that the 2011 Census data for hours worked was not a good proxy for the true population quantities being estimated by the Ten to Men baseline survey conducted in 2013–2014. It is entirely plausible that the number of hours worked in all jobs in a given week could fluctuate considerably due to temporal factors and a wide range of individual-level covariates not included in our multilevel model. This was also evidenced by the large amount of residual variation in the multilevel model for this outcome.

Downes and Carlin summarize what they learned from the examples:

The increased consistency among state-level estimates achieved by MRP can be attributed to the partial pooling of categorical covariate parameter estimates toward their mean in multilevel modeling. This was particularly evident in the estimation of labor-force status for the smaller states of TAS, ACT, and NT, where MRP estimates fell part of the way between the unweighted state estimates and the national MRP estimate, with the degree of shrinkage reflecting the relative amount of information available about the individual state and all the states combined.

We did not observe, in this study, the large gains in precision achieved with MRP seen in our previous case study and simulation study. The multilevel models fitted here were more complex, including a larger number of covariates and multiple interaction effects. While we have sacrificed precision, this increased model complexity appears to have achieved increased accuracy. We did see small gains in precision when using MRP, particularly for the smaller states, and we might expect these gains to be larger for smaller sample sizes where the benefits of partial pooling in multilevel modeling would be greater.

Also:

The employment outcome measures considered in this study are not health outcomes per se; rather, they were chosen in the absence of any health outcomes for which census data were available to provide a comparison in terms of accuracy. We have no reason to expect MRP would behave any differently for outcome measures more commonly under investigation in population health or epidemiologic studies.

MRP can often lead to a very large number of poststratification cells. Our multilevel models generated 60,800 unique poststratification cells. With a total population size of 4,990,304, almost three-fourths of these cells contained no population data. This sparseness is not a problem, however, due to the smoothing of the multilevel model and the population cell counts used simply as weights in poststratification.

They conclude:

Results of this case-study analysis further support previous findings that MRP provides generally superior performance in both accuracy and precision relative to the use of conventional sample weighting for addressing potential participation bias in the estimation of population descriptive quantities from large-scale health surveys. Future research could involve the application of MRP to more complex problems such as estimating changes in prevalence over time in a longitudinal study or developing some user-friendly software tools to facilitate more widespread usage of this method.

It’s great to see people looking at these questions in detail. Mister P is growing up!

]]>It is your emphasis on “realistic” that is wrong. Paul played a significant role in the random walk model for stock prices and knew a huge amount about both theory and practice, as part of the MIT group, including others such as Bob Solow. He had no shades on his eyes – but he knew that the model fit well enough that betting on indexes was for most people as good as any strategy. But how to convey this in an introductory text? Most people would go to simple random walks – coin tossing. But that was far from realistic. Stock prices jumped irregularly and were at that time limited to something like half shares or quarter shares. It is a clever idea to think of a sort of random draw of actual price changes as a device to teach students what was going on. Much more realistic. I cannot believe he ever said this was exactly how it worked. Text books have to have a degree of idealization if they are to make a complex subject understandable. Paul was certainly not a pompous fool. Economics was a no holds barred field in those days and all models were actively criticized from both theoretical and empirical sides. His clever instructional idea was indeed more realistic and effective as well. He did get the Soviet economy wrong, but so dd every other economist.

Regarding that last point, I still maintain that Samuelson’s problem was not getting the Soviet economy wrong, but rather that he didn’t wrestle with his error in later editions of his book; see third and fourth paragraph of this comment. But, sure, that’s just one thing. Overall I get my correspondent’s point, and I no longer think the main substance of my earlier post was correct.

]]>Marginal Revolution has a nice link showing that data provide almost no power to tell us anything about macroeconomic models. Tabarrok calls this “embarrassing.” I’m not so sure.

In my own field of finance, efficient market theory tells us that stock returns will have low predictability. More generally, optimization by rational economic agents makes it hard to predict their behavior. A consumer of a firm with quadratic objective function will have a linear first-order condition. So their behavior will be a linear function of the expectation of something, and (by iterated expectations) will be hard to predict.

This seems to occur in science generally. We can’t see inside black holes. We can’t know quantum positions. String theory has no interesting predictions.

Are there good examples in statistics? We struggle to estimate rare events, unit roots, and the effect of exponential growth. Early Covid forecasts were just silly. I think the data provide no information about the incentive effect of the death penalty. Similarly, data have limited power to gauge causal magnitudes of global warming.

I think these results are “embarrassing” in the sense that the data cannot tell us anything, so there is little point in doing statistics.

Heston continues:

Here is an example of poor understanding. Some Harvard biomedical informatics guy ran a regression on unit-root Covid growth without differencing the log-data. Naturally, he got an R-square of 99.8%. And the s-statistics were highly significant! He did not realize that his exponential forecast was just assuming that the in-sample growth would continue forever. That growth rate was estimated over only 2 weeks, so it was almost meaningless. The type of erroneous thinking told us for years that we had only two weeks to flatten the curve.

I think that statistics is fun, and there are classic areas where data are plentiful, e.g., predicting stock returns. But when the power is low, I am skeptical about pursuing research in the area.

My reply: statistics lives in the sweet spot between data that are so clear that you don’t need statistics (what Dave Krantz used to call the “intra-ocular traumatic test” or “IOTT”: effects that are so large that they hit you between the eyes) and data that are so noisy that statistical analysis doesn’t help at all.

From this perspective, considering this zone of phase transition, the narrow beach between the continent of clear effects and the sea of confusion, statistics may seem like a specialized and unimportant subject. And maybe it is kind of unimportant. I’m a statistician and have spent my working life teaching, researching, and writing about the topic, but I’d never claim that statistics is more important than physics, or chemistry, or literature, or music, or any number of other pursuits.

On the other hand, this beach or borderland, while small compared to the continents and seas, is not nothing—indeed, lots of important things happen here! What’s known is already known, and we spend lots of time and effort on that knowledge frontier (to mix spatial metaphors a bit). Statistics is important because it’s one of our tools for pushing that frontier back, reclaiming the land, as it were. As long as we recognize its limitations, statistics can be a valuable tool and a helpful way of looking at the world.

]]>I’m a PhD student in psychology with a background in computer science. I have struggled with the morality of the statistical approaches for a while, until I discovered Bayesian statistics. It doesn’t solve everything, but at least I don’t have to bend my mind in so many weird ways.

I would like to ask a question. In the last few years, you seem to embrace preregistration, as can be seen for example in this blogpost. However, I haven’t found a way to convince my co-authors of this. The reason is that my PhD project is part of an outside collaboration. We have automated large parts of the data collection from questionnaires and wearables. This way, we gather lots and lots of data. However, given that we want to steer our data collection procedures as early as possible and don’t have much literature to build our hypotheses on, the project managers push for analyzing the data continuously (exploring the data). To me, this is a big red flag. However, I do see their points as well. As another argument, since we have so much data, we can save a lot of time on being meticulous before doing anything. So, I came up with a research protocol

1. Explore data and find result

2. Report preliminary result to client

3. Create preregistration

4. Verify the result on new data

5. If the result doesn’t hold anymore, go back to step 1

6. Report results to client

7. Write paperDo you think this is still worthy to be called a “preregistration”? If not, how would you do it?

My reply: Sure, this sounds reasonable. Preregistration is a set of steps, it’s not anything precise. See for example my preregistered analysis here. It’s good to do this in a way that gives space for data exploration, because Cantor’s corner.

]]>NBC has placed a series order for “The Irrational,” the network announced on Tuesday.

According to the logline, the drama follows Alec Baker, a world-renowned professor of behavioral science, lends his expertise to an array of high-stakes cases involving governments, law enforcement and corporations with his unique and unexpected approach to understanding human behavior. The series is inspired by Dan Ariely’s novel “Predictably Irrational,” which was published in February 2008 by HarperCollins. Ariely will serve as a consultant.

I wonder if they’ll have an episode with the mysterious paper shredder?

In all seriousness, I absolutely love the above description of the TV show, especially the part where they described Ariely’s book as a novel. How edgy!

Describing his work as fiction . . . more accurate than they realize.

**P.S.** In the movie of me, I’d like to be played by a younger Rodney Dangerfield.

Mathematical convenience has often trumped common sense in financial models. For example, it is often assumed — because the assumption is useful — that changes in stock prices can be modeled as independent draws from a probability distribution. Paul Samuelson offered this analogy:

Write down those 1,800 percentage changes in monthly stock prices on as many slips of paper. Put them in a big hat. Shake vigorously. Then draw at random a new couple of thousand tickets, each time replacing the last draw and shaking vigorously. That way we can generate new realistically representative possible histories of future equity markets.

I [Smith] did Samuelson’s experiment. I put 100 years of monthly returns for the S&P 500 in a computer “hat” and had the computer randomly select monthly returns (with replacement) until I had a possible 25-year history. I repeated the experiment one million times, giving one million “Samuelson simulations.”

I also looked at every possible starting month in the historical data and determined the very worst and very best actual 25-year investment periods. The worst period began in September 1929, at the start of the Great Crash. An investment over the next 25 years would have had an annual return of 5.1%. The best possible starting month was January 1975, after the 1973-1974 crash. The annual rate of return over the next 25 years was 17.3%.

In the one million Samuelson simulations, 9.6% of the simulations gave 25-year returns that were worse than any 25-year period in the historical data and 4.9% of the simulations gave 25-year returns that were better than any actual 25-year historical period. Overall, 14.5% of the Samuelson simulations gave 25-year returns that were too extreme. Over a 50-year horizon, 24.5% of the Samuelson simulations gave 50-year returns that were more extreme than anything that has ever been experienced.

You might say that Smith is being unfair, as Samuelson was only offering a simple mathematical model. But it was Samuelson, not Smith, who characterized his random drawing as “realistically representative possible histories of future equity markets.” Samuelson was the one claiming realism.

My take is that Samuelson wanted it both ways. He wanted to show off his math, but he also wanted relevance, hence his “realistically.”

The prestige of economics comes partly from its mathematical sophistication but mostly because it’s supposed to relate to the real world.

Smith’s example of Samuelson’s error reminded me of this story from David Levy and Sandra Peart of this graph from the legendary textbook. This is from 1961:

Alex Tabarrok pointed out that it’s even worse than it looks: “in subsequent editions Samuelson presented the same analysis again and again except the overtaking time was always pushed further into the future so by 1980 the dates were 2002 to 2012. In subsequent editions, Samuelson provided no acknowledgment of his past failure to predict and little commentary beyond remarks about ‘bad weather’ in the Soviet Union.”

The bit about the bad weather is funny. If you’ve had bad weather in the past, maybe the possibility of future bad weather should be incorporated into the forecast, no?

**Is there a connection?**

Can we connect Samuelson’s two errors?

Again, the error with the Soviet economy forecast is not that he was wrong in the frenzied post-Sputnik year of 1961; the problem is that he kept making this error in his textbook for decades to come. Here’s another bit, from Larry White:

As late as the 1989 edition [Samuelson] coauthor William Nordhaus wrote: ‘The Soviet economy is proof that, contrary to what many skeptics had earlier believed, a socialist command economy can function and even thrive.’

I see three similarities between the stock-market error and the command-economy error:

1. Love of simple mathematical models: the random walk in one case and straight trends in the other. The model’s so pretty, it’s too good to check.

2. Disregard of data. Smith did that experiment disproving Samuelson’s claim. Samuelson could’ve done that experiment himself! But he didn’t. That didn’t stop him from making a confident claim about it. As for the Soviet Union, by the time 1980 had come along Samuelson had 20 years of data refuting his original model, but that didn’t stop him from just shifting the damn curve. No sense that, hey, maybe the model has a problem!

3. Technocratic hubris. There’s this whole story about how Samuelson was so brilliant. I have no idea how brilliant he was—maybe standards were lower back then?—but math and reality don’t care how brilliant you are. I see a connection between Samuelson thinking that he could describe the stock market with a simple random walk model, and him thinking that the Soviets could just pull some levers and run a thriving economy. Put the experts in charge, what could go wrong, huh?

**More stories**

Smith writes:

As a student, Samuelson reportedly terrorized his professors with his withering criticisms.

Samuelson is of course the uncle of Larry Summers, another never-admit-a-mistake guy. There is a story about Summers saying something stupid to Samuelson a week before Arthur Okun’s funeral. Samuelson reportedly said to Summers, “In my eulogy for Okun, I’m going to say that I don’t remember him ever saying anything stupid. Well, now I won’t be able to say that about you.”

There was a famous feud between Samuelson and Harry Markowitz about whether investors should think about arithmetic or geometric means. In one Samuelson paper responding to Markowitz, every word (other than author names) was single syllable.

I once gave a paper at a festschrift honoring Tobin. Markowitz began his talk by graciously saying to Samuelson, who was sitting arm-crossed in the front row, “In the spirit of this joyous occasion, I would like to say to Paul that ‘Perhaps there is some merit in your argument.’” Samuelson immediately responded, “I wish I could say the same.”

Here’s the words-of-one-syllable paper, and here’s a post that Smith found:

Maybe Samuelson and his coauthors should’ve spent less time on dominance games and “boss moves” and more time actually looking out at the world that they were purportedly describing.

**P.S.** OK, I was wrong.

*The Journal of Visualization and Interaction* (JoVI) is a venue for publishing scholarly work related to the fields of visualization and human-computer interaction. Contributions to the journal include research in:

- how people understand and interact with information and technology,
- innovations in interaction techniques, interactive systems, or tools,
- systematic literature reviews,
- replication studies or reinterpretations of existing work,
- and commentary on existing publications.

One component of their mission is to require materials to be open by default, including exposing all data and reasoning for scrutiny, and making all code reproducible “within a reasonable effort.” Other goals are to emphasize knowledge and discourage rejection based on novelty concerns (a topic that comes up often in computer science research, see e..g., my thoughts here). They welcome registered reports, and say they will not impose top down constraints on how many papers can be published that can lead to arbitrary-seeming decisions on papers that hinge on easily fixable mistakes. This last part makes me think they are trying to avoid the kind of constrained decision processes of conference proceeding publications, which are still the most common publication mode in computer science. There are existing journals like Transactions on Visualization and Computer Graphics that give authors more chances to go back and forth with reviewers, and my experience as associate editor there is that papers don’t really get rejected for easily fixable flaws. Part of JoVI’s mission seems to be about changing the kind of attitude that reviewers might bring, away from one of looking for reasons to reject and toward trying to work with the authors to make the paper as good as possible. If they can do this while also avoiding some of the other CS review system problems like lack of attention or sufficient background knowledge of reviewers, perhaps the papers will end up being better than what we currently see in visualization venues.

This part of JoVI’s mission distinguishes it from other visualization journals:

Open review, comments, and continued conversationAll submitted work, reviews, and discussions will by default be publicly available for other researchers to use. To encourage accountability, editors’ names are listed on the articles they accept, and reviewers may choose to be named or anonymous . All submissions and their accompanying reviews and discussions remain accessible whether or not an article is accepted. To foster discussions that go beyond the initial reviewer/author exchanges, we welcome post-publication commentaries on articles.

Open review is so helpful for adding context to how papers were received at the time of submission, so I hope it catches on here. Plus I really dislike by the attitude that it is somehow unfair to bring up problems with published work, at least outside of the accepted max 5 minutes of public QA that happens after the work is presented at a conference. People talk amongst themselves about what they perceive the quality or significance of new contributions to be, but many of the criticisms remain in private circles. It will be interesting to see if JoVI gets some commentaries or discussion on published articles, and what they are like.

This part is also interesting: “On an alternate, optional submission track, we will continually experiment with new article formats (including modern, interactive formats), new review processes, and articles as living documents. This experimentation will be motivated by re-conceptualizing peer review as a humane, constructive process aimed at improving work rather than gatekeeping.”

distll.pub is no longer publishing new stuff but some of their interactive ML articles were very memorable and probably had more impact than more conventionally published papers on the topic. Even more so I like the idea of trying to support articles as living documents that can continue to be updated. The current publication practices in visualization seem a long way from encouraging a process where it’s normal to first release working papers. Instead, people spend six months building their interactive system or doing their small study to get a paper-size unit of work, and then they move on. I associate the areas where working papers seem to thrive (e.g., theoretical or behavioral econ) with theorizing or trying to conceptualize something fundamental to behavior, rather than just describing or implementing something. The idea that we should be trying to write visualization papers that really make us think hard over longer periods, and that may not come in easily bite-size chunks, seems kind of foreign to how the research is conceptualized. But any steps toward thinking about papers as incomplete or imperfect, and building more feedback and iteration into the process, are welcome.

]]>The random walk model for polls is a bit like the idea that the hot hand is a fallacy: it’s an appealing argument that has a lot of truth to it (as compared to the alternative model that poll movement or sports performance is easily predictable given past data) but is not quite correct, and the mistake comes when it is elevated from a heuristic to a principle.

This mistake happens a lot, no? It comes up in statistics all the time.

**P.S.** Some discussion in comments on stock market and investing. I know nothing about that topic; the above post is just about the general problem of people elevating a heuristic to a principle.

Will Hobbs pointed me to his (PPNAS!) paper with Anthony Ong that highlights and examines another cause: common method bias.

Common method bias is the well-known (in some corners at least) phenomenon whereby specific common variance in variables measured through the same methods can produce bias. You can come up with many mechanisms for this. Variables measured in the same questionnaire can be correlated because of consistency motivations, the same tendency to give social desirably responses, similar uses of the similar scales, etc.

Many of these biases result in inflated correlations. Hobbs writes:

[U]nreasonable effect size priors is one of my main motivations for this line of work.

A lot of researchers seem to consider effect sizes meaningful only if they’re comparable to the huge observational correlations seen among subjective closed-ended survey items.

But often the quantities we *really* care about — or at least we are planning more ambitious field studies to estimate — are inherently going to be not measured in the same ways. We might assign a treatment and measure a survey outcome. We might measure a survey outcome, use this to target an intervention, and then look at outcomes in administrative data (e.g., income, health insurance data).

Here at this blog, perhaps there’s the most coverage of tiny studies, forking paths, and selection bias as causes of inflated effect size expectations. So this is a good reminder there are plenty of other causes, even with big samples or pre-registered analysis plans, like common method bias and confounding more generally.

*This post is by Dean Eckles.*

]]>

ChatGPT can do deep, involved reasoning. It has the context capacity to do that.

I [Bob] think that human language is what is known as “AI complete”. To be good at language, you have to be intelligent, because language is about the world and context. You can’t do what ChatGPT does ignorant of the world or be unable to plan. . . .

Humans also generally produce output one word at a time in spoken language. In writing we can plan and go back and revise. We can do a little planning on the fly, but not nearly as much. To me, this was the biggest open problem in computational linguistics—it’s what my job talk was about in 1989 and now it’s basically a solved problem from the engineering if not scientific perspective.

I [Bob] am not saying there’s no limitations to using the LLM architecture—it doesn’t have any long- or really medium-term memory. I’m just saying it can’t do what it does now without some kind of “intelligence”. If you try to define intelligence more tightly, you either rule out humans or you somehow say that only human meat can be intelligent.

I told Bob that take on this might be controversial, even among computer scientists, and he replied:

Of course. Everything’s controversial among academics . . .

My [Bob’s] position is hardly novel. It’s the take of everyone I know who understands the tech (of course, that’s a no-true-Scotsman argument), including this paper from Microsoft Research. I do think if you have studied cognitive science, philosophy of language, and philosophy of mind, studied language modeling, studied psycholinguistics, have some inkling of natural language compositional semantics and lexical semantics, and you understand crowdsourcing with human feedback, then you’re much more likely to come to the same conclusion as me. If you’re just shooting from the hip without having thought deeply about meaning and how to frame it or how humans process language a subword component at a time, then of course the behavior seems “impossible”. Everyone seems to have confused it with cutting-and-pasting search results, which is not at all what it’s doing.

I’m not saying it’s equivalent to a human, just that whatever it’s doing is a form of general intelligence. What it’s truly lacking is longer term memory. That means there are things humans can do that it really is incapable of doing in its present form. But that’s not because it’s a “dumb machine”. We’re just “dumb meat” viewed from that perspective (unless you want to get all spiritual and say we have a soul of some kind that matters).

Bob also recommends this paper from Google and this one from OpenAI, and he continues:

There’s a ton of work on scaling laws now and what people are seeing is emergent behavior at certain model sizes. As in like 1% performance for 3B parameters, then 95% performance for 6B parameters kind of thing. But nobody knows why this is happening or where.

The capacity of these models is quite high, including the representation of words, representation of positions, etc. It’s generting one word at a time, but the structrure is an incredibly rich time series with literally billions of parameters.

The background here is that I’ve been reading what Thomas Basbøll has been writing on chatbots and the teaching of writing (a topic of interest to me, because I teach writing as part of my Communicating Data and Statistics course), and he recommended a long article by Giulio Alessandrini, Brad Klee, and Stephen Wolfram entitled “What Is ChatGPT Doing . . . and Why Does It Work?”

I really liked Alessandrini et al.’s article. It was at the right level for me, stepping through the following topics:

It’s Just Adding One Word at a Time

Where Do the Probabilities Come From?

What Is a Model?

Models for Human-Like Tasks

Neural Nets

Machine Learning, and the Training of Neural Nets

The Practice and Lore of Neural Net Training

“Surely a Network That’s Big Enough Can Do Anything!”

The Concept of Embeddings

Inside ChatGPT

The Training of ChatGPT

Beyond Basic Training

What Really Lets ChatGPT Work?

Semantic Grammar and the Power of Computational Language

So . . . What Is ChatGPT Doing, and Why Does It Work?

Alessandrini et al.’s article has lots of examples, graphs, and code, and I get the impression that they’re actively trying to figure out what’s going on. They get into some interesting general issues; for example,

One might have thought that for every particular kind of task one would need a different architecture of neural net. But what’s been found is that the same architecture often seems to work even for apparently quite different tasks. At some level this reminds one of the idea of universal computation . . . but I think it’s more a reflection of the fact that the tasks we’re typically trying to get neural nets to do are “human-like” ones—and neural nets can capture quite general “human-like processes”.

In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself. . . .

That’s not to say that there are no “structuring ideas” that are relevant for neural nets. Thus, for example, having 2D arrays of neurons with local connections seems at least very useful in the early stages of processing images. And having patterns of connectivity that concentrate on “looking back in sequences” seems useful . . . in dealing with things like human language, for example in ChatGPT.

They also talk about the choices involved in tuning the algorithms—always an important topic in statistics and machine learning—so, all in all, I think a good starting point before getting into the technical articles that Bob pointed us to above. I pointed Bob to the Alessandrini et al. tutorial and his reaction was that it “seriously under-emphasizes the attention model in the transform and the alignment post-training. It’s the latter that took GPT-3 to ChatGPT, and it’s a huge difference.”

That’s the problem with sending a pop science article to an expert: the expert will latch on to some imperfection. The same thing happens to me when people send me popular articles on Bayesian statistics or American politics or whatever: I can’t help focusing on the flaws. Anyway, I still like the Alessandrini et al. article, I guess more so when supplemented with Bob’s comments.

**P.S.** Also I told Bob I still don’t get how generating one word at a time can tell the program to create a sonnet in the style of whoever. I just don’t get how “Please give me a sonnet . . .” will lead to a completion that has sonnet form. Bob replied:

Turn this around. Do you know how you write sentences without planning them all out word by word ahead of time? Language is hard, but we do all kinds of planning in this same way. Think about how you navigate from home to work. You don’t plan out a route step by step then execute it, you make a very general plan (‘ride my bke’ or ‘take the subway’), take a step toward that goal, then repeat until you get to work. As you get to each part of the task (unlock the bike, carry the bike outside, put the kickstand up, get on the bike, etc.) are all easily cued by what you did last, so it barely requires any thought at all. ChatGPT does the same thing with language. ChatGPT does a ton of computation on your query before starting to generate answers. It absolutely does a kind of “planning” in advance and as the MS paper shows, you can coach it to do better planning by asking it to share its plans. It does this all with its attention model. And it maintains several rich, parallel representations of how language gets generated.

Do you know how you understand language one subword component at a time? Human brains have *very slow* clock cycles, but very *high bandwidth* associative reasoning. We are very good at guessing what’s going to come next (though not nearly as good as GPT—it’s ability at this task is far beyond human ability) and very good at piecing together meaning from hints (too good in many ways as we jump to a lot of false associations and bad conclusions). We are terrible at logic and planning compared to “that looks similar to something I’ve seen before”.

I think everyone who’s thought deeply about language realizes it has evolved to make these tasks tractable. People can rap and write epic poems on the fly because there’s a form that we can follow and one step follows the next when you have a simple bigger goal. So the people who know the underlying architectures, but say “oh language is easy, I’m not impressed by ChatGPT” are focusing on this aspect of language. Where ChatGPT falls down is with long chains of logical reasoning. You have to coax it to do that by telling it to. Then it can do it in a limited way with guidance, but it’s basic architecture doesn’t support good long-term planning for language. If you want GPT to write a book, you can’t prompt it with “write a book”. Instead, you can say “please outline a book for me”, then you can go over the outline and have it instantiate as you go. At least that’s how people are currently using GPT to generate novels.

I asked Aki Vehtari about this, and Aki pointed out that there are a few zillion sonnets on the internet already.

Regarding the general question, “How does the chatbot do X?”, where X is anything other than “put together a long string of words that looks like something that could’ve been written by a human” (so, the question could be, “How does the chatbot write a sonnet” or “How does ChatGPT go from ‘just guessing next word’ to solving computational problems, like calculating weekly menu constrained by number of calories?”), Bob replied:

This is unknown. We’ve basically created human-level or better language ability (though not human or better ability to connect language to the world) and we know the entire architecture down to the bit level and still don’t know exactly why it works. My take and the take of many others is that it has a huge capacity in its representation of words and its representation of context and the behavior is emergent from that. It’s learned to model the world and how it works because it needs that information to be as good as it is at language.

Technically, it’s a huge mixture model of 16 different “attention heads”, each of which is itself a huge neural network and each of which pay attention to a different form of being coherent. Each of these is a contextual model with access to the previous 5K or so words (8K subword tokens).

Part of the story is that the relevant information is in the training set (lots of sonnets, lots of diet plans, etc.); the mysterious other part is how it knows from your query what piece of relevant information to use. I still don’t understand how it can know what to do here, but I guess that for now I’ll just have to accept that the program works but I don’t understand how. Millions of people drive cars without understanding at any level how cars work, right? I basically understand how cars work, but there’d be no way I could build one from scratch.

]]>In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials.

Sounds important. And they use Stan, which always makes me happy.

]]>Could you refer me to the definition of “representative sample” as you use it in your books. I am interested in developing my understanding of the theoretical and philosophical basis to make statements about unobservables using measurements on subsets of populations. I also want to learn more about how a Bayesian approach changes (or not) how one deals with sampling design. Any readings that you can recommend will be appreciated.

My reply: I don’t think there’s any formal definition. We could say that a sample is representative if we could usefully use it to represent the population As this definition makes clear, “representativeness” depends on the use to which the sample would be put.

Do readers have other thoughts?

]]>A method to find a probability that a given bias of mutations occur naturally is proposed to test whether a newly detected virus is a product of natural evolution or artificial genetic modification. The probability is calculated based on the neutral theory of molecular evolution and binominal distribution of non-synonymous (N) and synonymous (S) mutations. Though most of the conventional analyses, including dN/dS analysis, assume that any kinds of point mutations from a nucleotide to another nucleotide occurs with the same probability, the proposed model takes into account the bias in mutations, where the equilibrium of mutations is considered to estimate the probability of each mutation. The proposed method is applied to evaluate whether the Omicron variant strain of SARS-CoV-2, whose spike protein includes 29 N mutations and only one S mutation, can emerge through natural evolution. The result of binomial test based on the proposed model shows that the bias of N/S mutations in the Omicron spike can occur with a probability of 1.6 x 10^(-3) or less. Even with the conventional model where the probabilities of any kinds of mutations are all equal, the strong N/S mutation bias in the Omicron spike can occur with a probability of 3.7 x 10^(-3), which means that the Omicron variant is highly likely a product of artificial genetic modification.

I don’t know anything about the substance. The above bit makes me suspicious, as it looks like what they’re doing is rejecting null hypothesis A and using this to claim that their favored alternative hypothesis B is true.

Further comments from an actual expert are here.

]]>Just finished “Understanding Statistics and Experimental Design. How to Not Lie with Statistics”, by Michael Herzog, Gregory Francis, and Aaron Clarke (Springer, 2019). Near the end (p. 128), I read the following regarding “optional stopping”:

…suppose a scientist notes a marginal (p = 0.07) result in Experiment 1 and decides to run a new Experiment 2 to check on the effect. It may sound like the scientist is doing careful work, however, this is not necessarily true. Suppose Experiment 1 produced a significant effect (p = 0.03), would the scientist still have run Experiment 2 as a second check? If not, then the scientist is essentially performing optional stopping across experiments, and the Type I error rate for any given experiment (or across experiments) is unknown.

Indeed, the problem with optional stopping is not the actual behavior preformed by the scientist (e.g., the study with a planned sample size gives p = 0.02) but with what he would have done if the result turned out differently (e.g., if the study with a planned sample size gives p = 0.1, he would have added 20 more subjects). More precisely, if you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis.

It is the last statement, “if you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis”, that kept me wondering and prompted me to write to you and ask you for your comments.

My reply:

1. Yes, they’re correct that if you do not know what you would have done under all possible scenarios, then you cannot know the Type I error rate for your analysis. We make this point in section 1.2 of our Garden of Forking Paths paper and this is part of the definition of these error rates; the statement should not be controversial.

2. I’m pretty much not interested in type 1 or type 2 error rates; mostly the only reason I think it’s worth thinking about them is to respond to confused researchers who think that a low p-value represents strong evidence for their preferred theories.

3. I think that stopping the experiment based on the data is just fine in practice and is not cheating, as some people might think. See my discussion here.

I also sent to Greg Francis, who wrote:

]]>I too don’t think the statement is controversial, even though it is surprising to many practicing scientists. In part I think that is because textbooks on hypothesis testing give examples with very specific settings (always with a fixed sample size). Even though (usually) the textbooks properly describe theorems (e.g., for defining a sampling distribution), they ignore some realities of data collection (a fixed sample size is not the norm) and so do not consider what happens when you deviate from the theorems.

Contrary to Andrew, I think a lot of scientists genuinely do care about Type I and Type II error rates. However, getting control of those error rates is more difficult than many people realize, and I think that difficulty is good motivation to consider Bayesian approaches.

For a bit more discussion about error rates across experiments, you might look at this paper on a “reverse Bonferroni” method. I’m not sure I would actually recommend the method for any practical situation, but it highlights what to consider when you try to control error rates across experiments.

The 2022 election defied conventional wisdom and historical trends. In a typical midterm election year with one-party control of the presidency, House and Senate, the incumbent party would expect major losses. Instead, Democrats re-elected every incumbent senator and expanded their Senate majority by a seat, won the overwhelming majority of heavily contested gubernatorial elections, gained control of 4 state legislative chambers, and only narrowly lost the U.S. House. . . .

Unlike other recent midterm years, our analysis shows a stark contrast between the electorate in areas with one or more highly contested House, Senate or gubernatorial races versus those with less contested races. . . .

Their key findings:

Gen Z and Millennial voters had exceptional levels of turnout, with young voters in heavily contested states exceeding their 2018 turnout by 6% among those who were eligible in both elections.Further, 65% of voters between the ages of 18 and 29 supported Democrats, cementing their role as a key part of a winning coalition for the party. While young voters were historically evenly split between the parties, they are increasingly voting for Democrats.

Extreme “MAGA” Republicans underperformed.. . . Candidates who were outspoken election deniers did 1 to 4 points worse than other Republicans, contributing to their losses in important close races. Of course, election denial is one of many extreme positions associated with “MAGA” Republicans, so this analysis likely reflects relatively extreme stances on other issues, including abortion rights . . .

Women voters pushed Democrats over the top in heavily contested races, where abortion rights were often their top issue.After Republican-appointed justices on the Supreme Court overturned abortion rights, a disproportionate number of women voters registered to vote in states with highly contested elections. At the same time, polls showed Democratic women and men indicating they were more engaged in the election. While relative turnout by gender remained largely stable, Democratic performance improved over 2020 among women in highly contested races, going from 55% to 57% support.

Democrats largely retained their winning 2020 coalition in heavily contested races, with some exceptions.Turnout and support among voters by race, education, gender, and other demographic factors remained relatively stable in heavily contested races. . . .Democratic support among young voters is partly due to the diversity of this group, as America becomes more diverse over time. But that is not the whole story. Democratic support was higher among young voters of color, both nationally (78%) and in highly contested races (also 78%).9 But support among young white voters rose between 2018 (53% national, 52% highly contested races) and 2022 (58% nationally, 57% highly contested races). This 5-6 point support change is notable, indicating a broad base of Democratic support among young voters across the country. . . .

By any historical standard, 2022 turnout was high. Nationally it did not match 2018’s record-breaking turnout of 118 million votes, but it did reach 111 million ballots cast. . . . However, these national turnout numbers mask important differences at the state and congressional level: namely, that turnout matched or even exceeded 2018 turnout in the most highly contested elections in the country. . . . in these heavily contested races with higher turnout, Democratic candidates generally prevailed. . . . While some of these turnout trends are driven by population increases, it mostly reflects the high turnout environment that has been consistent from the 2018 election onward. In highly contested elections — where voters know the race could be decided by a small number of votes and campaigns invest resources into engaging voters — turnout often matched the historic “Blue Wave” election in 2018. . . .

Campaigns and voter registration groups invest significant resources in identifying, registering and mobilizing new voters as they seek to grow their coalitions. The high turnout era has been marked by millions of new voters entering — and staying in — the electorate. . . .

Lots more numbers and graphs at the above link.

]]>The ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) conference will be held this year in Chicago from 12-15 June. And at 7pm on Wed 13 June there will be a staged reading of Recursion, a play that Jessica and I wrote.

Recursion is an entertaining and thought-provoking (we hope) play with computer science themes that connect to many of the topics that bring people to FAccT. The play will be directed and acted by a troupe of students from the Northwestern University theater program. We even have some funding thanks to Northwestern’s Engineering school.

If you’ll be coming to FAccT, we hope you can boot it to the performance! And if you’re not coming, well, maybe you should reconsider.

]]>*This post is by Lizzie*

A colleague from U-Mass Amherst sent me this image yesterday. He said he had found ‘multiple paper copies’ in an office and then ruminated on how they might have been used. I suggested they might have been for ‘a group of super excited folks at a conference jam session!’

This leads back to a rumination I have had for a long time: how come I cannot find a MCMC version of ‘Jump Around’? It seems many of the lyrics could be improved upon with an MCMC spin (though I would keep: I got more rhymes than there’s cops at a Dunkin’).

My colleague suggested that there is perhaps a need to host a Bayesian song adaptation contest ….

]]>Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B. Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.

I put much of the blame on statistical education, for two reasons:

First, in our courses and textbooks (my own included), we tend to take the “dataset” and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given. . . .

Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. . . .

If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted.

In summary:

]]>I agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

I’m probably the 100th person that has sent this to you: here is the NEJM editorial and here is the study.

The underlying issue, which has been a concern of mine for some time now, is the usual practice of basing analysis on “intention to treat” rather than on “treatment per protocol.” In the present case, randomized assignment into groups “invited” to have a colonoscopy and those not, resulted in a low percentage actually following the advice. Based on intention to treat, the benefits of colonoscopy appear to be small (or potentially none). Based on those actually receiving the colonoscopy, the effectiveness appears quite large. While the editorial accurately describes the results, it seems far less clear than it could/should be. The other reasons why this latest study may differ from prior ones are valid (effectiveness of the physicians, long-term follow up, etc.) but pale in importance with the obvious conclusion that when adherence is low, effectiveness is thwarted. As the editorial states “screening can be effective only if it is performed.” I think that should be the headline and that is what the media reports should have focused on. Instead, the message is mixed at best – leading some headlines to suggest that the new study raises questions about whether or not colonoscopies are effective (or cost-effective).

The correct story does come though if you read all the stories but I think the message is far more ambiguous than it should be. Intention to treat is supposed to reflect real world practice whereas treatment per protocol is more of a best-case analysis. But when the difference (and the adherence rate here, less than 50%) is so low, then the most glaring result of this study should be that increasing adherence is of primary importance (in my opinion). Instead, there is a mixed message. I don’t even think the difference can be ascribed to the difference in audiences. Intention to treat may be appropriate for public health practitioners whereas the treatment per protocol might be viewed as appropriate for individual patients. However, in this case it would seem relatively costless to invite everyone in the target group to have a colonoscopy, even if less than half will do so. Actually, I think the results indicate that much more should be done to improve adherence, but at a minimum I see little justification for not inviting everyone in the target group to get a colonoscopy. I don’t see how this study casts much doubt on those conclusions, yet the NEJM and the media seem intent on mixing the message.

In fact, Dale was not the 100th person who had sent this to me or even the 10th person. He was the only one, and I had not heard about this story. I’m actually not sure how I would’ve heard about it . . .

Anyway, I quickly looked at everything and I agree completely with Dale’s point. For example, the editorial says:

In the intention-to-screen analysis,

colonoscopywas found to reduce the risk of colorectal cancer over a period of 10 years by 18% (risk ratio, 0.82; 95% confidence interval [CI], 0.70 to 0.93). However, the reduction in the risk of death from colorectal cancer was not sig- nificant (risk ratio, 0.90; 95% CI, 0.64 to 1.16).

I added the boldface above. What it should say there is not “colonoscopy” but “encouragement to colonoscopy.” Just two words, but a big deal. There’s nothing wrong with an intent-to-treat analysis, but then let’s be clear: it’s measuring the intent to treat, not the treatment itself.

**P.S.** Relatedly, I received this email from Gerald Weinstein:

Abuse of the Intention to Treat Principle in RCTs has led to some serious errors in interpreting such studies. The most absurd, and possibly deadly example is a recent colonoscopy study which was widely reported as “Screening Procedure Fails to Prevent Colon Cancer Deaths in a Gold-standard Study,” despite the fact that only 42% of the colonoscopy group actually underwent the procedure. My concern is far too many people will interpret this study as meaning “colonoscopy doesn’t work.”

It seems some things don’t change, as I had addressed this issue in a paper written with your colleague, Bruce Levin, in 1985 (Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548). I am a retired cardiac surgeon who has had to deal with similar misguided studies during my long career.

The recent NEJM article “Effect of Colonoscopy Screening on Risks of Colorectal Cancer and Related Death” showed only an 18% reduction in death in the colonoscopy group which was not statistically significant and was widely publicized in the popular media with headlines such as “Screening Procedure Fails to Prevent Colon Cancer Deaths in Large Study.”

In fact, the majority of people in the study group did not undergo colonoscopy, but were only *invited* to do so, with only 42% participating. How can colonoscopy possibly prevent cancer in those who don’t under go it? Publishing such a study is deeply misguided and may discourage colonoscopy, with tragic results.

Consider this: If someone wanted to study attending a wedding as a superspreader event, but included in the denominator all those who were invited, rather than those who attended, the results would be rendered meaningless by so diluting the case incidence as to lead to the wrong conclusion.

My purpose here is not merely to bash this study, but to point out difficulties with the “Intention to Treat” principle, which has long been a problem with randomized controlled studies (RCTs). The usefulness of RCTs lies in the logic of comparing two groups, alike in every way *except* for the treatment under study, so any differences in outcome may be imputed to the treatment. Any violation of this design can invalidate the study, but too often, such studies are assumed to be valid because they have the structure of an RCT.

There are several ways a clinical study can depart from RCT design: patients in the treatment group may not actually undergo the treatment (as in the colonoscopy study) or patients in the control group may cross over into the treatment group, yet still be counted as controls, as happened in the Coronary Artery Surgery Study (CASS) of the 1980s. Some investigators refuse to accept the problematic effects of such crossover and insist they are studying a “policy” of treatment, rather than the treatment itself. This concept, followed to its logical (illogical?) conclusion, leads to highly misleading trials, like the colonoscopy study.

**P.P.S.** I had a colonoscopy a couple years ago and it was no big deal, not much of an inconvenience at all.

]]>Back in the 1950s, when the Gallup poll was almost the only game in town, it was rational to respond to the survey–you’d be one of about 1000 respondents and could have a reasonable chance of (indirectly) affecting policy. Now you’re just one of millions, and so answering a pollster is probably not worth the time (See here for related arguments).

The recent proliferation of polls—whether for marketing or to just to sell newspapers—exploits people’s civic-mindedness. Polling and polling and polling until all the potential respondents get tired—it’s like draining the aquifer to grow alfalfa in the desert.

As polling averages shifted towards Republicans in the closing weeks of the 2022 midterms, one interpretation was that Americans were reverting to the usual pattern of favoring out-party candidates. Other observers argued that voter intentions were not changing and that the shift was driven by the release of a disproportionate number of pro-Republican polls – an argument supported by the unexpectedly favorable results for Democratic candidates on Election Day.

They continue:

We are not alleging a conspiracy among Republican pollsters to influence campaign narratives. . . . Even so, our results raise new concerns about the use of polling averages to assess campaign dynamics. A shift from one week to another may reflect changes in underlying voter preferences but can also reflect differences in the types of polls used to construct polling averages. This concern is particularly true for sites that aggregate polls without controlling for house effects (pollster-specific corrections for systematic partisan lean). . . .

Our results are also salient for aggregators who use pollster house effects to adjust raw polling data. In theory, these corrections remove poll-specific partisan biases, allowing polling averages to be compared week-to-week, even given changes in the types of polls being released. However, in most cases, aggregators use black-box models to estimate and incorporate house effects, making it impossible to assess the viability of this strategy. . . .

There’s a statistical point here, too, which is that additive “house effects” can appropriately shift individual polls so that even biased polls can supply information, but “weighting” can never do this. You need to move the numbers, not just reweight them.

]]>I agree with Aleks that these are excellent. The above image is just a screenshot; the links below are all live and interactive:

Covid-19 test calculator: How to interpret test results

Covid-19 lateral flow tests: Calculator for interpreting test results

Great stuff, and a model for risk communication going forward.

]]>Epidemiologic data consistently show strong protection for young children against severe COVID-19 illness. . . . We identified 3,126,427 adults (24% [N = 743,814] with children ≤18, and 8.8% [N = 274,316] with youngest child 0–5 years) to assess whether parents of young children—who have high exposure to non-SARS-CoV-2 coronaviruses—may also benefit from potential cross-immunity. In a large, real-world population, exposure to young children was strongly associated with less severe COVID-19 illness, after balancing known COVID-19 risk factors. . .

My first thought was that parents are more careful than non-parents so they’re avoiding exposure entirely. But it’s not that: non-parents in the matched comparison had a lower rate of *infections* but a higher rate of *severe* cases; see Comparison 3 in Table 2 of the linked article.

One complicating factor is that they didn’t seem to have adjusted for whether the adults were vaccinated—that’s a big deal, right? But maybe not such an issue given that the study ended on 31 Jan 2021, and by then it seems that only 9% of Americans were vaccinated. It’s hard for me to know if this would be enough to explain the difference found in the article—for that it would be helpful to have the raw data, including the dates of these symptoms.

Are the data available? It says, This article contains supporting information online at http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2204141119/-/DCSupplemental but when I click on that link it just takes me to the main page of the article (https://www.pnas.org/doi/abs/10.1073/pnas.2204141119) so I don’t know whassup with that.

Here’s another thing. Given that the parents in the study were infected at a *higher* rate than the nonparents, it would seem that the results can’t simply be explained by parents being more careful. But could it be a measurement issue? Maybe parents were more likely to get themselves tested.

The article has a one-paragraph section on Limitations, but it does not consider any of the above issues.

I sent the above to Aleks, who added:

My thought is that the population of parents probably lives differently than non-parents: less urban (so less exposure), perhaps biologically healthier. They did match, but just doing matching doesn’t guarantee that enough of the relevant confounders have truly been handled.

This paper is a big deal (1) because it’s used to support herd immunity and mass infection; (2) because it is used to argue against vaccination; (3) because it doesn’t incorporate long COVID-19 which can be caused even by an asymptomatic infection.

For #3, it might be possible to model the impact, based on what we know about the likelihood of long-term issues, e.g. https://www.clinicalmicrobiologyandinfection.com/article/S1198-743X(22)00321-4/fulltext

Your point about the testing bias could be picked up by the number of asymptomatic vs asymptomatic cases, which would reveal a potential bias.

My only response here is that if the study ends on Jan 2021, I can’t see how it can be taken as an argument against vaccination. Even taking the numbers in Table 2 at face value, we’re talking about a risk reduction for severe COVID-19 from having kids of a factor of 1.5. Vaccines are much more effective than that, no? So even if having Grandpa sleep on the couch and be exposed to the grandchildren’s colds is a solution that works for your family, it’s not nearly as effective as getting the shot—and it’s a lot less convenient.

Aleks responds:

]]>

Looking at the Israeli age-stratified hospitalization dashboard, the hospitalization rates for unvaccinated 30-39-olds are almost 5x greater than for vaccinated & boosted ones. However, the hospitalization rates for unvaccinated 80+ is only about 30% higher.

Pierson writes:

In spite of the simplicity and ubiquity of the setup, people were very bad at this quiz—on every question, the majority answer was incorrect, and on the last two questions, performance was worse than random guessing. Of course, Twitter is a weird sample, but I also spoke to a number of genuine experts—professors of statistics, computer science, and economics at top universities—who found these questions unintuitive as well. I thought this might be of interest to your blog because you might be able to provide some useful intuition or diagnose why people’s intuitions lead them astray here.

I responded: I don’t quite understand your questions. From question 1, it look like “a”, “noise”, and “p” are vectors; I’m picturing length 100 because that’s what I typically do in simulation experiments for class.. But then in question 2, it looks like “a” is a scalar, but then the bias would be E(a_hat – a), not a_hat – a. And then the correlation you’re discussing is across multiple studies. In question 3, it again seems like “a” and “p” are vectors. I guess I understand questions 1 and 3 but not question 2. I agree that it’s disturbing that so many people get this wrong. I wonder whether the ambiguous specification is part of the problem. Or maybe I’m missing something?

Pierson responded:

Here’s a simulation of the setup I had in mind (hopefully a correct simulation—I don’t really use R these days, but I think you do):

a = rnorm(500000) noise = rnorm(500000) p = a + noise d = data.frame(a, p) # question 1 print(lm(a ~ p, data=d)) print(lm(p ~ a, data=d)) # question 2 a_hat = predict(lm(a ~ p, data=d), d) print(cor(a_hat - a, a)) print(cor(a_hat - a, a_hat)) # question 3 print(cor(p - a, a)) print(cor(p - a, p))Always hard to know what people on Twitter are thinking, but I don’t think (based on my conversations with people) that misunderstandings explain the wrong answers? People seemed to understand the setup pretty clearly, and I also didn’t get any confused questions about it on Twitter.

To which I replied: Oh, you were defining a_hat as the point prediction: I didn’t get that part. I agree that these things confuse people. It reminds me of that other false intuition people have, that if a is correlated with b, and b is correlated with c, then a must be correlated with c. People also seem to have the intuition that a can be correlated with b, even while b is uncorrelated with a. And then there’s a whole literature in cognitive psychology on super-additive probabilities (Pr(A) + Pr(not-A) seems to be greater than 1) or sub-additive probabilities (Pr(A) + Pr(not-A) seems to be less than 1). Lots of confusion around probability. It’s a jungle out there.

This is one of the motivations for the work of Gigerenzer etc. on reframing problems in terms of natural frequencies so as to avoid the confusingness of probability.

]]>A majority of Americans continue to support legal abortion in all or most cases. Civiqs has been conducting daily tracking polls on abortion since 2016, and has found consistent support for legal abortion throughout that time.

This support includes many voters who are personally against abortion. In a May 2023 Civiqs poll, 42% of Americans support the right to abortion both personally and as policy. Another 27% say they are personally against abortion but think it is a decision that the government should not be involved in. Just 26% of voters personally believe abortion is wrong and should be illegal.

The Supreme Court recently ruled that the abortion pill mifepristone should remain available. A large majority of Americans (62%) support this decision, with only 25% opposed and 12% unsure. Both women (63%) and men (61%) support the Supreme Court’s decision.

More broadly, three-quarters of Americans (73%) believe that mifepristone should be legal (54%) or legal under certain circumstances (19%) in their state. This includes almost all Democrats (93%) and a substantial majority of Independents (73%). Half of Republicans agree, with 51% saying the pill should be legal (17%) or legal under certain circumstances (34%).

**P.S.** Previous pendulum discussions are here (2021) and here (2022). We’ll have to see if Clarence Thomas is still in the news next November to remind voters who’s in control of that third branch of government. I think the pendulum thing is less of a big deal in presidential election years, though.

I would not have the patience to go even 5 minutes into these models with the coefficients and arrows, as I think they’re close to hopeless even in the best of settings and beyond hopeless for observational data, nor do I want to think too hard about terms such as “two-way correlation,” a phrase which I hope never to see again!

I agree with Weissman on these points:

1. It is good for journals to publish critiques, and I don’t think that critiques should be held to higher standards than the publications they are critiquing.

2. I think that journals are too focused on “novel contributions” and not enough on learning from mistakes.

3. Being charitable toward others is fine, all else equal, but not so fine if this is used as a reason for researchers, or an entire field, to avoid confronting the mistakes they have made or the mistakes they have endorsed. Here’s something I wrote in praise of negativity.

4. Often these disputes are presented as if the most important parties are authors of the original paper, the journal editor, and the author of the letter or correction note. But that’s too narrow a perspective. The most important parties are not involved in the discussion at all: these are the readers of the articles—those who will takes its claims and apply them to policy or to further researchers—and all the future students who may be affected by these policies. Often it seems that the goal is to minimize any negative career impact on the authors of the original paper and to minimize any inconvenience to the journal editors. I think that’s the wrong utility function, and to ignore the future impacts of uncorrected mistakes is implicitly an insult to the entire field. If the journal editors think the work they publish has value—not just in providing chits that help scholars get promotions and publicity, but in the world outside the authors of these articles—then correcting errors and learning from mistakes should be a central part of their mission.

I hope Weissman’s efforts in this area have some effect in the physics education community.

As a statistics educator, I’ve been very impressed by the innovation shown by physics educators (for example, the ideas of peer instruction and just-in-time teaching, which I use in my classes), so I hope they can do better in this dimension of evaluating evidence of effectiveness.

]]>9:30 Presentation of the program

9:40 Andrew Gelman (Columbia University)

10:40 Sebastian Weber (Novartis) – Supporting phase I dose-escalation trials in Oncology

11:00 Tea/coffee break

11:30 Aki Vehtari (Aalto University) – Bayesian Workflow

12:30 Lunch break

14:00 Maxime Beaulieu (INSERM) – Hierarchical Nonlinear Joint Modelling in Oncology – a simulation study

14:20 Stanislas du Ché (Univ Paris Orsay) – Parallelization for Markov chains Monte Carlo with Heterogeneous Runtimes

14:40 Julie Bertrand (INSERM) – Standard Errors at finite distance in Non linear Mixed Effect models

15:00 Céline Brochot (Certara) – Stan for Bioequivalence studies

15:20 Tea/coffee break

15:50 François Mercier and Daniel Sabanes-Bove (Roche) – jmpost: An R package for Bayesian joint tumor growth inhibition and overall survival models using Stan

16:10 Charles Margossian (Flatiron Institute) – Making Bayesian pharmacometrics modeling simpler (but not too simple) with Torsten

16:30 Day wrap-up

If you’ll be in town, you can sign up!

I’m not sure what I’ll speak on, given that Aki seems to have already taken the Bayesian Workflow topic. But I’ll think of something!

]]>Palko: Sounds like all the unicorns. The venture capital model breeds these things.

Me: Unicorns aren’t real, right?

Palko: Unicorns are mythical beasts and those who invest in them are boobies.

Me: Something something longer than something something stay solvent.

Palko: That’s good advice for short sellers, but it’s good to remember the corollary: the market can become rational faster than you can get out.

Good point.

]]>But journalists . . . not so much. I get it that Walter Cronkite or whoever doesn’t want randos emailing him at all hours of the night, but even more obscure journalists don’t seem to like to make it so easy to contact them.

Why is that? I’d think that if you’re a reporter you’d want people to be able to reach you as directly as possible, no?

There are some exceptions such as Stephanie Lee who gives her direct contact right here, but she’s an unusual sort of journalist who’s interesting in afflicting the comfortable (and sometimes the public relations apparatus of comfortable fight back). What about all the other journalists out there? Why do they keep their emails secret?

I’m not saying this is a conspiracy; I’m sure there’s a simple answer. I just don’t know it.

]]>At the Technical University of Dortmund, Germany, I am currently looking for a PhD Student or PostDoc to work with me on simulation-based Bayesian inference research in the context of our BayesFlow framework.

**BayesFlow** is a Python library for efficient simulation-based Bayesian Inference. It enables users to create specialized neural networks for amortized Bayesian inference, which repays users with rapid statistical inference after a potentially longer simulation-based training phase. A cornerstone idea of amortized Bayesian inference is to employ generative neural networks for parameter estimation, model comparison, and model validation when working with intractable simulators whose behavior as a whole is too complex to be described analytically.

Both the BayesFlow library itself and its community are quickly growing. Our goal is to make it the gold-standard simulation-based inference library within the next couple of years.

For more details about the position, please see Paul Bürkner – Open Positions

I am looking forward to your applications!

Paul

]]>Matthieu Authier writes:

Here is a simulation study using regularized regression with post-stratification to estimate dolphin bycatch from non-representative samples. The Stan code is accessible here.

We’ve used also RRP on a case study with samples from FR where we know that independent observers are preferentially allowed on boat when dolphin bycatch is low (a report is being written at the moment on that, and it will be the second part of the dead dolphin duology for RRP). RRP is giving more plausible estimates in this case.

For those not familiar with the jargon, “bycatch” is “the non-intentional capture or killing of non-target species in commercial or recreational fisheries.”

It’s great to see MRP and Stan being used for all sorts of real problems.

**P.S.** Authier has an update:

]]>The article has been published and vastly improved thank to peer review. The simulation study is here and the case study is here.

Stan was instrumental in both cases to be able to fit the models.

The model we developed is now used to update estimate bycatch in the Bay of Biscay.

I’ve been thinking about the problem of setting an appropriate prior when conducting Bayesian analysis. Do you know if anyone has done any work to quantify how much prior-belief-ness undertaking a study to test a particular hypothesis might represent? Part of me wants to say that given all of the effort to design a study, get IRB approval and collect data, there has to be some lower bound on one’s prior before doing the data analysis… because if it were any lower, presumably one would undertake a different study. At the very least, one imagines that the researcher’s prior is greater than 0. I wonder if cognitive psychological researchers studying hypothesis testing could estimate how much prior belief people need, on average, to take action on their beliefs.

My reply: It’s hard to say. For example, a researcher’s prior belief in the efficacy of a proposed new treatment might be low, but if it could benefit millions of people, then it could be worth studying because of the high potential benefit. Even a treatment with negative expected value can be worth studying: if there is a small probability that it has a consistent positive benefit, then you do the experiment to see whether to proceed further. This is basic decision-analytic reasoning. Conversely, if the prior is that the treatment is probably beneficial, it can still be a good idea to do the experiment, just in case it actually has a negative effect, in which case you’d learn that and know not to proceed.

All that is not even considering issues such as cognitive biases, financial and career incentives, and all sorts of other reasons why we would expect researchers’ priors to be wrong.

Regarding your original question: Sometimes I call this sort of thing “anthropic reasoning” by analogy to the anthropic principle in physics, whereby we can derive some properties of our world, given the information that we exist in it.

Here’s an example from a few years ago where I used anthropic reasoning to answer the question, Should we take measurements at an intermediate design point? I love that paper, and I remain bummed that it’s only been cited 3 times.

]]>There’s no fancy statistical work here, and nothing about causal inference – my coauthor Cecilie Wathne and I were just trying to figure out where some of these oft-cited numbers come from… and we found that they’re often either completely made up, or based on what can be most generous described as gut feelings expressed in quantitative form.

From the paper:

We analysed ten global corruption statistics, attempting to trace each back to its origin and to assess its credibility and reliability. These statistics concern the amount of bribes paid worldwide, the amount of public funds stolen/embezzled, the costs of corruption to the global economy, and the percentage of development aid lost to corruption, among other things. Of the ten statistics we assessed, none could be classified as credible, and only two came close to credibility. Six of the ten statistics are problematic, and the other four appear to be entirely unfounded. . . .

First, using a combination of keyword searches and snowballing, we identified 71 potentially relevant quantitative statistics from a range of sources. . . . we narrowed our original list of 71 statistics to the following ten, which are the focus of our analysis:

1. Approximately US$1 trillion in bribes is paid worldwide every year.

2. Approximately US$2.6 trillion in public funds is stolen/embezzled every year.

3. Corruption costs the global economy approximately US$2.6 trillion, or 5% of global

GDP, each year.

4. Corruption, together with tax evasion and illicit financial flows, costs developing

countries approximately US$1.26 trillion each year.

5. Approximately 10%–25% of government procurement spending is lost to corruption

each year.

6. Approximately 10%–30% of the value of publicly funded infrastructure is lost to

corruption each year.

7. Approximately 20%–40% of spending in the water sector is lost to corruption each

year.

8. Up to 30% of development aid is lost to fraud and corruption each year.

9. Customs-related corruption costs World Customs Organization members at least

US$2 billion per year.

10. Approximately 1.6% of annual deaths of children under 5 years of age (over

140,000 deaths per year) are due in part to corruption.

We attempted to trace each of these statistics back to its original source. . . .

This is amazing. The corruption literature seems to have “Sleep Diplomat”-level problems.

This makes me think it could be useful to have an article collecting a bunch of these made-up or bogus statistics. So far we’ve got:

– The claim that North Korea is more democratic than North Carolina

– The Human Development Index

– The supposed “smallish town” where 75 people a week were supposedly dying because of the lack of information flow between the hospital’s emergency room and the nearby mental health clinic

– The above corrupt corruption statistics.

We could collect a bunch more, no?

Here I’m restricting ourselves to simple numbers, not even getting into bad statistical analyses, causal confusions, selection bias, etc.

**P.S.** Following up on Stephenson’s above-linked report, Ray Fisman and I wrote an article with him for The Atlantic which conveyed the general point. That article was fun to write, and I hope it reached some readers who otherwise wouldn’t have thought about the issue, but I’d still like to compile a broader list, as discussed in the above post.

Lately I have been thinking a bit about how useful it is in practice, like when predictions are available to someone making a decision. E.g., if the decision maker is presented with a prediction set rather than just the single maximum likelihood label, in what ways might this change their decision process? It’s also interesting to think about how you get people to understand the differences between a model-agnostic versus a model-dependent prediction set or uncertainty interval, and how use of them should change.

But beyond the human facing aspect, there are some more direct applications of conformal prediction to improve inference tasks. One uses what is essentially conformal prediction to estimate the transfer performance of an ML model trained on one domain when you apply it to a new domain. It’s a useful idea if you’re ok with assuming that the domains have been drawn i.i.d. from some unknown meta-distribution, which seems hard in practice.

Another recent idea coming from Angelopoulos, Bates, Fannjiang, Jordan, and Zrnic (the first two of whom have created a bunch of useful materials explaining conformal prediction) is in the same spirit as conformal, in that the goal is to use labeled data to “fix” predictions from a model in order to improve upon some classical estimate of uncertainty in an inference.

What they call prediction-powered inference is a variation on semi-supervised learning that starts by assuming that you want to estimate some parameter value theta*, and you have some labeled data of size n, a much larger set of unlabeled data of size N >> n, and access to a predictive model that you can apply to the unlabeled data. The predictive model is arbitrary in that it might be fit to some other data than the labeled and unlabeled data you want to use to do inference. The idea is then to first construct an estimate of the error in the predictions of theta* from the model on the unlabeled data. This is called a rectifier since it rectifies the predicted parameter value you would get if we were to treat the model predictions on the unlabeled data as the true/gold standard values in order to recover theta*. Then, you use the labeled data to construct a confidence set estimating your uncertainty about the rectifier. Finally, you use that confidence set to create a provably valid confidence set for theta* which adjusts for the prediction error.

You can compare this kind of approach to the case where you just construct your confidence set using only the labeled observations, resulting in a wide interval, or where you do inference on the combination of labeled and unlabeled data by assuming the model predicted labels for the unlabeled data are correct, which gets you tighter uncertainty intervals but which may not contain the true parameter value. To give intuition for how prediction powered inference differs, the authors start with an example of mean estimation, where your prediction powered estimate decomposes to your average prediction for the unabeled data, minus the average error in predictions on the labeled data. If the model is accurate, the second term is 0, so you end up with an estimate on the unlabeled data which has much lower variance than your classical estimate (since N >> n). Relative to existing work on estimation with a combination of labeled and unlabeled data, prediction-powered inference assumes that most of the data is unlabeled, and considers cases where the model is trained on separate data, which allows for generalizing the approach to any estimator which is minimizing some convex objective and avoids making assumptions about the model.

Here’s a figure illustrating this process (which is rather beautiful I think, at least by computer science standards):

They apply the approach to a number of examples to create confidence intervals for e.g., the proportion of people voting for each of two candidates in a San Francisco election (using a computer vision model trained on images of ballots), predicting intrinsically disordered regions of protein structures (using AlphaFold), estimating the effects of age and sex on income from census data, etc.

They also provide an extension to cases where there is distribution shift, in the form of the proportion of classes in the labeled being different from that in the unlabeled data. I appreciate this, as one of my pet peeves with much of the ML uncertainty estimation work happening these days is the how comfortably people seem to be using the term “distribution-free,” rather than something like non-parametric, even though the default assumption is that the (unknown) distribution doesn’t change. Of course the distribution matters, using labels that imply we don’t care at all about it feels kind of like implying that there is in fact the possibility of a free lunch.

]]>In the spring of 1970, some years after our Stanford graduations, we talked one evening outside the statistics department at Stanford and decided to write a paper together. What should it be about? Brad [Efron] suggested, “Let’s work on Stein’s estimator.” Because so few understood it back then, and because we both admired Charles Stein so much for his genius and his humanity, we chose this topic, hoping we could honor him by showing that his estimator could work well with real data.

Stein already had proved remarkable theorems about the dominance of his shrinkage estimators over the sample mean vector, but there also needed to be a really convincing applied example. For that, we chose baseball batting average data because we not only could use the batting averages of the players early in the season, but because we also later could observe how those batters fared for the season’s remainder—a much longer period of time.

What struck me about this quote was that there was such a long delay between the theoretical work and “a really convincing applied example.” Also, the “applied example” was only kind of applied. Yeah, sure, it was a real data, and it addressed a real problem—assessing performance based on noisy information—but it was what might best be called a *stylized* data example.

Don’t get me wrong; I think stylized data examples are great. Here are some other instances of stylized data examples in statistics:

– The 8 schools

– The Minnesota radon survey

– The Bangladesh arsenic survey

– Forecasting the 1992 presidential election

– The speed-of-light measurements.

What do these and many other examples have in common, beside that my colleagues and I used them to demonstrate methods in our books?

They are all real data, they are all related to real applied problems (in education research, environmental hazards, political science, and physics) and real statistical problems (estimating causal effects, small-area estimation, decision making under uncertainty, hierarchical forecasting, model checking), and they’re all kind of artificial, typically using only a small amount of the relevant information for the problem at hand.

Still, I’ve found stylized data examples to be very helpful, perhaps for similar reasons as Efron and Morris:

1. The realness of the problem helps sustain our intuition and also to give a sense of real progress being made by new methods, in a way that is more understandable and convincing than, say, a reduction in mean squared error.

2. The data are real and so we can be surprised sometimes! This is related to the idea of good stories being immutable.

Indeed, sometimes researchers demonstrate their methods with stylized data examples and the result is *not* convincing. Here’s an example from a few years ago, where a colleague and I expressed skepticism about a certain method that had been demonstrated on two social-science examples. I was bothered by both examples, and indeed my problems with these examples gave me more understanding as to why I didn’t like the method. So the stylized data examples were useful here too, even if not the way the original author intended.

In section 2 of this article from 2014 I discussed different “ways of knowing” in statistics:

How do we decide to believe in the e↵ectiveness of a statistical method? Here are a few potential sources of evidence (I leave the list unnumbered so as not to imply any order of priority):

• Mathematical theory (e.g., coherence of inference or convergence)

• Computer simulations (e.g., demonstrating approximate coverage of interval estimates under some range of deviations from an assumed model)

• Solutions to toy problems (e.g., comparing the partial pooling estimate for the eight schools to the no pooling or complete pooling estimates)

• Improved performance on benchmark problems (e.g., getting better predictions for the Boston Housing Data)

• Cross-validation and external validation of predictions

• Success as recognized in a field of application (e.g., our estimates of the incumbency advantage in congressional elections)

• Success in the marketplace (under the theory that if people are willing to pay for something, it is likely to have something to offer)None of these is enough on its own. Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.

The very imperfections of each of these sorts of evidence gives a clue as to why it makes sense to care about all of them. We can’t know for sure so it makes sense to have many ways of knowing. . . .

For more thoughts on this topic, see this follow-up paper with Keith O’Rourke.

In the above list of bullet points I described the 8 schools as a “toy problem,” but now I’m more inclined to call it a stylized data example. “Toy” isn’t quite right; these are data from real students in real schools!

Let me also distinguish stylized data examples from numerical illustrations of a method that happen to use real data. Introductory statistics books are full of examples like that. You’re at the chapter on the t test or whatever and they demonstrate it with data from some experiment in the literature. “Real data,” yes, but not really a “real example” in that there’s no engagement with the applied context; the data are just there to show how the method works. In contrast, the intro stat book by Llaudet and Imai uses what I’d call real examples. Still with the edges smoothed, but what I’d call legit stylized data examples.

It’s my impression that the use of stylized data examples has been standard in statistics research and education for awhile. Not always, but often, enough so that it’s not a surprise to see them. The remark by Carl Morris in that interview makes me think that this represents a change, that things were different 50 years ago and before.

And I guess that’s right—there really has been a change. When I think of the first statistics course I took in college, the data were all either completely fake or they were numerical illustrations. And even Tukey’s classic EDA book from 1977 is full of who-cares examples like the monthly temperature series in Yuma, Arizona. At that point, Tukey had decades of experience with real problems and real data in all sorts of application areas—yet when writing one of his most enduring works, he went with the fake? Why? I think because that’s how it was done back then. You have your theory, you have your methods, and the point of the methods research article or book is to show how to do it, full stop. In the tradition of Snedecor and Cochran’s classic book on statistical methods. Different methods but the same general approach. But something changed, and maybe the 1970s was the pivotal period. Maybe the Steve Stiglers of the future can figure this one out.

]]>– From one direction, a data analysis workflow typically involves many models (explicit and implicit).

– From the other direction, even the apparently simple task of fitting just one model can involve an elaborate workflow including experimentation. Statistical estimation as a science problem.

This reminds me of some ideas from psychology regarding brain processes, the executive function, and group dynamics.

]]>When Carl Morris came to our department in 1989, I and my fellow students were so excited. We all took his class. The funny thing is, though, the late 1980s might well have been the worst time to be Carl Morris, from the standpoint of what was being done in statistics at that time—not just at Harvard, but in the field in general. Carl has made great contributions to statistical theory and practice, developing ideas which have become particularly important in statistics in the last two decades. In 1989, though, Carl’s research was not in the mainstream of statistics, or even of Bayesian statistics.

When Carl arrived to teach us at Harvard, he was both a throwback and ahead of his time.

Let me explain. Two central aspects of Carl’s research are the choice of probability distribution for hierarchical models, and frequency evaluations in hierarchical settings where both Bayesian calibration (conditional on inferences) and classical bias and variance (conditional on unknown parameter values) are relevant. In Carl’s terms, these are “NEF-QVF” and “empirical Bayes.” My point is: both of these areas were hot at the beginning of Carl’s career and they are hot now, but somewhere in the 1980s they languished.

In the wake of Charles Stein’s work on admissibility in the late 1950s there was an interest, first theoretical but with clear practical motivations, to produce lower-risk estimates, to get the benefits of partial pooling while maintaining good statistical properties conditional on the true parameter values, to produce the Bayesian omelet without cracking the eggs, so to speak. In this work, the functional form of the hierarchical distribution plays an important role—and in a different way than had been considered in statistics up to that point. In classical distribution theory, distributions are typically motivated by convolution properties (for example, the sum of two gamma distributions with a common shape parameter is itself gamma), or by stable laws such as the central limit theorem, or by some combination or transformation of existing distributions. But in Carl’s work, the choice of distribution for a hierarchical model can be motivated based on the properties of the resulting partially pooled estimates. In this way, Carl’s ideas are truly non-Bayesian because he is considering the distribution of the parameters in a hierarchical model not as a representation of prior belief about the set of unknowns, and not as a model for a population of parameters, but as a device to obtain good estimates.

So, using a Bayesian structure to get good classical estimates. Or, Carl might say, using classical principles to get better Bayesian estimates. I don’t know that they used the term “robust” in the 1950s and 1960s, but that’s how we could think of it now.

The interesting thing is, if we take Carl’s work seriously (and we should), we now have two principles for choosing a hierarchical model. In the absence of prior information about the functional form of the distribution of group-level parameters, and in the absence of prior information about the values of the hyperparameters that would underly such a model, we should use some form with good statistical properties. On the other hand, if we do have good prior information, we should of course use it—even R. A. Fisher accepted Bayesian methods in those settings where the prior distribution is known.

But, then, what do we do in those cases in between—the sorts of problems that arose in Carl’s applied work in health policy and other areas? I learned from Carl to use our prior information to structure the model, for example to pick regression coefficients, to decide which groups to pool together, to decide which parameters to model as varying, and then use robust hierarchical modeling to handle the remaining, unexplained variation. This general strategy wasn’t always so clear in the theoretical papers on empirical Bayes, but it came through in the Carl’s applied work, as well as that of Art Dempster, Don Rubin, and others, much of which flowered in the late 1970s—not coincidentally, a few years after Carl’s classic articles with Brad Efron that put hierarchical modeling on a firm foundation that connected with the edifice of theoretical statistics, gradually transforming these ideas from a parlor trick into a way of life.

In a famous paper, Efron and Morris wrote of “Stein’s paradox in statistics,” but as a wise man once said, once something is understood, it is no longer a paradox. In un-paradoxing shrinkage estimation, Efron and Morris finished the job that Gauss, Laplace, and Galton had begun.

So far, so good. We’ve hit the 1950s, the 1960s, and the 1970s. But what happened next? Why do I say that, as of 1989, Carl’s work was “out of time”? The simplest answer would be that these ideas were a victim of their own success: once understood, no longer mysterious. But it was more than that. Carl’s specific research contribution was not just hierarchical modeling but the particular intricacies involved in the combination of data distribution and group-level model. His advice was not simply “do Bayes” or even “do empirical Bayes” but rather had to do with a subtle examination of this interaction. And, in the late 1980s and early 1990s, there wasn’t so much interest in this in the field of statistics. On one side, the anti-Bayesians were still riding high in their rejection of all things prior, even in some quarters a rejection of probability modeling itself. On the other side, a growing number of Bayesians—inspired by applied successes in fields as diverse as psychometrics, pharmacology, and political science—were content to just fit models and not worry about their statistical properties.

Similarly with empirical Bayes, a term which in the hands of Efron and Morris represented a careful, even precarious, theoretical structure intended to capture classical statistical criteria in a setting where the classical ideas did not quite apply, a setting that mixed estimation and prediction—but which had devolved to typically just be shorthand for “Bayesian inference, plugging in point estimates for the hyperparameters.” In an era where the purveyors of classical theory didn’t care to wrestle with the complexities of empirical Bayes, and where Bayesians had built the modeling and technical infrastructure needed to fit full Bayesian inference, hyperpriors and all, there was not much of a market for Carl’s hybrid ideas.

This is why I say that, at the time Carl Morris came to Harvard, his work was honored and recognized as pathbreaking, but his actual research agenda was outside the mainstream.

As noted above, though, I think things have changed. The first clue—although it was not at all clear to me at the time—was Trevor Hastie and Rob Tibshirani’s lasso regression, which was developed in the early 1990s and which has of course become increasingly popular in statistics, machine learning, and all sorts of applications. Lasso is important to me partly as the place where Bayesian ideas of shrinkage or partial polling entered what might be called the Stanford school of statistics. But for the present discussion what is most relevant is the centrality of the functional form. The point of lasso is not just partial pooling, it’s partial pooling with an exponential prior. As I said, I did not notice the connection with Carl’s work and other Stein-inspired work back when lasso was introduced—at that time, much was made of the shrinkage of certain coefficients all the way to zero, which indeed is important (especially in practical problems with large numbers of predictors), but my point here is that the ideas of the late 1950s and early 1960s again become relevant. It’s not enough just to say you’re partial pooling—it matters _how_ this is being done.

In recent years there’s been a flood of research on prior distributions for hierarchical models, for example the work by Nick Polson and others on the horseshoe distribution, and the issues raised by Carl in his classic work are all returning. I can illustrate with a story from my own work. A few years ago some colleagues and I published a paper on penalized marginal maximum likelihood estimation for hierarchical models using, for the group-level variance, a gamma prior with shape parameter 2, which has the pleasant feature of keeping the point estimate off of zero while allowing it to be arbitrarily close to zero if demanded by the data (a pair of properties that is not satisfied by the uniform, lognormal, or inverse-gamma distributions, all of which had been proposed as classes of priors for this model). I was (and am) proud of this result, and I linked it to the increasingly popular idea of weakly informative priors. After talking with Carl, I learned that these ideas were not new to me, indeed these were closely related to the questions that Carl has been wrestling with for decades in his research, as they relate both to the technical issue of the combination of prior and data distributions, and the larger concerns about default Bayesian (or Bayesian-like) inferences.

In short: in the late 1980s, it was enough to be Bayesian. Or, perhaps I should say, Bayesian data analysis was in its artisanal period, and we tended to be blissfully ignorant about the dependence of our inferences on subtleties of the functional forms of our models. Or, to put a more positive spin on things: when our inferences didn’t make sense, we changed our models, hence the methods we used (in concert with the prior information implicitly encoded in that innocent-sounding phrase, “make sense”) had better statistical properties than one would think based on theoretical analysis alone. Real-world inferences can be superefficient, as Xiao-Li Meng might say, because they make use of tacit knowledge.

In recent years, however, Bayesian methods (or, more generally, regularization, thus including lasso and other methods that are only partly in the Bayesian fold) have become routine, to the extent that we need to think of them as defaults, which means we need to be concerned about . . . their frequency properties. Hence the re-emergence of truly empirical Bayesian ideas such as weakly informative priors, and the re-emergence of research on the systematic properties of inferences based on different classes of priors or regularization. Again, this all represents a big step beyond the traditional classification of distributions: in the robust or empirical Bayesian perspective, the relevant properties of a prior distribution depend crucially on the data model to which it is linked.

So, over 25 years after taking Carl’s class, I’m continuing to see the centrality of his work to modern statistics: ideas from the early 1960s that were in many ways ahead of their time.

Let me conclude with the observation that Carl seemed to us to be a “man out of time” on the personal level as well. In 1989 he seemed ageless to us both physically and in his personal qualities, and indeed I still view him that way. When he came to Harvard he was not young (I suppose he was about the same age as I am now!) but he had, as the saying goes, the enthusiasm of youth, which indeed continues to stay with him. At the same time, he has always been even-tempered, and I expect that, in his youth, people remarked upon his maturity. It has been nearly fifty years since Carl completed his education, and his ideas remain fresh, and I continue to enjoy his warmth, humor, and insights.

Here is a video from his retirement event.

]]>From the proceedings of the December 4-6, 1962, fall joint computer conference, two researchers from General Electric Company’s Missile and Space Division write:

In general, there are two distinct modes of simulation; mathematical and physical. Mathematical simulation utilizes a mathematical model of the physical system under study. . . .

Physical simulation requires the excitation of the system under conditions which are representative of those encountered in actual system operation. This testing can involve anything from an inclined plane to large multi-million dollar ventures like the Space Environmental Simulator located at General Electric’s Valley Forge, Penna., Space Technology Center. These two types of simulation can be combined by mating physical hardware with a mathematical model. The general purpose computers available today are primarily designed for mathematical simulation. . . .

An electronic analog computer is an array of computational building blocks, or modules, each being able to perform a particular mathematical operation on an input voltage signal and provide a specific output response. These building blocks normally provide the functions of summation, integration with respect to time, multiplication by a constant, multiplication and division of variables, function generation, generation of trigonometric functions, and representation of system discontinuities. All quantities are represented on the analog by continuously varying voltages, restricted on almost all analog computers to the range between -100 and +100 volts. . . .

Data are fed into the analog computer in the form of parameter settings, which are usually associated with the coefficients that exist in the mathematical equations. Data are extracted from the computer in the form of voltages, either as steady-state values which can be read out on a voltmeter, or as varying values which can be recorded on a strip chart recorder or a plotting table. Some of the analog characteristics pertinent to our discussion are:

1. The analog is a parallel machine. All the variables are computed simultaneously and continuously. Thus, the speed with which the calculations are made is completely independent of the size or complexity of the problem.

2. The bigger a problem is, the more equipment is needed, as each piece of equipment works on one part of the problem.

3. Numbers on the analog are fixed point. Every variable must be scaled. The scaling will greatly affect the accuracy of the results.

4. The analog is best suited for solving systems of ordinary linear differential equations, although it can handle many other types of problem in a very satisfactory way.

5. There is no such thing as a computational cycle with the analog, because of characteristic No. 1. The analog can be set to calculate at any rate desired, but in practice there is an optimum time base associated with any particular problem, and attempts to run the problem much faster or slower will severely degrade the accuracy. The analog, generally speaking, is faster than the digital.

6. Analog outputs are almost always accurate to within 1%, but seldom better than 0.1%.

7. It is very easy, with most problems, to introduce extensive changes in the simulation in a matter of minutes.

Although the analog computer was designed primarily for the solution of problems in the aircraft field, its area of application has broadened considerably over the years. . . .

Many of these concerns still arise today, albeit in different form: scalability of computation (items 1 and 2), scalability of workflow (item 7), putting parameters on a natural scale (item 3), precision (item 6), and the idea that the method runs at some natural speed (item 5), which comes up with HMC and, before that, efficient Metropolis jumping rules.

They then move on to a discussion of digital computing:

The digital computer works by a counting technique and obeys logic rules exactly. The solutions are at discrete points dependent on the size of the time increment used. The smaller the mesh size, the more we approach the continuous solution. In contrast to the analog computer, which uses continuous variables in the form of voltages, the digital computer uses discrete variables, and operates with numbers as opposed to voltages. The digital computer is essentially a very fast calculating machine. . . .

There are a number of digital computer characteristics that are of particular interest in connection with hybrid simulation. These are:

1. It will deal only with numbers. Any problem must be reduced to a series of numerical operations before it can be handled by the computer. This is not to say that every step must actually be written each time. All sorts of aids to compiling programs are available. A program is nothing more than the entire sequence of instructions given to the computer to solve a problem. In actual practice, the machine itself will write most of its own instructions.

2. It will do exactly what it is told. All changes involve writing new instructions. The easier it is to make a change, the more complicated the original instructions have to be to include the option.

3. The results are exactly repeatable, but their accuracy is dependent on the numerical methods used to solve the problem.

4. The computer will perform only one operation at a time. That is, if the instruction reads, “Move number N from location A to location B,” the machine will, for a given period of time, be doing nothing but that.

5. The computer works with increments. None of the variables are calculated continuously. Generally speaking, the larger the calculation increment of the digital computer, the faster and the less accurate is the computation. There is absolutely no drift with a digital computer.

6. Compared with an analog, the digital is very much better equipped to make decisions. These can be made on the basis of comparison, time, reaching a point in the program, or almost any other criterion chosen by the programmer.

7. The digital can store very much more information than the analog. It can store tables, functions of several variables, whole programs, and many other things.

It is almost impossible to list the areas of application of the computer because of the diversity involved. We can say, however, that the digital computer lays sole claim to those problems which store a lot of information, use much logic, or require extreme accuracy. It will calculate trajectories, solve problems in astronomy, simulate mental processes such as learning and memory, analyze games, do translations, help design new computers, and do untold numbers of other tasks. The major effort to discover new computer applications is devoted to the digital area, with the analog a poor second, and the hybrid far behind.

They were right about that! Digital computers really did take over. Again, I find it interesting how much of the discussion turns on workflow, which we can roughly define as a process of exploration requiring science-like exploration by fitting multiple models.

They continue with some thoughts on the precision of computation which remain relevant over sixty years later:

The subject of accuracy is so complicated, and dependent on so many factors, that it just didn’t seem possible to summarize it by a mark in a box. While this is to some extent true of all the other characteristics listed, we believe considerations of accuracy fall into a special case.

On an analog computer, the result is usually within 0.1% and 1% of the value inherent in the equations. Whether this is excellent or poor depends on the nature of the problem. In many engineering investigations, this is much more precise than the data upon which the problem is based. The use to which the answer will be put also affects the accuracy required. Determination of the region of stability of a control system to within a millionth of the control range would be valueless, as the nature of the input could affect it much more than that. On a digital computer, the ultimate limit of accuracy is the number of bits in a word. This accuracy is seldom attained by the output variables of a problem, due to the approximations involved in almost any mathematical model, the idiosyncrasies of programming, and the practical necessity of taking reasonably large computing steps. The question concerning accuracy is more often, “How much cost and effort is needed to obtain the required accuracy?”, than “What accuracy is obtainable?” The answer has to be determined separately for each individual problem.

Next they move on to “hybrid” setups that combine analog and digital computing, sharing their own experiences:

The advantages of a hybrid that we felt to be of most value to the work of the department were in the area of increasing the size and variety of the problems we could solve. The things a hybrid can do to help in that endeavor are:

1. Assign different sections of a problem to each computer. For instance, in simulating a missile, the trajectory calculations can be assigned to the digital, because of the available precision, and the control simulation put on the analog because of its flexibility.

2. Assign different functions to each computer. For instance, all integrations might be assigned to the analog computer, in order to save time and get a continuous output. Or, all function generation might be assigned to the digital computer (where it is known as table look-up).

3. Provide analog plots of digital variables. This is particularly useful in observing the behavior of selected variables while the simulation is in progress. In one case, a stop was put on a 7090 after the first 15 seconds of what would otherwise have been a 10 minute run because it was easy to tell from the behavior of a continuous analog output that a key variable was not behaving quite as desired.

4. Let the digital provide logic for the analog. Things such as switching, scale changing, ending the program, choosing tables to examine, can be readily programmed into the digital and can greatly simplify and possibly even speed up an analog simulation.

5. Allow real hardware to be part of a simulation. Most hardware can readily be connected into the analog, and hybrid operation would allow it to connect to the digital just as easily. Similarly, digital devices can be included in analog operation the same way. Real hardware could also be considered to include people, as part of a control loop.

6. Provide accurate digital printouts of analog variables. Normally, the accuracy with which the analog variables are plotted is less than the accuracy that actually exists in the equipment. Hybrid operation enables selected variables to be converted to digital form and printed out from a digital tape.

The details of this sort hybrid computing don’t really matter anymore, but the general idea of looking at leaks in the modeling pipeline, that still is important.

I was also struck by the larger framework of simulation. Of course this makes sense: a missile test is expensive so you want to understand as much as you can using simulation before going out and launching something. In addition to being cost- and time-effective, simulation also makes the live test more effective. The real-world launch gives real-world data which you can compare to your expectations. The better your simulations, the better will be your expectations, and the more you will learn from discrepancies in the live data.

I’ve thought about these issues for awhile in the context of model checking and exploratory data analysis (see BDA starting from the first edition in 1995, and my 2003 article, A Bayesian formulation of exploratory data analysis and goodness-of-fit testing, but it was only just now that I realized the connection to workflow and simulated-data experimentation.

If only someone had given me this article to read 40 years ago, back when I was first doing simulations of physical systems. I blame the author of that 1962 article, who easily could have shared it with me at the time. The trouble was that he was too self-effacing.

**P.S.** The diagram at the top of this post comes from this 1963 article, “Corrected inputs: A method for improved hybrid simulation,” which begins:

Makes sense to me, to use some feedback to reduce transmission errors.

They were doing cool stuff back then, 60 years ago. Just regular guys, no Ph.D. or anything. Kinda like Steven Spielberg’s dad. Maybe that’s one reason I liked that movie so much.

]]>Modern AI has moved away from the absolute, deterministic procedures of early machine learning models. Nowadays, probability and randomness are fully embraced and utilized in AI. Some simple examples of this are avoiding overfitting by randomly dropping out neurons (i.e., dropout), and escaping local minima during training thanks to noisy gradient estimates (i.e., stochastic gradient descent). A deeper example is Bayesian neural networks, where the network’s weights are sampled from a probability distribution and Bayesian inference is employed to update the distribution in the presence of data . . .

Another deep example is generative modeling with diffusion models. Diffusion models add noise to data in a forward process, and then reverse the process to generate a new datapoint (see figure illustrating this for generating an image of a leaf). These models have been extremely successful not only in image generation, but also in generating molecules, proteins and chemically stable materials . . .

AI is currently booming with breakthroughs largely because of these modern AI algorithms that are inherently random. At the same time, it is clear that AI is not reaching its full potential, because of a mismatch between software and hardware. For example, sample generation rate can be relatively slow for diffusion models, and Bayesian neural networks require approximations for their posterior distributions to generate samples in reasonable time.

Then comes the punchline:

There’s no inherent reason why digital hardware is well suited for modern AI, and indeed digital hardware is handicapping these exciting algorithms at the moment.

For production AI, Bayesianism in particular has been stifled from evolving beyond a relative niche because of its lack of mesh with digital hardware . . . .the next hardware paradigm should be specifically tailored to the randomness in modern AI. Specifically, we must start viewing stochasticity as a computational resource. In doing so, we could build a hardware that uses the stochastic fluctuations produced by nature.

Coles continues:

The aforementioned building blocks are inherently static. Ideally, the state does not change over time unless it is intentionally acted upon by a gate, in these paradigms.

However, modern AI applications involve accidental time evolution, or in other words, stochasticity. This raises the question of whether we can construct a building block whose state randomly fluctuates over time. This would be useful for naturally simulating the fluctuations in diffusion models, Bayesian inference, and other algorithms.

The key is to introduce a new axis when plotting the state space: time. Let us define a stochastic bit (s-bit) as a bit whose state stochastically evolves over time according to a continuous time Markov chain . . .

Ultimately this involves a shift in perspective. Certain computing paradigms, such as quantum and analog computing, view random noise as a nuisance. Noise is currently the biggest roadblock to realizing ubiquitous commercial impact for quantum computing. On the other hand, Thermodynamic AI views noise as an essential ingredient of its operation. . . .

I think that when Coles says “AI,” he means what we would call “Bayesian inference.” Or maybe AI represents some particularly challenging applications of Bayesian computation.

**Analog computing**

OK, the above is all background. Coles’s key idea here is to build a computer using new hardware, to build these stochastic bits so that continuous computation gets done directly.

This is reminiscent of what in the 1950s and 1960s was called “analog computation” or “hybrid computation.” An analog computer is something you build with a bunch of resistors and capacitors and op-amps to solve a differential equation. You plug it in, turn on the power, and the voltage tells you the solution. Turn some knobs to change the parameters in the model, or set it up in a circuit with a sawtooth input and plug it into an oscilloscope to get the solution as a function of the input, etc. A hybrid computer mixes analog and digital elements. Coles is proposing something different in that he’s interested in the time evolution of the state (which, when marginalized over time, can be mapped to a posterior distribution), whereas in traditional analog computer, you just look at the end state and you’re not interested in the transient period that it takes to get there.

Here’s the technical report from Coles. I have not read it carefully or tried to evaluate it. That would be hard work! Could be interest to many of you, though.

]]>Not sure if you’ve written on this, but the orthopedic field is tying itself in knots trying to decide if meniscus repair is useful. The field thought it was, and then decided it wasn’t after some studies a decade ago, and is now mounting a rear-guard action via criticisms of the statistics of intent-to-treat studies.

He points to this article in the journal Arthroscopy, “Can We Trust Knee Meniscus Studies? One-Way Crossover Confounds Intent-to-Treat Statistical Methods,” by James Lubowitz, Ralph D’Agostino Jr., Matthew Provencher, Michael Rossi, and Jefferson Brand.

Hey, I know Ralph from grad school! So I’m inclined to trust this article. But I don’t know anything about the topic. Here’s how the article begins:

Randomized controlled studies have a high level of evidence. However, some patients are not treated in the manner to which they were randomized and actually switch to the alternative treatment (crossover). In such cases, “intent-to-treat” statistical methods require that such a switch be ignored, resulting in bias. Thus, the study conclusions could be wrong. This bias is a common problem in the knee meniscus literature. . . . patients who fail nonsurgical management can cross over and have surgery, but once a patient has surgery, they cannot go back in time and undergo nonoperative management. . . . the typical patient selecting to cross over is a patient who has more severe symptoms, resulting in failure of nonoperative treatment. Patients selecting to cross over are clearly different from the typical patient who does not cross over, because the typical patient who does not cross over is a patient who has less severe symptoms, resulting in good results of nonoperative treatment. Comparing patients with more severe symptoms to patients with less severe symptoms is biased.

Interesting.

That article is from 2016. I wonder what’s been learned since then? What’s the consensus now? Googling *knee meniscus surgery* yields this from the Cleveland Clinic:

Meniscus surgery is a common operation to remove or repair a torn meniscus, a piece of cartilage in the knee. The surgery requires a few small incisions and takes about an hour. Recovery and rehabilitation take a few weeks. The procedure can reduce pain, improve mobility and stability, and get you back to life’s activities. . . .

What follows is lots of discussions about the procedure, who should get it, and its risks and benefits. Nothing about any studies claiming that it doesn’t work. So in this case maybe the “rear-guard action” was correct, or at least successful so far.

In any case, this is a great example for thinking about potential biases in intent-to-treat studies.

]]>I agree with Tierney here. These shooter drills seem ridiculous, and indeed I could well believe that shooter drills could actually increase the rate of shootings by making the whole thing seem so exciting. I guess any such effect would be very small, though, given that kids can already play all those shooter video games.

I have just one more thought, which has to do with the distinction between school shootings as a danger (in a statistical sense) and school shootings as a source of evidence (in a statistical sense).

When I was a kid, there were not school shootings like there are today. Something has changed. More guns, more dangerous guns, more willingness to use them, whatever it is, there’s been a change. I think it makes sense to be concerned about school shootings, even if “the annual odds that an American child will die in a mass shooting at school are nearly 10 million to 1, about the odds of being killed by lightning or of dying in an earthquake,” because it’s something new and scary. Now, you might want to argue that, given the ready availability of high-powered guns, school shootings are inevitable and so there’s no point in worrying about them, or you might even argue that the positive benefits of weapons proliferation outweigh the costs of kids dying in school shootings, but in any case I think it’s legitimate to be concerned that these shootings are happening at all.

Again, though, I agree with Tierney that absolute rates matter. We should be much more concerned about everyday child abuse, which is happening all over the place even while it is trivialized by some entertainers and figures in the news.

]]>I recently read an article by the econometrician William Greene of NYU and others (in a 2005 book). They state the following:

The key difference between Bayesian and classical approaches is that Bayesians treat the nature of the randomness differently. In the classical view, the randomness is part of the model; it is the heterogeneity of the taste parameters, across individuals. In the Bayesian approach, the randomness ‘represents’ the uncertainty in the mind of the analyst (conjugate priors notwithstanding). Therefore, from the classical viewpoint, there is a ‘true’ distribution of the parameters across individuals. From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.

My understanding is that this statement runs counter to the Berstein-von Mises theorem, which in the wording of Wikipedia “ assumes there is some true probabilistic process that generates the observations, as in frequentism” (my emphasis). Their context is comparing individual parameters from a mixture model, which can be taken from the posterior of a Bayesian inference or (in the frequentist case) obtained through simulation. I was particularly struck by their terming randomness as part of the model in the frequentist approach, which to me reads more as a feature of Bayesian approaches that are driven by uncertainty quantification.

My reply: Yes, I disagree with the above-quoted passage. They are exhibiting a common misunderstanding. I’ll respond with two points:

1. From the Bayesian perspective there also is a true parameter; see for example Appendix B of BDA for a review of the standard asymptotic theory. That relates to Hawkins’s point about the Berstein-von Mises theorem.

2. Greene et al. write, “From the Bayesian viewpoint, in principle, there could be two analysts with different, both legitimate, but substantially different priors, who therefore could obtain very different, albeit both legitimate, posteriors.” The same is true in the classical viewpoint; just replace the word “priors” by “likelihoods” or, more correctly, “data models.” Hire two different econometricians to fit two different models to your data and they can get “very different, albeit both legitimate” inferences.

Hawkins sends another excerpts from the paper:

The Bayesian approach requires the a priori specification of prior distributions for all of the model parameters. In cases where this prior is summarising the results of previous empirical research, specifying the prior distribution is a useful exercise for quantifying previous knowledge (such as the alternative currently chosen). In most circumstances, however, the prior distribution cannot be fully based on previous empirical work. The resulting specification of prior distributions based on the analyst’s subjective beliefs is the most controversial part of Bayesian methodology. Poirier (1988) argues that the subjective Bayesian approach is the only approach consistent with the usual rational actor model to explain individuals’ choices under uncertainty. More importantly, the requirement to specify a prior distribution enforces intellectual rigour on Bayesian practitioners. All empirical work is guided by prior knowledge and the subjective reasons for excluding some variables and observations are usually only implicit in the classical framework. The simplicity of the formula defining the posterior distribution hides some difficult computational problems, explained in Brownstone (2001).

That’s a bit better but it still doesn’t capture the all-important point that that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood.

And this:

Allenby and Rossi (1999) have carried out an extensive Bayesian analysis of discrete brand choice and discussed a number of methodological issues relating to the estimation of individual level preferences. In comparison of the Bayesian and classical methods, they state the simulation based classical methods are likely to be extremely cumbersome and are approximate whereas the Bayesian methods are much simpler and are exact in addition. As to whether the Bayesian estimates are exact while sampling theory estimates are approximate, one must keep in mind what is being characterised by this statement. The two estimators are not competing for measuring the same population quantity with alternative tools. In the Bayesian approach, the ‘exact’ computation is of the analysts posterior belief about the distribution of the parameter (conditioned, one might note on a conjugate prior virtually never formulated based on prior experience), not an exact copy of some now revealed population parameter. The sampling theory ‘estimate’ is of an underlying ‘truth’ also measured with the uncertainty of sampling variability. The virtue of one over the other is not established on any but methodological grounds – no objective, numerical comparison is provided by any of the preceding or the received literature.

Again, I don’t think the framing of Bayesian inference as “belief” is at all helpful. Does the classical statistician or econometrician’s logistic regression model represent his or her “belief”? I don’t think so. It’s not a belief, it’s a model, it’s an assumption.

But I agree with their other point that we should not consider the result of an exact computation to itself be exact. The output depends on the inputs.

We can understand this last point without thinking about statistical inference at all. Just consider a simple problem of measurement, where we estimate the weight of a liquid by weighing an empty jar, then weighing the jar with the liquid in it, then subtracting. Suppose the measured weights are 213 grams and 294 grams, so that the estimated weight of the liquid is 81 grams. The calculation, 294-213=81, is exact, but if the original measurements have error, then that will propagate to the result, so it would not be correct to say that 81 grams is the exact weight.

]]>**The foundation: attention models**

In a nutshell, language modeling is the simple task of predicting the next subword (“called a token”) based on the previous sequence of subwords. The state-of-the-art had stalled for years on n-gram models that use the previous n subwords (usually with n < 5). In 2017, a team of Google researchers released a paper titled "Attention is all you need," which introduced the current state-of-the-art neural network architecture for language modeling. The breakthrough was in extending the context length into the thousands (GPT 3.5 uses 4K, GPT 4 has 8K and 32K models) with an attention model that figured out which parts of the context to concentrate on. The fundamental bottleneck is that computation is quadratic in context length (though it's all on GPU, so that's a massive numbers of flops for relatively low power).

The 2017 paper introduced the so-called “transformer” architecture, which combines multiple attention “heads” in parallel. The original application was to translation, but it’s the self attention component that was extracted for use in LLMs. The “T” in “GPT” is for “transformers” (the “GP” is for “generative pretrained”). What researchers have found is that the heads learn different aspects of prediction, such as different syntactic structures, much like any mixture model.

There’s a beautiful two-hour YouTube tutorial by Andrej Karpathy that builds up the entire transformer architecture piece by piece in a Colab notebook you can also use. Karpathy applies it to building a Shakespearean chatbot. It assumes you know Python, but is otherwise quite gentle, starting with an intro to n-gram language models and softmax.

**Garbage-in, garbage-out**

The current crop of large language models have been trained on vast amounts of human text, primarily collected through the internet. As you might imagine, including sources like Reddit and 4chan and Twitter leads to a broad set of what can most charitably be called “points of view.” Even on technical issues, the web is cluttered with material that should probably not be the basis for serious work—homework exercises for intro data science classes clutter GitHub and StackOverflow, every statistician and their cousin’s experimental code seems to be wrapped up as an R package, scripts from ancient versions of software persist, etc.

**Alignment: from LLMs to chatbots**

After building these powerful, transformer-based large language models (LLMs), people realized that they were really good at generating text. As in they blew away any previous compression record (just like the TV show *Silicon Valley*!). You can convert a language model into a compression scheme using prediction by partial matching (PPM) with arithmetic coding, the reference implementation of which was designed and coded by Radford Neal (with Ian Witten and John Cleary) in 1987. Seriously, they should win an ACM/Turing award just for the quantum leap in text compression.

The early LLMs could write computer programs, translate Pascal to Fortran and Swahili to English, and generate new recipes given only a list of ingredients or new episodes of TV shows. But they tend to ramble off topic, tend to “hallucinate” (the term of art for when LLMs make things up; it’s called “dreaming” for diffusion models like Midjourney), and tend to be fluid with the points of view they find in training data. They’re just as happy telling you how to make a bomb in your basement and where to set it off as they are telling you how to make a soufflé in your kitchen and how to serve it. And if you “jailbreak” the current ChatGPT, it’ll still be happy to tell you how to try all sorts of dangerous, illegal, and morally and ethically questionable activities.

OpenAI’s approach to preventing the LLMs from spewing dangerous and/or toxic garbage is to fine tune the large language models with human feedback (HF) using reinforcement learning (RL, and together RLHF). Their stated goal was to “align” the language models to be (a) helpful, (b) truthful, and (c) harmless. While this sounds like an objective task presented this way, the notions of truthful and harmless are difficult to pin down and require subjective judgement calls. Even helpfulness is a slippery notion in that help that’s too verbose or specific isn’t helpful. What one person takes to be self evident in these realms can be considered lunacy by others.

OpenAI either implicitly or explicitly chose the point of view of a West-coast American liberal, which is the far left of the mainstream US political spectrum, even though it’s relatively conservative by European standards. They could’ve just as easily decided to give ChatGPT the perspective of the far right of the mainstream US political spectrum and it would’ve had a very different perspective and a different segment of the population would be complaining about its biases.

**Cultural consensus theory**

In 1979, Phil Dawid and David Skene introduced a statistical model of crowdsourcing for medical records. The idea is that there’s a latent true value of something like whether a patient smokes, and doctors looking at medical records are going to give you a noisy measurement of that value. The same kind of model can be applied to radiology and doctors classifying medical images for stage of cancer, etc. Or to NeurIPS paper reviews. The model assigns accuracies and biases (too positive or too negative) to the raters and infers the underlying rating most consistent with the ratings (given the rater accuracies and biases).

David and Skene’s model was independently rediscovered by many, including by me with the help of Andrew and Jennifer Hill (it was my gateway model into Bayes and there’s an example of how to code it in the Stan *User’s Guide*). As Andrew tells me, no matter what model you look at, a psychometrician probably introduced it 50 years ago (e.g., Elo is just a rescaled Bradley-Terry model, which is from 1950).

In 1986, A. Kimball Romney, Susan Weller, and William Batchelder published “Culture as consensus: a theory of culture and informant accuracy”, which introduced cultural consensus theory (CCT). It shouldn’t be surprising that it was published in an anthropology journal, because anthropology is cross-cultural sociology. Batchelder and Romney later published a paper, “Test theory without an answer key” in *Biometrika*; think IRT 0PL model but with unknown true answer, which is the Dawid and Skene model.

The twist that CCT introduced to take it beyond David and Skene’s model was a mixture model for the “truth.” That is, they assumed there might not actually be a single consensus point of view among raters. This would be a good idea for crowdsourcing, too, where the respondents are often a mix of spammers and people making a good-faith effort (it’s really more of a continuum).

I think it would be interesting to apply CCT to ChatGPT. It’s the same kind of thing that folks do in applying ideal point models to voting.

]]>Ethan Steinberg writes:

A while back, you briefly blogged a bit about a very nicely done pre-K RCT on results up to the third grade:

I thought you might be interested that the authors have just published their sixth grade results.

It looks like the negative effects found in the third grade analysis seem to have gotten stronger:

Data through sixth grade from state education records showed that the children randomly assigned to attend pre-K had lower state achievement test scores in third through sixth grades than control children, with the strongest negative effects in sixth grade

I think the really interesting question about this study is, if the study is correct, how were we so wrong previously? Someone posted the above plot from a 2013 article that really shows the difference in how much earlier pre-K studies differed from later pre-K studies.

Is this a change in types of pre-K studies being done, a change in the environment (maybe something about the internet really changed the effectiveness of pre-K?), or a publication bias issue?

I don’t know! It’s my impression that those old studies were so noisy as to be essentially useless for any quantitative purposes. It’s funny how that could happen. I’m guessing that all these designs included power analyses but with massively overoptimistic hypothesized effect sizes, which can happen if you don’t fully think through the implications of treatment effect heterogenity. Kinda scary to think of all this money, effort, and statistical analysis that was missing this basic point. To really understand this, we have to go back to the gung-ho Cold War mindset of the 1950s and 60s, the attitude that, with sufficient fortitude, all problems could be solved.

]]>I’ve got a question that seems like it should be elementary, but I haven’t seen it addressed anywhere (maybe I’m looking in the wrong places?)

When I try to use binned residual plots to evaluate a multilevel logistic regression, I often see a pattern like this (from my student, fit with glmer):

I think the reason is because of partial pooling/shrinkage of group-level intercepts being shrunk towards the grand mean.

I was able to replicate the effect (albeit kind of mirror-imaged—the above plot was from a very complex model) with fake data:

makeData <- function(ngroup=100,groupSizeMean=10,reSD=2){ groupInt <- rnorm(ngroup,sd=reSD) groupSize <- rpois(ngroup,lambda=groupSizeMean) groups <- rep(1:ngroup,times=groupSize) n <- sum(groupSize) data.frame(group=groups,y=rbinom(n,size=1,prob=plogis(groupInt[groups]))) } dat <- makeData() mod <- glmer(y~(1|group),data=dat,family=binomial) binnedplot(predict(mod,type='response'),resid(mod,type='response'))Model estimates (i.e., point estimates of the parameters from a hierarchical model) of extreme group effects are shrunk towards 0---the grand mean intercept in this case---except at the very edges when the 0-1 bound forces the residuals to be small in magnitude (I expect the pattern would be linear in the log odds scale).

When I re-fit the same model on the same data with rstanarm and looked at the fitted values I got basically the same result.

On the other hand, when looking at 9 random posterior draws the pattern mostly goes away:

Now here come the questions---is this really a general phenomenon, like I think it is? If so, what does it mean for the use of binned residual plots for multilevel logistic regression, or really any time there's shrinkage or partial pooling? Can binned residual plots be helpful for models fit with glmer, or only by plotting individual posterior draws from a Bayesian posterior distribuion?

My reply: Yes, the positive slope for resid vs expected value . . . that would never happen in least-squares regression, so, yeah, it has to do with partial pooling. We should think about what's the right practical advice to give here. Residual plots are important.

As you note with your final graph above, the plots should have the right behavior (no slope when the model is correct) when plotting the residuals relative to the simulated parameter values. This is what Xiao-Li, Hal, and I called "realized discrepancies" in our 1996 paper on posterior predictive checking, but then in our 2000 paper on diagnostic checks for discrete-data regression models using posterior predictive simulations, Yuri, Francis, Ivan, and I found that the use of realized discrepancies added lots of noise in residual plots.

What we'd like is an approach that gives us the clean comparisons but without the noise.

]]>So I thought I’d share this new postdoc position that Manish Raghavan and I have here at MIT where it is an important focus. Here’s some of the description of the broad project area, which this researcher would help shape:

This research program is working to understand and advance techniques for sharing and using data while limiting what is revealed about any individual or organization. We are particularly interested in how privacy-preserving technologies interface with recent developments in high-dimensional statistical machine learning (including foundation models), questions about fairness of downstream decisions, and with causal inference. Applications include some in government and public policy (e.g., related to US Census Bureau data products) and increasing use in multiple industries (e.g., tech companies, finance).

While many people with relevant expertise might be coming from CS, we’re also very happy to get interest from statisticians — who have a lot to add here!

*This post is by Dean Eckles.*

I pointed Erik van Zwet to this post, “I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”, and wrote:

This example (in particular, the regression analysis at the end of the PPS section) makes me think about your idea of a standard-error-scaled prior, this time for regression coefficients. What do you think?

Erik replied:

Yes, I did propose a default prior for regression coefficients:

Wow, 2019 seems so long ago now! This was before I had the nice Cochrane data, and started focusing on clinical trials. The paper was based on a few hundred z-values of regression coefficients which I collected by hand from Medline. I tried to do that in an honest way as follows:

It is a fairly common practice in the life sciences to build multivariate regression models in two steps. First, the researchers run a number of univariate regressions for all predictors that they believe could have an important effect. Next, those predictors with a p-value below some threshold are selected for the multivariate model. While this approach is statistically unsound, we believe that the univariate regressions should be largely unaffected by selection on significance, simply because that selection is still to be done!

Anyway, using a standard-error-scaled prior really means putting in prior information about the signal-to-noise ratio. The study with the brain activity in babies seems to have a modest sample size relative to the rather noisy outcome. So I would expect regression coefficients with z-values between 1 and 3 to be inflated, and an Edlin factor of 1/2 seems about the right ball park.

I think that type M errors are a big problem, but I also believe that the probability of a type S error tends to be quite small. So, if I see a more or less significant effect in a reasonable study, I would expect the direction of the effect to be correct.

I just want to add one thing here, which is in that example the place where I wanted to apply the Edlin factor was on the control variables in the regression, where I was adjusting for pre-treatment predictions. The main effects in this example show no evidence of being different from what could be expected from pure chance.

This discussion is interesting in revealing two different roles of shrinkage. One role is what Erik is focusing on, which is shrinkage of effects of interest, which as he notes should generally have the effect of making the magnitudes of estimated effects smaller without changing their sign. The other role is shrinkage of coefficients of control variables, which regularizes these adjustments, which indirectly give more reasonable estimates of the effects of interest.

]]>