For the time being, COS is only recruiting for projects that require local IRB/ethics review and approval. Human Subjects Research (HSR) projects can be funded up to $10,000. However your institution is required to have a Federalwide Assurance (FWA) in order for you to receive funding for HSR projects. More information about the FWA is available **here**, and you can check whether your institution has an active FWA

If you have any questions, please reach out to Nick and Olivia at scorecoordinator@cos.io.

]]>“In this overview, I provide a summary description of the history and state of reproducibility and replicability in the academic field of economics.”

“The purpose of the overview is not to propose specific solutions, but rather to provide the context for the multiplicity of innovations and approaches that are currently being implemented and developed, both in economics and elsewhere.”

“In this text, we adopt the definitions of reproducibility and replicability articulated, inter alia, by Bollen et al. (2015) and in the report by NASEM (2019).”

“At the most basic level, reproducibility refers to “to the ability […] to duplicate the results of a prior study using the same materials and procedures as were used by the original investigator.”

“Replicability, on the other hand, refers to “the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected”…, and generalizability refers to the extension of the scientific findings to other populations, contexts, and time frames.”

“Much of economics was premised on the use of statistics generated by national statistical agencies as they emerged in the late 19th and early 20th century…Economists were requesting access for research purposes to government microdata through various committees at least as far back as 1959 (Kraus, 2013).”

“Whether using private-sector data, school-district data, or government administrative records, from the United States and other countries, the use of these data for innovative research has been increasing in recent years. In 1960, 76% of empirical AER articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use.”

“In economics, complaints about the inability to properly conduct reproducibility studies, or about the absence of any attempt to do so by editors, referees, and authors, can be traced back to comments and replies in the 1970s.”

“In the early 2000s, as in other sciences (National Research Council, 2003), journals started to implement “data’ or ‘data availability’ policies. Typically, they required that data and code be submitted to the journal, for publication as ‘supplementary materials.’”

“Journals in economics that have introduced data deposit policies tend to be higher ranked…None of the journals…request that the data be provided before or during the refereeing process, nor does a review of the data or code enter the editorial decision contrast to other domains (Stodden et al., 2013). All make provision of data and code a condition of publication, unless an exemption for data provision is requested.”

“More recently, economics journals have increased the intensity of enforcement of their policies. Historically being mainly focused on basic compliance, associations that publish journals …have appointed staff dedicated to enforcing various aspects of their data and code availability policies…The enforcement varies across journals, and may include editorial monitoring of the contents of the supplementary materials, reexecution of computer code (verification of computational reproducibility), and improved archiving of data.”

If the announcement and implementation of data deposit policies improve the availability of researchers’ code and data…, what has the impact been on overall reproducibility? Table 2B, shows the reproduction rates both conditional on data availability as well as unconditionally, for a number of reproducibility studies”Data that is not provided due to licensing, privacy, or commercial reasons (often incorrectly collectively referred to as ‘proprietary’ data) can still be useful in attempts at reproduction, as long as others can reasonably expect to access the data…Providers will differ in the presence of formal access policies, and this is quite important for reproducibility: only if researchers other than the original author can access the non-public data can an attempt at reproducibility even be made, if it at some cost.

“We made a best effort to classify the access to the confidential data, and the commitment by the author or third parties to provide the data if requested. For instance, a data curator with a well-defined, nonpreferential data access policy would be classified under ‘formal commitment.’…We could identify a formal commitment or process to access the data only for 35% of all nonpublic data sets.”

“One of the more difficult topics to empirically assess is the extent to which reproducibility is taught in economics, and to what extent in turn economic education is helped by reproducible data analyses. The extent of the use of replication exercises in economics classes is anecdotally high, but I am not aware of any study or survey demonstrating this.”

“More recently, explicit training in reproducible methods (Ball & Medeiros, 2012; Berkeley Initiative for Transparency in the Social Sciences, 2015), and participation of economists in data science programs with reproducible methods has increased substantially, but again, no formal and systematic survey has been conducted.”

“Because most reproducibility studies of individual articles ‘only’ confirm existing results, they fail the ‘novelty test’ that most editors apply to submitted articles (Galiani et al., 2017). Berry and coauthors (2017) analyzed all papers in Volume 100 of the AER, identifying how many were referenced as part of replication or cited in follow-on work.”

“While partially confirming earlier findings that strongly cited articles will also be replicated (Hamermesh, 2007), the authors found that 60% of the original articles were referenced in replication or extension work, but only 20% appeared in explicit replications. Of the roughly 1,500 papers that cite the papers in the volume, only about 50 (3.5%) are replications, and of those, only 8 (0.5%) focused explicitly on replicating one paper.”

“Even rarer are studies that conduct replications prior to their publication, of their own volition. Antenucci et al. (2014) predict the unemployment rate from Twitter data. After having written the paper, they continued to update the statistics on their website (“Prediction of Initial Claims for Unemployment Insurance,” 2017), thus effectively replicating their paper’s results on an ongoing basis. Shortly after release of the working paper, the model started to fail. The authors posted a warning on their website in 2015, but continued to publish new data and predictions until 2017, in effect, demonstrating themselves that the originally published model did not generalize.”

“Reproducibility has certainly gained more visibility and traction since Dewald et al.’s (1986) wake-up call…Still, after 30 years, the results of reproducibility studies consistently show problems with about a third of reproduction attempts, and the increasing share of restricted access data in economic research requires new tools, procedures, and methods to enable greater visibility into the reproducibility of such studies. Incorporating consistent training in reproducibility into graduate curricula remains one of the challenges for the (near) future.”

To read the article, **click here**.

**How should one define “replication success”?**

In their seminal article assessing the rate of replication in psychology, ** Open Science Collaboration (2015)** employed a variety of definitions of replication success. One of their measures has come to dominate all others: obtaining a statistically significant estimate with the same sign as the original study (“SS-SS”). For example, this is the definition of replication success employed by the massive

The reason for the “SS-SS” definition of replication success is obvious. It can easily be applied across a wide variety of circumstances, allowing a one-size, fits-all measure of success. It melds two aspects of parameter estimation – effect size and statistical significance – into a binary measure of success. However, studies differ in the nature of their contributions. For some studies, statistical significance may be all that matters, say when establishing the prediction of a given theory. For others, the size of the effect may be what’s important, say when one is concerned about the effect of a tax cut on government revenues.

The following example illustrates the problem. Suppose a study reports that a 10% increase in unemployment benefits is estimated to increase unemployment duration by 5%, with a 95% confidence interval of [4%, 6%]. Consider two replication studies. Replication #1 estimates a mean effect of 2% with corresponding confidence interval of [1%, 3%]. Replication #2 estimates a mean effect of 5%, but the effect is insignificant with a corresponding confidence interval of [0%, 10%].

Did either of the two replications “successfully replicate” the original? Did both? Did none? The answer to this question largely depends on the motivation behind the original analysis. Was the main contribution of the original study to demonstrate that unemployment benefits affect unemployment durations? Or was the motivation primarily budgetary? So that the size of the effect was the important empirical contribution?

There is no general right or wrong answer to these questions. It is study-specific. Maybe even researcher-specific. For this reason, while I understand the desire to develop one-size-fits-all measures of success, it is not clear how to interpret these “success rates”. This is especially true when one recognizes — and as I discussed in the previous instalment to this blog — that “success rates” below 100%, even well below 100%, are totally compatible with well-functioning science.

**How should we interpret the results of a replication?**

The preceding discussion might give the impression that replications are not very useful. While measures of the overall “success rate” of replications may not tell us much, they can be very insightful in individual cases.

In a blog I wrote for *TRN* entitled “** The Replication Crisis – A Single Replication Can Make a Big Difference**”, I showed how a single replication can substantially impact one’s assessment of a previously published study.

Define “Prior Odds” as the Prob(*Treatment is effective*):Prob(*Treatment is ineffective*). Define the “False Positive Rate” (FPR) as the percent of statistically significant estimates in published studies for which the true underlying effect is zero; i.e, the treatment has no effect. If the prior odds of a treatment being effective are relatively low, Type I error will generate a large number of “false” significant estimates that can overwhelm the significant estimates associated with effective treatments, causing the FPR to be high. TABLE 1 below illustrates this.

The FPR values in the table range from 0.24 to 0.91. For example, given 1:10 odds that a randomly chosen treatment is effective, and assuming studies have Power equal to 0.50, the probability that a statistically significant estimate is a false positive is 50%. Alternatively, if we take a Power value of 0.20, which is approximately equal to the value that ** Ioannidis et al. (2017)** report as the median value for empirical research in economics, the FPR rises to 71%.

It needs to be emphasized that these high FPRs have nothing to do with publication bias or file drawer effects. They are the natural outcomes of a world of discovery in which Type I error is combined with a situation where most studied phenomena are non-existent or economically negligible.

TABLE 2 reports what happens when a researcher in this environment replicates a randomly selected significant estimate. The left column reports the researcher’s initial assessment that the finding is a false positive (as per TABLE 1). The table shows how that probability changes as a result of a successful replication.

For example, suppose the researcher thinks there is a 50% chance that a given empirical claim is a false positive (Initial FPR = 50%). The researcher then performs a replication and obtains a significant estimate. If the replication study had 50% Power, the updated FPR would fall from 50% to 9%.

TABLE 2 demonstrates that successful replications produce substantial decreases in false positive rates across a wide range of initial FPRs and Power values. In other words, while discipline-wide measures of “success rates” may not be very informative, replications can have a powerful impact on the confidence that researchers attach to individual estimates in the literature.

**Do replications have a unique role to play in contributing to our understanding of economic phenomena?**

To date, replications have not had much of an effect on how economists do their business. The discipline has made great strides in encouraging transparency by ** requiring authors to make their data and code available**. However, this greater transparency has not resulted in a meaningful increase in published replications. While there are no doubt many reasons for this, one reason may be that economists do not appreciate the unique role that replications can play in contributing to our understanding of economic phenomena.

The potential for empirical analysis to inform our understanding of the world is conditioned on the confidence researchers have in the published literature. While economists may differ in their assessment of the severity of false positives, the message of TABLE 2 is that, for virtually all values of FPRs, replications substantially impact that assessment. A successful replication lowers, often dramatically lowers, the probability that a given empirical finding is a false positive.

It is worth emphasizing that replications are uniquely positioned to make this contribution. New studies fall under the cloud of uncertainty that hangs over all original findings; namely, the rational suspicion that reported results are merely a statistical artefact. Replications, because of their focus on individual findings, are able to break through the fog. It is hoped that economists will start to recognize the unique role that replications can play in the process of scientific discovery. And that publishing opportunities for well-done replications; and appropriate professional rewards for the researchers who do them, follow.

*Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at ***UCMeta***. He can be contacted at bob.reed@canterbury.ac.nz.*

Before I can get there, however, I need to acknowledge that the assessment above relies on a very specific definition of a replication, and that the sample of replications on which it is based is primarily drawn from one data source: **Replication Wiki**. Is it possible that there are a lot more replications “out there” that are not being counted? More generally, is it even physically possible to know how many replications there are?

**Is it possible to know how many replications there are?**

One of the most comprehensive assessments of the number of replications in economics was done in a study by Frank Mueller-Langer, Benedikt Fecher, Dietmar Harhoff, and Gert Wagner, published in *Research Policy* in 2019 and blogged about ** here**. ML et al. reviewed all articles published in the top 50 economics journals between 1974 and 2014. They calculated a “replication rate” of 0.1%. That is, 0.1% of all the articles in the top 50 economics journals during this time period were replication studies.

0.1% is likely an understatement of the overall replication rate in economics, as replications are likely to be underrepresented in the top journals. With 400 mainline economics journals, each publishing an average of approximately 100 articles a year, it is a daunting task to assess the replication rate for the whole discipline.

One possibility is to scrape the internet for economics articles and use machine learning algorithms to identify replications. In unpublished work, colleagues of mine at the University of Canterbury used “convolutional neural networks” to perform this task. They compared the texts of the **replication studies listed at The Replication Network (TRN)** with a random sample of economics articles from

Their final analysis produced a false negative error rate (the rate at which replications are mistakenly classified as non-replications) of 17%. The false positive rate (the rate at which non-replications are mistakenly classified as replications) was 5%.

To give a better feel for what these numbers means, consider a scenario where the replication rate is 1%. Suppose we have a sample of 10,000 papers, of which 100 are replications. Applying the false negative and positive rates above produces the numbers in TABLE 1.

Given this sample, a researcher would identify 578 replications, of which 83 would be true replications, and 495 would be “false replications”, that is, non-replication studies falsely categorized as replication studies. One would have to get a false positive rate below 1% before even half of the identified “replications” were true replications. Given a relatively low replication rate (here 1%), it is obvious that it is highly unlikely that machine learning will ever be accurate enough to produce reliable estimates of the overall replication rate in the discipline.

A final alternative is to follow the procedure of ML et al., but choose a set of 50 journals outside the top economics journals. However, as reported in yesterday’s blog, replications tend to be clustered in a relatively small number of journals. Results of replication rates would likely depend greatly on the particular sample of journals that was used.

Putting the above together, the answer to the question “Is it possible to know how many replications there are” appears to be no.

I now move on to assessing what we have learned from the replications that have been done to date. Specifically, have replications uncovered a reproducibility problem in economics?

**Is there a replication crisis in economics?**

The last decade has seen increasing concern that science has a ** reproducibility problem**. So it is fair to ask, is there a replication crisis in economics? Probably the most famous study of replication rates is the study by

The next section will delve a little more into the meaning of “replication success”. For now, let’s first ask, what rate of success should we expect to see if science is performing as it is supposed to? In a blog for TRN (“** The Statistical Fundamentals of (Non-)Replicability**”), Jeff Miller considers the case where a replication is defined to be “successful” when it reproduces a statistically significant estimate reported in a previous study (see FIGURE 1 below).

FIGURE 1 assumes 1000 studies each assess a different treatment. 10% of the treatments are effective. 90% have no effect. Statistical significance is set at 5% and all studies have statistical power of 60%. The latter implies that 60 of the 100 studies with effective treatments produce significant estimates. The Type I error rate implies that 45 of the remaining 900 studies with ineffectual treatments also generate significant estimates. As a result, 105 significant estimates are produced from the initial set of 1000 studies.

If these 105 studies are replicated, one would expect to see approximately 38 significant estimates, leading to a replication “success rate” of 36% (see bottom right of FIGURE 1). Note that there is no publication bias here. No “file drawer effect”. Even when science works as it is supposed to, we should not expect a replication “success rate” of 100%. “Success rates” far less than 100% are perfectly consistent with well-functioning science.

**Conclusion**

Replications come in many sizes, shapes, and flavors. Even if we could agree on a common definition of a replication, it would be very challenging to make discipline-level conclusions about the number of replications that get published. Given the limitations of machine learning algorithms, there is no substitute for personally assessing each article individually. With approximately 400 mainline economics journals, each publishing approximately 100 articles a year, that is a monumental, seemingly insurmountable, challenge.

Beyond the problem of defining a replication, beyond the problem of defining “replication success”, there is the further problem of interpreting “success rates”. One might think that a 36% replication success rate was an indicator that science was failing miserably. Not necessarily so.

The final instalment of this series will explore these topics further. The goal is to arrive at an overall assessment of the potential for replications to make a substantial contribution to our understanding of economic phenomena (to read the next instalment, ** click here**).

*Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.*

In this instalment, I address two issues:

– Are there more replications in economics than there used to be?

– Which journals publish replications?

**Are there more replications in economics than there used to be?**

Before we count replications, we need to know what we are counting. Researchers use different definitions of replications, which produce different numbers. For example, at the time of this writing, ** Replication Wiki** reports 670 replications at their website. In contrast,

Why the difference? *TRN *employs a narrower definition of a replication. Specifically, it defines a replication as “any study published in a peer-reviewed journal whose main purpose is to determine the validity of one or more empirical results from a previously published study.”

Replications come in many sizes and shapes. For example, sometimes a researcher will develop a new estimator and want to see how it compares with another estimator. Accordingly, they replicate a previous study using the new estimator. An example is De Chaisemartin & d’Haultfoeuille’s “** Fuzzy differences-in-differences**” (

Replication Wiki counts D&H as a replication. *TRN* does not. The reason *TRN* does not count D&H as a replication is because the main purpose of D&H is not to determine whether Duflo (2001) is correct. The main purpose of D&H is to illustrate the difference their estimator makes. This highlights the grey area that separates replications from other studies.

Reasonable people can disagree about the “best” definition of replication. I like *TRN’s *definition because it restricts attention to studies whose main goal is to determine “the truth” of a claim by a previous study. Studies that meet this criterion tend to be more intensive in their analysis of the original study and give it a more thorough empirical treatment. A further benefit is that *TRN *has consistently applied the same definition of replication over time, facilitating time series comparisons.

FIGURE 1 shows the growth in replications in economics over time. The graph is somewhat misleading because 2019 was an exceptional year, driven by special replication issues at the *Journal of Development Studies*, the *Journal of Development Effectiveness*, and, especially, *Energy Economics*. In contrast, 2020 will likely end up having closer to 20 replications. Even ignoring the big blip in 2019, it is clear that there has been a general upwards creep in the number of replications published in economics over time. It is, however, a creep, and not a leap. Given that there are ** approximately 40,000 articles published annually in Web of Science economics journals**, the increase over time does not indicate a major shift in how the economics discipline values replications.

**Which journals publish replications?**

TABLE 1 reports the top 10 economics journals in terms of total number of replications published in their journal lifetimes. Over the years, a consistent leader in the publishing of replications has been the *Journal of Applied Econometrics*. In second place is the *American Economic Review*. However, an important distinction between these two journals is that *JAE* publishes both positive and negative replications; that is, replications that both confirm and refute the original studies. In contrast, the *AER* only very rarely publishes a positive replication.

There have been several new initiatives by journals to publish replications. Notably, the ** International Journal for Re-Views of Empirical Economics (IREE)** was started in 2017 and is solely dedicated to the publishing of replications. It is an open access journal with no author processing charges (APCs), supported by a consortium of private and public funders. As of January 2021, it had published 10 replication studies.

To place the numbers in TABLE 1 in context, there are approximately 400 mainline economics journals. About one fourth (96) have ever published a replication. 2 journals account for approximately 25% of all replications that have ever been published. 9 journals account for over half of all replication studies. Only 25 journals (about 6% of all journals) have ever published more than 5 replications in their lifetimes.

**Conclusion**

While a little late to the party, economists have recently made noises about the importance of replication in their discipline. Notably, the ** 2017 Papers and Proceedings issue of the American Economic Review** prominently featured 8 articles addressing various aspects of replications in economics. And indeed, there has been an increase in the number of replications over time. However, the growth in replications is best described as an upwards creep rather than a bold leap.

Perhaps the reason replications have not really caught on is because fundamental questions about replications have not been addressed. Is there a replication crisis in economics? How should “replication success” be measured? What is the “success rate” of replications in economics? How should the results of replications be interpreted? Do replications have a unique role to play in contributing to our understanding of economic phenomena? I take these up in subsequent instalments of this blog (to read the next instalment, click ** here**).

**UCMeta**. He can be contacted at bob.reed@canterbury.ac.nz.

Ter Schure and Grünwald (2019) detail all the possible ways in which the size of a study series up for meta- analysis, or the timing of the meta-analysis, might be driven by the results within those studies. Any such dependency introduces *accumulation bias*. Unfortunately, it is often impossible to fully characterize the processes at play in retrospective meta-analysis. The bias cannot be accounted for. In this blog we revisit an example accumulation bias process, that can be one of many influencing a single meta-analysis, and use it to illustrate the following key points:

– Standard meta-analysis does not take into account that researchers decide on new studies based on other study results already available. These decisions introduce accumulation bias because the analysis assumes that the size of the study series is unrelated to the studies within; it essentially conditions on the number of studies available.

– Accumulation bias does not result from questionable research practices, such as publication bias from file-drawering a selection of results. The decision to replicate only some studies instead of all of them biases the sampling distribution of study series, but can be a very efficient approach to set priorities in research and reduce research waste.

– ALL-IN meta-analysis stands for *Anytime*, *Live *and *Leading INterim *meta-analysis. It can handle accumulation bias because it does not require a set number of studies, but performs analysis on a growing series – starting from a single study and accumulating as many studies as needed.

– ALL-IN meta-analysis also allows for continuous monitoring of the evidence as new studies arrive, even as new interim results arrive. Any decision to start, stop or expand studies is possible, while keeping valid inference and type-I error control intact. Such decisions can be strategic: increasing the value of new studies, and reducing research waste.

**Our example: extreme Gold Rush accumulation bias**

We imagine a world in which a series of studies is meta-analyzed as soon as three studies become available. Many topics deserve a first initial study, but the research field is very selective with its replications. Nevertheless, for significant results in the right direction, a replication is warranted. We call this the *Gold Rush *scenario, because after each finding of a positive significant result – the gold in science – some research group rushes into a replication, but as soon as a study disappoints, the research effort is terminated and no-one bothers to ever try again. This scenario was first proposed by Ellis and Stewart (2009) and formulated in detail and under this name by Ter Schure and Grünwald (2019). Here we consider the most extreme version of the *Gold Rush *where finding a significant positive result not only makes a replication more probable, but even inevitable: the dependency of occurring replications on their predecessor’s result is deterministic.

**Biased Gold Rush sampling**

We denote the number of studies available on a certain topic by *t*. This number *t *can also indicate the *timing *of a meta-analysis, such that a meta-analysis can possibly occur at number of studies *t *= 1*, *2*, *3*, . . . *up to some maximum number of studies *T *. This notation follows from Ter Schure and Grünwald (2019); the Technical Details at the end of this blog make the notation involved in this blog more explicit.

We summarize the results of individual studies into a single per-study *Z*-score (*z*_{1} for the first study, *z*_{2} for the second, etc), such that we have the following information on a series of size *t*: *z*_{1}*, z*_{2}*, . . . , z _{t}* . We distinguish between

*Gold Rush world*

Here *A*(*t*) denotes whether we accumulate *and* analyze the *t *studies: It can be that *A*(2) = 0 and *A*(3) = 0 because we are stuck at one study, but also *A*(1) = 0 because we don’t “meta-analyze” that single study. It can only be that *A*(2) = 1 if we accumulate *and* meta-analyze a two-study series and *A*(3) = 1 if we accumulate *and* meta-analyze a three-study series. In our *Gold Rush *world a very specific subset of studies accumulate into a three-study series such that they are meta-analyzed (*A*(3) = 1).

*z*^{(3)} denotes the *Z*-score of a fixed effects meta-analysis. This meta-analysis *Z*-score is simply a re-normalized average and can, assuming equal sample size and variances in all studies, be obtained from the individual study *Z*-scores as follows: *z*^{(3)} =[1/sqrt(3)) × sum(z_{i})_{i = 1 to 3}]. The effects of accumulation bias are not limited to fixed-effects meta-analysis (see for example Kulinskaya et al. (2016)), but fixed-effects meta-analysis does provide us with a simple illustration for the purposes of this blog.

We observe in our *Gold Rush *world above that the study series that are eventually meta-analyzed into a *Z*-score *z*^{(3)} are a very biased subset of all possible study series. So we expect these *z*^{(3)} scores to be biased as well. In the next section, we simulate the sampling distribution of these *z*^{(3)} scores to illustrate this bias.

**The conditional sampling distribution under extreme Gold Rush accumulation bias**

Assume that we are in the scenario that only true null effects are studied in our *Gold Rush *world, such that any new study builds on a false-positive result. How large would the bias be if the three-study series are simply analyzed by standard meta-analysis? We illustrate this by simulating this *Gold Rush *world using the R code below.

**Theoretical sampling process: **A fixed-effects meta-analysis assumes that if three studies *z*_{1}*, z*_{2}*, z*_{3} are each sampled under the null hypothesis, each has a standard normal with mean zero and the standard normal sampling distribution also applies for the combined *z*^{(3)} score. The R code in Figure 1 illustrates this sampling process: First, a large population is simulated of possible first (Z1), second (Z2) and third (Z3) studies from a standard normal distribution. Then in Zmeta3 each index i represents a possible study series, such that c(Z1[i], Z2[i], Z3[i]) samples an unbiased study series and calcZmeta calculates its fixed-effects meta- analysis *Z*-score *z*^{(3)}. So the large number of *Z*-scores in Zmeta3 capture the unbiased sampling distribution that is assumed for fixed-effects meta-analysis *z*^{(3)}-scores.

** Gold Rush sampling process: **In contrast, the code resulting in A3 selects only those study series for which

**Meta-analysis under Gold Rush accumulation bias: **The final lines of code in Figure 1 plot two histograms of

We observe in Figure 2 that the theoretical sampling process, resulting in the pink histogram, gives a distribution for the three-study meta-analysis *z*^{(3)}-scores that is centered around zero. Under the *Gold Rush *sampling process, however, our three-study *z*^{(3)}-scores do not behave like this theoretical distribution at all. The blue histogram has a smaller variance and is shifted to the right – representing the bias.

We conclude that we should not use conventional meta-analysis techniques to analyze our study series under *Gold Rush *accumulation bias: Conventional fixed-effects meta-analysis assumes that any three-study summary statistic *Z*^{(3)} is sampled from the pink distribution in Figure 2 under the null hypothesis, such that the meta- analysis is significant for *Z*^{(3)}-scores larger than *z _{α} *= 1

**Accumulation bias can be efficient**

The steps in the code from Figure 1 that arrive at the biased distribution in Figure 2 illustrate that accumulation bias is in fact a selection bias. Nevertheless, accumulation bias does not result from questionable research practices, such as publication bias from file-drawering a selection of results. The selection to replicate only some studies instead of all of them biases the sampling distribution of study series, but can be a very efficient approach to set priorities in research and reduce research waste.

By inspecting our *Gold Rush *world a bit closer, we observe that a fixed-effects meta-analysis of three studies actually *conditions *on this number of studies ((*A*(*t*) needs to be *A*(3) to be 1), and that this conditional nature is what is driving the accumulation bias; in technical details subsection A.3 we show this explicitly. In the next section we take the unconditional view.

**The unconditional sampling distribution under extreme Gold Rush accumulation bias**

We first adapt our *Gold Rush *accumulation bias world a bit, and not only meta-analyze three-study series but one-study “series” and two-study series as well. All possible scenarios for study series in this “all-series-size” *Gold Rush *world are illustrated below. We assume that we only meta-analyze series in a terminated state, and therefore first await a replication for significant studies before performing the meta-analysis. So a single-study “meta-analysis” can only consist of a negative or nonsignificant initial study (*z*1*−*); only in that case we are in a terminated state with *A*(1) = 1 and the series does not grow to two (*A*(2) = 0). In a two-study meta-analysis the series starts with a significant positive initial study and is replicated by a nonsignificant or negative one; only in that case *A*(2) = 1, and the series does not grow to three so *A*(3) = 0. And only three-study series that start with two significant positive studies are meta-analyzed in a three-study synthesis; only in that case *A*(3) = 1.

**Gold Rush world; all-series-size**

The R code in Figure 4 calculates the fixed-effects meta-analysis *z*^{(1)}, *z*^{(2)} and *z*^{(3)} scores, conditional on meta- analyzing a one-study, two-study, or three-study series in this adjusted *Gold Rush *accumulation bias scenario. The histograms of these conditional *z*^{(}^{t}^{)} scores are shown in Figure 5, including the theoretical unbiased *z*^{(3)} histogram that was also shown in Figure 2 and largely overlaps with the “*A*(1) = 1*, A*(2) = 0”-scenario. The difference between these two sampling distributions is only visible in their right tail, with the green histogram excluding values larger than *z _{α}*= 1

Figure 5 clarifies that single studies are hardly biased in this extreme *Gold Rush *scenario, that the bias is problematic for two-study series and most extreme for three-study ones.

However, what this plot does not show us is how often we are in the one-study, two-study and three-study case.

To illustrate the relative frequencies of one-study, two-study and three-study meta-analyses, the code in Figure 6 samples the series in their respective numbers, instead of in equal numbers (which happens in the size = numSim.3series statement in Figure 4, part of creating the data frame). Plotting the total number of sampled *Z*-scores is dangerous for the single study *z*^{(1)}-scores, however, since there are so many of them (it can crash your R studio). So before plotting the histogram, a smaller sample (of size = 3*numSim.3series in total) is drawn that keeps the ratios between *z*^{(1)}s, *z*^{(2)}s and *z*^{(3)}s intact.

The histogram in Figure 7 illustrates an unconditional distribution by the raw counts of the *z*^{(}^{t}^{)}-scores: many result from a single study, very few from a two-study series and almost none from a three-study series. In fact, this unconditional sampling distribution is hardly biased, as we will illustrate with our table further below.

We first introduce an example of an ALL-IN meta-analysis to argue that such an unconditional approach can in fact be very efficient.

**ALL-IN meta-analysis**

Figure 8 shows an example of an ALL-IN meta-analysis. Each of the red/orange/yellow lines represents a study out of the ten separate studies in as many different countries. The blue line indicates the meta-analysis synthesis of the evidence; a live account of the evidence so far in the underlying studies. In fact, *ALL-IN *meta-analysis stands for *Anytime, Live *and *Leading INterim *meta-analysis, in which the *Anytime Live *property assures valid inference under continuously monitoring and the *Leading *property allows the meta-analysis results to inform whether individual studies should be stopped or expanded. This is important to note that such data-driven decisions would invalidate conventional meta-analysis by introducing accumulation bias.

To interpret Figure 8, we observe that initially only the Dutch (NL) study contributes to the meta-analysis and the blue line completely overlaps with the light yellow one. Very quickly, the Australian (AU) study also starts contributing and the blue meta-analysis line captures a synthesis of the evidence in two studies. Later on, also the study in the US, France (FR) and Uruguay (UY) start contributing and the meta-analysis becomes a three-study, four-study and five-study meta-analysis. How many studies contribute to the analysis, however, does not matter for its evidential value.

Some studies (like the Australian one) are much larger than others, such that under a lucky scenario this study could reach the evidential threshold even before other studies start observing data. This threshold (indicated at 400) controls type-I errors at a rate of *α*= 1*/*400 = 0*.*0025 (details in the final section). So in repeated sampling under the null, the combined studies will only have a probability to cross this threshold that is smaller than 0*.*25%. In this repeated sampling the size of the study series is essentially random: we can be lucky and observe very convincing data in the early studies, making more studies superfluous, or we can be unlucky and in need of more studies. The threshold can be reached with a single study, with a two-study meta-analysis, with a three-study,.. etc, and the repeated sampling properties, like type-I error control, hold on average over all those sampling scenarios (so unconditional on the series size).

ALL-IN meta-analysis allows for meta-analyses with Type-I error control, while completely avoiding the effects of accumulation bias and multiple testing. This is possible for two reasons: (1) we do not just perform meta- analyses on study series that have reached a certain size, but continuously monitor study series irrespective of the current number of studies in the series; (2) we use likelihood ratios (and their cousins, e-values (Grünwald et al., 2019) instead of raw *Z*-scores and *p*-values; we say more on likelihood ratios further below.

**Accumulation bias from ALL-IN meta-analysis vs Gold Rush**

The ALL-IN meta-analysis in Figure 8 illustrates an improved efficiency by not setting the number of studies in advance, but let it rely on the data and be – just like the data itself – essentially random before the start of the research effort. This introduces dependencies between study results and series size that can be expressed in similar ways as *Gold Rush *accumulation bias. Yet this field of studies might make decisions differently to our *Gold Rush*: a positive nonsignificant result might not terminate the research effort, but encourage extra studies. And instead of always encouraging extra studies, a very convincing series of significant studies might conclude the research effort. If a series of studies is dependent on any such data-driven decisions, the use of conventional statistical methods is inappropriate. These dependencies actually do not have to be extreme at all: Many fields of research might be a bit like the *Gold Rush *scenario in their response to finding significant negative results of harm. A widely known study result that indicated significant harm might make it very unlikely that the series will continue to grow. So large study series will very rarely have a completely symmetric sampling distribution, since initial studies that observe results of significant harm do not grow into large series. Hence this small aspect of accumulation bias will already invalidate conventional meta-analysis, when it assumes such symmetric distributions under the null hypothesis with equal mass on significant effects of harm and benefit.

**Properties averaged over time**

Accumulation bias can already result from simply excluding results of significant harm from replication. This exclusion also takes place under extreme *Gold Rush *accumulation bias, since results of significant harm as well as all nonsignificant results are not replicated. Fortunately, any such scenarios can be handled by taking an unconditional approach to meta-analysis. We will now give an intuition for why this is true in case of our extreme *Gold Rush *scenario: initial studies have bias that balances the bias in larger study series when averaged over series size and analyzed in a certain way.

Table 1 is inspired by Senn (2014) (different question, similar answer) and represents our extreme *Gold Rush *world of study series. It takes the same approach as Figure 7 and indicates the probability to meta-analyze a one-study, two-study or three-study series of each possible form under the null hypothesis. The three study series are very biased, with two or even three out of three studies showing a positive significant effect. But the P_{0} column shows that the probability of being in this scenario is very small under the null hypothesis, as was also apparent from Figure 7. In fact, most analysis will be of the one-study kind, that hardly have any bias, and are even slightly to the left of the theoretic standard null distribution. Exactly this phenomenon balances the biased samples of series of larger size.

A Z-score is marked by a * and color orange (e.g. z_{1}*) in case the individual study result is significant and positive (z_{1} ≥ z_{α} (one-sided test)) and by a (e.g. z_{1}^{−}) otherwise. The column t indicates the number of studies and the column counts the number of significant studies. The fifth and sixth column multiply P_{0} with the column and t column to arrive at an expected value E0[*] and E0[t] respectively in the bottom row.

The bottom row of Table 1 gives the expected values for the number of significant studies per series in the *P_{0} column, and the expected value for the total number of studies per series in the *t *P_{0} column. If we use these expressions to obtain the proportion of expected number of significant to expected total number of studies, we get the following:

The proportion of expected significant effects to expected series size is still *α *in Table 1 under extreme *Gold Rush *accumulation bias, as it would also be without accumulation bias.

This result is driven by the fact that there is a martingale process underlying this table. If a statistic is a martingale process and it has a certain value after *t *studies, the conditional expected value of the statistic after *t *+ 1 studies, given all the past data, is equal to the statistic after *t *studies. So if our proportion of significant positive studies is exactly *α *for the first study (t = 1), we expect to also observe a proportion *α *if we grow our series with an additional study (t = 1+1 = 2). The Accumulation bias does not affect such statistics when averaged over time if martingales are involved (Doob’s optional stopping theorem for martingales). You can verify this aspect by deleting the last row for z_{1}**, *z_{2}**, *z_{3}** *from our table and adding two rows for *t *= 4 in its place with z_{1}**, *z_{2}**, *z_{3}** * and either a fourth significant or a nonsignificant study. If you calculate the expected significant effects to expected series size, you will again arrive at *α*.

Martingale properties drive many approaches to sequential analysis, including the Sequential Probability Ratio Test (SPRT), group-sequential analysis and alpha spending. When applied to meta-analysis, any such inferences essentially average over series size, just like ALL-IN meta-analysis.

**Multiple testing over time**

Just having the expectation of some statistics not affected by stopping rules is not enough to monitor data continuously, as in ALL-IN meta-analysis. We need to account for the multiple testing as well. In that respect, the approaches to sequential analysis differ by either restricting inference to a strict stopping rule (SPRT), or setting a maximum sample size (group-sequential analysis and alpha spending).

ALL-IN meta-analysis takes an approach that is different from its predecessors and is part of an upcoming field of sequential analysis for continuous monitoring with an unlimited horizon. These approaches are called *Safe *for optional stopping and/or continuation (Grünwald et al., 2019) *any-time valid* (Ramdas et al., 2020). Their methods rely on nonnegative martingales (Ramdas et al., 2020); with its most well-known and useful martingale: the likelihood ratio. For a meta-analysis *Z*-score, a martingale process of likelihood ratios could look as follows:

The subscript _{10} indicates that the denominator of the likelihood ratio is the likelihood of the *Z*-scores under the null hypothesis of mean zero, and in the numerator is some alternative mean normal likelihood. The likelihood ratio becomes smaller when the data are more likely under the null hypothesis, but the likelihood ratio can never become smaller than 0 (hence the “nonnegative” martingale). This is crucial, because a nonnegative martingale allows us to use Ville’s inequality (Ville, 1939), also called the universal bound by Royall (1997). For likelihood ratios, this means that we can set a threshold that guarantees type-I error control under any accumulation bias process and at any time, as follows:

The ALL-IN meta-analysis in Figure 8 in fact is based on likelihood ratios like this, and controls the type-I error by the threshold 400 at level 1*/*400 = 0*.*25%.

The code below illustrates that likelihood ratios can also control type-I error rates under continuous monitoring when extreme *Gold Rush *accumulation bias is at play. Within our previous simulation, we again assume a *Gold Rush *world with only true null studies and very biased two-study and three-study series. The code in Figure 11 calculates likelihood ratios for the growing study series under accumulation bias. Figure 11 illustrates that still very few likelihood ratios ever grow very large.

If we set our type-I error rate *α *to 5%, and compare our likelihood ratios to 1*/α *= 20 we observe that less than 1*/*20 = 5% of the study series *ever* achieves a value of LR_{10} larger than 20 (Figure 12). The simulated type-I error is even much smaller than 5% since in our *Gold Rush *world series stop growing at three studies, yet this procedure controls type-I error also in the case none of these series stops growing at three studies, but all continue to grow forever.

The type-I error control is thus conservative, and we pay a small price in terms of power. That price is quite manageable, however, and can be tuned by setting the mean value of the alternative likelihood (arbitrarily set to mean = 1 in the code for calcLR of Figure 10). More on that in Grünwald et al. (2019) and the forthcoming preprint paper on ALL-IN meta-analysis that will appear on** https://projects.cwi.nl/safestats/.**

It is this small conservatism in controlling type-I error that allows for full flexibility: There isn’t a single accumulation bias process that could invalidate the inference. Any data-driven decision is allowed. And data- driven decisions can increase the value of new studies and reduce research waste.

**Conclusion**

In our imaginary world of extreme *Gold Rush *accumulation bias, the sampling distribution of the meta-analysis *Z*-score behaves very different from the sampling distribution assumed to calculate p-values and confidence intervals. A meta-analysis p-value conditions on the available sample size – on the sample size of the studies and on the number of studies available – and represents the tail area of this conditional sampling distribution under the null based on the observed *Z*-statistic. Analogously, a meta-analysis confidence interval provides coverage under repeated sampling from this conditional distribution. So if this sample size is driven by the data, as in any accumulation bias process, there is a mismatch between the assumed sampling distribution of the meta-analysis *Z*-statistic, and the actual sampling distribution.

We believe that some accumulation bias is at play in almost any retrospective meta-analysis, such that p-values and confidence intervals generally do not have their promised type-I error control and coverage. ALL-IN meta- analysis based on likelihood ratios can handle accumulation bias, even if the exact process is unknown. It also allows for continuous monitoring; multiple testing is no problem. Hence taking the ALL-IN perspective on meta-analysis will reduce research waste by allowing efficient data-driven decisions – not letting them invalidate the inference – and incorporating single studies and small study series into meta-analysis inference.

**Postscript**

ALL-IN meta-analysis has been applied during the corona pandemic to analyze an accumulating series of studies while they were still ongoing. Each study investigated the ability of the BCG vaccine to prevent covid-19, but data on covid cases came in only slowly (fortunately). Meta-analyzing interim results and data-driven decisions improved the possibility of finding efficacy earlier in the pandemic. A webinar on the methodology underlying this meta-analysis – the specific likelihood ratios – is available on **https://projects.cwi.nl/safestats**/ under the name ALL-IN-META-BCG-CORONA.

*Judith ter Schure is a PhD student in the Department of Machine Learning at Centrum Wiskunde & Informatica in the Netherlands. She can be contacted at Judith.ter.Schure@cwi.nl. *

**Acknowledgements**

My thanks go to Professor Bob Reed for inviting this contribution to his website and his patience with its publication. I also want to acknowledge Professor Peter Grünwald for checking the details. Daniel Lakens provided me with great advice to write this text more blog-like. Muriel Pérez helped me with the details of the martingale underlying the table.

**References**

Iain Chalmers and Paul Glasziou. Avoidable waste in the production and reporting of research evidence. *The Lancet*, 114(6):1341–1345, 2009.

Iain Chalmers, Michael B Bracken, Ben Djulbegovic, Silvio Garattini, Jonathan Grant, A Metin Gülmezoglu, David W Howells, John PA Ioannidis, and Sandy Oliver. How to increase value and reduce waste when research priorities are set. *The Lancet*, 383(9912):156–165, 2014.

Hans Lund, Klara Brunnhuber, Carsten Juhl, Karen Robinson, Marlies Leenaars, Bertil F Dorch, Gro Jamtvedt, Monica W Nortvedt, Robin Christensen, and Iain Chalmers. Towards evidence based research. *Bmj*, 355: i5440, 2016.

Judith ter Schure and Peter Grünwald. Accumulation Bias in meta-analysis: the need to consider time in error control [version 1; peer review: 2 approved]. *F1000Research*, 8:962, June 2019. ISSN 2046-1402. doi: 10.12688/f1000research.19375.1. URL https://f1000research.com/articles/8-962/v1.

Steven P Ellis and Jonathan W Stewart. Temporal dependence and bias in meta-analysis. *Communications in Statistics—Theory and Methods*, 38(15):2453–2462, 2009.

Elena Kulinskaya, Richard Huggins, and Samson Henry Dogo. Sequential biases in accumulating evidence. *Research synthesis methods*, 7(3):294–305, 2016.

Peter Grünwald, Rianne de Heide, and Wouter Koolen. Safe testing. *arXiv preprint arXiv:1906.07801*, 2019.

Stephen Senn. A note regarding meta-analysis of sequential trials with stopping for efficacy. *Pharmaceutical Statistics*, 13(6):371–375, 2014.

Aaditya Ramdas, Johannes Ruf, Martin Larsson, and Wouter Koolen. Admissible anytime-valid sequential inference must rely on nonnegative martingales. *arXiv preprint arXiv:2009.03167*, 2020.

Jean Ville. Etude critique de la notion de collectif. *Bull. Amer. Math. Soc*, 45(11):824, 1939.

Richard Royall. *Statistical evidence: a likelihood paradigm*, volume 71. CRC press, 1997.

Judith ter Schure, Alexander Ly, Muriel F. Pérez-Ortiz, and Peter Grünwald. Safestats and all-in meta-analysis project page. https://projects.cwi.nl/safestats/, 2020.

This blog post discusses approaches to meta-analysis that control type-I error averaged over study series size. This is called error control *surviving over time *in Ter Schure and Grünwald (2019), as will become more clear in the technical details.

You can find a link to a these four pages of technical details ** here**. A link to the file of R code used in this blog can be found

The network previously organised a roundtable discussion entitled “Reproducibility in Experimental Economics: Crisis or Opportunity?” (see ** this blog** for a nice summary) as part of their annual event in Osnabrueck, Germany. The discussion acknowledged that the principle of reproducibility is a central tenet of experimental research to inform policy relevant decisions in the environmental and agricultural spheres. However, recent studies (e.g. Camerer et al 2018) cast doubts about the replicability of social science experiments, going as far as to say that the social sciences may experience a ‘replication crisis’.

This is the backdrop to the webinar REECAP is offering on October 19, 2020 from 10.30-12.30am CET, details of the programme can be found ** here**. The webinar aims to discuss replications in agricultural economics, and it is of interest to researchers who wish to learn about the ‘replication crisis’, where it comes from and what has been done to tackle it so far. This event will also provide a forum for those interested in exploring participation in REECAP’s replication project that is aiming to coordinate the replication of experiments relevant for shaping agricultural policies.

These replications may then be submitted to a Special Issue in ** Applied Economic Perspectives and Policy** – a call is forthcoming. However, this webinar is not just targeting researchers but also practitioners who may have an interest in research methodologies and who are curious to learn more about how best to improve the robustness of research findings, especially when these findings are used to inform policy relevant decisions in the context of agriculture.

We looked at empirical environmental economics papers published between 2015 and 2018 in four top journals: The American Economic Review (AER), Environmental and Resource Economics (ERE), The Journal of the Association of Environmental and Resource Economics (JAERE), and The Journal of Environmental Economics and Management (JEEM). From 307 publications, we collected more than 21,000 test statistics to construct our dataset. We reported four key findings:

**1. Underpowered Study Designs and Exaggerated Effect Sizes**

As has been observed in other fields, the empirical designs used by environmental and resource economists are statistically underpowered, which implies that the magnitude and sign of the effects reported in their publications are unreliable. The conventional target for adequate statistical power in many fields of science is 80%. We estimated that, in environmental and resource economics, the median power of study designs is 33%, with power less than 80% for nearly two out of the three estimated parameters. When studies are underpowered and when scientific journals are more likely to publish results that pass conventional tests of statistical significance – tests that can only be passed in underpowered designs when the estimated effect is much larger than the true effect size – these journals will tend to be publish exaggerated effect sizes. *We estimated that 56% of the reported effect sizes in the environmental and resource economics literature are exaggerated by a factor of two or more; 35% are exaggerated by a factor of four or more.*

**2. Selective Reporting of Statistical Significance or “p-hacking”**

Researchers face strong professional incentives to report statistically significant results, which may lead them to selectively report results from their analyses. One indicator of selective reporting is an unusual pattern in the distribution of test statistics; specifically, a double-humped distribution around conventionally accepted values of statistical significance. In the figure below, we present the distribution of test statistics for the estimates in our sample, where 1.96 is the conventional value for statistical significance (p<0.05). ** The unusual dip just before 1.96, is consistent with selective reporting of results** that are above the conventionally accepted level of statistical significance.

**3. Multiple Comparisons and False Discoveries**

Repeatedly testing the same data set in multiple ways increases the probability of making false (spurious) discoveries, a statistical issue that is often called the “multiple comparisons problem.” To mitigate the probability of false discoveries when testing more than one related hypothesis, researchers can adopt a range of approaches. For example, they can ensure the false discovery rate is no larger than a pre-specified level. These approaches, however, are rare in the environmental and resource economics literature: *63% of the studies in our sample conducted multiple hypothesis tests, but less than 2% of them used an accepted approach to mitigate the multiple comparisons problem.*

**4. Questionable Research Practices (QRPs)**

To better understand empirical research practices in the field of environmental and resource economics, we also conducted a survey of members of the Association of Environmental and Resource Economists (AERE) and the European Association of Environmental and Resource Economists (EAERE). In the survey, we asked respondents to self-report whether they had engaged in research practices that other scholars have labeled “questionable”. These QRPs include selectively reporting only a subset of dependent variables or analyses conducted, hypothesizing after results are known (also called HaRKing), choosing regressors or re-categorizing data after looking at the results, etc. Although one might assume that respondents would be unlikely to self-report engaging in such practices, *92% admitted to engaging in at least one QRP.*

**Recommendations for Averting a Replication Crisis**

To help improve the credibility of the environmental and resource economics literature, we recommended changes to the current incentive structures for researchers.

– Editors, funders, and peer reviewers should emphasize the designs and research questions more than results, abolish conventional statistical significance cut-offs, and encourage the reporting of statistical power for different effect sizes.

– Authors should distinguish between exploratory and confirmatory analyses, and reviewers should avoid punishing authors for exploratory analyses that yield hypotheses that cannot be tested with the available data.

– Authors should be required to be transparent by uploading to publicly-accessible, online repositories the datasets and code files that reproduce the manuscript’s results, as well as results that may have been generated but not reported in the manuscript because of space constraints or other reasons. Authors should be encouraged to report everything, and reviewers should avoid punishing them for transparency.

– To ensure their discipline is self-correcting, environmental and resource economists should foster a culture of open, constructive criticism and commentary. For example, journals should encourage the publication of comments on recent papers. In a flagship field journal, *JAERE*, we could find no published comments in the last five years.

– Journals should encourage and reward pre-registration of hypotheses and methodology, not just for experiments, but also for observational studies for which pre-registrations are rare. We acknowledge in our article that pre-registration is no panacea for eliminating QRPs, but we also note that, in other fields, it has been shown to greatly reduce the frequency of large, statistically significant effect estimates in the “predicted” direction.

– Journals should also encourage and reward replications of influential, innovative, or controversial empirical studies. To incentivize such replications, we recommend that editors agree to review a replication proposal as a pre-registered report and, if satisfactory, agree to publish the final article regardless of whether it confirms, qualifies, or contradict the original study.

Ultimately, however, we will continue to rely on researchers to self-monitor their decisions concerning data preparation, analysis, and reporting. To make that self-monitoring more effective, greater awareness of good and bad research practices is critical. We hope that our publication contributes to that greater awareness.

*Paul J. Ferraro is the Bloomberg Distinguished Professor of Human Behavior and Public Policy at Johns Hopkins University. Pallavi Shukla is a Postdoctoral Research Fellow at the Department of Environmental Health and Engineering at Johns Hopkins University. Correspondence regarding this blog can be sent to Dr. Shukla at pshukla4@jhu.edu. *