With Little Power Comes Great Responsibility
Abstract
Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community.
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings.
By metaanalyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature.
In particular, for
several tasks in the popular GLUE benchmark,
small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power
to detect differences of
1 BLEU point.
To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks
to assist with future power analyses.
itemsep=0.5pt
1 Introduction
Despite its importance to empirical evaluation, relatively little attention has been paid to statistical power in NLP. In particular, if it is the case that typical experiments in NLP are underpowered, not only would we expect many meaningful improvements to go undetected, we would also expect many apparently significant differences to be exaggerated (Gelman and Carlin, 2014). In this paper, we build on past work calling for greater rigor in evaluation (McCoy et al., 2019; Azer et al., 2020), including the need for careful hypothesis testing Koehn (2004); BergKirkpatrick et al. (2012); Søgaard et al. (2014); Dror et al. (2018), and show why and how power matters to NLP, addressing challenges unique to this domain.
Roughly speaking, power is the probability that a statistical test will successfully detect a true effect. As an illustrative example, imagine comparing two dialog systems (see Figure 1). We want to know if people tend to prefer one system over the other. To test this, we will need multiple people to evaluate the systems. But how many? Once we have collected data, a statistical test will tell us if we can reject the null hypothesis the systems are equally good. Assuming the systems are not identical, statistical power is the probability that the experiment will return a significant result (or equivalently, it is one minus the probability of failing to detect the difference as significant). Although we don’t know the magnitude of this difference, power analysis helps to estimate how much power an experiment will have under various assumptions.
Power depends on multiple factors, including the statistical test used,
the
significance threshold,
true effect size,
variance, and sample size. All else being equal, experiments with larger samples will have greater power than smaller samples, as shown in Figure 1.
Similarly, larger effects and those with less variance are easier to detect, and therefore require fewer samples for equivalent power.
Importantly, note that if we do find a significant difference, this does not imply that the experiment had high power.
Proceeding with a test that is underpowered (i.e., too few subjects or items; often taken to mean less than 80% power; Cohen, 1962) means that one is less likely to be able to draw any useful statistical conclusion from the experiment, and has contributed, in part, to the replication crisis in other fields (Button et al., 2013; Szucs and Ioannidis, 2017; Ioannidis et al., 2017). Routinely running experiments with low statistical power undermines the scientific enterprise. Not only will true effects go undetected; when significant effects are found, they are likely to be noisier and have lower positive predictive value (Button et al., 2013).
Moreover, significant findings from underpowered experiments are more likely to exaggerate or reverse the true effect – socalled TypeM (magnitude) and TypeS (sign) errors, respectively (Gelman and Carlin, 2014). This problem can lead to systematic distortions in the literature if only significant findings are published, especially if these results are based on underpowered experiments (Scargle, 1999). The effect of TypeM error can be seen in Figure 1; significant differences are less likely to be found in smaller samples (right), but among those tests that are significant, the observed difference will tend to exaggerate the true difference (left) by more than a larger sample (middle). For further discussion of TypeM and TypeS errors, please refer to Appendix B.
Here, we investigate how these issues affect NLP. Although retrospective analysis of power involves challenges, we present evidence that underpowered experiments are widespread in NLP research. Among human evaluations, we find most experimental designs involve too few items and/or raters to detect small effects (§5). For comparing models in terms of accuracy, we find that some widely used benchmark datasets, including MRPC and SST2, are now too small to be able to properly measure future progress against top performing models (§3). We also introduce a novel approach to power analysis for machine translation and characterize power in experiments testing for differences in BLEU (§4). Finally, a survey of recent papers reveals a general lack of statistical evaluation and a dearth of detailed reporting (§5.1).
To improve future practice, we suggest broader adoption of power analyses prior to evaluation, provide guidance on running power analyses in NLP, and release a series of notebooks for this purpose.
2 Power Analysis for NLP
Because most NLP tasks do not take the form of standard experiments in other sciences (Kraemer and Blasey, 2015; Westfall et al., 2014), it is nontrivial to run power analyses for many tasks of interest. While we cannot cover every scenario, we present here a generalizable, simulationbased approach to power analysis, along with three sample applications, which can be extended as necessary. Such an approach is modular, reusable, and transparent, and encourages planning of analyses in advance of data collection.
Every power analysis requires assumptions, and there is not likely to be a single correct approach. Rather, the point is to make one’s assumptions explicit, and include enough detail so as to account for whatever is likely to be observed. By using reasonable assumptions, one can help to ensure that one’s experiment is sufficiently wellpowered, In the case of NLP, this means that one recruits enough subjects, collects enough ratings, or uses a large enough test set.
The general procedure we suggest for power analysis is described in detail in Figure 2. At a high level, the idea is to estimate power by running simulations. Recall that power is the probability of detecting a true effect, conditional on the experimental setting (effect size, variance, etc.) and significance threshold. Thus, if one can translate these assumptions into a process for generating simulated data, we can estimate power by generating many simulated datasets using assumed or estimated parameter values, running each sample through a significance test, and reporting the proportion that are found to be significant.
The key to generalizing this approach is to begin with the end in mind. In particular, if one plans to test for a difference between models, one needs to choose the statistical test that will be used. That test will determine the level of detail required in the generative process for simulating data.
To return to the opening example of evaluating dialog systems, we want to test if people prefer one system over the other (Ai et al., 2007). If we ignore the nuances of human preference for now (but see §5 for a more nuanced approach), and simply assume that each person either prefers system A or system B, the only assumption we need to make for a power analysis in this setting is the proportion of people in the population who prefer system B. We can then simulate samples of people (each of whom independently has the same probability of preferring system B) as a draw from a binomial distribution, and repeat this thousands of times.
The most difficult part of power analyses is estimating the relevant quantities, such as the true proportion of people that prefer system B. Note, however, that one can always compute what power would be for a range of possible values, and indeed, this is the recommended procedure. For estimating the relevant parameters within an NLP context, we will primarily rely on data from the literature, measurements on validation data, and estimates from external datasets (see §3.2). However, where appropriate, pilot studies may also be informative.
In the remainder of this paper, we consider three scenarios of interest in depth, and assess the state of power in the NLP literature for each.
3 Comparing Models on Accuracy
It is common in NLP research to look for models which improve over state of the art (SOTA) on various benchmarks. However, an important but rarely asked question is, can these benchmarks support the kinds of comparisons we want to make? Many have emphasized the need for proper significance testing to avoid spurious findings, but if an experiment’s test set is small, the minimum detectable effect (MDE) size may be large: only large improvements will yield sufficiently powered comparisons (i.e., power). If an experiment is badly underpowered, it cannot provide useful evidence that one model achieves slightly better performance than another for the underlying data distribution. Reliance on such evidence risks leading to overconfidence about the relative ranking of various models. As we show in §3.3, there is legitimate reason to be concerned about this in the case of certain widely used benchmarks.
3.1 Significance test for comparing classifiers
The standard statistical test for comparing classifiers on paired data is McNemar’s test (Dietterich, 1998; Dror et al., 2018), which uses the numbers of items where the models disagree (i.e., the offdiagonal elements in Table 1).
M1 correct  M1 incorrect  

M2 correct  both correct  only M2 correct 
M2 incorrect  only M1 correct  both incorrect 
Thus, for McNemar’s test, the relevant data generating process for simulations can be specified in terms of the expected difference in accuracy between the models, , and , the expected proportion of examples for which the models will have the same outcome (i.e., both correct or both incorrect).
From these
we can compute the expected proportions of examples on which only one model is correct (i.e., the offdiagonals in Table 1), and estimate power via the algorithm in Figure 2.
Figure 3
illustrates how power increases with increased sample size, effect size, and agreement rate.
3.2 Estimating parameters
In order to estimate the required parameters ( and ), we consider three options: (1) use results on validation (dev) data; (2) fit a regression based on historical data; (3) use middleoftheroad assumptions when lacking other information. Using these methods, we can then estimate power or calculate the smallest effect that can be detected with 80% power at (or other thresholds). Both to illustrate this process, and to provide guidance for future work, we demonstrate these approaches below using data from two widelyused datasets for evaluating NLP models: SQuAD 2.0 Rajpurkar et al. (2016, 2018) and the GLUE benchmark (Wang et al., 2018).
Using validation results: To the extent that we expect performance on test data to match performance on validation data (i.e., in the absence of domain shift), paired performance on validation data (i.e., difference in accuracy and agreement rate) provides one method for estimating power when comparing against a baseline model.
To illustrate this, from the authors of SQuAD 2.0, we obtain the pairwise agreement rates between all models submitted to the leaderboard on both validation and test data. We find a very strong correlation between validation and test for both pairwise accuracy differences ( and agreement rates () ( for both, as shown in Figure 9 in Appendix D, with results on validation data included in the accompanying online materials), suggesting we can use paired predictions on validation data for power calculations when we have access to the predictions from both models. Note that this approach assumes that the dev and test data have been drawn from the same distribution, and that dev performance has not been artificially inflated (such as by training on validation data directly).
Using historical data: When one does not have access to the baseline model or an informative prior, one can make use of historical trends. That is, we can try to estimate what a typical improvement will look like, given the current state of the art (SOTA). To illustrate this approach, we collect reported results for both SQuAD 2.0 and GLUE, and fit regressions to estimate and . Given these parameters, we can assess the likely power and MDE for a typical model improvement against a given baseline accuracy level.
To fit a regression to predict typical improvements to SOTA, we gather data from GLUE papers and manually label 119 accuracy comparisons and 57 claims of improvement (as denoted by bolding of a result and a claim of SOTA in text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying paper). In regressing on baseline accuracy and task, we achieve an , which is not a perfect fit, but still provides a prior on likely effect size. Similarly, we achieve an when fitting a regression to SOTA improvements on the SQuAD 2.0 leaderboard (selected as being a significant improvement in timeordered submissions). See Appendix E.2.1 for more details.
To assess power for McNemar’s test, we must also fit a regression predicting the expected overlap between the models (). To fit such a regression, from
GLUE authors
we obtain the model test set predictions on all tasks from a set of 10 highperforming models, which allows us to measure the extent to which their predictions overlap with each other.
Using GLUE tasks which measure accuracy,
we regress on baseline accuracy and , and
achieve an of .
Typical improvements on popular tasks tend to be small (see mean improvements in Table 2). Except for rare transformative work, such as BERT (Devlin et al., 2019), it is generally difficult to do much better than a previous SOTA and thus improvements are likely to follow a trend, which is why we are able to use historical data as a guide. In cases where such data is not available or cannot be trusted, other methods are necessary.
No prior: If no informative prior is available and the baseline model or can’t be used for comparison on a validation set, then we must fall back on middle of the road assumptions. Lachenbruch (1992) provides a suggested default prior, and we find that MDEs using this method are very similar to those found by using the regression based approach. Appendix E.3 provides more details, and Table 9 in the appendix presents the comparison.
3.3 Assessing power in the literature
Using the regressionbased approach of estimating and described above, we estimate the MDE for each individual accuracybased GLUE task in comparison to current SOTA, and report the average effect size of results which claimed improvements. Table 2 summarizes these results, showing for each dataset the size of the test set, the accuracy of the best performing model on each task at the time of writing, the estimated MDE to have 80% power using our regression to predict overlap (), and the average reported difference from their respective baselines.
Dataset  Size  SOTA (%)  Est. MDE (%)  (%) 

WNLI  147  94.5  +5.26  +1.72 
MRPC  1,725  92.0  +1.62  +0.63 
SST2  1,821  97.2  +1.02  +0.57 
RTE  3,000  91.7  +1.23  +3.89 
QNLI  5,463  97.5  +0.55  +1.31 
MNLIm  9,796  91.6  +0.67  +0.97 
MNLImm  9,847  91.3  +0.68  +1.29 
QQP  390,965  91.0  +0.11  +0.36 
SQuAD 2.0  8,862  90.7  +0.56  +2.23 
As can be seen in Table 2, the mean reported effect size () is well below the estimated MDE for the three smallest test sets – WNLI, MRPC, and SST2. Because this mean is based on models comparing to even weaker baselines, we would expect most future improvements to be even smaller. Thus, most future experiments involving these three datasets will not have adequate power to test for improvements over the current SOTA in the way that they are routinely used. Moreover, alternative analyses give even more pessimistic estimates of likely improvements relative to MDE, as described in Appendix E.4. If an experiment does show significant improvement on a dataset such as MRPC, the potential for TypeM error should make us skeptical that this improvement will generalize to new data from the same domain.
While the above results are informative about future experiments, we would also ideally like to know about the power of past experiments. Most of the papers from which we collected results did not report a significance test on the test set. Here we estimate the expected power and predicted result of such a test using leaveoneout regressions, where we make a prediction for each reported improvement using all other reported model comparisons. This procedure reveals that only 46% would have predicted adequate power (using estimates for expected improvement and agreement), and approximately 51% would have been significant (based on estimated agreement and reported improvement). Approximately 80% of experiments with at least 80% power would also have been found to be significant (37% of all comparisons).
In part because performance on many of these tasks is now so good, a large expected improvement is required in order for
a new experiment
to have 80% power, suggesting that larger test set sizes may be necessary to continue making wellpowered claims of SOTA improvement on individual tasks.
For any comparisons which are likely to be underpowered, we should refrain from placing much emphasis on obtaining small improvements over the previously reported best model. In extreme cases, such as MRPC and SST2, it is worth considering whether it is time to retire these datasets as the basis for model comparison.
4 Machine Translation
To show how our approach to power analysis can be applied to a more difficult setting, we consider automated evaluation of machine translation using BLEU scores (Papineni et al., 2002). As with accuracy, we would like to know what scale of improvements can be detected with reasonable power on typical test sets. This setting is more complicated because (1) BLEU is a corpuslevel metric, rather than being averaged across instances, and (2) typical models are trained on vast amounts of parallel data, with little data available that has not been used in training, making it difficult to estimate variation in performance.
Significance testing for BLEU: To test for a significant difference between two MT models we use the randomization test, as recommended in Dror et al. (2018): given the paired output translations from both models, swap the outputs for a random subset of test examples and compute the resulting difference in BLEU. Repeating this thousands of times gives us a null distribution, which can be used to test the observed difference between models.
Generative process for simulations: If large amounts of untouched evaluation data were available, we could approach power analysis by simply evaluating BLEU score on many random subsets of sentences, and computing the mean and variance of each system. Unfortunately, because MT depends on parallel text (most of which is used in training), evaluation data tends to be scarce. Instead, we introduce a generative process that can produce the necessary inputs for power analysis.
For intuition, note that if we swap the pair of model outputs (as is done in the randomization test), leaving rest as they are, we change the difference in BLEU between models by a specific amount, , which we call the effect of making that swap. While these individual effects are not independent of each other due to the corpuslevel nature of the metric, in practice, the sum of individual effects closely approximates the net effect of swapping entire subsets (see Figure 15 in Appendix G).
Based on analyzing several models and datasets, we find the typical distribution of these individual effects can be approximated using a mixture of a Delta distribution at zero, and a Laplace distribution (see Appendix G for details).
Concretely,
if we assume is the expected difference in BLEU between two models on a dataset of examples,
and is the expected proportion of examples for which ,
we can simulate a dataset of individual effects using the following process:
with probability , . With probability , , where , ,
and is a userspecified parameter that controls the variance, independent of the sample size.
By construction, .
Given this generative process, we can then estimate power using the Algorithm in Figure 2. On each iteration, draw a simulated dataset from the generative process, compute the observed difference between models as , and test if this is significantly different from zero using a modified randomization test, in which we assume that the net effect of swapping a subset of instances is simply the sum of the ’s in the subset. (Please see online materials for an interactive example).
Empirical estimates:
M1  M2  Test set  

TF19  TF18  2019  2K  4.3  0.19  23.7 
TF18  TF16  2018  3K  4.2  0.09  29.4 
TF16  Conv17  2017  3K  1.3  0.12  22.5 
TF16  Conv14  2016  3K  7.6  0.10  27.6 
In order to estimate reasonable values for the required parameters, we use several pretrained models from the fairseq library (Ott et al., 2019) for the WMT EnglishGerman translation task. We evaluate these models on the shared task test sets from 20162019 and compute BLEU scores using sacrebleu (Post, 2018). Fitting a DeltaLaplace mixture to the effects of swapping individual output pairs, we estimate values for and , reported in Table 3. (See also Figure 16 in Appendix G; code for computing estimates is provided in the online materials).
While far from identical, the four comparisons, each representing different stages of model evolution, all produce similar estimates. Although these estimates are only based on a single language pair, the models and test sets are relatively diverse, and we expect that these estimates will generalize, though better estimates could be obtained by fitting this distribution to a new domain of interest.
Using these estimates, we can now
characterize how much power test sets of different test set sizes () would have for a range of possible differences in BLEU ().
Figure 4 shows this for and set to the average of the observed values.
This analysis has served, in part, to show how a simulationbased approach to power analysis can be adapted to virtually any task. Additional work is required to test how well these specific parameter estimates will generalize, but the same process can easily be adapted to new language pairs. More generally, there would be great value in the MT community curating larger heldout test sets, both to validate this analysis, and for better powered future comparison.
5 LikertScale Human Evaluations
Tasks such as natural language generation are difficult to evaluate using automated methods; as such, human evaluations are central to NLP. Past work has reported great variation in how human evaluations are done (van der Lee et al., 2019). Therefore, we begin with a metaanalysis of a subset of human evaluation experiments from EMNLP 2019, which we then use as the basis for claims about the power of human evaluations in NLP more generally.
5.1 Metaanalysis
To characterize the state of human evaluation in NLP, we identified papers from the main session of EMNLP 2019 that made use of human evaluations (details in Appendix H.2). To generalize across studies, we restrict our analysis to Likertscale comparisons, which was the most commonly reported type of evaluation. We extracted all cases where a new model was being compared to the bestperforming baseline on one more metrics (117 comparisons from 41 papers) and normalized all ratings to be on a 01 scale.
One takeaway from this metaanalysis is that the reported effect sizes (that is, difference between the novel model and the bestperforming baseline) vary widely ( on a [0, 1] scale). Number of items tested is more consistent: 69% used 100 or fewer, and only 18% used over 200. But, as similarly found by van der Lee et al. (2019), many key details were not reported in this sample of experiments. Most commonly missing was number of ratings per item (34% of all experiments), followed by total number of workers (28%). For 7% of experiments, we could not determine the number of items tested. 57% of experiments collected 3 annotations per item, which was also the modal number of unique annotators. Thus, it is often difficult to ascertain, for any particular experiment, the details of the experimental setting that are necessary to evaluate the validity of the results.
Because the number of items rated was the most commonly reported, we use that as our proxy for sample size. Figure 5 shows
scaled
mean
difference between models
as a function of
number of items.
As expected, we see greater variance in effects with smaller samples since, with smaller samples, we expect greater noise. We also observe a slight negative correlation between effect size and sample size. That is, as sample size gets larger (and, thus, as estimates get more precise), the estimated effect size gets smaller. This trend is sometimes used as an indication of publication bias (censoring of null and oppositedirection effects) since, in a sample with no publication bias, the effect size should be independent of the sample size Begg and Mazumdar (1994). However, in our case, this correlation is not significant (Kendall’s , ) and so it is difficult to draw strong conclusions.
5.2 Power analysis for human Likert ratings
What kind of effect sizes can typical human evaluation experimental designs detect? As in previous sections, we can use simulations to explore how many annotators and/or instances should be used to have sufficient power.
Simulating human experiments is conceptually simple (e.g., raters each rate generated sentence on overall quality), but for realistic simulations, we need to consider variation in items (some generated sentences are better than others), and variation by rater (some raters use higher ratings and/or respond to different aspects of quality), as well as the overall difference in quality between models. A simulation which treated all workers as identical would fail to capture this variation, and hence might overestimate power Barr et al. (2013).
Unfortunately, details such as worker
variance
are
rarely reported in published papers.
To better characterize the typical variation in human evaluations,
we rely on a convenience sample of several large datasets to estimate these parameters and use them in our simulations as a proxy for what we might observe in practice.
Although
focused on different tasks,
all use a similar methodology, namely, getting many Likertscale annotations per instance from
many
annotators and models (in some cases as many as 20 ratings per item).
In order to extract estimates of these parameters for our simulations, we use hierarchical mixedeffects models, as used in psychology and other behavioral fields (Barr et al., 2013; Gelman and Hill, 2006). Such models incorporate variation in the quality of generated instances, annotator responses, and annotator sensitivity, and are recommended by van der Lee et al. (2019) for analyzing human evaluations. (We provide details in Appendix H.3 and include code for fitting such models as part of the online materials). Using this approach, we obtain an estimate of the relevant parameters from each of the large datasets. From these, we choose sets of parameters to be representative of experiments with high or low variance, with full results in Appendix H.3 (see Table 16 for parameter estimates).
As before, we then use these estimates to simulate data, assess significance on the simulated data (here using mixed effect regression), and compute power as a function of mean difference and sample size.

Many human evaluation studies are likely underpowered: Using the “high variance” parameters (which are typical of most of the datasets we used), the most common design at EMNLP 2019 (3 workers, 100 items) is underpowered unless the effect size is quite large (0.2 or higher on the [0, 1] scale).

Even with low variance, typical designs are underpowered to detect small effects: Using our estimated parameters for the low variance setting, experiments will be underpowered to detect small effects (0.05 on the [0, 1] scale), unless an unusually large number of ratings per item are collected (10+ for 100 items).

Need for improved reporting: Most human evaluations do not report enough detail to interpret the results. This could be drastically improved through basic power analyses, significance testing using mixedeffects models, and sharing of raw data.
Given our model estimates and simulations, we conclude that, in aggregate, many human evaluations are underpowered and would benefit from larger sample sizes, particularly by using more workers per item. Increased adoption of even approximate power calculations within the NLP community will promote thoughtful consideration of appropriate sample sizes and improve the reliability and replicability of results.
6 Overall Recommendations

Power analyses should be done prior to evaluation when comparing against a baseline. If a comparison is likely to be underpowered, the pros and cons of running that evaluation should be carefully considered. Underpowered experiments do not provide convincing evidence of progress.

For new datasets and shared tasks, the number of instances in the test will determine the minimum detectable effect size, and should be chosen accordingly.

For tasks which no longer have adequate power to detect typical improvements (e.g., MRPC and SST2), authors should consider expanding the test set or retiring the task.

To facilitate future power calculation and significance tests, model owners should release final finetuned model checkpoints. Alternatively, leaderboard owners may wish to make validation set predictions from all submitted models publicly available.

For human evaluations, (anonymized) raw data should be shared, along with parameters and code to replicate the analysis, including proper significance testing. Prior to collecting human evaluation data, researchers should create an analysis plan and run power analyses to determine an appropriate sample size (likely requiring more workers and items than is currently typical in NLP).
7 Conclusion
Recent progress in NLP has been extraordinarily rapid, sometimes at the cost of experimental rigor. In this paper, we have presented evidence that underpowered experiments are widespread in NLP. For comparisons based on small samples, there is little reason to think that such an evaluation could reliably provide evidence of a significant improvement, and good reason to believe that improvements found to be significant will exaggerate or reverse the true effect. Going forward, a combination of larger test sets, simple power analyses, and wider sharing of code, data, and experimental details will help to build the foundation for a higher standard of experimental methodology in NLP.
Acknowledgments
Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. Thanks to Sam Bowman, Amanpreet Singh, Kevin Clark, Naman Goyal, and Colin Raffel for providing data from submissions to the GLUE leaderboard, as well as Taylor BergKirkpatrick, Sumanth Dathathri, Ari Holtzman, Hannah Rashkin, and Nikita Srivatsan for providing raw human evaluation data, not all of which made it into the paper.
Appendix A Further Discussion of Significance Testing, Power Analysis, and PostHoc Analysis
Null hypothesis significance testing: In this paper, we work within the framework of null hypothesis significance testing (NHST). NHST is not free from problems, in that certain systematic processes within the practice of scientific research and publishing can undermine its advantages, many of which have been explored in the literature (Gelman and Loken, 2013; Ioannidis, 2019; McShane et al., 2019). Nevertheless, it would be premature to discard the entire paradigm, and we believe there is still some value in considering power within NHST for several reasons.
First, despite its flaws, NHST remains a commonly used experimental framework in NLP research. Whether implicit of explicit, most experimental comparisons in the NLP literature have the structure of an experiment in the NHST framework, where having equivalent performance to an existing baseline is treated as a null hypothesis and the new model is argued to be significantly better (the typical case) or significantly worse (far rarer). But, whereas many fields that run experiments have standardized procedures for assessing statistical significance, NLP papers vary as to how formally they use a hypothesis testing framework to evaluate their results BergKirkpatrick et al. (2012); van der Lee et al. (2019); Azer et al. (2020).
Second, when done properly, NHST does provide a convenient way of summarizing results. Improvements in overall methdology, such as sharing code and data, sensitivity analyses, greater interest in null findings, and even preregistration can vastly improve the validity of this paradigm, and we are seeing adoption of some of these practices within NLP.
Finally, there is also a great need for additional clarity with respect to precisely what claims are being made by NLP papers. In this work, we are primarily focused on claims made about trained models (i.e. in testing whether one particular instantiation of a model is significantly better than a particular instantiation of another model). It is, of course, also important to consider broader claims that might be made, such as about expected performance or computational budget (Dodge et al., 2019; Schwartz et al., 2019), and everything we have to say can be extended to incorporate such considerations. For the purpose of clarity, however, we restrict ourselves to the simplest sort of statistical claim.
Power and power analyses: The probability that a statistical test will reject the null hypothesis in an experiment is a function of several parameters, some of which are typically known or controllable, such as the sample size and significance threshold, and some of which are unknown, such as the details about exactly how models differ. Power tells us what this probability would be, if we knew the true values for these unknown parameters. Conditional on a particular difference existing (e.g. an expected difference in accuracy between two models for a particular data distribution), along with a statistical test, a significance threshold, power is the probability that the test will reject the null hypothesis and find the observed difference to be significant. In common statistical terminology, power is one minus the probability of false negatives in rejecting the null hypothesis or type II error.
While we will not, in general, know what the true power of an experiment is, by making reasonable assumptions, we can try to choose appropriate values for those parameters that we can control. By making assumptions about what we expect to observe, we can obtain estimates of how much power a test is likely to have, which may lead us to modify our experimental design, such as by increasing the sample size.
Importantly, proper experiment design requires specifying these parameters in advance of data collection, or otherwise using a valid stopping rule. One can always obtain a significant result by progressively collecting data until a significant result is found (“sampling to a foregone conclusion”), but this is not a valid procedure (Anscombe, 1954; Wagenmakers, 2007). Similarly, posthoc power analysis, using estimates derived from the experiment itself, provides no additional information beyond a transformation of the observed value, and is thus not recommended (though see below).
Expanding on the algorithm in Figure 2, a simulationbased power analysis involves the following:

Come up with a generative process which could be used to generate data like that which we will collect. In this step, we need to make assumptions about the comparison of interest. Since the binomial test requires only the counts of how many people prefer each system, we need to specify a prior on generating those counts. For example, we might assume that 60% of people will prefer system B, so the generative process will be , where is the total number of people to be sampled.

Choose a value of for which we want to calculate power. Repeatedly (e.g., 10,000 times) draw many samples from our assumed generative process for that size of .

For each simulated dataset of size , run the chosen statistical test to check if difference between the observed counts is significant, and compute the proportion that are found to be significant. This is our estimate of power.
Note that more direct solutions for power analysis do exist for some settings, such as this one (see Appendix E.5 below).
PostHoc Power Analysis: Posthoc power analysis is an issue when the true population effect has variance to it (O’Keefe, 2007; Hoenig and Heisey, 2001; Gelman, 2019). In the case of NLP models, there are several perspectives on the comparisons which can lead to differences regarding how we perceive posthoc power analysis: (1) we are comparing one model vs. another on a particular test set, the effect we see is the true population effect, posthoc power analysis is okay because it is deterministic; (2) we are comparing one model vs. another on a data distribution from which the test and dev set are drawn, posthoc power is not okay; (3) we are comparing one training algorithm vs. another (including variance from both training procedures and test/dev set draws), posthoc power analysis is still not okay. We specifically look at the case of (2). While (3) is interesting on its own, this is not the typical comparison done (yet) in NLP research and thus we do not have enough information on reported training variance to investigate this thoroughly here. The case of (1) is also atypical as the authors of a study typically wish to draw inferences about how well a model does on the true data distribution (hence, why a dev and test set are used).
Appendix B TypeM and TypeS errors
Although the most obvious risk of using underpowered experiments is that there is a greater chance of failing to detect a true effect, there is an additional harm of using an underpowered design, which has emerged in light of the replication crisis in science. This can be most easily understood through the idea of TypeM and TypeS error (Gelman and Carlin, 2014).
TypeM error is the extent to which an observed difference exaggerates the true effect, conditional on a finding being significant. TypeS error is the probability that an observed difference has the opposite sign of the true difference, again conditional on a finding being significant. Even in a lowpowered experiment, there is some probability of finding an effect to be significant; the lower the power, however, the more likely it is that the observed significant difference has the opposite sign of the true effect, and the larger the degree to which the magnitude of the observed effect will tend to exaggerate the true effect.
Intuitively, if power is low, this means that the sample size is small relative to the effect size. As such, the difference will only be significant if an atypically large effect is observed. Assuming the use of a twosided test, many of these significant findings will also have the wrong sign, as they will be nearly as likely to fall on either side of zero for a symmetric distribution.
TypeM and TypeS error rates can be estimated using the exact same process for power analysis as described in Figure 2. To do so, we need only augment the algorithm with these two additional steps:
Figures 7 and 8 show scenarios for comparing classifiers on accuracy, corresponding to Figure 3 in the main text, but showing expected TypeM and TypeS error instead of power. As can be seen, TypeM and TypeS error increase with smaller sample sizes, smaller differences between models, and lower agreement rates, all corresponding to lower power.
Appendix C Numerical Example of a McNemar’s Test Simulation
To provide a concrete example of comparing classifiers on accuracy, imagine that a test set for a benchmark task has 500 instances. Based on prior knowledge (see main paper), we might assume that our proposed model will achieve, at most, an absolute improvement of 2 percentage points over the state of the art (), and that the models are likely to agree on 90% of examples (). We can convert these assumptions into a distribution over outcomes which will define our generative process. In particular, for a random unseen instance, these assumptions imply that there is a 10% chance of a disagreement; the probability that our model is correct and the old model is incorrect is therefore 6%, and the opposite outcome has a probability of 4% (giving us the assumed net difference of 2%). Note that, because McNemar’s test does not consider the ondiagonal elements, it is not necessary that we explicitly define the baseline accuracy. Thus, a valid probability distribution for use in this simulations could be that shown in Table 4.
M1 correct  M1 incorrect  

M2 correct  
M2 incorrect 
By drawing many samples from this distribution of size and computing a value using McNemar’s test for each, we obtain an estimate that the power of this test is approximately for a significance threshold of , which is severely underpowered. This would also imply a TypeM error factor of 1.9; we would expect that a typical experiment that found the observed difference between models to be significant would exaggerate the true difference of 0.02 by a factor of 1.9, producing observed significant differences between models on the order of 0.04, on average. (See supplementary notebooks for calculations and interactive demonstration). As such, we conclude that this test set is too small to be able to reliably evaluate whether or not our model is significantly different from the state of the art, and should distrust any observed differences that are significant, unless we have poorly estimated the relevant parameters.
By contrast, if the test set contained 2000 examples, we would estimate the test to have nearly 80% power, with a TypeM factor of only 1.1, and would feel comfortable proceeding with and reporting on this evaluation. Similarly, if we had reason to think that our model represented a gamechanging advance, and would achieve an improvement of 4 percentage points, or if we had reason to believe that the models would agree on 97.5% of examples, then we would have the power to evaluate this, even with only 500 examples.
Appendix D SQuAD 2.0 Analysis and Results
From the authors of SQuAD 2.0, we obtained pairwise agreement statistics on the SQuAD 2.0 development and test sets for all models that were submitted to the SQuAD 2.0 leaderboard and had publicly visible development set predictions on the CodaLab platform. We removed six submissions whose exact match (EM) scores on test data were less than ; EM scores below suggest a bug or misconfiguration of the model for predicting on the test set, as the majority baseline gets roughly accuracy (by always predicting noanswer). We also removed one submission whose development set EM score was more than points higher than its test EM score, as it seemed likely that the model had been trained on the development set. After this filtering, we were left with 144 models.
Figure 9 shows the correlation between validation and test data for both pairwise accuracy differences () and agreement rates () on the SQuAD 2.0 leaderboard. As can be seen, these correlate well, suggesting that measuring these quantities on validation data can serve as a reasonable guide when doing a power analysis for a new model, though lower agreement rates on dev data to tend to slightly underestimate agreement on test. If the validation results are available for both models, these can be used to compute estimates of and , and these can be used to compute the approximate power of the test set.
To verify that using these estimates provide a reliable guide to power, we make use the predictions made by SQuAD 2.0 submissions on both validation and test data. In particular, if we assume that each submission is being compared to the previous model to demonstrate a significant and wellpowered improvement over the previous baseline, we find that 19 out of 143 submissions showed sufficient improvement on the validation set to have at least 80% power (see Figure 10). Of these, 14 (74%) attain a significant improvement over the baseline on the test data (consistent with the expected value of 80%). Of the remaining 124 submissions, 3 (2.5%) would show a significant improvement over the baseline, but did not have sufficient power based on validation performance. Interestingly, while all other significant improvements were generally wellspaced over time, these three underpowered submissions were all beaten by a new submissions within 5 days. As an aside, we also note that the vast majority of submissions are significantly worse than the current SOTA, reinforcing the notion that real improvements are rare, and most improvements will be small.
Caveats: Correlation between the effect size on the validation and test sets may not always be so high. Overconfidence in the power of your experiment may thus occur if the validation performance is greater than the test performance (as would be the case if no regularization was used and extensive hyperparameter tuning caused a model to overfit to the validation set). Alternatively, if comparing to a baseline with inflated performance on validation data (for the same reasons as above), running power analyses based purely on estimates from validation data would underestimate power. As such, combining validation estimates with reasonable priors is recommended.
Appendix E Accuracy
e.1 Data Collection
Model Predictions on Test Set and Model Prediction Agreement
From the authors of the GLUE benchmark – as well as authors of individual models – we obtain the model testset predictions on all tasks from a set of 10 highperforming models, which allows us to measure the extent to which their predictions overlap with each other. We select GLUE tasks which use accuracy as an evaluation metric. The relevant tasks are MNLI (Williams et al., 2018), MRPC (Dolan and Brockett, 2005), RTE (Dagan et al., 2005; BarHaim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), SST2 (Socher et al., 2013), QQP (Iyer et al., 2017), QNLI (Rajpurkar et al., 2016), and WNLI (Levesque et al., 2012). For consideration of other metrics, see Appendix F.
We use model predictions for: ELECTRA (small, base, large, large with tricks) (Clark et al., 2019b), XLNet (large) (Yang et al., 2019), T5 (Raffel et al., 2019), ALBERT (large) (Lan et al., 2020), BAM(large) (Clark et al., 2019a), RoBERTa (large) (Liu et al., 2019), and BERT (Devlin et al., 2019). We only had the model predictions available and extrapolated overlap from that, we did not have access to the models themselves, ground truth test set labels, nor dev set predictions for the models.
Comparisons and Claims
We gather data from GLUE papers regarding the accuracy tasks and manually label 119 comparisons and 57 claims of improvement (as denoted within a work by bolding of a new model’s number and a claim of SOTA in the main text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying publication). For each paper we examine if a specific comparison is made against a baseline that isn’t claiming state of the art performance. For example, the STILTs approach (Phang et al., 2018) makes comparisons against nonSOTA baselines, which we add to our labeling scheme but filter out when fitting regressions to likely SOTA improvements. We mark this as SOTA Comparison = N. For claims of SOTA improvement, we examine this as some textual basis for the claim (e.g., “we drive state of the art performance on GLUE”) coupled with bolding of values in a table reporting baselines against the model under test. We mark datapoints as Claim of Improvement = Y if they are an improvement claim. We mark effect size as the improvement from the best previous baseline (the current SOTA) on the test set on a perdataset basis. We note that in several cases, worse results on the new model were bolded. We treated this as no claim of improvement. If results were not bolded but still higher for the new model we also treated this as no claim for improvement.
e.2 Regressionbased approach to modeling power and MDEs
Predicting overlap
There are several versions of McNemar’s test, each with their own unique method for calculating power, sample size, or minimum effect size. See, for example, discussions in Schlesselman (1982), Duffy (1984) Suissa and Shuster (1991), Connett et al. (1987), Fagerland et al. (2013), and Lachenbruch (1992).
The methods for calculating sample size or power by Connett et al. (1987); Schlesselman (1982); Suissa and Shuster (1991) require making an assumption about the odds ratio as well as an estimate of the fraction of discordant pairs (disagreements between two models).
Fagerland et al. (2013) suggest that the exact unconditional version of the test by Suissa and Shuster (1991) has desirable properties. Thus, we use the implementation of the power calculations for this test from the https://github.com/ekstroem/MESS package.
How do we make an assumption about the odds ratio and fraction of discordant pairs? We first fit an OLS regression to the existing models on the GLUE leaderboard for all binary choice accuracy tasks using the aforementioned predictions provided by the leaderboard creators and individual authors of models,
(1) 
for all that are a pairwise comparison between any two models, is the minimum accuracy between the two models under comparison, is the gap between the two models, and is the fraction of overlapping predictions. We end up with the model shown in Table 5.
Dep. Variable:  y  Rsquared:  0.966 
Model:  OLS  Adj. Rsquared:  0.966 
Method:  Least Squares  Fstatistic:  3820. 
Date:  Thu, 14 May 2020  Prob (Fstatistic):  3.62e197 
Time:  07:03:28  LogLikelihood:  818.14 
No. Observations:  270  AIC:  1630. 
Df Residuals:  267  BIC:  1619. 
Df Model:  2 
coef  std err  t  Pt  [0.025  0.975]  

const  0.4142  0.019  21.694  0.000  0.377  0.452 
min_acc  0.5819  0.021  27.999  0.000  0.541  0.623 
acc_diff  0.4662  0.028  16.625  0.000  0.521  0.411 
Omnibus:  6.121  DurbinWatson:  1.040 
Prob(Omnibus):  0.047  JarqueBera (JB):  8.647 
Skew:  0.108  Prob(JB):  0.0133 
Kurtosis:  3.850  Cond. No.  71.5 
Dep. Variable:  y  Rsquared:  0.944 
Model:  OLS  Adj. Rsquared:  0.933 
Method:  Least Squares  Fstatistic:  91.87 
Date:  Tue, 26 May 2020  Prob (Fstatistic):  1.37e07 
Time:  06:05:23  LogLikelihood:  36.368 
No. Observations:  14  AIC:  66.74 
Df Residuals:  11  BIC:  64.82 
Df Model:  2 
coef  std err  t  Pt  [0.025  0.975]  

const  0.4339  0.091  4.786  0.001  0.234  0.633 
min_acc  0.5932  0.101  5.874  0.000  0.371  0.816 
acc_diff  1.2849  0.588  2.186  0.051  2.578  0.009 
Omnibus:  0.299  DurbinWatson:  2.022 
Prob(Omnibus):  0.861  JarqueBera (JB):  0.163 
Skew:  0.214  Prob(JB):  0.922 
Kurtosis:  2.691  Cond. No.  140. 
We note that outcomes are biased toward a higher range of accuracy values and may not be a perfect prior. However, this does give us a fairly good linear fit for topoftheleaderboard results. We then can predict the expected overlap for a given model as:
(2) 
Note now we can make an assumption on the expected fraction of discordant values and the odds ratio, the latter being:
(3) 
This is all that is necessary for McNemar’s test and thus we can then simply solve for the minimum expect treatment effect for the given sample size of the dataset and a power of . Note that for QQP we use the normal approximation rather than exact unconditional test as the large sample size makes the exact test intractable. See Duffy (1984).
We fit such a regression to GLUE tasks and achieve an of . Repeating this for SQuAD 2.0, we get an of , with fit shown in Table 6. See Figure 11 for a plot indicating the level of agreement plotted against baseline accuracy. See also additional model comparisons for overlap in Appendix I.
Predicting Effect Size
A similar regression can be run to predict the expected effect size given the baseline accuracy: how much do models typically improve given the current SOTA. To fit an OLS regression predicting this value, we gather data from GLUE papers regarding the accuracy tasks and manually label 119 comparisons and 57 claims of improvement (as denoted within a work by bolding of a new model’s number and a claim of SOTA in the main text) across 14 papers (selected as being at or above the BERT score on the GLUE leaderboard with an accompanying publication). We fit the regression:
(4) 
to see how predictable the expected effect size is, where is the predicted effect size, is the baseline model’s accuracy, and is a categorical variable (in the regression this ends up being a set of dummy variables for each category so we denote to emphasize this). Note that for SQuAD 2.0, we use a separate regression without the task variable since it is a singletask leaderboard.
We achieve an which is not a perfect fit, but still provides a prior on likely effect size. Similarly, we achieve an when fitting a regression to SOTA improvements on the SQuAD 2.0 leaderboard (selected as being a significant improvement in timeordered submissions).
Dependent variable:  
effect.size  
Previous.Best  
TaskMNLImm  
TaskMRPC  
TaskQNLI  
TaskQQP  
TaskRTE  
TaskSST2  
TaskWNLI  
Constant  
Observations  61 
R  0.690 
Adjusted R  0.642 
Residual Std. Error  1.309 (df = 52) 
F Statistic  14.455 (df = 8; 52) 
Note:  p0.1; p0.05; p0.01 
Dep. Variable:  y  Rsquared:  0.672 
Model:  OLS  Adj. Rsquared:  0.644 
Method:  Least Squares  Fstatistic:  24.55 
Date:  Tue, 26 May 2020  Prob (Fstatistic):  0.000334 
Time:  06:05:23  LogLikelihood:  45.711 
No. Observations:  14  AIC:  87.42 
Df Residuals:  12  BIC:  86.14 
Df Model:  1 
coef  std err  t  Pt  [0.025  0.975]  

const  0.1331  0.023  5.910  0.000  0.084  0.182 
x1  0.1408  0.028  4.955  0.000  0.203  0.079 
Omnibus:  19.911  DurbinWatson:  2.643 
Prob(Omnibus):  0.000  JarqueBera (JB):  18.487 
Skew:  1.995  Prob(JB):  9.68e05 
Kurtosis:  6.971  Cond. No.  17.3 
Caveats for Regressionbased Approach
Fitting a regression to predict overlap between a baseline and a new model has a good linear fit. However, this may not be the case for every dataset. Additionally, predicting effect sizes via a linear fit is not a perfect prior. The measurements of power in this case are meant to simulate estimating power before running evaluation on a test set, as running power analysis using only the observed effect may lead to the issues of posthoc power estimation.
e.3 No Prior Approach (Lachenbruch, 1992)
What do you do if there is no prior data available (as in a new task) and so you cannot make assumptions about discordant pairs or odds ratio? Lachenbruch (1992) discusses this exact problem in the context of clinical trials, and proposes an alternative method based on the work of Connett et al. (1987) which allows you to make assumptions about potential marginal probabilities, providing a midpoint value, as well as an upper and lower bound. We use an implementation of this from: https://rdrr.io/rforge/biostatUZH/man/sampleSizeMcNemar.html and solve for the expected accuracy minimum given a fixed dataset sample size and baseline accuracy for each of the lower bound, midpoint, and upper bound. In practice, we find the Lachenbruch (1992) prior to be very close to the values we obtain from the above regression (see Table 9). Importantly this method requires no assumptions and is meant to give an idea for whether it is worth pursuing a study for the given size of the test set.
e.4 Extended Results
Table 9 contains additional MDE estimates using a twosample proportion test as in Appendix E.5, the Lachenbruch (1992) methodology. We also provide the standard errors and for each average effect size, the OLS regression predicting the next effect size for a new SOTA , and the current difference from SOTA and next on the leaderboard. We note that MDE calculations are roughly similar except for the upper and lower bounds provided in the Lachenbruch (1992) calculation. We also note that predicted SOTA results are far lower than past averages since the average includes early large results like those of Devlin et al. (2019). We can see that in some cases the predicted effect size is even smaller than the lowest bound MDE and we may wish to consider the usefulness of further comparisons on individual datasets in such cases.
Dataset  Size  SOTA  MDE Binomial  MDE (Lachenbruch, 1992)  MDE regression  (std.err.,n)  

WNLI  147  94.5%  +5.38%  +5.42%(5.36%, 5.45%)  +5.26%  1.17%  1.72 (0.917, 4)  0.0% 
MRPC  1725  92.0%  +2.40%  +1.91% (0.45%, 2.48%)  +1.62%  +0.03%  +0.625 (0.234, 8)  +0.6% 
SST2  1821  97.2%  +1.34%  +1.10% (0.43%,1.35%)  +1.02%  +0.18%  +0.571 (0.197, 7)  0.3% 
RTE  3000  91.7%  +1.89%  +1.48% (0.26%, 1.96%)  +1.23%  +1.11%  +3.89 (1.23, 10)  +0.8% 
QNLI  5463  97.5%  +0.77%  +0.60% (0.14%, 0.78%)  +0.55%  + 0.69%  +1.31 (0.552, 9)  +0.9% 
MNLIm  9796  91.6%  +1.08%  +0.82% ( 0.08%, 1.12%)  +0.67%  +0.12%  + 0.97 (0.442, 10)  +0.2% 
MNLImm  9847  91.3%  +1.09%  +0.84% ( 0.08%, 1.14%)  +0.68%  +0.34%  + 1.29 (0.550, 8)  +0.3% 
QQP  390965  91.0%  +0.18%  + 0.13% (%, 0.19%)  +0.11%  +0.08%  0.36 (0.121, 5)  +0.1% 
SQuAD 2.0  8862  90.724%  +1.18%  +0.91% (0.09%, 1.23%)  +0.556%  +0.528%  +2.23% (0.431,14)  +0.146% 
Statistic  N  Mean  St. Dev.  Min  Pctl(25)  Pctl(75)  Max 

Power  
P  
Statistic  N  Percentage    
% Powered  
% Significant  
% significant and Powered 
e.5 Calculating Power or Sample Size with Binomial Test
If we assume that samples are unpaired – the new model and baseline evaluation samples are drawn from the same data distribution but aren’t necessarily the same samples – we can use a binomial test for significance.
In this case, we assume that we have two models and each draw brings a 1 if the model is correct or 0 if incorrect. We would like to use the twosample proportion test, and have two binomial distributions with and as the mean probabilities. Our null hypothesis is . We have an alternative hypothesis (two sided) is . Note, in R we can use the function power.prop.test() to calculate power, the MDE, or the sample size of the tests. See also a tutorial here: https://imai.fas.harvard.edu/teaching/files/Handout9.pdf.
Appendix F Additional Metrics
In this appendix, we provide guidance on how we might apply power analysis to metrics beyond what is covered in the main paper.
Recall, Precision, F1, Matthew’s correlation: While accuracy is the most commonly used metric in the GLUE benchmark, other tasks make use of other metrics such as F1 and Matthew’s correlation. F1 is particularly relevant in cases of binary classification where there is strong class imbalance, such that even the baseline of predicting the most common class will achieve high accuracy.
If we have good prior information, we can use an approach akin to that recommended for accuracy, but replacing McNemar’s test with a randomization test (as used for machine translation, see §4 in main paper). In particular, given an evaluation on paired data (as is the case for all benchmark datasets), one can test for a significant difference between models in terms of F1 (or any other metric) using a randomization test. That is, on each iteration, we randomize the assignment of which model each prediction came from for every instance with probability 0.5, and compute the resulting overall difference in F1. Repeating this thousands of times gives us the null distribution, and we can then check to see whether the observe difference in F1 is in the tails of this distribution, which can thereby be converted into a value (see Dror et al. (2018) for more details).
Because F1 (and related metrics) cannot be represented as a simple sum over individual instances, in order to completely specify a hypothetical data generating process, we need to assume values for all cells in the confusion matrix, per class. That is for each class we would need to assume values for the cells as shown in Table 11, where the relevant distribution of predictions are for the instances with the corresponding label, and the values for each class sum to one.
M1 negative  M1 positive  

M2 negative  
M2 positive 
In addition, we need to assume the true distribution of labels in the data distribution of interest, for in . Given these assumptions, we could then simulate an arbitrary number of datasets from this process. For each instance, we would first sample a true label , and then sample the model predictions from the corresponding contingency table. For each simulated dataset, we could then apply the randomization test (using thousands of randomizations). By repeating this process many times, we can directly estimate power for the corresponding assumptions and sample size .
This process is not particularly efficient, but can still be run relatively quickly on a laptop. The more difficult part is choosing good values for the necessary probabilities. However, such an approach can still be used to test for how sensitive power is to variations in assumptions. It is also possible to make simplifying assumptions, such as that the rate of false positives and false negatives will be the same across classes, or to estimate some parameters from training data, such as the underlying distribution of labels. The same technique can easily be extended to other metrics that depend on the contingency table, such as Matthew’s correlation.
Appendix G Additional Details for the BLEU Scores Power Analysis
In this section, we provide further details for the machine translation (MT) data generation procedure as well as an analysis of how power varies for a range of values of and , the parameters estimated from the empirical observations.
g.1 Data Generation Procedure
Recall that using the randomization test to determine whether two MT systems are statistically different gives rise to the null distribution of differences in BLEU.
In our case, the answer to this question lies in establishing a relationship between individual samples and the permuted set within each trial of the randomization test.
This relationship is as follows: the sum of individual changes to the difference in BLEU, from swapping single samples at a time, closely approximates the net change to the difference in BLEU, from swapping those samples all at once.
This relationship is illustrated in Figure 15: Figure 14(a) shows the difference between two models evaluated on the 2019 test set, and Figure 14(b) shows the difference between a different pair of models evaluated on the 2018 test set. We found the same relationship is true for the 2017 and 2016 test sets, as well.
Now that we have established a relationship to closely approximate the outcome of each randomization trial, all that remains is to define a distribution from which the individual changes to the difference in BLEU can be sampled. This distribution is a mixture of a Delta distribution at zero and a Laplace distribution. The Delta distribution accounts for the proportion of samples () such that swapping any of them individually results in no change to the difference in BLEU, i.e. the effect is zero. For the remaining samples, we fit a Laplace distribution, as shown in Figure 16. This Laplace is parametrized by two parameters: location () and scale (). By fitting this mixture to the individual effects computed from evaluating BLEU differences on many pairs of models, we discover that the variance parameter scales inversely proportional to the size of the dataset. Thus, we report an overall value for each dataset, such that = , where is the Laplace scale parameter obtained from dataset containing samples.
For generating synthetic data, we need to specify and , as well as . However, because we want the effect of swapping half the nonzero samples from this distribution to equal the difference in BLEU between models, we only use the above fits to estimate . We thus complete the generative process by assuming values for , , , , and setting such that the average effect of a random subset of instances is equal to . Table 3 in the main paper shows a range of observed values for and .
g.2 Variation in Power Estimates for a Range of Parameter Values
Now that we have defined the data generation procedure, and have estimates for the two parameters, and , that are needed to simulate datasets, we can estimate power for a range of values for sample size and difference in BLEU , and see how these estimates vary as and change. To provide a concrete example, suppose that we have two machine translation models that we expect will differ by 1 BLEU point. For a dataset of 2,000 sentences, we assume that the models will perform equally for , i.e. 20% of sentences, and will assume a base scale parameter of . To compute power, we would follow the process in Algorithm 1, with the following modifications. On each iteration, we would draw individual changes to the difference in BLEU from the distribution specified above, with , , , and . For each such draw, we would apply the randomization test to compute a null distribution, using the sum of individual amounts as the total effect of flipping a random subset of pairs. Based on the null distribution, we compute if the difference is significant for this trial. Repeating this many times and observing the proportion of trials that are found to be significant gives us the approximate power.
Figure 17 shows power for a range of values for , , and . When is low, as is true for the observed data in Table 3, effect sizes and sample sizes need to be larger in order for an experiment to be wellpowered. But as gets higher, a given effect size can be detected by a smaller sample size. On the other hand, as increases and consequently the scale parameter for the Laplace grows, even large effect sizes cannot be detected by test sets containing 5,000 samples.
Appendix H Details of Human Evaluation Section
h.1 Metaanalysis of human ratings for EMNLP 2019
To assess the state of statistical power in a typical NLP study using human evaluation, we sampled papers from the mean EMNLP 2019 workshop that contained the phrase “human eval”. This first pass returned 117 papers, of which 86 had relevant human evaluations (in which models were compared), with the remainder either referencing human evaluation, or containing some other type of evaluation, such as comparing the agreement between automated metrics and human performance. Because some papers had more than one such evaluation, we had 97 experiments for analysis. Of these 51 were Likert experiments (as discussed in the main text), 38 were some form of direct model comparison, and 8 were other.
Significance testing was rare and was reported, in some form, in only 24% of experiments. Bolding or starring the best results in a table was more common, occurring in 63% of human rating experiments in our set. Whether bold results implies that the author is claiming a meaningful difference is not always clear. We did find one single case of authors performing a power analysis to estimate sample size among the papers we surveyed (Garbacea et al., 2019). However, because that paper did not involve a comparison of models to a baseline, it was not included in our analysis. In addition, we note that few details were provided, such that we were unable to ascertain precisely how the power analysis was done.
Because we chose to focus on ordinal ratings, we further annotated those in order to record the mean ratings and experimental characteristics (number of annotators, number of items, number of annotators per item), as well as all differences for all metrics between the model being proposed and the best performing baseline evaluated in the paper, as discussed in the main text.
h.2 Human evaluation datasets
For our analyses, we make use of the following datasets:

From Hashimoto et al. (2019) we use the evaluation data for Reddit, language modeling, and summarization. The data is available at https://worksheets.codalab.org/worksheets/0x88644b5ee189402eb19d39d721d1005c

From Dathathri et al. (2020) we use the available ratings. The data is available at https://github.com/uberresearch/PPLM

For WMT19 (http://statmt.org/wmt19/translationtask.html), the data is available at https://www.computing.dcu.ie/~ygraham/newstest2019humaneval.tar.gz

For Holtzman et al. (2020), we obtain the human evaluation data directly from the authors.
h.3 Linear Mixed Effect Models
To assess power in the human ratings framework, we used linear mixed effect models with random intercepts and slopes for worker and item, as in Barr et al. (2013). Following best practices, we use the following structure, where is a particular worker and is a particular item. There are seven parameters, corresponding to the parameters needed for running a power analysis: fixed effects (the intercept) and (the model effect), and variance parameters for the worker intercept (), the item intercept () and their respective slope variance parameters ( and ). There is also a variance parameter for the overall error (). We transform the Likert ratings to be on a [0, 1] scale and treat them as normally distributed (which we note is an imperfect assumption). We give fit parameters for these values, on a few datasets, in Tables 13, 14, and 15.
(5)  
(6)  
(7)  
(8)  
(9)  
(10) 
For simplicity and convergence issues, we do not include a correlation parameter in the random effect structure.
To assess power, we use two possible variance settings derived from the model fits (“high variance” and “low variance” settings, in the main text) and show these in Table 16. We systematically vary the number of annotators (always assuming each annotator annotates each item, which is not always true in typical experiments), the number of items, and the effect size. We note that simulations can be customized to the planned analysis, including aspects such as how many items will be annotated by each annotator.
To compute power, we use each setting of the parameters to simulate 200 experiments and compute the proportion that detect a significant positive effect (). Significant effects in the opposite direction () do not count as detections. Code for these model fits and simulations is included with the online materials. However, we note that these should be used as a starting point, rather than being blindly copied, as details may differ in each experimental setting.
Dataset  Number of Workers  Number of Items 

Hashimoto et al. (2019) (LM)  124  50 
Hashimoto et al. (2019) (summarization)  96  99 
Hashimoto et al. (2019) (Reddit)  123  99 
WMT19  176  1997 
Dathathri et al. (2020)  15  1358 
Holtzman et al. (2020)  140  1399 
Dataset  

Hashimoto et al. (2019) (LM)  0.55  0.03  0.25  
Hashimoto et al. (2019) (summarization)  0.58  0.06  0.26  
Hashimoto et al. (2019) (Reddit)  0.55  0.05  0.03  0.01  0.23  
WMT19  0.86  0.04  0.12  
Dathathri et al. (2020)  0.62  0.04  0.05  0.03  0.16  
Holtzman et al. (2020)  0.59  0.02  0.04  0.02  0.01  0  0.04  0.16 
Dataset  

Hashimoto et al. (2019) (LM)  0  0.11  0.11  
Hashimoto et al. (2019) (summarization)  0  0.13  0.11  
Hashimoto et al. (2019) (Reddit)  0.11  0.04  0.08  0.06  0.17  
WMT19  0.07  0.04  0.13  
Dathathri et al. (2020)  0  0.04  0.05  0.05  0.05  
Holtzman et al. (2020)  0.09  0.05  0.03  0.04  0.04  0.02  0.04 
Dataset  

Hashimoto et al. (2019) (LM)  0.04  0.14  0.1  
Hashimoto et al. (2019) (summarization)  0.07  0  0.18  
Hashimoto et al. (2019) (Reddit)  0  0.13  0.11  0.14  0.14  
WMT19  0.05  0.03  0.15  
Dathathri et al. (2020)  0  0.16  0.19  0.16  0.16  
Holtzman et al. (2020)  0  0.13  0.1  0.12  0.11  0.13  0.13 
Scenario  

Low variance  0.01  0.04  0.01  0.13  0.16 
High variance  0.01  0.11  0.04  0.14  0.26 
h.4 Head to head human evaluations
Another commonly used form of human evaluation is head to head comparison, where raters are shown a pair of outputs (one from each model), and asked to choose which they prefer, sometimes with “neither” as a third option. Head to head comparisons offer some advantages over ratingsbasd approaches (Yannakakis and MartÃnez, 2015; van der Lee et al., 2019), but do not scale as well when comparing many models.
As with ordinal judgements, there are multiple ways of analyzing such data. If we treat annotator judgements as independent and identically distributed (such as if we only collect one judgement from each annotator), we could model this simply in terms of the underlying probabilities that a random annotator will prefer each model (as in the opening example in the main paper). In that case, running a power analysis would be a simple as assuming values for the underlying probabilities of each category (win, lose, draw), as usual based on pilot data or prior assumptions, and simulating many draws from that prior, checking in each sample to see if there is a statistically significant difference between win and lose.
On the other hand, if multiple judgements will be collected from each annotator and/or for each pair of outputs, then it makes sense to use a richer model to account for all sources of variation, as described above (see §H.3). In particular, the mixed effects framework can be adopted, potentially by modeling the outcome as a logistic model (in the case of win or lose), with ties either excluded or split.
Appendix I Additional Plots of Model Overlap
Footnotes
 https://github.com/dallascard/NLPpoweranalysis
 Using the observed outcome from a single experiment to compute power falls into the trap of posthoc power analysis and is not recommended. For additional background on statistical power, power analysis, nullhypothesis significance testing, and posthoc analysis, please refer to Appendix A.
 We don’t need to address variance in this scenario, as the variance of a binomial distribution is a function of its mean.
 More direct solutions are available for some settings, including this one (see Appendix E.5), but we describe it using the generic approach from Figure 2 for the purpose of illustration. For all cases examined in this paper, simulations take only minutes on a laptop.
 Unpaired data (i.e., if two models are evaluated on different data drawn from the same distribution) requires a different approach, such as using a binomial test. See Appendix E.5 for extended discussion.
 Corresponding plots showing TypeM and TypeS error (Gelman and Carlin, 2014) are in Appendix B. To walk through a numerical example, see Appendix C. For an interactive example, see the accompanying online notebooks.
 WNLI (Levesque et al., 2012), MRPC (Dolan and Brockett, 2005), SST2 (Socher et al., 2013), RTE (Dagan et al., 2005; BarHaim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), QNLI (Rajpurkar et al., 2016) MNLI (Williams et al., 2018), and QQP (Iyer et al., 2017). For consideration of other metrics, see Appendix F.
 It is also worth exploring power with respect to claims of improvement on multiple tasks with a single model (Demšar, 2006), rather than each task individually. We leave consideration of this as an interesting direction for future work.
 Note that swapping all examples would reverse the model scores, equivalent to a net effect of .
 For a sensitivity analysis of how power varies under different assumptions for and , please see Figure 17 in Appendix G.
 We exclude from this analysis two large negative effects with which would exaggerate this correlation.
 We use publicly available or authorprovided data from Hashimoto et al. (2019); Dathathri et al. (2020); Holtzman et al. (2020), and WMT19 (links in Appendix H.2).
 These simulations require estimates for 7 parameters: the baseline, the effect size, variance by worker, variance by worker as a function of model, variance by item, variance by item as a function of model, and residual variance.
 The bootstrap is another valid approach to testing for differences between models (Koehn, 2004; Graham et al., 2014; Dror et al., 2018), though note the concerns highlighted by Riezler and Maxwell (2005).
 Note that this does not directly solve the problem of computing BLEU at the sentence level (Chen and Cherry, 2014), as it still mimicking the process of evaluating BLEU on a corpus.
References
 Comparing spoken dialog corpora collected with recruited subjects versus real users. In Proceedings SIGdial, External Links: Link Cited by: §2.
 Fixedsamplesize analysis of sequential observations. Biometrics 10, pp. 89–100. External Links: Document Cited by: Appendix A.
 Statistical power in twolevel models: a tutorial based on Monte Carlo simulation.. Psychological methods 24 (1), pp. 1–19. External Links: Document Cited by: Table 9.
 Not all claims are created equal: Choosing the right statistical approach to assess hypotheses. In Proceedings of ACL, External Links: Document Cited by: Appendix A, §1.
 The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment, External Links: Link Cited by: §E.1.1, footnote 7.
 Random effects structure for confirmatory hypothesis testing: keep it maximal. Journal of Memory and Language 68 (3), pp. 255–278. External Links: Document Cited by: §H.3, §5.2, §5.2.
 Operating characteristics of a rank correlation test for publication bias. Biometrics 50 (4), pp. 1088–1101. External Links: Link Cited by: §5.1.
 The fifth PASCAL recognizing textual entailment challenge.. In Proceedings of TAC, Cited by: §E.1.1, footnote 7.
 An empirical investigation of statistical significance in NLP. In Proceedings of EMNLP, External Links: Link Cited by: Appendix A, §1.
 Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14 (5), pp. 365–376. External Links: Document Cited by: §1.
 A systematic comparison of smoothing techniques for sentencelevel BLEU. In Proceedings of WMT, External Links: Document Cited by: footnote 15.
 BAM! Bornagain multitask networks for natural language understanding. In Proceedings of ACL, External Links: Document Cited by: §E.1.1.
 ELECTRA: Pretraining text encoders as discriminators rather than generators. In Proceedings of ICLR, External Links: Link Cited by: §E.1.1.
 The statistical power of abnormalsocial psychological research: A review. Journal of Abnormal and Social Psychology 65 (3), pp. 145–153 (eng). External Links: ISSN 0096851X, Document Cited by: §1.
 Sample size and power for pairmatched casecontrol studies. Statistics in Medicine 6 (1), pp. 53–59. External Links: Document Cited by: §E.2.1, §E.2.1, §E.3.
 The PASCAL recognising textual entailment challenge. In Proceedings of the Machine Learning Challenges Workshop, External Links: Document Cited by: §E.1.1, footnote 7.
 Plug and play language models: A simple approach to controlled text generation. In Proceedings of ICLR, External Links: Link Cited by: 2nd item, Table 12, Table 13, Table 14, Table 15, footnote 12.
 Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, pp. 1–30. External Links: Link Cited by: footnote 8.
 BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL, External Links: Document Cited by: §E.1.1, §E.4, §3.2.
 Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation 10 (7), pp. 1895–1923. External Links: Document Cited by: §3.1.
 Show your work: Improved reporting of experimental results. In Proceedings of EMNLP, External Links: Document Cited by: Appendix A.
 Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: §E.1.1, footnote 7.
 The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of ACL, External Links: Document Cited by: item 1, Appendix F, §1, §3.1, §4, footnote 14.
 Asymptotic and exact power for the McNemar test and its analogue with R controls per case. Biometrics 40, pp. 1005–1015. External Links: Document Cited by: §E.2.1, §E.2.1.
 Understanding backtranslation at scale. In Proceedings of EMNLP, External Links: Document Cited by: Table 3.
 The McNemar test for binary matchedpairs data: mid and asymptotic are better than exact conditional. BMC Medical Research Methodology 13. External Links: Document Cited by: §E.2.1, §E.2.1.
 Judge the judges: A largescale evaluation study of neural language models for online review generation. In Proceedings of EMNLP, External Links: Document Cited by: §H.1.
 Convolutional sequence to sequence learning. In Proceedings of ICML, External Links: Link Cited by: Table 3.
 Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science 9 (6), pp. 641–651. External Links: Document Cited by: Appendix B, §1, §1, footnote 6.
 Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. External Links: Document, ISBN 9780511790942 Cited by: §5.2.
 The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “phacking” and the research hypothesis was posited ahead of time. External Links: Link Cited by: Appendix A.
 Donât calculate posthoc power using observed estimate of effect size. Annals of Surgery 269 (1), pp. e9–e10. External Links: Document Cited by: Appendix A.
 The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACLPASCAL Workshop on Textual Entailment and Paraphrasing, External Links: Link Cited by: §E.1.1, footnote 7.
 Randomized significance tests in machine translation. In Proceedings of WMT, External Links: Document Cited by: footnote 14.
 Unifying human and statistical evaluation for natural language generation. In Proceedings of NAACL, External Links: Document Cited by: 1st item, Table 12, Table 13, Table 14, Table 15, footnote 12.
 The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician 55 (1), pp. 19–24. External Links: Document, Link, https://doi.org/10.1198/000313001300339897 Cited by: Appendix A.
 The curious case of neural text degeneration. In Proceedings of ICLR, External Links: Link Cited by: 4th item, Table 12, Table 13, Table 14, Table 15, footnote 12.
 The power of bias in economics research. The Economic Journal 127 (605), pp. F236–F265. External Links: Document Cited by: §1.
 What have we (not) learnt from millions of scientific papers with values?. The American Statistician 73 (sup1), pp. 20–25. External Links: Document Cited by: Appendix A.
 First Quora dataset release: Question pairs. External Links: Link Cited by: §E.1.1, footnote 7.
 Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP, External Links: Link Cited by: §1, footnote 14.
 How many subjects?: Statistical power analysis in research. SAGE. External Links: ISBN 9781483319537 Cited by: §2.
 On the sample size for studies based upon McNemar’s test. Statistics in Medicine 11 (11), pp. 1521–1525. External Links: Document Cited by: Figure 14, §E.2.1, §E.3, §E.3, §E.4, Table 9, §3.2.
 ALBERT: A lite BERT for selfsupervised learning of language representations. In Proceedings of ICLR, External Links: Link Cited by: §E.1.1.
 The Winograd schema challenge. In Proceedings of KR, Cited by: §E.1.1, footnote 7.
 ROBERTA: a robustly optimized BERT pretraining approach. Computing Research Repository arXiv:1907.11692. External Links: Link Cited by: §E.1.1.
 BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. Computing Research Repository arXiv:1911.02969. External Links: Link Cited by: §1.
 Abandon statistical significance. The American Statistician 73 (sup1), pp. 235–245. External Links: Document Cited by: Appendix A.
 Facebook FAIR’s WMT19 news translation task submission. In Proceedings of WMT, External Links: Document Cited by: Table 3.
 Brief report: post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: sorting out appropriate uses of statistical power analyses. Communication Methods and Measures 1 (4), pp. 291–299. External Links: Document, Link, https://doi.org/10.1080/19312450701641375 Cited by: Appendix A.
 FAIRSEQ: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL, External Links: Document Cited by: §4.
 Scaling neural machine translation. In Proceedings of WMT, External Links: Document Cited by: Table 3.
 BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL, External Links: Document Cited by: §4.
 Sentence encoders on STILTs: Supplementary training on intermediate labeleddata tasks. Computing Research Repository arXiv:1811.01088. External Links: Link Cited by: §E.1.2.
 A call for clarity in reporting BLEU scores. In Proceedings of WMT, External Links: Document Cited by: §4.
 Exploring the limits of transfer learning with a unified texttotext transformer. Computing Research Repository arXiv:1910.10683. External Links: Link Cited by: §E.1.1.
 Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of ACL, External Links: Document Cited by: §3.2.
 SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, External Links: Document Cited by: §E.1.1, §3.2, footnote 7.
 On some pitfalls in automatic evaluation and significance testing for MT. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, External Links: Link Cited by: footnote 14.
 Publication bias: The “filedrawer” problem in scientific inference. arXiv arXiv:physics/9909033. External Links: Link Cited by: §1.
 Casecontrol studies: Design, conduct, analysis. Oxford University Press. Cited by: §E.2.1, §E.2.1.
 Green AI. Computing Research Repository arXiv:1907.10597. External Links: Link Cited by: Appendix A.
 Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, External Links: Link Cited by: §E.1.1, footnote 7.
 What’s in a value in NLP?. In Proceedings CoNLL, External Links: Document Cited by: §1.
 The 2 x 2 matchedpairs trial: Exact unconditional design and analysis. Biometrics 47 (2), pp. 361–372. External Links: Document Cited by: §E.2.1, §E.2.1, §E.2.1.
 Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS biology 15 (3). External Links: Document Cited by: §1.
 Best practices for the human evaluation of automatically generated text. In Proceedings of INLG, External Links: Document Cited by: Appendix A, §H.4, §5.1, §5.2, §5.
 A practical solution to the pervasive problems of values. Psychonomic Bulletin & Review 14, pp. 779–804. External Links: Document Cited by: Appendix A.
 GLUE: A multitask benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop on BlackboxNLP, External Links: Document Cited by: §3.2.
 Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Pyschology: General 143 (5), pp. 2020–2045. External Links: Document Cited by: §2.
 A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL, External Links: Document Cited by: §E.1.1, footnote 7.
 XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of NeurIPS, External Links: Link Cited by: §E.1.1.
 Ratings are overrated!. Frontiers in ICT 2. External Links: Document Cited by: §H.4.