Utility is in the Eye of the User: A Critique of NLP Leaderboards
Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards – in their current form – can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model’s utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).
The past few years have seen significant progress on a variety of NLP tasks, from question answering to machine translation. These advances have been driven in part by benchmarks such as GLUE (Wang et al., 2018), whose leaderboards rank models by how well they perform on these diverse tasks. Performance-based evaluation on a shared task is not a recent idea either; this sort of shared challenge has been an important driver of progress since MUC (Sundheim, 1995). While this paradigm has been successful at driving the creation of more accurate models, the historical focus on performance-based evaluation has been at the expense of other attributes valued by the NLP community, such as fairness and energy efficiency (Bender and Friedman, 2018; Strubell et al., 2019). For example, a highly inefficient model would have limited use in practical applications, but this would not preclude it from reaching the top of most leaderboards. Similarly, models can reach the top while containing racial and gender biases – and indeed, some have (Bordia and Bowman, 2019; Manzini et al., 2019; Rudinger et al., 2018; Blodgett et al., 2020).
Microeconomics provides a useful lens through which to study the divergence between what is incentivized by leaderboards and what is valued by practitioners. We can frame both the leaderboard and NLP practitioners as consumers of models and the benefit they receive from a model as its utility to them. Although leaderboards are inanimate, this framing allows us to make an apples-to-apples comparison: if the priorities of leaderboards and practitioners are perfectly aligned, their utility functions should be identical; the less aligned they are, the greater the differences. For example, the utility of both groups is monotonic non-decreasing in accuracy, so a more accurate model is no less preferable to a less accurate one, holding all else constant. However, while the utility of practitioners is also sensitive to the size and efficiency of a model, the utility of leaderboards is not. By studying such differences, we formalize some of the limitations in contemporary leaderboard design:
Non-Smooth Utility: For a leaderboard, an improvement in model accuracy on a given task only increases utility when it also increases rank. For practitioners, any improvement in accuracy can increase utility.
Prediction Cost: Leaderboards treat the cost of making predictions (e.g., model size, energy efficiency, latency) as being zero, which does not hold in practice.
Robustness: Practitioners receive higher utility from a model that is more robust to adversarial perturbations, generalizes better to out-of-distribution data, and that is equally fair to all demographics. However, these benefits would leave leaderboard utility unchanged.
We contextualize these limitations with examples from the ML fairness (Barocas et al., 2017; Hardt et al., 2016), Green AI (Strubell et al., 2019; Schwartz et al., 2019), and robustness literature (Jia and Liang, 2017). These three limitations are not comprehensive – other problems can also arise, which we leave to be discussed in future work.
What changes can we make to leaderboards so that their utility functions better reflect that of the NLP community at large? Given that each practitioner has their own preferences, there is no way to rank models so that everyone is satisfied. Instead, we suggest that leaderboards demand transparency, requiring the reporting of statistics that are of practical concern (e.g., model size, energy efficiency). This is akin to the use of data statements for mitigating bias in NLP systems (Gebru et al., 2018; Mitchell et al., 2019; Bender and Friedman, 2018). This way, practitioners can determine the utility they receive from a given model with relatively little effort. Dodge et al. (2019) have suggested that model creators take it upon themselves to report these statistics, but without leaderboards requiring it, there is little incentive to do so.
2 Utility Functions
In economics, the utility of a good denotes the benefit that a consumer receives from it (Mankiw, 2020). We specifically discuss the theory of cardinal utility, in which the amount of the good consumed can be mapped to a numerical value that quantifies its utility in utils (Mankiw, 2020). For example, a consumer might assign a value of 10 utils to two apples and 8 utils to one orange; we can infer both the direction and magnitude of the preference.
We use the term leaderboard to refer to any ranking of models or systems using performance-based evaluation on a shared benchmark. In NLP, this includes both longstanding benchmarks such as GLUE (Wang et al., 2018) and one-off challenges such as the annual SemEval STS tasks Agirre et al. (2013, 2014, 2015). This is not a recent idea either; this paradigm has been a driver of progress since MUC (Sundheim, 1995). All we assume is that all models are evaluated on the same held-out test data.
In our framework, leaderboards are consumers whose utility is solely derived from the rank of a model. Framing leaderboards as consumers is unorthodox, given that they are inanimate – in fact, it might seem more intuitive to say that leaderboards are another kind of product that is also consumed by practitioners. While that perspective is valid, what we ultimately care about is how good of a proxy leaderboards are for practitioner preferences. Framing both leaderboards and practitioners as consumers permits an apples-to-apples comparison using their utility functions. If a leaderboard were only thought of as a product, it would not have such a function, precluding such a comparison.
Unlike most kinds of consumers, a leaderboard is a consumer whose preferences are perfectly revealed through its rankings: the state-of-the-art (SOTA) model is preferred to all others, the second ranking model is preferred to all those below it, and so on. Put more formally, leaderboard utility is monotonic non-decreasing in rank. Still, because each consumer is unique, we cannot know the exact shape of a leaderboard utility function – only that it possesses this monotonicity.
Practitioners are also consumers, but they derive utility from multiple properties of the model being consumed (e.g., accuracy, energy efficiency, latency). Each input into their utility function is some desideratum, but since each practitioner applies the model differently, the functions can be different. For example, someone may assign higher utility to BERT-Large (Devlin et al., 2019) and its 95% accuracy on some task, while another may assign higher utility to the smaller BERT-Base and its 90% accuracy. As with leaderboards, although the exact shapes of practitioner utility functions are unknown, we can infer that they are monotonic non-decreasing in each desideratum. For example, more compact models are more desirable, so increasing compactness while holding all else constant will never decrease utility.
3 Utilitarian Critiques
Our criticisms apply regardless of the shape taken by practitioner utility functions – a necessity, given that the exact shapes are unknown. However, not every criticism applies to every leaderboard. StereoSet (Nadeem et al., 2020) is a leaderboard that ranks language models by how unbiased they are, so fairness-related criticisms would not apply as much to StereoSet. Similarly, the SNLI leaderboard (Bowman et al., 2015) reports the model size – a cost of making predictions – even if it does not factor this cost into the model ranking. Still, most of our criticisms apply to most leaderboards in NLP, and we provide examples of well-known leaderboards that embody each limitation.
3.1 Non-Smoothness of Utility
Leaderboards only gain utility from an increase in accuracy when it improves the model’s rank. This is because, by definition, the leaderboard’s preferences are perfectly revealed through its ranking – there is no model that can be preferred to another while having a lower rank. Since an increase in accuracy does not necessarily trigger an increase in rank, it does not necessarily trigger an increase in leaderboard utility either.
Put another way, the utility of leaderboards takes the form of a step function, meaning that it is not smooth with respect to accuracy. In contrast, the utility of practitioners is smooth with respect to accuracy. Holding all else constant, an increase in accuracy will yield some increase in utility, however small. Why this difference? The utility of leaderboards is a function of rank – it is only indirectly related to accuracy. On the other hand, the utility of practitioners is a direct function of accuracy, among other desiderata. To illustrate this difference, we contrast possible practitioner and leaderboard utility functions in Figure 1.
This difference means that practitioners who are content with a less-than-SOTA model – as long as it is lightweight, perhaps – are under-served, while practitioners who want competitive-with-SOTA models are over-served. For example, on a given task, say that an -gram baseline obtains an accuracy of 78%, an LSTM baseline obtains 81%, and a BERT-based SOTA obtains 92%. The leaderboard does not incentivize the creation of lightweight models that are 85% accurate, and as a result, few such models will be created. This is indeed the case with the SNLI leaderboard (Bowman et al., 2015), where most submitted models are highly-parameterized and over 85% accurate.
This incentive structure leaves those looking for a lightweight model with limited options. This lack of smaller, more energy-efficient models has been an impediment to the adoption of Green AI (Schwartz et al., 2019; Strubell et al., 2019). Although there are increasingly more lightweight and faster-to-train models – such as ELECTRA, on which accuracy and training time can be easily traded off (Clark et al., 2020) – their creation was not incentivized by a leaderboard, despite there being a demand for such models in the NLP community. A similar problem exists with incentivizing the creation of fair models, though the introduction of leaderboards such as StereoSet (Nadeem et al., 2020) are helping bridge this divide.
3.2 Prediction Cost
Leaderboards of NLP benchmarks rank models by taking the average accuracy, F1 score, or exact match rate (Wang et al., 2018, 2019; McCann et al., 2018). In other words, they rank models purely by the value of their predictions; no consideration is given to the cost of making those predictions. We define ‘cost’ here chiefly as model size, energy-efficiency, and latency – essentially any sacrifice that needs to be made in order to use the model. In reality, no model is costless, yet leaderboards are cost-ignorant.
This means that a SOTA model can simultaneously provide high utility to a leaderboard and zero utility to a practitioner, by virtue of being too impractical to use. For some time, this was true of the 175 billion parameter GPT-3 (Brown et al., 2020), which achieved SOTA on several few-shot tasks, but whose sheer size precludes it from being fully reproduced by researchers. Even today, practitioners can only use GPT-3 through an API, access to which is restricted. The cost-ignorance of leaderboards disproportionately affects practitioners with fewer resources (e.g., independent researchers) (Rogers, 2020), since the resource demands would dwarf any utility from the model.
It should be noted that this limitation of leaderboards has not precluded the creation of cheaper models, given the real-world benefit of lower costs. For example, ELECTRA (Clark et al., 2020) can be trained up to several hundred times faster than traditional BERT-based models while performing comparably on GLUE. Similarly, DistilBERT is a distilled variant of BERT that is 40% smaller and 60% faster while retaining 97% of the language understanding (Sanh et al., 2019). There are many others like it as well (Zadeh and Moshovos, 2020; Hou et al., 2020; Mao et al., 2020). More efficiency and fewer parameters translate to lower costs.
Our point is not that there is no incentive at all to build cheaper models, but rather that this incentive is not baked into leaderboards, which are an important artefact of the NLP community. Because lower prediction costs improve practitioner utility, practitioners build them despite the lack of incentive from leaderboards. If lower prediction costs also improved leaderboard utility, then there would be more interest in creating them (Rogers, 2020; Dodge et al., 2019). At the very least, making prediction costs publicly available would allow users to better estimate the utility they will get from a model, given that the leaderboard’s cost-ignorant ranking may be a poor proxy for their preferences.
Leaderboard utility only depends on model rank, which in turn only depends on the model’s performance on the test data. A typical leaderboard would gain no additional utility from a model that was robust to adversarial examples, generalized well to out-of-distribution data, or was fair in a Rawlsian sense (i.e., by maximizing the welfare of the worst-off group) (Rawls, 2001; Hashimoto et al., 2018). In contrast, these are all attributes that NLP practitioners care about, particularly those who deploy NLP systems in real-world applications. Such interest is evidenced by the extensive literature on the lack of robustness in many SOTA models (Jia and Liang, 2017; Zhang et al., 2020).
There are many examples of state-of-the-art NLP models that were found to be brittle or biased. The question-answering dataset SQuAD 2.0 was created in response to the observation that existing systems could not reliably demur when presented with an unanswerable question (Rajpurkar et al., 2016, 2018). The perplexity of language models rises when given out-of-domain text (Oren et al., 2019). Many types of bias have also been found in NLP systems, with models performing better on gender-stereotypical inputs (Rudinger et al., 2018; Ethayarajh, 2020) and racial stereotypes being captured in embedding space (Manzini et al., 2019; Ethayarajh et al., 2019, 2019; Ethayarajh, 2019). Moreover, repeated resubmissions allow for a model’s hyperparameters to be tuned to maximize performance, even on a private test set (Hardt, 2017).
Note that leaderboards do not necessarily incentivize the creation of brittle and biased models; rather, because leaderboard utility is so parochial, these unintended consequences are relatively common. Some recent work has addressed the problem of brittleness by offering certificates of performance against adversarial examples (Raghunathan et al., 2018b, a; Jia et al., 2019). To tackle gender bias, the SuperGLUE leaderboard considers accuracy on the WinoBias task (Wang et al., 2019; Zhao et al., 2018). Other work has proposed changes to prevent over-fitting via multiple resubmissions (Hardt and Blum, 2015; Hardt, 2017) while some have argued that this issue is overblown, given that question-answering systems submitted to the SQuAD leaderboard do not over-fit to the original test set (Miller et al., 2020). A novel approach even proposes using a dynamic benchmark instead of a static one, creating a moving target that is harder for models to overfit to (Nie et al., 2019).
4 The Future of Leaderboards
4.1 A Leaderboard for Every User
Given that each practitioner has their own utility function, models cannot be ranked in a way that satisfies everyone. Drawing inspiration from data statements (Bender and Friedman, 2018; Mitchell et al., 2019; Gebru et al., 2018), we instead recommend that leaderboards demand transparency and require the reporting of metrics that are relevant to practitioners, such as training time, model size, inference latency, and energy efficiency. Dodge et al. (2019) have suggested that model creators submit these statistics of their own accord, but without leaderboards requiring it, there would be no explicit incentive to do so. Although NLP workshops and conferences could also require this, their purview is limited: (1) models are often submitted to leaderboards before conferences; (2) leaderboards make these statistics easily accessible in one place.
Giving practitioners easy access to these statistics would permit them to estimate each model’s utility to them and then re-rank accordingly. This could be made even easier by offering an interface that allows the user to change the weighting on each metric and then using the chosen weights to dynamically re-rank the models. In effect, every user would have their own leaderboard. Ideally, users would even have the option of filtering out models that do not meet their criteria (e.g., those above a certain parameter count).
This would have beneficial second-order effects as well. For example, reporting the costs of making predictions would put large institutions and poorly-resourced model creators on more equal footing (Rogers, 2020). This might motivate the creation of simpler methods whose ease-of-use makes up for weaker performance, such as weighted-average sentence embeddings (Arora et al., 2019; Ethayarajh, 2018). Even if a poorly-resourced creator could not afford to train the SOTA model du jour, they could at least compete on the basis of efficiency or create a minimally viable system that meets some desired threshold (Dodge et al., 2019; Dorr, 2011). Reporting the performance on the worst-off group, in the spirit of Rawlsian fairness (Rawls, 2001; Hashimoto et al., 2018), would also incentivize creators to improve worst-case performance.
4.2 A Leaderboard for Every Type of User
While each practitioner may have their own utility function, groups of practitioners – characterized by a shared goal – can be modelled with a single function. For example, programmers working on low latency applications (e.g., search engines, multiplayer games) will place more value on latency than others. In contrast, researchers submitting their work to a conference may place more value on accuracy, given that a potential reviewer may reject a model that is not state-of-the-art. Although there is variance within any group, this approach is tractable when there are many points of consensus.
How might we go about creating a leaderboard for a specific type of user? As proposed in the previous subsection, one option is to offer an interface that allows the user to change the utility function dynamically. If we wanted to create a static leaderboard for a group of users, however, we would need to estimate their utility function. This could be done explicitly or implicitly. The explicit way would be to ask questions that use the dollar value as a proxy for cardinal utility: e.g., Given a 100M parameter sentiment classifier with 200ms latency, how much would you pay per 1000 API calls? One could also try to estimate the derivative of the utility function with questions such as: How much would you pay to improve the latency from 200ms to 100ms, holding all else constant? When there are multiple metrics to consider, some assumptions are needed to tractably estimate the function – for example, conditional independence.
The implicit alternative to estimating this function is to record which models practitioners actually use and then fit a utility function that maximizes the utility of the observed models. This approach is rooted in revealed preference theory (Samuelson, 1948) – we assume that what practitioners use reveals their latent preferences. Exploiting revealed preferences may be difficult in practice, however, given that usage statistics for models are not often made public and the decision to use a model might not be made with complete information.
In this work, we offered several criticisms of leaderboard design in NLP. While it has helped create more accurate models, we argued that this has been at the expense of fairness, efficiency, and robustness, among other desiderata. We were not the first to criticize NLP leaderboards (Rogers, 2020; Crane, 2018), but we were the first to do so under a framework of utility, which we used to study the divergence between what is incentivized by leaderboards and what is valued by practitioners. Given the diversity of NLP practitioners, there is no one-size-fits-all solution; rather, leaderboards should demand transparency, requiring the reporting of statistics that may be of practical concern. Equipped with these statistics, each user could then estimate the utility that each model provides to them and then re-rank accordingly, effectively creating a custom leaderboard for everyone.
Many thanks to Eugenia Rho, Dallas Card, Robin Jia, Urvashi Khandelwal, Nelson Liu, and Sidd Karamcheti for feedback. KE is supported by an NSERC PGS-D.
- SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability.. In Proceedings SemEval@ NAACL-HLT, pp. 252–263. Cited by: §2.
- SemEval-2014 task 10: multilingual semantic textual similarity.. In Proceedings SemEval@ COLING, pp. 81–91. Cited by: §2.
- SEM 2013 shared task: semantic textual similarity, including a pilot on typed-similarity. In SEM 2013: The Second Joint Conference on Lexical and Computational Semantics, Cited by: §2.
- A simple but tough-to-beat baseline for sentence embeddings. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: §4.1.
- Fairness in machine learning. NIPS Tutorial. Cited by: §1.
- Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, pp. 587–604. Cited by: §1, §1, §4.1.
- Language (technology) is power: a critical survey of ”bias” in nlp. External Links: Cited by: §1.
- Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 7–15. Cited by: §1.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Cited by: §3.1, §3.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §3.2.
- Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §3.1, §3.2.
- Questionable Answers in Question Answering Research: Reproducibility and Variability of Published Results. Transactions of the Association for Computational Linguistics 6, pp. 241–252 (en-us). External Links: Cited by: §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2.
- Show your work: improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2185–2194. Cited by: §1, §3.2, §4.1, §4.1.
- Part 5: machine translation evaluation. In Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation, J. Olive, C. Christianson and J. McCary (Eds.), Cited by: §4.1.
- Towards understanding linear word analogies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3253–3262. External Links: Cited by: §3.3.
- Understanding undesirable word embedding associations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1696–1705. Cited by: §3.3.
- Unsupervised random walk sentence embeddings: a strong but simple baseline. In Proceedings of The Third Workshop on Representation Learning for NLP, pp. 91–100. Cited by: §4.1.
- Rotate king to get queen: word relationships as orthogonal transformations in embedding space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3494–3499. Cited by: §3.3.
- Is your classifier actually biased? measuring fairness under uncertainty with bernstein bounds. arXiv preprint arXiv:2004.12332. Cited by: §3.3.
- Datasheets for datasets. arXiv preprint arXiv:1803.09010. Cited by: §1, §4.1.
- The ladder: a reliable leaderboard for machine learning competitions. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 1006–1014. Cited by: §3.3.
- Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §1.
- Climbing a shaky ladder: better adaptive risk estimation. arXiv preprint arXiv:1706.02733. Cited by: §3.3, §3.3.
- Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pp. 1929–1938. Cited by: §3.3, §4.1.
- DynaBERT: dynamic bert with adaptive width and depth. arXiv preprint arXiv:2004.04037. Cited by: §3.2.
- Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2021–2031. Cited by: §1, §3.3.
- Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4120–4133. Cited by: §3.3.
- Principles of economics. Cengage Learning. Cited by: §2.
- Black is to criminal as caucasian is to police: detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 615–621. Cited by: §1, §3.3.
- LadaBERT: lightweight adaptation of bert through hybrid model compression. arXiv preprint arXiv:2004.04124. Cited by: §3.2.
- The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §3.2.
- The effect of natural distribution shift on question answering models. arXiv preprint arXiv:2004.14444. Cited by: §3.3.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp. 220–229. Cited by: §1, §4.1.
- StereoSet: measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456. Cited by: §3.1, §3.
- Adversarial nli: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599. Cited by: §3.3.
- Distributionally robust language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4218–4228. Cited by: §3.3.
- Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pp. 10877–10887. Cited by: §3.3.
- Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344. Cited by: §3.3.
- Know what you donât know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: §3.3.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Cited by: §3.3.
- Justice as fairness: a restatement. Harvard University Press. Cited by: §3.3, §4.1.
- How the Transformers broke NLP leaderboards . Note: \urlhttps://hackingsemantics.xyz/2019/leaderboards/Accessed: 2020-05-20 Cited by: §3.2, §3.2, §4.1, §5.
- Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 8–14. Cited by: §1, §3.3.
- Consumption theory in terms of revealed preference. Economica 15 (60), pp. 243–253. Cited by: §4.2.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §3.2.
- Green AI. CoRR abs/1907.10597. External Links: Cited by: §1, §3.1.
- Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650. Cited by: §1, §1, §3.1.
- Overview of results of the MUC-6 evaluation. In MUC-6, Columbia, MD, pp. 13–31. Cited by: §1, §2.
- Superglue: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3261–3275. Cited by: §3.2, §3.3.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, pp. 353. Cited by: §1, §2, §3.2.
- GOBO: quantizing attention-based nlp models for low latency and energy efficient inference. arXiv preprint arXiv:2005.03842. Cited by: §3.2.
- Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11 (3), pp. 1–41. Cited by: §3.3.
- Gender bias in coreference resolution: evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 15–20. Cited by: §3.3.