Pretrained Transformers Improve Out-of-Distribution Robustness
Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for various NLP tasks by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers’ performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.
The train and test distributions are often not identically distributed. Such train-test mismatches occur because evaluation datasets rarely characterize the entire distribution Torralba and Efros (2011), and the test distribution typically drifts over time Quionero-Candela et al. (2009). Chasing an evolving data distribution is costly, and even if the training data does not become stale, models will still encounter unexpected situations at test time. Accordingly, models must generalize to OOD examples whenever possible, and when OOD examples do not belong to any known class, models must detect them in order to abstain or trigger a conservative fallback policy Emmott et al. (2015).
Most evaluation in natural language processing (NLP) assumes the train and test examples are independent and identically distributed (IID). In the IID setting, large pretrained Transformer models can attain near human-level performance on numerous tasks Wang et al. (2019). However, high IID accuracy does not necessarily translate to OOD robustness for image classifiers Hendrycks and Dietterich (2019), and pretrained Transformers may embody this same fragility. Moreover, pretrained Transformers can rely heavily on spurious cues and annotation artifacts Cai et al. (2017); Gururangan et al. (2018) which out-of-distribution examples are less likely to include, so their OOD robustness remains uncertain.
In this work, we systematically study the OOD robustness of various NLP models, such as word embeddings averages, LSTMs, pretrained Transformers, and more. We decompose OOD robustness into a model’s ability to (1) generalize and to (2) detect OOD examples Card et al. (2018).
To measure OOD generalization, we create a new evaluation benchmark that tests robustness to shifts in writing style, topic, and vocabulary, and spans the tasks of sentiment analysis, textual entailment, question answering, and semantic similarity. We create OOD test sets by splitting datasets with their metadata or by pairing similar datasets together (Section 2). Using our OOD generalization benchmark, we show that pretrained Transformers are considerably more robust to OOD examples than traditional NLP models (Section 3). We show that the performance of an LSTM semantic similarity model declines by over on OOD examples, while a RoBERTa model’s performance slightly increases. Moreover, we demonstrate that while pretraining larger models does not seem to improve OOD generalization, pretraining models on diverse data does improve OOD generalization.
To measure OOD detection performance, we turn classifiers into anomaly detectors by using their prediction confidences as anomaly scores Hendrycks and Gimpel (2017). We show that many non-pretrained NLP models are often near or worse than random chance at OOD detection. In contrast, pretrained Transformers are far more capable at OOD detection. Overall, our results highlight that while there is room for future robustness improvements, pretrained Transformers are already moderately robust.
2 How We Test Robustness
2.1 Train and Test Datasets
We evaluate OOD generalization with seven carefully selected datasets. Each dataset either (1) contains metadata which allows us to naturally split the samples or (2) can be paired with a similar dataset from a distinct data generating process. By splitting or grouping our chosen datasets, we can induce a distribution shift and measure OOD generalization.
We utilize four sentiment analysis datasets:
We use SST-2, which contains pithy expert movie reviews Socher et al. (2013), and IMDb Maas et al. (2011), which contains full-length lay movie reviews. We train on one dataset and evaluate on the other dataset, and vice versa. Models predict a movie review’s binary sentiment, and we report accuracy.
The Yelp Review Dataset contains restaurant reviews with detailed metadata (e.g., user ID, restaurant name). We carve out four groups from the dataset based on food type: American, Chinese, Italian, and Japanese. Models predict a restaurant review’s binary sentiment, and we report accuracy.
The Amazon Review Dataset contains product reviews from Amazon McAuley et al. (2015); He and McAuley (2016). We split the data into five categories of clothing (Clothes, Women Clothing, Men Clothing, Baby Clothing, Shoes) and two categories of entertainment products (Music, Movies). We sample 50,000 reviews for each category. Models predict a review’s 1 to 5 star rating, and we report accuracy.
We also utilize these datasets for semantic similarity, reading comprehension, and textual entailment:
STS-B requires predicting the semantic similarity between pairs of sentences Cer et al. (2017). The dataset contains text of different genres and sources; we use four sources from two genres: MSRpar (news), Headlines (news); MSRvid (captions), Images (captions). The evaluation metric is Pearson’s correlation coefficient.
ReCoRD is a reading comprehension dataset using paragraphs from CNN and Daily Mail news articles and automatically generated questions Zhang et al. (2018). We bifurcate the dataset into CNN and Daily Mail splits and evaluate using exact match.
MNLI is a textual entailment dataset using sentence pairs drawn from different genres of text Williams et al. (2018). We select examples from two genres of transcribed text (Telephone and Face-to-Face) and one genre of written text (Letters), and we report classification accuracy.
2.2 Embedding and Model Types
We evaluate NLP models with different input representations and encoders. We investigate three model categories with a total of thirteen models.
Bag-of-words (BoW) Model.
We use a bag-of-words model Harris (1954), which is high-bias but low-variance, so it may exhibit performance stability. The BoW model is only used for sentiment analysis and STS-B due to its low performance on the other tasks. For STS-B, we use the cosine similarity of the BoW representations from the two input sentences.
Word Embedding Models.
We use word2vec Mikolov et al. (2013) and GloVe Pennington et al. (2014) word embeddings. These embeddings are encoded with one of three models: word averages Wieting et al. (2016), LSTMs Hochreiter and Schmidhuber (1997), and Convolutional Neural Networks (ConvNets). For classification tasks, the representation from the encoder is fed into an MLP. For STS-B and MNLI, we use the cosine similarity of the encoded representations from the two input sentences. For reading comprehension, we use the DocQA model Clark and Gardner (2018) with GloVe embeddings. We implement our models in AllenNLP Gardner et al. (2018) and tune the hyperparameters to maximize validation performance on the IID task.
We investigate BERT-based models Devlin et al. (2019) which are pretrained bidirectional Transformers Vaswani et al. (2017) with GELU Hendrycks and Gimpel (2016) activations. In addition to using BERT Base and BERT Large, we also use the large version of RoBERTa Liu et al. (2019b), which is pretrained on a larger dataset than BERT. We use ALBERT Lan et al. (2020) and also a distilled version of BERT, DistilBERT Sanh et al. (2019). We follow the standard BERT fine-tuning procedure Devlin et al. (2019) and lightly tune the hyperparameters for our tasks. We perform our experiments using the HuggingFace Transformers library Wolf et al. (2019).
3 Out-of-Distribution Generalization
Pretrained Transformers are More Robust.
In our experiments, pretrained Transformers often have smaller generalization gaps from IID data to OOD data than traditional NLP models. For instance, Figure 1 shows that the LSTM model declined by over 35%, while RoBERTa’s generalization performance in fact increases. For Amazon, MNLI, and Yelp, we find that pretrained Transformers’ accuracy only slightly fluctuates on OOD examples. Partial MNLI results are in Table 1. We present the full results for these three tasks in Appendix A.2. In short, pretrained Transformers can generalize across a variety of distribution shifts.
|Model||Telephone (IID)||Letters (OOD)||Face-to-Face (OOD)|
Bigger Models Are Not Always Better.
While larger models reduce the IID/OOD generalization gap in computer vision Hendrycks and Dietterich (2019); Xie and Yuille (2020); Hendrycks et al. (2019d), we find the same does not hold in NLP. Figure 3 shows that larger BERT and ALBERT models do not reduce the generalization gap. However, in keeping with results from vision Hendrycks and Dietterich (2019), we find that model distillation reduces robustness, as evident in our DistilBERT results in Figure 2. This highlights that testing model compression methods for BERT Shen et al. (2020); Ganesh et al. (2020); Li et al. (2020) on only in-distribution examples gives a limited account of model generalization, and such narrow evaluation may mask downstream costs.
More Diverse Data Improves Generalization.
Similar to computer vision Orhan (2019); Xie et al. (2020); Hendrycks et al. (2019a), pretraining on larger and more diverse datasets can improve robustness. RoBERTa exhibits greater robustness than BERT Large, where one of the largest differences between these two models is that RoBERTa pretrains on more data. See Figure 2’s results.
4 Out-of-Distribution Detection
Since OOD robustness requires evaluating both OOD generalization and OOD detection, we now turn to the latter. Without access to an outlier dataset Hendrycks et al. (2019b), the state-of-the-art OOD detection technique is to use the model’s prediction confidence to separate in- and out-of-distribution examples Hendrycks and Gimpel (2017). Specifically, we assign an example the anomaly score , the negative prediction confidence, to perform OOD detection.
We train models on SST-2, record the model’s confidence values on SST-2 test examples, and then record the model’s confidence values on OOD examples from five other datasets. For our OOD examples, we use validation examples from 20 Newsgroups (20 NG) Lang (1995), the English source side of English-German WMT16 and English-German Multi30K Elliott et al. (2016), and concatenations of the premise and hypothesis for RTE Dagan et al. (2005) and SNLI Bowman et al. (2015). These examples are purely used to evaluate OOD detection performance and are not seen during training.
For evaluation, we follow past work Hendrycks et al. (2019b) and report the False Alarm Rate at Recall (FAR95). The FAR95 is the probability that an in-distribution example raises a false alarm, assuming that 95% of all out-of-distribution examples are detected. Hence a lower FAR95 is better. Partial results are in Figure 4, and full results are in Appendix A.3.
Previous Models Struggle at OOD Detection.
Models without pretraining (e.g., BoW, LSTM word2vec) are often unable to reliably detect OOD examples. In particular, these models’ FAR95 scores are sometimes worse than chance because the models often assign a higher probability to out-of-distribution examples than in-distribution examples. The models particularly struggle on 20 Newsgroups (which contains text on diverse topics including computer hardware, motorcycles, space), as their false alarm rates are approximately .
Pretrained Transformers Are Better Detectors.
In contrast, pretrained Transformer models are better OOD detectors. Their FAR95 scores are always better than chance. Their superior detection performance is not solely because the underlying model is a language model, as prior work Hendrycks et al. (2019b) shows that language models are not necessarily adept at OOD detection. Also note that in OOD detection for computer vision, higher accuracy does not reliably improve OOD detection Lee et al. (2018), so pretrained Transformers’ OOD detection performance is not anticipated. Despite their relatively low FAR95 scores, pretrained Transformers still do not cleanly separate in- and out-of-distribution examples (Figure 5). OOD detection using pretrained Transformers is still far from perfect, and future work can aim towards creating better methods for OOD detection.
5 Discussion and Related Work
Why Are Pretrained Models More Robust?
An interesting area for future work is to analyze why pretrained Transformers are more robust. A flawed explanation is that pretrained models are simply more accurate. However, this work and past work shows that increases in accuracy do not directly translate to reduced IID/OOD generalization gaps Hendrycks and Dietterich (2019); Fried et al. (2019). One partial explanation is that Transformer models are pretrained on diverse data, and in computer vision, dataset diversity can improve OOD generalization Hendrycks et al. (2020) and OOD detection Hendrycks et al. (2019b). Similarly, Transformer models are pretrained with large amounts of data, which may also aid robustness Orhan (2019); Xie et al. (2020); Hendrycks et al. (2019a). However, this is not a complete explanation as BERT is pretrained on roughly 3 billion tokens, while GloVe is trained on roughly 840 billion tokens. Another partial explanation may lie in self-supervised training itself. Hendrycks et al. (2019c) show that computer vision models trained with self-supervised objectives exhibit better OOD generalization and far better OOD detection performance. Future work could propose new self-supervised objectives that enhance model robustness.
Other research on robustness considers the separate problem of domain adaptation Blitzer et al. (2007); Daumé III (2007), where models must learn representations of a source and target distribution. We focus on testing generalization without adaptation in order to benchmark robustness to unforeseen distribution shifts. Unlike Fisch et al. (2019); Yogatama et al. (2019), we measure OOD generalization by considering simple and natural distribution shifts, and we also evaluate more than question answering.
Adversarial examples can be created for NLP models by inserting phrases Jia and Liang (2017); Wallace et al. (2019), paraphrasing questions Ribeiro et al. (2018), and reducing inputs Feng et al. (2018). However, adversarial examples are often disconnected from real-world performance concerns Gilmer et al. (2018). Thus, we focus on an experimental setting that is more realistic. While previous works show that, for all NLP models, there exist adversarial examples, we show that all models are not equally fragile. Rather, pretrained Transformers are overall far more robust than previous models.
Counteracting Annotation Artifacts.
Annotators can accidentally leave unintended shortcuts in datasets that allow models to achieve high accuracy by effectively “cheating” Cai et al. (2017); Gururangan et al. (2018); Min et al. (2019). These annotation artifacts are one reason for OOD brittleness: OOD examples are unlikely to contain the same spurious patterns as in-distribution examples. OOD robustness benchmarks like ours can stress test a model’s dependence on artifacts Liu et al. (2019a); Feng et al. (2019); Naik et al. (2018).
We created an expansive benchmark across several NLP tasks to evaluate out-of-distribution robustness. To accomplish this, we carefully restructured and matched previous datasets to induce numerous realistic distribution shifts. We first showed that pretrained Transformers generalize to OOD examples far better than previous models, so that the IID/OOD generalization gap is often markedly reduced. We then showed that pretrained Transformers detect OOD examples surprisingly well. Overall, our extensive evaluation shows that while pretrained Transformers are moderately robust, there remains room for future research on robustness.
We thank the members of Berkeley NLP, Sona Jeswani, Suchin Gururangan, Nelson Liu, Shi Feng, the anonymous reviewers, and especially Jon Cai. This material is in part based upon work supported by the National Science Foundation Frontier Award 1804794. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Appendix A Additional Experimental Results
a.1 Significant OOD Accuracy Drops
a.2 Minor OOD Accuracy Drops
a.3 OOD Detection
Full FAR95 values are in Table 7. We also report the Area Under the Receiver Operating Characteristic (AUROC) Hendrycks and Gimpel (2017). The AUROC is the probability that an OOD example receives a higher anomaly score than an in-distribution example, viz.,
A flawless AUROC is while is random chance. These results are in Figure 7 and Table 8. Note that the pretrained Transformers have an approximately equal AUROC, save for DistilBERT, which is approximately worse.
|Train||Test||BoW||Average word2vec||LSTM word2vec||ConvNet word2vec||Average GloVe||LSTM GloVe||ConvNet GloVe||BERT Base||BERT Large||RoBERTa|
|MSRvid||4.4 (-35.4)||11.3 (-50.1)||38.3 (-37.4)||62.0 (-19.8)||6.1 (-55.2)||43.1 (-36.7)||57.8 (-24.1)||89.5 (-2.3)||90.5 (-2.3)||94.3 (0.1)|
|Images||19.3 (-41.4)||23.7 (-44.9)||45.6 (-40.2)||54.3 (-30.7)||11.1 (-55.7)||49.0 (-36.6)||51.9 (-35.4)||85.8 (-6.6)||86.8 (-7.1)||90.4 (-4.6)|
|MSRpar||10.1 (-16.7)||19.1 (-39.7)||-1.9 (-68.1)||9.8 (-57.6)||25.9 (-27.5)||25.4 (-44.5)||10.9 (-58.7)||69.9 (-17.1)||63.6 (-24.7)||75.5 (-15.8)|
|Headlines||-9.7 (-56.7)||12.7 (-14.4)||10.3 (-36.5)||23.7 (-26.1)||7.0 (-43.9)||15.6 (-31.1)||30.6 (-15.6)||73.0 (-5.8)||71.7 (-9.9)||83.9 (-2.9)|
|Train||Test||Document QA||DistilBERT||BERT Base||BERT Large||RoBERTa|
|DailyMail||29.7 (-9.3)||34.8 (-10.2)||46.7 (-6.6)||59.8 (-7.4)||72.2 (0.7)|
|CNN||36.9 (6.2)||43.9 (7.2)||51.8 (3.6)||65.5 (4.3)||73.0 (0.0)|
|Train||Test||BoW||Average word2vec||LSTM word2vec||ConvNet word2vec||Average GloVe||LSTM GloVe||ConvNet GloVe||BERT Base||BERT Large||RoBERTa|
|IMDb||73.9 (-6.8)||76.4 (-5.0)||78.0 (-9.5)||81.0 (-4.4)||74.5 (-5.8)||82.1 (-5.3)||81.0 (-3.8)||87.5 (-4.4)||88.3 (-5.3)||92.8 (-2.8)|
|SST||78.3 (-7.6)||68.5 (-16.3)||63.7 (-26.3)||83.0 (-8.0)||77.5 (-6.1)||79.9 (-11.4)||80.0 (-10.9)||87.6 (-4.3)||88.6 (-4.3)||91.0 (-3.4)|
|Train||Test||DistilBERT||BERT Base||BERT Large||RoBERTa|
|Letters||75.6 (-1.9)||82.3 (0.9)||85.1 (1.0)||90.0 (0.4)|
|Face-to-face||76.0 (-1.4)||80.8 (-0.7)||83.2 (-0.8)||89.4 (-0.2)|
|Train||Test||BoW||Average word2vec||LSTM word2vec||ConvNet word2vec||Average GloVe||LSTM GloVe||ConvNet GloVe||DistilBERT||BERT Base||BERT Large||RoBERTa|
|CH||82.4 (-4.8)||80.4 (-5.2)||87.2 (-0.8)||88.6 (-1.0)||75.1 (-9.9)||88.4 (0.4)||89.6 (-1.6)||91.8 (1.8)||91.0 (0.2)||90.6 (-0.4)||90.8 (-2.2)|
|IT||81.8 (-5.4)||82.6 (-3.0)||86.4 (-1.6)||89.4 (-0.2)||82.0 (-3.0)||89.2 (1.2)||89.6 (-1.6)||92.6 (2.6)||91.6 (0.8)||91.2 (0.2)||91.8 (-1.2)|
|JA||84.2 (-3.0)||86.0 (0.4)||89.6 (1.6)||89.4 (-0.2)||79.2 (-5.8)||87.8 (-0.2)||89.2 (-2.0)||92.0 (2.0)||92.0 (1.2)||92.2 (1.2)||93.4 (0.4)|
|AM||82.2 (0.0)||85.4 (1.0)||88.0 (0.4)||89.2 (0.4)||83.0 (-1.4)||85.6 (-3.6)||90.2 (1.2)||90.6 (0.4)||88.8 (-1.6)||91.8 (1.0)||92.4 (0.0)|
|IT||84.6 (2.4)||82.0 (-2.4)||88.0 (0.4)||89.6 (0.8)||84.6 (0.2)||88.6 (-0.6)||90.4 (1.4)||91.4 (1.2)||89.0 (-1.4)||90.2 (-0.6)||92.6 (0.2)|
|JA||83.8 (1.6)||85.8 (1.4)||88.6 (1.0)||89.0 (0.2)||86.8 (2.4)||88.8 (-0.4)||89.6 (0.6)||91.6 (1.4)||89.4 (-1.0)||91.6 (0.8)||92.2 (-0.2)|
|AM||85.4 (-1.8)||83.8 (-3.0)||89.0 (-0.6)||90.2 (-0.6)||85.6 (-0.6)||89.0 (-0.6)||90.2 (-0.6)||90.4 (-2.0)||90.6 (-1.0)||89.4 (-2.4)||92.0 (-2.2)|
|CH||79.6 (-7.6)||81.6 (-5.2)||83.8 (-5.8)||88.4 (-2.4)||78.0 (-8.2)||83.2 (-6.4)||85.8 (-5.0)||90.4 (-2.0)||89.6 (-2.0)||90.0 (-1.8)||92.4 (-1.8)|
|JA||82.0 (-5.2)||84.6 (-2.2)||87.4 (-2.2)||88.6 (-2.2)||85.0 (-1.2)||86.8 (-2.8)||89.4 (-1.4)||91.8 (-0.6)||91.4 (-0.2)||91.2 (-0.6)||92.2 (-2.0)|
|AM||83.4 (-1.6)||85.0 (-2.6)||87.8 (-1.2)||87.8 (-2.6)||80.4 (-7.6)||88.6 (-0.4)||89.4 (-0.2)||91.2 (-0.4)||90.4 (-1.8)||90.6 (-2.8)||91.0 (-1.6)|
|CH||81.6 (-3.4)||83.6 (-4.0)||89.0 (0.0)||89.0 (-1.4)||80.6 (-7.4)||87.4 (-1.6)||89.2 (-0.4)||92.8 (1.2)||91.4 (-0.8)||90.8 (-2.6)||92.4 (-0.2)|
|IT||84.0 (-1.0)||83.6 (-4.0)||88.2 (-0.8)||89.4 (-1.0)||83.6 (-4.4)||88.0 (-1.0)||90.6 (1.0)||92.6 (1.0)||90.2 (-2.0)||91.0 (-2.4)||92.6 (0.0)|
|BoW||Avg w2v||Avg GloVe||LSTM w2v||LSTM GloVe||ConvNet w2v||ConvNet GloVe||DistilBERT||BERT Base||BERT Large||RoBERTa|
|BoW||Avg w2v||Avg GloVe||LSTM w2v||LSTM GloVe||ConvNet w2v||ConvNet GloVe||DistilBERT||BERT Base||BERT Large||RoBERTa|
- Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In ACL, Cited by: §5.
- A large annotated corpus for learning natural language inference. In EMNLP, Cited by: §4.
- Pay attention to the ending: strong neural baselines for the roc story cloze task. In ACL, Cited by: §1, §5.
- Deep weighted averaging classifiers. In FAT, Cited by: §1.
- SemEval-2017 Task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. In SemEval, Cited by: 1st item.
- Simple and effective multi-paragraph reading comprehension. In ACL, Cited by: §2.2.
- The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, Cited by: §4.
- Frustratingly easy domain adaptation. In ACL, Cited by: §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.2.
- Multi30k: multilingual english-german image descriptions. In ACL, Cited by: §4.
- A meta-analysis of the anomaly detection problem. Cited by: §1.
- Misleading failures of partial-input baselines. In ACL, Cited by: §5.
- Pathologies of neural models make interpretations difficult. In EMNLP, Cited by: §5.
- Proceedings of the 2nd workshop on machine reading for question answering. In MRQA Workshop, Cited by: §5.
- Cross-domain generalization of neural constituency parsers. In ACL, Cited by: §5.
- Compressing large-scale transformer-based models: a case study on BERT. ArXiv abs/2002.11985. Cited by: §3.
- AllenNLP: a deep semantic natural language processing platform. In Workshop for NLP Open Source Software, Cited by: §2.2.
- Motivating the rules of the game for adversarial example research. ArXiv abs/1807.06732. Cited by: §5.
- Annotation artifacts in natural language inference data. In NAACL-HLT, Cited by: §1, §5.
- Distributional structure. Word. Cited by: §2.2.
- Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, Cited by: 3rd item.
- Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: §1, §3, §5.
- Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415. Cited by: §2.2.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In ICLR, Cited by: §A.3, §1, §4.
- Using pre-training can improve model robustness and uncertainty. ICML. Cited by: §3, §5.
- Deep anomaly detection with outlier exposure. ICLR. Cited by: §4, §4, §4, §5.
- Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, Cited by: §5.
- AugMix: a simple data processing method to improve robustness and uncertainty. ICLR. Cited by: §5.
- Natural adversarial examples. ArXiv abs/1907.07174. Cited by: §3.
- Long short-term memory. In Neural Computation, Cited by: §2.2.
- Adversarial examples for evaluating reading comprehension systems. In EMNLP, Cited by: §5.
- ALBERT: a lite BERT for self-supervised learning of language representations. In ICLR, Cited by: §2.2.
- NewsWeeder: learning to filter Netnews. In ICML, Cited by: §4.
- Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, Cited by: §4.
- Train large, then compress: rethinking model size for efficient training and inference of transformers. ArXiv abs/2002.11794. Cited by: §3.
- Inoculation by fine-tuning: a method for analyzing challenge datasets. In NAACL, Cited by: §5.
- RoBERTa: a robustly optimized BERT pretraining approach. ArXiv abs/1907.11692. Cited by: §2.2.
- Learning word vectors for sentiment analysis. In ACL, Cited by: 1st item.
- Image-based recommendations on styles and substitutes. In SIGIR, Cited by: 3rd item.
- Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §2.2.
- Compositional questions do not necessitate multi-hop reasoning. In ACL, Cited by: §5.
- Stress test evaluation for natural language inference. In COLING, Cited by: §5.
- Robustness properties of facebook’s ResNeXt WSL models. ArXiv abs/1907.07640. Cited by: §3, §5.
- GloVe: global vectors for word representation. In EMNLP, Cited by: §2.2.
- Dataset shift in machine learning. Cited by: §1.
- Semantically equivalent adversarial rules for debugging NLP models. In ACL, Cited by: §5.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS EMC Workshop, Cited by: §2.2.
- Q-BERT: hessian based ultra low precision quantization of BERT. In aaai, Cited by: §3.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, Cited by: 1st item.
- Unbiased look at dataset bias. CVPR. Cited by: §1.
- Attention is all you need. In NIPS, Cited by: §2.2.
- Universal adversarial triggers for attacking and analyzing NLP. In EMNLP, Cited by: §5.
- GLUE: a multitask benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §1.
- Towards universal paraphrastic sentence embeddings. In ICLR, Cited by: §2.2.
- A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, Cited by: 3rd item.
- HuggingFace’s Transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §2.2.
- Intriguing properties of adversarial training at scale. In ICLR, Cited by: §3.
- Self-training with noisy student improves imagenet classification. In CVPR, Cited by: §3, §5.
- Learning and evaluating general linguistic intelligence. ArXiv abs/1901.11373. Cited by: §5.
- ReCoRD: bridging the gap between human and machine commonsense reading comprehension. arXiv abs/1810.12885. Cited by: 2nd item.