Simple but Effective Techniques to Reduce
dataset Biases
Abstract
There have been several studies recently showing that strong natural language understanding (NLU) models are prone to relying on unwanted dataset biases without learning the underlying task, resulting in models which fail to generalize to outofdomain datasets, and are likely to perform poorly in realworld scenarios. We propose several learning strategies to train neural models which are more robust to such biases and transfer better to outofdomain datasets. We introduce an additional lightweight biasonly model which learns dataset biases and uses its prediction to adjust the loss of the base model to reduce the biases. In other words, our methods downweight the importance of the biased examples, and focus training on hard examples, i.e. examples that cannot be correctly classified by only relying on biases. Our approaches are model agnostic and simple to implement. We experiment on largescale natural language inference and fact verification datasets and their outofdomain datasets and show that our debiased models significantly improve the robustness in all settings, including gaining 9.76 points on the FEVER symmetric evaluation dataset, 5.45 on the HANS dataset and 4.78 points on the SNLI hard set. These datasets are specifically designed to assess the robustness of models in the outofdomain setting where typical biases in the training data do not exist in the evaluation set.
1 Introduction
Recent neural models (devlin2018bert; radford2018improving; chen2017enhanced) have achieved high and even near humanperformance on several largescale natural language understanding benchmarks. However, it has been demonstrated that neural models tend to rely on existing idiosyncratic biases in the datasets, and leverage superficial correlations between the label and existing shortcuts in the training dataset to perform surprisingly well^{1}^{1}1We use biases, heuristic patterns or shortcuts interchangeably., without learning the underlying task (kaushik2018much; gururangan2018annotation; poliak2018hypothesis; schuster2019towards; nivenkao2019probing; mccoyetal2019right). For instance, natural language inference (NLI) consists of determining whether a hypothesis sentence (There is no teacher in the room) can be inferred from a premise sentence (Kids work at computers with a teacher’s help)^{2}^{2}2The given sentences are in the contradictory relation and the hypothesis cannot be inferred from the premise.(dagan2006pascal). However, recent work has demonstrated that largescale NLI benchmarks contain annotation artifacts; certain words in the hypothesis are highly indicative of inference class that allow models with poor premise grounding to perform unexpectedly well (poliak2018hypothesis; gururangan2018annotation). As an example, in some NLI benchmarks, negation words such as “nobody”, “no”, and “not” in the hypothesis are often highly correlated with the contradiction label. As a consequence, NLI models do not need to learn the true relationship between the premise and hypothesis and instead can rely on statistical cues, such as learning to link negation words with the contradiction label.
As a result of the existence of such biases, models exploiting statistical shortcuts during training often perform poorly on outofdomain datasets, especially if they are carefully designed to limit the spurious cues. To allow proper evaluation, recent studies have tried to create new evaluation datasets that do not contain such biases (gururangan2018annotation; schuster2019towards).
Unfortunately, it is hard to avoid spurious statistical cues in the construction of largescale benchmarks, and collecting new datasets is costly (sharma2018tackling). It is therefore crucial to develop techniques to reduce the reliance on biases during the training of the neural models.
In this paper, we propose several endtoend debiasing techniques to adjust the crossentropy loss to reduce the biases learned from datasets, which work by downweighting the biased examples so that the model focuses on learning hard examples. Figure 1 illustrates an example of applying our strategy to prevent an NLI model from predicting the labels using existing biases in the hypothesis. Our strategy involves adding a biasonly branch on top of the base model during training (In case of NLI, the biasonly model only uses the hypothesis). We then compute the combination of the two models in a way to motivate the base model to learn different strategies than the ones used by the biasonly branch . At the end of the training, we remove the biasonly classifier and use the predictions of the base model.
We propose three main debiasing strategies, detailed in Section 2.2. In our first two proposed methods, the combination is done with an ensemble method which combines the predictions of the base and the biasonly models. The training loss of the base model is then computed on the output of this combined model . This has the effect of reducing the loss going from the combined model to the base model for the examples which the biasonly model classifies correctly. For the third method, the biasonly predictions are used to directly weight the loss of the base model, explicitly modulating the loss depending on the accuracy of the biasonly model. All strategies work by allowing the base model to focus on learning the hard examples, by preventing it from learning the biased examples.
Our approaches are simple and highly effective. They require training a simple classifier on top of the base model. Furthermore, our methods are model agnostic and general enough to be applicable for addressing common biases seen in several datasets in different domains.
We evaluate our models on challenging benchmarks in textual entailment and fact verification. For entailment, we run extensive experiments on HANS (Heuristic Analysis for NLI Systems) (mccoyetal2019right), and hard NLI sets of Stanford Natural Language Inference (SNLI) (bowman2015large) and MultiNLI (MNLI) (williams2018broad) datasets (gururangan2018annotation). We additionally construct hard MNLI datasets from MNLI development sets to facilitate the outofdomain evaluation on this dataset^{3}^{3}3Removing the need of submitting to an online evaluation system for MNLI hard test sets.. Furthermore, we evaluate our fact verification models on FEVER Symmetric test set (schuster2019towards). The selected datasets are highly challenging and have been carefully designed to be unbiased to allow proper evaluation of the outofdomain performance of the models. We show that including our strategies on training baseline models including BERT (devlin2018bert) provide substantial gain on outofdomain performance in all the experiments.
In summary, we make the following contributions: 1) Proposing several debiasing strategies to train neural models that make them more robust to existing biases in the dataset. 2) An empirical evaluation of the proposed methods on two largescale NLI benchmarks and obtaining substantial gain on their challenging outofdomain data, including 5.45 points on HANS and 4.78 points on SNLI hard set. 3) Evaluating our models on fact verification, obtaining 9.76 points gain on FEVER symmetric test set, improving the results of prior work by 4.65 points.
To facilitate future work, we release our datasets and code.
2 Reducing biases
Problem formulation We consider a general multiclass classification problem. Given a dataset consisting of the input data , and labels , the goal of the base model is to learn a mapping parameterized by which computes the predictions over the label space given the input data, shown as . Our goal is to optimize parameters such that we build a model which is more resistant to benchmark biases to improve its robustness to domain changes when the typical biases observed in the training data do not exist in the evaluation dataset.
The key idea of our approach, depicted in Figure 1 is first to identify the dataset biases and heuristic patterns which the base model is susceptible to relying on. Then, we use a biasonly branch to capture these biases. We propose several strategies to incorporate the biasonly knowledge into the training of the base model to make a robust version of it. After training we remove the biasonly model and use the predictions of the base model. In this section, we explain each of these components.
2.1 Biasonly branch
We assume that we do not have access to any data from the outofdomain dataset, so we need to know a priori about the possible types of shortcut patterns we would like the base model to avoid relying on them. Once these shortcut patterns are identified, we train a biasonly model designed to capture the identified biases which only uses the biased features. For instance, it has been shown that a hypothesisonly model in the largescale NLI datasets can correctly classify the majority of samples using the artifacts (poliak2018hypothesis; gururangan2018annotation). Therefore, our biasonly model for NLI only uses hypothesis sentences. But note that the biasonly model can, in general, have any form, and is not limited to models which are using only a part of input data. Let be biased features of which are predictive of . We then formalize this biasonly model as a mapping parameterized by trained using crossentropy loss :
(1) 
where is the onehot representation of the true label for the i example. In the next section, we explain how we use the biasonly model to make a robust version of the base model.
2.2 Proposed debiasing strategies
We propose several strategies to incorporate the biasonly knowledge into training of the base model and update its parameters using the obtained loss of the combined classifier . All these strategies have the form illustrated in Figure 1, where the predictions of the biasonly model are combined with either the predictions of the base model or its error to downweight the loss from the biased examples, thereby affecting the error backpropagated into the base model.
As also illustrated in Figure 1, it is often convenient for the biasonly model to share parameters with the base model, such as sharing a sentence encoder. To prevent the base model from learning the biases, the biasonly loss is not backpropagated to these shared parameters of the base model. To accommodate this sharing, the biasonly and the base models are trained together. Next, we explain how the loss of the combined classifier, , is computed for each of our debiasing methods.
2.2.1 Method 1: Product of experts
Our first approach is based on the idea of the product of experts ensemble method (hinton2002training): “It is possible to combine multiple probabilistic models of the same data by multiplying the probabilities together and then renormalizing.”. Here, we use this notion to combine the biasonly and base model predictions by computing the elementwise product between their predictions as . We compute this combination in the logarithmic space, which works better in practice:
(2) 
The key intuition behind this model is to combine the probability distributions of the biasonly and the base model to allow them to make predictions based on different characteristics of the input; the biasonly branch covers prediction based on biases, and the base model focuses learning the actual task. We then compute as the crossentropy loss of the combined predictions . Then the base model parameters are trained using the crossentropy loss of the combined classifier :
(3) 
When this loss is backpropagated to base model parameters , the predictions of the biasonly model decrease the updates for examples which it can accurately predict.
2.2.2 Method 2: RUBI Variations (cadene2019rubi)
Recently, cadene2019rubi propose a model called RUBI to alleviate unimodal biases learned by Visual Question Answering (VQA) models. cadene2019rubi’s study is limited to alleviating biases in VQA benchmarks. We, however, evaluate the effectiveness of their formulation together with our newly proposed variations in the natural language understanding context on several challenging NLU datasets.
We first apply a sigmoid function to the biasonly model’s predictions to obtain a mask containing an importance weight between 0 and 1 for each possible label. We then compute the elementwise product between the obtained mask and the base model’s predictions:
(4) 
The main intuition is to dynamically adjust the predictions of the base model to prevent the base model from leveraging the shortcuts. We note two properties of this loss. (1) When the biasonly model correctly classifies the example, the mask increases the value of the correct prediction while decreases the scores for other labels. As a result, the loss of biased examples is downweighted. (2) For the hard examples that cannot be correctly classified using biasonly model, the obtained mask increases the score of the wrong answer. This, in turn, increases the contribution of hard examples and encourages the base model to learn the importance of correcting them. We additionally propose the following new variants of this model:

Computing the combination in logarithmic space, which we refer to it as RUBI + log space.
(5) 
Normalizing the output of the biasonly model, followed by RUBI model, which we refer to it as RUBI + normalize:
(6)
As with our first method, we then update the parameters of the base model by backpropagating the crossentropy loss of the combined classifier.
2.2.3 Method 3: Debiased Focal loss
Focal loss was originally proposed in lin2017focal to improve a single classifier by downweighting the wellclassified points. We propose a novel variant of this loss, in which we leverage the biasonly branch’s predictions to reduce the relative importance of the most biased examples and allow the model to focus on learning the hard examples. We define Debiased Focal Loss as:
(7) 
where is the focusing parameter, which impacts the downweighting rate. When is set to 0, our Debiased Focal Loss is equivalent to the normal crossentropy loss. For , as the value of is increased, the effect of downweighting is increased. We set through all experiments, which works well in practice and avoid finetuning it further. We note the properties of the Debiased Focal Loss: (1) When the example is unbiased, and biasonly branch does not do well, is small, therefore the scaling factor is close to , and the loss remains unaffected. (2) As the sample is more biased and is closer to 1, the modulating factor approaches 0 and the loss for the most biased examples is downweighted.
For this debiasing strategy, Debiased Focal Loss is then used to update the parameters of the base model . Note that this loss has a different form from that used for the first two methods.
3 Experimental Results
We provide experiments on two largescale NLI datasets, namely SNLI and MNLI, and FEVER dataset for our fact verification experiment and evaluate the models’ performance on their challenging unbiased evaluation datasets proposed very recently. In most of our experiments, we consider BERT^{4}^{4}4https://github.com/huggingface/pytorchpretrainedBERT as our baseline which is known to work well for these tasks, and additionally, we have included other baselines used in the prior work to compare against them. In all the experiments, we kept the hyperparameters of baselines as the default. We include lowlevel details in the appendix.
3.1 Fact Verification
Dataset: FEVER dataset contains claimevidence pairs generated from Wikipedia. schuster2019towards collect a new evaluation set for FEVER dataset to avoid the idiosyncrasies observed in the claims of this benchmark. They make the original claimevidence pairs of FEVER evaluation dataset symmetric, by augmenting the dataset and making each claim and evidence appear with each label. Therefore, by balancing the artifacts, relying on cues from claim to classify samples is equivalent to a random guess. The collected dataset is challenging and the performance of the models evaluated on this dataset drop significantly.
Base models: We consider BERT as the baseline, which works the best on this dataset (schuster2019towards), and predicts the relations based on the concatenation of the claim and the evidence with a delimiter token (see Appendix A).
Biasonly model: The biasonly model predicts the labels using only claims as input.
Results: Table 1 shows the results. The obtained improvement of our debiasing methods varies between 1.119.76 absolute points. The Product of experts and Debiased Focal loss are highly effective, boosting the performance of the baseline model by 9.76 and 7.53 absolute points respectively, significantly surpassing the prior work (schuster2019towards).
3.2 Textual Entailment
Datasets: We evaluate on hard SNLI and MNLI datasets (gururangan2018annotation) which are the split of these datasets where a hypothesisonly model cannot correctly predict the labels. gururangan2018annotation show that the success of the recent textual entailment models is attributed to the biased examples, and the performance of these models are substantially lower on hard sets.
Base models: We consider InferSent (conneau2017supervised), and BERT as our base models. We choose InferSent to be able to compare against the prior work (belinkov2019adversarial).
Biasonly model: The biasonly model only uses the hypothesis to predict the labels (see Appendix B).
Results on SNLI: Table 2 shows the results on SNLI dataset. For InferSent model, Debiased Focal Loss and Product of experts methods result in 4.14 and 4.78 points gain. Similarly, for the BERT model, Debiased Focal loss and Product of experts improve the results the most by 2.48 and 1.62 absolute points. Comparing to the results of belinkov2019adversarial, our product of expert model obtains a 7.42 point gain, significantly surpassing the prior work.
Debiasing method  Dev  Symmetric test set 

None  85.99  56.49 
RUBI  86.23  57.60 
RUBI + log space  86.59  59.27 
RUBI + normalize  86.16  60.11 
Debiased Focal Loss  83.07  64.02 
Product of experts  86.46  66.25 
schuster2019towards  84.6  61.6 
Debiasing method  BERT  InferSent  
Test  Hard  Test  Hard  
None  90.53  80.53  84.24  68.91 
RUBI + log space  90.74  81.32  83.67  69.0 
RUBI  90.69  80.62  83.93  69.64 
RUBI + normalize  90.70  80.83  83.6  69.24 
Debiased Focal Loss  89.57  83.01  73.54  73.05 
Product of experts  90.11  82.15  80.35  73.69 
AdvCls (belinkov2019adversarial)      83.56  66.27 
AdvDat (belinkov2019adversarial)      78.30  55.60 
Results on MNLI: We construct hard sets from MNLI development set for both MNLI Matched and MNLI Mismatched datasets. Following gururangan2018annotation, we train a fastText classifier (joulin2017bag), to predict the labels using only the hypothesis and consider the subset of the samples on which our trained hypothesisonly classifier failed as hard examples. Table 3 shows the results on the development sets and their corresponding hard sets. For BERT baseline, on MNLI matched hard dataset, the product of experts and RUBI+normalize improve the results the most by 1.46 and 1.11 points. On MNLI mismatched hard, the Debiased Focal Loss and product of experts obtain 1.37, and 1.68 points gain respectively. For InferSent baseline, on MNLI matched hard, the product of experts and RUBI improve the results by 2.34 and 0.94 points. On MNLI mismatched hard, the Product of experts and Debiased Focal Loss improve the results by 2.61 and 2.52 points.
To comply with limited access to the submission system of MNLI, we evaluate only the best result of baseline and our models on the test sets. Table 4 shows the results on the MNLI test and hard sets. Our product of expert model improves the performance on MNLI matched hard set by 0.93 points and 1.08 points on MNLI Mismatched hard set while maintaining the indomain accuracy.
BERT  InferSent  
Debiasing method  
MNLI  MNLIM  MNLI  MNLIM  
Dev  Hard  Dev  Hard  Dev  Hard  Dev  Hard  
None  84.41  76.56  84.53  77.55  69.97  57.03  69.99  56.53 
RUBI + log space  84.46  76.80  84.86  78.04  69.70  56.57  69.95  56.56 
RUBI  84.48  77.13  85.17  78.63  70.51  57.97  70.53  58.08 
RUBI + normalize  84.80  77.67  84.77  78.54  70.16  57.53  70.09  57.62 
Debiased Focal Loss  83.72  77.37  84.85  78.92  60.78  57.88  61.12  59.05 
Product of experts  84.58  78.02  84.85  79.23  66.02  59.37  65.85  59.14 
Debiasing Method  MNLI  MNLIM  

Test  Hard  Test  Hard  
None  84.11  75.88  83.51  75.75 
Product of experts  84.11  76.81  83.47  76.83 
3.3 Syntactic bias
Dataset: mccoyetal2019right show that NLI models can rely on superficial syntactic heuristics to perform the task. They introduce HANS dataset, which covers several examples on which the models employing the syntactic heuristics fail.
Base model: We use BERT as our base model and train it on MNLI dataset.
Biasonly model: We consider several features for the biasonly model. The first three features are based on the syntactic heuristics proposed in mccoyetal2019right: 1) Whether all the words in the hypothesis are included in the premise. 2) If the hypothesis is the contiguous subsequence of the premise. 3) If the hypothesis is a subtree in the premise’s parse tree 4) The number of tokens shared between premise and hypothesis normalized by the number of tokens in the premise. We additionally include some similarity features: 5) The cosine similarity between premise and hypothesis tokens followed by mean and maxpooling. We consider the same weight for contradiction and neutral labels in the biasonly loss to allow the model to recognize entailment from notentailment. During the evaluation, we map the neutral and contradiction labels to notentailment.
Results: As shown in Table 5, the Product of experts and Debiased Focal loss improve the results the most by 5.45, 3.89 points. We provide the accuracy for each label on HANS dataset in Appendix C.
Debiasing Method  MNLI  HANS  Constituent  Lexical  Subsequence 

None  83.99  61.10  61.11  68.97  53.21 
RUBI + log space  84.56  62.53  57.77  75.18  54.65 
RUBI  83.93  60.35  56.51  71.09  53.44 
RUBI + normalize  84.15  59.71  55.74  70.01  53.37 
Debiased Focal Loss  84.33  64.99  62.42  74.45  58.11 
Product of experts  84.04  66.55  64.29  77.61  57.75 
4 discussion
Analysis of Debiased Focal Loss To understand the impact of in Debiased Focal Loss, we train InferSent models with this loss for different values of on SNLI dataset and evaluate its performance on SNLI and SNLI hard sets. As illustrated in Figure 2, increasing focuses the loss on learning hard examples, and reduces the attention on learning biased examples. Consequently, the indomain accuracy on SNLI is dropped but outofdomain accuracy on SNLI hard set is increased.
Results: Through extensive experiments on different datasets, our methods improve outofdomain performance in all settings. Debiased Focal Loss and Product of experts models consistently obtain the highest gains. Within RUBI variations, RUBI+log space outperforms the other variations on SNLI with BERT baseline and HANS dataset. RUBI+normalize does better than the rest on FEVER experiment and MNLI matched hard set with BERT baseline. RUBI performs the best on SNLI and MNLI experiments with InferSent baseline, and MNLI mismatched hard with BERT baseline.
As expected, improving the outofdomain performance could come at the expense of the decreased indomain performance, since the removed biases are useful for performing the indomain task. This especially happens for Debiased Focal Loss, in which there is a tradeoff between indomain and outofdomain performance as discussed depending on the parameter , and when the baseline model is not very powerful like InferSent. Our other models with BERT baseline consistently remain the indomain performance.
5 Related Work
Biases in NLU benchmarks and other domains Recent studies have shown that largescale NLU benchmarks contain biases. poliak2018hypothesis; gururangan2018annotation; mccoyetal2019right demonstrate that textual entailment models can rely on annotation artifacts and heuristic patterns to perform unexpectedly well. On ROC Stories corpus (mostafazadeh2016corpus), schwartz2017story show that considering only sample endings without story contexts performs exceedingly well. A similar phenomenon is observed in fact verification (schuster2019towards), argument reasoning comprehension (nivenkao2019probing), and reading comprehension (kaushik2018much). Finally, several studies confirm biases in VQA datasets, leading to accurate questiononly models ignoring visual content (goyal2017making; zhang2016yin).
Existing techniques to alleviate biases The most common strategy to date to address biases is to augment the datasets by balancing the existing cues (schuster2019towards; nivenkao2019probing). In another line of work, to address the shortcoming in Stanford Question Answering dataset (rajpurkar2016squad), jia2017adversarial propose to create an adversarial dataset in which they insert adversarial sentences to the input paragraphs. However, collecting new datasets especially in largescale is costly and it remains an unsatisfactory solution. It is, therefore, crucial to develop strategies to allow training models on the existing biased datasets, while improving their outofdomain performance.
schuster2019towards propose to first compute the ngrams existing in the claims which are the most associated with each label. They then solve an optimization problem to assign a balancing weight to each training sample to alleviate the biases. In contrast, we propose several endtoend debiasing strategies. Additionally, belinkovetal2019dont propose adversarial techniques to remove from the sentence encoder the features which allow a hypothesisonly model to succeed. However, we believe that in general the features used by the hypothesisonly model can include some information necessary to perform the NLI task, and removing such information from the sentence representation can hurt the performance of the full model. Their approach consequently degrades the performance on hard SNLI dataset which is expected to be less biased. In contrast to their method, we propose to train a biasonly model to use its predictions to dynamically adapt the classification loss to reduce the importance of the most biased examples during training.
Concurrently to our own work, clark2019dont; he2019unlearn have also proposed to use the product of experts models. However, we have evaluated on new domains and datasets, and have proposed several different ensemblebased debiasing techniques.
6 Conclusion
We propose several novel techniques to reduce biases learned by neural models. We introduce a biasonly model that is designed to capture biases and leverages the existing shortcuts in the datasets to succeed. Our debiasing strategies then work by adjusting the crossentropy loss based on the performance of this biasonly model to focus learning on the hard examples and downweight the importance of the biased examples. Our proposed debiasing techniques are model agnostic, simple and highly effective. Extensive experiments show that our methods substantially improve the model robustness to domainshift, including 9.76 points gain on FEVER symmetric test set, 5.45 on HANS dataset and 4.78 points on SNLI hard set.
Acknowledgments
We would like to thank Tal Schuster, Suraj Srinivas, Andreas Marfurt, and Dhananjay Ram for their helpful comments. We additionally would like to thank Tom McCoy, Corentin Dancette, Rémi Cadène, Devi Parikh, Tal Schuster and all the authors of schuster2019towards; cadene2019rubi; mccoyetal2019right for their help and supports and assisting us to reproduce their results. This research was supported by the Swiss National Science Foundation under the project Learning Representations of Abstraction for Opinion Summarization (LAOS), grant number “FNS30216”.
References
Appendix A Fact Verification
Base model: We finetune BERT for 3 epochs and use the default parameters and default learning rate of .
Biasonly model: Our biasonly classifier is a shallow nonlinear classifier with 768, 384, 192 hidden units with Tanh nonlinearity.
Appendix B Textual entailment
Base model: InferSent uses a separate BiLSTM encoder to learn sentence representations for premise and hypothesis, it then combines these embeddings following mou2016natural and feeds them to the default nonlinear classifier. For InferSent we train all models for 20 epochs as default without using earlystopping. We use the default hyperparameters and following wang2018glue, we set BiLSTM dimension to 512. We use the default nonlinear classifier with 512 and 512 hidden neurons with Tanh nonlinearity. For Bert model, we finetune the models for 3 epochs.
Biasonly model For BERT model, we use the same shallow nonlinear classifier explained in Appendix A, and for the InferSent model, we use a shallow linear classifier with 512, and 512 hidden units.
Appendix C Syntactic bias
Base model: We finetune all the models for 3 epochs.
Biasonly model: We use a nonlinear classifier with 6 and 6 hidden units with Tanh nonlinearity.
Results: Table 6 shows the performance for each label (entailment and non_entailment) on HANS dataset and its individual heuristics.
Debiasing Method  gold label: Entailment  gold label: Nonentailment  

HANS  Const.  Lexical  Subseq.  HANS  Const.  Lexical  Subseq.  
None  98.37  98.98  96.76  99.38  23.82  23.24  41.18  7.04 
RUBI+log space  97.51  98.56  95.44  98.54  27.55  16.98  54.92  10.76 
RUBI  97.27  99.18  95.26  97.38  23.42  13.84  46.92  9.50 
RUBI+normalize  97.87  98.48  96.32  98.80  21.55  13.00  43.70  7.94 
Debiased Focal loss  96.41  97.66  92.92  98.66  33.57  27.18  55.98  17.56 
Product of experts  96.08  98.38  93.52  96.34  37.02  30.20  61.70  19.16 