Simple but Effective Techniques to Reduce
There have been several studies recently showing that strong natural language inference (NLI) models are prone to relying on unwanted dataset biases, resulting in models which fail to capture the underlying generalization, and are likely to perform poorly in real-world scenarios. Biases are identified as statistical cues or superficial heuristic correlated with certain labels that are effective for the majority of examples but fail to succeed in more challenging hard examples. In this work, we propose several learning strategies to train neural models which are more robust to such biases and transfer better to out-of-domain datasets. We first introduce an additive lightweight model which learn dataset biases. We then use its prediction to adjust the loss of the base model to reduce the biases. In other words, our methods down-weight the importance of the biased examples, and focus training on hard examples which require grounded reasoning to deduce the label. Our approaches are model agnostic and simple to implement. We experiment on large-scale natural language inference and fact verification datasets and show that our debiased models obtain significant gain over the baselines on several challenging out-of-domain datasets.
Recent neural models devlin2018bert; kim2019semantic; chen2017enhanced have achieved high performance, near human-performance on several large-scale natural language understanding benchmarks. However, recent research has demonstrated that many models tend to rely on existing idiosyncratic biases in the datasets, and leverage superficial correlations between the answer and existing pattern or shortcuts in the training datasets to correctly classify the datasets. kaushik2018much perform a study on several reading comprehension datasets and show that question and passage-only models often perform surprisingly well, sometimes even matching the full model’s performance. Similar findings are demonstrated for natural language inference gururangan2018annotation; poliak2018hypothesis, fact verification schuster2019towards, and argument reasoning comprehension niven-kao-2019-probing. Most importantly, dataset biases often also exist on the corresponding test set, and models picking up on these artifacts can still perform well on standard evaluations.
One approach to gauge the biases in the datasets is to train models using only a part of input data such as only questions or passages in the context of reading comprehension kaushik2018much or only using the hypothesis in the context of textual entailment gururangan2018annotation; poliak2018hypothesis. Models exploiting the statistical shortcuts during training, often perform poorly when the dataset is carefully designed to limit the spurious cues. As an example, gururangan2018annotation find that in large NLI benchmarks, negation words such as “nobody”, “no”, and “not” in the hypothesis are highly correlated with the contradiction label. As a ramification of the existence of such biases, the NLI models do not require to learn the proper relation between premise and hypothesis to obtain high accuracy. It therefore instead of learning the true relations between the given sentences, rely on the statistical cues and shortcuts and learn to link the negation words“no”, “nothing”, “nobody” in the hypothesis with the most occurring label of contradiction.
In an effort to address this concern, several recent studies have proposed different mechanisms to create new evaluation datasets that do not contain such biases. To remove the idiosyncratic biases observed in FEVER dataset (thorne2018fever), schuster2019towards propose a method to augment the dataset in a way to turn the test set to a symmetric set, where each claim or evidence can appear with all labels, therefore a decision that is solely based on the statistical cues could not perform better than a random guess. gururangan2018annotation introduce Hard NLI sets for SNLI bowman2015large and MultiNLI (MNLI) williams2018broad benchmarks, i.e., a subset of these datasets which a hypothesis-only model fails to predict correctly. Unfortunately, this is hard to avoid spurious statistical cues in the construction of large-scale benchmarks. It is, therefore, crucial to develop techniques to reduce the reliance on biases during training the neural models.
In this paper, we propose several techniques to adjust the cross-entropy loss to reduce the biases learned from datasets, which work by down-weighting the biased examples so that models focus on learning hard examples. As illustrated in Figure 1, our strategy involves the base model , the bias-only model , and their combination . We propose three different categories of debiasing strategies detailed in section 2.2. In our first two proposed methods, the combination is done with an ensemble method which combines the predictions of the base model and the bias-only model. The training error is then computed on the output of this combined classifier . This has the effect of reducing the loss going from the combined classifier to the base model for the examples which the bias-only model classifies easily. This strategy allows the base model to focus on learning the hard examples, by preventing it from learning the biased examples. For the third method, the bias-only predictions are used to directly weight the loss of the base model, explicitly modulating the loss depending on the accuracy of the bias-only model. This again reduces the contribution of the most biased examples, and allow the model to focus on learning the unbiased hard examples.
Our approaches are simple and highly effective. They require training a simple classifier on top of base models. Furthermore, our methods are model agnostic and general enough to be applicable for addressing common biases seen in several datasets in different domains. At the end of the training, we leave the added bias-only classifier out and only use the predictions of the base model.
We evaluate our models on challenging benchmarks in textual entailment and fact verification domains. For entailment, we run extensive experiments on Hard NLI datasets introduced in gururangan2018annotation Additionally, we evaluate our fact verification models on Symmetric FEVER test set introduced in schuster2019towards. The selected datasets are highly challenging and have been carefully designed to be unbiased to allow proper evaluation of the performance of the models. We show that including our strategies on training baseline models including BERT devlin2018bert provide substantial gain on the challenging unbiased evaluation datasets.
In summary, we make the following contributions in this work:
Proposing several training strategies to incorporate into the training of neural models that make them more robust to existing biases in the datasets.
An empirical evaluation of the proposed methods on two large-scale NLI benchmarks and obtain substantial gain on their challenging hard splits.
Evaluating our models on Fact verification and improve the state-of-the-art performance on FEVER symmetric test set by 4.65%.
To facilitate future work, we release our codes.
2 Reducing biases
Problem formulation We consider a general multi-class classification problem. Given a dataset consisting of the input data , and labels , the goal of base model is to learn a mapping parameterized by which computes the predictions over the labels space given the input data, shown as . Our goal is to optimize parameters such that we build models which are more resistant to benchmark biases to improve their robustness to domain changes when the typical biases observed in the training benchmark do not exist in the evaluation dataset.
Our approach has two main steps: 1) Identify the dataset biases and heuristic patterns which the base model is susceptible to rely on them. Then, train a bias-only model to capture these biases. 2) Leverage the bias-only model’s prediction to compute the combination of the prediction from the base and bias-only model to make a robust version of the base model. In this section, we explain each of these steps.
2.1 Identifying biased examples and train bias-only model
Identify dataset biases We assume that we do not have access to any data from the out-of-domain datasets, so we need to know a priori about the possible type of shortcut patterns we would like the baseline model to avoid relying on them. Once these shortcut patterns are identified, we train a bias-only model designed to capture the identified biases.
Train Bias-only model To detect the biased examples, we require to train a bias-only model which aims to classify the samples using only existing biases and heuristic patterns in the datasets. For instance, in the context of textual entailment, it has been shown that hypothesis-only model can correctly classify the majority of samples using the artifacts from the annotation protocol poliak2018hypothesis; gururangan2018annotation. They show that negation words in the hypothesis are a strong indicator of contradiction. Purpose clauses are indicative of neutral hypotheses. Using generic words or removing gender or number information is a common heuristic for generating an entailed hypothesis. Therefore, we train the bias-only model for NLI dataset by only using the hypothesis sentences. But note that the bias-only model can, in general, have any form, and is not limited to only models which are using only a part of input data. We formalize this bias-only model as a mapping parameterized by . The bias-only model is trained using its cross entropy loss :
where is the one-hot representation of the true label for the i example. The bias-only model is trained on biased features or certain heuristic that we would like the base model to not rely on. In the next section, we propose several debiasing strategies to adjust the loss function to prevent neural models from learning existing shortcuts and biases.
2.2 Proposed debiasing strategies
We propose several different strategies to incorporate the bias-only knowledge into training the base model and to use the obtained combined prediction to compute the loss of the base model. All these strategies have the form illustrated in Figure 1, where the predictions of the bias-only model are combined with either the predictions of the base model or its error to down-weight the loss from the biased examples and thereby effect the error being backpropagated into the base model.
As also illustrated in Figure 1, it is often convenient for the bias-only model to share parameters with the base model, such as sharing a sentence encoder. To prevent the base model from learning the biases, the bias-only loss is not back-propagated to these shared parameters of the base model. To accommodate this sharing, the bias-only model and the base model are trained together.
2.2.1 Method 1: Product of experts
Our first approach is based on idea of product of experts ensemble method hinton2002training: “It is possible to combine multiple probabilistic models of the same data by multiplying the probabilities together and then renormalizing.”. Here, we use this notion to combine the bias-only and base models predictions by computing the element-wise product between their predictions as . We compute this combination in the logarithmic space which works better in practice:
The key intuition behind this model is to combine the probability distribution of bias-only and base model to allow them to make predictions based on different characteristics of the input; the bias-only branch covers prediction based on biases, and the base model focus learning the actual task. Then the base model parameters are trained using the cross-entropy loss of the combined classifier , given the bias-only model parameters :
When this loss is backpropagated to the base model parameters , the predictions of the bias-only model decrease the updates for examples which it can accurately predict.
2.2.2 Method 2: RUBI Variations cadene2019rubi
Recently, cadene2019rubi propose a variation of the product of expert models called RUBI to alleviate unimodal biases learned by Visual Question Answering (VQA) models. cadene2019rubi’s study is limited to alleviating the biases in VQA benchmarks. We, however, evaluate the effectiveness of their formulation together with our newly proposed variations within natural language understanding context on several challenging NLU datasets.
We first apply a sigmoid function to the bias-only model’s prediction to obtain a mask containing an importance weight between 0 and 1 for each possible label. We then compute the element-wise product between the obtained mask and the base model prediction:
The main intuition is to dynamically adjust the prediction of the base model to prevent the base model from leveraging the shortcuts. We note two properties of this loss. (1) when the example is correctly classified using the bias-only model, the mask increases the value of correct prediction while decreases the score of other labels. As a result, the loss of biased examples is down-weighted. (2) For the hard examples that cannot be classified using only bias-only model, the obtained mask computed from bias-only model increases the score of the wrong answer. This, in turn, increases the loss contribution from the hard examples and encourages the base model to learn the importance of correcting the hard examples.
We additionally propose the following variants of this model:
Computing the combination in logarithmic space. We refer to this model as RUBI + log space.
Normalizing the input of the bias-only model, followed by RUBI model, which we refer to it as RUBI + normalize:
As with our first method, we then update the parameters of the base model by backpropagating the cross-entropy loss of the combined classifier, given above in equation equation 3.
2.2.3 Method 3: Debiased Focal loss
Focal loss was originally proposed in lin2017focal to improve a single classifier by down-weighting the well-classified points. We propose a novel variant of this loss, in which we leverage the bias-only branch’s prediction to reduce the relative importance of the most biased examples, and allow the model to focus learning the hard examples. We define Debiased Focal Loss (DFL) as:
We note the properties of the Debiased Focal Loss: (1) When the example is hard, and bias-only branch does not do well, and is small, therefore the scaling factor is close to , and the loss remains unaffected. (2) On the other hand, as the sample is more biased and is closer to 1, the modulating factor approaches 0 and the loss for the most biased examples is down-weighted.
We call the focusing parameter, which impacts the down-weighting rate. When is set to 0, our Debiased Focal Loss is equivalent to the normal cross-entropy loss. For , as the value of is increased, the effect of down-weighting is increased. We set through all experiments, which works well in practice and avoid fine-tuning it further.
For our debiasing strategy, we use this Debiased Focal Loss in equation 7 as the loss for the combined prediction , which is then used to update the base model parameters . Note that this loss has a different form from that used for the first two methods.
3 Experimental Results
We train our proposed models on two large-scale NLI datasets, namely SNLI and MNLI, and the FEVER dataset for our fact verification experiments and evaluate the models’ performance on their challenging unbiased evaluation datasets proposed very recently. For each of the experiments, we consider the baseline models which perform the best on the dataset suggested in the previous state-of-the-art researches. In all the experiments, we kept the hyperparameters of baseline models as the default and avoid further finetuning.
3.1 Fact Verification
Dataset: schuster2019towards collect a new evaluation set for FEVER dataset to avoid the idiosyncrasies observed in this benchmark thorne2018fever. They make the original claim-evidence pairs of FEVER evaluation dataset symmetric, by augmenting the dataset and making each claim and evidence appear with each label. Therefore, by balancing the artifacts, relying solely on artifacts in the evidence or claim to classify samples would be equivalent to a random guess. The collected dataset is challenging and the performance of the models evaluated on this dataset drop significantly.
Base models: We consider BERT devlin2018bert111https://github.com/huggingface/pytorch-pretrained-BERT as the baseline, which is shown to work the best on this dataset schuster2019towards. We finetune BERT for 3 epochs and use the default parameters and learning rate of . We feed the concatenation of the claim and evidence with a delimiter token to the BERT model to predict the relations.
Bias-only model: To evaluate artifacts existing in the evidence or claim, we experiment with predicting the label solely based on the claim or the evidence as the bias-only models. Our bias-only classifier is a shallow nonlinear classifier with h, h/2, h/4 hidden neurons with Tanh nonlinearity, where is the default hidden-size in BERT.
We compare our models with the previous state-of-the-art schuster2019towards. Their method works by first computing the n-grams existing in the claims which are the most associated with each label. They then solve an optimization problem to assign a balancing weight for each training sample to alleviate the biases. Table 1 shows the performance of our debiasing technique. Since the existing biases in the claims are more prevalent, we observe that most of our techniques provide more improvement with the claim-only model used as the bias-only classifier. The Product of experts and Focal loss are highly effective, boosting the performance of the baseline model by 7.53% and 9.76% respectively, significantly surpassing the previous state-of-the-art schuster2019towards.
|dev.||dev. symmetric||dev||dev. symmetric|
|RUBI + log space||86.59||59.27||86.62||61.09|
|RUBI + normalize||86.16||60.11||86.38||56.4|
|Product of experts||86.46||66.25||87.34||60.67|
|schuster2019towards 222Direct comparison between the obtained values in our and schuster2019towards’s work may not be very fair since the Bert baseline model in the work of schuster2019towards obtains 58.3 on the dev. symmetric set, meaning that their proposed model improves baseline by an absolute points. However, our baseline BERT model obtains accuracy of 56.49 on the dev. symmetric set, i.e, we improve the baseline by an absolute points.||84.6||61.6||-||-|
3.2 Textual Entailment
Datasets: We evaluate on Hard SNLI and MNLI datasets gururangan2018annotation which are the split of these datasets where a hypothesis-only model cannot correctly predict the labels. gururangan2018annotation show that the success of the recent textual entailment models are attributed to the Easy examples, and the performance of these models are substantially lower on Hard sets.
Base models: We consider InferSentconneau2017supervised, and BERT as our base models. We choose InferSent to be able to compare against the most recent work on alleviating biases on SNLI dataset (belinkov2019adversarial) using this baseline. InferSent uses a separate BiLSTM encoder to learn sentence representation for premise and hypothesis, it then combines these embeddings following mou2016natural and feed them to a nonlinear classifier. For InferSent we train all models for 20 epochs without using early-stopping, and we use the default parameters. For Bert model, we finetune the models for 3 epochs.
Bias-only model: Our bias-only model only uses the hypothesis to predict the labels. For BERT model, we use the same shallow nonlinear classifier explained in Section 3.1, and for the InferSent model, we use a nonlinear shallow classifier with 512, and 512 hidden neurons and Tanh nonlinearity.
Table 2 shows the results on SNLI dataset. For InferSent model, Focal Loss and Product of experts methods results in 4.14 and 4.78 points gain. Similarly, for the BERT model, Focal loss and Product of experts improve the results the most by 2.48 and 1.62 absolute points. We compare against the methods of belinkov2019adversarial. They proposed two adversarial techniques, namely AdvCls and AdvDat, to remove from the sentence encoder the features which allow a hypothesis-only model succeed. However, we believe that in general the features used by the hypothesis-only model can include some information necessary to perform the NLI task, and removing such information from the sentence representation can hurt the performance of the full model. Comparing the best results of our proposed strategies and those of belinkov2019adversarial’s methods, our product of expert model obtains a 7.42 point gain, significantly surpassing their prior work.
|RUBI + log space||90.74||81.32||83.84||69.0|
|RUBI + normalize||90.70||80.83||84.16||69.24|
|Product of experts||90.11||82.15||79.53||73.69|
For MNLI dataset, to comply with limited access to the submission system, we evaluated our models on the development sets and only submit the best result of baseline model and our methods to the online submission system to evaluate on the test sets. We additionally construct a Hard set from MNLI development set for both MNLI Matched and MNLI Mismatched datasets. Following gururangan2018annotation, we train a fastText classifier joulin2017bag, which uses a bag of words and bigrams, to predict the labels 333We release our constructed dataset to facilitate the future research. Table 3 shows the results on the development sets and their hard sets, and Table 4 shows the results on the MNLI test and hard sets. Our product of expert model improves the performance on MNLI matched hard set by 0.93 points and 1.08 points on MNLI Mismatched hard set, while maintaining the performance on the in-domain test sets.
|MNLI Matched||MNLI Mismatched||MNLI Matched||MNLI Mismatched|
|RUBI + log space||84.46||76.80||84.86||78.04||69.70||56.57||69.95||56.56|
|RUBI + normalize||84.80||77.67||84.77||78.54||70.16||57.53||70.09||57.62|
|Product of experts||84.58||78.02||84.85||79.23||66.02||59.37||65.85||59.14|
|MNLI Matched||MNLI Mismatched|
|Product of experts||84.11||76.81||83.47||76.83|
4 Related Work
Recent studies have shown that large-scale natural language understanding (NLU) benchmarks contain annotation artifacts mccoy-etal-2019-right; gururangan2018annotation; kaushik2018much; schuster2019towards which are often hard to avoid due to the crowd-sourcing procedure. Crowd workers tend to use heuristics and language patterns to generate examples that can correlate with labels and introduce artifacts in the benchmarks geva2019are; mccoy-etal-2019-right. Neural models can exploit these biases and shortcuts, obtaining high performance on benchmarks, resulting in over-estimated performance without capturing the underlying generalization. In the following, we provide an overview of the related work trying to inspect and alleviate these biases.
Measuring the amount of biases in NLU benchmarks A common practice to gauge the reliance of the models on the biases is to train them by using a part of the input. poliak2018hypothesis; gururangan2018annotation demonstrate that a simple text classifier using the only hypothesis can perform unexpectedly well on several NLI benchmarks. kaushik2018much find that question and passage-only models are particularly strong baselines on several popular reading comprehension datasets weston2015towards; onishi2016did; hill2015goldilocks. On ROC Stories corpus mostafazadeh2016corpus, schwartz2017story show that considering only sample endings without story contexts performs exceedingly well. A similar phenomenon is observed in Fact verification, where schuster2019towards show that strong cues in the claims of FEVER dataset thorne2018fever are highly indicative of the inference class, and claim-only classifiers perform non-trivially well without considering the evidence. In argument reasoning comprehension task (ARCT) habernal2018semeval, training BERT devlin2018bert on warrants alone achieves high performance without accessing to the claims and reasons niven-kao-2019-probing using spurious statistical cues existing in the dataset. Finally, several studies confirm biases in visual question answering datasets, leading to accurate question-only models ignoring visual content goyal2017making; zhang2016yin.
Existing techniques to alleviate biases The most common strategy to date to address biases in common benchmarks is to augment the datasets by balancing the shortcuts and existing cues. schuster2019towards augment the evaluation set of FEVER dataset in a symmetric way such that each claim or evidence appears with both labels so that they balance the existing statistical cues. Similarly, niven-kao-2019-probing construct an adversarial evaluation dataset for ARCT benchmark. In another line of work, to address the shortcoming in Stanford Question Answering Dataset (SQuAD dataset) (rajpurkar2016squad), jia2017adversarial propose to create an adversarial dataset in which they insert adversarial sentences to the input paragraphs. However, augmenting new datasets is a costly procedure and in most cases it is done manually, especially constructing these datasets in large-scale for training the model would be too expensive and require substantial annotation effort. Therefore, it remains an unsatisfactory solution to address the biases. We, therefore, propose to use the knowledge of the type of biases the model is likely to use to improve the performance on out-of-domain dataset while still training on the existing biased datasets.
schuster2019towards propose to add a balancing weight to each training example to flatten the association of the ‘give-away’ n-grams 444n-grams appearing the most with each labels. in the claim with labels. They require first to preprocess the dataset to obtain the most correlated ngrams and then solve the optimization problem to obtain the sample weights. In contrast, we propose several end-to-end debiasing strategies and improve their results by points. Additionally, belinkov-etal-2019-dont use an adversarial technique to discourage the hypothesis encoder from capturing artifacts, however, their method causes a general loss of information in the encoder which consequently degrades the performance on Hard split of SNLI dataset gururangan2018annotation, which is expected to be less biased. In contrast to their method, we propose to train a bias-only model to use its prediction to dynamically adapt the classification loss to reduce the importance of the most biased examples during the training.
Concurrently to our own work, clark2019dont; he2019unlearn have also proposed to use the product of experts models. However, we have evaluated on new domains and datasets, and have proposed several different ensemble-based debiasing techniques.
5 Conclusion and Future work
We propose several novel techniques to reduce biases learned by neural models. We introduce a bias-only model that is designed to capture biases and leverages the existing shortcuts in the datasets to succeed. Our debiasing strategies then work by adjusting the cross-entropy loss based on the performance of this bias-only model to focus learning on the hard examples and down-weight the importance of the biased examples, for which the model can rely on artifacts to classify the samples. Our proposed debiasing techniques are model agnostic, simple and highly effective. We demonstrate the efficacy of our methods by extensive experimental analysis showing that our techniques substantially improve the performance of the models on the out-of-domain split of evaluation datasets. Currently we are finalizing our results on removing syntactic biases mccoy-etal-2019-right from NLI datasets.
We would like to thank Tal Schuster and all the authors of schuster2019towards for their assistance and sharing their FEVER training dataset with us. We additionally would like to thank Corentin Dancette, Rémi Cadène, Devi Parikh and all the authors of cadene2019rubi for their help and supports and assisting us to reproduce their results.
This research was supported by the Swiss National Science Foundation under the project Learning Representations of Abstraction for Opinion Summarisation (LAOS), grant number “FNS-30216”.