FreeLB: Enhanced Adversarial Training for Language Understanding
Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm - FreeLB, that promotes higher robustness and invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-based model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art test accuracies of 85.39% and 67.32% on ARC-Easy and ARC-Challenge. Experiments on CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and boost the performance of RoBERTa-large model on other tasks as well.
Adversarial training is a method for creating robust neural networks. During adversarial training, mini-batches of training samples are contaminated with adversarial perturbations (alterations that are small and yet cause misclassification), and then used to update network parameters until the resulting model learns to resist such attacks. Adversarial training was originally proposed as a means to enhance the security of machine learning systems (43405), especially for safety-critical systems like self-driving cars (Xiao_2018_ECCV) and copyright detection (saadatpanah2019adversarial).
In this paper, we turn our focus away from the security benefits of adversarial training, and instead study its effects on generalization. While adversarial training boosts the robustness, it is widely accepted by computer vision researchers that it is at odds with generalization, with classification accuracy on non-corrupted images dropping as much as on CIFAR-10, and on Imagenet (madry2018towards; Xie_2019_feature). Surprisingly, people observe the opposite result for language models (miyato2017adversarial; cheng2019robust), showing that adversarial training can improve both generalization and robustness.
We will show that adversarial training significantly improves performance of state-of-the-art models for many language understanding tasks. In particular, we propose a novel adversarial training algorithm, called FreeLB (Free Large-Batch), which adds adversarial perturbations to word embeddings and minimizes the resultant adversarial loss around input samples. The method leverages recently proposed “free” training strategies (2019arXiv190412843S; zhang2019you) to enlarge the batch size with diversified adversarial samples under different norm constraints at no extra cost than PGD-based (Projected Gradient Descent) adversarial training (madry2018towards), which enables us to perform such diversified adversarial training on large-scale state-of-the-art models. We observe improved robustness and invariance in the embedding space for models trained with FreeLB, which is positively correlated with generalization.
We perform comprehensive experiments to evaluate the performance of a variety of adversarial training algorithms on state-of-the-art language understanding models and tasks. In the comparisons with standard PGD (madry2018towards), FreeAT (2019arXiv190412843S) and YOPO (zhang2019you), FreeLB stands out to be the best for the datasets and models we evaluated. With FreeLB, we achieve state-of-the-art results on several important language understanding benchmarks. On the GLUE benchmark, FreeLB pushes the performance of the BERT-base model from 78.3 to 79.4. The overall score of the RoBERTa-large models on the GLUE benchmark is also lifted from 88.5 to 88.8, achieving best results on most of its sub-tasks. Experiments also show that FreeLB can boost the performance of RoBERTa-large on question answering tasks, such as the ARC and CommonsenseQA benchmarks. We also provide a comprehensive ablation study and analysis to demonstrate the effectiveness of our training process.
2 Related Work
2.1 Adversarial Training
To improve the robustness of neural networks against adversarial examples, many defense strategies and models have been proposed, in which PGD-based adversarial training (madry2018towards) is widely considered to be the most effective, since largely avoids the the obfuscated gradient problem (athalye2018obfuscated). It formulates a class of adversarial training algorithms (kurakin2016adversarial) into solving a minimax problem on the cross-entropy loss, which can be achieved reliably through multiple projected gradient ascent steps followed by a SGD (stochastic gradient descent) step.
Despite being proved by athalye2018obfuscated to avoid obfuscated gradients, qin2019adversarial shows that PGD-based adversarial training still leads to highly convolved and non-linear loss surfaces when is small, which could be readily broken under stronger adversaries. Thus, to be effective, the cost of PGD-based adversarial training is much higher than conventional training. To mitigate this cost, 2019arXiv190412843S proposed a “free” adversarial training algorithm that simultaneously updates both model parameters and adversarial perturbations on a single backward pass. Using a similar formulation, zhang2019you effectively reduce the total number of full forward and backward propagations by restricting most of its adversarial updates in the first layer.
2.2 Text Adversaries
Adversarial examples have been explored primarily in the image domain, and received many attention in text domain recently. Previous works on text adversaries have focused on heuristics for creating adversarial examples in the black-box setting, or on specific tasks. jia-liang-2017-adversarial propose to add distracting sentences to the input document in order to induce mis-classification. Zhao2017GeneratingNA generate text adversaries by projecting the input data to a latent space using GANs, and searching for adversaries close to the original instance. BelinkovB18 manipulate every word in a sentence with synthetic or natural noise in machine translation systems. iyyer-etal-2018-adversarial propose a neural paraphrase model based on back-translated data to produce paraphrases that have different sentence structures. Different from previous work, ours is not to produce actual adversarial examples, but only take the benefit of adversarial training for natural language understanding.
We are not the first to observe that robust language models may perform better. miyato2017adversarial extend adversarial and virtual adversarial training (MiyatoMKI19) to the text domain to improve the performance of semi-supervised classification tasks. hotflip propose a character/word replacement for craft attacks, and show employing adversarial examples in training renders the models more robust. ribeiro-etal-2018-semantically show that adversarial attacks can be used as a valuable tool for debugging NLP models. cheng2019robust also find the craft adversarial examples can help neural machine translation significantly. Notably, these studies have focused on simple models or text generation tasks. Our work explores deeper/larger Transformer-based models, how to adversarially attack them efficiently, and how to improve performance on challenging problems.
3 Adversarial Training for Language Understanding
Pre-trained large-scale language models, such as BERT (devlin2019bert) and RoBERTa (liu2019roberta), have proven to be highly effective for downstream tasks. We aim to further improve the generalization of these pre-trained language models on the downstream language understanding tasks by enhancing their robustness in the embedding space during finetuning on these tasks. We achieve this goal by creating “virtual” adversarial examples in the embedding space, and then perform parameter updates on these adversarial embeddings. Creating actual adversarial examples for language is difficult; even with state-of-the-art language models as guidance (e.g., cheng2019robust), it remains unclear how to construct label-preserving adversarial examples via word/character replacement without human evaluations, since the meaning of each word/character depends on the context ribeiro-etal-2018-semantically. Since we are only interested in the effects of adversarial training, rather than producing actual adversarial examples, we add norm-bounded adversarial perturbations to the embeddings of the input sentences using a gradient-based method. Note that our embedding-based adversary is strictly stronger than a more conventional text-based adversary, as our adversary can make manipulations on word embeddings that are not possible in the text domain.
For models that encode text at the sub-word level using Byte-Pair Encoding (BPE) (sennrich2016neural), our adversaries only modify the concatenated BPE embeddings, leaving other components of the sentence representation (e.g., position encoding) unchanged. Denote the sequence of one-hot representations of the input subwords as , the embedding matrix as , and the language model (encoder) as a function , where , is the output of the model, and denotes all the learnable parameters including the embedding matrix . We add adversarial perturbations to the embeddings such that the prediction becomes . To preserve the semantics, we constrain the norm of to be small, and assume the model’s prediction does not change after the perturbation. This formulation is analogous to miyato2017adversarial, with the difference that we do not require to be normalized.
3.1 PGD for Adversarial Training
Standard adversarial training seeks to find a set of parameters to minimize the maximum risk for any within a norm ball as:
where is the data distribution, is the label, and is some loss function. We use the Frobenius norm to constrain . For neural networks, the outer “min” is non-convex, and the inner “max” is non-concave. Nonetheless, madry2018towards demonstrated that this saddle-point problem can be solved reliably with SGD for the outer minimization and PGD (a standard method for large-scale constrained optimization, see combettes2011proximal and goldstein2014field), for the inner maximization. In particular, for the constraint , with an additional assumption that the loss function is locally linear, PGD takes the following step (with step size ) in each iteration:
where is the gradient of the loss with respect to , and performs a projection onto the -ball. To achieve high-level robustness, multi-step adversarial examples are needed during training, which is computationally expensive. The -step PGD (-PGD) requires forward-backward passes through the network, while the standard SGD update requires only one. As a result, the adversary generation step in adversarial training increases run-time by an order of magnitude – a catastrophic amount when training large state-of-the-art language models.
3.2 Large-Batch Adversarial Training for Free
In the inner ascent steps of PGD, the gradients of the parameters can be obtained with almost no overhead when computing the gradients of the inputs. From this observation, FreeAT (2019arXiv190412843S) and YOPO (zhang2019you) have been proposed to accelerate adversarial training. They achieve comparable robustness and generalization as standard PGD-trained models using only the same or a slightly larger number of forward-backward passes as natural training (i.e., SGD on clean samples). FreeAT takes one descent step on the parameters together with each of the ascent steps on the perturbation. As a result, when the optimal is highly correlated with , is sub-optimal for maximizing , since the ascent step from to is based on the gradient at . Different from FreeAT, YOPO accumulates the gradient of the parameters from each of the ascent steps, and updates the parameters only once after the inner ascent steps. YOPO also advocates that after each back-propagation, one should take the gradient of the first hidden layer as a constant and perform several additional updates on the adversary using the product of this constant and the Jacobian of the first layer of the network. Interestingly, the analysis backing the extra update steps assumes a twice continuously differentiable loss, which does not hold for ReLU-based neural networks they experimented with, and thus the reasons for the success of such an algorithm remains obscure. We give empirical comparisons between YOPO and our approach in Sec. 4.3.
To obtain better solutions for the inner max and avoid fundamental limitations on the function class, we propose FreeLB, which performs multiple PGD iterations to craft adversarial examples, and simultaneously accumulates the “free” parameter gradients in each iteration. After that, it updates the model parameter all at once with the accumulated gradients. The overall procedure is shown in Algorithm 1, in which is an approximation to the local maximum within the intersection of two balls . By taking a descent step along the averaged gradients at , we approximately optimize the following objective:
which is equivalent to replacing the original batch with a -times larger virtual batch, consisting of samples whose embeddings are . Compared with PGD-based adversarial training (Eq. 1), which minimizes the maximum risk at a single estimated point in the vicinity of each training sample, FreeLB minimizes the maximum risk at each ascent step at almost no overhead.
Intuitively, FreeLB could be a learning method with lower generalization error than PGD. sokolic2017generalization have proved that the generalization error of a learning method invariant to a set of transformations may be up to smaller than a non-invariant learning method. According to their theory, FreeLB could have a more significant improvement over natural training, since FreeLB enforces the invariance to adversaries from a set of up to different norm constraints,111The cardinality of the set is approximately . while PGD only enforces invariance to a single norm constraint .
Empirically, FreeLB does lead to higher robustness and invariance than PGD in the embedding space, in the sense that the maximum increase of loss in the vicinity of for models trained with FreeLB is smaller than that with PGD. See Sec. 4 for details. In theory, such improved robustness can lead to better generalization (xu2012robustness), which is consistent with our experiments. qin2019adversarial also demonstrated that PGD-based method leads to highly convolved and non-linear loss surfaces in the vicinity of input samples when is small, indicating a lack of robustness.
3.3 When Adversarial Training Meets Dropout
Usually, adversarial training is not used together with dropout (JMLR:v15:srivastava14a). However, for some language models like RoBERTa (liu2019roberta), dropout is used during the finetuning stage. In practice, when dropout is turned on, each ascent step of Algorithm 1 is optimizing for a different network. Specifically, denote the dropout mask as with each entry . Similar to our analysis for FreeAT, the ascent step from to is based on , so is sub-optimal for if and are highly correlated. Here is the effective parameters under dropout mask .
The more plausible solution is to use the same in each step. When applying dropout to any network, the objective for is to minimize the expectation of loss under different networks determined by the dropout masks, which is achieved by minimizing the Monte Carlo estimation of the expected loss. In our case, the objective becomes:
where the 1-sample Monte Carlo estimation should be . This is similar to applying Variational Dropout to RNNs as used in gal2016theoretically.
In this section, we provide comprehensive analysis on FreeLB through extensive experiments on three NLP benchmarks: GLUE (wang2019glue), ARC (clark2018think) and CommonsenseQA (talmor2019commonsenseqa). We also compare the robustness and generalization of FreeLB with other adversarial training algorithms to demonstrate its strength. Additional experimental details are provided in the Appendix.
GLUE Benchmark. The GLUE benchmark is a collection of 9 natural language understanding tasks, mostly on single-sentence classification or sentence-pair matching. 8 of the tasks are formulated as classification problems and only STS-B is formulated as regression, but FreeLB applies to all of them. For BERT-base, we use the HuggingFace implementation222https://github.com/huggingface/pytorch-transformers, and follow the single-task finetuning procedure as in devlin2019bert. For RoBERTa, we use the fairseq implementation333https://github.com/pytorch/fairseq. Same as liu2019roberta, we also use single-task finetuning for all dev set results, and start with MNLI-finetuned models on RTE, MRPC and STS-B for the test submissions.
ARC Benchmark. The ARC dataset (clark2018think) is a collection of multi-choice science questions from grade-school level exams. It is further divided into ARC-Challenge set with 2,590 question answer (QA) pairs and ARC-Easy set with 5,197 QA pairs. Questions in ARC-Challenge are more difficult and cannot be handled by simply using a retrieval and co-occurence based algorithm (clark2018think). A typical question is:
Which property of a mineral can be determined just by looking at it?
(A) luster [correct] (B) mass (C) weight (D) hardness.
CommonsenseQA Benchmark. The CommonsenseQA dataset (talmor2019commonsenseqa) consists of 12,102 natural language questions that require human commonsense reasoning ability to answer. A typical question is :
Where can I stand on a river to see water falling without getting wet?
(A) waterfall, (B) bridge [correct], (C) valley, (D) stream, (E) bottom.
Each question has five candidate answers from ConceptNet (speer2017conceptnet). To make the question more difficult to solve, most answers have the same relation in ConceptNet to the key concept in the question. As shown in the above example, most answers can be connected to “river” by “AtLocation” relation in ConceptNet. For a fair comparison with the reported results in papers and leaderboard444https://www.tau-nlp.org/csqa-leaderboard, we use the official random split 1.11.
4.2 Experimental Results
|ReImp||-||-||-||85.61 (1.7)||96.56 (.3)||90.69 (.5)||67.57 (1.3)||92.20 (.2)|
|PGD||90.53 (.2)||94.87 (.2)||92.49 (.07)||87.41 (.9)||96.44 (.1)||90.93 (.2)||69.67 (1.2)||92.43 (7.)|
|FreeAT||90.02 (.2)||94.66 (.2)||92.48 (.08)||86.69 (15.)||96.10 (.2)||90.69 (.4)||68.80 (1.3)||92.40 (.3)|
|FreeLB||90.61 (.1)||94.98 (.2)||92.60 (.03)||88.13 (1.2)||96.79 (.2)||91.42 (.7)||71.12 (.9)||92.67 (.08)|
GLUE We summarize results on the dev sets of GLUE in Table 1, comparing the proposed FreeLB against other adversatial training algorithms (PGD (madry2018towards) and FreeAT (2019arXiv190412843S)). We use the same step size and number of steps for PGD, FreeAT and FreeLB. FreeLB is consistently better than the two baselines. Comparisons and detailed discussions about YOPO (zhang2019you) are provided in Sec. 4.3. We have also submitted our results to the evaluation server, results provided in Table 2. FreeLB lifts the performance of the BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88 on overall scores.
ARC For ARC, a corpus of 14 million related science documents (from ARC Corpus, Wikipedia and other sources) is provided. For each QA pair, we first use a retrieval model to select top 10 related documents. Then, given these retrieved documents555We thank AristoRoBERTa team for providing retrieved documents and additional Regents Living Environments dataset., we use RoBERTa-large model to encode s Retrieved Documents /s Question + Answer /s, where s and /s are special tokens for RoBERTa model666Equivalent to [CLS] and [SEP] token in BERT.. We then apply a fully-connected layer to the representation of the [CLS] token to compute the final logit, and use standard cross-entropy loss for model training.
Results are summarized in Table 3. Following sun2018improving, we first finetune the RoBERTa model on the RACE dataset (lai2017race). The finetuned RoBERTa model achieves 85.70% and 85.24% accuracy on the development and test set of RACE, respectively. Based on this, we further finetune the model on both ARC-Easy and ARC-Challenge datasets with the same hyper-parameter searching strategy (for 5 epochs), which achieves 84.13%/64.44% test accuracy on ARC-Easy/ARC-Challenge. And by adding FreeLB finetuning, we can reach 84.81%/65.36%, a significant boost on ARC benchmark, demonstrating the effectiveness of FreeLB.
To further improve the results, we apply a multi-task learning (MTL) strategy using additional datasets. We first finetune the model on RACE (lai2017race), and then finetune on a joint dataset of ARC-Easy, ARC-Challenge, OpenbookQA (mihaylov2018can) and Regents Living Environment777
https://www.nysedregents.org/livingenvironment. Based on this, we further finetune our model on ARC-Easy and ARC-Challenge with FreeLB.
After finetuning, our single model achieves 67.32% test accuracy on ARC-Challenge and 85.39% on ARC-Easy, both outperforming the best submission on the official leaderboard888https://leaderboard.allenai.org/arc/submissions/public and https://leaderboard.allenai.org/arc_easy/
CommonsenseQA Similar to the training strategy in liu2019roberta, we construct five inputs for each question by concatenating the question and each answer separately, then encode each input with the representation of the [CLS] token. A final score is calculated by applying the representation of [CLS] to a fully-connected layer. Following the fairseq repository999https://github.com/pytorch/fairseq/tree/master/examples/roberta/commonsense_qa, the input is formatted as: ”s Q: Where can I stand on a river to see water falling without getting wet? /s A: waterfall /s”, where ’Q:’ and ’A:’ are the prefix for question and answer, respectively.
Results are summarized in Table 3. We obtained a dev-set accuracy of 77.56% with the RoBERTa-large model. When using FreeLB finetuning, we achieved 78.64%, a 1.08% absolute gain. Compared with the heavy hyper-parameter searching strategy from fairseq repository, which obtains 78.43% accuracy on the dev-set, FreeLB still achieves better performance. Our submission to the CommonsenseQA leaderboard achieves 72.2% test set accuracy, which is slightly higher than RoBERTa (72.1%).
4.3 Ablation Study and Analysis
In this sub-section, we first show the importance of reusing dropout mask, then conduct a thorough ablation study on FreeLB over the GLUE benchmark to analyze the robustness and generalization strength of different approaches. We observe that it is unnecessary to perform shallow-layer updates on the adversary as YOPO for our case, and FreeLB results in improved robustness and generalization compared with PGD.
Importance of Reusing Mask
|RTE||85.61 (1.67)||87.14 (1.29)||88.13 (1.21)||87.05 (1.36)||87.05 (0.20)|
|CoLA||67.57 (1.30)||69.31 (1.16)||71.12 (0.90)||70.40 (0.91)||69.91 (1.16)|
|MRPC||90.69 (0.54)||90.93 (0.66)||91.42 (0.72)||90.44 (0.62)||90.69 (0.37)|
Table 4 (columns 2 to 4) compares the results of FreeLB with and without reusing the same dropout mask in each ascent step, as proposed in Sec. 3.3. With reusing, FreeLB can achieve a larger improvement over the naturally trained models. Thus, we enable mask reusing for all experiments involving RoBERTa.
Comparing the Robustness Table 5 provides the comparisons of the maximum increment of loss in the vicinity of each sample, defined as:
which reflects the robustness and invariance of the model in the embedding space. In practice, we use PGD steps as in Eq. 2 to find the value of . We found that when using a step size of and , the PGD iterations converge to almost the same value, starting from 100 different random initializations of for the RoBERTa models, trained with or without FreeLB. This indicates that PGD reliably finds for these models. Therefore, we compute for each via a 2000-step PGD.
Samples with small margins exist even for models with perfect accuracy, which could give a false sense of vulnerability of the model. To rule out the outlier effect and make comparable across different samples, we only consider samples that all the evaluated models can correctly classify, and search for an for each sample such that the reference model can correctly classify all samples within the ball.101010For each sample, we start from a value slightly larger than the norm constraint during training for , and then decrease linearly until the model trained with the reference model can correctly classify after a 2000-step PGD attack. The reference model is either trained with FreeLB or PGD. However, such choice of per-sample favors the reference model by design. To make fair comparisons, Table 5 provides the median of with per-sample from models trained by FreeLB (Max Inc) and PGD (Mac Inc (R)), respectively.
Across all three datasets and different reference models, FreeLB has the smallest median increment even when starting from a larger natural loss than vanilla models. This demonstrates that FreeLB is more robust and invariant in most cases. Such results are also consistent with the models’ dev set performance (the performances for Vanilla/PGD/FreeLB models on RTE, CoLA and MRPC are 86.69/87.41/89.21, 69.91/70.84/71.40, 91.67/91.17/91.17, respectively).
Comparing with YOPO The original implementation of YOPO (zhang2019you) chooses the first convolutional layer of the ResNets as for updating the adversary in the “s-loop”. As a result, each step of the “s-loop” should be using exactly the same value to update the adversary,111111Except for the first step, where they unexpectedly put the clipping operator on the perturbation into the computation graph in their code. and YOPO-- degenerates into FreeLB with a -times large step size. To avoid that, we choose the layers up to the output of the first Transformer block as when implementing YOPO. To make the total amount of update on the adversary equal, we take the hyper-parameters for FreeLB- and only change the step size into for YOPO--. Table 4 shows that FreeLB performs consistently better than YOPO on all three datasets. We leave exhaustive hyperparameter search for both models as our future work.
In this work, we have developed an adversarial training approach, FreeLB, to improve natural language understanding. The proposed approach adds perturbations to continuous word embeddings using a gradient method, and minimizes the resultant adversarial risk in an efficient way. FreeLB is able to boost Transformer-based model (BERT and RoBERTa) on several datasets and achieve new state of the art on GLUE and ARC benchmarks. Empirical study demonstrates that our method results in both higher robustness in the embedding space than natural training and better generalization ability. Such observation seems inconsistent with that in computer vision problems. Investigating the reason for the discrepancy between the outcomes of adversarial training for images and text is an interesting future direction.
Appendix A Additional Experimental Details
a.1 Problem Formulations
For tasks with ranking loss like ARC, CommonsenseQA, WNLI and QNLI, add the perturbation to the concatenation of the embeddings of all question/answer pairs.
Additional tricks are required to achieve high performance on WNLI and QNLI for the GLUE benchmark. We use the same tricks as liu2019roberta. For WNLI, we use the same WSC data provided by liu2019roberta for training. For testing, liu2019roberta also provided the test set with span annotations, but the order is different form the GLUE dataset. We re-order their test set by matching. For the QNLI, we follow liu2019roberta and formulate the problem as pairwise ranking problem, which is the same for CommonsenseQA. We find the matching pairs for both training set and testing set by matching the queries in the dev set. We predict “entailment” if the candidate has the higher score, and “not_entailment” otherwise.
As other adversarial training methods, introduces three additional hyper-parameters: step size , maximum perturbation , number of steps . For all other hyper-parameters such as learning rate and number of iterations, we either search in the same interval as RoBERTa (on CommonsenseQA, ARC, and WNLI), or use exactly the same setting as RoBERTa (except for MRPC, where we find using a learning rate of gives better results).121212https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md. We list the best combinations for and for each of the GLUE tasks in Table 6. For WSC/WNLI, the best combination is . Notice even when , the maximum perturbation could still reach due to the random initialization.
Appendix B Variance of Maximum Increment of Loss
Table 7 provides the complete results for the increment of loss in the interval, with median and standard deviation.