StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
Recently, the pre-trained language model, BERT (and its robustly optimized version RoBERTa), has attracted a lot of attention in natural language understanding (NLU), and achieved state-of-the-art accuracy in various NLU tasks, such as sentiment classification, natural language inference, semantic textual similarity and question answering. Inspired by the linearization exploration work of Elman [elman1990finding], we extend BERT to a new model, StructBERT, by incorporating language structures into pre-training. Specifically, we pre-train StructBERT with two auxiliary tasks to make the most of the sequential order of words and sentences, which leverage language structures at the word and sentence levels, respectively. As a result, the new model is adapted to different levels of language understanding required by downstream tasks.
The StructBERT with structural pre-training gives surprisingly good empirical results on a variety of downstream tasks, including pushing the state-of-the-art on the GLUE benchmark to 89.0 (outperforming all published models), the F1 score on SQuAD v1.1 question answering to 93.0, the accuracy on SNLI to 91.7.
A pre-trained language model (LM) is a key component in many natural language understanding (NLU) tasks such as semantic textual similarity [cer2017semeval], question answering [rajpurkar2016squad] and sentiment classification [socher2013recursive]. In order to obtain reliable language representations, neural language models are designed to define the joint probability function of sequences of words in text with self-supervised learning. Different from traditional word-specific embedding in which each token is assigned a global representation, recent work, such as Cove [mccann2017learned], ELMo [peters2018deep], GPT [radford2018improving] and BERT [devlin2018bert], derives contextualized word vectors from a language model trained on a large text corpus. These models have been shown effective for many downstream NLU tasks.
Among the context-sensitive language models, BERT (and its robustly optimized version RoBERTa [liu2019roberta]) has taken the NLP world by storm. It is designed to pre-train bidirectional representations by jointly conditioning on both left and right context in all layers and model the representations by predicting masked words only through the contexts. However, it does not make the most of underlying language structures.
According to Elman [elman1990finding]’s study, the recurrent neural networks was shown to be sensitive to regularities in word order in simple sentences. Since language fluency is determined by the ordering of words and sentences, finding the best permutation of a set of words and sentences is an essential problem in many NLP tasks, such as machine translation and NLU [hasler2017comparison]. Recently, word ordering was treated as LM-based linearization solely based on language models [schmaltz2016word]. Schmaltz showed that recurrent neural network language models [mikolov2010recurrent] with long short-term memory [hochreiter1997long] cells work effectively for word ordering even without any explicit syntactic information.
In this paper, we introduce a new type of contextual representation, StructBERT, which incorporates language structures into BERT pre-training by proposing two novel linearization strategies. Specifically, in addition to the existing masking strategy, StructBERT extends BERT by leveraging the structural information: word-level ordering and sentence-level ordering. We augment model pre-training with two new structural objectives on the inner-sentence and inter-sentence structures, respectively. In this way, the linguistic aspects [elman1990finding] are explicitly captured during the pre-training procedure. With structural pre-training, StructBERT encodes dependency between words as well as sentences in the contextualized representation, which provides the model with better generalizability and adaptability.
StructBERT significantly advances the state-of-the-art results on a variety of NLU tasks, including the GLUE benchmark [wang2018glue], the SNLI dataset [bowman2015large] and the SQuAD v1.1 question answering task [rajpurkar2016squad]. All of these experimental results clearly demonstrate StructBERT’s exceptional effectiveness and generalization capability in language understanding.
We make the following major contributions:
We propose novel structural pre-training that extends BERT by incorporating the word structural objective and the sentence structural objective to leverage language structures in contextualized representation. This enables the StructBERT to explicitly model language structures by forcing it to reconstruct the right order of words and sentences for correct prediction.
StructBERT significantly outperforms all published state-of-the-art models on a wide range of NLU tasks. This model extends the superiority of BERT, and boosts the performance in many language understanding applications such as semantic textual similarity, sentiment analysis, textual entailment, and question answering.
2 StructBERT Model Pre-training
StructBERT builds upon the BERT architecture, which uses a multi-layer bidirectional Transformer network [vaswani2017attention]. Given a single text sentence or a pair of text sentences, BERT packs them in one token sequence and learns a contextualized vector representation for each token. Every input token is represented based on the word, the position, and the text segment it belongs to. Next, the input vectors are fed into a stack of multi-layer bidirectional Transformer blocks, which uses self-attention to compute the text representations by considering the entire input sequence.
The original BERT introduces two unsupervised prediction tasks to pre-train the model: i.e., a masked LM task and a next sentence prediction task. Different from original BERT, our StructBERT amplifies the ability of the masked LM task by shuffling certain number of tokens after word masking and predicting the right order. Moreover, to better understand the relationship between sentences, StructBERT randomly swaps the sentence order and predicts the next sentence and the previous sentence as a new sentence prediction task. In this way, the new model not only explicitly captures the fine-grained word structure in every sentence, but also properly models the inter-sentence structure in a bidirectional manner. Once the StructBERT language model is pre-trained with these two auxiliary tasks, we can fine-tune it on task-specific data for a wide range of downstream tasks.
2.1 Input Representation
Every input is a sequence of word tokens, which can be either a single sentence or a pair of sentences packed together. The input representation follows that used in BERT [devlin2018bert]. For each input token , its vector representation is computed by summing the corresponding token embedding, positional embedding, and segment embedding. We always add a special classification embedding ([CLS]) as the first token of every sequence, and a special end-of-sequence ([SEP]) token to the end of each segment. Texts are tokenized to subword units by WordPiece [wu2016google] and absolute positional embeddings are learned with supported sequence lengths up to 512 tokens. In addition, the segment embeddings are used to differentiate a pair of sentences as in BERT.
2.2 Transformer Encoder
We use a multi-layer bidirectional Transformer encoder [vaswani2017attention] to encode contextual information for input representation. Given the input vectors , an -layer Transformer is used to encode the input as:
where , and . We use the hidden vector as the contextualized representation of the input token .
2.3 Pre-training Objectives
To make full use of the rich inner-sentence and inter-sentence structures in language, we extend the pre-training objectives of original BERT in two ways: 1⃝ word structural objective (mainly for the single-sentence task), and 2⃝ sentence structural objective (mainly for the sentence-pair task). We pre-train these two auxiliary objectives together with the original masked LM objective in a unified model to exploit inherent language structures.
2.3.1 Word Structural Objective
Despite its success in various NLU tasks, original BERT is unable to explicitly model the sequential order and high-order dependency of words in natural language. Given a set of words in random order from a sentence, ideally a good language model should be able to recover this sentence by reconstructing the correct order of these words. To implement this idea in StructBERT, we supplement BERT’s training objectives with a new word structural objective which endows the model with the ability to reconstruct the right order of certain number of intentionally shuffled word tokens. This new word objective is jointly trained together with the original masked LM objective from BERT.
Figure 0(a) illustrates the procedure of jointly training the new word objective and the masked LM objective. In every input sequence, we first mask 15% of all tokens at random, as done in BERT [devlin2018bert]. The corresponding output vectors of the masked tokens computed by the bidirectional Transformer encoder are fed into a softmax classifier to predict the original tokens.
Next, the new word objective comes into play to take word order into consideration. Given the randomicity of token shuffling, the word objective is equivalent to maximizing the likelihood of placing every shuffled token in its correct position. More formally, this objective can be formulated as:
where represents the set of trainable parameters of StructBERT, and indicates the length of every shuffled subsequence. Technically, a larger would force the model to be able to reconstruct longer sequences while injecting more disturbed input. On the contrary, when is smaller, the model gets more undisturbed sequences while less capable of recovering long sequences. We decide to use trigrams (i.e., ) for subsequence shuffling to balance language reconstructability and robustness of the model.
Specifically, as shown in Figure 0(a), we randomly choose some percentage of trigrams from unmasked tokens, and shuffle the three words (e.g., , , and in the figure) within each of the trigrams. The output vectors of the shuffled tokens computed by the bidirectional Transformer encoder are then fed into a softmax classifier to predict the original tokens. The new word objective is jointly learned together with the masked LM objective in a unified pre-trained model with equal weights.
2.3.2 Sentence Structural Objective
The next sentence prediction task is considered easy for the original BERT model (the prediction accuracy of BERT can easily achieve 97%-98% in this task [devlin2018bert]). We, therefore, extend the sentence prediction task by predicting both the next sentence and the previous sentence, to make the pre-trained language model aware of the sequential order of the sentences in a bidirectional manner.
As illustrated in Figure 0(b), given a pair of sentences (, ) as input, we predict whether is the next sentence that follows , or the previous sentence that precedes , or a random sentence from a different document. Specifically, for the sentence , of the time we choose the text span that follows as the second sentence , of the time the previous sentence ahead of is selected, and of the time a sentence randomly sampled from the other documents is used as . The two sentences are concatenated together into an input sequence with the separator token [SEP] in between, as done in BERT. We pool the model output by taking the hidden state corresponding to the first token [CLS], and feed the encoding vector of [CLS] into a softmax classifier to make a three-class prediction.
2.4 Pre-training Setup
The training objective function is a linear combination of the word structural objective and the sentence structural objective. For the masked LM objective, we followed the same masking rate and settings as in BERT [devlin2018bert]. 5% of trigrams are selected for random shuffling.
We used documents from English Wikipedia (2,500M words) and BookCorpus [zhu2015aligning] as pre-training data, following the preprocessing and the WordPiece tokenization from [devlin2018bert]. The maximum length of input sequence was set to 512.
We ran Adam with learning rate of 1e-4, , , L2 weight decay of 0.01, learning rate warm-up over the first 10% of the total steps, and linear decay of the learning rate. We set a dropout probability of 0.1 for every layer. The gelu activation [hendrycks2016gaussian] was used as done in GPT [radford2018improving].
We denote the number of Transformer block layers as , the size of hidden vectors as , and the number of self-attention heads as . Following the practice of BERT, We primarily report experimental results on the two model sizes:
StructBERTBase: , , , Number of parametersM
StructBERTLarge: , , , Number of parametersM
Pre-training of StructBERT was performed on a distributed computing cluster consisting of 64 Telsa V100 GPU cards. For the StructBERTBase, we ran the pre-training procedure for 40 epochs, which took about 38 hours, and the training of StructBERTLarge took about 7 days to complete.
In this section, we report results of StructBERT on a variety of downstream tasks including General Language Understanding Evaluation (GLUE benchmark), Standford Natural Language inference (SNLI corpus) and extractive question answering (SQuAD v1.1).
Following BERT’s practice, during fine-tuning on downstream tasks, we performed a grid search or an exhaustive search (depending on the data size) on the following sets of parameters and chose the model that performed the best on the dev set. All the other parameters remain the same as those in pre-training:
Batch size: 16, 24, 32; Learning rate: 2e-5, 3e-5, 5e-5; Number of epochs: 2, 3; Dropout rate: 0.05, 0.1
|BERT on STILTs [phang2018sentence]||62.1||94.3||90.2/86.6||88.7/88.3||71.9/89.4||86.4/85.6||92.7||80.1||65.1||28.3||82.0|
|Snorkel MeTaL [ratner2017snorkel]||63.8||96.2||91.5/88.5||90.1/89.7||73.1/89.9||87.6/87.2||93.9||80.9||65.1||39.9||83.2|
|MT-DNN ensemble [liu2019multi]||65.4||96.5||92.2/89.5||89.6/89.0||73.7/89.9||87.9/87.4||96.0||85.7||65.1||42.8||84.2|
|XLNet ensemble [yang2019xlnet]||67.8||96.8||93.0/90.7||91.6/91.1||74.2/90.3||90.2/89.8||98.6||86.3||90.4||47.5||88.4|
|RoBERTa ensemble [liu2019roberta]||67.8||96.7||92.3/89.8||92.2/91.9||74.3/90.2||90.8/90.2||98.9||88.2||89.0||48.7||88.5|
3.1 General Language Understanding
3.1.1 GLUE benchmark
The General Language Understanding Evaluation (GLUE) benchmark [wang2018glue] is a collection of nine NLU tasks, covering textual entailment (RTE [bentivogli2009fifth] and MNLI [williams2017broad]), question-answer entailment (QNLI [wang2018glue]), paraphrase (MRPC [dolan2005automatically]), question paraphrase (QQP 111https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs), textual similarity (STS-B [cer2017semeval]), sentiment (SST-2 [socher2013recursive]), linguistic acceptability (CoLA [warstadt2018neural]), and Winograd Schema (WNLI [levesque2012winograd]).
On the GLUE benchmark, given the similarity of MRPC/RTE/STS-B to MNLI, we fine-tuned StructBERT on MNLI before training on MRPC/RTE/STS-B data for the respective tasks. This follows the two-stage transfer learning STILTs introduced in [phang2018sentence]. For all the other tasks (i.e., RTE, QNLI, QQP, SST-2, CoLA and MNLI), we fine-tuned StructBERT for each single task only on its in-domain data.
Table 1 presents the results of published models on the GLUE test set obtained from the official benchmark evaluation server. Our StructBERTLarge ensemble suppressed all published models (excluding RoBERTa ensemble and XLNet ensemble) on the average score, and performed the best among these models in six of the nine tasks. In the most popular MNLI task, our StructBERTLarge single model improved the best result by 0.3%/0.5%, since we fine-tuned MNLI only on its in-domain data, this improvement is entirely attributed to our new training objectives. The most significant improvement over BERT was observed on CoLA (4.8%), which may be due to the strong correlation between the word order task and the grammatical error correction task. In the SST-2 task, our model improved over BERT while performed worse than MT-DNN did, which indicates that sentiment analysis based on single sentences benefits less from the word structural objective and sentence structural objective.
With pre-training on large corpus, XLNet ensemble and RoBERTa ensemble outperformed all published models including our StructBERTLarge ensemble. To take advantage of the large data which RoBERTa is trained on, we continued pre-training with our two new objectives from the released RoBERTa model, named StructBERTRoBERTa. At the time of paper submission, our StructBERTRoBERTa ensemble, which was submitted under a different name ALICE, achieved the best performance among all published models including RoBERTa on the leaderboard, creating a new state-of-the-art result of 89.0% on the average GLUE score. It demonstrates that the proposed objectives are able to improve language models in addition to BERT.
Natural Language Inference (NLI) is one of the important tasks in natural language understanding. The goal of this task is to test the ability of the model to reason the semantic relationship between two sentences. In order to perform well on an NLI task, a model needs to capture the semantics of sentences, and thus to infer the relationship between a pair of sentences: entailment, contradiction or neutral.
|System||Dev set||Test set|
We evaluated our model on the most widely used NLI dataset: The Stanford Natural Language Inference (SNLI) Corpus [bowman2015large], which consists of 549,367/9,842/9,824 premise-hypothesis pairs in train/dev/test sets and target labels indicating their relations. We performed a grid search on the sets of parameters, and chose the model that performed best on the dev set.
Table 2 shows the results on the SNLI dataset of our model with other published models. StructBERT outperformed all existing systems on SNLI, creating new state-of-the-art results 91.7%, which amounts to 0.4% absolute improvement over the previous state-of-the-art model SJRC and 0.9% absolute improvement over BERT. Since the network architecture of our model is identical to that of BERT, this improvement is entirely attributed to the new pre-training objectives, which justifies the effectiveness of the proposed tasks of word prediction and sentence prediction.
3.2 Extractive Question Answering
SQuAD v1.1 is a popular machine reading comprehension dataset consisting of 100,000+ questions created by crowd workers on 536 Wikipedia articles [rajpurkar2016squad]. The goal of the task is to extract the right answer span from the corresponding paragraph given a question.
We fine-tuned our StructBERT language model on the SQuAD dataset for 3 epochs, and compared the result against the state-of-the-art methods on the official leaderboard 222https://rajpurkar.github.io/SQuAD-explorer/, as shown in Table 3. We can see that even without any additional data augmentation (DA) techniques, the proposed StructBERT model was superior to all published models except XLNet+DA on the dev set. 333We have submitted the model under the name of ALICE to the SQuAD v1.1 CodaLab for evaluation on the test set. However, due to crash of the Codalab evaluation server, we have not got our test result back yet at the time of paper submission. We will update the result once it is announced.. With data augmentation and large corpus used during pre-training, XLNet+DA outperformed our StructBERT which did not use data augmentation or large pre-training corpus. It demonstrates the effectiveness of the proposed pre-trained StructBERT in modeling the question-paragraph relationship for extractive question answering. Incorporating the word and sentence structures significantly improves the understanding ability in this fine-grained answer extraction task.
3.3 Effect of Different Structural Objectives
We have demonstrated the strong empirical results of the proposed model on a variety of downstream tasks. In the StructBERT pre-training, the two new structural prediction tasks are the most important components. Therefore, we conducted an ablation study by removing one structural objective from pre-training at a time to examine how the two structural objectives influence the performance on various downstream tasks.
Results are presented in Table 4. From the table, we can see that: (1) the two structural objectives were both critical to most of the downstream tasks, except for the word structural objective in the SNLI task. Removing any word or sentence objective from pre-training always led to degraded performance in the downstream tasks. The StructBERT model with structural pre-training consistently outperformed the original BERT model, which shows the effectiveness of the proposed structural objectives. (2) For the sentence-pair tasks such as MNLI, SNLI, QQP and SQuAD, incorporating the sentence structural objective significantly improved the performance. It demonstrates the effect of inter-sentence structures learned by pre-training in understanding the relationship between sentences for downstream tasks. (3) For the single-sentence tasks such as CoLA and SST-2, the word structural objective played the most important role. Especially in the CoLA task, which is related to the grammatical error correction, the improvement was over 5%. The ability of reconstructing the order of words in pre-training helped the model better judge the acceptability of a single sentence.
We also studied the effect of both structural objectives during self-supervised pre-training. Figure 2 illustrates the loss and accuracy of word and sentence prediction over the number of pre-training steps for StructBERTBase and BERTBase. From the two sub-figures on top, it is observed that compared with BERT, the augmented shuffled token prediction in StructBERT’s word structural objective had little effect on the loss and accuracy of masked token prediction. On the other hand, the integration of the simpler task of shuffled token prediction (lower loss and higher accuracy) provides StructBERT with the capability of word reordering. In contrast, the new sentence structural objective in StructBERT leads to a more challenging prediction task than that in BERT, as shown in the two figures at the bottom. This new pre-training objective enables StructBERT to exploit inter-sentence structures, which benefits sentence-pair downstream tasks.
4 Related Work
4.1 Contextualized Language Representation
A word can have different semantics depending on the its context. Contextualized word representation is considered to be an important part of modern NLP research, with various pre-trained language models [mccann2017learned, peters2018deep, radford2018improving, devlin2018bert] emerging recently. ELMo [peters2018deep] learns two unidirectional LMs based on long short-term memory networks (LSTMs). A forward LM reads the text from left to right, and a backward LM encodes the text from right to left. Following the similar idea of ELMo, OpenAI GPT [radford2018improving] expands the unsupervised language model to a much larger scale by training on a giant collection of free text corpora. Different from ELMo, it builds upon a multi-layer Transformer [vaswani2017attention] decoder, and uses a left-to-right Transformer to predict a text sequence word-by-word.
In contrast, BERT [devlin2018bert] (as well as its robustly optimized version RoBERTa [liu2019roberta]) employs a bidirectional Transformer encoder to fuse both the left and the right context, and introduces two novel pre-training tasks for better language understanding. We base our LM on the architecture of BERT, and further extend it by introducing word and sentence structures into pre-training tasks for deep language understanding.
4.2 Word & Sentence Ordering
The task of linearization aims to recover the original order of a shuffled sentence [schmaltz2016word]. Part of larger discussion as to whether LSTMs are capturing syntactic phenomena linearization, is standardized in a recent line of research as a method useful for isolating the performance of text-to-text generation [zhang2015discriminative] models. Recently, Transformers have emerged as a powerful architecture for learning the latent structure of language. For example, Bidirectional Transformers (BERT) has reduced the perplexity for language modeling task. We revisit Elman’s question by applying BERT to the word-ordering task, without any explicit syntactic approaches, and find that pre-trained language models are effective for various downstream tasks with linearization.
Many important downstream tasks such as STS and NLI [wang2018glue] are based on understanding the relationship between two text sentences, which is not directly captured by language modeling. While BERT [devlin2018bert] pre-trains a binarized next sentence prediction task to understand sentence relationships, we take one step further and treat it as a sentence ordering task. The goal of sentence ordering is to arrange a set of sentences into a coherent text in a clear and consistent manner, which can be viewed as a ranking problem [chen2016neural]. The task is general and yet challenging, and once is especially important for natural language generation [reiter1997building]. Text should be organized according to the following properties: rhetorical coherence, topical relevancy, chronological sequence, and cause-effect. In this work, we focus on what is arguably the most basic characteristics of a sequence: their order. Most of prior work on sentence ordering was part of the study of downstream tasks, such as multi-document summarization [bollegala2010bottom]. We revisit this problem in the context of language modeling as a new sentence prediction task.
In this paper, we propose novel structural pre-training which incorporates word and sentence structures into BERT pre-training. A word structural objective and a sentence structural objective are introduced as two new pre-training tasks for deep understanding of natural language in different granularities. Experimental results demonstrate that the new StructBERT model can obtain new state-of-the-art results in a variety of downstream tasks, including the popular GLUE benchmark, the SNLI Corpus and the SQuAD v1.1 question answering.