ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding
Recently pre-trained models have achieved state-of-the-art results in various language understanding tasks. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring information, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entities, semantic closeness and discourse relations. In order to extract the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning. Based on this framework, we construct several tasks and train the ERNIE 2.0 model to capture lexical, syntactic and semantic aspects of information in the training data. Experimental results demonstrate that ERNIE 2.0 model outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several similar tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.
Pre-trained language representations such as ELMo[peters2018deep], OpenAI GPT[radford2018improving], BERT [devlin2018bert], ERNIE 1.0 [sun2019ernie]111In order to distinguish ERNIE 2.0 framework and the ERNIE model, the latter is referred to as ERNIE 1.0.[sun2019ernie] and XLNet[yang2019xlnet] have been proven to be effective for improving the performances of various natural language understanding tasks including sentiment classification [socher2013recursive], natural language inference [bowman2015large], named entity recognition [sang2003introduction] and so on.
Generally the pre-training of models often train the model based on the co-occurrence of words and sentences. While in fact, there are other lexical, syntactic and semantic information worth examining in training corpora other than co-occurrence. For example, named entities like person names, location names, and organization names, may contain conceptual information. Information like sentence order and sentence proximity enables the models to learn structure-aware representations. And semantic similarity at the document level or discourse relations among sentences allow the models to learn semantic-aware representations. In order to discover all valuable information in training corpora, be it lexical, syntactic or semantic representations, we propose a continual pre-training framework named ERNIE 2.0 which could incrementally build and train a large variety of pre-training tasks through continual multi-task learning.
Our ERNIE framework supports the introduction of various customized tasks continually, which is realized through continual multi-task learning. When given one or more new tasks, the continual multi-task learning method simultaneously trains the newly-introduced tasks together with the original tasks in an efficient way, without forgetting previously learned knowledge. In this way, our framework can incrementally train the distributed representations based on the previously trained parameters that it grasped. Moreover, in this framework, all the tasks share the same encoding networks, thus making the encoding of lexical, syntactic and semantic information across different tasks possible.
In summary, our contributions are as follows:
We propose a continual pre-training framework ERNIE 2.0, which efficiently supports customized training tasks and continual multi-task learning in an incremental way.
We construct three kinds of unsupervised language processing tasks to verify the effectiveness of the proposed framework. Experimental results demonstrate that ERNIE 2.0 achieves significant improvements over BERT and XLNet on 16 tasks including English GLUE benchmarks and several Chinese tasks.
Our fine-tuning code of ERNIE 2.0 and models pre-trained on English corpora are available at https://github.com/PaddlePaddle/ERNIE.
Unsupervised Learning for Language Representation
It is effective to learn general language representation by pre-training a language model with a large amount of unannotated data. Traditional methods usually focus on context-independent word embedding. Methods such as Word2Vec [mikolov2013efficient] and GloVe [pennington2014glove] learn fixed word embeddings based on word co-occurring information on large corpora.
Recently, several studies centered on contextualized language representations have been proposed and context-dependent language representations have shown state-of-the-art results in various natural language processing tasks. ELMo [peters2018deep] proposes to extract context-sensitive features from a language model. OpenAI GPT [radford2018improving] enhances the context-sensitive embedding by adjusting the Transformer [vaswani2017attention]. BERT [devlin2018bert], however, adopts a masked language model while adding a next sentence prediction task into the pre-training. XLM [lample2019cross] integrates two methods to learn cross-lingual language models, namely the unsupervised method that relies only on monolingual data and the supervised method that leverages parallel bilingual data. MT-DNN [liu2019multi] achieves a better result through learning several supervised tasks in GLUE[wang2018glue] together based on the pre-trained model, which eventually leads to improvements on other supervised tasks that are not learned in the stage of multi-task supervised fine-tuning. XLNet [yang2019xlnet] uses Transformer-XL [dai2019transformer] and proposes a generalized autoregressive pre-training method that learns bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.
Continual learning[parisi2019continual, chen2018lifelong] aims to train the model with several tasks in sequence so that it remembers the previously-learned tasks when learning the new ones. These methods are inspired by the learning process of humans, as humans are capable of continuously accumulating the information acquired by study or experience to efficiently develop new skills. With continual learning, the model should be able to performs well on new tasks thanks to the knowledge acquired during previous training.
The ERNIE 2.0 Framework
As shown in Figure 1, the ERNIE 2.0 framework is built based on an widely-used architecture of pre-training and fine-tuning. ERNIE 2.0 differs from the previous pre-training ones in that, instead of training with a small number of pre-training objectives, it could constantly introduce a large variety of pre-training tasks to help the model efficiently learn the lexical, syntactic and semantic representations. Based on this, ERNIE 2.0 framework keeps updating the pre-trained model with continual multi-task learning. During fine-tuning, the ERNIE model is first initialized with the pre-trained parameters, and would be later fine-tuned using data from specific tasks.
The process of continual pre-training contains two steps. Firstly, We continually construct unsupervised pre-training tasks with big data and prior knowledge involved. Secondly, We incrementally update the ERNIE model via continual multi-task learning.
Pre-training Tasks Construction
We can construct different kinds of tasks at each time, including word-aware tasks, structure-aware tasks and semantic-aware tasks222For the detailed information of these tasks, please refer to the next section.. All of these pre-training tasks rely on self-supervised or weak-supervised signals that could be obtained from massive data without human annotation. Prior knowledge such as named entities, phrases and discourse relations is used to generate labels from large-scale data.
Continual Multi-task Learning
The ERNIE 2.0 framework aims to learn lexical, syntactic and semantic information from a number of different tasks. Thus there are two main challenges to overcome. The first is how to train the tasks in a continual way without forgetting the knowledge learned before. The second is how to pre-train these tasks in an efficient way. We propose a continual multi-task learning method to tackle with these two problems. Whenever a new task comes, the continual multi-task learning method first uses the previously learned parameters to initialize the model, and then train the newly-introduced task together with the original tasks simultaneously. This will make sure that the learned parameters encodes the previously-learned knowledge. One left problem is how to make it trained more efficiently. We solve this problem by allocating each task N training iterations. Our framework needs to automatically assign these N iterations for each task to different stages of training. In this way, we can guarantee the efficiency of our method without forgetting the previously trained knowledge 333For more details, please refer to Table 7 in the experiment section..
Figure 2 shows the difference among our method, multi-task learning from scratch and previous continual learning. Although multi-task learning from scratch could train multiple tasks at the same time, it is necessary that all customized pre-training tasks are prepared before the training could proceed. So this method takes as much time as continual learning does, if not more. Traditional continual learning method trains the model with only one task at each stage with the demerit that it may forget the previously learned knowledge.
|\diagboxCorpusTask||Token-Level Loss||Sentence-Level Loss|
||Knowledge Masking||Capital Prediction||Token-Document Relation||Sentence Reordering||Sentence Distance||Discourse Relation||IR Relevance|
|IR Relevance Data|
|Discourse Relation Data|
As shown in Figure 4, the architecture of our continual multi-task learning in each stage contains a series of shared text encoding layers to encode contextual information, which can be customized by using recurrent neural networks or a deep Transformer consisting of stacked self-attention layers[vaswani2017attention]. The parameters of the encoder can be updated across all learning tasks. There are two kinds of loss functions in our framework. One is the sentence-level loss and the other one is the token-level loss, which are similar to the loss functions of BERT. Each pre-training task has its own loss function. During pre-training, one sentence-level loss function can be combined with multiple token-level loss functions to continually update the model.
Fine-tuning for Application Tasks
By virtue of fine-tuning with task-specific supervised data, the pre-trained model can be adapted to different language understanding tasks, such as question answering, natural language inference, and semantic similarity. Each downstream task has its own fine-tuned models after being fine-tuned.
ERNIE 2.0 Model
In order to verify the effectiveness of the framework, we construct three different kinds of unsupervised language processing tasks and develop a pre-trained model called ERNIE 2.0 model. In this section we introduce the implementation of the model in the proposed framework.
The model uses a multi-layer Transformer[vaswani2017attention] as the basic encoder like other pre-training models such as GPT[radford2018improving], BERT[devlin2018bert] and XLM[lample2019cross]. The transformer can capture the contextual information for each token in the sequence via self-attention, and generate a sequence of contextual embeddings. Given a sequence, the special classification embedding [CLS] is added to the first place of the sequence. Furthermore, the symbol of [SEP] is added as the separator in the intervals of the segments for the multiple input segment tasks.
The model feeds task embedding to represent the characteristic of different tasks. We represent different tasks with an id ranging from 0 to N. Each task id is assigned to one unique task embedding. The corresponding token, segment, position and task embedding are taken as the input of the model. We can use any task id to initialize our model in the fine-tuning process. The model structure is shown in Figure 3.
We construct three different kinds of tasks to capture different aspects of information in the training corpora. The word-aware tasks enable the model to capture the lexical information, the structure-aware tasks enable the model capture the syntactic information of the corpus and the semantic-aware tasks aims to learn semantic information.
Word-aware Pre-training Tasks
Knowledge Masking Task
ERNIE 1.0[sun2019ernie] proposed an effective strategy to enhance representation through knowledge integration. It introduced phrase masking and named entity masking and predicts the whole masked phrases and named entities to help the model learn the dependency information in both local contexts and global contexts. We use this task to train an initial version of the model.
Capitalization Prediction Task
Capitalized words usually have certain specific semantic information compared to other words in sentences. The cased model has some advantages in tasks like named entity recognition while the uncased model is more suitable for some other tasks. To combine the advantages of both models, we add a task to predict whether the word is capitalized or not.
Token-Document Relation Prediction Task
This task predicts whether the token in a segment appears in other segments of the original document. Empirically, the words that appear in many parts of a document are usually commonly-used words or relevant with the main topics of the document. Therefore, through identifying the frequently-occurring words of a document appearing in the segment, the task can enable the ability of a model to capture the key words of the document to some extent.
Structure-aware Pre-training Tasks
Sentence Reordering Task
This task aims to learn the relationships among sentences. During the pre-training process of this task, a given paragraph is randomly split into 1 to m segments and then all of the combinations are shuffled by a random permuted order. We let the pre-trained model to reorganize these permuted segments, modeled as a k-class classification problem where . Empirically, the sentences reordering task can enable the pre-trained model to learn relationships among sentences in a document.
Sentence Distance Task
We also construct a pre-training task to learn the sentence distance using document-level information. This task is modeled as a 3-class classification problem. ”0” represents that the two sentences are adjacent in the same document, ”1” represent that the two sentences are in the same document, but not adjacent, and ”2” represents that the two sentences are from two different documents.
Semantic-aware Pre-training Tasks
Discourse Relation Task
Beside the distance task mentioned above, we introduce a task to predict the semantic or rhetorical relation between two sentences. We use the data built by Sileo et.al[sileo2019mining] to train a pre-trained model for English tasks. Following the method in Sileo et.al[sileo2019mining], we also automatically construct a Chinese dataset for pre-training.
IR Relevance Task
We build a pre-training task to learn the short text relevance in information retrieval. It is a 3-class classification task which predicts the relationship between a query and a title. We take the query as the first sentence and the title as the second sentence. The search log data from a commercial search engine is used as our pre-training data. There are three kinds of labels in this task. The query and title pairs that are labelled as ” 0” stand for strong relevance, which means that the title is clicked by the users after they input the query. Those labelled as ”1” represent weak relevance, which implies that when the query is input by the users, these titles appear in the search results but failed to be clicked by users. The label ”2” means that the query and title are completely irrelevant and random in terms of semantic information.
We compare the performance of ERNIE 2.0 with the state-of-the-art pre-training models. For English tasks, we compare our results with BERT [devlin2018bert] and XLNet [yang2019xlnet] on GLUE. For Chinese tasks, we compare the results with that of BERT [devlin2018bert] and the previous ERNIE 1.0 [sun2019ernie] model on several Chinese datasets. Moreover, we will compare our method with multi-task learning and traditional continual learning.
|IR Relevance Data||-||4500M|
|Discourse Relation Data||171M||1110M|
Pre-training and Implementation
Similar to that of BERT, some data in the English corpus are crawled from Wikipedia and BookCorpus. Besides we also collect some data from Reddit and use the Discovery data [sileo2019mining] as our discourse relation data. For the Chinese corpus, we collect a variety of data, such as encyclopedia, news, dialogue, information retrieval and discourse relation data from a search engine. The details of the pre-training data are shown in Table 2. The relationship between pre-training task and pre-training dataset is shown in Table 1.
To compare with BERT[devlin2018bert], We use the same model settings of transformer as BERT. The base model contains 12 layers, 12 self-attention heads and 768-dimensional of hidden size while the large model contains 24 layers, 16 self-attention heads and 1024-dimensional of hidden size. The model settings of XLNet [yang2019xlnet] are same as BERT.
|Epoch||Learning Rate||Batch Size||Epoch||Learning Rate||Batch Size|
|Epoch||Learning Rate||Batch Size||Epoch||Learning Rate||Batch Size|
ERNIE 2.0 is trained on 48 NVidia v100 GPU cards for the base model and 64 NVidia v100 GPU cards for the large model in both English and Chinese. The ERNIE 2.0 framework is implemented on PaddlePaddle, which is an end-to-end open source deep learning platform developed by Baidu. We use Adam optimizer that parameters of which are fixed to , , with a batch size of 393216 tokens. The learning rate is set as 5e-5 for English model and 1.28e-4 for Chinese model. It is scheduled by decay scheme noam [vaswani2017attention] with warmup over the first 4,000 steps for every pre-training task. By virtue of float16 operations, we manage to accelerate the training and reduce the memory usage of our models. Each of the pre-training tasks is trained until the metrics of pre-training tasks converge.
|Task(Metrics)||BASE model||LARGE model|
|BERT||ERNIE 2.0||BERT||XLNet||ERNIE 2.0||BERT||ERNIE 2.0|
|CoLA (Matthew Corr.)||52.1||55.2||60.6||63.6||65.4||60.5||63.5|
|STS-B (Pearson Corr./Spearman Corr.)||87.1/85.8||87.6/86.5||90.0/-||91.8/-||92.3/-||87.6/86.5||91.2/90.6|
|Task||Metrics||BERT||ERNIE 1.0||ERNIE 2.0||ERNIE 2.0|
|Pre-training method||Pre-training task||Training iterations (steps)||Fine-tuning result|
|Stage 1||Stage 2||Stage 3||Stage 4||MNLI||SST-2||MRPC|
|Continual Learning||Knowledge Masking||50k||-||-||-||77.3||86.4||82.5|
|Multi-task Learning||Knowledge Masking||50k||78.7||87.5||83.0|
|continual Multi-task Learning||Knowledge Masking||20k||10k||10k||10k||79.0||87.8||84.0|
As a multi-task benchmark and analysis platform for natural language understanding, General Language Understanding Evaluation (GLUE) is usually applied to evaluate the performance of models. We also test the performance of ERNIE 2.0 on GLUE. Specifically, GLUE covers a diverse range of NLP datasets, the details is shown [wang2018glue].
We executed extensive experiments on 9 Chinese NLP tasks, including machine reading comprehension, named entity recognition, natural language inference, semantic similarity, sentiment analysis and question answering. Specifically, the following Chinese datasets are chosen to evaluate the performance of ERNIE 2.0 on Chinese tasks:
Machine Reading Comprehension (MRC): CMRC 2018 [DBLP:journals/corr/abs-1810-07366], DRCD [shao2018drcd], and DuReader [he2017dureader].
Named Entity Recognition (NER): MSRA-NER [levow2006third].
Natural Language Inference (NLI): XNLI [conneau2018xnli].
Sentiment Analysis (SA): ChnSentiCorp 444https://github.com/pengming617/bert_classification.
Semantic Similarity (SS): LCQMC [liu2018lcqmc], and BQ Corpus [chen2018bq].
Question Answering (QA): NLPCC-DBQA 555http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf.
Implementation Details for Fine-tuning
Results on English Tasks
We evaluate the performance of the base models and the large models of each method on GLUE. Considering the fact that only the results of the single model XLNet on the dev set are reported, we also reports the results of each method on the dev set. In order to obtain a fair comparison with BERT and XLNet, we run a single-task and single-model 666which mean the result without additional tricks such as ensemble and multi-task losses. ERNIE 2.0 on the dev set. The detailed results on GLUE are depicted in Table 5.
As shown in the BASE model columns of Table 5, ERNIE 2.0 outperforms BERT on all of the 10 tasks and obtains a score of 80.6. As shown in the dev columns of LARGE model section in Table 5, ERNIE 2.0 consistently outperforms BERT and XLNet on most of the tasks except MNLI-m. Furthermore, as shown in the LARGE model section in Table 5, ERNIE 2.0 outperforms BERT on all of the 10 tasks, which gets a score of 83.6 on the GLUE test set and achieves a 3.1% improvement over the previous SOTA pre-training model BERT.
Results on Chinese Tasks
Table 6 shows the performances on 9 classical Chinese NLP tasks. It can be seen that ERNIE 1.0 outperforms BERT on XNLI, MSRA-NER, ChnSentiCorp, LCQMC and NLPCC-DBQA tasks, yet the performance is less ideal on the rest, which is caused by the difference in pre-training between the two methods. Specifically, the pre-training data of ERNIE 1.0 does not contain instances whose length exceeds 128, but BERT is pre-trained with the instances whose length are 512. From the results, it can be also seen that the proposed ERNIE 2.0 makes further progress, which significantly outperforms BERT on all of the nine tasks. Furthermore, we train a large version of ERNIE 2.0. ERNIE 2.0 achieves the best performance and creates new state-of-the-art results on these Chinese NLP tasks.
Comparison of Different Learning Methods
In order to analyze the effectiveness of the continual multi-task learning strategy adopted in our framework, we compare this method with two other methods as shown in figure 2. Table 7 describes the detailed information. For all the methods, we assume that the training iterations are the same for each task. In our settings, each task can be trained in 50k iterations, with 200k iterations for all of the tasks. As it can be seen, multi-task learning trains all the tasks in one stage, continual pre-training trains the tasks one by one, while our continual multi-task learning method can assign different iterations to each task in different training stages. The experimental result shows that continual multi-task learning obtains the better performance on downstream tasks compared with the other two methods, without sacrificing any efficiency. The result also indicates that our pre-training method can trains the new tasks in a more effective and efficient way. Moreover, the comparison between continual multi-task learning, multi-task learning and traditional continual learning shows that the first two methods outperform the third one, which confirms our intuition that traditional continual learning tends to forget the knowledge it has learnt when there is only one new task involved each time.
We proposed a continual pre-training framework named ERNIE 2.0, in which pre-training tasks can be incrementally built and learned through continual multi-task learning in a continual way. Based on the framework, we constructed several pre-training tasks covering different aspects of language and trained a new model called ERNIE 2.0 model which is more competent in language representation. ERNIE 2.0 was tested on the GLUE benchmarks and various Chinese tasks. It obtained significant improvements over BERT and XLNet. In the future, we will introduce more pre-training tasks to the ERNIE 2.0 framework to further improve the performance of the model. We will also investigate other sophisticated continual learning method in our framework.