Multi-Task Deep Neural Networks for Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu , Pengcheng He, Weizhu Chen, Jianfeng Gao
Microsoft Research         Microsoft Dynamics 365 AI
  Equal Contribution.

In this paper, we present a Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations to help adapt to new tasks and domains. MT-DNN extends the model proposed in Liu et al. (2015) by incorporating a pre-trained bidirectional transformer language model, known as BERT (Devlin et al., 2018). MT-DNN obtains new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.2% (1.8% absolute improvement). We also demonstrate using the SNLI and SciTail datasets that the representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Our code and pre-trained models will be made publicly available.

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liuthanks:   Equal Contribution. , Pengcheng He, Weizhu Chen, Jianfeng Gao Microsoft Research         Microsoft Dynamics 365 AI {xiaodl,penhe,wzchen,jfgao}

1 Introduction

Learning vector-space representations of text, e.g., words and sentences, is fundamental to many natural language understanding (NLU) tasks. Two popular approaches are multi-task learning and language model pre-training. In this paper we strive to combine the strengths of both approaches by proposing a new Multi-Task Deep Neural Network (MT-DNN).

Multi-Task Learning (MTL) is inspired by human learning activities where people often apply the knowledge learned from previous tasks to help learn a new task (Caruana, 1997; Zhang and Yang, 2017). For example, it is easier for a person who knows how to ski to learn skating than the one who does not. Similarly, it is useful for multiple (related) tasks to be learned jointly so that the knowledge learned in one task can benefit other tasks. Recently, there is a growing interest in applying MTL to representation learning using deep neural networks (DNNs) (Collobert et al., 2011; Liu et al., 2015; Luong et al., 2015; Xu et al., 2018) for two reasons. First, supervised learning of DNNs requires large amounts of task-specific labeled data, which is not always available. MTL provides an effective way of leveraging supervised data from many related tasks. Second, the use of multi-task learning profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned representations universal across tasks.

In contrast to MTL, language model pre-training has shown to be effective for learning universal language representations by leveraging large amounts of unlabeled data. A recent survey is included in Gao et al. (2018). Some of the most prominent examples are ELMo (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin et al., 2018). These are neural network language models trained on text data using unsupervised objectives. For example, BERT is based on a multi-layer bidirectional Transformer, and is trained on plain text for masked word prediction and next sentence prediction tasks. To apply a pre-trained model to specific NLU tasks, we often need to fine-tune, for each task, the model with additional task-specific layers using task-specific training data. For example, Devlin et al. (2018) shows that BERT can be fine-tuned this way to create state-of-the-art models for a range of NLU tasks, such as question answering and natural language inference.

We argue that MTL and language model pre-training are complementary technologies, and can be combined to improve the learning of text representations to boost the performance of various NLU tasks. To this end, we extend the MT-DNN model originally proposed in Liu et al. (2015) by incorporating BERT as its shared text encoding layers. As shown in Figure 1, the lower layers (i.e., text encoding layers) are shared across all tasks, while the top layers are task-specific, combining different types of NLU tasks such as single-sentence classification, pairwise text classification, text similarity, and relevance ranking. Similar to the BERT model, MT-DNN is trained in two stages: pre-training and fine-tuning. Unlike BERT, MT-DNN uses MTL in the fine-tuning stage with multiple task-specific layers in its model architecture.

MT-DNN obtains new state-of-the-art results on eight out of nine NLU tasks 111The only GLUE task where MT-DNN does not create a new state of the art result is WNLI. But as noted in the GLUE webpage (, there are issues in the dataset, and none of the submitted systems has ever outperformed the majority voting baseline whose accuracy is 65.1. used in the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018), pushing the GLUE benchmark score to 82.2%, amounting to 1.8% absolute improvement over BERT. We further extend the superiority of MT-DNN to the SNLI (Bowman et al., 2015a) and SciTail (Khot et al., 2018) tasks. The representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. For example, our adapted models achieve the accuracy of 91.1% on SNLI and 94.1% on SciTail, outperforming the previous state-of-the-art performance by 1.0% and 5.8%, respectively. Even with only 0.1% or 1.0% of the original training data, the performance of MT-DNN on both SNLI and SciTail datasets is fairly good and much better than many existing models. All of these clearly demonstrate MT-DNN’s exceptional generalization capability via multi-task learning.

2 Tasks

The MT-DNN model combines four types of NLU tasks: single-sentence classification, pairwise text classification, text similarity scoring, and relevance ranking. For concreteness, we describe them using the NLU tasks defined in the GLUE benchmark as examples.

Single-Sentence Classification:

Given a sentence222In this study, a sentence can be an arbitrary span of contiguous text or word sequence, rather than a linguistically plausible sentence., the model labels it using one of the pre-defined class labels. For example, the CoLA task is to predict whether an English sentence is grammatically plausible. The SST-2 task is to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative.

Text Similarity:

This is a regression task. Given a pair of sentences, the model predicts a real-value score indicating the semantic similarity of the two sentences. STS-B is the only example of the task in GLUE.

Pairwise Text Classification:

Given a pair of sentences, the model determines the relationship of the two sentences based on a set of pre-defined labels. For example, both RTE and MNLI are language inference tasks, where the goal is to predict whether a sentence is an entailment, contradiction, or neutral with respect to the other. QQP and MRPC are paragraph datasets that consist of sentence pairs. The task is to predict whether the sentences in the pair are semantically equivalent.

Relevance Ranking:

Given a query and a list of candidate answers, the model ranks all the candidates in the order of relevance to the query. QNLI is a version of Stanford Question Answering Dataset (Rajpurkar et al., 2016). The task involves assessing whether a sentence contains the correct answer to a given query. Although QNLI is defined as a binary classification task in GLUE, in this study we formulate it as a pairwise ranking task, where the model is expected to rank the candidate that contains the correct answer higher than the candidate that does not. We will show that this formulation leads to a significant improvement in accuracy over binary classification.

3 The Proposed MT-DNN Model

Figure 1: Architecture of the MT-DNN model for representation learning. The lower layers are shared across all tasks while the top layers are task-specific. The input (either a sentence or a pair of sentences) is first represented as a sequence of embedding vectors, one for each word, in . Then the Transformer encoder captures the contextual information for each word and generates the shared contextual embedding vectors in . Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking.

The architecture of the MT-DNN model is shown in Figure 1. The lower layers are shared across all tasks, while the top layers represent task-specific outputs. The input , which is a word sequence (either a sentence or a pair of sentences packed together) is first represented as a sequence of embedding vectors, one for each word, in . Then the transformer encoder captures the contextual information for each word via self-attention, and generates a sequence of contextual embeddings in . This is the shared semantic representation that is trained by our multi-task objectives. In what follows, we elaborate on the model in detail.

Lexicon Encoder ():

The input is a sequence of tokens of length . Following Devlin et al. (2018), the first token is always the [CLS] token. If is packed by a sentence pair , we separate the two sentences with a special token [SEP]. The lexicon encoder maps into a sequence of input embedding vectors, one for each token, constructed by summing the corresponding word, segment, and positional embeddings.

Transformer Encoder ():

We use a multi-layer bidirectional Transformer encoder (Vaswani et al., 2017) to map the input representation vectors () into a sequence of contextual embedding vectors . This is the shared representation across different tasks. Unlike the BERT model (Devlin et al., 2018) that learns the representation via pre-training and adapts it to each individual task via fine-tuning, MT-DNN learns the representation using multi-task objectives.

Single-Sentence Classification Output:

Suppose that is the contextual embedding () of the token [CLS], which can be viewed as the semantic representation of input sentence . Take the SST-2 task as an example. The probability that is labeled as class (i.e., the sentiment) is predicted by a logistic regression with softmax:


where is the task-specific parameter matrix.

Text Similarity Output:

Take the STS-B task as an example. Suppose that is the contextual embedding () of [CLS] which can be viewed as the semantic representation of the input sentence pair . We introduce a task-specific parameter vector to compute the similarity score as:


where is a sigmoid function that maps the score to a real value of the range .

Pairwise Text Classification Output:

Take natural language inference (NLI) as an example. The NLI task defined here involves a premise of words and a hypothesis of words, and aims to find a logical relationship between and . The design of the output module follows the answer module of the stochastic answer network (SAN) (Liu et al., 2018a), a state-of-the-art neural NLI model. SAN’s answer module uses multi-step reasoning. Rather than directly predicting the entailment given the input, it maintains a state and iteratively refines its predictions.

The SAN answer module works as follows. We first construct the working memory of premise by concatenating the contextual embeddings of the words in , which are the output of the transformer encoder, denoted as , and similarly the working memory of hypothesis , denoted as . Then, we perform -step reasoning on the memory to output the relation label, where is a hyperparameter. At the beginning, the initial state is the summary of : , where . At time step in the range of , the state is defined by . Here, is computed from the previous state and memory : and . A one-layer classifier is used to determine the relation at each step :


At last, we utilize all of the outputs by averaging the scores:


Each is a probability distribution over all the relations . During training, we apply stochastic prediction dropout (Liu et al., 2018b) before the above averaging operation. During decoding, we average all outputs to improve robustness.

Relevance Ranking Output:

Take QNLI as an example. Suppose that is the contextual embedding vector of [CLS] which is the semantic representation of a pair of question and its candidate answer . We compute the relevance score as:


For a given , we rank all of its candidate answers based on their relevance scores computed using Equation 5.

3.1 The Training Procedure

The training procedure of MT-DNN consists of two stages: pretraining and multi-task fine-tuning. The pretraining stage follows that of the BERT model (Devlin et al., 2018). The parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence prediction.333In this study we use the pre-trained BERT models released by the authors.

In the multi-task fine-tuning stage, we use mini-batch based stochastic gradient descent (SGD) to learn the parameters of our model (i.e., the parameters of all shared layers and task-specific layers) as shown in Algorithm 1. In each epoch, a mini-batch is selected(e.g., among all 9 GLUE tasks), and the model is updated according to the task-specific objective for the task . This approximately optimizes the sum of all multi-task objectives.

Initialize model parameters randomly.
Pre-train the shared layers (i.e., the lexicon encoder and the transformer encoder).
Set the max number of epoch: . //Prepare the data for tasks.
for  in  do
       Pack the dataset into mini-batch: .
end for
for  in  do
       1. Merge all the datasets:
       2. Shuffle
       for  in D do
             // is a mini-batch of task .
             3. Compute loss :
                 Eq. 6 for classification
                 Eq. 7 for regression
                 Eq. 8 for ranking
             4. Compute gradient:
             5. Update model:
       end for
end for
Algorithm 1 Training a MT-DNN model.

For the classification tasks (i.e., single-sentence or pairwise text classification), we use the cross-entropy loss as the objective:


where is the binary indicator (0 or 1) if class label is the correct classification for , and is defined by e.g., Equation 1 or 4.

For the text similarity tasks, such as STS-B, where each sentence pair is annotated with a real-valued score , we use the mean squared error as the objective:


where is defined by Equation 2.

The objective for the relevance ranking tasks follows the pairwise learning-to-rank paradigm (Burges et al., 2005; Huang et al., 2013). Take QNLI as an example. Given a query , we obtain a list of candidate answers which contains a positive example that includes the correct answer, and negative examples. We then minimize the negative log likelihood of the positive example given queries across the training data


where is defined by Equation 5 and is a tuning factor determined on held-out data. In our experiment, we simply set to 1.

4 Experiments

Corpus Task #Train #Dev #Test #Label Metrics
 Single-Sentence Classification (GLUE)
CoLA Acceptability 8.5k 1k 1k 2 Matthews corr
SST-2 Sentiment 67k 872 1.8k 2 Accuracy
 Pairwise Text Classification (GLUE)
MNLI NLI 393k 20k 20k 3 Accuracy
RTE NLI 2.5k 276 3k 2 Accuracy
WNLI NLI 634 71 146 2 Accuracy
QQP Paraphrase 364k 40k 391k 2 Accuracy/F1
MRPC Paraphrase 3.7k 408 1.7k 2 Accuracy/F1
 Text Similarity (GLUE)
STS-B Similarity 7k 1.5k 1.4k 1 Pearson/Spearman corr

  Relevance Ranking (GLUE)
QNLI QA/NLI 108k 5.7k 5.7k 2 Accuracy
 Pairwise Text Classification
SNLI NLI 549k 9.8k 9.8k 3 Accuracy
SciTail NLI 23.5k 1.3k 2.1k 2 Accuracy

Table 1: Summary of the three benchmarks: GLUE, SNLI and SciTail.

We evaluate the proposed MT-DNN on three popular NLU benchmarks: GLUE (Wang et al., 2018), Stanford Natural Language Inference (SNLI) (Bowman et al., 2015b), and SciTail (Khot et al., 2018). We compare MT-DNN with existing state-of-the-art models including BERT and demonstrate the effectiveness of MTL for model fine-tuning using GLUE and domain adaptation using SNLI and SciTail.

4.1 Datasets

This section briefly describes the GLUE, SNLI, and SciTail datasets, as summarized in Table 1.

The GLUE benchmark is a collection of nine NLU tasks, including question answering, sentiment analysis, and textual entailment; it is considered well-designed for evaluating the generalization and robustness of NLU models. Both SNLI and SciTail are NLI tasks.


The Corpus of Linguistic Acceptability is to predict whether an English sentence is linguistically “acceptable” or not (Warstadt et al., 2018). It uses Matthews correlation coefficient (Matthews, 1975) as the evaluation metric.


The Stanford Sentiment Treebank is to determine the sentiment of sentences. The sentences are extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013). Accuracy is used as the evaluation metric.


The Semantic Textual Similarity Benchmark is a collection of sentence pairs collected from multiple data resources including news headlines, video, and image captions, and NLI data (Cer et al., 2017). Each pair is human-annotated with a similarity score from one to five, indicating how similar the two sentences are. The task is evaluated using two metrics: the Pearson and Spearman correlation coefficients.


This is derived from the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task in GLUE. A query-candidate-answer tuple is labeled as positive if the candidate contains the correct answer to the query and negative otherwise. In this study, however, we formulate QNLI as a relevance ranking task, where for a given query, its positive candidate answers are considered more relevant, and thus should be ranked higher than its negative candidates.


The Quora Question Pairs dataset is a collection of question pairs extracted from the community question-answering website Quora. The task is to predict whether two questions are semantically equivalent (Chen et al., 2018). As the distribution of positive and negative labels is unbalanced, both accuracy and F1 score are used as evaluation metrics.


The Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources with human annotations denoting whether a sentence pair is semantically equivalent to the other in the pair (Dolan and Brockett, 2005). Similar to QQP, both accuracy and F1 score are used as evaluation metrics.


Multi-Genre Natural Language Inference is a large-scale, crowd-sourced entailment classification task (Nangia et al., 2017). Given a pair of sentences (i.e., a premise-hypothesis pair), the goal is to predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise. The test and development sets are split into in-domain (matched) and cross-domain (mismatched) sets. The evaluation metric is accuracy.


The Recognizing Textual Entailment dataset is collected from a series of annual challenges on textual entailment. The task is similar to MNLI, but uses only two labels: entailment and not_entailment (Wang et al., 2018).


The Winograd NLI (WNLI) is a natural language inference dataset derived from the Winograd Schema dataset (Levesque et al., 2012). This is a reading comprehension task. The goal is to select the referent of a pronoun from a list of choices in a given sentence which contains the pronoun.


The Stanford Natural Language Inference (SNLI) dataset contains 570k human annotated sentence pairs, in which the premises are drawn from the captions of the Flickr30 corpus and hypotheses are manually annotated (Bowman et al., 2015b). This is the most widely used entailment dataset for NLI. The dataset is used only for domain adaptation in this study.


This is a textual entailment dataset derived from a science question answering (SciQ) dataset (Khot et al., 2018). The task involves assessing whether a given premise entails a given hypothesis. In contrast to other entailment datasets mentioned previously, the hypotheses in SciTail are created from science questions while the corresponding answer candidates and premises come from relevant web sentences retrieved from a large corpus. As a result, these sentences are linguistically challenging and the lexical similarity of premise and hypothesis is often high, thus making SciTail particularly difficult. The dataset is used only for domain adaptation in this study.

4.2 Implementation details

Our implementation of MT-DNN is based on the PyTorch implementation of BERT444 We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Following (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradient norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.

8.5k 67k 3.7k 7k 364k 393k 108k 2.5k 634
 BiLSTM+ELMo+Attn 36.0 90.4 84.9/77.9 75.1/73.3 64.8/84.7 76.4/76.1 79.9 56.8 65.1 26.5 70.5
Singletask Pretrain
45.4 91.3 82.3/75.7 82.0/80.0 70.3/88.5 82.1/81.4 88.1 56.0 53.4 29.8 72.8
 GPT on STILTs 47.2 93.1 87.7/83.7 85.3/84.8 70.1/88.1 80.8/80.6 87.2 69.1 65.1 29.4 76.9
 BERT 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7/85.9 91.1 70.1 65.1 39.6 80.4
 MT-DNN 61.5 95.6 90.0/86.7 88.3/87.7 72.4/89.6 86.7/86.0 98.0 75.5 65.1 40.3 82.2
Table 2: GLUE test set results, which are scored by the GLUE evaluation server. The number below each task denotes the number of training examples. The state-of-the-art results are in bold. MT-DNN uses BERTLARGE for its shared layers. All the results are obtained from
 BERT 84.5/84.4 90.4/87.4 84.5/89.0 65.0 88.4 92.8 55.4 89.6/89.2
 ST-DNN 84.7/84.6 91.0/87.9 86.6/89.1 64.6 94.6 - - -
 MT-DNN 85.3/85.0 91.6/88.6 86.8/89.2 79.1 95.7 93.6 59.5 90.6/90.4
Table 3: GLUE dev set results. The best result on each task is in bold. BERTBASE is the base BERT model released by the authors, and is fine-tuned for each single task. The Single-Task DNN (ST-DNN) uses the same model architecture as MT-DNN. But instead of fine-tuning one model for all tasks using MTL, we create multiple ST-DNNs, one for each task using only in-domain data for fine-tuning. ST-DNNs and MT-DNN use BERTBASE for their shared layers.

4.3 GLUE Results

The test results on GLUE are presented in Table 2.555There is an ongoing discussion on revising the QNLI dataset. We will update the results when the new dataset is available. MT-DNN outperforms all existing systems on all tasks, except WNLI, creating new state-of-the-art results on eight GLUE tasks and pushing the benchmark to 82.2%, which amounts to 1.8% absolution improvement over BERTLARGE. Since MT-DNN uses BERTLARGE for its shared layers, the gain is solely attributed to the use of MTL in fine-tuning. MTL is particularly useful for the tasks with little in-domain training data. As we observe in the table, on the same type of tasks, the improvements over BERT are much more substantial for the tasks with less in-domain training data e.g., the two NLI tasks: RTE vs. MNLI, and the two paraphrase tasks: MRPC vs. QQP.

The gain of MT-DNN is also attributed to its flexible modeling framework which allows us to incorporate the task-specific model structures and training methods which have been developed in the single-task setting, effectively leveraging the existing body of research.

Two such examples use the SAN answer module for the pairwise text classification output module, and the pairwise ranking loss for the QNLI task which by design is a binary classification problem in GLUE. To investigate the relative contributions of the above two modeling design choices, we implement different versions of MT-DNNs and compare their performance on the development sets. The results are shown in Table 3.

  • BERTBASE is the base BERT model released by the authors, which we used as a baseline. We fine-tuned the model for each single task.

  • MT-DNN is the proposed model described in Section 3 using the pre-trained BERTBASE as its shared layers. We then fine-tuned the model using MTL on all GLUE tasks. Comparing MT-DNN vs. BERTBASE, we see that the results on dev sets are consistent with the GLUE test results in Table 2.

  • ST-DNN, standing for Single-Task DNN, uses the same model architecture as MT-DNN. But, instead of fine-tuning one model for all tasks using MTL, we create multiple ST-DNNs, one for each task using only its in-domain data for fine-tuning. Thus, for pairwise text classification tasks, the only difference between their ST-DNNs and BERT models is the design of the task-specific output module. The results show that on three out of four tasks (MNLI, QQP and MRPC) ST-DNNs outperform their BERT counterparts, justifying the effectiveness of the SAN answer module. We also compare the results of ST-DNN and BERT on QNLI. While ST-DNN is fine-tuned using the pairwise ranking loss, BERT views QNLI as binary classification and is fine-tuned using the cross entropy loss. That ST-DNN significantly outperforms BERT demonstrates clearly the importance of problem formulation.

4.4 SNLI and SciTail Results

In Table 4, we compare our adapted models, using all in-domain training samples, against several strong baselines including the best results reported in the leaderboards. We see that MT-DNN generates new state-of-the-art results on both datasets, pushing the benchmarks to 91.1% on SNLI (1.0% absolute improvement) and 94.1% on SciTail (5.8% absolute improvement), respectively.

Model Dev Test

SNLI Dataset (Accuracy%)
GPT (Radford et al., 2018) - 89.9
Kim et al. (2018) - 90.1
BERT 91.0 90.8
MT-DNN 91.4 91.1
SciTail Dataset (Accuracy%)
GPT (Radford et al., 2018) - 88.3
BERT 94.3 92.0
MT-DNN 95.8 94.1
Table 4: Results on the SNLI and SciTail dataset. Previous state-of-the-art results are marked by , obtained from the official SNLI leaderboard ( and the official SciTail leaderboard maintained by AI2 ( Both MT-DNN and BERT are fine-tuned based on the pre-trained BERT.

4.5 Domain Adaptation Results

Figure 2: Domain adaption results on SNLI and SciTail development datasets using the shared embeddings generated by MT-DNN and BERT, respectively. Both MT-DNN and BERT are fine-tuned based on the pre-trained BERT. The X-axis indicates the amount of domain-specific labeled samples used for adaptation.
Model 0.1% 1% 10% 100%
SNLI Dataset (Dev Accuracy%)
 #Training Data 549 5,493 54,936 549,367
 BERT 52.5 78.1 86.7 91.0
 MT-DNN 82.1 85.2 88.4 91.5
SciTail Dataset (Dev Accuracy%)
 #Training Data 23 235 2,359 23,596
 BERT 51.2 82.2 90.5 94.3
 MT-DNN 81.9 88.3 91.1 95.7
Table 5: Domain adaptation results on SNLI and SciTail, as shown in Figure 2.

One of the most important criteria for building practical systems is fast adaptation to new tasks and domains. This is because it is prohibitively expensive to collect labeled training data for new domains or tasks. Very often, we only have very small training data or even no training data.

To evaluate the models using the above criterion, we perform domain adaptation experiments on two NLI tasks, SNLI and SciTail, using the following procedure:

  1. fine-tune the MT-DNN model on eight GLUE tasks, excluding WNLI;

  2. create for each new task (SNLI or SciTail) a task-specific model, by adapting the trained MT-DNN using task-specific training data;

  3. evaluate the models using task-specific test data.

We denote the two task-specific models as MT-DNN. For comparison, we also perform the same adaptation procedure to the pre-trained BERT model, creating two task-specific BERT models for SNLI and SciTail, respectively, denoted as BERT.

We split the training data of SNLI and SciTail, and randomly sample 0.1%, 1%, 10% and 100% of its training data. As a result, we obtain four sets of training data for SciTail, which includes 23, 235, 2.3k and 23.5k training samples. Similarly, we obtain four sets of training data for SNLI, which includes 549, 5.5k, 54.9k and 549.3k training samples.

Results on different amounts of training data of SNLI and SciTail are reported in Figure 2 and Table 5. We observe that our model pre-trained on GLUE via multi-task learning outplays the BERT baseline consistently. The fewer the training data used, the larger improvement MT-DNN demonstrates over BERT. For example, with only 0.1% (23 samples) of the SNLI training data, MT-DNN achieves 82.1% in accuracy while BERT’s accuracy is 52.5%; with 1% of the training data, the accuracy of our model is 85.2% and BERT is 78.1%. We observe similar results on SciTail. The results indicate that the representations learned by MT-DNN are more effective for domain adaptation than that of BERT.

5 Conclusion

In this work we proposed a model called MT-DNN to combine multi-task learning and language model pre-training for language representation learning. MT-DNN obtains new state-of-the-art results on ten NLU tasks across three popular benchmarks: SNLI, SciTail, and GLUE. MT-DNN also demonstrates an exceptional generalization capability in domain adaptation experiments.

There are many future areas to explore to improve MT-DNN, including a deeper understanding of model structure sharing in MTL, a more effective training method that leverages relatedness among multiple tasks, and ways of incorporating the linguistic structure of text in a more explicit and controllable manner.

6 Acknowledgements

We would like to thanks Jade Huang from Microsoft for her generous help on this work.


  • Bowman et al. (2015a) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015a. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642.
  • Bowman et al. (2015b) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015b. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96. ACM.
  • Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41–75.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
  • Chen et al. (2018) Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. Quora question pairs.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dolan and Brockett (2005) William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  • Gao et al. (2018) J. Gao, M. Galley, and L. Li. 2018. Neural approaches to conversational AI. CoRR, abs/1809.08267.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2333–2338. ACM.
  • Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTail: A textual entailment dataset from science question answering. In AAAI.
  • Kim et al. (2018) Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. 2018. Semantic sentence matching with densely-connected recurrent and co-attentive information. arXiv preprint arXiv:1805.11360.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Levesque et al. (2012) Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
  • Liu et al. (2018a) Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a. Stochastic answer networks for natural language inference. arXiv preprint arXiv:1804.07888.
  • Liu et al. (2015) Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 912–921.
  • Liu et al. (2018b) Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018b. Stochastic answer networks for machine reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
  • Luong et al. (2015) Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
  • Matthews (1975) Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451.
  • Nangia et al. (2017) N. Nangia, A. Williams, A. Lazaridou, and S. R. Bowman. 2017. The RepEval 2017 Shared Task: Multi-Genre Natural Language Inference with Sentence Representations. ArXiv e-prints.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. pages 2383–2392.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
  • Wang et al. (2018) Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  • Warstadt et al. (2018) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.
  • Xu et al. (2018) Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing Liu, and Jianfeng Gao. 2018. Multi-task learning for machine reading comprehension. arXiv preprint arXiv:1809.06963.
  • Zhang and Yang (2017) Yu Zhang and Qiang Yang. 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description