WaLDORf: Wasteless Language-model Distillation On Reading-comprehension


Transformer based Very Large Language Models (VLLMs) like BERT, XLNet and RoBERTa, have recently shown tremendous performance on a large variety of Natural Language Understanding (NLU) tasks. However, due to their size, these VLLMs are extremely resource intensive and cumbersome to deploy at production time. Several recent publications have looked into various ways to distil knowledge from a transformer based VLLM (most commonly BERT-Base) into a smaller model which can run much faster at inference time. Here, we propose a novel set of techniques which together produce a task-specific hybrid convolutional and transformer model, WaLDORf, that achieves state-of-the-art inference speed while still being more accurate than previous distilled models.

Natural Language Understanding, BERT, Model Distillation, Reading Comprehension

1. Introduction

The recent emergence of VLLMs(Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019; Radford et al., 2019; Shoeybi et al., 2019; Raffel et al., 2019) have made a wide variety of previously extremely difficult NLU tasks feasible. From an academic perspective, the leader boards of a broad set of NLU benchmarks, such as the General Language Understanding Evaluation (GLUE(Wang et al., 2018) or Super-GLUE(Wang et al., 2019)) and the Stanford Question Answering Dataset (SQuAD(Rajpurkar et al., 2016; Rajpurkar et al., 2018)), are dominated by fine-tuned versions of such VLLMs, generally based on transformers(Vaswani et al., 2017) and the attention mechanism(Bahdanau et al., 2014; Luong et al., 2015; Britz et al., 2017; Graves et al., 2014; Xu et al., 2015; Cheng et al., 2016). From industry, real world problems are now being solved by putting these VLLMs into production. Recently, both Google and Microsoft have announced that they use some (possibly distilled) version of a VLLM in their search algorithms. In academia, a plethora of research has arisen not only towards advancing and augmenting these state-of-the-art VLLMs(Peters et al., 2019; Conneau and Lample, 2019; Lample et al., 2019; Sun et al., 2019b; Dai et al., 2019; Joshi et al., 2019) but also towards studying their inner workings(Wallace et al., 2019a, b; Si et al., 2019; Wallace et al., 2019c; Tenney et al., 2019; Michel et al., 2019; Clark et al., 2019; Arkhangelskaia and Dutta, 2019; Hewitt and Manning, 2019; Niven and Kao, 2019) - a field sometimes called ”BERTology”.

Looking over the entire NLU landscape, it has become increasingly apparent that deploying a (fine-tuned) VLLM into production can generate a lot of value. However the resource requirements of deploying a VLLM into production can be prohibitively expensive. VLLMs, being very large, require both a lot of memory and a lot of computation for inference. Therefore, there has been much recent work into distilling knowledge from a VLLM into a much smaller, faster model(Sun et al., 2019a; Jiao et al., 2019; Tang et al., 2019; Sanh et al., 2019). Our work continues that line of work by examining several novel techniques which, together with previously published techniques, produce a leaner, faster but still highly accurate distilled model. This distilled model, in turn, can produce massive cost-savings for production environments. In concrete terms, while maintaining Exact Match (EM) and F1 scores on SQuAD v2.0 higher than TinyBERT, 4-layer Patient Knowledge Distillation of BERT (BERT-PKD), and 4-layer DistilBERT, we have achieved 1.24x, 3.1x, and 3.1x speed up from those models respectively (4-layer BERT-PKD and 4-layer DistilBERT have the same model architecture and so inference speed is the same). Compared with BERT-Base and BERT-Large, we achieved 9.13x and 28.63x speed up respectively.

For this work, we are looking into task specific distillation for the SQuAD v2.0 task. I.e. we do not perform model distillation on the pre-training phase of BERT. We chose to look more deeply into task specific distillation so that we could explore task specific data augmentation and its effects on model distillation in more detail. We chose specifically the SQuAD v2.0 task because it is relatively difficult and flexible. Recent advances in NLU have proposed to frame a wide variety of NLU tasks into the Question-Answering format(Raffel et al., 2019; McCann et al., 2018).

The major contributions (which we will describe in detail in section 3) of our work are:

  • With inspiration from convolutional auto encoders, we use convolutions, max pooling, and up sampling to shrink the sequence length of inputs to the transformer encoder blocks and then expand the outputs of those encoder blocks back to the original sequence length. (Section 3.2)

  • We modify the hidden-state distillation and attention-weight distillation by using average pooling and max pooling respectively to align the sequence dimension of our model with the teacher model. (Section 3.3)

  • We modify the patient knowledge distillation procedure to build layers up from the bottom one layer at a time to reduce the covariate shift experienced by each encoder block. (Section 3.3)

  • We fine-tune a pre-trained sequence-to-sequence VLLM to produce high quality questions from given contexts to perform task specific data augmentation. (Section 3.4).

The rest of our paper is organized as follows: in section 2 we discuss some related work and models from which we drew inspiration; in section 3 we describe in detail our methodology, including model architecture, model distillation procedure, and data augmentation; in section 4 we present all of our results, including results from our ablation study; in section 5 we discuss our findings in detail; and finally in section 6 we present a few concluding remarks and directions for future work. We also present some details to reproduce our results in the appendix, section A.

2. Related Work

Model compression has been explored in the past in several parallel paths. There is a whole field of study devoted to weight quantization, weight pruning, and weight encoding to reduce the size of large neural networks - often achieving impressive results(Han et al., 2015a, b; Lee and Kim, 2018; Zhao et al., 2019; Lin et al., 2016; Cheng et al., 2017; LeCun et al., 1990; Frankle and Carbin, 2018; Liu et al., 2018; Zhou et al., 2019). Recently, there have been efforts to apply weight quantization and pruning to transformer models as well(Voita et al., 2019; Prato et al., 2019; Cheong and Daniel, 2019). For this work, we instead focus on the parallel path(s) of exploration into compressing large transformers using architectural changes or model distillation.

The recently released ALBERT(Lan et al., 2019) approached model compression from the point of view of making architectural and training objective changes to the transformer model. The authors of ALBERT use the following three main techniques: 1) Factorized embedding parameterization to greatly reduce the number of weights in the embedding layer, 2) Weight sharing among all encoder blocks of the transformer, greatly reducing the number of weights in the encoder stack, and 3) Using an inter-sentence coherence loss instead of a next sentence prediction loss during BERT-like pre-training. With these architectural and training modifications, ALBERT produces impressive results and achieved state-of-the-art accuracy, for their xx-large model, on a wide variety of NLU tasks. ALBERT-base also achieves quite high accuracy while reducing the number of parameters to 12 million (compared with 110 million for BERT-Base and 335 million for BERT-Large). However, because data still must be processed through every layer of ALBERT, even though the weights are shared between all the layers, the inference speed improvement achieved by ALBERT is not so striking. ALBERT-Base achieves 21.1x speed up from an extra large version of BERT built by the authors of ALBERT (BERT-XLarge) which is itself 17.7x slower than BERT-Base. This means that ALBERT-Base achieves only roughly 1.2x speed up from BERT-Base itself. For our purposes, we wanted to look at much faster architectures. Therefore, we only incorporated the factorized embedding parameterization idea from ALBERT. Other possibly faster architectures have of course also been explored for language modeling(Merity, 2019).

Model distillation was first applied to distilling ensemble model knowledge into a single model(Buciluǎ et al., 2006) and then expanded to other domains(Hinton et al., 2015). In this vein, we point out three recent efforts at model distillation of BERT: DistilBERT(Sanh et al., 2019), BERT-PKD(Sun et al., 2019a), and TinyBERT(Jiao et al., 2019). DistilBERT took inspiration from model distillation as put forth by Hinton et. al. and distilled knowledge from the softmax outputs produced by BERT-Base into a 6 encoder block (also called 6-layer) version of BERT. DistilBERT has half the number of encoder blocks as BERT-Base but is otherwise identical to BERT-Base (in terms of hidden dimension size, the number of attention heads, etc.). BERT-PKD goes further and distils the hidden state activations, layer-by-layer, from BERT-Base (and in one experiment, BERT-Large) into 3 and 6 encoder block versions of BERT-Base. The authors of BERT-PKD show that patient knowledge distillation generally outperforms last-layer distillation. Finally, TinyBERT is a 4 encoder block version of BERT-Base that goes even further and includes distillation objectives for the attention scores and the embedding layer. In addition, TinyBERT shrinks the hidden dimension sizes of BERT-Base to produce a much smaller and much faster model than either DistilBERT or BERT-PKD. The accuracy on SQuAD v2.0 attained by TinyBERT also outperforms those attained by DistilBERT and BERT-PKD, for a given number of encoder blocks, so for our work we generally benchmark against TinyBERT. We take inspiration from TinyBERT and perform a version of their layer-by-layer patient knowledge distillation. However, different from TinyBERT, we also included some architectural changes that necessitated changes to the knowledge distillation procedure. Furthermore, we did not reproduce the language modeling pre-training used by BERT and all these previously mentioned distillations. Instead, we relied purely on data augmentation to achieve state-of-the-art results.

3. Methodology

3.1. Problem Statement

For this work, we are working specifically in the framework of question-answering as posed by the SQuAD v2.0 dataset. To perform question-answering, we are given a question and a context and we must find the concrete answer to the question. More specifically, SQuAD v2.0(Rajpurkar et al., 2018) is an extractive, text-based, question-answering dataset which tasks us to find spans of text within the context which answer the given question or, if the question is unanswerable given the context, to return a null result. Formally, given a tuple , where is the question, and is the context consisting of a passage of text, find , which is a span in , that answers the question or return a null result if is not answerable given . For training, we are given tuples of training examples , while for evaluation, we are given the tuple and a set of ground truth answers that are variants of the answer for the given question.

We use two primary metrics to evaluate our model for accuracy, the EM and F1 metrics. The EM score divides the number of examples for which our model’s answer, , exactly matched any of the ground truth answers in into the total number of evaluation examples. The F1 score averages the maximum overlap between and over all the evaluation examples. These are the same two metrics used by the SQuAD v2.0 dataset itself.

3.2. Model Architecture

Figure 1. A visualization of WaLDORf’s architecture. There are 8 transformer encoder blocks sandwiched between 1D convolutional layers along the sequence length direction. All convolutional layers use padding to preserve sequence length. Conv_1 and Conv_2 layers are paired with maxpooling to reduce sequence length by a total factor of 4. Conv_3 and Conv_4 layers are paired with up sampling to restore the sequence length to its original size. Skip connections are used after up sampling layers to skip past convolutional layers but before layer normalization.

WaLDORf is a hybrid model that incorporates convolutional layers into the transformer encoder stack architecture of BERT(Devlin et al., 2018); see figure 1 for an illustration. Taking inspiration from ALBERT, we chose the embedding dimension to be smaller than the internal hidden dimension used in the encoder blocks(Lan et al., 2019). We use the same word piece tokenization scheme as BERT (which is itself a sub word tokenization scheme similar to Byte Pair Encoding(Gage, 1994)) in order to allow distillation of the embedding layer as described in section 3.3. Before feeding data into the 8 encoder blocks, we use a time-distributed feed-forward layer to expand the dimensions of the hidden representation to equal that of the encoder blocks. The major architectural insight we bring forth with WaLDORf is to use convolutional layers to shrink and then re-expand the input sequence dimension. The self-attention mechanism employed by the transformer encoder blocks have complexity which is quadratic in the input sequence length(Vaswani et al., 2017). By using two 1D convolutions (along the sequence length dimension) coupled with two max-pooling layers, we shrink the input sequence lengths to the encoder blocks by a factor of 4. In turn, this means each self-attention mechanism needs to do only 1/16th the amount of work. We then use two more 1D convolutions (since there are no transposed convolutions in 1 dimension) coupled with two up sampling layers to re-expand the sequence length dimension back to its original length. By greatly reducing the amount of work carried out by the self-attention mechanism, we are able to make the encoder blocks run much faster so that we can have a deeper model without sacrificing speed. The trade-off is, of course, that the model must learn to encode the input information in a ”smeared out” way; with roughly every 4 word-pieces getting 1 internal representation. This trade-off may affect different tasks differently depending on how sensitive a certain task is to specific word-pieces. However, we found that even for the SQuAD v2.0 task, which intuitively should be relatively sensitive to specific words or word-pieces, since the task is specifically about finding spans within the text, we were still able to achieve good accuracy with our model.

The 8 encoder blocks used in our model are the standard transformer encoder blocks used by BERT which have been scaled down to increase inference speed. A detailed list of architecture hyperparmeters can be seen in table LABEL:table:archHyper. The self-attention mechanisms in our model use 16 heads of attention like BERT-Large instead of 12 heads of attention like BERT-Base because we chose to use BERT-Large as our teacher model. In total, our model has parameters which is a bit less than a quarter of the number of parameters of BERT-Base.

Architecture Hyperparameters
# Encoder Blocks
# Encoder Convolutional Layers
# Decoder Convolutional Layers
Embedding Size
Hidden Size
Feed-Forward Size
# Attention Heads
# Total Parameters
Table 1. hyperparameters for WaLDORf’s architecture

3.3. Training Objectives and Procedure

To train our model, we used a modified version of the patient knowledge distillation procedure. We have 5 main training losses:

  • The distillation loss over the embedding layer: .

  • The distillation loss over the hidden state representations inside the 8 encoder blocks: .

  • The distillation loss over the attention scores used to calculate the attention weights inside the 8 self-attention mechanisms: .

  • The distillation loss over the final softmax layer which corresponds to the vanilla model distillation loss introduced by Hinton et. al.: .

  • The ground truth loss (for data for which we have the ground truth): .

The student model (our model) learns from a teacher model which we chose to be BERT-Large Whole-Word-Masking (BERT-Large WWM) which is a variant of BERT-Large that was trained using a masked token objective where the masked tokens were ensured to be over whole words instead of word pieces as in the original. The first 4 losses mentioned, , are all losses for the student model with respect to the teacher model. The final loss is the loss with respect to the ground truth and would be the only loss available to us if we did not perform model distillation. As we will show in section 4.3, if we use only to try to train our model, our results are extremely poor.

The embedding loss, , is a simple mean squared error loss that measures how different the embedding representations of the student model are from the teacher model. Given , where are the student model embeddings for the text sequence, is the input sequence length, and is the embedding size of the student model, and , where are the teacher model embeddings for the text sequence, and is the teacher model embedding size, then:


Where MSE is the mean-squared error, is the batch size, and is a learned tensor that projects the student embeddings to the same dimension size as the teacher embeddings.

Both the hidden loss, , and the attention scores loss are mean squared error losses that tries to match the hidden states and attention weights within the 8 encoder blocks of the student model to 8 encoder blocks teacher model. We choose to follow the PKD-skip scheme(Sun et al., 2019a) and use every 3 encoder block in the teacher model for distillation. Just like for the embedding loss, we also have to project the student model hidden states to the same dimension size as the teacher model’s. In addition, we also use an average pooling over the teacher model’s hidden states along the sequence length dimension to shrink the sequence length of the teacher model to match the sequence length inside the student model’s encoder blocks. Because WaLDORf used convolutions to shrink the sequence length used in the encoder blocks, we can’t match hidden states, word for word, between the two models. If we think of hidden states analogously to word vectors, then we are essentially asking WaLDORf to approximate the average of 4 word vectors with each of its hidden states. This procedure is perhaps the simplest way to make the dimensions of our student model align with the targets given by the teacher model. Empirically, this procedure gives us good results. Thus, given , where are the hidden state outputs for the text sequence of the encoder block of the student model, and is the hidden dimension size of the student model, and , where are the hidden state outputs for the text sequence of the encoder block of the teacher model, and is the hidden dimension size of the teacher model, then:


Where AVG denotes a 1-D average pooling with size 4 and stride 4 along the sequence length dimension:


is the hidden loss of the encoder block, and is a learned matrix that projects student model hidden dimension sizes to those of the teacher model. Note that we use the same for every encoder block because we want the distilled learning to be learned by WaLDORf itself and not by these projection matrices which will be discarded at inference time.

For the attention score loss, instead of using average pooling to make sequence dimensions match, we switch to using max pooling. We decided to use max pooling here because we want the student model to learn the strongest attention scores rather than an average of attention scores. Hence, given , where are the attention scores for the text sequence of the encoder block, and , where are the attention scores for the text sequence of the encoder block, then:


Where MAX denotes a 2-D max pooling operation with size and stride :


and is the attention score loss of the encoder block. Note that here we are operating on attention scores and not the attention weights. The attention scores have not had a softmax applied to them and so do not necessarily add up to 1, hence we are free to perform a simple max pooling without destroying any normalization since the softmax will be applied after. One detail that should be noted is that proper masking needs to be maintained on the attention score loss so that the system does not try to learn minor changes to the masking value.

The distillation loss over the final softmax layer follows closely the distillation loss introduced in Hinton et. al.(Hinton et al., 2015):


Where is a temperature hyperparameter and are teacher and student logits for the text sequence respectively. We multiply the loss by the factor of to ensure that the gradients are of the correct scale.

Lastly, the ground truth loss is the standard cross entropy loss with the ground truth labels:


Where is the ground truth label for the text sequence. This ground truth loss is only available for data for which we have the ground truth labels. When we perform data augmentation, as discussed in the next section, we will no longer have access to ground truth labels and so this loss will not apply. Note that for both the ground truth loss and the distillation loss over the final softmax, there are two sets of logits (and labels), one for the start position and one for the end position. The softmax functions are taken over the sequence length dimension.

Figure 2 shows our version of the patient knowledge distillation procedure in detail.

Figure 2. The patient knowledge distillation procedure used for WaLDORf. The left-side model is the student and the right-side model is the teacher. Attention scores and hidden states are distilled from every 3rd teacher encoder block.

To train our model, we found it was very beneficial to train each layer from the bottom up (or left to right in figure 1) so that we start with training only using the embedding loss, then combine that loss with the hidden state and attention score losses for layer 0, then for layer 1, etc. After each intermediate layer is trained for some time, we then add the final softmax layer distillation loss and the ground truth loss. This training procedure makes sense intuitively because it should reduce the internal covariate shift experienced by each layer during training (the same goal as for batch normalization(Ioffe and Szegedy, 2015)). Since we have targets available for each layer in the student model, with the exception of the convolutional layers, we can use those intermediate targets to ensure that the inputs to each layer have stabilized somewhat by the time that that layer is trained. Thus, our final loss function is:


Where are real number hyperparameters and:


Where is the Heaviside step function, is the current global training step and is an integer hyperparameter specifying how many training steps to take for each layer. We will show in our ablation study in section 4.3 that this layer-by-layer building up of the loss function significantly improves accuracy.

3.4. Data Augmentation

A large part of why VLLMs perform so well on downstream tasks is due to the massive amount of data that they pretrain on(Yang et al., 2019; Radford et al., 2019; Raffel et al., 2019). For distilled architectures, such as DistilBERT or BERT-PKD, which keep hidden dimensions all the same size as a teacher model (e.g. BERT-Base) and just use fewer encoder blocks, the student encoder blocks can be initialized with the weights from a subset of encoders blocks of a trained teacher model. For distilled architectures, such as TinyBERT or our model, which change the hidden dimension sizes as well as the number of encoder blocks, there is no obvious way to initialize the weights of the model with weights from the teacher model. Hence, distilled architectures also often reproduce the pretraining phase of BERT before being fine-tuned (with or without additional distillation) on downstream tasks. Here, we are focused on task-specific distillation and so we do not desire to reproduce the pretraining, language modeling, phase. However, not reproducing the pretraining phase would put our model at a huge disadvantage simply due to the much smaller coverage of input data that our model would be trained on. Hence, to help balance out the imbalance in data volume, we chose to perform extensive data augmentation. It should be noted though that even with data augmentation, the final volume of data, as measured by total token count, that we train on is still much smaller than the pretraining data used to train BERT.

To perform data augmentation, we sourced 500,000 additional paragraphs, from 280,250 English Wikipedia articles which did not correspond to any paragraphs or articles found in the SQuAD v2.0 dataset (training or dev). For each paragraph, we automatically generated a question using a fine-tuned Text-to-Text Transfer Transformer, T5-Large(Raffel et al., 2019). We fine-tuned T5-Large by using question-context pairs from the SQuAD v2.0 training dataset itself. Our empirical experiments will show (in section 4) that this data augmentation increased the end-to-end accuracy by a large margin. For comparison, the SQuAD v2.0 training dataset has roughly 130,000 questions on 20,000 paragraphs sourced from 500 articles.

4. Results

Overall, WaLDORf achieved significant speed up over previous state-of-the-art distilled architectures while maintaining a higher level of accuracy on SQuAD v2.0.

4.1. Inference Speed

Inference Speed Results
BERT-Base 1x
BERT-Large 0.3x
BERT-6 2x
BERT-4 2.9x
BERT-2 5.8x
TinyBERT 7.4x
WaLDORf 9.1x
Table 2. Results obtained from testing for inference speed. is the time to perform inference once over the SQuAD v2.0 cross validation set averaged over many trials. is relative speed up as compared with BERT-Base. See section A.1 for details on how the testing was conducted.

Detailed results of our inference speed testing are presented in table LABEL:table:speedTest. As clearly shown, our model is the fastest model to perform inference over the SQuAD v2.0 cross validation set by a wide margin. Because of our use of convolutional layers, we were able to have a model which is deeper and has larger hidden dimensions while still being faster than TinyBERT. DistilBERT and BERT-PKD would both fall under the BERT-6 umbrella since in terms of architecture both are simply 6 encoder block versions of BERT-Base. BERT-PKD also has a 3 encoder block version, but we can see that our model is much faster than even a 2 encoder block version of BERT-Base.

4.2. Accuracy Performance

SQuAD v2.0 Results
Model EM F1
WaLDORf 66.0 70.3
Table 3. WaLDORf’s accuracy performance on SQuAD v2.0 as compared with TinyBERT, BERT-PKD, and DistilBERT. BERT-PKD, DistilBERT and TinyBERT results are taken from TinyBERT paper. TinyBERT-6 has scaled up internal dimension sizes to match those of BERT-Base. BERT-Base and BERT-Large-WWM results are from models that we fine-tuned. We used BERT-Large-WWM as our teacher model.

Our model’s performance on SQuAD v2.0 is benchmarked against other models in table LABEL:table:accuracy. The BERT-PKD-4 and DistilBERT-4 models presented have internal dimensions identical with BERT-4 from table LABEL:table:speedTest. Thus, our model would be 3.1x faster than those models for inference while being significantly more accurate. WaLDORf is also somewhat more accurate than TinyBERT ( EM, F1) while still being 1.24x faster. To obtain EM and F1 scores higher than ours, one would have to go to 6-layer versions of those models for which our model would be 4.5x faster. TinyBERT-6 has internal dimension sizes scaled to be the same as BERT-Base as well and so would also be 4.5x slower than WaLDORf.

4.3. Ablation Study

Ablation Study
EM F1 +EM +F1
+ Softmax Distil
+ All Layer Distil
+ Slow Build 11.8 12.5
+ Data Augment 66.0 70.3
Table 4. Ablation study for WaLDORf. The baseline model is trained on just ground truth labels. ”Softmax Distil” is, in addition, trained on the softmax probabilities provided by the teacher model. ”All Layer Distil” adds on the embedding layer, hidden state, and attention score losses as described in section 3.3. ”Slow Build” builds each layer of the student model one-by-one. ”Data Augment”, as described in 3.4, was the final technique we added and gave us our final results. +EM and +F1 denote improvements over the previous best model.

Results from our ablation study are provided in table LABEL:table:ablation. Each additional technique shown in the table was added on to the sum of the previously used techniques. As we can clearly see, each technique which we added produced significant improvements. With the addition of each new technique, we also explored a whole set of hyperparameters. Results are presented for the best set of hyperparameters found within a given framework. When all the techniques are combined, we get a very significant absolute improvement of in EM score and in F1 score over the baseline.

The biggest incremental improvement in accuracy that we saw was in moving from trying to distill every layer all at once to slowly building up the layers from the bottom embedding layer up. This is likely due to the fact that training every layer all at once introduces a significant covariate shift for every layer’s inputs during training. Stabilizing the inputs to each layer by training layers from a bottom up fashion appears to make optimization much easier.

5. Discussion

5.1. Convolutions and Autoencoding

The core architectural change that we bring forth in this work is the use of convolutions, max pooling, and up sampling to reduce and then re-expand the sequence length dimension. We introduced this architectural change with the hope that, analogous to convolutional autoencoders, WaLDORf will find an efficient way to compress the information contained within an example into a shorter sequence. By attaining accuracy performance higher than previous state-of-the-art, we have shown that it is in fact feasible for the model to compress information in this way. Information can be effectively encoded by the input convolutions, processed by the encoder blocks, and then decoded by the output convolutions. In addition, we also showed that by using max pooling and average pooling we are able to use only the signals given by the ”important” attention weights and the averaged hidden states from the teacher model to make our student model perform strongly. These properties may be somewhat task dependent, but even for the SQuAD v2.0 task we were able to attain a high level of accuracy.

Due to the quadratic dependence of the complexity of the self-attention mechanism on the sequence length, our architecture will actually see larger relative gains in speed the longer the input sequence length is. We chose a sequence length of 384 for our tests because that is the sequence length used by BERT for the SQuAD data set. Other transformer-based VLLMs may use even longer sequence lengths in order to deal better with longer paragraphs, e.g. XLNet uses a max sequence length of 512 for SQuAD. For circumstances where the max input sequence length is even longer than 384, we expect to gain even larger relative speed ups.

5.2. Distillation

Model distillation is currently a very active area of research. Having access to a teacher model opens up the possibility of using a huge variety of techniques to improve a student model’s performance. The distillation techniques available often depend on the architecture of the student model and how similar it is to the teacher model. A model, like BERT-PKD or DistilBERT, that is essentially a copy of a teacher model with just fewer layers can straight-forwardly distil a subset of the layers in the teacher model. If one scales down the internal dimensions of the model, like in the case of TinyBERT, then one needs to introduce some way to project the student model’s hidden dimensions to those of the teacher model or vice versa. That projection operator introduces new degrees of freedom to the problem and the distillation procedure is no longer as straight forward. In our case, we had to also modify the sequence-length dimension of the teacher model’s signals in order to perform intermediate-layer distillation.

In the case that the student model is entirely different from the teacher model, e.g. when a LSTM-based student model tries to learn from a transformer based teacher model as in Tang et. al.(Tang et al., 2019), then perhaps the only easily distilled layer is the final layer and no intermediate layer distillation is feasible. However, we (and others) have shown that there is significant performance improvements to be had by distilling also the intermediate layers and operations (e.g. attention scores) of a VLLM. It is uncertain, at this point, how much of the performance improvement is due to the difference in the raw information contained within the training signals (softmax signals vs intermediate layers and operations signals) and how much is due to a difference in how much those signals help the student model to optimize. The concrete difference in these scenarios would be that if the latter is true, then student models with architectures wildly different from a teacher model may still attain a high level of accuracy if the last layer distillation is carried out with a very strong optimization scheme. If the former scenario is true, then we may expect that there is no method by which we can optimize a student model using only the signals from the last layer as strongly as we could by using signals from all the layers.

5.3. Data Augmentation

As we showed in section 4.3, data augmentation gives us a significant 8-point boost in EM and F1 scores over distillation using just the SQuAD v2.0 dataset itself. This result shows that even with roughly 130,000 question-context-answer tuples, SQuAD v2.0 is still not large enough of a dataset to fully train our student model using pure model distillation. The set of targets that the student model had to hit, given only SQuAD v2.0 inputs, was already enormously large. The embedding layer targets are of size , hidden layer targets are of size and the attention score targets are of size , where denotes the total number of examples. For a BERT-Large-WWM teacher model over the SQuAD v2.0 dataset, these tensors would comprise roughly 1.2 Terabytes of data. The sheer size of this data makes for a very rich set of signals given to the student model, however, given the success of our data augmentation experiments, the input space provided by SQuAD v2.0 itself did not appear rich enough for the student model to fully exploit those signals.

A huge benefit of performing model distillation and student-teacher learning is the ease with which ”labeled” data can be obtained from unlabeled data. Data augmentation for us was simplified further by the introduction of the powerful T5 model which we fine-tuned to automatically generate questions given a passage. But, in the absence of such a model, generating a huge amount of unlabeled data in the context of a production environment is often quite feasible. In the case of question-answering, for example, we may be able to mine user-generated questions and run those through a teacher model to provide data for the student model to train on. For task specific knowledge distillation, we have shown that purely performing data augmentation is a viable alternative to reproducing the language modeling pre-training phase.

6. Conclusion and Further Work

In this work, we have shown a set of techniques that together successfully advance the state-of-the-art in accuracy and speed of distilled language models for the SQuAD v2.0 task. There are quite many promising directions for future work within this field. Of course, the main considerations would be in pushing the limits of model distillation itself. Perhaps by cleverly using adapter layers of some kind, it will become possible in the future to distil the knowledge contained within intermediate layers of a VLLM to a much smaller model that has a totally different architecture from the teacher model. Or maybe a combination of model distillation techniques and model pruning and weight sharing could bring even greater speed gains without loss in accuracy. Lastly, beyond the practical benefits, model distillation techniques may be used to probe just how much the size of a VLLM is necessary to performing specific NLU tasks. Model distillation is an extremely rich area of research well deserving additional exploration.

The authors would like to thank all the members of the SAP Innovation Center Newport Beach as well as our collaborators at the University of California, Irvine, for helpful discussions and inspiration.

Appendix A Methodology Details

a.1. Inference Speed Testing

To perform inference speed testing, we set up a Nvidia V100 GPU running Tensorflow 1.14 and CUDA 10. We must point out here that inference speed and relative speed up are quite sensitive to the exact testing conditions. For example, changing the input sequence length or prediction batch size may change both the absolute inference speed and relative speed up one architecture gets over another. Inference speed and relative speed up are therefore task dependent and are not universal. These reasons are why the relative speed up reported by us may not be entirely consistent with the relative speed up reported by others. In particular, while TinyBERT found 9.4x speed up over BERT-Base for their particular testing environment, which used a batch size of 128 on the QNLI training set and a maximum sequence length of 128 on a NVidia K80 GPU, within our testing environment, TinyBERT showed only 7.4x speedup over BERT-Base. For our test environment, we chose to use a maximum sequence length of 384 and a batch size of 32. We ran inference multiple times on the SQuAD v2.0 dev set and averaged the run-times to average out any network or memory caching dependent performance issues.

a.2. Training Hyperparameters

Training Hyperparameters
Init Learning Rate
Batch Size
Adam Epsilon
Init Temperature
Table 5. hyperparameters for training WaLDORf

We present the hyperparameters used to train WaLDORf in table LABEL:table:trainHyper. These are the hyperparameters we found gave us the best results. The initial learning rate was kept constant over the ”building up” phase of training where layers and encoder blocks are slowly being added to the training objective. Once the model reached the final phase of training and all of the loss functions were in play, the learning rate was decayed linearly to 0 by the end of training. The initial temperature was also decayed linearly to 1 for the final phase of training. For our choice of data volume, batch size, and steps per layer (), 35 epochs of training corresponds roughly to 1 million global steps with 517,500 of those steps being spent during the build-up phase of training.

a.3. Hardware

All model training was performed using a single Google v3 TPU running TensorFlow 1.14. Data parallelization among the 8 TPU cores was handled by the TPUEstimator API. As mentioned in subsection A.1, all inference speed testing was performed using a single Nvidia V100 GPU.


  1. conference: nn; nn; nn
  2. ccs: Computing methodologies Information extraction
  3. ccs: Computing methodologies Neural networks


  1. Ekaterina Arkhangelskaia and Sourav Dutta. 2019. Whatcha lookin’at? DeepLIFTing BERT’s Attention in Question Answering. arXiv preprint arXiv:1910.06431 (2019).
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  3. Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906 (2017).
  4. Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 535–541.
  5. Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733 (2016).
  6. Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).
  7. Robin Cheong and Robel Daniel. 2019. transformers. zip: Compressing Transformers with Pruning and Quantization. Technical Report. Technical report, Stanford University, Stanford, California, 2019. URL https â€¦.
  8. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. arXiv preprint arXiv:1906.04341 (2019).
  9. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems. 7057–7067.
  10. Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
  11. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  12. Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
  13. Philip Gage. 1994. A new algorithm for data compression. The C Users Journal 12, 2 (1994), 23–38.
  14. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014).
  15. Song Han, Huizi Mao, and William J Dally. 2015a. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
  16. Song Han, Jeff Pool, John Tran, and William Dally. 2015b. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.
  17. John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4129–4138.
  18. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  19. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
  20. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).
  21. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529 (2019).
  22. Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. Large memory layers with product keys. In Advances in Neural Information Processing Systems. 8546–8557.
  23. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
  24. Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.
  25. Dongsoo Lee and Byeongwook Kim. 2018. Retraining-based iterative weight quantization for deep neural networks. arXiv preprint arXiv:1805.11233 (2018).
  26. Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning. 2849–2858.
  27. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  28. Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018).
  29. Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
  30. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).
  31. Stephen Merity. 2019. Single Headed Attention RNN: Stop Thinking With Your Head. arXiv preprint arXiv:1911.11423 (2019).
  32. Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv preprint arXiv:1905.10650 (2019).
  33. Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355 (2019).
  34. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
  35. Matthew E Peters, Mark Neumann, IV Logan, L Robert, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
  36. Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh. 2019. Fully Quantized Transformer for Improved Translation. arXiv preprint arXiv:1910.10485 (2019).
  37. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).
  38. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  39. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
  40. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
  41. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  42. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  43. Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing Jiang. 2019. What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? arXiv preprint arXiv:1910.12391 (2019).
  44. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355 (2019).
  45. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019b. Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412 (2019).
  46. Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. arXiv preprint arXiv:1903.12136 (2019).
  47. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950 (2019).
  48. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  49. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019).
  50. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a. Universal adversarial triggers for nlp. arXiv preprint arXiv:1908.07125 (2019).
  51. Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, and Sameer Singh. 2019b. AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. arXiv preprint arXiv:1909.09251 (2019).
  52. Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019c. Do NLP Models Know Numbers? Probing Numeracy in Embeddings. arXiv preprint arXiv:1909.07940 (2019).
  53. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537 (2019).
  54. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
  55. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048–2057.
  56. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019).
  57. Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. 2019. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. In International Conference on Machine Learning. 7543–7552.
  58. Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. 2019. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067 (2019).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description