WaLDORf: Wasteless Languagemodel Distillation On Readingcomprehension
Abstract.
Transformer based Very Large Language Models (VLLMs) like BERT, XLNet and RoBERTa, have recently shown tremendous performance on a large variety of Natural Language Understanding (NLU) tasks. However, due to their size, these VLLMs are extremely resource intensive and cumbersome to deploy at production time. Several recent publications have looked into various ways to distil knowledge from a transformer based VLLM (most commonly BERTBase) into a smaller model which can run much faster at inference time. Here, we propose a novel set of techniques which together produce a taskspecific hybrid convolutional and transformer model, WaLDORf, that achieves stateoftheart inference speed while still being more accurate than previous distilled models.
1. Introduction
The recent emergence of VLLMs(Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019; Radford et al., 2019; Shoeybi et al., 2019; Raffel et al., 2019) have made a wide variety of previously extremely difficult NLU tasks feasible. From an academic perspective, the leader boards of a broad set of NLU benchmarks, such as the General Language Understanding Evaluation (GLUE(Wang et al., 2018) or SuperGLUE(Wang et al., 2019)) and the Stanford Question Answering Dataset (SQuAD(Rajpurkar et al., 2016; Rajpurkar et al., 2018)), are dominated by finetuned versions of such VLLMs, generally based on transformers(Vaswani et al., 2017) and the attention mechanism(Bahdanau et al., 2014; Luong et al., 2015; Britz et al., 2017; Graves et al., 2014; Xu et al., 2015; Cheng et al., 2016). From industry, real world problems are now being solved by putting these VLLMs into production. Recently, both Google and Microsoft have announced that they use some (possibly distilled) version of a VLLM in their search algorithms. In academia, a plethora of research has arisen not only towards advancing and augmenting these stateoftheart VLLMs(Peters et al., 2019; Conneau and Lample, 2019; Lample et al., 2019; Sun et al., 2019b; Dai et al., 2019; Joshi et al., 2019) but also towards studying their inner workings(Wallace et al., 2019a, b; Si et al., 2019; Wallace et al., 2019c; Tenney et al., 2019; Michel et al., 2019; Clark et al., 2019; Arkhangelskaia and Dutta, 2019; Hewitt and Manning, 2019; Niven and Kao, 2019)  a field sometimes called ”BERTology”.
Looking over the entire NLU landscape, it has become increasingly apparent that deploying a (finetuned) VLLM into production can generate a lot of value. However the resource requirements of deploying a VLLM into production can be prohibitively expensive. VLLMs, being very large, require both a lot of memory and a lot of computation for inference. Therefore, there has been much recent work into distilling knowledge from a VLLM into a much smaller, faster model(Sun et al., 2019a; Jiao et al., 2019; Tang et al., 2019; Sanh et al., 2019). Our work continues that line of work by examining several novel techniques which, together with previously published techniques, produce a leaner, faster but still highly accurate distilled model. This distilled model, in turn, can produce massive costsavings for production environments. In concrete terms, while maintaining Exact Match (EM) and F1 scores on SQuAD v2.0 higher than TinyBERT, 4layer Patient Knowledge Distillation of BERT (BERTPKD), and 4layer DistilBERT, we have achieved 1.24x, 3.1x, and 3.1x speed up from those models respectively (4layer BERTPKD and 4layer DistilBERT have the same model architecture and so inference speed is the same). Compared with BERTBase and BERTLarge, we achieved 9.13x and 28.63x speed up respectively.
For this work, we are looking into task specific distillation for the SQuAD v2.0 task. I.e. we do not perform model distillation on the pretraining phase of BERT. We chose to look more deeply into task specific distillation so that we could explore task specific data augmentation and its effects on model distillation in more detail. We chose specifically the SQuAD v2.0 task because it is relatively difficult and flexible. Recent advances in NLU have proposed to frame a wide variety of NLU tasks into the QuestionAnswering format(Raffel et al., 2019; McCann et al., 2018).
The major contributions (which we will describe in detail in section 3) of our work are:

With inspiration from convolutional auto encoders, we use convolutions, max pooling, and up sampling to shrink the sequence length of inputs to the transformer encoder blocks and then expand the outputs of those encoder blocks back to the original sequence length. (Section 3.2)

We modify the hiddenstate distillation and attentionweight distillation by using average pooling and max pooling respectively to align the sequence dimension of our model with the teacher model. (Section 3.3)

We modify the patient knowledge distillation procedure to build layers up from the bottom one layer at a time to reduce the covariate shift experienced by each encoder block. (Section 3.3)

We finetune a pretrained sequencetosequence VLLM to produce high quality questions from given contexts to perform task specific data augmentation. (Section 3.4).
The rest of our paper is organized as follows: in section 2 we discuss some related work and models from which we drew inspiration; in section 3 we describe in detail our methodology, including model architecture, model distillation procedure, and data augmentation; in section 4 we present all of our results, including results from our ablation study; in section 5 we discuss our findings in detail; and finally in section 6 we present a few concluding remarks and directions for future work. We also present some details to reproduce our results in the appendix, section A.
2. Related Work
Model compression has been explored in the past in several parallel paths. There is a whole field of study devoted to weight quantization, weight pruning, and weight encoding to reduce the size of large neural networks  often achieving impressive results(Han et al., 2015a, b; Lee and Kim, 2018; Zhao et al., 2019; Lin et al., 2016; Cheng et al., 2017; LeCun et al., 1990; Frankle and Carbin, 2018; Liu et al., 2018; Zhou et al., 2019). Recently, there have been efforts to apply weight quantization and pruning to transformer models as well(Voita et al., 2019; Prato et al., 2019; Cheong and Daniel, 2019). For this work, we instead focus on the parallel path(s) of exploration into compressing large transformers using architectural changes or model distillation.
The recently released ALBERT(Lan et al., 2019) approached model compression from the point of view of making architectural and training objective changes to the transformer model. The authors of ALBERT use the following three main techniques: 1) Factorized embedding parameterization to greatly reduce the number of weights in the embedding layer, 2) Weight sharing among all encoder blocks of the transformer, greatly reducing the number of weights in the encoder stack, and 3) Using an intersentence coherence loss instead of a next sentence prediction loss during BERTlike pretraining. With these architectural and training modifications, ALBERT produces impressive results and achieved stateoftheart accuracy, for their xxlarge model, on a wide variety of NLU tasks. ALBERTbase also achieves quite high accuracy while reducing the number of parameters to 12 million (compared with 110 million for BERTBase and 335 million for BERTLarge). However, because data still must be processed through every layer of ALBERT, even though the weights are shared between all the layers, the inference speed improvement achieved by ALBERT is not so striking. ALBERTBase achieves 21.1x speed up from an extra large version of BERT built by the authors of ALBERT (BERTXLarge) which is itself 17.7x slower than BERTBase. This means that ALBERTBase achieves only roughly 1.2x speed up from BERTBase itself. For our purposes, we wanted to look at much faster architectures. Therefore, we only incorporated the factorized embedding parameterization idea from ALBERT. Other possibly faster architectures have of course also been explored for language modeling(Merity, 2019).
Model distillation was first applied to distilling ensemble model knowledge into a single model(BuciluÇ et al., 2006) and then expanded to other domains(Hinton et al., 2015). In this vein, we point out three recent efforts at model distillation of BERT: DistilBERT(Sanh et al., 2019), BERTPKD(Sun et al., 2019a), and TinyBERT(Jiao et al., 2019). DistilBERT took inspiration from model distillation as put forth by Hinton et. al. and distilled knowledge from the softmax outputs produced by BERTBase into a 6 encoder block (also called 6layer) version of BERT. DistilBERT has half the number of encoder blocks as BERTBase but is otherwise identical to BERTBase (in terms of hidden dimension size, the number of attention heads, etc.). BERTPKD goes further and distils the hidden state activations, layerbylayer, from BERTBase (and in one experiment, BERTLarge) into 3 and 6 encoder block versions of BERTBase. The authors of BERTPKD show that patient knowledge distillation generally outperforms lastlayer distillation. Finally, TinyBERT is a 4 encoder block version of BERTBase that goes even further and includes distillation objectives for the attention scores and the embedding layer. In addition, TinyBERT shrinks the hidden dimension sizes of BERTBase to produce a much smaller and much faster model than either DistilBERT or BERTPKD. The accuracy on SQuAD v2.0 attained by TinyBERT also outperforms those attained by DistilBERT and BERTPKD, for a given number of encoder blocks, so for our work we generally benchmark against TinyBERT. We take inspiration from TinyBERT and perform a version of their layerbylayer patient knowledge distillation. However, different from TinyBERT, we also included some architectural changes that necessitated changes to the knowledge distillation procedure. Furthermore, we did not reproduce the language modeling pretraining used by BERT and all these previously mentioned distillations. Instead, we relied purely on data augmentation to achieve stateoftheart results.
3. Methodology
3.1. Problem Statement
For this work, we are working specifically in the framework of questionanswering as posed by the SQuAD v2.0 dataset. To perform questionanswering, we are given a question and a context and we must find the concrete answer to the question. More specifically, SQuAD v2.0(Rajpurkar et al., 2018) is an extractive, textbased, questionanswering dataset which tasks us to find spans of text within the context which answer the given question or, if the question is unanswerable given the context, to return a null result. Formally, given a tuple , where is the question, and is the context consisting of a passage of text, find , which is a span in , that answers the question or return a null result if is not answerable given . For training, we are given tuples of training examples , while for evaluation, we are given the tuple and a set of ground truth answers that are variants of the answer for the given question.
We use two primary metrics to evaluate our model for accuracy, the EM and F1 metrics. The EM score divides the number of examples for which our model’s answer, , exactly matched any of the ground truth answers in into the total number of evaluation examples. The F1 score averages the maximum overlap between and over all the evaluation examples. These are the same two metrics used by the SQuAD v2.0 dataset itself.
3.2. Model Architecture
WaLDORf is a hybrid model that incorporates convolutional layers into the transformer encoder stack architecture of BERT(Devlin et al., 2018); see figure 1 for an illustration. Taking inspiration from ALBERT, we chose the embedding dimension to be smaller than the internal hidden dimension used in the encoder blocks(Lan et al., 2019). We use the same word piece tokenization scheme as BERT (which is itself a sub word tokenization scheme similar to Byte Pair Encoding(Gage, 1994)) in order to allow distillation of the embedding layer as described in section 3.3. Before feeding data into the 8 encoder blocks, we use a timedistributed feedforward layer to expand the dimensions of the hidden representation to equal that of the encoder blocks. The major architectural insight we bring forth with WaLDORf is to use convolutional layers to shrink and then reexpand the input sequence dimension. The selfattention mechanism employed by the transformer encoder blocks have complexity which is quadratic in the input sequence length(Vaswani et al., 2017). By using two 1D convolutions (along the sequence length dimension) coupled with two maxpooling layers, we shrink the input sequence lengths to the encoder blocks by a factor of 4. In turn, this means each selfattention mechanism needs to do only 1/16th the amount of work. We then use two more 1D convolutions (since there are no transposed convolutions in 1 dimension) coupled with two up sampling layers to reexpand the sequence length dimension back to its original length. By greatly reducing the amount of work carried out by the selfattention mechanism, we are able to make the encoder blocks run much faster so that we can have a deeper model without sacrificing speed. The tradeoff is, of course, that the model must learn to encode the input information in a ”smeared out” way; with roughly every 4 wordpieces getting 1 internal representation. This tradeoff may affect different tasks differently depending on how sensitive a certain task is to specific wordpieces. However, we found that even for the SQuAD v2.0 task, which intuitively should be relatively sensitive to specific words or wordpieces, since the task is specifically about finding spans within the text, we were still able to achieve good accuracy with our model.
The 8 encoder blocks used in our model are the standard transformer encoder blocks used by BERT which have been scaled down to increase inference speed. A detailed list of architecture hyperparmeters can be seen in table LABEL:table:archHyper. The selfattention mechanisms in our model use 16 heads of attention like BERTLarge instead of 12 heads of attention like BERTBase because we chose to use BERTLarge as our teacher model. In total, our model has parameters which is a bit less than a quarter of the number of parameters of BERTBase.
Architecture Hyperparameters  

# Encoder Blocks  
# Encoder Convolutional Layers  
# Decoder Convolutional Layers  
filters  
filters  
filters  
Embedding Size  
Hidden Size  
FeedForward Size  
# Attention Heads  
# Total Parameters 
3.3. Training Objectives and Procedure
To train our model, we used a modified version of the patient knowledge distillation procedure. We have 5 main training losses:

The distillation loss over the embedding layer: .

The distillation loss over the hidden state representations inside the 8 encoder blocks: .

The distillation loss over the attention scores used to calculate the attention weights inside the 8 selfattention mechanisms: .

The distillation loss over the final softmax layer which corresponds to the vanilla model distillation loss introduced by Hinton et. al.: .

The ground truth loss (for data for which we have the ground truth): .
The student model (our model) learns from a teacher model which we chose to be BERTLarge WholeWordMasking (BERTLarge WWM) which is a variant of BERTLarge that was trained using a masked token objective where the masked tokens were ensured to be over whole words instead of word pieces as in the original. The first 4 losses mentioned, , are all losses for the student model with respect to the teacher model. The final loss is the loss with respect to the ground truth and would be the only loss available to us if we did not perform model distillation. As we will show in section 4.3, if we use only to try to train our model, our results are extremely poor.
The embedding loss, , is a simple mean squared error loss that measures how different the embedding representations of the student model are from the teacher model. Given , where are the student model embeddings for the text sequence, is the input sequence length, and is the embedding size of the student model, and , where are the teacher model embeddings for the text sequence, and is the teacher model embedding size, then:
(1) 
Where MSE is the meansquared error, is the batch size, and is a learned tensor that projects the student embeddings to the same dimension size as the teacher embeddings.
Both the hidden loss, , and the attention scores loss are mean squared error losses that tries to match the hidden states and attention weights within the 8 encoder blocks of the student model to 8 encoder blocks teacher model. We choose to follow the PKDskip scheme(Sun et al., 2019a) and use every 3 encoder block in the teacher model for distillation. Just like for the embedding loss, we also have to project the student model hidden states to the same dimension size as the teacher model’s. In addition, we also use an average pooling over the teacher model’s hidden states along the sequence length dimension to shrink the sequence length of the teacher model to match the sequence length inside the student model’s encoder blocks. Because WaLDORf used convolutions to shrink the sequence length used in the encoder blocks, we can’t match hidden states, word for word, between the two models. If we think of hidden states analogously to word vectors, then we are essentially asking WaLDORf to approximate the average of 4 word vectors with each of its hidden states. This procedure is perhaps the simplest way to make the dimensions of our student model align with the targets given by the teacher model. Empirically, this procedure gives us good results. Thus, given , where are the hidden state outputs for the text sequence of the encoder block of the student model, and is the hidden dimension size of the student model, and , where are the hidden state outputs for the text sequence of the encoder block of the teacher model, and is the hidden dimension size of the teacher model, then:
(2) 
Where AVG denotes a 1D average pooling with size 4 and stride 4 along the sequence length dimension:
(3) 
is the hidden loss of the encoder block, and is a learned matrix that projects student model hidden dimension sizes to those of the teacher model. Note that we use the same for every encoder block because we want the distilled learning to be learned by WaLDORf itself and not by these projection matrices which will be discarded at inference time.
For the attention score loss, instead of using average pooling to make sequence dimensions match, we switch to using max pooling. We decided to use max pooling here because we want the student model to learn the strongest attention scores rather than an average of attention scores. Hence, given , where are the attention scores for the text sequence of the encoder block, and , where are the attention scores for the text sequence of the encoder block, then:
(4) 
Where MAX denotes a 2D max pooling operation with size and stride :
(5) 
and is the attention score loss of the encoder block. Note that here we are operating on attention scores and not the attention weights. The attention scores have not had a softmax applied to them and so do not necessarily add up to 1, hence we are free to perform a simple max pooling without destroying any normalization since the softmax will be applied after. One detail that should be noted is that proper masking needs to be maintained on the attention score loss so that the system does not try to learn minor changes to the masking value.
The distillation loss over the final softmax layer follows closely the distillation loss introduced in Hinton et. al.(Hinton et al., 2015):
(6) 
Where is a temperature hyperparameter and are teacher and student logits for the text sequence respectively. We multiply the loss by the factor of to ensure that the gradients are of the correct scale.
Lastly, the ground truth loss is the standard cross entropy loss with the ground truth labels:
(7) 
Where is the ground truth label for the text sequence. This ground truth loss is only available for data for which we have the ground truth labels. When we perform data augmentation, as discussed in the next section, we will no longer have access to ground truth labels and so this loss will not apply. Note that for both the ground truth loss and the distillation loss over the final softmax, there are two sets of logits (and labels), one for the start position and one for the end position. The softmax functions are taken over the sequence length dimension.
Figure 2 shows our version of the patient knowledge distillation procedure in detail.
To train our model, we found it was very beneficial to train each layer from the bottom up (or left to right in figure 1) so that we start with training only using the embedding loss, then combine that loss with the hidden state and attention score losses for layer 0, then for layer 1, etc. After each intermediate layer is trained for some time, we then add the final softmax layer distillation loss and the ground truth loss. This training procedure makes sense intuitively because it should reduce the internal covariate shift experienced by each layer during training (the same goal as for batch normalization(Ioffe and Szegedy, 2015)). Since we have targets available for each layer in the student model, with the exception of the convolutional layers, we can use those intermediate targets to ensure that the inputs to each layer have stabilized somewhat by the time that that layer is trained. Thus, our final loss function is:
(8) 
Where are real number hyperparameters and:
(9) 
Where is the Heaviside step function, is the current global training step and is an integer hyperparameter specifying how many training steps to take for each layer. We will show in our ablation study in section 4.3 that this layerbylayer building up of the loss function significantly improves accuracy.
3.4. Data Augmentation
A large part of why VLLMs perform so well on downstream tasks is due to the massive amount of data that they pretrain on(Yang et al., 2019; Radford et al., 2019; Raffel et al., 2019). For distilled architectures, such as DistilBERT or BERTPKD, which keep hidden dimensions all the same size as a teacher model (e.g. BERTBase) and just use fewer encoder blocks, the student encoder blocks can be initialized with the weights from a subset of encoders blocks of a trained teacher model. For distilled architectures, such as TinyBERT or our model, which change the hidden dimension sizes as well as the number of encoder blocks, there is no obvious way to initialize the weights of the model with weights from the teacher model. Hence, distilled architectures also often reproduce the pretraining phase of BERT before being finetuned (with or without additional distillation) on downstream tasks. Here, we are focused on taskspecific distillation and so we do not desire to reproduce the pretraining, language modeling, phase. However, not reproducing the pretraining phase would put our model at a huge disadvantage simply due to the much smaller coverage of input data that our model would be trained on. Hence, to help balance out the imbalance in data volume, we chose to perform extensive data augmentation. It should be noted though that even with data augmentation, the final volume of data, as measured by total token count, that we train on is still much smaller than the pretraining data used to train BERT.
To perform data augmentation, we sourced 500,000 additional paragraphs, from 280,250 English Wikipedia articles which did not correspond to any paragraphs or articles found in the SQuAD v2.0 dataset (training or dev). For each paragraph, we automatically generated a question using a finetuned TexttoText Transfer Transformer, T5Large(Raffel et al., 2019). We finetuned T5Large by using questioncontext pairs from the SQuAD v2.0 training dataset itself. Our empirical experiments will show (in section 4) that this data augmentation increased the endtoend accuracy by a large margin. For comparison, the SQuAD v2.0 training dataset has roughly 130,000 questions on 20,000 paragraphs sourced from 500 articles.
4. Results
Overall, WaLDORf achieved significant speed up over previous stateoftheart distilled architectures while maintaining a higher level of accuracy on SQuAD v2.0.
4.1. Inference Speed
Inference Speed Results  
Model  
BERTBase  1x  
BERTLarge  0.3x  
BERT6  2x  
BERT4  2.9x  
BERT2  5.8x  
TinyBERT  7.4x  
WaLDORf  9.1x 
Detailed results of our inference speed testing are presented in table LABEL:table:speedTest. As clearly shown, our model is the fastest model to perform inference over the SQuAD v2.0 cross validation set by a wide margin. Because of our use of convolutional layers, we were able to have a model which is deeper and has larger hidden dimensions while still being faster than TinyBERT. DistilBERT and BERTPKD would both fall under the BERT6 umbrella since in terms of architecture both are simply 6 encoder block versions of BERTBase. BERTPKD also has a 3 encoder block version, but we can see that our model is much faster than even a 2 encoder block version of BERTBase.
4.2. Accuracy Performance
SQuAD v2.0 Results  

Model  EM  F1 
BERTBase  
BERTLargeWWM  
BERTPKD4  
DistilBERT4  
TinyBERT  
WaLDORf  66.0  70.3 
BERTPKD6  
DistilBERT6  
TinyBERT6 
Our model’s performance on SQuAD v2.0 is benchmarked against other models in table LABEL:table:accuracy. The BERTPKD4 and DistilBERT4 models presented have internal dimensions identical with BERT4 from table LABEL:table:speedTest. Thus, our model would be 3.1x faster than those models for inference while being significantly more accurate. WaLDORf is also somewhat more accurate than TinyBERT ( EM, F1) while still being 1.24x faster. To obtain EM and F1 scores higher than ours, one would have to go to 6layer versions of those models for which our model would be 4.5x faster. TinyBERT6 has internal dimension sizes scaled to be the same as BERTBase as well and so would also be 4.5x slower than WaLDORf.
4.3. Ablation Study
Ablation Study  

EM  F1  +EM  +F1  
Baseline  
+ Softmax Distil  
+ All Layer Distil  
+ Slow Build  11.8  12.5  
+ Data Augment  66.0  70.3 
Results from our ablation study are provided in table LABEL:table:ablation. Each additional technique shown in the table was added on to the sum of the previously used techniques. As we can clearly see, each technique which we added produced significant improvements. With the addition of each new technique, we also explored a whole set of hyperparameters. Results are presented for the best set of hyperparameters found within a given framework. When all the techniques are combined, we get a very significant absolute improvement of in EM score and in F1 score over the baseline.
The biggest incremental improvement in accuracy that we saw was in moving from trying to distill every layer all at once to slowly building up the layers from the bottom embedding layer up. This is likely due to the fact that training every layer all at once introduces a significant covariate shift for every layer’s inputs during training. Stabilizing the inputs to each layer by training layers from a bottom up fashion appears to make optimization much easier.
5. Discussion
5.1. Convolutions and Autoencoding
The core architectural change that we bring forth in this work is the use of convolutions, max pooling, and up sampling to reduce and then reexpand the sequence length dimension. We introduced this architectural change with the hope that, analogous to convolutional autoencoders, WaLDORf will find an efficient way to compress the information contained within an example into a shorter sequence. By attaining accuracy performance higher than previous stateoftheart, we have shown that it is in fact feasible for the model to compress information in this way. Information can be effectively encoded by the input convolutions, processed by the encoder blocks, and then decoded by the output convolutions. In addition, we also showed that by using max pooling and average pooling we are able to use only the signals given by the ”important” attention weights and the averaged hidden states from the teacher model to make our student model perform strongly. These properties may be somewhat task dependent, but even for the SQuAD v2.0 task we were able to attain a high level of accuracy.
Due to the quadratic dependence of the complexity of the selfattention mechanism on the sequence length, our architecture will actually see larger relative gains in speed the longer the input sequence length is. We chose a sequence length of 384 for our tests because that is the sequence length used by BERT for the SQuAD data set. Other transformerbased VLLMs may use even longer sequence lengths in order to deal better with longer paragraphs, e.g. XLNet uses a max sequence length of 512 for SQuAD. For circumstances where the max input sequence length is even longer than 384, we expect to gain even larger relative speed ups.
5.2. Distillation
Model distillation is currently a very active area of research. Having access to a teacher model opens up the possibility of using a huge variety of techniques to improve a student model’s performance. The distillation techniques available often depend on the architecture of the student model and how similar it is to the teacher model. A model, like BERTPKD or DistilBERT, that is essentially a copy of a teacher model with just fewer layers can straightforwardly distil a subset of the layers in the teacher model. If one scales down the internal dimensions of the model, like in the case of TinyBERT, then one needs to introduce some way to project the student model’s hidden dimensions to those of the teacher model or vice versa. That projection operator introduces new degrees of freedom to the problem and the distillation procedure is no longer as straight forward. In our case, we had to also modify the sequencelength dimension of the teacher model’s signals in order to perform intermediatelayer distillation.
In the case that the student model is entirely different from the teacher model, e.g. when a LSTMbased student model tries to learn from a transformer based teacher model as in Tang et. al.(Tang et al., 2019), then perhaps the only easily distilled layer is the final layer and no intermediate layer distillation is feasible. However, we (and others) have shown that there is significant performance improvements to be had by distilling also the intermediate layers and operations (e.g. attention scores) of a VLLM. It is uncertain, at this point, how much of the performance improvement is due to the difference in the raw information contained within the training signals (softmax signals vs intermediate layers and operations signals) and how much is due to a difference in how much those signals help the student model to optimize. The concrete difference in these scenarios would be that if the latter is true, then student models with architectures wildly different from a teacher model may still attain a high level of accuracy if the last layer distillation is carried out with a very strong optimization scheme. If the former scenario is true, then we may expect that there is no method by which we can optimize a student model using only the signals from the last layer as strongly as we could by using signals from all the layers.
5.3. Data Augmentation
As we showed in section 4.3, data augmentation gives us a significant 8point boost in EM and F1 scores over distillation using just the SQuAD v2.0 dataset itself. This result shows that even with roughly 130,000 questioncontextanswer tuples, SQuAD v2.0 is still not large enough of a dataset to fully train our student model using pure model distillation. The set of targets that the student model had to hit, given only SQuAD v2.0 inputs, was already enormously large. The embedding layer targets are of size , hidden layer targets are of size and the attention score targets are of size , where denotes the total number of examples. For a BERTLargeWWM teacher model over the SQuAD v2.0 dataset, these tensors would comprise roughly 1.2 Terabytes of data. The sheer size of this data makes for a very rich set of signals given to the student model, however, given the success of our data augmentation experiments, the input space provided by SQuAD v2.0 itself did not appear rich enough for the student model to fully exploit those signals.
A huge benefit of performing model distillation and studentteacher learning is the ease with which ”labeled” data can be obtained from unlabeled data. Data augmentation for us was simplified further by the introduction of the powerful T5 model which we finetuned to automatically generate questions given a passage. But, in the absence of such a model, generating a huge amount of unlabeled data in the context of a production environment is often quite feasible. In the case of questionanswering, for example, we may be able to mine usergenerated questions and run those through a teacher model to provide data for the student model to train on. For task specific knowledge distillation, we have shown that purely performing data augmentation is a viable alternative to reproducing the language modeling pretraining phase.
6. Conclusion and Further Work
In this work, we have shown a set of techniques that together successfully advance the stateoftheart in accuracy and speed of distilled language models for the SQuAD v2.0 task. There are quite many promising directions for future work within this field. Of course, the main considerations would be in pushing the limits of model distillation itself. Perhaps by cleverly using adapter layers of some kind, it will become possible in the future to distil the knowledge contained within intermediate layers of a VLLM to a much smaller model that has a totally different architecture from the teacher model. Or maybe a combination of model distillation techniques and model pruning and weight sharing could bring even greater speed gains without loss in accuracy. Lastly, beyond the practical benefits, model distillation techniques may be used to probe just how much the size of a VLLM is necessary to performing specific NLU tasks. Model distillation is an extremely rich area of research well deserving additional exploration.
Acknowledgements.
The authors would like to thank all the members of the SAP Innovation Center Newport Beach as well as our collaborators at the University of California, Irvine, for helpful discussions and inspiration.Appendix A Methodology Details
a.1. Inference Speed Testing
To perform inference speed testing, we set up a Nvidia V100 GPU running Tensorflow 1.14 and CUDA 10. We must point out here that inference speed and relative speed up are quite sensitive to the exact testing conditions. For example, changing the input sequence length or prediction batch size may change both the absolute inference speed and relative speed up one architecture gets over another. Inference speed and relative speed up are therefore task dependent and are not universal. These reasons are why the relative speed up reported by us may not be entirely consistent with the relative speed up reported by others. In particular, while TinyBERT found 9.4x speed up over BERTBase for their particular testing environment, which used a batch size of 128 on the QNLI training set and a maximum sequence length of 128 on a NVidia K80 GPU, within our testing environment, TinyBERT showed only 7.4x speedup over BERTBase. For our test environment, we chose to use a maximum sequence length of 384 and a batch size of 32. We ran inference multiple times on the SQuAD v2.0 dev set and averaged the runtimes to average out any network or memory caching dependent performance issues.
a.2. Training Hyperparameters
Training Hyperparameters  

Init Learning Rate  
Batch Size  
Dropout  
Adam Epsilon  
Init Temperature  
Epochs  
We present the hyperparameters used to train WaLDORf in table LABEL:table:trainHyper. These are the hyperparameters we found gave us the best results. The initial learning rate was kept constant over the ”building up” phase of training where layers and encoder blocks are slowly being added to the training objective. Once the model reached the final phase of training and all of the loss functions were in play, the learning rate was decayed linearly to 0 by the end of training. The initial temperature was also decayed linearly to 1 for the final phase of training. For our choice of data volume, batch size, and steps per layer (), 35 epochs of training corresponds roughly to 1 million global steps with 517,500 of those steps being spent during the buildup phase of training.
a.3. Hardware
All model training was performed using a single Google v3 TPU running TensorFlow 1.14. Data parallelization among the 8 TPU cores was handled by the TPUEstimator API. As mentioned in subsection A.1, all inference speed testing was performed using a single Nvidia V100 GPU.
Footnotes
 conference: nn; nn; nn
 ccs: Computing methodologies Information extraction
 ccs: Computing methodologies Neural networks
References
 Ekaterina Arkhangelskaia and Sourav Dutta. 2019. Whatcha lookin’at? DeepLIFTing BERT’s Attention in Question Answering. arXiv preprint arXiv:1910.06431 (2019).
 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
 Denny Britz, Anna Goldie, MinhThang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906 (2017).
 Cristian BuciluÇ, Rich Caruana, and Alexandru NiculescuMizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 535–541.
 Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long shortterm memorynetworks for machine reading. arXiv preprint arXiv:1601.06733 (2016).
 Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).
 Robin Cheong and Robel Daniel. 2019. transformers. zip: Compressing Transformers with Pruning and Quantization. Technical Report. Technical report, Stanford University, Stanford, California, 2019. URL https â¦.
 Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. arXiv preprint arXiv:1906.04341 (2019).
 Alexis Conneau and Guillaume Lample. 2019. Crosslingual Language Model Pretraining. In Advances in Neural Information Processing Systems. 7057–7067.
 Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformerxl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860 (2019).
 Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
 Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
 Philip Gage. 1994. A new algorithm for data compression. The C Users Journal 12, 2 (1994), 23–38.
 Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014).
 Song Han, Huizi Mao, and William J Dally. 2015a. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
 Song Han, Jeff Pool, John Tran, and William Dally. 2015b. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.
 John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4129–4138.
 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
 Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
 Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).
 Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pretraining by representing and predicting spans. arXiv preprint arXiv:1907.10529 (2019).
 Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. Large memory layers with product keys. In Advances in Neural Information Processing Systems. 8546–8557.
 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
 Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598–605.
 Dongsoo Lee and Byeongwook Kim. 2018. Retrainingbased iterative weight quantization for deep neural networks. arXiv preprint arXiv:1805.11233 (2018).
 Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning. 2849–2858.
 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
 Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018).
 MinhThang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
 Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).
 Stephen Merity. 2019. Single Headed Attention RNN: Stop Thinking With Your Head. arXiv preprint arXiv:1911.11423 (2019).
 Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One? arXiv preprint arXiv:1905.10650 (2019).
 Timothy Niven and HungYu Kao. 2019. Probing neural network comprehension of natural language arguments. arXiv preprint arXiv:1907.07355 (2019).
 Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
 Matthew E Peters, Mark Neumann, IV Logan, L Robert, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).
 Gabriele Prato, Ella Charlaix, and Mehdi Rezagholizadeh. 2019. Fully Quantized Transformer for Improved Translation. arXiv preprint arXiv:1910.10485 (2019).
 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).
 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified texttotext transformer. arXiv preprint arXiv:1910.10683 (2019).
 Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatronlm: Training multibillion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053 (2019).
 Chenglei Si, Shuohang Wang, MinYen Kan, and Jing Jiang. 2019. What does BERT Learn from MultipleChoice Reading Comprehension Datasets? arXiv preprint arXiv:1910.12391 (2019).
 Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355 (2019).
 Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019b. Ernie 2.0: A continual pretraining framework for language understanding. arXiv preprint arXiv:1907.12412 (2019).
 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Distilling TaskSpecific Knowledge from BERT into Simple Neural Networks. arXiv preprint arXiv:1903.12136 (2019).
 Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950 (2019).
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
 Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing MultiHead SelfAttention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418 (2019).
 Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a. Universal adversarial triggers for nlp. arXiv preprint arXiv:1908.07125 (2019).
 Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, and Sameer Singh. 2019b. AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. arXiv preprint arXiv:1909.09251 (2019).
 Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019c. Do NLP Models Know Numbers? Probing Numeracy in Embeddings. arXiv preprint arXiv:1909.07940 (2019).
 Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Superglue: A stickier benchmark for generalpurpose language understanding systems. arXiv preprint arXiv:1905.00537 (2019).
 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
 Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048–2057.
 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019).
 Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. 2019. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. In International Conference on Machine Learning. 7543–7552.
 Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. 2019. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067 (2019).