Stagewise Knowledge Distillation

Stagewise Knowledge Distillation

Akshay Kulkarni, Navid Panchi and Shital Chiddarwar


The deployment of modern Deep Learning models requires high computational power. However, many applications are targeted for embedded devices like smartphones and wearables which lack such computational abilities. This necessitates compact networks which reduce computations while preserving the performance. Knowledge Distillation is one of the methods used to achieve this. Traditional Knowledge Distillation methods transfer knowledge from teacher to student in a single stage. We propose progressive stagewise training to improve the transfer of knowledge. We also show that this method works even with a fraction of the data used for training the teacher model, without compromising on the metric. This method can complement other model compression methods and also can be viewed as a generalized model compression technique.

1 Introduction

There is now widespread use of Deep Learning models in various fields like Computer Vision, Signal Processing, Robotics, Natural Language Processing. It started with simple and small networks such as AlexNet[1], LeNet[2], etc. but as time passed, researchers started coming up with more complex models like ResNets[3], DenseNets[4], Inception Networks[5], etc. Although these models achieved higher accuracies compared to the simpler models, the computational cost became extremely high. This does not matter when the application is run on the cloud or PCs with very high computing power with multiple GPUs, but for smaller and more portable devices, it is not possible to deploy such complex models and expect real-time inference. Unavailability of sufficient computational power leads to long inference times, which poses a major limitation to real-time applications. For example, for speech-to-speech translation, very large Transformers[6] are used, but these models cannot be deployed on embedded devices due to computational as well as memory constraints. Even if it is deployed in some manner, the inference time would render the application unusable. Since more and more of these applications are geared towards mobile and portable devices, the use of more compact or computationally efficient networks becomes necessary. While the use of smaller networks is generally accompanied by a reduction in accuracy, techniques have been developed which try to maintain the accuracy while reducing the computational cost. These come under the research field of Model Compression in Deep Learning.

Model compression techniques[7] can be broadly classified into the following 5 categories:

  1. Parameter Pruning and Sharing : This aims to reduce the redundancy in the parameters of the network and also to eliminate unessential parameters.

  2. Low Rank Factorization techniques : This aims to use tensor/matrix decomposition to determine useful parameters of the network.

  3. Transferred/Compact Convolutional Filters : This aims to use specially designed convolutional filters to reduce computation and storage space.

  4. Knowledge Distillation : This aims to train a compact model (student) using a larger pre-trained model (teacher).

  5. Quantization : This aims to reduce the number of bits representing each weight of the network while preserving the network performance.

In this work, we focus on the knowledge distillation approach. It is analogous to students learning in a classroom. The teacher had gained knowledge while learning and training for the teaching job. Ideally, the teacher should be able to transfer all of their knowledge to the student, but this might not always be the case. Further, all of the teacher’s knowledge won’t necessarily be relevant to the student. The best case scenario would be that students learn the most important details properly while leaving out unnecessary details which will not affect their performance in a given task. In our approach as well, we consider two models, a teacher model and a student model. The teacher model is generally a larger model or an ensemble of models. Whereas, the student model is smaller (both with respect to memory as well as computation) than the teacher.

ResNets[3] have given very good results in various Computer Vision tasks[3] while also retaining a very simple high level block-wise structure. Hence, we have chosen our primary teacher model to be ResNet34. This model is also used for setting the accuracy baselines for comparisons. The student model is also based on the ResNet like architecture, but is smaller in storage size and has less computational complexity than the teacher model, i.e. the number of layers in the student model is lesser than the teacher model. This paper describes the methodologies for training a student model using multiple feature maps of a pre-trained teacher model.

We present a novel approach to train the student model using multiple feature maps from the teacher taken at specific locations. The student model is trained in a stage-wise manner for each feature map and the final classification layers are trained directly on the dataset without the teacher. We show that this method enables the student to learn even on a small subset of the dataset.

The rest of the paper is organized as follows. Section 2 gives an overview of the related work in this field. Section 3 details the proposed approach. Section 4 presents the results and its analysis. Section 5 concludes this paper and suggests possible future work.

2 Related Work

[8] showed that the knowledge from an ensemble can be compressed into a smaller model. Unlabeled data is labelled by the ensemble and used by the smaller model for training and the smaller model achieved similar accuracy to the ensemble. [9] proposed a knowledge distillation technique that minimizes the MSE loss between the output of the teacher and student models. [10] proposed another knowledge distillation technique for neural networks which builds on the technique of [9]. However, they use a combination of softened output from the teacher and the ground truth labels while training the student, while we directly make use of the ground truth labels and additionally the intermediate feature maps. Our approach thus enables the student to learn the correct labels while also learning useful filters from the teacher model. The presented approach is a generalization of the approach by [11], in which the feature map from the middle layer of a large pre-trained teacher model is used to train the smaller student model along with the data. [12] use the feature maps to calculate a flow of solution procedure (FSP) matrix and minimize the difference between the FSP matrices of the student and teacher models. However, our approach directly minimizes the difference between the multiple feature maps of the student and the teacher, which is faster since calculation of FSP matrix adds to the computation. [13] use an intermediately sized teaching assistant model between the smaller student model and the larger teacher model following the method of [10] for the 3 models. Though it improves knowledge transfer, it also increases the complexity (both storage and computational) involved in training the student model. Several other approaches [14][15][16] have used knowledge distillation techniques of [10] along with weight quantization (using 1 to 8-bit integer representations of the weights instead of the conventional 32 or 64-bit floating point representations). [17] use a recurrent policy network to compress a teacher model while preserving the performance level of the teacher in the student. [10] use a combination of parameter pruning and knowledge distillation (as done by [10]) to achieve a 10x speedup for gaze prediction while preserving the performance level. [18] match the feature vectors before the softmax activation function from the teacher and the student. [19] use a Naive Bayes based teacher model to train a deep network, with true labels as the hard target and teacher output as the soft target, for sentiment classification. [20] explore the use of multiple teacher models whose combined output along with the similarity between one of the intermediate layers, are used to train a student model.

3 Methodology

For this work, deep residual networks[3] are used. Some of the basic terminologies are mentioned below:

  • Basic Block: Each block consists of 2 convolutional layers with filter size 3x3 and stride of 1. These convolutional layers are followed by a batch normalization layer and the first convolutional layer also has a Rectified Linear Unit (ReLU) [21] following it’s batch normalization [22] layer. There is a skip connection which downsamples the input using the downsample layer (explained below) and adds it to the output from the last layer of the block. See Figure 1.

  • Downsample Layer: In case input of the BasicBlock matches with the output of the last layer of the BasicBlock, then this layer is an identity function. If the dimensions do not match, this layer consists of a convolutional layer with filter size 1x1 and stride of 2, followed by a batch normalization layer. In both cases, the output of the downsample layer is added with the output of the last layer of the BasicBlock. This is shown by the dotted connection in Figure 1.

  • ResNet18 or 34 type models: These models are formed using combinations of BasicBlocks. The first two layers are always the same: a convolutional layer of filter size 7x7 with stride of 2 followed by a max-pooling layer with filter size 3x3 and stride of 2. These are followed by 4 layers each containing multiple (at least 1) BasicBlocks. Some of the ResNets used in this work are detailed in Table 1.

Figure 1: Left: Basic Block for the same input and output dimensions, Right: Basic Block where skip connection has a layer that makes input and output dimensions match
Figure 2: Teacher and Student Network Architectures and extraction of intermediate feature maps, Upper: ResNet34 (Teacher), Lower: ResNet10 (Student)

3.1 Teacher Network

Teacher network is generally a standard model like ResNet18 or ResNet34. This makes it easier to obtain pretrained weights for standard datasets. In this work, ResNet34 is used as the teacher model.

layer name
34 layer
26 layer 20 layer 14 layer 10 layer
conv1 7x7, 64, stride 2
conv2_x 3x3, maxpool, stride 2
( 3x3, 64
3x3, 64 ) x 3
( 3x3, 64
3x3, 64 ) x 3
( 3x3, 64
3x3, 64 ) x 2
( 3x3, 64
3x3, 64 ) x 1
( 3x3, 64
3x3, 64 ) x 1
( 3x3, 128
3x3, 128 ) x 4
( 3x3, 128
3x3, 128 ) x 3
( 3x3, 128
3x3, 128 ) x 2
( 3x3, 128
3x3, 128 ) x 1
( 3x3, 128
3x3, 128 ) x 1
( 3x3, 256
3x3, 256 ) x 6
( 3x3, 256
3x3, 256 ) x 3
( 3x3, 256
3x3, 256 ) x 3
( 3x3, 256
3x3, 256 ) x 2
( 3x3, 256
3x3, 256 ) x 1
( 3x3, 512
3x3, 512 ) x 3
( 3x3, 512
3x3, 512 ) x 3
( 3x3, 512
3x3, 512 ) x 2
( 3x3, 512
3x3, 512 ) x 2
( 3x3, 512
3x3, 512 ) x 1
FLOPs 3.679G 2.752G 2.056G 1.359G 896.197M
Parameters 21.550M 17.712M 12.622M 11.072M 5.171M
Table 1: Models used in this work

3.2 Student Network

A student model is a smaller yet similar version of a certain standard teacher model. In this work, the student model is based on the ResNet34 model. They have lesser number of BasicBlocks compared to the ResNet34 model (the first two layers having the 7x7 convolutional layer and max-pooling layer remain unchanged). All the models used are detailed in the Table 1.

3.3 Datasets

We have used three datasets, Imagenette[23], Imagewoof[23] and CIFAR10[24]. The first two datasets are subsets of the ImageNet[25] dataset. Imagenette is a relatively easier dataset for classification while Imagewoof is relatively difficult. The classes in the datasets are as follows:

  • Imagenette: Tench, cassette player, chain saw, church, garbage truck, gas pump, parachute, English Springer, French horn, golf ball.

  • Imagewoof: Australian terrier, Old English sheepdog, Rhodesian ridgeback, Border terrier, Beagle, English foxhound, Shih-Tzu, Dingo, Golden retriever, Samoyed.

  • CIFAR10: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.

The aim of this work is not to improve the accuracy, but to ensure that the accuracy of the student model is as close to teacher model as possible. Thus, we haven’t validated our approaches directly on the complete ImageNet dataset (and also due to computational and memory limitations).

3.4 Proposed Training Method

In our early experiments, we trained multiple feature maps of the student model to mimic the corresponding feature maps of the teacher model as well as the label simultaneously. So, Mean Squared Error (MSE) between each pair of feature maps is added together. Further, the cross entropy loss between the softmax output from the model and the label is added to the sum of the MSE losses. Then, back-propagation is done using the total loss.

These early experiments showed a marginal improvement between student models trained with and without the teacher model. This can be attributed to the fact that multiple feature maps and the label have to be mimicked at the same time i.e. very strict conditions on the optimization algorithm. Assigning weights to each MSE loss and the Cross Entropy loss will also not be helpful, since the training remains strict as before. Another reason could be gradient accumulation and vanishing gradients. To make the training less strict, we propose a stagewise training approach.

We propose the training for student model to be done stagewise i.e. one block at a time. The image is given as input to the teacher and student model both and the output of the first block is taken out from both of the models. A Mean Squared Error is taken between the outputs and then backpropagation is done for the student model. After training first block for 100 epochs, the training for the first block is stopped. In the next stage, again the input is passed to the teacher and the student but the features from the second block are taken out and then the same procedure is followed as in the first stage, i.e. MSE loss between the outputs of the second blocks of teacher and student is minimized for 100 epochs using backpropagation. This process is repeated for all the blocks. The classifier part, at the end of the student model, is directly trained to predict the classes from the dataset i.e. image is passed to the student model and it is trained with Cross Entropy Loss for class prediction. In this stage, the teacher model is not used and the rest of the student model, except the classifier part is frozen. This can be understood using Figure 2.

We show that the stagewise training has its own advantages, the major one being the limited number of parameters that need to be optimized at a time. This limited number of parameters leads to less strictness while training compared to training more number of parameters at once. The results show that stagewise training performs better than training all the blocks at once.

3.4.1 Less Data Approach

Datasets like ImageNet are so large that stage-wise training of the student model using teacher model will take too much time on limited hardware. Thus, it becomes useful if we perform stagewise training using only a subset of the data, while also preserving the accuracy. Thus, the stage-wise training experiments were repeated using 1/4th of the original training data while using the same pre-trained teacher model. Note that original training data is the data on which the teacher model was trained. The remaining 3/4th data is kept as a test set for evaluation.

3.5 Implementation Details

All experiments were performed using a computer having Intel Core i7-7700K CPU with one Nvidia GTX1080Ti GPU. PyTorch[26] is used for implementation, training and evaluation of all experiments. All experiments use Adam optimizer[27] with a learning rate of 1e-4 for 100 epochs for each stage of training (for no teacher and simultaneous training also, training is done for 100 epochs). The code is open-source and available at (this link, hidden for anonymity during review).

4 Results and Discussion

The results of simultaneous training, stagewise training and stagewise training with less data are presented in Figures 3 and 4.

Figure 3: From left to right : Graphs of Validation Accuracy v/s Number of Layers in student ResNet for various training approaches on complete Imagenette, Imagewoof and CIFAR10 datasets
Figure 4: From left to right : Graphs of Validation Accuracy v/s Number of Layers in student ResNet for various training approaches on different amount of Imagenette, Imagewoof and CIFAR10 datasets

The graphs show that student models with the entire dataset nearly achieve the same accuracy as the teacher. However, in case of less data, there is a huge gap in accuracy while training with and without the teacher. The probable reasons behind these results are discussed in the following paragraphs. It should be noted that the aim of model compression is to decrease the difference between the accuracy of teacher and student, rather than achieving state of the art accuracy in classification tasks. Obviously, if a better teacher is used, the accuracy of the student will improve. Sometimes, the accuracy of the students even surpasses the accuracy of the teacher model.

The results of the experiments can be explained as follows. Since the teacher has seen the complete dataset, it will have learned necessary features for classification for the whole dataset. When the student is trained using this teacher, the teacher transfers its ‘knowledge’ to the student even when a small dataset is used to train the student. Using less data can also be justified by the number of parameters that have to be trained during a single stage. Since only a small part of the network is trained at a time with the proposed method, the number of parameters which need to be optimized is much lower than the complete student network as mentioned in earlier paragraphs. It is shown that this approach gives a substantial improvement in the accuracy i.e. the student network which was trained on small dataset without a teacher gives much less accuracy than the student trained on the same dataset using proposed methodology. The major advantage is, of course, reduced training time. This becomes essential because stagewise training will take times the time for training without the teacher (since each stage is trained separately for the same number of epochs). Here, is the number of stages.

Figure 4 shows the result of the experiments performed using less amount of data. It can be seen that, using less amount of data and training independently, the students perform very poorly. On the other hand, if the teacher is used for training the student stagewise using less data, the accuracy of prediction increases substantially, specially in case of Imagewoof dataset, which is particularly difficult to classify. The results are promising and will be particularly useful when applied to very large datasets like ImageNet.

The simultaneous training results are marginally close to the results of training without the teacher. Particularly, simultaneous training is marginally better for Imagenette and Imagewoof datasets, but marginally worse for CIFAR10. This reiterates that simultaneous training is strict and provides no significant advantage over training without the teacher. However, the stagewise training results are much better than both the simultaneous training and without the teacher training results. Since simultaneous training results on complete dataset were not promising, simultaneous training on less data was not done.

5 Conclusion and Future Work

This work proposes a novel way to transfer knowledge from one network to another. The proposed method works better than directly using complete network for transferring knowledge at a time due to reduction in number of parameters to be optimized at one stage. This also enables the student network to learn with less amount of data, than the teacher. It can be extremely helpful while training the networks on bigger datasets like ImageNet.

Further, this method is flexible and can be used with other model compression techniques and also with other types of models. This method is also not restricted to image classification, but can be utilized for applications like object detection, pixel-level image segmentation, etc. The scope of the proposed technique is boundless and it can be viewed as a generalized compression technique.


The authors would like to thank all current and previous members of IvLabs - the AI & Robotics Club of VNIT for their constant support and motivation. We thank Sharath Chandra Raparthy for useful discussions.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description