Spherical Knowledge Distillation
Knowledge distillation aims at obtaining a small but effective deep model by transferring knowledge from a much larger one. The previous approaches try to reach this goal by simply “logit-supervised” information transferring between the teacher and student, which somehow can be subsequently decomposed as the transfer of normalized logits and norm. We argue that the norm of logits is actually interference, which damages the efficiency in the transfer process. To address this problem, we propose Spherical Knowledge Distillation (SKD). Specifically, we project the teacher and the student’s logits into a unit sphere, and then we can efficiently perform knowledge distillation on the sphere. We verify our argument via theoretical analysis and ablation study. Extensive experiments have demonstrated the superiority and scalability of our method over the SOTAs.
Recently, deep neural networks have achieved remarkable success in most of computer vision tasks. However, the increasing network depth in the current state-of-the-art models has resulted in a high computational burden and inference time. One direction to reduce the model size is knowledge distillation (KD) methods Hinton et al. (2015); Romero et al. (2015); Tian et al. (2020); Tung and Mori (2019); Zagoruyko and Komodakis (2017), which distills the knowledge from a large model (teacher) into a small model (student). The fundamental question of KD is “what knowledge is captured by the large model and can be transferred to the small model ?”. Two groundbreaking papers Buciluǎ et al. (2006); Hinton et al. (2015) propose a knowledge transferring approach that uses the teacher’s logits (the output of the last linear layer) as the supervised information to train the students.
Similarity information between categories contained in teacher’s logits is generally believed the reason why distillation is effectiveFurlanello et al. (2018); Hinton et al. (2015). For instance, the image of cats is more similar to dogs than cars. Therefore, given a “cat” image, a trained network will assign a higher probability to the “dog” category than the “car” category. Knowledge distillation essentially transfers this similarity between teacher logits to the student.
The similarity of logits, as a vector, can be decomposed into the similarity of vector norm and the similarity of vector direction. We show that KD process tries to match the norm and direction in a mixed manner between the student and the teacher. Intuitively, it would be easier to match only the direction than match both. According to the success of the cosine similarity in the classificationLiu et al. (2017); Wang et al. (2018), the direction of logits is sufficient to support the capacity of the model. In this paper, we show that the direction alone is capable of representing category similarity, and the norm will have a negative impact on knowledge distillation.
Therefore, we propose to project the logits of the teacher and the student into a spherical space by normalization. This makes the training of distillation optimized on a sphere, where norm is a constant. The optimization constrained on the sphere becomes more smooth and simpler. Thus the model can concentrate on the real important learning part, the normalized logits, which captures the relative size between probability predictions for each category.
In summary, the main contribution of this paper has three-fold:
We reveal that the direction of logits is the key knowledge, and the norm of logits hinders the distilling process of this knowledge. We propose Spherical Knowledge Distillation (SKD) to transform logits to a unit sphere with the constant norm and perform KD on the sphere space.
We conduct extensive experiments to show the effectiveness of SKD method. We achieve better performance than SOTA methods in CIFAR100 and ImageNet with diverse student and teacher architectures.
Our method is simple to implement and can improve the performance of most existing KD methods. Our method is also robust to the temperature parameter and capacity gap between the student and teacher.
2 Related Work
Bucilua et al. Buciluǎ et al. (2006) first proposed to compress a trained cumbersome model into a smaller model by matching the logits between them. Hinton et al. Hinton et al. (2015) advanced this idea and formed a more widely used framework known as knowledge distillation (KD). Hinton KD tries to minimize the KL divergence between the output probabilities generated by the logits through softmax. Hinton proposed to use a selected temperature applied on output logits to soften the distribution so that the loss can weigh more to the lower logits of the teacher.
In the later works, the main direction to improve the knowledge distillation is to transfer more knowledge from the teacher to the students, such as the intermediate representation Romero et al. (2015); Zagoruyko and Komodakis (2017); Yim et al. (2017) and relation among instances Peng et al. (2019); Liu et al. (2019); Tung and Mori (2019). FitNets Romero et al. (2015) uses the intermedia feature map of the teacher as the additional training hints for the student. The attention transfer method Zagoruyko and Komodakis (2017) makes the activation-based and gradient-based spatial attention map of the student consistent with the teacher. Peng et al. Peng et al. (2019) transfers only the instance-level knowledge but also the correlation matrix between instances. Similarity-preserving KD Tung and Mori (2019) requires the student to mimic the pairwise similarity map between the instance features of the teacher.
Recent works Cho and Hariharan (2019); Jin et al. (2019) also investigate the capacity mismatching between the teacher and the students, and small students fail to mimic much lager teachers. Cho et al. Cho and Hariharan (2019) propose stopping the training of the teacher to mitigate this problem. Inspired by curriculum learning, RCO Jin et al. (2019) selects several checkpoints generated during the training of the teacher and makes the student gradually mimic those checkpoints of the teacher.
However, there are few works study and modify the origin knowledge distillation loss Buciluǎ et al. (2006); Hinton et al. (2015). We demonstrate that the training of the student can be more efficient by projecting the logits into a unit sphere. We also show that our method can alleviate the capacity mismatching problem.
From the insight that direction, rather than the norm of logits, encodes the knowledge of classwise similarity, We propose to remove the influence of norm during distillation. In this section, we discuss the role of norm in distillation from a theoretical perspective. Specifically, We decompose the logits into two parts: the norm and the direction. The latter is expressed as a unit vector. We prove that the student learns the teacher from these two parts during distilling, and the learning of direction will be disturbed by the learning of the norm. The student can be trained much more efficiently on the unit sphere with a constant norm.
3.1 Hinton’s Distillation
Distillation is first proposed by Hinton Hinton et al. (2015). Given two neural networks, a teacher network T, and a student network S, We denote the output logits of the teacher as , and output logits from the student as .
Knowledge distillation is an effective method to transfer knowledge into a small network. Generally, there are two losses used, MSE or KL divergence.
where is the temperature parameter. Hinton points out that when the temperature is high, KL divergence will degenerate to MSE.
3.2 Theoretical Analysis
In this section, we investigate the KD loss effect on the L2 norm and the normalized logits. We decompose the logits of the student as where the is the l2 norm, defined as , and is the normalized logits: . The teacher can be defined in the same way. After the decomposition, we find the KD loss can be also decomposed into two separate parts. For simplicity, we conduct from the MSE loss of logits.
Proposition 1. During distillation, the impact of MSE loss (Eq.1) on the norm and the direction are respectively equivalent to the effect of MSE loss between the norms and norm-weighted MSE loss between the directions:
Proof: We derive the gradient of the norm as follows:
where and during distillation, the student is approximately equal to the teacher: .
We derive the gradient of the normalized logits as follows:
where the student norm is approximately equal to the teacher’s: . We can see that during the distillation, the student approaches the teacher from the aspects of norm and normalized logits. It is worth noting that distillation gives greater weight to data with a larger norm. Since the norm of students will keep changing during the training process, this may be harmful to the stability of knowledge distillation.
3.3 Analyzing Norm and Direction Loss
We use ResNet18 as student and ResNet56 as teacher to demonstrate the above conclusion, that student try to learn the teacher both from the norm and normalized logits. We logged the MSE loss of both norm and normalized logits between the student and the teacher during distilling process.
As the Figure 1 shows, the norm loss between student and teacher will change drastically during the training process. On the other hand, the MSE loss of normalized logits does not fluctuates that much. When the L2 norm loss stabilizes, the student ’s accuracy begins to increase rapidly. This indicates that the norm part of distilling loss is the hard one to be optimized.
We conjecture the reason is that the norm is difficult for the student to learn because of the high variance in the norm of teacher among different data points. As shown in Figure 2, in binary classification case, the norm of the logits varies between the samples and the student have difficulty in mimicing this variance. However, after projected onto the sphere, it would be easier to match the logits between the teacher and the student. We valid this assumption in the ablation of the experiment.
3.4 Spherical Knowledge Distillation
Based on the above formula and experimental phenomena, we propose Spherical Knowledge Distillation (SKD). SKD aims to exclude the effects of the norm from distillation from two aspects. First, SKD normalizes the logits of the teacher. In this way, students do not need to learn the different norms of each training data of the teacher during distillation. Second, SKD also normalizes the logits of the student. With this setting, the norm will stay unchanged during distilling, so to stable the training process. Then SKD minimizes the KL divergence between the student and the teacher, after restoring the norm of teacher and student to the teacher’s norm space together. Finally, cross entropy loss with ground truth are added.
To sum up, logits of teacher and student are all projected into the same constrained spherical space. SKD training convergence is faster than the regular knowledge distillation, and achieve much better performance than other methods.
The overall knowledge distillation loss for SKD:
where is the normalized logits and is the supervised loss, and is the trade-off parameter, and is the average norm of the teacher.
In the experiments, our method is robust to the choice of the temperature , and less effort is needed to tune the hype-parameter.
We conducted our experiments on two popular datasets, CIFAR100 and ImageNet. We prove the excellent performance of SKD through a broad range of teacher-student settings on CIFAR100 and ImageNet. In addition, we find our SKD can be used as a drop-in module to boost the performance of a range of knowledge distillation methods. Finally, we conduct two ablation experiments that help us to understand what happened to norm of students during the distillation.
Datasets (1)CIFAR100 Krizhevsky and Hinton (2009) is a relatively small data set and is widely used for testing various of deep learning methods. CIFAR100 contains 50,000 images in the training set and 10,000 images in the dev set, divided into 100 fine-grained categories. (2) ImageNet Deng et al. (2009) is a much harder task than CIFAR100. ImageNet contains 1.2M images for training and 50K for validation, that distributes in 1000 classes.
With all the experiment blow, we use as 0.9 and as 4 without specifically mentioned.
|KD Hinton et al. (2015)||74.92||73.54||70.66||70.67||73.08||73.33||72.98|
|FitNet Romero et al. (2015)||73.58||72.24||69.21||68.99||71.06||73.50||71.02|
|AT Zagoruyko and Komodakis (2017)||74.08||72.77||70.55||70.22||72.31||73.44||71.43|
|SP Tung and Mori (2019)||73.83||72.43||69.67||70.04||72.69||72.94||72.68|
|CC Peng et al. (2019)||73.56||72.21||69.63||69.48||71.48||72.97||70.71|
|VID Ahn et al. (2019)||74.11||73.30||70.38||70.16||72.61||73.09||71.23|
|RKD Park et al. (2019)||73.35||72.22||69.61||69.25||71.82||71.90||71.48|
|PKT Passalis and Tefas (2018)||74.54||73.45||70.34||70.25||72.61||73.64||72.88|
|AB Heo et al. (2019)||72.50||72.38||69.47||69.53||70.98||73.17||70.94|
|FT Kim et al. (2018)||73.25||71.59||69.84||70.22||72.37||72.86||70.58|
|FSP Yim et al. (2017)||72.91||0.00||69.95||70.11||71.89||72.62||70.23|
|NST Huang and Wang (2017)||73.68||72.24||69.60||69.53||71.96||73.30||71.53|
|CRD Tian et al. (2020)||75.48||74.14||71.16||71.46||73.48||75.51||73.94|
|CE||KD||ES Cho and Hariharan (2019)||SP||CC||CRD||AT||SKD||SKD||SKD||SKD|
|ResNet56||KD (0.9, 0.1)||KD (1, 1)||FitNet||AT||SP||CC||VID|
4.1 Experiment on CIFAR100 & ImageNet
We achieve new STOAs in most of the experimental settings with our SKD. In order to test the effectiveness of our method, we conducted extensive experiments on CIFAR100 and ImageNet and were successful in most experimental configurations. In particular, we exceeded the previous STOA on ImageNet by more than one point. We trained ResNet18 He et al. (2016) to an acc of 73.01, close to the level of ResNet34, which is 73.3. To fully examine the performance of SKD, we compare with 13 other teacher-student framework methods.
SKD on CIFAR100
Experimente settings For the experiments on CIFAR100, we run total 240 epochs. The learning rate is initialized as 0.05, then decay by 0.1 every 30 epochs after the first 150 epochs. Temperature is set to 4, the weight of SKD or KD and cross-entropy is 0.9 and 0.1 for all the setting. As for the hyperparameters of other different distilling methods, we follow the setting of Tian et al. (2020).
Analysis When the teacher and student share the same architecture, SKD has a stable improvement compared with all other methods.
SKD on ImageNet
Experimente settings We use ResNet18 as our backbone in ImageNet. We use 4 NVIDIA 1080ti with distributed learning to accelerate our experiment. For better performance, some experiments use 100 epochs, and others use 90 epochs. So the initial learning rate is set at 0.4, and decay by 0.1 in 30, 60, 80. A total 1024 batch size is used. Typically, our SKD with ResNet18 can finish within one day.
Analysis In the previous practice, the KD-based method often failed to bring too much improvement on ImageNet. The previous STOA is 71.34 accuracy with ResNet18, compared to 69.8 baseline trained with regular training with ground truth. In contrast, our method achieves an accuracy of 72.2 with 90 epochs, which is over two points higher than using cross-entropy. Furthermore, We found that until the end of the training, the accuracy of the model still has an upward trend. So to further push the limit, we add another 10 epochs and apply a strategy that gradually reduces the weight of distillation loss during the training process. This strategy is inspired by Cho and Hariharan (2019). To be specific, we set to 0.9 before 60 epoch, and 0.5 before 80 epoch, and 0.1 with the rest of epochs.
Another thing worth noting is that the previous work Cho and Hariharan (2019) finds that KD often fails in distilling from regular ResNet50 or larger models. On the contrary, We find our SKD successfully distilled ResNet18 from teachers that up to ResNet101.
Finally, we achieve 73.01 accuracy, which is close to 73.3 of ResNet34.
4.2 Scalability Experiment
A lot of KD-based methods compared above are commonly used with a combination of three losses, which include their specially designed loss, plus KD loss and cross-entropy loss. Because our method is easy to implement, we are curious if using our method directly on other methods will improve these methods. We conducted experiments that replace the KD loss with our SKD loss, to test the scalability of SKD.
Experiment settings We use ResNet56 as the teacher and ResNet20 as the student to perform all these experiments. Hyperparameter is consistent with those used in CIFAR100. “The total loss of this experiment consists of three parts: cross-entropy, SKD loss, and their own specific loss. ”
Analysis We find that SKD can improve performance on most of them. To be noticed, normalization on logits may change the representation consistency between the output logits and intermedia features, thus harms the performance of some feature-learning based methods. The integration of SKD and feature-based KD may require further research. Still, SKD improves most of them. This result shows the high scalability and “easy-to-transfer ability” of SKD.
4.3 Ablation Experiment
To study the role of norm played during distillation, we conduct two experiments learning directly on the norm or exclude the teacher’s norm from distillation. Through these two experiments, we demonstrate that the norm of the teacher is hard to learn during the distilling process.
Directly Learn Norm of The Teacher Experiment
This experiment was meant to examine if the transferring of the norm will bring improvement to student performance.
Experiment settings We use resnet20 as the student and resnet56 as the teacher in this experiment. The student is trained to minimize the MSE loss of logits norm between the student and the teacher. The student is also trained by cross-entropy loss with ground truth. The weight of these two losses is set 0.05 and 1, respectively. The weight of the norm loss larger than 0.05 will harm more to the performance of the student.
Analysis “We found that the decline curve of norm in this experiment is similar to that in knowledge distillation”. In the end, the accuracy of this experiment is not as good as that training with only ground truth. This indicates that directly transferring the norm from teacher to student is harmful to the student.
Learn Normalized Logits From Teacher Experiment
We further test if we only normalize the teacher, whether the performance is improved or harmed.
Experiment settings We conduct a distillation experiment that completely eliminates the teacher’s norm by normalizing the teacher’s logits and then multiply it by its average norm. Other settings are consistent with Hinton’s knowledge distillation. So all the teacher’s logits will be projected onto a sphere, and all information contained in the teacher’s norm will disappear. But the student still has to learn this constant norm as before.
Analysis In this way, students do not have to learn the complicated distribution of teacher’s norm among the training dataset. This method achieves much better results than regular KD, which demonstrates that the norms of teacher logits actually harm the distilling process. However, the student still has to adjust its norm to match the teacher norm, even now a constant. Thus the performance is not as good as SKD.
Capacity gap and temperature are two aspects that affects the success of distillation. The former is mentioned in both Cho and Hariharan (2019), and Jin et al. (2019), which states a phenomenon that when the size of the teacher becomes larger, the student performance obtained by distillation becomes worse. The latter, on the other hand, is used to soften the distribution of the teacher, so the student can learn more from the relative small numbers of logits, which is believed contains the knowledge of class-wise similarity. We study these two problems by decomposing the loss into MSE loss of norm and normalized logits like the former sectors. And we find Our SKD is robust to both capacity gap and temperature.
Robustness to Temperature
Temperature is used to soften the distribution, then the student can pay more attention to the small number of logits. But negative numbers in logits is not supervised too much during training, which could contains much noise Hinton et al. (2015). So in practice, we need to balance these two aspects by selecting an appropriate temperature. We demonstrate that SKD is more robust with the choice of temperature.
Experiment Settings We conducted KD and SKD both on CIFAR100 dataset, with resnet20 as student and resnet56 as teacher. We logged the MSE loss between norm and normalized logits in both training process on validation dataset. The results are shown on Figure 3 We also plotted error rate to show the relevance.
Analysis With KD, We observed a high correlation between norm loss and error rate. When the norm loss is the smallest, the error rate is also very low. As the norm loss becomes larger, the error rate also starts to increase. In general, a good temperature will balance the norm loss and normalized logits loss, to get good performance.
On the other side, SKD is more robust to temperature, that achieves good performance even with temperature set at 80.
Robustness to Capacity Gap
Due to the limitations of representation ability, it is difficult for student to learn from a oversized teacher. We demonstrate that this gap can reflect in both the loss of norm and the loss of normalized logits between students and teachers. And SKD can alleviate this problem, thereby enjoying more performance improvements brought by larger teachers.
Experiment settings We conducted distillation with resnet14 as student. In terms of teachers, we choose a range of teachers, that including resnet20, resnet32, resnet44, resnet56 and resnet110. Hyperparameters are set as before in CIFAR100 experiments. We still log the MSE loss of norm and normalized logits between the student and the teachers.
Analysis From the figure 3 . We can find that both norm and normlized logtis loss continue to increase as the teachers become larger. This can be the sign that capacity gap becomes larger. When the teacher is larger than resnet32, the damage caused by this gap overwhelms the advantages of the bigger teacher’s stronger performance, so the accuracy of student begins to decline.
At the same time, as the size of the teacher increases, the performance of SKD trained students become better. We think this is because SKD restricts learning to the sphere and does not need to learn teacher’s complicated norms, thus alleviating the capacity gap problem. We can also observe similar phenomena on Imagenet experiment, where we successfully distilled resnet101 to resnet18. This demonstrate that SKD can effectively learn more knowledge than KD from a larger network. While we need to select the most suitable teacher size in KD, we can choose to use a larger and therefore stronger network as the teacher in SKD.
We propose Spherical Knowledge Distillation (SKD), a very simple yet powerful framework that allows a small network to effectively approximate the performance of a much bigger one, showing a broad prospect in compressing networks. We demonstrate that the norm of the output logits will hinder the knowledge distillation. Therefore, we project the student and the teacher’s logits to the sphere space and only distill the direction of the logits. We conduct extensive experiments to show the effectiveness of SKD method, and our method can be easily plugged into most of the previous works.
This work does not present any foreseeable societal consequence
- (2019) Variational information distillation for knowledge transfer. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9155–9163. Cited by: Table 1.
- (2006) Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, USA, pp. 535–541. Cited by: §1, §2, §2.
- (2019) On the efficacy of knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4793–4801. Cited by: §2, §4.1, §4.1, Table 2, §5.
- (2009) ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.
- (2018) Born again neural networks. External Links: Cited by: §1.
- (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1.
- (2019) Knowledge transfer via distillation of activation boundaries formed by hidden neurons. ArXiv abs/1811.03233. Cited by: Table 1.
- (2015) Distilling the knowledge in a neural network. ArXiv abs/1503.02531. Cited by: §1, §1, §2, §2, §3.1, Table 1, §5.
- (2017) Like what you like: knowledge distill via neuron selectivity transfer. ArXiv abs/1707.01219. Cited by: Table 1.
- (2019) Knowledge distillation via route constrained optimization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1345–1354. Cited by: §2, §5.
- (2018) Paraphrasing complex network: network compression via factor transfer. In NeurIPS, Cited by: Table 1.
- (2009) Learning multiple layers of features from tiny images. Cited by: §4.
- (2017) SphereFace: deep hypersphere embedding for face recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6738–6746. Cited by: §1.
- (2019) Knowledge distillation via instance relationship graph. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7089–7097. Cited by: §2.
- (2019) Relational knowledge distillation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3962–3971. Cited by: Table 1.
- (2018) Learning deep representations with probabilistic knowledge transfer. In ECCV, Cited by: Table 1.
- (2019) Correlation congruence for knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5006–5015. Cited by: §2, Table 1.
- (2015) FitNets: hints for thin deep nets. CoRR abs/1412.6550. Cited by: §1, §2, Table 1.
- (2020) Contrastive representation distillation. ArXiv abs/1910.10699. Cited by: §1, §4.1, Table 1.
- (2019) Similarity-preserving knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1365–1374. Cited by: §1, §2, Table 1.
- (2018) CosFace: large margin cosine loss for deep face recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §1.
- (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7130–7138. Cited by: §2, Table 1.
- (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. ArXiv abs/1612.03928. Cited by: §1, §2, Table 1.