Efficient Kernel Transfer in Knowledge Distillation

Efficient Kernel Transfer in Knowledge Distillation

Abstract

Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. An essential challenge in knowledge distillation is to identify the appropriate information to transfer. In early works, only the final output of the teacher model is used as the soft label to help the training of student models. Recently, the information from intermediate layers is also adopted for better distillation. In this work, we aim to optimize the process of knowledge distillation from the perspective of kernel matrix. The output of each layer in a neural network can be considered as a new feature space generated by applying a kernel function on original images. Hence, we propose to transfer the corresponding kernel matrix (i.e., Gram matrix) from teacher models to student models for distillation. However, the size of the whole kernel matrix is quadratic to the number of examples. To improve the efficiency, we decompose the original kernel matrix with Nyström method and then transfer the partial matrix obtained with landmark points, whose size is linear in the number of examples. More importantly, our theoretical analysis shows that the difference between the original kernel matrices of teacher and student can be well bounded by that of their corresponding partial matrices. Finally, a new strategy of generating appropriate landmark points is proposed for better distillation. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code will be released.

\nocopyright\affiliations

1 Alibaba Group
2 Center for Data Science, School of Engineering and Technology
University of Washington, Tacoma, USA
{qi.qian, lihao.lh}@alibaba-inc.com, juhuah@uw.edu

Introduction

With the development of deep learning, neural networks make many computer vision tasks applicable for edge devices. Edge devices often have limited computation and storage resources. Therefore, neural networks that contain a small number of FLOPS and parameters are preferred. Lots of efforts are devoted to improving the performance of neural networks with resource constraints Courbariaux, Bengio, and David (2015); Hinton, Vinyals, and Dean (2015); Sandler et al. (2018). Among various developed strategies, knowledge distillation (KD) is a simple yet effective way to help train compact networks Hinton, Vinyals, and Dean (2015).

Figure 1: Illustration of knowledge distillation by transferring kernel matrix. Instead of transferring the whole kernel matrix, we adopt landmark points to obtain the approximated kernel matrix for efficient optimization. (Square and circle denote examples from the teacher and student, respectively. Triangle is the corresponding landmark point. Blue and red indicate examples from different classes.)

Knowledge distillation in deep learning aims to improve the performance of a small network (i.e., student) with the information from a large network (i.e., teacher). Given a teacher, various information can be transferred to regularize the training of the student network. Hinton, Vinyals, and Dean (2015) transfers the label information to smooth the label space of the student network. Romero et al. (2015) and Zagoruyko and Komodakis (2017) propose to transfer the information of intermediate layers to help training. Yim et al. (2017) transfers the flow between layers as hints for student networks. Chen, Wang, and Zhang (2018) improves the performance of metric learning with the rank information from teacher models. Recently, Liu et al. (2019) and Park et al. (2019) consider the similarity of examples and transfer the distances between examples to student networks. All of these methods can achieve success in certain applications but there lacks a consistent problem formulation for knowledge distillation in different layers of the neural network.

In this work, we study knowledge distillation from the perspective of kernel matrix. Given two images and , their similarity can be computed as

where is the corresponding kernel function that projects examples from the original space to a space that fits the problem better. In an appropriate space, a simple model (e.g., a linear model) can describe the data well. For example, Williams and Seeger (2000) projects the original examples to an infinite space where different classes become linearly separable. If we consider a neural network as a kernel function and becomes the output from a certain layer of the neural network, each layer in the neural network can generate a kernel matrix that captures the similarity between examples. Since the space spanned by the teacher model serves the target task better, we propose to transfer the kernel matrix from the teacher model to the student model.

The main challenge in transferring kernel matrix comes from the size of the kernel matrix (i.e., , where is the number of total examples). With a large number of training examples, it becomes intractable to transfer the whole matrix directly, especially for training neural networks, where only a mini-batch of examples are accessible at each iteration. If transferring only the sub-matrix in the mini-batch, the optimization can be slow and other loss functions for knowledge distillation have to be designed to achieve a desired performance as shown in Park et al. (2019). Therefore, we propose to apply the Nyström method Williams and Seeger (2000) to obtain a low-rank approximation of the original kernel matrix with landmark points. Then, we can minimize the difference between the compact kernel matrices that are calculated between examples and landmark points, to transfer the information from the teacher effectively. Fig. 1 illustrates the proposed strategy.

Compared with the whole kernel matrix whose size is , the transferred one is only in our method. Besides, since the number of landmark points is small, we can keep them in the network, which makes the optimization with mini-batch effective. Considering that the selection of landmark points is important for approximating the original kernel matrix, we propose to apply class centers as landmark points for better distillation. More importantly, our theoretical analysis shows that the difference between original kernel matrices from teacher and student can be well bounded by that of the corresponding partial matrices. The empirical study on benchmark data sets and popular neural networks confirms that the proposed method with a single loss for distillation can transfer the knowledge from different layers well.

Related Work

Knowledge distillation Knowledge distillation has a long history in ensemble learning and becomes popular for training small-sized neural networks Chen, Wang, and Zhang (2018); Hinton, Vinyals, and Dean (2015); Liu et al. (2019); Park et al. (2019); Zagoruyko and Komodakis (2017); Tian, Krishnan, and Isola (2020). Various algorithms have been developed to transfer different information from the teacher model to the student model. Hinton, Vinyals, and Dean (2015) considers the final output of the teacher model as the soft label and regularizes the similarity between the label distribution output from the student model and that of the soft label from the teacher model. Zagoruyko and Komodakis (2017) transfers the attention maps from intermediate layers, which provides a way to explore more information from the teacher model. The algorithms proposed in Liu et al. (2019) and Park et al. (2019) are close to our work, where the Euclidean distances between examples are transferred for knowledge distillation. However, they can only transfer the distance information of pairs within a mini-batch while we aim to transfer the whole kernel matrix to achieve a better performance. Furthermore, we provide a theoretical analysis to demonstrate the effectiveness of the proposed method. Besides these work for classification, some methods are proposed for other tasks, e.g., detection Chen et al. (2017) and metric learningChen, Wang, and Zhang (2018). We focus on classification in this work while the proposed method can be easily extended to metric learning that aims to optimize the performance of the embedding layer.

Nyström method Nyström method is an effective algorithm to obtain a low-rank approximation for a kernel matrix Williams and Seeger (2000). Given a whole kernel matrix, it tries to reconstruct the original one with the randomly sampled columns. The data points corresponding to the selected columns are denoted as landmark points. The approximation error can be bounded even with the randomly sampled landmark points. Later, researchers show that a delicate sampling strategy can further improve the performance Drineas and Mahoney (2005); Kumar, Mohri, and Talwalkar (2012); Zhang, Tsang, and Kwok (2008). Drineas and Mahoney (2005) proposes to sample landmark points with a data-dependent probability distribution rather than the uniform distribution. Kumar, Mohri, and Talwalkar (2012) and Zhang, Tsang, and Kwok (2008) demonstrate that using clustering centers as landmark points provides the best approximation among different strategies. Note that Nyström method is developed for unsupervised kernel matrix approximation while we can access the label information in knowledge distillation. In this work, we provide an analysis on the selection criterion of landmark points for kernel matrix transfer and develop a supervised strategy accordingly.

Efficient Kernel Matrix Transfer

Given two image and , the similarity between them can be measured with a kernel function as

where is a projection function that projects examples from the original space to a space better for the target task.

In this work, we consider each layer in a neural network as a projection function. We denote the student network as and the teacher network as . The features output from a certain layer of and are referred as and , respectively, and the index for the layer is omitted for brevity. Then, the similarity between two image and in the kernel matrix can be computed by

Let and denote the kernel matrices from the student and teacher networks, respectively, where is the total number of images. We aim to transfer the kernel matrix from the teacher model to the student model. The corresponding loss for knowledge distillation with kernel matrix transfer can be written as

(1)

where and denote the representations of the entire data set output from the same certain layer of the student and teacher network.

Minimizing the loss directly is intractable due to the large size of the kernel matrix, especially for the conventional training pipeline in deep learning, where only a mini-batch of examples are accessible at each iteration. A straightforward way is to optimize only the random pairs in each mini-batch as in Liu et al. (2019); Park et al. (2019). However, it can result in a slow optimization in transferring. Hence, we consider to decompose the kernel matrix and optimize the low-rank approximation in lieu of the original kernel matrix.

Nyström Approximation

Nyström method is prevalently applied to approximate the kernel matrix Drineas and Mahoney (2005); Kumar, Mohri, and Talwalkar (2012); Williams and Seeger (2000); Zhang, Tsang, and Kwok (2008). We briefly review it in this subsection.

Given a kernel matrix , we can first randomly shuffle columns and rewrite it as

where . Then, a good approximation for can be obtained as , where and denotes the pseudo inverse of  Williams and Seeger (2000).

Let denote the best top- approximation of kernel matrix and . The rank-k approximation derived by the Nyström method can be computed as , where denotes the best top- approximation of and is the corresponding pseudo inverse. The performance of the approximation can be demonstrated by the following theorem.

Theorem 1.

Kumar, Mohri, and Talwalkar (2012) Let denote the rank- Nyström approximation with columns that are sampled uniformly at random without replacement from . We have

The examples corresponding to the selected columns in are referred as landmark points. Theorem 1 shows that the approximation is applicable even with the randomly sampled landmark points.

Kernel Transfer

With the low-rank approximation, the kernel matrix from a certain layer of the student and teacher network can be computed as

Compared with the original kernel matrix, the partial matrix has significantly less number of terms. Let and denote the landmark points for the student and teacher kernel matrices, then we have and . We theoretically show that transferring the compact matrix is up-bounding the distance between original kernel matrices. The detailed proof can be found in the supplementary.

Corollary 1.

Assuming that and are bounded by a constant as and the smallest eigenvalues in and are larger than , with the Nyström approximation, we can bound the loss in Eqn. 1 by

In Corollary 1, the partial kernel matrix will be regularized with the pseudo inverse of . The computational cost of obtaining pseudo inverse is expensive and it can introduce additional noise when the feature space of the student is unstable in the early stage of training.

For efficiency, we aim to bound the original loss in Eqn. 1 solely with and as in the following Corollary.

Corollary 2.

With the same assumptions in Corollary 1, we can bound the loss in Eqn. 1 by

Corollary 2 illustrates that minimizing the difference between the partial kernel matrices using landmark points can transfer the original kernel matrix from the teacher model effectively. Note that the partial matrices have the size of , where is the number of landmark points. When , landmark points and can be kept in GPU memory as parameters of the loss function for optimization, which is much more efficient than transferring the original kernel matrix. Since the selection of landmark points is important for approximation. We elaborate our strategy in the next subsection.

Landmark Selection

We consider a strategy that obtains a single landmark point for each class, which means for a -class classification problem. We will theoretically demonstrate the selection of appropriate landmark points as follows.

Let and denote the landmark points for a certain layer of the student and teacher network, respectively. First, considering the similarity between an arbitrary pair of examples, we can bound the difference between that from the teacher and student as follows.

Lemma 1.

Given an arbitrary pair , let and denote the corresponding landmark points for the -th example in the space of student and teacher network, respectively. Assume the norm of are bounded by . Then we have

Lemma 1 provides the bound on a single pair. The bound for the kernel matrix can be accumulated over all pairs.

Theorem 2.

With the assumptions in Lemma 1, we have

(2)

According to Theorem 2, we can find that the transfer loss comes from two aspects. Term in Eqn. 2 contains the distance from each example to its corresponding landmark point. Since the corresponding landmark point for can be obtained by , we can rewrite the problem of minimizing the original term as

Apparently, this objective is a standard clustering problem. It inspires us to use cluster centers as the landmark points for both the student and teacher networks. Unlike the conventional Nyström method, which is often in an unsupervised learning setting, we can access the label information in knowledge distillation. When we set the number of clusters to be the number of classes, the landmark point becomes the center in each class and can be computed by averaging examples within the class as

(3)

where is the class label of the -th example and denotes the number of examples in the -th class.

Term from Eqn. 2, in fact, indicates the difference between the student and teacher kernel matrices defined by the landmark points as in Corollary 2. With landmark points and obtained from optimizing the term , we can formulate the Knowledge Distillation problem by transferring Approximated kernel matrix (KDA) as

(4)

where and we adopt as the smoothed loss for the stable optimization as .

With the KDA loss, we propose a novel knowledge distillation algorithm that works in an alternative manner. In each iteration, we first compute the landmark points with features of examples accumulated from the last epoch by Eqn. 3. Then, the KDA loss defined by the fixed landmark points in Eqn. 4 will be optimized along with a standard cross-entropy loss for the student. The proposed algorithm is summarized in Alg. 1. Since at least one epoch will be spent on collecting features for computing landmark points, we will minimize the KDA loss after epochs of training, where .

  Input: Data set , a student model , a teacher model , total epochs , warm-up epochs
  Initialize ,
  for  to  do
     Optimize without KDA loss
     Compute , as in Eqn. 3
  end for
  for  to  do
     Optimize with KDA loss defined on ,
     Compute , as in Eqn. 3
  end for
  return  
Algorithm 1 Knowledge Distillation by Approximated Kernel Transfer (KDA)

Connection to Conventional KD

In the conventional KD method Hinton, Vinyals, and Dean (2015), only the output from the last layer in the teacher model are adopted for the student. By setting an appropriate parameter, Hinton, Vinyals, and Dean (2015) illustrates that the loss function for KD can be written as

where denote the logits before the SoftMax operator. With the identity matrix , the equivalent formulation is

Compared to the KDA loss in Eqn. 4, the conventional KD can be considered as applying one-hot label vectors as landmark points to transfer the kernel matrix of the teacher network. However, it lacks the constraints on the similarity between each example and its corresponding landmark point as illustrated in Theorem 2, which may degenerate the performance of knowledge distillation.

Experiments

We conduct experiments on two benchmark data sets to illustrate the effectiveness of the proposed KDA algorithm. We employ ResNet-34 He et al. (2016) as the teacher network. ResNet-18, ResNet-18-0.5 and ShuffleNetV2 Ma et al. (2018) are adopted as student networks, where ResNet-18-0.5 denotes ResNet-18 with a half number of channels. We apply the standard stochastic gradient descent (SGD) with momentum to train the networks. Specifically, we set the size of mini-batch to , momentum to and weight decay to - in all experiments. The student models are trained with epochs. The initial learning rate is and cosine decay is adopted with epochs for warm-up.

Three baseline knowledge distillation methods are included in the comparison

  • KD Hinton, Vinyals, and Dean (2015): a conventional knowledge distillation method that constrains the KL-divergence between the output label distributions of the student and teacher networks.

  • AT Zagoruyko and Komodakis (2017): a method that transfers the information from intermediate layers to accelerate the training of the student network.

  • RKD Park et al. (2019): a recent work that regularizes the similarity matrices between student and teacher networks. Unlike the method proposed in this work, they focus on transferring the similarity matrices within a mini-batch. Note that a similar method is proposed in Liu et al. (2019).

Every algorithm will minimize the combined loss from both the distillation and the standard cross entropy loss for classification. For RKD, we transfer the features before the last fully-connected (FC) layer for comparison. Note that AT transfers the attention map of the teacher, so we adopt the feature before the last pooling layer for distillation. Besides, we let “Baseline” denote the method that trains the student without information from the teacher. Our method is referred as “KDA”. We search the best parameters for all methods in the comparison and keep the same parameters for different experiments.

Ablation Study

We perform the ablation study on CIFAR-100 Krizhevsky (2009). This data set contains classes, where each class has images for training and for test. Each image is a color image with size of .

In this subsection, we set ResNet-34 as the teacher and ResNet-18 as the student. During the training, each image is first padded to be , and then we randomly crop a image from it. Besides, random horizontal flipping is also adopted for data augmentation.

Effect of Landmark Points

First, we evaluate the strategy for generating landmark points. As illustrated in Corollary 2, the randomly selected landmark points can achieve a good performance. So we compare the KDA with class centers to that with random landmark points in Fig. 2. In this experiment, we adopt the features before the last FC layer for transfer.

(a) Overall Comparison (b) Zoom In
Figure 2: Comparison of landmark points selection.

From Fig. 2, we can observe that with landmark points, two variants of KDA perform significantly better than the baseline. It demonstrates that kernel matrix is informative for training student models and transferring kernel matrix from teacher can help improve the performance of student. Furthermore, KDA with randomly sampled landmark points can surpass baseline by a large margin. It is consistent with Corollary 2 that even with the random landmark points, Nyström method can guarantee a good approximation of the kernel matrix. Finally, KDA with class centers as the landmark points shows the best performance among different methods, which confirms the criterion suggested in Theorem 2. We will use class centers as landmark points in the remaining experiments.

Effect of Kernel Transfer

Then, we compare the difference between kernel matrices from a teacher and its student models. The performance of transferring a kernel matrix is measured by , which calculates the faction of information that has not been transferred. We investigate features from two layers in the comparison: the one before the last FC layer and that after the FC layer. The kernel transfer performance of different layers are illustrated in Fig. 3 (a) and (b), respectively.

(a) Layer before FC (b) Layer after FC
Figure 3: Comparison of kernel transfer performance measured by .

Fig. 3 (a) compares the kernel transfer performance of the baseline and that of those methods which can transfer information from the layer before the last FC layer. First, it is obvious that both RKD and KDA are better than the baseline (i.e., less information loss during the transferring). It indicates that minimizing the difference between kernel matrices can effectively transfer appropriate information from the teacher. Second, RKD transfers the similarity matrix defined by examples in a mini-batch only and shows a larger transfer loss than KDA. Considering the massive number of pairs, optimizing with all of these pairs in RKD is intractable. Note that the number of pairs can be up to while the number of pairs in a mini-batch is only , where is the size of a mini-batch. To visit all pairs only once, it requires at least epochs.

On the contrary, the loss from KDA is only about of that in RKD. KDA optimizes the partial kernel matrix with landmark points and the total number of pairs is linear in that of original examples. Due to a small number of landmark points, the partial matrix is much more compact than the original one. For example, there are examples in CIFAR-100. When applying landmark points for distillation, the partial matrix contains terms of the original one. Besides, since we keep class centers in the memory as the parameters of the loss function, the whole kernel matrix can be approximated in a single epoch. Therefore, SGD can optimize the KDA loss sufficiently.

Then, we compare the performance of transfer after the last FC layer as shown in Fig. 3 (b). For KDA, we compute the kernel matrix with features before the SoftMax operator. From the comparison, we can observe that both of KD and KDA have much less transfer loss than the baseline. As illustrated in the discussion of “Connection to Conventional KD”, conventional KD is equivalent to transferring the partial kernel matrix with one-hot landmark points. Therefore, it can reduce the difference between teacher and student effectively. However, the landmark points adopted by KD fail to satisfy the property illustrated in Theorem 2. By equipping class centers as landmark points, KDA can further reduce the transfer loss from in KD to , which confirms the effectiveness of transferring kernel matrix with appropriate landmark points.

Figure 4: Relationship between (i.e., our estimation) and (i.e., ground-truth). Figure 5: Smallest eigenvalues of (i.e., student) and (i.e., teacher).

Finally, we demonstrate that the difference between partial kernel matrices is closely correlated with that between whole kernel matrices as suggested in Corollary 2. Features from the layer before the last FC layer are adopted for evaluation. Fig. 4 illustrates how the ground-truth transfer loss and the estimated transfer loss are changing during the training (values used for better visualization). Evidently, minimizing can reduce the gap between the original kernel matrices effectively, which is consistent with our theoretical analysis. Note that we have an assumption in Corollary 2 that the minimal eigenvalues of and are larger than . Since we adopt class centers as landmark points, and can be full rank matrices. We show their smallest eigenvalues in Fig. 5. It is obvious that the smallest eigenvalue is larger than 10, consistent with our assumption.

Effect of Matrix

Corollary 1 implies a variant that uses the standard Nyström method including matrix for transferring, while our proposal ignores and optimizes only the difference between and for efficiency. We compare our method to the one with in Fig. 6, where features before the last FC layer are adopted for transfer. During the experiment, we find that provides better performance. We can observe that our method without has a similar performance as the one with . It further demonstrates our analysis in Corollary 2.

Figure 6: Comparison of our method and the method with matrix as in Corollary 1. Figure 7: Comparison of different number of landmark points (i.e., centers) for each class.

Effect of Centers per Class

When assigning landmark points, we set the number to be that of classes, which avoids clustering in the implementation. It also constrains that each class has a single landmark point. The number of landmark points for each class can be easily increased by clustering. We compare the variant with two centers per class in Fig. 7, where features before the last FC layer are adopted for comparison.

We can observe that more centers for each class cannot improve the performance significantly. It may be because that the feature space is optimized with the cross entropy loss. As illustrated in Qian et al. (2019), cross entropy loss will push all examples from the same class to a single center. Therefore, assigning one landmark point for each class is an appropriate setting, which also simplifies the algorithm and improves the efficiency. We use one center per class in the following experiments.

Effect of Different Layers

Now, we illustrate the performance of transferring the kernel matrix from different layers. ResNet consists of convolutional layer groups and we compare the performance of the last groups (i.e., “conv3_x”, “conv4_x” and “conv5_x”) and the one after the last FC layer. The definition of groups can be found in Table 1 of He et al. (2016). For each group, we adopt the last layer for transfer. Before transfer, we add a pooling layer to reduce the dimension of the feature map. Note that after pooling, the last layer of “conv5_x” becomes the layer before the last FC layer.

S conv3_x conv4_x Before FC After FC T
77.2 77.7 78.1 79.6 79.4 80.3
Table 1: Comparison of accuracy () on CIFAR-100 when transferring the kernel matrix from different layers.
Teacher Student S Before Last FC After FC Combo T
AT RKD KDA KD KDA KDA
ResNet34 ResNet18 77.2 78.10.3 78.30.2 79.60.1 78.80.2 79.40.1 79.70.1 80.3
ResNet34 ResNet18-0.5 73.5 75.00.1 74.30.3 75.60.3 74.80.2 75.30.2 75.90.2 80.3
ResNet34 ShuffleNet 71.7 73.00.1 72.50.1 74.00.3 72.90.1 73.60.1 74.20.3 80.3
Table 2: Comparison of accuracy () on CIFAR-100.
Teacher Student S Before Last FC After FC Combo T
AT RKD KDA KD KDA KDA
ResNet34 ResNet18 63.4 64.40.1 63.90.2 65.20.3 64.90.2 65.40.1 65.50.1 66.6
ResNet34 ResNet18-0.5 60.3 61.00.1 60.60.2 61.70.3 61.30.1 61.90.2 62.20.2 66.6
ResNet34 ShuffleNet 60.6 61.30.1 61.20.2 62.00.2 61.50.1 62.30.2 62.40.1 66.6
Table 3: Comparison of accuracy () on Tiny-ImageNet.

Table 1 shows the performance of transferring information from different layers. First, transferring information from teacher always improves the performance of student, which demonstrates the effectiveness of knowledge distillation. Besides, the information from the later layers is more helpful for training student. It is because later layers contain more semantic information that is closely related to the target task. We will focus on the layers before and after the FC layer in the rest experiments.

Cifar-100

In this subsection, we compare the proposed KDA method to other methods on CIFAR-100. The results of different methods can be found in Table 2. Methods in the comparison are repeated times and the average results with standard deviation are reported. First, all methods with knowledge distillation outperform training the student network without a teacher, which shows that knowledge distillation can improve the performance of student models significantly.

By transferring the kernel matrix before the last FC layer, KDA surpasses RKD by when ResNet-34 is the teacher and ResNet-18 is the student. The observation is consistent with the comparison in the ablation study, which confirms that is an appropriate metric to evaluate the transfer loss of similarity matrix transfer. Moreover, KDA shows a significant improvement on different student networks, which implies that the proposed method can be applicable for different teacher-student configurations.

Furthermore, when transferring the kernel matrix after the last FC layer, both of KD and KDA can demonstrate good performance with the student model. It is due to the fact that these methods transfer the kernel matrix with landmark points, which is efficient for optimization. Besides, KDA can further improve the performance compared to KD. The superior performance of KDA demonstrates the effectiveness of the proposed strategy for generating landmark points.

Finally, compared to benchmark methods, KDA can distill the information from different layers with the uniform formulation in Eqn. 4. The proposed method provides a systematic perspective to understand a family of knowledge distillation methods that aim to transfer kernel matrices. If transferring the kernel matrix before and after the FC layer simultaneously, the performance of KDA can be slightly improved too as illustrated by “Combo” in Table 2.

Tiny-ImageNet

Then, we compare different methods on Tiny-ImageNet data set1. There are classes in this data set and each class provides images for training and for validation. We report the performance on the validation set. Since the size of images in Tiny-ImageNet is that is larger than CIFAR-100, we replace the random crop augmentation with a more aggressive version as in He et al. (2016), and keep other settings the same.

Table 3 summarizes the comparison. We can observe the similar results as on CIFAR-100. First, all methods with the information from a teacher model can surpass the student model without a teacher by a large margin. Second, compared with the baseline methods that try to transfer the similarity matrix, KDA outperforms them no matter in which layer the transfer happens. Finally, KDA with information combined from two layers can further improve the performance, which implies that the information from different layers can be complementary. Note that CIFAR-100 and Tiny-ImageNet have very different data formats, which demonstrates the applicability of the proposed algorithm in various real-world applications.

Conclusion

In this work, we investigate the knowledge distillation problem from the perspective of kernel matrix. Since kernel matrix is closely related to the performance on the target task, we propose to transfer the kernel matrix from the teacher model to the student model. Considering the number of terms in the kernel matrix is quadratic in the number of training examples, we adopt the Nyström method and propose a strategy to obtain the landmark points for efficient optimization. The proposed method not only improves the efficiency of transferring kernel matrix, but also has the theoretical guarantee for the efficacy. Experiments on the benchmark data sets verify the effectiveness of the proposed algorithm. Besides kernel matrix, there are many existing methods that transfer different information from the teacher to the student. Combining the proposed KDA loss with other knowledge for distillation can be our future work.

Appendix A Theoretical Analysis

Proof of Corollary 1

Proof.

Proof of Corollary 2

Proof.

Then, we want to show that . Note that due to the fact that is a partial matrix from . We can prove instead.

Let

where and . Let , where . Then we have

So

(5)

According to the definition of , we have

Since is a doubly stochastic matrix, we can show that

It can be proved by contradiction. If the optimal solution has a larger result than the R.H.S., we can denote the first column index of the non-zero off-diagonal element as (i.e., ), and the corresponding row index as (i.e., ). Let and we have

It shows that the assignment with the diagonal element can achieve a larger result than the optimal assignment, which contradicts the assumption.

With the optimal results from the assignment of , we obtain that

For each term, it is easy to show that

Finally, we have

Proof of Lemma 1

Proof.

For an arbitrary pair, we have

Footnotes

  1. https://tiny-imagenet.herokuapp.com

References

  1. Chen, G.; Choi, W.; Yu, X.; Han, T. X.; and Chandraker, M. 2017. Learning Efficient Object Detection Models with Knowledge Distillation. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., NeurIPS, 742–751.
  2. Chen, Y.; Wang, N.; and Zhang, Z. 2018. DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer. In AAAI, 2852–2859.
  3. Courbariaux, M.; Bengio, Y.; and David, J. 2015. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In NeurIPS, 3123–3131.
  4. Drineas, P.; and Mahoney, M. W. 2005. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. J. Mach. Learn. Res. 6: 2153–2175.
  5. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In CVPR, 770–778.
  6. Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531.
  7. Krizhevsky, A. 2009. Learning Multiple Layers of Features from Tiny Images .
  8. Kumar, S.; Mohri, M.; and Talwalkar, A. 2012. Sampling Methods for the Nyström Method. J. Mach. Learn. Res. 13: 981–1006.
  9. Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; and Duan, Y. 2019. Knowledge Distillation via Instance Relationship Graph. In CVPR, 7096–7104.
  10. Ma, N.; Zhang, X.; Zheng, H.; and Sun, J. 2018. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Ferrari, V.; Hebert, M.; Sminchisescu, C.; and Weiss, Y., eds., ECCV, volume 11218 of Lecture Notes in Computer Science, 122–138. Springer.
  11. Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. Relational Knowledge Distillation. In CVPR, 3967–3976.
  12. Qian, Q.; Shang, L.; Sun, B.; Hu, J.; Tacoma, T.; Li, H.; and Jin, R. 2019. SoftTriple Loss: Deep Metric Learning Without Triplet Sampling. In ICCV, 6449–6457. IEEE.
  13. Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and Bengio, Y. 2015. FitNets: Hints for Thin Deep Nets. In ICLR.
  14. Sandler, M.; Howard, A. G.; Zhu, M.; Zhmoginov, A.; and Chen, L. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In CVPR, 4510–4520. doi:10.1109/CVPR.2018.00474.
  15. Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive Representation Distillation. In ICLR. OpenReview.net.
  16. Williams, C. K. I.; and Seeger, M. W. 2000. Using the Nyström Method to Speed Up Kernel Machines. In NeurIPS, 682–688.
  17. Yim, J.; Joo, D.; Bae, J.; and Kim, J. 2017. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In CVPR, 7130–7138.
  18. Zagoruyko, S.; and Komodakis, N. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In ICLR.
  19. Zhang, K.; Tsang, I. W.; and Kwok, J. T. 2008. Improved Nyström low-rank approximation and error analysis. In ICML, 1232–1239.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414542
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description