Improving Knowledge Distillation with Supporting Adversarial Samples
Many recent works on knowledge distillation have provided ways to transfer the knowledge of a trained network for improving the learning process of a new one, but finding a good technique for knowledge distillation is still an open problem. In this paper, we provide a new perspective based on a decision boundary, which is one of the most important component of a classifier. The generalization performance of a classifier is closely related to the adequacy of its decision boundaries, so a good classifier bears good decision boundaries. Therefore, transferring the boundaries directly can be a good attempt for knowledge distillation. To realize this goal, we utilize an adversarial attack to discover samples supporting the decision boundaries. Based on this idea, to transfer more accurate information about the decision boundaries, the proposed algorithm trains a student classifier based on the adversarial samples supporting the decision boundaries. Alongside, two metrics are proposed to evaluate the similarity between decision boundaries. Experiments show that the proposed method indeed improves knowledge distillation and produces much more similar decision boundaries to the teacher classifier.
1.1 Motivations and Objectives
Knowledge distillation is a method to enhance the training of a new network based on an existing, already trained network. In a teacher-student framework, the existing network is considered as a teacher and the new network becomes a student. Hinton et al. (2015), a pioneer in knowledge distillation, proposed a loss minimizing the cross-entropy between the outputs of the student and the teacher, which referred to as the knowledge distillation loss (KD loss). Due to the KD loss, the student network is trained to be a better classifier than the network trained without knowledge distillation. Although the goals of knowledge distillation are diverse, recent studies (Yim et al., 2017; Chen et al., 2017a) focus on improving a small student network using a large network as a teacher. These studies aim to create a small network with the speed of a small network and the performance of a large network. This paper, too, focuses on knowledge distillation in the respect of enhancing the performance of a small network using a large network.
Many of recent studies are focusing on manipulating the KD loss for various purposes. Romero et al. (2014) and Zagoruyko & Komodakis (2016) propose new distillation losses to transfer the hidden layer response of the network and used it with the KD loss. Chen et al. (2017a) and Wang & Lan (2017) extend the KD loss to various applications. In contrast to these existing approaches that concentrate on how to manipulate various parts of a network in order to improve the effect of knowledge distillation, in this paper, we investigate knowledge distillation in another important perspective for a classifier, a decision boundary. Research on the decision boundary has been one of the most important topics in machine learning because the decision boundary of a trained classifier has an important role for improving the generalization performance (Cortes & Vapnik, 1995; Bishop, 2006). Thus, it is a crucial problem to train a classifier with good decision boundaries that are close to the ideal ones.
In this paper, to obtain a student classifier that has similar decision boundaries to the teacher in knowledge distillation, we utilize an adversarial attack (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016). An adversarial attack is a technique to tamper with the result of a classifier by adding a small perturbation to an input image. Although an adversarial attack is not particularly aimed at finding a decision boundary, they are closely related to each other (Cao & Gong, 2017). An adversarial attack tries to find a small modification that can change the answer, i.e., it tries to move the input sample beyond a nearby decision boundary. Inspired by this fact, we propose a method to transfer the decision boundaries of a classifier through adversarial samples.
Although an adversarial sample is related to a decision boundary, we have to be more certain about the relationship to utilize it in knowledge distillation. To get adversarial samples beneficial to knowledge distillation, we modify an attack scheme to search an adversarial sample supporting a decision boundary . The resulting sample is referred to as the supporting adversarial sample (SAS) in this paper. A new loss function using SAS s is suggested for knowledge distillation that transfers decision boundaries to a student classifier. In order to verify whether the proposed method actually transfers the decision boundaries, we also propose two similarity metrics that compares the decision boundaries of two classifiers and use these metrics to examine the decision boundaries of a teacher and a student.
The proposed method is verified by various experiments. First, we analyze the distance from a sample to a decision boundary and verify that an SAS is more advantageous for transferring the decision boundary than the original training sample. After this, we show that the use of SASs could improve the knowledge distillation scheme of Hinton et al. (2015) in an image classification problem, which confirms the efficacy of the proposed decision boundary transfer. Here, we also evaluate the proposed similarity metrics to verify the relationship between the performance improvement and the similarity of decision boundaries. Finally, we perform more experiments to examine the generalization performance of the proposed method, of which the result indicates that the proposed method has better generalization performance, and accordingly, it can provide good results with less training samples.
The rest of the paper is organized as follows. In Section 1.2, we introduce the existing studies on knowledge distillation and adversarial attack. The advantages of an SAS in knowledge distillation and the adversarial attack to create an SAS are described in section 2.1. Section 2.2 presents the proposed distillation using SAS s, and similarity metrics between two decision boundaries are proposed in section 2.3. Section 3 provides experimental results and discussions on the effect s of SAS s in knowledge distillation. In section 4, the conclusion follows.
1.2 Related Works
Many studies have been conducted for knowledge distillation since Hinton et al. (2015) proposed the first knowledge distillation method based on class probability. Romero et al. (2014) used the hidden layer response of a teacher network as a hint for a student network to improve knowledge distillation. Zagoruyko & Komodakis (2016) found the area of activated neurons in a teacher network and transfered the activated area to a student network. In the case of Yim et al. (2017), the channel response of a teacher network was transferred for knowledge distillation. Xu et al. (2017) proposed a knowledge distillation method based on the framework of generative adversarial network (Goodfellow et al., 2014). Some studies (Chen et al., 2017a; Wang & Lan, 2017; Chen et al., 2017b) extended knowledge distillation to computer vision applications. Knowledge distillation has been studied in various directions, but most of the studies are focused on manipulating the hidden layer response of a network or changing the loss appropriately for the purpose. As far as we know, the proposed method is the first method to improve knowledge distillation by changing the sample s used for training.
In the mean time, Szegedy et al. (2014) found that a classifier based on a neural network could be fooled easily by a small noise. This work gave rise to a new research topic in neural networks called an adversarial attack, which is about finding a noise that can deceive a neural network. Moosavi-Dezfooli et al. (2016) propose d a method to optimize a classifier based on a linear approximation to find the closest adversarial example. Goodfellow et al. (2015) proposed the adversarial training which trains a classifier with adversarial samples in order to make the network robust to an adversarial attack. Cao & Gong (2017) found that an adversarial sample was located near the decision boundary, and used this property to defense an adversarial attack. There have also been some works that connect an adversarial attack to another research topic. Papernot et al. (2016) found that a network trained by knowledge distillation is robust to adversarial attack s. The relationship between an adversarial attack and a decision boundary was used to prevent an adversarial attack in Cao & Gong (2017). Knowledge distillation was also used to prevent an adversarial attacks in Papernot et al. (2016). Our study is closely related to these approaches, except that we take an opposite direction to them: We use adversarial attack s to find decision boundaries to enhance knowledge distillation , which is a novel approach that has not been attempted yet.
2 Knowledge distillation with adversarial samples
2.1 Adversarial attack for knowledge distillation
An adversarial attack is to change a sample in a class into an adversarial sample in another class for a given classifier. In our paper, the given sample for the adversarial attack is referred to as the base sample. In this section, we present an idea to utilize the adversarial attack in knowledge distillation. The idea is about using adversarial samples near decision boundaries to transfer the knowledge about the boundaries. In the following sections, we first explain the definition of a supporting adversarial sample (SAS) and its benefits in knowledge distillation, and then we provide an iterative procedure to find one.
2.1.1 Benefits of SASs in knowledge distillation
It is well-known that the generalization performance of a classifier highly depends on how well the classifier learns the true decision boundar ies between the actual class distributions (Cortes & Vapnik, 1995; Bishop, 2006). This indicates that if a classifier yields a good performance then it probably has good decision boundaries that are close to the true ones, and are far from them otherwise. We can analyze knowledge distillation in this respect: It is well known that a classifier network may not yield a good performance if it is trained with binary (one-hot) labels and the training samples are not sufficient. The Knowledge distillation approaches attempt to resolve this issue with help of another existing, high-performing network, i.e., a teacher, by transferring its knowledge to the classifier we are to train, i.e., the student (Hinton et al., 2015; He et al., 2015; Szegedy et al., 2015). If we train the student without knowledge distillation, then its performance may not as good as the teacher, which indicates that a decision boundary of the teacher is likely better than that of the student, as shown in Figure 1. On the other hand, knowledge distillation can enhance the performance of the student, which suggests that the decision boundaries of the student are getting improved by knowledge distillation.
However, existing works do not explicitly address that the information about the decision boundaries is directly transferred by knowledge distillation. In our paper, inspired by this motivation, we instead utilize adversarial samples obtained from the the training samples to transfer the decision boundaries more accurately. A supporting adversarial samples (SAS) is defined in this respect, it is an adversarial sample that positioned just beyond a decision boundary. Since SASs are labeled samples near decision boundaries as depicted in the second picture of Figure 1, they contain the information about the decision boundaries. Hence, using SASs in knowledge distillation can provide a more accurate transfer of decision boundary information. An SAS in our work is obtained by a gradient descent method based on a classification score functions, and it contains information about both the distance and the path direction from the base sample to the decision boundary. In conclusion, SASs could be beneficial to improve the decision boundaries, and hence the generalization performance , of a student classifier in knowledge distillation. In addition, the SAS could be utilized to measure the similarity between two decision boundaries, which is useful in analysis and evaluation of performance of knowledge distillation.
2.1.2 Iterative Scheme to find SAS
For a given sample, as shown in Figure 2, there exist many SASs over all classes except the base class that containing the base sample. To find an SAS, we define a loss function based on classification scores produced by a classifier. Then , we search an SAS in the gradient direction of the loss function based on the method in (Moosavi-Dezfooli et al., 2016) with a modified update rule.
Given a sample vector in a base class , its corresponding adversarial samples are calculated based on an iterative procedure. First, a sample is initialized to , and then it is iteratively updated to a target class , . Here, the adversarial sample after the -th iteration is denoted by . Assume that the classifier produces classification scores for all classes, where the class of a sample is determined by the class having the maximum score. Then , let and be the classification scores for the base class and the target class , respectively.
The goal of the adversarial attack is to decrease the score for the base class while increasing the score for the target class. To this end, the loss function for the attack to the target class is defined by
This loss becomes zero at a point on the decision boundary, positive at a point within the base class, and negative at an adversarial point within the target class. To find an adversarial sample, we move the sample to the direction minimizing the loss by the iterative scheme in (2), until the loss becomes negative.
where refers to the gradient of . The step size of the Moosavi-Dezfooli et al. (2016) is abnormally large due toÂ a small gradient. To prevent this, we introduce a learning rate which is used together with the loss function to control the step size. Note that the loss is large at the initial point and small near the decision boundary. In addition, derives the sample to cross the decision boundary as shown in the following.
If we derive the first-order Taylor series approximation of at and substitute (2) to remove , then we have
Let us assume that we have chosen a small enough so that . Then, if the sample approaches a decision boundary, becomes small. In this case, without the last term in (3) that exists due to the introduction of in (2), the loss converges to zero but does not become negative which means the sample does not cross the decision boundary. By introducing in (2), the loss can become negative due to the second term in (3).
To lead the adversarial sample to a location near the decision boundary, we establish the stop conditions. The iteration stops if one of the following conditions occurs:
where is a predefined number of maximum iterations. Condition (a) means that the adversarial sample cross s the decision boundary. If (a) is satisfied, then the attack is successful and the resulting sample is regarded as an SAS. On the other hand, conditions (b) and (c) are about failure cases and we discard the sample if one of them is satisfied. Condition (b) means that the sample has stepped into a class that is not the target. This case occurs when there exists a non-target class between the base class and the target class. Condition (c) happens if the decision boundary is too far from the base sample.
2.2 Knowledge distillation using SAS
In the previous section, we show that SASs are beneficial for transferring the decision boundar ies of a teacher to a student classifier. In this section, we present a method to enhance knowledge distillation by transferring decision boundaries more precisely based on SASs.
2.2.1 Loss function for SAS distillation
From a training batch, our distillation scheme uses a set of base sample pairs where denotes the class index of . A set of SASs is denoted by . Let the teacher and the student classifiers be denoted by and respectively. For a sample , the class probability vectors are denoted by and , where refers to the softmax function. The desired class label for is denoted by a one-hot label vector of which the element is either one for the ground -truth class or zero for the other classes. The proposed loss function to train the student classifier combines three losses; a classification loss , the knowledge distillation loss in (Hinton et al., 2015), and an adversarial loss :
If we define the entropy function as , where and are column vectors and is the element-wise logarithm, each loss is defined by
Note that , the ‘temperature’, is a design parameter to prevent the loss from becoming too large (Hinton et al., 2015). in the third term of (4) is the probability of class being selected as the target class, which is introduced to sample target classes stochastically during training. The definition of can be found in (12) of section 2.2.2. Here, and are decaying functions to control the relative weights of losses:
where is the epoch index and is the maximum epoch of the training. is the same function as in (Hinton et al., 2015) and it has been determined empirically based on a similar procedure in (Romero et al., 2014) .
Note that the transfers direct answers (one-hot labels) for the training samples, whereas transfers probabili stic labels (Hinton et al., 2015). In contrast, the adversarial loss is introduced to transfer the information about the decision boundary directly.
2.2.2 Miscellaneous issues on using SASs
Base sample selection for adversarial loss. As shown in (4), the adversarial loss can be used for all training samples and all target classes. For the training batch size , this requires times of SAS calculations assuming that we sample a single target class for each sample. However, this can be costly because finding an adversarial sample requires additional iterations. To reduce the computation, we select base samples out of training samples according to a specific rule explained below and applied the adversarial loss only to the selected samples.
The base samples for generating the adversarial samples are selected from the training batch . A training sample pair () is selected as the base sample for an adversarial attack if the class has the highest probability for both the teacher and the student classifiers. That is, considering the probability vectors and , the base sample set is determined by
where () is the th element of . If the size of is smaller than a predefined , all the samples in is used for the adversarial loss. If the size of is larger than , we select samples that have the highest distance between and , i.e.,
Large means that the probability vector of the teacher and the probability vector of the student are largely different at the base sample position . Since the reduction of matches the goal of knowledge distillation, it is reasonable to choose a base sample with large .
Target class sampling. An SAS can target all classes except the base class. In the learning process, one of the classes is sampled as the target class and an adversarial attack is conducted to make an SAS of the selected target class. For the base sample , the probability to sample the class is defined based on the class probability of the teacher as follows:
This is motivated from that a target class with higher probability is more influential to the given base sample.
2.3 Metrics for similarity of decision boundaries in two networks
To verify whether the proposed method actually transfers decision boundaries in knowledge distillation, we need some metrics. Here, we propose two metrics based on SASs to measure the similarity between the decision boundaries of two classifiers (i.e., teacher and student classifiers in knowledge distillation). These metrics are used to evaluate the performance of knowledge distillation or analyze the benefits of SAS s in knowledge distillation.
Given the th base sample , the perturbation vector to attack the target class for the teacher classifier is obtained by
Likewise, denotes the perturbation vector for the student classifier. Using a set of perturbation vector pairs , the similarity between the two decision boundaries is defined by two metrics: The Magnitude Similarity (MagSim) in (14) and the Angle similarity (AngSim) in (15):
These two metrics have values in the range of [0,1] and higher values represent more similar decision boundaries.
Note that MagSim represents the similarity with respect to the distance from the base sample to the decision boundary. and AngSim depicts that with respect to the path direction from the base sample to the boundary. Since the path is obtained by the gradient of the classification score function, AngSim reflects the similarity with respect to the surface shape of the class score function which affects the shape of decision boundary. Hence, we can say that the decision boundaries of two classifiers have become more similar if either of the metrics increases.
Validation of two metrics. The validity of the metrics is verified through the evaluation of similarities between teacher and student classifiers for a simple classification problem in two dimensional space . We use two-layer neural networks for all the classifiers and use one teacher and three student classifiers. The student classifiers are trained based on the KD loss from the teacher classifier with 4, 32, and 120 training samples. Hence, they have different decision boundaries from each other. Since the dimension is low, the decision boundary is found by the full search. As shown in Figure 3, the metrics show similar tendency to the true similarity between two decision boundaries. The true similarity is obtained by where is the true distance calculated directly with 100 equally-distributed points on the decision boundaries. Since a decision boundary cannot be easily obtained in a high -dimensional space, the metrics can be effectively used for the evaluation of distillation performance in the viewpoint of decision boundary transfer.
Through experiments, we show that the proposed method is a way to enhance the performance of knowledge distillation. In addition, using the proposed metrics, the effect of knowledge distillation is also evaluated in the viewpoint of decision boundary transfer. In order to increase the reliability of the experiment, we performed the training several times for the same condition, and displayed the average and standard deviation of the results. Experiments were performed on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) using a variety of residual networks (He et al., 2015).
3.1 Relation ship between class probability and decision boundary
In the case of a linear classifier, the decision boundary is a hyperplane and so the class probability of a sample has a one-to-one correspondence to the distance from the sample to the decision boundary (Bishop, 2006). However, in general case s, they do not have such a one-to-one correspondence as shown in Figure 4. Blue dots in the figure represent the relation ship between the magnitude of an adversarial perturbation and the corresponding class probability acquired from a classifier based on ResNet26 trained on the CIFAR-10 dataset. This result indicates that one cannot obtain a satisfactory decision boundary by performing knowledge distillation based on the class probabilit ies with insufficient training sample s. In our paper, to tackle this problem, we instead utilize SASs obtained from the training samples. Since SASs are labeled samples near decision boundar ies, they have a more consistent relationship between the class probability and the distance from the decision boundar ies, as depicted as orange dots in Figure 4 . Hence , when using SAS s in knowledge distillation, comparing the class probabilities only is sufficient to achieve a more accurate distillation of decision boundary information. In conclusion, SASs could be beneficial to improve the generalization performance of a student classifier in knowledge distillation.
3.2 Performance on image classification
The performance of the proposed method is verified by image classification in the CIFAR-10 and CIFAR-100 dataset. For CIFAR-10, ResNet26 with 92.55% accuracy is used as a teacher classifier. ResNet32 with 72.55% accuracy is used as a teacher for CIFAR-100. Meanwhile, ResNet8 and ResNet14 are used as student classifiers. We trained the student classifier s in three ways. The first method is denoted as ‘original ’, and it uses only the classification loss for training. The second method is denoted as ‘Hinton ’. Using the classification loss and the KD loss, Hinton is implemented in the same way as in Hinton et al. (2015). The final method is the proposed method which is denoted as ‘proposed ’. For each method, the performance and the proposed similarity metric s of decision boundaries between the teacher and the student were measured. For a fair comparison, all the measures were averaged after repeating the learning process 10 times with different network initial weight s.
Figure 5 shows the result. The proposed method shows improved performance compared to the Hinton method and original method. Here, the similarity of the decision boundar ies explains the reason for the performance improvement. The similarity of the decision boundar ies is increased by the Hinton method and the proposed method, and the performance is also increased by a similar ratio. This implies that the Hinton and the proposed method s train the student classifier to have similar decision boundar ies to the teacher, and as a result, the performance improved. In terms of the decision boundary, the result shows that the proposed method has the highest MagSim and AngSim . This confirms that the proposed method effectively transfers the decision boundar ies as we intended. When the proposed method is compared with the Hinton method, one can confirm that the improvement rate of AngSim is mostly larger than that of MagSim. This shows that the proposed method transfers both the distances and directions to decision boundaries effectively, unlike the Hinton method which mostly concentrates on the distances. As explained in Section 2.3, having similar directions to the decision boundaries implies that the class score functions share similar surface shapes. Therefore, we can conclude that the decision boundaries produced by the proposed method are more similar to those of the teacher than those produced by the Hinton method.
3.3 Generalization of the classifier
The proposed method improves the generalization performance of a student classifier by transferring the decision boundar ies of a teacher classifier. Through an experiment, we verif ied that the generalization performance of the student classifier actually increases by the proposed method. In order to measure the generalization performance, the experiment was repeated while reducing the number of training samples from 100% to 20%. The CIFAR-10 dataset was used in this experiment, and ResNet26 trained on the whole dataset was used as the teacher classifier while ResNet14 was used as the student classifier. Similar to the previous experiment, the original method, the Hinton method, and the proposed method were trained repeatedly for 10 times and the results were averaged. All methods were trained on the same training data for fairness.
Figure 6 shows the performance improvement from the original method over the size of the dataset. Here, we can see that the improvement of the Hinton method does not change regardless of the size of data. On the other hand, the proposed method shows bigger performance improvement for less training data. This indicates that the proposed method improve s the generalization performance of the student classifier through the transfer of decision boundaries.
3.4 Ablation study
|Without cls loss||87.00% 0.15%|
|Without KD loss||86.37% 0.33%|
|Without adv loss||86.66% 0.23%|
|No perturbation||86.80% 0.25%|
|Random perturbation||86.73% 0.16%|
|All selection||87.14% 0.32%|
|Random selection||87.10% 0.23%|
|Inverse selection||86.79% 0.19%|
We performed an ablation study based on ResNet8 trained on CIFAR-10 to examine the influence of various elements of the proposed method. All the experimental results are shown in Table 1. The first experiment is about the effect of each loss. The proposed method consists of three losses. Hence, we examined the performance of the proposed method without each of the loss. In Table 1, the biggest impact on performance is from the KD loss. The second biggest is from the adversarial loss, and the classification loss has the least impact on performance. However, as can be seen in the table, all the three losses are important for the proposed method to be effective. The second experiment verifies the effect of using SAS s. ‘No perturbation ’ in Table 1 refers to the case of using the base sample s instead of the SAS s for the adversarial loss, and ‘random perturbation ’ means that the base samples are added with random perturbations for the adversarial loss. Both show poor performance , which indicates that using SASs is a vital part of the proposed method. The third experiment is about the sample selection described in Section 2.2.2. ‘All selection ’ is when all samples in the candidate set are used. Although this has larger computational burden, its performance is actually reduced. ‘Random selection ’ is when samples are randomly selected from . Similar to the case of all selection, it is worse than the proposed algorithm. ‘Inverse selection ’ is the exact opposite of the proposed method , where samples with smaller values in (11) are selected. This has the worst performance among different selection methods, suggesting that the proposed selection method is appropriate for our purpose.
3.5 Implementation details
All the experiments were performed using residual networks (He et al., 2015). The channel sizes of ResNet were set to 16, 32, and 64 for CIFAR-10, and 32, 64, and 128 for CIFAR-100. We used random crop and random horizontal flip for data augmentation and normalized an input image based on the mean and the variance of the dataset. The temperatures of the KD loss and the adversarial loss were fixed to 3 in all experiments. The learning process was performed with a maximum of 80 epochs with 256 batch size, with a learning rate which started at 0.1 and decreased to 0.01 at half of the maximum epoch and to 0.001 in 3/4 of the maximum epoch. The momentum used in the study was 0.9 and the weight decay was 0.0001. We also performed experiments with 320 epochs, but it is presented in the supplementary due to the lack of space. was used for the adversarial attack in the proposed method and the maximum number of iteration was set to 10 for knowledge distillation and was set to 20 for calculating the similarity metric s. For the adversarial loss, was selected among 256 batch samples. All experiments were performed 10 times and the results were averaged.
In this paper, we have investigated how to transfer decision boundaries using knowledge distillation. The adversarial attack method in (Moosavi-Dezfooli et al., 2016) was modified to find an adversarial sample (SAS) supporting a decision boundary. Based on the SAS, we proposed a knowledge distillation method to transfer decision boundaries. We also proposed metrics to measure the decision boundary similarity between two classifiers and used the metrics to verify the transfer of decision boundaries in the proposed method. Experiments have shown that the proposed method transfers more accurate decision boundaries, which improves the performance of knowledge distillation. Also, it was shown that the proposed method has stronger generalization performance and so it is more effective in situations with fewer training samples. Designing a knowledge distillation method in terms of decision boundaries is a new direction that has not been attempted in the past studies. It is also a new approach to utilize an adversarial attack to find and transfer the decision boundaries. Therefore, this work can be useful for future research on knowledge distillation and on the application of an adversarial attack.
- Bishop (2006) Bishop, C. M. Pattern recognition and Machine Learning. Springer, 2006.
- Cao & Gong (2017) Cao, X. and Gong, N. Z. Mitigating evasion attacks to deep neural networks via region-based classification. CoRR, abs/1709.05583, 2017.
- Chen et al. (2017a) Chen, G., Choi, W., Yu, X., Han, T., and Chandraker, M. Learning efficient object detection models with knowledge distillation. In Neural Information Processing Systems (NIPS), 2017a.
- Chen et al. (2017b) Chen, Y., Wang, N., and Zhang, Z. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. CoRR, abs/1707.01220, 2017b.
- Cortes & Vapnik (1995) Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273–297, Sep 1995.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Neural Information Processing Systems (NIPS), pp. 2672–2680. 2014.
- Goodfellow et al. (2015) Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.
- He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
- Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
- Moosavi-Dezfooli et al. (2016) Moosavi-Dezfooli, S., Fawzi, A., and Frossard, P. Deepfool: A simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Papernot et al. (2016) Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 582–597, 2016.
- Romero et al. (2014) Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
- Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.
- Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Wang & Lan (2017) Wang, C. and Lan, X. Model distillation with knowledge transfer in face classification, alignment and verification. CoRR, abs/1709.02929, 2017.
- Xu et al. (2017) Xu, Z., Hsu, Y., and Huang, J. Learning loss for knowledge distillation with conditional adversarial networks. CoRR, abs/1709.00513, 2017.
- Yim et al. (2017) Yim, J., Joo, D., Bae, J., and Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. CoRR, abs/1612.03928, 2016.