Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students
This paper studies teacher-student optimization on neural networks, i.e., adopting the supervision from a trained (teacher) network to optimize another (student) network. Conventional approaches enforced the student to learn from a strict teacher which fit a hard distribution and achieved high recognition accuracy, but we argue that a more tolerant teacher often educate better students.
We start with adding an extra loss term to a patriarch network so that it preserves confidence scores on a primary class (the ground-truth) and several visually-similar secondary classes. The patriarch is also known as the first teacher. In each of the following generations, a student learns from the teacher and becomes the new teacher in the next generation. Although the patriarch is less powerful due to ambiguity, the students enjoy a persistent ability growth as we gradually fine-tune them to fit one-hot distributions. We investigate standard image classification tasks (CIFAR100 and ILSVRC2012). Experiments with different network architectures verify the superiority of our approach, either using a single model or an ensemble of models.
Department of Computer Science
The Johns Hopkins University
Baltimore, 21218 MD, USA More Tolerant Teachers Educate Better Students
“Indigo comes from blue, but it is bluer than blue.”
“Teachers need not to be wiser than students.” —Old Proverbs
Deep learning, especially the convolutional neural networks, has been widely applied to computer vision problems. Among them, image classification has been considered the fundamental task which sets the backbone of a vision system [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton][Simonyan and Zisserman(2015)][Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich, et al.][He et al.(2016)He, Zhang, Ren, and Sun], and the knowledge or features extracted from these modules can be transferred for generic image representation purposes [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] or other vision tasks [Long et al.(2015)Long, Shelhamer, and Darrell][Ren et al.(2015)Ren, He, Girshick, and Sun][Xie and Tu(2015)][Newell et al.(2016)Newell, Yang, and Deng].
Therefore, in computer vision, a fundamental task is to optimize deep networks for image classification. Most existing works achieved this goal by fitting one-hot vectors. For each training sample where is an image matrix and is the class label out of classes, the goal is to find network parameters , so that , i.e., only the -th dimension is and all others are . Despite its effectiveness, we argue that this setting forces the image to be classified as the primary class (i.e., the ground-truth) with the confidence score of the target class being and others being . However, this is not necessarily the optimal fitting target, because allowing for some secondary classes (i.e., those that are visually similar to the ground-truth) to be preserved may help to alleviate the risk of over-fitting. Previously, some approaches dealt with this issue by learning a class-level similarity matrix [Deng et al.(2010)Deng, Berg, Li, and Fei-Fei][Verma et al.(2012)Verma, Mahajan, Sellamanickam, and Nair][Wu et al.(2017)Wu, Tygert, and LeCun]; but they are unable to capture the image-level similarities, e.g., different cat images may be visually similar to different classes.
An alternative solution is to distill knowledge from a trained (teacher) network and guide another (student) network in order to model the visual similarity [Hinton et al.(2015)Hinton, Vinyals, and Dean]. This involves training models in total, denoted by , respectively. is named the patriarch network, or the first teacher network. In each of the following generations, plays the role of the student and learns from . Mathematically, each loss function is composed of two terms, i.e., . The first term, , is the classification loss (the standard cross-entropy loss), which requires the network to learn visual features that align with the final goal. The second term, , is the similarity loss facilitating the student to learn from the visual knowledge of its teacher. The patriarch does not have a teacher, and so we define a different accordingly and discard .
This work reveals an interesting phenomenon, that a more tolerant teacher (i.e., a model which tends to distribute the confidence score out of the primary class) often educates better students, because it provides a larger room for the student to capture image-level visual similarity. To this end, we intentionally train to fit a softened class distribution, e.g., the primary class has a score of , and the remaining is distributed among a few secondary classes, with the set of secondary classes varying from sample to sample. As the number of generation goes up, the trained model gradually converges to fit the original one-hot distribution, but in comparison to the standard training, our approach produces stronger models. We perform experiments on the CIFAR100 and ILSVRC2012 datasets. Although our approach requires a longer training stage, the testing complexity remains the same, with classification accuracy being much higher, either for single models or model ensemble.
2 Related Work
Deep learning has been dominating the field of computer vision. Powered by large-scale image datasets [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] and powerful computational resources, it is possible to train very deep networks for various computer vision tasks. The fundamental idea of deep learning is to design a hierarchical structure containing multiple layers, each of which contains a number of neurons having the same or similar mathematical functions. People believe that a sufficiently deep network is able to fit very complicated distributions in the feature space. In a fundamental task known as image classification, deep neural networks [Krizhevsky and Hinton(2009)] have achieved much higher accuracy than conventional handcrafted features [Perronnin et al.(2010)Perronnin, Sanchez, and Mensink]. It is well acknowledged that deeper network structures lead to better performance[Simonyan and Zisserman(2015)][Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich, et al.][He et al.(2016)He, Zhang, Ren, and Sun][Huang et al.(2017b)Huang, Liu, Weinberger, and van der Maaten][Hu et al.(2018)Hu, Shen, and Sun], but we still need specifically designed techniques to assist optimization, such as ReLU activation [Nair and Hinton(2010)], Dropout [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] and batch normalization [Ioffe and Szegedy(2015)]. People also studied automatic ways of designing neural network architectures [Xie and Yuille(2017)][Zoph and Le(2017)].
The rapid progress of deep learning has helped a lot of computer vision tasks. Features extracted from trained classification networks can be transferred to small datasets for image classification [Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell], retrieval [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] or object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik]. An even more effective way is to insert specified network modules for these tasks, and initializing these models with part of the weights learned for image classification. This flowchart, often referred to as fine-tuning, works well in a variety of problems, including object detection[Girshick(2015)][Ren et al.(2015)Ren, He, Girshick, and Sun], semantic segmentation [Long et al.(2015)Long, Shelhamer, and Darrell][Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille], edge detection [Xie and Tu(2015)], etc.
Since the trained network is deep (i.e., the mathematical function is very complicated) and the amount of training data is limited, it is often instructive to introduce extra priors to constrain the training process and thus prevent over-fitting. A common prior assumes that some classes are visually or semantically similar [Deng et al.(2010)Deng, Berg, Li, and Fei-Fei], and adds a class-level similarity matrix to the loss function [Verma et al.(2012)Verma, Mahajan, Sellamanickam, and Nair][Wu et al.(2017)Wu, Tygert, and LeCun], but it is unable to deal with image-level similarity which is well noted in previous research [Wang et al.(2014)Wang, Leung, Rosenberg, Wang, Philbin, Chen, Wu, et al.][Akata et al.(2016)Akata, Perronnin, Harchaoui, and Schmid][Zhang et al.(2018)Zhang, Cheng, and Tian]. Another prior is named knowledge distillation, which allows a teacher network to guide the optimization of a student network. In [Hinton et al.(2015)Hinton, Vinyals, and Dean], it was verified that the student network can be of a smaller size than the teacher network, but achieve similar recognition performance with the help of the teacher. This model was later improved so that multiple teachers were used to provide a better guidance [Tarvainen and Valpola(2017)]. In a recent work named the born-again network [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar], the networks are optimized in generations, in which the next generation was guided by the standard one-hot classification vector as well as the knowledge learned in the previous generation.
3 Our Approach
This section presents our approach. We first introduce a framework organized by generations (Section 3.1). Then, we provide empirical analysis on why this framework trains deep networks better (Section 3.2), based on which we propose to set tolerant teachers to educate better students (Section 3.3).
3.1 Framework: Network Training in Generations
We consider a standard network optimization task. Given a model which has a mathematical form of , where and are input and output, and denotes learnable parameters (e.g., convolutional weights). Given a training set , the goal is to determine the parameter that best fits these data.
In practice, a popular optimization flowchart starts with setting all weights as white noise, and then applies a gradient-descent-based algorithm to update them. In this scenario, the complicated network design and the limited number of training samples create a high-dimensional feature space in which only a limited number of data points are observed. As a result, it is likely to find a to achieve a high accuracy on the training set (e.g., a -layer residual network [He et al.(2016)He, Zhang, Ren, and Sun] reports almost 100% training accuracy on CIFAR100), but the testing accuracy is still below satisfaction. This phenomenon, often referred to as over-fitting, limits us from generalizing the trained model from training data to unobserved testing data.
We suggest to regularize the training stage with a teacher-student framework111We shall explain why teacher-student training serves as regularization in the next subsection., in which a teacher model : is first trained to capture data distribution, and then used to guide a student model : . We follow [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar] to design the loss function. For each training sample , the loss function is composed of two terms, i.e., , where and indicate the classification loss and the similarity loss, respectively. The classification term is simply the standard cross entropy loss:
Since is a one-hot vector, there is only one term in being calculated. The similarity loss is the difference between the predicted class distribution and that of its teacher , which we measure by computing the KL-divergence:
Note that is normalized, i.e., . In [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar], this model was further extended to multi-generation learning. Let there be generations in total. The patriarch model, denoted by , is the first one to be trained (no teacher is available). In this case, the similarity loss term is discarded, i.e., follows a standard training stage with cross-entropy loss. Then in each generation , the student learns from its teacher . Finally, is more powerful than and the ensemble of these models also obtains higher accuracy than that of the same number of standard individual models.
3.2 Analysis: Why Teacher-Student Optimization Works?
Before continuing, we study the reason why teacher-student optimization works better. Traditional literatures referred to this approach by knowledge distillation, which is able to (i) train a student network being more compact than the teacher but achieving comparable performance [Hinton et al.(2015)Hinton, Vinyals, and Dean]; or (ii) train a student network having the same number of parameters but outperforming the teacher [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar]. It was believed that the student can benefit from the knowledge learned by the teacher, but we provide an alternative analysis on what the benefit is, which also inspires our algorithm.
We first note that the difference lies in the term , where is the teacher signal, or more precisely, its “prediction” in a training sample. If the teacher is perfectly correct, i.e., is a one-hot vector with appearing at the ground-truth class, is equivalent to the cross-entropy loss and nothing different happens. So, we conjecture that the teacher network does not produce perfect prediction even in an observed sample.
We verify this by an empirical study on the CIFAR datasets [Krizhevsky and Hinton(2009)]. A -layer residual network [He et al.(2016)He, Zhang, Ren, and Sun] is trained, and validated directly in the training data. From Figure 1, we observe three important facts. First, training accuracy is close to , with a dominant amount of confidence ( on CIFAR10 and on CIFAR100) at the correct class, but there still exists a little portion of scores assigned to other classes. Second, for each primary class, the secondary classes sharing a considerable amount of confidence are often semantically similar to it, e.g., in CIFAR10, cat is close to dog, automobile is close to truck, etc. Third, although semantic similarity exists in the class level, there are still some situations when an image is visually similar to other classes, e.g., automobile is most similar to truck in of time, but in another and of time, it is most similar to ship and airplane, respectively. This is why teacher-student optimization works. The teacher network learns image-level similarity and allows the student network to preserve such similarity. Without it, the student network needs to fit a one-hot vector, which assumes that all classes have zero similarity with each other, and inevitably results in over-fitting.
Despite the usefulness of image-level similarity, existing approaches [Hinton et al.(2015)Hinton, Vinyals, and Dean][Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar] often optimize the teacher network with standard cross-entropy loss in only a small amount of similarity is preserved (see Figure 1). Consequently, the benefit of teacher-student optimization is reduced. In the following, we will modify the loss function of the teacher model, leading to more “tolerant” teachers which have a greater potential to educate better students.
3.3 More Tolerant Teachers, Better Students
Intuitively, we refer to the model with a larger amount of image-level similarity as a model containing higher energy. We make two modifications beyond the baseline algorithm: (i) we change the loss function of the patriarch model to intentionally create a higher energy level; (ii) we increase the weight of the teacher signal so that the energy does not decay very fast throughout generations.
The definition of starts with a hyper-parameter which is the approximate number of classes that is visually similar to each primary class (including itself). For each input , these classes are determined by the classification scores . We denote the -th largest dimension in by or simply . is called the primary class, and are called the secondary classes. The idea is to constrain the difference between the primary score , or abbreviated as , and the secondary scores , for . Therefore, we add a so-called score difference term to the loss function, yielding the updated loss function:
Here, is a hyper-parameter controlling the balance between the ground-truth supervision and the score difference term. Eqn (3) raises a tradeoff between perfect classification ( is the same as , i.e., ) and sufficient energy ( is not very close to ). Mathematically, the optimal to minimize shall satisfy , , and all other entries are . We can easily derive the optimal which is a monotonically descending function with respect to .
Two side notes are made on Eqn (3) and hyper-parameter . First, our formulation does not guarantee the primary class corresponds to the true class. But as we shall see in experiments, after a sufficient number of training epochs, the training accuracy is always close to . Second, is often difficult to estimate, and may vary among different primary classes. In practice, we set for CIFAR100 and ILSVRC2012, because (i) CIFAR100 contains coarse groups and each of them has finer-level classes; and (ii) an important evaluation metric on ILSVRC2012 is top- accuracy – which is an approximate estimation of . Of course, fixing for all classes is very rough, but we shall see in experiments that this simple formulation indeed works.
At the -th generation, i.e., learns from , we use the same loss function described in Section 3.1, namely
where is another hyper-parameter controlling the the balance between the supervisions from the ground-truth and the teacher signal. Mathematically, the optimal to minimize shall satisfy , , and all other entries are , just like . Similarly, we have which is monotonically descending with respect to . Therefore, the multi-generation optimization process is parameterized by , and . We fix to be , and use to equivalently replace , so that each process is denoted by . As examples, training individual models corresponds to , and training born-again networks [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar] corresponds to .
Regardless of and , as , and so the optimal gradually converges to the ground-truth one-hot vector, but as we shall see in experiments, setting a small and a large boosts the energy during the training process, and leads to better models. This phenomenon can also be explained using the simulated annealing theory. Using a smaller is similar to setting a higher temperature in the annealing process (the patriarch model often reports a lower accuracy). A smaller allows the ambiguity to decrease gradually, corresponding to setting a lower annealing rate which brings higher stability (the students enjoy a persistent accuracy gain and eventually outperform the baseline).
4.1 The CIFAR100 Dataset
Settings and Baselines
We first evaluate our approach on the CIFAR100 dataset [Krizhevsky and Hinton(2009)], which contains tiny RGB images of a spatial resolution of . These images are split into a training set of samples and a testing set of samples. Both training and testing images are uniformly distributed over classes. Note that we do not experiment on the CIFAR10 dataset because the number of classes are too small to reveal the effectiveness of our gradual optimization strategy.
We start with the deep residual networks (ResNets) [He et al.(2016)He, Zhang, Ren, and Sun] with different layers, i.e., , or layers where is an integer. The first convolutional layer is first performed on the input image without changing its spatial resolution, then three stages followed, each of which has residual blocks (two convolutions plus one identity connection). Batch normalization [Ioffe and Szegedy(2015)] and ReLU activation [Nair and Hinton(2010)] are added after each convolution. The spatial resolution of the input remain unchanged in the three stages (, and ), and the number of channels are , and , respectively. Average pooling layers are inserted after the first two stages for down-sampling. The network ends with a fully-connected layer with outputs.
Our approach is also experimented on the densely connected convolutional networks (DenseNets) [Huang et al.(2017b)Huang, Liu, Weinberger, and van der Maaten] with and layers. The overall architecture is similar to the ResNets, but the building blocks are densely-connected, i.e., each basic unit takes the input feature, convolves it twice, and concatenates it to the original feature. We use the -layer DenseNet with a base number of channels of and a growth rate of .
We follow the conventions to train these networks from scratch. We use the standard Stochastic Gradient Descent (SGD) with a weight decay of and a Nesterov momentum of . In the ResNets, we train the network for epochs with mini-batch size of . The base learning rate is , and is divided by after and epochs. In the DenseNets, we train the network for epochs with a mini-batch size of . The base learning rate is , and is divided by after and epochs. In the training process, the standard data-augmentation is used, i.e., each image is padded with a -pixel margin on each of the four sides. In the enlarged image, a subregion with pixels is randomly cropped and flipped with a probability of . No augmentation is used at the testing stage.
We first evaluate the performance with respect to different hyper-parameters, namely, different parameterized processes . We fix and , and diagnose the impact of on deep residual networks [He et al.(2016)He, Zhang, Ren, and Sun] with different numbers of layers. We also evaluate the born-again networks [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar] which corresponds to .
Results are summarized in Figure 1. We can observe several important properties of our algorithm. First, a strict teacher (i.e., the born-again network [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar], ) is inferior to a tolerant teacher (e.g., ). Although the latter often starts with a lower accuracy of the patriarch model, it has the ability of gradually and persistently growing up and outperforming the baseline after – generations. It is possible to see the accuracy saturate after a few generations, because eventually the teacher signal will converge to the points that are dominated by the primary classes (i.e., ), and the teacher will become strict again. Second, the number of layers impacts the choice of the teaching parameter . In a relatively deep network, a large (e.g., ) works better than a small (e.g., ), while it is the opposite in a relatively shallow network. This is because knowledge distillation requires higher temperature (larger energy) for deeper networks [Hinton et al.(2015)Hinton, Vinyals, and Dean]. As a side note, the ensemble of the models in our algorithm (e.g., ) is considerably better than that of the models trained individually () or from a strict teacher ().
In DenseNets with and layers, we report both single-model and model-ensemble results in Table 2. We evaluate and , and observe the same phenomena as in ResNet experiments (both single-model and model-ensemble works favorably). In particular, in DenseNet100, our single-model accuracy is – higher, and our -model ensemble accuracy is more than higher and even outperforms single DenseNet190 models. Considering DenseNet190 requires around FLOPs of DenseNet100, this is quite an efficient method to achieve high classification accuracy. In DenseNet190, our results are competitive to the state-of-the-arts. Note that [Zhang et al.(2017a)Zhang, Cisse, Dauphin, and Lopez-Paz] and [Gastaldi(2017)] applied complicated data augmentation approaches to achieve high accuracy, but we found a different way, which is to improve the optimization algorithm.
|Gen #0||Gen #1||Gen #2||Gen #3||Gen #4||Gen #5|
|Baseline ( layers)|
|Baseline ( layers)|
|[Zhang et al.(2017b)Zhang, Qi, Xiao, and Wang]||[Huang et al.(2017a)Huang, Li, Pleiss, Liu, Hopcroft, and Weinberger]||[Han et al.(2017)Han, Kim, and Kim]||[Zhang et al.(2017a)Zhang, Cisse, Dauphin, and Lopez-Paz]||[Gastaldi(2017)]||[Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar]|
We inherit these learned parameters to the large-scale experiments on ILSVRC2012 (the costly computation avoids us from tuning this hyper-parameter). We use , As ResNet18 ( layers, not very deep) is chosen as the baseline.
4.2 The ILSVRC2012 Dataset
Settings and Baselines
With the knowledge and parameters learned from the CIFAR100 experiments, we now investigate the ILSVRC2012 dataset [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.], a popular subset of the ImageNet database [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei]. There are classes in total. The training set and testing set contains and high-resolution images, with each class having approximately the same number of training images and exactly the same number of testing images.
We set the -layer residual network [He et al.(2016)He, Zhang, Ren, and Sun] as our baseline. The input image is passed through a convolutional layer with a stride of , and a max-pooling layer. Then four stages followed, each with standard residual blocks (two convolution layers plus an identity connection). The spatial resolution of these four stages are , , and , and the number of channels are , , and , respectively. Three max-pooling layers are inserted between these four stages. The network ends with a fully-connected layer with outputs.
All networks are trained from scratch. We follow [Hu et al.(2018)Hu, Shen, and Sun] in configuring the following parameters. Standard Stochastic Gradient Descent (SGD) with a weight decay of and a Nesterov momentum of is used. There are a total of epochs in the training process, and the mini-batch size is . The learning rate starts with , and is divided by after , and epochs. In the training process, we apply a series of data-augmentation techniques, including rescaling and cropping the image, randomly mirroring and rotating (slightly) the image, changing its aspect ratio and performing pixel jittering. In the testing process, we use the standard single-center-crop on each image.
|Gen #0||Gen #1||Gen #2||Gen #3||Gen #4||Gen #5|
We still set and . Following CIFAR100 experiments, we use . Results are summarized in Table 2. One can observe very similar results as in the previous experiments, i.e., a worse patriarch333It is interesting yet expected that the top- accuracy of the patriarch is lower than the baseline, but the top- accuracy is merely lower. This is because setting hardly impacts top- classification., gradual and persistent improvement, and saturation after several generations. Although the performance is not comparable to deeper network structures, the accuracy gain ( top- and top-) is higher than other two light-weighted modules, namely Squeeze-and-Excitation (SE) [Hu et al.(2018)Hu, Shen, and Sun] ( top- and top-) and Second-Order Response Transform (SORT) [Wang et al.(2017)Wang, Xie, Liu, Qiao, Zhang, Zhang, Tian, and Yuille] ( top- and top-). Different from them, our approach does not require any additional computation at the testing stage, although the training stage is longer.
In this work, we present an approach for optimizing deep networks. Under the framework of teacher-student optimization, our motivation is to set a tolerant teacher by adding a loss term measuring the difference among top- scores. This allows the network to preserve sufficient energy which decays gradually, which fits better the theory of knowledge distillation. Experiments on image classification verify our assumption. With the same network (thus the same computational costs at the testing stage), our model works consistently better.
Our research votes for the opinion that network optimization is far from perfect at the current status. In the future, we will investigate a more generalized model, including using a variable function at each generation and allowing to be varying from case to case. In addition, we will consider a temperature term in Eqn (4) to adjust the KL-divergence. Both are expected to achieve better optimization results.
- [Akata et al.(2016)Akata, Perronnin, Harchaoui, and Schmid] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7):1425–1438, 2016.
- [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In International Conference on Learning Representations, 2016.
- [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
- [Deng et al.(2010)Deng, Berg, Li, and Fei-Fei] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us? In European conference on computer vision, 2010.
- [Donahue et al.(2014)Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, and Darrell] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, 2014.
- [Furlanello et al.(2017)Furlanello, Lipton, Itti, and Anandkumar] T. Furlanello, Z. C. Lipton, L. Itti, and A. Anandkumar. Born again neural networks. In NIPS Workshop on Meta Learning, 2017.
- [Gastaldi(2017)] X. Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
- [Girshick(2015)] R. Girshick. Fast r-cnn. In Computer Vision and Pattern Recognition, 2015.
- [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
- [Han et al.(2017)Han, Kim, and Kim] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In Computer Vision and Pattern Recognition, 2017.
- [He et al.(2016)He, Zhang, Ren, and Sun] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
- [Hinton et al.(2015)Hinton, Vinyals, and Dean] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- [Hu et al.(2018)Hu, Shen, and Sun] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Computer Vision and Pattern Recognition, 2018.
- [Huang et al.(2017a)Huang, Li, Pleiss, Liu, Hopcroft, and Weinberger] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017a.
- [Huang et al.(2017b)Huang, Liu, Weinberger, and van der Maaten] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Computer Vision and Pattern Recognition, 2017b.
- [Ioffe and Szegedy(2015)] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
- [Krizhevsky and Hinton(2009)] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
- [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] A. Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- [Long et al.(2015)Long, Shelhamer, and Darrell] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, 2015.
- [Nair and Hinton(2010)] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
- [Newell et al.(2016)Newell, Yang, and Deng] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, 2016.
- [Perronnin et al.(2010)Perronnin, Sanchez, and Mensink] F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In European conference on computer vision, 2010.
- [Razavian et al.(2014)Razavian, Azizpour, Sullivan, and Carlsson] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Computer Vision and Pattern Recognition, 2014.
- [Ren et al.(2015)Ren, He, Girshick, and Sun] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2015.
- [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- [Simonyan and Zisserman(2015)] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich, et al.] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015.
- [Tarvainen and Valpola(2017)] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, 2017.
- [Verma et al.(2012)Verma, Mahajan, Sellamanickam, and Nair] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair. Learning hierarchical similarity metrics. In Computer Vision and Pattern Recognition, 2012.
- [Wang et al.(2014)Wang, Leung, Rosenberg, Wang, Philbin, Chen, Wu, et al.] J. Wang, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu, et al. Learning fine-grained image similarity with deep ranking. In Computer Vision and Pattern Recognition, 2014.
- [Wang et al.(2017)Wang, Xie, Liu, Qiao, Zhang, Zhang, Tian, and Yuille] Y. Wang, L. Xie, C. Liu, S. Qiao, Y. Zhang, W. Zhang, Q. Tian, and A. Yuille. Sort: Second-order response transform for visual recognition. In International Conference on Computer Vision, 2017.
- [Wu et al.(2017)Wu, Tygert, and LeCun] C. Wu, M. Tygert, and Y. LeCun. Hierarchical loss for classification. arXiv preprint arXiv:1709.01062, 2017.
- [Xie and Yuille(2017)] L. Xie and A. Yuille. Genetic cnn. In International Conference on Computer Vision, 2017.
- [Xie and Tu(2015)] S. Xie and Z. Tu. Holistically-nested edge detection. In International Conference on Computer Vision, 2015.
- [Zhang et al.(2018)Zhang, Cheng, and Tian] C. Zhang, J. Cheng, and Q. Tian. Image-level classification by hierarchical structure learning with visual and semantic similarities. Information Sciences, 422:271–281, 2018.
- [Zhang et al.(2017a)Zhang, Cisse, Dauphin, and Lopez-Paz] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017a.
- [Zhang et al.(2017b)Zhang, Qi, Xiao, and Wang] T. Zhang, G. J. Qi, B. Xiao, and J. Wang. Interleaved group convolutions. In Computer Vision and Pattern Recognition, 2017b.
- [Zoph and Le(2017)] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.