Unifying Distillation
and Privileged Information
Abstract
Distillation (Hinton et al., 2015) and privileged information (Vapnik & Izmailov, 2015) are two techniques that enable machines to learn from other machines. This paper unifies the two into generalized distillation, a framework to learn from multiple machines and data representations. We provide theoretical and causal insight about the inner workings of generalized distillation, extend it to unsupervised, semisupervised and multitask learning scenarios, and illustrate its efficacy on a variety of numerical simulations on both synthetic and realworld data.
1 Introduction
Humans learn much faster than machines. Vapnik & Izmailov (2015) illustrate this discrepancy with the Japanese proverb
better than a thousand days of diligent study is one day with a great teacher.
Motivated by this insight, the authors incorporate an “intelligent teacher” into machine learning. Their solution is to consider training data formed by a collection of triplets
Here, each is a featurelabel pair, and the novel element is additional information about the example provided by an intelligent teacher, such as to support the learning process. Unfortunately, the learning machine will not have access to the teacher explanations at test time. Thus, the framework of learning using privileged information (Vapnik & Vashist, 2009; Vapnik & Izmailov, 2015) studies how to leverage these explanations at training time, to build a classifier for test time that outperforms those built on the regular features alone. As an example, could be the image of a biopsy, the medical report of an oncologist when inspecting the image, and a binary label indicating whether the tissue shown in the image is cancerous or healthy.
The previous exposition finds a mathematical justification in VC theory (Vapnik, 1998), which characterizes the speed at which machines learn using two ingredients: the capacity or flexibility of the machine, and the amount of data that we use to train it. Consider a binary classifier belonging to a function class with finite VCDimension . Then, with probability , the expected error is upper bounded by
(1) 
where is the training error over data, and . For difficult (nonseparable) problems the exponent is , which translates into machines learning at a slow rate of . On the other hand, for easy (separable) problems, i.e., those on which the machine makes no training errors, the exponent is , which translates into machines learning at a fast rate of . The difference between these two rates is huge: the learning rate potentially only requires examples to achieve the accuracy for which the learning rate needs examples. So, given a student who learns from a fixed amount of data and a function class , a good teacher can try to ease the problem at hand by accelerating the learning rate from to .
Vapnik’s learning using privileged information is one example of what we call machinesteachingmachines: the paradigm where machines learn from other machines, in addition to training data. Another seemingly unrelated example is distillation (Hinton et al., 2015),^{1}^{1}1Distillation relates to model compression (Burges & Schölkopf, 1997; Buciluǎ et al., 2006; Ba & Caruana, 2014). We will adopt the term distillation throughout this manuscript. where a simple machine learns a complex task by imitating the solution of a flexible machine. In a wider context, the machinesteachingmachines paradigm is one step toward the definition of machine reasoning of Bottou (2014), “the algebraic manipulation of previously acquired knowledge to answer a new question”. In fact, many recent stateoftheart systems compose data and supervision from multiple sources, such as object recognizers reusing convolutional neural network features (Oquab et al., 2014), and natural language processing systems operating on vector word representations extracted from unsupervised text corpora (Mikolov et al., 2013).
In the following, we frame Hinton’s distillation and Vapnik’s privileged information as two instances of the same machinesteachingmachines paradigm, termed generalized distillation. The analysis of generalized distillation sheds light to applications in semisupervised learning, domain adaptation, transfer learning, Universum learning (Weston et al., 2006), reinforcement learning, and curriculum learning (Bengio et al., 2009); some of them discussed in our numerical simulations.
2 Distillation
We focus on class classification, although the same ideas apply to regression. Consider the data
(2) 
Here, is the set of dimensional probability vectors. Using (2), we are interested in learning the representation
(3) 
where is a class of functions from to , the function is the softmax operation
for all , the function is the crossentropy loss
and is an increasing function which serves as a regularizer.
When learning from real world data such as highresolution images, is often an ensemble of large deep convolutional neural networks (LeCun et al., 1998a). The computational cost of predicting new examples at test time using these ensembles is often prohibitive for production systems. For this reason, Hinton et al. (2015) propose to distill the learned representation into
(4) 
where
(5) 
are the soft predictions from about the training data, and is a function class simpler than . The temperature parameter controls how much do we want to soften or smooth the classprobability predictions from , and the imitation parameter balances the importance between imitating the soft predictions and predicting the true hard labels . Higher temperatures lead to softer classprobability predictions . In turn, softer classprobability predictions reveal label dependencies which would be otherwise hidden as extremely large or small numbers. After distillation, we can use the simpler for faster prediction at test time.
3 Vapnik’s privileged information
We now turn back to Vapnik’s problem of learning in the company of an intelligent teacher, as introduced in Section 1. The question at hand is: How can we leverage the privileged information to build a better classifier for test time? One naïve way to proceed would be to estimate the privileged representation from the regular representation , and then use the union of regular and estimated privileged representations as our testtime feature space. But this may be a cumbersome endeavour: in the example of biopsy images and medical reports , it is reasonable to believe that predicting reports from images is more complicated than classifying the images into cancerous or healthy.
Alternatively, we propose to use distillation to extract useful knowledge from privileged information. The proposal is as follows. First, learn a teacher function by solving (3) using the data . Second, compute the teacher soft labels , for all and some temperature parameter . Third, distill into by solving (4) using both the hard labeled data and the softly labeled data .
3.1 Comparison to prior work
Vapnik & Vashist (2009); Vapnik & Izmailov (2015) offer two strategies to learn using privileged information: similarity control and knowledge transfer. Let us briefly compare them to our distillationbased proposal.
The motivation behind similarity control is that SVM classification is separable after we correct for the slack values , which measure the degree of misclassification of training data points (Vapnik & Vashist, 2009). Since separable classification admits fast learning rates, it would be ideal to have a teacher that could supply slack values to us. Unluckily, it seems quixotic to aspire for a teacher able to provide with abstract floating point number slack values. Perhaps it is more realistic to assume instead that the teacher can provide with some rich, highlevel representation useful to estimate the soughtafter slack values. This reasoning crystallizes into the SVM+ objective function from (Vapnik & Vashist, 2009):
(6) 
where is the decision boundary at , and is the teacher correcting function at the same location. The SVM+ objective function matches the objective function of nonseparable SVM when we replace the correcting functions with the slacks . Thus, skilled teachers provide with privileged information highly informative about the slack values . Such privileged information allows for simple correcting functions , and the easy estimation of these correcting functions is a proxy to fast learning rates. Technically, this amounts to saying that a teacher is helpful whenever the capacity of her correcting functions is much smaller than the capacity of the student decision boundary.
In knowledge transfer (Vapnik & Izmailov, 2015) the teacher fits a function on the inputoutput pairs and , to find the best reduced set of prototype or basis points . Second, the student fits one function per set of inputoutput pairs , for all . Third, the student fits a new vector of coefficients to obtain the final student function , using the inputoutput pairs and . Since the representation is intelligent, we assume that the function class has small capacity, and thus allows for accurate estimation under small sample sizes.
Distillation differs from similarity control in three ways. First, distillation is not restricted to SVMs. Second, while the SVM+ solution contains twice the amount of parameters than the original SVM, the user can choose a priori the amount of parameters in the distilled classifier. Third, SVM+ learns the teacher correcting function and the student decision boundary simultaneously, but distillation proceeds sequentially: first with the teacher, then with the student. On the other hand, knowledge transfer is closer in spirit to distillation, but the two techniques differ: while similarity control relies on a student that purely imitates the hidden representation of a lowrank kernel machine, distillation is a tradeoff between imitating soft predictions and hard labels, using arbitrary learning algorithms.
The framework of learning using privileged information enjoys theoretical analysis (Pechyony & Vapnik, 2010) and multiple applications, including ranking (Sharmanska et al., 2013), computer vision (Sharmanska et al., 2014; LopezPaz et al., 2014), clustering (Feyereisl & Aickelin, 2012), metric learning (Fouad et al., 2013), Gaussian process classification (HernándezLobato et al., 2014), and finance (Ribeiro et al., 2010). Lapin et al. (2014) show that learning using privileged information is a particular instance of importance weighting.
4 Generalized distillation
We now have all the necessary background to describe generalized distillation. To this end, consider the data . Then, the process of generalized distillation is as follows:

Learn teacher using the inputoutput pairs and Eq. 3.

Compute teacher soft labels , using temperature parameter .

Learn student using the inputoutput pairs , , Eq. 4, and imitation parameter .^{2}^{2}2Note that these three steps could be combined into a joint endtoend optimization problem. For simplicity, our numerical simulations will take each of these three steps sequentially.
We say that generalized distillation reduces to Hinton’s distillation if for all and , where is an appropriate function class capacity measure. Conversely, we say that generalized distillation reduces to Vapnik’s learning using privileged information if is a privileged description of , and .
This comparison reveals a subtle difference between Hinton’s distillation and Vapnik’s privileged information. In Hinton’s distillation, is flexible, for the teacher to exploit her general purpose representation to learn intricate patterns from large amounts of labeled data. In Vapnik’s privileged information, is simple, for the teacher to exploit her rich representation to learn intricate patterns from small amounts of labeled data. The space of privileged information is thus a specialized space, one of “metaphoric language”. In our running example of biopsy images, the space of medical reports is much more specialized than the space of pixels, since the space of pixels can also describe buildings, animals, and other unrelated concepts. In any case, the teacher must develop a language that effectively communicates information to help the student come up with better representations. The teacher may do so by incorporating invariances, or biasing them towards being robust with respect to the kind of distribution shifts that the teacher may expect at test time. In general, having a teacher is one opportunity to learn characteristics about the decision boundary which are not contained in the training sample, in analogy to a good Bayesian prior.
4.1 Why does generalized distillation work?
Recall our three actors: the student function , the teacher function , and the real target function of interest to both the student and the teacher, . For simplicity, consider pure distillation (set the imitation parameter to ). Furthermore, we will place some assumptions about how the student, teacher, and true function interplay when learning from data. First, assume that the student may learn the true function at a slow rate
where the term is the estimation error, and is the approximation error of the student function class with respect to . Second, assume that the better representation of the teacher allows her to learn at the fast rate
where is the approximation error of the teacher function class with respect to . Finally, assume that when the student learns from the teacher, she does so at the rate
where is the approximation error of the student function class with respect to , and . Then, the rate at which the student learns the true function admits the alternative expression
where the last inequality follows because . Thus, the question at hand is to argue, for a given learning problem, if the inequality
holds. The inequality highlights that the benefits of learning with a teacher arise due to i) the capacity of the teacher being small, ii) the approximation error of the teacher being smaller than the approximation error of the student, and iii) the coefficient being greater than . Remarkably, these factors embody the assumptions of privileged information from Vapnik & Izmailov (2015). The inequality is also reasonable under the main assumption in (Hinton et al., 2015), which is . Moreover, the inequality highlights that the teacher is most helpful in low data regimes, such as small datasets, Bayesian optimization, reinforcement learning, domain adaptation, transfer learning, or in the initial stages of online and reinforcement learning.
We believe that the “ case” is a general situation, since soft labels (dense vectors with a real number of information per class) contain more information than hard labels (onehotencoding vectors with one bit of information per class) per example, and should allow for faster learning. This additional information, also understood as label uncertainty, relates to the acceleration in SVM+ due to the knowledge of slack values. Since a good teacher smoothes the decision boundary and instructs the student to fail on difficult examples, the student can focus on the remaining body of data. Although this translates into the unambitious “whatever my teacher could not do, I will not do”, the imitation parameter in (4) allows to follow this rule safely, and fall back to regular learning if necessary.
4.2 Extensions
Semisupervised learning
We now extend generalized distillation to the situation where examples lack regular features, privileged features, labels, or a combination of the three. In the following, we denote missing elements by . For instance, the example has no privileged features, and the example is missing its label. Using this convention, we introduce the clean subset notation
Then, semisupervised generalized distillation walks the same three steps as generalized distillation, enumerated at the beginning of Section 4, but uses the appropriate clean subsets instead of the whole data. For example, the semisupervised extension of distillation allows the teacher to prepare soft labels for all the unlabeled data . These additional softlabels are additional information available to the student to learn the teacher representation .
Learning with the Universum
The unlabeled data can belong to one of the classes of interest, or be Universum data (Weston et al., 2006; Chapelle et al., 2007). Universum data may have labels: in this case, one can exploit these additional labels by i) training a teacher that distinguishes amongst all classes (those of interest and those from the Universum), ii) computing soft classprobabilities only for the classes of interest, and iii) distilling these soft probabilities into a student function.
Learning from multiple tasks
Generalized distillation applies to some domain adaptation, transfer learning, or multitask learning scenarios. On the one hand, if the multiple tasks share the same labels but differ in their input modalities, the input modalities from the source tasks are privileged information. On the other hand, if the multiple tasks share the same input modalities but differ in their labels, the labels from the source tasks are privileged information. In both cases, the regular student representation is the input modality from the target task.
Curriculum and reinforcement learning
We conjecture that the uncertainty in the teacher soft predictions can be used as a mechanism to rank the difficulty of training examples, and use these ranks for curriculum learning (Bengio et al., 2009). Furthermore, distillation resembles imitation, a technique that learning agents could exploit in reinforcement learning environments.
4.3 A causal perspective on generalized distillation
The assumption of independence of cause and mechanisms states that “the probability distribution of a cause is often independent from the process mapping this cause into its effects” (Schölkopf et al., 2012). Under this assumption, for instance, causal learning problems —i.e., those where the features cause the labels— do not benefit from semisupervised learning, since by the independence assumption, the marginal distribution of the features contains no information about the function mapping features to labels. Conversely, anticausal learning problems —those where the labels cause the features— may benefit from semisupervised learning.
Causal implications also arise in generalized distillation. First, if the privileged features only add information about the marginal distribution of the regular features , the teacher should be able to help only in anticausal learning problems. Second, if the teacher provides additional information about the conditional distribution of the labels given the inputs , it should also help in the causal setting. We will confirm this hypothesis in the next section.
5 Numerical simulations
We now present some experiments to illustrate when the distillation of privileged information is effective, and when it is not. The necessary Python code to replicate all the following experiments is available at http://github.com/lopezpaz.
We start with four synthetic experiments, designed to minimize modeling assumptions and to illustrate different prototypical types of privileged information. These are simulations of logistic regression models repeated over random partitions, where we use samples for training, and samples for testing. The dimensionality of the regular features is , and the involved separating hyperplanes follow the distribution . For each experiment, we report the test accuracy when i) using the teacher explanations at both train and test time, ii) using the regular features at both train and test time, and iii) distilling the teacher explanations into the student classifier with .
1. Clean labels as privileged information.
We sample triplets from:
Here, each teacher explanation is the exact distance to the decision boundary for each , but the data labels are corrupt. This setup aligns with the assumptions about slacks in the similarity control framework of Vapnik & Vashist (2009). We obtained a privileged test classification accuracy of , a regular test classification accuracy of , and a distilled test classification accuracy of . This illustrates that distillation of privileged information is an effective mean to detect outliers in label space.
2. Clean features as privileged information
We sample triplets from:
In this setup, the teacher explanations are clean versions of the regular features available at test time. We obtained a privileged test classification accuracy of , a regular test classification accuracy of , and a distilled test classification accuracy of . This improvement is not statistically significant. This is because the intelligent explanations are independent from the noise polluting the regular features . Therefore, there exists no additional information transferable from the teacher to the student.
3. Relevant features as privileged information
We sample triplets from:
where the set , with , is a subset of the variable indices chosen at random but common for all samples. In another words, the teacher explanations indicate the values of the variables relevant for classification, which translates into a reduction of the dimensionality of the data that we have to learn from. We obtained a privileged test classification accuracy of , a regular test classification accuracy of , and a distilled test classification accuracy of . This illustrates that distillation on privileged information is an effective tool for feature selection.
4. Sampledependent relevant features as privileged information
Sample triplets
where the sets , with for all , are a subset of the variable indices chosen at random for each sample . One interpretation of such model is the one of bounding boxes in computer vision: each highdimensional vector would be an image, and each teacher explanation would be the pixels inside a bounding box locating the concept of interest (Sharmanska et al., 2013). We obtained a privileged test classification accuracy of , a regular test classification accuracy of , and a distilled test classification accuracy of . Note that although the classification is linear in , this is not the case in terms of . Therefore, although we have misspecified the function class for this problem, the distillation approach did not deteriorate the final performance.
The previous four experiments set up causal learning problems. In the second experiment, the privileged features add no information about the target function mapping the regular features to the labels, so the causal hypothesis from Section 4.3 justifies the lack of improvement. The first and third experiments provide privileged information that adds information about the target function, and therefore is beneficial to distill this information. The fourth example illustrates that the privileged features adding information about the target function is not a sufficient condition for improvement.
5. MNIST handwritten digit image classification
The privileged features are the original 28x28 pixels MNIST handwritten digit images (LeCun et al., 1998b), and the regular features are the same images downscaled to 7x7 pixels. We use or samples to train both the teacher and the student, and test their accuracies at multiple levels of temperature and imitation on the full test set. Both student and teacher are neural networks of composed by two hidden layers of rectifier linear units and a softmax output layer (the same networks are used in the remaining experiments). Figure 1 summarizes the results of this experiment, where we see a significant improvement in classification accuracy when distilling the privileged information, with respect to using the regular features alone. As expected, the benefits of distillation diminished as we further increased the sample size.
6. Semisupervised learning
We explore the semisupervised capabilities of generalized distillation on the CIFAR10 dataset (Krizhevsky, 2009). Here, the privileged features are the original 32x32 pixels CIFAR10 color images, and the regular features are the same images when polluted with additive Gaussian noise. We provide labels for images, and unlabeled privileged and regular features for the rest of the training set. Thus, the teacher trains on images, but computes the soft labels for the whole training set of images. The student then learns by distilling the original hard labels and the soft predictions. As seen in Figure 2, the soft labeling of unlabeled data results in a significant improvement with respect to pure student supervised classification. Distillation on the labeled samples did not improve the student performance. This illustrates the importance of semisupervised distillation in this data. We believe that the drops in performance for some distillation temperatures are due to the lack of a proper weighting between labeled and unlabeled data in (4).
7. Multitask learning
The SARCOS dataset (Vijayakumar, 2000) characterizes the 7 joint torques of a robotic arm given 21 realvalued features. Thus, this is a multitask learning problem, formed by 7 regression tasks. We learn a teacher on samples to predict each of the 7 torques given the other 6, and then distill this knowledge into a student who uses as her regular input space the 21 realvalued features. Figure 2 illustrates the performance improvement in mean squared error when using generalized distillation to address the multitask learning problem. When distilling at the proper temperature, distillation allowed the student to match her teacher performance.
Acknowledgments
We thank discussions with R. Nishihara, R. Izmailov, I. Tolstikhin, and C. J. SimonGabriel.
References
 Ba & Caruana (2014) Ba, Jimmy and Caruana, Rich. Do deep nets really need to be deep? In NIPS, 2014.
 Bengio et al. (2009) Bengio, Yoshua, Louradour, Jérôme, Collobert, Ronan, and Weston, Jason. Curriculum learning. In ICML, 2009.
 Bottou (2014) Bottou, Léon. From machine learning to machine reasoning. Machine learning, 94(2):133–149, 2014.
 Buciluǎ et al. (2006) Buciluǎ, Cristian, Caruana, Rich, and NiculescuMizil, Alexandru. Model compression. In KDD, 2006.
 Burges & Schölkopf (1997) Burges, Christopher and Schölkopf, Bernhard. Improving the accuracy and speed of support vector learning machines. In NIPS, 1997.
 Chapelle et al. (2007) Chapelle, Olivier, Agarwal, Alekh, Sinz, Fabian H, and Schölkopf, Bernhard. An analysis of inference with the Universum. In NIPS, 2007.
 Feyereisl & Aickelin (2012) Feyereisl, Jan and Aickelin, Uwe. Privileged information for data clustering. Information Sciences, 194:4–23, 2012.
 Fouad et al. (2013) Fouad, Shereen, Tino, Peter, Raychaudhury, Somak, and Schneider, Petra. Incorporating privileged information through metric learning. Neural Networks and Learning Systems, 24(7):1086–1098, 2013.
 HernándezLobato et al. (2014) HernándezLobato, Daniel, Sharmanska, Viktoriia, Kersting, Kristian, Lampert, Christoph H, and Quadrianto, Novi. Mind the nuisance: Gaussian process classification using privileged noise. In NIPS, 2014.
 Hinton et al. (2015) Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the knowledge in a neural network. arXiv, 2015.
 Krizhevsky (2009) Krizhevsky, Alex. The CIFAR10 and CIFAR100 datasets, 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
 Lapin et al. (2014) Lapin, Maksim, Hein, Matthias, and Schiele, Bernt. Learning using privileged information: Svm+ and weighted svm. Neural Networks, 53:95–108, 2014.
 LeCun et al. (1998a) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998a.
 LeCun et al. (1998b) LeCun, Yann, Cortes, Corinna, and Burges, Christopher JC. The MNIST database of handwritten digits, 1998b. URL http://yann.lecun.com/exdb/mnist/.
 LopezPaz et al. (2014) LopezPaz, David, Sra, Suvrit, Smola, Alex, Ghahramani, Zoubin, and Schölkopf, Bernhard. Randomized nonlinear component analysis. In ICML, 2014.
 Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv, 2013.
 Oquab et al. (2014) Oquab, Maxime, Bottou, Leon, Laptev, Ivan, and Sivic, Josef. Learning and transferring midlevel image representations using convolutional neural networks. In CVPR, pp. 1717–1724, 2014.
 Pechyony & Vapnik (2010) Pechyony, Dmitry and Vapnik, Vladimir. On the theory of learning with privileged information. In NIPS, 2010.
 Ribeiro et al. (2010) Ribeiro, Bernardete, Silva, Catarina, Vieira, Armando, GasparCunha, António, and das Neves, João C. Financial distress model prediction using SVM+. In IJCNN. IEEE, 2010.
 Schölkopf et al. (2012) Schölkopf, Bernhard, Janzing, Dominik, Peters, Jonas, Sgouritsa, Eleni, Zhang, Kun, and Mooij, Joris. On causal and anticausal learning. ICML, 2012.
 Sharmanska et al. (2013) Sharmanska, Viktoriia, Quadrianto, Novi, and Lampert, Christoph H. Learning to rank using privileged information. In ICCV, 2013.
 Sharmanska et al. (2014) Sharmanska, Viktoriia, Quadrianto, Novi, and Lampert, Christoph H. Learning to transfer privileged information. arXiv, 2014.
 Vapnik (1998) Vapnik, Vladimir. Statistical learning theory. Wiley New York, 1998.
 Vapnik & Izmailov (2015) Vapnik, Vladimir and Izmailov, Rauf. Learning using privileged information: Similarity control and knowledge transfer. JMLR, 16:2023–2049, 2015.
 Vapnik & Vashist (2009) Vapnik, Vladimir and Vashist, Akshay. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544–557, 2009.
 Vijayakumar (2000) Vijayakumar, Sethu. The SARCOS dataset, 2000. URL http://www.gaussianprocess.org/gpml/data/.
 Weston et al. (2006) Weston, Jason, Collobert, Ronan, Sinz, Fabian, Bottou, Léon, and Vapnik, Vladimir. Inference with the Universum. In ICML, 2006.