Semisupervised deep learning by metric embedding
Abstract
Deep networks are successfully used as classification models yielding stateoftheart results when trained on a large number of labeled samples. These models, however, are usually much less suited for semisupervised problems because of their tendency to overfit easily when trained on small amounts of data. In this work we will explore a new training objective that is targeting a semisupervised regime with only a small subset of labeled data. This criterion is based on a deep metric embedding over distance relations within the set of labeled samples, together with constraints over the embeddings of the unlabeled set. The final learned representations are discriminative in euclidean space, and hence can be used with subsequent nearestneighbor classification using the labeled samples.
Semisupervised deep learning by metric embedding
Elad Hoffer 

Technion  Israel Institute of Technology 
Haifa, Israel 
ehoffer@tx.technion.ac.il 
Nir Ailon 

Technion  Israel Institute of Technology 
Haifa, Israel 
nailon@cs.technion.ac.il 
1 Introduction
Deep neural networks have been shown to perform very well on various classification problems, often yielding stateoftheart results. Key motivation for the use of these models, is the assumption of hierarchical nature of the underlying problem. This assumption is reflected in the structure of NNs, composed of multiple stacked layers of linear transformations followed by nonlinear activation functions. The NN final layer is usually a softmax activated linear transformation indicating the likelihood of each class, which can be trained by crossentropy using the known target of each sample, and backpropagated to previous layers. The hierarchical property of NNs has been observed to yield highquality, discriminative representations of the input in intermediate layers. These representative features, although not explicitly part of the training objective, were shown to be useful in subsequent tasks in the same domain as demonstrated by Razavian et al. (2014). One serious problem occurring in neural network is their susceptibility to overfit over the training data. Due to this fact, a considerable part of modern neural network research is devoted to regularization techniques and heuristics such as Srivastava et al. (2014); Ioffe & Szegedy (2015); Wan et al. (2013); Szegedy et al. (2015), to allow the networks to generalize to unseen data samples. The tendency to overfit is most apparent with problems having a very small number of training examples per class, and these are considered illsuited to solve with neural network models. Because of this property, semisupervised regimes in which most data is unlabeled, are considered hard to learn and generalize with NNs.
In this work we will consider a new training criterion designed to be used with deep neural networks in semisupervised regimes over datasets with a small subset of labeled samples. Instead of a usual crossentropy between the labeled samples and the ground truth class indicators, we will use the labeled examples as targets for a metric embedding. Under this embedding, which is the mapping of a parameterized deep network, the features of labeled examples will be grouped together in euclidean space. In addition, we will use these learned embeddings to separate the unlabeled examples to belong each to a distinct cluster formed by the labeled samples. We will show this constraint translates to a minimum entropy criterion over the embedded distances. Finally, because of the use of euclidean space interpretation of the learned features, we are able to use a subsequent nearestneighbor classifier to achieve stateoftheart results on problems with small number of labeled examples.
2 Related work
2.1 Learning metric embedding
Previous works have shown the possible use of neural networks to learn useful metric embedding. One kind of such metric embedding is the “Siamese network” framework introduced by Bromley et al. (1993) and later used in the works of Chopra et al. (2005). One use for this methods is when the number of classes is too large or expected to vary over time, as in the case of face verification, where a face contained in an image has to compared against another image of a face. This problem was recently tackled by Schroff et al. (2015) for training a convolutional network model on triplets of examples. Learning features by metric embedding was also shown by Hoffer & Ailon (2015) to provide competitive classification accuracy compare to conventional crossentropy regression. This work is also related to Rippel et al. (2015), who introduced Magnet loss  a metric embedding approach for finegrained classification. The Magnet loss is based on learning the distribution of distances for each sample, from clusters assigned for each classified class. It then uses an intermediate kmeans clustering, to reposition the different assigned clusters. This proved to allow better accuracy than both marginbased Triplet loss, and softmax regression. Using metric embedding with neural network was also specifically shown to provide good results in the semisupervised learning setting as seen in Weston et al. (2012).
2.2 Semisupervised learning by adversarial regularization
As stated before, a key approach to generalize from a small training set, is by regularizing the learned model. Regularization techniques can often be interpreted as prior over model parameters or structure, such as regularization over the network weights or activations. More recently, neural network specific regularizations that induce noise within the training process such as Srivastava et al. (2014); Wan et al. (2013); Szegedy et al. (2015) proved to be highly beneficial to avoid overfitting. Another recent observation by Goodfellow et al. (2015) is that training on adversarial examples, inputs that were found to be misclassified under small perturbation, can improve generalization. This fact was explored by Feng et al. (2016) and found to provide notable improvements to the semi supervised regime by Miyato et al. (2015).
2.3 Semisupervised learning by auxiliary reconstruction loss
Recently, a stacked set of denoising autoencoders architectures showed promising results in both semisupervised and unsupervised tasks. A stacked whatwhere autoencoder by Zhao et al. (2015) computes a set of complementary variables that enable reconstruction whenever a layer implements a manytoone mapping. Ladder networks by Rasmus et al. (2015)  use lateral connections to allow higher levels of an autoencoder to focus on invariant abstract features by applying a layerwise cost function.
Generative adversarial network (GAN) is a recently introduced model that can be used in an unsupervised fashion Goodfellow et al. (2014). Adversarial Generative Models use a set of networks, one trained to discriminate between data sampled from the true underlying distribution (e.g., a set of images), and a separate generative network trained to be an adversary trying to confuse the first network. By propagating the gradient through the paired networks, the model learns to generate samples that are distributed similarly to the source data. As shown by Radford et al. (2015), this model can create useful latent representations for subsequent classification tasks. The usage for these models for semisupervised learning was further developed by Springenberg (2016) and Salimans et al. (2016), by adding a way classifier (number of classes + and additional “fake” class) to the discriminator. This proved to allow excellent accuracy with only a small subset of labeled examples.
2.4 Semisupervised learning by entropy minimization
Another technique for semisupervised learning introduced by Grandvalet & Bengio (2004) is concerned with minimizing the entropy over expected class distribution for unlabeled examples. Regularizing for minimum entropy can be seen as a prior which prefers minimum overlap between observed classes. This can also be seen as a generalization of the “selftraining” wrapper method described by Triguero et al. (2015), in which unlabeled examples are reintroduced after being labeled with the previous classification of the model. This is also related to the “Transductive suport vector machines” (TSVM) Vapnik & Vapnik (1998) which introduces a maximum margin objective over both labeled and unlabeled examples.
3 Our contribution: Neighbor embedding for semisupervised learning
In this work we are concerned with a semisupervised setting, in which learning is done on data of which only a small subset is labeled. Given observed sets of labeled data and unlabeled data where , , we wish to learn a classifier to have a minimum expected error on some unseen test data .
We will make a couple of assumptions regarding the given data:

The number of labeled examples is small compared to the whole observed set .
Using these assumptions, we are motivated to learn a metric embedding that forms clusters such that samples can be classified by their distance to the labeled examples in a nearestneighbor procedure.
We will now define our learning setting on the semilabeled data, using a neural network model denoted as where x is the input fed into the network, and are the optimized parameters
(dropped henceforward for convenience). The output of the network for each sample is a vector of features of dimensions which will be used to represent the input.
Our two training objectives which we aim to train our embedding networks by are:

Create feature representation that form clusters from the labeled examples such that two examples sharing the same label will have a smaller embedded distance than any third example with a different label

For each unlabeled example, its feature embedding will be close to the embeddings of one specific label occurring in :
For all , , there exists a specific class such that
where is any labeled example of class and is any example from class .
As the defined objectives create embeddings that target a nearestneighbor classification with regard to the labeled set, we will refer to it as “Neighbor embedding”.
4 Learning by distance comparisons
We will define a discrete distribution for the embedded distance between a sample , and labeled examples each belonging to a different class:
(1) 
This definition assigns a probability for sample to be classified into class , under a 1nn classification rule, when neighbors are given. It is similar to the stochasticnearestneighbors formulation of Goldberger et al. (2004), and will allow us to state the two underlying objectives as measures over this distribution.
4.1 Distance ratio criterion
Addressing objective (i), we will use a sample from the labeled set belonging to class , and another set of sampled labeled examples . In this work we will sample in uniform over all available samples for each class.
Defining the classindicator as
we will minimize the crossentropy between and the distancedistribution of with respect to 1:
(2) 
This is in fact a slightly modified version of distance ratio loss introduced in Hoffer & Ailon (2015).
(3) 
This loss is aimed to ensure that samples belonging to the same class will be mapped to have a small embedded distance compared to samples from different classes.
4.2 Minimum entropy criterion
Another part of the optimized criterion, inspired by Grandvalet & Bengio (2004), is designed to reduce the overlap between the different classes of the unlabeled samples.
We will promote this objective by minimizing the entropy of the underlying distance distribution of , again with respect to labeled samples 1:
(4) 
which is defined as
(5) 
We note that entropy is lower if the distribution 1 is sparse, and higher if the distribution is dense, and this intuition is compatible with our objectives.
Our final objective will use a sampled set of labeled examples, where each class is represented and additional labeled and unlabeled examples, combining a weighted sum of both 3 and 5 to form:
(6) 
Where are used to determine the weight assigned to each criterion.
This loss is differentiable and hence can be used for gradientbased training of deep models by existing optimization approaches and backpropagation (Rumelhart et al. ()) through the embedding neural network. The optimization can further be accelerated computationally by using minibatches of both labeled and unlabeled examples.
5 Qualities of neighbor embedding
We will now discuss some observed properties of neighbor embeddings, and their usefulness to semisupervised regimes using neural network models.
5.1 Reducing overfit
Usually, when using NNs for classification, a crossentropy loss minimization is employed by using a fixed onehot indicator (similar to 4.1) as target for each labeled example, thus maximizing a loglikelihood of the correct label. This form of optimization over a fixed target tend to cause
an overfitting of the neuralnetwork, especially on small labeled sets. This was lately discussed and addressed by Szegedy et al. (2015) using added random noise to the targets,
effectively smoothing the crossentropy target distribution. This regularization technique was shown empirically to yield
better generalization by reducing the overfitting over the training set.
Training on distance ratio comparisons, as shown in our work, provides a natural alternative to this problem. By setting the optimization target to be the embeddings of labeled examples, we create a continuously moving target that is dependent on the current model parameters.
We speculate that this reduces the model’s ability to overfit easily on the training data, allowing very small labeled datasets to be exploited.
5.2 Embedding into euclidean space
By training the model to create feature embedding that are discriminative with respect to their distance in euclidean space, we can achieve good classification accuracy using a simple nearestneighbor classifier. This embedding allows an interpretation of semantic relation in euclidean space, which can be useful for various tasks such as information retrieval, or transfer learning.
5.3 Incorporating prior knowledge
We also note that prior knowledge about a problem at hand can be incorporated into the expected measures with respect to the distance distribution 1. E.g, knowledge of relative distance between classes can be used to replace as target distribution in eq. 3 and knowledge concerning overlap between classes can be used to relax the constraint in eq. 5.
6 Experiments
All experiments were conducted using the Torch7 framework by Collobert et al. (2011). Code reproducing these results will by available at https://github.com/eladhoffer/SemiSupContrast. For every experiment we chose a small random subset of examples, with a balanced number from each class and denoted by . The remaining training images are used without their labeled to form . Finally, we test our final accuracy with a disjoint set of examples . No data augmentation was applied to the training sets.
In each iteration we sampled uniformly a set of labeled examples . In addition, batches of uniformly sampled examples were also sampled again from the labeled set , and the unlabeled set .
A batchsize of was used for all experiments, totaling a sampled set of examples for each iteration, where for both datasets. We used 6 as optimization criterion, where . Optimization was done using the Acceleratedgradient method by Nesterov (1983) with an initial learning rate of which was decreased by a factor of after every epochs. Both datasets were trained on for a total of epochs. Final test accuracy results was achieved by using a kNN classifier with best results out of . These results were average over random subsets of labeled data.
As the embedding model was chosen to be a convolutional network, the spatial properties of input space are crucial. We thus omit results on permutationinvariant versions of these problems, noting they usually tend to achieve worse classification accuracies.
6.1 Results on MNIST
The MNIST database of handwritten digits introduced by LeCun et al. (1998) is one of the most studied dataset benchmark for image classification. The dataset contains 60,000 examples of handwritten digits from 0 to 9 for training and 10,000 additional examples for testing, where each sample is a 28 x 28 pixel gray level image.
We followed previous works ((Weston et al., 2012),(Zhao et al., 2015),Rasmus et al. (2015)) and used semisupervised regime in which only samples ( for each class) were used as along with their labels. For the embedding network, we used a convolutional network with 5convolutional layers, where each layer is followed by a ReLU nonlinearity and batchnormalization layer Ioffe & Szegedy (2015). The full network structure is described in Appendix table 3. Results are displayed in table 1 and reflect that our approach yields stateoftheart results in this regime.
We also attempted to visualize the outcome of using this method, by training an additional model with a final 2dimensional embedding. Figure 1 shows the final embeddings, where labeled examples are marked in color with their respective class, and unlabeled examples are marked in gray. We can see that, in accordance with our objectives, the labeled examples formed clusters in euclidean space separate by their labels, while unlabeled examples were largely grouped to belong each to one of these clusters.
6.2 Results on Cifar10
Cifar10 introduced by Krizhevsky & Hinton (2009) is an image classification benchmark dataset containing training images and test images. The image sizes pixels, with color. The classes are airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships and trucks.
Following a commonly used regime, we trained on randomly picked samples ( for each class). As the convolutional embedding network, we used a network similar to that of Lin et al. (2014) which is described in table 3. The test error results are brought in table 2.
As can be observed, we achieve competitive results with stateoftheart in this regime. We also note that current best results are from generative models such as Springenberg (2016) and Salimans et al. (2016) that follow an elaborate and computationally heavy training procedure compared with our approach.
Model  Test error % 

SpikeandSlab Sparse Coding Goodfellow et al. (2012)  31.9 
ViewInvariant kmeans Hui (2013)  27.4 ( 0.7) 
ExemplarCNN Dosovitskiy et al. (2014)  23.4 ( 0.2) 
Ladder network Rasmus et al. (2015)  20.04 ( 0.47) 
ConvCatGan Springenberg (2016)  19.58 ( 0.58) 
ImprovedGan Salimans et al. (2016)  18.63 ( 2.32) 
Ours  20.3 ( 0.5) 

7 Conclusions
In this work we have shown how neural networks can be used to learn in a semisupervised setting using small sets of labeled data, by replacing the classification objective with a metric embedding one. We introduced an objective for semisupervised learning formulated as minimization of entropy over a distance encoding distribution. This objective is compliant with standard techniques of training deep neural network and requires no modification of the embedding model. Using the method in this work, we were able to achieve stateoftheart results on MNIST with only labeled examples and competitive results on Cifar10 dataset. We speculate that this form of learning is beneficial to neural network models by decreasing their tendency to overfit over small sets of training data. The objectives formulated here can potentially leverage prior knowledge on the distribution of classes or samples, as well as incorporating this knowledge in the training process. For example, utilizing the learned embedded distance, we speculate that a better sampling can be done instead of a uniform one over the entire set.
Further exploration is needed to apply this method to large scale problems, spanning a large number of available classes, which we leave to future work.
References
 Bromley et al. (1993) Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. Signature verification using a âsiameseâ time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7(04):669–688, 1993.
 Chapelle et al. (2009) Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
 Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pp. 539–546. IEEE, 2005.
 Collobert et al. (2011) Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, number EPFLCONF192376, 2011.
 Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 766–774, 2014.
 Feng et al. (2016) Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, and Shie Mannor. Ensemble robustness of deep learning algorithms. arXiv preprint arXiv:1602.02389, 2016.
 Goldberger et al. (2004) Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan Salakhutdinov. Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520, 2004.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Goodfellow et al. (2012) Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio. Largescale feature learning with spikeandslab sparse coding. 2012.
 Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. 2015.
 Grandvalet & Bengio (2004) Yves Grandvalet and Yoshua Bengio. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536, 2004.
 Hoffer & Ailon (2015) Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In SimilarityBased Pattern Recognition, pp. 84–92. Springer, 2015.
 Hui (2013) Ka Y Hui. Direct modeling of complex invariances for visual object features. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 352–360, 2013.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pp. 448–456, 2015.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lin et al. (2014) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ICLR2014, 2014.
 Miyato et al. (2015) Takeru Miyato, Shinichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing by virtual adversarial examples. stat, 1050:2, 2015.
 Nesterov (1983) Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). 1983.
 Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3532–3540, 2015.
 Razavian et al. (2014) Ali Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features offtheshelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813, 2014.
 Rippel et al. (2015) Oren Rippel, Manohar Paluri, Piotr Dollar, and Lubomir Bourdev. Metric learning with adaptive density discrimination. stat, 1050:18, 2015.
 (24) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
 Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823, 2015.
 Springenberg (2016) Jost Tobias Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. In International Conference on Learning Representations (ICLR). 2016. URL https://arxiv.org/abs/1511.06390.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Szegedy et al. (2015) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
 Triguero et al. (2015) Isaac Triguero, Salvador García, and Francisco Herrera. Selflabeled techniques for semisupervised learning: taxonomy, software and empirical study. Knowledge and Information Systems, 42(2):245–284, 2015.
 Vapnik & Vapnik (1998) Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.
 Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pp. 1058–1066, 2013.
 Weston et al. (2012) Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Springer, 2012.
 Zhao et al. (2015) Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked whatwhere autoencoders. arXiv preprint arXiv:1506.02351, 2015.
Appendix A Appendix
Model  

MNIST  Cifar10 
Input: monochrome  Input: RGB 
ConvReLUBN (16, 5x5, 1x1)  ConvReLUBN (192, 5x5, 1x1) 
MaxPooling (2x2, 2x2)  ConvReLUBN (160, 1x1, 1x1) 
ConvReLUBN (32, 3x3, 1x1)  ConvReLUBN (96, 1x1, 1x1) 
ConvReLUBN (64, 3x3, 1x1)  MaxPooling (3x3, 2x2) 
ConvReLUBN (64, 3x3, 1x1)  ConvReLUBN (96, 5x5, 1x1) 
MaxPooling (2x2, 2x2)  ConvReLUBN (192, 1x1, 1x1) 
ConvReLUBN (128, 3x3, 1x1)  ConvReLUBN (192, 1x1, 1x1) 
AvgPooling (6x6, 1x1)  MaxPooling (3x3, 2x2) 
ConvReLUBN (192, 3x3, 1x1)  
ConvReLUBN (192, 1x1, 1x1)  
AvgPooling (7x7, 1x1) 