SoftTarget Regularization
An effective technique to reduce overfitting in Neural Networks
Abstract
Deep neural networks are learning models with a very high capacity and therefore prone to overfitting. Many regularization techniques such as Dropout, DropConnect, and weight decay all attempt to solve the problem of overfitting by reducing the capacity of their respective models (Srivastava et al., 2014), (Wan et al., 2013), (Krogh & Hertz, 1992). In this paper we introduce a new form of regularization that guides the learning problem in a way that reduces overfitting without sacrificing the capacity of the model. The mistakes that models make in early stages of training carry information about the learning problem. By adjusting the labels of the current epoch of training through a weighted average of the real labels, and an exponential average of the past softtargets we achieved a regularization scheme as powerful as Dropout without necessarily reducing the capacity of the model, and simplified the complexity of the learning problem. SoftTarget regularization proved to be an effective tool in various neural network architectures.
1 Introduction
Many regularization techniques have been created to rectify the problem of overfitting in deep neural networks, but the majority of these methods reduce models capacities to force them to learn general enough features. For example, Dropout reduces the amount of learnable parameters by randomly dropping activations, and DropConnect extends this idea by randomly dropping weights (Srivastava et al., 2014), (Wan et al., 2013). Weight decay regularization reduces the capacity of the model, not by dropping learnable parameters, but by reducing the space of viable solutions (Krogh & Hertz, 1992).
1.1 Motivation
Hinton has shown that softlabels, or labels predicted from a model contain more information that binary hard labels due to the fact that they encode similarity measures between the classes (Hinton et al., 2015). Incorrect labels tagged by the model describe colabel similarities, and these similarities should be evident in future stages of learning, even if the effect is diminished. For example, imagine training a deep neural net on a classification dataset of various dog breeds. In the initial few stages of learning the model will not accurately distinguish between similar dogbreeds such as a Belgian Shepherd versus a German Shepherd. This same effect, although not so exaggerated, should appear in later stages of training. If, given an image of a German Shepherd, the model predicts the class German Shepherd with a highaccuracy, the next highest predicted dog should still be a Belgian Shepherd, or a similar looking dog. Overfitting starts to occur when the majority of these colabel effects begin to disappear. By forcing the model to contain these effects in the later stages of training, we reduced the amount of overfitting.
1.2 Method
Consider the standard supervised learning problem. Given a dataset containing inputs and outputs, and , a regularization function and a model prediction function we attempted to minimize the loss function given by:
(1) 
where are the weights in that are adjusted to minimize the loss function, and controls the effect of the regularization function. For our method to fit into the supervised learning scheme we altered the optimization problem by adding a time dimension to the loss function:
(2) 
SoftTarget regularization requires into two steps: first, we kept an exponential moving average of past labels , and second, we updated the current epochs label through a weighted average of the exponential moving average of past labels and of the true hard labels:
(3)  
(4) 
Here, and are hyperparameters that can be tuned to specific applications. The loss function then becomes:
(5) 
The algorithm also contains a ‘burnin’ period, where no SoftTarget regularization is done and the model is trained freely in order to learn the basic colabel similarities. We will denote the number of epochs trained freely as , and the total number of epochs as . Experimentally we also discovered that it is sometimes best to run the network for more than one epoch on a single , so we will denote as the number of epochs per every timestep. We have provided the pseudocode in Algorithm 1.
Here represents the training of the neural network, taking in a model , dataset and an integer representing number of epochs.
A large allows the network to learn a better mapping to the intermediate softlabels and therefore allows the regularization to be more effective. But increasing has a diminishing effect, because as becomes large the network begins to overfit to those softlabels, and reduces the effect of the regularization, as well as increasing the training time of the network significantly. should be optimized experimentally through standard hyperparameter optimization practices. We found to work best through standard grid hyperparameter optimization (Bergstra & Bengio, 2012). The small insures that the model does not overfit to the intermediate representation introduced by SoftTarget.
Through hyperparameter optimization the same range of was found to be optimal in the experiments we ran for . A small insures that the colabel similarities captured by SoftTarget would not have been affected by any type of overfitting. This insures that as the experiments are further ran the true colabel similarties are propagated correctly. More complicated learning scenarias where the amount of labels and data is greater, the chances of corruption in colabel similarties is reduced and therefore larger can be choosen.
1.3 Similarities to Other Methods
Other methods similar to this are specific to the case where the hyperparameter is set to zero, with no burnin period.

Reed et al. study the specific case of the SoftTarget method described above with the parameter set to zero (Reed et al., 2014). They focus on the capability of the network to be robust to noise, rather than the regularization abilities of the method.

Grandvalet and Bengio have proposed minimum entropy regularization in the setting of semisupervised learning (Grandvalet & Bengio, 2005). This algorithm changes the categorical crossentropy loss to force the network to make predictions with high degrees of confidence on the unlabeled portion of the dataset. Assuming crossentropy loss with SoftTarget normalization with a zero burnin period, and zero , our algorithm becomes equivalent to a softmax regression with minimum entropy regularization.

Another similar approach to minimum entropy regularization is an approach called pseudolabeling. Pseudolabeling tags unlabeled data with the class predicted highest by a learning model (Lee, 2013). No softtargets are kept, instead the predicted label is binarized, i.e. the highest class is labeled with a value of one, and every other class is labeled with a value of zero. These hard pseudolabels are then fed as input to the model.

Hinton et al described the power of soft targets in the use of transferring knowledge from one model to another, usually to a model that contains less parameters (Hinton et al., 2015). SoftTarget regularization can be interpreted as weighted distillation where the donor model is the state of the model at some previous time in training, and the weighting target are the hardtargets.
2 Experiments
We conducted experiments in python using the Theano and Keras libraries (The Theano Development Team, 2016), (Chollet, 2015). All of our code ran on a single Nvidia Titan X GPU, while using the CnMEM and cuDNN (5.103) extensions, and we visualized our results using matplotlib (Hunter, 2007). We used the same seed in all our calculations to insure the starting weights were equivalent in every set of experiments. The only source of randomness stemmed from the nondeterministic behavior of the cuDNN libraries.
2.1 Mnist
We first considered the famous MNIST dataset (LeCun et al., 1998). For each of the experiments discussed below, we performed a random gridsearch over the hyperparameters of the optimization algorithm, and a very small brute force grid search was done for the hyperparameters of SoftTarget regularization. We compared our results to the cases where the hyperparameters resulted in the best performance of the vanilla neural network without SoftTarget regularization. All of our reported values were computed on the standardized test portion of the MNIST dataset, as provided by the Keras library. The networks were trained strictly on the training portion of the dataset. We tested on eight different architectures, with four combinations of every architecture. The four combinations stem from testing each architecture via a combination of: no regularization, Dropout, SoftTarget, and Dropout+SoftTarget regularization.
We used a fully connected network, with a varying amount of hidden layers, and a set constant of neurons throughout each layer. Dropout was not introduced at the input layer, but was introduced at every layer after that. All of the layers activations we’re rectified linear units (ReLu), except for the final layer which was a SoftMax. The net was trained using a categorical crossentropy loss, and the ADADELTA optimization method. (Zeiler, 2012).
The frozen hyperparameters for the SoftTarget regularization were: . Our results are described in Table 1. We described the nets using the notation: 4 256 denoting a 4 hidden layer neural network, with each of the hidden layers having 256 units. We reported the minimum loss during training, the loss at the 100th epoch, and the maximum accuracy reached respectively.
Net  Vanilla  SoftTarget  SoftTarget+Dropout (0.2)  SoftTarget+Dropout (0.5)  Dropout (0.2)  Dropout (0.5) 

4 256  
5 512  
6 256  
6 512  
7 256  
7 512  
3 256  
3 1024  
3 2048 
In all our experiments, the best performing regularization for all of the architectures described above included SoftTarget regularization. Two representative results are plotted in Figure 1 for a shallow (three layer) and deep (seven layer) neural network. We saw that for deep neural networks (greater than three layers) SoftTarget regularization outperformed all the other regularization schemes. For shallow (three layer) neural networks SoftTarget+Dropout was the optimal scheme.
2.2 Cifar10
We then considered the CIFAR10 dataset (Krizhevsky & Hinton, 2009), comparing various combinations of SoftTarget, Dropout and BatchNormalization (BN) (Ioffe & Szegedy, 2015). BatchNormalization has been shown to have a regularization effect on neural networks due to the noise inherent to the minibatch statistics. We ran each configuration of the network through sixty iterations through the whole training set. The complete architecture used was:
Input Convolution (64,3,3) BN ReLU Convolution (64,3,3) BN ReLU MaxPooling ((3,3), (2,2)) Dropout () Convolution (128,3,3) BN ReLU Convolution (128,3,3) BN ReLU MaxPooling ((3,3), (2,2)) Dropout () Convolution (256,3,3) BN ReLU Convolution (256,1,1) BN ReLU Convolution (256,1,1) BN ReLU Dropout () AveragePooling ((6,6)) Flatten () Dense (256) BN ReLU Dense (256) BN ReLU Dense (256) SoftMax.
where: Convolution (64,3,3) signifies the convolution operator with 64 filters, and a kernel size of 3 by 3, MaxPooling ((3,3), (2,2)) represents the maxpooling operation with a kernel size of 3 by 3, and a stride of 2 by 2, AveragePooling ((6,6)) represents the average pooling operator with a kernel size of 6 by 6, Flatten represents a flattening of the tensor into a matrix, and Dense (256) a fullyconnected layer (Krizhevsky et al., 2012), (Scherer et al., 2010). In our results, when we note that BN or Dropout weren’t used, we simply omitted those layers from the architecture. We trained the networks using ADADELTA on the crossentropy loss, using the same SoftTarget hyperparameters we reported for the MNIST dataset. Our results are summarized in Table 2. The first column specifies the amount of Dropout used on the combinations listed in the next columns. As with the MNIST experiments, we reported the minimum loss during training, and the loss at the 100th epoch.
Amount of Dropout  BN  SoftTarget  Just Dropout  SoftTarget+BN 

0  
0.2  
0.4  
0.6  
0.8 
The use of SoftTarget regularization resulted in the lowest loss in four out of the five experiments on this architecture, and resulted in the lowest last epoch loss value and highest accuracy in all five of the experiments. As the dropout rate is increased the need for any other type of regularization is decreased. However, by increasing the rate of dropout, the resulting loss is increased because of the reduced capacity of the network. SoftTarget regularization allowed a lower dropout rate to be used, and this lowered the test error.
2.3 Svhn
Finally, we considered the Street View House Numbers (SVHN) dataset, consisting of various images mapping to one of ten digits (Netzer et al., 2011). This is similar to the MNIST dataset, but is much more organic in nature, as these images contain much more natural noise, such as lighting conditions and camera orientation. We tested residual networks in four configurations: No regularization, Batch Normalization (BN), SoftTarget, and BN+SoftTarget (1512.03385). Our architecture consisted of the same building blocks as the residual network outlined by He et al., consisting of identity and convolution blocks (He et al., 2015). Identity blocks are blocks that do not contain a convolution layer at the shortcut, while convolution blocks do. In our notation I (3,[16,16,32], BN) will mean an identity block with an intermediate square convolution kernel size of 3, with three convolution blocks of size 16, 16 and 32. The outer convolutions contain kernel sizes of 1. C (3,[16,16,32], BN) contains the same initial architecture as I (3,[16,16,32]) but an additional convolution layer of size 32 at the shortcut connection. All of these blocks contained the rectified linear function as their activation, and BN prior to activation. Our final architecture was:
Input ZeroPadding (3,3) Convolution (64,7,7,subsample = (2,2)) BN ReLU MaxPooling ((3,3), (2,2)) C (3,[16,16,32], BN) I (3,[16,16,32], BN) I (3,[16,16,32], BN) C (3,[32,32,64], BN) I (3,[32,32,64], BN) I (3,[32,32,64], BN) AveragePooling ((7,7)) Dense (10) SoftMax
We used the ADADELTA optimization method with a random grid search for hyperparameter optimization. All configurations of the networks were run for 60 iterations apart from the overfit configuration which was run for 100 iterations.
We reported our results in Table 3 and Figure 2, as before giving the minimum test loss and the test loss at the last epoch. SoftTarget regularized configurations (with and without BN) again scored the lowest test loss and highest accuracy, compared to Batch Normalization alone.
No Regularization  BN  SoftTarget  SoftTarget+BN  

Test Loss  0.254—0.347  0.298—0.404  0.244—0.244  0.237—0.249 
Test Accuracy  0.929—0.923  0.921—0.915  0.931—0.931  0.932—0.929 
2.4 Colabel Similarities
We claimed that overfitting begins to occur when colabel similarities that appeared in the initial stages of training, are not longer present. To test this hypothesis we compared the covariate matrices of a overfitted network, early training stopped networks, and regularized networks. We tested again on the CIFAR10 dataset, with the same architecture as the previous CIFAR10 experiment, except that the number of filters and dense units were reduced exactly by two. We compared four configurations: Early (10 epochs), Overfit (100 epochs), Dropout (=0.2, 100 epochs) and SoftTarget (, 100 epochs). After training each configuration for its respected amount we predicted the labels of the training set. We then calculated a covariance matrix scaled to a range of since we are only interested in the relative colabel similarities. We set the diagonal to all zeros, as to make it easier to see other relations. The covariance function used is defined below.
(6)  
(7)  
(8) 
We plotted the covariance matrices in Figure 3. For the early stop case, there we observed the highest covariance between labels 3 and 5, which correspond to cats and dogs respectively. This intuitively makes sense, during earlier steps of training, the network learns to first detect differences between varying entities, such as frog and airplane, and then later learns to detect subtle difference. It is interesting to note, that this is the core principle behind prototype theory in human psychology (Osherson & Smith, 1981), (Duch, 1996), (Rosch, 1978). Some concepts are in nature closer to each other than others. Dog and cat are closer in relation than frog and airplane, and our regularization method mimics this phenomena. Another interesting thing to note is that the dropout method of regularization produces a covariance matrix that is very similar to that produced by SoftTarget regularization. The phenomena of colabel similarities being propagated throughout learning is not specific to just SoftTarget regularization, but regularization in general. Therefore colabel similarities can be seen as a measure of overfitting.
3 Conclusion and Future Work
In conclusion, we presented a new regularization method based on the observation that colabel similarities apparent in the beginning of training, disappear once a network begins to overfit. SoftTarget regularization reduced overfitting as well as Dropout without adding complexity to the network, therefore reducing computational time, and we provided novel insights into the problem of overfitting.
Future work will focus on methods to reduce the number of hyperparameters introduced by SoftTarget regularization, as well as providing a formal mathematical framework to understand the phenomenon of colabel similarities.
References
 Bergstra & Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=2503308.2188395.
 Chollet (2015) François Chollet. Keras Deep Learning Library, 2015. URL https://github.com/fchollet/keras.
 Duch (1996) W. Duch. Categorization, prototype theory and neural dynamics. In T. Yamakawa and G. Matsumoto (eds.), Proceedings of the 4th International Conference on Soft Computing, volume 96, pp. 482–485, 1996.
 Grandvalet & Bengio (2005) Yves Grandvalet and Yoshua Bengio. Semisupervised Learning by Entropy Minimization. Network, 17(5):529–536, 2005.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. arXiv, pp. 1–12, December 2015. URL http://arxiv.org/abs/1512.03385.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. arXiv, pp. 1–9, 2015. URL http://arxiv.org/abs/1503.02531.
 Hunter (2007) John D Hunter. Matplotlib: A 2D Graphics Environment. Computing in Science and Engineering, 9(3):90–95, May 2007.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv, pp. 1–11, February 2015. URL http://arxiv.org/abs/1502.03167.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, Toronto, ON, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pp. 1097–1105, 2012.
 Krogh & Hertz (1992) A. Krogh and J. a. Hertz. A Simple Weight Decay Can Improve Generalization. Advances in Neural Information Processing Systems, 4:950–957, 1992.
 LeCun et al. (1998) Yann LeCun, Corinna Cortes, and Christopher J C Burges. The MNIST Database, 1998. URL http://yann.lecun.com/exdb/mnist/.
 Lee (2013) DongHyun Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In ICML 2013 Workshop: Challenges in Representation Learning (WREPL), 2013.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, pp. 1–9, 2011.
 Osherson & Smith (1981) Daniel N. Osherson and Edward E. Smith. On the adequacy of prototype theory as a theory of concepts. Cognition, 9(1):35–58, January 1981.
 Reed et al. (2014) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training Deep Neural Networks on Noisy Labels with Bootstrapping. arXiv, pp. 1–11, December 2014. URL http://arxiv.org/abs/1412.6596.
 Rosch (1978) Eleanor Rosch. Principles of Categorization. In Eleanor Rosch and Barbara L. Lloyd (eds.), Cognition and categorization, pp. 27–48. Lawrence Erlbaum, Hillsdale, NJ, 1st edition, 1978.
 Scherer et al. (2010) Dominik Scherer, Andreas Müller, and Sven Behnke. Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition, pp. 92–101. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 The Theano Development Team (2016) The Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv, pp. 19, May 2016. URL http://arxiv.org/abs/1605.02688.
 Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of Neural Networks using DropConnect. In Proceedings of the 30th International Conference on Machine Learning, pp. 109–111, 2013.
 Zeiler (2012) Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv, pp. 1–6, December 2012. URL http://arxiv.org/abs/1212.5701.