Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting
In this work, we show that saturating output activation functions, such as the softmax, impede learning on a number of standard classification tasks. Moreover, we present results showing that the utility of softmax does not stem from the normalization, as some have speculated. In fact, the normalization makes things worse. Rather, the advantage is in the exponentiation of error gradients. This exponential gradient boosting is shown to speed up convergence and improve generalization. To this end, we demonstrate faster convergence and better performance on diverse classification tasks: image classification using CIFAR-10 and ImageNet, and semantic segmentation using PASCAL VOC 2012. In the latter case, using the state-of-the-art neural network architecture, the model converged 33% faster with our method than with the standard softmax activation, and that with a slightly better performance to boot.
Training a deep neural network (NN) is a highly non-convex optimization problem that we usually solve using convex methods. For each extra layer we add to the network, the problem becomes more non-convex, i.e. more curvature is added to the error surface, making the optimization harder. Yet, it is commonplace to add unnecessary curvature at the output layer even though this does not expand the space of functions that the NN can represent. This curvature is then back-propagated through all the previous layers, causing a detrimental increase in the number of ripples in the error surfaces of especially the lower layers, which are already the toughest ones to train. This is done, in part, so that we can all pretend that the outputs are probabilities, even though they really are not. In the following, we show that saturating output activation functions, such as the softmax, impede learning on a number of standard classification tasks. Moreover, we present results showing that the utility of the softmax does not stem from the normalization, as some have speculated [Goodfellow et al. (2016); Keck et al. (2012)]. In fact, the normalization makes things worse. Rather, the advantage is in the exponentiation of error gradients. This exponential gradient boosting is shown to speed up convergence and improve generalization.
1.1 Squashers & Saturation
Historically, output squashers, such as the logistic (aka sigmoid) and tanh functions, have been used as a simple way of reducing the impact of outliers on the learned model. For example, if you fit a model to a small dataset with a good amount of outliers, those outliers can produce very large error gradients that will push the model towards a hypothesis that favors said outliers, leading to poor generalization. Squashing the output will reduce those large error gradients, and thus reduce the negative influence of the outliers on the learned model. However, if you have a small dataset, you should not use a neural network in the first place—other methods are likely to work better. And if you have a large dataset, the impact of any outliers will be minuscule. Therefore, the outlier argument is not very relevant in the context of deep learning. What is relevant, however, is that squashing functions saturate, resulting in very small gradients, appearing in the error surface as infinite flat plateaus, that slow down learning, and even cause the optimizer to get stuck [LeCun et al. (2012); Glorot and Bengio (2010)]. This observation was part of the motivation behind applying the now popular ReLU activation (rectified linear units) to convolutional neural nets [Krizhevsky et al. (2012); Nair and Hinton (2010); Jarrett et al. (2009)]. Surely, the massive success of ReLUs (and other related variants) speaks to the importance of avoiding saturating activations. Yet, somehow this knowledge is currently not being applied at the output layer! We contend, that for the very same reason that saturating units in the hidden layers should be avoided, linear output activations are to be preferred.
1.2 The Softmax Function
The most famous culprit, among the saturating non-linearities, is of course the softmax function [Bridle (1990)], . This is the de facto standard activation used for multi-class classification with one-hot target vectors. Even though it is technically not a squasher, but a normalized exponential, it suffers from the same problem of saturation. That is, when the normalization term (the denominator) grows large, the output goes towards zero. The original motivation behind the softmax function was not dealing with outliers, but rather to treat the outputs of a NN as probabilities conditioned on the inputs. As intriguing as this may sound, we must remember that in most cases the outputs of the softmax would actually not be true probabilities. To claim that outputs are probabilities, we must assume a within-class Gaussian distribution of the data, made in the derivation of the function [Bridle (1990)]. In practice, we say that the outputs may be interpreted as probabilities, as they lie in the range and sum to unity [Bishop (1995, 2007)]. However, if these are sufficient criteria for calling outputs probabilities, then the normalization might just as well be applied after training, which would not make the probabilistic interpretation any less correct. This way, we can avoid the problem of saturation during training, while still pretending that the outputs are probabilities (in case that is relevant to the given application). Another potential drawback of the normalization is that it bounds the function at both ends s.t. . Consequently, when we apply it at the output layer, s.t. , where the error gradient (or “error delta”) , and , we effectively bound the gradients too, which affects all the previous layers during back-propagation.
1.3 The Main Idea
Simply put, our main idea is to apply a bit of common sense to the aforementioned situation. Namely, that we are solving highly non-convex optimization problems using a convex method: backpropagation [Rumelhart et al. (1986)] with stochastic gradient descend (SGD). Even in saying those words, it appears evident that making the problem more non-convex—for no good reason—has to be a bad idea. Following that simple logic, output activations should always be linear (the identity function) unless there is some advantage in adding non-linearity that somehow outweighs the price that must be paid in added non-convexity. Thus, we take the view that the only things that really matter are the speed of convergence and the final classification accuracy. We do not care about probabilistic interpretations, loss functions, or even, to some extent, mathematical correctness. Training a neural network is the process of iteratively pushing some weights in the right direction, and while doing so, we want to make the most of what we have: our error gradients. This does not entail imposing pointless bounds on them, or allowing them to become very small for no good reason.
Table 1 shows what happened when we first applied this view on real data; the MNIST dataset [LeCun et al. (1998)]. Training a simple three-layer NN (fully connected) with ReLUs in the hidden layers, we compared the median results obtained over twenty trials with sigmoid, tanh, and linear output activations. The learning rate was fixed, and carefully tuned for each setting, and neither dropout [Hinton et al. (2012)], batch normalization [Ioffe and Szegedy (2015)], nor weight decay was used. The NN trained for 100 epochs, and the point of convergence is set to be the epoch where the minimum classification error was observed. This experiment was repeated multiple times with other hidden activations, and weight initialization schemes, and they all gave the same result: with linear output activations, the rate of convergence is reduced by approximately 25 percent (and some moderate improvements in generalization was observed as well). Note, that the softmax is not included in the table for the very simple reason that it gave miserable results on this NN configuration.
2 Gradient Boosting
When we first tried to train a convolutional neural network (CNN) on the CIFAR-10 data [Krizhevsky (2009)], with linear instead of softmax outputs, we expected to see at least a hint of improvement. This was not the case. On the contrary, the softmax clearly won that battle. The reason for this lies in the exponentiation of the outputs. For a moment, stop thinking about the softmax in a probabilistic context, and instead view it as the equivalent of linear outputs, with a mean squared error loss, combined with non-linear boosting of the error deltas, . From this perspective, it becomes clear that when we have , nothing changes with respect to the one-hot classification, but large errors will be exponentiated. This allows the optimizer to take bigger steps towards a minimum, thus leading to faster convergence. An intuitive interpretation of this would be that when we are confident about an error, we can take an exponentially larger step towards minimizing that error. The idea bears some resemblance to momentum, where we gradually speed things up when the error gradients are consistent.
2.1 Exponential Boosting
If exponentiation of error deltas is good, and saturation is bad, it follows that using an “un-normalized” softmax, so to speak, should yield an improvement. That is, simply use linear outputs, , but compute the error gradients as . Alternatively, we can think of it as an exponential output activation with an incorrect gradient formulation imposed on it, i.e. (this is in fact how we implemented it). As seen in Figure 0(a), this simple change does in fact lead to a consistent boost in performance. The result was obtained on CIFAR-10 with a 5-layer CNN; four convolutional layers followed by an affine output layer with linear outputs and exponential gradient boosting (exp-GB), and batch normalization in all layers. We set , which has worked well in all our experiments; sometimes is slightly better. To further boost the non-linear interaction between the outputs and the targets, we used larger target values, instead of . As can be seen in the histograms of Figure 2 (from a different experiment), this produces much larger gradients. The deltas are roughly in , as opposed to the bounded errors of the softmax, that are in . The idea of picking better target values is not new. To reduce the risk of saturating with logistic units, LeCun et al. (2012) recommend choosing targets at the point of the maximum of the second derivative.
Another potential advantage of the exponentiation is that asymptotically approaches zero towards negative infinity. This is especially advantageous with one-hot target vectors, because we do not care about exact output values as long as the correct class has the largest value. Hence, we can mostly ignore any negative errors in outputs for the negative classes. This can be seen as a relaxation of the optimization problem, where we are essentially trying to solve an inequality for the negative classes instead of an exact equality. With linear activations without exp-GB, the optimizer would always try to push the outputs for the negative classes towards zero. This can lead to situations where an otherwise correct output (i.e. the maximum value belongs to the node representing the target class) for a given input, , leads to a weight update that renders the output incorrect the next time that is seen; this is in exchange for the mean output for the negative classes being slightly closer to zero than on the previous iteration. That is a bad result, but we avoid this problem when we exponentiate our gradients.
2.2 Cubic Boosting
Although we can often ignore large negative outputs that yield large negative error deltas, we cannot ignore all of them. This raises the question whether we may further boost performance by also allowing for the exponentiation of large negative errors. The answer is: yes we can! An obvious candidate would be a mirrored exponential function, , where is the sign function. However, this function does not have a nice and cozy place for us to put all those gradients that we do not need to worry about, so it does not work well. Instead, we use a simple polynomial that has a conveniently flat area around , ; let’s call it pow3-GB. Taking another look at Figure 0(a), we see that this does indeed work better; following exactly the same trend as observed with exp-GB, that the error drops significantly faster than with the softmax. We set , , and use target values .
We now carefully study the performance of gradient boosting (GB) for image-classification on CIFAR-10 [Krizhevsky (2009)] and ImageNet 2012 [Russakovsky et al. (2015)], and the pixel-level task of semantic segmentation on the PASCAL VOC 2012 dataset [Everingham et al. (2010)].
CIFAR-10 Classification. In this experiment, our purpose is not to get state-of-the-art results but rather to learn how increased depth may affect our method. We use an (almost) all convolutional network with ten layers; following the principle presented in [Springenberg et al. (2014)], but with batch normalization, and the average pooling layer replaced by a fully-connected one. The latter was done to make computation more deterministic, so as to allow for better evaluation of the effects of changing various parameters. Note that pooling involves atomic operations on the GPU, which can result in relatively large variance in output. For this experiment, we used a fixed learning rate and carefully tuned it with the purpose of getting the best result within ten epochs. We use the same and values as in our previous experiment, but this time we use different target values. produced better results for exp-GB. With pow3-GB it seemed a good idea to try negative target values for the negative classes since the function is not bounded at the lower end; we saw a significant improvement when using .
Figure 0(b) shows how the classification error evolved during training. For softmax, we show results from trying three different learning rates to insure that our choice of really is a good one. We note that the overall trend is the same as for the 5-layer CNN; for the first 2-5 epochs, the error rates drop significantly faster with GB than with the softmax. The histograms in Figure 2 show the distribution of the output error deltas for the first batch of epoch 1 and epoch 4. The larger target values used for GB are clearly reflected; resulting in sharper distributions clustered around the negated target values. This is of course most significant on the first iteration, but the trend is still very clear in the fourth (and tenth) epoch. This boosting of the output errors has a very significant effect on the gradient signals received in the hidden layers during backpropagation. Figure 3 shows this effect very nicely via the root mean square (RMS) of the gradients. With exp-GB, the RMS of the hidden layer gradients is an order of magnitude higher than with the softmax; for pow3-GB it is more than two orders of magnitude. Interestingly, the hidden-layer RMS-gradients recorded for pow3-GB grow rapidly from the second epoch and onwards. A similar trend is seen for exp-GB, albeit less dramatically, and for the softmax there is only a slight upwards trend, and only in the top layers. This correlates well with the error rates (see Figure 0(b)); the softmax gets stuck early on, and the linear activations with gradient boosting continue to learn through all ten epochs. All in all, this seems to indicate that gradient boosting may help alleviate the infamous problem of vanishing gradients [Hochreiter (1991); Goodfellow et al. (2016)] in deep neural networks.
ImageNet Classification. The ImageNet 2012 dataset [Russakovsky et al. (2015)] consists of 1̃.3 million RGB images that are much larger than the tiny images of CIFAR-10. Training a state-of-the-art model on this data can take weeks. Thus, for this experiment, we will again consider only the first ten epochs of training on a relatively shallow and well-known architecture, AlexNet [Krizhevsky et al. (2012)]. Figure 3 and Table 2 show the median classification errors over five trials. With exp-GB, we get a median reduction in the top5-error of 4.52 percent, and a 3.74 percent reduction of the top1-error; i.e. the minimum errors achieved within ten epochs. Thus, the result follows the general trend of our previous experiments. However, there are two important differences. First, the pow3-GB did not work well, whereas exp-GB clearly still outperformed the softmax. Secondly, we had to use batch normalization (BN) to get good results.
With respect to the failure of pow3-GB, the explanation is likely found in the 100-fold increase in the number of classes; compared to the ten classes of CIFAR-10. Because such large error gradients are back-propagated from the output layer, the errors in the hidden layers simply blow up too much, when one thousand errors are multiplied and summed instead of just ten. In a way, this is the opposite problem of what we could expect to see with the softmax, where the saturation is likely to be worse with more classes (as the normalization term grows), thus producing very small gradients. With respect to why BN becomes more important, again, the reason is that the magnitude of the back-propagated gradients depends on the number of classes. The larger gradients result in bigger and faster changes in the distributions of the activations in the hidden layers; thus magnifying internal covariate shifts, and increasing the need for BN.
For this experiment, we used (carefully tuned) fixed learning rates of 0.01 and 0.001 for the softmax and exp-GB, respectively. For exp-GB we set and used target values, .
Semantic Segmentation. We now evaluate our method for the pixel-level classification task of semantic segmentation. The goal in semantic segmentation is to determine class labels for every single pixel in an image. Prior work [Bansal et al. (2017); Hariharan et al. (2015); Long et al. (2015)] in this direction use a fully convolutional network with the standard softmax and multi-class cross-entropy loss for optmization. In this experimwent, we use the PixelNet architecture [Bansal et al. (2017)]. This model uses a VGG-16 [Simonyan and Zisserman (2014)] architecture (pre-trained on ImageNet) followed by a multi-layer perceptron that is used to do per-pixel inference over multi-scale descriptors. It has achieved state-of-the-art performance for various pixel-level tasks such as semantic segmentation, depth/surface normal estimation, and boundary detection.
We evaluate our findings on the heavily benchmarked Pascal VOC 2012 dataset. Similar to prior work [Bansal et al. (2017); Hariharan et al. (2015); Long et al. (2015)], we make use of additional labels collected on 8498 images by Hariharan et al. (2011). We keep a small set of 100 images for validation to analyze convergence, and use the same settings as used for analysis in [Bansal et al. (2017)]: a single scale image is used as input to train the model. All the hyper-parameters are kept constant except the initial learning rate111The initial for softmax, for exp-GB, and for pow3-GB. Lowering the learning rate for softmax deteriorates the performance.. We report results on the Pascal VOC-2012 test set (evaluated on the PASCAL server) using the standard metrics of region intersection over union (IoU) averaged over classes (higher is better).
Table 3 shows our results (both per-class and mIoU) for GB and the standard softmax. We observe that the model trained using exp-GB converged after 40 epochs, whereas the softmax model converged after 60 epochs. As seen in Table 3, our method provides 33% faster convergence, while yielding a slightly better performance (E-40 vs. S-60). We see a significant 3% boost in the first 40 epochs with exp-GB (E-40 vs. S-40).
Additionally, recent work [Varol et al. (2017); Wang et al. (2015)] in the computer vision community have formulated regression problems such as depth and surface normals estimation, in a classification paradigm, in hope of easier optimization and better performance. From these experiments, we however infer that it is likely not the softmaxcross-entropy that boosts the performance. Rather, it is the use of one-hot encoding of the target vectors.
Summary. We evaluated our findings on two standard tasks of classification, i.e. image classification and pixel-level classification, on heavily benchmarked datasets. We observe a consistent trend for all these tasks: (1). softmax impedes learning; (2). exp-GBmean-squared-error converges 25-35% faster than standard softmaxcross-entropy, and that too with a slightly better performance (not at the cost of it). We believe our results are important not just from a convergence perspective, but also from the point-of-view of having a general loss function for both classification and regression tasks.
4 Further Analysis
We can take a slightly more theoretical view on gradient boosting, and why it speeds up the convergence, by reasoning about second-order properties of the error surfaces induced by exp-GB and pow3-GB. This is typically done with the Hessian matrix, , of second derivatives, which tells us something about the rate of change in the error for a single step of gradient descent. To keep things simple, we will consider only the case of a single output activation, i.e. a single dimension, so we do not need the full Hessian, will do. We will look at , where is the sum-of-squares error, . For our purpose we can simply ignore the summation in , and just analyze for a single example, . Let us start by comparing the Hessians for linear, softmax, exponential, and cubic activations.
For a linear activation, , the Hessian is simply
Re-writing the softmax activation as , where is a proxy for the normalization term, we get
and for exponential and cubic activations we have,
If we consider the situation where is near some local minimum, we know that the error surface will be locally convex around that point. This means that , and that the first and second term in each of the above Hessians will be approximately equal (i.e. they cancel each other out). Thus, we will ignore the second terms, and simply compare the growth of all the first terms, as we move away from that local optimum. Now it becomes immediately evident that (locally) , because as we get for all . Unsurprisingly, it all depends on the magnitude of the normalization term of the softmax, . If is very small will blow up, so we need to assert the probability of that happening. At the onset of training, it is reasonable to assume that the input to the softmax will be evenly distributed around zero. Thus, half of the ’s are positive, guaranteeing that as . To see what happens later, we can consider a numerical example for one thousand classes. Even when the model is trained well, such that the ’s for the 999 negative classes are likely to be negative and contribute very little to as —it still takes only one single to make (likely to be the one for the positive class). It seems reasonable to claim that this will probably be the case most of the time.
To back up this claim, we take a look at the actual ’s recorded during training of the 10-layer CNN from our CIFAR-10 experiment in the previous section. Figure 5 shows how the normalization term, , of the softmax actually behaved. It starts out with a value of 2,342 and increases monotonically from there.
However, we need to remember that for GB the Hessians are a little different, as we are just boosting the error gradients, . Thus, the second derivatives are just the derivatives of those deltas, with , , and (with the multi-class cross-entropy loss)—which only adds to our point that GB can minimize the error faster than the softmax.
Our results suggest fundamental changes in deep network training, and to our perception of the omnipresent softmax function. In a way, all that we have done is to apply common sense to the challenge of training deep non-convex models using the convex method of gradient descent. Specifically, do not make the problem more non-convex that it needs to be. Whenever we add curvature to the error surface, we make the optimization harder, and we should always keep this in mind when we make decisions on how we configure our models during training.
Taking the consequence of this, by e.g. skipping the normalization term of the softmax, we get significant improvement in our NN training—and at no other cost than a few minutes of coding. The only drawback is the introduction of some new hyper-paramters, , , and the target values. However, these have been easy to choose, and we do not expect that a lot of tedious fine-tuning is required in the general case.
From this perspective, our work—and much of literature—is concerned with treating the symptoms rather than the cause. The cause of our problems is our use of gradient-based optimizers. Perhaps one day we will have a better learning algorithm, but until we do: be careful what you back-propagate!
- Bansal et al. (2017) Bansal, A., Chen, X., Russell, B., Gupta, A., and Ramanan, D. (2017). Pixelnet: Representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506.
- Bishop (2007) Bishop, C. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. 2006. corr. 2nd printing edn. Springer, New York.
- Bishop (1995) Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, 1 edition.
- Bridle (1990) Bridle, J. S. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition, pages 227–236. Springer Berlin Heidelberg.
- Everingham et al. (2010) Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge. IJCV.
- Glorot and Bengio (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 9:249–256.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT Press.
- Hariharan et al. (2015) Hariharan, B., Arbeláez, P., Girshick, R., and Malik, J. (2015). Hypercolumns for object segmentation and fine-grained localization. In CVPR.
- Hariharan et al. (2011) Hariharan, B., Arbel ez, P., Bourdev, L., Maji, S., and Malik, J. (2011). Semantic contours from inverse detectors. In ICCV.
- Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
- Hochreiter (1991) Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, diploma thesis, institut für informatik, lehrstuhl prof. brauer, technische universität münchen.
- Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, pages 1–11.
- Jarrett et al. (2009) Jarrett, K., Kavukcuoglu, K., Ranzato, M. A., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, pages 2146–2153. IEEE.
- Keck et al. (2012) Keck, C., Savin, C., Lücke, J., King, A., and Ros, H. (2012). Feedforward Inhibition and Synaptic Scaling – Two Sides of the Same Coin? PLoS Computational Biology, 8(3):e1002432.
- Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient Based Learning Applied to Document Recognition. Proceedings of the IEEE.
- LeCun et al. (2012) LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K. R. (2012). Efficient BackProp. In Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., Kosba, A., Mattern, F., Mitchell, J. C., Naor, M., Nierstrasz, O., Rangan, C. P., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., and Weikum, G., editors, Neural Networks: Tricks of the Trade, chapter 1, pages 9–48. Springer, Berlin Heidelberg, 2 edition.
- Long et al. (2015) Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional models for semantic segmentation. In CVPR.
- Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814.
- Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533–538.
- Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. IJCV.
- Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
- Springenberg et al. (2014) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. A. (2014). Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806.
- Varol et al. (2017) Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., and Schmid, C. (2017). Learning from Synthetic Humans. In CVPR.
- Wang et al. (2015) Wang, X., Fouhey, D., and Gupta, A. (2015). Designing deep networks for surface normal estimation. In CVPR.