Improving Stochastic Gradient Descent with Feedback
In this paper we propose a simple and efficient method for improving stochastic gradient descent methods by using feedback from the objective function. The method tracks the relative changes in the objective function with a running average, and uses it to adaptively tune the learning rate in stochastic gradient descent. We specifically apply this idea to modify Adam, a popular algorithm for training deep neural networks. We conduct experiments to compare the resulting algorithm, which we call Eve, with state of the art methods used for training deep learning models. We train CNNs for image classification, and RNNs for language modeling and question answering. Our experiments show that Eve outperforms all other algorithms on these benchmark tasks. We also analyze the behavior of the feedback mechanism during the training process.
Despite several breakthrough results in the last few years, the training of deep learning models remains a challenging problem. This training is a complex, high-dimensional, non-convex, stochastic optimization problem which is not amenable to many standard methods. Currently, the most common approach is to use some variant of stochastic gradient descent. Many extensions have been proposed to the basic gradient descent algorithm - designed to handle specific issues in the training of deep learning models. We review some of these methods in the next section.
Although variants of simple stochastic gradient descent work quite well in practice, there is still room for improvement. This is easily evidenced by the existence of numerous methods to simplify the optimization problem itself like weight initialization techniques and normalization methods.
In this work, we seek to improve stochastic gradient descent with a simple method that incorporates feedback from the objective function. The relative changes in the objective function indicate progress of the optimization algorithm. Our main hypothesis is that incorporating information about this change into the optimization algorithm can lead to improved performance - quantified in terms of the progress rate. We keep a running average of the relative changes in the objective function and use it to divide the learning rate. When the average relative change is high, the learning rate is reduced. This can improve the progress if, for example, the algorithm is bouncing around the walls of the objective function. Conversely, when the relative change is low, the learning rate is increased. This can help the algorithm accelerate through flat areas in the loss surface. As we discuss in the next section, such “plateaus” pose a significant challenge for first order methods and can create the illusion of local minima.
While our method is general i.e. independent of any particular optimization algorithm, in this work we specifically apply the method to modify Adam , considered to be the state of the art for training deep learning models. We call the resulting algorithm Eve and design experiments to compare it with Adam, as well as other popular methods from the literature.
The paper is organized as follows. In Section 2, we review recent results related to the optimization of deep neural networks. We also discuss some popular algorithms and their motivations. Our general method, and the specific algorithm Eve are discussed in Section 3. Then in Section 4, we show that Eve consistently outperforms other methods in training convolutional neural networks (CNNs), and recurrent neural networks (RNNs). We also look in some detail, at the behavior of our method in the simple case of convex non-stochastic optimization. Finally we conclude in Section 5.
There has been considerable effort to understand the challenges in deep learning optimization. Intuitively, it seems that the non-convex optimization is made difficult by the presence of several poor local optima. However, this geometric intuition proves to be inadequate in reasoning about the high-dimensional case that arises with training deep learning models. Various empirical and theoretical results  have indicated that the problem in high dimensions arises not from local minima, but rather from saddle points. Moreover, a recent paper  proved (for deep linear networks, and under reasonable assumptions, also for deep non-linear networks) that all local minima in fact achieve the same value, and are optimal. The work also showed that all critical points which are not global minima are saddle points. Saddle points can seriously hamper the progress of both first and second order methods. Second order methods like Newton’s method are actually attracted to saddle points and are not suitable for high dimensional non-convex optimization. First order methods can escape from saddle points by following directions of negative curvature. However, such saddle points are usually surrounded by regions of small curvature - plateaus. This makes first order methods very slow near saddle points and can create the illusion of a local minimum.
To tackle the saddle point problem,  propose a second order method that fixes the issue with Newton’s method. Their algorithm builds on considering the behavior of Newton’s method near saddle points. Newton’s method rescales gradients in each eigen-direction with the corresponding inverse eigenvalue. However, near a saddle point, negative eigenvalues can cause the method to move towards the saddle point. Based on this observation, the authors propose using the absolute values of the eigenvalues to rescale the gradients. This saddle-free Newton method is backed by theoretical justifications and empirical results; however due to the computational requirements, second order methods are not very suitable for training large scale models. So we do not compare with such approaches in this work.
We instead focus on first order methods which only rely on the gradient information. A key issue in training deep learning models is that of sparse gradients. To handle this, Adagrad  adaptively changes the learning rate for each parameter, performing larger updates for infrequently updated parameters. However its update rule causes the learning rate to monotonically decrease, which eventually stalls the algorithm. Adadelta  and RMSProp  are two extensions that try to fix this issue. Finally, a closely related method, and the base for our algorithm Eve (introduced in the next section), is Adam . Adam incorporates the advantages of both Adagrad and RMSProp - and it has been found to work quite well in practice. Adam uses a running average of the gradient to determine the direction of descent, and scales the learning rate with a running average of the gradient squared. The authors of Adam also propose an extension based on the infinity norm, called Adamax. In our experiments, we compare Eve with both Adam and Adamax.
We do need to make an assumption about the objective function . We assume that the minimum value of over its domain is known. While this is true for loss functions encountered in machine learning (like mean squared error or cross entropy), it does not hold if the objective function also includes regularization terms (, etc.). In all our experiments, we used dropout for regularization which is not affected by this assumption. Finally, to simplify notation we assume that the minimum has been subtracted from the objective function i.e. the minimum has been made 0.
The main component of our proposed method is a feedback term that captures the relative change in the objective value. Let and denote the values of the objective function at time steps and respectively. Then this change is computed as if , and otherwise. Note that this value is always non-negative but it can be less than or greater than 1 i.e. it captures both relative increase and decrease. We compute a running average using these relative changes to get a smoother estimate. Specifically, we take , and for define as . Here is a decay rate - large values correspond to a slowly changing , and vice versa. This simple expression can, however, blow up and lead to instability. To handle this issue, we use a thresholding scheme. A simple thing to do would be to clip as for some suitable . But we found this to not work very well in practice due to the abrupt nature of the clipping. Instead we indirectly clip by smoothly tracking the objective function. Let be the value of the smoothly tracked objective function at time with . For now, assume . We would like to have which in this case requires . So we compute and set . Finally is . Analogous expressions can also be derived for the case when . This smooth tracking has the additional advantage of making less susceptible to the high variability that comes with training using minibatches.
Once has been computed, it can be used to modify any gradient descent algorithm by modifying the learning rate as . Large values of , caused by large changes in the objective function will lead to a smaller effective learning rate. Similarly, small values of will lead to a larger effective learning rate. Since we start with , the initial updates will closely follow that of the base algorithm. In the next section, we will look at how evolves during the course of an experiment to get a better understanding of how it affects the training.
We note again that our method is independent of any particular gradient descent algorithm. However, for this current work, we specifically focus on applying the method to Adam . This modified algorithm, which we call Eve, is shown in Algorithm ?. We modify the final Adam update by multiplying the denominator with . In addition to the hyperparameters in Adam, we introduce 3 new hyperparameters , , and . In all our experiments we use the values , , and , which we found to work well in practice.
Now we evaluate our proposed method by comparing Eve with several state of the art algorithms for optimizing deep learning models.
In the figures, SGD refers to vanilla stochastic gradient descent, and SGD Nesterov refers to stochastic gradient descent with Nesterov momentum  where we set the momentum to 0.9 in all experiments.
4.1Convolutional Neural Networks
We first trained a 5 layer convolutional neural network for 10-way classification of images from the CIFAR10 dataset . The model consisted of 2 blocks of 3x3 convolutional layers each followed by 2x2 max pooling and 0.25 dropout . The first block contained 2 layers with 32 filters each, and the second block contained 2 layers with 64 filters each. The convolutional layers were followed by a fully connected layer with 512 units and a 10-way softmax layer. We trained this model for 500 epochs on the training split using various popular methods for training deep learning models, as well as Eve. For each algorithm, we tried learning rates (for algorithms with suggested default learning rates, we also included them in the search), learning rate decays , and picked the pair of values that led to the smallest final training loss. The loss curves are shown in Figure 1. Eve quickly surpasses all other methods and achieves the lowest final training loss. In the next section we will look at the behavior of the adaptive coefficient to gain some intuition behind this improved performance.
We also trained a larger CNN model using the top-performing algorithms from the previous experiment. This model consisted of 3 blocks of 3x3 convolutional layers (3 layers per block, and 64, 128, 256 filters per layer in the first, second, and third block respectively) each followed by 2x2 max pooling and 0.5 dropout. Then we had 2 fully connected layers with 512 units, each followed by 0.5 dropout, and finally a 100-way softmax. We trained this model on the CIFAR100  dataset for 100 epochs. We again performed a grid search over the same learning rate and decay values as the last experiment. The results are shown in Figure 2, and once again show that our proposed method improves over state of the art methods for training convolutional neural networks.
4.2Analysis of Tuning Coefficient
Before we consider the next set of experiments on recurrent neural networks, we will first look more closely at the behavior of the tuning coefficient in our algorithm. We will specifically consider the results from the CNN experiment on CIFAR10. Figure 3 shows the progress of throughout the training, and also in two smaller windows. A few things are worth noting here. First is that of the overall trend. There is an initial acceleration followed by a decay. This initial acceleration allows Eve to rapidly overtake other methods, and makes it proceed at a faster pace for about 100 epochs. This acceleration is not equivalent to simply starting with a larger learning rate - in all our experiments we search over a range of learning rate values. The overall trend for can be explained by looking at the minibatch losses at each iteration (as opposed to the loss computed over the entire dataset after each epoch) in Figure ?. Initially, different minibatches achieve similar loss values which leads to decreasing. But as training proceeds, the variance in the minibatch losses increases and eventually increases. However, this overall trend does not capture the complete picture - for example, as shown in the bottom right plot of Figure 3, can actually be decreasing in some regions of the training, adjusting to local structures in the error surface.
To further study the observed acceleration, and to also motivate the need for clipping, we consider a simpler experiment. We trained a logistic regression model on 1000 images from the MNIST dataset. We used batch gradient descent for training i.e. all 1000 samples were used for computing the gradient at each step. We trained this model using Eve, Adam, Adamax, and SGD Nesterov for 10000 iterations, searching over a large range of values for the learning rate and decay: , . The results are shown in Figure 4. Eve again outperforms all other methods and achieves the lowest training loss. Also, since this is a smooth non-stochastic problem, the tuning coefficient continuously decreases - this makes having a thresholding mechanism important since the learning rate would blow up otherwise.
Although in the previous experiment the effect of our method is to increase the learning rate, it is not equivalent to simply starting with a larger learning rate. We will establish this with a couple simple experiments. First we note that in the previous experiment, the optimal decay rates for both Adam and Eve were 0 - no decay. The optimal learning rate for Eve was . Since the tuning coefficient converges to 0.1, we trained Adam using no decay, and learning rates where varies from 1 to 10. The training loss curves are shown in the left plot of Figure 5. While increasing the learning rate with Adam does seem to close the gap with Eve, Eve does remain marginally faster. Moreover, and more importantly, this first plot represents the best-case situation for Adam. With larger learning rates, training becomes increasingly unstable and sensitive to the initial values of the parameters. This is illustrated in the right plot of Figure 5 where we used Eve (with learning rate ) and Adam (with learning rate ) 10 times with different random initializations. In some cases, Adam fails to converge whereas Eve always converges - even though Eve eventually reaches a learning rate of . This is because very early in the training, the model is quite sensitive at higher learning rates due to larger gradients. Depending on the initial values, the algorithm may or may not converge. So it is advantageous to slowly accelerate as the learning stabilizes rather than start with a larger learning rate.
4.3Recurrent Neural Networks
Finally, we evaluated our method on recurrent neural networks (RNNs). We first trained a RNN for character-level language modeling on the Penn Treebank dataset . Specifically, the model consisted of a 2-layer character-level Gated Recurrent Unit  with hidden layers of size 256, 0.5 dropout between layers, and sequences of 100 characters. We adopted as the initial learning rate for Adam, Eve, and RMSProp. For Adamax, we used as the learning rate since it is the suggested value. We used for the learning rate decay. We trained this model for 100 epochs using each of the algorithms. The results, plotted in Figure 8, clearly show that our method achieves the best results. Eve optimizes the model to a lower final loss than the other models.
We trained another RNN-based model for the question & answering task. Specifically, we chose two question types among 20 types from the bAbI dataset , Q19 and Q14. The dataset consists of pairs of supporting story sentences and a question. Different types of pairs are said to require different reasoning schemes. For our test case, Q19 and Q14 correspond to Path Finding and Time Reasoning respectively. We picked Q19 since it is reported to have the lowest baseline score, and we picked Q14 randomly from the remaining questions. The model consisted of two parts, one for encoding story sentences and another for query. Both included an embedding layer with 256 hidden units, and 0.3 dropout. Next query word embeddings were fed into a GRU one token at a time, to compute a sentence representation. Both story and query sequences were truncated to the maximum sequence length of 100. Finally, the sequence of word embeddings from story sentences and the repeated encoded representation of a query were combined together to serve as input for each time step in another GRU, with 0.3 dropout. We searched for the learning rate and decay from a range of values, and . The results, shown in Figures Figure 6, and Figure 7 show that Eve again improves over all other methods.
We proposed a simple and efficient method for incorporating feedback in to stochastic gradient descent algorithms. We used this method to create Eve, a modified version of the Adam algorithm. Experiments with a variety of models showed that the proposed method can help improve the optimization of deep learning models.
For future work, we would look to theoretically analyze our method and its effects. While we have tried to evaluate our algorithm Eve on a variety of tasks, additional experiments on larger scale problems would further highlight the strength of our approach. We are making code for our method and the experiments publicly available to encourage more research on this method.
- Full technical details of the experiments and additional results are available at https://github.com/jayanthkoushik/sgd-feedback.
- Statistics of critical points of gaussian fields on large-dimensional spaces.
Alan J Bray and David S Dean. Physical review letters
- Empirical evaluation of gated recurrent neural networks on sequence modeling.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. arXiv preprint arXiv:1412.3555
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. In Advances in neural information processing systems, pp. 2933–2941, 2014.
- Adaptive subgradient methods for online learning and stochastic optimization.
John Duchi, Elad Hazan, and Yoram Singer. Journal of Machine Learning Research
- Understanding the difficulty of training deep feedforward neural networks.
Xavier Glorot and Yoshua Bengio. In Aistats, volume 9, pp. 249–256, 2010.
- Deep learning without poor local minima.
Kenji Kawaguchi. In Advances in Neural Information Processing Systems (NIPS), 2016.
- Adam: A method for stochastic optimization.
Diederik Kingma and Jimmy Ba. arXiv preprint arXiv:1412.6980
- Learning multiple layers of features from tiny images.
Alex Krizhevsky and Geoffrey Hinton. 2009.
- Building a large annotated corpus of english: The penn treebank.
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Computational linguistics
- A method for unconstrained convex minimization problem with the rate of convergence o (1/k2).
Yurii Nesterov. In Doklady an SSSR, volume 269, pp. 543–547, 1983.
- Dropout: a simple way to prevent neural networks from overfitting.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Journal of Machine Learning Research
- Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
Tijmen Tieleman and Geoffrey Hinton. COURSERA: Neural Networks for Machine Learning
- Towards ai-complete question answering: A set of prerequisite toy tasks.
Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. arXiv preprint arXiv:1502.05698
- Adadelta: an adaptive learning rate method.
Matthew D Zeiler. arXiv preprint arXiv:1212.5701