Abstract

The minimization of the loss function is of paramount importance in deep neural networks. On the other hand, many popular optimization algorithms have been shown to correspond to some evolution equation of gradient flow type. Inspired by the numerical schemes used for general evolution equations we introduce a second order stochastic Runge Kutta method and show that it yields a consistent procedure for the minimization of the loss function. In addition it can be coupled, in an adaptive framework, with a Stochastic Gradient Descent (SGD) to adjust automatically the learning rate of the SGD, without the need of any additional information on the Hessian of the loss functional. The adaptive SGD, called SGD-G2, is successfully tested on standard datasets.

Keywords: Machine Learning, ICML, SGD, stochastic gradient descent, adaptive stochastic gradient, deep learning optimization, neural networks optimization

1 Introduction and related literature

Optimization algorithms are at the heart of neural network design in deep learning. One of the most studied procedures is the fixed step Stochastic Gradient Descent (SGD) [1]; although very robust, SGD may converge too slow for small learning rates or become unstable if the learning rate is too large. Each problem having its own optimal learning rate, there is no general recipe to adapt it automatically; to address this issue, several approaches have been put forward among which the use of momentum [14], Adam [4], RMSprop [16] and so on. On the other hand, recent research efforts have been directed towards finding, heuristically or theoretically, the best learning rate [11, 13, 12] with [17], which uses an estimate of the Lipschitz constant, being a very recent example.

Another interpretation of the minimization procedure it to see it as a time evolution (a flow) in the space of neural network parameters. Denote such a parameter by and let be the the loss functional; the flow interpretations recognizes that the minimization of is related to the solution of the following evolution equation

 X′(t)=∇f(X(t)). (1)

In this work we consider the flow (1) and apply two numerical schemes to evolve it in time: the Explicit Euler scheme (which will correspond to the SGD algorithm) and a numerical scheme of second order in time, labeled ”SH” (like in ”stochastic Heun”) belonging to the class of stochastic Runge-Kutta methods; this second scheme allows to have a more precise estimation of the flow and in turn provides essential information to adapt the learning rate of the SGD.

We prove theoretically in section 3 that the SH scheme is indeed of second order; then we explain how it allows to choose the optimal learning rate for the SGD and build the SGD-G2 algorithm. Numerical results on standard datasets (MNIST, F-MNIST, CIFAR10) are presented in section 4 followed by a discussion and concluding remarks.

2 Notations

The fit of the neural networks is formalized through the introduction of a loss functional depending on the network parameters and the input data presented to it. In full generality the input data belongs to some probability space and the optimization aims to find the value minimizing the mapping . However, for instance for classification purposes, not all have a label attached to it, so in practice only a limited amount of values can be used. So the loss functional becomes:

 f(X):=1NN∑i=1f(ωi,X). (2)

For we introduce the functions to represent the loss due to the training sample .

To minimize the loss functional one can think of an deterministic procedure (of gradient descent type) which can be written as:

 Xn+1=Xn−h∇f(Xn), (3)

where denotes the learning rate (also called ”step size”). Note that this update rule requires gradient evaluations per step which is prohibitively large in applications in deep learning that involve networks with many parameters (tenths of thousands up to billions). To this end, the deterministic procedure is replaced by its stochastic counterpart, the Stochastic Gradient Descent (SGD). Let be i.i.d uniform variables taking values in . Then, the SGD is defined as:

 Xn+1=Xn−h∇fγn(Xn), X0=X(0). (4)

The advantage of the stochastic algorithm is that the gradient is evaluated once per iteration which makes its computational complexity independent of . This explains why this method is preferred for large data sets.1

2.1 The construction of the SGD-G2 algorithm: the principle

The state in (4) can be also seen as an approximation of the solution of the flow in (1) at the ”time” : . But there are many ways to obtain approximations of , for instance one can use a second order in scheme (such as the so-called Runge-Kutta schemes to name but a few [9]) and construct another approximation at the price of computing another gradient. If is a better approximation then it closer to and thus at the leading order is an estimation of the error . With such an approximation one can extrapolate the behavior of near and compute for what values of the learning rate we still have stability (the precise computations are detailed in the next section). We adapt then the learning rate to go towards the optimal value, that is, large enough to advance fast but still stable. This will be encoded in the SGD-G2 algorithm we propose. Two questions arise:

- how to design a high order scheme consistent with the equation (1): this is the object of section 3;

- is the numerical procedure performing well in practice : this is the object of section 4.

3 Theoretical results

3.1 Choice of the stochastic Runge-Kutta scheme

First we need to choose a numerical scheme that solves the equation (1) by using only partial information on the samples, i.e., we have to use some stochastic numerical scheme. Among the possible variants we choose the stochastic-Heun method described below (see also [10, 18] for related works, although not with the same goal); while the SGD updates the state by relation (4) the stochastic Heun scheme (named ”SH” from now on) reads:

 Y0=X(0) ~Yn+1=Yn−h∇fγn(Yn) Yn+1=Yn−h2[∇fγn(Yn)+∇fγn(~Yn+1)] (5)

Note that a step requires two evaluations of the gradient, but this is the price to pay for higher precision. Note also that the same random sample is used for both gradient computations. In fact SH this can be also seen as a SGD that uses a sample twice to advance and then do a linear combination of the gradients thus obtained.

In order to prove relevant properties of SH scheme, we need to make clear some details concerning the evolution equation (1). In fact, since only one sample is used at the time, the evolution will depend on the order in which samples are chosen. This means that in fact there is some randomness involved and we cannot hope to approach exactly the solution of (1). In fact, see [5, 6], it is known that the output of, let’s say, the SGD algorithm is close (in mathematical terms ”weakly converging”) when to the solution of the following stochastic differential equation:

 dZt=b(Zt)dt+σ(Zt)dWt, Z(0)=X(0), (6)

with and , where

 V(z)=∑Nk=1(∇fγk(z)−∇f(z))(∇fγk(z)−∇f(z))TN, (7)

is the covariance matrix of taken as a random variable of the samples (see also [5] equation (4)). Here is a standard Brownian motion. With these provisions we can formally state the following result:

Theorem 1 (Convergence of SGD and SH schemes).

Suppose , are Lipschitz functions having at most linear increase for 2. The SGD scheme converges at (weak) order (in ) to the solution of (6) while the SH scheme (5) at (weak) order .

Proof: Te recall what weak convergence means we need to introduce some notations: we designate by the set of function having at most polynomial growth at infinity and the set of Lipschitz functions. Given a numerical scheme of step (SGD, SH, etc.) that approximates the solution of the SDE (6) with , weak convergence at order means that, for any (kept fixed) and any function we have3 :

 |EG(U⌊t/n⌋)−EG(Zt)|=O(hp). (8)

It is known from [5] (Theorem 1 point ”i”) that SGD is of weak order . It remains to prove that SH is of weak order ; the proof uses Theorem 2 from the same reference (see also [7]) that is recalled below:

Theorem 2 (Milstein,1986).

Suppose : and have at most linear growth at infinity. Suppose that . Let and . If in addition, there exist such that for any and any :

 ∣∣ ∣∣E( s∏j=1Δij )−E( s∏j=1¯¯¯¯¯Δij )∣∣ ∣∣≤K1(x) hp+1, (9)

and

 E( 2p+1∏j=1|¯¯¯¯¯Δij|)≤K2(x) hp+1, (10)

the numerical scheme converges at weak order .

We return to the proof of theorem 2. Let be an operator acting over sufficiently smooth functions by:

 Lζ=−<∇f,∇ζ>+h2d∑i,j=1Vij∂2ijζ. (11)

Let and suppose it is times differentiable; using a classical result of semi-groups expansions (see [8]), we have:

 E(Φ(Zh))=Φ(z)+hLΦ(z)+h22L2Φ(z)+O(h3). (12)

For , we define . Let defined by . Note that the function belongs to the class and for ,

 (13)

To determine the partial derivatives of in , we use (12) to get:

 M(ξ)=1+hLΦt(z)+h22L2Φξ(z)+O(h3). (14)

After calculating the explicit expressions of and , we obtain:

 M(ξ)=1−ih⟨∇f(z),ξ⟩ +h2(i2d∑k=1∂kf(z)⟨∂k∇f(z),ξ⟩+⟨∇f(z),ξ⟩2−12ξTVξ) +O(h3). (15)

Therefore,

 E(Δj)=−h∂jf(z)+h22d∑k=1∂kf(z)∂2kjf(z)+O(h3) (16)
 E(ΔjΔl)=h2[∂jf(z)∂lf(z)+Vjl]+O(h3) (17)
 E(s∏j=1Δij)=O(h3), for s≥3. (18)

For the SH scheme:

 E(¯¯¯¯¯Δk)=−h∂kf(z)−h22<∂i∇f(z),∇f(z)>+O(h3), (19)
 E(¯¯¯¯¯Δk¯¯¯¯¯Δl)=1Nh2N∑ℓ=1∂kfℓ(z)∂lfℓ(z)+O(h3), (20)
 E(s∏j=1¯¯¯¯¯Δij)=O(h3), for s≥3. (21)

Therefore, all hypotheses of the theorem 2 are satisfied for which gives the conclusion.

Given theorem 2 we can trust SH to produce high order approximations of the solution which in turn will help obtain information on the Hessian and thus calibrate automatically the learning rate.

3.2 Rationale for the adaptive step proposal

In what follows, denotes the usual scalar product in and the associated euclidean norm. For a matrix , denotes the matrix norm defined as: .

If the loss function is smooth enough, a Taylor expansion around a current point allows to write:

 f(X)=f(Y)+⟨∇f(Y),X−Y⟩ +12⟨∇2f(Y)(X−Y),X−Y⟩+O(∥X−Y∥3), (22)

where is the gradient of computed at and is the Hessian matrix of at the same point.

Note that the Hessian provides detailed information over the behavior of around the current point but in practice is a high dimensional object and it is impossible to deal with it directly. We will not try to compute it but will exploit its structure.

Neglecting higher order terms, the loss functional can be written as:

 f(X)≃12⟨AX,X⟩−⟨b,X⟩, (23)

for some matrix in and vector in . To explain our method we will consider that in equation (23) we have equality. Then, . If is the minimum of , then and thus

 ∇f(X)=A(X−Xopt). (24)

We will forget for a moment that the gradient of is not computed exactly (only an unbiased estimator being available in practice under the form of ). To minimize the function , the gradient descent (also called Explicit Euler) scheme with learning rate reads:

 Xn+1=Xn−hn∇f(Xn). (25)

Then, denoting the identity matrix of dimensions, we obtain:

 Xn+1−Xopt=(Id−hnA)(Xn−Xopt). (26)

On the other hand the Heun scheme (which is a Runge Kutta method of the second order) reads:

 Yn+1=Yn−hn2(∇f(Yn)+∇f(Yn−hn∇f(Yn))). (27)

Then,

 Yn+1−Xopt=[Id−hA+h2nA22](Yn−Xopt). (28)

At the step , suppose that the two schemes start from the same point i.e., = . Therefore, using (26) and (28):

 Yn+1−Xn+1=h2nA22(Xn−Xopt)=h2nA2∇f(Xn) (29)

On the other hand from (25) and (27) we can also write:

 Yn+1−Xn+1=hn2(∇f(Xn)−∇f(Xn+1)) (30)

Combining (29) and (30) we get:

 (Id−hnA)∇f(Xn)=∇f(Xn+1). (31)

From (24) and (26) the stability criterion for the gradient descent (Explicit Euler) scheme is that needs to be bounded, which is verified in particular when . If this condition is true then for each step :

 ∥(Id−hnA)∇f(Xn)∥∥∇f(Xn)∥<1. (32)

Although the reciprocal is false, it is not far from true because, if for some , then this means in particular (because the matricial norm is a supremum) and then, except degenerate initial conditions , we obtain that will diverge (unless is adapted to ensure stability). So to enforce stability we have to request:

 ∥∇f(Xn+1)∥≤∥∇f(Xn)∥. (33)

On the other hand, at every iteration , we attempt to choose the biggest possible learning rate that guarantees (33), that is we accelerate the rate of convergence without breaking the stability criterion.

A natural question arises : what is the maximum value of so that (33) still holds ?

Let us denote . With being given, this is a second order polynomial in ; let be the maximum value of the learning rate such (33) still holds. In other words, .

Note that:

 ξn(h)=∥A∇f(Xn)∥2h2 −2⟨A∇f(Xn),∇f(Xn)⟩h+∥∇f(Xn)∥2. (34)

Then, implies that or . Since , we have that . Then, unless , which would imply that a critical point has already been reached:

 hoptn=max(0,2hn⟨∇f(Xn)−∇f(Xn+1),∇f(Xn)⟩||∇f(Xn)−∇f(Xn+1)||2). (35)

Note that it is important that in (35) the matrix , which is impossible to compute, does not appear; only appear and that are known.4.

To conclude: if the current learning rate is then it should be put to (given in equation (35)) to have the best convergence and stability properties.

Note that when is definite positive, the scalar product must be positive. Therefore, if in a given iteration , (which equals ) happens to be negative this means that the second order assumption (23) made on breaks down around the current point ; we cannot trust any computation made above and thus when we can set for instance . Therefore, denoting now the learning rate at step we will define:

 hoptn=⎧⎨⎩ 2hnpn||∇f(Xn)−∇f(Xn+1)||2if pn>0 hnotherwise. (36)

3.3 Update policy

At a given iteration , if , this means that the gradient descent is progressing too slow with the current learning rate and thus we need to accelerate it by increasing the learning rate. In practice, to not break the stability condition, the new learning rate must not be very close to . To avoid this risk, we choose a gradual update with an hyper-parameter close to :

 hn+1=βhn+(1−β)hoptn. (37)

Suppose now that . This means that the current learning rate breaks the convergence criteria. Then, we have to decrease it; contrary to previous policy, here we do not want a slow update because instability is already set in. A drastic measure is required, otherwise the whole optimization may become useless. We propose the following update rule:

 hn+1=(1−β)hoptn. (38)

This update is not necessarily close to but is a conservative choice to enter again the stability region, which, in practice gives good results.

3.4 The SGD-G2 algorithm

We present in this section the algorithm resulting from the above considerations; due to the stochastic nature of the gradient that is to be computed, for each iteration in all the formulas in the former section the full gradient must be replaced by . Then, our suggested adaptive SGD called ”SGD-G2” is described by the Algorithm 1 which includes the provision for mini-batch processing.

Remark 1.

Several remarks are in order:

1. The computation of both and allows in principle to construct a more precise, second order in the learning rate, estimate of the next step of the form ; this is not what we want here, the precise estimate is only used to calibrate the learning rate, in the end the SGD update formula is invoked to advance the network parameters to .

2. It is crucial to have the same randomness in the computation of as the one present in the computation of . This ensures that a consistent approximation is obtained as detailed in Theorem 2.

3. We recommend to take the initial learning rate very small, in order to be sure to start in the stability region around , for instance ; however numerical experiments seem to be largely insensitive to this value as detailed in section 4, figure 2.

4 Numerical experiments

4.1 Network architecture

We conducted experiments on three different data sets (MNIST, Fashion MNIST and CIFAR-10) using neural networks (CNNs) developed for image classification. We follow in this section the specifications in [3, 15] and in [2] and reproduce below the corresponding architectures as given in the reference:

MNIST/Fashion-MNIST ( sized images): three dense 256 neurons feed-forward ReLU layers followed by a final dense 10 neurons feed-forward layer.

CIFAR-10 ( images with color layers): a convolution layer with filters, a max pooling layer, a convolution layer with filters, a max pooling layer, a convolution layer with filters followed by a flatten layer, a dense layer with a neurons and a final dense layer with neurons. All activations are ReLU except last one which is a softmax.

The last layer returns the classification result.

All hyper-parameters are chosen as in the references: the loss was minimized with the SGD / SGD-G2 algorithms and hyper-parameters or as indicated in the figures; we used epochs. The batch size is .

4.2 Discussion

First, we compare the performance of the adaptive SGD algorithm for different choices of the parameter. The results do not vary much among databases, we plot in figure 2 the ones for FMNIST. The variability is similar for MNIST but slightly larger for CIFAR10.

We come next to the heart of the procedure and we compare the performance of the adaptive SGD algorithm with the standard SGD algorithm for different initial learning rates. Recall that the goal of the adaptive procedure is not to beat the best possible SGD convergence but to identify fast enough the optimal learning rate. We recommend thus to start from a very small value of ; the algorithm will increase it up to the stability threshold. This is indeed what happens, see figures 1, 3 and 4. As the adaptive algorithm uses two (mini-batch) gradient evaluations per iteration, we take as -axis in the plots the number of gradient evaluations and not the iteration counter, (in order not to favor the adaptive procedure which is more costly per iteration). We see that in all situations, starting from a tiny value of , the SGD-G2 algorithm quickly reaches the stability region and converge accordingly.

For MNIST and FMNIST the SGD-G2 cannot be surpassed by SGD, even for the optimal SGD learning rate; on the contrary for CIFAR10 the SGD with the optimal learning rate (which has to be searched through repeated runs) does converge better than the SGD-G2, but this is due to the counting procedure: if instead of number of gradient evaluations we count the iterations, the two are comparable as shown in figure 5 (same considerations apply for the Adam algorithm as illustrated in figure 6); so one can imagine that the adaptive part is only switched on from time to time (for instance once every iterations), which will make its overhead negligible and still reach the optimal learning rate regime. Such fine tuning remains for future work.

In conclusion, an adaptive SGD algorithm (called SGD-G2) is proposed which is both computationally convenient and provides a stable, optimal learning rate value. The procedure is tested successfully on three image databases (MNIST, FMNIST, CIFAR10).

Footnotes

1. In practice are drawn without replacement from until all values are seen. This is called an epoch. Then all values re-enter the choice set and a new epoch begins. We will not take discuss this refinement in our procedures which is independent of the segmentation in epochs or not. Same for mini-batch processing which consists in drawing several at once; the method proposed in the sequel adapts out-of-the-box to such a situation too.
2. The linear growth at infinity is a technical hypothesis. It is for instance true when the parameter domain is closed and bounded and the function continuous.
3. We used the notation to designate the integer part of a real number , that is the largest integer smaller than .
4. Recall that do not discuss here the stochastic part; in practice an unbiased estimator of the gradient is available.

References

1. LÃ©on Bottou. Stochastic Gradient Descent Tricks. In GrÃ©goire Montavon, GeneviÃ¨ve B. Orr, and Klaus-Robert MÃ¼ller, editors, Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science, pages 421–436. Springer, Berlin, Heidelberg, 2012.
2. Feiyang Chen, Nan Chen, Hanyang Mao, and Hanlin Hu. Assessing four neural networks on handwritten digit recognition dataset (mnist), 11 2018. arXiv:1811.08278.
3. Akinori Hidaka and Takio Kurita. Consecutive dimensionality reduction by canonical correlation analysis for visualization of convolutional neural networks. Proceedings of the ISCIE International Symposium on Stochastic Systems Theory and its Applications, 2017:160–167, 12 2017.
4. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. arXiv:1412.6980.
5. Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2101–2110, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
6. Qianxiao Li, Cheng Tai, and Weinan E. Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations. Journal of Machine Learning Research, 20(40):1–47, 2019.
7. G. N. Milâshtein. Weak Approximation of Solutions of Systems of Stochastic Differential Equations. Theory of Probability & Its Applications, 30(4):750–766, December 1986.
8. E. Hille and R. S. Phillips. Functional Analysis and Semi-groups. American Mathematical Society, Providence, revised edition edition, February 1996.
9. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. ”Numerical Recipes 3rd Edition: The Art of Scientific Computing”. Cambridge University Press, 2007.
10. Lars Ruthotto and Eldad Haber. Deep Neural Networks Motivated by Partial Differential Equations. arXiv:1804.04272 [cs, math, stat], December 2018. arXiv: 1804.04272.
11. Sihyeon Seong, Yekang Lee, Youngwook Kee, Dongyoon Han, and Junmo Kim. Towards Flatter Loss Surface via Nonmonotonic Learning Rate Scheduling. In UAI2018 Conference on Uncertainty in Artificial Intelligence. Association for Uncertainty in Artificial Intelligence (AUAI), 2018.
12. Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017.
13. Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. arXiv preprint arXiv:1708.07120, 2017.
14. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
15. TensorFlow. Convolutional Neural Network (CNN) TensorFlow Core, retrieved 2020-02-06. https://www.tensorflow.org/tutorials/images/cnn.
16. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
17. Rahul Yedida and Snehanshu Saha. Lipschitzlr: Using theoretically computed adaptive learning rates for fast convergence, 2019.
18. Mai Zhu, Bo Chang, and Chong Fu. Convolutional Neural Networks combined with Runge-Kutta Methods. arXiv:1802.08831 [cs], January 2019. arXiv: 1802.08831.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters