Accelerating SGD with momentum for over-parameterized learning
Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov’s method over optimal gradient descent.
To address the non-acceleration issue, we introduce a compensation term to Nesterov SGD. The resulting algorithm, which we call MaSS, converges for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting. For full batch, the convergence rate of MaSS matches the well-known accelerated rate of the Nesterov’s method.
We also analyze the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini-batch size, demonstrating three distinct regimes: linear scaling, diminishing returns and saturation.
Experimental evaluation of MaSS for several standard architectures of deep networks, including ResNet and convolutional networks, shows improved performance over SGD, SGD+Nesterov and Adam.
Many modern neural networks and other machine learning models are over-parametrized [canziani2016analysis]. These models are typically trained to have near zero training loss, known as interpolation and often have strong generalization performance, as indicated by a range of empirical evidence including [zhang2016understanding, belkin2018understand]. Due to a key property of interpolation – automatic variance reduction (discussed in Section 2.1), stochastic gradient descent (SGD) with constant step size is shown to converge to the optimum of a convex loss function for a wide range of step sizes [ma2017power]. Moreover, the optimal choice of step size in that setting is can be derived analytically.
The goal of this paper is to take a step toward understanding momentum-based SGD in the interpolating setting. Among them, stochastic version of Nesterov’s acceleration method (SGD+Nesterov) is arguably the most widely used to train modern machine learning models in practice. The popularity of SGD+Nesterov is tied to the well-known acceleration of the deterministic Nesterov’s method over gradient descent [nesterov2013introductory]. However, it is not theoretically clear whether Nesterov SGD accelerates over SGD.
As we show in this work, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge, even in the linear setting, for step sizes that guarantee convergence of ordinary SGD. Intuitively, the lack of acceleration stems from the fact that, to ensure convergence, the step size of SGD+Nesterov has to be much smaller than the optimal step size for SGD. This is in contrast to the deterministic Nesterov method, which accelerates using the same step size as optimal gradient descent. As we prove rigorously in this paper, the slow-down of convergence caused by the small step size negates the benefit brought by the momentum term. We note that a similar lack of acceleration for the stochastic Heavy Ball method was analyzed in [kidambi2018insufficiency].
To address the non-acceleration of SGD+Nesterov, we introduce an additional compensation term to allow convergence for the same range of step sizes as SGD. The resulting algorithm, MaSS (Momentum-added Stochastic Solver)111Code url: https://github.com/ts66395/MaSS updates the weights and using the following rules (with the compensation term underlined):
Here, represents the stochastic gradient. The step size , the momentum parameter and the compensation parameter are independent of .
We proceed to analyze theoretical convergence properties of MaSS in the interpolated regime. Specifically, we show that in the linear setting MaSS converges exponentially for the same range of step sizes as plain SGD, and the optimal choice of step size for MaSS is exactly which is optimal for SGD. Our key theoretical result shows that MaSS has accelerated convergence rate over SGD. Furthermore, in the full batch (deterministic) scenario, our analysis selects , thus reducing MaSS to the classical Nesterov’s method [nesterov2013introductory]. In this case the our convergence rate also matches the well-known convergence rate for the Nesterov’s method [nesterov2013introductory, bubeck2015convex]. This acceleration is illustrated in Figure 1. Note that SGD+Nesterov (as well as Stochastic Heavy Ball) does not converge faster than SGD, in line with our theoretical analysis. We also prove exponential convergence of MaSS in more general convex setting under additional conditions.
We further analyze the dependence of the convergence rate and optimal hyper-parameters on the mini-batch size . We identify three distinct regimes of dependence defined by two critical values and : linear scaling, diminishing returns and saturation, as illustrated in Figure 2. The convergence speed per iteration , as well as the optimal hyper-parameters, increase linearly as in the linear scaling regime, sub-linearly in the diminishing returns regime, and can only increase by a small constant factor in the saturation regime. The critical values and are derived analytically. We note that the intermediate “diminishing terurns” regime is new and is not found in SGD [ma2017power]. To the best of our knowledge, this is the first analysis of mini-batch dependence for accelerated stochastic gradient methods.
We also experimentally evaluate MaSS on deep neural networks, which are non-convex. We show that MaSS outperforms SGD, SGD+Nesterov and Adam [kingma2014adam] both in optimization and generalization, on different architectures of deep neural networks including convolutional networks and ResNet [he2016deep].
The paper is organized as follows: In section 2, we introduce notations and preliminary results. In section 3, we discuss the non-acceleration of SGD+Nesterov. In section 4 we introduce MaSS and analyze its convergence and optimal hyper-parameter selection. In section 5, we analyze the mini-batch MaSS. In Section 6, we show experimental results.
1.1 Related Work
Over-parameterized models have drawn increasing attention in the literature as many modern machine learning models, especially neural networks, are over-parameterized [canziani2016analysis] and show strong generalization performance [neyshabur2014search, zhang2016understanding, belkin2019reconciling]. Over-parameterized models usually result in nearly perfect fit (or interpolation) of the training data [zhang2016understanding, sagun2017empirical, belkin2018understand]. Exponential convergence of SGD with constant step size under interpolation and its dependence on the batch size is analyzed in [ma2017power].
There are a few works that show or indicate the non-acceleration of existing stochastic momentum methods. First of all, the work [kidambi2018insufficiency] theoretically proves non-acceleration of stochastic Heavy Ball method (SGD+HB) over SGD on certain synthetic data. Furthermore, these authors provide experimental evidence that SGD+Nesterov also converges at the same rate as SGD on the same data. The work [yuan2016influence] theoretically shows that, for sufficiently small step-sizes, SGD+Nesterov and SGD+HB is equivalent to SGD with a larger step size. However, the results in [yuan2016influence] do not exclude the possibility that acceleration is possible when the step size is larger. The work [liu2018toward] concludes that “momentum hurts the convergence within the neighborhood of global optima”, based on a theoretical analysis of SGD+HB. These results are consistent with our analysis of the standard SGD+Nesterov. However, this conclusion does not apply to all momentum methods. Indeed, we will show that MaSS provably improves convergence over SGD.
There is a large body of work, both practical and theoretical, on SGD with momentum, including [kingma2014adam, jain2017accelerating, allen2017katyusha]. Adam [kingma2014adam], and its variant AMSGrad [reddi2018convergence], are among the most practically used SGD methods with momentum. Unlike our method, Adam adaptively adjusts the step size according to a weight-decayed accumulation of gradient history. In [jain2017accelerating] the authors proposed an accelerated SGD algorithm, which can be written in the form shown on the right hand side in Eq.8, but with different hyper-parameter selection. Their ASGD algorithm also has a tail-averaging step at the final stage. In the interpolated setting (no additive noise) their analysis yields a convergence rate of compared to for our algorithm with batch size 1. We provide some experimental comparisons between their ASGD algorithm and MaSS in Fig. 4.
The work [vaswani2018fast] proposes and analyzes another first-order momentum algorithm and derives convergence rates under a different set of conditions – the strong growth condition for the loss function in addition to convexity. As shown in Appendix F.3, on the example of a Gaussian distributed data, the rates in [vaswani2018fast] can be slower than those for SGD. In contrast, our algorithm is guaranteed to never have a slower convergence rate than SGD. Furthermore, in the same Gaussian setting MaSS matches the optimal accelerated full-gradient Nesterov rate.
Additionally, in our work we consider the practically important dependence of the convergence rate and optimal parameter selection on the mini-batch size, which to the best of our knowledge, has not been analyzed for momentum methods.
2 Notations and Preliminaries
Given dataset , we consider an objective function of the form , where only depends on a single data point . Let denote the exact gradient, and denote the unbiased stochastic gradient evaluated based on a mini-batch of size . For simplicity, we also denote We use the concepts of strong convexity and smoothness of functions, see definitions in Appendix B.1. For loss function with -strong convexity and -smoothness, the condition number is defined as .
In the case of the square loss, , and the Hessian matrix is . Let and be the largest and the smallest non-zero eigenvalues of the Hessian respectively. Then the condition number is then (note that zero eigenvalues can be ignored in our setting, see Section 4).
Stochastic Condition Numbers. For a quadratic loss function, let denotes a mini-batch estimate of . Define be the smallest positive number such that and denote
Given a mini-batch size , we define the -stochastic condition number as .
Following [jain2017accelerating], we introduce the quantity (called statistical condition number in [jain2017accelerating]), which is the smallest positive real number such that
We note that , since Consequently, according to the definition of in Eq.2, we have
Hence, the quadratic loss function is also -smooth, for all . By the definition of , we also have
It is important to note that , since .
2.1 Convergence of SGD for Over-parametrized Models and Optimal Step Size
We consider over-parametrized models that have zero training loss solutions on the training data (e.g., [zhang2016understanding]). A solution which fits the training data perfectly is known as interpolating. In the linear setting, interpolation implies that the linear system has at least one solution.
A key property of interpolation is Automatic Variance Reduction (AVR), where the variance of the stochastic gradient decreases to zero as the weight approaches the optimal .
For a detailed discussion of AVR see Appendix B.2.
Thanks to AVR, plain SGD with constant step size can be shown to converge exponentially for strongly convex loss functions [moulines2011non, schmidt2013fast, needell2014stochastic, ma2017power]. The set of acceptable step sizes is , where is defined in Eq.2 and is the mini-batch size. Moreover, the optimal step size of SGD that induces fastest convergence guarantee is proven to be [ma2017power].
3 Non-acceleration of SGD+Nesterov
In this section we prove that SGD+Nesterov, with any constant hyper-parameter setting, does not generally improve convergence over optimal SGD. Specifically, we demonstrate a setting where SGD+Nesterov can be proved to have convergence rate of , which is same (up to a constant factor) as SGD. In contrast, the classical accelerated rate for the deterministic Nesterov’s method is .
We will consider the following two-dimensional data-generating component decoupled model. Fix an arbitrary and randomly sample from . The data points are constructed as follow:
where are canonical basis vectors, . It can be seen that , and (See Appendix F.1). This model is similar to that used to analyze the stochastic Heavy Ball method in [kidambi2018insufficiency].
The following theorem gives a lower bound for the convergence of SGD+Nesterov, regarding the linear regression problem on the component decoupled data model. See Appendix C for the proof.
Theorem 1 (Non-acceleration of SGD+Nesterov).
Let be a dataset generated according to the component decoupled model. Consider the optimization problem of minimizing quadratic function . For any step size and momentum parameter of SGD+Nesterov with random initialization, with probability one, there exists a such that ,
where is a constant.
Compared with the convergence rate of SGD [ma2017power], this theorem shows that SGD+Nesterov does not accelerate over SGD. This result is very different from that in the deterministic gradient scenario, where the classical Nesterov’s method has a strictly faster convergence guarantee than gradient descent [nesterov2013introductory].
Intuitively, the key reason for the non-acceleration of SGD+Nesterov is a condition on the step size required for non-divergence of the algorithm. Specifically, when momentum parameter is close to 1, is required to be less than (precise formulation is given in Lemma 1 in Appendix C). The slow-down resulting from the small step size necessary to satisfy that condition cannot be compensated by the benefit of the momentum term. In particular, the condition on the step-size of SGD+Nesterov excludes that achieves fastest convergence for SGD. We show in the following corollary that, with the step size , SGD+Nesterov diverges. This is different from the deterministic scenario, where the Nesterov method accelerates using the same step size as gradient descent.
Consider the same optimization problem as in Theorem 1. Let step-size and acceleration parameter , then SGD+Nesterov, with random initialization, diverges in expectation with probability 1.
4 MaSS: Accelerating SGD
In this section, we propose MaSS, which introduces a compensation term (see Eq.1) onto SGD+Nesterov. We show that MaSS can converge exponentially for all the step sizes that result in convergence of SGD, i.e., . Importantly, we derive a convergence rate , where , for MaSS which is faster than the convergence rate for SGD . Moreover, we give an analytical expression for the optimal hyper-parameter setting.
For ease of analysis, we rewrite update rules of MaSS in Eq.1 in the following equivalent form (introducing an additional variable ):
There is a bijection between the hyper-parameters () and (), which is given by:
Remark 3 (SGD+Nesterov).
In the literature, the Nesterov’s method is sometimes written in a similar form as the R.H.S. of Eq.8. Since SGD+Nesterov has no compensation term, has to be fixed as , which is consistent with the parameter setting in [nesterov2013introductory].
Assumptions. We first assume square loss function, and later extend the analysis to general convex loss functions under additional conditions. For square loss function, the solution set is an affine subspace in the parameter space . Given any , we denote its closest solution as and define the error . One should be aware that different may correspond to different , and that and (stochastic) gradients are always perpendicular to . Hence, no actual update happens along . For this reason, we can ignore zero eigenvalues of and restrict our analysis to the span of the eigenvectors of the Hessian with non-zero eigenvalues.
Based on the equivalent form of MaSS in Eq.8, the following theorem shows that, for square loss function in the interpolation setting, MaSS is guaranteed to have exponential convergence when hyper-parameters satisfy certain conditions.
Theorem 2 (Convergence of MaSS).
Consider minimizing a quadratic loss function in the interpolation setting. Let be the smallest non-zero eigenvalue of the Hessian matrix . Let be as defined in Eq.2. Denote . In MaSS with mini batch of size , if the positive hyper-parameters satisfy the following two conditions:
then, after iterations,
for some constant which depends on the initialization.
By condition Eq.10, the admissible step size is , exactly the same as SGD for interpolated setting [ma2017power].
One can easily check that the hyper-parameter setting of SGD+Nesterov does not satisfy the conditions in Eq.10.
Proof sketch for Theorem 2.
By setting , which maximizes the right hand side of the inequality, we obtain the optimal selection . Note that this setting of and determines a unique by the conditions in Eq.10. In summary,
By Eq.9, the optimal selection of would be:
is usually larger than 1, which implies that the coefficient of the compensation term is non-negative. The non-negative coefficient indicates that the weight is “over-descended” in SGD+Nesterov and needs to be compensated along the gradient direction.
It is important to note that the optimal step size for MaSS as in Eq.13 is exactly the same as the optimal one for SGD [ma2017power]. With such hyper-parameter selection given in Eq.14, we have the following theorem for optimal convergence:
Theorem 3 (Acceleration of MaSS).
With the optimal hyper-parameters in Eq.13, the asymptotic convergence rate of MaSS is
which is faster than the rate of SGD (see [ma2017power]), since .
Remark 7 (MaSS Reduces to the Nesterov’s method for full batch).
In the limit of full batch , we have , the optimal parameter selection in Eq.14 reduces to
It is interesting to observe that, in the full batch (deterministic) scenario, the compensation term vanishes and and are the same as those in Nesterov’s method. Hence MaSS with the optimal hyper-parameter selection reduces to Nesterov’s method in the limit of full batch. Moreover, the convergence rate in Theorem 3 reduces to , which is exactly the well-known convergence rate of Nesterov’s method [nesterov2013introductory, bubeck2015convex].
Extension to Convex Case.
First, we extend the definition of to convex functions, and keep the definition of the same as Eq.2. It can be shown that these definitions of are consistent with those in the quadratic setting. We assume that the smallest positive eigenvalue of Hessian is lower bounded by , for all .
Suppose there exists a -strongly convex and -smooth non-negative function such that and , for some . In MaSS, if the hyper-parameters are set to be:
then after iterations, there exists a constant such that,
5 Linear Scaling Rule and the Diminishing Returns
Based on our analysis, we discuss the effect of selection of mini-batch size . We show that the domain of mini-batch size can be partitioned into three intervals by two critical points: The three intervals/regimes are depicted in Figure 2, and the detailed analysis is in Appendix G.
Linear Scaling: . In this regime, we have and . The optimal selection of hyper-parameters is approximated by:
and the convergence rate in Eq.16 is approximately . This indicates linear gains in convergence when increases.
In the linear scaling regime, the hyper-parameter selections follow a Linear Scaling Rule (LSR): When the mini-batch size is multiplied by , multiply all hyper-parameters by . This parallels the linear scaling rule for SGD which is an accepted practice for training neural networks [goyal2017accurate].
Moreover, increasing results in linear gains in the convergence speed, i.e., one MaSS iteration with mini-batch size is almost as effective as MaSS iterations with mini-batch size 1.
In this regime, increasing results in sublinear gains in convergence speed. One MaSS iteration with mini-batch size is less effective than MaSS iterations with mini-batch size 1.
One MaSS iteration with mini-batch size is nearly as effective (up to a multiplicative factor of 2) as one iteration with full gradient.
6 Empirical Evaluation
We empirically verify the non-acceleration of SGD+Nesterov and the fast convergence of MaSS on synthetic data. Specifically, we optimize the quadratic function , where the dataset is generated by the component decoupled model described in Section 3. We compare the convergence behavior of SGD+Nesterov with SGD, as well as our proposed method, MaSS, and several other methods: SGD+HB, ASGD [jain2017accelerating]. We select the best hyper-parameters from dense grid search for SGD+Nesterov (step-size and momentum parameter), SGD+HB (step-size and momentum parameter) and SGD (step-size). For MaSS, we do not tune the hyper-parameters but use the hyper-parameter setting suggested by our theoretical analysis in Section 4; For ASGD, we use the setting provided by [jain2017accelerating].
Fig. 4 shows the convergence behaviors of these algorithms on the setting of . We observe that the fastest convergence of SGD+Nesterov is almost identical to that of SGD, indicating the non-acceleration of SGD+Nesterov. We also observe that our proposed method, MaSS, clearly outperforms the others. In Appendix F.2, we provide additional experiments on more settings of the component decoupled data, and Gaussian distributed data. We also show the divergence of SGD+Nesterov with the same step size as SGD and MaSS in Appendix F.2.
Real data: MNIST and CIFAR-10.
We compare the optimization performance of SGD, SGD+Nesterov and MaSS on the following tasks: classification of MNIST with a fully-connected network (FCN), classification of CIFAR-10 with a convolutional neural network (CNN) and Gaussian kernel regression on MNIST. See detailed description of the architectures in Appendix H.1. In all the tasks and for all the algorithms, we select the best hyper-parameter setting over dense grid search, except that we fix the momentum parameter for both SGD+Nesterov and MaSS, which is typically used in practice. All algorithms are implemented with mini batches of size for neural network training.
Fig. 5 shows the training curves of MaSS, SGD+Nesterov and SGD, which indicate the fast convergence of MaSS on real tasks, including the non-convex optimization problems on neural networks.
Test Performance. We show that the solutions found by MaSS have good generalization performance. We evaluate the classification accuracy of MaSS, and compare with SGD, SGD+Nesterov and Adam, on different modern neural networks: CNN and ResNet [he2016deep]. See description of the architectures in Appendix H.1. In the training processes, we follow the standard protocol of data augmentation and reduction of learning rate, which are typically used to achieve state-of-the-art results in neural networks. In each task, we use the same initial learning rate for MaSS, SGD and SGD+Nesterov, and run the same number of epochs ( epochs for CNN and epochs for ResNet-32). Detailed experimental settings are deferred to Appendix H.2.
Table 1 compares the classification accuracy of these algorithms on the test set of CIFAR-10 (average of 3 independent runs).
We observe that MaSS produces the best test performance. We also note that increasing initial learning rate may improves performance of MaSS and SGD, but degrades that of SGD+Nesterov. Moreover, in our experiment, SGD+Nesterov with large step size diverges in out of runs on CNN and out of runs on ResNet-32 (for random initialization), while MaSS and SGD converge on every run.
This research was in part supported by NSF funding and a Google Faculty Research Award. GPUs donated by Nvidia were used for the experiments. We thank Ruoyu Sun for helpful comments concerning convergence rates. We thank Xiao Liu for helping with the empirical evaluation of our proposed method.
Appendix A Pseudocode for MaSS
Note that the proposed algorithm initializes the variables and with the same vector, which could be randomly generated.
As discussed in section 4, MaSS can be equivalently implemented using the following update rules:
In this case, variables , and should be initialized with the same vector.
There is a bijection between the hyper-parameters () and (), which is given by:
Appendix B Additional Preliminaries
b.1 Strong Convexity and Smoothness of Functions
Definition 1 (Strong Convexity).
A differentiable function is -strongly convex (), if
Definition 2 (Smoothness).
A differentiable function is -smooth (), if
b.2 Automatic Variance Reduction
In the interpolation setting, one can write the square loss as
A key property of interpolation is that the variance of the stochastic gradient of decreases to zero as the weight approaches an optimal solution .
Proposition 1 (Automatic Variance Reduction).
For the square loss function in the interpolation setting, the stochastic gradient at an arbitrary point can be written as
Moreover, the variance of the stochastic gradient
Since is independent of , the above proposition unveils a linear dependence of variance of stochastic gradient on the norm square of error . This observation underlies exponential convergence of SGD in certain convex settings [strohmer2009randomized, moulines2011non, schmidt2013fast, needell2014stochastic, ma2017power].
Appendix C Proof of Theorem 1
The key proof technique is to consider the asymptotic behavior of SGD+Nesterov in the decoupled model of data when the condition number becomes large.
Notations and proof setup. Recall that the square loss function based on the component decoupled data , define in Eq.6, is in the interpolation regime, then for SGD+Nesterov, we have the recurrence relation
where . It is important to note that each component of evolves independently, due to the fact that is diagonal.
With , we define for each component that
where is the -th component of vector .
The recurrence relation in Eq.26 can be rewritten as
For the ease of analysis, we define and . Without loss of generality, we assume in this section. In this case, and , where is the condition number.
Elementary analysis gives the eigenvalues of :
Proof idea. For the two-dimensional component decoupled data, we have
By definition of in Eq.27, we can see that the convergence rate is lower bounded by the convergence rates of the sequences . By the relation Eq.28, we have that the convergence rate of the sequence is controlled by the magnitude of the top eigenvalue of , if has non-zero component along the eigenvector of with eigenvalue . Specifically, if , grows at a rate of , indicating the divergence of SGD+Nesterov; if , then converges at a rate of .
In the following, We use the eigen-systems of matrices , especially the top eigenvalue, to analyze the convergence behavior of SGD+Nesterov with any hyper-parameter setting. We show that, for any choice of hyper-parameters (i.e., step-size and momentum parameter), at least one of the following statements must holds:
has an eigenvalue larger than 1.
has an eigenvalue of magnitude .
This is formalized in the following two lemmas.
For any , if step size
then, has an eigenvalue larger than 1.
We will analyze the dependence of the eigenvalues on , when is large to obtain
For any , if step size
then, has an eigenvalue of magnitude .
Finally, we show that has non-zero component along the eigenvector of with eigenvalue , hence the convergence of SGD+Nesterov is controlled by the eigenvalue of with the largest magnitude.
Assume SGD+Nesterov is initialized with such that both components and are non-zero. Then, for all , has a non-zero component in the eigen direction of that corresponds to the eigenvalue with largest magnitude.
When is randomly initialized, the conditions and are satisfied with probability 1, since complementary cases form a lower dimensional manifold which has measure 0.
By combining Lemma 1, 2 and 3, we have that SGD+Nesterov either diverges or converges at a rate of , and hence, we conclude the non-acceleration of SGD+Nesterov. In addition, Corollary 1 is a special case of Theorem 1 and is proven by combining Lemma 1 and 3.
In high level, the proof ideas of Lemma 1 and 3 is analogous to those of [kidambi2018insufficiency], which proves the non-acceleration of stochastic Heavy Ball method over SGD. But the proof idea of Lemma 2 is unique.
Proof of Lemma 1.
The characteristic polynomial of , , are:
First note that . In order to show has an eigenvalue larger than 1, it suffices to verity that , because is continuous.
Replacing by and by , we have
Solving for the inequality , we have
for positive step size . ∎
Proof of Lemma 2.
We will show that at least one of the eigenvalues of is , under the condition in Eq.32.
First, we note that , which is . We consider the following cases separately: 1) is ; 2) is and ; 3) is ; 4) is and ; and 5) is , the last of which includes the case where momentum parameter is a constant.
Note that, for cases 1-4, is . In such cases, the step size condition Eq.32 can be written as
It is interesting to note that must be to not diverge, when is . This is very different to SGD where a constant step size can result in convergence, see [ma2017power].
Case 1: is