On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

# On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Dongruo Zhou    Yiqi Tang   Ziyan Yang   Yuan Cao    Quanquan Gu Equal ContributionDepartment of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: zhoudongruo@gmail.comDepartment of Computer Science, University of Virginia, Charlottesville, VA 22904, USA; e-mail: yt6ze@virginia.eduDepartment of Computer Science, University of Virginia, Charlottesville, VA 22904, USA; e-mail: zy3cx@virginia.eduThe Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544, USA; e-mail: yuanc@princeton.eduDepartment of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: qgu@cs.ucla.edu
###### Abstract

Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been sufficiently studied. In this paper, we provide a sharp analysis of a recently proposed adaptive gradient method namely partially adaptive momentum estimation method (Padam) (Chen and Gu, 2018), which admits many existing adaptive gradient methods such as AdaGrad, RMSProp and AMSGrad as special cases. Our analysis shows that, for smooth nonconvex functions, Padam converges to a first-order stationary point at the rate of , where is the number of iterations, is the dimension, are the stochastic gradients, and . Our theoretical result also suggests that in order to achieve faster convergence rate, it is necessary to use Padam instead of AMSGrad. This is well-aligned with the empirical results of deep learning reported in Chen and Gu (2018).

## 1 Introduction

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning rate. This problem is especially exacerbated in deep learning due to the huge number of optimization variables. To overcome this issue, RMSProp (Tieleman and Hinton, 2012) was proposed to use exponential moving average rather than the arithmetic average to scale the gradient, which mitigates the rapid decay of the learning rate. Kingma and Ba (2014) proposed an adaptive momentum estimation method (Adam), which incorporates the idea of momentum (Polyak, 1964; Sutskever et al., 2013) into RMSProp. Other related algorithms include AdaDelta (Zeiler, 2012) and Nadam (Dozat, 2016), which combine the idea of exponential moving average of the historical gradients, Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient descent (Nesterov, 2013). Recently, by revisiting the original convergence analysis of Adam, Reddi et al. (2018) found that for some handcrafted simple convex optimization problem, Adam does not even converge to the global minimizer. In order to address this convergence issue of Adam, Reddi et al. (2018) proposed a new variant of the Adam algorithm namely AMSGrad, which has guaranteed convergence in the convex optimization setting. The update rule of AMSGrad is as follows111With slight abuse of notation, here we denote by the element-wise square root of the vector , by the element-wise division between and , and by the element-wise maximum between and .:

 xt+1=xt−αtmt√ˆvt, with ˆvt=max(ˆvt−1,vt),

where is the step size, is the iterate in the -th iteration, and are the exponential moving averages of the gradient and the squared gradient at the -th iteration respectively. More specifically, and are defined as follows222Here we denote by the element-wise square of the vector .:

 mt=β1mt−1+(1−β1)gt,vt=β2vt−1+(1−β2)g2t,

where and are hyperparameters of the algorithm, and is the stochastic gradient at the -th iteration. However, Wilson et al. (2017) found that for over-parameterized neural networks, training with Adam or its variants typically generalizes worse than SGD, even when the training performance is better. In particular, they found that carefully-tuned SGD with momentum, weight decay and appropriate learning rate decay strategies can significantly outperform adaptive gradient algorithms in terms of test error. This problem is often referred to as the generalization gap for adaptive gradient methods. In order to close this generalization gap of Adam and AMSGrad, Chen and Gu (2018) proposed a partially adaptive momentum estimation method (Padam). Instead of scaling the gradient by , this method chooses to scale the gradient by , where is a hyper parameter. This gives rise to the following update formula333We denote by the element-wise -th power of the vector :

 xt+1=xt−αtgtˆv−pt, with ˆvt=max(ˆvt−1,vt).

Evidently, when , Padam reduces to AMSGrad. Padam also reduces to the corrected version of RMSProp (Reddi et al., 2018) when and .

Despite the successes of adaptive gradient methods for training deep neural networks, the convergence guarantees for these algorithms are mostly restricted to online convex optimization (Duchi et al., 2011; Kingma and Ba, 2014; Reddi et al., 2018; Chen and Gu, 2018). Therefore, there is a huge gap between existing online convex optimization guarantees for adaptive gradient methods and the empirical successes of adaptive gradient methods in nonconvex optimization. In order to bridge this gap, there are a few recent attempts to prove the nonconvex optimization guarantees for adaptive gradient methods. More specifically, Basu et al. (2018) proved the convergence rate of RMSProp and Adam when using deterministic gradient rather than stochastic gradient. Ward et al. (2018) proved the convergence rate of a simplified AdaGrad where the moving average of the norms of the gradient vectors is used to adjust the gradient vector in both deterministic and stochastic settings for smooth nonconvex functions. Nevertheless, the convergence guarantees in Basu et al. (2018); Ward et al. (2018) are still limited to simplified algorithms.

In this paper, we provide a sharp convergence analysis of the adaptive gradient methods. In particular, we analyze the state-of-the-art adaptive gradient method, i.e., Padam (Chen and Gu, 2018), and prove its convergence rate for smooth nonconvex objective functions in the stochastic optimization setting. Our results directly imply the convergence rates for AMSGrad (the corrected version of Adam) and the corrected version of RMSProp (Reddi et al., 2018). Our analyses can be extended to other adaptive gradient methods such as AdaGrad, AdaDelta (Zeiler, 2012) and Nadam (Dozat, 2016) mentioned above, but we omit these extensions in this paper for the sake of conciseness. It is worth noting that our convergence analysis emphasizes equally on the dependence of number of iterations and dimension in the convergence rate. This is motivated by the fact that modern machine learning methods, especially the training of deep neural networks, usually requires solving a very high-dimensional nonconvex optimizaiton problems. The order of dimension is usually comparable to or even larger than the total number of iterations . Take training the latest convolutional neural network DenseNet-BC (Huang et al., 2017) with depth and growth rate on CIFAR-10 (Krizhevsky, 2009) as an example. According to Huang et al. (2017), the network is trained with in total million iterations, however the number of parameters in the network is million. This example shows that can indeed be in the same order of in practice. Therefore, we argue that it is very important to show the precise dependence on both and in the convergence analysis of adaptive gradient methods for modern machine learning.

When we were preparing this manuscript, we noticed that there was a paper (Chen et al., 2018) released on arXiv on August 8th, 2018, which analyzes the convergence of a class of Adam-type algorithms including AMSGrad and AdaGrad for nonconvex optimization. Our work is an independent work, and our derived convergence rate for AMSGrad is faster than theirs.

### 1.1 Our Contributions

The main contributions of our work are summarized as follows:

• We prove that the convergence rate of Padam to a stationary point for stochastic nonconvex optimization is

 O((∑di=1∥g1:T,i∥2)1/2T3/4+dT), (1.1)

where are the stochastic gradients and . When the stochastic gradients are -bounded, (1.1) matches the convergence rate of vanilla SGD in terms of the rate of .

• Our result implies the convergence rate for AMSGrad is

 O(√dT+dT),

which has a better dependence on the dimension and than the convergence rate proved in Chen et al. (2018), i.e.,

 O(logT+d2√T).

### 1.2 Additional Related Work

Here we briefly review other related work on nonconvex stochastic optimization.

Ghadimi and Lan (2013) proposed a randomized stochastic gradient (RSG) method, and proved its convergence rate to a stationary point. Ghadimi and Lan (2016) proposed an randomized stochastic accelerated gradient (RSAG) method, which achieves convergence rate, where is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic momentum methods in deep learning (Sutskever et al., 2013), Yang et al. (2016) provided a unified convergence analysis for both stochastic heavy-ball method and the stochastic variant of Nesterov’s accelerated gradient method, and proved convergence rate to a stationary point for smooth nonconvex functions. Reddi et al. (2016); Allen-Zhu and Hazan (2016) proposed variants of stochastic variance-reduced gradient (SVRG) method (Johnson and Zhang, 2013) that is provably faster than gradient descent in the nonconvex finite-sum setting. Lei et al. (2017) proposed a stochastically controlled stochastic gradient (SCSG), which further improves convergence rate of SVRG for finite-sum smooth nonconvex optimization. Very recently, Zhou et al. (2018) proposed a new algorithm called stochastic nested variance-reduced gradient (SNVRG), which achieves strictly better gradient complexity than both SVRG and SCSG for finite-sum and stochastic smooth nonconvex optimization.

There is another line of research in stochastic smooth nonconvex optimization, which makes use of the -nonconvexity of a nonconvex function (i.e., ). More specifically, Natasha 1 (Allen-Zhu, 2017b) and Natasha 1.5 (Allen-Zhu, 2017a) have been proposed, which solve a modified regularized problem and achieve faster convergence rate to first-order stationary points than SVRG and SCSG in the finite-sum and stochastic settings respectively. In addition, Allen-Zhu (2018) proposed an SGD4 algorithm, which optimizes a series of regularized problems, and is able to achieve a faster convergence rate than SGD.

### 1.3 Organization and Notation

The remainder of this paper is organized as follows: We present the problem setup and review the algorithms in Section 2. We provide the convergence guarantee of Padam for stochastic smooth nonconvex optimization in Section 3. Finally, we conclude our paper in Section 4.

Notation. Scalars are denoted by lower case letters, vectors by lower case bold face letters, and matrices by upper case bold face letters. For a vector , we denote the norm () of by , the norm of by . For a sequence of vectors , we denote by the -th element in . We also denote . With slightly abuse of notation, for any two vectors and , we denote as the element-wise square, as the element-wise power operation, as the element-wise division and as the element-wise maximum. For a matrix , we define . Given two sequences and , we write if there exists a constant such that . We use notation to hide logarithmic factors.

## 2 Problem Setup and Algorithms

In this section, we first introduce the preliminary definitions used in this paper, followed by the problem setup of stochastic nonconvex optimization. Then we review the state-of-the-art adaptive gradient method, i.e., Padam (Chen and Gu, 2018), along with AMSGrad (the corrected version of Adam) (Reddi et al., 2018) and the corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018).

### 2.1 Problem Setup

We study the following stochastic nonconvex optimization problem

 minx∈Rdf(x):=Eξ[f(x;ξ)],

where is a random variable satisfying certain distribution, is a -smooth nonconvex function. In the stochastic setting, one cannot directly access the full gradient of . Instead, one can only get unbiased estimators of the gradient of , which is . This setting has been studied in Ghadimi and Lan (2013, 2016).

### 2.2 Algorithms

In this section we introduce the algorithms we study in this paper. We mainly consider three algorithms: Padam (Chen and Gu, 2018), AMSGrad (Reddi et al., 2018) and a corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018).

The Padam algorithm is given in Algorithm 1. It is originally proposed by Chen and Gu (2018) to improve the generalization performance of adaptive gradient methods. As is shown in Algorithm 1, the learning rate of Padam is , where is a partially adaptive parameter. With this parameter , Padam unifies AMSGrad and SGD with momentum, and gives a general framework of algorithms with exponential moving average. Padam reduces to the AMSGrad algorithm when . If and , Padam reduces to a corrected version of the RMSProp algorithm given by Reddi et al. (2018). As important special cases of Padam, we show AMSGrad and the corrected version of RMSProp in Algorithms 2 and 3 respectively.

## 3 Main Theory

In this section we present our main theoretical results. We first introduce the following assumptions.

###### Assumption 3.1 (Bounded Gradient).

has -bounded stochastic gradient. That is, for any , we assume that

 ∥∇f(x;ξ)∥∞≤G∞.

It is worth mentioning that Assumption 3.1 is slightly weaker than the -boundedness assumption used in Reddi et al. (2016); Chen et al. (2018). Since , the -boundedness assumption implies Assumption 3.1 with . In fact, is often larger than by a facgtor of .

###### Assumption 3.2 (L-smooth).

is -smooth: for any , we have

 ∣∣f(x)−f(y)+⟨∇f(y),x−y⟩∣∣≤L2∥x−y∥22.

Assumption 3.2 is a standard assumption frequently used in analysis of gradient-based algorithms. It is equivalent to the -gradient Lipschitz condition, which is often written as .

We are now ready to present our main result.

###### Theorem 3.3 (Padam).

In Algorithm 1, suppose that , and for . Then under Assumptions 3.1 and 3.2, for any , the output of Algorithm 1 satisfies that

 (3.1)

where

 M1 =2G2p∞Δf,M2=4G2+2p∞E∥∥ˆv−p1∥∥1d(1−β1)+4G2∞, M3 =4LG1+q−2p∞(1−β2)2p+8LG1+q−2p∞(1−β1)(1−β2)2p(1−β1/β2p2)(β11−β1)2,

and .

###### Remark 3.4.

From Theorem 3.3, we can see that and are independent of the number of iterations and dimension . In addition, if , it is easy to see that also has an upper bound that is independent of and .

The following corollary is a special case of Theorem 3.3 when and .

###### Corollary 3.5.

Under the same conditions of Theorem 3.3, if , then the output of Padam satisifies

 (3.2)

where and and are the same as in Theorem 3.3, and is defined as follows:

 M′3=4LG1−2p∞(1−β2)2p+8LG1−2p∞(1−β1)(1−β2)2p(1−β1/β2p2)(β11−β1)2.
###### Remark 3.6.

Corollary 3.5 simplifies the result of Theorem 3.3 by choosing under the condition . We remark that this choice of is optimal in an important special case studied in Duchi et al. (2011); Reddi et al. (2018): when the gradient vectors are sparse, we assume that . Then for , it follows that

 ∑di=1∥g1:T,i∥2T≪dq(∑di=1∥g1:T,i∥2)1−qT1−q/2. (3.3)

(3.3) implies that the upper bound provided by (3.2) is strictly better than (3.1) with . Therefore when the gradient vectors are sparse, Padam achieves faster convergence when is located in .

###### Remark 3.7.

We show the convergence rate under different choices of step size . If

 α=Θ(T1/4(d∑i=1∥g1:T,i∥2)1/2)−1,

then by (3.2), we have

 (3.4)

Note that the convergence rate given by (3.4) is related to the sum of gradient norms . As is mentioned in Remark 3.6, when the stochastic gradients , are sparse, we follow the assumption given by Duchi et al. (2011) that . More specifically, suppose for some . We have

When , we have

which matches the rate achieved by nonconvex SGD (Ghadimi and Lan, 2016), considering the dependence of .

###### Remark 3.8.

If we set which is not related to , then (3.2) suggests that

 (3.5)

When (Duchi et al., 2011; Reddi et al., 2018), by (3.5) we have

which matches the convergence result in nonconvex SGD (Ghadimi and Lan, 2016) considering the dependence of .

Next we show the convergence analysis of two popular algorithms: AMSGrad and RMSProp. Since AMSGrad and RMSProp can be seen as two specific instances of Padam, we can apply Theorem 3.3 with specific parameter choice, and obtain the following two corollaries.

###### Corollary 3.9 (AMSGrad).

Under the same conditions of Theorem 3.3, for AMSGrad in Algorithm 2, if for , then the output satisfies that

 (3.6)

where are defined as follows:

 MA1 MA3 =4LG∞(1−β2)+8LG∞(1−β1)(1−β2)(1−β1/β2)(β11−β1)2.
###### Remark 3.10.

As what has been illustrated in Theorem 3.3, are independent of and essentially independent of . Thus, (3.6) implies that AMSGrad achieves

 O(√dT+dT)

convergence rate, which matches the convergence rate of nonconvex SGD (Ghadimi and Lan, 2016). Chen et al. (2018) also provided similar bound for AMSGrad. They showed that

It can be seen that the dependence of in their bound is quadratic, which is worse than the linear dependence implied by (3.6). Moreover, by Corollary 3.5, Corollary 3.9 and (3.3), it is easy to see that Padam with is faster than AMSGrad where , which backups the experimental results in Chen and Gu (2018).

###### Corollary 3.11 (corrected version of RMSProp).

Under the same conditions of Theorem 3.3, for RMSProp in Algorithm 3, if for , then the output satisfies that

 (3.7)

where are defined in the following:

 MR1=2G∞Δf,MR2=4G3∞E∥∥ˆv−1/21∥∥1/d+4G2∞,MR3=4LG∞(1−β2).
###### Remark 3.12.

are independent of and essentially independent of . Thus, (3.7) implies that RMSProp achieves convergence rate, which matches the convergence rate of nonconvex SGD given by Ghadimi and Lan (2016).

## 4 Conclusions

In this paper, we provided a sharp analysis of the state-of-the-art adaptive gradient method Padam (Chen and Gu, 2018), and proved its convergence rate for smooth nonconvex optimization. Our results directly imply the convergence rates of AMSGrad and the corrected version of RMSProp for smooth nonconvex optimization. In terms of the number of iterations , the derived convergence rates in this paper match the rate achieved by SGD; in terms of dimension , our results give better rate than existing work. Our results also offer some insights into the choice of the partially adaptive parameter in the Padam algorithm: when the gradients are sparse, Padam with achieves the fastest convergence rate. This theoretically backups the experimental results in existing work (Chen and Gu, 2018).

## Acknowledgement

We would like to thank Jinghui Chen for discussion on this work.

## Appendix A Proof of the Main Theory

### a.1 Proof of Theorem 3.3

Let . To prove Theorem 3.3, we need the following lemmas:

###### Lemma A.1.

Let and be as defined in Algorithm 1. Then under Assumption 3.1, we have , and .

###### Lemma A.2.

Suppose that has -bounded stochastic gradient. Let be the weight parameters, , be the step sizes in Algorithm 1 and . We denote . Suppose that and , then under Assumption 3.1, we have the following two results:

 ≤T(1+q)/2dqα2(1−β1)G(1+q−4p)∞(1−β2)2p(1−γ)E(d∑i=1∥g1:T,i∥2)1−q,

and

 ≤T(1+q)/2dqα2G(1+q−4p)∞(1−β2)2pE(d∑i=1∥g1:T,i∥2)1−q.

To deal with stochastic momentum and stochastic weight , following Yang et al. (2016), we define an auxiliary sequence as follows: let , and for each ,

 zt=xt+β11−β1(xt−xt−1)=11−β1xt−β11−β1xt−1. (A.1)

Lemma A.3 shows that can be represented in two different ways.

###### Lemma A.3.

Let be defined in (A.1). For , we have

 zt+1−zt =β11−β1[I−(αtˆV−pt)(αt−1ˆV−pt−1)−1](xt−1−xt)−αtˆV−ptgt. (A.2)

and

 zt+1−zt (A.3)

For , we have

 z2−z1=−α1ˆV−p1g1. (A.4)

By Lemma A.3, we connect with and

The following two lemmas give bounds on and , which play important roles in our proof.

###### Lemma A.4.

Let be defined in (A.1). For , we have

 ∥zt+1−zt∥2
###### Lemma A.5.

Let be defined in (A.1). For , we have

 ∥∇f(zt)−∇f(xt)∥2 ≤L(β11−β1)⋅∥xt−xt−1∥2.

Now we are ready to prove Theorem 3.3.

###### Proof of Theorem 3.3.

Since is -smooth, we have:

 f(zt+1) ≤f(zt)+∇f(zt)⊤(zt+1−zt)+L2∥zt+1−zt∥22 =f(zt)+∇f(xt)⊤(zt+1−zt)I1+(∇f(zt)−∇f(xt))⊤(zt+1−zt)I2+L2∥zt+1−zt∥22I3 (A.5)

In the following, we bound , and separately.

Bounding term : When , we have

 ∇f(x1)⊤(z2−z1)=−∇f(x1)⊤α1ˆV−ptg1. (A.6)

For , we have

 ∇f(xt)⊤(zt+1−zt) =∇f(xt)⊤[β11−β1(αt−1ˆV−pt−1−αtˆV−pt)mt−1−αtˆV−ptgt] (A.7)

where the first equality holds due to (A.3) in Lemma A.3. For in (A.7), we have

 ∇f(xt)⊤(αt−1ˆV−pt−1−αtˆV−pt)mt−1 =G2∞[∥∥αt−1ˆv−pt−1∥∥1−∥∥αtˆv−pt∥∥1]. (A.8)

The first inequality holds because for a positive diagonal matrix , we have . The second inequality holds due to . Next we bound . We have

 −∇f(xt)⊤αtˆV−ptgt ≤−∇f(xt)⊤αt−1ˆV−pt−1gt+G2∞(∥∥αt−1ˆV−pt−1∥∥1,1−∥∥αtˆV−pt∥∥1,1) =−∇f(xt)⊤αt−1ˆV−pt−1gt+G2∞(∥∥αt−1ˆv−pt−1∥∥1−∥∥αtˆv−pt∥∥1). (A.9)

The first inequality holds because for a positive diagonal matrix , we have . The second inequality holds due to . Substituting (A.8) and (A.9) into (A.7), we have

 (A.10)

Bounding term : For , we have

 (∇f(zt)−∇f(xt))⊤(zt+1−zt) ≤L∥∥αtˆV−ptgt∥∥22+2L(β11−β1)2∥xt−xt−1∥22, (A.11)

where the second inequality holds because of Lemma A.3 and Lemma A.4, the last inequality holds due to Young’s inequality.

Bounding term : For , we have

 L2∥zt+1−zt∥22 ≤L2[∥∥αtˆV−ptgt∥∥2+β11−β1∥xt−1−xt∥2]2 ≤L∥∥αtˆV−ptgt∥∥22+2L(β11−β1)2∥xt−1−xt∥22. (A.12)

The first inequality is obtained by introducing Lemma A.3.

For , substituting (A.6), (A.11) and (A.12) into (A.5), taking expectation and rearranging terms, we have

 E[f(z2)−f(z1)] ≤E[−∇f(x1)⊤α1ˆV−p1g1+2L∥∥α1ˆV−p1g1∥∥22+4L(β11−β1)2∥x1−x0∥22] =E[−∇f(x1)⊤α1ˆV−p1g1+2L∥∥α1ˆV−p1g1∥∥22] ≤E[dα1G2−2p∞+2L∥∥α1ˆV−p1g1∥∥22], (A.13)

where the last inequality holds because

 −∇f(x1)⊤ˆV−p1g1≤d⋅∥∇f(x1)∥∞⋅∥ˆV−p1g1∥∞≤d⋅G∞⋅G1−2p∞=dG2−2p∞.

For , substituting (A.10), (A.11) and (A.12) into (A.5), taking expectation and rearranging terms, we have

 ≤E[−∇f(xt)⊤αt−1ˆV−pt−1gt+2L∥∥αtˆV−ptgt∥∥22+4L(β11−β1)2∥xt−xt−1∥22] =E[−∇f(xt)⊤αt−1ˆV−pt−1∇f(xt)+2L∥∥αtˆV−ptgt∥∥22+4L(β11−β1)2∥∥αt−1ˆV−pt−1mt−1∥∥22] ≤E[−αt−1∥∥∇f(xt)∥∥22(G2p∞)−1+2L∥∥αtˆV−ptgt∥∥22+4L(β11−β1)2∥∥αt−1ˆV−pt−1mt−1∥∥22], (A.14)

where the equality holds because conditioned on and , the second inequality holds because of Lemma A.1. Telescoping (A.14) for to and adding with (A.13), we have

 +2LT∑t=1E∥∥αtˆV−ptgt∥∥22+4L(β11−β1)2T∑t=2E[∥∥αt−1ˆV−pt−1mt−1∥∥22] ≤E[Δf+G2∞∥∥α1ˆv−p1∥∥11−β1+dα1G2−2p∞]+2LT∑t=1E∥∥αtˆV−ptgt∥∥