Weighted AdaGrad with Unified Momentum

Weighted AdaGrad with Unified Momentum

Fangyu Zou
Stony Brook University
fangyu.zou@stonybrook.edu &Li Shen
Tencent AI Lab
mathshenli@gmail.com \ANDZequn Jie
Tencent AI Lab
zequn.nus@gmail.com &Ju Sun
Stanford University
sunju@stanford.edu &Wei Liu
Tencent AI Lab
wl2223@columbia.edu
Equal Contribution
Abstract

Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as Nadam, AccAdaGrad, etc. In spite of their effectiveness in practice, there is still a large gap in their theories of convergences, especially in the difficult non-convex stochastic setting. To fill this gap, we propose weighted AdaGrad with unified momentum, dubbed AdaUSM, which has the main characteristics that (1) it incorporates a unified momentum scheme which covers both the heavy ball momentum and the Nesterov accelerated gradient momentum; (2) it adopts a novel weighted adaptive learning rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp. Moreover, when we take polynomially growing weights in AdaUSM, we obtain its convergence rate in the non-convex stochastic setting. We also show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new perspesctive for understanding Adam and RMSProp. Lastly, comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA, Adam, and AMSGrad on various deep learning models and datasets are also provided.

 

Weighted AdaGrad with Unified Momentum


  Fangyu Zouthanks: Equal Contribution Stony Brook University fangyu.zou@stonybrook.edu Li Shen Tencent AI Lab mathshenli@gmail.com Zequn Jie Tencent AI Lab zequn.nus@gmail.com Ju Sun Stanford University sunju@stanford.edu Wei Liu Tencent AI Lab wl2223@columbia.edu

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

In this work we consider the following general non-convex stochastic optimization problem:

(1)

where denotes the expectation with respect to the random variable . We assume that is bounded from below, i.e., , and its gradient is -Lipschitz continuous.

Problem (1) arises from many statistical learning (e.g., logistic regression, AUC maximization) and deep learning models [10, 18]. In general, one only has access to noisy estimates of , as the expectation in problem (1) often can only be approximated as a finite sum. Hence, one of the most popular algorithms to solve problem (1) is Stochastic Gradient Decent (SGD) [30, 2]:

(2)

where is the learning rate and is the noisy gradient estimate of in the -th iteration. Its convergence rates for both convex and non-convex settings have been established [1, 8].

However, vanilla SGD suffers from slow convergence, and its performance is sensitive to the learning rate—which is tricky to tune. Many techniques have been introduced to improve the convergence speed and robustness of SGD, such as variance reduction [4, 14, 27], adaptive learning rate [6, 15], and momentum acceleration [28, 26, 21]. Among them, adaptive learning rate and momentum acceleration techniques are most economic, since they merely require slightly more computation per iteration. SGD with adaptive learning rate was first proposed as AdaGrad [23, 6] and the learning rate is adjusted by cumulative gradient magnitudes:

(3)

where are fixed parameters. On the other hand, Heavy Ball (HB) [28, 7] and Nesterov Accelerated Gradient (NAG) [26, 25] are two most popular momentum acceleration techniques, which have been extensively studied for stochastic optimization problems [9, 35, 21]:

(4)

where , , and is the momentum factor.

Both the adaptive learning rate and momentum techniques have been individually investigated and have displayed to be effective in practice, so are independently and widely applied in tasks such as training deep networks [17, 31, 15, 29]. It is natural to consider: Can we effectively incorporate both techniques at the same time so as to inherit their dual advantages and moreover develop convergence theory for this scenario, especially in the more difficult non-convex stochastic setting? To the best of our knowledge, Levy et al. [21] firstly attempted to combine the adaptive learning rate with NAG momentum, which yields the AccAdaGrad algorithm. However, its convergence is limited to the stochastic convex setting. Yan et al. [35] unified SHB and SNAG to a three-step iterate without considering the adaptive learning rate in Eq. (3).

In this work, we revisit the momentum acceleration technique [28, 26] and adaptive learning rate [6, 15], and propose weighted AdaGrad with unified stochastic momentum, dubbed as AdaUSM, to solve the general non-convex stochastic optimization problem (1). Specifically, the proposed AdaUSM has two main features: it develops a novel Unified Stochastic Momentum (USM) scheme to cover SHB and SNAG, entirely different from the three-step scheme in [35], and it generalizes the adaptive learning rate in Eq. (3) to a more general weighted adaptive learning rate (see Sectiion 3) that can unify the adaptive learning rates of AdaGrad, AccAdaGrad, and Adam into a succinct framework. In contrast to those in AdaGrad [6], the weighted adaptive learning rate in AdaUSM is estimated via a novel weighted gradient accumulation technique, which puts more weights on the most recent stochastic gradient estimates. Moreover, to make AdaUSM more practical for large-scale problems, a coordinate-wise weighted adaptive learning rate with a low computational cost is used.

We also characterize the convergence rate of AdaUSM in the non-convex stochastic setting, when we take polynomially growing weights in AdaUSM. When momentum is NAG and weights are set as the same as those in [21], AdaUSM reduces to AccAdaGrad [21]. In consequence, the convergence rate of AccAdaGrad in the non-convex setting is derived directly as byproduct. Thus, our work generalizes AccAdaGrad [21] in three aspects: (i) more general weights in estimating the adaptive learning rate; (ii) new unified momentum including both NAG and HB; (iii) the convergence rate in the more difficult non-convex stochastic setting. To the best of our knowledge, our work is the first to explore the convergence rates of adaptive stochastic algorithms with momentum acceleration in the non-convex stochastic setting. Our contributions are three-fold:

  • We develop a new weighted gradient accumulation technique to estimate the adaptive learning rate, and propose a novel unified stochastic momentum scheme to cover SHB and SNAG. We then integrate the weighted coordinate-wise AdaGrad with a unified momentum mechanism, yielding a novel adaptive stochastic momentum algorithm, dubbed as AdaUSM.

  • We establish the non-asymptotic convergence rate of AdaUSM under the general non-convex stochastic setting. Our assumptions are natural and mild.

  • We show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new perspective for understanding Adam and RMSProp.

Related Works. There exist several works to study the convergence rates of adaptive SGD in the non-convex stochastic setting. For example, Li and Orabona [22] first proved the global convergence of perturbed AdaGrad111In [22], the convergence rate of AdaGrad was established with perturbed adaptive learning rate: . The perturbation factor is unavoidable for its convergence argument.; Ward et al. [33] established convergence rates for the original AdaGrad [6] and WNgrad [34]; the convergence rates of Adam/RMSProp [36] and AMSGrad [3] were also established in the non-convex stochastic setting, respectively.

2 Preliminaries

Notations.

denotes the maximum number of iterations. The noisy gradient of at the -th iteration is denoted by for all . We use to denote expectation as usual, and as the conditional expectation with respect to conditioned on the random variables .

In this paper we allow differential learning rates across coordinates, so the learning rate is a vector in . Given a vector we denote its -th coordinate by . The -th coordinate of the gradient is denoted by . Given two vectors , the inner product between them is denoted by . We also heavily use the coordinate-wise product between and , denoted as , with . Division by a vector is defined similarly. Given a vector , we define the weighted norm: Norm without any subscript is the Euclidean norm, and is defined as . Let and .

Assumptions.

We assume that , are independent of each other. Moreover,

  • , i.e., stochastic gradient is an unbiased estimator;

  • , i.e., the second-order moment of is bounded.

Notice that the condition (A2) is slightly weaker than that in [3] which assumes that stochastic gradient estimate is uniformly bounded, i.e., .

3 Weighted AdaGrad with Unified Momentum

We describe the two main ingredients of AdaUSM: the unified stochastic momentum formulation of SHB and SNAG (see Subsection 3.1), and the weighted adaptive learning rate (see Subsection 3.2).

3.1 Unified Stochastic Momentum (USM)

By introducing with , the iterate of SNAG can be equivalently written as

Comparing SHB and above SNAG, the difference lies in that SNAG has more weight on the current momentum . Hence, we can rewrite SHB and SNAG in the following unified form:

(5)

where is a constant. When , it is SHB; when , it is SNAG. We call the interpolation factor. For any , can be chosen from the range .

Remark 1.

Yan et al. [35] unified SHB and SNAG as a three-step iterate scheme as follows:

(6)

where . Its convergence rate has been established for . Notably, USM is slightly simpler than Eq. (6) and the learning rate in USM is adaptively determined.

3.2 Weighted Adaptive Learning Rate

We generalize the learning rate in Eq. (3) by assigning different weights to the past stochastic gradients accumulated. It is defined as follows:

(7)

for , where and . Here, can be understood as the base learning rate. The classical AdaGrad corresponds to taking for all in Eq. (7), i.e., uniform weights. However, recent gradients tend to carry more information of local geometries than remote ones. Hence, it is natural to assign the recent gradients more weights. A typical choice for such weights is to choose for , which grows in a polynomial rate. For instance, in AccAdaGrad [21] weights are chosen to be for and for .

3.3 AdaUSM: Weighted AdaGrad with USM

In this subsection, we present the AdaUSM algorithm, which effectively integrates the weighted adaptive learning rate in Eq. (7) with the USM technique in Eq. (5), and establish its convergence rate.

1:  Parameters: Choose , fixed parameter , momentum factor , and initial accumulator factor . Set , , , , and weights .
2:  for  do
3:     Sample a stochastic gradient ;
4:     for  do
5:        ;
6:        ;
7:        ;
8:        ;
9:        ;
10:     end for
11:  end for
Algorithm 1  AdaUSM: Weighted AdaGrad with USM

Note that AdaUSM extends the AccAdaGrad in [21] by using more general weighted parameters and momentum accelerated mechanisms in the non-convex stochastic setting. In addition, when , AdaUSM reduces to the weighted AdaGrad with a heavy ball momentum, which we denote by AdaHB for short. When , AdaUSM reduces to the weighted AdaGrad with Nesterov accelerated gradient momentum, which we denote by AdaNAG for short. The detailed iterate schemes of AdaHB and AdaNAG are placed in the supplementary material due to the space limit.

Theorem 1.

Let be a sequence generated by AdaUSM. Assume that the noisy gradient satisfies assumptions (A1)- (A2), and the sequence of weights is non-decreasing in . Let be uniformly randomly drawn from . Then

(8)

where , and constants and are defined as and , respectively.

Sketch of proof.

Our starting point is the following inequality which follows from the Lipschitz continuity of and the descent lemma in [25]:

The key point is to estimate the term , which involves the momentum. However, is far from an unbiased estimate of the true gradient. This difficulty is resolved in Lemma 3, where we establish an estimate for the term via iteration

Furthermore, we can derive an estimate on each in terms of (by Lemma 2). For the term , taking expectation of is tricky as is correlated with . We follow the idea of [33] and consider the term , where is an approximation of independent of . Hence its expectation gives rise to the desired term . With a suitable choice of , in Lemma 4 we establish the following estimate

Summarizing the estimates above leads to the estimate in Lemma 7:

With the specific adaptive learning rate as Eq. (7), we can further show that the principal term is bounded by (by Lemma 5), while the term via Hölder inequality (by Lemma 6). This immediately gives rise to our theorem. ∎

Remark 2.

When we take for some constant power , then and conditions (i)-(ii) are satisfied. Hence, AdaUSM with such weights has the convergence rate. In fact, AdaUSM is convergent as long as .

When interpolation factor , and for and for , AdaUSM reduces to coordinate-wise AccAdaGrad [20]. In this case, . Thus, we have the following corollary for the convergence rate of AccAdaGrad in the non-convex stochastic setting.

Corollary 1.

Assume the same setting as Theorem 1. Let be randomly selected from with equal probability . Then

(9)

where , and constants and are defined as and , respectively.

Remark 3.

The non-asymptotic convergence rate measured by the objective for AccAdaGrad has already been established in [21] in the convex stochastic setting. Corollary 1 provides the convergence rate of coordinate-wise AccAdaGrad measured by gradient residual, which also supplements the results in [21] in the non-convex stochastic setting.

4 Relationships with Adam and RMSProp

In this section, we show that the exponential moving average (EMA) technique in estimating adaptive learning rates in Adam [15] and RMSProp [12] is a special case of the weighted adaptive learning rate in Eq. (7), i.e., their adaptive learning rates correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new angle for understanding Adam and RMSProp.

4.1 Adam

For better comparison, we first represent the -th iterate scheme of Adam [15] as follows:

for , where and are constants, and is a sufficiently small constant. Denoting , we can simplify the iterations of Adam as

Below, we show that AdaUSM and Adam differ in two aspects: momentum estimation and coordinate-wise adaptive learning rate .

Momentum estimation.  EMA technique is widely used in the momentum estimation step in Adam [15] and AMSGrad [29]. Without loss of generality, we consider the simplified EMA step

(10)

To show the difference clearly, we merely compare HB momentum with EMA momentum. Let . By the first equality in Eq. (10), we have

Comparing with HB, EMA has an extra error term which vanishes if for all , i.e., the step-size is taken constant222Since the learning rates in AdaUSM and Adam are determined adaptively, we do not have .. In addition, EMA has a much smaller step-size if the momentum factor is close to . More precisely, if we write the iterate in terms of stochastic gradients and eliminate , we obtain

One can see that AdaUSM uses the past step-sizes but EMA uses only the current one in exponential moving averaging. Moreover, when momentum factor is very close to , the update of via EMA could stagnate since . This dilemma will not appear in AdaUSM.

Adaptive learning rate.  Note that . We have . Without loss of generality, we set . Hence, it holds that

Then, the adaptive learning rate in Adam can be rewritten as

(11)

Let . Note that Hence, Eq. (11) can be further reformulated as

(12)

For comparison, the adaptive learning rates of Adam and AdaUSM are summarized as follows:

Hence, the adaptive learning rate in Adam is actually equivalent to that in AdaUSM by specifying if is sufficiently small. For the parameter setting in Adam, it holds that . Thus, we gain an insight on understanding the convergence of Adam from the convergence results of AdaUSM in Theorem 1.

Remark 4.

Recently, Chen et al. [3] have also proposed AdaGrad with exponential moving average (AdaEMA) by setting as and removing the bias-correction steps in Adam. Its convergence rate in the non-convex stochastic setting has been established under a slightly stronger assumption that the stochastic gradient estimate is required to be uniformly bounded. Compared with AdaEMA, AdaUSM not only adopts a general weighted sequence in estimating the adaptive learning rate, but also uses a different unified momentum that covers HB and NAG as special instances. The superiority of HB and NAG momentums over EMA has been pointed out in the above paragraph: Momentum estimation. In Section 5, we also experimentally demonstrate the effectiveness of AdaUSM against AdaEMA.

4.2 RMSProp

Coordinate-wise RMSProp is another efficient solver for training DNNs [12, 24], which is defined as

Define . The adaptive learning rate of RMSProp denoted as can be rewritten as

When is a sufficiently small constant and , it is obvious that has a similar structure to after being sufficiently large. Based on the above analysis, AdaUSM can be interpreted as generalized RMSProp with HB and NAG momentums.

5 Experiments

In this section, we conduct experiments to validate the efficacy and theory of AdaHB (AdaUSM with ) and AdaNAG (AdaUSM with ) by applying them to train DNNs333https://github.com/kuangliu/pytorch-cifar including LetNet [19], GoogLeNet [32], ResNet [11], and DenseNet [13] on various datasets including MNIST [19], CIFAR10/100 [16], and Tiny-ImageNet [5]. The efficacies of AdaHB and AdaNAG are evaluated via the training loss, test accuracy, and test loss vs. epochs, respectively. In the experiments, we fix the batch-size as and the weighted decay parameter as , respectively.

Optimizers.   We compare AdaHB/AdaNAG with five competing algorithms: SGD with momentum (SGDm) [31], AdaGrad [6, 33], AdaEMA [3], AMSGrad [29, 3], and Adam [15]. The parameter settings of all compared optimizers are summarized in Table 1 (see Section D in the supplementary material) and are consistent with [15, 29, 3]. To match the convergence theory, we take the diminishing base learning rate as uniformly across all the tested adaptive optimizers. Moreover, via the momentum estimation paragraph in Section 4.1, we know that the learning rates in AdaHB and AdaNAG will be times greater than those in AdaEMA, AMSGrad, and Adam if they share the same constant parameter . In addition, too large and small would lead to heavy oscillation and bad stagnation on the training loss, respectively, which would deteriorate the performances of the tested optimizers. Consequently, the base learning rate for each solver is chosen via grid search on the set . We report the base learning rate of each solver that can consistently contribute to the best performance.

Figures 4- 4 illustrate the performance profiles of applying AdaHB, AdaNAG, SGDm, AdaGrad, AdaEMA, AMSGrad, and Adam to train LetNet on MNIST, GoogLeNet on CIFAR10, DenseNet on CIFAR100, and ResNet on Tiny-ImageNet, respectively. More specifically, Figure 4 illustrates the performance of LeNet on MNIST which covers training examples and test examples. It can be seen that AdaHB decreases the training loss and test loss fastest among the seven tested optimizers, which simultaneously yields a higher test accuracy than the other tested optimizers. The performances of AdaEMA and AdaGrad are worse than AdaHB but better than SGDm, Adam, and AMSGrad. Figure 4 illustrates the performance of training GoogLeNet on CIFAR10 which covers training examples and test examples. It can be seen that SGDm decreases the training loss slightly faster than other optimizers, followed by AdaHB and AdaNAG, and that AMSGrad and Adam are the slowest optimizers. The test accuracies of AdaHB, AdaNAG, AdaGrad, and AdaEMA are comparable, which are all slightly better than SGDm and outperform Adam and AMSGrad. Figure 4 illustrates the performance of training DenseNet on CIFAR100 which covers training examples and test examples. It shows that SGDm has the worst training process and test accuracy, followed by Adam and AMSGrad. While AdaGrad, AdaEMA, AdaHB, and AdaNAG decrease the training loss and increase test accuracy at roughly the same speed. Figure 4 illustrates the performance of training ResNet on Tiny-ImageNet which contains training examples and test examples. It can be seen that AdaHB and AdaNAG show the fastest speed to decrease the training loss and increase the test accuracy. SGDm is worse than AdaHB and AdaNAG, and better than AdaGrad, AdaEMA, and AMSGrad in terms of the training loss and test accuracy.

Figure 2: Performance profiles of various optimizers for training GoogLeNet on CIFAR10.
Figure 3: Performance profiles of various optimizers for training DenseNet on CIFAR100.
Figure 1: Performance profiles of various optimizers for training LeNet on MNIST.
Figure 2: Performance profiles of various optimizers for training GoogLeNet on CIFAR10.
Figure 3: Performance profiles of various optimizers for training DenseNet on CIFAR100.
Figure 4: Performance profiles of various optimizers for training ResNet on Tiny-ImageNet.
Figure 1: Performance profiles of various optimizers for training LeNet on MNIST.

In summary, AdaHB and AdaNAG are more efficient and robust than SGDm, AdaGrad, AdaEMA, Adam, and AMSGrad in terms of both the training speed and generalization capacity. SGDm is also an efficient optimizer but it is highly sensitive to the hand-tuning learning rate. Moreover, the value of EMA is marginal and not as efficient as heavy ball and Nesterov accelerated gradient momentums, as revealed by the performance curves of AdaEMA, AdaGrad, AdaHB, and AdaNAG.

6 Conclusions

We integrated a novel weighted coordinate-wise AdaGrad with unified momentum including heavy ball and Nesterov accelerated gradient momentums, yielding a new adaptive stochastic algorithm called AdaUSM. Its convergence rate was established in the non-convex stochastic setting. Our work largely extends the convergence rate of accelerated AdaGrad in [21] to the general non-convex stochastic setting. Moreover, we pointed out that the adaptive learning rates of Adam and RMSProp are essentially special cases of the weighted adaptive learning rate in AdaUSM, which provides a new angle to understand the convergences of Adam/RMSProp. We also experimentally verified the efficacy of AdaUSM in training deep learning models on several datasets. The promising results show that the proposed AdaUSM is more effective and robust than SGD with momentum, AdaGrad, AdaEMA, AMSGrad, and Adam in terms of training loss and test accuracy vs. epochs.

References

  • [1] L. Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998.
  • [2] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
  • [3] X. Chen, S. Liu, R. Sun, and M. Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2019.
  • [4] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [7] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, pages 310–315. IEEE, 2015.
  • [8] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • [9] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  • [10] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [12] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, page 14, 2012.
  • [13] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269. IEEE, 2017.
  • [14] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • [16] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [18] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [20] K. Levy. Online to offline conversions, universality and adaptive minibatch sizes. In Advances in Neural Information Processing Systems, pages 1613–1622, 2017.
  • [21] Y. K. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, pages 6500–6509, 2018.
  • [22] X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In International Conference on Artificial Intelligence and Statistics, pages 983–992, 2019.
  • [23] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. COLT 2010, page 244, 2010.
  • [24] M. C. Mukkamala and M. Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. In International Conference on Machine Learning, pages 2545–2553, 2017.
  • [25] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
  • [26] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
  • [27] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613–2621. JMLR. org, 2017.
  • [28] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • [29] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  • [30] H. Robbins and S. Monro. A stochastic approximation method. In Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.
  • [31] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
  • [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [33] R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
  • [34] X. Wu, R. Ward, and L. Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
  • [35] Y. Yan, T. Yang, Z. Li, Q. Lin, and Y. Yang. A unified analysis of stochastic momentum methods for deep learning. In IJCAI, pages 2955–2961, 2018.
  • [36] F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A sufficient condition for convergences of adam and rmsprop. arXiv preprint arXiv:1811.09358, 2018.

 

Supplementary Material for

“Weighted AdaGrad with Unified Momentum"  

In this supplementary we give a complete proof of Theorem 1. The material is arranged as follows. In Section A, we present the detailed iteration schemes of AdaHB and AdaNAG for readability. In Section B, we provide preliminary lemmas that will be used to establish Theorem 1. In Section C, we give the detailed proof of Theorem 1. In Section D, we provide the parameter settings of the compared optimizers including AdaHB, AdaNAG, AdaGrad, SGDm, AdaEMA, AMSGrad, and Adam.

Appendix A AdaHB and AdaNAG

  Parameters: Choose , fixed parameter , momentum factor , initial accumulator value , and parameters . Set , and .
  for  do
     Sample a stochastic gradient ;
     for  do
        ;
        ;
        ;
        ;
        ;
     end for
  end for
Algorithm 2  AdaHB: AdaGrad with HB
  Parameters: Choose , fixed parameter , momentum factor , initial accumulator value , and parameters . Set , and .
  for  do
     Sample a stochastic gradient ;
     for  do
        ;
        ;
        ;
        ;
        ;
     end for
  end for
Algorithm 3  AdaNAG: AdaGrad with NAG

Appendix B Preliminary Lemmas

In this section we provide preliminary lemmas that will be used to prove our main theorem. The readers may skip this part for the first time and come back whenever the lemmas are needed.

Lemma 1.

Let , where is a non-negative sequence and . We have

Proof.

The finite sum can be interpreted as a Riemann sum as follows Since is decreasing on the interval , we have

The proof is finished. ∎

The following lemma is a direct result of the momentum updating rule.

Lemma 2.

Suppose with and . We have the following estimate

(13)
Proof.

First, we have the following inequality due to convexity of :

(14)

Taking sum of Eq. (14) from to and using , we have that

(15)

Hence,

(16)

The proof is finished. ∎

The following lemma is a result of the USM formulation for any general adaptive learning rate.

Lemma 3.

Let and be sequences generated by the following general SGD with USM momentum: starting from initial values and , and being updated through

where the momentum factor and the interpolation factor satisfy and , respectively. Suppose that the function is -smooth. Then for any we have the following estimate

(17)

In particular, the following estimate holds

(18)
Proof.

Since , we have

(19)

Note that by -smoothness of function , we have that

(20)

Hence, by Cauchy-Schwartz inequality and Eq. (20), we have that