## Abstract

\setitemize

noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt

## 1 Introduction

To solve the problems above, we propose “AdaBelief”, which can be easily modified from Adam. Denote the observed gradient at step as and its exponential moving average (EMA) as . Denote the EMA of and as and , respectively. is divided by in Adam, while it is divided by in AdaBelief. Intuitively, is the “belief” in the observation: viewing as the prediction of the gradient, if deviates much from , we have weak belief in , and take a small step; if is close to the prediction , we have a strong belief in , and take a large step. We validate the performance of AdaBelief with extensive experiments. Our contributions can be summarized as:

• We propose AdaBelief, which can be easily modified from Adam without extra parameters. AdaBelief has three properties: (1) fast convergence as in adaptive gradient methods, (2) good generalization as in the SGD family, and (3) training stability in complex settings such as GAN.

• We theoretically analyze the convergence property of AdaBelief in both convex optimization and non-convex stochastic optimization.

• We validate the performance of AdaBelief with extensive experiments: AdaBelief achieves fast convergence as Adam and good generalization as SGD in image classification tasks on CIFAR and ImageNet; AdaBelief outperforms other methods in language modeling; in the training of a W-GAN arjovsky2017wasserstein, compared to a well-tuned Adam optimizer, AdaBelief significantly improves the quality of generated images, while several recent adaptive optimizers fail the training.

## 2 Methods

### 2.1 Details of AdaBelief Optimizer

Notations By the convention in kingma2014adam, we use the following notations:

• : is the loss function to minimize, is the parameter in

• : projection of onto a convex feasible set

• : the gradient and step

• : exponential moving average (EMA) of

• : is the EMA of , is the EMA of

• : is the learning rate, default is ; is a small number, typically set as

• : smoothing parameters, typical values are

• are the momentum for and respectively at step , and typically set as constant (e.g.

Initialize , , , While not converged                                         Update                Algorithm 1 Adam Optimizer Initialize , , , While not converged                                         Update                Algorithm  2 AdaBelief Optimizer

Comparison with Adam Adam and AdaBelief are summarized in Algo. 1 and Algo. 2, where all operations are element-wise, with differences marked in blue. Note that no extra parameters are introduced in AdaBelief. For simplicity, we omit the bias correction step. A detailed version of AdaBelief is in Appendix A. Specifically, in Adam, the update direction is , where is the EMA of ; in AdaBelief, the update direction is , where is the EMA of . Intuitively, viewing as the prediction of , AdaBelief takes a large step when observation is close to prediction , and a small step when the observation greatly deviates from the prediction.

### 2.2 Intuitive explanation for benefits of AdaBelief

Note that we name as the “learning rate” and as the “stepsize” for the th parameter. With a 1D example in Fig. 1, we demonstrate that AdaBelief uses the curvature of loss functions to improve training as summarized in Table 1, with a detailed description below:

(1) In region in Fig. 1, the loss function is flat, hence the gradient is close to 0. In this case, an ideal optimizer should take a large stepsize. The stepsize of SGD is proportional to the EMA of the gradient, hence is small in this case; while both Adam and AdaBelief take a large stepsize, because the denominator ( and ) is a small value.

(2) In region , the algorithm oscillates in a “steep and narrow” valley, hence both and is large. An ideal optimizer should decrease its stepsize, while SGD takes a large step (proportional to ). Adam and AdaBelief take a small step because the denominator ( and ) is large.

(3) In region , we demonstrate AdaBelief’s advantage over Adam in the “large gradient, small curvature” case. In this case, and are large, but and are small; this could happen because of a small learning rate . In this case, an ideal optimizer should increase its stepsize. SGD uses a large stepsize (); in Adam, the denominator is large, hence the stepsize is small; in AdaBelief, denominator is small, hence the stepsize is large as in an ideal optimizer.

To sum up, AdaBelief scales the update direction by the change in gradient, which is related to the Hessian. Therefore, AdaBelief considers curvature information and performs better than Adam.

AdaBelief considers the sign of gradient in denominator We show the advantages of AdaBelief with a 2D example in this section, which gives us more intuition for high dimensional cases. In Fig. 2, we consider the loss function: . Note that in this simple problem, the gradient in each axis can only take . Suppose the start point is near the axis, e.g. . Optimizers will oscillate in the direction, and keep increasing in the direction.
Suppose the algorithm runs for a long time ( is large), so the bias of EMA () is small:

 mt =EMA(g0,g1,...gt)≈E(gt),  mt,x≈Egt,x=1,  mt,y≈Egt,y=0 (2) vt =EMA(g20,g21,...g2t)≈E(g2t),  vt,x≈Eg2t,x=1,  vt,y≈Eg2t,y=1. (3)
Step 1 2 3 4 5 1 1 1 1 1 -1 1 -1 1 -1 Adam 1 1 1 1 1 1 1 1 1 1 AdaBelief 0 0 0 0 0 1 1 1 1 1

In practice, the bias correction step will further reduce the error between the EMA and its expectation if is a stationary process kingma2014adam. Note that:

 st=EMA((g0−m0)2,...(gt−mt)2)≈E[(gt−Egt)2]=Vargt,  st,x≈0,  st,y≈1 (4)

An example of the analysis above is summarized in Fig. 2. From Eq. 3 and Eq. 4, note that in Adam, ; this is because the update of only uses the amplitude of and ignores its sign, hence the stepsize for the and direction is the same . AdaBelief considers both the magnitude and sign of , and , hence takes a large step in the direction and a small step in the direction, which matches the behaviour of an ideal optimizer.

Update direction in Adam is close to “sign descent” in low-variance case In this section, we demonstrate that when the gradient has low variance, the update direction in Adam is close to “sign descent”, hence deviates from the gradient. This is also mentioned in balles2017dissecting.

Under the following assumptions: (1) assume is drawn from a stationary distribution, hence after bias correction, . (2) low-noise assumption, assume , hence we have . (3) low-bias assumption, assume ( to the power of ) is small, hence as an estimator of has a small bias . Then

In this case, Adam behaves like a “sign descent”; in 2D cases the update is to the axis, hence deviates from the true gradient direction. The “sign update” effect might cause the generalization gap between adaptive methods and SGD (e.g. on ImageNet) bernstein2018signsgd; wilson2017marginal. For AdaBelief, when the variance of is the same for all cooridnates, the update direction matches the gradient direction; when the variance is not uniform, AdaBelief takes a small (large) step when variance is large (small).

Numerical experiments In this section, we validate intuitions in Sec. 2.2. Examples are shown in Fig. 3, and we refer readers to more examples in the supplementary videos for better visualization. In all examples, compared with SGD with momentum and Adam, AdaBelief reaches the optimal point at the fastest speed. Learning rate is for all optimizers. For all examples except Fig. 3(d), we set the parameters of AdaBelief to be the same as the default in Adam kingma2014adam, , and set momentum as 0.9 for SGD. For Fig. 3(d), to match the assumption in Sec. 2.2, we set for both Adam and AdaBelief, and set momentum as for SGD. Videos are available at 1. We summarize these experiments as follows:

1. Consider the loss function and a starting point near the axis. This setting corresponds to Fig. 2. Under the same setting, AdaBelief takes a large step in the direction, and a small step in the direction, validating our analysis. More examples such as are in the supplementary videos.

2. For an inseparable loss, AdaBelief outperforms other methods under the same setting.

3. For an inseparable loss, AdaBelief outperforms other methods under the same setting.

4. We set for Adam and AdaBelief, and set momentum as in SGD. This corresponds to settings of Eq. 5. For the loss , is a constant for a large region, hence . As mentioned in kingma2014adam, , hence a smaller decreases faster to 0. Adam behaves like a sign descent ( to the axis), while AdaBelief and SGD update in the direction of the gradient.

5. Optimization trajectory under default setting for the Beale beale1955minimizing function in 2D and 3D.

6. Optimization trajectory under default setting for the Rosenbrock rosenbrock1960automatic function.

Above cases occur frequently in deep learning Although the above cases are simple, they give hints to local behavior of optimizers in deep learning, and we expect them to occur frequently in deep learning, hence we expect AdaBelief to outperform Adam in general cases. Other works in the literature reddi2019convergence; luo2019adaptive claim advantages over Adam, but are typically substantiated with carefully-constructed examples. Note that most deep networks use ReLU activation glorot2011deep, which behaves like an absolute value function as in Fig. 3(a); considering the interaction between neurons, most networks behaves like case Fig. 3(b), and typically are ill-conditioned (the weight of some parameters are far larger than others) as in the figure. Considering a smooth loss function such as cross entropy or a smooth activation, this case is similar to Fig. 3(c). The case with Fig. 3(d) requires , and this typically occurs at the late stages of training, where the learning rate is decayed to a small value, and the network reaches a stable region.

### 2.3 Convergence analysis in convex and non-convex optimization

Similar to reddi2019convergence; luo2019adaptive; chen2018convergence, for simplicity, we omit the de-biasing step (analysis applicable to de-biased version). Proof for convergence in convex and non-convex cases is in the appendix.

Optimization problem For deterministic problems, the problem to be optimized is ; for online optimization, the problem is , where can be interpreted as loss of the model with the chosen parameters in -th step.

###### Theorem 2.1.

(Convergence in convex optimization) Let and be the sequence obtained by AdaBelief, let , , . Let , where is a convex feasible set with bounded diameter . Assume is a convex function and (hence ) and . Denote the optimal point as . For generated with AdaBelief, we have the following bound on the regret:

###### Corollary 2.1.1.

Suppose in Theorem (2.1), then we have:

 T∑t=1[ft(θt)−ft(θ∗)]≤D2∞√T2α(1−β1)d∑i=1s1/2T,i+(1+β1)α√1+logT2√c(1−β1)3d∑i=1∣∣∣∣g21:T,i∣∣∣∣2+D2∞β1G∞2(1−β1)(1−λ)2α

For the convex case, Theorem 2.1 implies the regret of AdaBelief is upper bounded by . Conditions for Corollary 2.1.1 can be relaxed to as in reddi2019convergence, which still generates regret. Similar to Theorem 4.1 in kingma2014adam and corollary 1 in reddi2019convergence, where the term exists, we have . Without further assumption, since as assumed in Theorem 2.1, and is constant. The literature kingma2014adam; reddi2019convergence; duchi2011adaptive exerts a stronger assumption that . Our assumption could be similar or weaker, because , then we get better regret than .

###### Theorem 2.2.

(Convergence for non-convex stochastic optimization) Under the assumptions:

• is differentiable; ; is also lower bounded.

• The noisy gradient is unbiased, and has independent noise, .

• At step , the algorithm can access a bounded noisy gradient, and the true gradient is also bounded. .

Assume , noise in gradient has bounded variance, , then the proposed algorithm satisfies:

as in chen2018convergence, are constants independent of and , and is a constant independent of .

###### Corollary 2.2.1.

If and assumptions for Theorem 2.2 are satisfied, we have:

 1TT∑t=1E[α2t∣∣∣∣∇f(θt)∣∣∣∣2]≤1T11H−C1c[C1α2σ2c(1+logT)+C2dα√c+C3dα2c+C4]

Theorem 2.2 implies the convergence rate for AdaBelief in the non-convex case is , which is similar to Adam-type optimizers reddi2019convergence; chen2018convergence. Note that regret bounds are derived in the worst possible case, while empirically AdaBelief outperforms Adam mainly because the cases in Sec. 2.2 occur more frequently. It is possible that the above bounds are loose; we will try to derive a tighter bound in the future.

## 3 Experiments

We performed extensive comparisons with other optimizers, including SGD sutskever2013importance, AdaBound luo2019adaptive, Yogi zaheer2018adaptive, Adam kingma2014adam, MSVAG balles2017dissecting, RAdam liu2019variance, Fromage bernstein2020distance and AdamW loshchilov2017decoupled. The experiments include: (a) image classification on Cifar dataset krizhevsky2009learning with VGG simonyan2014very, ResNet he2016deep and DenseNet huang2017densely, and image recognition with ResNet on ImageNet deng2009imagenet; (b) language modeling with LSTM ma2015long on Penn TreeBank dataset marcus1993building; (c) wasserstein-GAN (WGAN) arjovsky2017wasserstein on Cifar10 dataset. We emphasize (c) because prior work focuses on convergence and accuracy, yet neglects training stability.

Hyperparameter tuning We performed a careful hyperparameter tuning in experiments. On image classification and language modeling we use the following:

SGD, Fromage:    We set the momentum as , which is the default for many networks such as ResNet he2016deep and DenseNethuang2017densely. We search learning rate among .
Adam, Yogi, RAdam, MSVAG, AdaBound:  We search for optimal among , search for as in SGD, and set other parameters as their own default values in the literature.
AdamW:    We use the same parameter searching scheme as Adam. For other optimizers, we set the weight decay as ; for AdamW, since the optimal weight decay is typically larger loshchilov2017decoupled, we search weight decay among .
For the training of a GAN, we set for AdaBelief; for other methods, we search for among , and search for among . We set learning rate as for all methods. Note that the recommended parameters for Adam radford2015unsupervised and for RMSProp salimans2016improved are within the search range.

CNNs on image classification We experiment with VGG11, ResNet34 and DenseNet121 on Cifar10 and Cifar100 dataset. We use the official implementation of AdaBound, hence achieved an exact replication of luo2019adaptive. For each optimizer, we search for the optimal hyperparameters, and report the mean and standard deviation of test-set accuracy (under optimal hyperparameters) for 3 runs with random initialization. As Fig. 4 shows, AdaBelief achieves fast convergence as in adaptive methods such as Adam while achieving better accuracy than SGD and other methods.

We then train a ResNet18 on ImageNet, and report the accuracy on the validation set in Table. 2. Due to the heavy computational burden, we could not perform an extensive hyperparameter search; instead, we report the result of AdaBelief with the default parameters of Adam () and decoupled weight decay as in liu2019variance; loshchilov2017decoupled; for other optimizers, we report the best result in the literature. AdaBelief outperforms other adaptive methods and achieves comparable accuracy to SGD (70.08 v.s. 70.23), which closes the generalization gap between adaptive methods and SGD. Experiments validate the fast convergence and good generalization performance of AdaBelief.

LSTM on language modeling We experiment with LSTM on the Penn TreeBank dataset marcus1993building, and report the perplexity (lower is better) on the test set in Fig. 5. We report the mean and standard deviation across 3 runs. For both 2-layer and 3-layer LSTM models, AdaBelief achieves the lowest perplexity, validating its fast convergence as in adaptive methods and good accuracy. For the 1-layer model, the performance of AdaBelief is close to other optimizers.

Generative adversarial networks Stability of optimizers is important in practice such as training of GANs, yet recently proposed optimizers often lack experimental validations. The training of a GAN alternates between generator and discriminator in a mini-max game, and is typically unstable goodfellow2014generative; SGD often generates mode collapse, and adaptive methods such as Adam and RMSProp are recommended in practice goodfellow2016nips; salimans2016improved; gulrajani2017improved. Therefore, training of GANs is a good test for the stability of optimizers.

We experiment with one of the most widely used models, the Wasserstein-GAN (WGAN) arjovsky2017wasserstein and the improved version with gradient penalty (WGAN-GP) salimans2016improved. Using each optimizer, we train the model for 100 epochs, generate 64,000 fake images from noise, and compute the Frechet Inception Distance (FID) heusel2017gans between the fake images and real dataset (60,000 real images). FID score captures both the quality and diversity of generated images and is widely used to assess generative models (lower FID is better). For each optimizer, under its optimal hyperparameter settings, we perform 5 runs of experiments, and report the results in Fig. 6 and Fig. 7. AdaBelief significantly outperforms other optimizers, and achieves the lowest FID score.

## 4 Related works

Besides first-order methods, second-order methods (e.g. Newton’s method boyd2004convex, Quasi-Newton method and Gauss-Newton method wedderburn1974quasi; schraudolph2002fast; wedderburn1974quasi, L-BFGS nocedal1980updating, Natural-Gradient amari1998natural; pascanu2013revisiting, Conjugate-Gradient hestenes1952methods) are widely used in conventional optimization. Hessian-free optimization (HFO) martens2010deep uses second-order methods to train neural networks. Second-order methods typically use curvature information and are invariant to scaling battiti1992first but have heavy computational burden, and hence are not widely used in deep learning.

## 5 Conclusion

We propose the AdaBelief optimizer, which adaptively scales the stepsize by the difference between predicted gradient and observed gradient. To our knowledge, AdaBelief is the first optimizer to achieve three goals simultaneously: fast convergence as in adaptive methods, good generalization as in SGD, and training stability in complex settings such as GANs. Furthermore, Adabelief has the same parameters as Adam, hence is easy to tune. We validate the benefits of AdaBelief with intuitive examples, theoretical convergence analysis in both convex and non-convex cases, and extensive experiments on real-world datasets.

Optimization is at the core of modern machine learning, and numerous efforts have been put into it. To our knowledge, AdaBelief is the first optimizer to achieve fast speed, good generalization and training stability. Adabelief can be used for the training of all models that can numerically esimate parameter gradient. hence can boost the development and application of deep learning models; yet this work mainly focuses on the theory part, and the social impact is mainly determined by each application rather than by optimizer.

## A. Detailed Algorithm of AdaBelief

Notations By the convention in kingma2014adam, we use the following notations:

• : is the loss function to minimize, is the parameter in

• : the gradient and step

• : is the learning rate, default is ; is a small number, typically set as

• : smoothing parameters, typical values are

• : exponential moving average (EMA) of

• : is the EMA of , is the EMA of

## B. Convergence analysis in convex online learning case (Theorem 2.1 in main paper)

For the ease of notation, we absorb into . Equivalently, . For simplicity, we omit the debiasing step in theoretical analysis as in reddi2019convergence. Our analysis can be applied to the de-biased version as well.

###### Lemma .1.

mcmahan2010adaptive For any and convex feasible set , suppose and , then we have .

###### Theorem .2.

Let and be the sequence obtained by the proposed algorithm, let , , . Let , where is a convex feasible set with bounded diameter . Assume is a convex function and (hence ) and . Denote the optimal point as . For generated with Algorithm 3, we have the following bound on the regret:

 T∑t=1ft(θt)−ft(θ∗) ≤D2∞√T2α(1−β1)d∑i=1s1/2T,i+(1+β1)α√1+logT2√c(1−β1)3d∑i=1∣∣∣∣g21:T,i∣∣∣∣2 +D2∞2(1−β1)T∑t=1d∑i=1β1ts1/2t,iαt

Proof:

 θt+1=∏F,√st(θt−αts−1/2tmt)=minθ∈F∣∣∣∣s1/4t[θ−(θt−αts−1/2tmt)]∣∣∣∣

Note that since . Use and to denote the th dimension of and respectively. From lemma (.1), using and , we have:

 ∣∣∣∣s1/4t(θt+1−θ∗)∣∣∣∣2 ≤∣∣∣∣s1/4t(θt−αts−1/2tmt−θ∗)∣∣∣∣2 =∣∣∣∣s1/4t(θt−θ∗)∣∣∣∣2+α2t∣∣∣∣s−1/4tmt∣∣∣∣2−2αt⟨mt,θt−θ∗⟩ =∣∣∣∣s1/4t(θt−θ∗)∣∣∣∣2+α2t∣∣∣∣s−1/4tmt∣∣∣∣2 −2αt⟨β1tmt−1+(1−β1t)gt,θt−θ∗⟩ (1)

Note that and , rearranging inequality (B. Convergence analysis in convex online learning case (Theorem 2.1 in main paper)), we have:

 ⟨gt,θt−θ∗⟩ ≤12αt(1−β1t)[∣∣∣∣s1/4t(θt−θ∗)∣∣∣∣2−∣∣∣∣s1/4t(θt+1−θ∗)∣∣∣∣2] +αt2(1−β1t)∣∣∣∣s−1/4tmt∣∣∣∣2−β1t1−β1t⟨mt−1,θt−θ∗⟩ ≤12αt(1−β1t)[∣∣∣∣s1/4t(θt−θ∗)∣∣∣∣2−∣∣∣∣s1/4t(θt+1−θ∗)∣∣∣∣2] +αt2(1−β1t)∣∣∣∣s−1/4tmt∣∣∣∣2 +β1t2(1−β1t)αt∣∣∣∣s−1/4tmt−1∣∣∣∣2+β1t2αt(1−β1t)∣∣∣∣s1/4t(θt−θ∗)∣∣∣∣2 (Cauchy-Schwartz and Young's inequality: ab≤a2ϵ2+b22ϵ,∀ϵ>0) (2)

By convexity of , we have:

 T∑t=1ft(θt)−ft(θ∗) ≤T∑t=1⟨gt,θt−θ∗⟩ +β1t2αt(1−β1t)∣∣∣∣s1/4t(θt−θ∗)∣∣∣∣2} (By formula (???)) ≤12(1−β1)∣∣∣∣s1/41(θ1−θ∗)∣∣∣∣2α1 (0≤st−1≤st,0≤αt≤αt−1