AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients
Abstract
Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the “belief” in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a welltuned Adam optimizer. Code is available at https://github.com/juntangzhuang/AdabeliefOptimizer
noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt
1 Introduction
Modern neural networks are typically trained with firstorder gradient methods, which can be broadly categorized into two branches: the accelerated stochastic gradient descent (SGD) family robbins1951stochastic, such as Nesterov accelerated gradient (NAG) nesterov27method, SGD with momentum sutskever2013importance and heavyball method (HB) polyak1964some; and the adaptive learning rate methods, such as Adagrad duchi2011adaptive, AdaDelta zeiler2012adadelta, RMSProp graves2013generating and Adam kingma2014adam. SGD methods use a global learning rate for all parameters, while adaptive methods compute an individual learning rate for each parameter.
Compared to the SGD family, adaptive methods typically converge fast in the early training phases, but have poor generalization performance wilson2017marginal; lyu2019gradient. Recent progress tries to combine the benefits of both, such as switching from Adam to SGD either with a hard schedule as in SWATS keskar2017improving, or with a smooth transition as in AdaBound luo2019adaptive. Other modifications of Adam are also proposed: AMSGrad reddi2019convergence fixes the error in convergence analysis of Adam, Yogi zaheer2018adaptive considers the effect of minibatch size, MSVAG balles2017dissecting dissects Adam as sign update and magnitude scaling, RAdam liu2019variance rectifies the variance of learning rate, Fromage bernstein2020distance controls the distance in the function space, and AdamW loshchilov2017decoupled decouples weight decay from gradient descent. Although these modifications achieve better accuracy compared to Adam, their generalization performance is typically worse than SGD on largescale datasets such as ImageNet russakovsky2015imagenet; furthermore, compared with Adam, many optimizers are empirically unstable when training generative adversarial networks (GAN) goodfellow2014generative.
To solve the problems above, we propose “AdaBelief”, which can be easily modified from Adam. Denote the observed gradient at step as and its exponential moving average (EMA) as . Denote the EMA of and as and , respectively. is divided by in Adam, while it is divided by in AdaBelief. Intuitively, is the “belief” in the observation: viewing as the prediction of the gradient, if deviates much from , we have weak belief in , and take a small step; if is close to the prediction , we have a strong belief in , and take a large step. We validate the performance of AdaBelief with extensive experiments. Our contributions can be summarized as:

We propose AdaBelief, which can be easily modified from Adam without extra parameters. AdaBelief has three properties: (1) fast convergence as in adaptive gradient methods, (2) good generalization as in the SGD family, and (3) training stability in complex settings such as GAN.

We theoretically analyze the convergence property of AdaBelief in both convex optimization and nonconvex stochastic optimization.

We validate the performance of AdaBelief with extensive experiments: AdaBelief achieves fast convergence as Adam and good generalization as SGD in image classification tasks on CIFAR and ImageNet; AdaBelief outperforms other methods in language modeling; in the training of a WGAN arjovsky2017wasserstein, compared to a welltuned Adam optimizer, AdaBelief significantly improves the quality of generated images, while several recent adaptive optimizers fail the training.
2 Methods
2.1 Details of AdaBelief Optimizer
Notations By the convention in kingma2014adam, we use the following notations:

: is the loss function to minimize, is the parameter in

: projection of onto a convex feasible set

: the gradient and step

: exponential moving average (EMA) of

: is the EMA of , is the EMA of

: is the learning rate, default is ; is a small number, typically set as

: smoothing parameters, typical values are

are the momentum for and respectively at step , and typically set as constant (e.g.
Comparison with Adam Adam and AdaBelief are summarized in Algo. 1 and Algo. 2, where all operations are elementwise, with differences marked in blue. Note that no extra parameters are introduced in AdaBelief. For simplicity, we omit the bias correction step. A detailed version of AdaBelief is in Appendix A. Specifically, in Adam, the update direction is , where is the EMA of ; in AdaBelief, the update direction is , where is the EMA of . Intuitively, viewing as the prediction of , AdaBelief takes a large step when observation is close to prediction , and a small step when the observation greatly deviates from the prediction.
2.2 Intuitive explanation for benefits of AdaBelief
AdaBelief uses curvature information
Update formulas for SGD, Adam and AdaBelief are:
(1) 
Note that we name as the “learning rate” and as the “stepsize” for the th parameter. With a 1D example in Fig. 1, we demonstrate that AdaBelief uses the curvature of loss functions to improve training as summarized in Table 1, with a detailed description below:
Case 1  Case 2  Case 3  
,  S  L  L  
,  S  L  S  
L  S  L  
SGD  Adam  AdaBelief  SGD  Adam  AdaBelief  SGD  Adam  AdaBelief  
S  L  L  L  S  S  L  S  L 
(1) In region in Fig. 1, the loss function is flat, hence the gradient is close to 0. In this case, an ideal optimizer should take a large stepsize. The stepsize of SGD is proportional to the EMA of the gradient, hence is small in this case; while both Adam and AdaBelief take a large stepsize, because the denominator ( and ) is a small value.
(2) In region , the algorithm oscillates in a “steep and narrow” valley, hence both and is large. An ideal optimizer should decrease its stepsize, while SGD takes a large step (proportional to ). Adam and AdaBelief take a small step because the denominator ( and ) is large.
(3) In region , we demonstrate AdaBelief’s advantage over Adam in the “large gradient, small curvature” case. In this case, and are large, but and are small; this could happen because of a small learning rate . In this case, an ideal optimizer should increase its stepsize. SGD uses a large stepsize (); in Adam, the denominator is large, hence the stepsize is small; in AdaBelief, denominator is small, hence the stepsize is large as in an ideal optimizer.
To sum up, AdaBelief scales the update direction by the change in gradient, which is related to the Hessian. Therefore, AdaBelief considers curvature information and performs better than Adam.
AdaBelief considers the sign of gradient in denominator We show the advantages of AdaBelief with a 2D example in this section, which gives us more intuition for high dimensional cases.
In Fig. 2, we consider the loss function:
.
Note that in this simple problem, the gradient in each axis can only take . Suppose the start point is near the axis, e.g. . Optimizers will oscillate in the direction, and keep increasing in the direction.
Suppose the algorithm runs for a long time ( is large), so the bias of EMA () is small:
(2)  
(3) 
In practice, the bias correction step will further reduce the error between the EMA and its expectation if is a stationary process kingma2014adam. Note that:

(4) 
An example of the analysis above is summarized in Fig. 2. From Eq. 3 and Eq. 4, note that in Adam, ; this is because the update of only uses the amplitude of and ignores its sign, hence the stepsize for the and direction is the same . AdaBelief considers both the magnitude and sign of , and , hence takes a large step in the direction and a small step in the direction, which matches the behaviour of an ideal optimizer.
Update direction in Adam is close to “sign descent” in lowvariance case In this section, we demonstrate that when the gradient has low variance, the update direction in Adam is close to “sign descent”, hence deviates from the gradient. This is also mentioned in balles2017dissecting.
Under the following assumptions: (1) assume is drawn from a stationary distribution, hence after bias correction, . (2) lownoise assumption, assume , hence we have . (3) lowbias assumption, assume ( to the power of ) is small, hence as an estimator of has a small bias . Then

(5) 
In this case, Adam behaves like a “sign descent”; in 2D cases the update is to the axis, hence deviates from the true gradient direction. The “sign update” effect might cause the generalization gap between adaptive methods and SGD (e.g. on ImageNet) bernstein2018signsgd; wilson2017marginal. For AdaBelief, when the variance of is the same for all cooridnates, the update direction matches the gradient direction; when the variance is not uniform, AdaBelief takes a small (large) step when variance is large (small).
Numerical experiments In this section, we validate intuitions in Sec. 2.2. Examples are shown in Fig. 3, and we refer readers to more examples in the supplementary videos for better visualization. In all examples, compared with SGD with momentum and Adam, AdaBelief reaches the optimal point at the fastest speed. Learning rate is for all optimizers. For all examples except Fig. 3(d), we set the parameters of AdaBelief to be the same as the default in Adam kingma2014adam, , and set momentum as 0.9 for SGD. For Fig. 3(d), to match the assumption in Sec. 2.2, we set for both Adam and AdaBelief, and set momentum as for SGD. Videos are available at

Consider the loss function and a starting point near the axis. This setting corresponds to Fig. 2. Under the same setting, AdaBelief takes a large step in the direction, and a small step in the direction, validating our analysis. More examples such as are in the supplementary videos.

For an inseparable loss, AdaBelief outperforms other methods under the same setting.

For an inseparable loss, AdaBelief outperforms other methods under the same setting.

We set for Adam and AdaBelief, and set momentum as in SGD. This corresponds to settings of Eq. 5. For the loss , is a constant for a large region, hence . As mentioned in kingma2014adam, , hence a smaller decreases faster to 0. Adam behaves like a sign descent ( to the axis), while AdaBelief and SGD update in the direction of the gradient.

Optimization trajectory under default setting for the Beale beale1955minimizing function in 2D and 3D.

Optimization trajectory under default setting for the Rosenbrock rosenbrock1960automatic function.
Above cases occur frequently in deep learning Although the above cases are simple, they give hints to local behavior of optimizers in deep learning, and we expect them to occur frequently in deep learning, hence we expect AdaBelief to outperform Adam in general cases. Other works in the literature reddi2019convergence; luo2019adaptive claim advantages over Adam, but are typically substantiated with carefullyconstructed examples. Note that most deep networks use ReLU activation glorot2011deep, which behaves like an absolute value function as in Fig. 3(a); considering the interaction between neurons, most networks behaves like case Fig. 3(b), and typically are illconditioned (the weight of some parameters are far larger than others) as in the figure. Considering a smooth loss function such as cross entropy or a smooth activation, this case is similar to Fig. 3(c). The case with Fig. 3(d) requires , and this typically occurs at the late stages of training, where the learning rate is decayed to a small value, and the network reaches a stable region.
2.3 Convergence analysis in convex and nonconvex optimization
Similar to reddi2019convergence; luo2019adaptive; chen2018convergence, for simplicity, we omit the debiasing step (analysis applicable to debiased version). Proof for convergence in convex and nonconvex cases is in the appendix.
Optimization problem For deterministic problems, the problem to be optimized is ; for online optimization, the problem is , where can be interpreted as loss of the model with the chosen parameters in th step.
Theorem 2.1.
(Convergence in convex optimization) Let and be the sequence obtained by AdaBelief, let , , . Let , where is a convex feasible set with bounded diameter . Assume is a convex function and (hence ) and . Denote the optimal point as . For generated with AdaBelief, we have the following bound on the regret:
Corollary 2.1.1.
Suppose in Theorem (2.1), then we have:

For the convex case, Theorem 2.1 implies the regret of AdaBelief is upper bounded by . Conditions for Corollary 2.1.1 can be relaxed to as in reddi2019convergence, which still generates regret. Similar to Theorem 4.1 in kingma2014adam and corollary 1 in reddi2019convergence, where the term exists, we have . Without further assumption, since as assumed in Theorem 2.1, and is constant. The literature kingma2014adam; reddi2019convergence; duchi2011adaptive exerts a stronger assumption that . Our assumption could be similar or weaker, because , then we get better regret than .
Theorem 2.2.
(Convergence for nonconvex stochastic optimization) Under the assumptions:

is differentiable; ; is also lower bounded.

The noisy gradient is unbiased, and has independent noise, .

At step , the algorithm can access a bounded noisy gradient, and the true gradient is also bounded. .
Assume , noise in gradient has bounded variance, , then the proposed algorithm satisfies:
as in chen2018convergence, are constants independent of and , and is a constant independent of .
Corollary 2.2.1.
If and assumptions for Theorem 2.2 are satisfied, we have:

Theorem 2.2 implies the convergence rate for AdaBelief in the nonconvex case is , which is similar to Adamtype optimizers reddi2019convergence; chen2018convergence. Note that regret bounds are derived in the worst possible case, while empirically AdaBelief outperforms Adam mainly because the cases in Sec. 2.2 occur more frequently. It is possible that the above bounds are loose; we will try to derive a tighter bound in the future.
3 Experiments
We performed extensive comparisons with other optimizers, including SGD sutskever2013importance, AdaBound luo2019adaptive, Yogi zaheer2018adaptive, Adam kingma2014adam, MSVAG balles2017dissecting, RAdam liu2019variance, Fromage bernstein2020distance and AdamW loshchilov2017decoupled. The experiments include: (a) image classification on Cifar dataset krizhevsky2009learning with VGG simonyan2014very, ResNet he2016deep and DenseNet huang2017densely, and image recognition with ResNet on ImageNet deng2009imagenet; (b) language modeling with LSTM ma2015long on Penn TreeBank dataset marcus1993building; (c) wassersteinGAN (WGAN) arjovsky2017wasserstein on Cifar10 dataset. We emphasize (c) because prior work focuses on convergence and accuracy, yet neglects training stability.
Hyperparameter tuning We performed a careful hyperparameter tuning in experiments. On image classification and language modeling we use the following:
AdaBelief  SGD  AdaBound  Yogi  Adam  MSVAG  RAdam  AdamW 

70.08  70.23  68.13  68.23  63.79 (66.54)  65.99  67.62  67.93 
AdaBelief: We use the default parameters of Adam: .
SGD, Fromage: We set the momentum as , which is the default for many networks such as ResNet he2016deep and DenseNethuang2017densely. We search learning rate among .
Adam, Yogi, RAdam, MSVAG, AdaBound: We search for optimal among , search for as in SGD, and set other parameters as their own default values in the literature.
AdamW: We use the same parameter searching scheme as Adam. For other optimizers, we set the weight decay as ; for AdamW, since the optimal weight decay is typically larger loshchilov2017decoupled, we search weight decay among .
For the training of a GAN, we set for AdaBelief; for other methods, we search for among , and search for
among .
We set learning rate as for all methods. Note that the recommended parameters for Adam radford2015unsupervised and for RMSProp salimans2016improved are within the search range.


CNNs on image classification We experiment with VGG11, ResNet34 and DenseNet121 on Cifar10 and Cifar100 dataset. We use the official implementation of AdaBound, hence achieved an exact replication of luo2019adaptive. For each optimizer, we search for the optimal hyperparameters, and report the mean and standard deviation of testset accuracy (under optimal hyperparameters) for 3 runs with random initialization. As Fig. 4 shows, AdaBelief achieves fast convergence as in adaptive methods such as Adam while achieving better accuracy than SGD and other methods.
We then train a ResNet18 on ImageNet, and report the accuracy on the validation set in Table. 2. Due to the heavy computational burden, we could not perform an extensive hyperparameter search; instead, we report the result of AdaBelief with the default parameters of Adam () and decoupled weight decay as in liu2019variance; loshchilov2017decoupled; for other optimizers, we report the best result in the literature. AdaBelief outperforms other adaptive methods and achieves comparable accuracy to SGD (70.08 v.s. 70.23), which closes the generalization gap between adaptive methods and SGD. Experiments validate the fast convergence and good generalization performance of AdaBelief.
LSTM on language modeling We experiment with LSTM on the Penn TreeBank dataset marcus1993building, and report the perplexity (lower is better) on the test set in Fig. 5. We report the mean and standard deviation across 3 runs. For both 2layer and 3layer LSTM models, AdaBelief achieves the lowest perplexity, validating its fast convergence as in adaptive methods and good accuracy. For the 1layer model, the performance of AdaBelief is close to other optimizers.
Generative adversarial networks Stability of optimizers is important in practice such as training of GANs, yet recently proposed optimizers often lack experimental validations. The training of a GAN alternates between generator and discriminator in a minimax game, and is typically unstable goodfellow2014generative; SGD often generates mode collapse, and adaptive methods such as Adam and RMSProp are recommended in practice goodfellow2016nips; salimans2016improved; gulrajani2017improved. Therefore, training of GANs is a good test for the stability of optimizers.
We experiment with one of the most widely used models, the WassersteinGAN (WGAN) arjovsky2017wasserstein and the improved version with gradient penalty (WGANGP) salimans2016improved. Using each optimizer, we train the model for 100 epochs, generate 64,000 fake images from noise, and compute the Frechet Inception Distance (FID) heusel2017gans between the fake images and real dataset (60,000 real images). FID score captures both the quality and diversity of generated images and is widely used to assess generative models (lower FID is better). For each optimizer, under its optimal hyperparameter settings, we perform 5 runs of experiments, and report the results in Fig. 6 and Fig. 7. AdaBelief significantly outperforms other optimizers, and achieves the lowest FID score.
Remarks Recent research on optimizers tries to combine the fast convergence of adaptive methods with high accuracy of SGD. AdaBound luo2019adaptive achieves this goal on Cifar, yet its performance on ImageNet is still inferior to SGD chen2018closing. Padam chen2018closing closes this generalization gap on ImageNet; writing the update as , SGD sets , Adam sets , and Padam searches between 0 and 0.5 (outside this region Padam diverges chen2018closing; zhou2018convergence). Intuitively, compared to Adam, by using a smaller , Padam sacrifices the adaptivity for better generalization as in SGD; however, without good adaptivity, Padam loses training stability. As in Table 3, compared with Padam, AdaBelief achieves a much lower FID score in the training of GAN, meanwhile achieving slightly higher accuracy on ImageNet classification. Furthermore, AdaBelief has the same number of parameters as Adam, while Padam has one more parameter hence is harder to tune.
AdaBelief  Padam  
p=1/2 (Adam)  p=2/5  p=1/4  p=1/5  p=1/8  p=1/16  p = 0 (SGD)  
ImageNet Acc  70.08  63.79        70.07    70.23 
FID (WGAN)  83.0 4.1  96.64.5  97.52.8  426.449.6  401.533.2  328.137.2  362.643.9  469.3 7.9 
FID (WGANGP)  61.8 7.7  73.58.7  87.16.0  155.123.8  167.327.6  203.618.9  228.525.8  244.3 27.4 
4 Related works
This work considers the update step in firstorder methods. Other directions include Lookahead zhang2019lookahead which updates “fast” and “slow” weights separately, and is a wrapper that can combine with other optimizers; variance reduction methods reddi2016stochastic; johnson2013accelerating; ma2018quasi which reduce the variance in gradient; and LARS you2017scaling which uses a layerwise learning rate scaling. AdaBelief can be combined with these methods. Other variants of Adam have been proposed (e.g. NosAdam huang2018nostalgic, Sadam wang2019sadam and Adax li2020adax).
Besides firstorder methods, secondorder methods (e.g. Newton’s method boyd2004convex, QuasiNewton method and GaussNewton method wedderburn1974quasi; schraudolph2002fast; wedderburn1974quasi, LBFGS nocedal1980updating, NaturalGradient amari1998natural; pascanu2013revisiting, ConjugateGradient hestenes1952methods) are widely used in conventional optimization. Hessianfree optimization (HFO) martens2010deep uses secondorder methods to train neural networks. Secondorder methods typically use curvature information and are invariant to scaling battiti1992first but have heavy computational burden, and hence are not widely used in deep learning.
5 Conclusion
We propose the AdaBelief optimizer, which adaptively scales the stepsize by the difference between predicted gradient and observed gradient. To our knowledge, AdaBelief is the first optimizer to achieve three goals simultaneously: fast convergence as in adaptive methods, good generalization as in SGD, and training stability in complex settings such as GANs. Furthermore, Adabelief has the same parameters as Adam, hence is easy to tune. We validate the benefits of AdaBelief with intuitive examples, theoretical convergence analysis in both convex and nonconvex cases, and extensive experiments on realworld datasets.
Broader Impact
Optimization is at the core of modern machine learning, and numerous efforts have been put into it. To our knowledge, AdaBelief is the first optimizer to achieve fast speed, good generalization and training stability. Adabelief can be used for the training of all models that can numerically esimate parameter gradient. hence can boost the development and application of deep learning models; yet this work mainly focuses on the theory part, and the social impact is mainly determined by each application rather than by optimizer.
References
Appendix
A. Detailed Algorithm of AdaBelief
Notations By the convention in kingma2014adam, we use the following notations:

: is the loss function to minimize, is the parameter in

: the gradient and step

: is the learning rate, default is ; is a small number, typically set as

: smoothing parameters, typical values are

: exponential moving average (EMA) of

: is the EMA of , is the EMA of

B. Convergence analysis in convex online learning case (Theorem 2.1 in main paper)
For the ease of notation, we absorb into . Equivalently, . For simplicity, we omit the debiasing step in theoretical analysis as in reddi2019convergence. Our analysis can be applied to the debiased version as well.
Lemma .1.
mcmahan2010adaptive For any and convex feasible set , suppose and , then we have .
Theorem .2.
Let and be the sequence obtained by the proposed algorithm, let , , . Let , where is a convex feasible set with bounded diameter . Assume is a convex function and (hence ) and . Denote the optimal point as . For generated with Algorithm 3, we have the following bound on the regret:
Proof:
Note that since . Use and to denote the th dimension of and respectively. From lemma (.1), using and , we have:
(1) 
Note that and , rearranging inequality (B. Convergence analysis in convex online learning case (Theorem 2.1 in main paper)), we have:
(2) 
By convexity of , we have: