Can speed up the convergence rate of stochastic gradient methods to \mathcal{O}(1/k^{2}) by a gradient averaging strategy?

# Can speed up the convergence rate of stochastic gradient methods to O(1/k2) by a gradient averaging strategy?

## Abstract

In this paper we consider the question of whether it is possible to apply a gradient averaging strategy to improve on the sublinear convergence rates without any increase in storage. Our analysis reveals that a positive answer requires an appropriate averaging strategy and iterations that satisfy the variance dominant condition. As an interesting fact, we show that if the iterative variance we defined is always dominant even a little bit in the stochastic gradient iterations, the proposed gradient averaging strategy can increase the convergence rate to in probability for the strongly convex objectives with Lipschitz gradients. This conclusion suggests how we should control the stochastic gradient iterations to improve the rate of convergence.

Speed up the convergence rate of stochastic gradients?X Xu and X Luo \firstpageno1

\editor{keywords}

Stochastic optimization, Stochastic gradient, Convergence rate, Speed up, Strongly convex

## 1 Introduction

In this paper we consider the question of whether it is possible to apply a gradient averaging strategy to improve on the sublinear convergence rates without any increase in storage for stochastic gradient (SG) methods. The SG method is the popular methodology (Zinkevich, 2003; Zhang, 2004; Bottou and Bousquet, 2007; Nemirovski et al., 2009; Shalev-Shwartz et al., 2011) for solving the following class of stochastic optimization problems:

 x∗=argminx∈RdF(x), (1)

where the real-valued function is defined by

 F(x):=Eξ[f(x,ξ)]=∫Ξf(x,ξ)dP(ξ), (2)

and can be defined as a collection of real-valued functions with a certain probability distribution over the index set .

With an initial point , these methods are characterized by the iteration

 xk+1=xk−αkg(xk,ξk), (3)

where is the stepsize and is the stochastic gradient defined by

 g(xk,ξk)={∇f(xk,ξk),1nk∑nki=1∇f(xk,ξk,i), (4)

which is usually assumed to be an unbiased estimate of the socalled full gradient , i.e., (Shapiro et al., 2009; Bottou et al., 2018). The SG method was originally developed by Robbins and Monro (1951) for smooth stochastic approximation problems. It is guaranteed to achieve the sublinear convergence rate for strongly convex objectives (Nemirovski et al., 2009; Bottou et al., 2018) and this theoretical rate is also supported by practical experience (Bottou et al., 2018). In particular, the practical performance of stochastic gradient methods with momentum has made them popular in the community working on training DNNs (Sutskever et al., 2013); in fact, the approach could be viewed as a gradient averaging strategy. While momentum can lead to improved practical performance, it is still not known to lead to a faster convergence rate. Usually, the gradient averaging strategy and its variants can improve the constants in the convergence rate (Xiao, 2010), it does not improve on the sublinear convergence rates for SG methods. However, owing to the successful practical performance of gradient averaging, it is worth considering whether it is possible to improve the convergence rate, which forms the starting point of this work.

The primary contribution of this work is to show that under the variance dominant condition (Assumption 3), the proposed gradient averaging strategy could improve the convergence rate to in probability without any increase in storage for the strongly convex objectives with Lipschitz gradients. This result also suggests how we should control the stochastic gradient iterations to improve the rate of convergence in practice. In particular, our averaging strategy coordinates the relationship between the mean and variance of the increment of the iteration, so that the growth of expectation can be controlled when the variance is reduced.

### 1.1 Related Work

We briefly review several methods related to the new strategy, mainly including stochastic gradient with momentum (SGM), gradient averaging, stochastic variance reduced gradient (SVRG), SAGA and iterate averaging.

SGM. With an initial point , two scalar sequences and that are either predetermined or set dynamically, and , SGM uses iterations of the form (Tseng, 1998; Bottou et al., 2018)

 xk+1=xk−αkg(xk,ξk)+βk(xk−xk−1).

They are procedures in which each step is chosen as a combination of the stochastic gradient direction and the most recent iterate displacement. It is common to set and as some fixed constants, and in this case we can rewrite the SGM iteration as

 xk+1=xk−αk∑i=1βk−ig(xi,ξi)=xk−α′k∑ki=1βk−ik∑i=1βk−ig(xi,ξi). (5)

So it is clear that SGM is a weighted average of all previous stochastic gradient directions. In deterministic settings, it is referred to as the heavy ball method (Polyak, 1964). While SGM can lead to improved practical performance, it is still not known to lead to a faster convergence rate. Moreover, see Remark 4.3 in Section 4 for a variance analysis of SGM.

 xk+1=xk−αkkk∑i=1g(xi,ξi). (6)

This approach is used in the dual averaging method (Nesterov, 2009). Compared with our new strategy, this method reduces the variance to a similar order without considering the stepsize , however, its expectation is not well controlled, for details see Remark 4.2 in Section 4. So as mentioned above, it can improve the constants in the convergence rate (Xiao, 2010) but does not improve on the sublinear convergence rates.

SVRG. SVRG is designed to minimize the objective function of the form of a finite sum (Johnson and Zhang, 2013), i.e.,

 F(x)=1nn∑i=1fi(x).

The method is able to achieve a linear convergence rate for strongly convex problems, i.e.,

 E[F(xk+1)−F∗]⩽ρE[F(xk)−F∗].

SVRG needs to compute the batch gradients and has two parameters that needs to be set: besides the stepsize , there is an additional parameter , namely the number of iterations per inner loop. In order to guarantee a linear convergence theoretically, one needs to choose and such that

 ρ:=11−2αL(1mαl+2αL)<1,

where and are given in Assumption 1. Without explicit knowledge and , the lengths of the inner loop and the stepsize are typically both chosen by experimentation. This improved rate is achieved by either an increase in computation or an increase in storage. Hence, SVRG usually can not beat SG for very large (Bottou et al., 2018).

SAGA. SAGA has its origins in the stochastic average gradient (SAG) algorithm (Le Roux et al., 2012; Schmidt et al., 2017); moreover, the SAG algorithm is a randomized version of the incremental aggregated gradient (IAG) method proposed in Blatt et al. (2007) and analyzed in Gürbüzbalaban et al. (2017). Compared with SVRG, SAGA is to apply an iteration that is closer in form to SG in that it does not compute batch gradients except possibly at the initial point, and SAGA has a practical advantage that there is only one parameter (the stepsize ) to tune instead of two. Beyond its initialization phase, the per-iteration cost of SAGA is the same as in a SG method; but it has been shown that the method can also achieve a linear rate of convergence for strongly convex problems (Defazio et al., 2014). However, the price paid to reach this rate is the need to store stochastic gradient vectors for general cases except logistic and least squares regression (Bottou et al., 2018), which would be prohibitive in many large-scale applications.

Iterate Averaging. Rather than averaging the gradients, some authors propose to perform the basic SG iteration and try to use an average over iterates as the final estimator (Polyak, 1991; Polyak and Juditsky, 1992). Since SG generates noisy iterate sequences that tend to oscillate around minimizers during the optimization process, the iterate averaging would possess less noisy behavior (Bottou et al., 2018). It is shown that suitable iterate averaging strategies obtain an rate for strongly convex problems even for non-smooth objectives (Hazan and Kale, 2014; Rakhlin et al., 2012). However, none of these methods improve on the sublinear convergence rates (Schmidt et al., 2017).

### 1.2 Paper Organization

The next section introduces the assumptions we used for establishing convergence results, especially, the variance dominant condition ( Assumption 3). Then the new strategy is discussed in detail in Section 3. In Section 4, we show that under the variance dominant condition, the proposed strategy could increase the convergence rate to in probability for the strongly convex objectives with Lipschitz gradients, which suggests how we should control the stochastic gradient iterations to improve the rate of convergence. And we draw some conclusions in Section 5.

## 2 Assumptions

### 2.1 Assumption on the objective

First, let us begin with a basic assumption of smoothness of the objective function. Such an assumption is essential for convergence analyses of most gradient-based methods (Bottou et al., 2018).

###### Assumption  1 (Strongly convex objectives with Lipschitz-continuous gradients)

The objective function is continuously differentiable and there exist such that, for all ,

 ∥∇F(x′)−∇F(x)∥2⩽L∥x′−x∥2  and (7) F(x′)⩾F(x)+∇F(x)T(x′−x)+l2∥x′−x∥22. (8)

The inequality (7) ensures that the gradient of the objective is bounded and does not change arbitrarily quickly with respect to the parameter vector, which implies that

 |F(x′)−F(x)−∇F(x)T(x′−x)|⩽L2∥x′−x∥22  for all  x′,x∈Rd. (9)

This inequality (9) is an important basis for performing so-called mean-variance analyses for stochastic iterative sequences (Bottou et al., 2018; Luo and Xu, 2019). The inequality (8) is called a strong convexity condition, which is often used to ensure a sublinear convergence for the stochastic gradient methods; and the role of strong sonvexity may be essential for such rates of convergence (Nemirovski et al., 2009; Bottou et al., 2018). Under the strong sonvexity assumption, the gap between the value of the objective and the minima can be bounded by the squared -norm of the gradient of the objective:

 2l(F(x)−F∗)⩽∥∇F(x)∥22  for all  x∈Rd. (10)

This is referred to as the Polyak-Łojasiewicz inequality which was originally introduced by Polyak (1963). It is a sufficient condition for gradient descent to achieve a linear convergence rate; and it is also a special case of the Łojasiewicz inequality proposed in the same year (Łojasiewicz, 1963), which gives an upper bound for the distance of a point to the nearest zero of a given real analytic function.

### 2.2 Assumption on the variance

We follow Bottou et al. (2018) to make the following assumption about the variance of stochastic gradients, i.e., . It states that the variance of is restricted in a relatively minor manner.

###### Assumption  2 (Variance limit)

The objective function and the stochastic gradient satisfy there exist scalars and such that, for all ,

 Vξk[g(xk,ξk)]⩽M+MV∥∇F(xk)∥22. (11)

### 2.3 Assumption on the iteration

Now we make the following variance dominant assumption. It states that the iterative variance is always dominant even a little bit in the stochastic gradient iterations. This assumption guarantees that the proposed strategy could achieve the convergence rate in probability for the strongly convex objectives.

###### Assumption  3 (Variance dominant)

The sequence of iterates satisfy for all , there is a fixed (could be arbitrarily small) such that

 ∥E[xj−xi]∥22=O(j−κV[xj−xi]), (12)

where denotes the historical expectation operator defined as and the variance is defined as

 V[xj−xi]:=E[∥xj−xi∥22]−∥E[xj−xi]∥22.

When , then is strictly less than . And obviously, Assumption 3 implies that

 ∥E[xj−xi]∥22=O(j−κE[∥xj−xi∥22]) (13)

and

 V[xj−xi]=O(E[∥xj−xi∥22]).

## 3 Algorithms

### 3.1 Methods

Our accelerated method is procedures in which each step is chosen as a weighted average of all historical stochastic gradients. And specifically, with an initial point , the method is characterized by the iteration

 xk+1←xk+αkmk, (14)

where the weighted average direction

 mk=−1∑ki=1ipk∑j=1jpg(xj,ξj),  p>0. (15)

Here, is the weighted average of past gradients and the values of mean different weighting methods. A larger value of makes us focus on more recent gradients, as shown in Figure 1; and the recommended weighting method is to choose , which uses about the nearest historical gradients.

Moreover, the method (14) can be equivalently rewritten by the iteration

 xk+1←xk+αkvk∑ki=1(ik)p, (16)

where the direction vector is recursively defined as

 vk=(k−1k)pvk−1−g(xk,ξk)=−k∑i=1(ik)pg(xi,ξi),

which could be viewed as the classical stochastic gradient with momentum where the decay factor depends on .

We now define our accelerated method as Algorithm 1. The algorithm presumes that three computational tools exist: (i) a mechanism for generating a realization of random variable (with representing a sequence of jointly independent random variables); (ii) given an iteration number , a mechanism for computing a scalar stepsize ; and (iii) given an iterate and the realization of , a mechanism for computing a stochastic vector and a scalar .

### 3.2 Stepsize Policy

For strongly convex objectives, we consider the stepsize sequence taking the form

 αk=sk+σ  for some  s>4l  and  σ>0  such that  α1⩽1LM(1)G,p; (17)

where the constant will be discussed in Lemma 4.3.

Notice that the accelerated method and the stochastic gradient method are exactly the same in the first iteration. So we assume, without loss of generality, that the first iterations generated by (14) has the sublinear convergence rate under Assumptions 1 and 2, that is, for every , we have

 E[F(xj)]−F∗=O(1/j); (18)

then we shall prove by induction on that the accelerated method maintains the sublinear convergence rate under Assumptions 1 and 2; and furthermore, we shall also prove that this method can achieve a convergence rate under Assumptions 1 to 3.

It follows from (18) and Assumption 1 that

 E[∥xj−x∗∥22]⩽2l−1(E[F(xj)]−F∗)=O(j−1). (19)

Since , it follows from (19) that

 E[∥xj−xk∥22]⩽ E[∥xj−x∗∥2+∥xk−x∗∥2]2 = E[∥xj−x∗∥22+2∥xj−x∗∥2∥xk−x∗∥2+∥xk−x∗∥22] ⩽ 2E[∥xj−x∗∥22+∥xk−x∗∥22]=O(j−1),

and then, we obtain

 ∥E[xj−xk]∥22⩽E[∥xj−xk∥22]=O(j−1). (20)

Together with Assumption 3, we further obtain

 ∥E[xj−xk]∥22=O(j−κE[∥xj−xk∥22])=O(j−1−κ). (21)

And we will finally show that (21) implies actually in Section 4.

On the basis of (20), (21) and the stepsize policy (17), we first prove two Lemmas which are necessary for the following convergence analysis. {lemma} Under the conditions of (18), suppose that the sequence of iterates is generated by (14) with a stepsize sequence taking the form (17). Then, there is such that for any given diagonal matrix with , the inequality

 1∑ki=1ip∥∥∥k∑j=1jpΛ(xj−xk)∥∥∥2⩽√αkLD′p2 (22)

holds in probability. {proof} Note that for any ,

together with (17), we have ; thus, to prove (22), we only need to show that

holds in probability. Using the mean-variance analysis, it follows from and (20) that

 E[k∑j=1jpΛ(xj−xk)]=Λk∑j=1jp E[xj−xk]=O(Λk∑j=1jp−12)=O(kp+12);

meanwhile, according to (20), we also have

 V[k∑j=1jpΛ(xj−xk)]=Λ2k∑j=1j2p V[xj−xk]⩽Λ2k∑j=1j2p−1=O(k2p). (24)

Using Chebyshev’s inequality, there is a such that for ,

 P(∥∥∥k∑j=1jpΛ(xj−xk)−Ckp+12Λ∥∥∥2⩾ϵkpΛ)⩽1ϵ2,

which gives the inequality (22) in probability.

Under Assumption 3, it is clear that (22) could be further strengthened. {lemma} Suppose the conditions of Lemma 3.2 and Assumption 3 hold. Then, there is such that for any given diagonal matrix with , the inequality

 1∑ki=1ip∥∥∥k∑j=1jpΛ(xj−xk)∥∥∥2⩽αskLDp2 (25)

holds in probability, where . {proof} Note that (23), we have ; thus, to prove (25), we only need to show that

holds in probability. First, it follows from and (21) that

 E[k∑j=1jpΛ(xj−xk)]=Λk∑j=1jp E[xj−xk]=O(Λk∑j=1jp−1+κ2)=O(kp+1−κ2);

together with (24) and using Chebyshev’s inequality, there is a such that

 P(∥∥∥k∑j=1jpΛ(xj−xk)−Ckp+1−κ2Λ∥∥∥2⩾ϵVkpΛ)⩽1ϵ2.

It is worth noting that when , the variance will become the principal part; so the proof is complete.

## 4 Convergence Results

### 4.1 Mean-Variance Framework

The mean-variance framework can be described as a fundamental lemma for any iteration based on random steps, which relies only on Assumption 1 and is a slight generalization of Lemma 4.2 in Bottou et al. (2018).

{lemma}

Under Assumption 1, if for every , is any random vector independent of and is a stochastic step depending on , then the iteration

 xk+1=xk+s(xk,ξk)

satisfy the following inequality

 Eξk[F(xk+1)]−F(xk)⩽ ∇F(xk)TEξk[s(xk,ξk)]+L2∥Eξk[s(xk,ξk)]∥22+L2Vξk[s(xk,ξk)],

where the variance of is defined as

 Vξk[s(xk,ξk)]:=Eξk[∥s(xk,ξk)∥22]−∥Eξk[s(xk,ξk)]∥22. (26)
{proof}

According to the inequality (9), the iteration satisfy

 F(xk+1)−F(xk)⩽ ∇F(xk)T(xk+1−xk)+L2∥xk+1−xk∥22 ⩽ ∇F(xk)Ts(xk,ξk)+L2∥s(xk,ξk)∥22.

Noting that is independent of and taking expectations in these inequalities with respect to the distribution of , we obtain

 Eξk[F(xk+1)]−F(xk)⩽∇F(xk)TEξk[s(xk,ξk)]+L2Eξk[∥s(xk,ξk)∥22].

Recalling (26), we finally get the desired bound.

Regardless of the states before , the expected decrease in the objective function yielded by the th stochastic step , say, , could be bounded above by a quantity involving the expectation and variance .

### 4.2 Expectation Analysis

Now we will analyze the mean of to get the bounds of and , where is the historical expectation of . First, according to the definition (15) of the weighted average direction , we have

 E[mk]=−1∑ki=1ipk∑j=1jpEξj[g(xj,ξj)]=−1∑ki=1ipk∑j=1jp∇F(xj).

Further, from Assumption 1, we have

 ∥∇F(xj)−∇F(xk)∥∞⩽∥∇F(xj)−∇F(xk)∥2⩽L∥xj−xk∥2,

then there is a diagonal matrix with such that

 ∇F(xj)=∇F(xk)+Λ(xj−xk). (27)

Therefore, could be written as

 E[mk]=−∇F(xk)+1∑ki=1ipk∑j=1jpΛ(xk−xj), (28)

where is a diagonal matrix with .

Together with Lemma 3.2, we get the following bounds of and . {theorem} Suppose the conditions of Lemma 3.2 hold. Then for every , the following conditions

 ∥E[mk]∥2⩽ ∥∇F(xk)∥2+√αkLD′p2  and (29) ∇F(xk)TE[mk]⩽ −∥∇F(xk)∥22+√αkLD′p2∥∇F(xk)∥2 (30)

hold in probability. {remark} For the gradient averaging method mentioned (6) in Subsection 1.1, it is using the average of all previous gradients,

 xk+1=xk+αkm′k,

where

 m′k=−1kk∑i=1g(xi,ξi).

By (27), the historical expectation of could be written as

 E[m′k]=−1kk∑j=1Eξj[g(xj,ξj)]=−1kk∑j=1∇F(xj)=−∇F(xk)+R′,

where

 R′=1kk∑j=1Λ(xk−xj).

And the bound of is , which decays slower than , i.e., described in Theorem 4.2. {proof} According to (28) and Lemma 3.2, it follows that

 E[mk]=−∇F(xk)+R,

where the vector and . Thus, one obtains (29) and (30) by noting that

 ∥E[mk]∥2⩽∥∇F(xk)∥2+∥R∥2

and

 ∇F(xk)TE[mk]=−∥∇F(xk)∥22+∇F(xk)TR⩽−∥∇F(xk)∥22+∥∇F(xk)∥2∥R∥2,

so the proof is complete.

Under Assumption 3, both (29) and (30) can be further improved. {theorem} Suppose the conditions of Lemma 3.2 hold. Then for every , the following conditions

 ∥E[mk]∥2⩽ ∥∇F(xk)∥2+αskLDp2  and (31) ∇F(xk)TE[mk]⩽ −∥∇F(xk)∥22+αskLDp2∥∇F(xk)∥2 (32)

hold in probability, where . {proof} According to Lemma 3.2, we have , and the desired results could be proved in the same way as the proof of Theorem 4.2.

### 4.3 Variance Analysis

Now we will analyze the variance of to get the bound of . As an important result, we will show that the variance of tends to zero with a rate as grows. {lemma} Under Assumption 2, suppose that the sequence of iterates is generated by (14) with for any . Then (17), then

 V[mk]⩽Cpαk(M+2MV∥∇F(xk)∥22+2MVL2D2), (33)

where is positive real constant. {remark} For the SGM method (5) mentioned in Subsection 1.1, it is using the weighted average of all previous gradients,

 xk+1=xk+αkm′′k,

where

 m′′k=−1∑ki=1βk−ik∑i=1βk−ig(xi,ξi).

Since

 1(∑ki=1βk−i)2k∑j=1β2(k−j)=1−β2k1−β2(1−β)2(1−βk)2=1−β1+β1+βk1−βk,

it follows that

Together with the proof below, one can find that the variance of could be controlled by a by a fixed fraction . {proof} It follows from (27) and that

 ∥∇F(xj)∥2⩽∥∇F(xk)∥2+L∥xj−xk∥2⩽∥∇F(xk)∥2+LD,

together with the Arithmetic Mean Geometric Mean inequality, we have

 ∥∇F(xj)∥22⩽ (∥∇F(xk)∥2+LD)2 ⩽ ∥∇F(xk)∥22+L2D2+2LD∥∇F(xk)∥2 ⩽ 2∥∇F(xk)∥22+2L2D2.

Hence, along with Assumption 2, we obtain

 V[mk]= 1(∑ki=1ip)2k∑j=1j2p Vξj[g(xj,ξj)] ⩽ 1(∑ki=1ip)2k∑j=1j2p(M+MV∥∇F(xj)∥22) ⩽ ∑ki=1j2p(∑ki=1ip)2(M+2MV∥∇F(xk)∥22+2MVL2D2).

According to (23), and ; therefore, we have , and the proof is complete.

Combining Theorem 4.2 and Lemma 4.3, we can obtain a bound for each iteration of the accelerated method. {lemma} Under the conditions of Theorem 4.2 and Lemma 4.3, suppose that the stepsize sequence satisfies . Then, the inequality

 Eξk[F(xk+1)]−F(xk)⩽−3αk4∥∇F(xk)∥22+α2kLM(k)G,p2∥∇F(xk)∥22+α2kLM(k)d,p,12

holds in probability, where and ; further,

 limk→∞M(k)G,p=32  and  limk→∞M(k)d,p,1=5LD′p24.
{proof}

According to (29), together with the Arithmetic Mean Geometric Mean inequality, one obtains

 ∥∥Eξk[mk]∥∥22⩽ (∥∇F(xk)∥2+