From Averaging to Acceleration, There is Only a Stepsize
Abstract
We show that accelerated gradient descent, averaged gradient descent and the heavyball method for nonstronglyconvex problems may be reformulated as constant parameter secondorder difference equation algorithms, where stability of the system is equivalent to convergence at rate , where is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system, showing various oscillatory and nonoscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.
1 Introduction
Many problems in machine learning are naturally cast as convex optimization problems over a Euclidean space; for supervised learning this includes leastsquares regression, logistic regression, and the support vector machine. Faced with large amounts of data, practitioners often favor firstorder techniques based on gradient descent, leading to algorithms with many cheap iterations. For smooth problems, two extensions of gradient descent have had important theoretical and practical impacts: acceleration and averaging.
Acceleration techniques date back to Nesterov (1983) and have their roots in momentum techniques and conjugate gradient (Polyak, 1987). For convex problems, with an appropriately weighted momentum term which requires to store two iterates, Nesterov (1983) showed that the traditional convergence rate of for the function values after iterations of gradient descent goes down to for accelerated gradient descent, such a rate being optimal among firstorder techniques that can access only sequences of gradients (Nesterov, 2004). Like conjugate gradient methods for solving linear systems, these methods are however more sensitive to noise in the gradients; that is, to preserve their improved convergence rates, significantly less noise may be tolerated (d’Aspremont, 2008; Schmidt et al., 2011; Devolder et al., 2014).
Averaging techniques which consist in replacing the iterates by the average of all iterates have also been thoroughly considered, either because they sometimes lead to simpler proofs, or because they lead to improved behavior. In the noiseless case where gradients are exactly available, they do not improve the convergence rate in the convex case; worse, for stronglyconvex problems, they are not linearly convergent while regular gradient descent is. Their main advantage comes with random unbiased gradients, where it has been shown that they lead to better convergence rates than the unaveraged counterparts, in particular because they allow larger stepsizes (Polyak and Juditsky, 1992; Bach and Moulines, 2011). For example, for leastsquares regression with stochastic gradients, they lead to convergence rates of , even in the nonstrongly convex case (Bach and Moulines, 2013).
In this paper, we show that for quadratic problems, both averaging and acceleration are two instances of the same secondorder finite difference equation, with different stepsizes. They may thus be analyzed jointly, together with a nonstrongly convex version of the heavyball method (Polyak, 1987, Section 3.2). In presence of random zeromean noise on the gradients, this joint analysis allows to design a novel intermediate algorithm that exhibits the good aspects of both acceleration (quick forgetting of initial conditions) and averaging (robustness to noise).
In this paper, we make the following contributions:

We show in Section 2 that accelerated gradient descent, averaged gradient descent and the heavyball method for nonstronglyconvex problems may be reformulated as constant parameter secondorder difference equation algorithms, where stability of the system is equivalent to convergence at rate .

In Section 3, we provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system, showing various oscillatory and nonoscillatory behaviors, together with a sharp stability result with explicit constants.

In Section 4, we consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.

In Section 5, we illustrate our results with simulations on synthetic examples.
2 SecondOrder Iterative Algorithms for Quadratic Functions
Throughout this paper, we consider minimizing a convex quadratic function defined as:
(1) 
with a symmetric positive semidefinite matrix and . Without loss of generality, is assumed invertible (by projecting onto the orthogonal of its null space), though its eigenvalues could be arbitrarily small. The solution is known to be , but the inverse of the Hessian is often too expensive to compute when is large. The excess cost function may be simply expressed as .
2.1 Secondorder algorithms
In this paper we study secondorder iterative algorithms of the form:
(2) 
started with in , with , and for all . We impose the natural restriction that the optimum is a stationary point of this recursion, that is, for all :
(stationarity) 
By letting we then have , started from . Thus, we restrict our problem to the study of the convergence of an iterative system to .
In connection with accelerated methods, we are interested in algorithms for which converges to at a speed of . Within this context we impose that and have the form :
(nscalability) 
By letting , we can now study the simple iterative system with constant terms , started at and . Showing that the function values remain bounded, we directly have the convergence of to at the speed . Thus the nscalability property allows to switch from a convergence problem to a stability problem.
For feasibility concerns the method can only access through matrixvector products. Therefore and should be polynomials in and a polynomial in times , if possible of low degree. The following theorem clarifies the general form of iterative systems which share these three properties (see proof in Appendix B).
Theorem 1.
Let for all , be a sequence of polynomials. If the iterative algorithm defined by Eq. (2) with , and satisfies the stationarity and nscalability properties, there are polynomials such that:
Note that our result prevents and from being zero, thus requiring the algorithm to strictly be of second order. This illustrates the fact that firstorder algorithms as gradient descent do not have the convergence rate in .
We now restrict our class of algorithms to lowest possible order polynomials, that is, and with , which correspond to the fewest matrixvector products per iteration, leading to the constantcoefficient recursion for :
(3) 
Expression with gradients of .
The recursion in Eq. (3) may be written with gradients of in multiple ways. In order to preserve the parallel with accelerated techniques, we rewrite it as:
(4) 
It may be interpreted as a modified gradient recursion with two potentially different affine (i.e., with coefficients that sum to one) combinations of the two past iterates. This reformulation will also be crucial when using noisy gradients. The allowed values for will be determined in the following sections.
2.2 Examples
Averaged gradient descent.
We consider averaged gradient descent (referred to from now on as “AvGD”) (Polyak and Juditsky, 1992) with stepsize defined by:
When computing the average online as and seeing the average as the main iterate, the algorithm becomes (see proof in Appendix B.2):
This corresponds to Eq. (4) with and .
Accelerated gradient descent.
We consider the accelerated gradient descent (referred to from now on as “AccGD”) (Nesterov, 1983) with stepsizes :
For smooth optimization the accelerated literature (Nesterov, 2004; Beck and Teboulle, 2009) uses the stepsize and their results are not valid for bigger stepsize . However is compatible with the framework of Lan (2012) and is more convenient for our setup. This corresponds to Eq. (4) with and . Note that accelerated techniques are more generally applicable, e.g., to composite optimization with smooth functions (Nesterov, 2013; Beck and Teboulle, 2009).
Heavy ball.
3 Convergence with Noiseless Gradients
We study the convergence of the iterates defined by: . This is a secondorder iterative system with constant coefficients that it is standard to cast in a linear framework (see, e.g., Ortega and Rheinboldt, 2000). We may rewrite it as:
Thus . Following O’Donoghue and Candes (2013), if we consider an eigenvalue decomposition of , i.e., with an orthogonal matrix and the eigenvalues of , sorted in decreasing order: , then Eq. (3) may be rewritten as:
(5) 
Thus there is no interaction between the different eigenspaces and we may consider, for the analysis only, different recursions with , , where is the th column of :
(6) 
3.1 Characteristic polynomial and eigenvalues
In this section, we consider a fixed and study the stability in the corresponding eigenspace. This linear dynamical system may be analyzed by studying the eigenvalues of the matrix . These eigenvalues are the roots of its characteristic polynomial which is:
To compute the roots of the secondorder polynomial, we compute its reduced discriminant:
Depending on the sign of the discriminant , there will be two real distinct eigenvalues (, two complex conjugate eigenvalues ( or a single real eigenvalue (.
We will now study the sign of . In each different case, we will determine under what conditions on and the modulus of the eigenvalues is less than one, which means that the iterates remain bounded and the iterates converge to . We may then compute function values as .
The various regimes are summarized in Figure 1: there is a triangle of values of for which the algorithm remains stable (i.e., the iterates do not diverge), with either complex or real eigenvalues. In the following lemmas (see proof in Appendix C), we provide a detailed analysis that leads to Figure 1.
Lemma 1 (Real eigenvalues).
The discriminant is strictly positive and the algorithm is stable if and only if
We then have two real roots , with . Moreover, we have:
(7) 
Therefore, for real eigenvalues, will converge to at a speed of however the constant may be arbitrarily small (and thus the scaling factor arbitrarily large). Furthermore we have linear convergence if the inequalities in the lemmas are strict.
Lemma 2 (Complex eigenvalues).
The discriminant is stricly negative and the algorithm is stable if and only if
We then have two complex conjugate eigenvalues: . Moreover, we have:
(8) 
with , and defined through and .
Therefore, for complex eigenvalues, there is a linear convergence if the inequalities in the lemma are strict. Moreover, oscillates to at a speed of even if is arbitrarily small.
Coalescing eigenvalues.
When the discriminant goes to zero in the explicit formulas of the real and complex cases, both the denominator and numerator of will go to zero. In the limit case, when the discriminant is equal to zero, we will have a double real eigenvalue. This happens for . Then the eigenvalue is , and the algorithm is stable for , we then have . This can be obtained by letting goes to in the real and complex cases (see also Appendix C.3).
Summary.
To conclude the iterate will be stable for and . According to the values of and this iterate will have a different behavior. In the complex case, the roots are complex conjugate with magnitude . Thus, when , will converge to , oscillating, at rate . In the real case, the two roots are real and distinct. However the product of the two roots is equal to , thus one will have a higher magnitude and will converges to at rate higher than in the complex case (as long as and belong to the interior of the stability region).
Finally, for a given quadratic function , all the iterates should be bounded, therefore we must have and . Then, depending on the value of , some eigenvalues may be complex or real.
3.2 Classical examples
For particular choices of and , displayed in Figure 1, the eigenvalues are either all real or all complex, as shown in the table below.
AvGD  AccGD  Heavy ball  






,  



Averaged gradient descent loses linear convergence for stronglyconvex problems, because for all eigensubspaces. Similarly, the heavy ball method is not adaptive to strong convexity because . However, accelerated gradient descent, although designed for nonstronglyconvex problems, is adaptive because depends on while and do not. These last two algorithms have an oscillatory behavior which can be observed in practice and has been already studied (Su et al., 2014).
Note that all the classical methods choose stepsizes and either having all the eigenvalues real either complex; whereas we will see in Section 4, that it is significant to combine both behaviors in presence of noise.
3.3 General bound
Even if the exact formulas in Lemmas 1 and 2 are computable, they are not easily interpretable. In particular when the two roots become close, the denominator will go to zero, which prevents from bounding them easily. When we further restrict the domain of , we can always bound the iterate by the general bound (see proof in Appendix D):
Theorem 2.
For and , we have
(9) 
These bounds are shown by dividing the set of in three regions where we obtain specific bounds. They do not depend on the regime of the eigenvalues (complex or real); this enables us to get the following general bound on the function values, our main result for the deterministic case.
Corollary 1.
For and :
(10) 
We can make the following observations:

The first bound corresponds to the traditional acceleration result, and is only relevant for (that is, for Nesterov acceleration and the heavyball method, but not for averaging). We recover the traditional convergence rate of secondorder methods for quadratic functions in the singular case, such as conjugate gradient (Polyak, 1987, Section 6.1).

While the result above focuses on function values, like most results in the nonstrongly convex case, the distance to optimum typically does not go to zero (although it remains bounded in our situation).

When (averaged gradient descent), then the second bound provides a convergence rate of if no assumption is made regarding the starting point , while the last bound of Theorem 2 would lead to a bound , that is a rate of , only for some starting points.
4 Quadratic Optimization with Additive Noise
In many practical situations, the gradient of is not available for the recursion in Eq. (4), but only a noisy version. In this paper, we only consider additive uncorrelated noise with finite variance.
4.1 Stochastic difference equation
We now assume that the true gradient is not available and we rather have access to a noisy oracle for the gradient of . In Eq. (4), we assume that the oracle outputs a noisy gradient . The noise is assumed to be uncorrelated zeromean with bounded covariance, i.e., for all and , where means that is positive semidefinite.
For quadratic functions, for the reduced variable , we get:
(11) 
Note that algorithms with will have an important level of noise because of the term . We denote by and we now have the recursion:
(12) 
which is a standard noisy linear dynamical system (see, e.g., Arnold, 1998) with uncorrelated noise process . We may thus express directly as and its expected secondorder moment as, In order to obtain the expected excess cost function, we simply need to compute , which thus decomposes as a term that only depends on initial conditions (which is exactly the one computed and studied in Section 3.3), and a new term that depends on the noise.
4.2 Convergence result
For a quadratic function with arbitrarily small eigenvalues and uncorrelated noise with finite covariance, we obtain the following convergence result (see proof in Appendix F); since we will allow the parameters and to depend on the time we stop the algorithm, we introduce the horizon :
Theorem 3 (Convergence rates with noisy gradients).
With for all , for and . Then for any , we have:
(13) 
We can make the following observations:

Although we only provide an upperbound, the proof technique relies on direct moment computations in each eigensubspace with few inequalities, and we conjecture that the scalings with respect to are tight.

For and (which corresponds to averaged gradient descent), the second bound leads to , which is bounded but not converging to zero. We recover a result from Bach and Moulines (2011, Theorem 1).

For (which corresponds to Nesterov’s acceleration), the first bound leads to , and our bound suggests that the algorithm diverges, which we have observed in our experiments in Appendix A.

For and , the second bound leads to , and we recover the traditional rate of for stochastic gradient in the nonstronglyconvex case.

When the values of the bias and the variance are known we can choose and such that the tradeoff between the bias and the variance is optimal in our bound, as the following corrollary shows. Note that in the bound below, taking a non zero enables the bias term to be adaptive to hidden strongconvexity.
Corollary 2.
For and , we have:
4.3 Structured noise and leastsquare regression
When only the noise total variance is considered, as shown in Section 4.4, Corollary 2 recover existing (more general) results. Our framework however leads to improved result for structured noise processes frequent in machine learning, in particular in leastsquares regression which we now consider but this goes beyond (see, e.g. Bach and Moulines, 2013).
Assume we observe independent and identically distributed pairs and we want to minimize the expected loss . We denote by the covariance matrix which is assumed invertible. The global minimum of is attained at defined as before and we denote by the statistical noise, which we assume bounded by . We have . In an online setting, we observe the gradient , whose expectation is the gradient . This corresponds to a noise in the gradient of . Given , if the data are almost surely bounded, the covariance matrix of this noise is bounded by a constant times . This suggests to characterize the noise convergence by , which is bounded even though has arbitrarily small eigenvalues.
However, our result will not apply to stochastic gradient descent (SGD) for leastsquares, because of the term which depends on , but to a “semistochastic” recursion where the noisy gradient is , with a noise process , which is such that , and has been used by Bach and Moulines (2011) and Dieuleveut and Bach (2014) to prove results on regular stochastic gradient descent. We conjecture that our algorithm (and results) also applies in the regular SGD case, and we provide encouraging experiments in Section 5.
For this particular structured noise we can take advantage of a large :
Theorem 4 (Convergence rates with structured noisy gradients).
Let and . For any , is upperbounded by:
(14) 
We can make the following observations:

For and (which corresponds to averaged gradient descent), the second bound leads to . We recover a result from Bach and Moulines (2013, Theorem 1). Note that when , .

For (which corresponds to Nesterov’s acceleration), the first bound leads to , which is bounded but not converging to zero (as opposed to the the unstructured noise where the algorithm may diverge).

For with and , the first bound leads to . We thus obtain an explicit biasvariance tradeoff by changing the value of .

When the values of the bias and the variance are known we can choose and with an optimized tradeoff, as the following corrollary shows:
Corollary 3.
For and we have:
(15) 
4.4 Related work
Acceleration and noisy gradients.
Several authors (Lan, 2012; Hu et al., 2009; Xiao, 2010) have shown that using a stepsize proportional to accelerated methods with noisy gradients lead to the same convergence rate of than in Corollary 2, for smooth functions. Thus, for unstructured noise, our analysis provides insights in the behavior of secondorder algorithms, without improving bounds. We get significant improvements for structured noises.
Leastsquares regression.
When the noise is structured as in leastsquare regression and more generally in linear supervised learning, Bach and Moulines (2011) have shown that using averaged stochastic gradient descent with constant stepsize leads to the convergence rate of . It has been highlighted by Défossez and Bach (2014) that the bias term may often be the dominant one in practice. Our result in Corollary 3 leads to an improved bias term in with the price of a potentially slightly worse constant in the variance term. However, with optimal constants in Corollary 3, the new algorithm is always an improvement over averaged stochastic gradient descent in all situations. If constants are unknown, we may use with and and we choose depending on the emphasis we want to put on bias or variance.
Minimax convergence rates.
For noisy quadratic problems, the convergence rate nicely decomposes into two terms, a bias term which corresponds to the noiseless problem and the variance term which corresponds to a problem started at . For each of these two terms, lower bounds are known. For the bias term, if , then the lower bound is, up to constants, (Nesterov, 2004, Theorem 2.1.7). For the variance term, for the general noisy gradient situation, we show in Appendix H that for , it is , while for leastsquares regression, it is (Tsybakov, 2003). Thus, for the two situations, we attain the two lower bounds simultaneously for situations where respectively and . It remains an open problem to achieve the two minimax terms in all situations.
Other algorithms as special cases.
5 Experiments
In this section, we illustrate our theoretical results on synthetic examples. We consider a matrix that has random eigenvectors and eigenvalues , for and . We take a random optimum and a random starting point such that (unless otherwise specified). In Appendix A, we illustrate the noiseless results of Section 3, in particular the oscillatory behaviors and the influence of all eigenvalues, as well as unstructured noisy gradients. In this section, we focus on noisy gradients with structured noise (as described in Section 4.3), where our new algorithms show significant improvements.
We compare our algorithm to other stochastic accelerated algorithms, that is, ACSA (Lan, 2012), SAGE (Hu et al., 2009) and AccRDA (Xiao, 2010) which are presented in Appendix G. For all these algorithms (and ours) we take the optimal stepsizes defined in these papers. We show results averaged over 10 replications.
Homoscedastic noise.
We first consider an i.i.d. zero mean noise whose covariance matrix is proportional to . We also consider a variant of our algorithm with an anytime stepsize function of rather than (for which we currently have no proof of convergence). In Figure 3, we take into account two different setups. In the left plot, the variance dominates the bias (with ). We see that (a) AccGD does not converge to the optimum but does not diverge either, (b) AvGD and our algorithms achieve the optimal rate of convergence of , whereas (c) other accelerated algorithms only converge at rate . In the right plot, the bias dominates the variance ( and ). In this situation our algorithm outperforms all others.
Application to leastsquares regression.
We now see how these algorithms behave for leastsquares regressions and the regular (nonhomoscedastic) stochastic gradients described in Section 4.3. We consider normally distributed inputs. The covariance matrix is the same as before. The outputs are generated from a linear function with homoscedatic noise with a signaltonoise ratio of . We consider . We show results averaged over 10 replications. In Figure 4, we consider again a situation where the bias dominates (left) and vice versa (right). We see that our algorithm has the same good behavior than in the homoscedastic noise case and we conjecture that our bounds also hold in this situation.
6 Conclusion
We have provided a joint analysis of averaging and acceleration for nonstronglyconvex quadratic functions in a single framework, both with noiseless and noisy gradients. This allows to define a class of algorithms that can benefit simultaneously of the known improvements of averaging and accelerations: faster forgetting of initial conditions (for acceleration), and better robustness to noise when the noise covariance is proportional to the Hessian (for averaging).
Our current analysis of our class of algorithms in Eq. (4), that considers two different affine combinations of previous iterates (instead of one for traditional acceleration), is limited to quadratic functions; an extension of its analysis to all smooth or selfconcordantlike functions would widen its applicability. Similarly, an extension to leastsquares regression with natural heteroscedastic stochastic gradient, as suggested by our simulations, would be an interesting development.
Acknowledgements
This work was partially supported by the MSRInria Joint Centre and a grant by the European Research Council (SIERRA project 239993). The authors would like to thank Aymeric Dieuleveut for helpful discussions.
References
 Agarwal et al. (2012) A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Informationtheoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235–3249, 2012.
 Arnold (1998) L. Arnold. Random dynamical systems. Springer Monographs in Mathematics. SpringerVerlag, 1998.
 Bach and Moulines (2011) F. Bach and E. Moulines. NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems, 2011.
 Bach and Moulines (2013) F. Bach and E. Moulines. Nonstronglyconvex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems, December 2013.
 Beck and Teboulle (2009) A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
 d’Aspremont (2008) A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171–1183, 2008.
 Défossez and Bach (2014) A. Défossez and F. Bach. Constant step size leastmeansquare: Biasvariance tradeoffs and optimal sampling distributions. Technical Report 1412.0156, arXiv, 2014.
 Devolder et al. (2014) O. Devolder, F. Glineur, and Y. Nesterov. Firstorder methods of smooth convex optimization with inexact oracle. Math. Program., 146(12, Ser. A):37–75, 2014.
 Dieuleveut and Bach (2014) A. Dieuleveut and F. Bach. Nonparametric Stochastic Approximation with Large Step sizes. Technical Report 1408.0361, arXiv, August 2014.
 Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, 2009.
 Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(12, Ser. A):365–397, 2012.
 Nesterov (1983) Y. Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27(2):372–376, 1983.
 Nesterov (2004) Y. Nesterov. Introductory Lectures on Convex Optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course.
 Nesterov (2013) Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1, Ser. B):125–161, 2013.
 O’Donoghue and Candes (2013) B. O’Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, pages 1–18, 2013.
 Ortega and Rheinboldt (2000) J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several variables, volume 30 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2000.
 Polyak (1964) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. {USSR} Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 Polyak (1987) B. T. Polyak. Introduction to Optimization. Translations Series in Mathematics and Engineering. Optimization Software, Inc., Publications Division, New York, 1987.
 Polyak and Juditsky (1992) B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
 Schmidt et al. (2011) M. Schmidt, N. Le Roux, and F. Bach. Convergence Rates of Inexact ProximalGradient Methods for Convex Optimization. In Advances in Neural Information Processing Systems, December 2011.
 Su et al. (2014) W. Su, S. Boyd, and E. Candes. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. In Advances in Neural Information Processing Systems, 2014.
 Tsybakov (2003) A. B. Tsybakov. Optimal rates of aggregation. In Proceedings of the Annual Conference on Computational Learning Theory, 2003.
 Xiao (2010) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11:2543–2596, 2010.
Appendix A Additional experimental results
In this appendix, we provide additional experimental results to illustrate our theoretical results.
a.1 Deterministic convergence
Comparaison for .
In Figure 5, we minimize a onedimensional quadratic function for a fixed stepsize and different stepsizes . In the left plot, we compare AccGD, HB and AvGD. We see that HB and AccGD both oscillate and that AccGD leverages strong convexity to converge faster. In the right plot, we compare the behavior of the algorithm for different values of . We see that the optimal rate is achieved for defined to be the one for which there is a double coalescent eigenvalue, where the convergence is linear at speed . When , we are in the real case and when the algorithm oscillates to the solution.
Comparison between the different eigenspaces.
Figure 6 shows interactions between different eigenspaces. In the left plot, we optimize a quadratic function of dimension . The first eigenvalue is and the second is . For AvGD the convergence is of order since the problem is “not” strongly convex (i.e., not appearing as strongly convex since remains small). The convergence is at the beginning the same for HB and AccGD, with oscillation at speed , since the small eigenvalue prevents AccGD from having a linear convergence. Then for large , the convergence becomes linear for AccGD, since becomes large. In the right plot, we optimize a quadratic function in dimension with eigenvalues from to . We show the function values of the projections of the iterates on the different eigenspaces. We see that high eigenvalues first dominate, but converge quickly to zero, whereas small ones keep oscillating, and converge more slowly.
Comparison for .
In Figure 7, we optimize two dimensional quadratic functions with different eigenvalues with AvGD, HB and AccGD for a fixed stepsize . In the left plot, the eigenvalues are and in the right one, they are , for . We see that in both cases, AvGD converges at a rate of and HB at a rate of . For AccGD the convergence is linear when is large (left plot) and becomes sublinear at a rate of when becomes small (right plot).
a.2 Noisy convergence with unstructured additive noise
We optimize the same quadratic function, but now with noisy gradients. We compare our algorithm to other stochastic accelerated algorithms, that is, ACSA [Lan, 2012], SAGE [Hu et al., 2009] and AccRDA [Xiao, 2010], which are presented in Appendix G. For all these algorithms (and ours) we take the optimal stepsizes defined in these papers. We plot the results averaged over 10 replications.
We consider in Figure 8 an i.i.d. zero mean noise of variance . We see that all the accelerated algorithms achieve the same precision whereas AvGD with constant stepsize does not converge and AccGd diverges. However SAGE and ACSA are anytime algorithms and are faster at the beginning since their stepsizes are decreasing and not a constant (with respect to ) function of the horizon .
Appendix B Proofs of Section 2
b.1 Proof of Theorem 1
Let for all be a sequence of polynomials. We consider the iterates defined for all by
started from . The stationarity property gives for :
Since we get for all
For all we apply this relation to vectors :
and we get
Therefore there are polynomials and for all such that we have for all :
(16) 
The nscalability property means that there are polynomials independent of such that:
And in connection with Eq. (16) we can rewrite and as: