From Averaging to Acceleration, There is Only a Step-size

From Averaging to Acceleration, There is Only a Step-size

Nicolas Flammarion and Francis Bach
INRIA - Sierra project-team
Département d’Informatique de l’Ecole Normale Supérieure
Paris, France
nicolas.flammarion@ens.fr, francis.bach@ens.fr
Abstract

We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate , where is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system, showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.

1 Introduction

Many problems in machine learning are naturally cast as convex optimization problems over a Euclidean space; for supervised learning this includes least-squares regression, logistic regression, and the support vector machine. Faced with large amounts of data, practitioners often favor first-order techniques based on gradient descent, leading to algorithms with many cheap iterations. For smooth problems, two extensions of gradient descent have had important theoretical and practical impacts: acceleration and averaging.

Acceleration techniques date back to Nesterov (1983) and have their roots in momentum techniques and conjugate gradient (Polyak, 1987). For convex problems, with an appropriately weighted momentum term which requires to store two iterates, Nesterov (1983) showed that the traditional convergence rate of for the function values after iterations of gradient descent goes down to for accelerated gradient descent, such a rate being optimal among first-order techniques that can access only sequences of gradients (Nesterov, 2004). Like conjugate gradient methods for solving linear systems, these methods are however more sensitive to noise in the gradients; that is, to preserve their improved convergence rates, significantly less noise may be tolerated (d’Aspremont, 2008; Schmidt et al., 2011; Devolder et al., 2014).

Averaging techniques which consist in replacing the iterates by the average of all iterates have also been thoroughly considered, either because they sometimes lead to simpler proofs, or because they lead to improved behavior. In the noiseless case where gradients are exactly available, they do not improve the convergence rate in the convex case; worse, for strongly-convex problems, they are not linearly convergent while regular gradient descent is. Their main advantage comes with random unbiased gradients, where it has been shown that they lead to better convergence rates than the unaveraged counterparts, in particular because they allow larger step-sizes (Polyak and Juditsky, 1992; Bach and Moulines, 2011). For example, for least-squares regression with stochastic gradients, they lead to convergence rates of , even in the non-strongly convex case (Bach and Moulines, 2013).

In this paper, we show that for quadratic problems, both averaging and acceleration are two instances of the same second-order finite difference equation, with different step-sizes. They may thus be analyzed jointly, together with a non-strongly convex version of the heavy-ball method (Polyak, 1987, Section 3.2). In presence of random zero-mean noise on the gradients, this joint analysis allows to design a novel intermediate algorithm that exhibits the good aspects of both acceleration (quick forgetting of initial conditions) and averaging (robustness to noise).

In this paper, we make the following contributions:

  • We show in Section 2 that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate .

  • In Section 3, we provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system, showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants.

  • In Section 4, we consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.

  • In Section 5, we illustrate our results with simulations on synthetic examples.

2 Second-Order Iterative Algorithms for Quadratic Functions

Throughout this paper, we consider minimizing a convex quadratic function defined as:

(1)

with a symmetric positive semi-definite matrix and . Without loss of generality, is assumed invertible (by projecting onto the orthogonal of its null space), though its eigenvalues could be arbitrarily small. The solution is known to be , but the inverse of the Hessian is often too expensive to compute when is large. The excess cost function may be simply expressed as .

2.1 Second-order algorithms

In this paper we study second-order iterative algorithms of the form:

(2)

started with in , with , and for all . We impose the natural restriction that the optimum is a stationary point of this recursion, that is, for all :

(-stationarity)

By letting we then have , started from . Thus, we restrict our problem to the study of the convergence of an iterative system to .

In connection with accelerated methods, we are interested in algorithms for which converges to at a speed of . Within this context we impose that and have the form :

(n-scalability)

By letting , we can now study the simple iterative system with constant terms , started at and . Showing that the function values remain bounded, we directly have the convergence of to at the speed . Thus the n-scalability property allows to switch from a convergence problem to a stability problem.

For feasibility concerns the method can only access through matrix-vector products. Therefore and should be polynomials in and a polynomial in times , if possible of low degree. The following theorem clarifies the general form of iterative systems which share these three properties (see proof in Appendix B).

Theorem 1.

Let for all , be a sequence of polynomials. If the iterative algorithm defined by Eq. (2) with , and satisfies the -stationarity and n-scalability properties, there are polynomials such that:

Note that our result prevents and from being zero, thus requiring the algorithm to strictly be of second order. This illustrates the fact that first-order algorithms as gradient descent do not have the convergence rate in .

We now restrict our class of algorithms to lowest possible order polynomials, that is, and with , which correspond to the fewest matrix-vector products per iteration, leading to the constant-coefficient recursion for :

(3)
Expression with gradients of .

The recursion in Eq. (3) may be written with gradients of in multiple ways. In order to preserve the parallel with accelerated techniques, we rewrite it as:

(4)

It may be interpreted as a modified gradient recursion with two potentially different affine (i.e., with coefficients that sum to one) combinations of the two past iterates. This reformulation will also be crucial when using noisy gradients. The allowed values for will be determined in the following sections.

2.2 Examples

Averaged gradient descent.

We consider averaged gradient descent (referred to from now on as “Av-GD”) (Polyak and Juditsky, 1992) with step-size defined by:

When computing the average online as and seeing the average as the main iterate, the algorithm becomes (see proof in Appendix B.2):

This corresponds to Eq. (4) with and .

Accelerated gradient descent.

We consider the accelerated gradient descent (referred to from now on as “Acc-GD”) (Nesterov, 1983) with step-sizes :

For smooth optimization the accelerated literature (Nesterov, 2004; Beck and Teboulle, 2009) uses the step-size and their results are not valid for bigger step-size . However is compatible with the framework of Lan (2012) and is more convenient for our set-up. This corresponds to Eq. (4) with and . Note that accelerated techniques are more generally applicable, e.g., to composite optimization with smooth functions (Nesterov, 2013; Beck and Teboulle, 2009).

Heavy ball.

We consider the heavy-ball algorithm (referred to from now on as “HB”) (Polyak, 1964) with step-sizes :

when . We note that typically is constant for strongly-convex problems. This corresponds to Eq. (4) with and .

3 Convergence with Noiseless Gradients

We study the convergence of the iterates defined by: . This is a second-order iterative system with constant coefficients that it is standard to cast in a linear framework (see, e.g., Ortega and Rheinboldt, 2000). We may rewrite it as:

Thus . Following O’Donoghue and Candes (2013), if we consider an eigenvalue decomposition of , i.e., with an orthogonal matrix and the eigenvalues of , sorted in decreasing order: , then Eq. (3) may be rewritten as:

(5)

Thus there is no interaction between the different eigenspaces and we may consider, for the analysis only, different recursions with , , where is the -th column of :

(6)

3.1 Characteristic polynomial and eigenvalues

In this section, we consider a fixed and study the stability in the corresponding eigenspace. This linear dynamical system may be analyzed by studying the eigenvalues of the -matrix . These eigenvalues are the roots of its characteristic polynomial which is:

To compute the roots of the second-order polynomial, we compute its reduced discriminant:

Depending on the sign of the discriminant , there will be two real distinct eigenvalues (, two complex conjugate eigenvalues ( or a single real eigenvalue (.

We will now study the sign of . In each different case, we will determine under what conditions on and the modulus of the eigenvalues is less than one, which means that the iterates remain bounded and the iterates converge to . We may then compute function values as .

The various regimes are summarized in Figure 1: there is a triangle of values of for which the algorithm remains stable (i.e., the iterates do not diverge), with either complex or real eigenvalues. In the following lemmas (see proof in Appendix C), we provide a detailed analysis that leads to Figure 1.

Lemma 1 (Real eigenvalues).

The discriminant is strictly positive and the algorithm is stable if and only if

We then have two real roots , with . Moreover, we have:

(7)

Therefore, for real eigenvalues, will converge to at a speed of however the constant may be arbitrarily small (and thus the scaling factor arbitrarily large). Furthermore we have linear convergence if the inequalities in the lemmas are strict.

AvGDAccGDHBRealComplex
Figure 1: Area of stability of the algorithm, with the three traditional algorithms represented. In the interior of the triangle, the convergence is linear.
Lemma 2 (Complex eigenvalues).

The discriminant is stricly negative and the algorithm is stable if and only if

We then have two complex conjugate eigenvalues: . Moreover, we have:

(8)

with , and defined through and .

Therefore, for complex eigenvalues, there is a linear convergence if the inequalities in the lemma are strict. Moreover, oscillates to at a speed of even if is arbitrarily small.

Coalescing eigenvalues.

When the discriminant goes to zero in the explicit formulas of the real and complex cases, both the denominator and numerator of will go to zero. In the limit case, when the discriminant is equal to zero, we will have a double real eigenvalue. This happens for . Then the eigenvalue is , and the algorithm is stable for , we then have . This can be obtained by letting goes to in the real and complex cases (see also Appendix C.3).

Summary.

To conclude the iterate will be stable for and . According to the values of and this iterate will have a different behavior. In the complex case, the roots are complex conjugate with magnitude . Thus, when , will converge to , oscillating, at rate . In the real case, the two roots are real and distinct. However the product of the two roots is equal to , thus one will have a higher magnitude and will converges to at rate higher than in the complex case (as long as and belong to the interior of the stability region).

Finally, for a given quadratic function , all the iterates should be bounded, therefore we must have and . Then, depending on the value of , some eigenvalues may be complex or real.

3.2 Classical examples

For particular choices of and , displayed in Figure 1, the eigenvalues are either all real or all complex, as shown in the table below.

Av-GD Acc-GD Heavy ball



,


Averaged gradient descent loses linear convergence for strongly-convex problems, because for all eigensubspaces. Similarly, the heavy ball method is not adaptive to strong convexity because . However, accelerated gradient descent, although designed for non-strongly-convex problems, is adaptive because depends on while and do not. These last two algorithms have an oscillatory behavior which can be observed in practice and has been already studied (Su et al., 2014).

Note that all the classical methods choose step-sizes and either having all the eigenvalues real either complex; whereas we will see in Section 4, that it is significant to combine both behaviors in presence of noise.

3.3 General bound

Even if the exact formulas in Lemmas 1 and 2 are computable, they are not easily interpretable. In particular when the two roots become close, the denominator will go to zero, which prevents from bounding them easily. When we further restrict the domain of , we can always bound the iterate by the general bound (see proof in Appendix D):

Theorem 2.

For and , we have

(9)

These bounds are shown by dividing the set of in three regions where we obtain specific bounds. They do not depend on the regime of the eigenvalues (complex or real); this enables us to get the following general bound on the function values, our main result for the deterministic case.

Corollary 1.

For and :

(10)

We can make the following observations:

  • The first bound corresponds to the traditional acceleration result, and is only relevant for (that is, for Nesterov acceleration and the heavy-ball method, but not for averaging). We recover the traditional convergence rate of second-order methods for quadratic functions in the singular case, such as conjugate gradient (Polyak, 1987, Section 6.1).

  • While the result above focuses on function values, like most results in the non-strongly convex case, the distance to optimum typically does not go to zero (although it remains bounded in our situation).

  • When (averaged gradient descent), then the second bound provides a convergence rate of if no assumption is made regarding the starting point , while the last bound of Theorem 2 would lead to a bound , that is a rate of , only for some starting points.

  • As shown in Appendix E by exhibiting explicit sequences of quadratic functions, the inverse dependence in and in Eq. (10) is not improvable.

4 Quadratic Optimization with Additive Noise

In many practical situations, the gradient of is not available for the recursion in Eq. (4), but only a noisy version. In this paper, we only consider additive uncorrelated noise with finite variance.

AvGDAccGDOur Algorithms
Figure 2: Trade-off between averaged and accelerated methods for noisy gradients.

4.1 Stochastic difference equation

We now assume that the true gradient is not available and we rather have access to a noisy oracle for the gradient of . In Eq. (4), we assume that the oracle outputs a noisy gradient . The noise is assumed to be uncorrelated zero-mean with bounded covariance, i.e., for all and , where means that is positive semi-definite.

For quadratic functions, for the reduced variable , we get:

(11)

Note that algorithms with will have an important level of noise because of the term . We denote by and we now have the recursion:

(12)

which is a standard noisy linear dynamical system (see, e.g., Arnold, 1998) with uncorrelated noise process . We may thus express directly as and its expected second-order moment as, In order to obtain the expected excess cost function, we simply need to compute , which thus decomposes as a term that only depends on initial conditions (which is exactly the one computed and studied in Section 3.3), and a new term that depends on the noise.

4.2 Convergence result

For a quadratic function with arbitrarily small eigenvalues and uncorrelated noise with finite covariance, we obtain the following convergence result (see proof in Appendix F); since we will allow the parameters and to depend on the time we stop the algorithm, we introduce the horizon :

Theorem 3 (Convergence rates with noisy gradients).

With for all , for and . Then for any , we have:

(13)

We can make the following observations:

  • Although we only provide an upper-bound, the proof technique relies on direct moment computations in each eigensubspace with few inequalities, and we conjecture that the scalings with respect to are tight.

  • For and (which corresponds to averaged gradient descent), the second bound leads to , which is bounded but not converging to zero. We recover a result from Bach and Moulines (2011, Theorem 1).

  • For (which corresponds to Nesterov’s acceleration), the first bound leads to , and our bound suggests that the algorithm diverges, which we have observed in our experiments in Appendix A.

  • For and , the second bound leads to , and we recover the traditional rate of for stochastic gradient in the non-strongly-convex case.

  • When the values of the bias and the variance are known we can choose and such that the trade-off between the bias and the variance is optimal in our bound, as the following corrollary shows. Note that in the bound below, taking a non zero enables the bias term to be adaptive to hidden strong-convexity.

Corollary 2.

For and , we have:

4.3 Structured noise and least-square regression

When only the noise total variance is considered, as shown in Section 4.4, Corollary 2 recover existing (more general) results. Our framework however leads to improved result for structured noise processes frequent in machine learning, in particular in least-squares regression which we now consider but this goes beyond (see, e.g. Bach and Moulines, 2013).

Assume we observe independent and identically distributed pairs and we want to minimize the expected loss . We denote by the covariance matrix which is assumed invertible. The global minimum of is attained at defined as before and we denote by the statistical noise, which we assume bounded by . We have . In an online setting, we observe the gradient , whose expectation is the gradient . This corresponds to a noise in the gradient of . Given , if the data are almost surely bounded, the covariance matrix of this noise is bounded by a constant times . This suggests to characterize the noise convergence by , which is bounded even though has arbitrarily small eigenvalues.

However, our result will not apply to stochastic gradient descent (SGD) for least-squares, because of the term which depends on , but to a “semi-stochastic” recursion where the noisy gradient is , with a noise process , which is such that , and has been used by Bach and Moulines (2011) and Dieuleveut and Bach (2014) to prove results on regular stochastic gradient descent. We conjecture that our algorithm (and results) also applies in the regular SGD case, and we provide encouraging experiments in Section 5.

For this particular structured noise we can take advantage of a large :

Theorem 4 (Convergence rates with structured noisy gradients).

Let and . For any , is upper-bounded by:

(14)

We can make the following observations:

  • For and (which corresponds to averaged gradient descent), the second bound leads to . We recover a result from Bach and Moulines (2013, Theorem 1). Note that when , .

  • For (which corresponds to Nesterov’s acceleration), the first bound leads to , which is bounded but not converging to zero (as opposed to the the unstructured noise where the algorithm may diverge).

  • For with and , the first bound leads to . We thus obtain an explicit bias-variance trade-off by changing the value of .

  • When the values of the bias and the variance are known we can choose and with an optimized trade-off, as the following corrollary shows:

Corollary 3.

For and we have:

(15)

4.4 Related work

Acceleration and noisy gradients.

Several authors (Lan, 2012; Hu et al., 2009; Xiao, 2010) have shown that using a step-size proportional to accelerated methods with noisy gradients lead to the same convergence rate of than in Corollary 2, for smooth functions. Thus, for unstructured noise, our analysis provides insights in the behavior of second-order algorithms, without improving bounds. We get significant improvements for structured noises.

Least-squares regression.

When the noise is structured as in least-square regression and more generally in linear supervised learning, Bach and Moulines (2011) have shown that using averaged stochastic gradient descent with constant step-size leads to the convergence rate of . It has been highlighted by Défossez and Bach (2014) that the bias term may often be the dominant one in practice. Our result in Corollary 3 leads to an improved bias term in with the price of a potentially slightly worse constant in the variance term. However, with optimal constants in Corollary 3, the new algorithm is always an improvement over averaged stochastic gradient descent in all situations. If constants are unknown, we may use with and and we choose depending on the emphasis we want to put on bias or variance.

Minimax convergence rates.

For noisy quadratic problems, the convergence rate nicely decomposes into two terms, a bias term which corresponds to the noiseless problem and the variance term which corresponds to a problem started at . For each of these two terms, lower bounds are known. For the bias term, if , then the lower bound is, up to constants,  (Nesterov, 2004, Theorem 2.1.7). For the variance term, for the general noisy gradient situation, we show in Appendix H that for , it is , while for least-squares regression, it is  (Tsybakov, 2003). Thus, for the two situations, we attain the two lower bounds simultaneously for situations where respectively and . It remains an open problem to achieve the two minimax terms in all situations.

Other algorithms as special cases.

We also note as shown in Appendix G that in the special case of quadratic functions, the algorithms of Lan (2012); Hu et al. (2009); Xiao (2010) could be unified into our framework (although they have significantly different formulations and justifications in the smooth case).

5 Experiments

In this section, we illustrate our theoretical results on synthetic examples. We consider a matrix  that has random eigenvectors and eigenvalues , for and . We take a random optimum and a random starting point such that (unless otherwise specified). In Appendix A, we illustrate the noiseless results of Section 3, in particular the oscillatory behaviors and the influence of all eigenvalues, as well as unstructured noisy gradients. In this section, we focus on noisy gradients with structured noise (as described in Section 4.3), where our new algorithms show significant improvements.

We compare our algorithm to other stochastic accelerated algorithms, that is, AC-SA (Lan, 2012), SAGE (Hu et al., 2009) and Acc-RDA (Xiao, 2010) which are presented in Appendix G. For all these algorithms (and ours) we take the optimal step-sizes defined in these papers. We show results averaged over 10 replications.

Homoscedastic noise.

We first consider an i.i.d. zero mean noise whose covariance matrix is proportional to . We also consider a variant of our algorithm with an any-time step-size function of rather than (for which we currently have no proof of convergence). In Figure 3, we take into account two different set-ups. In the left plot, the variance dominates the bias (with ). We see that (a) Acc-GD does not converge to the optimum but does not diverge either, (b) Av-GD and our algorithms achieve the optimal rate of convergence of , whereas (c) other accelerated algorithms only converge at rate . In the right plot, the bias dominates the variance ( and ). In this situation our algorithm outperforms all others.

Figure 3: Quadratic optimization with regression noise. Left , . Right , .
Application to least-squares regression.

We now see how these algorithms behave for least-squares regressions and the regular (non-homoscedastic) stochastic gradients described in Section 4.3. We consider normally distributed inputs. The covariance matrix is the same as before. The outputs are generated from a linear function with homoscedatic noise with a signal-to-noise ratio of . We consider . We show results averaged over 10 replications. In Figure 4, we consider again a situation where the bias dominates (left) and vice versa (right). We see that our algorithm has the same good behavior than in the homoscedastic noise case and we conjecture that our bounds also hold in this situation.

Figure 4: Least-Square Regression. Left , . Right , .

6 Conclusion

We have provided a joint analysis of averaging and acceleration for non-strongly-convex quadratic functions in a single framework, both with noiseless and noisy gradients. This allows to define a class of algorithms that can benefit simultaneously of the known improvements of averaging and accelerations: faster forgetting of initial conditions (for acceleration), and better robustness to noise when the noise covariance is proportional to the Hessian (for averaging).

Our current analysis of our class of algorithms in Eq. (4), that considers two different affine combinations of previous iterates (instead of one for traditional acceleration), is limited to quadratic functions; an extension of its analysis to all smooth or self-concordant-like functions would widen its applicability. Similarly, an extension to least-squares regression with natural heteroscedastic stochastic gradient, as suggested by our simulations, would be an interesting development.

Acknowledgements

This work was partially supported by the MSR-Inria Joint Centre and a grant by the European Research Council (SIERRA project 239993). The authors would like to thank Aymeric Dieuleveut for helpful discussions.

References

  • Agarwal et al. (2012) A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235–3249, 2012.
  • Arnold (1998) L. Arnold. Random dynamical systems. Springer Monographs in Mathematics. Springer-Verlag, 1998.
  • Bach and Moulines (2011) F. Bach and E. Moulines. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems, 2011.
  • Bach and Moulines (2013) F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems, December 2013.
  • Beck and Teboulle (2009) A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
  • d’Aspremont (2008) A. d’Aspremont. Smooth optimization with approximate gradient. SIAM J. Optim., 19(3):1171–1183, 2008.
  • Défossez and Bach (2014) A. Défossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-offs and optimal sampling distributions. Technical Report 1412.0156, arXiv, 2014.
  • Devolder et al. (2014) O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Math. Program., 146(1-2, Ser. A):37–75, 2014.
  • Dieuleveut and Bach (2014) A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical Report 1408.0361, arXiv, August 2014.
  • Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, 2009.
  • Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1-2, Ser. A):365–397, 2012.
  • Nesterov (1983) Y. Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27(2):372–376, 1983.
  • Nesterov (2004) Y. Nesterov. Introductory Lectures on Convex Optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course.
  • Nesterov (2013) Y. Nesterov. Gradient methods for minimizing composite functions. Math. Program., 140(1, Ser. B):125–161, 2013.
  • O’Donoghue and Candes (2013) B. O’Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, pages 1–18, 2013.
  • Ortega and Rheinboldt (2000) J. M. Ortega and W. C. Rheinboldt. Iterative solution of nonlinear equations in several variables, volume 30 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2000.
  • Polyak (1964) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. {USSR} Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
  • Polyak (1987) B. T. Polyak. Introduction to Optimization. Translations Series in Mathematics and Engineering. Optimization Software, Inc., Publications Division, New York, 1987.
  • Polyak and Juditsky (1992) B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
  • Schmidt et al. (2011) M. Schmidt, N. Le Roux, and F. Bach. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization. In Advances in Neural Information Processing Systems, December 2011.
  • Su et al. (2014) W. Su, S. Boyd, and E. Candes. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. In Advances in Neural Information Processing Systems, 2014.
  • Tsybakov (2003) A. B. Tsybakov. Optimal rates of aggregation. In Proceedings of the Annual Conference on Computational Learning Theory, 2003.
  • Xiao (2010) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res., 11:2543–2596, 2010.

Appendix A Additional experimental results

In this appendix, we provide additional experimental results to illustrate our theoretical results.

a.1 Deterministic convergence

Comparaison for .

In Figure 5, we minimize a one-dimensional quadratic function for a fixed step-size and different step-sizes . In the left plot, we compare Acc-GD, HB and Av-GD. We see that HB and Acc-GD both oscillate and that Acc-GD leverages strong convexity to converge faster. In the right plot, we compare the behavior of the algorithm for different values of . We see that the optimal rate is achieved for defined to be the one for which there is a double coalescent eigenvalue, where the convergence is linear at speed . When , we are in the real case and when the algorithm oscillates to the solution.

Figure 5: Deterministic case for and . Left: classical algorithms, right: different oscillatory behaviors.
Comparison between the different eigenspaces.
Figure 6: Left: Deterministic quadratic optimization for . Right: Function value of the projection of the iterate on the different eigenspaces ().

Figure 6 shows interactions between different eigenspaces. In the left plot, we optimize a quadratic function of dimension . The first eigenvalue is and the second is . For Av-GD the convergence is of order since the problem is “not” strongly convex (i.e., not appearing as strongly convex since remains small). The convergence is at the beginning the same for HB and Acc-GD, with oscillation at speed , since the small eigenvalue prevents Acc-GD from having a linear convergence. Then for large , the convergence becomes linear for Acc-GD, since becomes large. In the right plot, we optimize a quadratic function in dimension with eigenvalues from to . We show the function values of the projections of the iterates on the different eigenspaces. We see that high eigenvalues first dominate, but converge quickly to zero, whereas small ones keep oscillating, and converge more slowly.

Comparison for .
Figure 7: Deterministic case for and . Left: . Right: .

In Figure 7, we optimize two -dimensional quadratic functions with different eigenvalues with Av-GD, HB and Acc-GD for a fixed step-size . In the left plot, the eigenvalues are and in the right one, they are , for . We see that in both cases, Av-GD converges at a rate of and HB at a rate of . For Acc-GD the convergence is linear when is large (left plot) and becomes sublinear at a rate of when becomes small (right plot).

a.2 Noisy convergence with unstructured additive noise

We optimize the same quadratic function, but now with noisy gradients. We compare our algorithm to other stochastic accelerated algorithms, that is, AC-SA [Lan, 2012], SAGE [Hu et al., 2009] and Acc-RDA [Xiao, 2010], which are presented in Appendix G. For all these algorithms (and ours) we take the optimal step-sizes defined in these papers. We plot the results averaged over 10 replications.

We consider in Figure 8 an i.i.d. zero mean noise of variance . We see that all the accelerated algorithms achieve the same precision whereas Av-GD with constant step-size does not converge and Acc-Gd diverges. However SAGE and AC-SA are anytime algorithms and are faster at the beginning since their step-sizes are decreasing and not a constant (with respect to ) function of the horizon .

Figure 8: Quadratic optimization with additive noise.

Appendix B Proofs of Section 2

b.1 Proof of Theorem 1

Let for all be a sequence of polynomials. We consider the iterates defined for all by

started from . The -stationarity property gives for :

Since we get for all

For all we apply this relation to vectors :

and we get

Therefore there are polynomials and for all such that we have for all :

(16)

The n-scalability property means that there are polynomials independent of such that:

And in connection with Eq. (16) we can rewrite and as: