On the Effectiveness of Richardson Extrapolation in Machine Learning
Abstract
Richardson extrapolation is a classical technique from numerical analysis that can improve the approximation error of an estimation method by combining linearly several estimates obtained from different values of one of its hyperparameters, without the need to know in details the inner structure of the original estimation method. The main goal of this paper is to study when Richardson extrapolation can be used within machine learning, beyond the existing applications to stepsize adaptations in stochastic gradient descent. We identify two situations where Richardson interpolation can be useful: (1) when the hyperparameter is the number of iterations of an existing iterative optimization algorithm, with applications to averaged gradient descent and FrankWolfe algorithms (where we obtain asymptotically rates of on polytopes, where is the number of iterations), and (2) when it is a regularization parameter, with applications to Nesterov smoothing techniques for minimizing nonsmooth functions (where we obtain asymptotically rates close to for nonsmooth functions), and ridge regression. In all these cases, we show that extrapolation techniques come with no significant loss in performance, but with sometimes strong gains, and we provide theoretical justifications based on asymptotic developments for such gains, as well as empirical illustrations on classical problems from machine learning.
1 Introduction
Many machine learning methods can be cast as looking for approximations of some ideal quantity which cannot be readily computed from the data at hand: this ideal quantity can be the predictor learned from infinite data, or an iterative algorithm run for infinitely many iterations. Taking theirs roots in optimization and more generally numerical analysis, many accelerations techniques have been developed to tighten these approximations with as few changes as possible to the original method.
While some acceleration techniques add some simple modifications to a known algorithm, such as Nesterov acceleration for the gradient descent method (Nesterov, 1983), extrapolation techniques are techniques that do not need to know the fine inner structure of the method to be accelerated. These methods are only based on the observations of solutions of the original method.
Extrapolation techniques work on the vectorvalued output of the original method that depends on some controllable realvalued quantity , which can be the number of iterations or some regularization parameter, and more generally any parameter that controls both the running time and the approximation error of the algorithm. When tends to (which is typically or ), we will assume that has an asymptotic expansion of the form
where is the desired output, is the asymptotic equivalent of , and . The key question in extrapolation is the following: from the knowledge of for potentially several ’s, how can we better approximate , without the full knowledge of ?
For exponentially converging algorithms, there exist several “nonlinear” schemes that combine linearly several values of with weights that depend nonlinearly on the iterates, such as Aitken’s process (Aitken, 1927) or Anderson acceleration (Anderson, 1965), which has recently been shown to provide significant acceleration to linearly convergent gradientbased algorithms (Scieur et al., 2016). In this paper, we consider dependence in powers of , where Richardson extrapolation excels (see, e.g., Richardson, 1911; Joyce, 1971; Gautschi, 1997).
We thus assume that
and is a power of such that , where is known but is unknown, that is
In all our cases, when and when . Richardson extrapolation is simply combining two iterates with different values of so that the zeroth order term is preserved, while the firstorder term cancels, for example:
See an illustration in Figure 1 for , , and . Note that: (a) the choice of as a multiplicative factor is arbitrary and chosen for its simplicity when , and (b) Richardson extrapolation can be used with iterates to remove the first terms in a asymptotic expansion, where the powers of the expansion are known and not the associated vectorvalued constants (see examples in Section 3).
The main goal of this paper is to study when Richardson extrapolation can be used within machine learning, beyond the existing explicit applications to stepsize adaptations in stochastic gradient descent (Durmus et al., 2016; Dieuleveut et al., 2019), and implicit widespread use in “tailaveraging” (see more details in Section 2.1).
We identify two situations where Richardson interpolation can be useful:

is the number of iterations of an existing iterative optimization algorithm converging to , where then and , and Richardson extrapolation considers, for even, . We consider in Section 2, averaged gradient descent and FrankWolfe algorithms (where we obtain asymptotically rates of on polytopes, where is the number of iterations).

is a regularization parameter, where then and , and Richardson extrapolation considers . We consider in Section 3, Nesterov smoothing techniques for minimizing nonsmooth functions (where we obtain asymptotically rates close to for nonsmooth functions), and ridge regression (where we obtain estimators with lower bias).
As we will show, extrapolation techniques come with no significant loss in performance, but with sometimes strong gains, and the goal of this paper is to provide theoretical justifications for such gains, as well as empirical illustrations on classical problems from machine learning. Note that we aim for the simplest asymptotic results (most can be made nonasymptotic with extra assumptions).
2 Extrapolation on the Number of Iterations
In this section, we consider extrapolation based on the number of iterations , that is, for the simplest case
for optimization algorithms aimed at minimizing on (or potentially a subset thereof) a function , which we will always assume convex, threetimes differentiable with Hessian eigenvalues bounded by , and a unique minimizer such that is positive definite. We will consider as the performance measure. Note that these assumptions, in particular convexity, are made to make the asymptotic analysis simpler, and that nonconvex extensions could be considered.
If is converging to , then so is , and thus also . Given our assumptions on , then even if there are no cancellations, because we have assumed that our performance measure is smooth, the convergence rate is at most a constant times the original rate for (see a detailed argument in Appendix A based on local quadratic approximations for unconstrained minimization).
Therefore, performance is never significantly deteriorated (the risk is essentially to lose half of the iterations). The potential gains depend on the way converges to . The existence of a convergence rate of the form or is not enough, as Richardson extrapolation requires a specific direction of asymptotic convergence. As illustrated in Figure 2, some algorithms are oscillating around their solutions, while some converge with a specific direction. Only the latter ones can be accelerated with Richardson extrapolation, while the former ones are good candidates for Anderson acceleration (Anderson, 1965; Scieur et al., 2016).
We now consider three algorithms: (1) averaged gradient descent, where extrapolation is at its best, as it transforms an convergence rate into an exponential one, (2) accelerated gradient descent, where extrapolation does not bring anything, and (3) FrankWolfe algorithms, where the situation is mixed (sometimes it helps, sometimes it does not).
2.1 Averaged gradient descent
We consider the usual gradient descent algorithm
where is a stepsize, with PolyakRuppert averaging (Polyak and Juditsky, 1992; Ruppert, 1988):
Averaging is key to robustness to potential noise of the gradients (Polyak and Juditsky, 1992; Nemirovski et al., 2009). However it comes with the unintended consequence of losing the exponential forgetting of initial conditions for strongly convex problems (Bach and Moulines, 2011).
A common way to restore exponential convergence (up to the noise level in the stochastic case) is to consider “tailaveraging”, that is, to replace by the average of only the latest iterates (Jain et al., 2018). As shown below for even, this corresponds exactly to Richardson extrapolation:
Asymptotic analysis.
With our assumptions on (see the beginning of Section 2), we can show (see Appendix B for a detailed proof) that
where and depends on the condition number of . Note that: (a) before Richardson extrapolation, the asymptotic convergence rate will be of the order , which is better than the usual upperbound for the rate of gradient descent, but with a stronger assumption that in fact leads to exponential convergence before averaging, (b) while has a simple expression, it cannot be computed in practice, (c) that Richardson extrapolation leads to an exponentially convergent algorithm from an algorithm converging asymptotically in for functions values, and (d) that in the presence of noise in the gradients, the exponential convergence would only be up to the noise level. See Figure 3 (left plot) for an illustration.
2.2 Accelerated gradient descent
In the section above, we considered averaged gradient descent, which is asymptotically converging as and on which Richardson extrapolation could be used with strong gains. It is possible also for the accelerated gradient descent (Nesterov, 1983), which has a (nonasymptotic) convergence rate of for convex functions?
It turns out that the behavior of the iterates of accelerated gradient descent is exactly of the form depicted in the left plot of Figure 2: that is, the iterates oscillate around the optimum (see, e.g., Su et al., 2016; Flammarion and Bach, 2015), and Richardson extrapolation is of no help, but is not degrading performance too much. See Figure 3 (right plot) for an illustration.
2.3 FrankWolfe algorithms
We now consider FrankWolfe algorithms (also known as conditional gradient algorithms) for minimizing a smooth convex function on a compact convex set . These algorithms are dedicated to situations where one can easily minimize linear functions on (see, e.g., Jaggi, 2013, and references therein). The algorithm has the following form:
That is, the first order Taylor expansion of at is minimized, ending up typically in an extreme point or and a convex combination of and is considered. While some form of line search can be used to find , we consider socalled “open loop” schemes where or (Dunn and Harshbarger, 1978; Jaggi, 2013).
In terms of function values, these two variants are known to converge at respective rates and . Moreover, as illustrated in Figure 4, they are known to zigzag towards the optimal point. Avoiding this phenomenon can be done in several ways, for examples through optimizing over all convex combinations of the ’s for (Von Hohenbalken, 1977), or through socalled “away steps” (Guélat and Marcotte, 1986; LacosteJulien and Jaggi, 2015). In this section, we consider Richardson extrapolation and assume for simplicity that is a polytope (which is a typical use case for FrankWolfe algorithms). Note here that we are considering asymptotic convergence rates, and even without extrapolation (but with a local strongconvexity assumption), we can beat the rates for the stepsize .
Assumptions.
We let denote the minimizer of on , which we assume unique, on top of the differentiability properties from the beginning of Section 2 (which include local strong convexity around ). In order to show our asymptotic result, we assume that is “in the middle of a face” of . In mathematical words, if the minimizer is strictly in a dimensional face of which is the convex hull of extreme points , then is attained only by elements of the convex hull of . This assumption is sufficient to show that the corresponding optimal face of is eventually identified, and is often referred to as constraint qualification in optimization (Nocedal and Wright, 2006).
Asymptotic expansion.
As shown in Appendix C, for , then
where is orthogonal to the span of all , , while the remainder term is within this span. This characterizes the zigzagging phenomenon, and implies that
Thus, without extrapolation, we get a convergence rate in . Moreover, as expected, we show that
that is, Richardson extrapolation allows to go from an to an convergence rate. In the left plots of Figure 5 and Figure 6, we can observe the benefits of Richardson extrapolation on two optimization problems with the stepsize . Note that: (a) asymptotically, there is provably no extra logarithmic factor like we have for the nonasymptotic convergence rate, and (b) that the Richardson extrapolated iterate may not be within , but is away from it (in our simulations, we simply make the iterate feasible by rescaling).
For , we show in Appendix C that
and thus we already get an performance of without extrapolation, and Richardson extrapolation does not lead to any acceleration. In the right plots of Figure 5 and Figure 6, we indeed see no benefits (but no strong degradation).
Note that here (and for both stepsizes), higher order Richardson would not lead to further cancellation as within the span of the supporting face, we have an oscillating behavior similar to the left plot of Figure 2. Moreover, although we do not have a proof, the closed loop algorithm exhibits the same behavior as the step , both with and without extrapolation, which is consistent with the analysis of Canon and Cullum (1968). It would also be interesting to consider the benefits for Richardson extrapolation for stronglyconvex sets (Garber and Hazan, 2015).
3 Extrapolation on the Regularization Parameter
In this section, we explore the application of Richardson extrapolation to regularization methods. In a nutshell, regularization allows to make an estimation problem more stable (less subject to variations for statistical problems) or the algorithm faster (for optimization problems). However, regularization adds a bias that needs to be removed. In this section, we apply Richardson extrapolation to the regularization parameter to reduce this bias. We consider two applications where we can provably show some benefits: (a) smoothing for nonsmooth optimization in Section 3.1, and (b) ridge regression in Section 3.2.
Other applications include the classical use within integration methods, where the technique is often called RichardsonRomberg extrapolation (Gautschi, 1997), and for bias removal in constantstepsize stochastic gradient descent for sampling (Durmus et al., 2016) and optimization (Dieuleveut et al., 2019).
3.1 Smoothing nonsmooth problems
We consider the minimization of a convex function of the form , where is smooth and is nonsmooth. These optimization problems are ubiquitous in machine learning and signal processing, where the lack of smoothness can come from (a) nonsmooth losses such as maxmargin losses used in support vector machines and more generally structured output classification (Taskar et al., 2005; Tsochantaridis et al., 2005), and (b) sparsityinducing regularizers (see, e.g., Bach et al., 2012, and references therein). While many algorithms can be used to deal with this nonsmoothness, we consider a classical smoothing technique below.
Nesterov smoothing.
In this section, we consider the smoothing approach of Nesterov (2005) where the nonsmooth term is “smoothed” into , where is a regularization parameter, and accelerated gradient descent is used to minimize .
A typical way of smoothing the function is to add times a strongly convex regularizer to the Fenchel conjugate of (see an example below); as shown by Nesterov (2005), this leads to a function which has a smoothness constant (defined as the maximum of the largest eigenvalues of all Hessians) proportional to , with a uniform error of between and . Given that accelerated gradient descent leads to an iterate with excess function values proportional to after iterations, with the choice of , this leads to an excess in function values proportional to , which improves on the subgradient method which converges in .
Richardson extrapolation.
If we denote by the minimizer of and the global minimizer of , if we can show that , then and we can expand , which is better than the approximation without extrapolation.
Then, with , to balance the two terms and , we get an overall convergence rate for the nonsmooth problem of . We now make this formal for the special (but still quite generic) case of polyhedral functions , and also consider step Richardson extrapolation, which leads to a convergence rate arbitrarily close to .
Polyhedral functions.
We consider a polyhedral function of the form
where and . This form includes traditional regularizers such as the norm of grouped norms.
We assume that the function has a unique minimizer for which is positive definite (with the same regularity assumptions than smooth functions from previous sections, that is, threetimes differentiable with bounded derivatives). The optimality conditions are the existence of in the simplex (defined as the vectors in with nonnegative components summing to one), such that, for the support of (that is, the set of nonzeros),
and
Our only further assumptions are to assume that (a) this maximizer is only attained for , and (2) the submatrix obtained by taking the the rows of indexed by has full rank, another classical form of constraint qualification.
We consider the smoothing of as:
for some strongly convex function , typically, the negative entropy , or . We denote by a minimizer of , and the corresponding dual variable. We show in Appendix D that for the quadratic and entropic penalties:
with for the quadratic penalty, and a similar expression for the entropic penalty.
Multiplestep Richardson extrapolation.
Given that onestep Richardson extrapolation allows to go from a bias of to , a natural extension is to consider step Richardson extrapolation (Gautschi, 1997), that is, a combination of iterates:
where the weights are such that the first powers in the Taylor expansion of cancel out. This can be done by solving the linear system based on the following equations:
(1)  
(2) 
Using the same technique as Pagès (2007, Lemma 3.1), this is a Vandermonde system with a closed form solution (see proof in Appendix E.3):
We show in Appendix D that if is sufficiently smooth, we have:
Thus, within the smoothing technique, if we consider , to balance the terms and , we get an error for the nonsmooth problem of , which can get arbitrarily close to when gets large. The downsides are that (a) the constants in front of the asymptotic equivalent may blow up (a classical problem in highorder expansions), and (b) step extrapolation requires running the algorithm times (this can be down in parallel). In our experiments below, 3step extrapolation already brings in most of the benefits.
Experiments.
We consider the penalized Lasso problem:
where for , for and , and with input data distributed as a standard normal vector. We use either a dual entropic penalty or a dual quadratic penalty for smoothing each component of the norm of . Plots for the quadratic penalty are presented here in Figure 7, while plots for the entropic penalty are presented in Figure 9 in Appendix D with the same conclusions.
In the left plot of Figure 7, we illustrate the dependence of on for Richardson extrapolation with various orders, while in the right plot of Figure 7, we study the effect of extrapolation to solve the nonsmooth problem. For a series of regularization parameters equal to for between and (sampled every ), we run accelerated gradient descent on and we plot the value of for the various estimates, where for each number of iterations, we minimize over the regularization parameter. This is an oracle version of varying as a function of the number of iterations (a detailed evaluation where depends on could also be carried out). In Figure 7, we plot the excess function values as a function of the number of iterations, taking into account that step Richardson extrapolation requires times more iterations. We see that we get a strong improvement approaching .
From nonlinear programming to linear programming.
When we use the entropic penalty, the smoothing framework is generally applicable in most nonlinear programming problems (see, e.g., Cominetti and Dussault, 1994). It is interesting to note that typically when applying the entropic penalty, the deviation to the global optimizer is going to zero exponentially in for some of the components (see a proof for our particular case in Appendix D), but not for the corresponding dual problem (which is our primal problem).
Another classical instance of entropic regularization in machine learning leads to the Sinkhorn algorithm for computing optimal transport plans (Cuturi, 2013). For that problem, the entropic penalty is put directly on the original problem, and the deviation in estimating the optimal transport plan can be shown to be asymptotically exponential in (Cominetti and San Martín, 1994), and thus there Richardson extrapolation is not helpful (unless one wants to estimate the Kantorovich dual potentials).
3.2 Improving bias in ridge regression
We consider the ridge regression problem, that is, is the unique minimizer of
where is a feature vector and a vector of responses (Friedman et al., 2001). The solution may be obtained in closed form by solving the normal equations, as .
The regularization term is added to avoid overfitting and control the variability of due to the randomness in the training data (the higher the , the more control); however, it does create a bias that goes down as goes to zero. Richardson extrapolation can be used to reduce this bias. We thus consider and more generally
with the same weights as defined in Eq. (1) and Eq. (2). In order to compute , either ridge regression problems can be solved, or a closedform spectral formula can be used based on a single singular value decomposition of the kernel matrix (see Appendix E.3 for details).
Theoretical analysis.
Following Bach (2013), we assume for simplicity that is deterministic and that , where has zero mean and covariance matrix . We consider the insample error of , where is the usual kernel matrix, and is the smoothing matrix, which is equal to for very small and equal to zero for very large . We consider the socalled “insample generalization error”, that is, we want to minimize
where
The bias term is increasing in , while the variance term in decreasing in , and there is thus a tradeoff between these two terms. To find the optimal , assumptions need to be made on the problem, regarding the eigenvalues of , and the components of in the eigenbasis of . That is, following the notations of Bach (2013, Section 4.3), we assume that the eigenvalues of are (that is bounded from above and below by constants times ) and the coordinates of in the eigenbasis of are . The precise tradeoff depends on the rates at which and decay to zero.
A classical situation is and , where and (to ensure finite energy). As detailed in Appendix E:

the variance term is equivalent to and does not depend on or ;

the bias term depends on both and : for signals which are not too smooth (i.e., not too fast decay of , and thus small ), that is if , then the bias term is equivalent to and we can thus find the optimal as leading to predictive performance of , which happens to be optimal Caponnetto and De Vito (2007). However, when , a phenomenon called “saturation” occurs, and the bias term is equivalent to (independent of ), and the optimized predictive performance is , which is not optimal anymore.
As shown in Appendix E, by reducing the bias, with step Richardson interpolation, we can show that the variance term is bounded by a constant times the usual one, while the bias term is equivalent to for a wider range of , that is, , which recovers for the nonextrapolated estimate. This leads to optimal statistical performance for a wider range of problems.
Experiments.
As an illustration, we consider a ridge regression problem with data uniformly sampled on the unit sphere in dimension with observations, and generated as a linear function of the input plus some noise. We consider the rotation invariant kernel equal to the expectation , for uniform on the sphere. This is equal to, for (see Cho and Saul, 2009; Bach, 2017):
When the number of observations tends to infinity, the eigenvalues of are known to converge to the eigenvalues of a certain infinitedimensional operator (Koltchinskii and Giné, 2000). As shown by Bach (2017), the corresponding eigenvalues of the kernel matrix decay as . We consider generated as a linear function so that has a finite number of non zero components in the eigenbasis of .
In the left plot of Figure 8, we consider step and step Richardson extrapolation and plot the generalization error (averaged over 10 replications) as a function of the regularization parameter: we can see that as expected, (a) with extrapolation the curves move to the right (we can use a larger for a similar performance, which is advantageous as iterative algorithms are typically faster), and (b) the minimal error is smaller (which is true here because we learn a smooth function). In the right plot of Figure 8, we study the effect of increasing the order of extrapolation, showing that the larger the better with some saturation. With infinite, there will be overfitting as the corresponding spectral filter is nonstable, but this happens very slowly (see Appendix E.3 for details).
4 Conclusion
In this paper, we presented various applications of Richardson extrapolation to machine learning optimization and estimation problems, each time with an asymptotic analysis showing the potential benefits. For example, when using the number of iterations of an iterative algorithm to perform extrapolation, we can accelerate FrankWolfe algorithms to have an asymptotic rates of on polytopes for the stepsize and locally stronglyconvex functions (this is achieved without extrapolation for the stepsize ). When extrapolating based on the regularization parameter, we can accelerate Nesterov smoothing technique to have asymptotic rates close to .
We also highlighted situations where Richardson extrapolation does not bring any benefits (but does not degrade performance much), namely when applied to accelerated gradient descent or the Sinkhorn algorithm for optimal transport.
The analysis in this paper can be extended in a number of ways: (1) while the paper has focused on asymptotic analysis for simplicity, nonasymptotic analysis could be carried out to study more finely when acceleration starts, (2) we have focused on deterministic optimization algorithms, and extensions to stochastic algorithms could be derived, along the lines of the work of Dieuleveut et al. (2019), (3) we have primarily focused on convex optimization algorithms but nonconvex extensions, like done by Scieur et al. (2018) for Anderson acceleration, could also lead to acceleration
Acknowledgements
This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). We also acknowledge support the European Research Council (grant SEQUOIA 724063).
Appendix A Preliminary considerations
We have assumed that is threetimes differentiable, and has a unique constrained minimizer on , such that is invertible and that the third derivatives of are uniformly bounded. Thus, using a Taylor expansion, there exists a constant such that
This implies that locally around the function is stronglyconvex. Moreover, another consequence is that a bound implies that we stay in the region of strongconvexity . Indeed, using the Taylor formula with integral remainder, for (where the lower bound on Hessians above holds), we have , and thus is incompatible with .
Thus, if we have some decreasing rate for , then, asymptotically, the quadratic Taylor expansion of around is asymptotically valid, and thus up to negligible terms:
We thus obtain in the worst case a constant factor approximation when using Richardson extrapolation.
Appendix B Asymptotic expansion for averaged gradient descent
In this particular case of unconstrained gradient descent, (Nesterov, 2013), this implies from Appendix A, that for larger than some , all iterates are such that , and thus we are in the strongly convex case, where , where depend on the lowest eigenvalue of as and .
Thus, with , tends to when (since the series is convergent), and
leading to , and thus, with (which is hard to compute a priori),
Appendix C Asymptotic expansion for FrankWolfe
We consider the stepsizes and , for which the respective convergence rates for are of the form and , for constants depending on the smoothness of and the diameter of the compact set (see, e.g., Jaggi, 2013). When running FrankWolfe (with any of the classical versions with openloop stepsizes), we thus have with , and, following the same reasoning as in Appendix B, has to converge to at rate .
Because of affine invariance of the FrankWolfe algorithm, we can assume without loss of generality that .
The assumption which is made implies that there exists such that the ball of center and radius intersected with the affine hull of is included in the convex hull of , as well such that if , then is attained only by elements of the convex hull of .
Thus, for large enough, that is greater than some (which can be quantified from , and other quantities), all elements of are in the convex hull of , that is, only the correct extreme points are selected. Denoting the orthogonal projection on the span of , we have:
and thus, subtracting :
leading to, using the projections and :
and, because is in the span of ,
We now consider these two terms separately.
Convergence of .
For , we have, in closed form for :
For , we have:
Convergence of .
We now look at the convergence of . We have
Because of the ball assumption (that is the existence of a ball around that is contained in the supporting face of ), we have
which leads to
Moreover, using a Taylor expansion with , we have . Moreover, by optimality of , . Therefore we have, for :