On the Effectiveness of Richardson Extrapolation in Machine Learning
Richardson extrapolation is a classical technique from numerical analysis that can improve the approximation error of an estimation method by combining linearly several estimates obtained from different values of one of its hyperparameters, without the need to know in details the inner structure of the original estimation method. The main goal of this paper is to study when Richardson extrapolation can be used within machine learning, beyond the existing applications to step-size adaptations in stochastic gradient descent. We identify two situations where Richardson interpolation can be useful: (1) when the hyperparameter is the number of iterations of an existing iterative optimization algorithm, with applications to averaged gradient descent and Frank-Wolfe algorithms (where we obtain asymptotically rates of on polytopes, where is the number of iterations), and (2) when it is a regularization parameter, with applications to Nesterov smoothing techniques for minimizing non-smooth functions (where we obtain asymptotically rates close to for non-smooth functions), and ridge regression. In all these cases, we show that extrapolation techniques come with no significant loss in performance, but with sometimes strong gains, and we provide theoretical justifications based on asymptotic developments for such gains, as well as empirical illustrations on classical problems from machine learning.
Many machine learning methods can be cast as looking for approximations of some ideal quantity which cannot be readily computed from the data at hand: this ideal quantity can be the predictor learned from infinite data, or an iterative algorithm run for infinitely many iterations. Taking theirs roots in optimization and more generally numerical analysis, many accelerations techniques have been developed to tighten these approximations with as few changes as possible to the original method.
While some acceleration techniques add some simple modifications to a known algorithm, such as Nesterov acceleration for the gradient descent method (Nesterov, 1983), extrapolation techniques are techniques that do not need to know the fine inner structure of the method to be accelerated. These methods are only based on the observations of solutions of the original method.
Extrapolation techniques work on the vector-valued output of the original method that depends on some controllable real-valued quantity , which can be the number of iterations or some regularization parameter, and more generally any parameter that controls both the running time and the approximation error of the algorithm. When tends to (which is typically or ), we will assume that has an asymptotic expansion of the form
where is the desired output, is the asymptotic equivalent of , and . The key question in extrapolation is the following: from the knowledge of for potentially several ’s, how can we better approximate , without the full knowledge of ?
For exponentially converging algorithms, there exist several “non-linear” schemes that combine linearly several values of with weights that depend non-linearly on the iterates, such as Aitken’s process (Aitken, 1927) or Anderson acceleration (Anderson, 1965), which has recently been shown to provide significant acceleration to linearly convergent gradient-based algorithms (Scieur et al., 2016). In this paper, we consider dependence in powers of , where Richardson extrapolation excels (see, e.g., Richardson, 1911; Joyce, 1971; Gautschi, 1997).
We thus assume that
and is a power of such that , where is known but is unknown, that is
In all our cases, when and when . Richardson extrapolation is simply combining two iterates with different values of so that the zero-th order term is preserved, while the first-order term cancels, for example:
See an illustration in Figure 1 for , , and . Note that: (a) the choice of as a multiplicative factor is arbitrary and chosen for its simplicity when , and (b) Richardson extrapolation can be used with iterates to remove the first terms in a asymptotic expansion, where the powers of the expansion are known and not the associated vector-valued constants (see examples in Section 3).
The main goal of this paper is to study when Richardson extrapolation can be used within machine learning, beyond the existing explicit applications to step-size adaptations in stochastic gradient descent (Durmus et al., 2016; Dieuleveut et al., 2019), and implicit wide-spread use in “tail-averaging” (see more details in Section 2.1).
We identify two situations where Richardson interpolation can be useful:
is the number of iterations of an existing iterative optimization algorithm converging to , where then and , and Richardson extrapolation considers, for even, . We consider in Section 2, averaged gradient descent and Frank-Wolfe algorithms (where we obtain asymptotically rates of on polytopes, where is the number of iterations).
is a regularization parameter, where then and , and Richardson extrapolation considers . We consider in Section 3, Nesterov smoothing techniques for minimizing non-smooth functions (where we obtain asymptotically rates close to for non-smooth functions), and ridge regression (where we obtain estimators with lower bias).
As we will show, extrapolation techniques come with no significant loss in performance, but with sometimes strong gains, and the goal of this paper is to provide theoretical justifications for such gains, as well as empirical illustrations on classical problems from machine learning. Note that we aim for the simplest asymptotic results (most can be made non-asymptotic with extra assumptions).
2 Extrapolation on the Number of Iterations
In this section, we consider extrapolation based on the number of iterations , that is, for the simplest case
for optimization algorithms aimed at minimizing on (or potentially a subset thereof) a function , which we will always assume convex, three-times differentiable with Hessian eigenvalues bounded by , and a unique minimizer such that is positive definite. We will consider as the performance measure. Note that these assumptions, in particular convexity, are made to make the asymptotic analysis simpler, and that non-convex extensions could be considered.
If is converging to , then so is , and thus also . Given our assumptions on , then even if there are no cancellations, because we have assumed that our performance measure is smooth, the convergence rate is at most a constant times the original rate for (see a detailed argument in Appendix A based on local quadratic approximations for unconstrained minimization).
Therefore, performance is never significantly deteriorated (the risk is essentially to lose half of the iterations). The potential gains depend on the way converges to . The existence of a convergence rate of the form or is not enough, as Richardson extrapolation requires a specific direction of asymptotic convergence. As illustrated in Figure 2, some algorithms are oscillating around their solutions, while some converge with a specific direction. Only the latter ones can be accelerated with Richardson extrapolation, while the former ones are good candidates for Anderson acceleration (Anderson, 1965; Scieur et al., 2016).
We now consider three algorithms: (1) averaged gradient descent, where extrapolation is at its best, as it transforms an convergence rate into an exponential one, (2) accelerated gradient descent, where extrapolation does not bring anything, and (3) Frank-Wolfe algorithms, where the situation is mixed (sometimes it helps, sometimes it does not).
2.1 Averaged gradient descent
We consider the usual gradient descent algorithm
Averaging is key to robustness to potential noise of the gradients (Polyak and Juditsky, 1992; Nemirovski et al., 2009). However it comes with the unintended consequence of losing the exponential forgetting of initial conditions for strongly convex problems (Bach and Moulines, 2011).
A common way to restore exponential convergence (up to the noise level in the stochastic case) is to consider “tail-averaging”, that is, to replace by the average of only the latest iterates (Jain et al., 2018). As shown below for even, this corresponds exactly to Richardson extrapolation:
where and depends on the condition number of . Note that: (a) before Richardson extrapolation, the asymptotic convergence rate will be of the order , which is better than the usual upper-bound for the rate of gradient descent, but with a stronger assumption that in fact leads to exponential convergence before averaging, (b) while has a simple expression, it cannot be computed in practice, (c) that Richardson extrapolation leads to an exponentially convergent algorithm from an algorithm converging asymptotically in for functions values, and (d) that in the presence of noise in the gradients, the exponential convergence would only be up to the noise level. See Figure 3 (left plot) for an illustration.
2.2 Accelerated gradient descent
In the section above, we considered averaged gradient descent, which is asymptotically converging as and on which Richardson extrapolation could be used with strong gains. It is possible also for the accelerated gradient descent (Nesterov, 1983), which has a (non-asymptotic) convergence rate of for convex functions?
It turns out that the behavior of the iterates of accelerated gradient descent is exactly of the form depicted in the left plot of Figure 2: that is, the iterates oscillate around the optimum (see, e.g., Su et al., 2016; Flammarion and Bach, 2015), and Richardson extrapolation is of no help, but is not degrading performance too much. See Figure 3 (right plot) for an illustration.
2.3 Frank-Wolfe algorithms
We now consider Frank-Wolfe algorithms (also known as conditional gradient algorithms) for minimizing a smooth convex function on a compact convex set . These algorithms are dedicated to situations where one can easily minimize linear functions on (see, e.g., Jaggi, 2013, and references therein). The algorithm has the following form:
That is, the first order Taylor expansion of at is minimized, ending up typically in an extreme point or and a convex combination of and is considered. While some form of line search can be used to find , we consider so-called “open loop” schemes where or (Dunn and Harshbarger, 1978; Jaggi, 2013).
In terms of function values, these two variants are known to converge at respective rates and . Moreover, as illustrated in Figure 4, they are known to zig-zag towards the optimal point. Avoiding this phenomenon can be done in several ways, for examples through optimizing over all convex combinations of the ’s for (Von Hohenbalken, 1977), or through so-called “away steps” (Guélat and Marcotte, 1986; Lacoste-Julien and Jaggi, 2015). In this section, we consider Richardson extrapolation and assume for simplicity that is a polytope (which is a typical use case for Frank-Wolfe algorithms). Note here that we are considering asymptotic convergence rates, and even without extrapolation (but with a local strong-convexity assumption), we can beat the rates for the step-size .
We let denote the minimizer of on , which we assume unique, on top of the differentiability properties from the beginning of Section 2 (which include local strong convexity around ). In order to show our asymptotic result, we assume that is “in the middle of a face” of . In mathematical words, if the minimizer is strictly in a -dimensional face of which is the convex hull of extreme points , then is attained only by elements of the convex hull of . This assumption is sufficient to show that the corresponding optimal face of is eventually identified, and is often referred to as constraint qualification in optimization (Nocedal and Wright, 2006).
As shown in Appendix C, for , then
where is orthogonal to the span of all , , while the remainder term is within this span. This characterizes the zig-zagging phenomenon, and implies that
Thus, without extrapolation, we get a convergence rate in . Moreover, as expected, we show that
that is, Richardson extrapolation allows to go from an to an convergence rate. In the left plots of Figure 5 and Figure 6, we can observe the benefits of Richardson extrapolation on two optimization problems with the step-size . Note that: (a) asymptotically, there is provably no extra logarithmic factor like we have for the non-asymptotic convergence rate, and (b) that the Richardson extrapolated iterate may not be within , but is away from it (in our simulations, we simply make the iterate feasible by rescaling).
For , we show in Appendix C that
and thus we already get an performance of without extrapolation, and Richardson extrapolation does not lead to any acceleration. In the right plots of Figure 5 and Figure 6, we indeed see no benefits (but no strong degradation).
Note that here (and for both step-sizes), higher order Richardson would not lead to further cancellation as within the span of the supporting face, we have an oscillating behavior similar to the left plot of Figure 2. Moreover, although we do not have a proof, the closed loop algorithm exhibits the same behavior as the step , both with and without extrapolation, which is consistent with the analysis of Canon and Cullum (1968). It would also be interesting to consider the benefits for Richardson extrapolation for strongly-convex sets (Garber and Hazan, 2015).
3 Extrapolation on the Regularization Parameter
In this section, we explore the application of Richardson extrapolation to regularization methods. In a nutshell, regularization allows to make an estimation problem more stable (less subject to variations for statistical problems) or the algorithm faster (for optimization problems). However, regularization adds a bias that needs to be removed. In this section, we apply Richardson extrapolation to the regularization parameter to reduce this bias. We consider two applications where we can provably show some benefits: (a) smoothing for non-smooth optimization in Section 3.1, and (b) ridge regression in Section 3.2.
Other applications include the classical use within integration methods, where the technique is often called Richardson-Romberg extrapolation (Gautschi, 1997), and for bias removal in constant-step-size stochastic gradient descent for sampling (Durmus et al., 2016) and optimization (Dieuleveut et al., 2019).
3.1 Smoothing non-smooth problems
We consider the minimization of a convex function of the form , where is smooth and is non-smooth. These optimization problems are ubiquitous in machine learning and signal processing, where the lack of smoothness can come from (a) non-smooth losses such as max-margin losses used in support vector machines and more generally structured output classification (Taskar et al., 2005; Tsochantaridis et al., 2005), and (b) sparsity-inducing regularizers (see, e.g., Bach et al., 2012, and references therein). While many algorithms can be used to deal with this non-smoothness, we consider a classical smoothing technique below.
In this section, we consider the smoothing approach of Nesterov (2005) where the non-smooth term is “smoothed” into , where is a regularization parameter, and accelerated gradient descent is used to minimize .
A typical way of smoothing the function is to add times a strongly convex regularizer to the Fenchel conjugate of (see an example below); as shown by Nesterov (2005), this leads to a function which has a smoothness constant (defined as the maximum of the largest eigenvalues of all Hessians) proportional to , with a uniform error of between and . Given that accelerated gradient descent leads to an iterate with excess function values proportional to after iterations, with the choice of , this leads to an excess in function values proportional to , which improves on the subgradient method which converges in .
If we denote by the minimizer of and the global minimizer of , if we can show that , then and we can expand , which is better than the approximation without extrapolation.
Then, with , to balance the two terms and , we get an overall convergence rate for the non-smooth problem of . We now make this formal for the special (but still quite generic) case of polyhedral functions , and also consider -step Richardson extrapolation, which leads to a convergence rate arbitrarily close to .
We consider a polyhedral function of the form
where and . This form includes traditional regularizers such as the -norm of grouped --norms.
We assume that the function has a unique minimizer for which is positive definite (with the same regularity assumptions than smooth functions from previous sections, that is, three-times differentiable with bounded derivatives). The optimality conditions are the existence of in the simplex (defined as the vectors in with non-negative components summing to one), such that, for the support of (that is, the set of non-zeros),
Our only further assumptions are to assume that (a) this maximizer is only attained for , and (2) the submatrix obtained by taking the the rows of indexed by has full rank, another classical form of constraint qualification.
We consider the smoothing of as:
for some strongly convex function , typically, the negative entropy , or . We denote by a minimizer of , and the corresponding dual variable. We show in Appendix D that for the quadratic and entropic penalties:
with for the quadratic penalty, and a similar expression for the entropic penalty.
Multiple-step Richardson extrapolation.
Given that one-step Richardson extrapolation allows to go from a bias of to , a natural extension is to consider -step Richardson extrapolation (Gautschi, 1997), that is, a combination of iterates:
where the weights are such that the first powers in the Taylor expansion of cancel out. This can be done by solving the linear system based on the following equations:
We show in Appendix D that if is sufficiently smooth, we have:
Thus, within the smoothing technique, if we consider , to balance the terms and , we get an error for the non-smooth problem of , which can get arbitrarily close to when gets large. The downsides are that (a) the constants in front of the asymptotic equivalent may blow up (a classical problem in high-order expansions), and (b) -step extrapolation requires running the algorithm times (this can be down in parallel). In our experiments below, 3-step extrapolation already brings in most of the benefits.
We consider the penalized Lasso problem:
where for , for and , and with input data distributed as a standard normal vector. We use either a dual entropic penalty or a dual quadratic penalty for smoothing each component of the -norm of . Plots for the quadratic penalty are presented here in Figure 7, while plots for the entropic penalty are presented in Figure 9 in Appendix D with the same conclusions.
In the left plot of Figure 7, we illustrate the dependence of on for Richardson extrapolation with various orders, while in the right plot of Figure 7, we study the effect of extrapolation to solve the non-smooth problem. For a series of regularization parameters equal to for between and (sampled every ), we run accelerated gradient descent on and we plot the value of for the various estimates, where for each number of iterations, we minimize over the regularization parameter. This is an oracle version of varying as a function of the number of iterations (a detailed evaluation where depends on could also be carried out). In Figure 7, we plot the excess function values as a function of the number of iterations, taking into account that -step Richardson extrapolation requires -times more iterations. We see that we get a strong improvement approaching .
From non-linear programming to linear programming.
When we use the entropic penalty, the smoothing framework is generally applicable in most non-linear programming problems (see, e.g., Cominetti and Dussault, 1994). It is interesting to note that typically when applying the entropic penalty, the deviation to the global optimizer is going to zero exponentially in for some of the components (see a proof for our particular case in Appendix D), but not for the corresponding dual problem (which is our primal problem).
Another classical instance of entropic regularization in machine learning leads to the Sinkhorn algorithm for computing optimal transport plans (Cuturi, 2013). For that problem, the entropic penalty is put directly on the original problem, and the deviation in estimating the optimal transport plan can be shown to be asymptotically exponential in (Cominetti and San Martín, 1994), and thus there Richardson extrapolation is not helpful (unless one wants to estimate the Kantorovich dual potentials).
3.2 Improving bias in ridge regression
We consider the ridge regression problem, that is, is the unique minimizer of
where is a feature vector and a vector of responses (Friedman et al., 2001). The solution may be obtained in closed form by solving the normal equations, as .
The regularization term is added to avoid overfitting and control the variability of due to the randomness in the training data (the higher the , the more control); however, it does create a bias that goes down as goes to zero. Richardson extrapolation can be used to reduce this bias. We thus consider and more generally
with the same weights as defined in Eq. (1) and Eq. (2). In order to compute , either ridge regression problems can be solved, or a closed-form spectral formula can be used based on a single singular value decomposition of the kernel matrix (see Appendix E.3 for details).
Following Bach (2013), we assume for simplicity that is deterministic and that , where has zero mean and covariance matrix . We consider the in-sample error of , where is the usual kernel matrix, and is the smoothing matrix, which is equal to for very small and equal to zero for very large . We consider the so-called “in-sample generalization error”, that is, we want to minimize
The bias term is increasing in , while the variance term in decreasing in , and there is thus a trade-off between these two terms. To find the optimal , assumptions need to be made on the problem, regarding the eigenvalues of , and the components of in the eigenbasis of . That is, following the notations of Bach (2013, Section 4.3), we assume that the eigenvalues of are (that is bounded from above and below by constants times ) and the coordinates of in the eigenbasis of are . The precise trade-off depends on the rates at which and decay to zero.
A classical situation is and , where and (to ensure finite energy). As detailed in Appendix E:
the variance term is equivalent to and does not depend on or ;
the bias term depends on both and : for signals which are not too smooth (i.e., not too fast decay of , and thus small ), that is if , then the bias term is equivalent to and we can thus find the optimal as leading to predictive performance of , which happens to be optimal Caponnetto and De Vito (2007). However, when , a phenomenon called “saturation” occurs, and the bias term is equivalent to (independent of ), and the optimized predictive performance is , which is not optimal anymore.
As shown in Appendix E, by reducing the bias, with -step Richardson interpolation, we can show that the variance term is bounded by a constant times the usual one, while the bias term is equivalent to for a wider range of , that is, , which recovers for the non-extrapolated estimate. This leads to optimal statistical performance for a wider range of problems.
As an illustration, we consider a ridge regression problem with data uniformly sampled on the unit sphere in dimension with observations, and generated as a linear function of the input plus some noise. We consider the rotation invariant kernel equal to the expectation , for uniform on the sphere. This is equal to, for (see Cho and Saul, 2009; Bach, 2017):
When the number of observations tends to infinity, the eigenvalues of are known to converge to the eigenvalues of a certain infinite-dimensional operator (Koltchinskii and Giné, 2000). As shown by Bach (2017), the corresponding eigenvalues of the kernel matrix decay as . We consider generated as a linear function so that has a finite number of non zero components in the eigenbasis of .
In the left plot of Figure 8, we consider -step and -step Richardson extrapolation and plot the generalization error (averaged over 10 replications) as a function of the regularization parameter: we can see that as expected, (a) with extrapolation the curves move to the right (we can use a larger for a similar performance, which is advantageous as iterative algorithms are typically faster), and (b) the minimal error is smaller (which is true here because we learn a smooth function). In the right plot of Figure 8, we study the effect of increasing the order of extrapolation, showing that the larger the better with some saturation. With infinite, there will be overfitting as the corresponding spectral filter is non-stable, but this happens very slowly (see Appendix E.3 for details).
In this paper, we presented various applications of Richardson extrapolation to machine learning optimization and estimation problems, each time with an asymptotic analysis showing the potential benefits. For example, when using the number of iterations of an iterative algorithm to perform extrapolation, we can accelerate Frank-Wolfe algorithms to have an asymptotic rates of on polytopes for the step-size and locally strongly-convex functions (this is achieved without extrapolation for the step-size ). When extrapolating based on the regularization parameter, we can accelerate Nesterov smoothing technique to have asymptotic rates close to .
We also highlighted situations where Richardson extrapolation does not bring any benefits (but does not degrade performance much), namely when applied to accelerated gradient descent or the Sinkhorn algorithm for optimal transport.
The analysis in this paper can be extended in a number of ways: (1) while the paper has focused on asymptotic analysis for simplicity, non-asymptotic analysis could be carried out to study more finely when acceleration starts, (2) we have focused on deterministic optimization algorithms, and extensions to stochastic algorithms could be derived, along the lines of the work of Dieuleveut et al. (2019), (3) we have primarily focused on convex optimization algorithms but non-convex extensions, like done by Scieur et al. (2018) for Anderson acceleration, could also lead to acceleration
This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). We also acknowledge support the European Research Council (grant SEQUOIA 724063).
Appendix A Preliminary considerations
We have assumed that is three-times differentiable, and has a unique constrained minimizer on , such that is invertible and that the third derivatives of are uniformly bounded. Thus, using a Taylor expansion, there exists a constant such that
This implies that locally around the function is strongly-convex. Moreover, another consequence is that a bound implies that we stay in the region of strong-convexity . Indeed, using the Taylor formula with integral remainder, for (where the lower bound on Hessians above holds), we have , and thus is incompatible with .
Thus, if we have some decreasing rate for , then, asymptotically, the quadratic Taylor expansion of around is asymptotically valid, and thus up to negligible terms:
We thus obtain in the worst case a constant factor approximation when using Richardson extrapolation.
Appendix B Asymptotic expansion for averaged gradient descent
In this particular case of unconstrained gradient descent, (Nesterov, 2013), this implies from Appendix A, that for larger than some , all iterates are such that , and thus we are in the strongly convex case, where , where depend on the lowest eigenvalue of as and .
Thus, with , tends to when (since the series is convergent), and
leading to , and thus, with (which is hard to compute a priori),
Appendix C Asymptotic expansion for Frank-Wolfe
We consider the step-sizes and , for which the respective convergence rates for are of the form and , for constants depending on the smoothness of and the diameter of the compact set (see, e.g., Jaggi, 2013). When running Frank-Wolfe (with any of the classical versions with open-loop step-sizes), we thus have with , and, following the same reasoning as in Appendix B, has to converge to at rate .
Because of affine invariance of the Frank-Wolfe algorithm, we can assume without loss of generality that .
The assumption which is made implies that there exists such that the ball of center and radius intersected with the affine hull of is included in the convex hull of , as well such that if , then is attained only by elements of the convex hull of .
Thus, for large enough, that is greater than some (which can be quantified from , and other quantities), all elements of are in the convex hull of , that is, only the correct extreme points are selected. Denoting the orthogonal projection on the span of , we have:
and thus, subtracting :
leading to, using the projections and :
and, because is in the span of ,
We now consider these two terms separately.
Convergence of .
For , we have, in closed form for :
For , we have:
Convergence of .
We now look at the convergence of . We have
Because of the ball assumption (that is the existence of a ball around that is contained in the supporting face of ), we have
which leads to
Moreover, using a Taylor expansion with , we have . Moreover, by optimality of , . Therefore we have, for :