Nonlinear Acceleration of Deep Neural Networks
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods, with marginal computational overhead. It aims to improve convergence using only the iterates of simple iterative algorithms. However, so far its application to optimization was theoretically limited to gradient descent and other single-step algorithms. Here, we adapt RNA to a much broader setting including stochastic gradient with momentum and Nesterov’s fast gradient. We use it to train deep neural networks, and empirically observe that extrapolated networks are more accurate, especially in the early iterations. A straightforward application of our algorithm when training ResNet-152 on ImageNet produces a top-1 test error of 20.88%, improving by the reference classification pipeline. Furthermore, the code runs offline in this case, so it never negatively affects performance.
Ogreen m \clist_if_in:nVT #2 \ExplSyntaxOff
Stochastic gradient descent is a popular and effective method to train neural networks (moulines2011non; deng2013recent). A lot of efforts have been invested in accelerating stochastic algorithms, in particular by deriving methods that adapt to the structure of the problem. Algorithms such as RMSProp (tieleman2012lecture) or Adam (kingma2014adam) are examples of direct modifications of gradient descent, and estimate some statistical momentum during optimization to speed up convergence. Unfortunately, these methods can fail to converge on some simple problems (reddi2018convergence), and may fail to achieve state-of-the-art test accuracy on image classification problems. Other techniques have been developed to improve convergence speed or accuracy, such as adaptive batch-size for distributed SGD (goyal2017accurate), or quasi-second-order methods which improve the rate of convergence of stochastic algorithms (bollapragada2018progressive). However, only a limited number of settings are covered by such techniques, which are not compatible with state-of-the-art architectures.
The approach we propose here is completely different as our method is built on top of existing optimization algorithms. Classical algorithms typically retain only the last iterate or the average (polyak1992acceleration) of iterates as their best estimate of the optimum, throwing away all the information contained in the converging sequence of iterates. As it is highly wasteful from a statistical perspective, extrapolation schemes estimate the optimum of an optimization problem using a weighted average of the last iterates produced by an algorithm, where the weights depend on the iterates.
For example, Aitken- or Wynn’s -algorithm (a good survey can be found in (brezinski2013extrapolation)), provide an improved estimate of the limit of a sequence using the last few iterates, and these methods have been extended to the vector case, where they are known as Anderson acceleration (walker2011anderson), minimal polynomial extrapolation (cabay1976polynomial) or reduced rank extrapolation (eddy1979extrapolating).
Network ensembling (zhou2002ensembling) can also be seen as combining several neural networks to improve convergence. However, contrary to our strategy, ensembling consists in training different networks from different starting points and then averaging the predictions or the parameters. Averaging the weights of successive different neural networks from SGD iterations has been studied in (izmailov2018averaging), and our method can also be seen as an extension of this averaging idea for SGD, but with non-uniform adaptive weights.
Recent results by (scieur2016regularized) adapted classical extrapolation techniques such as Aitken’s and minimal polynomial extrapolation to design regularized extrapolation schemes accelerating convergence of basic methods such as gradient descent. They showed in particular that using only iterates from a very basic fixed-step gradient descent, these extrapolation algorithms produced solutions reaching the optimal convergence rate of (nesterov2013introductory), without any modification to the original algorithm. However, these results were limited to single-step algorithms such as gradient descent, thus excluding the much faster momentum-based methods such as SGD with momentum or Nesterov’s algorithm. Our results here seek to accelerate these accelerated methods.
Overall, nonlinear acceleration has marginal computational complexity. On convex problems, the online version (which modifies iterations) is competitive with L-BFGS in our experiments (see Figure 1), and is robust to misspecified strong convexity parameters. On neural network training problems, the offline version improves both the test accuracy of early iterations, as well as the final accuracy (see Figures 2 and 4), with only minor modifications of existing learning pipelines (see Appendix C). It never hurts performance as it runs on top of the algorithm and does not affect iterations. Finally, our scheme produces smoother learning curves. When training neural networks, we observe in our experiments in Figure 3 that the convergence speedup produced by acceleration is much more significant in early iterations, which means it could serve for “rapid prototyping” of network architectures, with significant computational savings. 111The source code for the numerical experiments can be found on GitHub: https://github.com/windows7lover/RegularizedNonlinearAcceleration
2 Regularized Nonlinear Acceleration
2.1 Vector extrapolation methods
Vector extrapolation methods look at the sequence produced by an iterative algorithm, and try to find its limit . Typically, they assume was produced by a fixed-point iteration with function as follow,
In the case of optimization methods, is the function minimizer and usually corresponds to a gradient step. In most cases, convergence analysis bounds are based on a Taylor approximation of , and produce only local rates.
Recently, (scieur2016regularized) showed global rates of convergence of regularized versions of Anderson acceleration. These results show in particular that, without regularization, classical extrapolation methods are highly unstable when applied to the iterates produced by optimization algorithms. However, the results hold only for sequences generated by (1) where has a symmetric Jacobian.
To give a bit of intuition on extrapolation, let be a (potentially) noisy objective function. We are interested in finding its minimizer . To find this point, we typically use an iterative optimization algorithm and after iterations obtain a sequence of points converging to the critical point where the gradient is zero. Vector extrapolation algorithms find linear combinations of the iterates with coefficients to minimize the norm of the gradient, i.e.,
However, this problem is non-linear and hard to solve. The difference between extrapolation algorithms resides in the way they approximate the solution to .
2.2 Regularized Nonlinear Acceleration (RNA) Algorithm.
In this paper, instead of considering the iterations produced by (1), we will look at a pair of sequences , generated by
where both , converge to . In this section, we will develop an extrapolation scheme for (3).
Intuition. In this section, we show how to design the RNA algorithm for the special case where is the gradient step with fixed step size ,
The RNA algorithm approximates (2) by assuming that the function is approximately quadratic in the neighbourhood of . This is a common assumption in optimization for the design of second-order methods, such as the Newton’s method or BFGS, and implies that is approximately linear, so
where the constraint on ensures convergence (scieur2016regularized). Even if we do not have an explicit access to the gradient, we can recover it from the differences between sequences and since
Writing , we can solve (5) explicitly, with
Because the gradients are increasingly colinear as the algorithm converges, the matrix of gradients quickly becomes ill-conditioned. Without regularization, solving (5) is highly unstable. We illustrate the impact of regularization on the conditioning in Figure 6 (in Appendix A), when optimizing a quadratic function. Even for this simple problem, the extrapolation coefficients are large and oscillate between highly negative and positive values, and regularization dampens this behaviour.
Complexity. Roughly speaking, the RNA algorithm assumes that iterates follow a vector auto-regressive process of the form for some matrix (scieur2016regularized), which is true when using gradient descent on quadratic functions, and holds asymptotically otherwise, provided some regularity conditions. The output of Algorithm 1 could be achieved by identifying the matrix , recovering the quadratic function, then computing its minimum explicitly. Of course, the RNA algorithm does not perform these steps explicitly, so its complexity is bounded by where is the dimension of the iterates and the number of iterates used in estimating the optimum. In our experiments, is typically equal to , so we can consider Algorithm 1 to be linear . In practice, computing the extrapolated solution on a CPU is faster than a single forward pass on a mini-batch.
Stochastic gradients. In the stochastic case, the concept of “iteration” is not as straightforward since one stochastic gradient is usually very noisy and non-informative, while it is not feasible to compute a full gradient due to problem scales. In this paper, we consider one iteration being one pass on the data with the stochastic algorithm and estimate the gradient on the fly.
Note that the extrapolated point in Algorithm 1 is computed only from the two sequences and . Its computation does not require the function , or access to the data set. Therefore, RNA can be used offline. On the other hand, we will see in the next subsection that it is possible to use RNA “online”: we combine the extrapolation with the original algorithm. This often improves the observed rate of convergence.
Offline versus online acceleration. We now discuss several strategies for implementing RNA. The most basic one uses it offline: we simply generate an auxiliary sequence without interfering with the original algorithm. The main advantage of this strategy is to be at least as good as the vanilla algorithm, since we keep the sequences and unchanged. In addition, we can apply it after the iterations of (3), since we only need the sequences and nothing else.
However, since is a better estimate of the optimum, restarting iterations from the extrapolated point can potentially improve convergence. The experiments in (scieur2016regularized) restart the gradient method after a fixed number of iterations, to produce sequences that converge faster than classical accelerated methods in most cases.
Momentum acceleration. Concretely, we extend the results of (scieur2016regularized) to handle iterative algorithms of the form (3) where is a nonlinear iteration with a symmetric Jacobian. It allows push the approach a bit further, using the extrapolated point online, at each step. This was not possible in the scheme detailed in (scieur2016regularized) because, as many other extrapolation schemes, (scieur2016regularized) requires iterations of the form of (1).
In fact, the class of algorithms following (3) contains most common optimization schemes such as gradient descent with line-search or averaging and momentum-based methods. In fact, picking an algorithm is exactly equivalent to choosing values for and .
For example, in the case of Nesterov’s method for smooth and convex functions, we get
while for the gradient method with momentum to train neural networks we obtain
where is the learning rate and the momentum parameter. Running iterations of (3) produces two sequences of iterates and converging to some minimizer at a certain rate. In comparison, the setting of (scieur2016regularized) corresponds to the special case of (3) where , i.e., and all other coefficients are zero. In (6) and (7) we assume constant over time, and a variable learning rate is captured by the and instead.
Because the extrapolation in (2) is a linear combination of previous iterates, it matches exactly the description of in (3), with and RNA can thus be directly injected in the algorithmic scheme in (3) as follows,
This trick potentially improves the previous version of RNA with “restarts”, since we benefit from acceleration more often. We will see that this online version often improves significantly the rate of convergence of gradient and Nesterov methods for convex losses.
3 Optimal Convergence Rate for Linear Mappings
We now analyze the rate of convergence of the extrapolation step produced by Algorithm 1 as a function of the length of the sequences and . We restrict our result to the specific case when is the linear mapping
This holds asymptotically if is smooth enough. The matrix should be symmetric, and its norm is strictly bounded by one, i.e., . Typically, is the rate of convergence of the “vanilla” algorithm, where is usually linked to the condition number of the problem. The assumptions are the same as (scieur2016regularized) for the linear case, except now the linear mapping is coupled with an extra linear combination step in (3) which allows us to handle momentum terms or accelerated methods.
3.1 Convergence Bound
We now prove that the RNA algorithm applied to the iterates in (3) reaches an optimal rate of convergence when is a linear mapping and when and are generated by (3). The theorem is valid for any choice of (up to some mild assumptions), i.e., for any algorithm which can be written as (3).
To prove the convergence, we will bound the residue , which corresponds to the norm of the gradient when is a gradient step. This is a classical convergence bound for algorithms applied to non-convex problems. In particular, the following Theorem shows the rate of convergence to a critical point where .
Because a critical point is not guaranteed to be a local minimum, the extrapolation can converge to a saddle point or a local maximum if (3) is doing so. However, if (3) converges to a minimum, then Algorithm 1 too.
Let be the output of the RNA Algorithm 1 using these sequences. When , the rate of convergence of the residue is optimal and bounded by
In particular, when is a simple gradient step on a quadratic function , this means
For clarity, we remove the superscript in the scope of this proof, and assume without loss of generality. By definition,
Since , we can bound , hence if is constructed as in Algorithm 1
We will now show by induction that all terms are computed using matrix polynomials in of degree exactly satisfying applied to the vector . This holds trivially for . Assume now this is true for with
In this case,
where is a polynomial of degree . In fact, we can show that . Indeed,
because by the induction hypothesis, when . Since by assumption,
We will now show that . Indeed,
where the first equality is obtained by the induction hypothesis, and the second using our assumption on the coefficients . Finally,
which means . Clearly,
which proves the induction. This means that the family of polynomial generates , the subspace of polynomial of degree . We can thus rewrite (11) using polynomials,
The explicit solution involves rescaled Chebyshev polynomials described in (golub1961chebyshev), whose optimal value is exactly (3.1), as stated, e.g., in Proposition 2.2 of (scieur2016regularized).
More concretely, this last result allows us to apply RNA to stochastic gradient algorithms featuring a momentum term. It also allows using a full extrapolation step at each iteration of Nesterov’s method, instead of the simple momentum term in the classical formulation in this method. As we will observe in the numerical section, this yields significant computation gains in both cases.
The previous result also shows that our method is adaptive. Whatever the algorithm we use for optimizing a quadratic function, if the iteration converges, the coefficients and sum to one and we have non-zero, we obtain an optimal rate of convergence. For example, when the strong convexity parameter is unknown, Nesterov’s method (6) has a rate of convergence in on a quadratic functions, even if the function is strongly convex. By post-processing the iterates, we transform this method into an optimal algorithm, automatically adapting to the strong convexity constant, hence extrapolation adaptively recovers the optimal rate of convergence even when a bad momentum parameter is used.
The setting of Theorem 3.1 assumes we store indefinitely the points and . In practice we only keep a constant window, but it still improves the convergence speed by a constant factor (see Section 4 and Figure 1 for more details). The link between the theorem and a windowed version of RNA is similar to the one between BFGS, whose optimal convergence is also proved for quadratics, and its limited memory version L-BFGS used in practice.
Because it applies only to quadratics, our result is essentially asymptotic. However, an argument similar to that detailed in (scieur2017nonlinear) would also give us non asymptotic, albeit less explicit, bounds. Similarly, the bound looks only at the non-regularized version, but it is possible to extend it using the solution of the regularized Chebyshev polynomial ((scieur2017nonlinear), Proposition 3.3). Overall, as for the theoretical bounds on BFGS, these global bounds on RNA performance are highly conservative and do not faithfully reflect its numerical efficiency.
4 Numerical Experiments
The following numerical experiments seek to highlight the benefits of RNA in its offline and online versions when applied to the gradient method (with or without momentum term). Since the complexity grows quadratically with the number of points in the sequences and , we will use RNA with a fixed window size ( for stochastic and for convex problems) in all these experiments. These values are sufficiently large to show a significant improvement in the rate of convergence, but can of course be fine-tuned. For simplicity, we fix .
4.1 Logistic Regression
We solve a classical regression problem on the Madelon-UCI dataset (guyon2003design) using the logistic loss with regularization. The regularization has been set such that the condition number of the function is equal to . We compare to standard algorithms such as the simple gradient scheme, Nesterov’s method for smooth and strongly convex objectives (nesterov2013introductory) and L-BFGS. For the step length parameter, we used a backtracking line-search strategy. We compare these methods with their offline RNA accelerated counterparts, as well as with the online version of RNA described in (8). Results are reported in Figure 1.
On Figure 1, we observe that offline RNA improves the convergence speed of gradient descent and Nesterov’s method. However, the improvement is only a constant factor: the curves are shifted but have the same slope. Meanwhile, the online version greatly improves the rate of convergence, transforming the basic gradient method into an optimal algorithm competitive with line-search L-BFGS.
In opposition to most quasi-newton methods (such as L-BFGS), RNA does not require a Wolfe line-search to be convergent. This is because the algorithm is stabilized with a Tikhonov regularization. In addition, the regularization in a way controls the impact of the noise in the iterates, making the RNA algorithm suitable for stochastic iterations (scieur2017nonlinear).
4.2 Image Classification
We now describe experiments with CNNs for image classification. Because one stochastic iteration is not informative due to the noise, we refer to as the model parameters (including batch normalization statistics) corresponding to the final iteration of the epoch . In this case, we do not have an explicit access to “”, so we will estimate it during the stochastic steps. Let be the parameters of the network at epoch after stochastic iterations, and be the parameters after one stochastic gradient step. Then, for a data set of size ,
Because the learning curve is highly dependent on the learning rate schedule, we decided to use a linearly decaying learning rate to better illustrate the benefits of acceleration, even if acceleration also works with a constant learning rate schedule (see (scieur2018nonlinear) and Figure 3). In all our experiments, until epoch , the learning rate decreases linearly from an initial value to a final value , with
We then continue the optimization during additional epochs using to stabilize the curve. We summarize the parameters used for the optimization in Table 1.
|SGD and Online RNA (8)||1.0||0.01||0|
|SGD + momentum||0.1||0.001||0.9|
CIFAR-10 is a standard 10-class image dataset comprising training samples and samples for testing. Except for the linear learning rate schedule above, we follow the standard practice for CIFAR-10. We applied the standard augmentation via padding of pixels. We trained the networks VGG19, ResNet-18 and DenseNet121 during epochs () with a weight decay of .
We observe in Figure 7 in Appendix B that the online version does not perform as well as in the convex case. More surprisingly, it is outperformed by its offline version (Figure 2) which computes the iterates on the side.
In fact, the offline experiments detailed in Figure 2 exhibit much more significant gains. It produces a similar test accuracy, and the offline version converges faster than SGD, especially for early iterations. We reported speedup factors to reach a certain tolerance in Tables 2, 3 and 4. This suggests that the offline version of RNA is a good candidate for training neural networks, as it converges faster while guaranteeing performance at least as good as the reference algorithm. Additional figures of networks VGG19 and DenseNet121 can be found in Appendix B, Figure 8.
|5.0%||68 (0.87)||59||21 (2.81)||16 (3.69)|
|2.0%||78 (0.99)||77||47 (1.64)||40 (1.93)|
|1.0%||82 (1.00)||82||67 (1.22)||59 (1.39)|
|0.5%||84 (1.02)||86||75 (1.15)||63 (1.37)|
|0.2%||86 (1.13)||97||84 (1.15)||85 (1.14)|
|5.0%||69 ()||60||26 ()||24 ()|
|2.0%||83 ()||82||52 ()||45 ()|
|1.0%||84 ()||86||71 ()||60 ()|
|0.5%||89 ()||87||73 ()||62 ()|
|0.2%||N/A||90||99 ()||63 ()|
|5.0%||65 (0.86)||56||22 (2.55)||13 (4.31)|
|2.0%||80 (0.98)||78||45 (1.73)||38 (2.05)|
|1.0%||83 (1.00)||83||60 (1.38)||56 (1.48)|
|0.5%||87 (0.99)||86||80 (1.08)||66 (1.30)|
|0.2%||92 (1.01)||93||86 (1.08)||75 (1.24)|
Here, we apply the RNA algorithm to the standard ImageNet dataset. We trained the networks during epochs () with a weight decay of . We reported the test accuracy on Figure 4 for the networks ResNet-50 and ResNet-152. We only tested the offline version of RNA here, because in previous experiments it gives better result than its online counterpart.
We again observe that the offline version of Algorithm 1 improves the convergence speed of SGD with and without momentum. In addition, we show a substantial improvement of the accuracy over the non-accelerated baseline. The improvement in the accuracy is reported in Figure 5. Interestingly, the resulting training loss is smoother than its non accelerated counterpart, which indicates a noise reduction.
|Resnet-50||23.85||23.808||23.346||23.412 (-0.396%)||22.914 (-0.432%)|
We extend the Regularized Nonlinear Acceleration scheme in (scieur2016regularized) to cover algorithms such as stochastic gradient methods with momentum and Nesterov’s method. As the original scheme, it has optimal complexity on convex quadratic problems, but it is also amenable to non-convex optimization problems such as deep CNNs training.
As an online algorithm, RNA substantially improves the convergence rate of Nesterov’s algorithm on convex problems such as logistic regression. On the other hand, when applied offline to CNN training, it improves both accuracy and convergence speed. This could be prove useful for fast prototyping of neural networks architectures.
We acknowledge support from the European Union’s Seventh Framework Programme (FP7-PEOPLE-2013-ITN) under grant agreement n.607290 SpaRTaN and from the European Research Council (grant SEQUOIA 724063). Alexandre d’Aspremont was partially supported by the data science joint research initiative with the fonds AXA pour la recherche and Kamet Ventures. Edouard Oyallon was partially supported by a postdoctoral grant from DPEI of Inria (AAR 2017POD057) for the collaboration with CWI.
Appendix A Geometric interpretation of RNA
We can give a geometric interpretation of the RNA algorithm. In view of (5), we can link its output with the center of mass of an object, whose forces are represented with gradients. In Figure 6 we show the trajectory of a gradient + momentum algorithm, where the extrapolated point is shown in red while the real solution is in green.