1 Introduction

Abstract

Anderson acceleration (or Anderson mixing) is an efficient acceleration method for fixed point iterations , e.g., gradient descent can be viewed as iteratively applying the operation . It is known that Anderson acceleration is quite efficient in practice and can be viewed as an extension of Krylov subspace methods for nonlinear problems. In this paper, we show that Anderson acceleration with Chebyshev polynomial can achieve the optimal convergence rate , which improves the previous result provided by (Toth and Kelley, 2015) for quadratic functions. Moreover, we provide a convergence analysis for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter ) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the proposed Anderson-Chebyshev acceleration method converges significantly faster than other algorithms, e.g., vanilla gradient descent (GD), Nesterov’s Accelerated GD. Also, these algorithms combined with the proposed guessing algorithm (guessing the hyperparameters dynamically) achieve much better performance.

\runningtitle

A Fast Anderson-Chebyshev Acceleration for Nonlinear Optimization

\runningauthor

Zhize Li, Jian Li

1 Introduction

Machine learning problems are usually modeled as optimization problems, ranging from convex optimization to highly nonconvex optimization such as deep neural networks, e.g., (Nesterov, 2004; Bubeck, 2015; LeCun et al., 2015; Lei et al., 2017; Li and Li, 2018; Fang et al., 2018; Zhou et al., 2018; Li et al., 2019; Ge et al., 2019; Li, 2019). To solve an optimization problem , the classical method is gradient descent, i.e., . There exist several techniques to accelerate the standard gradient descent, e.g., momentum (Nesterov, 2004; Allen-Zhu, 2017; Lan and Zhou, 2018; Lan et al., 2019). There are also various vector sequence acceleration methods developed in the numerical analysis literature, e.g., (Brezinski, 2000; Sidi et al., 1986; Smith et al., 1987; Brezinski and Redivo Zaglia, 1991; Brezinski et al., 2018). Roughly speaking, if a vector sequence converges very slowly to its limit, then one may apply such methods to accelerate the convergence of this sequence. Taking gradient descent as an example, the vector sequence are generated by , where the limit is the fixed-point (i.e. . One notable advantage of such acceleration methods is that they usually do not require to know how the vector sequence is actually generated. Thus the applicability of those methods is very wide.

Recently, Scieur et al. (2016) used the minimal polynomial extrapolation (MPE) method (Smith et al., 1987) for convergence acceleration. This is a nice example of using sequence acceleration methods to optimization problems. In this paper, we are interested in another classical sequence acceleration method called Anderson acceleration (or Anderson mixing), which was proposed by Anderson in 1965 (Anderson, 1965). The method is known to be quite efficient in a variety of applications (Capehart, 1989; Pratapa et al., 2016; Higham and Strabić, 2016; Loffeld and Woodward, 2016). The idea of Anderson acceleration is to maintain recent iterations for determining the next iteration point, where is a parameter (typically a very small constant). Thus, it can be viewed as an extension of the existing momentum methods which usually use the last and current points to determine the next iteration point. Anderson acceleration with slight modifications is described in Algorithm 1.

1 input:
2 Define ;
3 , ;
4 for  do
5       ;
6       ;
7       Solve subject to ;
8       ;
9      
return
Algorithm 1 Anderson Acceleration()

Note that the step in Line 1 of Algorithm 1 can be transformed to an equivalent unconstrained least-squares problem:

(1)

then let . Using QR decomposition, (1) can be solved in time , where is the dimension. Moreover, the QR decomposition of (1) at iteration can be efficiently obtained from that of at iteration in (see, e.g. (Golub and Van Loan, 1996)). The constant is usually very small. We use and for the numerical experiments in Section 5. Hence, each iteration of Anderson acceleration can be implemented quite efficiently.

Many studies showed the relations between Anderson acceleration and other optimization methods. In particular, for the quadratic case (linear problems), Walker and Ni (2011) showed that it is related to the well-known Krylov subspace method GMRES (generalized minimal residual algorithm) (Saad and Schultz, 1986). Furthermore, Potra and Engler (2013) showed that GMRES is equivalent to Anderson acceleration with any mixing parameters under (see Line 5 of Algorithm 1) for linear problems. Concretely, Toth and Kelley (2015) proved the first linear convergence rate for linear problems with fixed parameter , where is the condition number. Besides, Eyert (1996), and Fang and Saad (2009) showed that Anderson acceleration is related to the multisecant quasi-Newton methods (more concretely, the generalized Broyden’s second method). Despite the above results, the convergence results for this efficient method are still limited (especially for general nonlinear problems and the case where is small). In this paper, we analyze the convergence for small which is the typical case in practice and also provide the convergence analysis for general nonlinear problems.

1.1 Our Contributions

There has been a growing number of applications of Anderson acceleration method (Pratapa et al., 2016; Higham and Strabić, 2016; Loffeld and Woodward, 2016; Scieur et al., 2018). Towards a better understanding of this efficient method, we make the following technical contributions:

  1. We prove the optimal convergence rate of the proposed Anderson-Chebyshev acceleration (i.e., Anderson acceleration with Chebyshev polynomial) for minimizing quadratic functions (see Theorem 1). Our result improves the previous result given by (Toth and Kelley, 2015) and matches the lower bound provided by (Nesterov, 2004). Note that for ill-conditioned problems, the condition number can be very large.

  2. Then, we prove the linear-quadratic convergence of Anderson acceleration for minimizing general nonlinear problems under some standard assumptions (see Theorem 2). Compared with Newton-like methods, it is more attractive since it does not require to compute (or approximate) Hessians, or Hessian-vector products.

  3. Besides, we propose a guessing algorithm (Algorithm 2) for the case when the hyperparameters (e.g., ) are not available. We prove that it achieves a similar convergence rate (see Theorem 3). This guessing algorithm can also be combined with other algorithms, e.g., Gradient Descent (GD), Nesterov’s Accelerated GD (NAGD). The experimental results (see Section 5.1) show that these algorithms combined with the proposed guessing algorithm achieve much better performance.

  4. Finally, the experimental results on the real-world UCI datasets and synthetic datasets demonstrate that Anderson acceleration methods converge significantly faster than other algorithms (see Section 5). Combined with our theoretical results, the experiments validate that Anderson acceleration methods (especially Anderson-Chebyshev acceleration) are efficient both in theory and practice.

1.2 Related Work

As aforementioned, Anderson acceleration can be viewed as the extension of the momentum methods (e.g., NAGD) and the potential extension of Krylov subspace methods (e.g., GMRES) for nonlinear problems. In particular, GD is the special case of Anderson acceleration with , and to some extent NAGD can be viewed as . We also review the equivalence of GMRES and Anderson acceleration without truncation (i.e., ) in Appendix A. Besides, Eyert (1996), and Fang and Saad (2009) showed that Anderson acceleration is related to the multisecant quasi-Newton methods. Note that Anderson acceleration has the advantage over the Newton-like methods since it does not require the computation of Hessians or approximation of Hessians or Hessian-vector products.

There are many sequence acceleration methods in the numerical analysis literatures. In particular, the well-known Aitken’s process (Aitken, 1926) accelerated the convergence of a sequence that is converging linearly. Shanks generalized the Aitken extrapolation which was known as Shanks transformation (Shanks, 1955). Recently, Brezinski et al. (2018) proposed a general framework for Shanks sequence transformations which includes many vector sequence acceleration methods. One fundamental difference between Anderson acceleration and other sequence acceleration methods (such as MPE, RRE (reduced rank extrapolation) (Sidi et al., 1986; Smith et al., 1987), etc.) is that Anderson acceleration is a fully dynamic method (Capehart, 1989). Here dynamic means all iterations are in the same sequence, and it does not require to restart the procedure. It can be seen from Algorithm 1 that all iterations are applied to the same sequence . In fact, in Capehart’s PhD thesis (Capehart, 1989), several experiments were conducted to demonstrate the superior performance of Anderson acceleration over other semi-dynamic methods such as MPE, RRE (semi-dynamic means that the algorithm maintains more than one sequences or needs to restart several times). More recently, Anderson acceleration with different variants and/or under different assumptions are widely studied (see e.g., (Zhang et al., 2018; Evans et al., 2018; Scieur et al., 2019)).

2 The Quadratic Case

In this section, we consider the problem of minimizing a quadratic function (also called least squares, or ridge regression (Boyd and Vandenberghe, 2004; Hoerl and Kennard, 1970)). The formulation of the problem is

(2)

where . Note that and are usually called the strongly convex parameter and Lipschitz continuous gradient parameter, respectively (e.g. (Nesterov, 2004; Allen-Zhu, 2017; Lan et al., 2019)). There are many algorithms for optimizing this type of functions. See e.g. (Bubeck, 2015) for more details. We analyze the problem of minimizing a more general function in the next Section 3.

We prove that Anderson acceleration with Chebyshev polynomial parameters achieves the optimal convergence rate, i.e., it obtains an -approximate solution using iterations. The convergence result is stated in the following Theorem 1.

Theorem 1

The Anderson-Chebyshev acceleration method achieves the optimal convergence rate for obtaining an -approximate solution of problem (2) for any , where is the condition number, is defined in Definition 1 and this method combines Anderson acceleration (Algorithm 1) with the Chebyshev polynomial parameters , for .

Remark: In this quadratic case, we mention that Toth and Kelley (2015) proved the first convergence rate for fixed parameter . Here we use the Chebyshev polynomials to improve the result to the optimal which matches the lower bound . Note that for ill-conditioned problems, the condition number can be very large. Also note that in practice the constant is usually very small. Particularly, has already achieved a remarkable performance from our experimental results (see Figures 25 in Section 5).

Before proving Theorem 1, we first define and then briefly review some properties of the Chebyshev polynomials. We refer to (Rivlin, 1974; Olshanskii and Tyrtyshnikov, 2014; Hageman and Young, 2012) for more details of Chebyshev polynomials.

Definition 1

Let ’s be the unit eigenvectors of , where is defined in (2). Consider a unit vector and let , where denotes the projection to the orthogonal complement of the column space of . Define to be the maximum integer such that for any .

Obviously, since due to and .

Now we review the Chebyshev polynomials. The Chebyshev polynomials are polynomials , where , , which is defined by the recursive relation:

(3)

The key property is that has minimal deviation from on among all polynomials with and leading coefficient for the largest degree term , i.e.,

(4)

In particular, for , Chebyshev polynomials can be written in an equivalent way:

(5)

In our proof, we use this equivalent form (5) instead of (3). The equivalence can be verified as follows:

(6)
(7)

where (6) and (7) use the transformation due to . According to (5), and the roots of are as follows:

(8)

To demonstrate it more clearly, we provide an example for (W-shape curve) in Figure 1. Since in this polynomial , the first root . The remaining three roots for can be easily computed too.

Figure 1: The Chebyshev polynomial

Proof of Theorem 1. For iteration , the residual (let ) can be deduced as follows:

(9)
(10)

where (9) uses .

To bound (i.e., ), we first obtain the following lemma by using Singular Value Decomposition (SVD) to solve the least squares problem (1) and then using several transformations. We defer the proof of Lemma 1 to Appendix B.2.

Lemma 1

Let and , then

(11)

where is a degree polynomial.

According to Lemma 1, to bound , it is sufficient to bound the right-hand-side (RHS) of (11) (i.e., ). So we want to choose parameter in order to make as small as possible. According to (4) (the minimal deviation property of standard Chebyshev polynomials), hence a natural idea is to choose such that is a kind of modified Chebyshev polynomials. In order to do this, we first transform into , i.e., let , where . Also note that polynomial has (only) one constraint, i.e., . Thus we choose such that

(12)

where is the standard Chebyshev polynomials. Now, the RHS of (11) can be bounded as follows:

(13)
(14)

where (13) uses (12), and (14) uses (see (5)). According to (8), it is not hard to see that is defined by the mixing parameters according to , where . Note that the roots of standard Chebyshev polynomials (i.e., (8)) can be found from many textbooks, e.g., Section 1.2 of (Rivlin, 1974). Now, we only need to bound . First, we need to transform the form (5) of Chebyshev polynomials as follows:

Let , we get . So we have

(15)

Now, the RHS of (11) can be bounded as

(16)
(17)

where (16) follows from (14), and (17) follows from (15). Then, according to (11), the gradient norm is bounded as , where . Note that if the number of iterations , then

Thus the Anderson-Chebyshev acceleration method achieves the optimal convergence rate for obtaining an -approximate solution.

3 The General Case

In this section, we analyze the Anderson Acceleration (Algorithm 1) in the general nonlinear case:

(18)

We prove that Anderson acceleration method achieves the linear-quadratic convergence rate under the following standard Assumptions 1 and 2, where denotes the Euclidean norm. Let denote the small matrix of the least-square problem in Line 7 of Algorithm 1, i.e., (see problem (1)). Then, we define its condition number and , where denotes the least non-zero singular value of .

Assumption 1

The Hessian satisfies , where .

Assumption 2

The Hessian is -Lipschitz continuous, i.e.,

(19)
Theorem 2

Suppose Assumption 1 and 2 hold. Let step-size . The convergence rate of Anderson Acceleration() (Algorithm 1) is linear-quadratic for problem (18), i.e.,

(20)

where ,  ,   and .

Remark:

  1. The constant is usually very small. Particularly, we use and for the numerical experiments in Section 5. Hence is very small and also decreases as the algorithm converges.

  2. Besides, one can also use instead of in (20) according to the property of (Assumption 1), i.e., , and .

  3. Note that the first two terms in RHS of (20) converge quadratically and the last term converges linearly. Due to the fully dynamic property of Anderson acceleration as we discussed in Section 1.2, it turns out the exact convergence rate of Anderson acceleration in the general case is not easy to obtain. But we note that the convergence rate is roughly linear, i.e., since the first two quadratic terms converge much faster than the last linear term in some neighborhood of optimum. In particular, if is a quadratic function, then (Assumption 2) and thus in (20). Only the last linear term remained, thus it converges linearly (see the following corollary).

Corollary 1

If is a quadratic function, let step-size and . Then the convergence rate of Anderson Acceleration is linear, i.e., , where is the condition number.

Note that this corollary recovers the previous result (i.e., ) obtained by (Toth and Kelley, 2015), and we use Chebyshev polyniomial to improve this result to the optimal convergence rate in our previous Section 2 (see Theorem 1). Concretely, we transfer the weight of step-size to the parameters ’s and use Chebyshev polynomial parameters ’s in our Theorem 1 instead of using fixed parameter .

Now, we provide a proof sketch for Theorem 2. The detailed proof can be found in Appendix B.1.

Proof Sketch of Theorem 2. Consider the iteration , we have according to . First, we need to demonstrate several useful forms of as follows:

(21)
(22)

where (21) holds due to the definition , and (22) holds since .

Then, to bound (i.e., ), we deduce as follows:

(23)

where (23) uses the definition . Now, we bound the first two terms of (23) as follows:

(24)

where (24) is obtained by using (22) to replace . To bound (24), we use Assumptions 1, 2, and the equation

After some non-trivial calculations (details can be found in Appendix B.1), we obtain

where denotes the Euclidean norm of . Then, according to the problem (1) and the definition of , we have . Finally, we bound using QR decomposition of problem (1) and recall to finish the proof of Theorem 2.

4 Guessing Algorithm

In this section, we provide a Guessing Algorithm (described in Algorithm 2) which guesses the parameters (e.g., ) dynamically. Intuitively, we guess the parameter and the condition number in a doubling way. Note that in general these parameters are not available, since the time for computing these parameters is almost the same as (or even longer than) solving the original problem. Also note that the condition in Line 14 of Algorithm 2 depends on the algorithm used in Line 12.

1 input:
2 Let ;
3 for  do
4       ;
5       for  do
6             ;
7             do
8                   ;
9                   if  then
10                         break;
11                        
12                  ;
13                   Anderson Acceleration() //can be replaced by other algorithms;
14                   ;
15                  
16            while  ;
17            if  then
18                   ;
19                  
20            
21      
return
Algorithm 2 Guessing Algorithm

The convergence result of our Algorithm 2 is stated in the following Theorem 3. The detailed proof is deferred to Appendix B.3. Note that we only prove the quadratic case for Theorem 3, but it is possible to extend it to the general case.

Theorem 3

Without knowing the parameters and , Algorithm 2 achieves convergence rate for obtaining an -approximate solution of problem (2), where , and can be any number as long as the eigenvalue spectrum belongs to .

Remark: We provide a simple example to show why this guessing algorithm is useful. Note that algorithms usually need the (exact) parameters and to set the step size. Without knowing the exact values and , one needs to approximate these parameters once at the beginning. Let and denote the approximated values, where . Without guessing them dynamically, one fixes and all the time in its algorithm. According to the lower bound , we know that its convergence rate cannot be better than , where . However, if one combines with our Algorithm 2 (guessing the parameters dynamically), the convergence rate can be improved to according to our Theorem 3 by letting and (hence ). Note that there is no (accuracy) in the second term . Thus the rate turns to the optimal when . To achieve an -approximate solution, our guessing algorithm can improve the convergence a lot especially for an imprecise estimate at the beginning (i.e., and are very large). The corresponding experimental results in Section 5.1 (see Figure 6) indeed validate our theoretical results.

5 Experiments

In this section, we conduct the numerical experiments on the real-world UCI datasets1 and synthetic datasets. We compare the performance among these five algorithms: Anderson Acceleration (AA), Anderson-Chebyshev acceleration (AA-Cheby), vanilla Gradient Descent (GD), Nesterov’s Accelerated Gradient Descent (NAGD) (Nesterov, 2004) and Regularized Minimal Polynomial Extrapolation (RMPE) with (same as (Scieur et al., 2016)).

Regarding the hyperparameters, we directly set them from their corresponding theoretical results. See Proposition 1 of (Lessard et al., 2016) for GD and NAGD. For RMPE5, we follow the same setting as in (Scieur et al., 2016). For our AA/AA-Cheby, we set them according to our Theorem 1 and 2.

Figure 2 demonstrates the convergence performance of these algorithms in general nonlinear case and Figures 35 demonstrate the convergence performance in quadratic case. The last Figure 6 demonstrates the convergence performance of these algorithms combined with our guessing algorithm (Algorithm 2). The values of in the caption of figures denote the mixing parameter of Anderson acceleration algorithms (see Line 5 of Algorithm 1).

Figure 2: Logistic regression,

In Figure 2, we use the negative log-likelihood as the loss function (logistic regression), i.e., , where . We run these five algorithms on real-world diabetes and cancer datasets which are standard UCI datasets. The x-axis and y-axis represent the number of iterations and the norm of the gradient of loss function respectively.

Figure 3: ; (left), (right)

Figure 4: ; (left), (right)

Figure 5: ; (left), (right)

Figures 35 demonstrate the convergence performance for the quadratic case, where . Concretely, we compared the convergence performance among these algorithms when the condition number and the mixing parameter are varied, e.g., the left figure in Figure 3 is the case and . Recall that is the mixing parameter for Anderson acceleration algorithms (see Line 5 of Algorithm 1). We run these five algorithms on the synthetic datasets in which we randomly generate the and for the loss function . Note that for randomly generated satisfying the property of , we randomly generate instead and let .

In conclusion, Anderson acceleration methods converge the fastest no matter it is a quadratic function or general function in all of our experiments. The efficient Anderson acceleration methods can be viewed as the extension of momentum methods (e.g., NAGD) since GD is the special case of Anderson Acceleration with , and to some extent NAGD can be viewed as . Combined with our theoretical results (i.e., optimal convergence rate in quadratic case and linear-quadratic convergence in general case), the experimental results validate that Anderson acceleration methods are efficient both in theory and practice.

5.1 Experiments for Guessing Algorithm

In this section, we conduct the experiments for guessing the hyperparameters (i.e., ) dynamically using our Algorithm 2.

(a) Gradient Descent (b) Nesterov’s AGD
(c) Anderson Acceleration (d) Anderson-Chebyshev
Figure 6: Algorithms with/without guessing algorithm

In Figure 6, we separately consider these algorithms. For each of them, we compare its convergence performance between its original version and the one combined with our guessing algorithm (Algorithm 2). The experimental results show that all these four algorithms combined with our guessing algorithm achieve much better performance than their original versions. Thus it validates our theoretical results (see Theorem 3 and its following Remark).

6 Conclusion

In this paper, we prove that Anderson acceleration with Chebyshev polynomial can achieve the optimal convergence rate , which improves the previous result provided by (Toth and Kelley, 2015). Thus it can deal with ill-conditioned problems (condition number is large) more efficiently. Furthermore, we also prove the linear-quadratic convergence of Anderson acceleration for minimizing general nonlinear problems. Besides, if the hyperparameters (e.g., the Lipschitz smooth parameter ) are not available, we propose a guessing algorithm for guessing them dynamically and also prove a similar convergence rate. Finally, the experimental results demonstrate that the efficient Anderson acceleration methods converge significantly faster than other algorithms. This validates that Anderson-Chebyshev acceleration is efficient both in theory and practice.

Acknowledgements

Zhize was supported by the Office of Sponsored Research of KAUST, through the Baseline Research Fund of Prof. Peter Richtárik. Jian was supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and the Zhongguancun Haihua Institute for Frontier Information Technology and Turing AI Institute of Nanjing. The authors also would like to thank Francis Bach, Claude Brezinski, Rong Ge, Damien Scieur, Le Zhang and anonymous reviewers for useful discussions and suggestions.

Appendix A GMRES vs. Anderson Acceleration()

In this appendix, in order to better understand the efficient Anderson acceleration method, we review the equivalence between the well-known Krylov subspace method GMRES (Saad and Schultz, 1986) and Anderson acceleration without truncation (i.e., or large enough in Line 5 of Algorithm 1) in linear case. We emphasize that in this paper we focus on the more general hard cases where is small (since usually is finite and not very large in practice) and also general nonlinear case.

Consider the problem of solving the linear system , with a nonsingular matrix . This is equivalent to solving the fixed point , where . Let denote the residual in the point , i.e., . The GMRES method is an effective iterative method for linear system which has the property of minimizing the norm of the residual vector over a Krylov subspace at every step.

(25)

Note that the Krylov space is the linear span of the first gradients and can span the whole space . Hence the method arrives the exact solution after iteration. It is also theoretically equivalent to the Generalized Conjugate Residual method (GCR).

Now we show that to indicate the equivalence, under the assumption for . and denote the -th GMRES iterative point and -th Anderson Acceleration iterative point, respectively. Let mixing parameters for all . Then, we deduce the as follows:

(26)
(27)
(28)

Note that the second term in (28) is the same as we minimized in Line 7 of Algorithm 1. This step also can be transformed to an unconstrained version as follows:

(29)

The equals to . Note that