A proof that Anderson acceleration increases the convergence rate in linearly converging fixed point methods (but not in quadratically converging ones)

# A proof that Anderson acceleration increases the convergence rate in linearly converging fixed point methods (but not in quadratically converging ones)

## Abstract

This paper provides the first proof that Anderson acceleration (AA) increases the convergence rate of general fixed point iterations. AA has been used for decades to speed up nonlinear solvers in many applications, however a rigorous mathematical justification of the improved convergence rate remained lacking. The key ideas of the proof are relating errors with residuals, using results arising from the optimization, and explicitly defining the gain in the optimization stage to be the ratio of improvement over a step of the unaccelerated fixed point iteration. The main result we prove is that AA improves a the convergence rate of a fixed point iteration to first order by a factor of the gain at each step. In addition to improving the convergence rate, our results also show that AA increases the radius of convergence (even beyond a set where the fixed point operator is contractive). Lastly, our estimate shows that while the linear convergence rate is improved, additional quadratic terms arise in the estimate, which shows why AA does not typically improve convergence in quadratically converging fixed point iterations. Results of several numerical tests are given which illustrate the theory.

## 1 Introduction

We study an acceleration technique for fixed point problems called Anderson acceleration, in which a history of search-directions is used to increase the rate of convergence of fixed-point iterations. The method was originally introduced by D.G. Anderson in 1965 in the context of integral equations [2]. It has recently been used in many applications, including multisecant methods for fixed-point iterations in electronic structure computations [4], geometry optimization problems [11], various types of flow problems [10, 12], radiation diffusion and nuclear physics [1, 15], molecular interaction [13], machine learning [5], improving the alternating projections method for computing nearest correlation matrices [6], and on a wide range of nonlinear problems in the context of generalized minimal residual (GMRES) methods in [16]. We further refer readers to [7, 9, 10, 16] and references therein for detailed discussions on both practical implementation and a history of the method and its applications.

Despite a long history of use and a strong recent interest, the first mathematical convergence results for Anderson acceleration (for both linear and nonlinear problems) appear in 2015 in [14], under the usual local assumptions for convergence of Newton iterations. However, this theory does not prove that Anderson acceleration improves the convergence of a fixed point iteration, or in other words accelerates convergence in the sense of [3]. Rather, it proves that Anderson accelerated fixed point iterations will converge in the neighborhood of a fixed point; and, an upper bound on the convergence rate is shown to approach from above the convergence rate of the underlying fixed point iteration. While an important stage in the developing theory, this does not explain the efficacy of the method, which has gained popularity as practitioners have continued to observe a dramatic speedup and increase in robustness from Anderson acceleration over a wide range of problems.

The purpose of this paper is to address this gap in the theory by proving a rigorous estimate for Anderson acceleration that shows a guaranteed increase in the convergence rate for fixed point iterations (for general functions) that converge linearly (with rate ). By explicitly defining the gain of the optimization stage at iteration to be the ratio of the optimized objective function compared to that of the usual fixed point method, we prove the new convergence rate is at step , where is a damping parameter and produces the undamped iteration. The key ideas to the proof are an expansion of the residual errors, developing expressions relating the errors and residuals, and explicitly factoring in the gain from the optimization stage. A somewhat similar approach is used by the authors to prove that Anderson acceleration speeds up Picard iteration convergence for finite element discretizations of the steady Navier-Stokes equations in [12], (without the assumption on the fixed-point operator) and herein we extend these ideas to general fixed point iterations.

In addition to the improved linear convergence rate, our analysis also indicates that Anderson acceleration introduces quadratic error terms, which is consistent with known results that Anderson acceleration does not accelerate quadratically converging fixed point methods (see the numerical experiments section below), establishing a barrier which theoretically prevents establishing an improved convergence rate for general fixed-point iterations. A third important result we show is that both Anderson acceleration and the use of damping can extend the radius of convergence for the method, i.e. Anderson acceleration can allow the iteration to converge even when outside the domain where the fixed point function is contractive.

This paper is arranged as follows. In Section 2, we review Anderson acceleration, describe the problem setting, and give some basic definitions and notation. Section 3 gives several important technical results, to make the later analysis cleaner and simpler. Section 4 gives the main result of the paper, proving that the linear convergence rate is reduced by Anderson acceleration, but additional quadratic error terms arise. Section 5 gives results from numerical tests, with the intent of illustrating the theory proven herein. Conclusions are given in the final section.

## 2 Anderson acceleration

In what follows, we will consider a fixed-point operator where is a Hilbert space with norm and inner-product . The Anderson acceleration algorithm with depth applied to the fixed-point problem reads as follows.

###### Algorithm 2.1 (Anderson iteration).

The Anderson-acceleration with depth and damping factors reads:
Step 0: Choose
Step 1: Find such that . Set .
Step : For Set
[a.] Find .
[b.] Solve the minimization problem for

 min∑kj=k−mkαk+1j=1∥∥ ∥∥k∑j=k−mkαk+1j(~xj+1−xj)∥∥ ∥∥. (2.1)

[c.] For damping factor , set

 xk+1=(1−βk)k∑j=k−mkαk+1jxj+βkk∑j=k−mkαk+1j~xj+1. (2.2)

We will use throughout this work the stage- residual and error terms

 ek\coloneqqxk−xk−1,~ek\coloneqq~xk−~xk−1,wk\coloneqq~xk−xk−1. (2.3)

Define the following averages given by the solution to the optimization problem (2.1):

 xαk=k∑j=k−mkαk+1jxj,~xαk+1=k∑j=k−mkαk+1j~xj+1,wαk+1=k∑j=k−mkαk+1j(g(xj)−xj). (2.4)

Then the update (2.2) can be written in terms of the averages and ,

 xk+1=(1−βk)xαk+βk~xαk+1, (2.5)

and the stage- gain can be defined by

 ∥∥wαk+1∥∥=θk∥wk+1∥. (2.6)

The key to showing the acceleration technique defined by taking a linear combination of a history of steps corresponding to the coefficients of the optimization problem (2.1) is connecting the gain given by (2.6) to the error and residual terms in (2.4). As such, the success (or failure) of the algorithm to reduce the residual is coupled to the success of the optimization problem at each stage of the algorithm. As is an admissible solution to (2.1), it follows immediately that . As discussed in the remainder, the improvement in the contraction rate of the fixed-point iteration is characterized by .

The two main components of the proof of residual convergence at an accelerated rate are the expansion of the residual into and error terms ; and, control of the ’s in terms of the corresponding ’s. In the next section, the first of these is established for general , and the second for the particular cases of depth and , with the result then extrapolated for general .

## 3 Technical preliminaries

There are several important technical results that our theory utilizes. We choose to separate them out and into this section, to allow for cleaner proofs of the main results in Section 4.

In [12] the operator was assumed Lipschitz continuous and contractive. Here, we make the stronger assumption to allow for Taylor expansions of the error terms.

###### Assumption 3.1.

Assume has a fixed point , and there are positive constants and with

1. .

2. for each .

3. for each .

###### Remark 3.2.

Reducing the assumptions from global to local is possible, but will make the technical analysis below significantly more technical.

Based on Assumption 3.1 the error of (2.4) has a Taylor expansion

 ~ek\coloneqqg(xk−1)−g(xk)=g′(ck−1)(xk−xk−1)=g′(ck)ek, (3.1)

where is on the line segment joining and . A second application of Taylor’s Theorem provides

 g′(ck−1)=g′(ck)+g′′(^ck)(ck−1−ck), (3.2)

with on the line segment connecting with .

### 3.1 Expansion of the residual

Based on (3.1)-(3.2) we next derive an expansion of the residual in terms of the errors . We start with the definition of the residual by (2.4) and the expansion of iterate by the update (2.5).

 wk+1=g(xk)−xk=(1−βk−1)(g(xk)−xαk−1)+βk−1(g(xk)−~xαk). (3.3)

Expanding the first term on the right hand side of (3.3) yields

 g(xk)−xαk−1 =k−1∑j=k−mk−1−1αkj(g(xk)−xj) =k−1∑j=k−mk−1−1αkj(g(xj)−xj)+k∑j=k−mk−1⎛⎝j−1∑n=k−mk−1−1αkn⎞⎠(g(xj)−g(xj−1)) =wαk+k∑j=k−mk−1γj~ej+1, (3.4)

where

 γj\coloneqqj−1∑n=k−mk−1−1αkn. (3.5)

It is worth noting that . Expanding the second term on the right hand side of (3.3), we get

 g(xk)−~xαk=k−1∑j=k−mk−1−1αkj(g(xk)−g(xj))=k∑j=k−mk−1γj~ej+1. (3.6)

Reassembling (3.3) with (3.1) and (3.6) followed by (3.1), we have

 wk+1=(1−βk−1)wαk+k∑j=k−mk−1γj~ej+1=(1−βk−1)wαk+k∑j=k−mk−1γjg′(cj)ej. (3.7)

We now take a closer look at the last term of (3.7). For each , applying (3.2) times yields

 g′(cj)ej=g′(ck)ek+(k−1∑n=jg′′(^cn+1)(cn−cn+1))ej.

Then

 k∑j=k−mk−1γjg′(cj)ej=g′(ck)k∑j=k−mk−1γjej+k−1∑j=k−mk−1γj(k−1∑n=jg′′(^cn+1)(cn−cn+1))ej. (3.8)

The next calculation shows that the sum multiplying is equal to . First observe that and . Separating the first term of the sum and using ,

 k∑j=k−mk−1γjej =xk−xk−1+k−1∑j=k−mk−1γj(xj−xj−1) =xk−xk−1+γk−1xk−1−k−2∑j=k−mk−1−1αkjxj =xk−αkk−1xk−1−k−2∑j=k−mk−1−1αkjxj=xk−xαk−1. (3.9)

From (3.1) and the decomposition of in terms of update (2.2), we have that

 k∑j=k−mk−1γjej=xk−xαk−1=(1−βk−1)xαk−1+βk−1~xαk−xαk−1=βk−1(~xαk−xαk−1)=βk−1wαk. (3.10)

Putting (3.10) together with (3.8) and (3.7) then yields

 wk+1 =((1−βk−1)I+βk−1g′(ck))wαk+k−1∑j=k−mk−1(k−1∑n=jg′′(^cn+1)(cn−cn+1))γjej. (3.11)

Based on the expansion of by (3.11) we now proceed to bound the higher order terms in the particular cases and to establish convergence of Algorithm 2.1 at an accelerated rate.

### 3.2 Relating errors to residuals

We now derive estimates to bound (in norm) the ’s from the right hand side of (3.11) by the corresponding ’s. For the bounds in this section, we require to be a contractive operator. In fact, the estimates of this section hold for contractive but not necessarily operators, see [12]. However the application of these results in the current setting requires to be , so we state the next assumption as follows.

###### Assumption 3.3.

Let be a Hilbert space and . Assume for each .

Under Assumption 3.3 it holds that as in (3.1); and, we have the inequality

 (3.12)

The next lemma establishes a bound for in terms of and in the case of depth . The subsequent lemma generalizes the same idea for general .

###### Lemma 3.1.

Under the conditions of Assumption 3.3, the following bounds hold true:

 |αjj−1|∥∥ej−1∥∥ ≤11−κ∥∥wj−1∥∥, (3.13) |αjj−2|∥∥ej−1∥∥ ≤11−κ∥∥wj∥∥. (3.14)
###### Proof.

Begin by rewriting the optimization problem (2.1) in the equivalent form

 η=argmin∥∥wj−1+η(wj−wj−1)∥∥2,

where and . The critical point then satisfies Applying Cauchy-Schwarz and triangular inequalities yields Applying (3.12) with yields the result (3.13).

Next, rewrite the optimization problem (2.1) in another equivalent form,

 γ=argmin∥∥wj−γ(wj−wj−1)∥∥2, (3.15)

where the equivalence follows with and . Following the same procedure as above yields . Applying (3.12) at level then yields the second result (3.14). ∎

The use of as the second parameter of in the proof above is not purely coincidental, as this agrees with the used in Section 3.1. The same essential technique yields the necessary bounds for . The estimate for general is given in the lemma below, with the particular estimate for given as a proposition.

As in the case above, two forms of the optimization problem are used. The -formulation is used to bound the terms that appear from the expansion (3.11); whereas, the -formulation is used to bound the terms that appear in the numerator without leading optimization coefficients. It is then of particular importance that estimates of the form have the property that is bounded away from zero. This is a reasonable assumption on the leading coefficient for each , as some nonvanishing component in the latest search direction is necessary for progress. It is also a reasonable assumption on , meaning the coefficient of the earliest search direction considered is bounded away from unity. Presumably, is a reasonable assumption to make, although this is not explicitly required (cf., [12]).

###### Lemma 3.2.

Under the conditions of Assumption 3.3, the following bounds hold true:

 |αjj−1∥∥ej−1∥∥ ≤11−κ⎛⎝|ηj−1|∥∥wj−1∥∥+j−2∑n=j−mj−1−1|αjn−1|∥wn∥⎞⎠ (3.16) |1−αjj−mj−1−1|∥∥ej−mj−1∥∥ ≤11−κ(j∑n=j−m+2|αjn−1|∥wn∥+|ηj−m+2|∥∥wj−m+1∥∥+∥∥wj−m∥∥) (3.17) |γp−1|∥ep−1∥ ≤11−κ⎛⎝p−2∑n=j−mj−1|αjn−1|∥wn∥+|γp−2|∥wp−1∥+|γp|∥wp∥ +j∑n=p+1|αjn−1|∥wn∥). (3.18)

with as in (3.19), and given below by (3.20).

###### Proof.

The optimization problem (2.1) at level is to minimize

 ∥∥ ∥∥j−1∑n=j−mj−1−1αjnwn+1∥∥ ∥∥  % subject to j−1∑n=j−mj−1−1αjn=1.

Differencing from the left and right respectively, this can be posed as the following unconstrained optimization problems:

 minimize ∥∥ ∥∥wj−mj−1+j∑n=j−mj−1+1ηn(wn−wn−1)∥∥ ∥∥2, ηn =j−1∑i=n−1αji. (3.19) minimize ∥∥ ∥∥wj−j∑n=j−mj−1+1γn−1(wn−wn−1)∥∥ ∥∥2, γn =n−1∑i=j−mj−1−1αji. (3.20)

Note that (3.20) coincides with (3.5) which agrees with the unconstrained form of the optimization problem in, for instance [4]. To help reduce notation, denote for the remainder of the proof.

Starting with estimate (3.16) we are concerned with bounding in norm the leading term difference term . Expanding the norm squared (3.19) as an inner-product and seeking the critical point for yields

 ηj∥∥wj−wj−1∥∥2+(wj−wj−1,wj−m)+j−1∑n=j−m+1ηn(wj−wj−1,wn−wn−1)=0.

Recombining the terms inside the sum, noting , and obtain

 αjj−1∥∥wj−wj−1∥∥2=(αjj−1+αjj−2)(wj−wj−1,wj−1)+j−2∑n=j−mαjn−1(wj−wj−1,wn).

Applying Cauchy-Schwarz and triangle inequalities then yields

 |αjj−1|∥∥wj−wj−1∥∥≤|αjj−1+αjj−2|∥∥wj−1∥∥+j−2∑n=j−mαjn−1∥wn∥.

Applying (3.12), the result (3.16) follows.

Following the same idea for estimate (3.17), we are now concerned with bounding in norm the final difference term . Again expanding (3.19) as an inner-product and seeking the critical point this time for yields

 ηj∥∥wj−m+1−wj−m∥∥2+(wj−m+1−wj−m,wj−m)+j∑n=j−m+2ηn(wj−m+1−wj−m,wn−wn−1)=0.

Recombining terms noting

 (1−αjj−m−1)∥∥wj−m+1−wj−m∥∥2 =j∑n=j−m+2αjn−1(wj−m+1−wj−m,wn) −(wj−m+1−wj−m,ηj−m+2wj−m+1+wj−m).

Applying Cauchy-Schwarz and triangle inequalities then yields

 |1−αjj−m−1|∥∥wj−m+1−wj−m∥∥≤(j∑n=j−m+2|αjn−1|∥wn∥)+|ηj−m+2|∥∥wj−m+1∥∥+∥∥wj−m∥∥.

The result (3.17) follows by (3.12).

Similarly for (3.2), expanding the norm of (3.20) as an inner product and seeking the critical point for each yields

 γp−1∥wp−wp−1∥2=(wp−wp−1,wj)−j∑n=j−m+1,n≠pγn−1(wp−wp−1,wn−wn−1).

Recombining the terms inside the sum using , and , we obtain

 γp−1∥wp−wp−1∥2 =p−2∑n=j−mαjn−1(wp−wp−1,wn)−γp−2(wp−wp−1,wp−1)+γp(wp−wp−1,wp) +j∑n=p+1αjn−1(wp−wp−1,wn).

Applying now Cauchy-Schwarz and triangle inequalities,

 |γp−1|∥wp−wp−1∥ =p−2∑n=j−m|αjn−1|∥wn∥+|γp−2|∥wp−1∥+|γp|∥wp∥+j∑n=p+1|αjn−1|∥wn∥.

Applying (3.12), the result (3.2) follows. ∎

For the convenience of subsequent calculations, the bounds (3.19) and (3.20) used to bound for the case of depth are summarized in the following proposition.

###### Proposition 3.4 (Depth m=2).

With depth the estimates (3.16) and (3.2) reduce to

 |αjj−1|∥∥ej−1∥∥ ≤11−κ(|αjj−1+αjj−2|∥∥wj−1∥∥+|αjj−3|∥∥wj−2∥∥) (3.21) |1−αjj−2|∥∥ej−2∥∥ ≤11−κ(|αjj−1|∥∥wj∥∥+|αjj−1|∥∥wj−1∥∥+∥∥wj−2∥∥) (3.22) |γj−1|∥∥ej−1∥∥ ≤11−κ(∥∥wj∥∥+|αjj−3|∥∥wj−1∥∥+|αjj−3|∥∥wj−2∥∥) (3.23) |γj−2|∥∥ej−2∥∥ (3.24)

The approach taken in [12] is to reduce the right hand side of (3.22) and (3.23) to two terms each by relating their expansion to that of (3.21) and (3.24), respectively. Here the terms are left as they are to emphasize the direct generality to greater depth .

### 3.3 Explicit computation of the optimization gain

The stage- gain has a simple description assuming the optimization is performed over a norm induced by an inner product , in other words in a Hilbert space setting.

Consider the unconstrained -form of the optimization problem (3.20) at iteration with depth : Find that minimize

 ∥∥ ∥∥wk+1−k∑n=k−m+1γn(wn+1−wn)∥∥ ∥∥2=∥∥wk+1−Fkγk∥∥2, (3.25)

Where is the matrix with columns , and is the corresponding vector of coefficients . Indeed, (3.25) (or equivalently reindexed) is the preferred way to state the optimization problem [16], particularly in the case where is the norm and a fast algorithm can be used.

This is also the preferred statement of the problem to understand the gain from (2.6), which satisfies Define the unique decomposition with and . Then is the least-squares residual satisfying meaning

 θk= ⎷1−∥wR∥2∥wk+1∥2, (3.26)

and, has the interpretation of the direction-sine between and the subspace spanned by . This is particularly clear in the case where by solving for the critical point of (3.15) yields

 γ=(wk+1,wk+1−wk)∥wk+1−wk∥2.

Expanding and using the particular value of above yields

 1−θ2k=(wk+1,wk+1−wk)2∥wk+1−wk∥2∥wk+1∥2,

with the clear interpretation that is the direction cosine between and , hence is the direction-sine.

If indeed an (economy) algorithm is used to solve the optimization problem then , which can be used to predict whether an accelerated step would be (sufficiently) beneficial. This explicit computation of is used in Section 5.3 to propose an adaptive damping strategy based on the gain at each step. Finally, it is noted that the improvement in the gain as is increased depends on sufficient linear indepence or small direction cosines between the columns of , as information from ealier in the history is added. This is discussed in some greater depth in [16].

## 4 Convergence rates for depths m=1 and m=2

First we put the expansion (3.11) together with the bounds (3.13)-(3.14) for a convergence proof for the simplest case of .

###### Theorem 4.1 (Convergence of the residual with depth m=1).

On satisfaction of Assumptions 3.1 and 3.3, if the coefficients remain bounded and bounded away from zero, the following bound holds for the residual from Algorithm 2.1 with depth :

###### Proof.

In this case the expansion found for in (3.11) reduces to

 wk+1=((1−βk−1)I+βk−1g′(ck))wαk+g′′(^ck)(ck−1−ck)γk−1ek−1. (4.1)

Taking norms of both sides and applying (2.6) and the triangle inequality,

 ∥wk+1∥≤θk((1−βk−1)+κβk−1)∥wk∥+^κ(∥ek∥+∥ek−1∥)γk−1∥ek−1∥. (4.2)

Notably (4.2) holds regardless of whether is a contractive operator, hence for error terms and small enough, convergence can be observed for , particularly if a damping factor is applied, and if the gain is sufficiently less than one. This justifies the observation that Anderson acceleration can enlarge the effective domain of convergence of a fixed point iteration.

For the remainder of the calculation, we consider the case of a contractive operator, meaning Assumption 3.3 is satisfied. Applying (3.14) with to the , recalling by (3.5) we have ; and, applying (3.13) with and respectively to the remaining and allows

 ∥wk+1∥ ≤θk((1−βk−1)+κβk−1|∥wk∥+^κ(1−κ)2(∥wk∥αk+1k+∥wk−1∥αkk−1)∥wk∥ =θk((1−βk−1)+κβk−1|∥wk∥+O(∥wk∥2)+O(∥wk−1∥2). (4.3)

As discussed in Section 3.2, and are each the leading coefficients in their respective optimization problems, multiplying the most recent iterate. As such, these coefficients may be reasonably considered bounded away from zero.

The case of follows similarly, combining (3.11) with (3.21)-(3.24).

###### Theorem 4.2 (Convergence of the residual with depth m=2).

On satisfaction of Assumptions 3.1 and 3.3, if the coefficients remain bounded, and and remain bounded away from zero, the following bound holds for the residual from Algorithm 2.1 with depth .

###### Proof.

For depth the residual expansion (3.11) reduces to

 wk+1 =((1−βk−1)I+βk−1g′(ck))wαk+g′′(^ck)(ck−1−ck)γk−1ek−1 +(g′