Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods

# Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods

N. D. Vanli Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. email: {denizcan, mertg, asuman}@mit.edu.    M. Gürbüzbalaban11footnotemark: 1    A. Ozdaglar11footnotemark: 1
June 29, 2019
###### Abstract

We focus on the problem of minimizing the sum of smooth component functions (where the sum is strongly convex) and a non-smooth convex function, which arises in regularized empirical risk minimization in machine learning and distributed constrained optimization in wireless sensor networks and smart grids. We consider solving this problem using the proximal incremental aggregated gradient (PIAG) method, which at each iteration moves along an aggregated gradient (formed by incrementally updating gradients of component functions according to a deterministic order) and taking a proximal step with respect to the non-smooth function. While the convergence properties of this method with randomized orders (in updating gradients of component functions) have been investigated, this paper, to the best of our knowledge, is the first study that establishes the convergence rate properties of the PIAG method for any deterministic order. In particular, we show that the PIAG algorithm is globally convergent with a linear rate provided that the step size is sufficiently small. We explicitly identify the rate of convergence and the corresponding step size to achieve this convergence rate. Our results improve upon the best known condition number dependence of the convergence rate of the incremental aggregated gradient methods used for minimizing a sum of smooth functions.

## 1 Introduction

We focus on composite additive cost optimization problems, where the objective function is given by the sum of component functions and a possibly non-smooth regularization function :

 minx∈\mathbbmRnF(x)≜f(x)+r(x), \hb@xt@.01(1.1)

and . We assume each component function is convex and continuously differentiable while the regularization function is proper, closed, and convex but not necessarily differentiable. This formulation arises in many problems in machine learning, distributed optimization, and signal processing. Notable examples include constrained and regularized least squares problems that arise in various machine learning applications [7, 21], distributed optimization problems that arise in wireless sensor network as well as smart grid applications [19, 11] and constrained optimization of separable problems [1]. An important feature of this formulation is that the number of component functions is large, hence solving this problem using a standard gradient method that involves evaluating the full gradient of , i.e., , is costly. This motivates using incremental methods that exploit the additive structure of the problem and update the decision vector using one component function at a time.

While these recent advances suggest IAG as a promising approach with fast convergence rate guarantees for solving additive cost problems, in many applications listed above, the objective function takes a composite form and includes a non-smooth regularization function (to avoid overfitting or to induce a sparse representation). Another important case of interest is smooth constrained optimization problems which can be represented in the composite form (LABEL:eq:goal) where the function is the indicator function of a nonempty closed convex set.

Our analysis uses function values to track the evolution of the iterates generated by the PIAG algorithm. This is in contrast with the recent analysis of the IAG algorithm provided in [12], which used distances of the iterates to the optimal solution as a Lyapunov function and relied on the smoothness of the problem to bound the gradient errors with distances. This approach does not extend to the non-smooth composite case, which motivates a new analysis using function values and the properties of the proximal operator. Since we work directly with function values, this approach also allows us to obtain iteration complexity results to achieve an -optimal solution.

In terms of the algorithmic structure, our paper is related to [7], where the authors introduce the SAGA method, which extends the SAG method to the composite case and provides a linear convergence rate result with an analysis that relies on the stochastic nature of the algorithm and does not extend to the deterministic case. Particularly, the SAGA method samples the component functions randomly and independently at each iteration without replacement (in contrast with the PIAG method, where the component functions are processed deterministically). However, such random sampling may not be possible for applications such as decentralized information processing in wireless sensor networks (where agents are subject to communication constraints imposed by the network topology and all agents are not necessarily connected to every other agent via a low-cost link [19]), motivating the study of the deterministic PIAG method. In [7], the authors prove that to achieve a point in the -neighborhood of the optimal solution, SAGA requires iterations.222Let be an optimal solution of the problem (LABEL:eq:goal). A vector is in the -neighborhood of an optimal solution if . However, note that this result does not translate into a guarantee in the function suboptimality of the resulting point because of lack of smoothness. Furthermore, the choice of Lyapunov function in [7] requires each to be convex (to satisfy the non-negativity condition), whereas we do not need this assumption in our analysis.

Our work is also related to [26], where the authors propose a related linearly convergent incrementally updated gradient method for solving the composite additive cost problem in (LABEL:eq:goal) under a local Lipschitzian error condition (a condition satisfied by locally strongly convex functions around an optimal solution). The PIAG algorithm is different from the algorithm proposed in [26]. Specifically, for constrained optimization problems (i.e., when the regularization function is the indicator function of a nonempty closed convex set), the iterates generated by the algorithm in [26] stay in the interior of the set since the algorithm in [26] searches for a feasible update direction. On the other hand, the PIAG algorithm uses the proximal map on the intermediate iterate obtained by moving in the opposite direction of the aggregated gradient, which operates as a projected gradient method and allows the iterates to be on the boundary of the set. Aside from algorithmic differences, [26] does not provide explicit rate estimates (even though the exact rate can be calculated after an elaborate analysis, the dependence on the condition number and the window length of the outdated gradients is significantly worse than the one presented in this paper). Furthermore, the results in [26] provides a -step linear convergence, whereas the linear convergence results in our paper hold uniformly for each step.

Other than the papers mentioned above, our paper is also related to [4], which studies an alternative incremental aggregated proximal method and shows linear convergence when each and are continuously differentiable. This method forms a linear approximation to and processes the component functions with a proximal iteration, whereas our method processes based on a gradient step. Furthermore, our linear convergence results do not require the differentiability of the objective function in contrast to the analysis in [4].

Several recent papers in the machine learning literature (e.g., [7, 14, 8, 17, 16] and references therein) are also weakly related to our paper. In all these papers, the authors propose randomized order algorithms similar to the SAG algorithm [23] and analyze their convergence rates in expectation. In particular, in [8], the authors propose an algorithm, called Finito, which is closely related to the SAG algorithm but achieves a faster convergence rate than the SAG algorithm. These ideas are then extended to composite optimization problems with non-smooth objective functions (as in (LABEL:eq:goal)) in [17, 7]. In particular, in [17], a majorization-minimization algorithm, called MISO, is proposed to solve smooth optimization problems and its global linear convergence is shown in expectation. In [16], the ideas in [17] are then extended for non-smooth optimization problems using proximal operator. Similarly, in [14], a variance reduction technique is applied to the SGD algorithm for smooth problems and its global linear convergence in expectation is proven.

The rest of the paper is organized as follows. In Section LABEL:sec:piag, we introduce the PIAG algorithm. In Section LABEL:sec:main, we first provide the assumptions on the objective functions and then prove the global linear convergence of the proposed algorithm under these assumptions. We conclude the paper in Section LABEL:sec:conclusion with a summary of our results.

## 2 The PIAG Algorithm

Similar to the IAG method, at each iteration , we form an aggregated gradient, which we denote as follows

 gk≜1mm∑i=1∇fi(xτi,k),

where represents the gradient of the th component function sampled at time . We assume that each component function is sampled at least once in the past iterations, i.e., we have

 k−K≤τi,k≤k,∀i∈{1,…,m}.

This condition is typically satisfied in practical implementations of the deterministic incremental methods. For instance, if the functions are processed in a cyclic order, we have [26, 13]. On the other hand, corresponds to the case where we have the full gradient of the function at each iteration (i.e., ) and small may represent a setting in which the gradients of the component functions are sent to a processor with some delay upper bounded by .

Since the regularization function is not necessarily differentiable, we propose to solve (LABEL:eq:goal) with the proximal incremental aggregated gradient (PIAG) method, which uses the proximal operator with respect to the regularization function at the intermediate iterate obtained using the aggregated gradient. In particular, the PIAG algorithm, at each iteration , updates as

 xk+1=proxηr(xk−ηgk), \hb@xt@.01(2.1)

where is a constant step size and the proximal mapping is defined as follows

 proxηr(y)=argminx∈\mathbbmRn{12||x−y||2+ηr(x)}. \hb@xt@.01(2.2)

Here, we define and let denote the set of subgradients of the function at . Then, it follows from the optimality conditions [3] of the problem in (LABEL:eq:prox) that . This yields , for some . Hence, we can compactly represent our update rule as

 xk+1=xk+ηdk, \hb@xt@.01(2.3)

where is the direction of the update at time .

## 3 Convergence Analysis

### 3.1 Assumptions

Throughout the paper, we make the following standard assumptions.

###### Assumption 3.1

(Lipschitz gradients) Each has Lipschitz continuous gradients on with some constant , i.e.,

for any .333If a function has Lipschitz continuous gradients with some constant , then is called -smooth. We use these terms interchangeably.

Defining , we observe that Assumption LABEL:asmp:lips and the triangle inequality yield

 ||∇f(x)−∇f(y)||≤L||x−y||,

for any , i.e., the function is -smooth.

###### Assumption 3.2

(Strong Convexity) The sum function is -strongly convex on for some , i.e., the function is convex, and the regularization function is convex on .

A consequence of Assumption LABEL:asmp:conv is that is strongly convex, hence there exists a unique optimal solution of problem (LABEL:eq:goal) [20, Lemma 6], which we denote by .

We emphasize that these assumptions hold for a variety of cost functions including regularized squared error loss, hinge loss, and logistic loss [6] and similar assumptions are widely used to analyze the convergence properties of incremental gradient methods in the literature [7, 4, 2, 23, 12]. Note that in contrast with many of these analyses, we do not assume that the component functions are convex.

### 3.2 Rate of Convergence

In this section, we show that the PIAG algorithm attains a global linear convergence rate with a constant step size provided that the step size is sufficiently small. We define

 Fk≜F(xk)−F(x∗), \hb@xt@.01(3.1)

which is the suboptimality in the objective value at iteration . In our analysis, we will use as a Lyapunov function to prove global linear convergence. Before providing the main theorems of the paper, we first introduce three lemmas that contain key relations in proving these theorems.

The first lemma investigates how the suboptimality in the objective value evolves over the iterations. In particular, it shows that the change in suboptimality can be bounded as a sum of two terms: The first term is negative and has a linear dependence in the step size , whereas the second term is positive and has a quadratic dependence in . This suggests that if the step size is small enough, the linear term in will be dominant guaranteeing a descent in suboptimality.

###### Lemma 3.3

Suppose that Assumptions 1 and 2 hold. Then, the PIAG algorithm in (LABEL:eq:updaterule) yields the following guarantee

 Fk+1≤Fk−12η||dk||2+η2L2k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2,

for any step size .

Proof. We first consider the difference of the errors in consecutive time instances and write

 F(xk+1)−F(xk) =f(xk+1)−f(xk)+r(xk+1)−r(xk) ≤⟨∇f(xk),xk+1−xk⟩+L2||xk+1−xk||2+r(xk+1)−r(xk),

where the inequality follows from the Taylor series expansion of around and since the Hessian of at any point is upper bounded by by Assumption LABEL:asmp:lips. Using the update rule in this inequality, we obtain

 F(xk+1)−F(xk) =η⟨∇f(xk)−gk,dk⟩+η2L2||dk||2+η⟨gk,dk⟩+r(xk+1)−r(xk) ≤η||∇f(xk)−gk||||dk||+η2L2||dk||2−η||dk||2−η⟨hk+1,dk⟩+r(xk+1)−r(xk) =η||∇f(xk)−gk||||dk||+η(ηL2−1)||dk||2+⟨hk+1,xk−xk+1⟩+r(xk+1)−r(xk) ≤η||∇f(xk)−gk||||dk||+η(ηL2−1)||dk||2, \hb@xt@.01(3.2)

where the first inequality follows by the triangle inequality and the last inequality follows from the convexity of .

The gradient error term in (LABEL:eq:c1), i.e., , can be upper bounded as follows

 ||∇f(xk)−gk|| ≤1mm∑i=1∣∣∣∣∇fi(xk)−∇fi(xτi,k)∣∣∣∣ ≤1mm∑i=1Li∣∣∣∣xk−xτi,k∣∣∣∣ ≤1mm∑i=1Lik−1∑j=τi,kη∣∣∣∣dj∣∣∣∣ ≤ηLk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣, \hb@xt@.01(3.3)

where the first and third inequalities follow by the triangle inequality, the second inequality follows since each is -smooth, and the last inequality follows since . Using (LABEL:eq:lem1) we can upper bound (LABEL:eq:c1) as follows

 F(xk+1)−F(xk) ≤η(ηL2−1)||dk||2+η2Lk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣||dk|| ≤η(ηL(K+1)2−1)||dk||2+η2L2k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2 ≤−η2||dk||2+η2L2k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, \hb@xt@.01(3.4)

where the second inequality follows from the arithmetic-geometric mean inequality, i.e., and the last inequality follows since . This concludes the proof of Lemma LABEL:prop1.

We next introduce the following lemma, which can be viewed as an extension of [25, Theorem 4] into our framework with aggregated gradients. We provide a simplified proof compared to [25] with a tighter upper bound. This lemma can be interpreted as follows. When the regularization function is zero (i.e., for all ) and we have access to full gradients (i.e., ), this lemma simply follows from the strong convexity of the sum function since and due to the optimality condition of the problem. The following lemma indicates that even though we do not have such control over the subgradients of the regularization function (as the regularization function is neither strongly convex nor smooth), the properties of the proximal step yields a similar relation at the expense of a constant of (instead of compared to the case) and certain history dependent terms (that arise due to the incremental nature of the PIAG algorithm) that has a linear dependence in step size . This lemma will be a key step in the proof of Lemma LABEL:prop2, where we illustrate how the descent term in Lemma LABEL:prop1 relates to our Lyapunov function.

###### Lemma 3.4

Suppose that Assumptions 1 and 2 hold and let denote the condition number of the problem. Then, the distance of the iterates from the optimal solution is upper bounded as

 ||xk−x∗||≤2μ||dk||+2ηQk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣,

for any and .

Proof. Define

as the direction of update with the full gradient. Non-expansiveness property of the proximal map implies

 ∣∣∣∣proxηr(x)−proxηr(y)∣∣∣∣2≤⟨proxηr(x)−proxηr(y),x−y⟩.

Putting and in the above inequality, we obtain

 ∣∣∣∣xk+ηd′k−x∗∣∣∣∣2 ≤⟨xk+ηd′k−x∗,xk−η∇f(xk)−x∗+η∇f(x∗)⟩ ≤⟨xk+ηd′k−x∗,xk+ηd′k−x∗⟩+⟨xk+ηd′k−x∗,−ηd′k+η∇f(x∗)−η∇f(xk)⟩,

which implies

 0≤⟨xk+ηd′k−x∗,−d′k+∇f(x∗)−∇f(xk)⟩.

This inequality can be rewritten as follows

 ⟨xk−x∗,∇f(xk)−∇f(x∗)⟩ ≤⟨xk−x∗,−d′k⟩−η∣∣∣∣d′k∣∣∣∣2+η⟨d′k,∇f(x∗)−∇f(xk)⟩ ≤⟨xk−x∗,−d′k⟩+η⟨d′k,∇f(x∗)−∇f(xk)⟩ ≤∣∣∣∣d′k∣∣∣∣(||xk−x∗||+η||∇f(x∗)−∇f(xk)||) ≤∣∣∣∣d′k∣∣∣∣(||xk−x∗||+ηL||xk−x∗||) ≤2∣∣∣∣d′k∣∣∣∣||xk−x∗||, \hb@xt@.01(3.5)

where the second inequality follows since , the third inequality follows by the Cauchy-Schwarz inequality, the fourth inequality follows from the -smoothness of , and the last inequality follows since . Since -strong convexity of implies

 μ||xk−x∗||2≤⟨xk−x∗,∇f(xk)−∇f(x∗)⟩, \hb@xt@.01(3.6)

then combining (LABEL:eq:app1) and (LABEL:eq:app2), we obtain

 \hb@xt@.01(3.7)

In order to relate to the original direction of update , we use the triangle inequality and write

 ∣∣∣∣d′k∣∣∣∣ ≤||dk||+∣∣∣∣d′k−dk∣∣∣∣ =||dk||+1η∣∣∣∣xk+ηd′k−xk−ηdk∣∣∣∣ =||dk||+1η∣∣∣∣% proxηr(xk−η∇f(xk))−proxηr(xk−ηgk)∣∣∣∣ ≤||dk||+||gk−∇f(xk)||, ≤||dk||+ηLk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣, \hb@xt@.01(3.8)

where the last line follows by equation (LABEL:eq:lem1). Putting (LABEL:eq:app4) back into (LABEL:eq:app3) concludes the proof of Lemma LABEL:lem.

In the following lemma, we relate the direction of update to the suboptimality in the objective value at a given iteration . In particular, we show that the descent term presented in Lemma LABEL:prop1 (i.e., ) can be upper bounded by the negative of the suboptimality in the objective value of the next iteration (i.e., ) and additional history dependent terms that arise due to the incremental nature of the PIAG algorithm.

###### Lemma 3.5

Suppose that Assumptions 1 and 2 hold. Then, for any , the PIAG algorithm in (LABEL:eq:updaterule) yields the following guarantee

 −||dk||2≤−μ4Fk+1+ηLk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2.

Proof. In order to prove this lemma, we use Lemma LABEL:lem, which can be rewritten as follows

 −||dk||≤−μ2||xk−x∗||+ηLk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣.

Then, we can upper bound as

 −||dk||2 ≤−μ2||dk||||xk−x∗||+ηLk−1∑j=(k−K)+||dk||∣∣∣∣dj∣∣∣∣ ≤−μ2⟨dk,x∗−xk⟩+ηKL2||dk||2+ηL2k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, \hb@xt@.01(3.9)

where the last line follows by the Cauchy-Schwarz inequality and the arithmetic-geometric mean inequality. We can upper bound the inner product term in (LABEL:eq:pop1) as

 −⟨dk,x∗−xk⟩ =⟨gk+hk+1,x∗−xk⟩ =⟨∇f(xk),x∗−xk⟩+⟨hk+1,x∗−xk⟩+⟨gk−∇f(xk),x∗−xk⟩ ≤f(x∗)−f(xk)+⟨hk+1,x∗−xk+1⟩+η⟨hk+1,dk⟩+⟨gk−∇f(xk),x∗−xk⟩ ≤f(x∗)−f(xk)+r(x∗)−r(xk+1)+η⟨hk+1,dk⟩+||gk−∇f(xk)||||x∗−xk||, \hb@xt@.01(3.10)

where the first inequality follows from the convexity of and the second inequality follows from the convexity of and the triangle inequality. The inner product term in (LABEL:eq:pop2) can be upper bounded as

 η⟨hk+1,dk⟩ =−η||dk||2−⟨gk,ηdk⟩ =−η||dk||2+⟨∇f(xk),−ηdk⟩+⟨gk−∇f(xk),−ηdk⟩ ≤−η||dk||2+⟨∇f(xk),xk−xk+1⟩+η||dk||||gk−∇f(xk)|| ≤−η||dk||2+f(xk)−f(xk+1)+η2L2||dk||2+η||dk||||gk−∇f(xk)||, \hb@xt@.01(3.11)

where the first inequality follows by the triangle inequality and the second inequality follows from the -smoothness of . Putting (LABEL:eq:pop3) back in (LABEL:eq:pop2), we obtain

 −⟨dk,x∗−xk⟩≤−Fk+1+η(ηL2−1)||dk||2+||gk−∇f(xk)||(||x∗−xk||+η||dk||). \hb@xt@.01(3.12)

The final term in (LABEL:eq:pop4) can be upper bounded as follows

 ||gk−∇f(xk)||(||x∗−xk||+η||dk||) ≤ηL⎛⎝k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣⎞⎠(||x∗−xk||+η||dk||) ≤ηL⎛⎝k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣⎞⎠⎡⎣(η+2μ)||dk||+2ηQk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣⎤⎦,

where the first line follows by equation (LABEL:eq:lem1) and the last line follows by Lemma LABEL:lem. Using arithmetic-geometric mean inequality in the above inequality, we obtain

 ||gk−∇f(xk)||(||x∗−xk||+η||dk||)≤ηKL2(η+2μ)||dk||+η[ηL2+Q+2ηKQL]k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣. \hb@xt@.01(3.13)

Putting (LABEL:eq:pop5) back in (LABEL:eq:pop4) yields

 −⟨dk,x∗−xk⟩ ≤−Fk+1+η(ηL2−1+KL2(η+2μ))||dk||2+η[ηL2+Q+2ηKQL]k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣ =−Fk+1+η(η(K+1)L2−1+KQ)||dk||2+η[ηL2+Q+2ηKQL]k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣ ≤−Fk+1+η(KQ−12)||dk||2+η(ηL2+Q+2ηKQL)k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣, \hb@xt@.01(3.14)

where the last line follows since . Finally, using (LABEL:eq:pop6) in our original inequality in (LABEL:eq:pop1), we obtain

 −||dk||2 ≤−μ2Fk+1+η(KL2−μ4+KL2)||dk||2+η(ημL4+L2+ηKL2+L2)k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, ≤−μ2Fk+1+ηKL||dk||2+ηL(μ4+ηKL+1)k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, ≤−μ2Fk+1+ηKL||dk||2+ηL(η(K+1)L+1)k−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, ≤−μ2Fk+1+||dk||2+2ηLk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, \hb@xt@.01(3.15)

where the second inequality follows since , the third inequality follows since , and the last inequality follows since . Rearranging the terms in (LABEL:eq:pop7), we obtain

 −||dk||2≤−μ4Fk+1+ηLk−1∑j=(k−K)+∣∣∣∣dj∣∣∣∣2, \hb@xt@.01(3.16)

which completes the proof of Lemma LABEL:prop2.

In the following theorem, we derive a recursive inequality to upper bound the suboptimality at iteration in terms of the suboptimality in the previous iterations (where is a positive integer that measures the length of history considered) and an additive remainder term. We will later show in Corollary LABEL:corr1 that this remainder term can also be upper bounded in terms of the suboptimality observed in previous iterations.

###### Theorem 3.6

Suppose that Assumptions 1 and 2 hold. Then, the PIAG algorithm with step size yields the following recursion

 (1+ημ8)Fk+1≤(1−a−1∑i=1ϵi)Fk+a−1∑i=1ϵiFk−iK+η2KL2ϵa−1k−1∑j=k−aK∣∣∣∣dj∣∣∣∣2, \hb@xt@.01(3.17)

for any , where is an arbitrary constant and .

Proof. We use induction on the constant to prove this theorem. For , the recursion can be obtained as follows. Using Lemma LABEL:prop2 in Lemma LABEL:prop1 and rearranging terms, we get

 (1+ημ8)Fk+1≤Fk+η2Lk−1∑j=k−K∣∣∣∣dj∣∣∣∣2. \hb@xt@.01(3.18)

Rearranging terms in Lemma LABEL:prop1, we obtain

 \hb@xt@.01(3.19)

Putting (LABEL:eq:t2) back in (LABEL:eq:t1), we get

 (1+ημ8)Fk+1 ≤Fk+η2Lk−1∑j=k−K(2η(Fj−Fj+1)+ηLj−1∑i=j−K||di||2) ≤Fk+2ηL(Fk−K−Fk)+η3L2k−1∑j=k−Kj−1∑i=j−K||di||2 ≤Fk+2ηL(Fk−K−Fk)+η3KL2k−1∑j=k−2K||di||2 \hb@xt@.01(3.20)

Since , then (LABEL:eq:t3) can be rewritten as follows

 (1+ημ8)Fk+1≤(1−ϵ1)Fk+ϵ1F(k−K)++η2KL2ϵ1k−1∑j=k−2K∣∣∣∣dj∣∣∣∣2, \hb@xt@.01(3.21)

showing (LABEL:eq:rec) for . As a part of the induction procedure, we then assume that (LABEL:eq:rec) holds for some arbitrary , which amounts to

 (1+ημ8)Fk+1≤(1−a−1∑i=1ϵi)Fk+a−1∑i=1ϵiFk−iK+η2KL2ϵa−1k−1∑j=k−aK∣∣∣∣dj∣∣∣∣2.

Using (LABEL:eq:t2) in the above inequality, we obtain

 (1+ημ8)Fk+1 ≤(1−a−1∑i=1ϵi)Fk+a−1∑i=1ϵiFk−iK+η2KL2ϵa−1 ×k−1∑j=k−aK(2η(Fj−Fj+1)+ηLj−1∑i=j−K||di||2) =(1−a−1∑i=1ϵi)Fk+a−1∑i=1ϵiFk−iK+ϵa(F(k−aK)+−Fk)+η2L2ϵak−1∑j=k−aKj−1∑i=j−K∣∣∣∣dj∣∣∣∣2 ≤(1−a∑i=1ϵi)Fk+a∑i=1ϵiFk−iK+η2KL2ϵak−1∑j=k−(a+1)K∣∣∣∣dj∣∣∣∣2.

Therefore, (LABEL:eq:rec) holds for as well, which concludes the proof of Theorem LABEL:thm1.

###### Corollary 3.7

Suppose that Assumptions 1 and 2 hold. Then, the PIAG algorithm with step size yields the following recursion

 (1+ημ8)Fk+1≤(1−a−1∑