Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods
We focus on the problem of minimizing the sum of smooth component functions (where the sum is strongly convex) and a non-smooth convex function, which arises in regularized empirical risk minimization in machine learning and distributed constrained optimization in wireless sensor networks and smart grids. We consider solving this problem using the proximal incremental aggregated gradient (PIAG) method, which at each iteration moves along an aggregated gradient (formed by incrementally updating gradients of component functions according to a deterministic order) and taking a proximal step with respect to the non-smooth function. While the convergence properties of this method with randomized orders (in updating gradients of component functions) have been investigated, this paper, to the best of our knowledge, is the first study that establishes the convergence rate properties of the PIAG method for any deterministic order. In particular, we show that the PIAG algorithm is globally convergent with a linear rate provided that the step size is sufficiently small. We explicitly identify the rate of convergence and the corresponding step size to achieve this convergence rate. Our results improve upon the best known condition number dependence of the convergence rate of the incremental aggregated gradient methods used for minimizing a sum of smooth functions.
We focus on composite additive cost optimization problems, where the objective function is given by the sum of component functions and a possibly non-smooth regularization function :
and . We assume each component function is convex and continuously differentiable while the regularization function is proper, closed, and convex but not necessarily differentiable. This formulation arises in many problems in machine learning, distributed optimization, and signal processing. Notable examples include constrained and regularized least squares problems that arise in various machine learning applications [7, 21], distributed optimization problems that arise in wireless sensor network as well as smart grid applications [19, 11] and constrained optimization of separable problems . An important feature of this formulation is that the number of component functions is large, hence solving this problem using a standard gradient method that involves evaluating the full gradient of , i.e., , is costly. This motivates using incremental methods that exploit the additive structure of the problem and update the decision vector using one component function at a time.
When is continuously differentiable, one widely studied approach is the incremental gradient (IG) method [18, 24, 1]. The IG method processes the component functions one at a time by taking steps along the gradient of each individual function in a sequential manner, following a cyclic order [26, 27] or a randomized order [22, 13, 27]. A particular randomized order, which at each iteration independently picks a component function uniformly at random from all component functions leads to the popular stochastic gradient descent (SGD) method. While SGD is the method of choice in practice for many machine learning applications due to its superior empirical performance and convergence rate estimates that does not depend on the number of component functions , its convergence rate is sublinear, i.e., an -optimal solution can be computed within iterations.111Let be an optimal solution of the problem (LABEL:eq:goal). A vector is an -optimal solution if . In a seminal paper, Blatt et al.  proposed the incremental aggregated gradient (IAG) method, which maintains the savings associated with incrementally accessing the component functions, but keeps the most recent component gradients in memory to approximate the full gradient and updates the iterate using this aggregated gradient. Blatt et al. showed that under some assumptions, for a sufficiently small constant step size, the IAG method is globally convergent and when the component functions are quadratics, it achieves a linear rate. Two recent papers,  and , investigated the convergence rate of this method for general component functions that are convex and smooth (i.e., with Lipschitz gradients), where the sum of the component functions is strongly convex: In , the authors focused on a randomized version, called stochastic average gradient (SAG) method (which samples the component functions independently similar to SGD), and showed that it achieves a linear rate using a proof that relies on the stochastic nature of the algorithm. In a more recent work , the authors focused on deterministic IAG (i.e., component functions processed using an arbitrary deterministic order) and provided a simple analysis that uses a delayed dynamical system approach to study the evolution of the iterates generated by this algorithm.
While these recent advances suggest IAG as a promising approach with fast convergence rate guarantees for solving additive cost problems, in many applications listed above, the objective function takes a composite form and includes a non-smooth regularization function (to avoid overfitting or to induce a sparse representation). Another important case of interest is smooth constrained optimization problems which can be represented in the composite form (LABEL:eq:goal) where the function is the indicator function of a nonempty closed convex set.
In this paper, we study the proximal incremental aggregated gradient (PIAG) method for solving composite additive cost optimization problems. Our method computes an aggregated gradient for the function (with component gradients evaluated in a deterministic manner at outdated iterates over a finite window , similar to IAG) and uses a proximal operator with respect to the regularization function at the intermediate iterate obtained by moving along the aggregated gradient. Under the assumptions that is strongly convex and each is smooth with Lipschitz gradients, we show the first linear convergence rate result for the deterministic PIAG and provide explicit convergence rate estimates that highlight the dependence on the condition number of the problem (which we denote by ) and the size of the window over which outdated component gradients are evaluated. In particular, we show that in order to achieve an -optimal solution, the PIAG algorithm requires iterations, or equivalently iterations, where the tilde is used to hide the logarithmic terms in and . This result improves upon the condition number dependence of the deterministic IAG for smooth problems , where the authors proved that to achieve an -optimal solution, the IAG algorithm requires iterations. We also note that two recent independent papers [15, 9] have analyzed the convergence rate of the prox-gradient algorithm (which is a special case of our algorithm with , i.e., where we have access to a full gradient at each iteration instead of an aggregated gradient) under strong convexity type assumptions and provided linear rate estimates. Our rate estimates for the PIAG algorithm with matches the condition number dependence of the prox-gradient algorithm provided in these papers [15, 9] up to logarithmic factors. Furthermore, for the case (i.e., for the prox-gradient algorithm), the rate estimates obtained using our analysis technique can be shown to have the same condition number dependence as the ones presented in [15, 9].
Our analysis uses function values to track the evolution of the iterates generated by the PIAG algorithm. This is in contrast with the recent analysis of the IAG algorithm provided in , which used distances of the iterates to the optimal solution as a Lyapunov function and relied on the smoothness of the problem to bound the gradient errors with distances. This approach does not extend to the non-smooth composite case, which motivates a new analysis using function values and the properties of the proximal operator. Since we work directly with function values, this approach also allows us to obtain iteration complexity results to achieve an -optimal solution.
In terms of the algorithmic structure, our paper is related to , where the authors introduce the SAGA method, which extends the SAG method to the composite case and provides a linear convergence rate result with an analysis that relies on the stochastic nature of the algorithm and does not extend to the deterministic case. Particularly, the SAGA method samples the component functions randomly and independently at each iteration without replacement (in contrast with the PIAG method, where the component functions are processed deterministically). However, such random sampling may not be possible for applications such as decentralized information processing in wireless sensor networks (where agents are subject to communication constraints imposed by the network topology and all agents are not necessarily connected to every other agent via a low-cost link ), motivating the study of the deterministic PIAG method. In , the authors prove that to achieve a point in the -neighborhood of the optimal solution, SAGA requires iterations.222Let be an optimal solution of the problem (LABEL:eq:goal). A vector is in the -neighborhood of an optimal solution if . However, note that this result does not translate into a guarantee in the function suboptimality of the resulting point because of lack of smoothness. Furthermore, the choice of Lyapunov function in  requires each to be convex (to satisfy the non-negativity condition), whereas we do not need this assumption in our analysis.
Our work is also related to , where the authors propose a related linearly convergent incrementally updated gradient method for solving the composite additive cost problem in (LABEL:eq:goal) under a local Lipschitzian error condition (a condition satisfied by locally strongly convex functions around an optimal solution). The PIAG algorithm is different from the algorithm proposed in . Specifically, for constrained optimization problems (i.e., when the regularization function is the indicator function of a nonempty closed convex set), the iterates generated by the algorithm in  stay in the interior of the set since the algorithm in  searches for a feasible update direction. On the other hand, the PIAG algorithm uses the proximal map on the intermediate iterate obtained by moving in the opposite direction of the aggregated gradient, which operates as a projected gradient method and allows the iterates to be on the boundary of the set. Aside from algorithmic differences,  does not provide explicit rate estimates (even though the exact rate can be calculated after an elaborate analysis, the dependence on the condition number and the window length of the outdated gradients is significantly worse than the one presented in this paper). Furthermore, the results in  provides a -step linear convergence, whereas the linear convergence results in our paper hold uniformly for each step.
Other than the papers mentioned above, our paper is also related to , which studies an alternative incremental aggregated proximal method and shows linear convergence when each and are continuously differentiable. This method forms a linear approximation to and processes the component functions with a proximal iteration, whereas our method processes based on a gradient step. Furthermore, our linear convergence results do not require the differentiability of the objective function in contrast to the analysis in .
Several recent papers in the machine learning literature (e.g., [7, 14, 8, 17, 16] and references therein) are also weakly related to our paper. In all these papers, the authors propose randomized order algorithms similar to the SAG algorithm  and analyze their convergence rates in expectation. In particular, in , the authors propose an algorithm, called Finito, which is closely related to the SAG algorithm but achieves a faster convergence rate than the SAG algorithm. These ideas are then extended to composite optimization problems with non-smooth objective functions (as in (LABEL:eq:goal)) in [17, 7]. In particular, in , a majorization-minimization algorithm, called MISO, is proposed to solve smooth optimization problems and its global linear convergence is shown in expectation. In , the ideas in  are then extended for non-smooth optimization problems using proximal operator. Similarly, in , a variance reduction technique is applied to the SGD algorithm for smooth problems and its global linear convergence in expectation is proven.
The rest of the paper is organized as follows. In Section LABEL:sec:piag, we introduce the PIAG algorithm. In Section LABEL:sec:main, we first provide the assumptions on the objective functions and then prove the global linear convergence of the proposed algorithm under these assumptions. We conclude the paper in Section LABEL:sec:conclusion with a summary of our results.
2 The PIAG Algorithm
Similar to the IAG method, at each iteration , we form an aggregated gradient, which we denote as follows
where represents the gradient of the th component function sampled at time . We assume that each component function is sampled at least once in the past iterations, i.e., we have
This condition is typically satisfied in practical implementations of the deterministic incremental methods. For instance, if the functions are processed in a cyclic order, we have [26, 13]. On the other hand, corresponds to the case where we have the full gradient of the function at each iteration (i.e., ) and small may represent a setting in which the gradients of the component functions are sent to a processor with some delay upper bounded by .
Since the regularization function is not necessarily differentiable, we propose to solve (LABEL:eq:goal) with the proximal incremental aggregated gradient (PIAG) method, which uses the proximal operator with respect to the regularization function at the intermediate iterate obtained using the aggregated gradient. In particular, the PIAG algorithm, at each iteration , updates as
where is a constant step size and the proximal mapping is defined as follows
Here, we define and let denote the set of subgradients of the function at . Then, it follows from the optimality conditions  of the problem in (LABEL:eq:prox) that . This yields , for some . Hence, we can compactly represent our update rule as
where is the direction of the update at time .
3 Convergence Analysis
Throughout the paper, we make the following standard assumptions.
(Lipschitz gradients) Each has Lipschitz continuous gradients on with some constant , i.e.,
for any .333If a function has Lipschitz continuous gradients with some constant , then is called -smooth. We use these terms interchangeably.
Defining , we observe that Assumption LABEL:asmp:lips and the triangle inequality yield
for any , i.e., the function is -smooth.
(Strong Convexity) The sum function is -strongly convex on for some , i.e., the function is convex, and the regularization function is convex on .
A consequence of Assumption LABEL:asmp:conv is that is strongly convex, hence there exists a unique optimal solution of problem (LABEL:eq:goal) [20, Lemma 6], which we denote by .
We emphasize that these assumptions hold for a variety of cost functions including regularized squared error loss, hinge loss, and logistic loss  and similar assumptions are widely used to analyze the convergence properties of incremental gradient methods in the literature [7, 4, 2, 23, 12]. Note that in contrast with many of these analyses, we do not assume that the component functions are convex.
3.2 Rate of Convergence
In this section, we show that the PIAG algorithm attains a global linear convergence rate with a constant step size provided that the step size is sufficiently small. We define
which is the suboptimality in the objective value at iteration . In our analysis, we will use as a Lyapunov function to prove global linear convergence. Before providing the main theorems of the paper, we first introduce three lemmas that contain key relations in proving these theorems.
The first lemma investigates how the suboptimality in the objective value evolves over the iterations. In particular, it shows that the change in suboptimality can be bounded as a sum of two terms: The first term is negative and has a linear dependence in the step size , whereas the second term is positive and has a quadratic dependence in . This suggests that if the step size is small enough, the linear term in will be dominant guaranteeing a descent in suboptimality.
Suppose that Assumptions 1 and 2 hold. Then, the PIAG algorithm in (LABEL:eq:updaterule) yields the following guarantee
for any step size .
Proof. We first consider the difference of the errors in consecutive time instances and write
where the inequality follows from the Taylor series expansion of around and since the Hessian of at any point is upper bounded by by Assumption LABEL:asmp:lips. Using the update rule in this inequality, we obtain
where the first inequality follows by the triangle inequality and the last inequality follows from the convexity of .
The gradient error term in (LABEL:eq:c1), i.e., , can be upper bounded as follows
where the first and third inequalities follow by the triangle inequality, the second inequality follows since each is -smooth, and the last inequality follows since . Using (LABEL:eq:lem1) we can upper bound (LABEL:eq:c1) as follows
where the second inequality follows from the arithmetic-geometric mean inequality, i.e., and the last inequality follows since . This concludes the proof of Lemma LABEL:prop1.
We next introduce the following lemma, which can be viewed as an extension of [25, Theorem 4] into our framework with aggregated gradients. We provide a simplified proof compared to  with a tighter upper bound. This lemma can be interpreted as follows. When the regularization function is zero (i.e., for all ) and we have access to full gradients (i.e., ), this lemma simply follows from the strong convexity of the sum function since and due to the optimality condition of the problem. The following lemma indicates that even though we do not have such control over the subgradients of the regularization function (as the regularization function is neither strongly convex nor smooth), the properties of the proximal step yields a similar relation at the expense of a constant of (instead of compared to the case) and certain history dependent terms (that arise due to the incremental nature of the PIAG algorithm) that has a linear dependence in step size . This lemma will be a key step in the proof of Lemma LABEL:prop2, where we illustrate how the descent term in Lemma LABEL:prop1 relates to our Lyapunov function.
Suppose that Assumptions 1 and 2 hold and let denote the condition number of the problem. Then, the distance of the iterates from the optimal solution is upper bounded as
for any and .
as the direction of update with the full gradient. Non-expansiveness property of the proximal map implies
Putting and in the above inequality, we obtain
This inequality can be rewritten as follows
where the second inequality follows since , the third inequality follows by the Cauchy-Schwarz inequality, the fourth inequality follows from the -smoothness of , and the last inequality follows since . Since -strong convexity of implies
then combining (LABEL:eq:app1) and (LABEL:eq:app2), we obtain
In order to relate to the original direction of update , we use the triangle inequality and write
where the last line follows by equation (LABEL:eq:lem1). Putting (LABEL:eq:app4) back into (LABEL:eq:app3) concludes the proof of Lemma LABEL:lem.
In the following lemma, we relate the direction of update to the suboptimality in the objective value at a given iteration . In particular, we show that the descent term presented in Lemma LABEL:prop1 (i.e., ) can be upper bounded by the negative of the suboptimality in the objective value of the next iteration (i.e., ) and additional history dependent terms that arise due to the incremental nature of the PIAG algorithm.
Suppose that Assumptions 1 and 2 hold. Then, for any , the PIAG algorithm in (LABEL:eq:updaterule) yields the following guarantee
Proof. In order to prove this lemma, we use Lemma LABEL:lem, which can be rewritten as follows
Then, we can upper bound as
where the last line follows by the Cauchy-Schwarz inequality and the arithmetic-geometric mean inequality. We can upper bound the inner product term in (LABEL:eq:pop1) as
where the first inequality follows from the convexity of and the second inequality follows from the convexity of and the triangle inequality. The inner product term in (LABEL:eq:pop2) can be upper bounded as
where the first inequality follows by the triangle inequality and the second inequality follows from the -smoothness of . Putting (LABEL:eq:pop3) back in (LABEL:eq:pop2), we obtain
The final term in (LABEL:eq:pop4) can be upper bounded as follows
where the first line follows by equation (LABEL:eq:lem1) and the last line follows by Lemma LABEL:lem. Using arithmetic-geometric mean inequality in the above inequality, we obtain
Putting (LABEL:eq:pop5) back in (LABEL:eq:pop4) yields
where the last line follows since . Finally, using (LABEL:eq:pop6) in our original inequality in (LABEL:eq:pop1), we obtain
where the second inequality follows since , the third inequality follows since , and the last inequality follows since . Rearranging the terms in (LABEL:eq:pop7), we obtain
which completes the proof of Lemma LABEL:prop2.
In the following theorem, we derive a recursive inequality to upper bound the suboptimality at iteration in terms of the suboptimality in the previous iterations (where is a positive integer that measures the length of history considered) and an additive remainder term. We will later show in Corollary LABEL:corr1 that this remainder term can also be upper bounded in terms of the suboptimality observed in previous iterations.
Suppose that Assumptions 1 and 2 hold. Then, the PIAG algorithm with step size yields the following recursion
for any , where is an arbitrary constant and .
Proof. We use induction on the constant to prove this theorem. For , the recursion can be obtained as follows. Using Lemma LABEL:prop2 in Lemma LABEL:prop1 and rearranging terms, we get
Rearranging terms in Lemma LABEL:prop1, we obtain
Putting (LABEL:eq:t2) back in (LABEL:eq:t1), we get
Since , then (LABEL:eq:t3) can be rewritten as follows
showing (LABEL:eq:rec) for . As a part of the induction procedure, we then assume that (LABEL:eq:rec) holds for some arbitrary , which amounts to
Using (LABEL:eq:t2) in the above inequality, we obtain
Therefore, (LABEL:eq:rec) holds for as well, which concludes the proof of Theorem LABEL:thm1.
Suppose that Assumptions 1 and 2 hold. Then, the PIAG algorithm with step size yields the following recursion