Adaptive restart of accelerated gradient methods under local quadratic growth condition
By analyzing accelerated proximal gradient methods under a local quadratic growth condition, we show that restarting these algorithms at any frequency gives a globally linearly convergent algorithm. This result was previously known only for long enough frequencies.
Then, as the rate of convergence depends on the match between the frequency and the quadratic error bound, we design a scheme to automatically adapt the frequency of restart from the observed decrease of the norm of the gradient mapping. Our algorithm has a better theoretical bound than previously proposed methods for the adaptation to the quadratic error bound of the objective.
We illustrate the efficiency of the algorithm on a Lasso problem and on a regularized logistic regression problem.
The proximal gradient method aims at minimizing composite convex functions of the form
where is differentiable with Lipschitz gradient and may be nonsmooth but has an easily computable proximal operator. For a mild additional computational cost, accelerated gradient methods transform the proximal gradient method, for which the optimality gap decreases as , into an algorithm with “optimal” complexity [Nes83]. Accelerated variants include the dual accelerated proximal gradient [Nes05, Nes13], the accelerated proximal gradient method (APG) [Tse08] and FISTA [BT09]. Gradient-type methods, also called first-order methods, are often used to solve large-scale problems because of their good scalability and easiness of implementation that facilitates parallel and distributed computations.
When solving a convex problem whose objective function satisfies a local quadratic error bound (this is a generalization of strong convexity), classical (non-accelerated) gradient and coordinate descent methods automatically have a linear rate of convergence, i.e. for a problem dependent [NNG12, DL16], whereas one needs to know explicitly the strong convexity parameter in order to set accelerated gradient and accelerated coordinate descent methods to have a linear rate of convergence, see for instance [LS13, LMH15, LLX14, Nes12, Nes13]. Setting the algorithm with an incorrect parameter may result in a slower algorithm, sometimes even slower than if we had not tried to set an acceleration scheme [OC12]. This is a major drawback of the method because in general, the strong convexity parameter is difficult to estimate.
In the context of accelerated gradient method with unknown strong convexity parameter, Nesterov [Nes13] proposed a restarting scheme which adaptively approximates the strong convexity parameter. The same idea was exploited by Lin and Xiao [LX15] for sparse optimization. Nesterov [Nes13] also showed that, instead of deriving a new method designed to work better for strongly convex functions, one can restart the accelerated gradient method and get a linear convergence rate. However, the restarting frequency he proposed still depends explicitly on the strong convexity of the function and so O’Donoghue and Candes [OC12] introduced some heuristics to adaptively restart the algorithm and obtain good results in practice.
In this paper, we show that, if the objective function is convex and satisfies a local quadratic error bound, we can restart accelerated gradient methods at any frequency and get a linearly convergent algorithm. The rate depends on an estimate of the quadratic error bound and we show that for a wide range of this parameter, one obtains a faster rate than without acceleration. In particular, we do not require this estimate to be smaller than the actual value. In that way, our result supports and explains the practical success of arbitrary periodic restart for accelerated gradient methods.
Then, as the rate of convergence depends on the match between the frequency and the quadratic error bound, we design a scheme to automatically adapt the frequency of restart from the observed decrease of the norm of the gradient mapping. The approach follows the lines of [Nes13, LX15, LY13]. We proved that, if our current estimate of the local error bound were correct, the norm of the gradient mapping would decrease at a prescribed rate. We just need to check this decrease and when the test fails, we have a certificate that the estimate was too large.
Our algorithm has a better theoretical bound than previously proposed methods for the adaptation to the quadratic error bound of the objective. In particular, we can make use of the fact that we know that the norm of the gradient mapping will decrease even when we had a wrong estimate of the local error bound.
In Section 2 we recall the main convergence results for accelerated gradient methods and show that a fixed restart leads to a linear convergence rate. In Section 3, we present our adaptive restarting rule. Finally, we present numerical experiments on the lasso and logistic regression problem in Section 4.
2 Accelerated gradient schemes
2.1 Problem and assumptions
We consider the following optimization problem:
where is a differentiable convex function and is a proper, closed and convex function. We denote by the optimal value of (1) and assume that the optimal solution set is nonempty. Throughout the paper denotes the Euclidean norm. For any positive vector , we denote by the weighted Euclidean norm:
and the distance of to the closed convex set with respect to the norm . In addition we assume that is simple, in the sense that the proximal operator defined as
is easy to compute, for any positive vector . We also make the following smoothness and local quadratic error bound assumption.
There is a positive vector such that
For any , there is such that
where denotes the set of all such that .
2.2 Accelerated gradient schemes
We first recall in Algorithm 1 and 2 two classical accelerated proximal gradient schemes. For identification purpose we refer to them respectively as FISTA (Fast Iterative Soft Thresholding Algorithm) [BT09] and APG (Accelerated Proximal Gradient) [Tse08]. As pointed out in [Tse08], the accelerated schemes were first proposed by Nesterov [Nes04].
We have written the algorithms in a unified framework to emphasize their similarities. Practical implementations usually consider only two variables: for FISTA and for APG .
The most simple restarted accelerated gradient method has a fixed restarting frequency. This is Algorithm 3, which restarts periodically Algorithm 1 or Algorithm 2. Here, the restarting period is fixed to be some integer . In Section 3 we will also consider adaptive restarting frequency with a varying restarting period.
2.3 Convergence results for accelerated gradients methods
2.3.1 Basic results
In this section we gather a list of known results, shared by FISTA and APG, which will be used later to build restarted methods. Although all the results presented in this subsection have been proved or can be derived easily from existing results, for completeness all the proof is given in Appendix. We first recall the following properties on the sequence .
The sequence defined by and satisfies
We shall also need the following relation of the sequences.
It is known that the sequence of objective values generated by accelerated schemes, in contrast to that generated by proximal gradient schemes, does not decrease monotonically. However, this sequence is always upper bounded by the initial value . Following [LX15], we refer to this result as the non-blowout property of accelerated schemes.
The non-blowout property of accelerated schemes can be found in many papers, see for example [LX15, WCP17]. It will be repeatedly used in this paper to derive the linear convergence rate of restarted methods. Finally recall the following fundamental property for accelerated schemes.
All the results presented above hold without Assumption 2.
2.3.2 Conditional linear convergence
Although Proposition 3 provides a guarantee of linear convergence when , it does not give any information in the case when is too small to satisfy .
2.3.3 Unconditional linear convergence
We now prove a contraction result on the distance to the optimal solution set.
For all let us denote the projection of onto . For all and , we have
We proceed as
Denote . Remark that
so that we may prove that by induction. Let us assume that . Then using (2.3.3)
Using (6) one can then easily check that
By Corollary 3, the number of proximal mappings needed to reach an -accuracy on the distance is bounded by
In particular, if we choose , then we get an iteration complexity bound
In [WCP17], the local linear convergence of the sequence generated by FISTA with arbitrary (fixed or adaptive) restarting frequency was proved. Our Theorem 1 not only yields the global linear convergence of such sequence, but also gives an explicit bound on the convergence rate. Also, note that although an asymptotic linear convergence rate can be derived from the proof of Lemma 3.6 in [WCP17], it can be checked that the asymptotic rate in [WCP17] is not as good as ours. In fact, an easy calculation shows that even restarting with optimal period , their asymptotic rate only leads to the complexity bound . Moreover, our restarting scheme is more flexible, because the internal block in Algorithm 3 can be replaced by any scheme which satisfies all the properties presented in Section 2.3.1.
3 Adaptive restarting of accelerated gradient schemes
Although Theorem 1 guarantees a linear convergence of the restarted method (Algorithm 3), it requires the knowledge of to attain the complexity bound (19). In this section, we show how to combine Corollary 3 with Nesterov’s adaptive restart method, first proposed in [Nes07], in order to obtain a complexity bound close to (19) that does not depend on a guess on .
3.1 Bounds on gradient mapping norm
We first show the following inequalities that generalize similar ones in [Nes07]. Hereinafter we define the proximal mapping:
3.2 Adaptively restarted algorithm
It is essential that both sides of (27) are computable, so that we can check this inequality for each estimate . If (27) does not hold, then we know that . The idea was originally proposed by Nesterov in [Nes07] and later generalized in [LX15], where instead of restarting, they incorporate the estimate into the update of . As a result, the complexity analysis only works for strongly convex objective function and seems not to hold under Assumption 2.
At the end of each restarting period, we test condition (27), the opposite of which is given by the first inequality at Line 13. If it holds then we continue with the same estimate and thus the same restarting period, otherwise we decrease by one half and repeat. Our stopping criteria is based on the norm of proximal gradient, same as in related work [LX15, LY13].
Although Line 13 of Algorithm 4 requires to compute the proximal gradient mapping , one should remark that this is in fact given by the first iteration of Algorithm 1() or Algorithm 2(). Hence, except for , the computation of does not incur additional computational cost. Therefore, the output of Algorithm 4 records the total number of proximal gradient mappings needed to get .
We first show the following non-blowout property for Algorithm 4.
For any and we have
For any , if , then
Consider Algorithm 4. If for any we have , then and