[

# [

Mingrui Liu mingrui-liu@uiowa.edu
Tianbao Yang tianbao-yang@uiowa.edu
Department of Computer Science
The University of Iowa, Iowa City, IA 52242
###### Abstract

under Hölderian Error Bound Condition

[ Mingrui Liu mingrui-liu@uiowa.edu
Tianbao Yang tianbao-yang@uiowa.edu
Department of Computer Science
The University of Iowa, Iowa City, IA 52242

First version: November 22, 2016

## 1 Introduction

We consider the following smooth optimization problem:

 minx∈Rdf(x), (1)

where is a continuously differential convex function, whose gradient is -Lipschitz continuous. More generally, we also tackle the following composite optimization:

 minx∈RdF(x)≜f(x)+g(x), (2)

where is a proper lower semi-continuous convex function and is a continuously differentiable convex function, whose gradient is -Lipschitz continuous. The above problem has been studied extensively in literature and many algorithms have been developed with convergence guarantee. In particular, by employing the proximal mapping associated with , i.e.,

 Pηg(u)=argminx∈Rd12∥x−u∥22+ηg(x), (3)

proximal gradient (PG) and accelerated proximal gradient (APG) methods have been developed for solving (2) with and  111For the moment, we neglect the constant factor. iteration complexities for finding an -optimal solution. When either or is strongly convex, both PG and APG can enjoy a linear convergence, i.e., the iteration complexity is improved to be .

Recently, a wave of study is to generalize the linear convergence to problems without strong convexity but under certain structured condition of the objective function or more generally a quadratic growth condition (Hou et al., 2013; Zhou et al., 2015; So, 2013; Wang and Lin, 2014a; Gong and Ye, 2014; Zhou and So, 2015; Bolte et al., 2015; Necoara et al., 2015; Karimi et al., 2016; Zhang, 2016a; Drusvyatskiy and Lewis, 2016). Earlier work along the line dates back to (Luo and Tseng, 1992a, b, 1993). An example of the structured condition is such that where is strongly convex function and is Lipschitz continuous on any compact set, and is a polyhedral function. Under such a structured condition, a local error bound condition can be established (Luo and Tseng, 1992a, b, 1993), which renders an asymptotic (local) linear convergence for the proximal gradient method. A quadratic growth condition (QGC) prescribes that the objective function satisfies for any  222It can be relaxed to a fixed domain as done in this work.: , where denotes a closest point to in the optimal set. Under such a quadratic growth condition, several recent studies have established the linear convergence of PG, APG and many other algorithms (e.g., coordinate descent methods) (Bolte et al., 2015; Necoara et al., 2015; Drusvyatskiy and Lewis, 2016; Karimi et al., 2016; Zhang, 2016a). A notable result is that PG enjoys an iteration complexity of without knowing the value of , while a restarting version of APG studied in Necoara et al. (2015) enjoys an improved iteration complexity of hinging on the value of to appropriately restart APG periodically. Other equivalent conditions or more restricted conditions are also considered in several studies to show the linear convergence of (proximal) gradient method and other methods (Karimi et al., 2016; Necoara et al., 2015; Zhang, 2016a, b).

In this paper, we extend this line of work to a more general error bound condition, i.e., the Hölderian error bound (HEB) condition on a compact sublevel set : there exists and such that

 ∥x−x∗∥2≤c(F(x)−F(x∗))θ,∀x∈Sξ. (4)

Note that when and , the HEB reduces to the QGC. In the sequel, we will refer to as condition number of the problem. It is worth mentioning that Bolte et al. (2015) considered the same condition or an equivalent Kurdyka - Łojasiewicz inequality but they only focused on descent methods that bear a sufficient decrease condition for each update consequentially excluding APG. In addition, they do not provide explicit iteration complexity under the general HEB condition.

As a warm-up and motivation, we will first present a straightforward analysis to show that PG is automatically adaptive and APG can be made adaptive to the HEB by restarting. In particular if satisfies a HEB condition on the initial sublevel set, PG has an iteration complexity of  333When , all algorithms can converge in finite steps., and restarting APG enjoys an iteration complexity of for the convergence of objective value, where is the condition number. These two results resemble but generalize recent works that establish linear convergence of PG and restarting APG under the QGC - a special case of HEB. Although enjoying faster convergence, restarting APG has some caveats: (i) it requires the knowledge of constant in HEB to restart APG, which is usually difficult to compute or estimate; (ii) there lacks an appropriate machinery to terminate the algorithm. In this paper, we make nontrivial contributions to obtain faster convergence of the proximal gradient’s norm under the HEB condition by developing an adaptive accelerated gradient converging method.

The main results of this paper are summarized in Table 1. In summary the contributions of this paper are:

• We extend the analysis of PG and restarting APG under the quadratic growth condition to more general HEB condition, and establish the adaptive iteration complexities of both algorithms.

• To enjoy faster convergence of restarting APG and to eliminate the algorithmic dependence on the unknown parameter , we propose and analyze an adaptive accelerated gradient converging (adaAGC) method.

The developed algorithms and theory have important implication and applications in machine learning. Firstly, if the considered objective function is also coercive and semi-algebraic (e.g., a norm regularized problem in machine learning with a semi-algebraic loss function), then PG’s convergence speed is essentially instead of , where is the total number of iterations. Secondly, for solving , or regularized smooth loss minimization problems including least-squares loss, squared hinge loss and huber loss, the proposed adaAGC method enjoys a linear convergence and a square root dependence on the “condition” number. In contrast to previous work, the proposed algorithm is parameter free and does not rely on any restricted conditions (e.g., the restricted eigen-value conditions).

## 2 Related Work

At first, we review some related work for solving the problem (1) and (2). In Nesterov’s seminal work (Nesterov, 1983, 2007), the accelerated (proximal) gradient (APG) method were proposed for (composite) smooth optimization problems, enjoying iteration complexity for achieving a -optimal solution. When the objective is also strongly convex, APG can converge to the optimal solution linearly with an appropiate step size depending on the strong convexity modulus, which enjoys iteration complexity.

To address the issue of unknown strong convexity modulus for some problems, several restarting schemes were developed. Nesterov (2007) proposed a restarting scheme for the APG method to approximate the unknown strongly convexity parameter and achieved a linear convergence rate. Lin and Xiao (2014) proposed an adaptive APG method which employs the restart and line search technique to automatically estimate the strong convexity parameter. Oâdonoghue and Candes (2015) proposed an heuristic approach to adaptively restart accelerated gradient schemes and showed good experimental results. Nevertheless, they provide no theoretical guarantee of their proposed heuristic approach. In contrast to these work, we do not assume any strong convexity or restricted strong convexity for sparse learning. It was brought to our attention that a recent work (Fercoq and Qu, 2016) considered QGC and proposed restarted accelerated gradient and coordinate descent methods, including APG, FISTA and the accelerated proximal coordinate descent method (APPROX). The difference from their restarting scheme for APG and the restarting schemes in (Nesterov, 2007; Lin and Xiao, 2014; Oâdonoghue and Candes, 2015) and the present work is that their restart doest not involve evaluation of the gradient or the objective value but rather depends on a restarting frequency parameter and a convex combination parameter for computing the restarting solution, which can be set based on a rough estimate of the strong convexity parameter. As a result, their linear convergence (established for distance of solutions to the optimal set) heavily depends on the rough estimate of the strong convexity parameter.

Leveraging error bound conditions dates back to (Luo and Tseng, 1992a, b, 1993), which employed the error bound condition to establish the asymptotic (local) linear convergence for feasible descent methods. Luo & Tseng’ bounds the distance of a local solution to the optmal set by the norm of proximal gradient. Several recent work (Hou et al., 2013; Zhou et al., 2015; So, 2013) have considered Luo & Pseng’s error bound condition for more problems in machine learning and established local linear convergence for proximal gradient methods. Wang and Lin (2014b) established a global error bound version of Luo & Pseng’s condition for a family of problems in machine learning (e.g., the dual formulation of SVM), and provided the global linear convergence for a series of algorithms, including cyclic coordinate descent methods for solving dual support vector machine. Note that the Hölderian error bound (Bolte et al., 2015) used in our analysis is different from Luo & Pseng’s condition, and is actually more general. Bolte et al. (2015) established the equivalence of HEB and Kurdyka-Łojasiewicz (KL) inequality and showed how to derive lower computational complexity via employing KL inequality. As a special case of Hölderian error bound condition, quadratic growth condition (QGC) has been considered in several recent work for deriving linear convergence. Gong and Ye (2014) established linear convergence of proximal variance-reduced gradient (Prox-SVRG) algorithm under QGC. Necoara et al. (2015) showed that QGC is one of the relaxations of strong convexity conditions, which can still guarantee the linear convergence for several first order methods, including projected gradient, fast gradient and feasible descent methods. Drusvyatskiy and Lewis (2016) also showed that proximal gradient algorithm achieved the linear convergence under QGC. There also exist other conditions (stronger than or equivalent to QGC) that can help achieve linear convergence rate. For example, Karimi et al. (2016) showed that the Polyak-Łojasiewicz (PL) inequality suffices to guarantee a global linear convergence for (proximal) gradient descent methods. Zhang (2016a) summarized different sufficient conditions which are capable of deriving linear convergence, and discussed their relationships.

## 3 Notations and Preliminaries

In this section, we present some notations and preliminaries. In the sequel, we let () denote the -norm of a vector. A function is a proper function if for at least one and for all . is lower semi-continuous at a point if . A function is coercive if and only if as .

A subset is a real semi-algebraic set if there exists a finite number of real polynomial functions such that

 S=∪pj=1∩qi=1{u∈Rd;gij(u)=0 and hij(u)≤0}.

A function is semi-algebraic if its graph is a semi-algebraic set.

Denote by the set of all positive integers. A function is a real polynomial if there exists such that , where and , , and is referred to as the degree of . A continuous function is said to be a piecewise convex polynomial if there exist finitely many polyhedra with such that the restriction of on each is a convex polynomial. Let be the restriction of on . The degree of a piecewise convex polynomial function denoted by is the maximum of the degree of each . If , the function is referred to as a piecewise convex quadratic function. Note that a piecewise convex polynomial function is not necessarily a convex function (Li, 2013).

A function is -smooth w.r.t if it is differentiable and has a Lipschitz continuous gradient with the Lipschitz constant , i.e., . Let denote the subdifferential of at , i.e.,

 ∂g(x)={u∈Rd:g(y)≥g(x)+u⊤(y−x),∀y}.

Denote by . A function is -strongly convex w.r.t if it satisfies for any such that .

Denote by a positive scalar, and let be the proximal mapping associated with defined in (3). Given an objective function , where is -smooth and is a simple non-smooth function, define a proximal gradient as:

 Gη(x)=1η(x−x+η), where x+η=Pηg(x−η∇f(x))

When , we have , i.e., the proximal gradient is the gradient. It is known that is an optimal solution iff . If , for simplicity we denote by and . Below, we give several technical propositions related to and the proximal gradient update.

###### Proposition 1

(Nesterov, 2007) Given , is a monotonically decreasing function of .

###### Proposition 2

(Beck and Teboulle, 2009) Let . Assume is -smooth. For any and , we have

 F(y+η)≤F(x)+Gη(y)⊤(y−x)−η2∥Gη(y)∥22. (5)

The following corollary is useful for our analysis.

###### Corollary 1

Let . Assume is -smooth. For any and , we have

 η2∥Gη(y)∥22≤F(y)−F(y+η)≤F(y)−minxF(x). (6)

Remark: The proof of Corollary 1 is immediate by employing the convexity of and Proposition 2.

Let denote the optimal objective value to and denote the optimal set. Denote by the -sublevel set of . Let .

The proximal gradient (PG) method solves the problem (2) by the update

 xt+1=Pηg(xt−η∇f(xt)), (7)

with starting from some initial solution . It can be shown that PG has an iteration complexity of . The convergence guarantee of PG is presented in the following proposition.

###### Proposition 3

(Nesterov, 2004) Let (7) run for with , we have

 F(xT+1)−F∗≤D(x1,Ω∗)22ηT.

Based on the above proposition, one can deduce that PG has an iteration complexity of . Nevertheless, accelerated proximal gradient (APG) converges faster than PG. There are many variants of APG in literature (Tseng, 2008). The simplest variant adopts the following update

 {yt=xt+βt(xt−xt−1),xt+1=Pηg(yt−η∇f(yt)), (8)

where and . APG enjoys an iteration complexity of  (Tseng, 2008). The convergence guarantee of APG is presented in the following proposition.

###### Proposition 4

(Tseng, 2008) Let (8) run for with and , we have

 F(xT+1)−F∗≤2D(x1,Ω∗)2η(T+1)2.

Based on the above proposition, one can deduce that APG has an iteration complexity of .

Furthermore, if is both -smooth and -strongly convex, one can set and deduce a linear convergence (Lin and Xiao, 2014) with a better dependence on the condition number than that of PG.

###### Proposition 5

(Lin and Xiao, 2014) Assume is -smooth and -strongly convex. Let (8) run for with , and , we have for any

 F(xT+1)−F(x)≤(1−√αL)T[F(x0)−F(x)+α2∥x0−x∥22].

If is -strongly convex and is -smooth, Nesterov (2007) proposed a different variant based on dual averaging, which is referred to accelerated dual gradient (ADG) method and will be useful for our develeopment. The key steps are presented in Algorithm 1. The convergence guarantee of ADG is given the following proposition.

###### Proposition 6

(Nesterov, 2007) Assume is -smooth and is -strongly convex. Let Algorithm 1 run for . Then for any we have

 F(xT+1)−F(x)≤L2∥x0−x∥22(11+√α/2L)2T.

### A Hölderian error bound (HEB) condition

###### Definition 1 (Hölderian error bound (HEB))

A function is said to satisfy a HEB condition on the -sublevel set if there exist and such that for any

 dist(x,Ω∗)≤c(F(x)−F∗)θ, (9)

where denotes the optimal set of .

The HEB condition is closely related to the Łojasiewicz inequality or more generally Kurdyka- Łojasiewicz (KL) inequality in real algebraic geometry. It has been shown that when functions are semi-algebraic and continuous, the above inequality is known to hold on any compact set (Bolte et al., 2015). We refer the readers to (Bolte et al., 2015) for more discussions on HEB and KL inequalities.

In the remainder of this section, we will review some previous results to demonstrate that HEB is a generic condition that holds for a broad family of problems of interest. The following proposition states that any proper, coercive, convex, lower-semicontinuous and semi-algebraic functions satisfy the HEB condition.

###### Proposition 7

(Bolte et al., 2015) Let be a proper, coercive, convex, lower semicontinuous and semi-algebraic function. Then there exists and such that satisfies the HEB on any -sublevel set.

Example: Most optimization problems in machine learning with an objective that consists of an empirical loss that is semi-algebraic (e.g., hinge loss, squared hinge loss, absolute loss, square loss) and a norm regularization ( is a rational) or a norm constraint are proper, coercive, lower semicontinuous and semi-algebraic functions.

Next two propositions exhibit the value for piecewise convex quadratic functions and piecewise convex polynomial functions.

###### Proposition 8

(Li, 2013) Let be a piecewise convex quadratic function on . Suppose is convex. Then for any , there exists such that

 D(x,Ω∗)≤c(F(x)−F∗)1/2,∀x∈Sξ.

Many problems in machine learning are piecewise convex quadratic functions, which will be discussed more in Section 7.

###### Proposition 9

(Li, 2013) Let be a piecewise convex polynomial function on . Suppose is convex. Then for any , there exists such that

 D(x,Ω∗)≤c(F(x)−F∗)1(deg(F)−1)d+1,∀x∈Sξ.

Indeed, for a polyhedral constrained convex polynomial, we can have a tighter result, as show below.

###### Proposition 10

(Yang, 2009) Let be a convex polynomial function on with degree . If is a polyhedral set, then the problem admits a global error bound: there exists such that

 D(x,Ω∗)≤c[(F(x)−F∗)+(F(x)−F∗)1m], (10)

From the global error bound (10), one can easily derive the Hölderian error bound condition (4). For an example, we can consider an constrained norm regression (Nyquist, 1983):

 min∥x∥1≤sF(x)≜1nn∑i=1(a⊤ix−bi)p,p∈2N (11)

which satisfies the HEB condition (4) with .

Many previous papers have considered a family of structured smooth composite problems:

 minx∈RdF(x)=h(Ax)+g(x) (12)

where is a polyhedral function and is a smooth and strongly convex function on any compact set. Suppose the optimal set of the above problem is non-empty and compact (e.g., the function is coercive) so is the sublevel set , it can been shown that such a function satisfies HEB with on any sublevel set . Examples of include logistic loss .

###### Proposition 11

(Necoara et al., 2015, Theorem 4.3) Suppose the optimal set of (12) is non-empty and compact, is a polyhedral function and is a smooth and strongly convex function on any compact set. Then satisfies the HEB on any sublevel set with for .

Finally, we note that there exist problems that admit HEB with . A trivial example is given by with , which satisfies HEB with . An interesting non-trivial family of problems is that and is a piece-wise linear functions according to Proposition 9. PG or APG applied to such family of problems is closely related to proximal point algorithm (Rockafellar, 1976). Explorations of such algorithmic connection is not the focus of this paper.

## 4 PG and restarting APG under HEB

As a warm-up and motivation of the major contribution presented in next section, we present a convergence result of PG and a restarting APG under the HEB condition. We first present a result of PG as shown in Algorithm 2.

###### Theorem 1

Suppose and satisfies HEB on . The iteration complexity of PG (with option I) for achieving is if , and is if .

Proof  Divide the whole FOR loop of the Algorithm 2 into stages, denote by the number of iterations in the -th stage, and denote by the updated at the end of the -th stage, where . Define .

Choose , and we will prove by induction. Suppose , we have . According to Proposition 3, at the -th stage, we have

 F(xk)−F∗≤L∥xk−1−x∗k−1∥222tk,

where , the closest point to in the optimal set. By the HEB condition, we have

 F(xk)−F∗≤c2Lϵ2θk−12tk.

Since , we have . The total number of iterations is

 K∑k=1tk≤O(c2LK∑k=1ϵ2θ−1k−1).

From the above analysis, we see that after each stage, the optimality gap decreases by half, so taking guarantees .

If , the iteration complexity is . If , the iteration complexity is . If , the iteration complexity is

 K∑k=1tk≤O(c2LK∑k=1(ϵ02k−1)2θ−1)=O(c2L/ϵ1−2θ).

Next, we show that APG can be made adaptive to HEB by periodically restarting given and . This is similar to (Necoara et al., 2015) under the QGC. The steps of restarting APG (rAPG) are presented in Algorithm 3, where we employ the simplest variant of APG.

###### Theorem 2

Suppose and satisfies HEB on . By running Algorithm 2 with and , we have . The iteration complexity of rAPG is if , and if it is .

Proof  Similar to the proof of Theorem 1, we will prove by induction that . Assume that . Hence, . Then according to Proposition 4 and the HEB condition, we have

 F(xk)−F∗≤2c2Lϵ2θk−1(tk+1)2.

Since , we have

 F(xk)−F∗≤ϵk−12=ϵk.

After stages, we have . The total number of iterations is

 TK=K∑k=1tk≤O(c√Lϵθ−1/2k−1).

When , we have . When , we have

 TK≤O(max{c√Llog(ϵ0/ϵ),c√L/ϵ1/2−θ}).

From Algorithm 3, we can see that rAPG requires the knowledge of besides to restart APG. However, for many problems of interest, the value of is unknown, which makes rAPG impractical. To address this issue, we propose to use the magnitude of the proximal gradient as a measure for restart and termination. Previous work (Nesterov, 2004) have considered the strongly convex optimization problems where the strong convexity parameter is unknown, where they also use the magnitude of the proximal gradient as a measure for restart and termination. However, in order to achieve faster convergence under the HEB condition without the strong convexity, we have to introduce a novel technique of adaptive regularization that adapts to the HEB. With a novel synthesis of the adaptive regularization and a conditional restarting that searchs for the , we are able to develop practical adaptive accelerated gradient methods.

Before diving into the details of the proposed algorithm, we will first present a variant of PG as a baseline for comparison motivated by (Nesterov, 2012) for smooth problems, which enjoys a faster convergence than the vanilla PG in terms of the proximal gradient’s norm. The idea is to return a solution that achieves the minimum magnitude of the proximal gradient (the option II in Algorithm 2). The convergence of under HEB is presented in the following theorem.

###### Theorem 3

Suppose and satisfies HEB on . The iteration complexity of PG (with option II) for achiving , is if , and is if .

Proof  By the update of Algorithm 2 with option II and Corollary 1, we have

 F(xτ)−F(xτ+1)≥12L∥G(xτ)∥22.

Let . Summing over gives

 F(xj)−F(xt+1)≥12Lt∑τ=j∥G(xτ)∥22.

Since and , we have

 j2Lmin1≤τ≤t∥G(xτ)∥22≤F(xj)−F∗.

Hence,

 min1≤τ≤t∥G(xτ)∥22≤2Lj(F(xj)−F∗). (13)

We consider three scenarios of .

(I). If , according to Theorem 1, we know that converges to in steps, so converges to in steps.

(II). If , let and , where , and is a constant hided in the big O notation. According to Theorem 1, we have

 F(xk)−F∗≤ϵ2, (14)

then the inequality (13), (14) and the choice of yields

 min1≤τ≤t∥G(xτ)∥22≤2Lj(F(xj)−F∗)≤ϵ2,

so we know that .

(III). If , let be an index such that . We can set and hence , and have

 min1≤τ≤t∥G(xτ)∥22≤2Lj(F(xj)−F∗)≤ϵ′ϵ′1−2θac2=ϵ′2−2θac2.

Let , we have . We can conclude .

By combining the three scenarios, we can complete the proof.

The final theorem in this section summarizes an convergence result of PG for minimizing a proper, coercive, convex, lower semicontinuous and semi-algebraic function, which could be interesting of its own.

###### Theorem 4

Let be a proper, coercive, convex, lower semicontinuous and semi-algebraic functions. Then PG (with option I and option II) converges at a speed of for and , respectively, where is the total number of iterations.

Remark: This can be easily proved by combining Proposition 7 and Theorems 13.

In the following two sections, we will present adaptive accelerated gradient converging methods that are faster than minPG for the convergence of (proximal) gradient’s norm. Due to its simplicity, we first consider the following unconstrained optimization problem:

 minx∈Rdf(x)

where is a -smooth function. We abuse to denote the optimal set of above problem. The lemma below that bounds the distance of a point to the optimal set by a function of the gradient’s norm.

###### Lemma 1

If satisfies the HEB on with , i.e., there exists such that for any we have

 D(x,Ω∗)≤c(f(x)−f∗)θ.

If , then for any

 D(x,Ω∗)≤c11−θ∥∂f(x)∥θ1−θ2.

If , then for any

 D(x,Ω∗)≤c2ξ∥∂f(x)∥2.

The proof of this lemma is included in the Appendix.

Note that for a smooth function , we can restrict our discussion on HEB condition to . Since where , plugging this equality into the HEB we can see has to be less than if remains a constant. In order to derive faster convergence than minPG, we employ the technique of regularization, i.e., adding a strongly convex regularizer into the objective. To this end, we define the following problem:

 fδ(x)=f(x)+δ2∥x−x0∥22,

where is the initial solution. It is clear that is a -smooth and -strongly convex function. The proposed adaAGC algorithm will run in multiple stages. At the -th stage, we construct a problem like above using a value of and an initial solution , and employ APG for smooth and strongly convex minimization to solve the constructed problem until the gradient’s norm is decreased by a factor of . The initial solution for each stage is the output solution of the previous stage and the value of will be adaptively decreasing based on in the HEB condition. Specifically, the choice of can be set in the following way:

 δk=ϵ1−2θ1−θk−16c1/(1−θ)e (15)

We also embed a search procedure for the value of into the algorithm in order to leverage the HEB condition. The detailed steps of adaAGC for solving are presented in Algorithm 4 assuming satisfies a HEB condition.

Below, we first present the analysis for each stage to pave the path of proof for our main theorem.

###### Theorem 5

Suppose is -smooth. By running the update in (8) for solving with