Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems

# Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems

Lei Yang111Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, P.R. China. (lei.yang@connect.polyu.hk).
###### Abstract

In this paper, we consider a class of possibly nonconvex, nonsmooth and non-Lipschitz optimization problems arising in many contemporary applications such as machine learning, variable selection and image processing. To solve this class of problems, we propose a proximal gradient method with extrapolation and line search (PGels). This method is developed based on a special potential function and successfully incorporates both extrapolation and non-monotone line search, which are two simple and efficient accelerating techniques for the proximal gradient method. Thanks to the line search, this method allows more flexibilities in choosing the extrapolation parameters and updates them adaptively at each iteration if a certain line search criterion is not satisfied. Moreover, with proper choices of parameters, our PGels reduces to many existing algorithms. We also show that, under some mild conditions, our line search criterion is well defined and any cluster point of the sequence generated by PGels is a stationary point of our problem. In addition, by assuming the Kurdyka-Łojasiewicz exponent of the objective in our problem, we further analyze the local convergence rate of two special cases of PGels, including the widely used non-monotone proximal gradient method as one case. Finally, we conduct some numerical experiments for solving the regularized logistic regression problem and the regularized least squares problem. Our numerical results illustrate the efficiency of PGels and show the potential advantage of combining two accelerating techniques.

Keywords:  Proximal gradient method; extrapolation; non-monotone; line search; stationary point.

## 1 Introduction

In this paper, we consider the following composite optimization problem:

 minx∈Rn F(x):=f(x)+P(x), (1.1)

where and satisfy Assumption 1.1. We also assume that the proximal mapping of is easy to compute for all (see the next section for notation and definitions).

###### Assumption 1.1.
• is a continuously differentiable (possibly nonconvex) function with Lipschitz continuous gradient, i.e., there exists some Lipschitz constant such that

 ∥∇f(x)−∇f(y)∥≤Lf∥x−y∥,∀x,y∈Rn.
• is a proper closed (possibly nonconvex, nonsmooth and non-Lipschitz) function; it is bounded below and continuous on its domain.

• is level-bounded.

Problem (1.1) arises in many contemporary applications such as machine learning [14, 34], variable selection [16, 22, 23, 36, 47] and image processing [10, 30]. In general, is a loss or fitting function used for measuring the deviation of a solution from the observations. Two commonly used loss functions are

• least squares loss function: ,

• logistic loss function: ,

where is a data matrix and is an observed vector. One can verify that these two loss functions satisfy Assumption 1.1(i). On the other hand, the function is usually a regularizer used for inducing certain structure in the solution. For example, can be the indicator function for a certain set such as and ; the former choice restricts the elements of the solution to be nonnegative and the latter choice restricts the solution in a simplex. We can also choose to be a certain sparsity-inducing regularizer such as for [22, 23, 36], for [30] and [46], where is a regularization parameter. Note that all the aforementioned examples of as well as many other widely used regularizers (see [1, 11] and references therein for more regularizers) satisfy Assumption 1.1(ii). Finally, we would like to point out that Assumption 1.1(iii) is also satisfied by many choices of and in practice; see, for example, (5.1) and (5.8). More examples of (1.1) can be found in [10, 14, 34] and references therein.

Due to the importance and the popularity of (1.1), various attempts have been made to solve it efficiently, especially when the problem involves a large number of variables. One popular class of methods for solving (1.1) are first-order methods due to their cheap iteration cost and good convergence properties. Among them, the proximal gradient (PG) method111PG is also known as the forward-backward splitting algorithm [13] or the iterative shrinkage-thresholding algorithm [5]. [17, 25] is arguably the most fundamental one, whose basic iteration is

 xk+1∈Argminx{⟨∇f(xk),x⟩+μ2∥x−xk∥2+P(x)}, (1.2)

where is a constant depending on the Lipschitz constant of . However, PG can be slow in practice; see, for example, [5, 32, 39]. Therefore, a large amount of research has been conducted to accelerate PG for solving (1.1). One simple and widely studied strategy is to perform extrapolation in the spirit of Nesterov’s extrapolation techniques [27, 28], whose basic idea is to make use of historical information at each iteration. A typical scheme of the proximal gradient method with extrapolation (PGe) for solving (1.1) is

 ⎧⎪⎨⎪⎩yk=xk+βk(xk−xk−1),xk+1∈Argminx{⟨∇f(yk),x⟩+μ2∥x−yk∥2+P(x)}, (1.3)

where is the extrapolation parameter satisfying certain conditions and is a constant depending on . One representative algorithm that takes the form of (1.3) with proper choices of is the fast iterative shrinkage-thresholding algorithm (FISTA) [5], which is also known as the accelerated proximal gradient method (APG) independently proposed and studied by Nesterov [29]. This algorithm (FISTA or APG) is designed for solving (1.1) with and being convex and exhibits a faster convergence rate () in terms of objective values (see [5, 29] for more details). This motivates the study of PGe and its variants for solving (1.1) under different scenarios; see, for example, [6, 18, 31, 32, 37, 38, 39, 40, 42, 43, 44]. It is worth noting that, when and are convex, many existing PGe and its variants [5, 6, 29, 32, 37, 38] choose the extrapolation parameters (explicitly or implicitly) based on the following updating scheme

 ⎧⎪ ⎪⎨⎪ ⎪⎩βk=(tk−1−1)/tk,tk+1=1+√1+4t2k2,witht−1=t0=1, (1.4)

which originated from Nesterov’s work [27, 28] and was shown to be “optimal” [28]. However, for the nonconvex case, the “optimal” choices of are still not clear. Although the convergence of PGe and its variants can be guaranteed in theory for some classes of nonconvex problems under certain conditions on (see, for example, [18, 31, 39, 40, 42, 43, 44]), the choices of are relatively restrictive and hence may not work well for acceleration. Another efficient strategy for accelerating PG is to apply a non-monotone line search to adaptively find a proper in (1.2) at each iteration. The non-monotone line search technique dates back to the non-monotone Newton’s method proposed by Grippo et al. [21] and has been applied to many algorithms with good empirical performances; see, for example, [7, 15, 45, 48]. Based on this technique, Wright et al. [41] recently proposed an efficient method (called SpaRSA) to solve (1.1), whose iteration is roughly given as follows: Choose , and an integer . Then, at the -th iteration, choose and find the smallest nonnegative integer such that

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩u∈Argminx{⟨∇f(xk),x⟩+τjkμ0k2∥x−xk∥2+P(x)},F(u)−max[k−N]+≤i≤kF(xi)≤−c2∥u−xk∥2.

This method is essentially the non-monotone proximal gradient (NPG) method, namely, the proximal gradient method with a non-monotone line search. Later, NPG was extended for solving (1.1) under more general conditions and has been shown to have promising numerical performances in many applications (see, for example, [12, 20, 26]). In view of the above, it is natural to raise a question:

Can we derive an efficient method for solving (1.1), which takes advantage of both extrapolation and non-monotone line search?

In this paper, we propose such a method for solving (1.1) that successfully incorporates both extrapolation and non-monotone line search and allows more flexibilities in choosing the extrapolation parameters . We call our method the proximal gradient method with extrapolation and line search (PGels). This method is developed based on the following potential function (specifically constructed for in (1.1)):

 Hδ(u,v,μ):=F(u)+δμ4∥u−v∥2,∀u,v∈Rn, μ>0, (1.5)

where is a given nonnegative constant. Clearly, if . We will see in Section 3 that this potential function is used to establish a new non-monotone line search criterion (3.3) when the extrapolation technique is applied. This allows more choices of at each iteration, and will adaptively update and at the same time if the line search criterion is not satisfied (see Algorithm 1 for more details). The convergence analysis of PGels is also presented in Section 3. Specifically, under Assumption 1.1, we show that our line search criterion (3.3) is well defined and any cluster point of the sequence generated by PGels is a stationary point of (1.1). Moreover, since our PGels reduces to PG, PGe or NPG with proper choices of parameters (see Remark 3.1), then we actually obtain a unified convergence analysis for PG, PGe and NPG as a byproduct. In addition, in Section 4, we further study the local convergence rate in terms of objective values for two special cases of PGels (including NPG as one case) under an additional assumption on the Kurdyka-Łojasiewicz exponent of the objective in (1.1). To the best of our knowledge, this is the first local convergence rate analysis of NPG for solving (1.1). Finally, we conduct some numerical experiments in Section 5 to evaluate the performance of our method for solving the regularized logistic regression problem and the regularized least squares problem. Our computational results illustrate the efficiency of our method and show the potential advantage of combining extrapolation and non-monotone line search.

The rest of this paper is organized as follows. In Section 2, we present notation and preliminaries used in this paper. In Section 3, we describe PGels for solving (1.1) and study its global subsequential convergence. The local convergence rate of two special cases of PGels is analyzed in Section 4 and some numerical results are reported in Section 5. Finally, some concluding remarks are given in Section 6.

## 2 Notation and preliminaries

In this paper, we present scalars, vectors and matrices in lower case letters, bold lower case letters and upper case letters, respectively. We also use , , and to denote the set of real numbers, -dimensional real vectors, -dimensional real vectors with nonnegative entries and real matrices, respectively. For a vector , denotes its -th entry, denotes its Euclidean norm, denotes its norm defined by , denotes its -quasi-norm () defined by and denotes its norm given by the largest entry in magnitude. For a matrix , its spectral norm is denoted by , which is the largest singular value of .

For an extended-real-valued function , we say that it is proper if for all and its domain is nonempty. A proper function is said to be closed if it is lower semicontinuous. We also use the notation to denote and . The basic subdifferential (see [33, Definition 8.3]) of at used in this paper is

 ∂h(x):={d∈Rn:∃xkh→x, dk→d  with liminfy→xk,y≠xkh(y)−h(xk)−⟨dk,y−xk⟩∥y−xk∥≥0  ∀k}.

It can be observed from the above definition that

 {d∈Rn:∃xkh→x, dk→d with dk∈∂h(xk) for each k}⊆∂h(x). (2.1)

When is continuously differentiable or convex, the above subdifferential coincides with the classical concept of derivative or convex subdifferential of ; see, for example, [33, Exercise 8.8] and [33, Proposition 8.12]. In addition, if has several groups of variables, we use (resp., ) to denote the partial subdifferential (resp., gradient) of with respect to the group of variables .

For a proper closed function and , the proximal mapping of at is defined by

 Proxνh(y):=Argminx∈Rn{h(x)+12ν∥x−y∥2}.

Note that this operator is well defined for any if is bounded below in . For a closed set , its indicator function is defined by

 δX(x)={0,if x∈X,+∞,otherwise.

We also use to denote the distance from to , i.e., .

For any local minimizer of (1.1), it is known from [33, Theorem 10.1] and [33, Exercise 8.8(c)] that the following first-order necessary condition holds:

 0∈∇f(¯x)+∂P(¯x), (2.2)

where denotes the gradient of . In this paper, we say that is a stationary point of (1.1) if satisfies (2.2) in place of .

We next recall the Kurdyka-Łojasiewicz (KL) property (see [2, 3, 4, 8, 9] for more details), which plays an important role in our analysis for the local convergence rate in Section 4. For notational simplicity, let () denote a class of concave functions satisfying: (i) ; (ii) is continuously differentiable on and continuous at ; (iii) for all . Then, the KL property can be described as follows.

###### Definition 2.1 (KL property and KL function).

Let be a proper closed function.

• For , if there exist a , a neighborhood of and a function such that for all , it holds that

 φ′(h(x)−h(~x))dist(0,∂h(x))≥1,

then is said to have the Kurdyka-Łojasiewicz (KL) property at .

• If satisfies the KL property at each point of , then is called a KL function.

A large number of functions such as proper closed semialgebraic functions satisfy the KL property [3, 4]. Based on the above definition, we then introduce the KL exponent [3, 24].

###### Definition 2.2 (KL exponent).

Suppose that is a proper closed function satisfying the KL property at with for some and , i.e., there exist such that

 dist(0,∂h(x))≥a(h(x)−h(~x))θ

whenever , and . Then, is said to have the KL property at with an exponent . If is a KL function and has the same exponent at any , then is said to be a KL function with an exponent .

We also recall the following uniformized KL property, which was established in [9, Lemma 6].

###### Proposition 2.1 (Uniformized KL property).

Suppose that is a proper closed function and is a compact set. If on for some constant and satisfies the KL property at each point of , then there exist , and such that

 φ′(h(x)−ζ)dist(0,∂h(x))≥1

for all .

Finally, we recall two useful lemmas, which can be found in [24].

###### Lemma 2.1 ([24, Lemma 2.2]).

Let . Then, for any , there exist such that .

###### Lemma 2.2 ([24, Lemma 3.1]).

Let . Then, for any , there exist and such that .

## 3 Proximal gradient method with extrapolation and line search and its convergence analysis

In this section, we present a proximal gradient method with extrapolation and line search (PGels) for solving (1.1). This method is developed based on a specially constructed potential function, which is defined in (1.5). The complete PGels for solving (1.1) is presented as Algorithm 1.

###### Remark 3.1 (Comments on special cases of PGels).

In Algorithm 1, if , then we have and hence for all . In this case, our line search criterion (3.3) reduces to

 F(u)−max[k−N]+≤i≤kF(xi)≤−c2∥u−xk∥2.

Thus, our PGels reduces to NPG for solving (1.1) (see, for example, [12, 20, 41]). On the other hand, for any , if

 μ0k=μmax≥Lf+2c1−δandβ0k≤√δ(μmax−Lf)μmax4(μmax+Lf)2for all k≥0,

then it follows from Lemma 3.1 that the line search criterion (3.3) holds trivially for all . Thus, in this case, we do not need to perform the line search loop and hence our PGels reduces to a PGe for solving (1.1). Finally, if and , then our PGels obviously reduces to PG for solving (1.1).

###### Remark 3.2 (Comments on the extrapolation parameters in PGels).

Unlike most of the existing PGe and its variants (mentioned in Section 1) that should choose the extrapolation parameters under certain schemes or conditions, our PGels can choose any as an initial guess at each iteration and then updates it as well as adaptively at the same time if the line search criterion is not satisfied. This strategy actually allows more flexibilities in choosing the extrapolation parameters and works well from our computational results in Section 5.

In the following, we will study the convergence properties of PGels. Before proceeding, we present the first-order optimality condition for the subproblem (3.2) in (1b) of Algorithm 1 as follows:

 0∈∇f(yk)+μk(u−yk)+∂P(u). (3.4)

We now start our convergence analysis by proving the following supporting lemma, which characterizes the descent property of our potential function.

###### Lemma 3.1 (Sufficient descent of Hδ).

Suppose that Assumption 1.1 holds and is a nonnegative constant. Let and be the sequences generated by Algorithm 1, and let be the candidate generated by step (1b) at the -th iteration. For any , if

 μk>Lfandβk≤√δ(μk−Lf)¯μk−14(μk+Lf)2,

then we have

 Hδ(u,xk,μk)−Hδ(xk,xk−1,¯μk−1)≤−(1−δ)μk−Lf4∥u−xk∥2, (3.5)

where is the potential function defined in (1.5).

Proof. First, from (3.2), we have

 ⟨∇f(yk),u−yk⟩+μk2∥u−yk∥2+P(u)≤⟨∇f(yk),xk−yk⟩+μk2∥xk−yk∥2+P(xk),

which implies that

 P(u)≤P(xk)+⟨∇f(yk),xk−u⟩+μk2∥xk−yk∥2−μk2∥u−yk∥2=P(xk)+⟨∇f(yk),xk−u⟩+μk2∥xk−yk∥2−μk2∥(u−xk)+(xk−yk)∥2=P(xk)+⟨∇f(yk),xk−u⟩−μk2∥u−xk∥2+μk⟨xk−u,xk−yk⟩ (3.6)

On the other hand, using the fact that is Lipschitz continuous with a Lipschitz constant (Assumption 1.1(i)), we see from [28, Lemma 1.2.3] that

 f(u)≤f(xk)+⟨∇f(xk),u−xk⟩+Lf2∥u−xk∥2. (3.7)

Summing (3.6) and (3.7), we obtain that

 f(u)+P(u)−f(xk)−P(xk)≤−μk−Lf2∥u−xk∥2+μk⟨xk−u,xk−yk⟩+⟨∇f(xk)−∇f(yk),u−xk⟩≤−μk−Lf2∥u−xk∥2+(μk∥xk−yk∥+∥∇f(xk)−∇f(yk)∥)∥u−xk∥≤−μk−Lf2∥u−xk∥2+(μk+Lf)∥xk−yk∥∥u−xk∥≤−μk−Lf2∥u−xk∥2+μk−Lf4∥u−xk∥2+(μk+Lf)2μk−Lf∥xk−yk∥2=−μk−Lf4∥u−xk∥2+(μk+Lf)2μk−Lfβ2k∥xk−xk−1∥2≤−μk−Lf4∥u−xk∥2+δ¯μk−14∥xk−xk−1∥2=−(1−δ)μk−Lf4∥u−xk∥2−δμk4∥u−xk∥2+δ¯μk−14∥xk−xk−1∥2,

where the second inequality follows from Cauchy–Schwarz inequality; the third inequality follows from Lipschitz continuity of ; the fourth inequality follows from the relation with , and ; the first equality follows from (3.1); the last inequality follows from . Then, rearranging terms in above relation and recalling the definition of in (1.5), we obtain (3.5).

###### Remark 3.3 (Comments on Lemma 3.1).

Note that the descent property in Lemma 3.1 is established for without requiring or to be convex or difference-of-convex function. In fact, with additional assumptions (e.g., convexity) on or , one can establish a similar descent property for some other constructed potential function ; see, for example, [39, Lemma 3.1]. Then, one can perform the line search criterion (3.3) with in place of in Algorithm 1 and the convergence analysis can follow in a similar way as presented in this paper. Thus, one can choose suitable potential function in PGels to fit different scenarios. In this paper, we only focus on under Assumption 1.1.

It can be observed from Lemma 3.1 that the sufficient descent of can be guaranteed as long as is sufficiently large and is sufficiently small. Thus, based on this lemma, we can show in the following proposition that the line search criterion (3.3) in Algorithm 1 is well defined.

###### Proposition 3.1 (Well-definedness of the line search criterion).

Suppose that Assumption 1.1 holds and is a nonnegative constant. Let and be the sequences generated by Algorithm 1. Then, for each , the line search criterion (3.3) is satisfied after finitely many inner iterations.

Proof. We prove this proposition by contradiction. Assume that there exists a such that the line search criterion (3.3) cannot be satisfied after finitely many inner iterations. Since due to (1d) in Algorithm 1, then must be satisfied after finitely many inner iterations. Let denote the number of inner iterations when is satisfied for the first time. If , then ; otherwise, we have

 μminτnk−1≤μ0kτnk−1<μmax,

which implies that

 nk≤⌊log(μmax)−log(μmin)logτ+1⌋.

Now, we let for simplicity. Then, from (1d) in Algorithm 1, we see that is decreasing in the inner loop and hence must be satisfied after finitely many inner iterations. Similarly, let denote the number of inner iterations when is satisfied for the first time. Note that if , we have and hence . For , if , then ; otherwise, we have

 ¯βk<β0kη^nk−1≤δβmaxη^nk−1,

which implies that

 ^nk≤⌊log(δβmax)−log(¯βk)−logη+1⌋.

Thus, after at most inner iterations, we must have and . Since and , one can see that and . Then, using these facts and Lemma 3.1, we have

 Hδ(u,xk,μmax)−Hδ(xk,xk−1,¯μk−1)≤−(1−δ)μmax−Lf4∥u−xk∥2≤−c2∥u−xk∥2,

which, together with

 Hδ(xk,xk−1,¯μk−1)≤max[k−N]+≤i≤kHδ(xi,xi−1,¯μi−1),

implies that (3.3) must be satisfied after at most inner iterations. This leads to a contradiction.

We are now ready to show our first convergence result in the following theorem that characterizes a cluster point of the sequence generated by PGels. Our proof is similar to that of [41, Lemma 4]. However, the arguments involved relies on our potential function (1.5) that contains multiple blocks of variables. This makes our proof more intricate. For notational simplicity, from now on, let

 ℓ(k)∈Argmaxi{Hδ(xi,xi−1,¯μi−1):i=[k−N]+,⋯,k}. (3.8)
###### Theorem 3.1.

Suppose that Assumption 1.1 holds and is a nonnegative constant. Let and be the sequences generated by Algorithm 1. Then,

• (boundedness of sequence) the sequence is bounded;

• (non-increase of subsequence of ) the sequence is non-increasing;

• (existence of limit) exists;

• (diminishing successive changes) ;

• (global subsequential convergence) any cluster point of is a stationary point of (1.1).

Proof. Statement (i). We first prove by induction that

 Hδ(xk,xk−1,¯μk−1)≤F(x0) (3.9)

for all . Indeed, for , it follows from Proposition 3.1 that

 Hδ(x1,x0,¯μ0)−Hδ(x0,x−1,¯μ−1)≤−c2∥x1−x0∥2≤0

is satisfied after finitely many inner iterations. This, together with , implies that

 Hδ(x1,x0,¯μ0)≤Hδ(x0,x−1,¯μ−1)=F(x0).

Hence, (3.9) holds for . We now suppose that (3.9) holds for all for some integer . Next, we show that (3.9) also holds for . Indeed, for , we have

 Hδ(xK+1,xK,¯μK)−F(x0)≤Hδ(xK+1,xK,¯μK)−max[K−N]+≤i≤KHδ(xi,xi−1,¯μi−1)≤−c2∥xK+1−xK∥2≤0,

where the first inequality follows from the induction hypothesis and the second inequality follows from (3.3). Hence, (3.9) holds for . This completes the induction. Then, from (3.9), we have that for any ,

 F(x0)≥Hδ(xk,xk−1,¯μk−1)≥F(xk),

which, together with Assumption 1.1(iii), implies that is bounded. This proves statement (i).

Statement (ii). Recall the definition of in (3.8) and let for simplicity. Then, from the line search criterion (3.3), we have

 Hδ(xk+1,xk,¯μk)−Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1)≤−c2∥Δxk∥2≤0. (3.10)

Observe that

 Hδ(xℓ(k+1),xℓ(k+1)−1,¯μℓ(k+1)−1)=max[k+1−N]+≤i≤k+1Hδ(xi,xi−1,¯μi−1)=max{Hδ(xk+1,xk,¯μk),max[k+1−N]+≤i≤kHδ(xi,xi−1,¯μi−1)}≤max{Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1),max[k+1−N]+≤i≤kHδ(xi,xi−1,¯μi−1)}≤max{Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1),max[k−N]+≤i≤kHδ(xi,xi−1,¯μi−1)}=max{Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1),Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1)}=Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1),

where the first inequality follows from (3.10) and the second last equality follows from (3.8). This proves statement (ii).

Statement (iii). It follows from Assumption 1.1(iii) and the definition of in (1.5) that , is bounded below. This together with statement (ii) proves that there exists a number such that

 limk→∞Hδ(xℓ(k),xℓ(k)−1,¯μℓ(k)−1)=ζ. (3.11)

Statement (iv). We next prove statement (iv). To this end, we first show by induction that for all , it holds that

 limk