Thresholding gradient methods in Hilbert spaces: support identification and linear convergence

Thresholding gradient methods in Hilbert spaces: support identification and linear convergence

Guillaume Garrigos G. Garrigos   \Letter
CNRS, École Normale Supérieure (DMA), 75005 Paris, France
guillaume.garrigos@ens.fr
Lorenzo Rosasco L. Rosasco
LCSL, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology, Cambridge, MA 02139, USA
Università degli Studi di Genova (DIBRIS), 16146 Genova, Italy
lrosasco@mit.edu
and  Silvia Villa S. Villa
Politecnico di Milano (Dipartimento di Matematica), 20133 Milano, Italy
silvia.villa@polimi.it
Abstract.

We study regularized least squares optimization problem in a separable Hilbert space. We show that the iterative soft-thresholding algorithm (ISTA) converges linearly, without making any assumption on the linear operator into play or on the problem. The result is obtained combining two key concepts: the notion of extended support, a finite set containing the support, and the notion of conditioning over finite dimensional sets. We prove that ISTA identifies the solution extended support after a finite number of iterations, and we derive linear convergence from the conditioning property, which is always satisfied for regularized least squares problems. Our analysis extends to the the entire class of thresholding gradient algorithms, for which we provide a conceptually new proof of strong convergence, as well as convergence rates.

Keywords. Forward-Backward method, support identification, conditioning, convergence rates.

MSC. 49K40, 49M29, 65J10, 65J15, 65J20, 65J22, 65K15, 90C25, 90C46.

1. Introduction

Recent works show that, for many problems of interest, favorable geometry can greatly improve theoretical results with respect to more general, worst-case perspective [1, 16, 5, 20]. In this paper, we follow this perspective to analyze the convergence properties of threshold gradient methods in separable Hilbert spaces. Our starting point is the now classic iterative soft thresholding algorithm (ISTA) to solve the problem

 (1) f(x)=∥x∥1+12∥Ax−y∥2,

defined by an operator on and where is the norm.

From the seminal work [11], it is known that ISTA converges strongly in . This result is generalized in [9] to a wider class of algorithms, the so-called thresholding gradient methods, noting that these are special instances of the Forward-Backward algorithm, where the proximal step reduces to a thresholding step onto an orthonormal basis (Section 2). Typically, strong convergence in Hilbert spaces is the consequence of a particular structure of the considered problem. Classic examples being even functions, functions for which the set of minimizers has a nonempty interior, or strongly convex functions [30]. Further examples are uniformly convex functions, or functions presenting a favorable geometry around their minimizers, like conditioned functions or Lojasiewicz functions, see e.g. [4, 20]. Whether the properties of ISTA, and more generally threshold gradient methods, can be explained from this perspective is not apparent from the analysis in [11, 9].

Our first contribution is revisiting these results providing such an explanation: for these algorithms, the whole sequence of iterates is fully contained in a specific finite-dimensional subspace, ensuring automatically strong convergence. The key argument in our analysis is that after a finite number of iterations, the iterates identify the so called extended support of their limit. This set coincides with the active constraints at the solution of the dual problem, and reduces to the support, if some qualification condition is satisfied.
Going further, we tackle the question of convergence rates, providing a unifying treatment of finite and infinite dimensional settings. In finite dimensions, it is clear that if is injective, then becomes a strongly convex function, which guarantees a linear convergence rate. In [22], it is shown, still in a finite dimensional setting, that the linear rates hold just assuming to be injective on the extended support of the problem. This result is generalized in [8] to a Hilbert space setting, assuming to be injective on any subspace of finite support. Linear convergence is also obtained by assuming the limit solution to satisfy some nondegeneracy condition [8, 26]. In fact, it was shown recently in [6] that, in finite dimension, no assumption at all is needed to guarantee linear rates. Using a key result in [25], the function was shown to be -conditioned on its sublevel sets, and -conditioning is sufficient for linear rates [2]. Our identification result, mentioned above, allows to easily bridge the gap between the finite and infinite dimensional settings. Indeed, we show that in any separable Hilbert space, linear rates of convergence always hold for the soft-thresholding gradient algorithm under no further assumptions. Once again, the key argument to obtain linear rates is the fact that the iterates generated by the algorithm identify, in finite time, a set on which we know the function to have a favorable geometry.

The paper is organized as follows. In Section 2 we describe our setting and introduce the thresholding gradient method. We introduce the notion of extended support in Section 3, in which we show that the thresholding gradient algorithm identifies this extended support after a finite number of iterations (Theorem 3.9). In Section 4 we present some consequences of this result on the convergence of the algorithm. We first derive in Section 4.1 the strong convergence of the iterates, together with a general framework to guarantee rates. We then specify our analysis to the function (1) in Section 4.2, and show the linear convergence of ISTA (Theorem 4.8). We also consider in Section 4.3 an elastic-net modification of (1), by adding an regularization term, and provide rates as well, depending on the value of .

Notation

We introduce some notation we will use throughout this paper. is a subset of . Throughout the paper, is a separable Hilbert space endowed with the scalar product , and is an orthonormal basis of . Given , we set . The support of is . Analogously, given , . Given , the subspace supported by is denoted by and the subset of finitely supported vectors . Given a collection of intervals of the real line, with a slight abuse of notation, we define, for every ,

 B∞,I=⨁k∈NIk={x∈X:x=∑k∈Ntkek, with tk∈Ik for % every k∈N}.

Note that is a subspace of . Therefore, the components of each element of must be square summable. The closed ball of center and radius is denoted by . Let be a closed convex set. Its indicator and support functions are denoted and , respectively, and the projection onto is . Moreover, , , , and will denote respectively the interior, the boundary, the relative interior, and the quasi relative interior of [4, Section 6.2]. The set of proper convex lower semi-continuous functions from to is denoted by . Let and let . The sublevel set of is . The proximity operator of is defined as

 (∀λ∈]0,+∞[)\rm proxλf(x)=\rm argmin\,{y∈X | f(y)+12λ∥y−x∥2}.

Let be a closed interval. Then, , where

 (∀t∈R)\rm softI(t)=⎧⎨⎩t−infI if tsupI,

is the soft-thresholder corresponding to .

Problem and main hypotheses

We consider the general optimization problem

 (P) minx∈X f(x),f=g+h,

where typically will play the role of a smooth data fidelity term, and will be a nonsmooth sparsity promoting regularizer. More precisely, we will make the following assumption:

 (H) ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩h∈Γ0(X) is bounded from below, h is differentiable, and ∇h is L-Lipschitz continuous on X% , L∈]0,+∞[,g=∑k∈Ngk(⟨⋅,ek⟩), % with gk=ψk+σIk, where:∙for all k∈N, Ik is a proper closed % interval of R, and I={Ik}k∈N,∙for all k∈N,(∃ω>0)%$[−ω,ω]⊂Ik$,∙for all k∈N, ψk∈Γ0(R) is differentiable at 0 with ψk(0)=0 \text{ and } ψ′k(0)=0.

As stated in the above assumption, in this paper we focus on a specific class of functions . They are given by the sum of a weighted norm and a positive smooth function minimized at the origin, namely:

 ∥⋅∥1,I=∑k∈NσIk,ψ=∑k∈Nψk.

In [9] the following characterization has been proved: the proximity operators of such functions are the monotone operators such that for all , , for some which satisfies

 (∀k∈N)Tk(xk)=0⟺xk∈Ik.

A few examples of such, so called, thresholding operators are shown in Figure 1, and a more in-depth analysis can be found in [9].

A well-known approach to approximate solutions of (P) is the Forward-Backward algorithm [4]

 (FB) x0∈X,λ∈]0,2L−1[,xn+1=\rm proxλg(xn−λ∇h(xn)).

In our setting, (FB) is well-defined and specializes to a thresholding gradient method. The Proposition below gathers some basic properties of and following from assumption (H).

Proposition 2.1.

The following hold.

1. is the support function of ,

2. ,

3. and it is coercive,

4. is bounded from below and is nonempty,

5. the dual problem

 (D) minu∈X g∗(u)+h∗(−u),

admits a unique solution , and for all , .

6. for all and all , the proximal operator of can be expressed as

 \rm proxλg(x)=∑k∈N\rm proxλψk(\rm softλIk(xk))ek.
Proof.

i: see Proposition A.51.

ii: see Proposition A.52.

iii: see Proposition A.51.

iv: it is a consequence of the coercivity of and the fact that both and are bounded from below.

v: the smoothness of implies the strong convexity of , and the existence and uniqueness of , see [4, Theorems 15.13 and 18.15]. The equality follows from [4, Proposition 26.1(iv)(b)].

vi: it follows from A.5iv together with [9, Proposition 3.6]. ∎

3. Extended support and finite identification

3.1. Definition and basic properties

We introduce the notion of extended support of a vector and prove some basic properties of the support of solutions of problem (P).

Definition 3.1.

Let . The extended support of is

 \rm esupp(x)=\rm supp(x)∪{k∈N | −∇h(x)k∈\rm bd\,Ik}.

It is worth noting that the notion of extended support depends on the problem (P), since its definition involves (see Remark 3.4 for more details). It appears without a name in [22], and also in [14, 15, 17] for regularized least squares problems. Below we gather some results about the support and the extended support.

Proposition 3.2.

Let , then and are finite.

Proof.

Let , and let and let us start by verifying that is finite. Let , and let . Proposition 2.1vi implies that for all , . Lemma A.4 and the definition of imply that , and in particular that for all . Then we derive that

 |\rm supp(x)|=ω−2∑\footnotesizek∈\rm supp(x)ω2≤ω−2∑\footnotesizek∈\rm supp(x)|yk|2≤ω−2∥y∥2<+∞.

Next, we have to verify that is finite, where . If is finite, this is trivial. Otherwise, we observe that , which both implies that tends to when in . Since , we deduce that must be finite. ∎

The following proposition clarifies the relationship between the support and the extended support for minimizers.

Proposition 3.3.

Let .

1. If then .

Assume that is differentiable on , for all . Then

1. .

Assume moreover that . Then

1. .

2. There exists such that for every .

3. .

Proof of Proposition 3.3.

Since , it follows from Proposition 3.2 that is finite. Moreau-Rockafellar’s sum rule [29, Theorem 3.30], Proposition A.5iii, Proposition A.1i then yield

 (2) ∂f(¯x)k=∇h(¯x)k+{∂ψk(¯xk)+∂σIk(¯xk)if k∈\rm supp(¯x)Ikif k∉\rm supp(¯x).

Since is finite and is a closed interval of , Proposition A.3 and Proposition A.11 imply

 (3) (∀k∈N)(\rm qri\,∂f(¯x))k=∇h(¯x)k+{\rm ri\,(∂ψk(¯xk)+∂σIk(¯xk))if k∈\rm supp% (¯x)\rm int\,Ikif k∉\rm supp(¯x).

i: observe that

 0∈\rm qri\,∂f(¯x) ⇒ (∀k∉\rm supp(¯x))−∇h(¯x)k∈\rm int\,Ik ⇔ {k∈N | xk=0 and −∇h(¯x)k∈\rm bd\,Ik}=∅ ⇔ \rm esupp(¯x)=\rm supp(¯x).

1: note that from and (2), we have for all . But both and are differentiable at , so for all , holds. So we deduce from (3.1) that item 1 holds.

1: observe that, via (2) and Proposition A.1ii, for all , , meaning that indeed .

2: it follows from the uniqueness of , see Proposition 2.1v.

3: if there is some such that , we derive from 1 and 2 that . So, the inclusion holds. The reverse inclusion comes directly from the definition of and 2. For the reverse inclusion, assume that holds, and use the fact that is finite to apply Lemma A.9, and obtain some such that . We then conclude that using 2 and 1. ∎

Remark 3.4 (Extended support and active constraints).

Assume that . Since is the indicator function of , in this case, the dual problem (D) introduced in Proposition 2.1v can be rewritten as

 (D’) minu∈X(∀k∈N)uk∈Ik h∗(−u).

This problem admits a unique solution , and the set of active constraints at is

 {k∈N | ¯uk∈\rm bd\,Ik}.

Since for any by Proposition 2.1v, Proposition 3.31 implies that the extended support for the solutions of (P) is in that case nothing but the set of active constraints for the solution of (D’).

Remark 3.5 (Maximal support and interior solution).

If and the following (weak) qualification condition holds

 (w-CQ) (∃x∈\rm argmin\,f)0∈\rm qri\,∂f(x),

then, thanks to Lemma A.9 the extended support is the maximal support to be found among the solutions. If for instance is the least squares loss on a finite dimensional space, it can be shown that the solutions having a maximal support are the ones belonging to the relative interior of the solution set [3, Theorem 2]. However, there are problems for which (w-CQ) does not hold. In such a case Proposition 3.3 implies that the extended support will be strictly larger than the maximal support (see Example 3.7). The gap between the maximal support and the extended support is equivalent to the lack of duality between (P) and (D).

Example 3.6.

Let and . In this case, , where and , as can be seen in Figure 2. The solutions are the ones having the maximal support, since , and also satisfy . Instead, on the relative boundary of we have and for . This example is a one for which the extended support is the maximal support among the solutions.

Example 3.7.

Let and . Then , with , as can be seen in Figure 2. The support of is empty, and . In this case, condition (w-CQ) does not hold. This can also be seen from the dual problem , whose unique constraint is active at the solution , meaning that .

3.2. Finite identification

A sparse solution of problem (P) is usually approximated by means of an iterative procedure . To obtain an interpretable approximation, a crucial property is that, after a finite number of iterations, the support of stabilizes and is included in the support of . In that case, we say that the sequence identifies . The support identification property has been the subject of active research in the last years [22, 15, 26, 18, 17], and roughly speaking, in finite dimension it is known that support identification holds whenever satisfies the qualification condition . But this assumption is often not satisfied in practice, in particular for noisy inverse problems (see e.g. [18]). In [22, 14], the case is studied in finite dimension and it is shown that the extended support of is identified even if the qualification condition does not hold. Thus, the qualification condition is only used to ensure that the extended support coincides with the support (see Proposition 3.3).

In this section we extend these ideas to the setting of thresholding gradient methods in separable Hilbert spaces, and we show in Theorem 3.9 that indeed the extended support is always identified after a finite number of iterations. For this, we need to introduce a quantity, which measures the stability of the dual problem (D).

Definition 3.8.

We define the function as follows:

 (5) (∀u∈X)ρ(u)=infuk∈\rm int\,Ik% \rm dist\,(uk,\rm bd\,Ik).

Also, given any , we define .

It can be verified that for all (this is left in the Annex, see Proposition A.2). Moreover, is uniquely defined, thanks to Proposition 2.1v.

Theorem 3.9 (Finite identification of the extended support).

Let be generated by the Forward-Backward algorithm (FB), and let be any minimizer of . Then, the number of iterations for which the support of is not included in is finite, and cannot exceed .

Remark 3.10 (Optimality of the identification result).

Theorem 3.9 implies that after some iterations the inclusion holds. Let us verify that it is impossible to improve the result, i.e. that in general we cannot identify a set smaller than . In other words, is it true that

 (6) (∃x0∈X)(∃¯x∈\rm argmin\,f)(∀n∈N)\rm supp(xn)=\rm esupp(¯x)?

If (w-CQ) holds, the answer is yes. Indeed, if there is such that , we derive from Proposition 3.3i that . So by taking , and using the fact that it is a fixed point for the Forward-Backward iterations, we conclude that . If (w-CQ) does not hold, then this argument cannot be used, and it is not clear in general if there always exists an initialization which produces a sequence verifying (6). Consider for instance the function in Example 3.7. Taking and a stepsize , the iterates are defined by , meaning that for all , , which is exactly . So in that case (6) holds true.

Proof.

Let , and let be the finite dimensional subspace of supported by . First define the “gradient step” operator

 Tλh=Id−λ∇h,

so that the Forward-Backward iteration can be rewritten as . Proposition 2.1vi implies that for all and all ,

 (7) xnk=\rm proxλψk∘\rm softλIk(Tλh(xn−1)k).

Since is a fixed point for the forward-backward iteration [4, Proposition 26.1(iv)], we also have

 (8) ¯xk=\rm proxλψk∘\rm softλIk(Tλh(¯x)k).

Using the fact that is nonexpansive, and that is firmly non-expansive [4, Proposition 12.28], we derive

 ∥xn−¯x∥2 = ∑k∈N|xnk−¯xk|2 ≤ ∑k∈N|\rm softλIk(Tλh(xn−1)k)−\rm softλIk(Tλh(¯x)k)|2 ∑k∈N|Tλh(xn−1)k−Tλh(¯x)k|2−|(Id−\rm softλIk)(Tλh(xn−1)k)−(Id−\rm softλIk)(Tλh(¯x)k)|2 ∥Tλh(xn−1)−Tλh(¯x)∥2−σ2n,k,

where

 σn,k=|(Id−\rm softλIk)(Tλh(xn−1)k)−(Id−\rm softλIk)(Tλh(¯x)k)|.

Moreover, the gradient step operator is non-expansive since (see e.g. [24, Lemma 3.2]), so we end up with

 (9) (∀n∈N∗)(∀k∈N)∥xn−¯x∥2≤∥xn−1−¯x∥2−σ2n,k.

The key point of the proof is to get a nonnegative lower bound for which is independent of , when .

Assume that there is some such that . This means that there exists such that . Also, since , we must have , meaning that . We deduce from (7), (8), and Lemma A.4, that

 (10) Tλh(xn−1)k∉λIk and Tλh(¯x)k∈\rm int\,λIk.

Since is the projection on , we derive from (10) that

 σn,k=|\rm projλIk(Tλh(xn−1)k)−Tλh(¯x)k|.

Moreover , therefore by Definition 3.8 and (10), we obtain that

 σn,k≥λ\rm dist\,(λ−1Tλh(¯x)k,\rm bd\,Ik)≥λρ(λ−1Tλh(¯x)k)=λρ(−∇h(¯x)k)=λρsol.

Plugging this into (9), we obtain

 (11) ∀n∈N∗, xn∉E ⇒ ∥xn−¯x∥2≤∥xn−1−¯x∥2−ρ2solλ2.

Next note that the sequence is Féjer monotone with respect to the minimizers of (see e.g. [20, Theorem 2.2]) — meaning that is a decreasing sequence. Therefore the inequality (11) cannot hold an infinite number of times. More precisely, can hold for at most iterations. ∎

4. Strong convergence and rates

4.1. General results for thresholding gradient methods

Strong convergence of the iterates for the thresholding gradient algorithm was first stated in [11, Section 3.2] for , and then generalized to general thresholding gradient methods in [9, Theorem 4.5]. We provide a new and simple proof for this result, exploiting the ”finite-dimensionality” provided by the identification result in Theorem 3.9.

Corollary 4.1 (Finite dimensionality for thresholding gradient methods).

Let be generated by a thresholding gradient algorithm. Then:

1. There exists a finite set such that for all .

2. converges strongly to some .

Proof.

i: let and let

 J=\rm esupp(x)⋃{\rm supp(xn) | n∈N∗,xn∉X\footnotesize\rm esupp(x)},

and observe that it is finite, as a finite union of finite sets (see Proposition 3.2 and Theorem 3.9).

ii: it is well known that implies that converges weakly towards some (see e.g. [20, Theorem 2.2]). In particular, is a bounded sequence in . Moreover, i implies that belongs to , which is finite dimensional. This two facts imply that is contained in a compact set of with respect to the strong topology, and thus converges strongly. ∎

Next we discuss the rate of convergence for the thresholding gradient methods. Beforehand, we briefly recall how the geometry of a function around its minimizers is related to the rates of convergence of the Forward-Backward algorithm.

Definition 4.2.

Let and . We say that is -conditioned on if

 (∃γϕ,Ω>0)(∀x∈Ω)γϕ,Ωp\rm dist\,(x,\rm argmin\,ϕ)p≤ϕ(x)−infϕ.

A -conditioned function is a function which somehow behaves like on a set. For instance, strongly convex functions are -conditioned on , and the constant is nothing but the constant of strong convexity. But the notion of -conditioning is more general and also describes the geometry of functions having more than one minimizer. For instance in finite dimension, any positive quadratic function is -conditioned on , in which case the constant is the smallest nonzero eigenvalue of the hessian. This notion is interesting since it allows to get precise convergence rates for some algorithms (including the Forward-Backward one) [2]:

• sublinear rates if ,

• linear rates if .

For more examples, related notions and references, we refer the interested reader to [16, 5, 20].

Corollary 4.1 highlights the fact that the behavior of the thresholding gradient method essentially depends on the conditioning of on finitely supported subspaces. It is then natural to introduce the following notion of finite uniform conditioning.

Definition 4.3.

Let . We say that a function satisfies the finite uniform conditioning property of order if, for every finite , , , is -conditioned on .

Remark 4.4.

In this definition, we only need information about over supports satisfying . Indeed, if , then is -conditioned on for any and for all according to [20, Proposition 3.4].

In the following theorem, we illustrate how finite uniform conditioning guarantees global rates of convergence for the threshold gradient methods: linear rates if , and sublinear rates for . Note that these sublinear rates are better than the rate guaranteed in the worst case.

Theorem 4.5 (Convergence rates for threshold gradient methods).

Let be generated by the Forward-Backward algorithm (FB), and let be its (weak) limit. Then the following hold.

1. If satisfies the finite uniform conditioning property of order , then there exist and , depending on , such that

 (∀n≥1)f(xn)−inff≤εn(f(x0)−inff) and ∥xn+1−¯x∥≤C√εn.
2. If satisfies the finite uniform conditioning property of order , then there exist , depending on , such that

 (∀n≥1)f(xn)−inff≤C1n−pp−2%and∥xn+1−x∞∥≤C2n−1p−2.
Proof.

According to Corollary 4.1, there exists a finite set such that for all , , and converges strongly to . Also, the decreasing and Féjer properties of the Forward-Backward algorithm (see e.g. [20, Theorem 2.2]) tells us that for all , , by taking and . Therefore, thanks to the finite uniform conditioning assumption, we can apply [20, Theorem 4.2] to the sequence and conclude. ∎

4.2. ℓ1 regularized least squares

Let be a linear operator from to a separable Hilbert space , and let . In this section, we discuss the particular case when and . The function in (P) then becomes

 X∋x↦f(x)=∥x∥1,I+12∥Ax−y∥2Y,

and the Forward-Backward algorithm specializes to the iterative soft-thresholding algorithm (ISTA). In this special case, linear convergence rates have been studied under additional assumptions on the operator . A common one is injectivity of or, more generally, the so-called Finite Basis Injectivity property (FBI) [8]. The FBI requires to be injective once restricted to , for any finite . It is clear that the FBI property implies that is a strongly convex function once restricted to each , meaning that the finite uniform conditioning of order holds. So, the linear rates obtained in [8, Theorem 1] under the FBI assumption can be directly derived from Theorem 4.5. However, as can be seen in Theorem 4.5 , strong convexity is not necessary to get linear rates, and the finite uniform -conditioning is a sufficient condition (and it is actually necessary, see [20, Proposition 4.18]). By using Li’s Theorem on convex piecewise polynomials [25, Corollary 3.6], we show in Proposition 4.7 below that satisfies a finite uniform conditioning of order on finitely supported subsets, without doing any assumption on the problem. First, we need a technical Lemma which establishes the link between the conditioning of a function on a finitely supported space and the conditioning of its restriction to this space.

Lemma 4.6.

Let , let and let . Suppose that . Let . Assume that, for every ,

 ϕJ=ϕ∘Ξ∈Γ0(Rm) is p-conditioned on BRm(Ξ−1(¯x),δ)∩SϕJ(r)

Then is -conditioned on .

Proof.

Assume without loss of generality that . Also, observe that implies that is well-defined. By definition, , and

 infϕ=ϕ(¯x)=ϕ∘Ξ(¯u)=ϕJ(¯u)≥infϕJ,

which implies . Also, we have

 x∈Ξ(\rm argmin\,ϕJ)⇔x=Ξ(u) and ϕJ(u)=infϕJ⇔x∈XJ and ϕ(x)=infϕ,

meaning that Let , and let . Since is -conditioned on there exists such that

 (12) (∀u∈BRm(¯u,δ)∩SϕJ(r))γp\rm dist\,(u,\rm argmin\,ϕJ)p≤ϕJ(u)−infϕJ.

Let in (12). Since