A Tight Bound of Hard Thresholding

# A Tight Bound of Hard Thresholding

Jie Shen
Department of Computer Science
School of Arts and Sciences
Rutgers University
Piscataway, NJ 08854, USA
js2007@rutgers.edu
Ping Li
Department of Statistics & Biostatistics
Department of Computer Science
Rutgers University
Piscataway, NJ 08854, USA
pingli@stat.rutgers.edu
###### Abstract

This paper is concerned with the hard thresholding operator which sets all but the largest absolute elements of a vector to zero. We establish a tight bound to quantitatively characterize the deviation of the thresholded solution from a given signal. Our theoretical result is universal in the sense that it holds for all choices of parameters, and the underlying analysis only depends on fundamental arguments in mathematical optimization. We discuss the implications for two domains:

Compressed Sensing.  On account of the crucial estimate, we bridge the connection between restricted isometry property (RIP) and the sparsity parameter for a vast volume of hard thresholding based algorithms, which renders an improvement on the RIP condition especially when the true sparsity is unknown. This suggests that in essence, many more kinds of sensing matrices or fewer measurements are admissible for the data acquisition procedure.

Machine Learning.  In terms of large-scale machine learning, a significant yet challenging problem is learning accurate sparse models in an efficient manner. In stark contrast to prior works that attempted the relaxation for promoting sparsity, we present a novel stochastic algorithm which performs hard thresholding in each iteration, hence ensuring such parsimonious solutions. Equipped with the developed bound, we prove global linear convergence for a number of prevalent statistical models under mild assumptions, even though the problem turns out to be non-convex.

Keywords: Sparsity, Hard Thresholding, Compressed Sensing, RIP, Stochastic Optimization

## 1 Introduction

Over the last two decades, pursuing sparse representations has emerged as a fundamental technique throughout bioinformatics [OF97], statistics [Tib96, EHJT04], signal processing [CDS98, Don06, CW08] and mathematical science [CRPW12], to name just a few. In order to obtain a sparse solution, a plethora of practical algorithms have been presented, among which two prominent examples are greedy pursuit and convex relaxation [TW10]. For instance, as one of the earliest pursuit algorithms, orthogonal matching pursuit (OMP) [PRK93] repeatedly picks a coordinate as the potential support of a solution. While OMP may fail for some deterministic sensing matrices, [Tro04, TG07] showed that OMP recovers the true signal with high probability when using random matrices such as Gaussian. Inspired by the success of OMP, the two concurrent work of compressive sampling matching pursuit (CoSaMP) [NT09] and subspace pursuit (SP) [DM09] made improvement by selecting multiple coordinates followed by a pruning step in each iteration, and the recovery condition was framed under the restricted isometry property (RIP) [CT05]. Interestingly, the more careful selection strategy of CoSaMP and SP leads to an optimal sample complexity. The iterative hard thresholding (IHT) algorithm [DDM04, BD08, BD09] gradually refines the iterates by gradient descent along with truncation for sparsity. [Fou11] developed a concise algorithm termed hard thresholding pursuit (HTP), which combined the idea of CoSaMP and IHT, and showed that HTP is superior in terms of the RIP condition. [JTD11] proposed an interesting variant of the HTP algorithm and obtained a sharper RIP result. Recently, [BRB13] and [YLZ13] respectively extended the CoSaMP and HTP algorithm to general objective functions, for which a global convergence of popular learning models, e.g., logistic regression, was established.

Since the sparsity constraint counts the number of non-zero components which renders the problem non-convex, the norm was suggested as a convex relaxation dating back to Lasso [Tib96] and basis pursuit [CDS98]. The difference is that Lasso looks for an norm constrained solution that minimizes the residual while the principle of basis pursuit is to find a signal with minimal norm that fits the observation data. In [CT05], a detailed analysis on the sparse recovery performance of basis pursuit was carried out. Another popular sparse estimator in the high-dimensional statistics is the Dantzig selector [CT07] which, instead of constraining the residual of the linear model, penalizes the maximum magnitude of the gradient. From a computational perspective, both basis pursuit and Dantzig selector can be solved by linear programming, while Lasso is formulated as a quadratic problem. Interestingly, under the RIP condition or the uniform uncertainty assumption [CRT06], a series of work showed that exact sparse recovery by minimization is possible as soon as the observation noise vanishes [CT05, Can08, Wai09, CWX10, Fou12, CRPW12].

In this paper, we are interested in the hard thresholding (HT) operator underlying a large body of the developed algorithms in compressed sensing (e.g., IHT, CoSaMP, SP), machine learning (e.g., sparse principal component analysis [Ma13, YZ13]) and statistics (e.g., sparse precision matrix estimation [YLZ13]). Our motivation is two-fold: From a high level, compared to the convex programs, these HT-based algorithms are always orders of magnitude computationally more efficient, hence more practical for large-scale problems [TW10]. Nevertheless, they usually require a more stringent condition to guarantee the success of the algorithms. This naturally raises an interesting question whether we can derive milder or sharp conditions for HT-based algorithms to achieve the best of the two worlds. For practitioners, to address the huge volume of data, a popular strategy in machine learning is devising stochastic algorithms that sequentially update the solution. However, as many researchers observed [LLZ09, DS09, Xia10], it is hard for the based stochastic algorithms to preserve the sparse structure of the solution as the batch solvers do. This immediately poses the question whether we are able to apply the principal idea of hard thresholding to stochastic algorithms while still ensuring fast convergence of the algorithms.

To elaborate the problem more precisely, let us first turn to some basic properties of hard thresholding along with simple yet illustrative cases. For a general vector , the hard thresholded signal is formed by setting all but the largest (in magnitude) elements of to zero. Ties are broken lexicographically. Hence, the hard thresholded signal is always -sparse, i.e., the number of non-zero components does not exceed . Moreover, the resultant signal is a best -sparse approximation to in terms of any norm (). That is, we are guaranteed that

 ∥Hk(b)−b∥p≤∥a−b∥p. (1.1)

In view of the above inequality, a broadly used bound in the literature for the deviation of the thresholded signal in terms of metric is as follows:

 ∥Hk(b)−a∥2≤2∥b−a∥2. (1.2)

To gain intuition on the utility of (1.2) and to spell out the importance of offering a tight bound for it, let us consider the compressed sensing problem as an example for which we aim to recover a true sparse signal from its linear measurements. Here, is a proxy obtained by, e.g., full gradient descent. Recall that the IHT algorithm uses as the iterate and proves its convergence to . Then (1.2) justifies that after hard thresholding, the distance of the iterate to the true signal is bounded by a multiple of to the one before. For comparison, it is worth mentioning that in order to obtain a sparse solution, based algorithms usually utilize the soft thresholding operator which enjoys the non-expansiveness property [DBL14], i.e., the iterate becomes closer to the optimum. This salient feature might partially attribute to the wide range of applications of the regularized formulations. Hence, in order to derive comparable performance guarantee, tightening the bound (1.2) is crucial in that it controls how much deviation the hard thresholding operator induces. This turns out to be more demanding for stochastic gradient method, where the proxy itself is affected by the randomness of sample realization. In other words, since does not minimize the objective function (it only optimizes the objective in expectation), the deviation (1.2) makes it more challenging to analyze the convergence behavior. As an example, [NNW14] proposed a stochastic solver for general sparsity constrained programs but suffered a non-vanishing optimization error due to randomness. This indicates that to mitigate the randomness barrier, we have to seek a better bound to control the precision of the thresholded solution and to apply randomness (variance) reduction technique.

### 1.1 Summary of Contributions

In this work, we make three contributions:

1. We examine the tightness of (1.2) that has been used for a decade in the literature and show that the equality therein will never be attained. We then improve this bound and quantitatively characterize that the deviation is inversely proportional to the value of . Our bound is tight, in the sense that the equality we build can be attained for specific signals, hence it cannot be improved if no additional information is available. Our bound is universal in that it holds for all choices of -sparse signals and for general signals .

2. Owing to the tight estimate, we demonstrate how the RIP (or RIP-like) condition required by a wide range of hard thresholding based algorithms can be relaxed. In the context of compressed sensing, this means in essence, many more kinds of sensing matrices or fewer measurements can be utilized for data acquisition. For machine learning, it suggests that existing algorithms are capable of handling more difficult statistical models.

3. Finally, we present a novel algorithm which applies hard thresholding in large-scale setting and we prove its convergence to a global optimum with a geometric rate. Returning to (1.2), our analysis shows that only when the deviation is controlled below the multiple of can such an algorithm succeed. This immediately implies that the conventional bound (1.2) is not applicable in this challenging scenario.

### 1.2 Notation

Before delivering the algorithm and main theoretical results, let us instate several pieces of notation that are involved throughout the paper. We use bold lowercase letters, e.g., to denote a vector and its th element is denoted by . The norm of a vector is denoted by . The support set of is denoted by whose cardinality is denoted by or . We write bold capital letters such as for matrices and its -th entry is denoted by . The capital upright letter and its subscript variants (e.g., ) are reserved for some absolute constants that may change from appearance to appearance.

For an integer , suppose that is a subset of . Then for a general vector , we define as the orthogonal projection onto the support set which retains elements contained in and sets others to zero. That is,

 PΩ(v)={vi,i∈Ω,0,otherwise. (1.3)

In particular, let be the support set indexing the largest absolute components of . In this way, the hard thresholding operator is given by

 Hk(v)=PΓ(v). (1.4)

We will also use the orthogonal projection of a vector onto an ball with radius . That is,

 (1.5)

We present the key tight bound for hard thresholding in Section 2, along with a justification why the conventional bound (1.2) is not tight. We then discuss the implications of the developed tight bound to compressed sensing and machine learning in Section 3, which shows that the RIP or RIP-like condition can be improved for a number of popular algorithms. Thanks to our new estimation, Section 4 develops a novel stochastic algorithm which applies hard thresholding to large-scale problems and establishes the global linear convergence. A comprehensive empirical study on the tasks of sparse recovery and binary classification is carried out in Section 5. Finally, We conclude the paper in Section 6 and all the proofs are deferred to the appendix.

## 2 The Key Bound

We argue that the conventional bound (1.2) is not tight, in the sense that the equality therein can hardly be attained. To see this, recall how the bound was derived for a -sparse signal and a general one  (see, e.g., [NT09]):

 (2.1)

where the last inequality holds because is a best -sparse approximation to . The major issue occurs in . Though it is the well-known triangle inequality and the equality could be attained if there is no restriction on the signals and , we remind here that the signal does have a specific structure – it is -sparse. Note that in order to fulfill the equality in , we must have for some , that is,

 Hk(b)=(t+1)b−ta. (2.2)

One may verify that the above equality holds if and only if

 a=b=Hk(b). (2.3)

To see this, let be the support set of and be the complement. Let and . Likewise, we define and as the components of supported on and respectively. Hence, (2.2) indicates and where we assume since immediately implies and hence the equality of (1.2) does not hold. If , then we have since contains the largest absolute elements of . Otherwise, the fact that and implies , and hence . Therefore, we obtain (2.3).

When (2.3) happens, however, we in reality have . In other words, the factor of in (1.2) can essentially be replaced with an arbitrary constant! In this sense, we conclude that the bound (1.2) is not tight. Our new estimate for hard thresholding is as follows:

###### Theorem 1 (Tight Bound for Hard Thresholding).

Let be a general vector and be any -sparse signal111Even if is not exactly -sparse, we still can bound the error by . Thus, without loss of generality, we assume that is -sparse.. Choose . Then, we have the following universal bound:

 ∥Hk(b)−a∥2≤√ν∥b−a∥2,ν=1+ρ+√(4+ρ)ρ2,ρ=min{K,d−k}k−K+min{K,d−k}.

In particular, our bound is tight in the sense that there exist specific vectors of and such that the equality holds, hence the bound cannot be improved without further information about and .

###### Remark 2 (Maximum ν).

In contrast to the constant bound (1.2), our result asserts that the deviation resulting from hard thresholding is inversely proportional to  (when ) in a universal manner. When tends to , is given by which is still decreasing with respect to . Thus, the maximum value of equals one. Even in this case, we find that .

###### Remark 3.

It is also worth mentioning that, although for some batch algorithms such as IHT and CoSaMP, the constant bound (1.2) suffices to establish the convergence due to specific conditions, we show in Section 4 that that bound cannot ensure global optimality for stochastic algorithms with a generic objective function.

###### Proof.

(Sketch) Our bound follows from fully exploring the sparsity pattern of the signals and fundamental arguments in mathematical optimization. Let be the support set of and let be its complement. We immediately have . Let be the support set of . Define

 b1=PΩ∖Ω′(b),b2=PΩ∩Ω′(b),b3=P¯¯¯Ω∖Ω′(b),b4=P¯¯¯Ω∩Ω′(b).

Likewise, we define and for . Due to construction, we have . Our goal is to estimate the maximum value of . It is easy to show that when attaining the maximum, must be zero. Denote

 (2.4)

Note that the the variables here only involve and . Arranging the equation we obtain

 (t−1)∥b2−a2∥22+t∥b4−a4∥22−∥a4∥22+(t−1)∥b1∥22=0. (2.5)

It is evident that for specific choices of and , we have . Since we are interested in the maximum of , we assume below. Fixing , we can view the left-hand side of the above equation as a function of . One may verify that the function has a positive definite Hessian matrix and thus it attains the minimum at stationary point given by

 a∗2=b2,a∗4=tt−1b4. (2.6)

On the other hand, (2.5) implies the minimum values should not be greater than zero. Plugging the stationary point back gives

Solving the above inequality with respect to , we obtain

 t≤1+(2∥b1∥22)−1(∥b4∥22+√(4∥b1∥22+∥b4∥22)∥b4∥22). (2.7)

To derive an upper bound that is uniform over the choice of , we recall that contains the largest absolute elements of while has smaller values. In particular, we have by [JTK14, Lemma 1] that

Note that . Hence, combining with the fact that and optimizing over in the above inequality gives

 ∥b4∥22≤min{K,d−k}k−K+min{K,d−k}∥b1∥22. (2.8)

Finally, we arrive at a uniform upper bound for :

 t≤1+ρ+√(4+ρ)ρ2,ρ=min{K,d−k}k−K+min{K,d−k}.

See Appendix B for the full proof. ∎

###### Remark 4 (Tightness).

We construct proper vectors of and to establish the tightness of our bound by a backward induction. Note that equals if and only if . Hence, we pick

 ∥b4∥22=ρ∥b1∥22,a2=b2,a4=νν−1b4, (2.9)

where and are actually chosen as the stationary point as in (2.6). We note that the quantity of only depends on , and , not on the components of or . Plugging the above back to (2.4) justifies .

It remains to show that our choices in (2.9) do not violate the definition of ’s, i.e., we need to ensure that elements in or should be no less than those in or . Note that there is no such constraint for the -sparse vector . Let us consider the case and , so that and . Thus, the first equality of (2.9) holds as soon as all the entries of have same magnitude. The fact also implies is a subset of due to the definition of and the sparsity of , hence we have . Finally, picking as we did in (2.9) completes the reasoning since it does not violate the sparsity constraint on .

As we pointed out and just verified, the bound given by Theorem 1 is tight. However, it is worth noting that if there is additional information on the signals, a better bound can be established. For instance, let us further assume that the signal is -sparse. If , then is a zero vector and (2.7) reads as . Otherwise, we have and (2.8) is improved to

 ∥b4∥22≤min{K,s−k}k−K+min{K,s−k}∥b1∥22.

Finally, we can show that the parameter is given by

 ρ=min{K,s−k}k−K+min{K,s−k}.

Note that the fact implies that the above is a tighter bound than the one in Theorem 1.

We would also like to mention that in the appealing work of [JTK14], a closely related bound was established (see Lemma 1 therein):

 ∥Hk(b)−b∥2≤√d−kd−K∥b−a∥2. (2.10)

One may use this nice result to show that

 (2.11)

It is not hard to see that this bound also improves on (1.2) provided . However, one shortcoming of the above inequality is that the factor depends on the dimension. For comparison, we recall that in the regime , our bound is free of the dimension. This turns out to be a salient feature to integrate hard thresholding with stochastic methods, and we will comment on it more in Section 4.

## 3 Implications to Compressed Sensing and Beyond

In this section, we investigate the implications of Theorem 1 for compressed sensing and machine learning. Since most of the popular HT-based algorithms (e.g., IHT, CoSaMP, SP, GraSP [BRB13] and GraHTP [YLZ13]) utilize the deviation bound (1.2) to derive the convergence condition, they can be improved by our new bound. Due to space limit, we only exemplify the power of our theorem on two popular algorithms in compressed sensing: IHT [BD08, BD09] and CoSaMP [NT09], and one interesting work GraSP [BRB13] which extends CoSaMP to general machine learning models, e.g., sparse logistic regression. To be clear, the purpose of this section is not dedicated to improving the best RIP condition for which recovery is possible by any methods (either convex or non-convex). Rather, we focus on three broadly used greedy algorithms and illustrate how our bound improves on previous results.

We proceed with a brief review of the problem setting in compressed sensing. Compressed sensing algorithms aim to recover the true sparse signal from a set of (perhaps noisy) measurements

 y=Ax∗+ε, (3.1)

where is some observation noise and is a given sensing matrix with , hence the name compressive sampling. In general, the problem is under-determined since the number of unknowns is much larger than that of the equations on hand. Yet, the prior knowledge that is sparse radically changes the premise. That is, if the geometry of the sparse signal is preserved under the action of the sampling matrix for a restricted set of directions, then it is possible to invert the sampling process. Such a novel idea was quantified as the th restricted isometry property (RIP) of by [CT05], which requires that there exists a (non-negative) constant , such that for all -sparse signals , the following inequalities hold:

 (3.2)

The th restricted isometry constant (RIC) is then defined as the smallest one that satisfies the above inequalities. Note that is the minimum requirement for distinguishing all -sparse signals from the measurements. This is because for two arbitrary -sparse vectors and and their respective measurements and , the RIP condition reads as

for which guarantees that implies . To date, there are three quintessential examples known to exhibit profound restricted isometry behavior as long as the number of measurements is proportional to the product of the sparsity and a polynomial logarithm of the dimension: Gaussian matrices (optimal RIP, i.e., very small ), partial Fourier matrices (fast computation) and Bernoulli ensembles (low memory footprint).

Equipped with the (standard) RIP condition, a bunch of efficient algorithms were developed and exact sparse recovery was established. A partial list includes norm based convex programs, IHT, CoSaMP, SP, stagewise OMP [DTDS12] and regularized OMP [NV10], along with much interesting work devoted to improving or sharpening the RIP condition, e.g., [WS12, MS12, CZ13, Dan13, Mo15]. To see why relaxing RIP is of central interest, note that standard result (see, e.g., [BDDW08]) asserts that the RIP condition holds with high probability provided that

 n≥C0δ−2rlog(d/r). (3.3)

Hence, a slight relaxation of the condition may dramatically decrease the measurements. That being said, since the constant above is unknown, in general one cannot tell the precise sample size for greedy sparse recovery algorithms to succeed. Estimating the constant is actually the theme of phase transition [DT10]. While precise phase transition for -based convex programs has been well understood, an analogous result for greedy algorithms remains an open problem. Notably, in the recent work of [BT15], phase transition for IHT/CoSaMP was derived using the constant bound (1.2). We believe that our tight bound shall sharpen these results and we leave it as our future work. In the present paper, we focus on the ubiquitous RIP condition. In the language of RIP, we establish improved results in the following.

Iterative Hard Thresholding.  The IHT algorithm recovers the underlying -sparse signal by iteratively performing a full gradient descent on the least-squares loss followed by a hard thresholding step. That is, at the -th iteration, given the previous iterate , IHT updates the new solution as follows:

 xt=Hk(xt−1+A⊤(y−Axt−1)). (3.4)

Note that [BD09] used the parameter . However, in practice one may only have access to an upper bound on the true sparsity . Thus, we consider the projection sparsity as a parameter that depends on . To establish global convergence with a geometric rate of , [BD09] applied the bound (1.2) and assumed the RIP condition

 δ2k+K≤0.18. (3.5)

As we have shown, the bound (1.2) is actually not tight and hence, their results, especially the requirement of restricted isometry constant can be improved by Theorem 1.

###### Theorem 5.

Let be the true sparse signal and denote . Assume the measurements are collected by (3.1) for a given sensing matrix . Pick . Then, under the RIP condition of , the sequence of the iterates produced by the IHT algorithm (3.4) converges (up to the energy of the noise) to the true signal with a geometric rate of .

Let us first study the vanilla case . [BD09] required whereas our analysis shows suffices. Note that even a little relaxation on RIP is challenging and may require several pages of mathematical induction [Can08, CWX10, Fou12]. In contrast, our improvement comes from a direct application of Theorem 1 which only modifies several lines of the original proof in [BD09]. See Appendix C for details. In view of (3.3), we find that the necessary number of measurements for IHT is dramatically reduced with a factor of by our new theorem in that the minimum requirement of is inversely proportional to the square of .

Another important consequence is a characterization on the RIP condition and the quantity of sparsity parameter for IHT, which, to the best of our knowledge, has not been studied in the literation. In [BD09], when gradually tuning larger than , it always requires . Note that due to the monotonicity of RIC, i.e., if , the condition turns out to be more and more stringent. Compared to their result, since is inversely proportional to , Theorem 5 is powerful especially when becomes larger. For example, suppose . In this case, Theorem 5 justifies that IHT admits the linear convergence as soon as whereas [BD09] requires . Such a property is appealing in practice, in that among various real-world applications, the true sparsity is indeed unknown and we would like to estimate a conservative upper bound on it.

On the other hand, for a given sensing matrix, there does exist a fundamental limit for the maximum choice of . To be more precise, the condition in Theorem 5 together with the probabilistic argument (3.3) require

 1/√8ν≥δ2k+K,C1ν(2k+K)log(d/(2k+K))≤n. (3.6)

Although it could be very interesting to derive a quantitative characterization for the maximum of , we argue that it is perhaps intractable owing to two aspects: First, it is known that one has to enumerate all the combinations of the columns of so as to compute the restricted isometry constant  [BT10, BT14]. This suggests that it is NP-hard to estimate the largest admissible value of . Also, there is no analytic solution of the stationary point for the left-hand side of the second inequality.

Compressive Sampling Matching Pursuit.  The CoSaMP algorithm proposed by [NT09] is one of the most efficient algorithms for sparse recovery. Compared to IHT which performs hard thresholding after gradient update, CoSaMP prunes the gradient at the beginning of each iteration, followed by solving a least-squares program restricted on a small support set. In particular, in the last step, CoSaMP applies hard thresholding to form a -sparse iterate for future update. The analysis on CoSaMP consists of bounding the estimation error in each step. Owing to Theorem 1, we advance the theoretical result of CoSaMP by improving the error bound for its last step, and hence the RIP condition.

###### Theorem 6.

Let be the sparse signal of interest and denote . Assume the measurements are collected by (3.1) for a given sensing matrix . Pick . Then, under the RIP condition that

 δ3k+K≤(√32ν+49−9)1/24√ν−1,

the sequence of the iterates produced by CoSaMP converges (up to the energy of the noise) to with a geometric rate of .

Roughly speaking, the bound is still inversely proportional to . Hence, it is monotonically increasing with respect to , indicating our theorem is more effective for a large quantity of . In fact, for the CoSaMP algorithm, our bound above is superior to the best known result even when . To see this, we have the RIP condition . In comparison, [NT09] derived a bound and [FR13, Theorem 6.27] improved to for a geometric rate of . Thus, our bound here is the best known bound. It is worth mentioning that in [JTK14], they provided a unified analysis of two-stage sparse recovery algorithms and obtained the RIP bound for CoSaMP. Nevertheless, their bound holds only for binary sparse vectors.

Gradient Support Pursuit.  The GraSP algorithm generalizes CoSaMP in the sense that GraSP supports a wide class of loss functions rather than the least-squares. To characterize the problem structure, a property called stable restricted Hessian (SRH) was introduced in [BRB13] which is quantitatively equal to the condition number of the Hessian matrix when restricted to sparse directions. In special cases, SRH reduces to RIP. The hard thresholding operation also emerges in their last step, and hence the SRH requirement can be relaxed like we have done for CoSaMP, implying that GraSP can in reality cope with more difficult statistical models.

###### Theorem 7.

Suppose the objective function that GraSP minimizes is a twice continuously differentiable function that has -SRH and the smallest eigenvalue of the Hessian matrix restricted on -sparse vectors bounded away from zero. Then GraSP converges with a geometric rate of provided that

 μ3k+K≤1+(√ν+1)1/2−1√ν.

## 4 Hard Thresholding in Large-Scale Setting

Now we move on to the machine learning setting where our focus is pursuing an optimal sparse solution that minimizes a given objective function based on a set of training samples . Different from compressed sensing, we usually have sufficient samples which means can be very large. Formally, we are interested in optimizing the following program:

 minx∈Rd F(x;Zn1)=1nn∑i=1f(x;Zi),s.t. ∥x∥0≤K, ∥x∥2≤ω. (4.1)

The global optimum of the above problem is denoted by . To ease notation, we will often write as and as for . We note that the objective function is presumed to be decomposable with respect to the samples. This is quite a mild condition and most of the popular machine learning models fulfill it. Typical examples include (but not limited to) the sparse linear regression and sparse logistic regression:

• Sparse Linear Regression: For all , we have and the generative model is for a -sparse signal of interest and some observation noise . The loss function is the least-squares and can be explained by .

• Sparse Logistic Regression: For all , we have and the negative log-likelihood is penalized, i.e., for which .

It is worth mentioning that the objective function is allowed to be non-convex. Hence, in order to ensure the existence of a global optimum, a natural option is imposing an -norm () constraint [LW12, LW15]. Here we choose the -norm constraint owing to its fast computation. Previous work, e.g., [ANW12] prefers the computationally less efficient -ball to promote sparsity and to guarantee the existence of optimum. In our problem, yet, we already have imposed the hard sparsity constraint so the -norm constraint is a better fit.

The major contribution of this section is a novel algorithm termed HT-SVRG to optimize (4.1), tackling one of the most important problems in large-scale machine learning: producing sparse solutions by stochastic methods. We emphasize that the formulation (4.1) is in stark contrast to the regularized programs considered by previous stochastic solvers such as Prox-SVRG [XZ14] and SAGA [DBL14]. We target here a stochastic algorithm for the non-convex problem that is less exploited in the literature. From a theoretical perspective, (4.1) is much harder to analyze but it always produces -sparse solutions in practice, whereas performance guarantees for convex programs are easier to carry out but one cannot characterize the sparsity of the obtained solution (usually the solution is approximately sparse). When we look at stochastic algorithms, the -norm formulation becomes much less effective in terms of sparsity even with a proximal operation, naturally owing to the randomness. See [LLZ09, Xia10, DS09] for more discussion on the issue. We also remark that existing work such as [YLZ13, BRB13, JTK14] investigated the sparsity constrained problem (4.1) in a batch scenario, which is not practical for large-scale learning problems. The perhaps most related work to our new algorithm is [NNW14]. Nonetheless, the optimization error of their algorithm does not vanish.

Our main result shows global linear convergence to the empirical risk minimizer with vanishing optimization error. We illustrate such an appealing behavior by two prevalent statistical models, hence bridging the gap of the lack in theoretical analysis and the practical elegance of stochastic sparsity constrained solvers. We find that the global convergence is attributed to both the tight bound and the variance reduction technique to be introduced below, and examining the necessity of them is an interesting future work. We would also like to remind that readers should distinguish the optimal solution and the true parameter . For instance, suppose that the response of sparse linear regression model is generated by the true parameter , but minimizing (4.1) does not amount to recovering . In fact, the convergence to of the iterates is only guaranteed to an accuracy reflected by the statistical precision of the problem, i.e., , which is the best one can hope for any statistical model [ANW12]. There is numerous insightful work devoted to the minimax rate of the statistical error [Zha10, RWY11], whereas the focus of the paper is establishing convergence to a global optimum .

### 4.1 Algorithm

Overview.  Our algorithm (Algorithm 1) applies the framework of SVRG [JZ13], where the main idea is to leveraging past gradients for the current gradient update for the sake of variance reduction. To guarantee each iterate is -sparse, it then invokes the hard thresholding operation. Note that the orthogonal projection for will not change the support set, and hence is still -sparse. Also note that our sparsity constraint reads as . What we show below is that when the parameter is properly chosen (which depends on ), we obtain a globally convergent sequence of iterates. This actually coincides with the merit of [JTK14]. In fact, from a high level, our algorithm can be seen as a stochastic counterpart of [JTK14] though with a different proof technique.

Challenge.  The most challenging part on establishing global convergence comes from the hard thresholding operation . Note that it is that reduces the objective value in expectation. If is not -sparse (usually it is dense), is not equal to so it does not decrease the objective function. In addition, compared with the convex proximal operator [DBL14] which guarantees the non-expansiveness of the distance to the optimum, the hard thresholding step can enlarge the distance up to a multiple of (if using the bound (1.2)). What makes it a more serious problem is that such inaccurate iterates will be used for the next gradient update, and hence the error might be progressively propagated at an exponential rate.

Key Idea.  We first bound the curvature from below and above to establish RIP-like condition, which, combined with Theorem 1, downscales the deviation resulting from hard thresholding. Note that is always greater than one, hence the curvature bound is necessary. Due to the variance reduction technique, we show that the optimization error vanishes when restricted on a small set of directions as soon as we have sufficient samples.

### 4.2 Assumptions

Our analysis depends on two assumptions on the curvature of the objective function which have been standard in the literature (see, e.g., [RWY11, ANW12, JTK14]):

1. satisfies the restricted strong convexity (RSC) condition with parameter .

2. For all , satisfies the restricted smoothness (RSS) condition with parameter .

Here, we recall that was first introduced in (4.1) and the parameter was used in our algorithm. The RSC and RSS conditions are defined as follows:

###### Definition 8 (Restricted Strong Convexity).

A differentiable function is said to satisfy the property of restricted strong convexity with positive parameter , if for all vectors , with , it holds that

 f(x′)−f(x)−⟨∇f(x),x′−x⟩≥αr2∥∥x′−x∥∥22.
###### Definition 9 (Restricted Smoothness).

A differentiable function is said to satisfy the property of restricted smoothness with parameter , if for all vectors , with , it holds that

One may have noticed that if we assume is twice continuously differentiable (which holds for various loss functions such as least-squares, logistic loss), the RSC and RSS conditions amount to bounding the curvature of from below and above when restricted to directions with sparse support, respectively. Compared to the convex algorithms such as SAG [RSB12], SVRG [JZ13] and SAGA [DBL14] that assume strong convexity and smoothness everywhere, we only assume these in a restricted sense. This is more practical especially in the high dimensional regime where the Hessian matrix could be degenerate [ANW12].

### 4.3 Performance Guarantee

For simplicity, let us denote

 L= L3k+K,α=αk+K,c=L/α, (4.2) T= (4.3) κ= 4νη2T(2Lω+T)m+2Tω/α, (4.4)

where we recall that is the expansiveness factor given by Theorem 1, is the learning rate of the algorithm and is a universal constant that upper bounds the norm of the global optimum. Note that the values of and do not depend on the iterates produced by HT-SVRG. Virtually, with an appropriate parameter setting, scales as which will be clear below. Also note that the quantity is the condition number of the problem when restricted on sparse directions. However, we stress that it is not the global condition number, in that the RSS condition is imposed on each rather than . For a particular stage , we denote , i.e., the samples randomly chosen for updating the solution.

###### Theorem 10 (Linear Convergence of HT-SVRG).

Consider Algorithm 1. Assume 1 and 2. Pick the step size . If , then for sufficiently large , the sequence of the iterates converges to a global optimum of (4.1) in the sense that

 E[F(˜xs)−F(ˆx)]≤βs[F(˜x0)−F(ˆx)]+τ, (4.5)

where the expectation is taken over . Above, is the convergence coefficient and is referred to as the optimization error, both of which depend on the range of . In particular, for , we have

 β1= 1(2νηα−2νη2αL−ν+1)m+2νη2αL2νηα−2νη2αL−ν+1, (4.6) τ1= ακ2(2νηα−2νη2αL−ν+1)(1−β1)m. (4.7)

For , we have

 β2=1νηα(1−2ηL)m+2ηL1−2ηL,τ2=κ2νηα(1−2ηL)(1−β2)m. (4.8)
###### Remark 11.

From the theorem, we know that in order to guarantee the convergence, the quantity of should be less than due to . Hence, the conventional bound (1.2) is not applicable. In contrast, Theorem 1 asserts that this condition can be fulfilled by tuning slightly larger than .

For the sake of success of our algorithm, we require that is smaller than , which implies . Recall that is given in Theorem 1. In general, we are interested in the regime . Hence, we have and the minimum requirement for the sparsity parameter is . We note that to our knowledge, the idea of relaxed sparsity was first given in [JTK14] to show linear convergence of projected gradient descent. However, the sparsity parameter in our work emerges in a different way. We also emphasize that in that work, the smoothness parameter is computed on , which implies that the condition number therein is smaller than ours, hence a milder condition on . However, as we will demonstrate in Section 5, HT-SVRG does not degrade too much when compared to [JTK14]. So we conjecture that the local smoothness parameter can be refined.

We also contrast our tight bound to the inequality (2.11) that is obtained by combining the triangle inequality and Lemma 1 of [JTK14]. By a simple calculation, (2.11) results in a dependency

 k≥(1−(√4c(4c−1)−1−1)2)d+(√4c(4c−1)−1−1)2K

which grows with the dimension , whereas using Theorem 1 the sparsity parameter depends only on the desired sparsity . In light of this, we conclude that our bound is vital especially in the high dimensional setting.

Another component of the algorithm is the update frequency . Intuitively, HT-SVRG performs number of stochastic gradient update followed by a full gradient evaluation, in order to mitigate the variance. In this light, should not be too small. Otherwise, the algorithm reduces to the full gradient method which is not computationally efficient. On the other spectrum, a large leads to a slow convergence since it amounts to applying a stochastic gradient method. The dependency of (or ) on the parameter makes this more apparent. To quantitatively analyze how should be selected, let us consider the second case, i.e., for example. The first case follows in a similar way. In order to ensure , we must have . Picking , we find that the update frequency has to scale as , which is of the same order as in the convex case [JZ13].

Tradeoff between and .  As we have fully characterized the convergence behavior, it is interesting to analyze the tradeoff between the sparsity parameter and the convergence coefficient . Note that due to , is monotonically increasing with respect to . By Theorem 1, we know that is decreasing with respect to . Thus, we conclude that a larger quantity of results in a smaller value of , and hence a faster rate. Interestingly, for the second case, we obtain a contrary result: the smaller the tends, the faster the algorithm converges. Combining these two cases, we have the following consequence:

###### Proposition 12.

Fix and . Then the optimal choice of in Theorem 10 is in the sense that the convergence coefficient attains the minimum quantity. In this case, we have which is proportional to if we pick .

###### Remark 13.

Note that the result regarding the optimal choice of is exclusively heuristic. This is because we are minimizing an upper bound which may not be sharp.

Optimization Error.  The last ingredient of our theorem is the optimization error which measures the distance of the iterates to the global optimum. Intuitively, the magnitude of indicates how close the solution of (4.1) is to that without the sparsity constraint. In particular, consider that the gradient of vanishes at . By specifying admissible values to , and , we obtain the following corollary for Theorem 10:

###### Corollary 14.

Assume 12 and . Pick . Then, the sequence of the -sparse solutions produced by Algorithm 1 with and converges to the global optimum of (4.1) with a geometric rate of , i.e., .

We isolate the corollary because it implies that if one is solving a convex problem without the sparsity constraint but the optimal solution happens to be sparse, it is safe to perform hard thresholding without loss of optimality. We exemplify this with another convex algorithm SAGA [DBL14] in Appendix E. In the noiseless compressed sensing setting where , the corollary guarantees that HT-SVRG exactly recovers the underlying true signal when is chosen as the least-squares loss. This is simply due to the fact that we have and .

Now we proceed with a more involved analysis to bound the quantity . We point out that the optimization error in Theorem 10 is dominated by the quantity . To see this, we consider the choice for some constant and as an example. Note that such a choice only simplifies the constants. One may derive similar results for other as long as the step size is inversely proportional to , which is actually a common practice in the literature [Nes04]. As we have discussed, the update frequency has to be large enough, i.e.,

 m>c−η′η′(1−4η′). (4.9)

Let us pick such that

 1−4η′−c−η′η′m=ωαmin{1,L}. (4.10)

In this way and picking larger than , we have

 τ=[(1−4η′)min{1,L}−1α]T+4η′min{1,L}⋅T+2η′min{1,L}LωT2

Thus, it is evident that the optimization error is determined by the quantity of , i.e., the maximum norm of the gradient evaluated at the optimum when restricted on a sparse direction. In particular, when , .

Based on the above discussion, we are in the position to characterize the optimization error . For the purpose of a concrete result, we study the sparse linear regression and sparse logistic regression problems. These are two of the most popular statistical model in the literature and have found a variety of applications in machine learning and statistics [RWY11, ANW12].

For sparse linear regression, we presume that there is a true sparse parameter that generates the observations:

 y=Ax∗+ε, (4.12)

where is the design matrix and is some observation noise. Then, one may optimize an regularized least-squares function in order to (approximately) estimate the parameter:

 minx F(x)\lx@stackreldef=12nn∑i=1∥yi−ai⋅x∥22+γ2∥x∥22,s.t. ∥x∥0≤K, ∥x∥2≤ω. (4.13)

For sparse logistic regression, it assumes that the label ( or ) is determined by a conditional probability as follows:

 P(yi∣ai; x∗)=exp(2yia⊤ix∗)1+exp(2yia⊤ix∗), ∀ 1≤i≤n. (4.14)

It then learns the parameter by minimizing regularized negative log-likelihood:

 minx F(x)\lx@stackreldef=1nn∑i=1log(1+exp(−2yia⊤ix))+γ2∥x∥22,s.t. ∥x∥0≤k, ∥x∥2≤ω. (4.15)

For both problems, we focus on the sub-Gaussian design matrix . There are several ways to define a sub-Gaussian random variable. Below we provide two of them:

###### Definition 15 (Sub-Gaussian Random Variable).

A centered random variable is called -sub-Gaussian if for all ,

 Eθ[exp(θz)]≤exp(σ2z22), (4.16)

or equivalently

 P(|θ|>z)≤2exp(−z22σ2). (4.17)

Classical examples of sub-Gaussian random variables are Gaussian, Bernoulli and all bounded ones. Note that the parameter is usually interpreted as the variance [Ver10]. In the sections to follow, we may need the following assumption:

1. The entries of the design matrix are i.i.d. -sub-Gaussian with zero mean.

Armed with the sub-Gaussian condition, we establish the following that shows a vanishing optimization error for both problems.

###### Lemma 16.

Consider the sparse linear regression model (4.12). Assume 1. Further assume that the elements of the random noise are i.i.d. -sub-Gaussian that is independent of . Then as soon as for some absolute constant , with probability at least ,

 T≤σ2∥ˆx−x∗∥2√C0(3k+K)logdn+σ0σ√C0(3k+K)logdn+γ∥ˆx∥2.
###### Proposition 17.

Consider the sparse linear regression model (4.12). Assume 12 and 1. Then with probability at least , the sequence of the iterates produced by HT-SVRG converges linearly to the global optimum with vanishing optimization error that scales as , provided that for some absolute constant and the regularization parameter is not greater than .

###### Proposition 18.

Consider the sparse logistic regression model (4.15). Assume 12 and 1. Then with probability at least , the sequence of the iterates produced by HT-SVRG converges linearly to the global optimum with vanishing optimization error , provided that and is not larger than .

Computational Complexity.  Now we compare the time complexity of HT-SVRG to that of projected gradient descent (PGD) [JTK14], which is a batch counterpart to HT-SVRG. First, we remark that the analysis of PGD is based on the smoothness parameter of at sparsity level . We denote it as and accordingly, we write . To achieve accuracy , PGD requires iterations, hence the total computational complexity is given by . For HT-SVRG, we consider with for some constant . Furthermore, we pick . In this way, we obtain a constant convergence coefficient . Hence, HT-SVRG needs iterations. In each stage, HT-SVRG computes a full gradient followed by times stochastic updates. Therefore, the total complexity of HT-SVRG is given by by noting the fact .

It remains to compare the quantity of PGD to of HT-SVRG. It turns out that if the covariance matrix is the identity, , as shown below.

###### Proposition 19.

Assume 1. Then there exists some absolute constant , such that with probability at least , the condition number of sparse linear regression is bounded by the constant provided that .

###### Proposition 20.

Assume 1. Then there exist absolute constant and , such that the condition number of sparse logistic regression is upper bounded by 8.16 with probability at least , provided that .

For the condition number , we find that it scales as , which is a consequence of Lemma 2.3 of [PV13]. Therefore, in this regime, HT-SVRG performs comparably to PGD in terms of computational efficiency. However, if we consider a general covariance matrix, such that or , then one quickly observes that HT-SVRG is superior to PGD.

### 4.4 A Concurrent Work

After we posted the first version [SL16] on arXiv, Li et al. made their work [LZA16] public where a similar algorithm to HT-SVRG was presented. However, their theoretical analysis applies only to convex objective functions while we allow the function to be non-convex. We also note that some theoretical results in [LZA16] are partially correct. To show the global convergence, [LZA16] requires the step size to be lower bounded away from zero. In Theorem 10, we prove that any step size smaller than can be used. We also fully characterize the convergence behavior of the algorithm by showing the tradeoff between the sparsity parameter and the convergence coefficient . Such a theoretical understanding is not covered in [LZA16].

## 5 Experiments

### 5.1 Sparse Recovery

To understand the practical behavior of our algorithm as well as to justify the theoretical analysis, we perform numerical experiments for sparse recovery.

Data Generation.  We follow the setting in [TG07], i.e., the data dimension is fixed as and we generate an Gaussian random sensing matrix whose entries are independently and identically distributed with zero mean and variance . Then -sparse signals are independently generated, where the support of each signal is uniformly distributed over the index set . That is, we run our algorithm and baselines for trials. The measurements for each signal is obtained by , i.e., noise free. In this way, we are able to study the convergence rate by plotting the logarithm of the objective value since the optimum is known to be zero.

Baselines.  We mainly compare with two closely related algorithms: IHT [BD09] and projected gradient descent (PGD) [JTK14]. Both of them compute the full gradient of the least-squares loss followed by hard thresholding. Yet, PGD is more general, in the sense that it allows the sparsity parameter to be larger than the true sparsity ( for IHT) and a flexible step size ( for IHT). Hence, PGD can be viewed as a batch counterpart of our method HT-SVRG.

Evaluation Metric.  We say a signal is successfully recovered by if

 (5.1)

In this way, we can compute the percentage of success over the trials for each algorithm.

Parameter Setting.  If not specified, we set the update frequency , , for HT-SVRG. We use the heuristic step size for HT-SVRG and PGD, where returns the largest singular value of the matrix . Since for each stage, HT-SVRG computes the full gradient for times, we run the IHT and PGD for iterations for fair comparison, i.e., they have the same number of full gradient evaluations.

Our first simulation aims at offering a big picture on the recovery performance. To this end, we vary the number of measurements from to , roughly with a step size . We also study the performance with respect to the true sparsity parameter , which ranges from to , roughly with step size . The results are illustrated in Figure 1, where a brighter block means a higher percentage of success and the brightest ones indicate exact sparse recovery. It is apparent that PGD and HT-SVRG require fewer measurements for an accurate recovery than IHT, possibly due to the flexibility in sparsity parameter and step size. We also observe that as a stochastic algorithm, HT-SVRG performs comparably to PGD. This suggests that HT-SVRG is an appealing solution to large-scale sparse learning problems for which the condition number may grow with the sample size. In Figure 2, we exemplify some of the results obtained from HT-SVRG by plotting a number of curves, so that one can examine the quantity of the percentage of success.

Based on the results in Figure 1, we have an estimation on the minimum requirement of the sample size which ensures accurate (or exact) recovery. Now we are to investigate how many measurements are needed to guarantee a success percentage of and . To this end, for each sparsity parameter , we look for the number of measurements from Figure 1 where percents of success is achieved. Then we carefully enlarge with step size and run the algorithms. The empirical results are recorded in Figure 3, where the circle markers represent the empirical results with different colors indicating different algorithms, e.g., red circle for empirical observation of HT-SVRG. Then we fit these empirical results by linear regression, which are plotted as solid lines, e.g., the green line is a fitted model for IHT. We find that is almost linear with . Especially, the curve of HT-SVRG is nearly on top of that of PGD, which again verifies HT-SVRG is an attractive alternative to the batch method.