Phase Transitions for Greedy Sparse Approximation Algorithms
A major enterprise in compressed sensing and sparse approximation is the design and analysis of computationally tractable algorithms for recovering sparse, exact or approximate, solutions of underdetermined linear systems of equations. Many such algorithms have now been proven to have optimal-order uniform recovery guarantees using the ubiquitous Restricted Isometry Property (RIP) CTdecoding (). However, it is unclear when the RIP-based sufficient conditions on the algorithm are satisfied. We present a framework in which this task can be achieved; translating these conditions for Gaussian measurement matrices into requirements on the signal’s sparsity level, length, and number of measurements. We illustrate this approach on three of the state-of-the-art greedy algorithms: CoSaMP NeTr09_cosamp (), Subspace Pursuit (SP) SubspacePursuit () and Iterative Hard Thresholding (IHT) BlDa08_iht (). Designed to allow a direct comparison of existing theory, our framework implies that, according to the best known bounds, IHT requires the fewest number of compressed sensing measurements and has the lowest per iteration computational cost of the three algorithms compared here.
keywords:Compressed sensing, greedy algorithms, sparse solutions to underdetermined systems, restricted isometry property, phase transitions, Gaussian matrices.
In compressed sensing CompressiveSampling (); CTdecoding (); CompressedSensing (), one works under the sparse approximation assumption, namely, that signals/vectors of interest can be well approximated by few components of a known basis. This assumption is often satisfied due to constraints imposed by the system which generates the signal. In this setting, it has been proven (originally in CTdecoding (); CompressedSensing () and by many others since) that the number of linear observations of the signal, required to guarantee recovery, need only be proportional to the sparsity of the signal’s approximation. This is in stark contrast to the standard Shannon-Nyquist Sampling paradigm shannon () where worst-case sampling requirements are imposed.
In the simplest setting, consider measuring a vector which either has exactly nonzero entries, or which has entries whose magnitudes are dominant. Let be an matrix with which we use to measure ; the inner products with are the entries in . From knowledge of and one seeks to recover the vector , or a suitable approximation thereof, CS_SIREV (). Let denote the family of at most -sparse vectors in , where counts the number of nonzero entries. From and , the optimal -sparse signal is the solution of
where denotes the Euclidean norm.
However, solving (1) via a naive exhaustive search is combinatorial in nature and NP-hard NPhard (). A major aspect of compressed sensing theory is the study of alternative methods to solving (1). Since the system is underdetermined, any successful recovery of will require some form of nonlinear reconstruction. Under certain conditions, various algorithms have been shown to successfully reduce (1) to a tractable problem, one with a computational cost which is a low degree polynomial of the problem dimensions, rather than the exponential cost associated with a direct combinatorial search for the solution of (1). While there are numerous reconstruction algorithms, they each generally fall into one of three categories: greedy methods, regularizations, or combinatorial group testing. For an indepth discussion of compressed sensing recovery algorithms, see NeTr09_cosamp () and references therein.
The first uniform guarantees for exact reconstruction of every , for a fixed , came from -regularization. In this case, (1) is relaxed to solving the problem
for some known noise level, or decreasing, . -regularization has been extensively studied, see the pioneering works CTdecoding (); CompressedSensing (); also, see Do05_signal (); DoTa08_JAMS (); RIconstants () for results analogous to those presented here. In this paper, we focus on three illustrative greedy algorithms, Compressed Sensing Matching Pursuit (CoSaMP) NeTr09_cosamp (), Subspace Pursuit (SP) SubspacePursuit (), and Iterative Hard Thresholding (IHT) BlDa08_iht (), which boast similar uniform guarantees of successful recovery of sparse signals when the measurement matrix satisfies the now ubiquitous Restricted Isometry Property (RIP) CTdecoding (); RIconstants (). The three algorithms are deeply connected and each have some advantage over the other. These algorithms are essentially support set recovery algorithms which use hard thresholding to iteratively update the approximate support set; their differences lie in the magnitude of the application of hard thresholding and the vectors to which the thresholding is applied, DoMa09 (); TrWr09 (). The algorithms are restated in the next section. Other greedy methods with similar guarantees are available, see for example RON (); IHTother (); several other greedy techniques have been developed (DTDS08 (); NeVe06_UUP (); OMP_wakin (), etc.), but their theoretical analyses do not currently subscribe to the above uniform framework.
As briefly mentioned earlier, the intriguing aspect of compressed sensing is its ability to recover -sparse signals when the number of measurements required is proportional to the sparsity, , as the problem size grows, . Each of the algorithms discussed here exhibit a phase transition property, where there exists a such that for any , as , the algorithm successfully recovers all -sparse vectors provided and does not recover all -sparse vectors if . For a description of phase transitions in the context of compressed sensing, see PUT (), while for numerical average-case phase transitions for greedy algorithms, see DoMa09 (). We consider the asymptotic setting where and grow proportionally with , namely, with the ratios as for ; also, we assume the matrix is drawn i.i.d. from , the normal distribution with mean and variance . In this framework, we develop lower bounds on the phase transition for exact recovery of all -sparse signals. These bounds provide curves in the unit square, , below which there is an exponentially high probability on the draw the Gaussian matrix , that will satisfy the sufficient RIP conditions and therefore solve (1). We utilize a more general, asymmetric version of the RIP, see Definition 1, to compute as precise a lower bound on the phase transitions as possible. This phase transition framework allows a direct comparison of the provable recovery regions of different algorithms in terms of the problem instance . We then compare the guaranteed recovery capabilities of these algorithms to the guarantees of -regularization proven via RIP analysis. For -regularization, this phase transition framework has already been applied using the RIP BlCaTa09_spars (); RIconstants (), using the theory of convex polytopes Do05_signal () and geometric functional analysis RV07_gaussian ().
The aforementioned lower bounds on the algorithmic exact sparse recovery phase transitions are presented in Theorems 10, 11, and 12. The curves are defined by functions (SP; the magenta curve in Fig.1(a)), (CoSaMP; the black curve in Fig.1(a)), (IHT; the red curve in Fig.1(a)). For comparison, the analogous lower bound on the phase transition for (-regularization) is displayed as the blue curve in Fig.1(a). From Fig. 1, we are able to directly compare the provable recovery results of the three greedy algorithms as well as -regularization. For a given problem instance with the entries of drawn i.i.d. from , if falls in the region below the curve associated to a specific algorithm, then with probability approaching 1 exponentially in , the algorithm will exactly recover the -sparse vector no matter which was measured by . These lower bounds on the phase transition can also be interpreted as the minimum number of measurements known to guarantee recovery through the constant of proportionality: . Fig. 1(b) portrays the inverse of the lower bounds on the phase transition. This gives a minimum possible value for . For example, from the blue curve, for a Gaussian random matrix used in -regularization, the minimum number of measurements proven (using RIP) to be sufficient to ensure recovery of all -sparse vectors is . By contrast, for greedy algorithms, the minimum number of measurements shown to be sufficient is significantly larger: for IHT, for SP, and for CoSaMP.
More precisely, the main contributions of this article is the derivation of theorems and corollaries of the following form for each of the CoSaMP, SP, and IHT algorithms.
Given a matrix with entries drawn i.i.d. from , for any , let for some (unknown) noise vector . For any , as with and , there exists and , the unique solution to . If , there is an exponentially high probability on the draw of that the output of the algorithm at the iteration, , approximates within the bound
for some and .
Given a matrix with entries drawn i.i.d. from , for any , let . For any , with and as , there is an exponentially high probability on the draw of that the algorithm exactly recovers from and in a finite number of iterations not to exceed
with and , the smallest integer greater than or equal to .
Corollary 2 implies that delineates the region in which the algorithm can be guaranteed to converge provided there exists an such that . However, if no such exists, as approaches the guarantees on the number of iteraties required and stability factors become unbounded. Further bounds on the convergence factor and the stability factor result in yet lower curves for a specified ; recall that corresponds to the bound .
In the next section, we recall the three algorithms and introduce necessary notation. Then we present the asymmetric restricted isometry property and formulate weaker restricted isometry conditions on a matrix that ensure the respective algorithm will successfully recover all -sparse signals. In addition to exact recovery, we study the known bounds on the behavior of the algorithms in the presence of noisy measurements. In order to make quantitative comparisons of these results, we must select a matrix ensemble for analysis. In Section 3, we present the lower bounds on the phase transition for each algorithm when the measurement matrix is a Gaussian random matrix. Phase transitions are developed in the case of exact sparse signals while bounds on the multiplicative stability constants are also compared through associated level curves. Section 4 is a discussion of our interpretation of these results and how to use this phase transition framework for comparison of other algorithms.
For an index set , let denote the restriction of a vector to the set , i.e., for and for . Also, let denote the submatrix of obtained by selecting the columns indexed by . is the conjugate transpose of while is the pseudoinverse of . In each of the algorithms, thresholding is applied by selecting entries of a vector with largest magnitude; we refer to this as hard thresholding of magnitude .
2 Greedy Algorithms and the Asymmetric Restricted Isometry Property
The CoSaMP recovery algorithm is a support recovery algorithm which applies hard thresholding by selecting the largest entries of a vector obtained by applying a pseudoinverse to the measurement . In CoSaMP, the columns of selected for the pseudoinverse are obtained by applying hard thresholding of magnitude to applied to the residual from the previous iteration and adding these indices to the approximate support set from the previous iteration. This larger pseudoinverse matrix of size imposes the most stringent aRIP condition of the three algorithms. However, CoSaMP uses one fewer pseudoinverse per iteration than SP as the residual vector is computed with a direct matrix-vector multiply of size rather than with an additional pseudoinverse. Furthermore, when computing the output vector , CoSaMP does not need to apply another pseudoinverse as does SP. See Algorithm 1.
2.2 Subspace Pursuit
The Subspace Pursuit algorithm is also a support recovery algorithm which applies hard thresholding of magnitude to a vector obtained by applying a pseudoinverse to the measurements . The submatrix chosen for the pseudoinverse has its columns selected by applying to the residual vector from the previous iteration, hard thresholding of magnitude , and adding the indices of the terms to the previous approximate support set. Compared to the other two algorithms, a computational disadvantage of SP is that the aforementioned residual vector is also computed via a pseudoinverse, this time selecting the columns from by again applying a hard threshold of magnitude . The computation of the approximation to the target signal also requires the application of a pseudoinverse for a matrix of size . See Algorithm 2.
2.3 Iterative Hard Thresholding
Iterative Hard Thresholding (IHT) is also a support recovery algorithm. However, IHT applies hard thresholding to an approximation of the target signal, rather than to the residuals. This completely eliminates the use of a pseudoinverse, reducing the computational cost per iteration. In particular, hard thresholding of magnitude is applied to an updated approximation of the target signal, , obtained by matrix-vector multiplies of size that represent a move by a fixed stepsize along the steepest descent direction from the current iterate for the residual . See Algorthm 3.
(Stopping criteria for greedy methods) In the case of corrupted measurements, where for some noise vector , the stopping criteria listed in Algorithms 1-3 may never be achieved. Therefore, a suitable alternative stopping criteria must be employed. For our analysis on bounding the error of approximation in the noisy case, we bound the approximation error if the algorithm terminates after iterations. For example, we could change the algorithm to require a maximum number of iterations as an input and then terminate the algorithm if our stopping criteria is not met in fewer iterations. In practice, the user would be better served to stop the algorithm when the residual is no longer improving. For a more thorough discussion of suitable stopping criteria for each algorithm in the noisy case, see the original announcement of the algorithms BlDa08_iht (); SubspacePursuit (); NeTr09_cosamp ().
2.4 The Asymmetric Restricted Isometry Property
In this section we relax the sufficient conditions originally placed on Algorithms 1-3 by employing a more general notion of a restricted isometry. As discussed in RIconstants (), the singular values of the submatrices of an arbitrary measurement matrix do not, in general, deviate from unity symmetrically. The standard notion of the restricted isometry property (RIP) CTdecoding () has an inherent symmetry which is unneccessarily restrictive. Hence, seeking the best possible conditions for the measurement matrix under which Algorithms 1-3 will provably recovery every sparse vector, we reformulate the sufficient conditions in terms of the asymmetric restricted isometry property (aRIP) RIconstants ().
For an matrix , the asymmetric RIP constants and are defined as:
The more common, symmetric definition of the RIP constants is recovered by defining . In this case, a matrix of size has the RIP constant if
Observe that for any and therefore the constants , , and are nondecreasing in CTdecoding ().
For all expressions involving it is understood, without explicit statement, that the first argument is limited to the range where . Beyond this range of sparsity, there exist vectors which are mapped to zero, and are unrecoverable.
Using the aRIP, we analyze the three algorithms in the case of a general measurement matrix of size . For each algorithm, the application of Definition 1 results in a relaxation of the conditions imposed on to provably guarantee recovery of all . We first present a stability result for each algorithm in terms of bounding the approximation error of the output after iterations. The bounds show a multiplicative stability constant in terms of aRIP contants that amplifies the total energy of the noise. As a corollary, we obtain a sufficient condition on in terms of the aRIP for exact recovery of all -sparse vectors. The proofs of these results are found in the Appendix. These theorems and corollaries take the same form, differing for each algorithm only by the formulae for various factors. We state the general form of the theorems and corollaries, analogous to Theorem 1 and Corollary 2, and then state the formulae for each of the algorithms CoSaMP, SP, and IHT.
Given a matrix of size with aRIP constants and , for any , let , for some (unknown) noise vector . Then there exists such that if , the output of algorithm “alg” at the iteration approximates within the bound
for some and .
Given a matrix of size with aRIP constants and , for any , let . Then there exists such that if , the algorithm “alg” exactly recovers from and in a finite number of iterations not to exceed
where defined as in (5).
We begin with Algorithm 1, the Compressive Sampling Matching Pursuit recovery algorithm of Needell and Tropp NeTr09_cosamp (). We relax the sufficient recovery condition in NeTr09_cosamp () via the aRIP.
Theorem 5 (CoSaMP)
Next, we apply the aRIP to Algorithm 2, Dai and Milenkovic’s Subspace Pursuit SubspacePursuit (). Again, the aRIP provides a sufficient condition that admits a wider range of measurement matrices than admitted by the symmetric RIP condition derived in SubspacePursuit ().
Theorem 6 (Sp)
Finally, we apply the aRIP analysis to Algorithm 3, Iterative Hard Thresholding for Compressed Sensing introduced by Blumensath and Davies BlDa08_iht (). Theorem 7 employs the aRIP to provide a weaker sufficient condition than derived in BlDa08_iht ().
Theorem 7 (Iht)
Each of Theorems 5, 6 and 7 are derived following the same recipe as in NeTr09_cosamp (), SubspacePursuit () and BlDa08_iht (), respectively, using the aRIP rather than the RIP and taking care to maintain the least restrictive bounds at each step (for details, see the Appendix). For Gaussian matrices, the aRIP improves the lower bound on the phase transitions by nearly a multiple of 2 when compared to similar statements using the classical RIP. For IHT, the aRIP is simply a scaling of the matrix so that its RIP bounds are minimal. This is possible for IHT as the factors in involve and for only one value of , here . No such scaling interpretation is possible for CoSaMP and SP.
At this point, we digress to mention that the first greedy algorithm shown to have guaranteed exact recovery capability is Needell and Vershynin’s ROMP (Regularized Orthogonal Matching Pursuit) NeVe06_UUP (). We omit the algorithm and a rigorous discussion of the result, but state an aRIP condition that will guarantee sparse recovery. ROMP chooses additions to the approximate support sets at each iteration with a regularization step requiring comparability between the added terms. This comparability requires a proof of partitioning a vector of length into subsets with comparable coordinates, namely the magnitudes of the elements of the subset differ by no more than a factor of 2. The proof that such a partition exists, with each partition having a nonzero energy, forces a pessimistic bound that decays with the problem size.
Theorem 8 (Regularized Orthogonal Matching Pursuit)
Let be a matrix of size with aRIP constants and . Define
If , then ROMP is guaranteed to exactly recover any from the measurements in a finite number of iterations.
Unfortunately, this dependence of the bound on the size of the problem instance forces the result to be inadequate for large problem instances. In fact, this result is inferior to the results for the three algorithms stated above which are all independent of problem size and therefore applicable to the most interesting cases of compressed sensing, when and . It is possible that this dependence on the problem size is an artifact of the technique of proof; without removing this dependence, large problem instances will require the measurement matrix to be a true isometry and the phase transition framework of the next section does not apply.
3 Phase Transitions for Greedy Algorithms with Gaussian Matrices
The quantities and in Theorems 5, 6, and 7 dictate the current theoretical convergence bounds for CoSaMP, SP, and IHT. Although some comparisons can be made between the forms of and for different algorithms, it is not possible to quantitatively state for what range of the algorithm will satisfy bounds on and for a specific value of and . To establish quantitative interpretations of the conditions in Theorems 5, 6 and 7, it is necessary to have quantitative bounds on the behaviour of the aRIP constants and for the matrix in question, BCT09_DecayRIP (); RIconstants (). Currently, there is no known matrix for which it has been proven that and remain bounded above and away from one, respectively, as grows, for and proportional to . However, it is known that for some random matrix ensembles, with exponentially high probability on the draw of , and do remain bounded as grows, for and proportional to . The ensemble with the best known bounds on the growth rates of and in this setting is the Gaussian ensemble. In this section, we consider large problem sizes as , with and for . We study the implications of the sufficient conditions from Section 2 for matrices with Gaussian i.i.d. entries, namely, entries drawn i.i.d. from the normal distribution with mean and variance , .
Gaussian random matrices are well studied and much is known about the behavior of their eigenvalues. Edelman EdelmanEigenvalues88 () derived bounds on the probability distribution functions of the largest and smallest eigenvalues of the Wishart matrices derived from a matrix with Gaussian i.i.d. entries. Select a subset of columns indexed by with cardinality and form the submatrix , and the associated Wishart matrix derived from is the matrix . The distribution of the most extreme eigenvalues of all Wishart matrices derived from with Gaussian i.i.d. entries is only of recent interest and the exact probability distribution functions are not known. Recently, using Edelman’s bounds EdelmanEigenvalues88 (), the first three authors RIconstants () derived upper bounds on the probability distribution functions for the most extreme eigenvalues of all Wishart matrices derived from . These bounds enabled them to formulate upper bounds on the aRIP constants, and , for matrices of size with Gaussian i.i.d. entries.
Theorem 9 (Blanchard, Cartis, and Tanner RIconstants ())
The details of the proof of Theorem 9 are found in RIconstants (). The bounds are derived using a simple union bound over all of the Wishart matrices that can be formed from columns of . Bounds on the tail behavior of the probability distribution function for the largest and smallest eigenvalues of can be expressed in the form with defined in (18) and (19) and a polynomial. Following standard practices in large deviation analysis, the tails of the probability distribution functionals are balanced against the exponentially large number of Wishart matrices (20) and (21) to define upper and lower bounds on the largest and smallest eigenvalues of all Wishart matrices, with bounds and , respectively. Overestimation of the union bound over the combinatorial number of Wishart matrices causes the bound to not be strictly increasing in for large; to utilize the best available bound on the extreme of the largest eigenvalue, we note that any bound for is also a valid bound for submatrices of size . The asymptotic bounds of the aRIP constants, and , follow directly. See Figure 3 for level curves of the bounds.
With Theorem 9, we are able to formulate quantitative statements about the matrices with Gaussian i.i.d. entries which satisfy the sufficient aRIP conditions from Section 2. A naive replacement of each and in Theorems 5-7 with the asymptotic aRIP bounds in Theorem 9 is valid in these cases. The properties necessary for this replacement are detailed in Lemma 16, stated in the Appendix. For each algorithm (CoSaMP, SP and IHT) the recovery conditions can be stated in the same format as Theorem 1 and Corollary 2, with only the expressions for , and differing. These recovery factors are stated in Theorems 10-12.
An analysis similar to that presented here for the greedy algorithms CoSaMP, SP, and IHT was previously carried out in RIconstants () for the -regularization problem (2). The form of the results differs from those of Theorem 1 and Corollary 2 in that no algorithm was specified for how (2) is solved. For this reason, no results are stated for the convergence rate or number of iterations. However, (2) can be reformulated as a convex quadratic or second-order cone programming problem — and its noiseless variant as a linear programming — which have polynomial complexity when solved using interior point methods NN (). Moreover, convergence and complexity of other alternative algorithms for solving (2) such as gradient projection have long been studied by the optimization community for more general problems Bertsekas (); nesterovbook (); NW (), and recently, more specifically for (2) GPSR (); Nesterov () and many more. For completeness, we include the recovery conditions for -regularization derived in RIconstants (); these results follow from the original -regularization bound derived by Foucart and Lai FoucartLai08 () for general .
Theorem 13 (Blanchard, Cartis, and Tanner RIconstants ())
Given a matrix with entries drawn i.i.d. from , for any , let for some (unknown) noise vector . Define
with and defined as in Theorem 9. Let be the unique solution to . For any , as with and , there is an exponentially high probability on the draw of that
approximates within the bound
Corollary 14 (Blanchard, Cartis, and Tanner RIconstants ())
Given a matrix with entries drawn i.i.d. from , for any , let . For any , with and as , there is an exponentially high probability on the draw of that
exactly recovers from and .
4 Discussion and Conclusions
We have presented a framework in which recoverability results for sparse approximation algorithms derived using the ubiquitous RIP can be easily compared. This phase transition framework, Do05_signal (); DoTa05_signal (); DoTa08_JAMS (); RIconstants (), translates the generic RIP-based conditions of Theorem 3 into specific sparsity levels and problem sizes and for which the algorithm is guaranteed to satisfy the sufficient RIP conditions with high probability on the draw of the measurement matrix; see Theorem 1. Deriving (bounds on) the phase transitions requires bounds on the behaviour of the measurement matrix’ RIP constants BCT09_DecayRIP (). To achieve the most favorable quantitative bounds on the phase transitions, we used the less restrictive aRIP constants; moreover, we employed the best known bounds on aRIP constants, those provided for Gaussian matrices RIconstants (), see Theorem 9.
Computational Cost of CoSaMP, SP and IHT
The major computational cost per iteration in these algorithms is the application of one or more pseudoinverses. SP uses two pseudoinverses of dimensions per iteration and another to compute the output vector ; see Algorithm 2. CoSaMP uses only one pseudoinverse per iteration but of dimensions ; see Algorithm 1. Consequently, CoSaMP and SP have identical computational cost per iteration, of order , if the pseudoinverse is solved using an exact factorization. IHT avoids computing a pseudoinverse altogether in internal iterations, but is aided by one pseudoinverse of dimensions on the final support set. Thus IHT has a substantially lower computational cost per iteration than CoSaMP and SP. Note that pseudoinverses may be computed approximately by an iterative method such as conjugate gradients NeTr09_cosamp (). As such, the exact application of a pseudoinverse could be entirely avoided, improving the implementation costs of these algorithms, especially of CoSaMP and SP.
Globally, all three algorithms converge linearly; in fact, they converge in a finite number of iterations provided there exists a -sparse solution to and a sufficient aRIP condition is satisfied, see Corollary 2. For each algorithm, the upper bound on the required number of iterations grows unbounded as the function . Hence, according to the bounds presented here, to ensure rapid convergence, it is advantageous to have a matrix that satisfies a more strict condition, such as . Similarly, the factor controlling stability to additive noise, namely the vector in Theorem 1, blows up as the function . Again, according to the bounds presented here, in order to guarantee stability with small amplification of the additive noise, it is necessary to restrict the range of . A phase transition function analogous to the functions can be easily computed in these settings as well, resulting in curves lower than those presented in Figure 1(a). This is the standard trade-off of compressed sensing, where one must determine the appropriate balance between computational efficiency, stability, and minimizing the number of measurements.
Comparison of Phase Transitions and Constants of Proportionality
From Figure 1(a), we see that the best known lower bounds on the phase transitions for the three greedy algorithms satisfy the ordering