Tuning Free Orthogonal Matching Pursuit

Tuning Free Orthogonal Matching Pursuit

Sreejith Kallummil, Sheetal Kalyani
Department of Electrical Engineering
Indian Institute of Technology Madras
Chennai, India 600036
{ee12d032,skalyani}@ee.iitm.ac.in
Abstract

Orthogonal matching pursuit (OMP) is a widely used compressive sensing (CS) algorithm for recovering sparse signals in noisy linear regression models. The performance of OMP depends on its stopping criteria (SC). SC for OMP discussed in literature typically assumes knowledge of either the sparsity of the signal to be estimated or noise variance , both of which are unavailable in many practical applications. In this article we develop a modified version of OMP called tuning free OMP or TF-OMP which does not require a SC. TF-OMP is proved to accomplish successful sparse recovery under the usual assumptions on restricted isometry constants (RIC) and mutual coherence of design matrix. TF-OMP is numerically shown to deliver a highly competitive performance in comparison with OMP having a priori knowledge of or . Greedy algorithm for robust de-noising (GARD) is an OMP like algorithm proposed for efficient estimation in classical overdetermined linear regression models corrupted by sparse outliers. However, GARD requires the knowledge of inlier noise variance which is difficult to estimate. We also produce a tuning free algorithm (TF-GARD) for efficient estimation in the presence of sparse outliers by extending the operating principle of TF-OMP to GARD. TF-GARD is numerically shown to achieve a performance comparable to that of the existing implementation of GARD.

I Introduction

Consider the linear regression model , where is a known design matrix, is the noise vector and is the observation vector. The design matrix is rank deficient in the sense that . Further, the columns of are normalised to have unit Euclidean norm. The vector is sparse, i.e., the support of given by has cardinality . The noise vector is assumed to have Gaussian distribution with mean and covariance , i.e., . The signal to noise ratio in this regression model is defined as

(1)

Throughout this paper, represents the expectation operator and represents the norm of . In this article we consider the following two problems in the context of recovering sparse vectors in underdetermined linear regression models which are of larger interest.

P1). Estimate with the objective of minimizing the mean squared error (MSE) .

P2). Estimate the support of with the objective of minimizing the probability of support recovery error , where .

These problems are common in signal processing applications like sparse channel estimation[1], direction of arrival estimation[2], multi user detection[3] etc. Typical machine learning applications include sparse subspace clustering[4], sparse representation classification[5] etc. In signal processing community these problems are discussed under the compressive sensing (CS) paradigm[6]. A number of algorithms like least absolute shrinkage and selection operator (LASSO)[7, 8], Dantzig selector (DS)[9], subspace pursuit (SP)[10], compressive sampling matching pursuit (CoSaMP)[11], sparse Bayesian learning (SBL)[12], orthogonal matching pursuit (OMP)[13, 14, 15, 16, 17, 18, 19] etc. are proposed to solve the above mentioned problems. However, for optimal performance of these algorithms, a number of tuning parameters (also called hyper parameters) need to be fixed. For example, the value of in LASSO estimate

(2)

has to tuned appropriately. Indeed, when the noise is Gaussian a value is known to be optimal in terms of MSE performance[8]. Likewise, a value of with is known to deliver as under some regularity conditions[20]. Likewise, for the optimal performance of DS, one need to have knowledge of [9]. However, unlike the case of overdetermined linear regression models where one can readily estimate using the maximum likelihood (ML) estimator, estimating in underdetermined linear regression models is extremely difficult[21]. This means that the optimal performance using LASSO and DS in many practical applications involving Gaussian noise111Apart from the Gaussian noise model considered in this paper, two other noise models popular in literature are bounded noise and bounded noise . The optimal performance of aforementioned algorithms in these models requires the knowledge of parameters and which are also difficult to estimate. is not possible. Even if the noise variance is known, an amount of subjectivity is involved in fixing the tuning parameters. SBL on the other hand involves a non convex optimization problem which is solved using the expectation maximization (EM) algorithm and hence the solution depend on the initialization values of EM algorithm. Likewise, algorithms like CoSaMP, SP etc. requires a priori knowledge of sparsity level which is rarely available. OMP, which is the focus of this article, requires either the knowledge of or the knowledge of for optimal performance. Hence, in many practical applications, the statistician is forced to choose ad hoc values of tuning parameters for which no performance guarantees are available. A popular alternative is based on techniques like cross validation which can deliver reasonably good performance at the expense of significantly high computational complexity[22, 23, 21]. Further, cross validation is also known to be ineffective for support recovery problems[23].

I-a Tuning parameter free sparse recovery.

The literature on tuning parameter free sparse recovery procedure is new in comparison with the literature on sparse recovery algorithms like OMP, LASSO, DS etc. A seminal contribution in this field is the square root LASSO[24] algorithm which estimate by

(3)

For optimal MSE performance can be set independent of thereby overcoming a major drawback of LASSO. However, the choice of is still subjective with little guidelines. The high SNR behaviour of PE for square root LASSO is not reported in the literature. Another interesting development in this area is the development of sparse iterative covariance-based estimation, popularly called as SPICE[25]. SPICE is a convex optimization based algorithm that is completely devoid of any hyper parameters. The relationship between SPICE and techniques like LAD-LASSO, square root LASSO and LASSO are derived in [26, 27]. Another tuning parameter free algorithm called LIKES which is closely related to SPICE is proposed in [28]. Another interesting contribution in this area is the derivation of analytical properties of the non negative least squares (NNLS) estimator

(4)

in [29] which points to the superior performance of NNLS in terms of MSE. However, the NNLS estimate is applicable only to the cases where the sign pattern of is known a priori. Existing literature on tuning free sparse recovery has many disadvantages. In particular, all these techniques are computationally complex in comparison with simple algorithms like OMP, CoSaMP etc. Notwithstanding the connections established between algorithms like SPICE and LASSO, the performance guarantees of SPICE are not well established.

I-B Robust regression in the presence of sparse outliers.

In addition to the recovery of sparse signals in underdetermined linear regression models (which is the main focus of this article), we also consider a regression model widely popular in robust statistics called sparse outlier model. Here we consider the regression model

(5)

where is a full rank design matrix with or , regression vector may or may not be sparse and inlier noise . The outlier noise represents the large errors in the regression equation that are not modelled by the inlier noise distribution. In many cases of practical interest, is modelled as sparse, i.e., . However, the non zero entries in can take large values, i.e., can be potentially high. Algorithms from robust statistics like Hubers’ M-est[30] were used to solve this problem. Recently, a number of algorithms that utilizes the sparse nature of like the convex optimization based [31, 32], SBL based [33], OMP based greedy algorithm for robust de-noising (GARD)[34] etc. are shown to estimate more efficiently than the robust statistics based techniques. Just like the case of sparse regression, algorithms proposed for robust estimation in the presence of sparse outliers also require tuning parameters that are subjective and dependent on inlier noise variance (which is difficult to estimate).

I-C Contribution of this article.

This article makes the following contributions to the CS literature. We propose a novel way of using the popular OMP called tuning free OMP (TF-OMP) which does not require a priori knowledge of sparsity level or noise variance and is completely devoid of any tuning parameters. We analytically establish that the TF-OMP can recover the true support in bounded noise () if the matrix satisfy either exact recovery condition (ERC)[13], mutual incoherence condition (MIC) [14] or the restricted isometry condition in [35] and the minimum non zero value is large enough. It is important to note that the conditions imposed on design matrix for successful support recovery using TF-OMP is no more stringent than the results available [14, 15, 35] in literature for OMP with a priori knowledge of or noise variance . Under the same set of conditions on matrix , TF-OMP is shown to achieve high SNR consistency[36, 37, 20] in Gaussian noise, i.e., as . This is the first time a tuning free CS algorithm is shown to achieve high SNR consistency. As mentioned before, GARD for estimation in the presence of sparse outliers is closely related to OMP. We extend the operating principle behind TF-OMP to GARD and develop a modified version of GARD called TF-GARD which is devoid of tuning parameters and does not require the knowledge of inlier noise variance . Both proposed algorithms, viz. TF-OMP and TF-GARD are numerically shown to achieve highly competitive performance in comparison with a broad class of existing algorithms over a number of experiments.

I-D Notations used.

the column space of . is the transpose and is the Moore-Penrose pseudo inverse of (if has full column rank). is the projection matrix onto . denotes the sub-matrix of formed using the columns indexed by . is the entry of . If is clear from the context, we use the shorthand for . Both and denotes the entries of indexed by . is a central chi square distribution with degrees of freedom (d.o.f). is a complex Gaussian R.V with mean and covariance matrix . implies that and are identically distributed. is the matrix norm. denotes the set . denotes the floor function. represents the null set. For any two index sets and , the set difference . For any index set , denotes the complement of with respect to . iff .

I-E Organization of this article:-

Section \@slowromancapii@ discuss existing literature on OMP. Section \@slowromancapiii@ present TF-OMP. Section \@slowromancapiv@ presents the performance guarantees for TF-OMP. Section \@slowromancapv@ discuss TF-GARD algorithm. Section \@slowromancapvi@ presents the numerical simulations.

Ii OMP: Prior art

The proposed tuning parameter free sparse recovery algorithm is based on OMP. OMP is a greedy procedure to perform sparsity constrained least square minimization. OMP starts with a null model and add columns to current support that is most correlated with the current residual. An algorithmic description of OMP is given in TABLE I. The performance of OMP is determined by the properties of the measurement matrix , ambient SNR, sparsity of () and stopping condition (SC). We first describe the properties of that are conducive for sparse recovery using OMP.

Step 1:- Initialize the residual . ,
           Support estimate , Iteration counter ;
Step 2:- Find the column most correlated with the current
            residual , i.e.,
Step 3:- Update support estimate: .
Step 4:- Estimate using current support: .
Step 5:- Update residual: .
Step 6:- Increment . .
Step 7:- Repeat Steps 2-6, until stopping condition (SC) is met.
Output:- and .
TABLE I: Orthogonal Matching Pursuit

Ii-a Qualifiers for design matrix .

When , the linear equation has infinitely many possible solutions. Hence the support recovery problem is ill-posed even in the noiseless case. To uniquely recover the -sparse vector , the measurement matrix has to satisfy certain well known regularity conditions. A plethora of sufficient conditions including restricted isometry property (RIP)[6, 15], mutual incoherence condition (MIC)[7, 14], exact recovery condition (ERC)[13, 7] etc. are discussed in the literature. We first describe the ERC.

Definition 1:- A matrix and a vector with support satisfy ERC if the exact recovery coefficient satisfies .

It is known that ERC is a sufficient and worst case necessary condition for accurately recovering from using OMP[13]. The same condition with appropriate scaling of is sufficient for recovery in regression models with noise[14]. Since ERC involves the unknown support , it is impossible to check ERC in practice. Another important metric used for qualifying is the restricted isometry constant (RIC). RIC of order denoted by is defined as the smallest value of that satisfies

(6)

for all -sparse . OMP can recover a sparse signal in the first iterations if [15, 38, 35]. In the absence of noise, OMP can recover a sparse in iterations if [18]. Likewise, it is possible to recover in iterations if [18]. It is well known that the computation of RIC is NP-hard. Hence, mutual coherence, a quantity that can be estimated easily is widely popular. For a matrix with unit norm columns, the mutual coherence is defined as the maximum pair wise column correlation, i.e.,

(7)

If , then for all -sparse vector , can be bounded as [13]. Hence, is a sufficient condition for both noiseless and noisy sparse recovery using OMP. It is also shown that is a worst case necessary condition for sparse recovery.

Ii-B Stopping conditions for OMP

Most of the theoretical properties of OMP are derived assuming either the absence of noise[13, 16, 18] or the a priori knowledge of [15]. In this case OMP iterations are terminated once or . When is not available which is typically the case, one has to rely on stopping conditions based on the properties of residual . For example, OMP can be stopped if [14, 35] or [14]. Likewise, [39] suggested a SC based on the residual difference . The necessary and sufficient conditions for high SNR consistency of OMP with residual based SC is derived in [20]. A generalized likelihood ratio based stopping rule is developed in [40]. In addition to the subjectivity involved in the choice of SC, all the above mentioned SC requires the knowledge of . As explained before, estimating in underdetermined regression models is extremely difficult. In the following, we use the shorthand OMP() for OMP with a priori knowledge of and OMP( for OMP with SC based on a priori knowledge of . In the next section, we develop TF-OMP, an OMP based procedure which does not require the knowledge of either or for good performance.

Iii Tuning Free orthogonal Matching Pursuit.

In this section, we present the proposed TF-OMP algorithm. This algorithm is based on the statistic , where is the residual in the iteration of OMP. Using the property of projection matrices[36], we have , where is the zero matrix. This implies that . Hence, can be rewritten as

(8)

Since the residual norms are non decreasing, i.e., , we always have . This statistic exhibits an interesting behaviour which is the core of our proposed technique, i.e., TF-OMP. Consider running OMP for a number of iterations such that neither the matrices are rank deficient nor the residuals are zero. Then varies in the following manner for .

Case 1:-). When :- Then both and contains contributions from signal and noise . Since both numerator and denominator contains noise and signal terms, it is less likely that takes very low values.
Case 2).When for the first time:- In this case and . Hence, numerator has contribution only from the noise , whereas, denominator has contributions from both noise and signal . Hence, if signal strength is sufficiently high or noise level is low, will take very low values.
Case 3:- When :- In this case both and . This means that both numerator and denominator consists only of noise terms and hence the ratio will not take very small value even if noise variance is very low.

To summarize, as SNR improves, the minimal value of for will corresponds to that value of such that for the first time with a very high probability. This point is illustrated in Fig.1 where a typical realization of the quantity is plotted for a matrix signal pair satisfying ERC. The signal has non zero values and . At both SNR=10 dB and SNR=30 dB, the minimum value is attained at which is also the first time . Further, the dip in the value of at becomes more and more pronounced as SNR increases. This motivate the TF-OMP algorithm given in TABLE \@slowromancapii@ which try to estimate by utilizing the sudden dip in .

Input:- Observation , design matrix .
Step 1:- Run OMP for iterations.
Step 2:- Estimate .
Step 3:- Estimate support as .
            Estimate as and .
Output:- Support estimate and signal estimate .
TABLE II: Tuning free orthogonal matching pursuit

We now make the following observations about TF-OMP.

Fig. 1: Variation of vs k for the matrix described in Section \@slowromancapvi@ when .
Remark 1.

It is important to note that the TF-OMP is designed to post facto estimate , the first such that such that from a sequence of related to OMP. Note that will correspond to only when the first iterations are accurate, i.e., at each of the first iterations, indices belonging to are selected by OMP. Only in that situation will the objective of exact support recovery matches the objective of TF-OMP. When conditions like RIC, MIC, ERC etc. are satisfied and SNR is high, it is established in [14, 15, 35] that the first iterations are correct with a high probability. Under such circumstances, TF-OMP is trying to estimate directly.

Remark 2.

Next consider the situation where , i.e., the first iterations are not accurate. This situation happens in coherent design matrices at all SNR and incoherent dictionaries at low SNR. In this situation, all versions of OMP including OMP() fail to deliver accurate support recovery. Indeed, OMP() results in missed discoveries (i.e., failure to include non zero entries of in ) which cause flooring of MSE as SNR improves. TF-OMP has a qualitatively different behaviour. Since TF-OMP is trying to estimate , it will produce a support estimate provided that that satisfies . Such delayed recovery happens quiet often in coherent dictionaries[18]. In other words, TF-OMP has a lesser tendency to have missed discoveries, rather it suffers from false discoveries (including non significant indices in ). This tendency can result in a degraded MSE performance for TF-OMP at low SNR. However, as SNR improves the effect of false discoveries on MSE decreases, whereas, the effect of missed discoveries become more predominant. Consequently, TF-OMP suffer less from MSE floors in such situations than OMP(). To summarise, when there is no congruency between and , TF-OMP can potentially deliver better MSE performance than OMP() at least in high SNR.

Remark 3.

The only user defined parameter in TF-OMP is . This can be set independent of the signal . The only requirement for efficient operation of TF-OMP is that , are full rank and the residuals are not zero. It is impossible to ascertain a priori when the matrices become rank deficient or residuals become zero. Hence, one can set (since is rank deficient w.p.1) initially and terminate iterations when any of the aforementioned contingencies happen. However, the maximum value of for a fixed that can be recovered using any sparse recovery technique is . This follows from the fact that Spark of satisfies (equality for equiangular tight frames) and is a necessary condition for sparse recovery using any algorithm[41]. Hence, instead of , it is sufficient to set and this is the value of used in our simulations. Needless to say, if one has a priori knowledge of maximum value of (not the exact value of ), can be set to that value also.

Iii-a Computational complexity of TF-OMP

The computational complexity of TF-OMP with is which is higher than the complexity of OMP(). This is the cost one has to pay for not knowing or a priori. However, TF-OMP is computationally much more efficient than either the second order conic programming (SOCP) or cyclic algorithm based implementation of the popular tuning free SPICE algorithm[28]. Even the cyclic algorithm based implementation of SPICE which is claimed to be computationally efficient (in comparison with SOCP) in small and medium sized problems involve multiple iterations and in each iteration it requires the inversion of a matrix ( complexity) and a matrix matrix multiplication of complexity . It is possible to reduce the complexity of TF-OMP by producing upper bounds on that is lower than the used in TF-OMP. Assuming a priori knowledge of an upper bound is a significantly weaker assumption than having exact a priori knowledge of . If one can produce an upper bound satisfying , then setting in TF-OMP gives the OMP() complexity of .

For situations where the statistician is completely oblivious to , we propose two low complexity versions of TF-OMP, viz., QTF-OMP1 (quasi tuning free OMP) and QTF-OMP2 that uses a value of lower than the used in TF-OMP. QTF-OMP1 uses and QTF-OMP2 uses . QTF-OMP1 is motivated by the fact that the best coherence based guarantee for OMP extends upto and for any matrix satisfies [41]. Hence, QTF-OMP1 uses a value of which is two times higher than the maximum value of that can be covered by the coherence based guarantees available for OMP. Likewise, the best known asymptotic guarantee for OMP states that OMP can recover any sparse signal when if , where is any arbitrary value[19]. Hence, when , the highest value of one can reliably detect using OMP asymptotically is . The value of used in QTF-OMP1 and QTF-OMP2 is twice of the aforementioned maximum detectable values of to add sufficient robustness. The complexity of QTF-OMP1 and QTF-OMP2 are and which is significantly lower than the complexity of TF-OMP. Unlike TF-OMP which is completely tuning free, QTF-OMP1 and QTF-OMP2 involves a subjective choice of (though motivated by theoretical properties). The rest of this article consider TF-OMP only and in Section \@slowromancapvi@ we demonstrate that the performance of TF-OMP, QTF-OMP1 and QTF-OMP2 are similar across multiple experiments.

Iv Analysis of TF-OMP

In this section we will mathematically analyse various factors that will influence the performance of TF-OMP. In particular we discuss the conditions for successful recovery of a -sparse vector in bounded noise . Note that the Gaussian vector is essentially bounded in the sense that . Hence with , this analysis is applicable to Gaussian noise too. For bounded noise, we define the SNR as . We next state and prove a theorem regarding the successful support recovery by TF-OMP in bounded noise. Note that the accurate support recovery automatically translate to a MSE performance equivalent to that of an oracle with a priori knowledge of support . Throughout this section, we use to denote the ratio instead of .

Theorem 1.

For any matrix signal pair satisfying ERC, MIC or RIP with , TF-OMP with will recover the correct support in bounded noise if the SNR () is sufficiently high.

Proof:- The analysis of TF-OMP is based on the fundamental results developed in the [14] and [35] stated next.

Iv-a A brief review of relevant results from [14] and [35].

Let and denotes the minimum and maximum eigenvalues of respectively.

Lemma 1.

If and , then the following statements hold true[14].
A1):- , . Here denotes the indices in that are not selected after the iteration.
A2):- implies that the first iterations are correct, i.e., .

A1) shows how to bound the residual norms used in based on and . A2) implies that the first iterations of OMP will be correct if is sufficiently high and ERC is satisfied. We now state conditions similar to A1)-A2) in terms of MIC and RIC.

Lemma 2.

MIC implies that and [13]. Substituting these bounds in A1) and A2) gives
B1):- , .
B2):- implies that the first iterations are correct, i.e., .

Lemma 3.

RIC implies that [35]. Substituting this in A1) gives
C1):- , .
The next statement follows from Theorem 1 of [35].
C2):- If , then implies that the first iterations are correct.

Since the analysis based on and are more general than MIC or RIC, we explain TF-OMP using and . However, as outlined in B1)-B2) and C1)-C2), this analysis can be easily replaced by and .

Iv-B Sufficient conditions for sparse recovery using TF-OMP

The successful recovery of support of using TF-OMP requires the simultaneous occurrence of the events E1)-E3) given below.
E1). The first iterations are correct, i.e., .
E2). .
E3). .
E1) implies that OMP with a priori knowledge of , i.e., OMP( can perform exact sparse recovery, whereas, E2) and E3) implies that TF-OMP will be free from missed and false discoveries respectively. Note that the condition A2) implies that the event E1) occurs as long as , and is below a particular level given by

(9)

Next we consider the events E2) and E3) assuming that the noise satisfies , i.e., E1) is true. To establish for , we produce an upper bound on and lower bounds on for and show that the upper bound on is lower than the lower bound on for at high SNR. We first consider the event E2). Since all entries in are selected in the first iterations and hence . Likewise, only one entry in is left out after iterations. Hence, and . Substituting these values in A1) of Lemma 1, we have and . Hence, is bounded by

(10)

Next we lower bound for . Note that after appending enough zeros in appropriate locations. Further, , where . Applying triangle inequality to gives the bound

(11)

Applying (11) in gives

(12)

for and . The R.H.S of (12) can be rewritten as

(13)

From (13) it is clear that the R.H.S of (12) decreases with decreasing . Note that the minimum value of is itself. This leads to an even smaller lower bound on for given by

(14)

For E2) to happen it is sufficient that the lower bound on for is larger than the upper bound on , i.e.,

(15)

This will happen if , where

(16)

In words, whenever , TF-OMP will not have any missed discoveries.

Next we consider the event E3) and assume again that . Since, the first iterations are correct, for . Note that the quantity is independent of the scaling factor . Hence, define the quantity

(17)

where is an ordered set representing the indices selected by OMP. By the definition of , . is a random variable depending on the indices which depends on the noise vector . However, influences only through . Since depends on , it is difficult to characterize . TF-OMP stops before deterministically and hence it is true that for each of the possible realization of or equivalently, each possible realization . Further, the set of all possible denoted by is large, but finite. This implies that . This implies that with probability one for all and is independent of . At the same time, the bound on decreases to zero with decreasing . Hence, given by

(18)

such that for all whenever . In words, TF-OMP will not make false discoveries whenever . Combining all the required conditions, we can see that TF-OMP will recover the correct support whenever . In words, for any support satisfying ERC, , such that TF-OMP will recover whenever . Hence proved.
We now make some remarks about the performance of TF-OMP.

Remark 4.

The conditions on the matrix support pair for the successful support recovery using TF-OMP is exactly same as the MIC, ERC and RIC based conditions outlined for OMP() and OMP(). From the expressions of in (16) and in (18), it is difficult to ascertain whether the . In other words, it is difficult to state whether the required SNR for successful recovery using TF-OMP is higher than that required for OMP() or OMP(). However, extensive numerical simulations indicate that except in the very low SNR regime, TF-OMP performs very closely compared to OMP(). This comparatively poor performance at low SNR can be directly attributed to the lack of knowledge of or . Note that the analysis in this article is worst case and qualitative in nature. Deriving exact conditions on for successful recovery will be more difficult and is not pursued in this article.

Remark 5.

The bound (16) involves the term in the denominator. In particular, (16) implies that the noise level that allows for successful recovery, i.e., decreases with increasing . This is the main qualitative difference between TF-OMP and the results in [14] and [38] available for OMP() and OMP(). This term can be attributed to the sudden fall in the residual power when a “very significant” entry in is covered by OMP at an intermediate iteration which mimic the fall in residual power when the “last” entry in is selected in the iteration. Note that the later fall in residual power is what TF-OMP trying to detect. The main implication of this result is that the TF-OMP will be lesser effective while recovering with significant variations (high ratio) than in recovering signals with lesser variations.

Iv-C High SNR consistency of TF-OMP in Gaussian noise.

The high SNR consistency of variable selection techniques in Gaussian noise has received considerable attention in signal processing community recently[42, 36, 37, 20]. High SNR consistency is formally defined as follows.
Definition:- A support recovery technique is high SNR consistent iff the probability of support recovery error (PE) satisfies .
The following lemma stated and proved in [20] establish the necessary and sufficient condition for the high SNR consistency of OMP and LASSO.

Lemma 4.