StructureBased Bayesian Sparse Reconstruction
Abstract
Sparse signal reconstruction algorithms have attracted research attention due to their wide applications in various fields. In this paper, we present a simple Bayesian approach that utilizes the sparsity constraint and a priori statistical information (Gaussian or otherwise) to obtain near optimal estimates. In addition, we make use of the rich structure of the sensing matrix encountered in many signal processing applications to develop a fast sparse recovery algorithm. The computational complexity of the proposed algorithm is relatively low compared with the widely used convex relaxation methods as well as greedy matching pursuit techniques, especially at a low sparsity rate.^{1}^{1}1This work was partially supported by SABIC through an internally funded project from DSR, KFUPM (Project No. SB101006) and partially by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at KFUPM (Project No. 09ELE76304) as part of the National Science, Technology and Innovation Plan. The work of Tareq Y. AlNaffouri was also supported by the Fullbright Scholar Program. Part of this work was presented at the Allerton Conference on Communications, Control and Computing, USA.
I Introduction
Compressive Sensing/Compressed Sampling (CS) is a fairly new field of research that is finding many applications in statistics and signal processing [1]. As its name suggests, CS attempts to acquire a signal (inherently sparse in some subspace) at a compressed rate by randomly projecting it onto a subspace that is much smaller than the dimension of the signal itself. Provided that the sensing matrix satisfies a few conditions, the sparsity pattern of such a signal can be recovered noncombinatorially with high probability. This is in direct contrast to the traditional approach of sampling signals according to the Nyquist theorem and then discarding the insignificant samples. Generally, most naturally occurring signals are sparse in some basis/domain and CS can therefore be utilized for their reconstruction. CS has been used successfully in, for example (but not limited to), peaktoaverage power ratio reduction in orthogonal frequency division multiplexing (OFDM) [2], image processing (onepixel camera [4]), impulse noise estimation and cancellation in powerline communication and digital subscriber lines (DSL) [5], magnetic resonance imaging (MRI) [6], channel estimation in communications systems [7], ultrawideband (UWB) channel estimation [8], directionofarrival (DOA) estimation [9], and radar design [10], to name a few.
The CS problem can be set up as follows. Let be a sparse signal (i.e., a signal that consists of nonzero coefficients in an dimensional space with ) in some domain and let be the observation vector with given by
(1) 
where is an measurement/sensing matrix that is assumed to be incoherent with the domain in which is sparse and is complex additive white Gaussian noise, . As , this is an illposed problem as there is an infinite number of solutions for satisfying (1). Now if it is known a priori that is sparse, the theoretical way to reconstruct the signal is to solve an norm minimization problem using only measurements when the signal and measurements are free of noise [11]
(2) 
Unfortunately, solving the norm minimization problem is NPhard [11] [12] and is therefore not practical. Thus, different suboptimal approaches, categorized as compressive sensing, have been presented in the literature to solve this problem. In [12] and [13], it has been shown that can be reconstructed with high probability in polynomial time by using convex relaxation approaches at the cost of an increase in the required number of measurements. This is done by solving a relaxed norm minimization problem using linear programming instead of norm minimization [12], [13]
(3) 
where . For norm minimization to reconstruct the sparse signal accurately, the sensing matrix, , should be sufficiently incoherent. In other words, the coherence, defined as , should be as small as possible (with depicting the worst case) [12]. In [14], it has been shown that these convex relaxation approaches have a Bayesian rendition and may be viewed as maximizing the maximum a posteriori estimate of , given that has a Laplacian distribution. Although convex relaxation approaches are able to recover sparse signals by solving underdetermined systems of equations, they also suffer from a number of drawbacks (some of which are common to other sparse recovery algorithms including [16][19]) that we discuss below.
Ia Drawbacks of Convex Relaxation Approaches
IA1 Complexity
Convex relaxation relies on linear programming to solve the convex norm minimization problem, which is computationally relatively complex (its complexity is of the order when interior point methods are used [37]). This approach can therefore not be used in problems with very large dimensions. To overcome this drawback, many greedy algorithms have been proposed that recover the sparse signal iteratively. These include Orthogonal Matching Pursuit (OMP) [15], [16], Regularized Orthogonal Matching Pursuit (ROMP) [17], Stagewise Orthogonal Matching Pursuit (StOMP) [18], and Compressive Sampling Matching Pursuit (CoSamp) [19]. These greedy approaches are relatively faster than their convex relaxation counterparts (approximately where is the number of iterations).
IA2 The need for randomness in the sensing matrix
Convex relaxation methods cannot make use of the structure exhibited by the sensing matrix (e.g., a structure that comes from a Toeplitz sensing matrix or that of a partial discrete Fourier transform (DFT) matrix). In fact, if anything, this structure is harmful to these methods as the best results are obtained when the sensing matrix is close to random. This comes in contrast to current digital signal processing architectures that only deal with uniform sampling. We would thus like to employ more feasible and standard subsampling approaches.
IA3 Inability to harness a priori statistical information
Convex relaxation methods are not able to take account of any a priori statistical information (apart from sparsity information) about the signal support and additive noise. Any a priori statistical information can be used on the result obtained from the convex relaxation method to refine both the signal support obtained and the resulting estimate through a hypothesis testing approach [20]. However, this is only useful if these approaches are indeed able to recover the signal support. In other words, performance is bottlenecked by the support recovering capability of these approaches. We note here that the use of a priori statistical information for sparse signal recovery has been studied in a Bayesian context in [14] and in algorithms based on belief propagation [21], [22]. Both [23] and [24] use a priori statistical information (assuming to be mixed BernoulliGaussian); only [23] uses this information in a recursive manner to obtain a fast sparse signal recovery algorithm. However, it is not clear how these approaches can be extended to the nonGaussian case.
IA4 Evaluating performance in statistically familiar terms
It is difficult to quantify the performance of convex relaxation estimates analytically in terms of the mean squared error (MSE) or bias or to relate these estimates to those obtained through more conventional approaches, e.g., maximum a posteriori probability (MAP), minimum meansquare error (MMSE), or maximum likelihood (ML).^{2}^{2}2It is worth noting that convex relaxation approaches have their merit in that they are agnostic to the signal distribution and thus can be quite useful when worstcase analysis is desired as opposed to averagecase analysis.
IA5 Trading performance for computational complexity
In general, convex relaxation approaches do not exhibit the customary tradeoff between increased computational complexity and improved recovery as is the case for, say, iterative decoding or joint channel and data detection. Rather, they solve some problem using (secondorder cone programming) with a set complexity. A number of works have attempted to derive sharp thresholds for support recovery [25], [26]. In other words, the only degree of freedom available for the designer to improve performance is to increase the number of measurements. Several iterative implementations [27], [28] of convex relaxation approaches provide some sort of flexibility by trading performance for complexity.
IB Motivation and Paper Organization
In this paper, we present a Bayesian approach to sparse signal recovery that has low complexity and makes a collective use of 1) a priori statistical properties of the signal and noise, 2) sparsity information, and 3) the rich structure of the sensing matrix, . Although there have been some works that use the structure of the sensing matrix (e.g., [29]), it has not yet been rigorously exploited to aid in algorithm development and complexity reduction. We also show how our approach is able to deal with both Gaussian and nonGaussian (or unknown) priors, and how we can compute performance measures of our estimates. In essence, we demonstrate how our technique enables us to tackle all the drawbacks of convex relaxation approaches.
This remainder of this paper is organized as follows. We start by describing the signal model in the next section. In Section III, we derive the MMSE/MAP estimates and introduce the various terms that need to be evaluated. In Section IV, we demonstrate how the structure of the sensing matrix can be used to recover the sparse signal in a divideandconquer manner. Section V details the proposed sparse reconstruction algorithm that we call Orthogonal Clustering. Section VI presents the different structural properties of the sensing matrix that are exploited by the proposed algorithm to reduce the computational complexity. The performance of the proposed algorithm is compared with various sparse reconstruction algorithms presented in the literature by numerical simulations in Section VII, which is followed by our conclusions in Section VIII.
IC Notation
We denote scalars with lowercase letters (e.g., ), vectors with lowercase boldfaced letters (e.g., ), matrices with uppercase, boldfaced letters (e.g., ), and sets with script notation (e.g. ). We use to denote the column of matrix , to denote the entry of vector , and to denote a subset of a set . We also use to denote the submatrix formed by the columns , indexed by the set . Finally, we use , , , and to respectively denote the estimate, conjugate, transpose, and conjugate transpose of a vector .
Ii Signal Model
We adopt the signal model in (1). Here, the vector is modelled as , where denotes the Hadamard (elementbyelement) multiplication. The entries of are independent and identically distributed (i.i.d) Bernoulli random variables and the entries of are drawn identically and independently from some zero mean distribution.^{3}^{3}3Most of the results presented in this paper also apply to the case when the entries are independent but not necessarily identically distributed. In other words, we assume that s are Bernoulli with success probability and similarly that the s are i.i.d variables with marginal probability distribution function . The noise is assumed to be complex circularly symmetric Gaussian, i.e., . When the support set of is known, we can equivalently write (1) as
(4) 
Iii Optimum Estimation of
Our task is to obtain the optimum estimate of given the observation . We can pursue either an MMSE or a MAP approach to achieve this goal. In the following, we elaborate on how we can obtain these two estimates.
Iiia MMSE Estimation of
The MMSE estimate of given the observation can be expressed as
(5) 
where the sum is over all the possible support sets of . The likelihood and expectation involved in (5) are evaluated below.
IiiA1 Evaluation of
Recall that the relationship between and is linear (see (1)). Thus, in the case when conditioned on its support is Gaussian, is nothing but the linear MMSE estimate of given (and ), i.e.,
(6) 
where
(7) 
When is nonGaussian or when its statistics are unknown, the expectation is difficult or even impossible to calculate. Thus, we replace it by the best linear unbiased estimate (BLUE), i.e.,
(8) 
IiiA2 Evaluation of
Using Bayes’ rule, we can rewrite as
(9) 
As the denominator is common to all posterior likelihoods, , it is a normalizing constant that can be ignored. To evaluate , note that the elements of are active according to a Bernoulli process with success probability . Thus, is given by
(10) 
It remains to evaluate . Here, we distinguish between the cases of whether or not is Gaussian.
1. is Gaussian
When is Gaussian, is Gaussian too with zero mean and covariance and we can write the likelihood function as^{4}^{4}4
(11) 
up to an irrelevant constant multiplicative factor, ().
2. is nonGaussian or unknown
Alternatively, we can treat as a random vector of unknown (nonGaussian) distribution, with support . Therefore, given the support , all we can say about is that it is formed by a vector in the subspace spanned by the columns of , plus a white Gaussian noise vector, . It is difficult to quantify the distribution of even if we know the distribution of (the nonGaussian) . One way around this is to annihilate the nonGaussian component and retain the Gaussian one. We do so by projecting onto the orthogonal complement of the span of the columns of , i.e., multiplying by This leaves us with , which is zero mean and with covariance Thus, the conditional density of given is approximately given by
(12) 
IiiB MAP Estimation of
To obtain the MAP estimate of , we first determine the MAP estimate of , which is given by
(13) 
The prior likelihood , is given by (11) when is Gaussian and by (12) when is nonGaussian or unknown, whereas is evaluated using (10). The maximization is performed over all possible support sets. The corresponding MAP estimate of is given by
(14) 
One can easily see that the MAP estimate is a special case of the MMSE estimate in which the sum (5) is reduced to one term. As a result, we confine the discussion in the rest of the paper to MMSE estimation.
IiiC Evaluation over
Having evaluated the posterior probability and expectation, it remains to evaluate this over possible supports (see (5) and (13)) which is a computationally daunting task. This is compounded by the fact that the calculations required for each support set in are relatively expensive, requiring some form of matrix multiplication/inversion as can be seen from (6)(12). One way around this exhaustive approach is somehow to guess at a superset consisting of the most probable support and limit the sum in (5) to the superset and its subsets, reducing the evaluation space to points. There are two techniques that help us guess at such a set .
1. Convex Relaxation
Starting from (1), we can use the standard convex relaxation tools [12], [13] to find the most probable support set, , of the sparse vector . This is done by solving (3) and retaining some largest nonzero values where is selected such that P is very small.^{5}^{5}5As is a binomial distribution B, it can be approximated by a Gaussian distribution , when (the DeMoivreLaplace approximation [30]). In this case, P.
2. Fast Bayesian Matching Pursuit (FBMP)
A fast Bayesian recursive algorithm is presented in [23] that determines the dominant support and the corresponding MMSE estimate of the sparse vector.^{6}^{6}6FBMP applies to the Bernoulli Gaussian case only. It uses a greedy tree search over all combinations in pursuit of the dominant supports. The algorithm starts with zero active element support set. At each step, an active element is added that maximizes the Gaussian loglikelihood function similar to (11). This procedure is repeated until we reach active elements in a branch. The procedure creates such branches, which represent a tradeoff between performance and complexity.^{7}^{7}7Though other greedy algorithms [16][19] can also be used, we focus here on FBMP as it utilizes a priori statistical information along with sparsity information.
The discussion in this section applies irrespective of the type of the sensing matrix, . However, in many applications in signal processing and communications, the sensing matrix is highly structured. This fact, which has been largely overlooked in the CS literature, is utilized in the following to evaluate the MMSE (MAP) estimate at a much lower complexity than is currently available.
Iv A StructureBased Bayesian Recovery Approach
Whereas in most CS literature, the sensing matrix, , is assumed to be drawn from a random constellation [12], [13], in many signal processing and communications applications, this matrix is highly structured. Thus, could be a partial DFT matrix [5] or a Toeplitz matrix (encountered in many convolution applications [7]). Table I lists various possibilities of structured .
Matrix  Application 

Partial DFT  OFDM applications including peaktoaverage power ratio 
reduction [2], narrowband interference cancelation [3],  
and impulsive noise estimation and mitigation in DSL [5]  
Toeplitz  Channel estimation [7], UWB [8], and DOA estimation [9] 
Hankel  Wideband spectrum sensing [31] 
DCT  Image compression [32] 
Structured Binary  Multiuser detection and contention resolution [33], [34] and 
feedback reduction [35], [36] 
Since is a fat matrix , its columns are not orthogonal (in fact not even linearly independent). However, in the aforementioned applications, one can usually find an orthogonal subset of the columns of that span the column space of . We can collect these columns into a square matrix, . The remaining columns of group around these orthogonal columns to form semiorthogonal clusters. In general, the columns of can be rearranged such that the farther two columns are from each other, the lower their correlation is. In this section, we demonstrate how semiorthogonality helps to evaluate the MMSE estimate in a divideandconquer manner. Before we do so, we present two sensing matrices that exhibit semiorthogonality.
Iva Examples of Sensing Matrices with SemiOrthogonality
IvA1 DFT Matrices
We focus here on the case when the sensing matrix is a partial DFT matrix, i.e., , where denotes the unitary DFT matrix, with and is an selection matrix consisting of zeros with exactly one entry equal to per row. To enforce the desired semiorthogonal structure, the matrix usually takes the form , for some integer . In other words, the sensing matrix consists of a continuous band of sensing frequencies. This is not unusual since in many OFDM problems, the band of interest (or the one free of transmission) is continuous. In this case, the correlation between two columns can be shown to be
(15) 
which is a function of the difference, . It thus suffices to consider the correlation of one column with the remaining ones. Figure 1 illustrates this correlation for and . It is worth noting that the matrix exhibits other structural properties (e.g., the fact that it is a Vandermonde matrix), which helps us reduce the complexity of the MMSE estimation (see Section VI for further details).
IvA2 Toeplitz/Hankel Matrices
We focus here on the Toeplitz case. The discussion can be easily extended to the Hankel case. A subsampled convolutional linear system can be written in the following matrix form, , where is a vector of length , is a vector of length and is the block Toeplitz/diagonal matrix
where the size of depends on the subsampling ratio. Here, for , and thus the columns of can easily be grouped into truly orthogonal clusters. Note also that the individual columns of are related to each other by a shift property, which we explore for further reduction in complexity in Section VI.
IvB Using Orthogonality for MMSE Estimation
Let be a possible support of . The columns of in (4) can be grouped into a maximum of semiorthogonal clusters, i.e., , where is the support set corresponding to the cluster (with ).^{8}^{8}8Here, we denote the maximum number of clusters formed by to distinguish it from , that refers to the estimate of the number of active supports as in [23] (see footnote 5). In our approach, is random and depends on a threshold. This threshold is obtained using the a priori statistical information of the noise signal, . The procedure of forming semiorthogonal clusters is presented in Section V. Based on this fact, (4) can be written as
(16) 
Columns indexed by these sets should be semiorthogonal, i.e., ; otherwise, and are merged into a bigger superset. Now, the MMSE estimate of simplifies to^{9}^{9}9In writing an expression like the one in (17), it is understood that estimates of elements of that do not belong to are identically zero.
(17) 
In the following, we show that can be evaluated in a divideandconquer manner by treating each cluster independently. To do so, we present in the following how orthogonality manifests itself in the calculation of the expectation and likelihood.
IvB1 The effect of orthogonality on the likelihood calculation
Recall that up to a constant factor, the likelihood can be written as . Now,
(18)  
where the equality in (18) is true up to some constant factor. Now, to evaluate , we distinguish between the Gaussian and nonGaussian cases. For brevity, we focus here on the Gaussian case and extrapolate the results to the nonGaussian case. Recall that
(19) 
with . Here, where Using the matrix inversion lemma, we can write as
(20)  
where . As and are almost orthogonal (i.e., ), (20) becomes
(21)  
Continuing in the same manner, it is easy to show that
(22) 
As such, we can write
(23) 
where Using a similar procedure, we can decompose as
(24)  
(25)  
(26) 
where in going from (24) to (25), we used the fact that and are almost orthogonal. Continuing in the same way, we can show that
(27) 
Combining (23) and (27), we obtain (up to an irrelevant multiplicative factor)
(28) 
Orthogonality allows us to reach the same conclusion (28) for the nonGaussian case. Now, combining (18) and (28), we can finally write
(29) 
which applies equally to the Gaussian and nonGaussian cases.
IvB2 The effect of orthogonality on the expectation calculation
In evaluating the expectation, we again distinguish between the Gaussian and nonGaussian cases. We focus here on the nonGaussian case for which Using the decomposition into semiorthogonal clusters , we can write
(30) 
Orthogonality allows us to write an identical expression to (30) in the Gaussian case.
IvB3 The effect of orthogonality on the MMSE estimation
We are now ready to show how (semi)orthogonality helps with the MMSE evaluation. To do this, we substitute the decomposed expressions (29) and (30) into (17) to get
(31)  
where the last line follows from the fact that . Thus, the semiorthogonality of the columns in the sensing matrix allows us to obtain the MMSE estimate of in a divideandconquer manner by estimating the nonoverlapping sections of independently from each other. Other structural properties of can be utilized to reduce further the complexity of the MMSE estimation. For example, the orthogonal clusters exhibit some form of similarity and the columns within a particular cluster are also related to each other. We explore these properties for complexity reduction in Section VI. However, before doing so, we devote the following section to a full description of our Bayesian orthogonal clustering algorithm.
V An Orthogonal Clustering (OC) Algorithm for Sparse Reconstruction
In this section, we present our sparse reconstruction algorithm, which is based on orthogonal clustering. The main steps of the algorithm are detailed in the following and summarized in Figure 3.
Va Determine dominant positions
Consider the model given in (1) reproduced here for convenience, By correlating the observation vector, , with the columns of the sensing matrix, , and by retaining correlations that exceed a certain threshold, we can determine the dominant positions/regions where the support of the sparse vector, , is located. The performance of our orthogonal clustering algorithm is dependent on this initial correlationbased guess.^{10}^{10}10We can also apply a convex relaxation approach, retain the largest values, and form clusters around them. This allows us to incorporate a priori statistical information and obtain MMSE estimates but the algorithm in this case is bottlenecked by the performance of the convex relaxation approach and also loses the appeal of low complexity.
VB Form semiorthogonal clusters
Define a threshold such that is very small.^{11}^{11}11As , the threshold can be easily evaluated as, . The previous correlation step creates a vector of correlations. From these correlations, obtain the indices with the correlation greater than the threshold, . Let denote the index with the largest correlation above and form a cluster of size centered around .^{12}^{12}12Given a fat sensing matrix, we consider two columns to be orthogonal (or semi orthogonal) when their correlation is below some value, . The cluster size is thus the minimum separation between two columns that makes these two columns semiorthogonal. Obviously, the distance ,, is a function of the correlation tolerance, . The lower the tolerance, , the larger the cluster size, . Now, let denote the corresponding index of the second largest correlation above and form another cluster of size around . If the two clusters thus formed are overlapping, merge them into one big cluster. Continue this procedure until all the correlations greater than are exhausted.
VC Find the dominant supports and their likelihoods
Let be the length of cluster and let denote the maximum possible support size in a cluster.^{13}^{13}13 is calculated in a way similar to as the support in a cluster is also a Binomial distribution B(). Thus, we set (see footnote 5). Let be the total number of semiorthogonal clusters formed in the previous step. For each of them, find the most probable support of size, , by calculating the likelihoods for all supports of size (using either (11) or (12)). Each cluster is processed independently by capitalizing on the semiorthogonality between the clusters. The expected value of the sparse vector given and the most probable support for each size can also be evaluated using either (6) or (8) depending on the a priori statistical information.
VD Evaluate the estimate of
Once we have the dominant supports for each cluster, their likelihoods, the expected value of given and the dominant supports, the MMSE (or MAP) estimates of can be evaluated as discussed in Section IV (see (31)). Note that these estimates are approximate as they are evaluated using only the dominant supports instead of using all supports.
Vi Reducing the Computational Complexity
In this paper, we explore three structures of the sensing matrix that help us to reduce the complexity of MMSE estimation.

Orthogonality (independence) of clusters: In Section IV, the orthogonality of clusters allowed us to calculate the MMSE estimate independently over clusters in a divideandconquer manner.

Similarity of clusters: While the columns of the clusters are (semi)orthogonal, allowing us to treat them independently, these columns could exhibit some form of similarity making some MMSE calculations invariant over these clusters. For example, the columns of a DFT matrix can be obtained from each other through a modulation operation while those of the Toeplitz matrix can be obtained through a shift operation. The correlation calculations that repeatedly appear in the MMSE estimation are invariant to the modulation and shift operations.

Order within a cluster: MMSE estimation in a cluster involves calculating the likelihoods and expectations for all supports of size . Several quantities involved in these evaluations can be obtained in an orderrecursive manner, incrementally moving from calculations for supports of size to similar calculations for supports of size .
We explore the last two properties in the following subsections.
Via Similarity of Clusters
As evident from the previous sections, calculating the likelihood can be done in a divideandconquer manner by calculating the likelihood for each cluster independently. This is a direct consequence of the semiorthogonality structure of the columns of the sensing matrix. Moreover, due to the rich structure of the sensing matrix, the clusters formed are quite similar. In the following subsections, we use the structure present in DFT and Toeplitz sensing matrices to show that the likelihood and expectation expressions in each cluster (for both the Gaussian and nonGaussian cases) are strongly related, allowing many calculations across clusters to be shared.
ViA1 Discrete Fourier Transform (DFT) Matrices
Let denote the sensing columns associated with the first cluster. Then, it is easy to see that the corresponding columns for the cluster of equal length that are columns away are, , where is some constant vector that depends on the sensing columns.^{14}^{14}14For example, if we use the last rows of the DFT matrix to construct the sensing matrix, then . Assume that we evaluate the likelihood, , and expectation, , for a set of columns, , in the first cluster. For this set, we make the assumption that
(32) 
Now, let denote the same set of columns chosen from the cluster that is columns away (in other words ). For this set, we assume that
(33) 
Now (Hadamard) multiply both sides of the above equation by to get
(34) 
Note that (32) and (34) have the same sensing matrix and the same noise statistics ( is a white circularly symmetric Gaussian and hence is invariant to multiplication by ). The only difference is that is modulated by the vector in moving from the first to the cluster. This allows us to write
(35) 
which is valid for both the Gaussian and nonGaussian cases. In other words, if is obtained from by a constant shift, then any independent calculations remain the same while any calculations involving are obtained by modulating by the vector as shown in Figure 3. For example, the likelihood in the Gaussian case reads
(36) 
and, in the nonGaussian case, it reads
(37) 
We observe similar behavior in calculating the expectation. Thus, in the Gaussian case, we have
(38) 
and in the nonGaussian case, we have
(39) 
ViA2 Toeplitz/Hankel Matrices
In the Toeplitz or block Toeplitz case, the sensing matrix reads . Now, the clusters can be modified to make sure that they are identical (by stretching their end points if necessary) such that In other words, the s are simply shifted versions of each other. We now calculate the quantities , , and for a set of columns of the first cluster. We then choose an identical set of columns, , in the cluster. Then, it is intuitively clear that