Recovery from Linear Measurements with ComplexityMatching Universal Signal Estimation
Abstract
We study the compressed sensing (CS) signal estimation problem where an input signal is measured via a linear matrix multiplication under additive noise. While this setup usually assumes sparsity or compressibility in the input signal during recovery, the signal structure that can be leveraged is often not known a priori. In this paper, we consider universal CS recovery, where the statistics of a stationary ergodic signal source are estimated simultaneously with the signal itself. Inspired by Kolmogorov complexity and minimum description length, we focus on a maximum a posteriori (MAP) estimation framework that leverages universal priors to match the complexity of the source. Our framework can also be applied to general linear inverse problems where more measurements than in CS might be needed. We provide theoretical results that support the algorithmic feasibility of universal MAP estimation using a Markov chain Monte Carlo implementation, which is computationally challenging. We incorporate some techniques to accelerate the algorithm while providing comparable and in many cases better reconstruction quality than existing algorithms. Experimental results show the promise of universality in CS, particularly for lowcomplexity sources that do not exhibit standard sparsity or compressibility.
I Introduction
Since many systems in science and engineering are approximately linear, linear inverse problems have attracted great attention in the signal processing community. An input signal is recorded via a linear operator under additive noise:
(1) 
where is an matrix and denotes the noise. The goal is to estimate from the measurements given knowledge of and a model for the noise . When , the setup is known as compressed sensing (CS) and the estimation problem is commonly referred to as recovery or reconstruction; by posing a sparsity or compressibility^{1}^{1}1We use the term compressibility in this paper as defined by Candès et al. [4] to refer to signals whose sparse approximation error decays sufficiently quickly. requirement on the signal and using this requirement as a prior during recovery, it is indeed possible to accurately estimate from [4, 5]. On the other hand, we might need more measurements than the signal length when the signal is dense or the noise is substantial.
Wu and Verdú [6] have shown that independent and identically distributed (i.i.d.) Gaussian sensing matrices achieve the same phasetransition threshold as the optimal (potentially nonlinear) measurement operator, for any i.i.d. signals following the discrete/continuous mixture distribution , where is the probability for to take a continuous distribution and is an arbitrary discrete distribution. For noni.i.d. signals, Gaussian matrices also work well [7, 8, 9]. Hence, in CS the acquisition can be designed independently of the particular signal prior through the use of randomized Gaussian matrices . Nevertheless, the majority of (if not all) existing recovery algorithms require knowledge of the sparsity structure of , i.e., the choice of a sparsifying transform that renders a sparse coefficient vector for the signal.
The large majority of recovery algorithms pose a sparsity prior on the signal or the coefficient vector , e.g., [4, 5, 10]. A second, separate class of Bayesian CS recovery algorithms poses a probabilistic prior for the coefficients of in a known transform domain [11, 12, 13, 14, 15]. Given a probabilistic model, some related message passing approaches learn the parameters of the signal model and achieve the minimum mean squared error (MMSE) in some settings; examples include EMGMAMPMOS [16], turboGAMP [17], and AMPMixD [18]. As a third alternative, complexitypenalized least square methods [19, 20, 21, 22, 23] can use arbitrary prior information on the signal model and provide analytical guarantees, but are only computationally efficient for specific signal models, such as the independententry Laplacian model [21]. For example, Donoho et al. [20] relies on Kolmogorov complexity, which cannot be computed [24, 25]. As a fourth alternative, there exist algorithms that can formulate dictionaries that yield sparse representations for the signals of interest when a large amount of training data is available [23, 26, 27, 28]. When the signal is noni.i.d., existing algorithms require either prior knowledge of the probabilistic model [17] or the use of training data [29].
In certain cases, one might not be certain about the structure or statistics of the source prior to recovery. Uncertainty about such structure may result in a suboptimal choice of the sparsifying transform , yielding a coefficient vector that requires more measurements to achieve reasonable estimation quality; uncertainty about the statistics of the source will make it difficult to select a prior or model for Bayesian algorithms. Thus, it would be desirable to formulate algorithms to estimate that are more agnostic to the particular statistics of the signal. Therefore, we shift our focus from the standard sparsity or compressibility priors to universal priors [30, 31, 32]. Such concepts have been previously leveraged in the Kolmogorov sampler universal denoising algorithm [33], which minimizes Kolmogorov complexity [34, 35, 36, 25, 37, 38, 3, 2]. Related approaches based on minimum description length (MDL) [39, 40, 41, 42] minimize the complexity of the estimated signal with respect to (w.r.t.) some class of sources.
Approaches for nonparametric sources based on Kolmogorov complexity are not computable in practice [24, 25]. To address this computational problem, we confine our attention to the class of stationary ergodic sources and develop an algorithmic framework for universal signal estimation in CS systems that will approach the MMSE as closely as possible for the class of stationary ergodic sources. Our framework can be applied to general linear inverse problems where more measurements might be needed. Our framework leverages the fact that for stationary ergodic sources, both the persymbol empirical entropy and Kolmogorov complexity converge asymptotically almost surely to the entropy rate of the source [24]. We aim to minimize the empirical entropy; our minimization is regularized by introducing a log likelihood for the noise model, which is equivalent to the standard least squares under additive white Gaussian noise. Other noise distributions are readily supported.
We make the following contributions toward our universal CS framework.

We apply a specific quantization grid to a maximum a posteriori (MAP) estimator driven by a universal prior, providing a finitecomputation universal estimation scheme; our scheme can also be applied to general linear inverse problems where more measurements might be needed.

We propose a recovery algorithm based on Markov chain Monte Carlo (MCMC) [43] to approximate this estimation procedure.

We prove that for a sufficiently large number of iterations the output of our MCMC recovery algorithm converges to the correct MAP estimate.

We identify computational bottlenecks in the implementation of our MCMC estimator and show approaches to reduce their complexity.

We develop an adaptive quantization scheme that tailors a set of reproduction levels to minimize the quantization error within the MCMC iterations and that provides an accelerated implementation.

We propose a framework that adaptively adjusts the cardinality (size) of the adaptive quantizer to match the complexity of the input signal, in order to further reduce the quantization error and computation.

We note in passing that averaging over the outputs of different runs of the same signal with the same measurements will yield lower mean squared error (MSE) for our proposed algorithm.
This paper is organized as follows. Section II provides background content. Section III overviews MAP estimation, quantization, and introduces universal MAP estimation. Section IV formulates an initial MCMC algorithm for universal MAP estimation, Section V describes several improvements to this initial algorithm, and Section VI presents experimental results. We conclude in Section VII. The proof of our main theoretical result appears in the appendix.
Ii Background and related work
Iia Compressed sensing
Consider the noisy measurement setup via a linear operator (1). The input signal is generated by a stationary ergodic source , and must be estimated from and . Note that the stationary ergodicity assumption enables us to model the potential memory in the source. The distribution that generates is unknown. The matrix has i.i.d. Gaussian entries, .^{2}^{2}2In contrast to our analytical and numerical results, the algorithm presented in Section IV is not dependent on a particular choice for the matrix . These moments ensure that the columns of the matrix have unit norm on average. For concrete analysis, we assume that the noise is i.i.d. Gaussian, with mean zero and known^{3}^{3}3We assume that the noise variance is known or can be estimated [11, 18]. variance for simplicity.
We focus on the setting where and the aspect ratio is positive:
(2) 
Similar settings have been discussed in the literature [44, 45]. When , this setup is known as CS; otherwise, it is a general linear inverse problem setting. Since is generated by an unknown source, we must search for an estimation mechanism that is agnostic to the specific distribution .
IiB Related work
For a scalar channel with a discretevalued signal , e.g., is an identity matrix and , Donoho proposed the Kolmogorov sampler (KS) for denoising [33],
(3) 
where denotes the Kolmogorov complexity of , defined as the length of the shortest input to a Turing machine [46] that generates the output and then halts,^{4}^{4}4For realvalued , Kolmogorov complexity (KC) can be approximated using a fine quantizer. Note that the algorithm developed in this paper uses a coarse quantizer and does not rely on KC due to the absence of a feasible method for its computation [24, 25] (cf. Section V). and controls for the presence of noise. It can be shown that asymptotically captures the statistics of the stationary ergodic source , and the persymbol complexity achieves the entropy rate , i.e., almost surely [24, p. 154, Theorem 7.3.1]. Noting that universal lossless compression algorithms [30, 31] achieve the entropy rate for any discretevalued finite state machine source , we see that these algorithms achieve the persymbol Kolmogorov complexity almost surely.
Donoho et al. expanded KS to the linear CS measurement setting but did not consider measurement noise [20]. Recent papers by Jalali and coauthors [37, 38], which appeared simultaneously with our work [3, 2], provide an analysis of a modified KS suitable for measurements corrupted by noise of bounded magnitude. Inspired by Donoho et al. [20], we estimate from noisy measurements using the empirical entropy as a proxy for the Kolmogorov complexity (cf. Section IVA).
Separate notions of complexitypenalized least squares have also been shown to be well suited for denoising and CS recovery [19, 20, 39, 40, 41, 21, 22, 23]. For example, minimum description length (MDL) [39, 40, 41, 23] provides a framework composed of classes of models for which the signal complexity can be defined sharply. In general, complexitypenalized least square approaches can yield MDLflavored CS recovery algorithms that are adaptive to parametric classes of sources [20, 19, 21, 22]. An alternative universal denoising approach computes the universal conditional expectation of the signal [3, 18].
Iii Universal MAP estimation and discretization
This section briefly reviews MAP estimation and then applies it over a quantization grid, where a universal prior is used for the signal. Additionally, we provide a conjecture for the MSE achieved by our universal MAP scheme.
Iiia Discrete MAP estimation
In this subsection, we assume for exposition purposes that we know the signal distribution . Given the measurements , the MAP estimator for has the form
(4) 
Because is i.i.d. Gaussian with mean zero and known variance ,
where and are constants, and denotes the Euclidean norm.^{5}^{5}5Other noise distributions are readily supported, e.g., for i.i.d. Laplacian noise, we need to change the norm to an norm and adjust and accordingly. Plugging into (4) and taking log likelihoods, we obtain , where denotes the objective function (risk)
our ideal risk would be .
Instead of performing continuousvalued MAP estimation, we optimize for the MAP in the discretized domain , with being defined as follows. Adapting the approach of Baron and Weissman [47], we define the set of dataindependent reproduction levels for quantizing as
(5) 
where . As increases, will quantize to a greater resolution. These reproduction levels simplify the estimation problem from continuous to discrete.
Having discussed our reproduction levels in the set , we provide a technical condition on boundedness of the signal.
Condition 1
We require that the probability density has bounded support, i.e., there exists such that for .
A limitation of the dataindependent reproduction level set (5) is that has infinite cardinality (or size for short). Thanks to Condition 1, for each value of there exists a constant such that a finite set of reproduction levels
(6) 
will quantize the range of values to the same accuracy as that of (5). We call the reproduction alphabet, and each element in it a (reproduction) level. This finite quantizer reduces the complexity of the estimation problem from infinite to combinatorial. In fact, under Condition 1. Therefore, for all and sufficiently large , this set of levels will cover the range . The resulting reduction in complexity is due to the structure in and independent of the particular statistics of the source .
Now that we have set up a quantization grid for , we convert the distribution to a probability mass function (PMF) over . Let , and define a PMF as . Then
gives the MAP estimate of over . Note that we use the PMF formulation above, instead of the more common bin integration formulation, in order to simplify our presentation and analysis. Luckily, as increases, will approximate more closely under (6).
IiiB Universal MAP estimation
We now describe a universal estimator for CS over a quantized grid. Consider a prior that might involve Kolmogorov complexity [34, 35, 36], e.g., , or MDL complexity w.r.t. some class of parametric sources [39, 40, 41]. We call a universal prior if it has the fortuitous property that for every stationary ergodic source and fixed , there exists some minimum such that
for all and [30, 31]. We optimize over an objective function that incorporates and the presence of additive white Gaussian noise in the measurements:
(7) 
resulting in^{6}^{6}6This formulation of corresponds to a Lagrangian relaxation of the approach studied in [37, 38]. . Our universal MAP estimator does not require , and can be used in general linear inverse problems.
IiiC Conjectured MSE performance
Donoho [33] showed for the scalar channel that: () the Kolmogorov sampler (3) is drawn from the posterior distribution ; and () the MSE of this estimate is no greater than twice the MMSE. Based on this result, which requires a large reproduction alphabet, we now present a conjecture on the quality of the estimation . Our conjecture is based on observing that (i) in the setting (1), Kolmogorov sampling achieves optimal rate–distortion performance; (ii) the Bayesian posterior distribution is the solution to the ratedistortion problem; and (iii) sampling from the Bayesian posterior yields a squared error that is no greater than twice the MMSE. Hence, behaves as if we sample from the Bayesian posterior distribution and yields no greater than twice the MMSE; some experimental evidence to assess this conjecture is presented in Figs. 2 and 4.
Conjecture 1
Assume that is an i.i.d. Gaussian measurement matrix where each entry has mean zero and variance . Suppose that Condition 1 holds, the aspect ratio in (2), and the noise is i.i.d. zeromean Gaussian with finite variance. Then for all , the mean squared error of the universal MAP estimator satisfies
for sufficiently large .
Iv Fixed reproduction alphabet algorithm
Although the results of the previous section are theoretically appealing, a brute force optimization of is computationally intractable. Instead, we propose an algorithmic approach based on MCMC methods [43]. Our approach is reminiscent of the framework for lossy data compression in [48, 49, 47, 50].
Iva Universal compressor
We propose a universal lossless compression formulation following the conventions of Weissman and coauthors [48, 49, 47]. We refer to the estimate as in our algorithm. Our goal is to characterize , cf. (7). Although we are inspired by the Kolmogorov sampler approach [33], KC cannot be computed [24, 25], and we instead use empirical entropy. For stationary ergodic sources, the empirical entropy converges to the persymbol entropy rate almost surely [24].
To define the empirical entropy, we first define the empirical symbol counts:
(8) 
where is the context depth [31, 51], , , is the symbol of , and is the string comprising symbols through within . We now define the order conditional empirical probability for the context as
(9) 
and the order conditional empirical entropy,
(10) 
where the sum is only over nonzero counts and probabilities.
Allowing the context depth to grow slowly with , various universal compression algorithms can achieve the empirical entropy asymptotically [31, 51, 30]. On the other hand, no compressor can outperform the entropy rate. Additionally, for large , the empirical symbol counts with context depth provide a sufficiently precise characterization of the source statistics. Therefore, provides a concise approximation to the persymbol coding length of a universal compressor.
IvB Markov chain Monte Carlo
Having approximated the coding length, we now describe how to optimize our objective function. We define the energy in an analogous manner to (7), using as our universal coding length:
(11) 
where . The minimization of this energy is analogous to minimizing .
Ideally, our goal is to compute the globally minimum energy solution . We use a stochastic MCMC relaxation [43] to achieve the globally minimum solution in the limit of infinite computation. To assist the reader in appreciating how MCMC is used to compute , we include pseudocode for our approach in Algorithm 1. The algorithm, called basic MCMC (BMCMC), will be used as a building block for our latter Algorithms 2 and 3 in Section V. The initial estimate is obtained by quantizing the initial point to . The initial point could be the output of any signal reconstruction algorithm, and because is a preliminary estimate of the signal that does not require high fidelity, we let for simplicity, where denotes transpose. We refer to the processing of a single entry of as an iteration and group the processing of all entries of , randomly permuted, into superiterations.
The Boltzmann PMF is defined as
(12) 
where is inversely related to the temperature in simulated annealing and is a normalization constant. MCMC samples from the Boltzmann PMF (12) using a Gibbs sampler: in each iteration, a single element is generated while the rest of , , remains unchanged. We denote by the concatenation of the initial portion of the output vector , the symbol , and the latter portion of the output . The Gibbs sampler updates by resampling from the PMF:
where
is the change in empirical entropy (10) when is replaced by , and
(14) 
is the change in when is replaced by . The maximum change in the energy within an iteration of Algorithm 1 is then bounded by
(15) 
Note that is assumed bounded (cf. Condition 1) so that (14–15) are bounded as well.
In MCMC, the space is analogous to a statistical mechanical system, and at low temperatures the system tends toward low energies. Therefore, during the execution of the algorithm, we set a sequence of decreasing temperatures that takes into account the maximum change given in (15):
(16) 
where is a temperature offset. At low temperatures, i.e., large , a small difference in energy drives a big difference in probability, cf. (12). Therefore, we begin at a high temperature where the Gibbs sampler can freely move around . As the temperature is reduced, the PMF becomes more sensitive to changes in energy (12), and the trend toward with lower energy grows stronger. In each iteration, the Gibbs sampler modifies in a random manner that resembles heat bath concepts in statistical mechanics. Although MCMC could sink into a local minimum, Geman and Geman [43] proved that if we decrease the temperature according to (16), then the randomness of Gibbs sampling will eventually drive MCMC out of the locally minimum energy and it will converge to the globally optimal energy w.r.t. . Note that Geman and Geman proved that MCMC will converge, although the proof states that it will take infinitely long to do so. In order to help BMCMC approach the global minimum with reasonable runtime, we will refine BMCMC in Section V.
The following theorem is proven in Appendix A, following the framework established by Jalali and Weissman [48, 49].
Theorem 1
Theorem 1 shows that Algorithm 1 matches the bestpossible performance of the universal MAP estimator as measured by the objective function , which should yield an MSE that is twice the MMSE (cf. Conjecture 1). We want to remind the reader that Theorem 1 is based on the stationarity and ergodicity of the source, which could have memory. To gain some insight about the convergence process of MCMC, we focus on a fixed arbitrary suboptimal sequence . Suppose that at superiteration the energy for the algorithm’s output has converged to the steady state (see Appendix A for details on convergence). We can then focus on the probability ratio ; because is the global minimum and has the largest Boltzmann probability over all , whereas is suboptimal. We then consider the same sequence at superiteration ; the inverse temperature is and the corresponding ratio at superiteration is (cf. (12))
That is, between superiterations and the probability ratio is also squared, and the Gibbs sampler is less likely to generate samples whose energy differs significantly from the minimum energy w.r.t. . We infer from this argument that the probability concentration of our algorithm around the globally optimal energy w.r.t. is linear in the number of superiterations.
IvC Computational challenges
Studying the pseudocode of Algorithm 1, we recognize that Lines 9–11 must be implemented efficiently, as they run times. Lines 9 and 10 are especially challenging.
For Line 9, a naive update of has complexity , cf. (10). To address this problem, Jalali and Weissman [48, 49] recompute the empirical conditional entropy in time only for the contexts whose corresponding counts are modified [48, 49]. The same approach can be used in Line 13, again reducing computation from to . Some straightforward algebra allows us to convert Line 10 to a form that requires aggregate runtime of . Combined with the computation for Line 9, and since (because , and ) in practice, the entire runtime of our algorithm is .
The practical value of Algorithm 1 may be reduced due to its high computational cost, dictated by the number of superiterations required for convergence to and the large size of the reproduction alphabet. Nonetheless, Algorithm 1 provides a starting point toward further performance gains of more practical algorithms for computing , which are presented in Section V. Furthermore, our experiments in Section VI will show that the performance of the algorithm of Section V is comparable to and in many cases better than existing algorithms.
V Adaptive reproduction alphabet
While Algorithm 1 is a first step toward universal signal estimation in CS, must be large enough to ensure that quantizes a broad enough range of values of finely enough to represent the estimate well. For large , the estimation performance using the reproduction alphabet (6) could suffer from high computational complexity. On the other hand, for small the number of reproduction levels employed is insufficient to obtain acceptable performance. Nevertheless, using an excessive number of levels will slow down the convergence. Therefore, in this section, we explore techniques that tailor the reproduction alphabet adaptively to the signal being observed.
Va Adaptivity in reproduction levels
To estimate better with finite , we utilize reproduction levels that are adaptive instead of the fixed levels in . To do so, instead of , we optimize over a sequence , where and denotes the size. The new reproduction alphabet does not directly correspond to real numbers. Instead, there is an adaptive mapping , and the reproduction levels are . Therefore, we call the adaptive reproduction alphabet. Since the mapping is onetoone, we also refer to as reproduction levels. Considering the energy function (11), we now compute the empirical symbol counts , order conditional empirical probabilities , and order conditional empirical entropy using , , and , cf. (8), (9), and (10). Similarly, we use instead of , where is the straightforward vector extension of . These modifications yield an adaptive energy function .
We choose to optimize for minimum squared error,
where denotes the entry of the vector . The optimal mapping depends entirely on , , and . From a coding perspective, describing requires bits for and bits for to match the resolution of the nonadaptive , with an arbitrary constant [47]. The resulting coding length defines our universal prior.
Optimization of reproduction levels: We now describe the optimization procedure for , which must be computationally efficient. Write
where is the entry of at row and column . For to be minimum, we need zerovalued derivatives in (17), where is the indicator function for event .
(17) 
Define the location sets for each , and rewrite the derivatives of ,
(18) 
Let the percharacter sum column values be
(19) 
for each and . We desire the derivatives to be zero, cf. (18):
Thus, the system of equations must be satisfied,
(20) 
for each . Consider now the right hand side,
for each . The system of equations can be described in matrix form in (21).
(21) 
Note that by writing as a matrix with entries indexed by row and column given by (19), we can write as a Gram matrix, , and we also have , cf. (20). The optimal can be computed as a vector if is invertible. We note in passing that numerical stability can be improved by regularizing . Note also that
(22) 
which can be computed in time instead of .
Computational complexity: Pseudocode for leveladaptive MCMC (LMCMC) appears in Algorithm 2, which resembles Algorithm 1. The initial mapping is inherited from a quantization of the initial point , ( takes different values in Section VB), and other minor differences between BMCMC and LMCMC appear in lines marked by asterisks.
We discuss computational requirements for each line of the pseudocode that is run within the inner loop.

Line 11 updates for in time.

Line 12 updates . Because we only need to update columns and rows, each such column and row contains entries, and each entry is a sum over terms, we need time.

Line 13 requires inverting in time.

Line 15 requires time.
In practice we typically have , and so the aggregate complexity is , which is greater than the computational complexity of Algorithm 1 by a factor of .
VB Adaptivity in reproduction alphabet size
While Algorithm 2 adaptively maps to , the signal estimation quality heavily depends on . Denote the true alphabet of the signal by ; if the signal is continuousvalued, then is infinite. Ideally we want to employ as many levels as the runtime allows for continuousvalued signals, whereas for discretevalued signals we want . Inspired by this observation, we propose to begin with some initial , and then adaptively adjust hoping to match . Hence, we propose the size and leveladaptive MCMC algorithm (Algorithm 3), which invokes LMCMC (Algorithm 2) several times.
Three basic procedures: In order to describe the size and leveladaptive MCMC (SLAMCMC) algorithm in detail, we introduce three alphabet adaptation procedures as follows.

MERGE: First, find the closest adjacent levels . Create a new level and add it to . Let . Replace by whenever . Next, remove and from .

ADDout: Define the range , and . Add a lower level and/or upper level to with
Note that , i.e., the new levels are empty.

ADDin: First, find the most distant adjacent levels, and . Then, add a level to with . For s.t. , replace by with probability
where is given in (12); for s.t. , replace by with probability
Note that is typically nonzero, i.e., tends not to be empty.
We call the process of running one of these procedures followed by running LMCMC a round.
Size and leveladaptive MCMC: SLAMCMC is conceptually illustrated in the flowchart in Fig. 1. It has four stages, and in each stage we will run LMCMC for several superiterations; we denote the execution of LMCMC for superiterations by L(). The parameters , and are the number of superiterations used in Stages 1 through 4, respectively. The choice of these parameters reflects a tradeoff between runtime and estimation quality.
In Stage 1, SLAMCMC uses a fixedsize adaptive reproduction alphabet to tentatively estimate the signal. The initial point of Stage 1 is obtained in the same way as LMCMC. After Stage 1, the initial point and temperature offset for each instance of LMCMC correspond to the respective outputs of the previous instance of LMCMC. If the source is discretevalued and in Stage 1, then multiple levels in the output of Stage 1 may correspond to a single level in . To alleviate this problem, in Stage 2 we merge levels closer than , where is a parameter.
However, might still be larger than needed; hence in Stage 3 we tentatively merge the closest adjacent levels. The criterion evaluates whether the current objective function is lower (better) than in the previous round; we do not leave Stage 3 until is violated. Note that if (this always holds for continuousvalued signals), then ideally SLAMCMC should not merge any levels in Stage 3, because the objective function would increase if we merge any levels.
Define the outlier set . Under Condition 1, might be small or even empty. When is small, LMCMC might not assign levels to represent the entries of . To make SLAMCMC more robust to outliers, in Stage 4a we add empty levels outside the range and then allow LMCMC to change entries of to the new levels during Gibbs sampling; we call this populating the new levels. If a newly added outside level is not populated, then we remove it from . Seeing that the optimal mapping in LMCMC tends not to map symbols to levels with low population, we consider a criterion where we will will add an outside upper (lower) level if the population of the current upper (lower) level is smaller than , where is a parameter. That is, the criterion is violated if both populations of the current upper and lower levels are sufficient (at least ); in this case we do not need to add outside levels because will map some of the current levels to represent the entries in . The criterion is violated if all levels added outside are not populated by the end of the round. SLAMCMC keeps adding levels outside until it is wide enough to cover most of the entries of .
Next, SLAMCMC considers adding levels inside (Stage 4b). If the signal is discretevalued, this stage should stop when . Else, for continuousvalued signals SLAMCMC can add levels until the runtime expires.
In practice, SLAMCMC runs LMCMC at most a constant number of times, and the computational complexity is in the same order of LMCMC, i.e., . On the other hand, SLAMCMC allows varying , which often improves the estimation quality.
VC Mixing
Donoho proved for the scalar channel setting that is sampled from the posterior [33]. Seeing that the Gibbs sampler used by MCMC (cf. Section IVB) generates random samples, and the outputs of our algorithm will be different if its random number generator is initialized with different random seeds, we speculate that running SLAMCMC several times will also yield independent samples from the posterior, where we note that the runtime grows linearly in the number of times that we run SLAMCMC. By mixing (averaging over) several outputs of SLAMCMC, we obtain , which may have lower squared error w.r.t. the true than the average squared error obtained by a single SLAMCMC output. Numerical results suggest that mixing indeed reduces the MSE (cf. Fig. 8); this observation suggests that mixing the outputs of multiple algorithms, including running a random reconstruction algorithm several times, may reduce the squared error.
Vi Numerical results
In this section, we demonstrate that SLAMCMC is comparable and in many cases better than existing algorithms in reconstruction quality, and that SLAMCMC is applicable when . Additionally, some numerical evidence is provided to justify Conjecture 1 in Section IIIC. Then, the advantage of SLAMCMC in estimating lowcomplexity signals is demonstrated. Finally, we compare BMCMC, LMCMC, and SLAMCMC performance.
We implemented SLAMCMC in Matlab^{7}^{7}7A toolbox that runs the simulations in this paper is available at http://people.engr.ncsu.edu/dzbaron/software/UCS_BaronDuarte/ and tested it using several stationary ergodic sources. Except when noted, for each source, signals of length were generated. Each such was multiplied by a Gaussian random matrix with normalized columns and corrupted by i.i.d. Gaussian measurement noise . Except when noted, the number of measurements varied between 2000 and 7000. The noise variance was selected to ensure that the signaltonoise ratio (SNR) was or dB; SNR was defined as . According to Section IVA, the context depth , where the base of the logarithm is the alphabet size; using typical values such as and , we have and set . While larger will slow down the algorithm, it might be necessary to increase when is larger. The number of superiterations in different stages of SLAMCMC and , the maximum total number of superiterations to be , the initial number of levels , and the tuning parameter from Section VB ; these parameters seem to work well on an extensive set of numerical experiments. SLAMCMC was not given the true alphabet for any of the sources presented in this paper; our expectation is that it should adaptively adjust to match . The final estimate of each signal was obtained by averaging over the outputs of runs of SLAMCMC, where in each run we initialized the random number generator with another random seed, cf. Section VC. These choices of parameters seemed to provide a reasonable compromise between runtime and estimation quality.
We chose our performance metric as the mean signaltodistortion ratio (MSDR) defined as . For each and SNR, the MSE was obtained after averaging over the squared errors of for 50 draws of , , and . We compared the performance of SLAMCMC to that of (i) compressive sensing matching pursuit (CoSaMP) [52], a greedy method; (ii) gradient projection for sparse reconstruction (GPSR) [10], an optimizationbased method; (iii) message passing approaches (for each source, we chose bestmatched algorithms between EMGMAMPMOS (EGAM for short)&