Recovery from Linear Measurements with Complexity-Matching Universal Signal Estimation

# Recovery from Linear Measurements with Complexity-Matching Universal Signal Estimation

Junan Zhu,  Dror Baron,
and Marco F. Duarte,
This paper was presented in part at the IEEE Workshop on Statistical Signal Processing, Gold Coast, Australia, June 2014 [1], the Allerton Conference on Communications, Control, and Computing, Monticello, IL, September 2011 [2], and the Workshop on Information Theoretic Methods in Science and Engineering, Helsinki, Finland, Aug. 2011 [3].J. Zhu and D. Baron were partially supported in part by the National Science Foundation under Grant CCF-1217749 and in part by the U.S. Army Research Office under Grants W911NF-04-D-0003 and W911NF-14-1-0314. M. F. Duarte was partially supported by NSF Supplemental Funding DMS-0439872 to UCLA-IPAM, PI R. Caflisch.J. Zhu and D. Baron are with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695. E-mail: {jzhu9,barondror}@ncsu.eduM. F. Duarte is with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003. E-mail:       mduarte@ecs.umass.edu
###### Abstract

We study the compressed sensing (CS) signal estimation problem where an input signal is measured via a linear matrix multiplication under additive noise. While this setup usually assumes sparsity or compressibility in the input signal during recovery, the signal structure that can be leveraged is often not known a priori. In this paper, we consider universal CS recovery, where the statistics of a stationary ergodic signal source are estimated simultaneously with the signal itself. Inspired by Kolmogorov complexity and minimum description length, we focus on a maximum a posteriori (MAP) estimation framework that leverages universal priors to match the complexity of the source. Our framework can also be applied to general linear inverse problems where more measurements than in CS might be needed. We provide theoretical results that support the algorithmic feasibility of universal MAP estimation using a Markov chain Monte Carlo implementation, which is computationally challenging. We incorporate some techniques to accelerate the algorithm while providing comparable and in many cases better reconstruction quality than existing algorithms. Experimental results show the promise of universality in CS, particularly for low-complexity sources that do not exhibit standard sparsity or compressibility.

Compressed sensing, MAP estimation, Markov chain Monte Carlo, universal algorithms.

## I Introduction

Since many systems in science and engineering are approximately linear, linear inverse problems have attracted great attention in the signal processing community. An input signal is recorded via a linear operator under additive noise:

 y=Φx+z, (1)

where is an matrix and denotes the noise. The goal is to estimate from the measurements given knowledge of and a model for the noise . When , the setup is known as compressed sensing (CS) and the estimation problem is commonly referred to as recovery or reconstruction; by posing a sparsity or compressibility111We use the term compressibility in this paper as defined by Candès et al. [4] to refer to signals whose sparse approximation error decays sufficiently quickly. requirement on the signal and using this requirement as a prior during recovery, it is indeed possible to accurately estimate from  [4, 5]. On the other hand, we might need more measurements than the signal length when the signal is dense or the noise is substantial.

Wu and Verdú [6] have shown that independent and identically distributed (i.i.d.) Gaussian sensing matrices achieve the same phase-transition threshold as the optimal (potentially nonlinear) measurement operator, for any i.i.d. signals following the discrete/continuous mixture distribution , where is the probability for to take a continuous distribution and is an arbitrary discrete distribution. For non-i.i.d. signals, Gaussian matrices also work well [7, 8, 9]. Hence, in CS the acquisition can be designed independently of the particular signal prior through the use of randomized Gaussian matrices . Nevertheless, the majority of (if not all) existing recovery algorithms require knowledge of the sparsity structure of , i.e., the choice of a sparsifying transform that renders a sparse coefficient vector for the signal.

The large majority of recovery algorithms pose a sparsity prior on the signal or the coefficient vector , e.g., [4, 5, 10]. A second, separate class of Bayesian CS recovery algorithms poses a probabilistic prior for the coefficients of in a known transform domain [11, 12, 13, 14, 15]. Given a probabilistic model, some related message passing approaches learn the parameters of the signal model and achieve the minimum mean squared error (MMSE) in some settings; examples include EM-GM-AMP-MOS [16], turboGAMP [17], and AMP-MixD [18]. As a third alternative, complexity-penalized least square methods [19, 20, 21, 22, 23] can use arbitrary prior information on the signal model and provide analytical guarantees, but are only computationally efficient for specific signal models, such as the independent-entry Laplacian model [21]. For example, Donoho et al. [20] relies on Kolmogorov complexity, which cannot be computed [24, 25]. As a fourth alternative, there exist algorithms that can formulate dictionaries that yield sparse representations for the signals of interest when a large amount of training data is available [23, 26, 27, 28]. When the signal is non-i.i.d., existing algorithms require either prior knowledge of the probabilistic model [17] or the use of training data [29].

In certain cases, one might not be certain about the structure or statistics of the source prior to recovery. Uncertainty about such structure may result in a sub-optimal choice of the sparsifying transform , yielding a coefficient vector that requires more measurements to achieve reasonable estimation quality; uncertainty about the statistics of the source will make it difficult to select a prior or model for Bayesian algorithms. Thus, it would be desirable to formulate algorithms to estimate that are more agnostic to the particular statistics of the signal. Therefore, we shift our focus from the standard sparsity or compressibility priors to universal priors [30, 31, 32]. Such concepts have been previously leveraged in the Kolmogorov sampler universal denoising algorithm [33], which minimizes Kolmogorov complexity [34, 35, 36, 25, 37, 38, 3, 2]. Related approaches based on minimum description length (MDL) [39, 40, 41, 42] minimize the complexity of the estimated signal with respect to (w.r.t.) some class of sources.

Approaches for non-parametric sources based on Kolmogorov complexity are not computable in practice [24, 25]. To address this computational problem, we confine our attention to the class of stationary ergodic sources and develop an algorithmic framework for universal signal estimation in CS systems that will approach the MMSE as closely as possible for the class of stationary ergodic sources. Our framework can be applied to general linear inverse problems where more measurements might be needed. Our framework leverages the fact that for stationary ergodic sources, both the per-symbol empirical entropy and Kolmogorov complexity converge asymptotically almost surely to the entropy rate of the source [24]. We aim to minimize the empirical entropy; our minimization is regularized by introducing a log likelihood for the noise model, which is equivalent to the standard least squares under additive white Gaussian noise. Other noise distributions are readily supported.

We make the following contributions toward our universal CS framework.

• We apply a specific quantization grid to a maximum a posteriori (MAP) estimator driven by a universal prior, providing a finite-computation universal estimation scheme; our scheme can also be applied to general linear inverse problems where more measurements might be needed.

• We propose a recovery algorithm based on Markov chain Monte Carlo (MCMC) [43] to approximate this estimation procedure.

• We prove that for a sufficiently large number of iterations the output of our MCMC recovery algorithm converges to the correct MAP estimate.

• We identify computational bottlenecks in the implementation of our MCMC estimator and show approaches to reduce their complexity.

• We develop an adaptive quantization scheme that tailors a set of reproduction levels to minimize the quantization error within the MCMC iterations and that provides an accelerated implementation.

• We propose a framework that adaptively adjusts the cardinality (size) of the adaptive quantizer to match the complexity of the input signal, in order to further reduce the quantization error and computation.

• We note in passing that averaging over the outputs of different runs of the same signal with the same measurements will yield lower mean squared error (MSE) for our proposed algorithm.

This paper is organized as follows. Section II provides background content. Section III overviews MAP estimation, quantization, and introduces universal MAP estimation. Section IV formulates an initial MCMC algorithm for universal MAP estimation, Section V describes several improvements to this initial algorithm, and Section VI presents experimental results. We conclude in Section VII. The proof of our main theoretical result appears in the appendix.

## Ii Background and related work

### Ii-a Compressed sensing

Consider the noisy measurement setup via a linear operator (1). The input signal is generated by a stationary ergodic source , and must be estimated from and . Note that the stationary ergodicity assumption enables us to model the potential memory in the source. The distribution that generates is unknown. The matrix has i.i.d. Gaussian entries, .222In contrast to our analytical and numerical results, the algorithm presented in Section IV is not dependent on a particular choice for the matrix . These moments ensure that the columns of the matrix have unit norm on average. For concrete analysis, we assume that the noise is i.i.d. Gaussian, with mean zero and known333We assume that the noise variance is known or can be estimated [11, 18]. variance for simplicity.

We focus on the setting where and the aspect ratio is positive:

 R≜limN→∞MN>0. (2)

Similar settings have been discussed in the literature [44, 45]. When , this setup is known as CS; otherwise, it is a general linear inverse problem setting. Since is generated by an unknown source, we must search for an estimation mechanism that is agnostic to the specific distribution .

### Ii-B Related work

For a scalar channel with a discrete-valued signal , e.g., is an identity matrix and , Donoho proposed the Kolmogorov sampler (KS) for denoising [33],

 xKS≜argminwK(w) s.t. ∥w−y∥2<τ, (3)

where denotes the Kolmogorov complexity of , defined as the length of the shortest input to a Turing machine [46] that generates the output and then halts,444For real-valued , Kolmogorov complexity (KC) can be approximated using a fine quantizer. Note that the algorithm developed in this paper uses a coarse quantizer and does not rely on KC due to the absence of a feasible method for its computation [24, 25] (cf. Section V). and controls for the presence of noise. It can be shown that asymptotically captures the statistics of the stationary ergodic source , and the per-symbol complexity achieves the entropy rate , i.e., almost surely [24, p. 154, Theorem 7.3.1]. Noting that universal lossless compression algorithms [30, 31] achieve the entropy rate for any discrete-valued finite state machine source , we see that these algorithms achieve the per-symbol Kolmogorov complexity almost surely.

Donoho et al. expanded KS to the linear CS measurement setting but did not consider measurement noise [20]. Recent papers by Jalali and coauthors [37, 38], which appeared simultaneously with our work [3, 2], provide an analysis of a modified KS suitable for measurements corrupted by noise of bounded magnitude. Inspired by Donoho et al. [20], we estimate from noisy measurements using the empirical entropy as a proxy for the Kolmogorov complexity (cf. Section IV-A).

Separate notions of complexity-penalized least squares have also been shown to be well suited for denoising and CS recovery [19, 20, 39, 40, 41, 21, 22, 23]. For example, minimum description length (MDL) [39, 40, 41, 23] provides a framework composed of classes of models for which the signal complexity can be defined sharply. In general, complexity-penalized least square approaches can yield MDL-flavored CS recovery algorithms that are adaptive to parametric classes of sources [20, 19, 21, 22]. An alternative universal denoising approach computes the universal conditional expectation of the signal [3, 18].

## Iii Universal MAP estimation and discretization

This section briefly reviews MAP estimation and then applies it over a quantization grid, where a universal prior is used for the signal. Additionally, we provide a conjecture for the MSE achieved by our universal MAP scheme.

### Iii-a Discrete MAP estimation

In this subsection, we assume for exposition purposes that we know the signal distribution . Given the measurements , the MAP estimator for has the form

 xMAP≜argmaxwfX(w)fY|X(y|w). (4)

Because is i.i.d. Gaussian with mean zero and known variance ,

 fY|X(y|w)=c1e−c2∥y−Φw∥2,

where and are constants, and denotes the Euclidean norm.555Other noise distributions are readily supported, e.g., for i.i.d. Laplacian noise, we need to change the norm to an norm and adjust and accordingly. Plugging into (4) and taking log likelihoods, we obtain , where denotes the objective function (risk)

 ΨX(w)≜−ln(fX(w))+c2∥y−Φw∥2;

our ideal risk would be .

Instead of performing continuous-valued MAP estimation, we optimize for the MAP in the discretized domain , with being defined as follows. Adapting the approach of Baron and Weissman [47], we define the set of data-independent reproduction levels for quantizing as

 R≜{…,−1γ,0,1γ,…}, (5)

where . As increases, will quantize to a greater resolution. These reproduction levels simplify the estimation problem from continuous to discrete.

Having discussed our reproduction levels in the set , we provide a technical condition on boundedness of the signal.

###### Condition 1

We require that the probability density has bounded support, i.e., there exists such that for .

A limitation of the data-independent reproduction level set (5) is that has infinite cardinality (or size for short). Thanks to Condition 1, for each value of there exists a constant such that a finite set of reproduction levels

 RF≜{−c3γ2γ,−c3γ2−1γ,…,c3γ2γ} (6)

will quantize the range of values to the same accuracy as that of (5). We call the reproduction alphabet, and each element in it a (reproduction) level. This finite quantizer reduces the complexity of the estimation problem from infinite to combinatorial. In fact, under Condition 1. Therefore, for all and sufficiently large , this set of levels will cover the range . The resulting reduction in complexity is due to the structure in and independent of the particular statistics of the source .

Now that we have set up a quantization grid for , we convert the distribution to a probability mass function (PMF) over . Let , and define a PMF as . Then

 xMAP(RF)≜argminw∈(RF)N(−ln(PX(w))+c2∥y−Φw∥2)

gives the MAP estimate of over . Note that we use the PMF formulation above, instead of the more common bin integration formulation, in order to simplify our presentation and analysis. Luckily, as increases, will approximate more closely under (6).

### Iii-B Universal MAP estimation

We now describe a universal estimator for CS over a quantized grid. Consider a prior that might involve Kolmogorov complexity [34, 35, 36], e.g., , or MDL complexity w.r.t. some class of parametric sources [39, 40, 41]. We call a universal prior if it has the fortuitous property that for every stationary ergodic source and fixed , there exists some minimum such that

 −ln(PU(w))N<−ln(PX(w))N+ϵ

for all and  [30, 31]. We optimize over an objective function that incorporates and the presence of additive white Gaussian noise in the measurements:

 ΨU(w)≜−ln(PU(w))+c2∥y−Φw∥2, (7)

resulting in666This formulation of corresponds to a Lagrangian relaxation of the approach studied in [37, 38]. . Our universal MAP estimator does not require , and can be used in general linear inverse problems.

### Iii-C Conjectured MSE performance

Donoho [33] showed for the scalar channel that: () the Kolmogorov sampler (3) is drawn from the posterior distribution ; and () the MSE of this estimate is no greater than twice the MMSE. Based on this result, which requires a large reproduction alphabet, we now present a conjecture on the quality of the estimation . Our conjecture is based on observing that (i) in the setting (1), Kolmogorov sampling achieves optimal rate–distortion performance; (ii) the Bayesian posterior distribution is the solution to the rate-distortion problem; and (iii) sampling from the Bayesian posterior yields a squared error that is no greater than twice the MMSE. Hence, behaves as if we sample from the Bayesian posterior distribution and yields no greater than twice the MMSE; some experimental evidence to assess this conjecture is presented in Figs. 2 and 4.

###### Conjecture 1

Assume that is an i.i.d. Gaussian measurement matrix where each entry has mean zero and variance . Suppose that Condition 1 holds, the aspect ratio in (2), and the noise is i.i.d. zero-mean Gaussian with finite variance. Then for all , the mean squared error of the universal MAP estimator satisfies

 EX,Z,Φ[∥x−xUMAP∥2]N<2EX,Z,Φ[∥x−EX[x|y,Φ]∥2]N+ϵ

for sufficiently large .

## Iv Fixed reproduction alphabet algorithm

Although the results of the previous section are theoretically appealing, a brute force optimization of is computationally intractable. Instead, we propose an algorithmic approach based on MCMC methods [43]. Our approach is reminiscent of the framework for lossy data compression in [48, 49, 47, 50].

### Iv-a Universal compressor

We propose a universal lossless compression formulation following the conventions of Weissman and coauthors [48, 49, 47]. We refer to the estimate as in our algorithm. Our goal is to characterize , cf. (7). Although we are inspired by the Kolmogorov sampler approach [33], KC cannot be computed [24, 25], and we instead use empirical entropy. For stationary ergodic sources, the empirical entropy converges to the per-symbol entropy rate almost surely [24].

To define the empirical entropy, we first define the empirical symbol counts:

 nq(w,α)[β]≜|{i∈[q+1,N]:wi−1i−q=α,wi=β}|, (8)

where is the context depth [31, 51], , , is the symbol of , and is the string comprising symbols through within . We now define the order conditional empirical probability for the context as

 Pq(w,α)[β]≜nq(w,α)[β]∑β′∈RFnq(w,α)[β′], (9)

and the order conditional empirical entropy,

 Hq(w)≜−1N∑α∈(RF)q,β∈RFnq(w,α)[β]log2(Pq(w,α)[β]), (10)

where the sum is only over non-zero counts and probabilities.

Allowing the context depth to grow slowly with , various universal compression algorithms can achieve the empirical entropy asymptotically [31, 51, 30]. On the other hand, no compressor can outperform the entropy rate. Additionally, for large , the empirical symbol counts with context depth provide a sufficiently precise characterization of the source statistics. Therefore, provides a concise approximation to the per-symbol coding length of a universal compressor.

### Iv-B Markov chain Monte Carlo

Having approximated the coding length, we now describe how to optimize our objective function. We define the energy in an analogous manner to (7), using as our universal coding length:

 ΨHq(w)≜NHq(w)+c4∥y−Φw∥2, (11)

where . The minimization of this energy is analogous to minimizing .

Ideally, our goal is to compute the globally minimum energy solution . We use a stochastic MCMC relaxation [43] to achieve the globally minimum solution in the limit of infinite computation. To assist the reader in appreciating how MCMC is used to compute , we include pseudocode for our approach in Algorithm 1. The algorithm, called basic MCMC (B-MCMC), will be used as a building block for our latter Algorithms 2 and 3 in Section V. The initial estimate is obtained by quantizing the initial point to . The initial point could be the output of any signal reconstruction algorithm, and because is a preliminary estimate of the signal that does not require high fidelity, we let for simplicity, where denotes transpose. We refer to the processing of a single entry of as an iteration and group the processing of all entries of , randomly permuted, into super-iterations.

The Boltzmann PMF is defined as

 Ps(w)≜1ζsexp(−sΨHq(w)), (12)

where is inversely related to the temperature in simulated annealing and is a normalization constant. MCMC samples from the Boltzmann PMF (12) using a Gibbs sampler: in each iteration, a single element is generated while the rest of , , remains unchanged. We denote by the concatenation of the initial portion of the output vector , the symbol , and the latter portion of the output . The Gibbs sampler updates by resampling from the PMF:

 Ps(wn=a|w∖n) = exp(−sΨHq(wn−11awNn+1))∑b∈RFexp(−sΨHq(wn−11bwNn+1)) = 1∑b∈RFexp(−s[NΔHq(w,n,b,a)+c4Δd(w,n,b,a)]),

where

 ΔHq(w,n,b,a)≜Hq(wn−11bwNn+1)−Hq(wn−11awNn+1)

is the change in empirical entropy (10) when is replaced by , and

 Δd(w,n,b,a)≜∥y−Φ(wn−11bwNn+1)∥2−∥y−Φ(wn−11awNn+1)∥2 (14)

is the change in when is replaced by . The maximum change in the energy within an iteration of Algorithm 1 is then bounded by

 Δq=max1≤n≤Nmaxw∈(RF)Nmaxa,b∈RF|NΔHq(w,n,b,a)+c4Δd(w,n,b,a)|. (15)

Note that is assumed bounded (cf. Condition 1) so that (1415) are bounded as well.

In MCMC, the space is analogous to a statistical mechanical system, and at low temperatures the system tends toward low energies. Therefore, during the execution of the algorithm, we set a sequence of decreasing temperatures that takes into account the maximum change given in (15):

 st≜ln(t+r0)/(cNΔq) for some c>1, (16)

where is a temperature offset. At low temperatures, i.e., large , a small difference in energy drives a big difference in probability, cf. (12). Therefore, we begin at a high temperature where the Gibbs sampler can freely move around . As the temperature is reduced, the PMF becomes more sensitive to changes in energy (12), and the trend toward with lower energy grows stronger. In each iteration, the Gibbs sampler modifies in a random manner that resembles heat bath concepts in statistical mechanics. Although MCMC could sink into a local minimum, Geman and Geman [43] proved that if we decrease the temperature according to (16), then the randomness of Gibbs sampling will eventually drive MCMC out of the locally minimum energy and it will converge to the globally optimal energy w.r.t. . Note that Geman and Geman proved that MCMC will converge, although the proof states that it will take infinitely long to do so. In order to help B-MCMC approach the global minimum with reasonable runtime, we will refine B-MCMC in Section V.

The following theorem is proven in Appendix A, following the framework established by Jalali and Weissman [48, 49].

###### Theorem 1

Let be a stationary ergodic source that obeys Condition 1. Then the outcome of Algorithm 1 in the limit of an infinite number of super-iterations obeys

 limr→∞ΨHq(wr)=min˜w∈(RF)NΨHq(˜w)=ΨHq(xHqMAP).

Theorem 1 shows that Algorithm 1 matches the best-possible performance of the universal MAP estimator as measured by the objective function , which should yield an MSE that is twice the MMSE (cf. Conjecture 1). We want to remind the reader that Theorem 1 is based on the stationarity and ergodicity of the source, which could have memory. To gain some insight about the convergence process of MCMC, we focus on a fixed arbitrary sub-optimal sequence . Suppose that at super-iteration the energy for the algorithm’s output has converged to the steady state (see Appendix A for details on convergence). We can then focus on the probability ratio ; because is the global minimum and has the largest Boltzmann probability over all , whereas is sub-optimal. We then consider the same sequence at super-iteration ; the inverse temperature is and the corresponding ratio at super-iteration is (cf. (12))

 P2st(w)P2st(xHqMAP)=exp(−2stΨHq(w))exp(−2stΨHq(xHqMAP))=⎛⎝Pst(w)Pst(xHqMAP)⎞⎠2.

That is, between super-iterations and the probability ratio is also squared, and the Gibbs sampler is less likely to generate samples whose energy differs significantly from the minimum energy w.r.t. . We infer from this argument that the probability concentration of our algorithm around the globally optimal energy w.r.t.  is linear in the number of super-iterations.

### Iv-C Computational challenges

Studying the pseudocode of Algorithm 1, we recognize that Lines 911 must be implemented efficiently, as they run times. Lines 9 and 10 are especially challenging.

For Line 9, a naive update of has complexity , cf. (10). To address this problem, Jalali and Weissman [48, 49] recompute the empirical conditional entropy in time only for the contexts whose corresponding counts are modified [48, 49]. The same approach can be used in Line 13, again reducing computation from to . Some straightforward algebra allows us to convert Line 10 to a form that requires aggregate runtime of . Combined with the computation for Line 9, and since (because , and ) in practice, the entire runtime of our algorithm is .

The practical value of Algorithm 1 may be reduced due to its high computational cost, dictated by the number of super-iterations required for convergence to and the large size of the reproduction alphabet. Nonetheless, Algorithm 1 provides a starting point toward further performance gains of more practical algorithms for computing , which are presented in Section V. Furthermore, our experiments in Section VI will show that the performance of the algorithm of Section V is comparable to and in many cases better than existing algorithms.

While Algorithm 1 is a first step toward universal signal estimation in CS, must be large enough to ensure that quantizes a broad enough range of values of finely enough to represent the estimate well. For large , the estimation performance using the reproduction alphabet (6) could suffer from high computational complexity. On the other hand, for small the number of reproduction levels employed is insufficient to obtain acceptable performance. Nevertheless, using an excessive number of levels will slow down the convergence. Therefore, in this section, we explore techniques that tailor the reproduction alphabet adaptively to the signal being observed.

### V-a Adaptivity in reproduction levels

To estimate better with finite , we utilize reproduction levels that are adaptive instead of the fixed levels in . To do so, instead of , we optimize over a sequence , where and denotes the size. The new reproduction alphabet does not directly correspond to real numbers. Instead, there is an adaptive mapping , and the reproduction levels are . Therefore, we call the adaptive reproduction alphabet. Since the mapping is one-to-one, we also refer to as reproduction levels. Considering the energy function (11), we now compute the empirical symbol counts , order conditional empirical probabilities , and order conditional empirical entropy using , , and , cf. (8), (9), and (10). Similarly, we use instead of , where is the straightforward vector extension of . These modifications yield an adaptive energy function .

We choose to optimize for minimum squared error,

 Aopt ≜argminA∥y−ΦA(u)∥2 =argminA[M∑m=1(ym−[ΦA(u)]m)2],

where denotes the entry of the vector . The optimal mapping depends entirely on , , and . From a coding perspective, describing requires bits for and bits for to match the resolution of the non-adaptive , with an arbitrary constant [47]. The resulting coding length defines our universal prior.

Optimization of reproduction levels: We now describe the optimization procedure for , which must be computationally efficient. Write

 Υ(A)≜∥y−ΦA(u)∥2=M∑m=1(ym−N∑n=1ΦmnA(un))2,

where is the entry of at row and column . For to be minimum, we need zero-valued derivatives in (17), where is the indicator function for event .

Define the location sets for each , and rewrite the derivatives of ,

 dΥ(A)dA(β)=−2M∑m=1⎛⎝ym−∑λ∈Z∑n∈LλΦmnA(λ)⎞⎠⎛⎜⎝∑n∈LβΦmn⎞⎟⎠. (18)

Let the per-character sum column values be

 μmβ≜∑n∈LβΦmn, (19)

for each and . We desire the derivatives to be zero, cf. (18):

Thus, the system of equations must be satisfied,

 M∑m=1ymμmβ=M∑m=1(∑λ∈ZA(λ)μmλ)μmβ (20)

for each . Consider now the right hand side,

 M∑m=1(∑λ∈ZA(λ)μmλ)μmβ=∑λ∈ZA(λ)M∑m=1μmλμmβ,

for each . The system of equations can be described in matrix form in (21).

Note that by writing as a matrix with entries indexed by row and column given by (19), we can write as a Gram matrix, , and we also have , cf. (20). The optimal can be computed as a vector if is invertible. We note in passing that numerical stability can be improved by regularizing . Note also that

 ∥y−ΦA(u)∥2=M∑m=1⎛⎝ym−∑β∈ZμmβAopt(β)⎞⎠2, (22)

which can be computed in time instead of .

Computational complexity: Pseudocode for level-adaptive MCMC (L-MCMC) appears in Algorithm 2, which resembles Algorithm 1. The initial mapping is inherited from a quantization of the initial point , ( takes different values in Section V-B), and other minor differences between B-MCMC and L-MCMC appear in lines marked by asterisks.

We discuss computational requirements for each line of the pseudocode that is run within the inner loop.

• Line 10 can be computed in time (see discussion of Line 9 of B-MCMC in Section IV-C).

• Line 11 updates for in time.

• Line 12 updates . Because we only need to update columns and rows, each such column and row contains entries, and each entry is a sum over terms, we need time.

• Line 13 requires inverting in time.

• Line 14 requires time, cf. (22).

• Line 15 requires time.

In practice we typically have , and so the aggregate complexity is , which is greater than the computational complexity of Algorithm 1 by a factor of .

### V-B Adaptivity in reproduction alphabet size

While Algorithm 2 adaptively maps to , the signal estimation quality heavily depends on . Denote the true alphabet of the signal by ; if the signal is continuous-valued, then is infinite. Ideally we want to employ as many levels as the runtime allows for continuous-valued signals, whereas for discrete-valued signals we want . Inspired by this observation, we propose to begin with some initial , and then adaptively adjust hoping to match . Hence, we propose the size- and level-adaptive MCMC algorithm (Algorithm 3), which invokes L-MCMC (Algorithm 2) several times.

Three basic procedures: In order to describe the size- and level-adaptive MCMC (SLA-MCMC) algorithm in detail, we introduce three alphabet adaptation procedures as follows.

• MERGE: First, find the closest adjacent levels . Create a new level and add it to . Let . Replace by whenever . Next, remove and from .

• ADD-out: Define the range , and . Add a lower level and/or upper level to with

 A(β3)=minA(Z)−IRA|Z|−1, A(β4)=maxA(Z)+IRA|Z|−1.

Note that , i.e., the new levels are empty.

• ADD-in: First, find the most distant adjacent levels, and . Then, add a level to with . For s.t. , replace by with probability

where is given in (12); for s.t. , replace by with probability

Note that is typically non-zero, i.e., tends not to be empty.

We call the process of running one of these procedures followed by running L-MCMC a round.

Size- and level-adaptive MCMC: SLA-MCMC is conceptually illustrated in the flowchart in Fig. 1. It has four stages, and in each stage we will run L-MCMC for several super-iterations; we denote the execution of L-MCMC for super-iterations by L(). The parameters , and are the number of super-iterations used in Stages 1 through 4, respectively. The choice of these parameters reflects a trade-off between runtime and estimation quality.

In Stage 1, SLA-MCMC uses a fixed-size adaptive reproduction alphabet to tentatively estimate the signal. The initial point of Stage 1 is obtained in the same way as L-MCMC. After Stage 1, the initial point and temperature offset for each instance of L-MCMC correspond to the respective outputs of the previous instance of L-MCMC. If the source is discrete-valued and in Stage 1, then multiple levels in the output of Stage 1 may correspond to a single level in . To alleviate this problem, in Stage 2 we merge levels closer than , where is a parameter.

However, might still be larger than needed; hence in Stage 3 we tentatively merge the closest adjacent levels. The criterion evaluates whether the current objective function is lower (better) than in the previous round; we do not leave Stage 3 until is violated. Note that if (this always holds for continuous-valued signals), then ideally SLA-MCMC should not merge any levels in Stage 3, because the objective function would increase if we merge any levels.

Define the outlier set . Under Condition 1, might be small or even empty. When is small, L-MCMC might not assign levels to represent the entries of . To make SLA-MCMC more robust to outliers, in Stage 4a we add empty levels outside the range and then allow L-MCMC to change entries of to the new levels during Gibbs sampling; we call this populating the new levels. If a newly added outside level is not populated, then we remove it from . Seeing that the optimal mapping in L-MCMC tends not to map symbols to levels with low population, we consider a criterion where we will will add an outside upper (lower) level if the population of the current upper (lower) level is smaller than , where is a parameter. That is, the criterion is violated if both populations of the current upper and lower levels are sufficient (at least ); in this case we do not need to add outside levels because will map some of the current levels to represent the entries in . The criterion is violated if all levels added outside are not populated by the end of the round. SLA-MCMC keeps adding levels outside until it is wide enough to cover most of the entries of .

Next, SLA-MCMC considers adding levels inside (Stage 4b). If the signal is discrete-valued, this stage should stop when . Else, for continuous-valued signals SLA-MCMC can add levels until the runtime expires.

In practice, SLA-MCMC runs L-MCMC at most a constant number of times, and the computational complexity is in the same order of L-MCMC, i.e., . On the other hand, SLA-MCMC allows varying , which often improves the estimation quality.

### V-C Mixing

Donoho proved for the scalar channel setting that is sampled from the posterior  [33]. Seeing that the Gibbs sampler used by MCMC (cf. Section IV-B) generates random samples, and the outputs of our algorithm will be different if its random number generator is initialized with different random seeds, we speculate that running SLA-MCMC several times will also yield independent samples from the posterior, where we note that the runtime grows linearly in the number of times that we run SLA-MCMC. By mixing (averaging over) several outputs of SLA-MCMC, we obtain , which may have lower squared error w.r.t. the true than the average squared error obtained by a single SLA-MCMC output. Numerical results suggest that mixing indeed reduces the MSE (cf. Fig. 8); this observation suggests that mixing the outputs of multiple algorithms, including running a random reconstruction algorithm several times, may reduce the squared error.

## Vi Numerical results

In this section, we demonstrate that SLA-MCMC is comparable and in many cases better than existing algorithms in reconstruction quality, and that SLA-MCMC is applicable when . Additionally, some numerical evidence is provided to justify Conjecture 1 in Section III-C. Then, the advantage of SLA-MCMC in estimating low-complexity signals is demonstrated. Finally, we compare B-MCMC, L-MCMC, and SLA-MCMC performance.

We implemented SLA-MCMC in Matlab777A toolbox that runs the simulations in this paper is available at http://people.engr.ncsu.edu/dzbaron/software/UCS_BaronDuarte/ and tested it using several stationary ergodic sources. Except when noted, for each source, signals of length were generated. Each such was multiplied by a Gaussian random matrix with normalized columns and corrupted by i.i.d. Gaussian measurement noise . Except when noted, the number of measurements varied between 2000 and 7000. The noise variance was selected to ensure that the signal-to-noise ratio (SNR) was or  dB; SNR was defined as . According to Section IV-A, the context depth , where the base of the logarithm is the alphabet size; using typical values such as and , we have and set . While larger will slow down the algorithm, it might be necessary to increase when is larger. The number of super-iterations in different stages of SLA-MCMC and , the maximum total number of super-iterations to be , the initial number of levels , and the tuning parameter from Section V-B ; these parameters seem to work well on an extensive set of numerical experiments. SLA-MCMC was not given the true alphabet for any of the sources presented in this paper; our expectation is that it should adaptively adjust to match . The final estimate of each signal was obtained by averaging over the outputs of runs of SLA-MCMC, where in each run we initialized the random number generator with another random seed, cf. Section V-C. These choices of parameters seemed to provide a reasonable compromise between runtime and estimation quality.

We chose our performance metric as the mean signal-to-distortion ratio (MSDR) defined as . For each and SNR, the MSE was obtained after averaging over the squared errors of for 50 draws of , , and . We compared the performance of SLA-MCMC to that of (i) compressive sensing matching pursuit (CoSaMP) [52], a greedy method; (ii) gradient projection for sparse reconstruction (GPSR) [10], an optimization-based method; (iii) message passing approaches (for each source, we chose best-matched algorithms between EM-GM-AMP-MOS (EGAM for short)&