Improving Pursuit Algorithms Using Stochastic Resonance

# Improving Pursuit Algorithms Using Stochastic Resonance

Dror Simon, Jeremias Sulam, Yaniv Romano, Yue M. Lu and Michael Elad D. Simon, J. Sulam and M. Elad are with the Department of Computer Science of the Technion, Israel.Y. Romano is with the Department of Electrical Engineering of the Technion, Israel.Y. M. Lu is with the School of engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA.
###### Abstract

Sparse Representation Theory is a sub-field of signal processing that has led to cutting edge results in many applications such as denoising, deblurring, super resolution and many other inverse problems. Broadly speaking, this field puts forward a model that assumes that signals are originated from a sparse representation in terms of an over-complete dictionary. Thus, when a corrupted measurement is given, we seek to estimate its original, clean form by finding the best matched sparse representation of the given signal in the dictionary domain. This process is essentially a non-linear estimation solved by a pursuit or a sparse coding algorithm.

The concept of Stochastic Resonance (SR) refers to the counter-intuitive idea of improving algorithms’ performance by a deliberate noise contamination. In this work we develop novel techniques that apply SR for enhancement of the performance of known pursuit algorithms. We show that these methods provide an effective MMSE approximation and are capable of doing so for high-dimensional problems, for which no alternative exists.

Sparseland, Stochastic Resonance, Basis Pursuit, Orthogonal Matching Pursuit, Noise-enhanced pursuit.
\DeclareUnicodeCharacter

2212-

## I Introduction

In signal processing, often times we have access to a corrupted signal and we wish to retrieve its clean version. This process includes a wide variety of problems, such as denoising, where we wish to remove noise from a noisy signal; deblurring where we look to sharpen an image that has been blurred or was taken out of focus; and inpainting in which we fill-in missing data that has been removed from the image. All the aforementioned tasks and many others, include a linear degradation operator and a stochastic corruption, and as such they can be described by the relation , where is the ideal signal, is the linear degradation operator, is the additive noise, and stands for the noisy measurements.

In order to successfully restore , the Bayesian approach relies on some statistical properties of the corruption and a prior knowledge on the signal. When using a prior model assumption, we essentially estimate the original signal by obtaining a solution that is constrained by the model, while also being similar to the corrupted data. In image processing many such priors were developed over the years, among which one can mention the Total-Variation, exploiting self-similarity, relying on sparsity, and many others [1, 2, 3]. The literature in image processing shows an evolution of models, all seeking to improve the performance of the inverse problems described above. In this work we focus our attention to the sparse model prior, as described next.

The Sparseland model assumes that a signal originates from an over complete dictionary where , multiplying it by a representation vector , i.e. . The vector is sparse, meaning that the number of non-zeros in it is very small compared to the data dimension,111The pseudo-norm stands for a count of the number of non-zeros in the vector. . This implies that is a linear combination of a small number of the columns from the dictionary , called atoms. One of the most fundamental problems in Sparseland is termed “Atom-Decomposition”: Given , our goal is to find the sparsest explanation for it . Essentially this calls for solving the optimization problem:

 (P0):^α=argminα||α||0s.t.Dα=x. (2)

In typical situations, we do not get access to the clean signal , but rather to a corrupted version of it by some noise with bounded energy . The measurements are modeled as and therefore the above problem is modified in order to take the noise into account, leading to the following task:

 (Pϵ0):^α=argminα||α||0s.t.||Dα−y||2≤ϵ. (3)

In the more general case, as described above, the degradation might include a corruption operator. In these cases, the resulting measurements are modeled as , leading to a similar optimization problem in which the constraint is replaced by . Once the problem is solved, we can estimate the original signal simply by .

The optimization is non-convex, posing a hard problem to tackle. Indeed, in general this problem is NP-Hard [4]. Nevertheless, approximation algorithms have been developed in order to manage this task effectively. One of these approximations, known as the Basis-Pursuit (BP) [5] is a relaxation method where the norm is replaced by an . An alternative approach adopts a greedy strategy, such as in the case of the Orthogonal Matching Pursuit (OMP) [6], where the non-zeros in are found one-by-one by minimizing the residual energy at each step.

These approximation algorithms have been accompanied by theoretical guarantees for finding a sparse representation leading to a bounded error [7]. These results rely on the cardinality of , the range of the non-zeros values and properties of the dictionary . In practice, under non-adversarial assumptions, these algorithms succeed with high probability even when the theoretical conditions of the worst case analysis are not met [8, 9].

Adopting a probabilistic point of view, as in [10], the fundamental purpose of these pursuit algorithms is to approximate the most likely support of the original signal, and therefore can be interpreted as a Maximum a Posteriori (MAP) estimator approximation under the sparsity prior. Clearly, the MAP is inferior to the Minimum Mean Square Error (MMSE) estimator in terms of mean square error and therefore it is not the optimal choice for some problems such as denoising. When applying a Bayesian approach, the MMSE is composed of an impractical sum of all the possible supports

 ^αMMSE=E[α|y]=∑S∈ΩE[α|y,S]P(S|y) (4)

and therefore it is usually avoided. As surprising as it might seem, the MMSE is actually a dense vector.

In a previous work [11] an MMSE estimator approximation under the Sparseland model was suggested. The proposed Random-OMP (RandOMP) algorithm has been proven to coincide with the MMSE estimator in the case where the dictionary is unitary, and in the case where is over-complete and . RandOMP also improves the MSE results in general cases where the MMSE cannot be practically computed. In [12] a pursuit based on a Bayesian approach was suggested. The Fast Bayesian Matching Pursuit (FBMP) proposed a method to recover the most probable supports and approximate their posterior probabilities in order to achieve MMSE estimator approximation. Both algorithms, RandOMP and FBMP, rely on a greedy search where the support is updated one coefficient at a time. For this reason, these algorithms are restricted to low dimensional problems. In this work we propose methods to approximate the practically unattainable MMSE estimator regardless of the pursuit used, therefore being applicable to large dimensions as well.

The term noise has the natural connotation of unwanted disturbance in signal processing. Usually, this is indeed the case, and the scientific community often tries to diminish its influence in almost every signal processing application that involes estimation [13]. Without invalidating the previous statement, noise has also shown to be of great constructive value. Stochastic algorithms such as simulated annealing, genetic algorithms and image dithering rely on the properties of noise in order to succeed [14, 15, 5]. Stochastic Resonance (SR) is known as a phenomenon in which adding noise to a weak sub-threshold periodic signal increases its output Signal-to-Noise Ratio (SNR) when going through a threshold quantizer. This field has been further developed and has shown the ability to improve the performance of sub-optimal detectors [16], non-linear parametric estimators [1] and some image processing algorithms [17]. A well-known application that uses noise in order to improve a system’s response is Dithering.

SR has been used in the past in order to improve sub-optimal non-linear systems’ performance. As we saw throughout this section, pursuit algorithms form MAP estimator approximations. Could we consider a pursuit algorithm as a sub-optimal non-linear system whose performance can be improved by SR? In this work we intend to establish a novel MMSE approximation for the Sparseland model by integrating SR with known pursuit algorithms. More specifically, in our work we will seek answers to the following questions:

1. Can noise be employed to improve the performance of sparse coding algorithms?

2. How significant can this improvement be? Can we use noise enhancement to achieve MMSE approximation in the special unitary dictionary case?

3. Can we apply noise enhancement to approximate the MMSE of the denoising problem in a more general case?

In this paper we address these and closely related questions, showing how SR could be of great benefit for pursuit algorithms, and indeed provide MMSE approximation. Section II reviews Bayesian estimation in Sparseland and an initial intuition to SR in sparseland. In Section III we introduce SR pursuits in the special case where the dictionary is unitary and Section IV refers to the general dictionary case. Then, in In Section V we discuss two extensions regarding the SR and sparse model: one regarding the noise that should be used and another linking SR to Monte Carlo Methods. Section VI demonstrates a practical usage of our proposed algorithm for image reconstruction and image denoising and finally, in section VII we conclude this work.

## Ii Bayesian Estimation in Sparseland

Before we dive into the estimators themselves, we first introduce the generative model assumed. In this work we lean on the model introduced in [11] which is described as follows. is an over-complete dictionary and a sparse vector with either a known number of non-zeros or some prior probability for each atom to be chosen (we will variate between the two along this work). The non-zeros themselves, noted as , are drawn from a Gaussian distribution . Using the dictionary and sparse representation, we create a signal . The received samples are its noisy measurements , where is a random Gaussian noise, i.e. . Unless stated otherwise, throughout the following sections we assume the described model and our goal is to acquire an estimation of the original signal . We shall now describe the estimators under the described model as were developed in [10].

The first estimator introduced is the Oracle estimator, which seeks to estimate the representation given the (oracle) information of it’s support. The task of retrieving the true support of the original sparse representation is the essence of the problem. If the support is known, then the MMSE estimator is simply the conditional expectation . From [10], this expectation has the following form:

 ^αOracleS,y=1σ2νQ−1SDTSy, (5)

where is a subdictionary containing only the columns (atoms) in the given support , and is:

 QS=1σ2αI|S|+1σ2νDTSDS (6)

originating from the Wiener filter solution. We refer to this estimator as the Oracle as there is no possible way of knowing the true support beforehand.

The second estimator is the MAP. In this case we look for the most probable support given our measurements and use it in order to estimate the signal222Actually this is the MAP of the support. This is used in order to avoid the very probable case where the recovered signal is the vector as described in [10].

 ^S=argmaxSP(S|y)=argmaxSP(y|S)P(S)=argmaxS12∣∣∣∣∣∣1σ2vQ−12SDTSy∣∣∣∣∣∣22−12log(det(CS))+∑i∈Slog(Pi)+∑j∉Slog(1−Pj), (7)

where In the case where the number of non-zeros is known to be a constant and all are equally likely, we can omit the last two sums since they indicate the prior of the support and they are uniformly distributed. As described, the final estimator is:

 ^αMAPy=^αOracle^SMAP,y (8)

The last estimator is the MMSE. A well known result from estimation theory is that the MMSE estimator is given by the conditional expectation:

 ^αMMSEy=E[α|y]=∑SP(S|y)E[α|y,S]=∑SP(S|y)^α% OracleS,y. (9)

This is a weighted sum of all the possible supports. The probability is given in [10]:

 P(S|y)=tSt,t≜∑StS ts≜1√det(CS)exp{−12yTC−1Sy}∏i∈SPi∏i∉S(1−Pi).

Note that both estimators, the MMSE and the MAP, are generally NP hard, since they require either to sum all the possible supports or to compute all the posterior probabilities and pick the highest one. Also, from the formulation given above, the essence of both estimators is the posterior probability . The better an algorithm estimates and leverages these probabilities, the better the original signal’s estimation will be. All of the described estimators are for the representation vector . In order to achieve estimations for the signal we simply multiply due to their linear relation. Given these estimators and the model we now begin our journey for pursuit improvement, starting with the unitary case.

As described in the Introduction, in this paper we use Stochastic Resonance in order to achieve MMSE approximation using many MAP approximations. The algorithm to do so is described in Algorithm 1. This algorithm simply adds a small amount of noise to the already noisy signal and then applies any pursuit given which is simply a MAP approximation and finally averages all the evaluations to a single final estimation. Note that two forms of estimations are given. The first one, the “non-subtractive”, uses the extra-noisy evaluations as the basic estimators and averages them. The second one, the “subtractive”, uses only the supports recovered by the noisy evaluations in order to achieve an estimation that is not effected by the added noise, and finally averages them out to achieve a final single estimation.

Before we dive in to the mathematical justifications given in the following sections, we first introduce the results of this algorithm using a simple synthetic example. In this experiment we generated random signals of length 50 with one non-zero (to make the MAP and MMSE attainable). Each non-zero is a Gaussian random variable, and the measurements are contaminated with additive white Gaussian noise . We used OMP as a basic MAP approximation and the reader can see the MSE results in Figure 0(a). The improvement in performance is clearly seen. We repeated this experiment for a varying amount of added noise and tested the optimal SR MSE vs. the MMSE and the results can be seen in Figure 0(b). Finally, in Figure 0(c) we show the optimal amount of SR added noise for different input noise energy values.

## Iii Improving Pursuits in the Unitary Case

### Iii-a The Unitary Sparse Estimators

Continuing the Bayesian analysis of the sparse prior, we can specialize and simplify the expressions associated with the oracle estimator, MMSE and the MAP estimator. In [10] they mention the special case where the dictionary is a unitary matrix. In this case the dictionary is no longer over-complete and due to its special properties, the described estimators can be simplified. The MAP estimator is reduced to the elementwise Hard Thresholding operator applied on the projected measurements , given by:

 ^αMAP(β)=HλMAP(β)={c2βif |β|≥λMAP,0else, (10)

where we denote and , and and are the elements of the vectors and .

The MMSE estimator in the unitary case is a simple elementwise shrinkage operator of the following form:

 ^αMMSE=ψ(β)=exp(c22σ2νβ2)pi1−pi√1−c21+exp(c22σ2νβ2)pi1−pi√1−c2c2β. (11)

Note that this shrinkage operator does not enforce a sparse vector, just as in the general case. The above scalar operators are also extended to act on vectors in an entry-wise manner.

### Iii-B The Unitary SR Estimators

In the previous section we saw that the MMSE is a weighted sum of all the probable solutions, where the weights are the posterior probabilities of each support. Similarly, we propose to approximate the MMSE by summing many probable solutions, by the following procedure: First, we add white zero mean Gaussian noise to the signal . Note that since is unitary, it does not matter if the noise is added to or to its projection . Then, we pass it through the MAP estimator resulting with an estimation of the original signal. This process is repeated many times, each time with a different noise realization. The final step includes an arithmetic mean over all the estimations333Notice that the written equation operates on the the vectors element-wise.

 ^αstochastic=1KK∑k=1^αk=1KK∑k=1H(β+nk). (12)

The described process is an empirical arithmetic average approximating the expected value described as:

 (13)

where we have denoted as the tail probability of the standard normal distribution, i.e. . The full derivation can be found in Appendix A. We shall define the above estimator as the non-subtractive SR estimator.

The term non-subtractive is used here to note that each estimation is still contaminated with the noise that was added in the process. Conversely, one might consider the subtractive estimator, in which we remove the projection of the added noise, resulting in the following shrinkage operator:

 ^αk(β,nk)= H−(β,nk) H−(β,nk)= {c2(β+nk)−c2nkif |β+nk|≥λMAP,0else = {c2βif |β+nk|≥λMAP,0else.

Using this shrinkage operator and following the same process described above, we end up with the following estimator:

 En[H−(β,n)]= ∫∞−∞H−(β+n)p(n)dn = c2β[Q(λ+βσn)+Q(λ−βσn)].

Again, The full derivation can be found in Appendix A.

Notice that the described estimators have two parameters yet to be set: and . The former tunes the magnitude of the added noise, while latter controls the value of the thresholding operation. Note that the original MAP threshold might be sub-optimal due to the added noise and therefore, we leave as a parameter needed to be set. We will explore how to set these parameters later in this section.

### Iii-C Unitary SR Estimation Results

In order to demonstrate the similarity of the proposed estimators to the MMSE, we can simply compare their shrinkage curves, as seen in Figure 2. One can see that, while the curves do not overlap completely, for the right choice of parameters ( and ), the curves are indeed close to each other. In terms of MSE, in Figure 3 we can see the results of these methods as a function of ( is fixed to the optimal value). It seems that the subtractive method is a slightly closer. We now discuss how to set the parameters in order to reach these optimal results.

### Iii-D Finding the Optimal Parameters for the Unitary Case

In the cases where the dictionary is known but other parameters such as or are not known, the MMSE estimator cannot be computed. In such cases, we may try and estimate the MMSE by using SR. Since our added noise is Gaussian, we can use Stein’s unbiased risk estimate (SURE) [18] which measures an estimator’s MSE up to a constant, in order to optimize the free parameters and . The SURE formulation is:

 μ(H(β,λ,σn),β)=||H(β,λ,σn)||22−2H(β,λ,σn)Tβ+2σ2ν∇H(β,λ,σn). (14)

In the unitary case this is further simplified to an element-wise sum:

 μ(H(β,λ,σn),β)=∑iμ(H(βi,λ,σn),βi)=∑iH(βi,λ,σn)2−2H(βi,λ,σn)βi+2σ2νddβiH(βi,λ,σn), (15)

and we wish to optimize and :

 σn,λ=argminσn,λμ(H(β,λ,σn),β). (16)

Plugging in the subtractive estimator results in a closed form expression as can be seen in Appendix B. Also, in Appendix B we show the surface for a specific experiment. Interestingly, we would like to note that empirically, the obtained optimal is quite close to the suggested by the MAP estimator.

## Iv Improving Pursuits in the General Case

The unitary case is a good testing environment for new ideas since it is relatively simple and easy to analyze. That being said, SR in the unitary case has a slim importance since the MMSE has a closed form solution in the form of a shrinkage curve and therefore an approximation made by many sparse coded estimators is not needed. Also, most of the sparse theory applications use over-complete dictionaries.

In order to provide an analysis for the general case some assumptions should be made. We start by analyzing the single-atom case.

### Iv-a Single-Atom Analysis

In this section we analyze the over-complete case with the assumption that the cardinality of the sparse representation is known to be 1, i.e. . From [11] we have that in this case, the MAP estimator described in the Introduction boils down to the following form:

 ^x(y)=c2yS,yS=dSdTSy,c2=σ2ασ2α+σ2ν, (17)

where the chosen atom is:

 dS=argmindS||yS−y||22=argmindS||dSdTSy−y||22=argmaxdSyTdSdTSy. (18)

Following the subtractive concept we proposed in the unitary case, we introduce the following SR estimator:

 ^x(y)=c2yS,yS=dSdTSy, (19)

where this time the choice of the atom is affected by an additional additive SR noise:

 dS=argmindS||yS(n)−y(n)||22=argmaxdS(y+n)TdSdTS(y+n). (20)

Employing this as a pursuit to be used in Algorithm 1 we now analyze the proposed asymptotic estimator:

 En^x(y,n)=Enc2yS(n)=ES[En|S[c2yS∣∣S]]=c2m∑i=1En|S[didTiy]P(^S=i)=c2m∑i=1didTiyP(^S=i). (21)

Similar to the MMSE in the general case (LABEL:eq:mmseGeneral), we have the sum of the solutions under all the possible supports (all supports with a single atom in this case), and each support is weighted by its probability to be chosen. We now turn to analyze this probability. As stated in Equation (LABEL:eq:maxAtom), the chosen atom is the most correlated atom with the input signal:

 P(^S=i)=P(∣∣dTi(y+n)∣∣>maxj≠i∣∣dTj(y+n)∣∣)==P(|~ni|>maxj≠i∣∣~nj∣∣), (22)

where we defined as a random Gaussian vector such that:

 ~n=⎡⎢ ⎢ ⎢ ⎢⎣~n1~n2⋮~nm⎤⎥ ⎥ ⎥ ⎥⎦∼N(DTy,σ2nDTD). (23)

We have that the probability of choosing the atom is distributed as the probability of the maximum value of a random Gaussian vector with correlated variables. The vector’s variables are correlated since in the non-unitary case, is not a diagonal matrix.
Facing this dilemma, we can tackle it in several directions:

1. Instead of adding the Gaussian noise to the image , we can add it to the projected signal thus avoiding the variables being correlated.

2. We can add some assumptions on the prior properties of the dictionary , thus deriving average case conclusions.

3. We can change the pursuit to use a constant threshold for the choice of the support instead of comparing the correlated variables. This means that when applying the algorithm, a cardinality of or might enter the averaging process. This is clearly a sub-optimal choice and we leave the study of this option for future work.

We shall now analyze the first two proposed alternatives.

#### Iv-A1 Adding Noise to the Representation

Under this assumption, we continue from Equation (LABEL:eq:atomProbability), only this time the noise is white and has the following properties:

 ~n∼N(DTy,σ2nIm×m) (24)

Plugging this into (LABEL:eq:atomProbability):

 P(^S=i) =P(∣∣dTiy+n∣∣>maxj≠i∣∣dTjy+n∣∣) =P(|~ni|>maxj≠i∣∣~nj∣∣) =∫∞0P(maxj≠i∣∣~nj∣∣

Starting with the first element in the above product, since these are independent variables, we have:

 P(maxj≠i∣∣~nj∣∣t)]=∏j≠i[1−(Q(t+βjσn)+Q(t−βjσn))], (25)

where the last equality follows similar steps as in Appendix A with and . The second term in (LABEL:eq:supportProbability) is simply an absolute value of a Gaussian, therefore:

 P(|~ni|=t) =1√2πσn⎛⎜⎝e−(t−βi)22σ2n+e−(t+βi)22σ2n⎞⎟⎠

Putting the two terms back into (LABEL:eq:supportProbability):

 P(^s=i)=∫∞01√2πσn⎡⎢⎣e−(t+βi)22σ2n+e−(t−βi)22σ2n⎤⎥⎦⋅∏j≠i[1−(Q(t−βjσn)+Q(t+βjσn))]dt (26)

The obtained expression cannot be solved analytically but can be numerically calculated. In Figure 4 we present a simulation for the single atom case where the noise is added to the projection . In Figure 3(a) we can observe that indeed, even in the non-unitary case, SR for the optimal choice of applied on the representation approximates the MMSE pretty well.

In 3(b) we show the probability of recovering the true support as a function of , both from (LABEL:eq:oneAtomProb) and from actual pursuit simulations. We also compare it to the MMSE weight from (LABEL:eq:mmseGeneral) and we marked the optimal MSE as a black line. We can see that the derived expression agrees with the simulations. In addition, we notice that the optimal in terms of MSE, also approximates the probability of the true support to the weight given in the MMSE solution. In other words, the optimal is the one that approximates the weight of the support to the weight given by the MMSE expression. The trend in (LABEL:eq:oneAtomProb) shows, unsurprisingly, that as we add noise, the probability of successfully recovering the true support decreases. In the limit, when the signal will be dominated by the noise and the success probability will be uniform among all the atoms, i.e. equal to .

Due to the fact that the MSE is almost the same when the probability , one might expect a similar behavior for all the other possible supports. To emphasize this we draw the following experiment. We randomize an index and draw many signals with the non-zero in the same location. Then, we plot the histogram of the average empirical probability (obtained by pursuit) of each element in the vector to be non-zero. We compare these probabilities to those of the MMSE. This experiment will run for different values and each time we can compare the entire support histogram. We expect the two histograms (SR and MMSE) to fit for the right choice of added noise . In Figure 5 we see the results of the described experiment.

Analyzing the results of this experiment, we notice that when no noise is added (this is the actually average case of the MAP estimator), most of the elements (apart from the true support element) have a much lower weight than the MMSE. As noise is added, the true support’s probability decreases and its weight is divided among the other elements. At some point the two histograms almost match each other. This is the point where SR MSE almost equals that of the MMSE . As we add more noise, the true support’s probability keeps decreasing and the other elements keep increasing and the histograms are now farther apart from each other. When we reach we obtain uniform probability for all the supports.

In Figure 6 we show on the left axis the distance (kullback-leibler divergence) between the two histograms, and on the right axis the MSE. We see, as expected, that when the histograms are close, the MSE is minimal.

We have shown empirically that SR approximates the MMSE well also when the dictionary is not unitary in the one atom case. We now need to find a way of estimating a proper . Using SURE is possible but seems impractical due to the complexity of the estimator.

#### 2. Prior Assumptions on the Dictionary D

In this section we will try to simplify the expression in (LABEL:eq:atomProbability) by adding assumptions regarding the dictionary . We will now show that if we assume that the columns of the dictionary, i.e. the atoms, are statistically uncorrelated, then adding noise in the image domain is the same as adding it to the representation. Formally, our assumption is that the atoms are drawn from some random distribution that obeys the following properties:

 EdTidj=0,∀i≠j1≤i,j≤m, (27)

and of course that the atoms are normalized:

 ||di||2=1,∀i1≤i≤m. (28)

We now look to analyze the properties of the random vector in (LABEL:eq:atomProbability):

 ~n=DT(y+n). (29)

First we observe that given the dictionary , each of the elements in this vector is asymptotically a Gaussian variable:

 ~ni|di=dTi(y+n)=n∑k=1di,k(yk+nk)=n∑k=1di,kyk+n∑k=1di,knk=μi+n∑k=1di,knk. (30)

Given the measurements and the dictionary, the first sum is some constant. The second term in the expression is a weighted sum of iid random variables . Therefore, using the Central Limit Theorem, and the fact that , asymptotically for large dimension , this is a Gaussian variable. It is easy to see that its mean value is 0, and its standard deviation is , hence for .

Now we turn to analyze the properties of the entire vector . From the previous analysis we know that given the dictionary , it is a random Gaussian vector with the mean vector . Using the properties of the noise , the auto-correlation matrix of is by definition:

 Σ|D=E[DTnnTD∣∣D]=DTE[nnT]D=σ2nDTD. (31)

Analyzing the average case, the mean vector is of the form:

 μ~n=EDDTy, (32)

and the auto-correlation matrix is simply diagonal:

 Σ= ED[Σ|D]=E[σ2nDTD]=σ2nI,

where we used the assumptions in (27) and (28). This means that the uncorrelated atoms assumption leads to have the same properties as seen in the previous section, and therefore their analysis is the same.

To show that practically the two are the same, we propose the following experiment. We sample a random dictionary and random sparse representations with cardinality of 1 as the generative model described earlier suggests. In this experiment we used a dictionary of size and random sparse representations. Using the generated vectors and dictionary we created signals simply by multiplying and adding noise . To denoise the signals, we once run the stochastic resonance algorithm with noise added to the signal vectors , and once with noise added to the representation domain . Observe that due to the cardinality of the sparse representation, a simple Hard Thresholding is the MAP estimator, thus we shall use that as our non-linear estimator. In Figure 7 we see that the MSE of the two cases result in an almost identical curve. Small differences might exist due to the finite dimensions used in the experiment.

Note that the total noise energy added in the representation domain is much larger than that of the noise added to the signal, i.e. but the results remain the same due to the unit norm of the dictionary , and of course the uncorrelated atoms assumption.

To conclude this section, under the statistically uncorrelated atoms assumption, we have shown that adding noise in the signal domain , asymptotically converges to the analysis addressed in the previous section in which the noise was added to the representation domain . Therefore, under this assumption the previous section’s analysis holds, and its results and conclusions can be inferred in this case as well.

### Iv-B Multiple Atoms

We shall now show that the application of Algorithm 1 in the general case of an overcomplete dictionary and many non-zeros results with superior denoising performance over the classic pursuit. We also show that this algorithm can be used with not only the pseudo-norm (OMP) but rather with norm algorithms (Basis Pursuit) as well. Unlike previous MMSE approximations under the sparseland model, such as the Random OMP, this is the first time an approximation using the norm is addressed. We use a random Gaussian dictionary and generate random Gaussian coefficients with a random number of non zeros and their locations, with an average sparsity of 5%. As in the previous experiments, we add Gaussian noise to the signal domain and use BP and OMP to denoise the signals. Since the number of non zeros is unknown, we use the bounded noise formulation of the pursuit algorithms, i.e.

 (OMP) minα||α||0s% .t.||y−Dα||2≤ϵ, (BP) minα||α||1s% .t.||y−Dα||2≤ϵ.

The results of the described experiment can be seen in Figure 8. We acknowledge that although BP is in general inferior to OMP in terms of MSE, the SR method improves both algorithms’ performance, indicating the indifferent of this concept to the pursuit algorithm used. We should mention that even though OMP achieved better MSE performance, as the number of non-zeros increases, BP converges much faster.

## V SR Extensions

In this section we represent some extensions to the basic concept introduced in Algorithm 1. The first is a link between SR and Monte Carlo Importance Sampling, and the second discusses whether Gaussian noise is the optimal choice.

### V-a SR With Monte Carlo Methods

In the case where the generative model is fully known, the MMSE can, in principle, be computed. The problem is that due to the huge amount of possible supports, it is impractical. In order to solve this problem, we can turn to Monte Carlo simulations in order to achieve an approximation. As assumed previously in this work, the non-zeros come from a Gaussian distribution, and the additive noise is also Gaussian. Hence, the probabilities have an exponential nature as shown in (LABEL:eq:mmseGeneral). This means that even though the true MMSE is a weighted sum of all possible supports:

 (33)

the sum is practically dominated by only a few number of elements:

 ^α(y)=∑S∈ΩP(S|y)^α(y,S)≈∑S∈ωω⊂ΩP(S|y)^α(y,S), (34)

for a proper choice of the subset . Can we somehow find the significant elements, weight them accordingly and use them as an approximation for the MMSE?

#### V-A1 Importance Sampling

As we have seen, the Bayesian approach requires a sum over all the possible supports (integration in the general Bayesian case) in order to achieve the MMSE estimator. In the literature, there are numerous ways of approximating non-analytic integrals [19]. Specifically for the posterior expectation case, the Monte Carlo Importance Sampling approach [20] is known to work well. Generally the integral we wish to approximate is:

 Ex[h(x)]=∫Xh(x)f(x)dx. (35)

Importance Sampling essentially calculates the above integral by using an additional sampling distribution (also known as importance sampling fundamental identity):

 Ex[h(x)]=∫Xh(x)f(x)dx=∫Xh(x)f(x)g(x)g(x)dx,{g(x)≠0|x∈X,f(x)≠0}. (36)

Note that the condition given in brackets means that the . The above integral can be approximated by sampling samples from the distribution , i.e. and then average:

 Ex[h(x)]≈1m∑Xjf(x)g(x)h(x). (37)

This sum asymptotically converges to the integral in (LABEL:eq:importanceInt) by the Strong Law of Large Numbers. A well known alternative of the above sum is:

 Ex[h(x)]≈∑Xjf(x)g(x)h(x)∑Xjf(x)g(x). (38)

This formulation addresses some stability issues regarding the tail of and that are present in (37) and therefore is commonly used. As the previous sum, it also converges to (LABEL:eq:importanceInt) by the Strong Law of Large Numbers.

#### V-A2 Importance Sampling using Stochastic Resonance

We propose to use stochastic resonance in order to retrieve the potentially important supports that have the dominant weights in the MMSE, and weight them accordingly using Importance Sampling. Formally, we use SR as a support generator PDF , use the oracle estimators and their MMSE un-normalized weights . Plugging this into (38) we get:

 ^α(y)=E[α|y]≈∑Sjf(Sj|y)g(Sj)^αOracleSj,y∑Sjf(Sj|y)g(Sj|y),Sj|y∼g(S|y). (39)

We now write the explicit expressions for each of the components described above:

• – is as stated in (5)

• – This is the un-normalized probability for a support given the noisy measurements vector . Again, given the described generative model and from (LABEL:eq:mmseGeneral)

 P(S|y)∝f(S|y)=tS (40)
• – This is the support generator probability function given the noisy measurements . As previously mentioned we would like to have a way of generating probable supports. Using the SR concept we shall create likely supports simply by adding SR noise to the measurements and run a pursuit. Clearly, as we add more noise, in each of the iterations a pursuit might retrieve a different support. If we add too much noise then the supports recovered are not necessarily likely and we might miss the “preferred supports” with the highest MMSE weights. This means that another parameter for this generating function is also the amount of noise to be added, .

In order to quantify we simply use the empirical distribution. In other words, if we run iterations and a specific support occurred times, its probability is simply

By using the described components, we can now approximate the NP-Hard MMSE calculation simply by recovering the likely supports using any pursuit on the SR-noisy measurements and use (38). Note that this approximation will asymptotically converge to the MMSE with probability 1.

#### V-A3 Results

Since asymptotically this method is guaranteed to converge to the MMSE, the question that arises is how fast do we converge versus the number of possible supports. We start by repeating the previously described experiments for comparison. Note that in this experiment there are only 100 possibilities for the support and therefore when running 100 iterations we are not surprised that we have successfully converged. These results can be seen in Figure 8(a). To show the efficiency of this technique we compare it with the same settings, only this time the cardinality is giving it possibilities, all apriori equally likely! These results can be seen in Figure 8(b). Note that the MAP and the MMSE are missing in this figure due to the impractical amount of calculations required. Compared to the previous SR methods described, this method is much less sensitive to the amount of added noise. As the noise increases we have minor degradation since the most likely supports are not recovered anymore. That being said, it seems that as the number of possible supports increases, it will take more iterations to actually converge or at least beat the previous SR method. In Figure 8(c) we see that after 200 iterations the two perform roughly the same, but Importance Sampling is much less sensitive to . Note that with 200 iterations we recover at most 200 different supports and that is only of all the possible supports.

To summarize, as we add more iterations we are guaranteed to asymptotically converge to the MMSE, but a major improvement can be achieved with a small amount of iterations. The biggest advantage of this method is its robustness to the energy of the SR noise.

### V-B What Noise Should be Used?

In the previous sections we used Gaussian noise by default. In this section we question this decision and wonder whether we can use noise models with different distributions and whether it affects the performance of the stochastic resonance estimator.

As mentioned briefly in IV-A, the result of the additive noise multiplied by the dictionary is Gaussian under mild conditions. Denoting , each element is:

 ~ni=dTin=m∑j=1di,jnj. (41)

Without the loss of generality, assuming normalized atoms , this expression a weighted average of iid variables . If the noise has bounded mean and variance and, of course, assuming iid elements, we can use the Central Limit Theorem. In this case, for large enough signal dimensions, is asymptotically Gaussian regardless of the distribution of the original additive noise .

Following the previous statement, we experiment with a different distribution for a random noise vector. We will employ an element-wise iid uniform noise with 0 mean . In order to compare with a Gaussian noise we choose thus assuring the same standard deviation for the two cases. Following the same experiment described in Figure 0(a), in Figure 9(a) we also use a uniform noise distribution as described above. By using uniform noise with a zero mean and the same standard deviation as the Gaussian noise, we see a perfect match between the two curves. Due to the Central Limit Theorem, the noise’s distribution will practically not change much as long as they uphold the described conditions, and the signal’s dimensions are large enough.

Now that we know that the choice of the noise’s distribution will not effect the performance, is there a distribution from which we can benefit more than others? To explore this question, consider the following. Given the signal , we define the subsampling noise in the following way:

 ni(yi)={0w.p.p−yiw.p.1−p, (42)

and the SR samples will now follow the following distribution:

 yi+ni(yi)={yiw.p.p0w.p.1−p. (43)

This means that each of the elements of the SR-noisy measurements will be zero with probability and only measurements will remain in the signal. This distribution is interesting because of the following reason. When zeroing out an element in the vector , the matching row in in the dictionary will always be multiplied by the zero element when calculating the correlations as done in most pursuits. This multiplication obviously has no contribution to the inner product and we might as well omit the zero elements from and the corresponding rows from , remaining with only a subsampled version of the signal and the dictionary . In other words, in each of the SR iterations we simply subsample a random portion from the signal and the matching portion of rows from the dictionary , remaining with of size and a dictionary of size and apply a pursuit. Recall that once the pursuit is done, its result contains a noisy estimation of the signal due to the added SR noise . Just like in the previous cases we should now use the pursuit’s result only as a support estimator in order to calculate the subtractive SR estimator. To do so, once the pursuit is done, we should turn back to the full sized signal and dictionary and calculate the oracle estimator using the support recovered by the subsampled pursuit.

The described process can be equally formulated by sampling a random mask which is a random subsample of rows from the diagonal identity matrix . In this formulation, we create many base estimators by applying a pursuit on the sub-sampled signal , using the sub-sampled dictionary . Note that when applying this method, each pursuit has a computational benefit over the previous methods due to the decreased size of the signal’s dimension. In Figure 9(b) we show the results for the same experiment described in Figure 9(a), this time using the subsampling approach. In this Figure the axis represents the probability , the percentage of elements kept in each pursuit. We see that the two methods have the same performance in MSE for the optimal choise of and , but this method is faster due to the smaller size of the pursuit.

## Vi Image Denoising

In this section we show the benefit of using SR for facial image denoising. In this experiment we use the Trainlets [21] dictionary trained on facial images from the Chinese Passport datase as described in [22]. In the dataset, each image is of size pixels and contains a gray-scale aligned face. The application we demonstrate is denoising, that is approximating the following optimization problem:

 ^α=argminα||Dα−y||2s.t.||α||0=L. (44)

The approximation is achieved using the Subspace Pursuit (SP) algorithm [23] which provides a fast converging algorithm for a fixed number of non-zeros . The number of non-zeros was achieved empirically and was chosen so that the denoising performance would be optimal. For Stochastic Resonance we used Algorithm 1 in its subtractive form with 200 iterations and the same SP settings. Note that we do not seek optimal denoising results but rather to show that SR can improve real image processing tasks.

### Vi-a Experiment Description

We corrupt an image from the dataset with Additive White Gaussian Noise (AWGN), each time with different standard deviation . After that, for each image we applied SP using the Trainlets dictionary and for each choose such that the denoised results would be optimal under the PSNR measure. After that we took the noisy images and used Algorithm 1 to denoise using the same with varying SR noise standard deviations .

### Vi-B Results

In Figure 11 the results for can be seen. Unsurprisingly, the SR results has a clearer image with much less artifacts. Figure 12 presents the effectiveness of SR under varying SR noise . We see that a gain of almost 2 dB is achieved by using SR with a proper over the regular pursuit. Figure 13 presents a comparison of SP vs. SR for varying values of the noise’s standard deviation . In all of the described experiments, SR improved the denoising results. Generally we observe that as the noise is increased, the improvement is more significant.

## Vii Conclusion

In this work we introduced Stochastic Resonance which is a phenomenon where noise improves the performance of a non-linear system. We suggested algorithms leveraging Stochastic Resonance under the context of sparse representation pursuit algorithms. We analyzed their theoretical properties under the SparseLand model setting and showed that MMSE approximation can be accomplished by repeatedly applying pursuits with different SR noise realizations, thus achieving many representations hypotheses and averaging all of them to a final dense estimator. We have demonstrated Stochastic Resonance as a practical and effective MMSE approximation that has the ability to use any pursuit algorithm as a “black box”, thus opening the door for MMSE approximations in large dimensions.

## Appendix A Unitary SR Estimator Expectation

For the non-subtractive case:

 = ∫|β+n|≥λc2(β+n)p(n)dn = c2[∫−λ−β−∞(β+n)p(n)dn+∫∞λ−β(β+n)p(n)dn] = c2[∫−λ−β−∞(β+n)1√2πσ2ne−n22σ2ndn]+ c2[∫∞λ−β(β+n)1√2πσ2ne