Analysis of Approximate Message Passingwith Non-Separable Denoisers and Markov Random Field PriorsPortions of the work appeared at the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, June 2017 [1].

# Analysis of Approximate Message Passing with Non-Separable Denoisers and Markov Random Field Priors††thanks: Portions of the work appeared at the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, June 2017 [1].

Yanting Ma
North Carolina State University
yma7@ncsu.edu
Cynthia Rush
Columbia University
cynthia.rush@columbia.edu
Dror Baron
North Carolina State University
barondror@ncsu.edu
###### Abstract

Approximate message passing (AMP) is a class of low-complexity, scalable algorithms for solving high-dimensional linear regression tasks where one wishes to recover an unknown signal from noisy, linear measurements. AMP is an iterative algorithm that performs estimation by updating an estimate of the unknown signal at each iteration and the performance of AMP (quantified, for example, by the mean squared error of its estimates) depends on the choice of a “denoiser” function that is used to produce these signal estimates at each iteration.

An attractive feature of AMP is that its performance can be tracked by a scalar recursion referred to as state evolution. Previous theoretical analysis of the accuracy of the state evolution predictions has been limited to the use of only separable denoisers or block-separable denoisers, a class of denoisers that underperform when sophisticated dependencies exist between signal entries. Since signals with entrywise dependencies are common in image/video-processing applications, in this work we study the high-dimensional linear regression task when the dependence structure of the input signal is modeled by a Markov random field prior distribution. We provide a rigorous analysis of the performance of AMP, demonstrating the accuracy of the state evolution predictions, when a class of non-separable sliding-window denoisers is applied. Moreover, we provide numerical examples where AMP with sliding-window denoisers can successfully capture local dependencies in images.

## 1 Introduction

In this work, we study the problem of estimating an unknown signal from noisy, linear measurements as in the following model:

 y=AV(β)+w, (1)

where for some integer , is an index set with cardinality , is the output, is a known measurement matrix, is zero-mean noise with finite variance , and (script stands for “vectorization”) is an invertible operator that rearranges elements of an array into a vector, hence is a length- vector. We assume that the ratio of the dimensions of the measurement matrix is a constant value, , with .

Approximate message passing (AMP) [2, 3, 4, 5, 6] is a class of low-complexity, scalable algorithms studied to solve the high-dimensional regression task of (1). The performance of AMP depends on a sequence of functions used to generate a sequence of estimates from effective observations computed in every iteration of the algorithm. A nice property of AMP is that under some technical conditions these observations can be approximated as the input signal plus independent and identically distributed (i.i.d.) Gaussian noise. For this reason, the functions are referred to as “denoisers.”

Previous analysis of the performance of AMP only considers denoisers that act coordinate-wise when applied to a vector; such denoisers are referred to as separable. If the unknown signal has a prior distribution with i.i.d. entries, restricting consideration to only separable denoisers causes no loss in performance. However, in many real-world applications, the unknown signal contains dependencies between entries, and therefore a coordinate-wise independence structure does not approximate the prior for well. Instead of using a separable denoiser, non-separable denoisers can improve reconstruction quality for signals with such dependencies among entries. For example, when the signals are images [7, 8] or sound clips [9], non-separable denoisers outperform reconstruction techniques based on over-simplified i.i.d. models. In such cases, a more appropriate model might be a finite memory model, well-approximated with a Markov random field (MRF) prior. In this paper, we extend the previous performance guarantees for AMP to a class of non-separable sliding-window denoisers when the unknown signal is a realization of an MRF. Sliding-window schemes have been studied for denoising signals with dependencies among entries by, for example, Sivaramakrishnan and Weissman [10, 11]. MRFs are appropriate models for many types of images, especially texture images, which have an inherently random component [12, 13].

When the measurement matrix has i.i.d. Gaussian entries and the empirical distribution function111For a vector , the empirical distribution function is defined as , where denotes the indicator function. The empirical distribution function is said to converge to some distribution function if for all such that is continuous, we have . of the unknown signal converges to some distribution function on , Bayati and Montanari [4] proved that at each iteration the performance of AMP can be accurately predicted by a simple, scalar iteration referred to as state evolution in the large system limit ( such that is a constant). For example, if is the estimate produced by AMP at iteration , the result by Bayati and Montanari [4] implies that the normalized squared error, , and other performance measures converge to deterministic values predicted by state evolution, which is a deterministic recursion calculated using the prior distribution of .222Throughout the paper, denotes the sum of squares of all the entries in , where could be, for example, in , , or . Rush and Venkataramanan [14] provided a concentration version of the asymptotic result when the prior distribution of is i.i.d. sub-Gaussian. The result in Rush and Venkataramanan [14] implies that the probability of -deviation between various performance measures and their limiting constant values decay exponentially in .

Extensions of AMP performance guarantees beyond separable denoisers have been considered in special cases [15, 16] for certain classes of block-separable denoisers that allow dependencies within blocks of the signal with independence across blocks. A preliminary version [1] of this work has analyzed the performance of AMP with sliding-window denoisers applied to the setting where the unknown signal has a Markov chain prior. In this paper, we generalize the previous result with the applications of compressive imaging [7, 8] and compressive hyperspectral imaging [17] in mind. We consider 2D/3D MRF priors for the input signal , and provide performance guarantees for AMP with 2D/3D sliding-window denoisers under some technical conditions.

While we were concluding this manuscript, we became aware of recent work of Berthier et al. [18]. The authors prove that the loss of the estimates generated by AMP (for a class of loss functions) with general non-separable denoisers converges to the state evolution predictions asymptotically. Our work differs from [18] in the following three aspects: (i) our work provides finite sample analysis, whereas the result in [18] is asymptotic; (ii) we adjust the state evolution sequence for the specific class of non-separable sliding-window denoisers to account for the “edge” issue that occurs in the finite sample regime (this point will become clear in later sections); (iii) we consider the setting where the unknown signal is a realization of an MRF and the expectation in the definition of the state evolution sequence is with respect to (w.r.t.) the signal , the matrix , and the noise , whereas in [18], the signal is deterministic and unknown, hence the expectation is only w.r.t. the matrix and the noise .

### 1.1 Sliding-Window Denoisers and AMP Algorithm

Notation: Before introducing the algorithm, we provide some notation that is used to define the sliding window in the sliding-window denoiser. Without loss of generality, we let the index set , on which the input signal in (1) is defined, be

 Γ=⎧⎪⎨⎪⎩[N],if p=1,×[N],if p=2,×[N]×[N],if p=3, (2)

where for an integer , the notation represents the set of integers , hence, . Similarly, let be a -dimensional cube in with length in each dimension, namely,

 Λ:=⎧⎪⎨⎪⎩[2k+1],if p=1,×[2k+1],if p=2,×[2k+1]×[2k+1],if p=3, (3)

where . We call the half-window size.

AMP with sliding-window denoisers: The AMP algorithm for estimating from and in (1) generates a sequence of estimates , where , is the iteration index, and the initialization is an all-zero array with the same dimension as the input signal . For , the algorithm proceeds as follows:

 zt =y−AV(βt)+zt−1n∑i∈Γη′t−1([V−1(A∗zt−1)+βt−1]Λi), (4) βt+1i =ηt([V−1(A∗zt)+βt]Λi), for all i∈Γ, (5)

where the function is a sequence of denoisers, is the partial derivative w.r.t. the center coordinate of the argument, is the transpose of , and for each is the -dimensional cube translated to be centered at location . The translated -dimensional cubes are referred to as “sliding windows,” which will be used to subset elements of a -dimensional array. The effective observation at iteration is , which can be approximated as the true signal plus i.i.d. Gaussian noise (in a sense that will be made clear in the statement of our main result, Theorem 1). Note that the sliding-windows and the sliding-window denoiser are defined on multidimensional signals, hence we use the inverse of the vectorization operator, , to rearrange elements of vectors into arrays before applying the sliding-window denoiser . It should also be noted that the denoiser may only process part of the signal elements in . For example, in the 2D case, if is defined as a window, then may only process the center and the four adjacent pixel values in the window (see Figure 1) and ignore the four corners. To simplify notation, we will write throughout the paper, and interpret this notation to mean that any processing of neighboring signal values is allowed, including the possibility of ignoring some of their values.

Edge cases: Notice that when the center coordinate is near the edge of , some of the elements in may fall outside , meaning that where is the complement of w.r.t. . In the definition of the AMP algorithm with sliding-window denoisers and the subsequent analysis, these “edge cases” must be handled carefully. The following definitions provide a framework for the special treatment of the edge cases.

Based on whether has elements outside , we partition the index set into two sets and defined as:

 Γmid:={i∈Γ|Λi∩Γc=∅}andΓedge:={i∈Γ|Λi∩Γc≠∅}. (6)

That is, for , all elements in lie inside , whereas for , some of the elements in fall outside . The size of each set will depend on the half-window size and the dimension .

For any , let be a subset of the elements of with indices in and for any , let be the index of so that returns a single element of . Notice that for , all entries of are well-defined. However, for , the subset has undefined entries, namely, for all such that , the entry is undefined. We now define the value of those “missing” entries to be the average of the entries of having indices in . Formally,

 v[Λi]j:=1|Λi∩Γ|∑ℓ∈Λi∩Γvℓ, for all j∈Λ such that [Λi]j∈Γc. (7)

It may improve signal recovery quality to use other schemes for “missing” entries, like interpolation. We leave the study of these improved schemes for future work, but the effect of improved processing around edges should become minor as increases. Notice that for all are now defined using only the entries in the original . It will be useful to emphasize this point in the proof of our main result, so we define a set of functions with as

 Ti(vΛi∩Γ):=vΛi, for % all i∈Γ, (8)

where follows our definition above. That is, is identity for , whereas for , extends a smaller array to a larger one with the extended entries defined by (7).

Examples for defining “missing” entries: To illustrate the notations defined above, we present an example for the case (hence is a vector). As defined above in (2) and (3), we have , , and for each . Moreover, and as defined in (6). Therefore, for ,

 vΛi=Ti(vi−k,vi−k+1,…,vi+k):=(vi−k,vi−k+1,…,vi+k)∈R2k+1.

For , the vector is still length-, and we set the values of the non-positive indices, i.e., , or indices above , i.e., , to be the average of values in the vector with indices in . For example, let and giving so that for we have . Following (7), define

 ¯v=188∑j=1vj, and set v[Λ3]1=v[Λ3]2=v[Λ3]3=¯v,

and so . An example for the case (hence is a matrix), is shown in Figure 2.

### 1.2 Contributions and Outline

Our main result proves concentration for (order-2) pseudo-Lipschitz (PL(2)) loss functions333A function is (order-2) pseudo-Lipschitz if there exists a constant such that for all , . acting on the AMP estimate given in (5) at any iteration of the algorithm to constant values predicted by the state evolution equations that will be introduced in the following. This work covers the case where the unknown signal has an MRF prior on . For example, when , can be thought of as an image, whereas when , can be thought of as a hyperspectral image cube. Moreover we use numerical examples to demonstrate the effectiveness of AMP with sliding-window denoisers when used to reconstruct images from noisy linear measurements.

The rest of the paper is organized as follows. Section 2 provides model assumptions, state evolution formulas, the main performance guarantee, and numerical examples illustrating the effectiveness of the algorithm for compressive image reconstruction. Our main performance guarantee (Theorem 1) is a concentration result for PL loss functions acting on the AMP outputs from (4)-(5) to the state evolution predictions. Section 3 provides the proof of Theorem 1. The proof is based on a technical lemma, Lemma 4, and the proof of Lemma 4 is provided in Section 4.

## 2 Main Results

### 2.1 Definitions and Assumptions

First we include some definitions relating to MRFs that will be used to state our assumptions on the unknown signal . These definitions can be found in standard textbooks such as [19]; we include them here for convenience.

Definitions: Let be a probability space. A random field is a collection of random variables defined on having spatial dependencies, where for some measurable state space and is a non-empty, finite subset of the infinite lattice . Note that , hence . One can consider as a collection of spatial locations. Denote the -order neighborhood of location by , that is, is a collection of location indices at a distance less than or equal to from but not including . Formally,

 Nqi={j∈Γ∖{i}|∥i−j∥2≤q}.

Following these definitions, is said to be a -order MRF if, for all and for all measurable subsets , we have

 P(Xi∈B|Xj,j∈Γ∖{i})=P(Xi∈B|Xj,j∈Nqi),

and for all we have . The positivity condition ensures that the joint distribution of an MRF is a Gibbs distribution by the Hammersley-Clifford theorem [20].

Let denote the distribution measure of , namely for all , we have , and let be the distribution measure of for . For any , define the set . Then the random field is said to be stationary if for all such that , it is true that .

Next we introduce the Dobrushin uniqueness condition, under which the random field admits a unique stationary distribution. Define the Dobrushin interdependence matrix for the measure of the random field to be

 Ci,j:=supξ,ξ′∈EΓξjc=ξ′jc∥μi(⋅|ξ)−μi(⋅|ξ′)∥tv. (9)

In the above, the index set and the total variation distance between two probability measures and on is defined as

Note that if is countable, then

 ∥ρ1(⋅)−ρ2(⋅)∥tv=12∑x∈E|ρ1(x)−ρ2(x)|. (10)

The measure is said to satisfy the Dobrushin uniqueness condition if

 c:=supi∈Γ∑j∈ΓCi,j<1.

The Dobrushin contraction coefficient, , is a quantity that estimates the magnitude of change of the single site conditional expectations, as they appear in (9), when the field values at the other sites vary. Similarly, we define the transposed Dobrushin contraction condition as

 c∗:=supj∈Γ∑i∈ΓCi,j<1.

Assumptions: We can now state our assumptions on the signal , the matrix , and the noise in the linear system (1), as well as the denoiser function used in the algorithm (4) and (5).

Signal: Let be a bounded state space (countable or uncountable). Let be a stationary MRF with Gibbs distribution measure on , where is a finite and nonempty rectangular lattice. We assume that satisfies the Dobrushin uniqueness condition and the transposed Dobrushin uniqueness condition. These two conditions together are needed for the results in Lemma C.1 and Lemma C.2, which demonstrate concentration of sums of pseudo-Lipschitz functions when the input to the functions are MRFs with distribution measure . Roughly, the conditions ensure that the dependencies between the terms in the sums are sufficiently weak for the desired concentration to hold. The class of finite state space stationary MRFs, which is widely used for image analysis [21], is one example that satisfies our assumption.

Denoiser functions: The denoiser functions used in (5) are assumed to be Lipschitz444A function is Lipschitz if there exists a constant such that for all , . for each and are, therefore, also weakly differentiable with bounded (weak) partial derivatives. We further assume that the partial derivative w.r.t. the center coordinate of , which is denoted by , is itself differentiable with bounded partial derivatives. Note that this implies is Lipschitz. (It is possible to weaken this condition to allow to have a finite number of discontinuities, if needed, as in [14].)

Matrix: The entries of the matrix are i.i.d. .

Noise: The entries of the measurement noise vector are i.i.d. according to some sub-Gaussian distribution with mean 0 and finite variance . The sub-Gaussian assumption implies [22] that for all and for some constants ,

 P(∣∣1n∥w∥2−σ2∣∣≥ϵ)≤Ke−κnϵ2.

### 2.2 Performance Guarantee

As noted in Section 1, the behavior of the AMP algorithm is predicted by a deterministic scalar recursion referred to as state evolution, which we now introduce. More specifically, the state evolution sequences and defined below in (11) will be used in Theorem 1 to characterize the estimation error of the estimates produced by AMP. Let the joint distribution define the (stationary) prior distribution for the unknown signal in (1). Following our assumption of stationarity, for all and for all with defined in (6), where and denote the one-dimensional marginal and -dimensional marginal of , respectively. Define , and . Iteratively define the state evolution sequences and as follows:

 τ2t=σ2+σ2t and σ2t=1δ|Γ|∑i∈ΓE[(ηt−1([β+τt−1Z]Λi)−βi)2], (11)

where is the sliding-window denoiser and has i.i.d. standard normal entries, independent of , which implies that is independent of and for all . Let and define with entries that are i.i.d. . We notice that for all , we have and . Therefore, for all , the expectations in (11) satisfy

 E[(ηt−1([β+τt−1Z]Λi)−βi)2]=E[(ηt−1(β′+τt−1Z′)−β′c)2],

where is the center coordinate of . For with defined in (6), it is not necessarily true that since, by definition (7), some entries of are defined as the average of other entries.

The explicit expression for the definition of in (11) is different when considering for different values, as the size of the set depends on the dimension. In the following, we provide explicit expressions for for the cases , but in the proof we will use the general expression given in (11) for brevity. We emphasize that the definition of the state evolution sequence in (11) only uses the marginal distribution (or ) instead of the joint distribution (or ), as demonstrated in the two examples below in (12) and (13).

Examples for explicit expressions for : Let be the center coordinate of and the window translated with center . Recall that is the -dimensional cube with length in each of the dimensions. Then we have and when we consider shifts for we, analogous to the definition in (7), define “missing” entries to be replaced by the average of the existing entries. (Note that since is exactly of size , thus for any , there will be “missing” entries.) For example, when ,

 β′Λc=(β′1,β′2,…,β′2k+1), while β′Λc−2=(avg,avg,β′1,β′2,…,β′2k−1),

where . Generalizing, we have with

 β′Λc+ℓ=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(12k+1+ℓ∑2k+1+ℓi=1β′i,…,12k+1+ℓ∑2k+1+ℓi=1β′i,β′1,β′2,…,β′2k+1+ℓ) if ℓ<0,(β′1,β′2,…,β′2k+1) if ℓ=0,(β′1+ℓ,β′2+ℓ,…,β′2k+1,12k+1−ℓ∑2k+1i=1+ℓβ′i,…,12k+1−ℓ∑2k+1i=1+ℓβ′i) if ℓ>0.

The same idea can be extended when .

For the case , we note that and , hence and . Therefore, we have

 σ2t=(N−2k)δNE[(ηt−1(β′+τt−1Z′)−β′c)2]+1δN∑ℓ∈{−k,…,k}∖{0}E[(ηt−1([β′+τt−1Z′]Λc+ℓ)−β′c+ℓ)2], (12)

where . In the above the first term corresponds to the middle indices, while the second term sums over terms corresponding to all the possible edge cases.

For the case , we note that , hence . Here we note . Therefore,

 σ2t =(N−2k)2δN2E[(ηt−1(β′+τt−1Z′)−β′c)2]+1δN2∑ℓ1,ℓ2∈{−k,…,k}∖{0}E[(ηt−1([β′+τt−1Z′]Λc+ℓ)−β′c+ℓ)2] +(N−2k)δN2∑ℓ1∈{−k,…,k}∖{0}ℓ2=0E[(ηt−1([β′+τt−1Z′]Λc+ℓ)−β′c+ℓ)2] +(N−2k)δN2∑ℓ2∈{−k,…,k}∖{0}ℓ1=0E[(ηt−1([β′+τt−1Z′]Λc+ℓ)−β′c+l)2], (13)

where we notice that there are terms in the second summand, terms in the third and fourth summands, and . Again, in the above the first term sums over all the middle indices. In this case, the second term corresponds to the corner edge cases, while the third and fourth terms correspond to the edge cases in one dimension only. We note that is a function of , but do not explicitly represent this relationship to simplify the notation. Moreover, for fixed , the terms , , and vanish as goes to infinity. Therefore, we have .

Similar to [14], our performance guarantee, Theorem 1, is a concentration inequality for PL(2) loss functions at any fixed iteration , where is the first iteration when either or defined in (36) is smaller than a predefined quantity . The precise definition of and is deferred to Section 3.2. For now, we can understand (respectively, ) as a number that quantifies (in a probabilistic sense) how close an estimate (respectively, a residual ) is to the subspace spanned by the previous estimates (respectively, the previous residuals ). In the special case where are Bayes-optimal conditional expectation denoisers, it can be shown that small implies that the difference between and is small [14].

###### Theorem 1.

Under the assumptions stated in Section 2.1, and for fixed half window-size , then for any (order-) pseudo-Lipschitz function , , and ,

 P(∣∣1|Γ|∑i∈Γ(ϕ(βt+1i,βi)−E[ϕ(ηt([β+τtZ]Λi),βi)])∣∣≥ϵ)≤Kk,te−κk,tnϵ2, (14)

where , has i.i.d. standard normal entries and is independent of , and the deterministic quantity is defined in (11). The constants do not depend on or , but do depend on and . Their values are not explicitly specified.

###### Proof.

See Section 3. ∎

Remarks:

(1) The probability in (14) is w.r.t. the product measure on the space of the matrix , signal , and noise .

(2) By choosing the following PL(2) loss function, , Theorem 1 gives the following concentration result for the mean squared error of the estimates. For all ,

 P(∣∣1|Γ|∥βt+1−β∥2−δσ2t+1∣∣≥ϵ)≤Kk,te−κk,tnϵ2,

with defined in (11).

### 2.3 Numerical Examples

Before moving to the proof of Theorem 1, we first demonstrate the effectiveness of the AMP algorithm with sliding-window denoisers when used to reconstruct an image from its linear measurements acquired according to (1). We verify that state evolution accurately tracks the normalized estimation error of AMP, as is guaranteed by Theorem 1. While we use squared error as the error metric in our examples, which corresponds to the case where the PL(2) loss function in Theorem 1 is defined as , we remind the reader that Theorem 1 also supports other PL(2) loss functions. Moreover, we apply AMP with sliding-window denoisers to reconstruct texture images, which are known to be well-modeled by MRFs in many cases [12, 13].

#### 2.3.1 Verification of state evolution

We consider a class of stationary MRFs on whose neighborhood is defined as the eight-nearest neighbors, meaning this is a -order MRF per the definition in Section 2.1. The joint distribution of such an MRF on any finite rectangular lattice in has the following expression [23]:

 μ(x)=P(β=x)=∏M−1m=1∏N−1n=1[xm,nxm,n+1xm+1,nxm+1,n+1]∏M−1m=2∏N−1n=2[xm,n]∏M−1m=2∏N−1n=1[xm,nxm,n+1]∏M−1m=1∏N−1n=2[xm,nxm+1,n], (15)

where we follow the notation in [23] for the generic measure defined as

 [xm,nxm,n+1xm+1,nxm+1,n+1]:=P(βm,n=xm,n,βm,n+1=xm,n+1,βm+1,n=xm+1,n,βm+1,n+1=xm+1,n+1),

and the conditional distribution of the element in the box given the element(s) not in the box:

 [xm,nxm,n+1xm+1,n\framebox$xm+1,n+1$]:=P(βm+1,n+1=xm+1,n+1|βm,n=xm,n,βm+1,n=xm+1,n,βm,n+1=xm,n+1).

The generic measure needs to satisfy some consistency conditions to ensure the Markovian property and stationarity of the MRF on a finite grid; details can be found in [23]. For convenience, in simulations we use a Binary MRF as defined in [23, Definition 7], for which the generic measure is conveniently parameterized by four parameters, namely,

 [1\framebox[1.5pt]0]=p,[0\framebox[1.5pt]1]=q,[00\framebox[1.5pt]10]=r,[111\framebox[1.5pt]0]=s. (16)

In the simulations, we set . Using (9) and (10), it can be checked that the distribution measure of this MRF satisfies the Dobrushin uniqueness condition.

As mentioned previously, an attractive property of AMP, which is formally stated in Theorem 1, is the following: for large and and for , the observation vector used as an input to the estimation function in (5) is approximately distributed as , where , has i.i.d. standard normal entries, independent of , and is defined in (11). With this property in mind, a natural choice of denoiser functions are those that calculate the conditional expectation of the signal given the value of the input argument, which we refer to as Bayesian sliding-window denoisers. Let and , then for each we define

 ηt(v) :=E[β′c|Vt=v]=P(β′c=1|Vt=v)=P(Vt=v,β′c=1)P(Vt=v) (17) =∑xΛ∖c∈{0,1}Λ∖c,xc=1fVt|β′(v|x)μ(x)∑x∈{0,1}ΛfVt|β′(v|x)μ(x)

where denotes the center coordinate of , denotes all coordinates in except the center, since coordinates of are i.i.d. normal, and is computed according to (15) with by using (16) and the property of Binary MRF given in [23, Definition 7]. Figure