AMPInspired Deep Networks for Sparse Linear Inverse Problems
Abstract
Deep learning has gained great popularity due to its widespread success on many inference problems. We consider the application of deep learning to the sparse linear inverse problem, where one seeks to recover a sparse signal from a few noisy linear measurements. In this paper, we propose two novel neuralnetwork architectures that decouple prediction errors across layers in the same way that the approximate message passing (AMP) algorithms decouple them across iterations: through Onsager correction. First, we propose a “learned AMP” network that significantly improves upon Gregor and LeCun’s “learned ISTA.” Second, inspired by the recently proposed “vector AMP” (VAMP) algorithm, we propose a “learned VAMP” network that offers increased robustness to deviations in the measurement matrix from i.i.d. Gaussian. In both cases, we jointly learn the linear transforms and scalar nonlinearities of the network. Interestingly, with i.i.d. signals, the linear transforms and scalar nonlinearities prescribed by the VAMP algorithm coincide with the values learned through backpropagation, leading to an intuitive interpretation of learned VAMP. Finally, we apply our methods to two problems from 5G wireless communications: compressive random access and massiveMIMO channel estimation.
I Introduction
We consider the problem of recovering a signal from a noisy linear measurement of the form^{1}^{1}1Although we focus on realvalued quantities for ease of illustration, the methods in this paper could be easily extended to the complexvalued case.
(1) 
where represents a linear operator and is additive white Gaussian noise (AWGN). In many cases of interest, . We will assume that the signal vector has an (approximately) sparse^{2}^{2}2Although we focus on sparse signals, the methods in this paper can be applied to other signals, such as the finitealphabet signals used in digital communications. representation in a known orthonormal basis , i.e., that for some (approximately) sparse vector . Thus we define , write (1) as
(2) 
and seek to recover a sparse from . In the sequel, we will refer to this problem as the “sparse linear inverse” problem. The resulting estimate of can then be converted into an estimate of via .
The sparse linear inverse problem has received enormous attention over the last few years, in large part because it is central to compressive sensing [2]. Many methods have been developed to solve this problem. Most of the existing methods involve a reconstruction algorithm that inputs a pair and produces a sparse estimate . A myriad of such algorithms have been proposed, including both sequential (e.g., greedy) and iterative varieties. Some relevant algorithms will be reviewed in Section IIA.
Recently, a different approach to solving this problem has emerged along the lines of “deep learning” [3], whereby a manylayer neural network is optimized to minimize reconstruction meansquared error (MSE) on a large set of training examples^{3}^{3}3Since orthonormal implies , training examples of the form can be converted to via . . Once trained, the network can be used to predict the sparse that corresponds to a new input . Although the operator and signal/noise statistics are not explicitly used when training, the learned network will be implicitly dependent on those parameters. Previous work (e.g., [4, 5, 6, 7, 8]) has shown that the deeplearning approach to solving sparse linear inverse problems has the potential to offer significant improvements, in both accuracy and complexity, over traditional algorithms like ISTA [9] and FISTA [10]. A short review of relevant concepts from deep learning will be provided in Section IIB.
In this paper, we show how recent advances in iterative reconstruction algorithms suggest modifications to traditional neuralnetwork architectures that yield improved accuracy and complexity when solving sparse linear inverse problems. In particular, we show how “Onsager correction,” which lies at the heart of the approximate message passing (AMP) [11] and vector AMP (VAMP) [12] algorithms, can be employed to construct deep networks that i) require fewer layers to reach a given level of accuracy and ii) yield greater accuracy overall. To our knowledge, the use of Onsager correction in deep networks is novel.
The contributions of our work are as follows. First, in Section III, we show how the softthresholdingbased AMP algorithm from [11] can be “unfolded” to form a feedforward neural network whose MSEoptimal parameters can be learned using a variant of backpropagation. The structure of the resulting “learned AMP” (LAMP) network is similar to that of learned ISTA (LISTA) [4] but contains additional “bypass” paths whose gains are set in a particular way. While bypass paths can also be found in recently proposed “residual networks” [13, 14] and “highway networks” [15], the bypass paths in LAMP have a different topology and a different gaincontrol mechanism. We show numerically that LAMP’s outputs are more accurate than those of LISTA at each iteration, in some cases by more than a factor of . To isolate the effects of LAMP’s change in network topology, the aforementioned experiments restrict the shrinkage function to classical softthresholding.
Next, in Section IV, we show that the accuracy of LAMP can be significantly improved by learning jointly MSEoptimal shrinkage functions and linear transforms. In particular, we consider several families of shrinkage functions, each controlled by a small number of learnable parameters: piecewise linear functions, exponential shrinkage functions, cubic Bsplines, and BernoulliGaussian denoisers. Our work in this section is inspired by [6], which learned cubic Bsplines for ISTA, but goes farther in that it i) considers shrinkage families beyond splines, ii) jointly learns the shrinkage functions and linear transforms, and iii) includes Onsager correction.
Then, in Section V, we show how the VAMP algorithm from [12] can be unfolded to form a feedforward neural network whose MSEoptimal lineartransforms and shrinkagefunctions can be jointly learned using a variant of backpropagation. Interestingly, we find that learned LVAMP parameters are nearly identical to the prescribed matchedVAMP parameters (i.e., VAMP under statistically matched prior and likelihood) when the signal is i.i.d. In this sense, matched VAMP “predicts” the parameters learned by backpropagation. Furthermore, since the parameters prescribed by VAMP have an intuitive interpretation based on MMSE estimation principles, VAMP “explains” the parameters learned by backpropagation.
Finally, in Section VII, we apply the proposed networks to two problems arising in 5thgeneration (5G) wireless communications: the compressive random access problem and the massiveMIMO channelestimation problem.
An early version of this work appeared in [1]. There, we proposed the LAMP algorithm and compared it to LISTA. In this work, we go beyond [1] by i) providing justification for our LAMP parameterization (in Appendix A), ii) jointly optimizing the shrinkage functions and the linear stages of LAMP, iii) proposing the LVAMP method, and iv) detailing two applications to 5G communications.
Notation
We use capital boldface letters like for matrices, small boldface letters like for vectors, for transposition, and to denote the th element of . Also, we use for the spectral norm of , for the norm of when , and for the or “counting” pseudonorm of . Likewise, we use for the diagonal matrix created from vector , for the identity matrix and for the zero vector. For a random vector , we denote its probability density function (pdf) by and its expectation by . For a random variable , we denote its variance by . Similarly, we use , , and for the pdf, expectation, and variance (respectively) conditioned on . We refer to the Dirac delta pdf using and to the pdf of a Gaussian random vector with mean and covariance using . Finally, we use to denote the signum function, where when and when .
Ii Iterative Algorithms and Deep Learning
Iia Iterative Algorithms
One of the best known algorithmic approaches to solving the sparse linear inverse problem is through solving the convex optimization problem [16, 17]
(3) 
where is a tunable parameter that controls the tradeoff between sparsity and measurement fidelity in . The convexity of (3) leads to provably convergent algorithms and bounds on the performance of the estimate (see, e.g., [18]). In the sequel, we will refer to (3) as the “” problem.
IiA1 Ista
One of the simplest approaches to solving (3) is the iterative shrinkage/thresholding algorithm (ISTA) [9], which iterates the steps (for and )
(4a)  
(4b) 
where is a stepsize, is the iteration residual measurement error, and is the “soft thresholding” shrinkage function, defined componentwise as
(5) 
IiA2 Fista
IiA3 Amp
Recently, an approximate message passing (AMP) algorithm [11, 20] was proposed for the problem. The resulting algorithm, which we call AMP, manifests as
(7a)  
(7b) 
where , , , and
(8)  
(9) 
In (9), is a tuning parameter that has a onetoone correspondence with in (3) [20]. Comparing AMP to ISTA, we see two major differences: i) AMP’s residual in (7a) includes the “Onsager correction” term , and ii) AMP’s shrinkage threshold in (7b) takes the prescribed, dependent value (9). In the sequel, we explain the rationale behind these differences.
AMP can in fact be used with any Lipschitzcontinuous shrinkage function. For this, we write the AMP algorithm as
(10a)  
(10b) 
where , , , and
(11)  
(12) 
In writing (10b), we assume that the shrinkage function accepts the noisestandarddeviation estimate as an argument. Although this is not a required feature of AMP, we find it useful in the sequel. It is straightforward to show that AMP in (10)(12) reduces to AMP from (7)(9) when and .
When is a typical realization of a large i.i.d. subGaussian random matrix with variance entries and has identical scalar components, the Onsager correction decouples the AMP iterations in the sense that the input to the shrinkage function,
(13) 
can be accurately modeled as^{4}^{4}4The AMP model (14)(12) is provably accurate in the largesystem limit (i.e., with converging to a positive constant) [21, 22].
(14) 
with from (12). In other words, the Onsager correction ensures that the shrinkage input is an AWGNcorrupted version of the true signal with known variance . (See Fig. 5(b) for numerical evidence.) The resulting “denoising” problem, that of estimating from , is well understood.
For example, when the elements of are statistically independent with known prior , the MSEoptimal denoiser^{5}^{5}5AMP with MSEoptimal denoising was first described in [23]. is simply the posterior mean estimator (i.e., ), which can be computed in closed form for many distributions . In the case that are unknown, we may be more interested in the minimax denoiser, i.e., the minimizer of the maximum MSE over an assumed family of priors. Remarkably, for generic sparse priors, i.e., with and arbitrary unknown , softthresholding (5) with a threshold proportional to the AWGN standard deviation (i.e., recalling (12)) is nearly minimax optimal [20]. Thus, we can interpret the AMP algorithm (7) as a nearly minimax approach to the sparse linear inverse problem under unknown
IiA4 Vector AMP
Very recently, the VAMP algorithm (see Algorithm 1) was proposed in [12] to address AMP’s fragility with respect to the matrix . The VAMP algorithm retains all the desirable properties of the original AMP (i.e., low periteration complexity, very few iterations to convergence, and shrinkage inputs that obey the AWGN model (14)), but over a much larger class of matrices: those that are large and rightrotationally invariant .
A rightrotationally invariant matrix is a random matrix whose distribution remains the same after right multiplication by any fixed orthogonal matrix. An intuitive understanding of such matrices arises from their singular value decomposition (SVD). Suppose that
(15) 
is the economysized^{6}^{6}6By “economysized,” we mean that if and contains the positive singular values of , then , , and . SVD of . For rightrotationally invariant , the matrix will contain the first columns of a matrix that is uniformly distributed on the group of orthogonal matrices. Note that i.i.d. Gaussian matrices are a special case of rightrotationally invariant, one where is random orthogonal and has a particular distribution. Importantly, VAMP behaves well under any orthogonal matrix and any singular values , as long as the dimensions are large enough [12].
The VAMP algorithm is defined in Algorithm 1. The algorithm can be seen to consist of two stages, each comprising the same four steps: estimation (lines 4 and 9), divergence computation (lines 5 and 10), Onsager correction (lines 6 and 11), and variance computation (lines 7 and 12). The only difference between the two stages is their choice of estimator. The first stage uses
(16)  
which depends on the measurements and the parameters
(17) 
while the second stage performs componentwise nonlinear shrinkage via , just as in step (10b) of the AMP algorithm.
Lines 5 and 10 in Algorithm 1 compute the average of the diagonal entries of the Jacobian of and , respectively. That is,
(18) 
From (16), we see that the Jacobian of is
(19) 
and so the average of its diagonal (or times its trace) is
(20) 
The firststage estimator in (16) can be interpreted as computing the MMSE estimate of under the likelihood function
(21) 
which follows from (2) under the assumption that and the pseudoprior
(22) 
We refer to (22) as a “pseudo” prior because it is constructed internally by VAMP at each iteration . The MMSE estimate of is then given by the conditional mean , which in the case of (21)(22) is
(23) 
Replacing in (23) with its SVD from (15) yields the expression in (16). Since the estimate is linear in , we refer to the first stage as the “linear MMSE” (LMMSE) stage.
The 2ndstage estimator , in line 9 of Algorithm 1, essentially denoises the pseudomeasurement
(24) 
The AWGNcorruption model (24) holds under large, rightrotationally invariant and with identical components, as proven in [12]. If the prior on was known,^{7}^{7}7Although the prior and noise variance are often unknown in practice, they can be learned online using the EMVAMP approach from [25]. then it would be appropriate to choose the MMSE denoiser for :
(25) 
With an i.i.d. signal and MMSE denoiser, VAMP produces a sequence whose fixed points have MSE consistent with the replica prediction of MMSE from [26]. In the sequel, we shall refer to VAMP with MMSE i.i.d.signal denoising and known as “matched VAMP.”
In summary, VAMP alternates between i) MMSE inference of under likelihood and pseudoprior , and ii) MMSE inference of under pseudolikelihood and prior . The intermediate quantities and are updated in each stage of VAMP using the Onsager correction terms and , respectively, where and are the divergences^{8}^{8}8Notice that the Onsager correction term in AMP step (10a) also involves a (scaled) divergence, , defined in (11). associated with the estimators and . Essentially, the Onsager correction acts to decouple the two stages (and iterations) of VAMP from each other so that local MSE optimization at each stage leads to global MSE optimization of the algorithm.
IiA5 Comparison of ISTA, FISTA, AMP, and VAMP
For illustration, we now compare the average periteration behavior of ISTA, FISTA, AMP, and VAMP in two scenarios: i) for an drawn i.i.d. , and ii) when the singular values of the same are replaced by a geometric series that yields the condition number . That is, , with set to achieve the conditionnumber and set to yield . In both cases, the problem dimensions were and ; the elements of were i.i.d. with probability and were otherwise set to zero (i.e., was BernoulliGaussian); and the noise was i.i.d. , with set to yield a signaltonoise ratio (SNR) of dB. Recall that ISTA, FISTA, AMP, and VAMP all estimate by iteratively minimizing (3) for a chosen value of (selected via in the case of AMP and VAMP). We chose the minimax optimal value of for AMP (which equals since [20]) and VAMP, and we used the corresponding for ISTA and FISTA.
Figure 1 shows the average normalized MSE (NMSE) versus iteration , where NMSE and realizations of were averaged. In Fig. 1(a), we see that AMP required an orderofmagnitude fewer iterations than FISTA, which required an orderofmagnitude fewer iterations than ISTA. Meanwhile, we see that VAMP required about half the iterations of AMP. In Fig. 1(b), AMP is not shown because it diverged. But VAMP required an orderofmagnitude fewer iterations than FISTA, which required an orderofmagnitude fewer iterations than ISTA.
IiB Deep Learning
In deep learning [3], training data comprised of (feature,label) pairs are used to train the parameters of a deep neural network, with the goal of accurately predicting the unknown label associated with a newly observed feature . The deep network accepts and subjects it to many layers of processing, where each layer usually consists of a linear transformation followed by a simple, componentwise nonlinearity.
Typically, the label space is discrete (e.g., is an image and is its class in {cat, dog, …, tree}). In our sparse linear inverse problem, however, the “labels” are continuous and highdimensional. Remarkably, Gregor and LeCun demonstrated in [4] that a wellconstructed deep network can accurately predict even labels such as these.
The neural network architecture proposed in [4] is closely related to the ISTA algorithm discussed in Section IIA1. To understand the relation, we rewrite the ISTA iteration (4) as
(26) 
and “unfold” the iterations , resulting in the layer feedforward neural network shown in Fig. 2.
Whereas ISTA uses the values of and prescribed in (26) and a common value of at all layers, Gregor and LeCun [4] proposed to use layerdependent thresholds and “learn” both the thresholds and the matrices from the training data by minimizing the quadratic loss
(27) 
Here, denotes the set of learnable parameters and the output of the layer network with input and parameters . The resulting approach was coined “learned ISTA” (LISTA).
The LISTA network generated estimates of comparable MSE with significantly fewer matrixvector multiplications than existing algorithms for the problem (3) with optimally tuned regularization parameters (e.g., or ). As an example, for the i.i.d. Gaussian version of the problem described in Section IIA5, LISTA took only layers to reach an NMSE of dB, whereas AMP took iterations,^{9}^{9}9The computational complexity of one layer of LISTA is essentially equal to one iteration of ISTA, FISTA, or AMP. FISTA took , and ISTA took . (More details will be given in Section VIA.)
Other authors have also applied ideas from deep learning to the sparse linear inverse problem. For example, [5] extended the LISTA approach [4] to handle structured sparsity and dictionary learning (when the training data are and is unknown). More recently, [7, 8] extended LISTA from the objective of (3) to the objective, and [6] proposed to learn the MSEoptimal scalar shrinkage functions by learning the parameters of a Bspline. It has also been proposed to recover signals using deep networks other than the “unfolded” type. For example, convolutional neural networks and stacked denoising autoencoders have been applied to speech enhancement [27], image denoising [28], image deblurring [29, 30], image super resolution [31], 3D imaging [32], compressive imaging [33, 34, 35], and video compressive sensing [36].
Iii Learned AMP
As described earlier, LISTA learns the value of the linear transform that minimizes MSE on the training data. As noted in [4], however, the LISTA’s performance does not degrade after imposing the structure
(28) 
where and , as suggested by (26). Since the form of in (28) involves free parameters, it is advantageous (in memory and training) over unstructured when , which is often the case in compressive sensing. The structured from (28) leads to network layers of the form shown in Fig. 3, with firstlayer inputs and .
Although not considered in [4], the network in Fig. 3 allows both and to vary with the layer , allowing for a modest performance improvement (as will be demonstrated in Section VIA) at the expense of a fold increase in memory and training complexity. We will refer to networks that use fixed and over all layers as “tied,” and those that allow dependent and as “untied.”
Iiia The LAMP Network
We propose to construct a neural network by unfolding the iterations of AMP from (7). We then propose to learn the MSEoptimal values of the network parameters, , from training data . We will refer to this approach as “learned AMP” (LAMP). The hope is that it will require fewer layers than LISTA to yield an accurate reconstruction, just as AMP requires many fewer iterations than ISTA to do the same (when is drawn i.i.d. Gaussian).
Figure 4 shows one layer of the LAMP network. Comparing LAMP to LISTA, we see two main differences:

LAMP includes a “bypass” path from to that is not present in LISTA. This path implements an “Onsager correction” whose goal is to decouple the layers of the network, just as it decoupled the iterations of the AMP algorithm (recall Section IIA3).

LAMP’s th shrinkage threshold varies with the realization , whereas LISTA’s does not.
IiiB Parameterizing LAMP
It is important to realize that LAMP implements a generalization of the AMP algorithm (7), wherein the matrices manifest as at iteration . In other words, the AMP algorithm enforces and , whereas the LAMP network does not. An important question is whether this generalization preserves the independentGaussian nature (14) of the shrinkage input error—the most important feature of AMP. We will show, numerically, that the desired behavior does seem to occur when
(29) 
with , at least when is i.i.d. Gaussian.
Note that, in (29), “” refers to the true measurement matrix from (2). If was unknown, we could instead use an estimate of computed from the training data, as described in Section IIIC. But, in many applications of the sparse linear inverse problem, is known. Furthermore, if matrixvector multiplication with was known to have a fast implementation (e.g., FFT), then it could be exploited in (29).
In Appendix A, we show that, under the parameterization (29) and some redefinitions of variables, the layer of the LAMP network can be summarized as
(30a)  
(30b) 
with firstlayer inputs and . The LAMP parameters are then in the tied case, or in the untied case.
Figure 5(c) shows a quantilequantile (QQ) plot for the error in the input to untiedLAMP’s shrinkage function, , at a middle layer , using the data from Fig. 1(a). Also shown are the shrinkage inputs for ISTA and AMP. The figure shows that the quantiles of AMP and LAMP fall on the dashed diagonal line, confirming that they are Gaussian distributed. In contrast, the quantiles of ISTA are heavytailed.
IiiC Learning the LAMP Parameters
For the “tied” case of LAMP, we aim to learn the parameters that minimize the MSE on the training data, i.e., (27). In a first attempt to do this, we tried the standard backpropagation approach, where were jointly optimized from the initialization , but we found that the parameters converged to a bad local minimum. We conjecture that this failure was a result of overfitting, since had many free parameters in our experiments: , since . Thus we propose a hybrid of “layerwise” and “global” optimization that appears to avoid this problem.
Roughly speaking, our approach is to learn , then , and so on, until . Recall that are not the parameters of layer but the parameters of all layers up to and including layer . The details of our approach are specified in Algorithm 2. There, line 5 performs layerwise learning (of layer ) and line 6 performs global learning (of all layers up to and including ). Note that, in line 2, we do not learn the parameter but instead leave it at its initial value. The reason is that the triple is overparameterized, in that gives the same layer output for any , due to the property of the softthresholding function. To avoid this overparameterization, we fix the value of .
For the untied case of LAMP, we aim to learn the parameters . Here we found that extra care was needed to avoid bad local minima. To this end, we implemented a bootstrapping method based on the following rationale: a network that can choose a different for each layer should perform at least as well as one that is constrained to use the same for all layers . In particular, our bootstrapping method checks performance against tied LAMP at each layer and reinitializes using the tied parameters when appropriate. The details are given in Algorithm 3.
As described in Section IIIA, our LAMP parameterization (29) assumes that is known. If is unknown, it could be estimated using a leastsquares (LS) fit^{10}^{10}10For the leastsquares learning of , one could either use the oneshot approach where and and is the pseudoinverse of , or one could use backpropagation to minimize the loss . to the training data and further optimized along with the parameters or to minimize the loss from (27). Empirically, we find (in experiments not detailed here) that there is essentially no difference between the final test MSEs of LAMP networks trained with known or LSestimated .
IiiD Discussion
In this section, we proposed a LAMP network whose nonlinear stages were constrained to the softthresholding shrinkage from (5). Under this constraint, the resulting LAMP network differs from LISTA only in the presence of Onsager correction, allowing us to study the effect of Onsager correction in deep networks. The numerical experiments in Section VIA show that, as expected, the LAMP network outperforms the LISTA network at every layer for the numerical data used to create Fig. 1.
Iv Learned AMP
We now consider the use of generic shrinkage functions within LAMP with the goal of improving its performance over that of LAMP. In particular, we aim to learn the jointly MSEoptimal shrinkage functions and linear transforms across all layers of the LAMP network. To make this optimization tractable, we consider several families of shrinkage functions, where each family is parameterized by a finitedimensional vector at layer . We then use backpropagation to learn the jointly MSEoptimal values of and the lineartransform parameters.
Iva The LAMP Network
For LAMP, we unfold the generic AMP algorithm (10) into a network. As with AMP, we relax the linear transform pair to the layerdependent learnable pair , and then place the restrictions on to facilitate Onsager correction. With AMP, the restrictions came in the form of (29), where and emerged as the tunable parameters. It was then shown, in Appendix A, that acted to scale the output of the softthresholding function. Since the shrinkage functions that we use in this section will have their own scaling mechanisms, it now suffices to use (29) with . Under this parameterization, the th layer of (general) LAMP becomes
(31a)  
(31b) 
with learnable parameters and . See Fig. 6 for an illustration.
IvB Parameterizing the Shrinkage Functions
In the sequel, we consider families of shrinkage functions that are both separable and odd symmetric. By separable, we mean that for some scalar function , and by odd symmetric we mean that for all . Several such shrinkage families are detailed below.
IvB1 Scaled SoftThreshold
IvB2 Piecewise Linear
Next we consider (odd symmetric) piecewise linear functions with five segments:
(33)  
Here, the shrinkagefamily parameters determine the abscissae of the four vertices where the line segments meet (i.e., ) and the slopes of the five segments (i.e., ). The shrinkage in (33) can be considered as a generalization of (32) from three to five segments with a possibly nonzero slope on the middle segment. It is inspired by the design from [37, Eq. (13)(15)] but has a different parameterization and includes a dependence on the estimated noise level .
IvB3 Exponential
We now consider the exponential shrinkage family
(34) 
The parameters control the asymptotic slope (i.e., ), the slope at the origin (i.e., ), and the rate of transition between those two slopes (where larger gives a slower transition). The shrinkage in (34) is inspired by the design from [37, Eq. (19)(20)] but includes a dependence on the estimated noise level .
IvB4 Spline
Next we consider the spline shrinkage family
(35) 
where is the cubic Bspline [38]
(36) 
Similar to (34), the parameters in (35) control the asymptotic slope (i.e., ), the slope at the origin (i.e., ), and the rate of transition between those two slopes (where larger gives a slower transition). The shrinkage in (35) is inspired by that used in [6], but is parameterized differently. The shrinkage in [6] was constructed using shifts of spread uniformly over the dynamic range of the signal, each scaled by an adjustable weight. By contrast, the shrinkage in (35) has only three adjustable parameters but includes a dependence on the noise level . Furthermore, [6] used identical shrinkage parameters at all layers of the ISTA network, whereas we allow the shrinkage parameters to vary across the layers of the LAMP network.
IvB5 BernoulliGaussian
Finally, we consider shrinkage functions that correspond to MSEoptimal denoisers under zeromean BernoulliGaussian (BG) priors. That is, , where has the BG prior
(37) 
(with and ) and is an AWGNcorrupted measurement of :
(38) 
The MSEoptimal denoiser is then (see, e.g., [39])
(39) 
To turn (39) into a learnable shrinkage function, we set and and then simplify, giving
(40)  
IvC Learning the LAMP Parameters
As with LAMP, we consider two cases of LAMP: the “tied” case, where the same linear transform is used at all layers of the network, and the “untied” case where a different linear transform is allowed in each layer. Thus, the parameters for the tied LAMP are and those for untied LAMP are . The LAMP parameters are then learned using the method described in Section IIIC, now with replaced by .
IvD Discussion
In this section, we constructed a “LAMP” deep network by unfolding the AMP algorithm from [11], parameterizing its linear and nonlinear stages in novel ways, and learning the parameters using a hybrid of layerwise and global optimization. The numerical experiments in Section VI suggest that LAMP performs quite well with i.i.d. Gaussian . For example, after layers, untied LAMP’s NMSE is dB from the supportoracle bound and as much as dB better than that of the (tied) LISTA approach from [4].
For noni.i.d.Gaussian , and especially illconditioned , however, the performance of LAMP suffers. Also, it is not clear how to interpret the parameters learned by LAMP, even in the case of i.i.d. Gaussian . Both problems stem from the fact that LAMP can be viewed as a generalization of AMP that uses the matrices in place of at the iteration. We aim to resolve these issues using the method presented in the next section.
V Learned VectorAMP
As described in Section IIA, the behavior of AMP is well understood when is i.i.d. subGaussian, but even small deviations from this model can lead AMP to diverge or at least behave in ways that are not well understood. Very recently, however, the VAMP algorithm has been proposed as a partial solution to this problem. That is, VAMP enjoys the same benefits of AMP but works with a much larger class of matrices : those that are rightrotationally invariant. Perhaps, by building a deep network around the VAMP algorithm, we can circumvent the problems with LAMP that arise with noni.i.d.Gaussian matrices.
Va The LVAMP Network
We propose to unfold the VAMP algorithm into a network and learn the MSEoptimal values of its parameters. The th layer of the learned VAMP (LVAMP) network is illustrated in Fig. 7. Essentially it consists of four operations: 1) LMMSE estimation, 2) decoupling, 3) shrinkage, and 4) decoupling, where the two decoupling stages are identical.
With an i.i.d. signal, the LMMSE estimator takes the form (23). Plugging the SVD (15) into (23) yields (16). Thus, since VAMP assumes an i.i.d. signal, its LMMSE stage is parameterized by for all iterations (recall (17)). For generality, we allow the LVAMP to vary these parameters with the layer , giving .
With noni.i.d. (e.g., correlated) signals, the LMMSE estimator also depends on the signal covariance matrix, which may not be explicitly known. In this case, it makes more sense to parameterize