Plug-and-Play Methods for Magnetic Resonance Imaging (long version)

# Plug-and-Play Methods for Magnetic Resonance Imaging (long version)

Rizwan Ahmad, Charles A. Bouman, Gregery T. Buzzard, Stanley Chan,
Sizhuo Liu, Edward T. Reehorst, and Philip Schniter
R. Ahmad and S. Liu are with the Department of Biomedical Engineering, The Ohio State University, Columbus OH, 43210, USA, e-mail: ahmad.46@osu.edu and liu.6221@osu.edu.C. Bouman and S. Chan are with the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47907, USA, e-mail: bouman@purdue.edu and stanchan@purdue.edu.G. Buzzard is with the Department of Mathematics, Purdue University, West Lafayette, IN, 47907, USA, e-mail: buzzard@purdue.edu.E. Reehorst and P. Schniter are with the Department of Electrical and Computer Engineering, The Ohio State University, Columbus OH, 43210, USA, e-mail: reehorst.3@osu.edu and schniter.1@osu.edu.
###### Abstract

Magnetic Resonance Imaging (MRI) is a non-invasive diagnostic tool that provides excellent soft-tissue contrast without the use of ionizing radiation. Compared to other clinical imaging modalities (e.g., CT or ultrasound), however, the data acquisition process for MRI is inherently slow, which motivates undersampling and thus drives the need for accurate, efficient reconstruction methods from undersampled datasets. In this article, we describe the use of “plug-and-play” (PnP) algorithms for MRI image recovery. We first describe the linearly approximated inverse problem encountered in MRI. Then we review several PnP methods, where the unifying commonality is to iteratively call a denoising subroutine as one step of a larger optimization-inspired algorithm. Next, we describe how the result of the PnP method can be interpreted as a solution to an equilibrium equation, allowing convergence analysis from the equilibrium perspective. Finally, we present illustrative examples of PnP methods applied to MRI image recovery.

## I Introduction

Magnetic Resonance Imaging (MRI) uses radiofrequency waves to non-invasively evaluate the structure, function, and morphology of soft tissues. MRI has become an indispensable imaging tool for diagnosing and evaluating a host of conditions and diseases. Compared to other clinical imaging modalities (e.g., CT or ultrasound), however, MRI suffers from slow data acquisition. A typical clinical MRI exam consists of multiple scans and can take more than an hour to complete. For each scan, the patient may be asked to stay still for several minutes, with slight motion potentially resulting in image artifacts. Furthermore, dynamic applications demand collecting a series of images in quick succession. Due to the limited time window in many dynamic applications (e.g., contrast enhanced MR angiography), it is not feasible to collect fully sampled datasets. For these reasons, MRI data is often undersampled. Consequently, computationally efficient methods to recover high-quality images from undersampled MRI data have been actively researched for the last two decades.

The combination of parallel (i.e., multi-coil) imaging and compressive sensing (CS) has been shown to benefit a wide range of MRI applications [1][2], including dynamic applications, and has been included in the default image processing frameworks offered by several major MRI vendors. More recently, learning-based techniques (e.g., [3, 4][5, 6, 7, 8, 9][10, 11, 12, 13, 14]) have been shown to outperform CS methods. Some of these techniques learn the entire end-to-end mapping from undersampled k-space or aliased images to recovered images (e.g.,  [4][7]). Considering that the forward model in MRI changes from one dataset to the next, such methods have to be either trained over a large and diverse data corpus or limited to a specific application. Other methods train scan-specific convolutional neural networks (CNN) on a fully-sampled region of k-space and then use it to interpolate missing k-space samples [8]. These methods do not require separate training data but demand a fully sampled k-space region. Due to the large number of unknowns in CNN, such methods require a fully sampled region that is larger than that typically acquired in parallel imaging, limiting the acceleration that can be achieved. Other supervised learning methods are inspired by classic variational optimization methods and iterate between data-consistency enforcement and a trained CNN, which acts as a regularizer [6][11]. Such methods require a large number of fully sampled, multi-coil k-space datasets, which may be difficult to obtain in many applications. Also, since CNN training occurs in the presence of dataset-specific forward models, generalization from training to test scenarios remains an open question [9]. Other learning-based methods have been proposed based on bi-level optimization (e.g., [10]), adversarially learned priors (e.g., [12]), and autoregressive priors (e.g., [13]). Consequently, the integration of learning-based methods into physical inverse problems remains a fertile area of research. There are many directions for improvement, including recovery fidelity, computational and memory efficiency, robustness, interpretability, and ease-of-use.

This article focuses on “plug-and-play” (PnP) algorithms [15], which alternate image denoising with forward-model based signal recovery. PnP algorithms facilitate the use of state-of-the-art image models through their manifestations as image denoisers, whether patch-based (e.g., [16]) or deep neural network (DNN) based (e.g., [17][18]). The fact that PnP algorithms decouple image modeling from forward modeling has advantages in compressive MRI, where the forward model can change significantly among different scans due to variations in the coil sensitivity maps, sampling patterns, and image resolution. Furthermore, fully sampled k-space MRI data is not needed for PnP; the image denoiser can be learned from MRI image patches, or possibly even magnitude-only patches. The objective of this article is two-fold: i) to review recent advances in plug-and-play methods, and ii) to discuss their application to compressive MRI image reconstruction.

The remainder of the paper is organized as follows. We first detail the inverse problem encountered in MRI reconstruction. We then review several PnP methods, where the unifying commonality is to iteratively call a denoising subroutine as one step of a larger optimization-inspired algorithm. Next, we describe how the result of the PnP method can be interpreted as a solution to an equilibrium equation, allowing convergence analysis from the equilibrium perspective. Finally, we present illustrative examples of PnP methods applied to MRI image recovery.

## Ii Image recovery in compressive MRI

In this section, we describe the standard linear inverse problem formulation in MRI. We acknowledge that more sophisticated formulations exist (see, e.g., [19] for a more careful modeling of physics effects). Briefly, the measurements are samples of the Fourier transform of the image, where the Fourier domain is often referred to as “k-space.” The transform can be taken across two or three spatial dimensions and includes an additional temporal dimension in dynamic applications. Furthermore, measurements are often collected in parallel from receiver coils. In dynamic parallel MRI with Cartesian sampling, the time- k-space measurements from the coil take the form

 y(t)i=P(t)FSix(t)+w(t)i, (1)

where is the vectorized 2D or 3D image at discrete time , is a diagonal matrix containing the sensitivity map for the coil, is the 2D or 3D discrete Fourier transform (DFT), the sampling matrix contains rows of the identity matrix, and is additive white Gaussian noise (AWGN). Often the sampling pattern changes across frames . The MRI literature often refers to as the “acceleration rate.” The AWGN assumption, which does not hold for the measured parallel MRI data, is commonly enforced using noise pre-whitening filters, which yields the model (1) but with diagonal “virtual” coil maps [20][21]. Additional justification of the AWGN model can be found in [22].

MRI measurements are acquired using a sequence of measurement trajectories through k-space. These trajectories can be Cartesian or non-Cartesian in nature. Cartesian trajectories are essentially lines through k-space. In the Cartesian case, one k-space dimension (i.e., the frequency encoding) is fully sampled, while the other one or two dimensions (i.e., the phase encodings) are undersampled to reduce acquisition time. Typically, one line, or “readout,” is collected after every RF pulse, and the process is repeated several times to collect adequate samples of k-space. Non-Cartesian trajectories include radial or spiral curves, which have the effect of distributing the samples among all dimensions of k-space. Compared to Cartesian sampling, non-Cartesian sampling provides more efficient coverage of k-space and yields an “incoherent” forward operator that is more conducive to compressed-sensing reconstruction [23]. But Cartesian sampling remains the method of choice in clinical practice, due to its higher tolerance to system imperfections and an extensive record of success.

Since the sensitivity map, , is patient-specific and varies with the location of the coil with respect to the imaging plane, both and are unknown in practice. Although calibration-free methods have been proposed to estimate (e.g., [24]) or to jointly estimate and (e.g., [25]), it is more common to first estimate through a calibration procedure and then treat as known in (1). Stacking , , and into vectors , , and , and packing into a known block-diagonal matrix , we obtain the linear inverse problem of recovering from

 y=Ax+w,w∼N(0,σ2I), (2)

where denotes a circularly symmetric complex-Gaussian random vector.

## Iii Signal Recovery and Denoising

The maximum likelihood (ML) estimate of from in (2) is , where , the probability density of conditioned on , is known as the “likelihood function.” The ML estimate is often written in the equivalent form . In the case of -variance AWGN , we have that , and so , which can be recognized as least-squares estimation. Although least-squares estimation can give reasonable performance when is tall and well conditioned, this is rarely the case under moderate to high acceleration (i.e., ). With acceleration, it is critically important to exploit prior knowledge of signal structure.

The traditional approach to exploiting such prior knowledge is to formulate and solve an optimization problem of the form

 ˆx=argminx{12σ2∥y−Ax∥22+ϕ(x)}, (3)

where the regularization term encodes prior knowledge of . In fact, in (3) can be recognized as the maximum a posteriori (MAP) estimate of under the prior density model . To see why, recall that the MAP estimate maximizes the posterior distribution . That is, . Since Bayes’ rule implies that , we have

 ˆx\sf map=argminx{−lnp(y|x)−lnp(x)}. (4)

Recalling that the first term in (3) (i.e., the “loss” term) was observed to be (plus a constant) under AWGN noise, the second term in (3) must obey . We will find this MAP interpretation useful in the sequel.

It is not easy to design good regularizers for use in (3). They must not only mimic the negative log signal-prior, but also enable tractable optimization. One common approach is to use with a tight frame (e.g., a wavelet transform) and a tuning parameter [26]. Such regularizers are convex, and the norm rewards sparsity in the transform outputs when used with the quadratic loss. One could go further and use the composite penalty . Due to the richness of data structure in MRI, especially for dynamic applications, utilizing multiple () linear sparsifying transforms has been shown to improve recovery quality [27], but tuning multiple regularization weights is a non-trivial problem [28].

Particular insight comes from considering the special case of , where the measurement vector in (2) reduces to an AWGN-corrupted version of the image , i.e.,

 z=x+w,w∼N(0,σ2I). (5)

The problem of recovering from noisy , known as “denoising,” has been intensely researched for decades. While it is possible to perform denoising by solving a regularized optimization problem of the form (3) with , today’s state-of-the-art approaches are either algorithmic (e.g., [16]) or DNN-based (e.g., [17][18]). This begs an important question: can these state-of-the-art denoisers be leveraged for MRI signal reconstruction, by exploiting the connections between the denoising problem and (3)? As we shall see, this is precisely what the PnP methods do.

## Iv Plug-and-Play Methods

In this section, we review several approaches to PnP signal reconstruction. What these approaches have in common is that they recover from measurements of the form (2) by iteratively calling a sophisticated denoiser within a larger optimization or inference algorithm.

### Iv-a Prox-based PnP

To start, let us imagine how the optimization in (3) might be solved. Through what is known as “variable splitting,” we could introduce a new variable, , to decouple the regularizer from the data fidelity term . The variables and could then be equated using an external constraint, leading to the constrained minimization problem

 ˆx=argminx∈CNminv∈CN{12σ2∥y−Ax∥22+ϕ(v)}subject tox=v. (6)

Equation (6) suggests an algorithmic solution that alternates between separately estimating and estimating , with an additional mechanism to asymptotically enforce the constraint .

The original PnP method [15] is based on the alternating direction method of multipliers (ADMM) [29]. For ADMM, (6) is first reformulated as the “augmented Lagrangian”

 minx,vmaxλ{12σ2∥y−Ax∥22+ϕ(v)+Re{λ{H}(x−v)}+12ν∥x−v∥2}, (7)

where are Lagrange multipliers and is a penalty parameter that affects the convergence speed of the algorithm, but not the final solution. With , (7) can be rewritten as

 minx,vmaxu{12σ2∥y−Ax∥22+ϕ(v)+12ν∥x−v+u∥2−12ν∥u∥2}. (8)

ADMM solves (8) by alternating the optimization of and with gradient ascent of , i.e.,

 xk =h(vk−1−uk−1;ν) (9a) vk =proxϕ(xk+uk−1;ν) (9b) uk =uk−1+(xk−vk), (9c)

where and , known as “proximal maps” (see the tutorial [30]) are defined as

 proxϕ(z;ν) ≜argminx{ϕ(x)+12ν∥x−z∥2} (10) h(z;ν) ≜argminx{12σ2∥y−Ax∥2+12ν∥x−z∥2}=prox∥y−Ax∥2/(2σ2)(z;ν) (11) =(A{H}A+σ2νI)−1(A{H}y+σ2νz). (12)

Under some weak technical constraints, it can be proven [29] that when is convex, the ADMM iteration (9) converges to , the global minimum of (3) and (6).

From the discussion in Section III, we immediately recognize in (10) as the MAP denoiser of under AWGN variance and signal prior . The key idea behind the original PnP work [15] was, in the ADMM recursion (9), to “plug in” a powerful image denoising algorithm like “block-matching and 3D filtering” (BM3D) [16] in place of the proximal denoiser from (10). If the plug-in denoiser is denoted by “,” then the PnP ADMM algorithm becomes

 xk =h(vk−1−uk−1;ν) (13a) vk =f(xk+uk−1) (13b) uk =uk−1+(xk−vk). (13c)

A wide variety of empirical results (see, e.g., [15, 31, 32][33]) have demonstrated that, when is a powerful denoising algorithm like BM3D, the PnP algorithm (13) produces far better recoveries than the regularization-based approach (9). For parallel MRI, the advantages of PnP ADMM were demonstrated in [34].Although the value of does not change the fixed point of the standard ADMM algorithm (9), it affects the fixed point of the PnP ADMM algorithm (13) through the ratio in (12).

The success of PnP methods raises important theoretical questions. Since is not in general the proximal map of any regularizer , the iterations (13) may not minimize a cost function of the form in (3), and (13) may not be an implementation of ADMM. It is then unclear if the iterations (13) will converge. And if they do converge, it is unclear what they converge to. The consensus equilibrium framework, which we discuss in Section V, aims to provide answers to these questions.

The use of a generic denoiser in place of a proximal denoiser can be translated to non-ADMM algorithms, such as FISTA [35], primal-dual splitting (PDS)[36], and others, as in [37, 38, 39]. Instead of optimizing as in (13), PnP FISTA [37] uses the iterative update

 zk =sk−1−νσ2A%H(Ask−1−y) (14a) xk =f(zk) (14b) sk =xk+qk−1−1qk(xk−xk−1), (14c)

where (14a) is a gradient descent (GD) step on the negative log-likelihood at with step-size , (14b) is the plug-in replacement of the usual proximal denoising step in FISTA, and (14c) is an acceleration step, where it is typical to use and .

PnP PDS [38] uses the iterative update

 xk =f(xk−1−νσ2A{H}vk−1) (15a) vk =γvk−1+(1−γ)(A(2xk−xk−1)−y) (15b)

where is a stepsize, is a relaxation parameter chosen such that , and in (15a) is the plug-in replacement of the usual proximal denoiser .

Comparing PnP ADMM (13) to PnP FISTA (14) and PnP PDS (15), one can see that they differ in how the data fidelity term is handled: PnP ADMM uses the proximal update (12), while PnP FISTA and PnP PDS use the GD step (14a). In most cases, solving the proximal update (12) is much more computationally costly than taking a GD step (14a). Thus, with ADMM, it is common to approximate the proximal update (12) using, e.g., several iterations of the conjugate gradient (CG) algorithm or GD, which should reduce the per-iteration complexity of (13) but may increase the number of iterations. But even with these approximations of (12), PnP ADMM is usually close to “convergence” after 10-50 iterations (e.g., see Figure 4).

An important difference between the aforementioned flavors of PnP is that the stepsize is constrained in FISTA but not in ADMM or PDS. Thus, PnP FISTA restricts the range of reachable fixed points relative to PnP ADMM and PnP PDS.

### Iv-B The balanced FISTA approach

In Section III, when discussing the optimization problem (3), the regularizer was mentioned as a popular option, where is often a wavelet transform. The resulting optimization problem,

 ˆx=argminx{12σ2∥y−Ax∥22+λ∥Ψx∥1}, (16)

is said to be stated in “analysis” form [2]. The proximal denoiser associated with (16) has the form

 proxϕ(z;ν)=argminx{λ∥Ψx∥1+12ν∥x−z∥2}. (17)

When is orthogonal, it is well known that , where

 f\sf tdt(z;τ)≜Ψ{H}soft-thresh(Ψz;τ) (18)

is the “transform-domain thresholding” denoiser with . The denoiser (18) is very efficient to implement, since it amounts to little more than computing forward and reverse transforms.

In practice, (16) yields much better results with non-orthogonal , such as when is a tight frame (see, e.g., the references in [40]). In the latter case, with tall . But, for general tight frames , the proximal denoiser (17) has no closed-form solution. What if we simply plugged the transform-domain thresholding denoiser (18) into an algorithm like ADMM or FISTA? How can we interpret the resulting approach? Interestingly, as we describe below, if (18) is used in PnP FISTA, then it does solve a convex optimization problem, although one with a different form than (3). This approach was independently proposed in [26] and [40], where in the latter it was referred to as “balanced FISTA” (bFISTA) and applied to parallel cardiac MRI. Notably, bFISTA was proposed before the advent of PnP FISTA. More details are provided below.

The optimization problem (16) can be stated in constrained “synthesis” form as

 ˆx=Ψ{H}ˆαforˆα=argminα∈range(Ψ){12σ2∥y−AΨ% {H}α∥22+λ∥α∥1}, (19)

where are transform coefficients. Then, as , (19) can be expressed in the unconstrained form

 ˆx=Ψ{H}ˆαforˆα=argminα{12σ2∥y−AΨ{H}α∥22+β2∥P⊥Ψα∥22+λ∥α∥1} (20)

with projection matrix . In practice, it is not possible to take and, for finite values of , the problems (19) and (20) are not equivalent. However, problem (20) under finite is interesting to consider in its own right, and it is sometimes referred to as the “balanced” approach [41]. If we solve (20) using FISTA with step-size (recall (14a)) and choose the particular value then, remarkably, the resulting algorithm takes the form of PnP FISTA (14) with . This particular choice of is motivated by computational efficiency (since it leads to the use of ) rather than recovery performance. Still, as we demonstrate in Section VI, it performs relatively well.

### Iv-C Regularization by denoising

Another PnP approach, proposed by Romano, Elad, and Milanfar in [42], recovers from measurements in (2) by finding the that solves the optimality condition111We begin our discussion of RED by focusing on the real-valued case, as in [42] and [43], but later we extend the RED algorithms to the complex-valued case of interest in MRI.

 0=1σ2A{T}(Aˆx−y)+1ν(ˆx−f(ˆx)), (21)

where is an arbitrary (i.e., “plug in”) image denoiser and is a tuning parameter. In [42], several algorithms were proposed to solve (21). Numerical experiments in [42] suggest that, when is a sophisticated denoiser (like BM3D) and is well tuned, the solutions to (21) are state-of-the-art, similar to those of PnP ADMM.

The approach (21) was coined “regularization by denoising” (RED) in [42] because, under certain conditions, the that solve (21) are the solutions to the regularized least-squares problem

 ˆx=argminx{12σ2∥y−Ax∥2+ϕ\sf red(x)}~{}with~{}ϕ\sf red(x)≜12νx{T}(x−f(x)), (22)

where the regularizer is explicitly constructed from the plug-in denoiser . But what are these conditions? Assuming that is differentiable almost everywhere, it was shown in [43] that the solutions of (21) correspond to those of (22) when i) is locally homogeneous222Locally homogeneous means that for all and sufficiently small nonzero . and ii) has a symmetric Jacobian matrix (i.e., ). But it was demonstrated in [43] that these properties are not satisfied by popular image denoisers, such as the median filter, transform-domain thresholding, NLM, BM3D, TNRD, and DnCNN. Furthermore, it was proven in [43] that if the Jacobian of is non-symmetric, then there does not exist any regularizer under which the solutions of (21) minimize a regularized loss of the form in (3).

One may then wonder how to justify (21). In [43], Reehorst and Schniter proposed an explanation for (21) based on “score matching” [44], which we now summarize. Suppose we are given a large corpus of training images , from which we could build the empirical prior model

 ˆp\sf x(x) ≜1TT∑t=1δ(x−xt),

where denotes the Dirac delta. Since images are known to exist outside , it is possible to build an improved prior model using kernel density estimation (KDE), i.e.,

 ˜p\sf x(x;ν) ≜1TT∑t=1N(x;xt,νI), (23)

where is a tuning parameter. If we adopt as the prior model for , then the MAP estimate of (recall (4)) becomes

 ˆx=argminx{12σ2∥y−Ax∥2−ln˜p\sf x% (x;ν)}. (24)

Because is differentiable, the solutions to (24) must obey

 0=1σ2A{T}(Aˆx−y)−∇ln˜p\sf x(ˆx;ν). (25)

A classical result known as “Tweedie’s formula” [45, 46] says that

 ∇ln˜p\sf x(z;ν) =1ν(f\sf mmse(z;ν)−z), (26)

where is the minimum mean-squared error (MMSE) denoiser under the prior and -variance AWGN. That is, , where and . Applying (26) to (25), the MAP estimate under the KDE prior obeys

 0=1σ2A{T}(Aˆx−y)+1ν(ˆx−f\sf mmse(ˆx;ν)), (27)

which matches the RED condition (21) when . Thus, if we could implement the MMSE denoiser for a given training corpus , then RED provides a way to compute the MAP estimate of under the KDE prior .

Although the MMSE denoiser can be expressed in closed form (see [43, eqn. (67)]), it is not practical to implement for large . Thus the question remains: Can the RED approach (21) also be justified for non-MMSE denoisers , especially those that are not locally homogeneous or Jacobian-symmetric? As shown in [43], the answer is yes. Consider a practical denoiser parameterized by tunable weights (e.g., a DNN). A typical strategy is to choose to minimize the mean-squared error on , i.e., set , where the expectation is taken over and . By the MMSE orthogonality principle, we have

 (28)

and so we can write

 ˆθ =argminθE{∥f\sf mmse(z;ν)−fθ(z)∥2} (29) =argminθE{∥∥∥∇ln˜p\sf x(z;ν)−1ν(z−fθ(z))∥∥∥2}, (30)

where (30) follows from (26). Equation (30) says that choosing to minimize the MSE is equivalent to choosing so that best matches the “score” . This is an instance of the “score matching” framework, as described in [44].

In summary, the RED approach (21) approximates the KDE-MAP approach (25)-(27) by using a plug-in denoiser to approximate the MMSE denoiser . When , RED exactly implements MAP-KDE, but with a practical , RED implements a score-matching approximation of MAP-KDE. Thus, a more appropriate title for RED might be “score matching by denoising.”

Comparing the RED approach from this section to the prox-based PnP approach from Section IV-A, we see that RED starts with the KDE-based MAP estimation problem (24) and replaces the -based MMSE denoiser with a plug-in denoiser , while PnP ADMM starts with the -based MAP estimation problem (3) and replaces the -based MAP denoiser from (10) with a plug-in denoiser . It has recently been demonstrated that, when the prior is constructed from image examples, MAP recovery often leads to sharper, more natural looking image recoveries than MMSE recovery[47][48, 49, 50]. Thus it is interesting that RED offers an approach to MAP-based recovery using MMSE denoising, which is much easier to implement than MAP denoising[47][48, 49, 50].

Further insight into the difference between RED and prox-based PnP can be obtained by considering the case of symmetric linear denoisers, i.e., with , where we will also assume that is invertible. Although such denoisers are far from state-of-the-art, they are useful for interpretation. It is easy to show [51] that is the proximal map of , i.e., that , recalling (10). With this proximal denoiser, we know that the prox-based PnP algorithm solves the optimization problem

 ˆx\sf pnp=\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0argminx{12σ2∥y−Ax∥2+12νx{T}(W−1−I)x}. (31)

Meanwhile, since is both locally homogeneous and Jacobian-symmetric, we know from (22) that the RED under this solves the optimization problem

 ˆx\sf red=\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0argminx{12σ2∥y−Ax∥2+12νx{T}(I−W)x}. (32)

By comparing (31) and (32), we see a clear difference between RED and prox-based PnP. Using , it can be seen that we can rewrite (32) as

 ˆx\sf red =W−1/2ˆu~{}~{}for~{}~% {}ˆu=\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0argminu{12σ2∥y−AW−1/2u∥2+12νu{T}(W−1−I)u}, (33)

which has the same regularizer as (31) but a different loss and additional post-processing. Section V-B compares RED to prox-based PnP from yet another perspective: consensus equilibrium.

So far, we have described RED as solving for in (21). But how exactly is this accomplished? In the original RED paper [42], three algorithms were proposed to solve (21): GD, inexact ADMM, and a “fixed point” heuristic that was later recognized [43] as a special case of the proximal gradient (PG) algorithm [52, 53]. Generalizations of PG RED were proposed in [43]. The fastest among them is the accelerated-PG RED algorithm, which uses the iterative update333The RED equations (34) and (35) may be used with complex-valued quantities.

 xk =h(vk−1;ν/L) (34a) zk =xk+qk−1−1qk(xk−xk−1) (34b) vk =1Lf(zk)+(1−1L)zk, (34c)

where was defined in (12), line (34b) uses the same acceleration as PnP FISTA (14b), and is a design parameter that can be related to the Lipschitz constant of from (22) (see [43, Sec.V-C]). When and , (34) reduces to the “fixed point” heuristic from [42]. To reduce the implementation complexity of , one could replace (34a) with the GD step

 xk =xk−1−νLσ2A{H}(Avk−1−y), (35)

which achieves a similar complexity reduction as when going from PnP ADMM to PnP FISTA (as discussed in Section IV-A). The result would be an “accelerated GD” [54] form of RED. Convergence of the RED algorithms will be discussed in Section V-B.

### Iv-D Amp

The approximate message passing (AMP) algorithm [55] estimates the signal using the iteration

 vk =√N∥A∥F(y−Axk−1)+1Mtr{Jfk−1(zk−1)}vk−1 (36a) zk =xk−1+A{H}vk (36b) xk =fk(zk) (36c)

starting with and some . In (36), and refer to the dimensions of , denotes the Jacobian operator, and the denoiser may change with iteration . In practice, the trace of the Jacobian in (36a) can be approximated using [56]

 tr{Jfk(zk)}≈p{H}[fk(zk+ϵp)−fk(zk)]/ϵ (37)

using small and random . The original AMP paper [55] considered only real-valued quantities and proximal denoising (i.e., for some ); the complex-valued case was first described in [57] and the PnP version of AMP shown in (36) was proposed in [58][59] under the name “denoising AMP.” This approach was applied to MRI in [60].

The rightmost term in (36a) is known as the “Onsager correction term,” with roots in statistical physics [61]. If we remove this term from (36a), the resulting recursion is no more than unaccelerated FISTA (also known as ISTA [62]), i.e., (14) with and a particular stepsize . The Onsager term is what gives AMP the special properties discussed below.

When is i.i.d. Gaussian and the denoiser is Lipschitz, the PnP AMP iteration (36) has remarkable properties in the large-system limit (i.e., with fixed ), as proven in [63] for the real-valued case. First,

 zk =x0+e~{}~{}for~{}~{}e∼N(0,νkI)~{}~{% }with~{}~{}νk=limM→∞∥vk∥2/M. (38)

That is, the denoiser input behaves like an AWGN corrupted version of the true image . Moreover, as , the mean-squared of AMP’s estimate is exactly predicted by the “state evolution”

 E(νk) (39a) νk+1 =σ2+NME(νk). (39b)

Furthermore, if is the MMSE denoiser (i.e., ) and the state-evolution (39) has a unique fixed-point, then converges to the minimum MSE. Or, if is the proximal denoiser and the state-evolution (39) has a unique fixed-point, then AMP’s converges to the MAP solution, even if is non-convex.

The remarkable properties of AMP are only guaranteed to manifest with large i.i.d. Gaussian . For this reason, a “vector AMP” (VAMP) algorithm was proposed [64] with a rigorous state-evolution and denoiser-input property (38) that hold under the larger class of right-rotationally invariant (RRI)444Matrix is said to be RRI if its singular-value decomposition has Haar , i.e., distributed uniformly over the group of orthogonal matrices for any and . Note that i.i.d. Gaussian is a special case of RRI where is also Haar and the singular values in have a particular distribution. matrices . Later, a PnP version of VAMP was proposed [65] and rigorously analyzed under Lipschitz [66]. VAMP can be understood as the symmetric variant [67] of ADMM with adaptive penalty [68].

## V Understanding PnP through Consensus Equilibrium

The success of the PnP methods in Section IV raises important theoretical questions. For example, in the case of PnP ADMM, if the plug-in denoiser is not the proximal map of any regularizer , then it is not clear what cost function is being minimized (if any) or whether the algorithm will even converge. Similarly, in the case of RED, if the plug-in denoiser is not the MMSE denoiser , then RED no longer solves the MAP-KDE problem, and it is not clear what RED does solve, or whether a given RED algorithm will even converge. In this section, we show that many of these questions can be answered through the consensus equilibrium (CE) framework [69, 51, 39, 43][70]. We start by discussing CE for the PnP approaches from Section IV-A and follow with a discussion of CE for the RED approaches from Section IV-C.

### V-a Consensus equilibrium for prox-based PnP

Let us start by considering the PnP ADMM algorithm (13). Rather than viewing (13) as minimizing some cost function, we can view it as seeking a solution to

 ˆx\sf pnp = h(ˆx\sf pnp−ˆu\sf pnp;ν) (40a) ˆx\sf pnp = f(ˆx\sf pnp+ˆu\sf pnp), (40b)

which, by inspection, must hold when (13) is at a fixed point. Not surprisingly, by setting in the PnP FISTA algorithm (14), it is straightforward to show that it too seeks a solution to (40). It is easy to show that the PnP PDS algorithm [38] seeks the same solution. With (40), the goal of the prox-based PnP algorithms becomes well defined! The pair (40) reaches a consensus in that the denoiser and the data fitting operator agree on the output . The equilibrium comes from the opposing signs on the correction term : the data-fitting subtracts it while the denoiser adds it.

Applying the expression from (12) to (40a), we find that

 ˆu\sf pnp=νσ2A{H}(y−Aˆx\sf pnp% ), (41)

where is the k-space measurement error and is its projection back into the image domain. We now see that provides a correction on for any components that are inconsistent with the measurements, but it provides no correction for errors in that lie outside the row-space of . Plugging (41) back into (40b), we obtain the image-domain fixed-point equation

 ˆx\sf pnp=f((I−νσ2A{H}A)ˆx\sf pnp+νσ2A{H}y). (42)

If , as in (2), and we define the PnP error as , then (42) implies

 ˆx\sf pnp=f(ˆx\sf pnp+νσ2A{% H}(w−Aˆe\sf pnp% )), (43)

which says that the error combines with the k-space measurement noise in such a way that the corresponding image-space correction is canceled by the denoiser .

By viewing the goal of prox-based PnP as solving the equilibrium problem (40), it becomes clear that other solvers beyond ADMM, FISTA, and PDS can be used. For example, it was shown in [69] that the PnP CE condition (40) can be achieved by finding a fixed-point of the system555The paper [69] actually considers the consensus equilibrium among agents, whereas here we consider the simple case of agents.

 z–– =(2G−I)(2F−I)z–– (44) z–– =[z1z2],F(z––)=[h(z1;ν)f(z2)],andG(z––)=[12(z1+z2)12(z1+z2)]. (45)

There exist many algorithms to solve (44). For example, one could use the Mann iteration [30]

 z––(k+1)=(1−γ)z––k+γ(2G−I)(2F−I)z––(k),with γ∈(0,1), (46)

when is nonexpansive. The paper [69] also shows that this fixed point is equivalent to the solution of , in which case Newton’s method or other root-finding methods could be applied.

The CE viewpoint also provides a path to proving the convergence of the PnP ADMM algorithm. Sreehari et al. [31] used a classical result from convex analysis to show that a sufficient condition for convergence is that i) is non-expansive, i.e., for any and , and ii) is a sub-gradient of some convex function, i.e., there exists such that . If these two conditions are met, then PnP ADMM (13) will converge to a global solution. Similar observations were made in other recent studies, e.g., [51][71]. That said, Chan et al. [32] showed that many practical denoisers do not satisfy these conditions, and so they designed a variant of PnP ADMM in which is decreased at every iteration. Under appropriate conditions on and the rate of decrease, this latter method also guarantees convergence, although not exactly to a fixed point of (40) since is no longer fixed.

Similar techniques can be used to prove the convergence of other prox-based PnP algorithms. For example, under certain technical conditions, including non-expansiveness of , it was established [39] that PnP FISTA converges to the same fixed-point as PnP ADMM.

### V-B Consensus equilibrium for RED

Just as the prox-based PnP algorithms in Section IV-A can be viewed as seeking the consensus equilibrium of (40), it was shown in [43] that the proximal-gradient and ADMM-based RED algorithms seek the consensus equilibrium of

 ˆx\sf red =h(ˆx\sf red−ˆu\sf red;ν) (47a) ˆx\sf red =((1+1L)I−1Lf)−1(ˆx\sf red+ˆu\sf red) (47b)

where was defined in (12) and is the algorithmic parameter that appears in (34).666The parameter also manifests in ADMM RED, as discussed in [43]. Since (47) takes the same form as (40), we can directly compare the CE conditions of RED and prox-based PnP.

Perhaps a more intuitive way to compare the CE conditions of RED and prox-based PnP follows from rewriting (47b) as , after which the RED CE condition becomes

 ˆx\sf red =h(ˆx\sf red−ˆu\sf red;ν) (48a) ˆx\sf red =f(ˆx\sf red)+Lˆu\sf red, (48b)

which involves no inverse operations. In the typical case of , we see that (48) matches (40), except that the correction is added after denoising in (48b) and before denoising in (40b).

Yet another way to compare the CE conditions of RED and prox-based PnP is to eliminate the variable. Solving (48b) for gives

 ˆu\sf red=νσ2A{H}(y−Aˆx\sf red% ), (49)

which mirrors the expression for in (41). Then plugging back into (48b) and rearranging, we obtain the fixed-point equation

 ˆx\sf red=f(ˆx\sf red)+Lνσ2A{H}(y−Aˆx\sf red), (50)

or equivalently

 Lνσ2A{H}(Aˆx\sf red−y)=f(ˆx\sf red)−ˆx\sf red, (51)

which says that the data-fitting correction (i.e., the left side of (51)) must balance the denoiser correction (i.e., the right side of (51)).

The CE framework also facilitates the convergence analysis of RED algorithms. For example, using the Mann iteration [30], it was proven in [43] that when is nonexpansive and , the PG RED algorithm converges to a fixed point.

## Vi Demonstration of PnP in MRI

### Vi-a Parallel cardiac MRI

We now demonstrate the application of PnP methods to parallel cardiac MRI. Because the signal is a cine (i.e., a video) rather than a still image, there are relatively few options available for sophisticated denoisers. Although algorithmic denoisers like BM4D [72] have been proposed, they tend to run very slowly, especially relative to the linear operators and . For this reason, we first trained an application specific CNN denoiser for use in the PnP framework. The architecture of the CNN denoiser, implemented and trained in PyTorch, is shown in Figure 1.

For training, we acquired 50 fully sampled, high-SNR cine datasets from eight healthy volunteers. Thirty three of those were collected on a 3 T scanner777The 3 T scanner was a Magnetom Prisma Fit from Siemens Healthineers in Erlangen, Germany and the 1.5 T scanner was a Magnetom Avanto from Siemens Healthineers in Erlangen, Germany. and the remaining 17 were collected on a 1.5 T scanner. Out of the 50 datasets, 28, 7, 7, and 8 were collected in the short-axis, two-chamber, three-chamber, and four-chamber view, respectively. The spatial and temporal resolutions of the images ranged from 1.8 mm to 2.5 mm and from 34 ms to 52 ms, respectively. The images size ranged from to and the number of frames ranged from to . For each of the 50 datasets, the reference image series was estimated as the least-squares solution to (1), with the sensitivity maps estimated from the time-averaged data using ESPIRiT [73]. We added zero-mean, complex-valued i.i.d. Gaussian noise to these “noise-free” reference images to simulate noisy images with SNR of 24 dB. Using a fixed stride of , we decomposed the images into patches of size . The noise-free and corresponding noisy patches were assigned as output and input to the CNN denoiser, with the real and imaginary parts processed as two separate channels. All 3D convolutions were performed using kernels. There were 64 filters of size in the first layer, 64 filters of size in the second through fourth layers, and 2 filters of size in the last layer. We set the minibatch size to four and used the ADAM optimizer with a learning rate of over 400 epochs. The training process was completed in 12 hours on a workstation equipped with a single NVIDIA GPU (GeForce RTX 2080 Ti).

For testing, we acquired four fully sampled cine datasets from two different healthy volunteers, with two image series in the short-axis view, one image series in the two-chamber view, and one image series in the four-chamber view. The spatial and temporal resolutions of the images ranged from 1.9 mm to 2 mm and from 37 ms to 45 ms, respectively. For the four datasets, the space-time signal vector, , in (2) had dimensions of , ,