Regularization by Denoising: Clarifications and New Interpretations
Abstract
Regularization by Denoising (RED), as recently proposed by Romano, Elad, and Milanfar, is powerful new imagerecovery framework that aims to construct an explicit regularization objective from a plugin imagedenoising function. Evidence suggests that the RED algorithms are, indeed, stateoftheart. However, a closer inspection suggests that explicit regularization may not explain the workings of these algorithms. In this paper, we clarify that the expressions in Romano et al. hold only when the denoiser has a symmetric Jacobian, and we demonstrate that such symmetry does not occur with practical denoisers such as nonlocal means, BM3D, TNRD, and DnCNN. Going further, we prove that nonsymmetric denoising functions cannot be modeled by any explicit regularizer. In light of these results, there remains the question of how to justify the goodperforming RED algorithms for practical denoisers. In response, we propose a new framework called “scorematching by denoising” (SMD), which aims to match the score (i.e., the gradient of the logprior) instead of designing the regularizer itself. We also show tight connections between SMD, kernel density estimation, and constrained minimum meansquared error denoising. Finally, we provide new interpretations of the RED algorithms proposed by Romano et al., and we propose novel RED algorithms with faster convergence.
I Introduction
Consider the problem of recovering a (vectorized) image from noisy linear measurements of the form
(1) 
where is a known linear transformation. This problem is of great importance in many applications and has been intensely studied for several decades. (See, e.g., the discussion in [1].)
One of the most popular approaches to image recovery is the “variational” approach, where one poses and solves an optimization problem of the form
(2) 
In (2), is a loss function that penalizes mismatch to the measurements, is a regularization term that penalizes mismatch to the image class of interest, and is a design parameter that trades between loss and regularization. A prime advantage of the variational approach is that efficient, offtheshelf optimization methods can be readily applied to (2).
A key question is: how should one choose the loss and regularization in (2)? As discussed in the sequel, the MAPBayesian interpretation suggests that they should be chosen in proportion to the negative loglikelihood and negative logprior, respectively. The trouble is that efficient and accurate prior models of images are lacking.
Recently, a breakthrough was made by Romano, Elad, and Milanfar in [1]. Leveraging the relative maturity of image denoising algorithms (see, e.g., [2, 3]), they proposed the regularization by denoising (RED) framework, where an explicit regularizer is constructed from an image denoiser using the simple and elegant rule
(3) 
Based on this framework, Romano et al. proposed several recovery algorithms (based on steepest descent, ADMM, and fixedpoint methods, respectively) that yield stateoftheart performance in deblurring and superresolution tasks.
In this paper, we provide some clarifications and new interpretations of the excellent RED algorithms from [1]. Our work was motivated by an interesting empirical observation: with practical denoisers , the RED algorithms do not minimize the RED variational objective “.” As we establish in the sequel, the RED regularization (3) is justified only for denoisers with symmetric Jacobians, which unfortunately does not cover many stateoftheart methods such as nonlocal means (NLM) [4], BM3D [5], TNRD [6], and DnCNN [7]. In fact, we are able to establish a stronger result: For nonsymmetric denoisers, there exists no regularization that explains the RED algorithms from [1].
In light of these (negative) results, there remains the question of how to explain/understand the RED algorithms from [1] when used with nonsymmetric denoisers. In response, we propose a framework called “scorematching by denoising” (SMD), which aims to match the score (i.e., the gradient of the logprior) rather than to design any explicit regularizer. We then show tight connections between SMD, kernel density estimation [8], and constrained minimum meansquared error (MMSE) denoising. Finally, we provide new interpretations of the REDADMM and REDFP algorithms proposed in [1], and we propose novel RED algorithms with faster convergence.
The remainder of the paper is organized as follows. In Section II we provide more background on RED and related algorithms such as plugandplay ADMM [9]. In Section III, we discuss the impact of Jacobian symmetry on RED and whether this property holds in practice. In Section IV, we propose the framework of “scorematching by denoising.” In Section V, we present new interpretations of the REDADMM and REDFP algorithms from [1], and we propose new algorithms based on PeacemanRachford splitting, expectationconsistent approximation, and majorization minimization. Finally, in Section VI, we conclude.
Ii Background
Iia The MAPBayesian Interpretation
Because it will be useful in the sequel, we now provide a short discussion of Bayesian maximum a posteriori (MAP) estimation [10]. The MAP estimate of from is defined as
(4) 
where denotes the probability density of given . Notice that, from Bayes rule and the monotonically increasing nature of , we can write
(5) 
MAP estimation (5) has a direct connection to variation optimization (2): the loglikelihood term corresponds to the loss and the logprior term corresponds to the regularization . For example, with additive white Gaussian noise (AWGN) , the loglikelihood implies a quadratic loss:
(6) 
Equivalently, the normalized loss could be used if was absorbed into .
IiB Admm
A popular approach to solving (2) is through ADMM [11], which we now review. Using variable splitting, (2) becomes
(7) 
which can be converted to the unconstrained problem
(8) 
using Lagrange multipliers (or “dual” variables) and a design parameter . Using , (8) can be simplified to
(9) 
The ADMM algorithm solves (9) by alternating the optimization of , , and , as specified in Algorithm 1. ADMM is known to converge under convex and as well as other mild conditions (see [11]).
IiC PlugandPlay ADMM
Importantly, line 3 of Algorithm 1 can be recognized as variational denoising of using regularization and quadratic loss , where at iteration . Here “denoising” means recovering from noisy measurements of the form
(10) 
for some variance .
Image denoising has been studied for decades (see, e.g., the overviews [2, 3]), with the result that high performance methods are now readily available. Today’s stateoftheart denoisers include those based on imagedependent filtering algorithms (e.g., BM3D [5]) or deep neural networks (e.g., TNRD [6], DnCNN [7]). Importantly, many denoisers are not variational in nature, i.e., they do not declare an explicit regularizer .
Leveraging the denoising interpretation of ADMM, Venkatakrishnan, Bouman, and Wolhberg [9] proposed to replace line 3 of Algorithm 1 with a call to a sophisticated image denoiser, such as BM3D, and dubbed their approach PlugandPlay (PnP) ADMM. Numerical experiments show that PnPADMM works very well in most cases. However, when the denoiser used in PnPADMM comes with no explicit regularization , it is not clear what objective PnPADMM is minimizing and PnPADMM convergence is difficult to characterize.
Shortly thereafter, a related approach was proposed based on the approximate message passing (AMP) algorithm [12], called Denoisingbased AMP (DAMP) [13]. DAMP exploits the fact that, for large i.i.d. random , some internal AMP variables naturally obey (10) and are thus amenable to denoising. An extension called DVAMP that covers the wider class of “right rotationally invariant” random matrices was then proposed in [14]. Rigorous analyses of DAMP and DVAMP were recently published in [15] and [16], respectively.
IiD Regularization by Denoising (RED)
As briefly discussed in the Introduction, Romano, Elad, and Milanfar [1] proposed a radically new way to exploit an image denoiser, which they call regularization by denoising (RED). Given an arbitrary image denoiser , they proposed to construct an explicit regularizer of the form
(11) 
to use within the variational framework (2). The advantage of using an explicit regularizer is that a wide variety of optimization algorithms can be used to solve (2) and their convergence can be tractably analyzed.
In [1], numerical evidence is presented to show that image denoisers are locally homogeneous (LH), i.e.,
(12) 
for sufficiently small . For such denoisers, Romano et al. claim [1, Eq.(28)] that obeys the gradient rule
(13) 
If , then any minimizer of the variational objective under quadratic loss,
(14) 
must yield , i.e., must obey
(15) 
Based on this line of reasoning, Romano et al. proposed several iterative algorithms that find an satisfying the fixedpoint condition (15), which we will refer to as “RED algorithms.”
Iii Clarifications on RED
In this section, we first show that the gradient expression (13) holds if and only if the denoiser is LH and has Jacobian symmetry (JS). We then establish that many popular denoisers lack JS, such as the median filter (MF) [17], nonlocal means (NLM) [4], BM3D [5], TNRD [6], and DnCNN [7]. For such denoisers, the RED algorithms cannot be explained by in (11). We also show a more general result: when a denoiser lacks JS, there exists no regularizer whose gradient expression matches (13). Thus, the problem is not the specific form of in (11) but rather the broader pursuit of explicit regularization.
Iiia Preliminaries
We first state some definitions and assumptions. In the sequel, we denote the th component of by , the gradient of at by
(16) 
and the Jacobian of at by
(17) 
Without loss of generality, we take to be the set of possible images. A given denoiser may involve decision boundaries at which its behavior changes suddenly. We assume that these boundaries are a closed set of measure zero and work instead with the open set , which contains almost all images.
IiiB The RED Gradient
We first recall a result that was established in [1].
Lemma 1 (Local homogeneity [1]).
Suppose that denoiser is locally homogeneous. Then .
Proof.
We now state one of the main results of this section.
Lemma 2 (RED gradient).
For defined in (11),
(22) 
Proof.
IiiC Impossibility of Explicit Regularization
For denoisers that lack Jacobian symmetry (JS), Lemma 3 establishes that the gradient expression (13) does not hold. Yet (13) leads to the fixedpoint condition (15) on which all RED algorithms in [1] are based. The fact that these algorithms work well in practice suggests that “” is a desirable property for a regularizer to have. But the regularization in (11) does not lead to this property when lacks JS. Thus an important question is:
Does there exist some other regularization for which when is nonJS?
The following theorem provides the answer.
Theorem 1 (Impossibility).
Suppose that denoiser has a nonsymmetric Jacobian. Then there exists no regularization for which .
Proof.
To prove the theorem, we view as a vector field. Theorem 4.3.8 in [19] says that a vector field is conservative if and only if there exists a continuously differentiable potential for which . Furthermore, Theorem 4.3.10 in [19] says that if is conservative, then the Jacobian is symmetric. Thus, by the contrapositive, if the Jacobian is not symmetric, then no such potential exists.
To apply this result to our problem, we define
(28) 
and notice that
(29) 
Thus, if is nonsymmetric, then is nonsymmetric, which means that there exists no for which (29) holds. ∎
Thus, the problem is not the specific form of in (11) but rather the broader pursuit of “regularization by denoising.”
IiiD Analysis of Jacobian Symmetry
The previous sections motivate an important question: Do commonlyused image denoisers have sufficient JS?
For some denoisers, JS can be studied analytically. For example, consider the “transform domain thresholding” (TDT) denoisers of the form
(30) 
where performs componentwise (e.g., soft or hard) thresholding and is some transform, as occurs in the context of wavelet shrinkage [20], with or without cyclespinning [21]. Using to denote the derivative of , we have
(31) 
and so the Jacobian of is perfectly symmetric.
Another class of denoisers with perfectly symmetric Jacobians are those that produce MAP or MMSE optimal under some assumed prior . In the MAP case, minimizes (over ) the cost for noisy input . If we define , known as the MoreauYosida envelope of , then , as discussed in [22]. The elements in the Jacobian are therefore , and so the Jacobian matrix is symmetric. In the MMSE case, we have that for defined in (51) (see Lemma 4), and so , again implying that the Jacobian is symmetric. But it is difficult to say anything about the Jacobian symmetry of approximate MAP or MMSE denoisers.
Now let us consider the more general class of denoisers
(32) 
sometimes called “pseudolinear” [3]. For simplicity, we assume that is differentiable on . In this case, using the chain rule, we have
(33) 
and so the Jacobian is symmetric for all whenever

is symmetric ,

.
When is invariant (i.e., is linear) and symmetric, both of these conditions are satisfied. This latter case was exploited for RED in [23]. The case of nonlinear is more complicated. Although can be symmetrized (see [24, 25]), it is not clear whether the second condition will be satisfied.
IiiE Jacobian Symmetry Experiments
For denoisers that do not admit a tractable analysis, we can still evaluate the Jacobian of at numerically via
(34) 
where denotes the th column of and is small ( in our experiments). For the purpose of quantifying JS, we define the normalized error metric
(35) 
which should be nearly zero for a symmetric Jacobian.
Table I shows the average value of for different image patches^{1}^{1}1We used the center patches of the standard Barbara, Bike, Boats, Butterfly, Cameraman, Flower, Hat, House, Leaves, Lena, Parrots, Parthenon, Peppers, Plants, Raccoon, and Starfish test images, all of size . of size , using denoisers that assumed a noise variance of . The denoisers tested were the TDT from (30) with the 2D Haar wavelet transform and softthresholding, the median filter (MF) [17] with a window, nonlocal means (NLM) [4], BM3D [5], TNRD [6], and DnCNN [7]. Table I shows that the Jacobians of all but the TDT denoiser are far from symmetric.
TDT  MF  NLM  BM3D  TNRD  DnCNN  

4.11e21  1.35  0.118  0.186  0.0151  0.194 
Jacobian symmetry is a somewhat abstract construct; what we really care about is the accuracy of the RED gradient expressions (13) and (22). To assess gradient accuracy, we numerically evaluated the gradient of at using
(36) 
and compared the result to the analytical expressions (13) and (22). Table II reports the normalized gradient error
(37) 
for the same , images, and denoisers used in Table I. The results in Table II show that, for all tested denoisers, the numerical gradient closely matches the analytical expression for from (22), but not that from (13). The mismatch between and from (13) is partly due to insufficient JS and partly due to insufficient LH, as we establish below.
IiiF Local Homogeneity Experiments
Recall that the TDT denoiser has a symmetric Jacobian, both theoretically and empirically. Yet Table II reports a disagreement between the expressions (13) and (22) for TDT. We now show that this disagreement is due to insufficient local homogeneity (LH).
To do this, we introduce yet another RED gradient expression,
(38) 
which results from combining (22) with Lemma 1. Here, indicates that (38) holds under LH. In contrast, the gradient expression (13) holds under both LH and Jacobian symmetry, while the gradient expression (22) holds in general (i.e., even in the absence of LH and/or Jacobian symmetry). We also introduce two normalized error metrics for LH,
(39)  
(40) 
which should both be nearly zero for LH . Note that quantifies LH according to definition (12) and closely matches the numerical analysis of LH in [1]. Meanwhile, quantifies LH according to Lemma 1 and to how LH is used in the gradient expressions (13) and (38).
TDT  MF  NLM  BM3D  TNRD  DnCNN  

7.99e10  0  5.60e9  1.52e13  5.09e10  2.06e9  
4.10e4  2.14e15  5.63e3  0.214  2.60e4  8.02e3 
The middle row of Table II reports the average gradient error of the gradient expression (38), and Table III reports average LH error for the metrics and . There we see that the average error is small for all denoisers, consistent with the experiments in [1]. But the average error is several orders of magnitude larger (for all but the MF denoiser). As discussed below, these seemingly small imperfections in LH have a significant effect on the RED gradient expressions (13) and (38).
Starting with the TDT denoiser, Table II shows that the gradient error on (38) is large, which can only be caused by insufficient LH. The insufficient LH is confirmed in Table III, which shows that the value of for TDT is nonnegligible, especially in comparison to the value for MF.
Continuing with the MF denoiser, Table I indicates that its Jacobian is far from symmetric, while Table III indicates that it is LH. The gradient results in Table II are consistent with these behaviors: the expression (38) is accurate on account of LH being satisfied, but the expression (13) is inaccurate on account of a lack in JS.
The results for the remaining denoisers NLM, BM3D, TNRD, and BM3D show a common trend: they have nontrivial levels of both JS error (see Table I) and LH error (see Table III). As a result, the gradient expressions (13) and (38) are both inaccurate (see Table II).
In conclusion, the experiments in this section show that the RED gradient expressions (13) and (38) are very sensitive to small imperfections in LH. Although the experiments in [1] suggested that many popular image denoisers are approximately LH, our experiments suggest that their levels of LH are insufficient to maintain the accuracy of the RED gradient expressions (13) and (38).
IiiG Example REDSD Trajectory
We now provide an example of how the RED algorithms from [1] do not necessarily minimize the variational objective .
Figure 1 plots the RED Cost from (14) and the error on the fixedpoint condition (15), i.e.,
versus iteration for a trajectory produced by the steepestdescent (SD) RED algorithm from [1]. For this experiment, we used the medianfilter for , the Starfish image, and noisy measurements with (i.e., in (1)).
Figure 1 shows that, although the REDSD algorithm asymptotically satisfies the fixedpoint condition (15), the RED cost function does not decrease with , as would be expected if the RED algorithms truly minimize the RED cost . This behavior implies that any optimization algorithm that monitors the objective value for, say, backtracking linesearch (e.g., the FASTA algorithm [26]), cannot be applied in the context of RED.
IiiH Hessian and Convexity
From (26), the th element of the Hessian of equals
(41)  
(42)  
(43) 
where if and otherwise . Thus, the Hessian of at equals
(44) 
This expression can be contrasted with the Hessian expression from [1, (60)], which reads
(45) 
Interestingly, (44) differs from (45) even when the denoiser has a symmetric Jacobian . One implication is that, even if eigenvalues of are limited to the interval , the Hessian may not be positive semidefinite due to the last term in (44), with negative implications on the convexity of . That said, the RED algorithms do not actually minimize the variational objective for common denoisers (as established in Section IIIG), and so the convexity of may not matter in practice.
Iv ScoreMatching by Denoising
As discussed in Section IID, the RED algorithms proposed in [1] are explicitly based on gradient rule
(46) 
This rule appears to be useful, since these algorithms work very well in practice. But Section III established that from (11) does not usually satisfy (46). We are thus motived to seek an alternative explanation for the RED algorithms. In this section, we explain them through a framework that we call “scorematching by denoising” (SMD).
Iva Regularization by LogLikelihood
As a precursor to the SMD framework, we first propose a technique called regularization by loglikelihood.
Recall the measurement model (10) used to define the “denoising” problem, repeated in (47) for convenience:
(47) 
To avoid confusion, we will refer to as “pseudomeasurements” and as “measurements.” From (47), the likelihood function for the pseudomeasurements is .
Now, suppose that we model the true image as a realization of a random vector with prior pdf . We write “” to emphasize that the model distribution may differ from the true distribution (from which the image is actually drawn). Under this prior model, the MMSE denoiser of from is
(48) 
and the likelihood of observing is
(49)  
(50) 
In what we will refer to as regularization by loglikelihood (RLL), an explicit regularizer is constructed with the form
(51) 
As we now show, has the desired property (46).
Proof.
Thus, if the RLL regularizer is used in the optimization problem (14), then the solution must satisfy the fixedpoint condition (15) associated with the RED algorithms from [1], albeit with an MMSEtype denoiser. This restriction will be removed using the SMD framework in Section IVC.
It is interesting to note that the gradient property (52) holds even for nonhomogeneous . This generality is important in applications under which is known to lack LH. For example, with a binary image modeled by , the MMSE denoiser takes the form , which is not LH.
IvB RLL as Kernel Density Estimation
We now show that regularization by loglikelihood (RLL) arises naturally in the datadriven, nonparametric context through kerneldensity estimation (KDE) [8].
Recall that, in most imaging applications, the true prior is unknown, as is the true MMSE denoiser . There are several ways to proceed. One way is to design “by hand” an approximate prior that leads to a computationally efficient denoiser . But, because this denoiser is not MMSE for , the performance of the resulting estimates will suffer relative to .
Another way to proceed is to approximate the prior using a large corpus of training data . To this end, an approximate prior could be formed using the empirical estimate
(59) 
but a much better (and smoother) match to the true prior can be obtained using
(60) 
with appropriately chosen , a technique known as kernel density estimation (KDE) or Parzen windowing [8]. Note that if is used as a surrogate for , then the MAP optimization problem becomes
(61)  
(62) 
with constructed using from (59). In summary, RLL arises naturally in the datadriven approach to image recovery when KDE is used to smooth the empirical prior.
IvC ScoreMatching by Denoising
A limitation of the above RLL framework is that it results in denoisers with symmetric Jacobians. (Recall the discussion of MMSE denoisers in Section IIID.) To justify the use of RED algorithms with nonsymmetric Jacobians, we introduce the scorematching by denoising (SMD) framework in this section.
Let us continue with the KDEbased MAP estimation problem (61). Note that from (61) zeros the gradient of the MAP optimization objective and thus obeys the fixedpoint equation
(63) 
In principle, in (63) could be found using gradient descent or similar techniques. However, computation of the gradient
(64) 
is too expensive for the values of typically needed to generate a good image prior .
A tractable alternative is suggested by the fact that
(65)  
(66) 
where is the MMSE estimator of from . In particular, if we can construct a good approximation to using a denoiser in a computationally efficient function class , then we can efficiently approximate the MAP problem (61).
This approach can be formalized using the framework of score matching [29], which aims to approximate the “score” (i.e., the gradient of the logprior) rather than the prior itself. For example, suppose that we want to want to approximate the score . For this, Hyvärinen [29] suggested to first find the best meansquare fit among a set of computationally efficient functions , i.e., find
(67) 
and then to approximate the score by . Later, in the context of denoising autoencoders, Vincent [30] showed that if we choose
(68) 
for some function , then from (67) can be equivalently written as
(69) 
In this case, is the MSEoptimal denoiser, averaged over and constrained to the function class