Regularization by Denoising: Clarifications and New Interpretations

Regularization by Denoising: Clarifications and New Interpretations

Edward T. Reehorst and Philip Schniter, E. T. Reehorst (email: reehorst.3@osu.edu) and P. Schniter (email: schniter.1@osu.edu) are with the Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, 43210. Their work is supported in part by the National Science Foundation through grants CCF-1527162 and CCF-1716388.
Abstract

Regularization by Denoising (RED), as recently proposed by Romano, Elad, and Milanfar, is powerful new image-recovery framework that aims to construct an explicit regularization objective from a plug-in image-denoising function. Evidence suggests that the RED algorithms are, indeed, state-of-the-art. However, a closer inspection suggests that explicit regularization may not explain the workings of these algorithms. In this paper, we clarify that the expressions in Romano et al. hold only when the denoiser has a symmetric Jacobian, and we demonstrate that such symmetry does not occur with practical denoisers such as non-local means, BM3D, TNRD, and DnCNN. Going further, we prove that non-symmetric denoising functions cannot be modeled by any explicit regularizer. In light of these results, there remains the question of how to justify the good-performing RED algorithms for practical denoisers. In response, we propose a new framework called “score-matching by denoising” (SMD), which aims to match the score (i.e., the gradient of the log-prior) instead of designing the regularizer itself. We also show tight connections between SMD, kernel density estimation, and constrained minimum mean-squared error denoising. Finally, we provide new interpretations of the RED algorithms proposed by Romano et al., and we propose novel RED algorithms with faster convergence.

I Introduction

Consider the problem of recovering a (vectorized) image from noisy linear measurements of the form

(1)

where is a known linear transformation. This problem is of great importance in many applications and has been intensely studied for several decades. (See, e.g., the discussion in [1].)

One of the most popular approaches to image recovery is the “variational” approach, where one poses and solves an optimization problem of the form

(2)

In (2), is a loss function that penalizes mismatch to the measurements, is a regularization term that penalizes mismatch to the image class of interest, and is a design parameter that trades between loss and regularization. A prime advantage of the variational approach is that efficient, off-the-shelf optimization methods can be readily applied to (2).

A key question is: how should one choose the loss and regularization in (2)? As discussed in the sequel, the MAP-Bayesian interpretation suggests that they should be chosen in proportion to the negative log-likelihood and negative log-prior, respectively. The trouble is that efficient and accurate prior models of images are lacking.

Recently, a breakthrough was made by Romano, Elad, and Milanfar in [1]. Leveraging the relative maturity of image denoising algorithms (see, e.g., [2, 3]), they proposed the regularization by denoising (RED) framework, where an explicit regularizer is constructed from an image denoiser using the simple and elegant rule

(3)

Based on this framework, Romano et al. proposed several recovery algorithms (based on steepest descent, ADMM, and fixed-point methods, respectively) that yield state-of-the-art performance in deblurring and super-resolution tasks.

In this paper, we provide some clarifications and new interpretations of the excellent RED algorithms from [1]. Our work was motivated by an interesting empirical observation: with practical denoisers , the RED algorithms do not minimize the RED variational objective “.” As we establish in the sequel, the RED regularization (3) is justified only for denoisers with symmetric Jacobians, which unfortunately does not cover many state-of-the-art methods such as non-local means (NLM) [4], BM3D [5], TNRD [6], and DnCNN [7]. In fact, we are able to establish a stronger result: For non-symmetric denoisers, there exists no regularization that explains the RED algorithms from [1].

In light of these (negative) results, there remains the question of how to explain/understand the RED algorithms from [1] when used with non-symmetric denoisers. In response, we propose a framework called “score-matching by denoising” (SMD), which aims to match the score (i.e., the gradient of the log-prior) rather than to design any explicit regularizer. We then show tight connections between SMD, kernel density estimation [8], and constrained minimum mean-squared error (MMSE) denoising. Finally, we provide new interpretations of the RED-ADMM and RED-FP algorithms proposed in [1], and we propose novel RED algorithms with faster convergence.

The remainder of the paper is organized as follows. In Section II we provide more background on RED and related algorithms such as plug-and-play ADMM [9]. In Section III, we discuss the impact of Jacobian symmetry on RED and whether this property holds in practice. In Section IV, we propose the framework of “score-matching by denoising.” In Section V, we present new interpretations of the RED-ADMM and RED-FP algorithms from [1], and we propose new algorithms based on Peaceman-Rachford splitting, expectation-consistent approximation, and majorization minimization. Finally, in Section VI, we conclude.

Ii Background

Ii-a The MAP-Bayesian Interpretation

Because it will be useful in the sequel, we now provide a short discussion of Bayesian maximum a posteriori (MAP) estimation [10]. The MAP estimate of from is defined as

(4)

where denotes the probability density of given . Notice that, from Bayes rule and the monotonically increasing nature of , we can write

(5)

MAP estimation (5) has a direct connection to variation optimization (2): the log-likelihood term corresponds to the loss and the log-prior term corresponds to the regularization . For example, with additive white Gaussian noise (AWGN) , the log-likelihood implies a quadratic loss:

(6)

Equivalently, the normalized loss could be used if was absorbed into .

Ii-B Admm

A popular approach to solving (2) is through ADMM [11], which we now review. Using variable splitting, (2) becomes

(7)

which can be converted to the unconstrained problem

(8)

using Lagrange multipliers (or “dual” variables) and a design parameter . Using , (8) can be simplified to

(9)

The ADMM algorithm solves (9) by alternating the optimization of , , and , as specified in Algorithm 1. ADMM is known to converge under convex and as well as other mild conditions (see [11]).

0:  , and
1:  for  do
2:     
3:     
4:     
5:  end for
6:  Return
Algorithm 1 ADMM [11]

Ii-C Plug-and-Play ADMM

Importantly, line 3 of Algorithm 1 can be recognized as variational denoising of using regularization and quadratic loss , where at iteration . Here “denoising” means recovering from noisy measurements of the form

(10)

for some variance .

Image denoising has been studied for decades (see, e.g., the overviews [2, 3]), with the result that high performance methods are now readily available. Today’s state-of-the-art denoisers include those based on image-dependent filtering algorithms (e.g., BM3D [5]) or deep neural networks (e.g., TNRD [6], DnCNN [7]). Importantly, many denoisers are not variational in nature, i.e., they do not declare an explicit regularizer .

Leveraging the denoising interpretation of ADMM, Venkatakrishnan, Bouman, and Wolhberg [9] proposed to replace line 3 of Algorithm 1 with a call to a sophisticated image denoiser, such as BM3D, and dubbed their approach Plug-and-Play (PnP) ADMM. Numerical experiments show that PnP-ADMM works very well in most cases. However, when the denoiser used in PnP-ADMM comes with no explicit regularization , it is not clear what objective PnP-ADMM is minimizing and PnP-ADMM convergence is difficult to characterize.

Shortly thereafter, a related approach was proposed based on the approximate message passing (AMP) algorithm [12], called Denoising-based AMP (D-AMP) [13]. D-AMP exploits the fact that, for large i.i.d. random , some internal AMP variables naturally obey (10) and are thus amenable to denoising. An extension called D-VAMP that covers the wider class of “right rotationally invariant” random matrices was then proposed in [14]. Rigorous analyses of D-AMP and D-VAMP were recently published in [15] and [16], respectively.

Ii-D Regularization by Denoising (RED)

As briefly discussed in the Introduction, Romano, Elad, and Milanfar [1] proposed a radically new way to exploit an image denoiser, which they call regularization by denoising (RED). Given an arbitrary image denoiser , they proposed to construct an explicit regularizer of the form

(11)

to use within the variational framework (2). The advantage of using an explicit regularizer is that a wide variety of optimization algorithms can be used to solve (2) and their convergence can be tractably analyzed.

In [1], numerical evidence is presented to show that image denoisers are locally homogeneous (LH), i.e.,

(12)

for sufficiently small . For such denoisers, Romano et al. claim [1, Eq.(28)] that obeys the gradient rule

(13)

If , then any minimizer of the variational objective under quadratic loss,

(14)

must yield , i.e., must obey

(15)

Based on this line of reasoning, Romano et al. proposed several iterative algorithms that find an satisfying the fixed-point condition (15), which we will refer to as “RED algorithms.”

Iii Clarifications on RED

In this section, we first show that the gradient expression (13) holds if and only if the denoiser is LH and has Jacobian symmetry (JS). We then establish that many popular denoisers lack JS, such as the median filter (MF) [17], non-local means (NLM) [4], BM3D [5], TNRD [6], and DnCNN [7]. For such denoisers, the RED algorithms cannot be explained by in (11). We also show a more general result: when a denoiser lacks JS, there exists no regularizer whose gradient expression matches (13). Thus, the problem is not the specific form of in (11) but rather the broader pursuit of explicit regularization.

Iii-a Preliminaries

We first state some definitions and assumptions. In the sequel, we denote the th component of by , the gradient of at by

(16)

and the Jacobian of at by

(17)

Without loss of generality, we take to be the set of possible images. A given denoiser may involve decision boundaries at which its behavior changes suddenly. We assume that these boundaries are a closed set of measure zero and work instead with the open set , which contains almost all images.

We furthermore assume that is differentiable on , which means [18, p.212] that, for any , there exists a matrix for which

(18)

When exists, it can be shown [18, p.216] that .

Iii-B The RED Gradient

We first recall a result that was established in [1].

Lemma 1 (Local homogeneity [1]).

Suppose that denoiser is locally homogeneous. Then .

Proof.

Our proof is based on differentiability and avoids the need to define a directional derivative. From (18), we have

(19)
(20)
(21)

where (20) follows from local homogeneity (12). Equation (21) implies that . ∎

We now state one of the main results of this section.

Lemma 2 (RED gradient).

For defined in (11),

(22)
Proof.

For any and ,

(23)
(24)
(25)
(26)
(27)

using the definition of from (17). Collecting into the gradient vector (13) yields (22). ∎

Note that the gradient expression (22) differs from (13).

Lemma 3 (Clarification on (13)).

Suppose that the denoiser is locally homogeneous. Then the RED gradient expression (13) holds if and only if .

Proof.

If , then the last term in (22) becomes , which equals by Lemma 1, in which case (22) agrees with (13). But if , then (22) differs from (13). ∎

Iii-C Impossibility of Explicit Regularization

For denoisers that lack Jacobian symmetry (JS), Lemma 3 establishes that the gradient expression (13) does not hold. Yet (13) leads to the fixed-point condition (15) on which all RED algorithms in [1] are based. The fact that these algorithms work well in practice suggests that “” is a desirable property for a regularizer to have. But the regularization in (11) does not lead to this property when lacks JS. Thus an important question is:

Does there exist some other regularization for which when is non-JS?

The following theorem provides the answer.

Theorem 1 (Impossibility).

Suppose that denoiser has a non-symmetric Jacobian. Then there exists no regularization for which .

Proof.

To prove the theorem, we view as a vector field. Theorem 4.3.8 in [19] says that a vector field is conservative if and only if there exists a continuously differentiable potential for which . Furthermore, Theorem 4.3.10 in [19] says that if is conservative, then the Jacobian is symmetric. Thus, by the contrapositive, if the Jacobian is not symmetric, then no such potential exists.

To apply this result to our problem, we define

(28)

and notice that

(29)

Thus, if is non-symmetric, then is non-symmetric, which means that there exists no for which (29) holds. ∎

Thus, the problem is not the specific form of in (11) but rather the broader pursuit of “regularization by denoising.”

Iii-D Analysis of Jacobian Symmetry

The previous sections motivate an important question: Do commonly-used image denoisers have sufficient JS?

For some denoisers, JS can be studied analytically. For example, consider the “transform domain thresholding” (TDT) denoisers of the form

(30)

where performs componentwise (e.g., soft or hard) thresholding and is some transform, as occurs in the context of wavelet shrinkage [20], with or without cycle-spinning [21]. Using to denote the derivative of , we have

(31)

and so the Jacobian of is perfectly symmetric.

Another class of denoisers with perfectly symmetric Jacobians are those that produce MAP or MMSE optimal under some assumed prior . In the MAP case, minimizes (over ) the cost for noisy input . If we define , known as the Moreau-Yosida envelope of , then , as discussed in [22]. The elements in the Jacobian are therefore , and so the Jacobian matrix is symmetric. In the MMSE case, we have that for defined in (51) (see Lemma 4), and so , again implying that the Jacobian is symmetric. But it is difficult to say anything about the Jacobian symmetry of approximate MAP or MMSE denoisers.

Now let us consider the more general class of denoisers

(32)

sometimes called “pseudo-linear” [3]. For simplicity, we assume that is differentiable on . In this case, using the chain rule, we have

(33)

and so the Jacobian is symmetric for all whenever

  1. is symmetric ,

  2. .

When is -invariant (i.e., is linear) and symmetric, both of these conditions are satisfied. This latter case was exploited for RED in [23]. The case of non-linear is more complicated. Although can be symmetrized (see [24, 25]), it is not clear whether the second condition will be satisfied.

Iii-E Jacobian Symmetry Experiments

For denoisers that do not admit a tractable analysis, we can still evaluate the Jacobian of at numerically via

(34)

where denotes the th column of and is small ( in our experiments). For the purpose of quantifying JS, we define the normalized error metric

(35)

which should be nearly zero for a symmetric Jacobian.

Table I shows the average value of for different image patches111We used the center patches of the standard Barbara, Bike, Boats, Butterfly, Cameraman, Flower, Hat, House, Leaves, Lena, Parrots, Parthenon, Peppers, Plants, Raccoon, and Starfish test images, all of size . of size , using denoisers that assumed a noise variance of . The denoisers tested were the TDT from (30) with the 2D Haar wavelet transform and soft-thresholding, the median filter (MF) [17] with a window, non-local means (NLM) [4], BM3D [5], TNRD [6], and DnCNN [7]. Table I shows that the Jacobians of all but the TDT denoiser are far from symmetric.

TDT MF NLM BM3D TNRD DnCNN
4.11e-21 1.35 0.118 0.186 0.0151 0.194
TABLE I: Average Jacobian-symmetry error on 1616 images

Jacobian symmetry is a somewhat abstract construct; what we really care about is the accuracy of the RED gradient expressions (13) and (22). To assess gradient accuracy, we numerically evaluated the gradient of at using

(36)

and compared the result to the analytical expressions (13) and (22). Table II reports the normalized gradient error

(37)

for the same , images, and denoisers used in Table I. The results in Table II show that, for all tested denoisers, the numerical gradient closely matches the analytical expression for from (22), but not that from (13). The mismatch between and from (13) is partly due to insufficient JS and partly due to insufficient LH, as we establish below.

TDT MF NLM BM3D TNRD DnCNN
from (13) 0.565 0.966 0.913 1.00 0.957 0.852
from (38) 0.565 6.09e-15 0.0699 0.344 0.139 1.20
from (22) 3.39e-19 2.65e-15 6.17e-21 2.14e-13 5.42e-17 1.02e-12
TABLE II: Average gradient error on 1616 images

Iii-F Local Homogeneity Experiments

Recall that the TDT denoiser has a symmetric Jacobian, both theoretically and empirically. Yet Table II reports a disagreement between the expressions (13) and (22) for TDT. We now show that this disagreement is due to insufficient local homogeneity (LH).

To do this, we introduce yet another RED gradient expression,

(38)

which results from combining (22) with Lemma 1. Here, indicates that (38) holds under LH. In contrast, the gradient expression (13) holds under both LH and Jacobian symmetry, while the gradient expression (22) holds in general (i.e., even in the absence of LH and/or Jacobian symmetry). We also introduce two normalized error metrics for LH,

(39)
(40)

which should both be nearly zero for LH . Note that quantifies LH according to definition (12) and closely matches the numerical analysis of LH in [1]. Meanwhile, quantifies LH according to Lemma 1 and to how LH is used in the gradient expressions (13) and (38).

TDT MF NLM BM3D TNRD DnCNN
7.99e-10 0 5.60e-9 1.52e-13 5.09e-10 2.06e-9
4.10e-4 2.14e-15 5.63e-3 0.214 2.60e-4 8.02e-3
TABLE III: Average local-homogeneity error on 1616 images

The middle row of Table II reports the average gradient error of the gradient expression (38), and Table III reports average LH error for the metrics and . There we see that the average error is small for all denoisers, consistent with the experiments in [1]. But the average error is several orders of magnitude larger (for all but the MF denoiser). As discussed below, these seemingly small imperfections in LH have a significant effect on the RED gradient expressions (13) and (38).

Starting with the TDT denoiser, Table II shows that the gradient error on (38) is large, which can only be caused by insufficient LH. The insufficient LH is confirmed in Table III, which shows that the value of for TDT is non-negligible, especially in comparison to the value for MF.

Continuing with the MF denoiser, Table I indicates that its Jacobian is far from symmetric, while Table III indicates that it is LH. The gradient results in Table II are consistent with these behaviors: the expression (38) is accurate on account of LH being satisfied, but the expression (13) is inaccurate on account of a lack in JS.

The results for the remaining denoisers NLM, BM3D, TNRD, and BM3D show a common trend: they have non-trivial levels of both JS error (see Table I) and LH error (see Table III). As a result, the gradient expressions (13) and (38) are both inaccurate (see Table II).

In conclusion, the experiments in this section show that the RED gradient expressions (13) and (38) are very sensitive to small imperfections in LH. Although the experiments in [1] suggested that many popular image denoisers are approximately LH, our experiments suggest that their levels of LH are insufficient to maintain the accuracy of the RED gradient expressions (13) and (38).

Iii-G Example RED-SD Trajectory

We now provide an example of how the RED algorithms from [1] do not necessarily minimize the variational objective .

Figure 1 plots the RED Cost from (14) and the error on the fixed-point condition (15), i.e.,

versus iteration for a trajectory produced by the steepest-descent (SD) RED algorithm from [1]. For this experiment, we used the median-filter for , the Starfish image, and noisy measurements with (i.e., in (1)).

Fig. 1: RED cost and fixed-point error versus iteration for produced by the RED-SD algorithm from [1]. Although the fixed-point condition is asymptotically satisfied, the RED cost does not decrease with .

Figure 1 shows that, although the RED-SD algorithm asymptotically satisfies the fixed-point condition (15), the RED cost function does not decrease with , as would be expected if the RED algorithms truly minimize the RED cost . This behavior implies that any optimization algorithm that monitors the objective value for, say, backtracking line-search (e.g., the FASTA algorithm [26]), cannot be applied in the context of RED.

Iii-H Hessian and Convexity

From (26), the th element of the Hessian of equals

(41)
(42)
(43)

where if and otherwise . Thus, the Hessian of at equals

(44)

This expression can be contrasted with the Hessian expression from [1, (60)], which reads

(45)

Interestingly, (44) differs from (45) even when the denoiser has a symmetric Jacobian . One implication is that, even if eigenvalues of are limited to the interval , the Hessian may not be positive semi-definite due to the last term in (44), with negative implications on the convexity of . That said, the RED algorithms do not actually minimize the variational objective for common denoisers (as established in Section III-G), and so the convexity of may not matter in practice.

Iv Score-Matching by Denoising

As discussed in Section II-D, the RED algorithms proposed in [1] are explicitly based on gradient rule

(46)

This rule appears to be useful, since these algorithms work very well in practice. But Section III established that from (11) does not usually satisfy (46). We are thus motived to seek an alternative explanation for the RED algorithms. In this section, we explain them through a framework that we call “score-matching by denoising” (SMD).

Iv-a Regularization by Log-Likelihood

As a precursor to the SMD framework, we first propose a technique called regularization by log-likelihood.

Recall the measurement model (10) used to define the “denoising” problem, repeated in (47) for convenience:

(47)

To avoid confusion, we will refer to as “pseudo-measurements” and as “measurements.” From (47), the likelihood function for the pseudo-measurements is .

Now, suppose that we model the true image as a realization of a random vector with prior pdf . We write “” to emphasize that the model distribution may differ from the true distribution (from which the image is actually drawn). Under this prior model, the MMSE denoiser of from is

(48)

and the likelihood of observing is

(49)
(50)

In what we will refer to as regularization by log-likelihood (RLL), an explicit regularizer is constructed with the form

(51)

As we now show, has the desired property (46).

Lemma 4.

(RLL) For defined in (51),

(52)

where is the MMSE denoiser from (48).

Proof.

Equation (52) is a direct consequence of a classical result known as Tweedie’s formula [27, 28]. A short proof, from first principles, is now given for completeness.

(53)
(54)
(55)
(56)
(57)
(58)

where (55) used . Stacking (58) for in a vector yields (52). ∎

Thus, if the RLL regularizer is used in the optimization problem (14), then the solution must satisfy the fixed-point condition (15) associated with the RED algorithms from [1], albeit with an MMSE-type denoiser. This restriction will be removed using the SMD framework in Section IV-C.

It is interesting to note that the gradient property (52) holds even for non-homogeneous . This generality is important in applications under which is known to lack LH. For example, with a binary image modeled by , the MMSE denoiser takes the form , which is not LH.

Iv-B RLL as Kernel Density Estimation

We now show that regularization by log-likelihood (RLL) arises naturally in the data-driven, non-parametric context through kernel-density estimation (KDE) [8].

Recall that, in most imaging applications, the true prior is unknown, as is the true MMSE denoiser . There are several ways to proceed. One way is to design “by hand” an approximate prior that leads to a computationally efficient denoiser . But, because this denoiser is not MMSE for , the performance of the resulting estimates will suffer relative to .

Another way to proceed is to approximate the prior using a large corpus of training data . To this end, an approximate prior could be formed using the empirical estimate

(59)

but a much better (and smoother) match to the true prior can be obtained using

(60)

with appropriately chosen , a technique known as kernel density estimation (KDE) or Parzen windowing [8]. Note that if is used as a surrogate for , then the MAP optimization problem becomes

(61)
(62)

with constructed using from (59). In summary, RLL arises naturally in the data-driven approach to image recovery when KDE is used to smooth the empirical prior.

Iv-C Score-Matching by Denoising

A limitation of the above RLL framework is that it results in denoisers with symmetric Jacobians. (Recall the discussion of MMSE denoisers in Section III-D.) To justify the use of RED algorithms with non-symmetric Jacobians, we introduce the score-matching by denoising (SMD) framework in this section.

Let us continue with the KDE-based MAP estimation problem (61). Note that from (61) zeros the gradient of the MAP optimization objective and thus obeys the fixed-point equation

(63)

In principle, in (63) could be found using gradient descent or similar techniques. However, computation of the gradient

(64)

is too expensive for the values of typically needed to generate a good image prior .

A tractable alternative is suggested by the fact that

(65)
(66)

where is the MMSE estimator of from . In particular, if we can construct a good approximation to using a denoiser in a computationally efficient function class , then we can efficiently approximate the MAP problem (61).

This approach can be formalized using the framework of score matching [29], which aims to approximate the “score” (i.e., the gradient of the log-prior) rather than the prior itself. For example, suppose that we want to want to approximate the score . For this, Hyvärinen [29] suggested to first find the best mean-square fit among a set of computationally efficient functions , i.e., find

(67)

and then to approximate the score by . Later, in the context of denoising autoencoders, Vincent [30] showed that if we choose

(68)

for some function , then from (67) can be equivalently written as

(69)

In this case, is the MSE-optimal denoiser, averaged over and constrained to the function class