Discriminative Transfer Learning for General Image Restoration
Abstract
Recently, several discriminative learning approaches have been proposed for effective image restoration, achieving convincing tradeoff between image quality and computational efficiency. However, these methods require separate training for each restoration task (e.g., denoising, deblurring, demosaicing) and problem condition (e.g., noise level of input images). This makes it timeconsuming and difficult to encompass all tasks and conditions during training. In this paper, we propose a discriminative transfer learning method that incorporates formal proximal optimization and discriminative learning for general image restoration. The method requires a singlepass training and allows for reuse across various problems and conditions while achieving an efficiency comparable to previous discriminative approaches. Furthermore, after being trained, our model can be easily transferred to new likelihood terms to solve untrained tasks, or be combined with existing priors to further improve image restoration quality.
1 Introduction
Lowlevel vision problems, such as denoising, deconvolution and demosaicing, have to be addressed as part of most imaging and vision systems. Although a large body of work covers these classical problems, lowlevel vision is still a very active area. The reason is that, from a Bayesian perspective, solving them as statistical estimation problems does not only rely on models for the likelihood (i.e. the reconstruction task), but also on natural image priors as a key component.
A variety of models for natural image statistics have been explored in the past. Traditionally, models for gradient statistics [27, 17], including totalvariation, have been a popular choice. Another line of works explores patchbased image statistics, either as perpatch sparse model [11, 35] or modeling nonlocal similarity between patches [9, 10, 13]. These prior models are general in the sense that they can be applied for various likelihoods, with the image formation and noise setting as parameters. However, the resulting optimization problems are prohibitively expensive, rendering them impractical for many realtime tasks especially on mobile platforms.
Recently, a number of works [29, 8] have addressed this issue by truncating the iterative optimization and learning discriminative image priors, tailored to a specific reconstruction task (likelihood) and optimization approach. While these methods allow to tradeoff quality with the computational budget for a given application, the learned models are highly specialized to the image formation model and noise parameters, in contrast to optimizationbased approaches. Since each individual problem instantiation requires costly learning and storing of the model coefficients, current proposals for learned models are impractical for vision applications with dynamically changing (often continuous) parameters. This is a common scenario in most realworld vision settings, as well as applications in engineering and scientific imaging that rely on the ability to rapidly prototype methods.
In this paper, we combine discriminative learning techniques with formal proximal optimization methods to learn generic models that can be truly transferred across problem domains while achieving comparable efficiency as previous discriminative approaches. Using proximal optimization methods [12, 23, 3] allows us to decouple the likelihood and prior which is key to learn such shared models. It also means that we can rely on wellresearched physicallymotivated models for the likelihood, while learning priors from example data. We verify our technique using the same model for a variety of diverse lowlevel image reconstruction tasks and problem conditions, demonstrating the effectiveness and versatility of our approach. After training, our approach benefits from the proximal splitting techniques, and can be naturally transferred to new likelihood terms for untrained restoration tasks, or it can be combined with existing stateoftheart priors to further improve the reconstruction quality. This is impossible with previous discriminative methods. In particular, we make the following contributions:

We propose a discriminative transfer learning technique for general image restoration. It requires a singlepass training and transfers across different restoration tasks and problem conditions.

We show that our approach is general by demonstrating its robustness for diverse lowlevel problems, such as denoising, deconvolution, inpainting, and for varying noise settings.

We show that, while being general, our method achieves comparable computational efficiency as previous discriminative approaches, making it suitable for processing highresolution images on mobile imaging systems.

We show that our method can naturally be combined with existing likelihood terms and priors after being trained. This allows our method to process untrained restoration tasks and take advantage of previous successful work on image priors (e.g., color and nonlocal similarity priors).
2 Related work
Image restoration aims at computationally enhancing the quality of images by undoing the adverse effects of image degradation such as noise and blur. As a key area of image and signal processing it is an extremely well studied problem and a plethora of methods exists, see for example [22] for a recent survey. Through the successful application of machine learning and datadriven approaches, image restoration has seen revived interest and much progress in recent years. Broadly speaking, recently proposed methods can be grouped into three classes: classical approaches that make no explicit use of machine learning, generative approaches that aim at probabilistic models of undegraded natural images and discriminative approaches that try to learn a direct mapping from degraded to clean images. Unlike classical methods, methods belonging to the latter two classes depend on the availability of training data.
Classical models focus on local image statistics and aim at maintaining edges. Examples include total variation [27], bilateral filtering [32] and anisotropic diffusion models [34]. More recent methods exploit the nonlocal statistics of images [1, 9, 21, 10, 13, 31]. In particular the highly successful BM3D method [9] searches for similar patches within the same image and combines them through a collaborative filtering step.
Generative learning models seek to learn probabilistic models of undegraded natural images. A simple, yet powerful subclass include models that approximate the sparse gradient distribution of natural images [19, 17, 18]. More expressive generative models include the fields of experts (FoE) model [26], KSVD [11] and the EPLL model [35]. While both FoE and KVSD learn a set of filters whose responses are assumed to be sparse, EPLL models natural images through Gaussian Mixture Models. All of these models have in common that they are agnostic to the image restoration task, i.e. they are transferable to any image degradation and can be combined in a modular fashion with any likelihood and additional priors at test time.
Discriminative learning models have recently become increasingly popular for image restoration due to their attractive tradeoff between high image restoration quality and efficiency at test time. Methods include trainable random field models such as cascaded shrinkage fields (CSF) [29], regression tree fields (RTF) [16], trainable nonlinear reaction diffusion (TRD) models [8], as well as deep convolutional networks [15] and other multilayer perceptrons [4].
Discriminative approaches owe their computational efficiency at runtime to a particular feedforward structure whose trainable parameters are optimized for a particular task during training. Those learned parameters are then kept fixed at testtime resulting in a fixed computational cost. On the downside, discriminative models do not generalize across tasks and typically necessitate separate feedforward architectures and separate training for each restoration task (denoising, demosaicing, deblurring, etc.) as well as every possible image degradation (noise level, Bayer pattern, blur kernel, etc.).
In this work, we propose the discriminative transfer learning technique that is able to combine the strengths of both generative and discriminative models: it maintains the flexibility of generative models, but at the same time enjoys the computational efficiency of discriminative models. While in spirit our approach is akin to the recently proposed method of Rosenbaum and Weiss [25], who equipped the successful EPLL model with a discriminative prediction step, the key idea in our approach is to use proximal optimization techniques [12, 23, 3] that allow the decoupling of likelihood and prior and therewith share the full advantages of a Bayesian generative modeling approach.
FoE  EPLL  BM3D  TRD  ours  

Runtime efficiency  ✓  ✓  ✓  
Easy to parallelize  ✓  ✓  
Transferable  ✓  ✓  ✓  ✓  
Modular  ✓  ✓  ✓  ✓ 
Table 1 summarizes the properties of the most prominent stateoftheart methods and puts our own proposed approach into perspective.
3 Proposed method
3.1 Diversity of data likelihood
The seminal work of fieldsofexperts (FoE) [26] generalizes the form of filter response based regularizers in the objective function given in Eq. 1. The vectors and represent the observed and latent (desired) image respectively, the matrix is the sensing operator, represents 2D convolution with filter , and represents the penalty function on corresponding filter responses . The positive scalar controls the relative weight between the data fidelity (likelihood) and the regularization term.
(1) 
The wellknown anisotropic totalvariation regularizer can be viewed as a special case of the FoE model where is the derivative operator , and the norm.
While there are various types of restoration tasks (e.g., denoising, deblurring, demosaicing) and problem parameters (e.g., noise level of input images), each problem has its own sensing matrix and optimal fidelity weight . For example, is an identity matrix for denoising, a convolution operator for deblurring, a binary diagonal matrix for demosaicing, and a random matrix for compressive sensing [5]. depends on both the task and its parameters in order to produce the best quality results.
The stateoftheart discriminative learning methods (CSF[29], TRD[8]) derive an endtoend feedforward model from Eq. 1 for each specific restoration task, and train this model to map the degraded input images directly to the output. These methods have demonstrated a great tradeoff between highquality and timeefficiency, however, as an inherent problem of the discriminative learning procedure, they require separate training for each restoration task and problem condition. Given the diversity of data likelihood of image restoration, this fundamental drawback of discriminative models makes it timeconsuming and difficult to encompass all tasks and conditions during training.
3.2 Decoupling likelihood and prior
It is difficult to directly minimize Eq. 1 when the penalty function is nonlinear and/or nonsmooth (e.g., norm, ). Proximal algorithms [3, 12, 6] instead relax Eq. 1 and split the original problem into several easier subproblems that are solved alternately until convergence.
In this paper we employ the halfquadraticsplitting (HQS) algorithm [12] to relax Eq. 1, as it typically requires much fewer iterations to converge compared with other proximal methods such as ADMM [3] and PD [6]. The relaxed objective function is given in Eq. 2:
(2) 
where a slack variable is introduced to approximate , and is a positive scalar.
With the HQS algorithm, Eq. 2 is iteratively minimized by solving for the slack variable and the latent image alternately as in Eq. 3 and 4 ().
Prior proximal operator:  
(3)  
Data proximal operator:  
(4) 
where increases as the iteration continues. This forces to become an increasingly good approximation of , thus making Eq. 2 an increasingly good proxy for Eq. 1.
Note that, while most related approaches including CSF [29] relax Eq. 1 by splitting on , we split on instead. This is critical for deriving our approach. With this new splitting strategy, the prior term and the data likelihood term in the original objective Eq. 1 are now separated into two subproblems that we call the “prior proximal operator” (Eq. 3) and the “data proximal operator” (Eq. 4), respectively.
3.3 Discriminative transfer learning
We observed that, while the data proximal operator in Eq. 4 is taskdependent because both the sensing matrix and fidelity weight are problemspecific as explained in Sec. 3.1, the prior proximaloperator (i.e. update step in Eq. 3) is independent of the original restoration tasks and problem conditions.
This leads to our main insight: Discriminative learned models can be made transferable by using them in place of the prior proximal operator, embedded in a proximal optimization algorithm. This allows us to generalize a single discriminative learned model to a very large class of problems, i.e. any linear inverse imaging problem, while simultaneously overcoming the need for problemspecific retraining. Moreover, it enables learning the taskdependent parameter in the data proximal operator for each problem in a single training pass, eliminating tedious handtuning at test time.
We also observed that, benefiting from our new splitting strategy, the prior proximal operator in Eq. 3 can be interpreted as a Gaussian denoiser on the intermediate image , since the leastsquares consensus term is equivalent to a Gaussian denoising term. This inspires us to utilize existing discriminative models that have been successfully used for denoising (e.g. CSF, TRD).
For convenience, we denote the prior proximal operator as , i.e.
(5) 
where the model parameter includes a number of filters and corresponding penalty functions . Inspired by the stateoftheart discriminative methods [29, 8], we propose to learn the model , and the fidelity weight scalar , from training data. Recall that with our new splitting strategy introduced in Sec. 3.2, the image prior and datafidelity term in the original objective (Eq. 1) are contained in two separate subproblems (Eq. 3 and 4). This makes it possible to train together an ensemble of diverse tasks (e.g., denoising, deblurring, or with different noise levels) each of which has its own data proximal operator, while learning a single prior proximal operator that is shared across tasks. This is in contrast to stateoftheart discriminative methods such as CSF [29] and TRD [8] which train separate models for each task.
For clarity, in Fig. 1 we visualize the architecture of our method. The input images may represent various restoration tasks and problem conditions. At each HQS iteration, each image from problem is updated by its own data proximal operator in Eq. 4 which contains separate trainable fidelity weight and predefined sensing matrix ; then each slack image is updated by the same, shared prior proximal operator implemented by a learned, discriminative model.
Recurrent network. Note that in Fig. 1 each HQS iteration uses exactly the same model parameters, forming a recurrent network. This is in contrast to previous discriminative learning methods including CSF and TRD, which form feedforward networks. Our recurrent network architecture maintains the convergence property of the proximal optimization algorithm (HQS), and is critical for our method to transfer between various tasks and problem conditions.
Shared prior proximal operator. While any discriminative Gaussian denoising model could be used as in our framework, we specifically propose to use the multistage nonlinear diffusion process that is modified from the TRD [8] model, for its efficiency. The model is given in Eq. 6.
(6) 
where is the stage index, filters , function are trainable model parameters at each stage, and is the initial value of . Note that, different from TRD, our model does not contain the reaction term which would be with step size . The main reasons for this modification are:

The data constraint is contained in update in Eq. 4;

More importantly, by dropping the reaction term our model gets rid of the weight which changes at each HQS iteration. Therefore, our proximal operator is simplified to be:
(7)
The parameter to learn in our method includes ’s for each problem class (restoration task and problem condition), and in the prior proximal operator shared across different classes, i.e. . Even though the scalar parameters are trained, our method allows users to override them at test time to handle nontrained problem classes or specific inputs as we will show in Sec. 4. This contrasts to previous discriminative approaches whose model parameters are all fixed at test time. The subscript indicating the problem class in is omitted below for convenience. The values of are preselected: and for .
3.4 Training
We consider denoising and deconvolution tasks at training, where the sensing operator is an identity matrix, or a block circulant matrix with circulant blocks that represents 2D convolution with randomly drawn blur kernels respectively. In denoising tasks, the update in Eq. 4 has a closedform solution:
(8) 
In deconvolution tasks, the update in Eq. 4 has a closedform solution in the Fourier domain:
(9) 
where and represent Fourier and inverse Fourier transform respectively. Note that, compared to CSF [29], our method does not require FFT computations for denoising tasks. We use the LBFGS solver [28] with analytic gradient computation for training. The training loss function is defined as the negative average Peak SignaltoNoise Ratio (PSNR) of reconstructed images. The gradient of w.r.t. the model parameters is computed by accumulating gradients at all HQS iterations, i.e.
(10) 
The 1D functions in Eq. 6 are parameterized as a linear combination of equidistantpositioned Gaussian kernels whose weights are trainable.
Progressive training. A progressive scheme is proposed to make the training more effective. First, we set the number of HQS iterations to be 1, and train and the model of each stage in in a greedy fashion. Then, we gradually increase the number of HQS iterations from 1 to where at each step the model is refined from the result of the previous step. The LBFGS iterations are set to be 200 for the greedy training steps, and 100 for the refining steps. Fig. 2 shows examples of learned filters in .
4 Results
Denoising and generality analysis. We compare the proposed discriminative transfer learning (DTL) method with stateoftheart image denoising techniques, including KSVD [11], FoE [26], BM3D [9], LSSC [21], WNNM [13], EPLL [35], optMRF [7], ARF [2], CSF [29] and TRD [8]. The subscript in CSF and TRD indicates the number of cascaded stages (each stage has different model parameters). The subscript and superscript in our method DTL indicate the number of diffusion stages ( in Algorithm 1) in our proximal operator , and the number of HQS iterations ( in Alg. 1), respectively. Note that the complexity (size) of our model is linear in , but independent of . CSF, TRD and DTL use 24 filters of size 55 pixels at all stages in this section.
The compared discriminative methods, CSF and TRD both are trained at single noise level that is the same as the test images. In contrast, our model is trained on 400 images (100100 pixels) cropped from [26] with random and discrete noise levels (standard deviation ) varying between 5 and 25. The images with the same noise level share the same data fidelity weight at training.
To verify the generality of our method on varying noise levels, we test our model DTL (trained with varying noise levels in a single pass) and two TRD models (trained at specific noise levels 15 and 25) on 3 sets of 68 images with noise respectively. The average PSNR values are shown in Fig. 3. Although performing slightly below the TRD model trained for the exact noise level used at test time, our method is more generic and works robustly for various noise levels. The performance of the discriminative TRD method drops down quickly as the problem condition (i.e. noise level) at test differs from its training data. In sharp contrast to discriminative methods (CSF, TRD, etc), which are inherently specialized for a given problem setting, i.e. noise level, the proposed approach transfers across different problem settings. More analysis can be found in the supplementary material.
All compared methods are evaluated on the 68 test images from [26] and the averaged PSNR values are reported in Table 2. The compared discriminative methods (CSF, TRD, etc) were trained for exactly the same noise level as the test images (i.e. the best case for them), while our model was trained with mixed noise levels and works robustly for arbitrary noise levels. Our results are comparable to generic methods such as KSVD, FoE and BM3D, and very close to discriminative methods such as CSF, while at the same time being much more timeefficient.
KSVD  FoE  BM3D  LSSC  WNNM  EPLL 

30.87  30.99  31.08  31.27  31.37  31.19 
optMRF  ARF  CSF  TRD  DTL  DTL 
31.18  30.70  31.14  31.30  30.91  31.00 
Image size  

WNNM  157.73  657.75  2759.79     
EPLL  29.21  111.52  463.71     
BM3D  0.78  3.45  15.24  62.81  275.39 
CSF  1.23  2.22  7.35  27.08  93.66 
TRD  0.39  0.71  2.01  7.57  29.09 
DTL  0.60  1.19  3.45  12.97  56.19 
DTL (Halide)  0.11  0.26  1.60  5.61  20.85 
Runtime comparison. In Table 3 we compare the runtime of our method and stateoftheart methods. The experiments were performed on a laptop computer with Intel i74720HQ CPU and 16GB RAM. WNNM and EPLL ran outofmemory for images over 4 megapixels in our experiments. CSF, TRD and DTL all use “parfor” setting in Matlab. DTL is significantly faster than all compared generic methods (WNNM, EPLL, BM3D) and even the discriminative method CSF. Runtime of DTL is about 1.5 times that of TRD, which is expected as they use 5 versus 9 diffusion steps in total. In addition, we implement our method in Halide language [24], which has become popular recently for highperformance image processing applications, and report the runtime on the same CPU as mentioned above.
Deconvolution. In this experiment, we train a model with an ensemble of denoising and deconvolution tasks on 400 images (100100 pixels) cropped from [26], in which 250 images are generated for denoising tasks with random noise levels varying between 5 and 25, and the other 150 images are generated by blurring the images with random 2525 kernels (PSFs) and then adding Gaussian noise with ranging between 1 and 5. All images are quantized to 8 bits.
We compare our method with stateoftheart nonblind deconvolution methods including Levin et al. [19], Schmidt et al. [30] and CSF [29]. Note that TRD [8] does not support nonblind deconvolution. We test the methods on the benchmark dataset from [20] which contains 32 realcaptured images and report the average PSNR values in Table 4. The results of compared methods are quoted from [29].
As said in Sec. 3.3, while the scalar weight is trained, our method allows users to override it at test time for untrained problem classes or specific inputs. Fig. 4 shows our results with different on the experiments compared in Table 4. Within a fairly wide range of , our method outperforms all previous methods.
We further test the above model trained with ensemble tasks on the denoising experiment in Table 2. The result average PSNR is 30.98dB, which is comparable to the result with the model trained only on the denoising task.
Input  Levin [19]  Schmidt [30]  CSF  DTL 
22.86  32.73  33.97  33.48  34.34 
Modularity with existing priors. As shown above, even though the fidelity weight is trainable, our method allows users to override its value at test time. This property also makes it possible to combine our model (after being trained) with existing stateoftheart priors at test time, in which case typically needs to be adjusted. This allows our method to take advantage of previous successful work on image priors. Again, this is not possible with previous discriminative methods (CSF, TRD).
In Fig. 5 we show an example to incorporate a nonlocal patch similarity prior (BM3D [9]) with our method to further improve the denoising quality. BM3D performs well in removing noise especially in smooth regions but usually oversmoothes edges and textures. Our original model (DTL) well preserves sharp edges however sometimes introduces artifacts in smooth regions when the input noise level is high. By combining those two methods, which is easy with our HQS framework, the result is improved both visually and quantitatively.
We give the derivation of the proposed hybrid method below. Let represents the nonlocal patch similarity prior. The objective function is:
(11) 
Applying the HQS technique described in Sec. 3, we relax the objective to be:
(12) 
Then we minimize Eq. 12 by alternately solving the following 3 subproblems:
(13)  
where is from our previous training, and the subproblem is approximated by running BM3D software on with noise parameter following [33, 14].
Similarly, our method can incorporate color image priors (e.g., crosschannel edgeconcurrence prior [14]) to improve test results on color images, despite our model being trained on grayscale images. An example is shown in Fig. 6. The hybrid method shares the advantages of our original model that effectively preserves edges and textures and the crosschannel prior that reduces color artifacts.
Transferability to unseen tasks. Our method allows for new datafidelity terms that are not contained in training, with no need for retraining. We demonstrate this flexibility with an experiment on the joint denoising and inpainting task shown in Fig. 9. In this experiment, 60% pixels of the input image are missing, and the measured 40% pixels are corrupted with Gaussian noise with . Let vector be the binary mask for measured pixels. The sensing matrix in Eq. 1, assumed to be known, is a binary diagonal matrix (hence ) with diagonal elements . To reuse our model trained on denoising/deconvolution tasks, we only need to specify and . The subproblems of our HQS framework are given in Eq. 14.
(14) 
Analysis of convergence and model complexity. To better understand the convergence of our method, in Fig. 7 and 8 we show the results of each HQS iteration of our method on denoising and nonblind deconvolution.
To understand the effect of model complexity and the number of HQS iteration on results, in Table 5 we report test results of our method using models trained with different HQS iterations ( in Algorithm 1), and with different stages in ( in Algorithm 1).
# HQS iterations  

1  3  5  
# stages 
1  29.80 / 26.81  30.89 / 28.12  30.96 / 28.28 
3  30.54 / 27.82  30.91 / 28.19  31.00 / 28.42  
5  30.54 / 27.83  30.92 / 28.18   
5 Conclusion
In this paper, we proposed the discriminative transfer learning framework for general image restoration. By combining advanced proximal optimization algorithms and discriminative learning techniques, a single training pass leads to a transferable model useful for a variety of image restoration tasks and problem conditions. Furthermore, our method is flexible and can be combined with existing priors and likelihood terms after being trained, allowing us to improve image quality on a task at hand. In spite of this generality, our method achieves comparable runtime efficiency as previous discriminative approaches, making it suitable for highresolution image restoration and mobile vision applications.
We believe that in future work, our framework incorporating advanced optimization with discriminative learning techniques can be extended to deep learning, for training more compact and shareable models, and to solve highlevel vision problems.
References
 [1] B. C. A. Buades and J. M. Morel. A review of image denoising algorithms, with a new one. Multiscale Modeling and Simulation, 4(2):490–530, 2005.
 [2] A. Barbu. Training an active random field for realtime image denoising. IEEE Transactions on Image Processing, 18(11):2451–2462, 2009.
 [3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 [4] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In CVPR 2012.
 [5] E. J. Candès and M. B. Wakin. An introduction to compressive sampling. IEEE signal processing magazine, 25(2):21–30, 2008.
 [6] A. Chambolle and T. Pock. A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
 [7] Y. Chen, T. Pock, R. Ranftl, and H. Bischof. Revisiting lossspecific training of filterbased mrfs for image restoration. In German Conference on Pattern Recognition 2013.
 [8] Y. Chen, W. Yu, and T. Pock. On learning optimized reaction diffusion processes for effective image restoration. In CVPR 2015.
 [9] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3D transformdomain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
 [10] W. Dong, L. Zhang, G. Shi, and X. Li. Nonlocally centralized sparse representation for image restoration. IEEE Transactions on Image Processing, 22(4):1620–1630, 2013.
 [11] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.
 [12] D. Geman and C. Yang. Nonlinear image recovery with halfquadratic regularization. IEEE Transactions on Image Processing, 4(7):932–946, 1995.
 [13] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In CVPR 2014.
 [14] F. Heide, M. Steinberger, Y.T. Tsai, M. Rouf, D. Pajak, D. Reddy, O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, et al. Flexisp: a flexible camera image processing framework. ACM Transactions on Graphics (TOG), 33(6):231, 2014.
 [15] V. Jain and H. Seung. Natural image denoising with convolutional networks.
 [16] J. Jancsary, S. Nowozin, T. Sharp, and C. Rother. Regression tree fields  an efficient, nonparametric approach to image labeling problems. In CVPR 2012.
 [17] D. Krishnan and R. Fergus. Fast image deconvolution using hyperlaplacian priors. In NIPS 2009.
 [18] D. Krishnan, T. Tay, and R. Fergus. Blind deconvolution using a normalized sparsity measure. In CVPR 2011.
 [19] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional camera with a coded aperture. ACM transactions on graphics (TOG), 26(3):70, 2007.
 [20] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Efficient marginal likelihood optimization in blind deconvolution. In CVPR 2011.
 [21] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Nonlocal sparse models for image restoration. In ICCV 2009.
 [22] P. Milanfar. A tour of modern image filtering: New insights and methods, both practical and theoretical. IEEE Signal Processing Magazine, 30(1):106–128, 2013.
 [23] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.
 [24] J. RaganKelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519–530, 2013.
 [25] D. Rosenbaum and Y. Weiss. The return of the gating network: Combining generative models and discriminative training in natural image priors. In NIPS 2015.
 [26] S. Roth and M. Black. Fields of experts. International Journal of Computer Vision, 82(2):205–229, 2009.
 [27] L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(14):259–268, 1992.
 [28] M. Schmidt. minfunc: unconstrained differentiable multivariate optimization in matlab. http://www.cs.ubc.ca/s̃chmidtm/Software/minFunc.html.
 [29] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In CVPR 2014.
 [30] U. Schmidt, C. Rother, S. Nowozin, J. Jancsary, and S. Roth. Discriminative nonblind deblurring. In CVPR 2013.
 [31] H. Talebi and P. Milanfar. Global image denoising. IEEE Transactions on Image Processing, 23(2):755–768, 2014.
 [32] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In ICCV 1998.
 [33] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg. Plugandplay priors for model based reconstruction. In GlobalSIP 2013.
 [34] J. Weickert. Anisotropic diffusion in image processing. ECMI Series, TeubnerVerlag, Stuttgart, Germany, 1998.
 [35] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In ICCV 2011.