xUnit: Learning a Spatial Activation Function for Efficient Image Restoration
Abstract
In recent years, deep neural networks (DNNs) achieved unprecedented performance in many lowlevel vision tasks. However, stateoftheart results are typically achieved by very deep networks, which can reach tens of layers with tens of millions of parameters. To make DNNs implementable on platforms with limited resources, it is necessary to weaken the tradeoff between performance and efficiency. In this paper, we propose a new activation unit, which is particularly suitable for image restoration problems. In contrast to the widespread perpixel activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance. We illustrate the effectiveness of our units through experiments with stateoftheart nets for denoising, deraining, and super resolution, which are already considered to be very small. With our approach, we are able to further reduce these models by nearly without incurring any degradation in performance.
1 Introduction
Deep convolutional neural networks (CNNs) have revolutionized computer vision, achieving unprecedented performance in highlevel vision tasks such as classification [44, 14, 17], segmentation [37, 1] and face recognition [31, 42], as well as in lowlevel vision tasks like denoising [19, 47, 3, 34], deblurring [29], super resolution [9, 22, 20] and dehazing [36]. Today, the performance of CNNs is still being constantly improved, mainly by means of increasing the net’s depth. Indeed, identity skip connections [15] and residual learning [14, 47], used within ResNets [14] and DenseNets [17], now overcome some of the difficulties associated with very deep nets, and have even allowed to cross the 1000layer barrier [15].
The strong link between performance and depth, has major implications on the computational resources and running times required to obtain stateoftheart results. In particular, it implies that many realtime, lowpower, and limited resource applications (e.g. on mobile devices), cannot currently exploit the full potential of CNNs.
In this paper, we propose a different mechanism for improving CNN performance (see Fig. 1). Rather than increasing depth, we focus on making the nonlinear activations more effective. Most popular architectures use perpixel activation units, e.g. rectified linear units (ReLUs) [11], exponential linear units (ELUs) [5], sigmoids [30], etc. Here, we propose to replace those units by xUnit, a layer with spatial and learnable connections. The xUnit computes a continuousvalued weight map, serving as a soft gate to its input. As we show, although it is more computationally demanding and memory consuming than a perpixel unit, xUnit activations have a dramatic effect on the net’s performance. Therefore, they allow using far fewer layers to match the performance of a CNN with ReLU activations. Overall, this results in a significantly improved tradeoff between performance and efficiency, as illustrated in Fig. 1.
The xUnit has a set of learnable parameters. Therefore, to conform to a total budget of parameters, xUnits must come at the expense of some of the convolutional layers in the net or some of the channels within those convolutional layers. This raises the question: What is the optimal percentage of parameters to invest in the activation units? Today, most CNN architectures are at one extreme of the spectrum, with of the parameters invested in the activations. Here, we show experimentally that the optimal percentage is much larger than zero. This suggests that the representational power gained by using spatial activations can be far more substantial than that offered by convolutional layers with perpixel activations.
We illustrate the effectiveness of our approach in several image restoration tasks. Specifically, we take stateoftheart CNN models for image denoising [47], superresolution [9], and deraining [10], which are already considered to be very lightweight, and replace their ReLUs with xUnits. We show that this allows us to further reduce the number of parameters (by discarding layers or channels) without incurring any degradation in performance. In fact, we show that for small models, we can save nearly of the parameters while achieving the same performance or even better. As we show, this often to use three orders of magnitude less training examples.
2 Related Work
The quest for accurate image enhancement algorithms attracted significant research efforts over the past several decades. Until 2012, the vast majority of algorithms relied on generative image models, usually through maximum aposterori (MAP) estimation. Models were typically either handcrafted or learned from training images, and included e.g. priors on derivatives [40], wavelet coefficients [32], filter responses [38], image patches [7, 49], etc. In recent years, generative approaches are gradually being pushed aside by discriminative methods, mostly based on CNNs. These architectures typically directly learn a mapping from a degraded image to a restored one, and were shown to exhibit excellent performance in many restoration tasks, including e.g. denoising [3, 47, 19], debluring [29], superresolution [8, 9, 22, 20], dehazing [4, 24], and deraining [10].
A popular strategy for improving the performance of CNN models, is by increasing their depth. Various works suggested ways to overcome some of the difficulties in training very deep nets. These opened the door to a line of algorithms using ever larger nets. Specifically, the residual net (ResNet) architecture [14], was demonstrated to achieve exceptional classification performance compared to a plain network. Dense convolutional networks (DenseNets) [17] took the “skipconnections” idea one step further, by connecting each layer to every other layer in a feedforward fashion. This allowed to achieve excellent performance in very deep nets.
These ideas were also adopted by the lowlevel vision community. In the context of denoising, Zhang et al. [47] were the first to train a very deep CNN for denoising, yielding stateoftheart results. To train their net, which has M parameters, they utilized residual learning and batch normalization [47], alleviating the vanishing gradients problem. In [19], a twice larger model was proposed, which is based on formatting the residual image to contain structured information instead of learning the difference between clean and noisy images. Similar ideas were also proposed in [33, 35, 48], leading to models with large numbers of parameters. Recently, a very deep network based on residual learning was proposed in [2], which contains more than 60 layers, and M parameters.
In the context of superresolution, the progress was similar. In the near past, stateoftheart methods used only a few tens of thousands of parameter. For example, the SRCNN model [9] contains only three convolution layers, with only K parameters. The very deep superresolution model (VDSR) [20] already used layers with K parameters. Nowadays, much more complex models are in use. For example, the wellknown SRResNet [22] uses more than M parameters, and the the EDSR network [26] (winner of the NTIRE2017 super resolution challenge [45]), has M parameters.
The trend of making CNNs as deep as possible, poses significant challenges in terms of running those models on platforms with lowpower and limited computation and memory resources. One approach towards diminishing memory consumption and access times, is to use binarized neural networks [6]. These architectures, which were shown beneficial in classification tasks, constrain the weights and activations to be binary. Another approach is to replace the multichannel convolutional layers by depthwise convolutions [16]. This offers a significant reduction in size and latency, while allowing reasonable classification accuracy. In [43], it was suggested to reduce network complexity and memory consumption for super resolution, by introducing a subpixel convolutional layer that learns upscaling filters. In [23], an architecture which exploits nonlocal selfsimilarities in images, was shown to yield good results with reduced models. Finally, learning the optimal slope of leaky ReLU type activations has also shown to lead to more efficient models [13].
3 xUnit
Although a variety of CNN architectures exist, their building blocks are quite similar, mainly comprising of convolutional layers and perpixel activation units. Mathematically, the features at layer are commonly calculated as
(1) 
where is the input to the net, performs a convolution operation, is a bias term, is the output of the convolutional layer, and is some nonlinear activation function which operates elementwise on its argument. Popular activation functions include the ReLU [11], leaky ReLU [27], ELU [5], tanh and sigmoid.
Note that there is a clear dichotomy in (3): The convolutional layers are responsible for the spatial processing, and the activation units for the nonlinearities. One may wonder if this is the most efficient way to realize the complex functions needed in lowlevel vision. In particular, is there any reason not to allow spatial processing also within the activation functions?
Elementwise activations can be thought of as nonlinear gating functions. Specifically, assuming that , as is the case for all popular activations, (3) can be written as
(2) 
where denotes the (elementwise) Hadamard product, and is a (multichannel) weight map that depends on elementwise, as
(3) 
Here should be interpreted as . For example, the weight map associated with the ReLU function , is a binary map which is a thresholded version of ,
(4) 
This interpretation is visualized in Fig. 2(a) (bias not shown).
Since the nonlinear activations are what grants CNNs their ability to implement complex functions, here we propose to use learnable spatial activations. That is, instead of the elementwise relation (3), we propose to allow each element in to depend also on the spatial environment of the corresponding element in . Specifically, we introduce xUnit, in which
(5) 
and
(6) 
with denoting depthwise convolution [16]. The depthwise convolution applies a single filter to each input channel, and is significantly more efficient in terms of memory and computations than the multichannel convolution popularly used in CNNs. The filters capture spatial relations, and have to be learned during training. To make the training stable, we also add batchnormalization layers [18] before the ReLU and before the exponentiation. This is illustrated in Fig. 2(b).
Merely replacing ReLUs with xUnits clearly increases memory consumption and running times at test stage. This is mostly due to their convolutional operations (the exponent can be implemented using a lookup table). Specifically, an xUnit with a channel input and a channel output involving filters, introduces an overhead of parameters ( for the depthwise filters, and for each batchnormalization layer). However, first, note that this overhead is relatively mild compared to the parameters of each convolutional layer. Second, in return to that overhead, xUnits provide a performance boost. This means that the same performance can be attained with less layers or with less channels per layer. Therefore, the important question is whether xUnits improve the overall tradeoff between performance and number of parameters.
Figure 3 shows the effect of using xUnits in a denoising task. Here, we trained two simple net architectures to remove additive Gaussian noise of standard deviation from noisy images, using a varying number of layers. The first net is a traditional ConvNet architecture comprising a sequence of Conv+BN+ReLU layers. The second net, which we coin xNet, comprises a sequence of Conv+xUnit layers. In both nets, the regular convolutional layers (not the ones within the xUnits) comprise channel filters. For the xUnits, we varied the size of the depthwise filters from to . We trained both nets on 400 images from the BSD dataset [28] using residual learning (i.e. learning to predict the noise and subtracting the noise estimate from the noisy image at test time). This has been shown to be advantageous for denoising in [47]. As can be seen in the figure, when the xUnit filters are , the peak signal to noise ratio (PSNR) attained by xNet exceeds that of ConvNet by only a minor gap. In this case, the xUnits are not spatial. However, as the xUnits’ filters become larger, xNet’s performance begins to improve, for any given total number of parameters. Note, for example, that a 3 layer xNet with activations outperforms a 9 layer ConvNet, although having less than the number of parameters.
To further understand the performancecomputation tradeoff when using spatial activations, Fig. 4 shows a vertical cross section of the graph in Fig. 4 at an overall of parameters. Here, the PSNR is plotted against the percentage of parameters invested in the xUnit activations. In a traditional ConvNet, of the parameters are invested in the activations. As can be seen in the graph, this is clearly suboptimal. In particular, the optimal percentage can be seen to be at least , where the performance of xNet reaches a plateau. In fact, after around (corresponding to activation filters), the benefit in further increasing the filters’ supports becomes relatively minor.
To gain intuition into the mechanism that allows xNet to achieve better results with less parameters, we depict in Fig. 5 the layer 4 feature maps , weight (activation) maps , and their products , for a ConvNet and an xNet operating on the same noisy input image. Interestingly, we see that many more xNet activations are close to (white) than ConvNet activations. Thus, it seems that in xNet, more channels take part in the denoising effort. Moreover, it can be seen that the xNet weight maps are quite complex functions of the features, as opposed to the simple binarization function of the ReLUs in ConvNet.
In terms of training stability, xNets behave similarly to ConvNets with ReLUs, and even allow slightly faster convergence. This can be seen in Fig. 6, which compares the convergence of the loss at train time of a 4layer ConvNet and a 4layer xNet with activations. Although the latter has more parameters, it converges slightly faster. Due to the twobranch structure of the xUnits, the xNet does not seem to suffer from the gradient vanishing problem. Furthermore, it should be noted that xUnits can be used as building blocks also within ResNets and DenseNets, which alleviate the gradient vanishing problem in very deep nets.
4 Experiments and Applications
Methods  BM3D  WNNM  EPLL  MLP  DnCNNS  xDnCNN 

# of parameters          555K  303K 
28.56  28.82  28.68  28.95  29.22  29.20  
25.62  25.87  25.67  26.01  26.23  26.26 
Our goal is to show that many smallscale and mediumscale stateoftheart CNNs can be made almost smaller with xUnits, without incurring any degradation in performance.
We implemented the proposed architecture in Pytorch. We ran all experiments on a desktop computer with an Intel i56500 CPU and an Nvidia 1080Ti GPU. We used the Adam [21] optimizer with its default settings for training the nets. We initialized the learning rate to and gradually decreased it to during training. We kept the minibatch size fixed at 64. In all applications, we used depthwise convolutions in the xUnits, and minimized the mean square error (MSE) over the training set.
4.1 Image Denoising
We begin by illustrating the effectiveness of xUnits in image denoising. As a baseline architecture, we take the stateoftheart DnCNN denoising network [47]. We replace all ReLU layers with xUnit layers and reduce the number of convolutional layers from 17 to 9. We keep all convolutional layers with 64 channel filters, as in the original architecture. Our net, which we coin xDnCNN, has only the number of parameters of DnCNN (303K for xDnCNN and 555K for DnCNN).
As in [47], we train our net on 400 images. We use images from the Berkeley segmentation dataset [28], enriched by random flipping and random cropping (). The noisy images are generated by adding Gaussian noise to the training images (different realization to each image). We examine the performance of our net at noise levels . Table 1 compares the average PSNR attained by our xDnCNN to that attained by the original DnCNN (variant ‘s’), as well as to the stateoftheart nonCNN denoising methods BM3D [7], WNNM [12], EPLL [49], and MLP [3]. The evaluation is performed on the BSD68 dataset [39], a subset of 68 images from the BSD dataset, which is not included in the training set. As can be seen, our xDnCNN outperforms all nonCNN denoising methods and achieves results that are on par with DnCNN. This is despite the fact that xDnCNN is nearly half the size of DnCNN in terms of number of parameters. The superiority of our method becomes more significant as the noise level increases. At a noise level of , our method achieves the highest PSNR values on out of the images in the dataset.
Figure 7 shows an example denoising result obtained with xDnCNN, compared with BM3D, EPLL, MLP and DnCNNs, for a noise level of . As can be seen, our xDnCNN best reconstructs the fine details and barely introduces any distracting artifacts. In contrast, all the other methods (including DnCNN), introduce unpleasing distortions.
4.2 Single image rain removal
Next, we use the same architecture in the task of removing rain streaks from a single image. We only introduce one modification to our denoising xDnCNN, which is to work on three channel (RGB) input images and to output three cannel images. This results in a network with 306K parameters. We compare our results to DerainNet [10], a network with K parameters, which comprises three convolutional layers: , and , respectively. Similarly to denoising, we learn the residual mapping between a rainy image and a clean image. Training is performed on the dataset of DerainNet [10], which contains 4900 pairs of clean and synthetically generated rainy images. However, we evaluate our net on the Rain12 dataset [25], which contains 12 artificially generated images. Although the training data is quite different from the testing data, our xDnCNN performs significantly better than DerainNet. Specifically, xDnCNN achieves dB, whereas DerainNet achieves dB. This behavior is also seen when deraining real images. As can be seen in Fig. 8, xDnCNN perform significantly better in cleaning actual rain streaks. We thus conclude that xDnCNN is far more robust to different rain appearances, while maintaining its efficiency. Pay attention that our xDnCNN deraining net has only the number of parameters of DerainNet.
4.3 Single image super resolution
Our xUnit activations can be also applied in single image super resolution. As a baseline implementation, we take the SRCNN [9] model. In this architecture, the input image is first interpolated to the desired size of the output image and then fed through three convolutional layers: , and . This is illustrated in Fig. 9(a). Here, we study two different modifications to SRCNN, where we replace the two ReLU layers with xUnit layers. In the first modification, we reduce the size of the filters in the middle layer from to (Fig. 9(b)). This variant, which we coin xSRCNNf, has only the number of parameters of SRCNN (K for xSRCNNf and K for SRCNN). In the second modification, we reduce the number of channels in the second layer from to (Fig. 9(c)). This variant, which we coin xSRCNNc, has only the number of parameters of SRCNN (44K for xSRCNNc and 57K for SRCNN).
The SRCNN model was trained on images from ImageNet [41]. Here, we train our nets on datasets that are three orders of magnitude smaller. Specifically, we use only images from [46] and images from the BSD dataset, as our training set. We augment the data by random flipping and random cropping.
Table 2 reports the results attained by all the models, tested on set5, set14, and BSD100. As can be seen, both our models attain results that or on par with the original SRCNN, although being much smaller and trained on a significantly smaller number of images. Note that our xSRCNNf has less parameters than xSRCNNc. This suggests that a better way to discard parameters in xNets is by reducing filter sizes, rather than reducing channels. A possible explanation is that the filters within the xUnits can partially compensate for the small support of the filters in the convolutional layers. However, the fact that discarding channels can also provide a significant reduction in parameters at the same performance, indicates that the channels in an xNet are more effective than those in ConvNets with perpixel activations.
Figure 10 shows the layer 2 feature maps, activation maps, and their products for both SRCNN and our xSRCNNf. As in the case of denoising, we can see that in xSRCNN, many more feature maps participate in the reconstruction effort compared to SRCNN. This provides a possible explanation to its ability to perform well with smaller filters (or with less channels).
Dataset  Scale 





Set5 





Set14 





BSD100 




5 Conclusion
Popular CNN architectures use simple nonlinear activation units (e.g. ReLUs), which operate pixelwise on the feature maps. In this paper, we demonstrated that CNNs can greatly benefit from incorporating learnable spatial connections within the activation units. While these spatial connections introduce additional parameters to the net, they significantly improve its performance. Overall, the tradeoff between performance and number of parameters, is substantially improved. We illustrated how our approach can reduce the size of several stateoftheart CNN models for denoising, deraining and superresolution, which are already considered to be very small, by almost 50%. This is without incurring any degradation in performance.
References
 [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
 [2] W. Bae, J. Yoo, and J. C. Ye. Beyond deep residual learning for image restoration: Persistent homologyguided manifold simplification. arXiv preprint arXiv:1611.06345, 2016.
 [3] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with bm3d? In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2392–2399. IEEE, 2012.
 [4] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An endtoend system for single image haze removal. IEEE Transactions on Image Processing, 25(11):5187–5198, 2016.
 [5] D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 [6] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [7] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transformdomain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
 [8] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image superresolution. In European Conference on Computer Vision, pages 184–199. Springer, 2014.
 [9] C. Dong, C. C. Loy, K. He, and X. Tang. Image superresolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
 [10] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley. Clearing the skies: A deep network architecture for singleimage rain removal. IEEE Transactions on Image Processing, 26(6):2944–2956, 2017.
 [11] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
 [12] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
 [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 [19] J. Jiao, W.C. Tu, S. He, and R. W. Lau. Formresnet: Formatted residual learning for image restoration. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1034–1042. IEEE, 2017.
 [20] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image superresolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
 [21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
 [23] S. Lefkimmiatis. Nonlocal color image denoising with convolutional neural networks. arXiv preprint arXiv:1611.06757, 2016.
 [24] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. An allinone network for dehazing and beyond. arXiv preprint arXiv:1707.06543, 2017.
 [25] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown. Rain streak removal using layer priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2736–2744, 2016.
 [26] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image superresolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
 [27] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013.
 [28] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
 [29] M. Noroozi, P. Chandramouli, and P. Favaro. Motion deblurring in the wild. arXiv preprint arXiv:1701.01486, 2017.
 [30] G. B. Orr and K.R. Müller. Neural networks: tricks of the trade. Springer, 2003.
 [31] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
 [32] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Transactions on Image processing, 12(11):1338–1351, 2003.
 [33] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep class aware denoising. arXiv preprint arXiv:1701.01698, 2017.
 [34] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep classaware image denoising. In Sampling Theory and Applications (SampTA), 2017 International Conference on, pages 138–142. IEEE, 2017.
 [35] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep convolutional denoising of lowlight images. arXiv preprint arXiv:1701.01687, 2017.
 [36] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.H. Yang. Single image dehazing via multiscale convolutional neural networks. In European Conference on Computer Vision, pages 154–169. Springer, 2016.
 [37] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 234–241. Springer, 2015.
 [38] S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 860–867. IEEE, 2005.
 [39] S. Roth and M. J. Black. Fields of experts. International Journal of Computer Vision, 82(2):205–229, 2009.
 [40] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(14):259–268, 1992.
 [41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [42] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
 [43] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Realtime single image and video superresolution using an efficient subpixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
 [44] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [45] R. Timofte, E. Agustsson, L. Van Gool, M.H. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, et al. Ntire 2017 challenge on single image superresolution: Methods and results. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1110–1121. IEEE, 2017.
 [46] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image superresolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
 [47] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 2017.
 [48] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. arXiv preprint arXiv:1704.03264, 2017.
 [49] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 479–486. IEEE, 2011.