xUnit: Learning a Spatial Activation Function for Efficient Image Restoration
In recent years, deep neural networks (DNNs) achieved unprecedented performance in many low-level vision tasks. However, state-of-the-art results are typically achieved by very deep networks, which can reach tens of layers with tens of millions of parameters. To make DNNs implementable on platforms with limited resources, it is necessary to weaken the tradeoff between performance and efficiency. In this paper, we propose a new activation unit, which is particularly suitable for image restoration problems. In contrast to the widespread per-pixel activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance. We illustrate the effectiveness of our units through experiments with state-of-the-art nets for denoising, de-raining, and super resolution, which are already considered to be very small. With our approach, we are able to further reduce these models by nearly without incurring any degradation in performance.
Deep convolutional neural networks (CNNs) have revolutionized computer vision, achieving unprecedented performance in high-level vision tasks such as classification [44, 14, 17], segmentation [37, 1] and face recognition [31, 42], as well as in low-level vision tasks like denoising [19, 47, 3, 34], deblurring , super resolution [9, 22, 20] and dehazing . Today, the performance of CNNs is still being constantly improved, mainly by means of increasing the net’s depth. Indeed, identity skip connections  and residual learning [14, 47], used within ResNets  and DenseNets , now overcome some of the difficulties associated with very deep nets, and have even allowed to cross the 1000-layer barrier .
The strong link between performance and depth, has major implications on the computational resources and running times required to obtain state-of-the-art results. In particular, it implies that many real-time, low-power, and limited resource applications (e.g. on mobile devices), cannot currently exploit the full potential of CNNs.
In this paper, we propose a different mechanism for improving CNN performance (see Fig. 1). Rather than increasing depth, we focus on making the nonlinear activations more effective. Most popular architectures use per-pixel activation units, e.g. rectified linear units (ReLUs) , exponential linear units (ELUs) , sigmoids , etc. Here, we propose to replace those units by xUnit, a layer with spatial and learnable connections. The xUnit computes a continuous-valued weight map, serving as a soft gate to its input. As we show, although it is more computationally demanding and memory consuming than a per-pixel unit, xUnit activations have a dramatic effect on the net’s performance. Therefore, they allow using far fewer layers to match the performance of a CNN with ReLU activations. Overall, this results in a significantly improved tradeoff between performance and efficiency, as illustrated in Fig. 1.
The xUnit has a set of learnable parameters. Therefore, to conform to a total budget of parameters, xUnits must come at the expense of some of the convolutional layers in the net or some of the channels within those convolutional layers. This raises the question: What is the optimal percentage of parameters to invest in the activation units? Today, most CNN architectures are at one extreme of the spectrum, with of the parameters invested in the activations. Here, we show experimentally that the optimal percentage is much larger than zero. This suggests that the representational power gained by using spatial activations can be far more substantial than that offered by convolutional layers with per-pixel activations.
We illustrate the effectiveness of our approach in several image restoration tasks. Specifically, we take state-of-the-art CNN models for image denoising , super-resolution , and de-raining , which are already considered to be very light-weight, and replace their ReLUs with xUnits. We show that this allows us to further reduce the number of parameters (by discarding layers or channels) without incurring any degradation in performance. In fact, we show that for small models, we can save nearly of the parameters while achieving the same performance or even better. As we show, this often to use three orders of magnitude less training examples.
2 Related Work
The quest for accurate image enhancement algorithms attracted significant research efforts over the past several decades. Until 2012, the vast majority of algorithms relied on generative image models, usually through maximum a-posterori (MAP) estimation. Models were typically either hand-crafted or learned from training images, and included e.g. priors on derivatives , wavelet coefficients , filter responses , image patches [7, 49], etc. In recent years, generative approaches are gradually being pushed aside by discriminative methods, mostly based on CNNs. These architectures typically directly learn a mapping from a degraded image to a restored one, and were shown to exhibit excellent performance in many restoration tasks, including e.g. denoising [3, 47, 19], debluring , super-resolution [8, 9, 22, 20], dehazing [4, 24], and de-raining .
A popular strategy for improving the performance of CNN models, is by increasing their depth. Various works suggested ways to overcome some of the difficulties in training very deep nets. These opened the door to a line of algorithms using ever larger nets. Specifically, the residual net (ResNet) architecture , was demonstrated to achieve exceptional classification performance compared to a plain network. Dense convolutional networks (DenseNets)  took the “skip-connections” idea one step further, by connecting each layer to every other layer in a feed-forward fashion. This allowed to achieve excellent performance in very deep nets.
These ideas were also adopted by the low-level vision community. In the context of denoising, Zhang et al.  were the first to train a very deep CNN for denoising, yielding state-of-the-art results. To train their net, which has M parameters, they utilized residual learning and batch normalization , alleviating the vanishing gradients problem. In , a twice larger model was proposed, which is based on formatting the residual image to contain structured information instead of learning the difference between clean and noisy images. Similar ideas were also proposed in [33, 35, 48], leading to models with large numbers of parameters. Recently, a very deep network based on residual learning was proposed in , which contains more than 60 layers, and M parameters.
In the context of super-resolution, the progress was similar. In the near past, state-of-the-art methods used only a few tens of thousands of parameter. For example, the SRCNN model  contains only three convolution layers, with only K parameters. The very deep super-resolution model (VDSR)  already used layers with K parameters. Nowadays, much more complex models are in use. For example, the well-known SRResNet  uses more than M parameters, and the the EDSR network  (winner of the NTIRE2017 super resolution challenge ), has M parameters.
The trend of making CNNs as deep as possible, poses significant challenges in terms of running those models on platforms with low-power and limited computation and memory resources. One approach towards diminishing memory consumption and access times, is to use binarized neural networks . These architectures, which were shown beneficial in classification tasks, constrain the weights and activations to be binary. Another approach is to replace the multi-channel convolutional layers by depth-wise convolutions . This offers a significant reduction in size and latency, while allowing reasonable classification accuracy. In , it was suggested to reduce network complexity and memory consumption for super resolution, by introducing a sub-pixel convolutional layer that learns upscaling filters. In , an architecture which exploits non-local self-similarities in images, was shown to yield good results with reduced models. Finally, learning the optimal slope of leaky ReLU type activations has also shown to lead to more efficient models .
Although a variety of CNN architectures exist, their building blocks are quite similar, mainly comprising of convolutional layers and per-pixel activation units. Mathematically, the features at layer are commonly calculated as
where is the input to the net, performs a convolution operation, is a bias term, is the output of the convolutional layer, and is some nonlinear activation function which operates element-wise on its argument. Popular activation functions include the ReLU , leaky ReLU , ELU , tanh and sigmoid.
Note that there is a clear dichotomy in (3): The convolutional layers are responsible for the spatial processing, and the activation units for the nonlinearities. One may wonder if this is the most efficient way to realize the complex functions needed in low-level vision. In particular, is there any reason not to allow spatial processing also within the activation functions?
Element-wise activations can be thought of as nonlinear gating functions. Specifically, assuming that , as is the case for all popular activations, (3) can be written as
where denotes the (element-wise) Hadamard product, and is a (multi-channel) weight map that depends on element-wise, as
Here should be interpreted as . For example, the weight map associated with the ReLU function , is a binary map which is a thresholded version of ,
This interpretation is visualized in Fig. 2(a) (bias not shown).
Since the nonlinear activations are what grants CNNs their ability to implement complex functions, here we propose to use learnable spatial activations. That is, instead of the element-wise relation (3), we propose to allow each element in to depend also on the spatial environment of the corresponding element in . Specifically, we introduce xUnit, in which
with denoting depth-wise convolution . The depth-wise convolution applies a single filter to each input channel, and is significantly more efficient in terms of memory and computations than the multi-channel convolution popularly used in CNNs. The filters capture spatial relations, and have to be learned during training. To make the training stable, we also add batch-normalization layers  before the ReLU and before the exponentiation. This is illustrated in Fig. 2(b).
Merely replacing ReLUs with xUnits clearly increases memory consumption and running times at test stage. This is mostly due to their convolutional operations (the exponent can be implemented using a look-up table). Specifically, an xUnit with a -channel input and a -channel output involving filters, introduces an overhead of parameters ( for the depth-wise filters, and for each batch-normalization layer). However, first, note that this overhead is relatively mild compared to the parameters of each convolutional layer. Second, in return to that overhead, xUnits provide a performance boost. This means that the same performance can be attained with less layers or with less channels per layer. Therefore, the important question is whether xUnits improve the overall tradeoff between performance and number of parameters.
Figure 3 shows the effect of using xUnits in a denoising task. Here, we trained two simple net architectures to remove additive Gaussian noise of standard deviation from noisy images, using a varying number of layers. The first net is a traditional ConvNet architecture comprising a sequence of Conv+BN+ReLU layers. The second net, which we coin xNet, comprises a sequence of Conv+xUnit layers. In both nets, the regular convolutional layers (not the ones within the xUnits) comprise channel filters. For the xUnits, we varied the size of the depth-wise filters from to . We trained both nets on 400 images from the BSD dataset  using residual learning (i.e. learning to predict the noise and subtracting the noise estimate from the noisy image at test time). This has been shown to be advantageous for denoising in . As can be seen in the figure, when the xUnit filters are , the peak signal to noise ratio (PSNR) attained by xNet exceeds that of ConvNet by only a minor gap. In this case, the xUnits are not spatial. However, as the xUnits’ filters become larger, xNet’s performance begins to improve, for any given total number of parameters. Note, for example, that a 3 layer xNet with activations outperforms a 9 layer ConvNet, although having less than the number of parameters.
To further understand the performance-computation tradeoff when using spatial activations, Fig. 4 shows a vertical cross section of the graph in Fig. 4 at an overall of parameters. Here, the PSNR is plotted against the percentage of parameters invested in the xUnit activations. In a traditional ConvNet, of the parameters are invested in the activations. As can be seen in the graph, this is clearly sub-optimal. In particular, the optimal percentage can be seen to be at least , where the performance of xNet reaches a plateau. In fact, after around (corresponding to activation filters), the benefit in further increasing the filters’ supports becomes relatively minor.
To gain intuition into the mechanism that allows xNet to achieve better results with less parameters, we depict in Fig. 5 the layer 4 feature maps , weight (activation) maps , and their products , for a ConvNet and an xNet operating on the same noisy input image. Interestingly, we see that many more xNet activations are close to (white) than ConvNet activations. Thus, it seems that in xNet, more channels take part in the denoising effort. Moreover, it can be seen that the xNet weight maps are quite complex functions of the features, as opposed to the simple binarization function of the ReLUs in ConvNet.
In terms of training stability, xNets behave similarly to ConvNets with ReLUs, and even allow slightly faster convergence. This can be seen in Fig. 6, which compares the convergence of the loss at train time of a 4-layer ConvNet and a 4-layer xNet with activations. Although the latter has more parameters, it converges slightly faster. Due to the two-branch structure of the xUnits, the xNet does not seem to suffer from the gradient vanishing problem. Furthermore, it should be noted that xUnits can be used as building blocks also within ResNets and DenseNets, which alleviate the gradient vanishing problem in very deep nets.
4 Experiments and Applications
|# of parameters||-||-||-||-||555K||303K|
Our goal is to show that many small-scale and medium-scale state-of-the-art CNNs can be made almost smaller with xUnits, without incurring any degradation in performance.
We implemented the proposed architecture in Pytorch. We ran all experiments on a desktop computer with an Intel i5-6500 CPU and an Nvidia 1080Ti GPU. We used the Adam  optimizer with its default settings for training the nets. We initialized the learning rate to and gradually decreased it to during training. We kept the mini-batch size fixed at 64. In all applications, we used depth-wise convolutions in the xUnits, and minimized the mean square error (MSE) over the training set.
4.1 Image Denoising
We begin by illustrating the effectiveness of xUnits in image denoising. As a baseline architecture, we take the state-of-the-art DnCNN denoising network . We replace all ReLU layers with xUnit layers and reduce the number of convolutional layers from 17 to 9. We keep all convolutional layers with 64 channel filters, as in the original architecture. Our net, which we coin xDnCNN, has only the number of parameters of DnCNN (303K for xDnCNN and 555K for DnCNN).
As in , we train our net on 400 images. We use images from the Berkeley segmentation dataset , enriched by random flipping and random cropping (). The noisy images are generated by adding Gaussian noise to the training images (different realization to each image). We examine the performance of our net at noise levels . Table 1 compares the average PSNR attained by our xDnCNN to that attained by the original DnCNN (variant ‘s’), as well as to the state-of-the-art non-CNN denoising methods BM3D , WNNM , EPLL , and MLP . The evaluation is performed on the BSD68 dataset , a subset of 68 images from the BSD dataset, which is not included in the training set. As can be seen, our xDnCNN outperforms all non-CNN denoising methods and achieves results that are on par with DnCNN. This is despite the fact that xDnCNN is nearly half the size of DnCNN in terms of number of parameters. The superiority of our method becomes more significant as the noise level increases. At a noise level of , our method achieves the highest PSNR values on out of the images in the dataset.
Figure 7 shows an example denoising result obtained with xDnCNN, compared with BM3D, EPLL, MLP and DnCNN-s, for a noise level of . As can be seen, our xDnCNN best reconstructs the fine details and barely introduces any distracting artifacts. In contrast, all the other methods (including DnCNN), introduce unpleasing distortions.
4.2 Single image rain removal
Next, we use the same architecture in the task of removing rain streaks from a single image. We only introduce one modification to our denoising xDnCNN, which is to work on three channel (RGB) input images and to output three cannel images. This results in a network with 306K parameters. We compare our results to DerainNet , a network with K parameters, which comprises three convolutional layers: , and , respectively. Similarly to denoising, we learn the residual mapping between a rainy image and a clean image. Training is performed on the dataset of DerainNet , which contains 4900 pairs of clean and synthetically generated rainy images. However, we evaluate our net on the Rain12 dataset , which contains 12 artificially generated images. Although the training data is quite different from the testing data, our xDnCNN performs significantly better than DerainNet. Specifically, xDnCNN achieves dB, whereas DerainNet achieves dB. This behavior is also seen when de-raining real images. As can be seen in Fig. 8, xDnCNN perform significantly better in cleaning actual rain streaks. We thus conclude that xDnCNN is far more robust to different rain appearances, while maintaining its efficiency. Pay attention that our xDnCNN deraining net has only the number of parameters of DerainNet.
4.3 Single image super resolution
Our xUnit activations can be also applied in single image super resolution. As a baseline implementation, we take the SRCNN  model. In this architecture, the input image is first interpolated to the desired size of the output image and then fed through three convolutional layers: , and . This is illustrated in Fig. 9(a). Here, we study two different modifications to SRCNN, where we replace the two ReLU layers with xUnit layers. In the first modification, we reduce the size of the filters in the middle layer from to (Fig. 9(b)). This variant, which we coin xSRCNN-f, has only the number of parameters of SRCNN (K for xSRCNN-f and K for SRCNN). In the second modification, we reduce the number of channels in the second layer from to (Fig. 9(c)). This variant, which we coin xSRCNN-c, has only the number of parameters of SRCNN (44K for xSRCNN-c and 57K for SRCNN).
The SRCNN model was trained on images from ImageNet . Here, we train our nets on datasets that are three orders of magnitude smaller. Specifically, we use only images from  and images from the BSD dataset, as our training set. We augment the data by random flipping and random cropping.
Table 2 reports the results attained by all the models, tested on set5, set14, and BSD100. As can be seen, both our models attain results that or on par with the original SRCNN, although being much smaller and trained on a significantly smaller number of images. Note that our xSRCNN-f has less parameters than xSRCNN-c. This suggests that a better way to discard parameters in xNets is by reducing filter sizes, rather than reducing channels. A possible explanation is that the filters within the xUnits can partially compensate for the small support of the filters in the convolutional layers. However, the fact that discarding channels can also provide a significant reduction in parameters at the same performance, indicates that the channels in an xNet are more effective than those in ConvNets with per-pixel activations.
Figure 10 shows the layer 2 feature maps, activation maps, and their products for both SRCNN and our xSRCNN-f. As in the case of denoising, we can see that in xSRCNN, many more feature maps participate in the reconstruction effort compared to SRCNN. This provides a possible explanation to its ability to perform well with smaller filters (or with less channels).
Popular CNN architectures use simple nonlinear activation units (e.g. ReLUs), which operate pixel-wise on the feature maps. In this paper, we demonstrated that CNNs can greatly benefit from incorporating learnable spatial connections within the activation units. While these spatial connections introduce additional parameters to the net, they significantly improve its performance. Overall, the tradeoff between performance and number of parameters, is substantially improved. We illustrated how our approach can reduce the size of several state-of-the-art CNN models for denoising, de-raining and super-resolution, which are already considered to be very small, by almost 50%. This is without incurring any degradation in performance.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
-  W. Bae, J. Yoo, and J. C. Ye. Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification. arXiv preprint arXiv:1611.06345, 2016.
-  H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with bm3d? In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2392–2399. IEEE, 2012.
-  B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao. Dehazenet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing, 25(11):5187–5198, 2016.
-  D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, pages 184–199. Springer, 2014.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
-  X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Transactions on Image Processing, 26(6):2944–2956, 2017.
-  X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
-  S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
-  J. Jiao, W.-C. Tu, S. He, and R. W. Lau. Formresnet: Formatted residual learning for image restoration. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1034–1042. IEEE, 2017.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
-  S. Lefkimmiatis. Non-local color image denoising with convolutional neural networks. arXiv preprint arXiv:1611.06757, 2016.
-  B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng. An all-in-one network for dehazing and beyond. arXiv preprint arXiv:1707.06543, 2017.
-  Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown. Rain streak removal using layer priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2736–2744, 2016.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
-  M. Noroozi, P. Chandramouli, and P. Favaro. Motion deblurring in the wild. arXiv preprint arXiv:1701.01486, 2017.
-  G. B. Orr and K.-R. Müller. Neural networks: tricks of the trade. Springer, 2003.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli. Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Transactions on Image processing, 12(11):1338–1351, 2003.
-  T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep class aware denoising. arXiv preprint arXiv:1701.01698, 2017.
-  T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep class-aware image denoising. In Sampling Theory and Applications (SampTA), 2017 International Conference on, pages 138–142. IEEE, 2017.
-  T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep convolutional denoising of low-light images. arXiv preprint arXiv:1701.01687, 2017.
-  W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang. Single image dehazing via multi-scale convolutional neural networks. In European Conference on Computer Vision, pages 154–169. Springer, 2016.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
-  S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 860–867. IEEE, 2005.
-  S. Roth and M. J. Black. Fields of experts. International Journal of Computer Vision, 82(2):205–229, 2009.
-  L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, et al. Ntire 2017 challenge on single image super-resolution: Methods and results. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1110–1121. IEEE, 2017.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 2017.
-  K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. arXiv preprint arXiv:1704.03264, 2017.
-  D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 479–486. IEEE, 2011.