Deep Likelihood Network for Image Restoration with Multiple Degradation Levels
Convolutional neural networks have been proven very effective in a variety of image restoration tasks. Most state-of-the-art solutions, however, are trained using images with a single particular degradation level, and can deteriorate drastically when being applied to some other degradation settings. In this paper, we propose a novel method dubbed deep likelihood network (DL-Net), aiming at generalizing off-the-shelf image restoration networks to succeed over a spectrum of degradation settings while keeping their original learning objectives and core architectures. In particular, we slightly modify the original restoration networks by appending a simple yet effective recursive module, which is derived from a fidelity term for disentangling the effect of degradations. Extensive experimental results on image inpainting, interpolation and super-resolution demonstrate the effectiveness of our DL-Net.
Over the past few years, deep convolutional neural networks (CNNs) have advanced the state-of-the-art of a variety of image restoration tasks including single image super-resolution (SISR) , inpainting , denoising , colorization , etc. Despite the impressive quantitative metrics and perceptual quality that CNNs have been achieved, most state-of-the-art solutions are trained with pairs of manually (de)generated images and their anticipated restoration outcomes based on implicit assumptions about the input.
In general, image degradations are restricted to being in a presumed level throughout datasets , e.g. , a pre-defined shape, size and even location for inpainting regions [6, 7], or a designate downsampling strategy from high-resolution images [1, 8, 9]. With the robustness of deep networks having been criticized for years [10, 11], such specification of the input domain shall entail severe over-fitting in the obtained CNN models [5, 12]. That is, they can succeed when the assumptions are fulfilled and test degradations are limited to the same particular setting as in training, but is problematic in practical applications in which multiple degradations and more flexible restorations are required.
A few endeavors have been exerted to address this issue. One straightforward solution is to jointly learn restoring images degenerated through different settings of degradations. Unfortunately, such naïve approach mitigates the issue only to a certain extent, and may still fail if the degradations vary within a substantial range of difficulties, as demonstrated in previous literature [5, 12]. Deep networks are believed to learn much better if the training samples and loss weights corresponding to possible degradation levels are allocated in a more reasonable way. Curriculum learning  and self-paced learning  approaches are hence proposed to control the sampling strategies and allocate more training samples to more challenging degradation levels with relatively lower PSNRs . Multi-task learning can be incorporated to adjust loss weightings similarly . Though enlightening, such methods suffer from several drawbacks in practice. First, difficulty level of degradations has to be appropriately discretized, but its guideline is unclear if their difficulties are inevident per se. Second, it is non-trivial to be generalized to CNNs whose objectives are comprised of perceptual  and adversarial losses , since PSNR turns out not to be a reliable criteria for them [16, 18]. Third, an extra validation set is inevitably required in every training epoch in .
There are also several recent works that propose to tackle this issue by introducing customized architectures [12, 19], with for instance modified convolution operations for image inpainting . The architectures are specifically designed and probably limited to certain tasks. In this paper, we aim at directly generalizing off-the-shelf image restoration networks, which might be previously presented to restore images in only one particular level, to succeed over a spectrum of degradation settings with the learning objectives and core network architectures retained/reused. In order to achieve this, we propose deep likelihood network (DL-Net): a novel method which inherently disentangles111With our method, input images with different settings of degradations are processed distinctively due to their different degradation kernel s and other hyper-parameters, which is what we meant by “disentangle”. the effects of possible degradations and enforces high likelihood overall. Its computation procedure is cast into a recursive module and can be readily incorporated into any network, making it highly general and scalable. Another benefit that our method should gain, in comparison with some previous ones [5, 15], comes from the degradation itself, whose information may facilitate the image restoration process as well .
We primarily focus on three image restoration tasks, i.e. inpainting, interpolation, and SISR in which image blurring is also introduced. Our main contributions are:
We propose a novel and general method to generalize off-the-shelf image restoration CNNs to succeed over a spectrum of image degradations.
By encouraging high likelihood in the architecture, our method utilizes information from different degradation to facilitate the restoration process.
Our method is computationally efficient (as the introduced overhead is small), easy to implement, and can be readily incorporated into different networks.
2 Related Works
Image restoration. Typical restoration tasks include SISR, inpainting, denoising, deblurring, just to name a few. We mainly focus on inpainting (also known as image completion or hole-filling), interpolation and SISR. On the point that many denoising CNNs have assured their behaviors in multiple settings [3, 25], we opt to cover other critical tasks where the problem arises.
Image inpainting , a task of making visual predictions for missing regions is required when human users attempt to erase certain regions on an image. Early solutions predict information inside regions by exploiting isophotes direction field [26, 27] and texture synthesis technologies [28, 29, 30]. Deep CNNs were later introduced to learn semantic contents in an end-to-end manner. Although initial efforts show decent results only on small regions , encoder-decoder-based architecture and adversarial learning are leveraged to make them work reasonably well on very large holes . For better perceptual quality, sophisticated loss terms in favor of content and texture consistencies are also developed [31, 7, 32, 33, 34]. Despite all these impressive improvements, popular methods analyze inpainting mainly with pre-defined region size, shape and even location, yet a deteriorating effect has been reported when multiple degradations exist . Bring in some more holistic and local loss terms may mitigate the problem, but does not resolve it. Recently, Ren et al.  and Liu et al.  propose Shepard convolution and partial convolution for inpainting with irregular holes, respectively. These works mostly focus on convolution design and are perpendicular to ours.
Image interpolation  is a similar task and sometimes also referred to as another type of inpainting using Bernoulli masks. Sparse coding networks are usually adopted to cope with it [36, 37]. Some implicit prior captured by the CNN architecture itself is also explored .
Then it is SISR, the task of generating a high-resolution digital image with a given low-resolution one. Based on the neighbor embedding and sparse coding technologies, many of the traditional methods try to preserve (example-based) neighborhood structures [39, 40, 22, 41] or (dictionary-based) perceptual representations [42, 43, 44, 45, 41] 222They are often sparse ones with over-complete dictionaries, can also be viewed as sparse priors. in different image subspaces. However, most of them cannot be calculated efficiently , and furthermore, their expressive power is also limited. To this end, researchers start pursuing approximated but explicit and powerful mappings from input to target subspaces [1, 47], for which CNNs can be appropriate candidates. Being able to extract contextual information, networks with deeper architectures show even higher PSNRs [8, 48, 9, 49, 50]. Whereas, it has also been reported that current state-of-the-art SISR networks suffer from generalizing to scenarios with possible blurring .
Likelihood and image priors. A key ingredient of image restoration is to estimate the probability that one outcome is capable of generating the given input through some sort of degradation, i.e. , the likelihood. In principle, such tasks are heavily ill-posed. Solving them would sometimes require prior knowledge on the input low- and output high-quality image subspaces, and learning-based methods are typically used to leverage them effectively. Generally, the likelihood and priors can be integrated together using the maximum a posteriori (MAP) or Bayesian frameworks . This line of research used to provide state-of-the-art results before the unprecedented success of deep learning.
Early works advocate Gaussian process prior , Huber prior, sampled texture prior , etc. Edge prior and some other types of natural image priors are explored to achieve more smooth results [53, 54]. Recently, generative adversarial network  derived image priors have gained great successes [56, 49]. Our DL-Net exploits the likelihood and also priors if necessary. Instead of optimizing them directly in the learning objective, we reformulate the procedure as a recursive module such that it can be directly incorporated into any given architecture. Our method is closely related to deep image prior , in a sense that we both schematically suggest outputs that being able to reproduce the corresponding inputs and can be applied to various CNNs (see Appendix E for discussions). It also lies in the category of MAP inference guided discriminative learning .
3 Deep Likelihood Network
Image restoration tasks have been intensively studied for decades. Before the advent of deep learning, conventional MAP-based methods hinged on likelihood and image prior modeling had been adopted to a variety of image restoration tasks and achieved state-of-the-art results. In this work, we advocate the MAP-based formulation and manage to introduce its fidelity term to off-the-shelf restoration CNNs for handling multiple settings of degradations. In the following two subsections, we will compare different formulations for image restoration, and show that simply adding the fidelity term to a discriminative learning objective cannot enhance a restoration network under multiple degradation scenarios. Then, in Section 3.3, we shall present our method which explicitly incorporates the MAP-inspired module with degradation information into the network architectures.
3.1 Problem Formulations: MAP & Deep Learning
In this work, we consider a group of image restoration tasks whose degradation process can be formulated as
where and denote the clean and degraded images, respectively. indicates the downsampling operator with an integer factor . denotes the additive white Gaussian noise with standard deviation . stands for the degradation operator. For image inpainting and interpolation, we have , , , where denotes the entry-wise multiplication operator, and denotes a binary mask. For SISR, we have , where denotes convolution and denotes the degradation (e.g. blur) kernel. In terms of multiple degradations (or particularly multiple degradation levels in the paper), we aim to train a single model to handle concerned restoration tasks with a spectrum of degradation settings.
The MAP-based formulation of image restoration generally involves a fidelity term for modeling the likelihood of degradation and a regularization term for modeling image priors. Given the degraded observation and the degradation setting (i.e. , , , ), the fidelity term can be defined as,
By further trading off against the regularization term , the problem can be formulated as,
where is the regularization parameter.
Given the regularizer and an optimizer, the MAP framework is naturally flexible in handling multiple degradations. However, existing regularizers are generally hand-crafted, non-convex, and insufficient for characterizing image priors. Furthermore, due to the non-convexity, the optimization algorithm often cannot find a satisfactory solution of Eqn. (3) . All these make recent works resort to deep CNNs for improving restoration performance.
Given one specific degradation setting with , , and , a direct input-output mapping is learned from a training set of degraded-clean image pairs to generate: . In spite of its success [1, 2, 3], the learned CNN model is tailored to the specific degradation setting, and shall generalize poorly when being applied to another degradation with , , or . As such, we aim at developing a deep image restoration method for handling multiple degradations with both the flexibility of MAP and the (accuracy and efficiency) merits of off-the-shelf deep networks. One specious method seems to add the fidelity term to the learning objective of CNNs. Yet, as will be shown in Section 3.2, it does not succeed anyway.
3.2 Likelihood in Learning Objective?
It has been discussed in prior works [5, 12] that rigid-joint training supported by data augmentation provides limited assistance to generalizing the degradation settings. The empirical results in our Section 4.1 echo this claim. In this subsection, we further demonstrate that network performance cannot really be improved either by adding the fidelity term to the learning objective. Taking multiple degradations, an augmented training set is represented as , where each sample is allowed to have its own degradations therefore we will have if desired in practice. Using the augmented set, we can train an autoencoder network to generate for restoration by minimizing the reconstruction loss . The method is named Autoencoder (Joint) and some detailed explanations for our experimental settings will be given in Section 4.1.
To incorporate MAP into the deep learning solutions, we should modify the original fidelity term as in Eqn. (2) into , which may serve as the role of “self-supervision” and be combined with for image restoration. The most direct combination is to define as the learning objective, and one may expect such a scheme, dubbed Naïve Likelihood in this paper, to endow the obtained model some more ability of handling multiple degradations.
|Autoencoder (Joint) ||0.0305 / 0.0040 / 30.87||0.0295 / 0.0035 / 31.50|
|Naïve Likelihood ()||0.0304 / 0.0040 / 30.87||0.0295 / 0.0036 / 31.48|
|Naïve Likelihood ()||0.0296 / 0.0040 / 30.96||0.0287 / 0.0035 / 31.61|
|Naïve Likelihood ()||0.0258 / 0.0038 / 31.20||0.0248 / 0.0033 / 31.94|
|Naïve Likelihood ()||0.0248 / 0.0044 / 30.45||0.0235 / 0.0037 / 31.30|
To illustrate the effect of in the objective, we present some quantitative results in Table I. Training curves (, centered) are further provided in Figure 1. The experiment is performed on image inpainting ( here indicates the region size) and no adversarial loss is used here to ensure reliable PSNRs for evaluating the restoration performance. From Table I and Figure 1, we can easily observe that, for most values, the “Naïve Likelihood” method always performs similarly when compared with “Autoencoder (Joint)” which minimizes only . Though facilitates , we observe that it is only since more attention has been paid to the unmasked region, and it fails the task on . It only achieves 26.04dB after retraining and testing, which is even worse than Autoencoder (Joint): 26.13dB. As discussed , the major reason that is less helpful in handling multiple degradations can be ascribed to the lack of degradation information when processing images in deep networks.
3.3 Likelihood Assured by A Recursive Module
In this subsection, we present our solution to generalizing current image restoration networks to handle a spectrum of degradation settings. As depicted in Figure (a)a, given an off-the-shelf image restoration network (i.e. the reference), the output of one middle layer is denoted by and we also denote , which means the reference network is regarded as the composition of two functions, or in other words two sub-networks.
Partially inspired by deep image prior , we substitute with in the fidelity term and eliminate the regularization term, since some implicit priors can be characterized by the sub-network . In contrast to , we assume there exists a unified suitable for all possible inputs but for each of them a specific should be learned with an optimization algorithm. In this regard, the formulation in Eqn. (3) can be rewritten as,
Since the reference is definitely differentiable, and further the objective function in (4) is differentiable w.r.t. . Therefore, we can simply use a stochastic gradient descent algorithm to pursue . We here choose ADAM  for this task and we keep all its hyper-parameters except for the learning rate as default, see (5) for more details. Such computation procedure can be cast into a recursive module and incorporated directly into any network. At the -th iteration, is calculated and added to the current estimation for approaching , that is . In theory, as long as is set, an update of estimation on such learning dynamics targets at superior architectures than the reference . Once an expected is gained as described, or an overall loss can be similarly minimized as usual, by evaluating the difference between and .
The proposed method is dubbed deep likelihood network (DL-Net), and Figure (b)b illustrates its the main steps. Considering , i.e. the gradient of (4) w.r.t. the current estimation of is stacked, higher-order gradient are required in the training process, and automatic differentiation in current learning frameworks deals with it efficiently. A set of are sequentially calculated for but only the last one will be taken as output. In comparison with the reference network, all our modifications in Figure (b)b occur on the base of . A feedback connection is established for estimating with , and the computations involved for deriving from the gradient are:
The crux of DL-Net is to incorporate an MAP-inspired module that is able to assist for disentangling the effects of different degradations. It enforces the outputs to achieve lower iteratively, in a similar vein to AffGAN . Our method may as well be employed in a bunch of classical learning-based methods where deep networks are left-off. Moreover, it is natural to further incorporate other desirable priors into (4). One extra benefit that such an MAP-inspired method should have, as emphasized , comes along with some insightful knowledge extracted from the degradation process, which might be critical for facilitating the learning. For instance, when the location, size and shape of inpainting regions are known, inpainting networks are likely to learn automatically paying more attention to missing regions or taking more discounted loss on them .
Since a gradient-descent-derived module is utilized, two more hyper-parameters will be introduced: the total number of gradient descent steps and the learning rate for (4). We fix and unless otherwise clarified in the paper. Obviously, the depth of hence has a major impact on the computational cost of our DL-Net. We found a light with one single convolution layer (along with a nonlinear activation layer picked up from its previous layer) works well enough in practice, so the extra running time can be negligible in comparison to which is much deeper. That is, is chosen as the last hidden representation (before ReLU) from the reference network and thus the increase of computational cost is small. Such setting is adopted across all the experiments in this work.
4 Experimental Results
4.1 Image Inpainting and Interpolation
We will analyze the performance of different models on image inpainting and interpolation together in this subsection.
Network architectures. Following prior works , we first introduce inpainting networks for hallucinating visual contents within square blocks whose size and location vary randomly in given images, and the a similar pipeline will be applied also to image interpolation. As a baseline model, the popular encoder-decoder-based architecture [2, 5, 33] (a.k.a., autoencoder) is chosen. We directly adapt the open-source Torch implementation from Gao and Grauman . See Appendix C for more details.
|Autoencoder (Default)||0.2556 / 0.1269 / 15.49||0.2482 / 0.1299 / 15.38||0.2600 / 0.1733 / 14.17||0.2470 / 0.1226 / 15.62|
|Autoencoder (, centered)||0.2116 / 0.0951 / 17.16||0.2258 / 0.1402 / 15.67||0.2616 / 0.2321 / 13.19||0.0246 / 0.0031 / 32.27|
|Autoencoder (Joint) ||0.0227 / 0.0015 / 35.00||0.0305 / 0.0040 / 30.87||0.0494 / 0.0118 / 26.13||0.0295 / 0.0035 / 31.50|
|On-Demand Learning ||0.0230 / 0.0015 / 34.93||0.0307 / 0.0040 / 30.85||0.0496 / 0.0118 / 26.12||0.0299 / 0.0036 / 31.43|
|Multi-Tasks Learning ||0.0178 / 0.0010 / 36.65||0.0263 / 0.0036 / 31.37||0.0470 / 0.0118 / 26.11||0.0253 / 0.0032 / 32.06|
|DL-Net (ours)||0.0129 / 0.0008 / 38.47||0.0214 / 0.0033 / 32.03||0.0413 / 0.0110 / 26.54||0.0205 / 0.0028 / 32.82|
|Autoencoder (Default)||0.0924 / 0.0159 / 24.64||0.0527 / 0.0062 / 28.59||0.0565 / 0.0095 / 26.78||0.0705 / 0.0112 / 26.67|
|Autoencoder (Joint) ||0.0256 / 0.0017 / 34.36||0.0347 / 0.0036 / 31.07||0.0597 / 0.0104 / 26.32||0.0345 / 0.0037 / 31.68|
|On-Demand Learning ||0.0251 / 0.0016 / 34.50||0.0343 / 0.0035 / 31.15||0.0589 / 0.0102 / 26.41||0.0340 / 0.0036 / 31.78|
|Multi-Tasks Learning ||0.0216 / 0.0013 / 35.58||0.0335 / 0.0034 / 31.24||0.0619 / 0.0110 / 26.05||0.0327 / 0.0035 / 32.22|
|DL-Net (ours)||0.0138 / 0.0007 / 38.35||0.0240 / 0.0023 / 33.09||0.0488 / 0.0083 / 27.42||0.0233 / 0.0024 / 34.51|
|Autoencoder (Default)||0.2407 / 0.1098 / 16.19||0.2297 / 0.1088 / 16.19||0.2626 / 0.1709 / 14.27||0.2303 / 0.1033 / 16.38|
|Autoencoder (, centered)||0.1926 / 0.0835 / 17.73||0.1845 / 0.1104 / 16.79||0.2400 / 0.2141 / 13.47||0.0485 / 0.0132 / 25.84|
|Autoencoder (Joint) ||0.0299 / 0.0031 / 32.18||0.0438 / 0.0098 / 27.16||0.0702 / 0.0235 / 23.20||0.0444 / 0.0102 / 26.95|
|On-Demand Learning ||0.0311 / 0.0032 / 31.94||0.0450 / 0.0101 / 27.04||0.0711 / 0.0238 / 23.15||0.0456 / 0.0105 / 26.83|
|Multi-Tasks Learning ||0.0245 / 0.0025 / 33.20||0.0389 / 0.0093 / 27.48||0.0662 / 0.0231 / 23.31||0.0395 / 0.0097 / 27.24|
|DL-Net (ours)||0.0175 / 0.0019 / 34.71||0.0329 / 0.0090 / 27.73||0.0608 / 0.0229 / 23.39||0.0337 / 0.0094 / 27.44|
|Autoencoder (Default)||0.1575 / 0.0390 / 20.34||0.0717 / 0.0119 / 25.87||0.0856 / 0.0199 / 23.71||0.1213 / 0.0351 / 23.10|
|Autoencoder (Joint) ||0.0348 / 0.0032 / 31.83||0.0488 / 0.0071 / 28.28||0.0820 / 0.0187 / 23.96||0.0480 / 0.0072 / 29.06|
|On-Demand Learning ||0.0342 / 0.0031 / 31.98||0.0480 / 0.0069 / 28.39||0.0809 / 0.0184 / 24.05||0.0472 / 0.0070 / 29.17|
|Multi-Tasks Learning ||0.0292 / 0.0025 / 32.98||0.0465 / 0.0068 / 28.46||0.0845 / 0.0195 / 23.74||0.0451 / 0.0069 / 29.58|
|DL-Net (ours)||0.0179 / 0.0014 / 35.28||0.0368 / 0.0054 / 29.53||0.0722 / 0.0166 / 24.56||0.0343 / 0.0054 / 31.38|
Training and test samples. Same with Gao and Grauman , we evaluate inpainting and interpolation models on two different datasets: CelebA  and SUN397 . For CelebA, the first 100,000 images are used to form the training set. Some other 2000 images are randomly chosen and split into a validation set and a test set, each with 1000 images. For SUN397, we similarly have 100,000 images for training, and 1000 images each for validation and testing. Input images are also uniformly rescaled to so the encoder will output feature maps. Extra randomness is introduced into the test samples owing to pixel removal at possibly any location, so the test set can be highly biased. In order to address this issue, we utilize more samples by degenerating each original test image with feasible degradations and save all these combinations (of corrupted images and ground-truth) locally such that our models are tested on the same sample pairs.
Training process. We initialize the channel-wise fully-connected layer with a random Gaussian distribution and all convolutional layers with the “MSRA” method . In our experiments, weight decay and ADAM  with an initial learning rate of is adopted to optimize:
for reference models. Training batch size is set to be 100. To keep in line with prior works, we report , loss and the PSNR index on our test set for evaluating performance. Images are numerically converted into the floating-point format and scaled to for calculating these metrics. We cut the learning rate by every 100 epochs due to no improvement observed on the validation set. The reference inpainting model is trained to fill square holes as large as (i.e., ). After training for 250 epochs, the references reach plateau and we achieve PSNRs of 24.45dB and 26.78dB on CelebA, for inpainting with size and interpolation when 75% pixels are removed, respectively
Main results. Despite the decent results on a presumed fixed degradation setting, reference models fail when evaluated on other inpainting size or interpolation percentage that is not specifically trained, but might be technically easier. For example, a PSNR of only 15.62dB is obtained when we moderately adjust the inpainting size to . If the location constraint is further relaxed, it degrades to 15.38dB which is by no means satisfactory in practice. Detailed quantitative results can be found in Table II and III 333For results on SUN397, please refer to our appendix.. This problem has been comprehensively studies in , so we follow their experimental settings and let and vary in and , respectively. We let the inpainting regions shift randomly in a range of pixels around the centroid of images where most meaningful content exists. 444Note that as one of the few effective methods, our DL-Net generalizes also to irregular holes, we choose the above settings mostly to make fair comparisons with previous works [5, 15]
Typically, a rigid-joint training strategy [2, 5] might mitigate the problem and save the image restoration performance to some extent. In a nutshell, it attempts to minimize the reconstruction loss over different settings of degradation simultaneously. By learning from more flexible restoration scenarios, network models demonstrate an average PSNR of 32.40dB for inpainting and 31.68dB for interpolation, over all degradation settings on our test sets. Though the learning problem seems more complex, it converges reasonably fast through fine-tuning from references instead of training from scratch. The training process takes 500 epochs for both tasks, and the learning rate is cut by every 200 epochs. We evaluate the , loss and PSNR in specific degradation settings and summarize all the results on CelebA in Table II and III. See Table IV and V for results on SUN397. Models are trained with 100,000 images randomly sampled from the dataset and tested on some other 1000 images, just as in .
We then fine-tune with our DL-Net similarly and test obtained models under exactly the same circumstances. As compared in Table II and III, our DL-Net models achieve significantly better results on all test cases. We also compare our method with state-of-the-art solutions dealing with the same problem in literature, including multi-tasks learning and on-demand learning . Being aware of a very recent progress on multi-task learning using uncertainty to weigh losses , we try adapting it to our image restoration tasks, for which each degradation level 555For inpainting, each is regarded as a particular level while for image interpolation, [2.5(-1)%, 2.5%) with a positive integer is regarded as one level. We also tried other possibilities but never got better results. is treated as a subtask and a weight in the learning objective is accordingly introduced. To be more specific, we have 30 extra weights to learn for both multi-task image inpainting and interpolation. The on-demand learning method is configured exactly the same as suggested in the paper. As expected, the two methods work well on multiple degradations, outperforming the aforementioned joint training strategy in most cases. However, they never surpass our DL-Net method in the sense of restoration performance and PSNRs. For qualitative comparisons, see Figure 3. It is further straightforward to absorb an adversarial loss in our DL-Net, either by introducing an adversarial or adding it directly to , here we adopt the latter and illustrate its results in Figure 3 as well.
Ablation studies. Our experimental settings in the prequel, including configurations on the training and test sets, CNN architectures and loss terms for image inpainting and interpolation mostly comply with those in . However, we notice that some different settings might also be popular. For example, one may prefer generating images with rather than using directly. We compare our DL-Nets with competitive models in such scenario and the restoration results are evaluated as below in Table VI and VII. Apparently, our method still outperforms its competitors on both tasks.
|Autoencoder (Joint) ||0.0117 / 0.0031 / 32.37||0.0332 / 0.0109 / 26.52|
|On-Demand Learning ||0.0117 / 0.0031 / 32.37||0.0331 / 0.0110 / 26.52|
|Multi-Tasks Learning ||0.0117 / 0.0031 / 32.33||0.0341 / 0.0113 / 26.35|
|DL-Net (ours)||0.0117 / 0.0031 / 32.42||0.0328 / 0.0108 / 26.62|
|Autoencoder (Joint) ||0.0204 / 0.0027 / 32.31||0.0512 / 0.0098 / 26.61|
|On-Demand Learning ||0.0203 / 0.0027 / 32.35||0.0505 / 0.0096 / 26.70|
|Multi-Tasks Learning ||0.0204 / 0.0027 / 32.27||0.0532 / 0.0103 / 26.34|
|DL-Net (ours)||0.0171 / 0.0021 / 33.52||0.0439 / 0.0080 / 27.54|
We are aware that there might exist different choices for the formulation of for training inpainting networks and it is worthwhile mentioning that our method generalizes to those cases. In addition to the one derived in Eqn. (6), researchers [2, 34] also propose to enlarge inpainting blocks by 7 pixels and penalize more on the boundary regions for encouraging perceptual consistency. As shown in Table X in our appendix, the superiority of our DL-Net holds.
One may observe and consider it a bit surprising that our method even achieves better results than specifically trained references on certain degradation levels. For instance, “Autoencoder (, centered)” is slightly worse than the DL-Net model even when tested with centered regions. Also, the “Autoencoder (Default)” interpolation model does not depict superior performance to our DL-Net when 75% of the pixels are indeed removed. We believe this is partially because some insightful knowledge extracted from the size and location information of the blank regions facilitates the image restoration process . Two other plausible explanations are training samples of benefit from more flexible degradations and the probably enlarged network capacity owing to a recursive .
We design deeper references and test if it is the capacity increase that helps. Stacking four extra convolutional layers that outputs 64 feature maps interlaced with ReLUs to the end (before the output layer), or put them in the beginning, or allocate two on each side, we have three networks that can be much more powerful. We train them as training our “Autoencoder (, centered)” model and test them with centered regions. Their average PSNRs are 32.21dB, 32.44dB, and 32.35dB, respectively, showing the increased capacity helps only to a limited extent. Experiments are also conducted to demonstrate how the performance of our method varies with . Note that the rigid-joint training can be considered as a special case of DL-Net with , we illustrate its training curve together with those of DL-Nets () in Figure 4. Apparently, larger indicates faster convergence and suggests better restorations.
Overhead. We evaluate CPU/GPU runtime of these DL-Nets with various values and summarize the results in Table VIII. It can be noticed that the introduced computational overhead is very small (at most 1.4 for on GPU) The number of learnable parameters is also reported. Since our method utilizes a recursive module, it never brings in more learnable parameters. All models discussed above are tested with the open-source TensorFlow framework  on an Intel Xeon E5 CPU and an NVIDIA Titan X GPU.
|Method||CPU Run-time (s)||GPU Run-time (s)||#Parameters|
|Autoencoder (Joint) ||3.824||1.330|
4.2 Single Image Super-Resolution
|Bicubic||33.66 / 30.39 / 28.42||29.02 / 26.57 / 24.78||26.13 / 25.32 / 24.33||-|
|VDSR||37.62 / 33.89 / 31.56||30.65 / 30.36 / 29.88||26.39 / 26.34 / 26.33|
|VDSR (DL-Net, ours)||36.47 / 33.28 / 31.09||36.25 / 33.28 / 31.17||34.97 / 32.89 / 30.99|
|DRRN||37.82 / 34.23 / 31.86||30.65 / 30.30 / 29.83||26.39 / 26.35 / 26.31||91+400|
|DRRN (DL-Net, ours)||37.32 / 33.76 / 31.34||37.39 / 33.83 / 31.41||36.76 / 33.53 / 31.41|
|IDN||37.73 / 34.07 / 31.76||30.64 / 30.32 / 29.88||26.39 / 26.35 / 26.33|
|IDN (DL-Net, ours)||37.07 / 33.49 / 31.24||37.05 / 33.58 / 31.27||36.50 / 33.32 / 31.14|
|SRMDNF ||37.79 / 34.12 / 31.96||37.45 / 34.16 / 31.99||34.12 / 33.02 / 31.77||400+800+4744|
Network architectures. We choose popular SISR CNNs as backbones in following experiments, including VDSR (2016) , DRRN (2017) , and IDN (2018) . The networks are structurally so different that we are able to validate whether our method cooperates well with off-the-shelf CNNs profiting from customized architecture designs. See Appendix C for schematic sketches and more details of these backbone networks.
Training and test samples. In order to be consistent with prior works, we use the famous dataset of 91 images  for training, along with 400 images from the Berkeley segmentation dataset (BSD-500)  which is also widely used. Our test datasets include Set-5 , Set-14  and the official test set of BSD-500, which consist of 5, 14, and 100 images, respectively. Following Timofte et al. , Dong et al. , etc., we consider only feeding the luminance channel into our networks and simply upscaling the two chrominance channels using bicubic interpolation. To take full advantage of the self-similarity of natural images, data augmentation strategies including rotation, rescaling, and flipping are also used on the training set, as in other works [63, 8, 64, 9, 50]. Likewise, all training images are downsampled and cropped to get square-shaped training pairs. For upscaling factor of 2, 3, and 4, these patch pairs have width (and also height) of 17 / 34, 17 / 51, and 17 / 68, respectively. The above settings mostly follow existing SISR works and help us train references successfully. For training DL-Nets, we believe larger image patches are more suitable and therefore adjust their width to 40 / 80, 40 / 120, and 40 / 160 .
Training process. Similarly, we still adopt the “MSRA” method  to initialize weights in (up)-convolutional layers. In order to train the networks reasonably fast, we also take advantage of the ADAM algorithm  for stochastic optimization. As suggested, the base learning rate is always set to be 0.001 and the two momentum hyper-parameters are set to be 0.9 and 0.999. In correspondence with other research works, we also calculate the average PSNR between ground truth images and the generated high-resolution images on the luminance channel (in the YCrCb color space). Since an upscaling layer is introduced, one model should be trained each for the , , and scenarios. We cut the learning rate by every 80 epochs, such that the model performance is finally saturated . After training for 300 epochs, our VDSR and DRRN models achieve average PSNRs of 33.89dB and 34.23dB on Set-5, respectively.
Main results. The above training process follows that of existing deep SISR CNNs in which all input low-resolution images are assumed to be self-degenerated directly through bicubic interpolation with high-resolution images. As have been discussed, such assumption fails on real-world applications where blurring (or other types of degradations) may also exist. Here we first evaluate the obtained references on low-resolution images downscaled from (probably) blurry images. We choose the same blur kernels as in  whose width vary in . Our results (in Table IX) demonstrate the performance of state-of-the-art SISR models diminishes a lot when some subtle distortion is introduced.
Fortunately, the proposed DL-Net strategy helps to resolve this problem and improve the restoration performance to a remarkable extent. As demonstrated in Table IX, our DL-Nets with VDSR, DRRN, and IDN as backends achieve prominent results for both and while in the the optimal setting where their PSNR indices drop only a little bit. Results on Set-14 and BSD-500 can be found in Appendix B. We compare our method with a very recent method dedicated to addressing the fixation problem for SISR . Though trained exploiting more images and substantially more parameters (M, while our DRRN: M), its performance still diminishes in the less challenging and scenarios with . Compared with SRMDNF and the references, our DL-Net shows more stable performance across all degradation settings.
While impressive results have been gained, state-of-the-art image restoration networks are usually trained with self-degenerated images in very restricted degradation settings, normally in a particular level. Such limitation may hinder restoration networks from being applied to some real-world applications, and there is as of yet no general solution for this. We propose DL-Net in this paper, towards generalizing existing networks to succeed over a spectrum of degradation levels with their training objective and core architectures retained. The pivotal of our method is to assist a subnet designed for disentangling the effects of possible degradations and for minimizing . Experimental results on image inpainting, interpolation, and SISR verify the effectiveness of our method. Future works shall include explorations on more degradation types.
Appendix A More About Image Inpainting
We are aware that there might exist different choices for the formulation of when training inpainting networks. In addition to the one derived in Eq. (5), researchers also propose to enlarge the inpainting blocks by 7 pixels and penalize more on the boundary regions for encouraging perceptual consistency . Here we also report experimental results on CelebA in such setting. As shown in Table X, the superiority of our method holds consistently.
|Autoencoder (Default)||0.0109 / 0.0083 / 28.37||0.0464 / 0.0375 / 21.41||0.1240 / 0.1159 / 16.46||0.0382 / 0.0236 / 22.78|
|Autoencoder (Joint) ||0.0025 / 0.0005 / 40.33||0.0122 / 0.0034 / 31.94||0.0344 / 0.0116 / 26.26||0.0112 / 0.0028 / 32.90|
|On-Demand Learning ||0.0025 / 0.0005 / 40.28||0.0122 / 0.0034 / 31.94||0.0343 / 0.0115 / 26.26||0.0112 / 0.0028 / 32.87|
|Multi-Tasks Learning ||0.0025 / 0.0005 / 40.36||0.0125 / 0.0035 / 31.77||0.0360 / 0.0123 / 25.94||0.0114 / 0.0029 / 32.74|
|DL-Net (ours)||0.0023 / 0.0005 / 40.97||0.0116 / 0.0031 / 32.36||0.0329 / 0.0108 / 26.59||0.0106 / 0.0026 / 33.33|
Appendix B SISR on Set-14 and BSD-500
We report SISR results on Set-14 and BSD-500 (in which only the 100 test images are used in the evaluation) in this section. All training and test policies are kept the same as on Set-5, and the results are summarized in Table XI and XII. Obviously, better results can be obtained using our DL-Net strategy, when some additional blurring is inevitable. It is also worthwhile noting that, with our method, similar PSNR performance can be obtained under various level of SISR degradations. We don’t have qualitative results of SRMDNF on the two popular datasets, so only the baseline is taken for comparison.
|VDSR||33.12 / 29.88 / 28.15||27.93 / 27.58 / 27.09||24.61 / 24.58 / 24.55|
|VDSR (DL-Net, ours)||32.49 / 29.53 / 27.86||32.28 / 29.58 / 27.93||30.89 / 29.30 / 27.74|
|DRRN||33.39 / 30.03 / 28.32||27.93 / 27.53 / 27.08||24.61 / 24.59 / 24.54||91+400|
|DRRN (DL-Net, ours)||33.10 / 29.85 / 28.06||33.18 / 29.97 / 28.11||32.19 / 29.77 / 28.01|
|IDN||33.28 / 29.97 / 28.21||27.93 / 27.54 / 27.12||24.61 / 24.59 / 24.56|
|IDN (DL-Net, ours)||32.83 / 29.71 / 27.99||32.89 / 29.73 / 28.02||32.14 / 29.52 / 27.92|
|VDSR||31.96 / 28.87 / 27.34||27.66 / 27.24 / 26.67||24.97 / 24.94 / 24.88|
|VDSR (DL-Net, ours)||31.49 / 28.67 / 27.20||31.44 / 28.71 / 27.25||30.09 / 28.57 / 27.20|
|DRRN||32.13 / 29.02 / 27.44||27.66 / 27.21 / 26.65||24.97 / 24.94 / 24.87||91+400|
|DRRN (DL-Net, ours)||31.98 / 28.88 / 27.32||32.10 / 28.94 / 27.36||31.35 / 28.88 / 27.34|
|IDN||32.05 / 28.94 / 27.40||27.66 / 27.22 / 26.67||24.97 / 24.94 / 24.88|
|IDN (DL-Net, ours)||31.83 / 28.82 / 27.26||31.93 / 28.89 / 27.30||31.20 / 28.83 / 27.30|
Appendix C Network Architectures
In this section, we elaborate on the network architectures introduced in the main body of our paper. First it is our inpainting and interpolation autoencoder. The encoder and decoder parts are each comprised of four (up)-convolutional layers, all followed by batch normalizations  and nonlinear activations. A channel-vise fully-connected layer that propagates information only within feature maps is used to concatenate the two parts. While leaky ReLUs are used in the encoder, ReLUs are directly adopted in the decoder for nonlinearity, just like the context encoder network .
To be more specific, its first half (i.e., the encoder part) is with: conv1 (kernel: , stride: 2), conv2 (kernel: , stride: 2), conv3 (kernel: , stride: 2) and conv2 (kernel: , stride: 2). The channel-wise fully-connected layer consists of 512 filters. The second half (i.e., the decoder) consists of four convolutional layers with stride 1/2 (or equivalently deconvolutions with stride 2), being structurally symmetric to the encoder. Batch normalization is used before (leaky) ReLU, as prescribed in the paper [5, 2]. A graphical demonstration of the network architecture can be found in .
Fo the three SISR networks adopted in our paper, we provide schematic sketches for their architectures in Figure 5. VDSR is a 20-layer CNN that takes advantage of residual learning  and for the first time builds a (global) skip connection from its input to the output. DRRN aims to further deepen the network and adopts residual learning in both global and local manners. To make the computation more efficient, we adapt them to manipulate image features on a low-resolution level  by modifying the global skip connection to a bicubic upscaling layer as in IDN . The width of convolutional layers in VDSR and DRRN are uniformly set to be 64 and 128, as suggested in the papers. Specifically, for DRRN, we adopt “DRRN_B1_U9” and further introduce a concat layer that merges sequential results from the recursive block to boost its performance. The concatenation is performed every three recursive steps, which means useful knowledge can be distilled from feature maps in total. The conv4 layer of our DRRN conducts convolutions and also outputs 64 feature maps. Also, inspired by IDN , we use leaky ReLU with a negative slop 0.05 instead of the original ReLU in DRRN to prevent the so-called “dying ReLU” problem. For IDN, our configurations mostly follow those suggested in the paper . Its final performance with might be a little bit lower than that reported in its paper , partially because we only use the loss for simplicity of training.
Appendix D More Discussions and Comparisons
Our DL-Net is related with some previous works and we have briefly introduced them in Section 2. We will give some more detailed discussions in this section. It has long been known that deep CNNs extract contextual information from ground-truth high-quality images. Natural image priors are introduced to encourage smooth textures. However, only until recently have we been aware of the prior knowledge brought in with the deep architecture itself . Considering implicit priors captured by the network, Ulyanov et al., propose to directly minimize is a task-dependent likelihood for pursuing decent image restoration performance.
Our DL-Net schematically suggests outputs that being able to reproduce the corresponding inputs (i.e., minimize the likelihood), which might seem similar to Ulyanov et al.’s deep image prior. In fact, the superiority of our method also rests on rich supervision from numerous real images and some insightful knowledge extracted from the degradations. Benefit from external data and the degradation information, our DL-Net outperforms some other supervised methods and its computational complexity is relatively low. Although Ulyanov et al.’s method also applies to restoration with multiple degradations, their performance is only comparable with the supervised state-of-the-art, and it requires thousands of iterations to run on a single test image.
Our method is also related with AffGAN  in which the amortized MAP inference is explored for SISR and a projection layer is introduced to guarantee its likelihood-based constraints being explicitly satisfied. Such a projection layer advocates outcomes strictly fulfilling its implicit assumptions and low likelihood loss is naturally obtained. However, AffGAN focuses on simple SISR problems whose given inputs are down-sampled through only a presumed bicubic interpolation. We stress that it does not apply to our task where multiple blurring levels exist, mainly because a single or even several operations cannot guarantee the constraints anymore in our setting with ().
Appendix E Qualitative Results for SISR
We also provide some qualitative results for SISR under multiple degradation settings. See Figure 6 for more details. We illustrate the luminance channel only to enable comparison between direct outputs of different network models. It can be seen that our model show perceptually similar results under different degradation settings while the performance of original DRRN diminish significantly under and .
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in CVPR, 2016.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing (TIP), vol. 26, no. 7, pp. 3142–3155, 2017.
-  R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in ECCV, 2016.
-  R. Gao and K. Grauman, “On-demand learning for deep image restoration,” in ICCV, 2017.
-  R. Köhler, C. Schuler, B. Schölkopf, and S. Harmeling, “Mask-specific inpainting with deep neural networks,” in German Conference on Pattern Recognition, 2014.
-  C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” in CVPR, 2017.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in CVPR, 2017.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in ICLR, 2014.
-  C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in ICLR, 2017.
-  K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018.
-  Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009.
-  L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann, “Self-paced learning with diversity,” in NIPS, 2014.
-  A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in CVPR, 2018.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in CVPR, 2017.
-  Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in CVPR, 2018.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, “Image inpainting for irregular holes using partial convolutions,” in ECCV, 2018.
-  Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in ICCV, 2015.
-  J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” International Journal of Computer Vision (IJCV), vol. 119, no. 1, pp. 3–22, 2016.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in BMVC, 2012.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International conference on curves and surfaces, 2010.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV, 2001.
-  S. Lefkimmiatis, “Universal denoising networks: A novel cnn architecture for image denoising,” in CVPR, 2018.
-  M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in SIGGRAPH, 2000.
-  M. Bertalmio, L. Vese, G. Sapiro, and S. Osher, “Simultaneous structure and texture image inpainting,” IEEE Transactions on Image Processing (TIP), vol. 12, no. 8, pp. 882–889, 2003.
-  I. Drori, D. Cohen-Or, and H. Yeshurun, “Fragment-based image completion,” in ACM Transactions on graphics (TOG), vol. 22, no. 3, 2003, pp. 303–312.
-  N. Komodakis, “Image completion using global optimization,” in CVPR, 2006.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Transactions on Graphics (TOG), vol. 28, no. 3, p. 24, 2009.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 107, 2017.
-  R. A. Yeh, C. Chen, T.-Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do, “Semantic image inpainting with deep generative models.” in CVPR, 2017.
-  Y. Li, S. Liu, J. Yang, and M.-H. Yang, “Generative face completion,” in CVPR, 2017.
-  J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in CVPR, 2018.
-  J. S. Ren, L. Xu, Q. Yan, and W. Sun, “Shepard convolutional neural networks,” in NIPS, 2015.
-  F. Heide, W. Heidrich, and G. Wetzstein, “Fast and flexible convolutional sparse coding,” in CVPR, 2015.
-  V. Papyan, Y. Romano, M. Elad, and J. Sulam, “Convolutional dictionary learning via local processing.” in ICCV, 2017.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.
-  H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in CVPR, 2004.
-  W. Fan and D.-Y. Yeung, “Image hallucination using neighbor embedding over visual primitive manifolds,” in CVPR, 2007.
-  R. Timofte, V. De Smet, and L. Van Gool, “Anchored neighborhood regression for fast example-based super-resolution,” in ICCV, 2013.
-  J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution as sparse representation of raw image patches,” in CVPR, 2008.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Transactions on Image Processing (TIP), vol. 19, no. 11, pp. 2861–2873, 2010.
-  W. Dong, L. Zhang, G. Shi, and X. Wu, “Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization,” IEEE Transactions on Image Processing (TIP), vol. 20, no. 7, pp. 1838–1857, 2011.
-  K. Zhang, X. Gao, D. Tao, and X. Li, “Multi-scale dictionary for single image super-resolution,” in CVPR, 2012.
-  C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in ECCV, 2014.
-  S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang, “Convolutional sparse coding for image super-resolution,” in ICCV, 2015.
-  X. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in NIPS, 2016.
-  C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár, “Amortised map inference for image super-resolution,” in ICLR, 2017.
-  Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in CVPR, 2018.
-  W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” International Journal of Computer Vision (IJCV), vol. 40, no. 1, pp. 25–47, 2000.
-  L. C. Pickup, S. J. Roberts, and A. Zisserman, “A sampled texture prior for image super-resolution.” in NIPS, 2004.
-  K. I. Kim and Y. Kwon, “Single-image super-resolution using sparse regression and natural image prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 6, pp. 1127–1133, 2010.
-  Y.-W. Tai, S. Liu, M. S. Brown, and S. Lin, “Super resolution using edge prior and single image detail synthesis,” in CVPR, 2010.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
-  J. Pan, Y. Liu, J. Dong, J. Zhang, J. Ren, J. Tang, Y.-W. Tai, and M.-H. Yang, “Physics-based generative adversarial models for image restoration and beyond,” arXiv preprint arXiv:1808.00605, 2018.
-  K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in CVPR, 2017.
-  U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in CVPR, 2014.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
-  R. T. d. Combes, M. Pezeshki, S. Shabanian, A. Courville, and Y. Bengio, “On the learning dynamics of deep neural networks,” arXiv preprint arXiv:1809.06848, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV, 2015.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in OSDI, 2016.
-  Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image super-resolution with sparse prior,” in ICCV, 2015.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in ECCV, 2016.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.