Deep Likelihood Network for Image Restoration with Multiple Degradations

Deep Likelihood Network for Image Restoration with Multiple Degradations

Yiwen Guo1, 2   Wangmeng Zuo3   Changshui Zhang2  Yurong Chen1
1 Intel Labs China, 2 Tsinghua University, 3 Harbin Institute of Technology
{yiwen.guo, yurong.chen}@intel.com  wmzuo@hit.edu.cn  zcs@mail.tsinghua.edu.cn
Abstract

Convolutional neural networks have been proven very effective in a variety of image restoration tasks. Most state-of-the-art solutions, however, are trained using images with a single particular degradation level, and can deteriorate drastically when being applied to some other degradation settings. In this paper, we propose a novel method dubbed deep likelihood network (DL-Net), aiming at generalizing off-the-shelf image restoration networks to succeed over a spectrum of degradation settings while keeping their original learning objectives and core architectures. In particular, we slightly modify the original restoration networks by appending a simple yet effective recursive module, which is derived from a fidelity term for disentangling the effect of degradations. Extensive experimental results on image inpainting, interpolation and super-resolution demonstrate the effectiveness of our DL-Net.

1 Introduction

Over the past few years, deep convolutional neural networks (CNNs) have advanced the state-of-the-art of a variety of image restoration tasks including single image super-resolution (SISR) [10], inpainting [40], denoising [61], colorization [64], etc. Despite the impressive quantitative metrics and perceptual quality that CNNs have been achieved, most state-of-the-art solutions are trained with pairs of manually (de)generated images and their anticipated restoration outcomes based on implicit assumptions about the input.

In general, image degradations are restricted to being in a presumed level throughout datasets [16], e.g., a pre-defined shape, size and even location for inpainting regions [29, 52], or a designate downsampling strategy from high-resolution images [10, 26, 46]. With the robustness of deep networks having been criticized for years [45, 59], such specification of the input domain shall entail severe over-fitting in the obtained CNN models [16, 63]. That is, they can succeed when the assumptions are fulfilled and test degradations are limited to the same particular setting as in training, but is problematic in practical applications in which multiple degradations and more flexible restorations are required.

A few endeavors have been exerted to address this issue. One straightforward solution is to jointly learn restoring images degenerated through different settings of degradations. Unfortunately, such naïve approach mitigates the issue only to a certain extent, and may still fail if the degradations vary within a substantial range of difficulties, as demonstrated in previous literature [16, 63]. Deep networks are believed to learn much better if the training samples and loss weights corresponding to possible degradation levels are allocated in a more reasonable manner. Curriculum [3] and self-paced learning [23] approaches are hence proposed to control the sampling strategies and allocate more training samples to more challenging degradation levels with relatively lower PSNRs [16]. Multi-task learning can be incorporated to adjust loss weightings similarly [25]. Though enlightening, such methods suffer from several drawbacks in practice. First, the “level” of degradations has to be appropriately discretized, but its guideline is unclear if their difficulties are inevident per se. Second, it is non-trivial to be generalized to CNNs whose objectives are comprised of perceptual [24] and adversarial losses [31], since PSNR turns out not to be a reliable criteria for them [24, 7]. Third, an extra validation set is inevitably required in every training epoch in [16].

There are also several recent works that propose to tackle this issue by introducing customized architectures [63, 34], with for instance modified convolution operations for image inpainting [34]. The architectures are specifically designed and probably limited to certain tasks. In this paper, we aim at directly generalizing off-the-shelf image restoration networks, which might be previously presented to restore images in only one particular level, to succeed over a spectrum of degradation settings with the learning objectives and core architectures retained. In order to achieve this, we propose deep likelihood network (DL-Net): a novel method which inherently disentangles111With our method, input images with different settings of degradations are processed distinctively due to their different degradation kernel s and other hyper-parameters, which is what we meant by “disentangle”. the effects of possible degradations and enforces high likelihood overall. Its computation procedure is cast into a recursive module and can be readily incorporated into any network, making it highly general and scalable. Another benefit that our method should gain, in comparison with some previous ones [16, 25], comes from the degradation itself, whose information may facilitate the image restoration process as well [63].

We primarily focus on three image restoration tasks, i.e. inpainting, interpolation, and SISR in which image blurring is also introduced. Our main contributions are:

We propose a novel and general method to generalize off-the-shelf image restoration CNNs to succeed over a spectrum of image degradations.

By encouraging high likelihood in the architecture, our method utilizes information from different degradation to facilitate the restoration process.

Our method is computationally efficient (as the introduced overhead is small), easy to implement, and can be readily incorporated into different networks.

The efficacy of our DL-Net is testified on a bunch of benchmark datasets: CelebA [35], SUN397 [51], Set-5 [6], Set-14 [58] and BSD-500 [37]. It outperforms previous state-of-the-arts in various test cases.

2 Related Works

Image restoration.

Typical restoration tasks include SISR, inpainting, denoising, deblurring, just to name a few. We mainly focus on inpainting (also known as image completion or hole-filling), interpolation and SISR. On the point that many denoising CNNs have assured their behaviors in multiple settings [61, 32], we opt to cover other critical tasks where the problem arises.

Image inpainting [4], a task of making visual predictions for missing regions is required when human users attempt to erase certain regions on an image. Early solutions predict information inside regions by exploiting isophotes direction field [4, 5] and texture synthesis technologies [13, 30, 2]. Deep CNNs were later introduced to learn semantic contents in an end-to-end manner. Although initial efforts show decent results only on small regions [29], encoder-decoder-based architecture and adversarial learning are leveraged to make them work reasonably well on very large holes [40]. For better perceptual quality, more sophisticated loss terms in favor of content and texture consistencies are also developed [22, 52, 56, 33, 57]. Despite all these impressive improvements, popular methods analyze inpainting mainly with pre-defined region size, shape and even location, yet a deteriorating effect has been reported when multiple degradations exist [16]. Bring in some more holistic and local loss terms may mitigate the problem, but does not resolve it. Recently, Ren et al[42] and Liu et al[34] propose Shepard convolution and partial convolution for inpainting with irregular holes, respectively. These works mostly focus on convolution design and are perpendicular to ours.

Image interpolation [16] is a similar task and sometimes also referred to as another type of inpainting using Bernoulli masks. Sparse coding networks are usually adopted to cope with it [20, 39]. Some implicit prior captured by the CNN architecture itself is also explored [49].

Then it is SISR, the task of generating a high-resolution digital image with a given low-resolution one. Based on the neighbor embedding and sparse coding technologies, many of the traditional methods try to preserve (example-based) neighborhood structures [8, 14, 6, 48] or (dictionary-based) perceptual representations [54, 55, 12, 60, 48] 222They are often sparse ones with over-complete dictionaries, can also be viewed as sparse priors. in different image subspaces. However, the majority of them cannot be calculated efficiently [53], and furthermore, the expressive power is also rather limited. To this end, researchers start pursuing approximated but explicit and powerful mappings from input to target subspaces [10, 18], for which CNNs can be appropriate candidates. Being able to extract contextual information, networks with deeper architectures show even higher PSNRs [26, 36, 46, 44, 21]. Whereas, it has also been reported that current state-of-the-art SISR networks suffer from generalizing to scenarios with possible blurring [63].

Likelihood and image priors.

A key ingredient of image restoration is to estimate the probability that one outcome is capable of generating the given input through some sort of degradation, i.e., the likelihood. In principle, such tasks are heavily ill-posed. Solving them would sometimes require prior knowledge on the input low- and output high-quality image subspaces, and learning-based methods are typically used to leverage them effectively. Generally, the likelihood and priors can be integrated together using the maximum a posteriori (MAP) or Bayesian frameworks [15]. This line of research used to provide state-of-the-art results before the unprecedented success of deep learning.

Early works advocate Gaussian process prior [15], Huber prior, sampled texture prior [41], etc. Edge prior and some other types of natural image priors are explored to achieve more smooth results [27, 47]. Recently, generative adversarial network [17] derived image priors have gained great successes [38, 44]. Our DL-Net exploits the likelihood and also priors if necessary. Instead of optimizing them directly in the learning objective, we reformulate the procedure as a recursive module such that it can be directly incorporated into any given architecture. Our method is closely related to deep image prior [49], in a sense that we both schematically suggest outputs that being able to reproduce the corresponding inputs and can be applied to various CNNs (see Section E in our supp. for discussions). It also lies in the category of MAP inference guided discriminative learning [62].

3 Deep Likelihood Network

Image restoration tasks have been intensively studied for decades. Before the advent of deep learning, conventional MAP-based methods hinged on likelihood and image prior modeling had been adopted to a variety of image restoration tasks and achieved state-of-the-art results. In this work, we advocate the MAP-based formulation and manage to introduce its fidelity term to off-the-shelf restoration CNNs for handling multiple settings of degradations. In the following two subsections, we will compare different formulations for image restoration, and show that simply adding the fidelity term to a discriminative learning objective cannot enhance a restoration network under multiple degradation scenarios. Then, in Section 3.3, we shall present our method which explicitly incorporates the MAP-inspired module with degradation information into the network architectures.

3.1 Problem Formulations: MAP & Deep Learning

In this work, we consider a group of image restoration tasks whose degradation process can be formulated as

(1)

where and denote the clean and degraded images, respectively. indicates the downsampling operator with an integer factor . denotes the additive white Gaussian noise with standard deviation . stands for the degradation operator. For image inpainting and interpolation, we have , , , where denotes the entry-wise multiplication operator, and denotes a binary mask. For SISR, we have , where denotes the convolution and denotes the degradation (e.g. blur) kernel. In terms of multiple degradations, we aim to train a single model to handle concerned restoration tasks with a spectrum of degradation settings.

The MAP-based formulation of image restoration generally involves a fidelity term for modeling the likelihood of degradation and a regularization term for modeling image prior. Given the degraded observation and the degradation setting (i.e., , , ), the fidelity term can be defined as,

(2)

By further trading off against the regularization term , the problem can be formulated as,

(3)

where is the regularization parameter.

Given the regularizer and an optimizer, the MAP framework is naturally flexible in handling multiple degradations. However, existing regularizers are generally hand-crafted, non-convex, and insufficient for characterizing image priors. Furthermore, due to the non-convexity, the optimization algorithm often cannot find a satisfactory solution of Eqn. (3[43]. All these make recent works resort to deep CNNs for improving restoration performance.

Given one specific degradation setting with , and , a direct input-output mapping is learned from a training set of degraded-clean image pairs to generate: . In spite of its success [10, 40, 61], the learned CNN model is tailored to the specific degradation setting, and shall generalize poorly when being applied to another degradation with , or . As such, we aim at developing a deep image restoration method for handling multiple degradations with both the flexibility of MAP and the (accuracy and efficiency) merits of off-the-shelf deep networks. One specious method seems to add the fidelity term to the learning objective of CNNs. Yet, as will be shown in Section 3.2, it does not succeed anyway.

3.2 Likelihood in Learning Objective?

It has been discussed in prior works [16, 63] that rigid-joint training supported by data augmentation provides limited assistance to generalizing the degradation settings. The results in our Section 4.1 echo this claim. In this subsection, we further show that the network performance cannot even be improved by adding the fidelity term to the learning objective. Taking multiple degradations into account, an augmented training set can be represented as , where each sample is allowed to have its own degradation setting therefore we might have if required. Using the augmented set, we can train an autoencoder network to generate for inpainting, by minimizing the reconstruction loss . The method is named Autoencoder (Joint) and some detailed explanations for our experimental settings will be given in Section 4.1.

To incorporate MAP into the deep learning solutions, we should modify the original fidelity term as in Eqn. (2) into , which may serve as the role of “self-supervision” and be combined with for image restoration. The most direct combination is to define as the learning objective, and one may expect such a scheme, dubbed Naïve Likelihood in this paper, to endow the learned model some ability of handling multiple degradations.

Method
loss /  loss / PSNR
,  centered
loss /  loss / PSNR
Autoencoder (Joint) [16] 0.0303 / 0.0041 / 30.82 0.0293 / 0.0036 / 31.53
Naïve Likelihood () 0.0305 / 0.0042 / 30.81 0.0294 / 0.0036 / 31.51
Naïve Likelihood () 0.0306 / 0.0042 / 30.81 0.0296 / 0.0037 / 31.50
Naïve Likelihood () 0.0305 / 0.0042 / 30.81 0.0294 / 0.0036 / 31.52
Table 1: Image inpainting results of autoencoder trained with and . It can be seen that directly adding to the learning objective (i.e., “Naïve Likelihood” with various ) does not work in practice.

To illustrate the effect of in the objective, we present some quantitative results in Table 1. The training curves are further provided in Figure 1. The experiment is performed on image inpainting and no adversarial loss is used here to ensure reliable PSNRs for evaluating the restoration performance [7]. From Table 1 and Figure 1, we can observe that, for any , the “Naïve Likelihood” method always performs similarly when compared with “Autoencoder (Joint)” which minimizes only during training. As discussed [63], the major reason that is less helpful in handling multiple degradations might be ascribed to the lack of degradation information in training deep networks.

Figure 1: The training curves of “Autoencoder (Joint)” and “Naïve Likelihood” methods with various . Directly adding to the learning objective does not help in practice.

3.3 Likelihood Assured by A Recursive Module

In this subsection, we present our method for generalizing current image restoration networks to handle a spectrum of degradation settings. As depicted in Figure (a)a, given an off-the-shelf image restoration network (i.e. the reference), the output of one middle layer is denoted by and we also denote , which means the reference network is regarded as the composition of two functions, or in other words two sub-networks.

Partially inspired by deep image prior [49], we substitue with in the fidelity term and eliminate the regularization term, since some implicit prior can be characterized by the sub-network . In contrast to [49], we assume there exist a unified suitable for all possible inputs but for each of them a specific should be learned with an optimization algorithm. In this regard, the formulation in Eqn. (3) can be rewritten as,

(4)

Since the reference is definitely differentiable, and further the objective function in (4) is differentiable w.r.t. . Therefore, we can simply use a stochastic gradient descent algorithm to pursue . We here choose ADAM [28] for this task and keep all of its hyper-parameters except for the learning rate as default, see (5) for more details. Such computation procedure can be cast into a recursive module and incorporated directly into any network. At the -th iteration, is calculated and added to the current estimation for approaching , that is . In theory, as long as is set, an update of estimation on such learning dynamics targets at superior architectures than the reference [9]. Once an expected is gained as described, or an overall loss can be similarly minimized as usual, by evaluating the difference between and .

The proposed method is dubbed deep likelihood network (DL-Net), and Figure (b)b illustrates its the main steps. Considering , i.e. the gradient of (4) w.r.t. the current estimation of is stacked, higher-order gradient are required in the training process, and automatic differentiation in current learning frameworks deals with it efficiently. A set of are sequentially calculated for but only the last one will be taken as output. In comparison with the reference network, all our modifications in Figure (b)b occur on the base of . A feedback connection is established for estimating with , and the computations involved for deriving from the gradient are:

(5)
(a)
(b)
Figure 2: Illustration of (a): the reference network and (b) our DL-Net. In each recursive step of DL-Net, a tensor is calculated and added for approaching the final . The restoration result can be iteratively estimated but only the one at iteration will be taken as the final output.

The crux of DL-Net is to incorporate an MAP-inspired module that is able to assist for disentangling the effects of different degradations. It enforces the outputs to achieve lower iteratively, in a similar vein to AffGAN [44]. Our method may as well be employed in a bunch of classical learning-based methods where deep networks are left-off. Moreover, it is natural to further incorporate other desirable priors into (4). One extra benefit that such an MAP-inspired method should have, as emphasized [63], comes along with some insightful knowledge extracted from the degradation process, which might be critical for facilitating the learning. For instance, when the location, size and shape of inpainting regions are known, inpainting networks are likely to learn automatically paying more attention to missing regions or taking more discounted loss on them [57].

Since a gradient-descent-derived module is utilized, two more hyper-parameters will be introduced: the total number of gradient descent steps and the learning rate for (4). We fix and unless otherwise clarified in the paper. Obviously, the depth of hence has a major impact on the computational cost of our DL-Net. We found a light with one single convolution layer (along with a nonlinear activation layer picked up from its previous layer) works well enough in practice, so the extra running time can be negligible in comparison to which is much deeper. That is, is chosen as the last hidden representation (before ReLU) from the reference network and thus the increase of computational cost is small. Such setting is adopted across all the experiments in this work.

4 Experimental Results

4.1 Image Inpainting and Interpolation

We will analyze the performance of different models on image inpainting and interpolation together in this section.

Network architectures.

Following previous works [16], we first introduce inpainting networks for hallucinating visual contents within square blocks whose size and location vary randomly in given images, and the same pipeline will be applied also to image interpolation. As a baseline model, the popular encoder-decoder-based architecture [40, 16, 33] (a.k.a., autoencoder) is chosen. We directly adapt the open-source Torch implementation from Gao and Grauman [16]. See Section D in our supp. for more details.

Method
loss /  loss / PSNR
loss /  loss / PSNR
loss /  loss / PSNR
,  centered
loss /  loss / PSNR
Autoencoder (Default) 0.2314 / 0.1115 / 16.05 0.2294 / 0.1152 / 15.97 0.2554 / 0.1638 / 14.74 0.2448 / 0.1112 / 16.09
Autoencoder (,  centered) 0.2094 / 0.0949 / 17.16 0.2259 / 0.1409 / 15.57 0.2608 / 0.2323 / 13.18 0.0246 / 0.0031 / 32.27
Autoencoder (Joint) [16] 0.0224 / 0.0015 / 35.08 0.0303 / 0.0041 / 30.82 0.0493 / 0.0118 / 26.12 0.0293 / 0.0036 / 31.53
On-Demand Learning [16] 0.0228 / 0.0016 / 34.97 0.0307 / 0.0043 / 30.78 0.0495 / 0.0120 / 26.12 0.0297 / 0.0037 / 31.47
Multi-Tasks Learning [25] 0.0187 / 0.0011 / 36.35 0.0271 / 0.0038 / 31.24 0.0473 / 0.0118 / 26.13 0.0261 / 0.0033 / 31.98
DL-Net (ours) 0.0122 / 0.0008 / 38.69 0.0210 / 0.0034 / 32.02 0.0410 / 0.0110 / 26.56 0.0201 / 0.0029 / 32.87
Table 2: Image inpainting with multiple degradations on CelebA: our method compared with the baseline and competitive methods. Apparently, our DL-Net consistently outperforms the others in all test cases.
Method
loss /  loss / PSNR
loss /  loss / PSNR
loss /  loss / PSNR
,  average
loss /  loss / PSNR
Autoencoder (Default) 0.0924 / 0.0159 / 24.64 0.0527 / 0.0062 / 28.59 0.0565 / 0.0095 / 26.78 0.0705 / 0.0112 / 26.67
Autoencoder (Joint) [16] 0.0256 / 0.0017 / 34.36 0.0347 / 0.0036 / 31.07 0.0597 / 0.0104 / 26.32 0.0345 / 0.0037 / 31.68
On-Demand Learning [16] 0.0251 / 0.0016 / 34.50 0.0343 / 0.0035 / 31.15 0.0589 / 0.0102 / 26.41 0.0340 / 0.0036 / 31.78
Multi-Tasks Learning [25] 0.0216 / 0.0013 / 35.58 0.0335 / 0.0034 / 31.24 0.0619 / 0.0110 / 26.05 0.0327 / 0.0035 / 32.22
DL-Net (ours) 0.0138 / 0.0007 / 38.35 0.0240 / 0.0023 / 33.09 0.0488 / 0.0083 / 27.42 0.0233 / 0.0024 / 34.51
Table 3: Image interpolation with multiple degradations on CelebA: our method () compared with the baseline and competitive methods. Apparently, our DL-Net consistently outperforms the others in all test cases.

Training and test samples.

Same with Gao and Grauman [16], we evaluate inpainting and interpolation models on two different datasets: CelebA [35] and SUN397 [51]. For CelebA, the first 100,000 images are used to form the training set. Some other 2000 images are randomly chosen and split into a validation set and a test set, each with 1000 images. For SUN397, we similarly have 100,000 images for training, and 1000 images each for validation and testing. Input images are also uniformly rescaled to so the encoder will output feature maps. Extra randomness is introduced into the test samples owing to pixel removal at possibly any location, so the test set can be highly biased. In order to address this issue, we utilize more samples by degenerating each original test image with feasible degradations and save all these combinations (of corrupted images and ground-truth) locally such that our models are tested on the same sample pairs.

Training process.

We initialize the channel-wise fully-connected layer with a random Gaussian distribution and all convolutional layers with the “MSRA” method [19]. In our experiments, weight decay and ADAM [28] with an initial learning rate of is adopted to optimize:

(6)

for reference models. Training batch size is set to be 100. To keep in line with prior works, we report , loss and the PSNR index on our test set for evaluating performance. Images are numerically converted into the floating-point format and scaled to for calculating these metrics. We cut the learning rate by every 100 epochs due to no improvement observed on the validation set. The reference inpainting model is trained to fill square holes as large as (i.e., ). After training for 250 epochs, the references reach plateau and we achieve PSNRs of 24.45dB and 26.78dB on CelebA, for inpainting with size and interpolation when 75% pixels are removed, respectively

Main results.

Despite the decent results on a presumed fixed degradation setting, reference models fail when evaluated on other inpainting size or interpolation percentage that is not specifically trained, but might be technically easier. For example, a PSNR of only 16.09dB is obtained when we moderately adjust the inpainting size to . If the location constraint is further relaxed, it degrades to 15.97dB which is by no means satisfactory in practice. Detailed quantitative results can be found in Table 2 and 3 333For results on SUN397, please refer to our supplementary material.. This problem has been comprehensively studies in [16], so we follow their experimental settings and let and vary in and , respectively. We let the inpainting regions shift randomly in a range of pixels around the centroid of images where most meaningful content exists. 444Note that as one of the few effective methods, our DL-Net generalizes also to irregular holes, we choose the above settings mostly to make fair comparisons with previous works [16, 25]

Typically, a rigid-joint training strategy [40, 16] might mitigate the problem and save the image restoration performance to some extent. In a nutshell, it attempts to minimize the reconstruction loss over different settings of degradation simultaneously. By learning from more flexible restoration scenarios, network models demonstrate an average PSNR of 32.47dB for inpainting and 31.68dB for interpolation, over all degradation settings on our test sets. Though the learning problem seems more complex, it converges reasonably fast through fine-tuning from references instead of training from scratch. The training process takes 500 epochs for both tasks, and the learning rate is cut by every 200 epochs. We evaluate the , loss and PSNR in specific degradation settings and summarize all the results on CelebA in Table 2 and 3. See Table 8 and 9 in the supp. for results on SUN397.

Figure 3: Qualitative comparisons between our method and competitors. See for example the eyebrow and the mouth of the woman for their difference. The inpainting region is slightly biased to the left bottom of the images. These images are not cherry-picked, and better zoomed in for more detailed comparisons.

We then fine-tune with our DL-Net similarly and test obtained models under exactly the same circumstances. As compared in Table 2 and 3, our DL-Net models achieve significantly better results on all test cases. We also compare our method with state-of-the-art solutions dealing with the same problem in literature, including multi-tasks learning and on-demand learning [16]. Being aware of a very recent progress on multi-task learning using uncertainty to weigh losses [25], we try adapting it to our image restoration tasks, for which each degradation level 555For inpainting, each is regarded as a particular level while for image interpolation, [2.5(-1)%, 2.5%) with a positive integer is regarded as one level. We also tried other possibilities but never got better results. is treated as a subtask and a weight in the learning objective is accordingly introduced. To be more specific, we have 30 extra weights to learn for both multi-task image inpainting and interpolation. The on-demand learning method is configured exactly the same as suggested in the paper. As expected, the two methods work well on multiple degradations, outperforming the aforementioned joint training strategy in most cases. However, they never surpass our DL-Net method in the sense of restoration performance and PSNRs. For qualitative comparisons, see Figure 3. It is further straightforward to absorb an adversarial loss in our DL-Net, either by introducing an adversarial or adding it directly to , here we adopt the latter and illustrate its results in Figure 3 as well.

Ablation studies.

Our experimental settings in the prequel, including configurations on the training and test sets, CNN architectures and loss terms for image inpainting and interpolation mostly comply with those in [16]. However, we notice that some different settings might also be popular. For example, one may prefer generating images with rather than using directly. We compare our DL-Nets with competitive models in such scenario and the restoration results are evaluated as below in Table 4 and 5. Apparently, our method still outperforms its competitors on both tasks.

Method
loss /  loss / PSNR
loss /  loss / PSNR
Autoencoder (Joint) [16] 0.0117 / 0.0031 / 32.29 0.0333 / 0.0110 / 26.50
On-Demand Learning [16] 0.0118 / 0.0032 / 32.27 0.0333 / 0.0111 / 26.51
Multi-Tasks Learning [25] 0.0117 / 0.0031 / 32.31 0.0338 / 0.0112 / 26.40
DL-Net (ours) 0.0117 / 0.0031 / 32.36 0.0329 / 0.0107 / 26.62
Table 4: Image inpainting with multiple degradations on CelebA: our DL-Net compared with competitors using .
Method
loss /  loss / PSNR
loss /  loss / PSNR
Autoencoder (Joint) [16] 0.0204 / 0.0027 / 32.31 0.0512 / 0.0098 / 26.61
On-Demand Learning [16] 0.0203 / 0.0027 / 32.35 0.0505 / 0.0096 / 26.70
Multi-Tasks Learning [25] 0.0204 / 0.0027 / 32.27 0.0532 / 0.0103 / 26.34
DL-Net (ours) 0.0171 / 0.0021 / 33.52 0.0439 / 0.0080 / 27.54
Table 5: Image interpolation with multiple degradations (CelebA): DL-Net compared with competitors using .

We are aware that there might exist different choices for the formulation of for training inpainting networks and it is worthwhile mentioning that our method generalizes to those cases. In addition to the one derived in Eqn. (6), researchers [40, 57] also propose to enlarge inpainting blocks by 7 pixels and penalize more on the boundary regions for encouraging perceptual consistency. As shown in Table 10 in our supp., the superiority of our DL-Net holds.

One may observe and consider it a bit surprising that our method even achieves better results than specifically trained references on certain degradation levels. For instance, “Autoencoder (, centered)” is slightly worse than the DL-Net model even when tested with centered regions. Also, the “Autoencoder (Default)” interpolation model does not depict superior performance to our DL-Net when 75% of the pixels are indeed removed. We believe this is partially because some insightful knowledge extracted from the size and location information of the blank regions facilitates the image restoration process [63]. Two other plausible explanations are training samples of benefit from more flexible degradations and the probably enlarged network capacity owing to a recursive .

Figure 4: The training curves of a rigid-joint learning method and our DL-Net with various . Note that the rigid-joint training can be considered as a special case of our DL-Net with .

We design deeper references and test if it is the capacity increase that helps. Stacking four extra convolutional layers that outputs 64 feature maps with ReLU to the end (before the output layer), or put them in the beginning, or allocate two on each side, we have three networks that can be much more powerful. We train them as training our “Autoencoder (,  centered)” model and test them with centered regions. Their average PSNRs are 32.21dB, 32.44dB and 32.35dB, respectively, showing the increased capacity helps only to a limited extent. Experiments are also conducted to demonstrate how the performance of our method varies with . Note that the rigid-joint training can be considered as a special case of DL-Net with , we illustrate its training curve together with those of DL-Nets () in Figure 4. Apparently, larger here indicates much faster convergence and suggests better restoration.

Overhead.

We evaluate CPU/GPU runtime of these DL-Nets with various values and summarize the results in Table 6. It can be noticed that the introduced computational overhead is very small (at most 1.4 for on GPU) The number of learnable parameters is also reported. Since our method utilizes a recursive module, it never brings in more learnable parameters. All models discussed above are tested with the open-source TensorFlow framework [1] on an Intel Xeon E5 CPU and an NVIDIA Titan X GPU.

Method CPU Run-time (s) GPU Run-time (s) #Parameters
Autoencoder (Joint) [16] 3.824 1.330
DL-Net (K=2) 4.346 1.595
DL-Net (K=3) 4.747 1.630 M
DL-Net (K=4) 5.255 1.747
DL-Net (K=5) 5.884 1.876
Table 6: Compare CPU/GPU runtime for processing the whole test set and the number of learnable parameters in different models.

4.2 Single Image Super-Resolution

Method
PSNR ( / /)
PSNR ( / /)
PSNR ( / /)
#Train. Images
Bicubic 33.66 / 30.39 / 28.42 29.02 / 26.57 / 24.78 26.13 / 25.32  / 24.33 -
VDSR 37.62 / 33.89 / 31.56 30.65 / 30.36 / 29.88 26.39 / 26.34 / 26.33
VDSR (DL-Net, ours) 36.47 / 33.28 / 31.09 36.25 / 33.28 / 31.17 34.97 / 32.89 / 30.99
DRRN 37.82 / 34.23 / 31.86 30.65 / 30.30 / 29.83 26.39 / 26.35 / 26.31 91+400
DRRN (DL-Net, ours) 37.32 / 33.76 / 31.34 37.39 / 33.83 / 31.41 36.76 / 33.53 / 31.41
IDN 37.73 / 34.07 / 31.76 30.64 / 30.32 / 29.88 26.39 / 26.35 / 26.33
IDN (DL-Net, ours) 37.07 / 33.49 / 31.24 37.05 / 33.58 / 31.27 36.50 / 33.32 / 31.14
SRMDNF [63] 37.79 / 34.12 / 31.96 37.45 / 34.16 / 31.99 34.12 / 33.02 / 31.77 400+800+4744
Table 7: SISR with multiple degradations: models trained using our DL-Net flavored strategy are compared with those trained as suggested. Test results of SRMDNF (which is trained exploiting more images and substantially more parameters) are cited from the paper.

Network architectures.

We choose popular SISR CNNs as backbones in our following experiments, including VDSR (2016) [26], DRRN (2017) [46] and IDN (2018) [21]. The networks are structurally so different that we are able to validate whether our method cooperates well with off-the-shelf CNNs profiting from customized architecture designs. See Section D in our supp. for schematic sketches and more details of these backbone networks.

Training and test samples.

In order to be consistent with prior works, we use the famous dataset of 91 images [55] for training, along with the 400 images from the Berkeley segmentation dataset (BSD-500) [37] which is also widely used. Our test datasets include Set-5 [6], Set-14 [58] and the official test set of BSD-500, which consist of 5, 14 and 100 images, respectively. Following Timofte et al[48], Dong et al[10], etc., we consider only feeding the luminance channel into our networks and simply upscaling the two chrominance channels using bicubic interpolation. To take full advantage of the self-similarity of natural images, data augmentation strategies including rotation, rescaling and flipping are also used on the training set, as in other works [50, 26, 11, 46, 21]. Likewise, all training images are downsampled and cropped to get square-shaped training pairs. For upscaling factor of 2, 3, and 4, these patch pairs have width (and height) of 17 / 34, 17 / 51 and 17 / 68, respectively. The above settings mostly follow previous SISR works and help us train references successfully. For training DL-Nets, we believe larger image patches are more suitable and therefore adjust their width to 40 / 80, 40 / 120 and 40 / 160 [63].

Training process.

Similarly, we still adopt the “MSRA” method [19] to initialize weights in (up)-convolutional layers. In order to train the networks reasonably fast, we also take advantage of the ADAM algorithm [28] for stochastic optimization. As suggested, the base learning rate is always set to be 0.001 and the two momentum hyper-parameters are set to be 0.9 and 0.999. In correspondence with other research works, we also calculate the average PSNR between ground truth images and the generated high-resolution images on the luminance channel (in the YCrCb color space). Since an upscaling layer is introduced, one model should be trained each for the , and scenarios. We cut the learning rate by every 80 epochs, such that the model performance is finally saturated [26]. After training for 300 epochs, our VDSR and DRRN models achieve average PSNRs of 33.89dB and 34.23dB on Set-5, respectively.

Main results.

The above training process follows that of previous deep SISR CNNs in which all input low-resolution images are assumed to be self-degenerated directly through bicubic interpolation with high-resolution images. As have been discussed, such assumption fails on real-world applications where blurring (or other types of degradations) may also exist. Here we first evaluate the obtained references on low-resolution images downscaled from (probably) blurry images. We choose the same blur kernels as in [63] whose width vary in . Our results (in Table 7) demonstrate the performance of state-of-the-art SISR models diminishes a lot when some subtle distortion is introduced.

Fortunately, the proposed DL-Net strategy helps to resolve this problem and improve the restoration performance to a remarkable extent. As demonstrated in Table 7, our DL-Nets with VDSR, DRRN and IDN as backends achieve prominent results for both and while in the the optimal setting where their PSNR indices drop only a little bit. Results on Set-14 and BSD-500 can be found in Section C in the supp. We compare our method with a very recent method dedicated to addressing the fixation problem for SISR [63]. Though trained exploiting more images and substantially more parameters (M, while our DRRN: M), its performance still diminishes in the less challenging and scenarios with . Compared with SRMDNF and the references, our DL-Net shows more stable performance across all degradation settings.

5 Conclusions

While impressive results have been gained, state-of-the-art image restoration networks are usually trained with self-degenerated images through very restricted degradations. Such limitation may hinder restoration networks from being applied to some real-world applications, and there is as of yet no general solution for this. We propose DL-Net in this paper, towards generalizing existing networks to succeed over a spectrum of degradations with their training objective and architectures are retained. The pivotal of our method is to assist a subnet designed for disentangling the effects of possible degradations and minimizing . Experimental results on image inpainting, interpolation and SISR verify the effectiveness of our method. Future works shall include explorations on more degradation types.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, 2016.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 28(3):24, 2009.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
  • [4] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In SIGGRAPH, 2000.
  • [5] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher. Simultaneous structure and texture image inpainting. IEEE Transactions on Image Processing (TIP), 12(8):882–889, 2003.
  • [6] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
  • [7] Y. Blau and T. Michaeli. The perception-distortion tradeoff. In CVPR, 2018.
  • [8] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. In CVPR, 2004.
  • [9] R. T. d. Combes, M. Pezeshki, S. Shabanian, A. Courville, and Y. Bengio. On the learning dynamics of deep neural networks. arXiv preprint arXiv:1809.06848, 2018.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In ECCV, 2014.
  • [11] C. Dong, C. C. Loy, and X. Tang. Accelerating the super-resolution convolutional neural network. In ECCV, 2016.
  • [12] W. Dong, L. Zhang, G. Shi, and X. Wu. Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing (TIP), 20(7):1838–1857, 2011.
  • [13] I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image completion. In ACM Transactions on graphics (TOG), volume 22, pages 303–312, 2003.
  • [14] W. Fan and D.-Y. Yeung. Image hallucination using neighbor embedding over visual primitive manifolds. In CVPR, 2007.
  • [15] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learning low-level vision. International Journal of Computer Vision (IJCV), 40(1):25–47, 2000.
  • [16] R. Gao and K. Grauman. On-demand learning for deep image restoration. In ICCV, 2017.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [18] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang. Convolutional sparse coding for image super-resolution. In ICCV, 2015.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [20] F. Heide, W. Heidrich, and G. Wetzstein. Fast and flexible convolutional sparse coding. In CVPR, 2015.
  • [21] Z. Hui, X. Wang, and X. Gao. Fast and accurate single image super-resolution via information distillation network. In CVPR, 2018.
  • [22] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017.
  • [23] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann. Self-paced learning with diversity. In NIPS, 2014.
  • [24] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • [25] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
  • [26] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [27] K. I. Kim and Y. Kwon. Single-image super-resolution using sparse regression and natural image prior. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(6):1127–1133, 2010.
  • [28] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [29] R. Köhler, C. Schuler, B. Schölkopf, and S. Harmeling. Mask-specific inpainting with deep neural networks. In German Conference on Pattern Recognition, 2014.
  • [30] N. Komodakis. Image completion using global optimization. In CVPR, 2006.
  • [31] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • [32] S. Lefkimmiatis. Universal denoising networks: A novel cnn architecture for image denoising. In CVPR, 2018.
  • [33] Y. Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. In CVPR, 2017.
  • [34] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. In ECCV, 2018.
  • [35] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [36] X. Mao, C. Shen, and Y.-B. Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS, 2016.
  • [37] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • [38] J. Pan, Y. Liu, J. Dong, J. Zhang, J. Ren, J. Tang, Y.-W. Tai, and M.-H. Yang. Physics-based generative adversarial models for image restoration and beyond. arXiv preprint arXiv:1808.00605, 2018.
  • [39] V. Papyan, Y. Romano, M. Elad, and J. Sulam. Convolutional dictionary learning via local processing. In ICCV, 2017.
  • [40] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  • [41] L. C. Pickup, S. J. Roberts, and A. Zisserman. A sampled texture prior for image super-resolution. In NIPS, 2004.
  • [42] J. S. Ren, L. Xu, Q. Yan, and W. Sun. Shepard convolutional neural networks. In NIPS, 2015.
  • [43] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. In CVPR, 2014.
  • [44] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised map inference for image super-resolution. In ICLR, 2017.
  • [45] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, 2014.
  • [46] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In CVPR, 2017.
  • [47] Y.-W. Tai, S. Liu, M. S. Brown, and S. Lin. Super resolution using edge prior and single image detail synthesis. In CVPR, 2010.
  • [48] R. Timofte, V. De Smet, and L. Van Gool. Anchored neighborhood regression for fast example-based super-resolution. In ICCV, 2013.
  • [49] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In CVPR, 2018.
  • [50] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep networks for image super-resolution with sparse prior. In ICCV, 2015.
  • [51] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva. Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision (IJCV), 119(1):3–22, 2016.
  • [52] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In CVPR, 2017.
  • [53] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-resolution: A benchmark. In ECCV, 2014.
  • [54] J. Yang, J. Wright, T. Huang, and Y. Ma. Image super-resolution as sparse representation of raw image patches. In CVPR, 2008.
  • [55] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE Transactions on Image Processing (TIP), 19(11):2861–2873, 2010.
  • [56] R. A. Yeh, C. Chen, T.-Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative models. In CVPR, 2017.
  • [57] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In CVPR, 2018.
  • [58] R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International conference on curves and surfaces, 2010.
  • [59] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
  • [60] K. Zhang, X. Gao, D. Tao, and X. Li. Multi-scale dictionary for single image super-resolution. In CVPR, 2012.
  • [61] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing (TIP), 26(7):3142–3155, 2017.
  • [62] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In CVPR, 2017.
  • [63] K. Zhang, W. Zuo, and L. Zhang. Learning a single convolutional super-resolution network for multiple degradations. In CVPR, 2018.
  • [64] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
355545
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description