Unsupervised Image Super-Resolution with an Indirect Supervised Path
The task of single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image from a low-resolution (LR) image. Although significant progress has been made by deep learning models, they are trained on synthetic paired data in a supervised way and do not perform well on real data. There are several attempts that directly apply unsupervised image translation models to address such a problem. However, unsupervised low-level vision problem poses more challenge on the accuracy of translation. In this work, we propose a novel framework which is composed of two stages: 1) unsupervised image translation between real LR images and synthetic LR images; 2) supervised super-resolution from approximated real LR images to HR images. It takes the synthetic LR images as a bridge and creates an indirect supervised path from real LR images to HR images. Any existed deep learning based image super-resolution model can be integrated into the second stage of the proposed framework for further improvement. In addition it shows great flexibility in balancing between distortion and perceptual quality under unsupervised setting. The proposed method is evaluated on both NTIRE 2017 and 2018 challenge datasets and achieves favorable performance against supervised methods.
The task of single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image from a low-resolution (LR) image. It has been widely used in several computer vision tasks such as image enhancement, surveillance and medical imaging. In the SISR task, an LR image is modeled as the degradation output by applying the following degradation process to an HR image ,
where denotes a blur kernel, denotes a downsampling operation with scaling factor , and denotes noise and is usually defined as Gaussian noise with standard deviation . It is an ill-posed problem since there are multiple solutions that can be reconstructed from a given LR image.
Recently, data-driven methods, especially deep learning based methods [Dong-tpami16, Shi-cvpr16, Kim-cvpr16-vdsr, Kim-cvpr16-drcn, Tai-cvpr17a, Tai-cvpr17b, Lai-cvpr17, Zhang-cvpr18, Zhang-eccv18] achieve great performance on low-level vision tasks. Although significant progress have been made on SISR, the trained deep learning models do not perform well on real data. That is because most of them are trained on synthetic paired data in a supervised way. Most synthetic low-resolution images are generated using simple and uniform degradation, such as directly downsampling a high-resolution image with bicubic interpolation. Therefore, the trained models are only able to produce high-quality images on synthetic low-resolution images but with poor generalization ability on unseen, realistic low-resolution images which suffer from other degradation factors such as blur and noise.
There are several works attempting to address this issue. On one hand, some researchers try to improve quality of training data by collecting a dataset consisting of real-world data [Cai-arxiv19, Zhang-cvpr19]; on the other hand, some researchers work on blind super-resolution [Glasner-iccv09, Huang-cvpr15, Michaeli-iccv13, Shocher-cvpr18]. These methods work in an unsupervised fashion by exploring self-similarity across scales of an LR image to hallucinate an HR image. Very recently, [Yuan-cvprw18] applies unsupervised image translation to image super-resolution with unpaired LR-HR data. LR images and HR images are considered as two domains and SISR is treated as an LR-to-HR image translation problem. Several recently proposed unsupervised image translation methods such as CycleGAN [Zhu-iccv17] and DualGAN [Yi-iccv17] can be employed to super-resolve an image when training data is not paired, showing great potential in addressing super-resolution of unseen, realistic low-resolution images.
Our method shares similar philosophy in the use of cycle consistency to train an SR model for unpaired LR and HR images. However, different from the common unsupervised image translation task, there is a strict requirement for the result of image super-resolution. It not only requires the style of a translated image to be correct but also requires that the super-resolved result keeps content and contains few artifacts. That is, it poses more challenge on the accuracy of translation.
The proposed framework is designed particularly for the unsupervised low-level vision problem. Instead of directly applying an unsupervised image translation model to bicubic upsampled images or taking a cycle-in-cycle method as in [Yuan-cvprw18], we take bicubic interpolated synthetic LR images as an intermediate bridge in the unsupervised training. We first build a cycle between real LR space and synthetic LR space. By adding a path from real LR images to HR images on top of the path from synthetic LR images to real LR images, we obtain an indirect supervised path from real LR images to HR images. After feeding a synthetic LR image to the cycle model, an LR image with similar degradation to a real LR image can be estimated and an indirect supervised path, synthetic LR images approximated real LR images HR images, can be obtained. This indirect supervised path allows the proposed method to enjoy advantages of any recently proposed SISR model. In addition, compared to previous unsupervised image translation approaches, our method is flexible enough such that the proposed real LR-to-HR SR model can be trained with various losses. Unsupervised translation methods trained with cycle-consistency and adversarial loss are prone to produce artifacts because those losses are weak in controlling generation quality. However, for the proposed approach, either a fidelity loss such as L1 or L2, or a perceptual quality oriented loss such as adversarial loss or texture loss can be used for training the proposed models. Hence, it shows great flexibility in balancing between distortion and perceptual quality. The proposed method is evaluated on both NTIRE 2017 and 2018 challenge datasets and achieves comparable performance to supervised methods.
Deep learning based image super-resolution.
Recently, a lot of works are proposed to address the task of SISR based on Convolutional Neural Networks (CNNs). The first one is proposed by Dong et al. [Dong-tpami16], which implicitly learns a mapping between LR and HR images using a shallow fully-convolutional network. In [Kim-cvpr16-vdsr, Kim-cvpr16-drcn], Kim et al.borrowed the idea of residual connection from ResNet [He-cvpr16-resnet] and designed a very deep network to improve SISR performance. Recently, SRResNet [Caballero-cvpr17], EDSR [Lim-cvprws17], DRRN [Tai-cvpr17a], RDN [Zhang-cvpr18] and ESRGAN [wang-eccv18ws] proposed to use not only the residual connection in the last layer, but also local residual connections and dense connection in the intermediate layers as in the ResNet [He-cvpr16-resnet] and DenseNet [Huang-cvpr17-densenet] architectures, and deeper network architecture to further improve the performance. In addition, there are several works [Caballero-cvpr17, Sajjadi-iccv17, wang-eccv18ws] on improving perceptual quality of SISR results by combining fidelity loss with an adversarial loss [Goodfellow-nips14] and a perceptual loss [Johnson-eccv16]. However, they are all trained on synthetic pairs of LR and HR images. LR images in real world suffer from multiple degradations such as complex blur and noise that are difficult to formulate. It is difficult and expensive to obtain paired data for real LR and HR images. [Cai-arxiv19, Zhang-cvpr19] proposed to capture LR-HR image pairs with a more realistic setting, that is, tuning focal length of DSLR cameras to collect images of different resolution. However, models trained with such data may not generalize well to LR images captured by other devices such as smartphones which may contain different level of noise and blur. In this work, we propose an approach to address unpaired data, which allows any aforementioned supervised SR method for synthetic pairs to be trained with unpaired data.
Unsupervised image translation.
Image super-resolution can be considered as a special image translation task, i.e., translating images from LR domain to HR domain. There have been several approaches to address unsupervised image translation. Zhu et al. [Zhu-iccv17] proposed CycleGAN by adding cycle consistency constraint on top of pix2pix [Isola-cvpr17]. Cycle consistency enforces each image to be correctly reconstructed after translating from a source domain to a target domain and translating back to the source domain. Similar approaches are also proposed in DiscoGAN [Kim-icml17] and [Yi-iccv17]. Another kind of approaches assume that images from source domain and target domain share a common latent space. Once an image is projected to the shared latent space, a decoder can be used to either reconstruct the image in source domain or produce an image in target domain. Huang et al. [Huang-eccv18] and Lee et al. [Lee-eccv18] further proposed to decompose an image into a content-related space and a style-related space to achieve many-to-many image translation.
Inspired by success of these methods, several works propose to use unsupervised image translation methods to model the unpaired mapping between real LR and HR images. [Yuan-cvprw18] introduces two cycles, with one between real LR and synthetic LR images, and another one between real LR and HR images. [Bulat-eccv18] proposes to first learn a unidirectional GAN-based degradation model for real LR images, and then train an SR model with approximate real LR-HR pairs. However, these methods attempt to directly translate between real LR and HR images, which may fail due to significant domain gap between real LR and HR images. In addition, different from common image translation tasks such as horse-to-zebra translation and style transfer, users are more sensitive to the results of SISR. Models for common image translation do not have strong control on color and content of translation results. In this work, we take bicubic interpolated synthetic LR images as an intermediate bridge in the cycled training and construct an indirect supervised path which enables more control on the super-resolution results.
As mentioned above, there is strict requirement for the super-resolved result in unsupervised image super-resolution. It not only requires the style of a translated image to be correct but also requires that the super-resolved result keeps the original content and contains few artifacts. Therefore, it poses more challenge on the accuracy of translation. The proposed framework is designed specifically for unsupervised low-level vision problem and is composed of two stages: 1) unsupervised image translation between real LR images and synthetic LR images; 2) supervised super-resolution from approximated real LR images to HR images. Given a real LR image and an unpaired HR image during training, we first generate a synthetic LR image with simple synthetic degradation. Then we have LR images and composing training set for unsupervised translation, and synthetic LR-HR pairs and composing training set for supervised image super-resolution. Our goal is to train a model from real LR images to HR images . Assume that a decent unsupervised translation model can succesfully transfer a synthetic LR image to an approximated real LR one . Then the generated LR images and HR images corresponding to implicitly compose supervised training pairs for the task of super-resolving real LR images. An arbitrary SR network proposed for supervised learning can be applied here and be trained with various losses to balance distortion and perceptual quality. The whole framework is shown in Figure 1.
Unsupervised translation among different degradaded LR images
In the proposed framework we first take the unsupervised image translation model CycleGAN [Zhu-iccv17] for mapping between synthetic LR images and real LR images . Synthetic LR images are generated by simply applying bicubic downsampling operation to without adding any noise or blur. In the CycleGAN model, there are a generator trained to produce samples similar to real LR images and a discrimimnator used to distinguish the translated outputs from real LR images. Similarly, there are a generator trained to learn the mapping for the backward direction and a discriminator to detect true synthetic LR images. Cycle consistency, i.e., , is added to enable unsupervised learning from synthetic LR images to real LR images. The loss for training such unsupervised image translation model is defined as below.
where is the adversarial losses on both real LR domain and synthetic LR domain and contains a term for generators and one for discriminators . In this work, we follow the CycleGAN [Zhu-iccv17] and use losses in LSGAN [Mao-iccv17]; contains cycle consistency losses for both forward and backward directions, and includes identity losses for keeping color consistency as proposed in the original CycleGAN. With a properly trained CycleGAN model, we can obtain an approximate real LR image from a synthetic LR image. The CycleGAN model here does not need to be perfect, because it will be jointly trained with the SR network at the second stage. Compared to [Bulat-eccv18] which uses a unidirectional GAN-based High-to-Low network to directly translate HR images to LR ones, our method first obtains bicubic downsampled LR images and then uses a two-way cycle-consistency based network for translation, which obtains more robust translation result than their method.
Indirect supervised learning for super-resolution
Given unpaired real LR images and HR images, there is no pairwise supervision provided to directly train an SR network from real LR images to HR images. However, with the approximated real LR images generated by the CycleGAN from synthetic LR images, we are able to create an indirect supervised path, i.e., synthetic LR images approximated real LR images HR images. A synthetic LR image is bicubicly downsampled from an HR image and it can be used to produce an approximated real LR image. The approximated real LR image has the same content as the synthetic LR image and HR image, with similar degradation as real LR images. Therefore, we can now indirectly train a supervised SR model with approximated real LR-HR pairs, which is able to perform well on real LR images.
Note that in this work, we do not explicitly separate the training process of these two stages but combine them together within one framework. In this way the SR model is able to adapt to approximated real LR images with slight quality difference. These different results can work as a kind of data augmentation by itself, improving SR model’s robustness. Super-resolution performance of the two-stage model trained separately would be heavily dependent on the result of unsupervised image translation which is still not perfect yet. In addition, having these two stages balanced within one framework also further constrains the space of possible mapping function in the CycleGAN. The generated real LR images are required to be not only able to go back to synthetic LR space but also super-resolved to produce the corresponding HR images. Different from [Yuan-cvprw18] and [Bulat-eccv18] where GAN losses plays the major role in recovering noisy LR images, our method shares similar philosophy as supervised SR models and takes distortion measure based losses as main component. Any loss function that works in the supervised SR training can be used for the indirect supervised path. For example, it can be an MSE loss, a perceptual loss or an adversarial loss. However, it is impossible for a vanilla CycleGAN to be trained with an MSE loss or a perceptual loss since there are no paired output-target data. The total loss for the proposed framework can be formulated as below.
where represents the loss of CycleGAN defined above, is SR losses which can be defined as any loss function used to train a supervised SR model. In this work we take a combination of MSE loss, perceptual loss and adversarial loss as our final loss function.
where is the approximated real LR images from the first stage, and are respectively reconstruction loss and perceptual loss between super-resolved results and HR images, and is adversarial loss on HR space, with a term for the generator and a term for the discriminator. We follow the ESRGAN [wang-eccvws18] and use losses in Relativistic GAN [RaGAN-iclr19].
Moreover, another advantage of the proposed framework over the vanilla CycleGAN is that the cycle training in our framework is conducted on LR space. During the cycled training, there are two generators and two discriminators to be trained. Therefore, plenty of GPU memory is required if the training is conducted on HR space. Besides, the advantage over other two-stage methods like CinCGAN [Yuan-cvprw18] lies in that pipeline of those methods is also two-stage during inference. However, the proposed approach is only two-stage during training but one-stage during inference. Therefore, the proposed approach takes up much less memory and runs faster during inference than other SR methods with unsupervised translation model.
In this section, we first introduce the datasets for evaluation under unpaired image super-resolution setting. Note that there is no unpaired dataset with real LR and HR images. In this work, we treat LR images with advanced degradation as real LR images and will explain how to form unpaired training data later. Implementation details for training the proposed models are described afterwards. Finally, we compare the proposed approach with several baselines, and investigate the influence of loss weights at the second stage on the super-resolution performance.
DIV2K [NTIRE17, NTIRE18] is a popular SISR dataset which contains 1,000 images with different scenes and is splitted to 800 for training, 100 for validation and 100 for testing. It was collected for NTIRE2017 and NTIRE2018 Super-Resolution Challenges in order to encourage research on image super-resolution with more realistic degradation. This dataset contains LR images with different types of degradations. Apart from the standard bicubic downsampling, several types of degradations are considered in synthesizing LR images for different tracks of the challenges. Since this work aims at dealing with more realistic setting, we choose track 2 of NTIRE 2017 which contains LR images with unknown x4 downscaling, and track 2 and track 4 of NTIRE 2018 which respectively correspond to realistic mild and realistic wild adverse conditions. More specifically, LR images under realistic mild x4 setting suffer from motion blur, Poisson noise and pixel shifting; while degradations under realistic wild x4 setting are further extended to be of different levels from image to image, which is more realistic and challenging. DIV2K dataset is split to training set, validation set and test set. Our models are trained on the 800 training images. In our experiment, since we focus on dealing with LR images with unpaired HR images, we take the first 400 HR images in the traning set as training data for HR image domain, and LR images corresponding to the other half for LR image domain. It is evaluated on the whole validation set of 100 images. Due to the motion blur with degraded images, input LR images and ground truth HR images are not well aligned. Following [NTIRE18], we ignore border pixels and report the most favorable score among images shifted from 0 to 40 pixels. For the two tracks in NTIRE 2018, a centered image patch is cropped to compute scores.
Network Architecture. Our framework is composed of two stages, i.e., unsupervised image translation and supervised super-resolution. As for the first stage, we take the network of CycleGAN [Zhu-iccv17]for translation between real LR images and synthetic LR images; as for the super-resolution network, we take different network architectures for different tracks. As for track 2 of NTIRE 2017, we adopt a modified version of VDSR [Kim-cvpr16-vdsr] by removing the bicubic upsampling operation at the beginning, adding a pixel shuffling layer [Shi-cvpr16] at the end, and replacing each of 20 convolutional layers with a BN-ReLU-Conv block. As for the track 2 and 4 of NTIRE 2018, we adopt the SRResNet [Caballero-cvpr17] instead. The reason lies in that the two tracks of NTIRE 2018 suffer from severe motion blur up to 40 pixel shift and more severe noise. Therefore, a deeper network with larger receptive field is required.
Training details. We crop patches from each LR/HR training image and obtain around 33000 patchs in total. They are flipped and rotated on the fly for further data augmentation. In this work bicubic down-sampling is used to generate synthetic LR images and it can be replaced by other downsampling methods such as nearest neighbor and bilinear down-sampling. The mini-batch size is set to 32. The weights for loss terms in the CycleGAN are set to for all datasets. As for the weights for SR loss terms, and are respectively set to 1 and 0.05 for all datasets, while is dependent on degradation of the dataset, with 1e3 for the track2 of NTIRE2018 and 1e2 for the tracks of NTIRE2018. We use Adam [Kingma-iclr15-adam] optimizer with , . The learning rate is set to and for the CycleGAN and SR network. After traning for 100 epoches, the learning rate starts to linearly decay and stops at zero after another 100 epochs. To stablize the adversarial training, we restrict the norm of gradient up to 50. In addition, during training we first pretrain a CycleGAN on unpaired LR images and an SR network on synthetic LR-HR pairs respectively for 5 epochs and then jointly train the whole framework till 200 epochs. Pretraining provides a good initialization for the whole framework and helps stablize the training.
In this section, we compare the proposed method with several baseline methods on three datasets. Both quantitative and qualitative results are shown to demonstrate its effectiveness. We denote the proposed method trained with only MSE as CycleSR and the one trained with both MSE and perceptual quality oriented losses as CycleSRGAN.
Bicubic: bicubic interpolation works without supervision and can be applied to any case; SR_syn: we train the same SR network on synthetic LR-HR image pairs and evaluate it on the datasets correspond to the three tracks; SR_paired: we also train the same SR network on real LR-HR image pairs, with a preprocessing of alignment for shifted pixels caused by severe blur in case of Track 2 and Track 4 in NTIRE 2018. CycleGAN: we first resize an LR image to the target size and directly apply CycleGAN to the unpaired data; Cycle+SR: we take the same network architecture as the proposed method but train two stages separately.
We present PSNR and SSIM scores for each compared method in Table 1. As shown in Table 1, the method trained on synthetic pairs performs even worse than the simple bicubic upsampling on all three datasets, which implies the importance of study on image super-resolution with unpaired data. Among those methods, the best results in all three datasets are taken by either the supervised method or ours. The supervised method outperforms ours on Track 2 of NTIRE 2017, while ours are better on both tracks of NTIRE 2018. That is, in case of severe blur and much noise, the proposed method CycleSR performs even better than the supervised method which is trained on pre-aligned input-target pairs. The supervised methods which rank high on the leaderboard of NTIRE 2018 also take image registration as a preprocessing step and use deeper network architecture. The best three methods for track 2 are nmhkahn and iim_lab with , and , and those for track 4 are xixihaha and yyuan13 with , and . Without any such preprocessing, the proposed method CycleSR still achieves favorable performance against those top ranked methods. The method with the same network architecture but with the two stages trained separately also obtains reasonable performance, but performs worse than the one trained jointly. In addition, due to the flexibility of the proposed framework, it can be easily trained with perceptual quality oriented losses. Not only can the proposed method obtain high scores but can also produce images of good perceptual quality, which would be discussed later. Overall, the proposed unsupervised method are better than other unsupervised baseline methods and is competitive even compared to supervised ones.
As shown in Figure 2, the SR network trained on synthetic pairs only learns to sharpen textures but cannot deal with motion blur or noise, while the same network trained on pre-aligned real LR-HR pairs achieves much better visual results in terms of sharpening and denoising. The proposed CycleSR also obtains comparable performance as SR Paired. However, SR Paired and such supervised method rely heavily on aligned real LR-HR pairs. In other words, the result of SR Paired is determined by the pre-aligned step of data, while our CycleSR does not need any preprocessing and achieves favourable results as well. Compared to CycleGAN, the proposed CycleSRGAN is able to produce high quality images with less artifacts and distortion. For all three examples in Figure 2, CycleGAN changes the color of translated results and adds many unexisted texture patterns. From the second example in the figure, we can see that CycleSRGAN produces an image even sharper than the ground truth HR image. That is because that HR image is taken with focus on the centered mushroom. The pine cone in the image belongs to out-of-focus region. Our method does not take depth and focus into consideration and tries to recover all blurred parts.
Influence of indirect supervised path:
We also investigate the influence of loss weights at the second stage on the training of the proposed method, especially for the CycleSR model. The weight of MSE loss at the second stage is chosen from . Figure 3 shows approximated LR images , i.e., intermediate output of the first stage, trained with different weights of MSE loss at the second stage. Since range of pixel values in varies with different , we normalize in terms of the mean and variance of to get for better visualization. When is small with , CycleGAN weighs more in the training of the whole framework. In this case, has similar style as real images , but suffers from color drift like CycleGAN. The model trained only learns to reconstruct the correct HR image from color-drifted LR image and would fail during inference phase. As increases to , color drift problem becomes less severe. The MSE loss at the second stage works as a regularizer to help translation. When it is too big with , the MSE loss at the second stage dominates the whole optimization process. Instead of generating an image similar to a real LR image, the generator at the first stage tries to generate an image which is easier for the SR network at the second stage to obtain good reconstruction. The intermediate output looks like synthetic LR images rather than real LR images. Hence it would not work to reconstruct a HR image given a real LR image. We also evaluate these variants on NTIRE2017 Track2. The results for are respectively 26.262/0.735, 27.021/0.77, 18.337/0.599, which is consistent with our analysis before. It is important to balance between the losses of the two stages of the proposed method for good performance.
In this work, we present a general framework for unsupervised image super-resolution, which is closer to real scenario. Instead of directly applying unsupervised image translation to address this task, we propose a novel approach which integrates cycled training and supervised training into one framework. Synthetic LR images are taken as a bridge and creates an indirect supervised path from real LR images to HR images. We show that the proposed approach learns to super-resolve a real LR image without any corresponding HR images in the training dataset. It is flexible enough to integrate any existed deep learning based super-resolution models, including those trained with either fidelity losses or perceptual oriented losses. It is evaluated on image super-resolution challenge datasets and achieves favorable performance against supervised methods.