Edge-Informed Single Image Super-Resolution
The recent increase in the extensive use of digital imaging technologies has brought with it a simultaneous demand for higher-resolution images. We develop a novel “edge-informed” approach to single image super-resolution (SISR). The SISR problem is reformulated as an image inpainting task. We use a two-stage inpainting model as a baseline for super-resolution and show its effectiveness for different scale factors (, , ) compared to basic interpolation schemes. This model is trained using a joint optimization of image contents (texture and color) and structures (edges). Quantitative and qualitative comparisons are included and the proposed model is compared with current state-of-the-art techniques. We show that our method of decoupling structure and texture reconstruction improves the quality of the final reconstructed high-resolution image.
Super-Resolution (SR) is the task of inferring a high-resolution (HR) image from one or more given low-resolution (LR) images. SR plays an important role in various image processing tasks with direct applications in medical imaging, face recognition, satellite imaging, and surveillance [farsiu2004advances]. Many existing SR methods reconstruct the HR image by fusing multiple instances of a LR image with different perspectives. These are called Multi-Frame Super-Resolution methods [farsiu2004fast]. However, in most applications, only a single instance of the LR image is available from which missing HR information needs to be recovered. Single-Image Super-Resolution (SISR) is a challenging ill-posed inverse problem [ebrahimi2007solving] that normally requires prior information to restrict the solution space of the problem [shi2016real].
We take inspiration from a recent image inpainting technique introduced by Nazeri \etal[nazeri2019edgeconnect] to propose a novel approach to Single-Image Super-Resolution by reformulating the problem as an in-between pixels inpainting task. Increasing the resolution of a given LR image requires recovery of pixel intensities in between every two adjacent pixels. The missing pixel intensities can be considered as missing regions of an image inpainting problem. Our inpainting task is modelled as a two stage process that separates structural inpainting and textural inpainting to ensure high frequency information is preserved in the recovered HR image. The pipeline involves first creating a mask for every extra row and column that needs to be filled in the reconstruction of the HR image. The edge generation stage then focuses on “hallucinating” edges in missing regions, and the image completion stage uses the hallucinated edges as prior information to estimate pixel intensities in the missing regions.
2 Related Work
Many approaches to SISR have been presented in literature. These methods have been extensively organized by type according to their image priors in a study by Yang \etal[yang2014single]. Prediction models generate HR images through predefined mathematical functions. Examples include bilinear interpolation and bicubic interpolation [de1978practical], and Lanczos resampling [duchon1979lanczos]. Edge-based methods learn priors from features such as width of an edge [fattal2007image], or parameter of a gradient profile [sun2008image] to reconstruct the HR image. Statistical methods exploit different image properties such as gradient distribution [shan2008fast] to predict HR images. Patch-based methods use exemplar patches from external datasets [chang2004super, freeman2002example] or the image itself [irani2009super, freedman2011image] to learn mapping functions from LR to HR.
Deep Learning-based methods have achieved great performance on SISR using deep convolutional neural networks (CNN) with a per-pixel Euclidean loss [shi2016real, dong2014learning, kim2016accurate]. Euclidean loss, however, is less effective to reconstruct high-frequency structures such as edges and textures. Recently Johnson \etal[johnson2016perceptual] proposed feed-forward CNN using a perceptual loss. In particular, they used a pre-trained VGG network [simonyan2014very] to extract high-level features from an image effectively separating content and style. Their model was trained with a joint optimization of Feature reconstruction loss and Style reconstruction loss and achieved state-of-the-art results on SISR for challenging magnification factor. To encourage spatial smoothness and mitigate the checkerboard artifact [odena2016deconvolution] of using feature reconstruction loss, they introduced total variation regularization [rudin1992nonlinear] to their model objective. Sajjadi \etal[sajjadi2017enhancenet] proposed to use style loss in a patch-wise fashion to reduce the checkerboard artifact and enforce locally similar textures between the HR and ground truth images. They also used an adversarial loss to produce sharp results and further improve SISR results. Adversarial loss has also shown to be very effective in producing realistically synthesized high-frequency textures for SISR [ledig2017photo, haris2018deep, park2018srfeat], however, the results of these GAN-based approaches tend to include less meaningful high-frequency noise around the edges that is unrelated to the input image [park2018srfeat]. Our work herein is inspired by the model proposed by Liu \etal[Liu_2018_ECCV] which extended their image inpainting framework to image super-resolution tasks by offsetting pixels and inserting holes. We present a SISR model that simultaneously improves structure, texture, and color to generate a photo-realistic high-resolution image.
We propose a Single Image Super-Resolution framework based on a two stage adversarial model [goodfellow2014generative] consisting of an edge enhancement step and an image completion step. Both the edge enhancement and image completion steps consist of their own generator/discriminator pair that decouples SISR into two separate problems i.e. structure and texture. Let and be the generator and discriminator for the edge enhancement step, and and be the generator and discriminator for the image completion step. Our edge enhancement and image completion generators are built from encoders that downsample twice, followed by eight residual blocks [he2016deep], and decoders that upsample to the original input size. We use dilated convolutions in our residual layers. Our generators follow similar architectures to the method proposed by Johnson \etal[johnson2016perceptual] shown to achieve superior results for super-resolution [sajjadi2017enhancenet, gondal2018unreasonable], image-to-image translation [zhu2017unpaired], and style transfer. Our discriminator follows the architecture of a PatchGAN [isola2017image, zhu2017unpaired] that classifies overlapping image patches as real or fake. We use instance normalization [ulyanov2017improved] across all layers of the network, which normalizes across the spatial dimension to generate qualitatively superior images during training and at test time.
3.1 Edge Enhancement
Our edge enhancement stage boosts the edges obtained from a low-resolution image to yield a high-resolution edge map. Let and be the low-resolution and high-resolution images. Their corresponding edge maps will be denoted as and respectively and is a grayscale counterpart of the low-resolution image. We add a nearest-neighbor interpolation module at the beginning of the network to resize the low-resolution image and its Canny edge-map to the same size as the HR image. The edge enhancement network predicts the high-resolution edge map
where and are the inputs to the network. The hinge variant [miyato2018spectral] of the adversarial loss objective over the generator and discriminator are defined as
We also include a feature matching loss objective [wang2018high] to our edge enhancement generator which compares activation maps in the intermediate layers of the discriminator. This stabilizes the training process by forcing the generator to produce results with representations that are similar to real images. Perceptual loss [johnson2016perceptual, gatys2016image, gatys2015texture] has also been known to accomplish this same task using a pretrained VGG network. However, since the VGG network is not trained to produce edge information, it fails to capture the result that we seek in the initial stage. The feature matching loss is defined as
where is the number of elements in the ’th activation layer, and is the activation in the ’th layer of the discriminator. Spectral normalization (SN) [miyato2018spectral] further stabilizes training by scaling down weight matrices by their respective largest singular values, effectively restricting the Lipschitz constant of the network to one. Although this was originally proposed to be used only on the discriminator, recent works [zhang2018self, odena2018generator] suggest that the generator can also benefit from SN by suppressing sudden changes of parameter and gradient values. We apply SN to both the generator and discriminator. The final joint loss objective for with regularization parameters and thus becomes
where we choose = 1 and for all experiments.
3.2 Image Completion
The image completion stage upscales the LR image to an incomplete HR image as input to using a fixed fractionally strided convolution kernel. This has the effect of adding empty rows and columns in-between pixels. To offset the pixels and increase the size of an image by a factor of we use an convolution kernel with stride of . Let denote a fixed strided convolution kernel and represent the high-resolution image being constructed by offsetting the pixels from the LR image.
The HR image is then generated using :
We include style loss and perceptual loss [gatys2016image, johnson2016perceptual] in our joint loss objective to further supplement training. Perceptual loss minimizes the Manhattan distance between feature maps generated from intermediate layers of VGG-19 trained on the ImageNet dataset [russakovsky2015imagenet]. This has the effect of encouraging perceptually similar predictions with ground truth labels. Perceptual loss is defined as
where is the number of elements in the ’th activation of VGG-19. While perceptual loss encourages perceptual similarities between ground truth images and predictions, style loss encourages texture similarities by minimizing the Manhattan distance between the Gram matrices of the intermediate feature maps. The Gram matrix of feature map is represented by [gatys2016image] and distributes spatial information of texture, shape, and style. Style loss is defined as
Style loss was shown by Sajjadi \etal[sajjadi2017enhancenet] to successfully mitigate the “checkerboard” artifact caused by transpose convolutions [odena2016deconvolution]. For both style and perceptual loss we extract feature maps from , , , and of VGG-19. We do not use feature matching loss in the image completion stage. While the feature matching loss is a regularizer to the adversarial loss in the edge generator, the perceptual loss used in this stage has the same effect while it is shown to be more effective loss for image generation tasks [nazeri2019edgeconnect, sajjadi2017enhancenet, johnson2016perceptual, johnson2016perceptual]. Thus the complete joint loss objective is
In all of our experiments we choose to train with parameters , , and to effectively minimize the reconstruction, style, perceptual, and adversarial loss to generate a photo-realistic high-resolution image.
4.1 Training Setup
To train , we generate edge maps using Canny edge detector [canny1986computational]. We can control the level of detail in the LR edge map by changing the Gaussian filter smoothing parameter . For our purposes, we found yields the best results. All of our experiments are implemented in PyTorch, with the HR images fixed at and the LR input scaled accordingly based on the zooming factor. We choose a batch size of eight during training. The models of both stages were optimized using Adam optimizer [kingma2014adam] with and . In our experiments, we didn’t find any improvement by jointly optimizing and , also we are limited to a smaller batch size due to the large memory footprint of the joint optimization, hence the generators from each stage are trained separately. We train using a learning rate of with Canny edges until the loss plateaus. We lower the learning rate to and continue training until convergence. We then freeze the weights of and continue to train with the same learning rates.
Our proposed models are evaluated on the following publicly available datasets.
Celeb-HQ [karras2018progressive]. High-quality version of the CelebA dataset with 30K images.
Places2 [zhou2017places]. More than 10 million images comprising 400+ unique scene categories.
Set5, Set14, BSDS100, Urban100 [huang2015single]. Standard SISR evaluation datasets.
Results are compared against the current state-of-the-art methods both qualitatively and quantitatively.
4.3 Qualitative Evaluation
Figures 5 and 6 show results of the proposed SISR method for scale factors of and respectively. For visualization purposes, the LR image is resized using nearest-neighbor interpolation. All HR images are cropped at , which means the LR images are and for scale factors of and respectively. We obtain the LR images by blurring the HR with a Gaussian kernel of width followed by downsampling with the corresponding zooming scale factor. The results are compared against bicubic interpolation and our proposed model without the edge generation network as a baseline. Despite having almost high PSNR/SSIM, the baseline model produces blurry results around the edges while our full model (with edge-maps) remains faithful to the high-frequency edge data and produces sharp photorealistic images.
4.4 Quantitative Evaluation
We evaluate our model using PSNR and SSIM for , and SISR scale factors. Table 1 shows the performance of our model against bicubic interpolation and current state of the art SISR models over datasets Set5, Set14, BSD100, and Celeb-HQ. Statistics for competing models for and SR were obtained from their respective papers where available. Results for a challenging case of are only compared against bicubic interpolation. Note that the PSNR in our results is lower than competing models. In particular, EDSR by Lim \etal[lim2017enhanced] has achieved the best PSNR for every dataset. However, their model is only trained with per-pixel loss and fails to reconstruct sharp edges despite having higher PSNR. Similar results in recent research [johnson2016perceptual, sajjadi2017enhancenet] show that PSNR favors smooth/blurry results.
4.5 Accuracy of Edge Generator
Table 2 shows the accuracy of our edge enhancer for Celeb-HQ and Places2 datasets for the Single Image Super-Resolution task. We measure precision and recall for various scale factors of SISR. In all experiments, the width of the Gaussian smoothing filter for Canny edge detection.
Figure 7 shows results of the edge prediction stage for scale factor. HR images are cropped at and for visualization purposes, the LR image and its edge-map are resized using nearest-neighbor interpolation.
5 Discussion and Future Work
We propose a new structure-driven deep learning model for Single Image Super-Resolution (SISR) by recasting the problem as an in-between pixels inpainting task. One benefit of this approach over most deep-learning based SISR models is that we only have a unified model that can be used for different SISR zooming scales. Most deep-learning based SISR models take the LR image as input and generate the HR by in-network upsampling layers, given a zooming factor. For each different zooming factor, different network architecture and training is required. On the other hand, our model takes the LR image and adds empty space between pixels before using it as input to the network. Our proposed model learns to fill in the missing pixels by relying on available edge information to create the high-resolution image and effectively applies parameter sharing for different scales of SISR. Quantitative results show the effectiveness of the structure-guided inpainting model for the SISR problem where it achieves state-of-the-art results on standard benchmarks.
One shortcoming of the proposed inpainting-based SISR model is that it requires minimizing two disjoint optimizing algorithms. A better approach is to incorporate the edge generation stage into the inpainting model’s objective. This model could be trained using a joint optimization of image contents and structures and potentially outperform the disjoint two-stage optimization algorithm computationally while preserving sharp details of the image.
Our method leads to an interesting direction, which raises the question that what other information could be learned from the original dataset to help the super-resolution process. Our source code is available at:
This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.