Edge-Informed Single Image Super-Resolution

 Edge-Informed Single Image Super-Resolution

Kamyar Nazeri, Harrish Thasarathan,  and  Mehran Ebrahimi
University of Ontario Institute of Technology, Canada
kamyar.nazeri@uoit.ca    harrish.thasarathan@uoit.net     mehran.ebrahimi@uoit.ca
http://www.ImagingLab.ca
Abstract

The recent increase in the extensive use of digital imaging technologies has brought with it a simultaneous demand for higher-resolution images. We develop a novel “edge-informed” approach to single image super-resolution (SISR). The SISR problem is reformulated as an image inpainting task. We use a two-stage inpainting model as a baseline for super-resolution and show its effectiveness for different scale factors (, , ) compared to basic interpolation schemes. This model is trained using a joint optimization of image contents (texture and color) and structures (edges). Quantitative and qualitative comparisons are included and the proposed model is compared with current state-of-the-art techniques. We show that our method of decoupling structure and texture reconstruction improves the quality of the final reconstructed high-resolution image.

1 Introduction

Super-Resolution (SR) is the task of inferring a high-resolution (HR) image from one or more given low-resolution (LR) images. SR plays an important role in various image processing tasks with direct applications in medical imaging, face recognition, satellite imaging, and surveillance [farsiu2004advances]. Many existing SR methods reconstruct the HR image by fusing multiple instances of a LR image with different perspectives. These are called Multi-Frame Super-Resolution methods [farsiu2004fast]. However, in most applications, only a single instance of the LR image is available from which missing HR information needs to be recovered. Single-Image Super-Resolution (SISR) is a challenging ill-posed inverse problem [ebrahimi2007solving] that normally requires prior information to restrict the solution space of the problem [shi2016real].

We take inspiration from a recent image inpainting technique introduced by Nazeri \etal[nazeri2019edgeconnect] to propose a novel approach to Single-Image Super-Resolution by reformulating the problem as an in-between pixels inpainting task. Increasing the resolution of a given LR image requires recovery of pixel intensities in between every two adjacent pixels. The missing pixel intensities can be considered as missing regions of an image inpainting problem. Our inpainting task is modelled as a two stage process that separates structural inpainting and textural inpainting to ensure high frequency information is preserved in the recovered HR image. The pipeline involves first creating a mask for every extra row and column that needs to be filled in the reconstruction of the HR image. The edge generation stage then focuses on “hallucinating” edges in missing regions, and the image completion stage uses the hallucinated edges as prior information to estimate pixel intensities in the missing regions.

(a) Ground Truth
(b) LR Image
(c) HR Estimate
Figure 1: Schematic illustration of the super-resolution problem. (a) The ground truth image, (b) The image downsampled by a factor of two. Each four-pixel segment of information on the left turn into one pixel in the middle, as a result, the structure and orientation of edges are not distinguished anymore as the problem is ill-posed. (c) The reconstruction of a high-resolution image from one-pixel segments of information using bilinear interpolation. Most distinctive features in the original image are lost and the result is blurry around the edges.
(a) LR image
(b) Upsample
(c) Upsample
Figure 2: An illustration of the proposed inpainting-based method for SISR. (a) The original LR image. (b) Upsampling by a factor of two corresponds to interpolating one pixel between every two adjacent pixels. We add an extra empty row and column for every row and column in the ground truth image (shown in gray) which we fill by an inpainting process. (c) Upsampling by a factor of four corresponds to interpolating three pixels between every two adjacent pixels where we can add three extra empty rows and columns for every row and column in the ground truth image to be inpainted.

2 Related Work

Many approaches to SISR have been presented in literature. These methods have been extensively organized by type according to their image priors in a study by Yang \etal[yang2014single]. Prediction models generate HR images through predefined mathematical functions. Examples include bilinear interpolation and bicubic interpolation [de1978practical], and Lanczos resampling [duchon1979lanczos]. Edge-based methods learn priors from features such as width of an edge [fattal2007image], or parameter of a gradient profile [sun2008image] to reconstruct the HR image. Statistical methods exploit different image properties such as gradient distribution [shan2008fast] to predict HR images. Patch-based methods use exemplar patches from external datasets [chang2004super, freeman2002example] or the image itself [irani2009super, freedman2011image] to learn mapping functions from LR to HR.

Deep Learning-based methods have achieved great performance on SISR using deep convolutional neural networks (CNN) with a per-pixel Euclidean loss [shi2016real, dong2014learning, kim2016accurate]. Euclidean loss, however, is less effective to reconstruct high-frequency structures such as edges and textures. Recently Johnson \etal[johnson2016perceptual] proposed feed-forward CNN using a perceptual loss. In particular, they used a pre-trained VGG network [simonyan2014very] to extract high-level features from an image effectively separating content and style. Their model was trained with a joint optimization of Feature reconstruction loss and Style reconstruction loss and achieved state-of-the-art results on SISR for challenging magnification factor. To encourage spatial smoothness and mitigate the checkerboard artifact [odena2016deconvolution] of using feature reconstruction loss, they introduced total variation regularization [rudin1992nonlinear] to their model objective. Sajjadi \etal[sajjadi2017enhancenet] proposed to use style loss in a patch-wise fashion to reduce the checkerboard artifact and enforce locally similar textures between the HR and ground truth images. They also used an adversarial loss to produce sharp results and further improve SISR results. Adversarial loss has also shown to be very effective in producing realistically synthesized high-frequency textures for SISR [ledig2017photo, haris2018deep, park2018srfeat], however, the results of these GAN-based approaches tend to include less meaningful high-frequency noise around the edges that is unrelated to the input image [park2018srfeat]. Our work herein is inspired by the model proposed by Liu \etal[Liu_2018_ECCV] which extended their image inpainting framework to image super-resolution tasks by offsetting pixels and inserting holes. We present a SISR model that simultaneously improves structure, texture, and color to generate a photo-realistic high-resolution image.

3 Model

We propose a Single Image Super-Resolution framework based on a two stage adversarial model [goodfellow2014generative] consisting of an edge enhancement step and an image completion step. Both the edge enhancement and image completion steps consist of their own generator/discriminator pair that decouples SISR into two separate problems i.e. structure and texture. Let and be the generator and discriminator for the edge enhancement step, and and be the generator and discriminator for the image completion step. Our edge enhancement and image completion generators are built from encoders that downsample twice, followed by eight residual blocks [he2016deep], and decoders that upsample to the original input size. We use dilated convolutions in our residual layers. Our generators follow similar architectures to the method proposed by Johnson \etal[johnson2016perceptual] shown to achieve superior results for super-resolution [sajjadi2017enhancenet, gondal2018unreasonable], image-to-image translation [zhu2017unpaired], and style transfer. Our discriminator follows the architecture of a PatchGAN [isola2017image, zhu2017unpaired] that classifies overlapping image patches as real or fake. We use instance normalization [ulyanov2017improved] across all layers of the network, which normalizes across the spatial dimension to generate qualitatively superior images during training and at test time.

Figure 3: Summary of our proposed method. takes a low resolution greyscale image and its corresponding low resolution edge map interpolated to the desired high resolution image size and outputs a high resolution edge map . takes the high resolution edge map generated by as well as an incomplete HR image created by offsetting the pixels of the ground truth LR image using a fixed fractionally strided convolution kernel. The output is the high resolution image .

3.1 Edge Enhancement

Our edge enhancement stage boosts the edges obtained from a low-resolution image to yield a high-resolution edge map. Let and be the low-resolution and high-resolution images. Their corresponding edge maps will be denoted as and respectively and is a grayscale counterpart of the low-resolution image. We add a nearest-neighbor interpolation module at the beginning of the network to resize the low-resolution image and its Canny edge-map to the same size as the HR image. The edge enhancement network predicts the high-resolution edge map

(1)

where and are the inputs to the network. The hinge variant [miyato2018spectral] of the adversarial loss objective over the generator and discriminator are defined as

(2)
(3)

We also include a feature matching loss objective [wang2018high] to our edge enhancement generator which compares activation maps in the intermediate layers of the discriminator. This stabilizes the training process by forcing the generator to produce results with representations that are similar to real images. Perceptual loss [johnson2016perceptual, gatys2016image, gatys2015texture] has also been known to accomplish this same task using a pretrained VGG network. However, since the VGG network is not trained to produce edge information, it fails to capture the result that we seek in the initial stage. The feature matching loss is defined as

(4)

where is the number of elements in the ’th activation layer, and is the activation in the ’th layer of the discriminator. Spectral normalization (SN) [miyato2018spectral] further stabilizes training by scaling down weight matrices by their respective largest singular values, effectively restricting the Lipschitz constant of the network to one. Although this was originally proposed to be used only on the discriminator, recent works [zhang2018self, odena2018generator] suggest that the generator can also benefit from SN by suppressing sudden changes of parameter and gradient values. We apply SN to both the generator and discriminator. The final joint loss objective for with regularization parameters and thus becomes

(5)

where we choose = 1 and for all experiments.

3.2 Image Completion

The image completion stage upscales the LR image to an incomplete HR image as input to using a fixed fractionally strided convolution kernel. This has the effect of adding empty rows and columns in-between pixels. To offset the pixels and increase the size of an image by a factor of we use an convolution kernel with stride of . Let denote a fixed strided convolution kernel and represent the high-resolution image being constructed by offsetting the pixels from the LR image.

Figure 4: Fixed fractionally strided convolution kernels to offset the pixels of the LR image and create an incomplete HR image for and SISR factors.
(6)

The HR image is then generated using :

(7)

We proceed to train with another joint loss consisting of an loss, hinge loss, perceptual loss, and style loss. The hinge variant of the adversarial loss follows equations 2 and 3

(8)
(9)

We include style loss and perceptual loss [gatys2016image, johnson2016perceptual] in our joint loss objective to further supplement training. Perceptual loss minimizes the Manhattan distance between feature maps generated from intermediate layers of VGG-19 trained on the ImageNet dataset [russakovsky2015imagenet]. This has the effect of encouraging perceptually similar predictions with ground truth labels. Perceptual loss is defined as

(10)

where is the number of elements in the ’th activation of VGG-19. While perceptual loss encourages perceptual similarities between ground truth images and predictions, style loss encourages texture similarities by minimizing the Manhattan distance between the Gram matrices of the intermediate feature maps. The Gram matrix of feature map is represented by [gatys2016image] and distributes spatial information of texture, shape, and style. Style loss is defined as

(11)

Style loss was shown by Sajjadi \etal[sajjadi2017enhancenet] to successfully mitigate the “checkerboard” artifact caused by transpose convolutions [odena2016deconvolution]. For both style and perceptual loss we extract feature maps from , , , and of VGG-19. We do not use feature matching loss in the image completion stage. While the feature matching loss is a regularizer to the adversarial loss in the edge generator, the perceptual loss used in this stage has the same effect while it is shown to be more effective loss for image generation tasks [nazeri2019edgeconnect, sajjadi2017enhancenet, johnson2016perceptual, johnson2016perceptual]. Thus the complete joint loss objective is

(12)

In all of our experiments we choose to train with parameters , , and to effectively minimize the reconstruction, style, perceptual, and adversarial loss to generate a photo-realistic high-resolution image.

4 Experiments

4.1 Training Setup

To train , we generate edge maps using Canny edge detector [canny1986computational]. We can control the level of detail in the LR edge map by changing the Gaussian filter smoothing parameter . For our purposes, we found yields the best results. All of our experiments are implemented in PyTorch, with the HR images fixed at and the LR input scaled accordingly based on the zooming factor. We choose a batch size of eight during training. The models of both stages were optimized using Adam optimizer [kingma2014adam] with and . In our experiments, we didn’t find any improvement by jointly optimizing and , also we are limited to a smaller batch size due to the large memory footprint of the joint optimization, hence the generators from each stage are trained separately. We train using a learning rate of with Canny edges until the loss plateaus. We lower the learning rate to and continue training until convergence. We then freeze the weights of and continue to train with the same learning rates.

4.2 Datasets

Our proposed models are evaluated on the following publicly available datasets.

Results are compared against the current state-of-the-art methods both qualitatively and quantitatively.

4.3 Qualitative Evaluation

Figures 5 and 6 show results of the proposed SISR method for scale factors of and respectively. For visualization purposes, the LR image is resized using nearest-neighbor interpolation. All HR images are cropped at , which means the LR images are and for scale factors of and respectively. We obtain the LR images by blurring the HR with a Gaussian kernel of width followed by downsampling with the corresponding zooming scale factor. The results are compared against bicubic interpolation and our proposed model without the edge generation network as a baseline. Despite having almost high PSNR/SSIM, the baseline model produces blurry results around the edges while our full model (with edge-maps) remains faithful to the high-frequency edge data and produces sharp photorealistic images.

Ground Truth                            LR                                Bicubic                           Baseline                             Ours
Figure 5: Comparison of qualitative results of images for scale factor SISR cropped at . Left to right: Ground Truth HR, LR image upscaled using nearest-neighbor interpolation, SISR using bicubic interpolation, Baseline (no edge data), Ours (Full Model)
Ground Truth                            LR                                Bicubic                           Baseline                             Ours
Figure 6: Comparison of qualitative results of images for scale factor SISR cropped at . Left to right: Ground Truth HR, LR image upscaled using nearest-neighbor interpolation, SISR using bicubic interpolation, Baseline (no edge data), Ours (Full Model)
Dataset Bicubic ENet EDSR Baseline Ours

PSNR

Set5 33.66 33.89 38.20 27.32 33.60
Set14 30.24 30.45 34.02 24.86 29.24
BSD100 29.56 28.30 32.37 23.97 28.12
Celeb-HQ 33.25 - - 31.33 32.12
Set5 28.42 28.56 32.62 24.22 28.59
Set14 25.99 25.77 28.94 21.56 25.19
BSD100 25.96 24.93 27.79 20.78 24.25
Celeb-HQ 29.59 - - 27.94 28.23
Set5 23.80 - - 19.32 23.73
Set14 22.37 - - 18.47 21.44
BSD100 22.11 - - 18.65 21.63
Celeb-HQ 26.66 - - 25.46 25.56

SSIM

Set5 0.930 0.928 0.961 0.974 0.985
Set14 0.869 0.862 0.920 0.930 0.954
BSD100 0.843 0.873 0.902 0.909 0.932
Celeb-HQ 0.967 - - 0.957 0.968
Set5 0.810 0.809 0.898 0.929 0.965
Set14 0.703 0.678 0.790 0.832 0.894
BSD100 0.668 0.627 0.744 0.773 0.851
Celeb-HQ 0.834 - - 0.910 0.912
Set5 0.646 - - 0.801 0.904
Set14 0.552 - - 0.708 0.793
BSD100 0.532 - - 0.663 0752
Celeb-HQ 0.782 - - 0.841 0.857
Table 1: Comparison of PSNR and SSIM for , , and factor SISR over Set5, Set14, BSD100, and Celeb-HQ datasets with bicubic interpolation, ENet [sajjadi2017enhancenet], EDSR [lim2017enhanced], and baseline (without edge-data). The best result of each row is boldfaced.
Ground Truth                                                        LR                                                         SISR
Figure 7: Comparison of edge prediction results for scale factor SISR cropped at . Left to right: Ground Truth HR, HR edge-map, LR image upscaled using nearest-neighbor interpolation, LR edge-map upscaled using nearest-neighbor interpolation, SISR, predicted edge-map SISR.
Scale Precision Recall

Celeb-HQ

74.27 73.21
45.14 43.04
23.23 19.09

Places2

79.18 80.24
60.80 58.19
31.06 23.93
Table 2: Quantitative performance of edge enhancer for Single Image Super-Resolution trained on Canny edges with for images. Statistics are calculated over the standard test sets of each dataset.

4.4 Quantitative Evaluation

We evaluate our model using PSNR and SSIM for , and SISR scale factors. Table 1 shows the performance of our model against bicubic interpolation and current state of the art SISR models over datasets Set5, Set14, BSD100, and Celeb-HQ. Statistics for competing models for and SR were obtained from their respective papers where available. Results for a challenging case of are only compared against bicubic interpolation. Note that the PSNR in our results is lower than competing models. In particular, EDSR by Lim \etal[lim2017enhanced] has achieved the best PSNR for every dataset. However, their model is only trained with per-pixel loss and fails to reconstruct sharp edges despite having higher PSNR. Similar results in recent research [johnson2016perceptual, sajjadi2017enhancenet] show that PSNR favors smooth/blurry results.

4.5 Accuracy of Edge Generator

Table 2 shows the accuracy of our edge enhancer for Celeb-HQ and Places2 datasets for the Single Image Super-Resolution task. We measure precision and recall for various scale factors of SISR. In all experiments, the width of the Gaussian smoothing filter for Canny edge detection.
Figure 7 shows results of the edge prediction stage for scale factor. HR images are cropped at and for visualization purposes, the LR image and its edge-map are resized using nearest-neighbor interpolation.

5 Discussion and Future Work

We propose a new structure-driven deep learning model for Single Image Super-Resolution (SISR) by recasting the problem as an in-between pixels inpainting task. One benefit of this approach over most deep-learning based SISR models is that we only have a unified model that can be used for different SISR zooming scales. Most deep-learning based SISR models take the LR image as input and generate the HR by in-network upsampling layers, given a zooming factor. For each different zooming factor, different network architecture and training is required. On the other hand, our model takes the LR image and adds empty space between pixels before using it as input to the network. Our proposed model learns to fill in the missing pixels by relying on available edge information to create the high-resolution image and effectively applies parameter sharing for different scales of SISR. Quantitative results show the effectiveness of the structure-guided inpainting model for the SISR problem where it achieves state-of-the-art results on standard benchmarks.

One shortcoming of the proposed inpainting-based SISR model is that it requires minimizing two disjoint optimizing algorithms. A better approach is to incorporate the edge generation stage into the inpainting model’s objective. This model could be trained using a joint optimization of image contents and structures and potentially outperform the disjoint two-stage optimization algorithm computationally while preserving sharp details of the image.

Our method leads to an interesting direction, which raises the question that what other information could be learned from the original dataset to help the super-resolution process. Our source code is available at:
https://github.com/knazeri/edge-informed-sisr

Acknowledgments

This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
389994
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description