Benefiting from Multitask Learning to Improve Single Image Super-Resolution

Benefiting from Multitask Learning to
Improve Single Image Super-Resolution

Mohammad Saeed Rad saeed.rad@epfl.ch Behzad Bozorgtabar Claudiu Musat Urs-Viktor Marti Max Basler Hazım Kemal Ekenel Jean-Philippe Thiran Signal Processing Laboratory 5, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland AI Lab, Swisscom AG, Lausanne, Switzerland Istanbul Technical University, Istanbul, Turkey
Abstract

Despite significant progress toward super resolving more realistic images by deeper convolutional neural networks (CNNs), reconstructing fine and natural textures still remains a challenging problem. Recent works on single image super resolution (SISR) are mostly based on optimizing pixel and content wise similarity between recovered and high-resolution (HR) images and do not benefit from recognizability of semantic classes. In this paper, we introduce a novel approach using categorical information to tackle the SISR problem; we present a decoder architecture able to extract and use semantic information to super-resolve a given image by using multitask learning, simultaneously for image super-resolution and semantic segmentation. To explore categorical information during training, the proposed decoder only employs one shared deep network for two task-specific output layers. At run-time only layers resulting HR image are used and no segmentation label is required. Extensive perceptual experiments and a user study on images randomly selected from COCO-Stuff dataset demonstrate the effectiveness of our proposed method and it outperforms the state-of-the-art methods.

keywords:
Single Image Super-Resolution, Multitask Learning, Recovering Realistic Textures, Semantic Segmentation, Generative Adversarial Network
journal: Neurocomputing

1 Introduction

Single image super-resolution (SISR) has many practical computer vision applications Zou and Yuen (2012); Shi et al. (2013); Park et al. (2013); Yang et al. (2007), which aims at recovering high-resolution (HR) images from a set of prior examples of paired low-resolution (LR) images. Although many SISR methods have been proposed in the past decade, recovering high-frequency details and realistic textures in a plausible manner are still challenging. Having said that, this problem is ill-posed, meaning each LR image might correspond to many HR images and the space of plausible HR images scales up quadratically with the image magnification factor.

Figure 1: The proposed single image super-resolution using multitask learning. This network architecture enables reconstructing SR images in a content-aware manner; during training (blue arrows), an additional objective function for semantic segmentation is used to force the SR to learn categorical information. At run-time we only reconstruct the SR image (orange arrows). In this work, we prove that learning semantic segmentation task in parallel with SR task can improve the reconstruction quality of SR decoder. Results from left to right: bicubic interpolation, SRResNet, SRGAN Ledig et al. (2017), and SRSEG (this work). Best viewed in color.

To tackle such an ill-posed problem numerous deep learning methods have been proposed to learn mappings between LR and HR image pairs Dong et al. (2014); Kim et al. (2016a, b); Shi et al. (2016). These approaches use various objective functions in a supervised manner to reach the current state-of-the-art. Conventional pixel-wise Mean Squared Error (MSE) is the commonly used loss to minimize pixel-wise similarity of the recovered HR image and the ground truth in an image space. However, Sajjadi et al. (2017); Ledig et al. (2017) show that lower MSE does not necessarily reflect a perceptually better SR result. Therefore, Johnson et al. (2016) proposed perceptual loss to optimize a SR model in a feature space instead of pixel space. Significant progress has been recently achieved in SISR by applying Generative Adversarial Networks (GANs) Ledig et al. (2017); Wu et al. (2017); Wang et al. (2018). GANs are known for the ability to generate more appealing and realistic images and have been used in different image synthesis-based applications Isola et al. (2017); Bozorgtabar et al. (2019); Mahapatra et al. (2018); Bozorgtabar et al. (2019).

1.1 Does semantic information help?

Despite significant progress toward learning deep models to super resolve realistic images, the proposed approaches still cannot fully reconstruct realistic textures; intuitively, it is expected to have a better reconstruction quality for common and known types of textures, e.g., ground soil and sea waves, but experiments show that the reconstruction quality is almost the same for a known and an unknown type of texture, e.g., a fabric with a random pattern. Although loss functions used in image super-resolution, e.g., perceptual and adversarial losses, generate appealing super-resolved images, they try to match the global level statistics of images without retaining the semantic details of the content. Wang et al. (2018) shows that variety of different HR image patches could have very similar LR counterparts, and as a consequence, similar SR images are reconstructed for categorically different textures using current state-of-the-art methods. They also prove that more realistic textures could be recovered by using an additional network to obtain prior knowledge and afterward use it as a secondary input in SR decoder.

In this work, we prove that a single SR decoder is capable of learning this categorical knowledge by using multitask learning. As Caruana (1997) emphasizes, multitask learning improves generalization by using the domain information contained in the training signals of related tasks. This improvement is the result of learning tasks in parallel while using a shared representation; in our case, what is learned for semantic segmentation task can help improving the quality of SR task and vice versa.

1.2 Our contribution

In this paper, we propose a novel architecture to reconstruct SR images in a content-aware manner, without requiring an additional network to predict the categorical knowledge. We show that this can be done by benefiting from multitask learning simultaneously for SR and semantic segmentation tasks. An overview of our proposed method is shown in Figure 1. We add an additional segmentation output in a way that the same SR decoder learns to segment the input image and generate a recovered image. We also introduce a novel boundary mask to filter out unrelated segmentation losses related to imprecise segmentation labels. The semantic segmentation task forces the network to learn the categorical knowledge. These categorical priors learned by the network are characterizing the semantic classes of different regions in an image and are the key to recover more realistic textures. Our approach outperforms quality of recovering textures of state-of-the-art algorithms in both qualitative and user studies manner.

Our contributions can be summarized as follows:

  • We propose a framework that uses segmentation labels during training to learn a CNN-based SR model in a content-aware manner.

  • We introduce a novel boundary mask to have an additional spatial control over categorical information within training examples and their segmentation label, and filter out their irrelevant information for SR task.

  • Unlike existing approaches for content-aware SR, the proposed method does not require any semantic information at the test time. Therefore, neither segmentation label nor additional computation is required at test time while benefiting from categorical information.

  • Our method is trained end-to-end and is easily reproducible.

  • Our experimental results, including an extensive user study, prove the effectiveness of using multitask learning for SISR and semantic segmentation and show that SISR of high perceptual quality can be achieved by using our proposed objective function.

In the remainder of this paper, first, in Section 2, we review the related literature. Then, in Section 3, we give a detailed explanation about our design including the used dataset and our training parameters. In Section 4 we present experimental results and computational time, and discuss the effectiveness of our proposed approach. Finally, we conclude the paper in Section 5 and also mention the future research directions.

2 Related Work

2.1 Single image super-resolution

SISR has been widely studied for decades and many different approaches have been proposed; from simple methods such as bicubic interpolation and Lanczos resampling Duchon (1979), to dictionary learning Yang et al. (2012) and self-similarity Yang et al. (2011); Huang et al. (2015) approaches. With the advances of deep CNNs, the state-of-the-art SISR methods have been built based on end-to-end deep neural networks and achieved significantly superior performances, thus we only review relevant recent CNN-based approaches.

An end-to-end CNN-based approach was proposed by Dong et al. (2016) to learn the mapping of LR to HR images. The concept of residual blocks and skip-connections He et al. (2016); Kim et al. (2016b) were used by Ledig et al. (2017) to facilitate the training of CNN-based decoders. A laplacian pyramid network was presented in Lai et al. (2017) to progressively reconstruct the sub-band residuals of high-resolution images. The choice of the objective function plays a crucial role in the performance of optimization-based methods. These works used various loss functions; the commonly used loss term is the pixel-wise distance between the super-resolved and the ground-truth HR images for training the networks Sajjadi et al. (2017); Dong et al. (2016); Kim et al. (2016a); Zhang et al. (2018). However, using those functions as the only optimization target leads to blurry super-resolved images due to the pixel-wise average of possible solutions in the pixel space.

A remarkable improvement in terms of the visual quality in SISR is the so-called perceptual loss Johnson et al. (2016). This loss function benefits from the idea of perceptual similarity Bruna et al. (2015) and seeks to minimize the distance loss over feature maps extracted from a pre-trained network, e.g., VGG Simonyan and Zisserman (2014). In a similar work, Mechrez et al. (2018) proposes contextual loss to generate images with natural image statistics, which focuses on the feature distribution rather than merely comparing the appearance.

More recently, the concept generative adversarial network (GAN) Goodfellow et al. (2014) is used for image super-resolution task, which achieves state-of-the-art results on various benchmarks in terms of reconstructing more appealing and realistic images Sajjadi et al. (2017); Ledig et al. (2017); Wang et al. (2018). The intuition behind its excellent performance is that GAN drives the image reconstruction towards the natural image manifold producing perceptually more convincing solutions. Having said that, it also uses a discriminator to distinguish between the generated and the original HR images, which is found to produce more photo-realistic results.

2.2 Super-resolution faithful to semantic classes

Semantic information has been used in different studies for variant tasks; Ren et al. (2017) proposed a method to benefit from semantic segmentation for video deblurring. For image generation, Chen and Koltun (2017) used semantic label to produce an image with photographic appearance. Isola et al. (2017) used the same idea to perform image to image translation. The SISR method proposed by Wang et al. (2018) is more relevant to our work. They use an additional segmentation network to estimate probability maps as prior knowledge and use them in existing super-resolution networks. Their segmentation network is pre-trained on the COCO dataset Lin et al. (2014) and then fine-tuned on the ADE dataset Zhou et al. (2017). They show that it is possible to recover textures faithful to categorical priors estimated through the pre-trained segmentation network, which generates intermediate conditions from the prior and broadcasts the conditions to the super-resolution network.

However, in this paper, we do not have an additional segmentation network, instead our super-resolution method is built on multitask end-to-end deep networks with the shared feature extraction parameters to learn semantic information. The intuition behind this proposed method is that the model can exploit features for both tasks, such a model, during training, is forced to explore categorical information while super-resolving the image. Therefore, the segmentation labels would be used only during the training phase and no additional segmentation labels would be required as the input at run-time.

3 Multitask Learning for Image Super-Resolution

Our ultimate goal is to train a SISR in a multitask manner, simultaneously for image super-resolution and semantic segmentation. Our proposed SR decoder only employs one shared deep network and keeps two task-specific output layers during training to force the network learn semantic information. If the network converges for both tasks, we can be sure that the parameters of the shared feature extractor have explored categorical information while super-resolving the image. In this section we present our proposed architecture and the objective function used for training. We also introduce a novel boundary mask used to simplify the segmentation task.

Figure 2: Architecture of the decoder. We train the SR decoder (upper part) in a multitask manner by introducing a segmentation extension (lower part). Feature extractor is shared between both super-resolution and segmentation tasks. The segmentation extension is only available during the training process and no segmentation label is used at the run-time. In this schema, , and correspond respectively to kernel size, number of feature maps, and strides.

3.1 Architecture

Figure 2 shows the multitask architecture used during training; the upper part (first row) shows SR generator, from the LR to HR image, while the lower part (second row) is the extension used to predict segmentation class probabilities. The role of segmentation extension layers of our design is to force the feature extractor parameters learn categorical information. These non-shared layers, generating segmentation probabilities, are not used during SR run-time. Each part is presented in more details as follows:

  • SR generator The generator network is a feed-forward CNN; the input image is passed through a convolution block followed by LeakyReLU activation layer. The output is subsequently passed through 16 residual blocks with skip connections. Each block has two convolutional layers with filters and feature maps, each one followed by a batch normalization and LeakyReLU activation. The output of the final residual block, concatenated with the features of the first convolutional layer, is inputted through two upsampling stages. Each stage doubles the input image size. Finally, the result is passed through a convolution stage to get the super-resolved image . In this study, we only investigate a scale factor of 4, but depending on the desired scaling, the number of upsampling stages can be changed.

  • Segmentation extension The segmentation extension uses the output of the SR generator feature extractor part, just before the first upsampling stage, and convert it to a segmentation probability by passing it through two convolutional layers. The computational complexity of this stage needs to be as limited as possible, as we wish that shared-layers with SR generator learn categorical information and not only layers from segmentation extension.

The parameters of the generator, for both segmentation and SR tasks, are obtained by minimizing the loss function presented in Section 3.3. This loss function consists also of a GAN Goodfellow et al. (2014)-based adversarial loss, which requires a discriminator network. This network discriminates real HR images from generated SR samples. We define our discriminator architecture similar to Ledig et al. (2017); it consists of multiple convolutional layers with the kernels increasing by a factor of from to . We use Leaky ReLU and strided convolutions to reduce the image dimension while doubling the number of features. The resulting feature maps are followed by two dense layers. Finally, the image is classified as real or fake by a final sigmoid activation function.

(a)
(b)
Figure 3: An example showing the accuracy and resolution of a pixel-wise semantic segmentation label (b) of a low resolution image (a). As both segmentation and super-resolution networks share layers, the inaccurate segmentation labels result inaccurate edges in super-resolved images.

3.2 Boundary mask

Although segmentation labels of available datasets, e.g., Caesar et al. (2018), to be used for segmentation task, are created by an expensive labeling effort, they still lack of precision close to boundaries of different classes as can be seen in Figure 3. Our experiments show that as shared features are used for generating the SR image and segmentation probabilities, this lack of boundaries’ precision in segmentation labels affects the edges in the SR image too. Therefore, we use a novel boundary mask () to filter out any segmentation losses from areas close to object boundaries from training images.

In order to generate such a boundary mask, first, we calculate the derivative of the segmentation label to get the boundaries of different classes in the low resolution image. Then, we compute the dilation of results with a disk of size to create a thicker strip around edges of each class. An example of converting the segmentation label to the boundary mask is shown in Figure 4. In Section 4 the effectiveness of using such boundary masks is shown.

Figure 4: The boundary mask generation. The black pixels of the results represent areas close to the edges while white pixels could be either background or foreground.

3.3 Loss function

We define the as a combination of pixel-wise loss (), perceptual loss (), adversarial loss (), and segmentation loss () filtered by our novel boundary mask () presented in Section 3.2. The overall loss function is given by:

(1)

where , , , and are the corresponding weights of each loss term used to train our network. In the following, we present each term in detail:

  • Pixel-wise loss The most common loss in SR is the pixel-wise Mean Squared Error (MSE) between the original image and the super-resolved image in the image space Sajjadi et al. (2017); Ledig et al. (2017); Dong et al. (2014). However, using it alone mostly results in finding pixel-wise averages of plausible solutions, which seems over-smoothed with poor perceptual qualities and lack of high-frequency details such as textures Dosovitskiy and Brox (2016); Mathieu et al. (2015); Bruna et al. (2015).

  • Perceptual loss Sajjadi et al. (2017) and Ledig et al. (2017) used the idea of measuring the perceptual similarity by computing the distance of feature spaces of the images. First, both HR and SR images are mapped into a feature space by a pre-trained model, VGG-16 Simonyan and Zisserman (2014) in our case. Then, the perceptual loss is calculated by the distance and using all feature maps of ReLU 4-1 layer of the VGG-16.

  • Adversarial loss Inspired by Ledig et al. (2017) we add the discriminator component of the mentioned GAN architecture to our design. This encourages our SR decoder to favor solutions that resolve more realistic and natural images, by trying to trick the discriminator network. It also results perceptually superior solutions to solutions obtained by minimizing pixel-wise MSE and perceptual loss.

  • Segmentation loss While using segmentation for SR application is new for the community, semantic segmentation as a stand-alone task has been investigated for years. The most commonly used loss function for the task of image segmentation is a pixel-wise cross entropy loss (or log loss) Badrinarayanan et al. (2017); Long et al. (2014); Kampffmeyer et al. (2016). In this work, we also use the cross entropy loss function to examine each pixel individually and compare the class predictions (depth-wise pixel vector) to the one-hot encoded label; it measures the performance of a pixel-wise classification model whose output is a probability value between zero and one for each pixel and category.

3.4 Dataset

Training the proposed network in a supervised manner requires a considerable number of training examples with ground-truths for both semantic segmentation and super resolution tasks. Therefore the choices of datasets are limited to the ones with available segmentation labels. We use a random sample of 60 thousand images from the COCO-Stuff database Caesar et al. (2018), which contains semantic labels for 91 stuff classes for segmentation task. We only choose images from five main background classes to be able to focus on texture quality and prove the concept: sky, ground, buildings, plants, and water. Each one of them contains multiple sub classes in COCO-Stuff dataset, e.g., water contains seas, lakes, rivers, etc. and plants contain trees, bushes, leaves, etc., but in this work we consider them as a single class. Any other object or background existing in an image is labeled as ”others” (the sixth class). More than 12 thousand images from each category were used to train our network. We obtained the LR images for the SR task by downsampling the HR images of the same database using the MATLAB imresize function with the bicubic kernel and downsampling factor 4 (all experiments were performed with a scaling factor of ×4). For each image, we crop a random HR sub image for training.

3.5 Training and parameters

In order to successfully converge to parameters compatible for both SR and the segmentation task, the training was done in different steps; first, the generator was trained for 25 epochs with only pixel-wise mean squared error as the loss function. Then the segmentation loss function was added and training continued for 25 more epochs. Finally, the loss function presented in Section 3.3 (including adversarial and perceptual losses) was used for 55 more epochs. The weights of each term in loss function presented in Eq. 1 were chosen as follows: as proposed by Ledig et al. (2017), , , and were respectively fixed to , , and . were tuned and fixed to . The Adam optimizer Kingma and Ba (2014) was used for all the steps. The learning rate was set to and then was decayed by a factor of 10 every 20 epochs. We also alternately optimized the discriminator with the setting proposed by Ledig et al. (2017).

As explained previously, to not consider a segmentation prediction error close to boundaries of objects/backgrounds, the segmentation loss is filtered by a boundary mask as introduced in Section 3.2. Figure 5 shows the segmentation prediction results of two training images; the artifacts close to boundaries (imprecise edges and black strips around them) are the result of applying a boundary mask. This mask makes the network not consider the class probabilities around boundaries and have a random prediction on those areas.

Figure 5: Two examples of segmentation prediction results. The artifacts close to boundaries (imprecise edges and black strips around them) are the result of applying boundary mask in a way that the generator does not focus on class probabilities around boundaries and have a random prediction on those areas.

4 Results and Discussion

In this section, we first investigate the effectiveness of using the presented boundary mask in the proposed approach. Then, we evaluate and discuss the benefits of introducing multitask learning for SR task by performing qualitative experiments, an extensive user study, and an ablation study. Finally, we discuss the computational time of the proposed approach.

4.1 Effectiveness of boundary masks

As explained previously in Section 3.2 in this work we use a novel boundary mask () to filter out all segmentation losses from areas close to object boundaries during training. The goal of this masking is to avoid forcing SR network to learn imprecise boundaries existing in segmentation labels. Figure 6 shows the SR results comparing the effect of segmentation mask; comparing Figure 6.c to 6.d shows the improvement in reconstructing sharper edges using segmentation with mask rather than without mask. In this example, both Figures 6.c and 6.d have the closest textures to the ground-truth comparing to Figure 6.b, however, the object in the super-resolved image without using segmentation information has the sharpest edges; this can be explained by the fact that we only considered background categories (“sky”, “plant”, “buildings”, “ground”, and “water”) because of their specific appearance and to prove the concept. All type of objects, e.g., giraffe in this example, are included in “Other” category, therefore, no specific pattern is expected to be learnt for this category. As a future work, more object categories can be added to the training examples.

(a)
(b)
(c)
(d)
Figure 6: (a) Ground-truth, (b) SRGAN, (c) SRSEG, (d) Masked-SRSEG. While SRGAN still has the most accurate edges in this example, both masked and unmask SRSEG network constructs more realistic textures in the background and are closer to ground-truth. All images are cropped from Figure 3.a and zoomed by a factor of 6 ().

4.2 Qualitative results

Standard benchmarks such as Set5 Bevilacqua et al. (2012), Set14 Zeyde et al. (2012), and BSD100 Martin et al. (2001) mostly do not contain the background categories studied in this research, therefore, first we evaluate our method on a test set consisting of random images of the COCO-stuff dataset Caesar et al. (2018).

Figure 7: Qualitative results on COCO-stuff dataset Caesar et al. (2018), focusing on object/background textures. The test images include images with the same categories as the one used during training (water, plant, building, sky, and ground). Cropped regions are zoomed in with a factor of 5 to 10. Images from left to right: High resolution image, bicubic interpolation, SRResNet Ledig et al. (2017), RCAN Zhang et al. (2018), SFT-GAN Wang et al. (2018), SRGAN Ledig et al. (2017), and SRSEG (this work). Zoom in to have the best view.

Figure 7 contains visual examples comparing different models. In order to have a fair comparison, we re-trained the SRResNet Ledig et al. (2017), SFT-GAN Wang et al. (2018), and SRGAN Ledig et al. (2017) methods on the same dataset and with the same parameters as ours. The generator and discriminator networks used in both SRGAN and our method are very similar (only layers resulting in segmentation probability output differ), which helps to investigate the effectiveness of our approach compared to the SRGAN, as the baseline. For RCAN, we used their pre-trained models in Zhang et al. (2018). The MATLAB imresize function with a bicubic kernel is used to produce LR images.

The qualitative comparison shows that our method generates more realistic and natural textures by benefiting from categorical information. Our experiment shows that the trained model for both segmentation and SR tasks is generalized in a way that it reconstructs more realistic background compared to the approaches using the same configuration and without the segmentation objective.

As mentioned previously, to prove the concept, most of the test images contains specific background categories, however, it still reconstructs competitive results for objects without any labels during the training phase, e.g., the man with a tie in Figure 7. In some cases, we could also observe that our method can result in a less precise boundaries as shown in Figure 8.

Figure 8: An example of a bad reconstruction of boundaries compared to the SRGAN Ledig et al. (2017) method; this effect could be seen in some cases, specially in objects/backgrounds that have not been from training classes.

4.3 User experience

As Ledig et al. (2017); Sajjadi et al. (2017); Wang et al. (2018) mentioned, the commonly used quantitative measurements for SR methods, such as SSIM and PSNR, are not directly correlated to the perceptual quality; their experiments show that GAN-based methods have lower PSNR and SSIM values compared to PSNR-oriented approaches, however, they easily outperform them in terms of more appealing and closer images to the HR images. Therefore, we did not use these evaluation metrics in this work.

To better investigate the effectiveness of multitask learning simultaneously for semantic segmentation and SR, we perform a user study to compare the SRGAN Ledig et al. (2017) method and our approach which is a an extended version of SRGAN with an additional segmentation output. We design our experiment in two stages; first stage quantifies the ability of our approaches to reconstruct perceptually convincing images while we focus specifically on the quality of texture reconstruction regarding to ground-truth (real HR image).

During the first stage, users were requested to vote for more appealing images between SRGAN and our proposed method, SRSEG output pairs. In order to avoid random guesses in case of similar qualities, a third choice as ”Similar” was also introduced for each image. 22 persons have participated in this experiment. 25 random images from COCO-Stuff Caesar et al. (2018) were presented in a randomized fashion to each person. The pie chart shown in Figure 9.a illustrates that the images reconstructed by our approach are more appealing to the users.

In the second stage, we focused only on enlarged texture patches, zoomed in with a factor of 8 to 10, mostly on parts of backgrounds that have been from training classes. The enlarged images represent only a reconstructed texture and no object was included in the image. The ground-truth was also shown to users. Each person was asked again to pick the texture closer to the ground-truth. 25 pairs of textures in addition to their ground-truth were shown to 22 persons in this stage. The results of this stage is shown in Figure 9.b. These results confirm that our approach reconstructs perceptually more convincing images for the users in terms of both overall and texture qualities of resolved images. However, comparing the results of the first and second stage of the user study shows that texture reconstruction quality of our proposed approach is by a large margin better than the quality of its object reconstruction. As a future work, adding more object categories to the training examples for both segmentation and SR tasks could also improve the reconstruction quality of the class “Others” with a similar margin.

(a)
(b)
Figure 9: The evaluation results of our user studies, comparing SRSEG (our method) with SRGAN Ledig et al. (2017); (a) Focusing on visual quality of the resolved images, (b) Focusing only on enlarged textures. Both textures and overall qualities of resolved images resolved by our method are improved. Users prefer textures reconstructed by our proposed approach by a large margin.

4.4 Ablation study

Intuitively, by introducing additional segmentation task, our SR decoder extracts more specific features for both image reconstruction and semantic segmentation. To investigate the competence of these new features and the effectiveness of our approach for image SR, we perform an ablation study, by qualitatively comparing the reconstruction quality of our decoder, with and without the segmentation extension. In Figure 10, we divide our results into different existing categories during training (sky, ground, buildings, plants, and water), as well as undefined categories in our dataset. We can see that the network trained with segmentation extension generates more photo-realistic textures for the available segmentation categories, while having competitive results for the other objects.

Figure 10: Ablation study on different type of objects/backgrounds; comparing the reconstruction quality of our decoder: (a) with the segmentation extension during training, (b) without the segmentation extension. Zoom in for best view

4.5 Results on standard benchmarks

During training, our approach focuses on optimizing the decoder by using an additional segmentation extension and loss term for recognizing specific categories, such as sky, ground, buildings, plants, and water. Even though many object and background categories are absent during the training phase, our experiment shows that the model generalizes in a way that it reconstructs either more realistic or competitive results for undefined objects/backgrounds as well. In this section, we evaluate the reconstruction quality of unknown objects, by using Set5 Bevilacqua et al. (2012) and Set14 Bevilacqua et al. (2012) standard benchmarks, where unlike our training set, in most of the images, outdoor background scenes are not present. Figure 11 compares the results of our SR model on the “baby” and the “baboon” images to recent state-of-the-art methods including bicubic, SRCNN Dong et al. (2016), RCAN Zhang et al. (2018), SFT-GAN Wang et al. (2018), and SRGAN  Ledig et al. (2017). In both images, despite the fact that their categories were not existed during training, we could generate more photo-realistic images compared to SRCNN and RCAN, while having competitive results with SFT-GAN and SRGAN. Their results were obtained by using their online supplementary materials.

Figure 11: Sample results on the “baby” (top) and “baboon” (bottom) images from Set5 Bevilacqua et al. (2012) and Set14 Bevilacqua et al. (2012) datasets, respectively. From left to right: HR image, bicubic, SRCNN Dong et al. (2016), RCAN Zhang et al. (2018), SFT-GAN Wang et al. (2018), SRGAN Ledig et al. (2017), and SRSEG (ours). Zoom in for the best view.

4.6 Computational time

Our proposed method has similar running time to CNN-based SISR methods and faster than method such as Wang et al. (2018), which uses a second network to predict segmentation probabilities. As the additional extension for segmentation, presented in this work, is removed at run-time and no segmentation label is required as an input, the running time is not affected by our proposed approach. However, using segmentation extension during the training phase increases our training time with a factor of compared to SRGAN.

In particular, our Tensorflow implementation runs at 20.24 FPS on a GeForce GTX 1080 Ti graphic card to reconstruct HD images () from their low-resolution counter-parts () with a scale factor of 4.

5 Conclusion and Future Work

In this work we presented a novel approach to use categorical information to tackle the SR problem. We introduced a SR decoder only benefiting from one shared deep network to learn simultaneously image super-resolution and semantic segmentation by keeping two task-specific output layers during training. We also introduced a novel boundary mask to filter out unrelated segmentation losses caused by imprecise segmentation labels. We have conducted perceptual experiments including a user study on images from COCO-Stuff dataset and demonstrated that multitask learning can enable benefiting from semantic information in a single network and improves the recovering quality. As a future work, additional object/background categories can be introduced during the training in order to explore how it could affect the reconstruction quality.

Acknowledgement This work was funded and supported by Swisscom Digital Lab in Lausanne, Switzerland.

References

  • Zou and Yuen (2012) W. Zou, P. C. Yuen, Very low resolution face recognition problem, IEEE Transactions on Image Processing 21 (2012) 327–340.
  • Shi et al. (2013) W. Shi, J. Caballero, C. Ledig, X. Zhuang, W. Bai, K. Bhatia, A. M. S. M. de Marvao, T. Dawes, D. O’Regan, D. Rueckert, Cardiac image super-resolution with global correspondence using multi-atlas patchmatch, in: K. Mori, I. Sakuma, Y. Sato, C. Barillot, N. Navab (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 9–16.
  • Park et al. (2013) J. Park, B. Nam, H. Yoo, A high-throughput 16× super resolution processor for real-time object recognition soc, in: 2013 Proceedings of the ESSCIRC (ESSCIRC), pp. 259–262.
  • Yang et al. (2007) Q. Yang, R. Yang, J. Davis, D. Nistér, Spatial-depth super resolution for range images, 2007 IEEE Conference on Computer Vision and Pattern Recognition (2007) 1–8.
  • Ledig et al. (2017) C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, pp. 105–114.
  • Dong et al. (2014) C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution, in: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Computer Vision – ECCV 2014, Springer International Publishing, Cham, 2014, pp. 184–199.
  • Kim et al. (2016a) J. Kim, J. Kwon Lee, K. Mu Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654.
  • Kim et al. (2016b) J. Kim, J. K. Lee, K. M. Lee, Deeply-recursive convolutional network for image super-resolution, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016b) 1637–1645.
  • Shi et al. (2016) W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 1874–1883.
  • Sajjadi et al. (2017) M. S. M. Sajjadi, B. Schölkopf, M. Hirsch, Enhancenet: Single image super-resolution through automated texture synthesis, 2017 IEEE International Conference on Computer Vision (ICCV) (2017) 4501–4510.
  • Johnson et al. (2016) J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in: ECCV.
  • Wu et al. (2017) B. Wu, H. Duan, Z. Liu, G. Sun, SRPGAN: perceptual generative adversarial network for single image super resolution, CoRR abs/1712.05927 (2017).
  • Wang et al. (2018) X. Wang, K. Yu, C. Dong, C. C. Loy, Recovering realistic texture in image super-resolution by deep spatial feature transform, CoRR abs/1804.02815 (2018).
  • Isola et al. (2017) P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134.
  • Bozorgtabar et al. (2019) B. Bozorgtabar, M. S. Rad, H. Kemal Ekenel, J. Thiran, Using photorealistic face synthesis and domain adaptation to improve facial expression analysis, in: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8.
  • Mahapatra et al. (2018) D. Mahapatra, B. Bozorgtabar, J.-P. Thiran, M. Reyes, Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp. 580–588.
  • Bozorgtabar et al. (2019) B. Bozorgtabar, M. S. Rad, H. K. Ekenel, J.-P. Thiran, Learn to synthesize and synthesize to learn, Computer Vision and Image Understanding (2019).
  • Caruana (1997) R. Caruana, Multitask learning, Machine Learning 28 (1997) 41–75.
  • Duchon (1979) C. E. Duchon, Lanczos filtering in one and two dimensions, Journal of Applied Meteorology 18 (1979) 1016–1022.
  • Yang et al. (2012) J. Yang, Z. Wang, Z. Lin, S. Cohen, T. Huang, Coupled dictionary training for image super-resolution, Trans. Img. Proc. 21 (2012) 3467–3478.
  • Yang et al. (2011) C.-Y. Yang, J.-B. Huang, M.-H. Yang, Exploiting self-similarities for single frame super-resolution, in: R. Kimmel, R. Klette, A. Sugimoto (Eds.), Computer Vision – ACCV 2010, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 497–510.
  • Huang et al. (2015) J.-B. Huang, A. Singh, N. Ahuja, Single image super-resolution from transformed self-exemplars, in: IEEE Conference on Computer Vision and Pattern Recognition).
  • Dong et al. (2016) C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE transactions on pattern analysis and machine intelligence 38 (2016) 295–307.
  • He et al. (2016) K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: ECCV.
  • Lai et al. (2017) W.-S. Lai, J.-B. Huang, N. Ahuja, M.-H. Yang, Deep laplacian pyramid networks for fast and accurate superresolution, in: IEEE Conference on Computer Vision and Pattern Recognition, volume 2, p. 5.
  • Zhang et al. (2018) Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, CoRR abs/1807.02758 (2018).
  • Bruna et al. (2015) J. Bruna, P. Sprechmann, Y. LeCun, Super-resolution with deep convolutional sufficient statistics, CoRR abs/1511.05666 (2015).
  • Simonyan and Zisserman (2014) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556 (2014).
  • Mechrez et al. (2018) R. Mechrez, I. Talmi, F. Shama, L. Zelnik-Manor, Learning to maintain natural image statistics, arXiv preprint arXiv:1803.04626 (2018).
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27, Curran Associates, Inc., 2014, pp. 2672–2680.
  • Ren et al. (2017) W. Ren, J. Pan, X. Cao, M. Yang, Video deblurring via semantic segmentation and pixel-wise non-linear kernel, CoRR abs/1708.03423 (2017).
  • Chen and Koltun (2017) Q. Chen, V. Koltun, Photographic image synthesis with cascaded refinement networks, CoRR abs/1707.09405 (2017).
  • Isola et al. (2017) P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-to-image translation with conditional adversarial networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 5967–5976.
  • Lin et al. (2014) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, pp. 740–755.
  • Zhou et al. (2017) B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, IEEE, p. 4.
  • Caesar et al. (2018) H. Caesar, J. R. R. Uijlings, V. Ferrari, Coco-stuff: Thing and stuff classes in context, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018) 1209–1218.
  • Dosovitskiy and Brox (2016) A. Dosovitskiy, T. Brox, Generating images with perceptual similarity metrics based on deep networks, in: D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems 29, Curran Associates, Inc., 2016, pp. 658–666.
  • Mathieu et al. (2015) M. Mathieu, C. Couprie, Y. LeCun, Deep multi-scale video prediction beyond mean square error, CoRR abs/1511.05440 (2015).
  • Badrinarayanan et al. (2017) V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2017) 2481–2495.
  • Long et al. (2014) J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, CoRR abs/1411.4038 (2014).
  • Kampffmeyer et al. (2016) M. Kampffmeyer, A. Salberg, R. Jenssen, Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 680–688.
  • Kingma and Ba (2014) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, CoRR abs/1412.6980 (2014).
  • Bevilacqua et al. (2012) M. Bevilacqua, A. Roumy, C. Guillemot, M. line Alberi Morel, Low-complexity single-image super-resolution based on nonnegative neighbor embedding, in: Proceedings of the British Machine Vision Conference, BMVA Press, 2012, pp. 135.1–135.10.
  • Zeyde et al. (2012) R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-representations, in: J.-D. Boissonnat, P. Chenin, A. Cohen, C. Gout, T. Lyche, M.-L. Mazure, L. Schumaker (Eds.), Curves and Surfaces, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 711–730.
  • Martin et al. (2001) D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pp. 416–423 vol.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
384136
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description