Deep Back-Projection Networks for Single Image Super-resolution
Previous feed-forward architectures of recently proposed deep super-resolution networks learn the features of low-resolution inputs and the non-linear mapping from those to a high-resolution output. However, this approach does not fully address the mutual dependencies of low- and high-resolution images. We propose Deep Back-Projection Networks (DBPN), the winner of two image super-resolution challenges (NTIRE2018 and PIRM2018), that exploit iterative up- and down-sampling layers. These layers are formed as a unit providing an error feedback mechanism for projection errors. We construct mutually-connected up- and down-sampling units each of which represents different types of image degradation and high-resolution components. We also show that extending this idea to several variants applying the latest deep network trends, such as recurrent network, dense connection, and residual learning, to improve the performance. The experimental results yield superior results and in particular establishing new state-of-the-art results across multiple data sets, especially for large scaling factors such as .
Single image SR (SISR) is an ill-posed inverse problem where the aim is to recover a high-resolution (HR) image from a low-resolution (LR) image. A currently typical approach is to construct an HR image by learning non-linear LR-to-HR mapping, implemented as a deep neural network [16, 17, 18, 13, 12, 19, 14]. These networks compute a sequence of feature maps from the LR image, culminating with one or more upsampling layers to increase resolution and finally construct the HR image. In contrast to this purely feed-forward approach, the human visual system is believed to use a feedback connection to simply guide the task for the relevant results [20, 21, 22]. Perhaps hampered by lack of such feedback, the current SR networks with only feed-forward connections have difficulty in representing the LR-to-HR relation, especially for large scaling factors.
On the other hand, feedback connections were used effectively by one of the early SR algorithms, the iterative back-projection . It iteratively computes the reconstruction error, then uses it to refine the HR image. Although it has been proven to improve the image quality, results still suffers from ringing and chessboard artifacts . Moreover, this method is sensitive to choices of parameters such as the number of iterations and the blur operator, leading to variability in results.
Inspired by , we construct an end-to-end trainable architecture based on the idea of iterative up- and down-sampling layers: Deep Back-Projection Networks (DBPN). Our networks are not only able to remove the ringing and chessboard effect but also successfully perform large scaling factors, as shown in Fig. 1. Furthermore, DBPN has been proven by winning SISR challenges. On NTIRE2018 , DBPN is the winner on track 8 Bicubic downscaling. On PIRM2018 , DBPN got on Region 2, on Region 1, and on Region 3.
Our work provides the following contributions:
(1) Iterative up- and down-sampling units. Feed-forward architectures, which are considered as a one-way mapping, only map rich representations of the input to the output space. This approach is unsuccessful to map LR and HR image, especially in large scaling factors, due to limited features available in the LR spaces. Our networks focus not only on generating variants of the HR features using the up-sampling unit but also on projecting it back to the LR spaces using the down-sampling unit. It is shown in Fig. 2 (d), alternating between up- (blue box) and down-sampling (gold box) units, which represent the mutual relation of LR and HR features. This procedure can also be considered as features augmentation to represent various image degradation and HR components. The detailed explanation can be seen in Section 3.2.
(2) Error feedback. We propose an iterative error-correcting feedback mechanism for SR, which calculates both up- and down-projection errors to guide the reconstruction for obtaining better results. Here, the projection errors are used to refine the initial features in early layers. The detailed explanation can be seen in Section 3.1.
(3) Deep concatenation. Our networks represent different types of image degradation and HR components produced by each up- and down-sampling unit. This ability enables the networks to reconstruct the HR image using concatenation of the HR feature maps from all of the up-sampling units. Our reconstruction can directly utilize different types of HR feature maps from different depths without propagating them through the other layers as shown by the red arrows in Fig. 2 (d).
2 Related Work
2.1 Image super-resolution using deep networks
Deep Networks SR can be primarily divided into four types as shown in Fig. 2.
(a) Predefined upsampling commonly uses interpolation as the upsampling operator to produce a middle resolution (MR) image. This scheme was firstly proposed by SRCNN  to learn MR-to-HR non-linear mapping with simple convolutional layers. Later, the improved networks exploited residual learning [12, 14] and recursive layers . However, this approach has higher computation because the input is the MR image which has the same size as the HR image.
(b) Single upsampling offers a simple way to increase the resolution. This approach was firstly proposed by FSRCNN  and ESPCN . These methods have been proven effective to increase the resolution and replace predefined operators. Further improvements include residual network , dense connection , and channel attention  However, they fail to learn complicated mapping of LR-to-HR image, especially on large scaling factors, due to limited feature maps from the LR image. This problem opens the opportunities to propose the mutual relation from LR-to-HR image that can preserve HR components better.
(c) Progressive upsampling was recently proposed in LapSRN . It progressively reconstructs the multiple SR images with different scales in one feed-forward network. For the sake of simplification, we can say that this network is a stacked of single upsampling networks which only relies on limited LR feature maps. Due to this fact, LapSRN is outperformed even by our shallow networks especially for large scaling factors such as in experimental results.
(d) Iterative up- and down-sampling is proposed by our networks . We focus on increasing the sampling rate of HR feature maps in different depths from iterative up- and down-sampling layers, then, distribute the tasks to calculate the reconstruction error on each unit. This scheme enables the networks to preserve the HR components by learning various up- and down-sampling operators while generating deeper features.
2.2 Feedback networks
Rather than learning a non-linear mapping of input-to-target space in one step, the feedback networks compose the prediction process into multiple steps which allow the model to have a self-correcting procedure. Feedback procedure has been implemented in various computing tasks [30, 31, 32, 33, 34, 35, 36].
In the context of human pose estimation, Carreira et al.  proposed an iterative error feedback by iteratively estimating and applying a correction to the current estimation. PredNet  is an unsupervised recurrent network to predictively code the future frames by recursively feeding the predictions back into the model. For image segmentation, Li et al.  learn implicit shape priors and use them to improve the prediction. However, to our knowledge, feedback procedures have not been implemented to SR.
2.3 Adversarial training
Adversarial training, such as with Generative Adversarial Networks (GANs)  has been applied to various image reconstruction problems [38, 39, 6, 3, 8]. For the SR task, Johnson et al.  introduced perceptual losses based on high-level features extracted from pre-trained networks. Ledig et al.  proposed SRGAN which is considered as a single upsampling method. It proposed the natural image manifold that is able to create photo-realistic images by specifically formulating a loss function based on the euclidian distance between feature maps extracted from VGG19 . Our networks can be extended with the adversarial training. The detailed explanation is available in Section 6.
Back-projection  is an efficient iterative procedure to minimize the reconstruction error. Previous studies have proven the effectiveness of back-projection [41, 42, 43, 44]. Originally, back-projection in SR was designed for the case with multiple LR inputs. However, given only one LR input image, the reconstruction procedure can be obtained by upsampling the LR image using multiple upsampling operators and calculate the reconstruction error iteratively . Timofte et al.  mentioned that back-projection could improve the quality of the SR images. Zhao et al.  proposed a method to refine high-frequency texture details with an iterative projection process. However, the initialization which leads to an optimal solution remains unknown. Most of the previous studies involve constant and unlearned predefined parameters such as blur operator and number of iteration.
To extend this algorithm, we develop an end-to-end trainable architecture which focuses to guide the SR task using mutually connected up- and down-sampling units to learn non-linear mutual relation of LR-to-HR image. The mutual relation between LR and HR image is constructed by creating iterative up- and down-projection unit where the up-projection unit generates HR feature maps, then the down-projection unit projects it back to the LR spaces as shown in Fig. 2 (d). This enables the networks to preserve the HR components by learned various up- and down-sampling operators and generates deeper features to construct numerous LR and HR feature maps.
3 Deep Back-Projection Networks
Let and be HR and LR image with and , respectively, where and . The main building block of our proposed DBPN architecture is the projection unit, which is trained (as part of the end-to-end training of the SR system) to map either an LR feature map to an HR map (up-projection), or an HR map to an LR map (down-projection).
3.1 Projection units
The up-projection unit is defined as follows:
|scale residual up:||(4)|
|output feature map:||(5)|
where * is the spatial convolution operator, and are, respectively, the up- and down-sampling operator with scaling factor , and are (de)convolutional layers at stage .
The up-projection unit, illustrated in the upper part of Fig. 3, takes the previously computed LR feature map as input, and maps it to an (intermediate) HR map ; then it attempts to map it back to LR map (“back-project”). The residual (difference) between the observed LR map and the reconstructed is mapped to HR again, producing a new intermediate (residual) map ; the final output of the unit, the HR map , is obtained by summing the two intermediate HR maps.
The down-projection unit, illustrated in the lower part of Fig. 3, is defined very similarly, but now its job is to map its input HR map to the LR map .
|scale residual down:||(9)|
|output feature map:||(10)|
We organize projection units in a series of stages, alternating between and . These projection units can be understood as a self-correcting procedure which feeds a projection error to the sampling layer and iteratively changes the solution by feeding back the projection error.
The projection unit uses large sized filters such as and . In the previous approaches, the use of large-sized filters is avoided because it can slow down the convergence speed and might produce sub-optimal results. However, the iterative up- and down-sampling units enable the mutual relation between LR and HR and take benefit of large receptive fields to perform better performance especially on large scaling factor where the significant amount of pixels is needed.
3.2 Network architecture
The proposed DBPN is illustrated in Fig. 4. It can be divided into three parts: initial feature extraction, projection, and reconstruction, as described below. Here, let conv be a convolutional layer, where is the filter size and is the number of filters.
Initial feature extraction. We construct initial LR feature-maps from the input using conv. Then conv is used to reduce the dimension from to before entering projection step where is the number of filters used in the initial LR features extraction and is the number of filters used in each projection unit.
Back-projection stages. Following initial feature extraction is a sequence of projection units, alternating between construction of LR and HR feature maps ( and ). Later, it further improves by dense connection where each unit has access to the outputs of all previous units (Section 4.1).
Reconstruction. Finally, the target HR image is reconstructed as where use conv as reconstruction and refers to the concatenation of the feature-maps produced in each up-projection unit which called as deep concatenation.
Due to the definitions of these building blocks, our network architecture is modular. We can easily define and train networks with different numbers of stages, controlling the depth. For a network with stages, we have the initial extraction stage (2 layers), and then up-projection units and down-projection units, each with 3 layers, followed by the reconstruction (one more layer). However, for the dense projection unit, we add conv in each projection unit, except the first three units as mentioned in Section 4.1.
4 The Variants of DBPN
In this section, we show how DBPN can be modified to apply the latest deep learning trends.
4.1 Dense projection units
The dense inter-layer connectivity pattern in DenseNets  has been shown to alleviate the vanishing-gradient problem, produce improved features, and encourage feature reuse. Inspired by this we propose to improve DBPN, by introducing dense connections in the projection units called, yielding Dense DBPN.
Unlike the original DenseNets, we avoid dropout and batch norm, which are not suitable for SR, because they remove the range flexibility of the features . Instead, we use convolution layer as the bottleneck layer for feature pooling and dimensional reduction [45, 11] before entering the projection unit.
In Dense DBPN, the input for each unit is the concatenation of the outputs from all previous units. Let the and be the input for dense up- and down-projection unit, respectively. They are generated using conv which is used to merge all previous outputs from each unit as shown in Fig. 5. This improvement enables us to generate the feature maps effectively, as shown in the experimental results.
4.2 Recurrent DBPN
Here, we propose recurrent DBPN which is able to reduce the number of parameters and widen the receptive field without increasing the model capacity. In SISR, DRCN  proposed recursive layers without introducing new parameters for additional convolutions in the networks. Then, DRRN  improves residual networks by introducing both global and local residual learning using a very deep CNN model (up to 52-layers). DBPN can also be treated as a recurrent network by sharing the projection units across the stages. We divided recurrent DBPN into two variants as mentioned below.
(a) Single pair of projection unit (DBPN-R) utilizes only one up-projection unit and one down-projection unit which is shared across all stages and does not utilize dense connection as shown in Fig. 6.
(b) Multiple pairs of projection units (DBPN-MR) utilizes multiple up- and down-projection units as shown in Fig. 7. However, instead of taking the output from each up-projection unit, DBPN-MR takes the HR features only from the last up-projection unit, then, concatenate the HR features from each iteration. Here, the output from the last down-projection unit is the input for the next iteration. Then, the last up-projection unit will receive the output of all previous down-projection units on the corresponding iteration.
4.3 Residual DBPN
Residual learning helps the network to converge faster and make the network have an easier job to produce only the difference between HR and interpolated LR image. Initially, residual learning has been applied in SR by VDSR . Residual DBPN takes LR image as an input to reduce the computational time. First, LR image is interpolated using Bicubic interpolation; then, at the last stage, the interpolated image is added to the reconstructed image to produce final SR image.
5 Experimental Results
5.1 Implementation and training details
In the proposed networks, the filter size in the projection unit is various with respect to the scaling factor. For , we use kernel with stride = 2 and pad by 2 pixels. Then, use kernel with stride = 4 and pad by 2 pixels. Finally, the use kernel with stride = 8 and pad by 2.111We found these settings to work well based on general intuition and preliminary experiments.
We initialize the weights based on . Here, standard deviation (std) is computed by where , is the filter size, and is the number of filters. For example, with and , the std is . All convolutional and deconvolutional layers are followed by parametric rectified linear units (PReLUs), except the final reconstruction layer.
We trained all networks using images from DIV2K  with augmentation (scaling, rotating, flipping, and random cropping). To produce LR images, we downscale the HR images on particular scaling factors using Bicubic. We use batch size of 16 with size for LR image, while HR image size corresponds to the scaling factors. The learning rate is initialized to for all layers and decrease by a factor of 10 for every iterations for total iterations. We used Adam with momentum to and trained with L1 Loss. All experiments were conducted using PyTorch 0.3.1 and Python 3.5 on NVIDIA TITAN X GPUs. The code is available in the internet.222The implementation is available here.
|No. of stage ()||2||2||4||6||6||7||10|
5.2 Model analysis
There are six types of DBPN used for model analysis: DBPN-SS, DBPN-S, DBPN-M, DBPN-L, D-DBPN-L, D-DBPN, and DBPN. The detailed architectures of those networks are shown in Table I.
Depth analysis. To demonstrate the capability of our projection unit, we construct multiple networks: DBPN-S (), DBPN-M (), and DBPN-L (). In the feature extraction, we use and . Then, we use for the reconstruction. The input and output image are luminance only.
The results on enlargement are shown in Fig. 8. DBPN outperforms the state-of-the-art methods. Starting from our shallow network, DBPN-S gives the higher PSNR than VDSR, DRCN, and LapSRN. DBPN-S uses only 12 convolutional layers with smaller number of filters than VDSR, DRCN, and LapSRN. At the best performance, DBPN-S can achieve dB which better dB, dB, dB than VDSR, DRCN, and LapSRN, respectively. DBPN-M shows performance improvement which better than all four existing state-of-the-art methods (VDSR, DRCN, LapSRN, and DRRN). At the best performance, DBPN-M can achieve dB which better dB, dB, dB, dB than VDSR, DRCN, LapSRN, and DRRN respectively. In total, DBPN-M uses 24 convolutional layers which has the same depth as LapSRN. Compare to DRRN (up to 52 convolutional layers), DBPN-M undeniable shows the effectiveness of our projection unit. Finally, DBPN-L outperforms all methods with dB which better dB, dB, dB, dB than VDSR, DRCN, LapSRN, and DRRN, respectively.
The results of enlargement are shown in Fig. 9. Our networks outperform the current state-of-the-art for enlargement which clearly show the effectiveness of our proposed networks on large scaling factors. However, we found that there is no significant performance gain from each proposed network especially for DBPN-L and DBPN-M networks where the difference only dB.
For the sake of low computation for real-time processing, we construct DBPN-SS which is the lighter version of DBPN-S, . We use and . However, the results outperform SRCNN, FSRCNN, and VDSR on both and enlargement. Moreover, DBPN-SS performs better than VDSR with and fewer parameters on and enlargement, respectively.
DBPN-S has about fewer parameters and higher PSNR than LapSRN on enlargement. Finally, D-DBPN has about fewer parameters, and approximately the same PSNR, compared to EDSR on enlargement. On the enlargement, D-DBPN has about fewer parameters with better PSNR compare to EDSR. This evidence show that our networks has the best trade-off between performance and number of parameter.
Deep concatenation. Each projection unit is used to distribute the reconstruction step by constructing features which represent different details of the HR components. Deep concatenation is also well-related with the number of (back-projection stage), which shows more detailed features generated from the projection units will also increase the quality of the results. In Fig. 12, it is shown that each stage successfully generates diverse features to reconstruct SR image.
Error Feedback. As stated before, error feedback (EF) is used to guide the reconstruction in the early layer. Here, we analyze how error feedback can help for better reconstruction. We conduct experiments to see the effectiveness of error feedback procedure. On the scenario without EF, we replace up- and down-projection unit with single up- (deconvolution) and down-sampling (convolution) layer.
We show PSNR of DBPN-S with EF and without EF in Table II. The result with EF has 0.53 dB and 0.26 dB better than without EF on Set5 and Set14, respectively. In Fig. 13, we visually show how error feedback can construct better and sharper HR image especially in the white stripe pattern of the wing.
Moreover, the performance of DBPN-S without EF is interestingly 0.57 dB and 0.35 dB better than previous approaches such as SRCNN  and FSRCNN , respectively, on Set5. The results show the effectiveness of iterative up- and downsampling layers to demonstrate the LR-to-HR mutual dependency.
Filter Size We analyze the size of filters which is used in the back-projection stage. As stated before, the choice of filter size in the back-projection stage is based on the preliminary results. For the 4 enlargement, we show that filter 88 is 0.08 dB and 0.09 dB better than filter 66 and 1010, respectively, as shown in Table III.
Luminance vs RGB In D-DBPN, we change the input/output from luminance to RGB color channels. There is no significant improvement in the quality of the result as shown in Table IV. However, for running time efficiency, constructing all channels simultaneously is faster than a separated process.
5.3 Comparison of each DBPN variant
Dense connection. We implement D-DBPN-L which is a dense connection of the network to show how dense connection can improve the network’s performance in all cases as shown in Table V. On enlargement, the dense network, D-DBPN-L, gains dB and dB higher than DBPN-L on the Set5 and Set14, respectively. On , the gaps are even larger. The D-DBPN-L has dB and dB higher that DBPN-L on the Set5 and Set14, respectively.
Comparison across the variants. We compare six DBPN variants: DBPN-R64-10, DBPN-R128-5, DBPN-MR64-3, DBPN-RES-MR64-3, DBPN-RES, and DBPN. First, DBPN, which was the winner of NTIRE2018  and PIRM2018 , uses , , and for the back-projection stages, and dense connection between projection units. In the reconstruction, we use . DBPN-R64-10 uses with 10 iterations to produce 640 HR features as input of reconstruction layer. DBPN-R128-5, uses with 5 iterations, produces 640 HR features. DBPN-MR64-3 has the same architecture with D-DBPN but the projection units are treated as recurrent network. DBPN-RES-MR64-3 is DBPN-MR64-3 with residual learning. Last, DBPN-RES is DBPN with residual learning. All variants are trained with the same training setup.
The results are shown in Table VI. It shows that all variants successfully have better performance than D-DBPN . DBPN-R64-10 has the least parameter compare to other variants, which is suitable for mobile/real-time application. It can reduce number of parameter compare to DBPN and maintain to get good performance. We can see that increasing can improve the performance of DBPN-R which is shown by DBPN-R128-5 compare to DBPN-R64-10. However, better results is obtained by DBPN-MR64-3, especially on Urban100 and Manga109 test set compare to other variants. It is also proven that residual learning can slightly improve the performance of DBPN. Therefore, it is natural that we performed the combination of multiple stages recurrent and residual learning called DBPN-RES-MR64-3 which performs the best results and has lower parameter than DBPN.
|Method||# Parameters ()||PSNR||SSIM||PSNR||SSIM||PSNR||SSIM||PSNR||SSIM||PSNR||SSIM|
5.4 Comparison with the-state-of-the-arts on SR
To confirm the ability of the proposed network, we performed several experiments and analysis. We compare our network with ten state-of-the-art SR algorithms: A+ , SRCNN , FSRCNN , VDSR , DRCN , DRRN , LapSRN , D-DBPN , EDSR , and RCAN . We carry out extensive experiments using 5 datasets: Set5 , Set14 , BSDS100 , Urban100  and Manga109 . Each dataset has different characteristics. Set5, Set14 and BSDS100 consist of natural scenes; Urban100 contains urban scenes with details in different frequency bands; and Manga109 is a dataset of Japanese manga.
Our final network, DBPN-RES-MR64-3, combines dense connection, recurrent network and residual learning to boost the performance of DBPN. It uses , , and with 3 iteration. In the reconstruction, we use . RGB color channels are used for input and output image. It takes around than 14 days to train.
PSNR and structural similarity (SSIM)  were used to quantitatively evaluate the proposed method. Note that higher PSNR and SSIM values indicate better quality. As used by existing networks, all measurements used only the luminance channel (Y). For SR by factor , we crop pixels near image boundary before evaluation as in [25, 17]. Some of the existing networks such as SRCNN, FSRCNN, VDSR, and EDSR did not perform enlargement. To this end, we retrained the existing networks by using author’s code with the recommended parameters.
Figure 14 shows that EDSR tends to generate stronger edge than the ground truth and lead to misleading information in several cases. The result of EDSR shows the eyelashes were interpreted as a stripe pattern. Our result generates softer patterns which is subjectively closer to the ground truth. On the butterfly image, EDSR separates the white pattern and tends to construct regular pattern such ac circle and stripe, while D-DBPN constructs the same pattern as the ground truth.
We show the quantitative results in the Table VII. Our network outperforms the existing methods by a large margin in all scales except RCAN. For , EDSR has dB higher than D-DBPN but outperformed by DBPN-RES-MR64-3 with dB margin on Urban100. Recent state-of-the-art, RCAN , performs better results than our network on . However, on , our network has dB higher than RCAN on Urban100. The biggest gap is shown on Manga109, our network has dB higher than RCAN.
Our network shows its effectiveness on enlargement which outperforms all of the existing methods by a large margin. Interesting results are shown on Manga109 dataset where D-DBPN obtains dB which is dB better than EDSR. While on the Urban100 dataset, D-DBPN achieves 23.25 which is only dB better than EDSR. Our final network, DBPN-RES-MR64-3, outperforms all previous networks. DBPN-RES-MR64-3 is roughly dB better than RCAN  across multiple dataset. The biggest gap is on Manga109 where DBPN-RES-MR64-3 is dB better than RCAN . The overall results show that our networks perform better on fine-structures images especially manga characters, even though we do not use any animation images in the training.
The results of enlargement are visually shown in Fig. 15. Qualitatively, our network is able to preserve the HR components better than other networks. For image “img.png”, all of previous methods fail to recover the correct direction of the image textures, while ours produce more faithful results to the ground truth. For image “Hamlet.png”, other methods suffer from heavy blurring artifacts and fail to recover the details. While, our network successfully recovers the fined detail and produce the closest result to the ground truth. It shows that our networks can successfully extract not only features but also create contextual information from the LR input to generate HR components in the case of large scaling factors, such as enlargement.
5.5 Runtime Evaluation
We present the runtime comparisons between our networks and 3 state-of-the-art networks: VDSR , DRRN , and EDSR . The comparison must be done in fair settings. The runtime is calculated using python function timeit which encapsulating only forward function. For EDSR, we use original author code based on Torch and use timer function to obtain the runtime.
We evaluate each network using NVIDIA TITAN X GPU (12G Memory). The input image size is 64 64, then upscaled into 128 128 (2), 256 256 (4), and 512 512 (8). The results are the average of 10 times trials.
Table VIII shows the runtime comparisons on 2, 4, and 8 enlargement. It shows that DBPN-SS and DBPN-S obtain the best and second best performance on 4 and 8 enlargement. On 2 enlargement, we did not train the variants of our proposed network except for D-DBPN. Therefore, we cannot produce the runtime for DBPN-SS, DBPN-S, DBPN-M, and DBPN-L networks. Compare to EDSR, D-DBPN shows its effectiveness by having faster runtime with comparable quality on 2 and 4 enlargement. On 8 enlargement, the gap is bigger. It shows that D-DBPN has better results with lower runtime than EDSR.
Noted that input for VDSR and DRRN is only luminance channel and need preprocessing to create middle-resolution image. So that, the runtime should be added by additional computation of interpolation computation on preprocessing.
6 Perceptually optimized DBPN
We also can extend DBPN to produce HR outputs that appear to be better under human perception. Despite many attempts, it remains unclear how to accurately model perceptual quality. Instead, we incorporate the perceptual quality into the generator by using adversarial loss, as introduced elsewhere [38, 39]. In the adversarial settings, there are two building blocks: a generator () and a discriminator (). In the context of SR, the generator produces HR images (from LR inputs). The discriminator works to differentiate between real HR images and generated HR images (the product of SR network ). In our experiments, the generator is a DBPN network, and the discriminator is a network with five hidden layers with batch norm, followed by the last, fully connected layer.
The generator loss in this experiment is composed of four loss terms, following : MSE, VGG, Style, and Adversarial loss.
MSE loss is pixel-wise loss which calculated in the image space .
Adversarial loss. , where is the probability assigned by to being a real HR image.
Style loss is used to generate high quality textures. This loss was originally proposed by . Style loss using the same differentiable function as in VGG loss. where Gram matrix .
The training objective for is
As is common in training adversarial networks, we alternate between stages of training and training . We use pre-trained DBPN model which optimized by MSE loss only, then fine-tuned with the perceptual loss. We use batch size of 4 with size for LR image, while HR image size is . The learning rate is initialized to for all layers for iteration using Adam with momentum to .
This method was inclded in the challenge associated with PIRM2018 , in conjunction with ECCV 2018. In the challenge, evaluation was conducted in three disjoint regimes defined by thresholds on the RMSE; the intuition behind this is the natural tradeoff between RMSE and perceptual quality of the reconstruction. The latter is measured by combining the quality measures of Ma  and NIQE  as below,
The three regimes correspond to Region 1: RMSE , Region 2: RMSE , and Region 3: RMSE . We select optimal parameter settings for each regime. This process yields
Region 1 ()
Region 2 ()
Region 3 ()
Our method achieved place on Region 2, place on Region 1, and place on Region 3 . In Region 3, it shows very competitive results where we got , however, it is noted that our method has the lowest RMSE among other top 5 performers which means the image has less distortion or hallucination w.r.t the original image.
We show qualitative results from our method which is shown in Fig. 16. It can be seen that there are significant improvement on high quality texture on each region compare to MSE-optimized SR image.
We have proposed Deep Back-Projection Networks for Single Image Super-resolution which is the winner of two single image SR challenge (NTIRE2018 and PIRM2018). Unlike the previous methods which predict the SR image in a feed-forward manner, our proposed networks focus to directly increase the SR features using multiple up- and down-sampling stages and feed the error predictions on each depth in the networks to revise the sampling results, then, accumulates the self-correcting features from each upsampling stage to create SR image. We use error feedbacks from the up- and down-scaling steps to guide the network to achieve a better result. The results show the effectiveness of the proposed network compares to other state-of-the-art methods. Moreover, our proposed network successfully outperforms other state-of-the-art methods on large scaling factors such as enlargement. We also show that DBPN can be modified into several variants to follow the latest deep learning trends to improve its performance.
This work was partly supported by FCRAL, and by AFOSR Center of Excellence in Efficient and Robust Machine Learning, Award FA9550-18-1-0166.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
-  E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in neural information processing systems, 2015, pp. 1486–1494.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648, 2016.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision. Springer, 2016, pp. 694–711.
-  X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-revealing deep video super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 22–29.
-  M. S. Sajjadi, R. Vemulapalli, and M. Brown, “Frame-recurrent video super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6626–6634.
-  M. Haris, M. R. Widyanto, and H. Nobuhara, “Inception learning super-resolution,” Appl. Opt., vol. 56, no. 22, pp. 6043–6048, Aug 2017.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp. 1646–1654.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in IEEE Conferene on Computer Vision and Pattern Recognition, 2017.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in European Conference on Computer Vision. Springer, 2016, pp. 391–407.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1637–1645.
-  D. J. Felleman and D. C. Van Essen, “Distributed hierarchical processing in the primate cerebral cortex.” Cerebral cortex (New York, NY: 1991), vol. 1, no. 1, pp. 1–47, 1991.
-  D. J. Kravitz, K. S. Saleem, C. I. Baker, L. G. Ungerleider, and M. Mishkin, “The ventral visual pathway: an expanded neural framework for the processing of object quality,” Trends in cognitive sciences, vol. 17, no. 1, pp. 26–49, 2013.
-  V. A. Lamme and P. R. Roelfsema, “The distinct modes of vision offered by feedforward and recurrent processing,” Trends in neurosciences, vol. 23, no. 11, pp. 571–579, 2000.
-  M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP: Graphical models and image processing, vol. 53, no. 3, pp. 231–239, 1991.
-  S. Dai, M. Han, Y. Wu, and Y. Gong, “Bilateral back-projection for single image super resolution,” in Multimedia and Expo, 2007 IEEE International Conference on. IEEE, 2007, pp. 1039–1042.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
-  M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  R. Timofte, S. Gu, J. Wu, L. Van Gool, L. Zhang, M.-H. Yang, M. Haris, G. Shakhnarovich, N. Ukita et al., “Ntire 2018 challenge on single image super-resolution: Methods and results,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2018 IEEE Conference on, 2018.
-  Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “2018 pirm challenge on perceptual image super-resolution,” arXiv preprint arXiv:1809.07517, 2018.
-  Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4733–4742.
-  S. Ross, D. Munoz, M. Hebert, and J. A. Bagnell, “Learning message-passing inference machines for structured prediction,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 2737–2744.
-  Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1744–1757, 2010.
-  K. Li, B. Hariharan, and J. Malik, “Iterative instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3659–3667.
-  A. R. Zamir, T.-L. Wu, L. Sun, W. Shen, J. Malik, and S. Savarese, “Feedback networks,” arXiv preprint arXiv:1612.09508, 2016.
-  A. Shrivastava and A. Gupta, “Contextual priming and feedback for faster r-cnn,” in European Conference on Computer Vision. Springer, 2016, pp. 330–348.
-  W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint arXiv:1605.08104, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
-  M. S. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” arXiv preprint arXiv:1612.07919, 2016.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.
-  Y. Zhao, R.-G. Wang, W. Jia, W.-M. Wang, and W. Gao, “Iterative projection reconstruction for fast and efficient image upsampling,” Neurocomputing, vol. 226, pp. 200–211, 2017.
-  M. Haris, M. R. Widyanto, and H. Nobuhara, “First-order derivative-based super-resolution,” Signal, Image and Video Processing, vol. 11, no. 1, pp. 1–8, 2017.
-  W. Dong, L. Zhang, G. Shi, and X. Wu, “Nonlocal back-projection for adaptive image enlargement,” in Image Processing (ICIP), 2009 16th IEEE International Conference on. IEEE, 2009, pp. 349–352.
-  R. Timofte, R. Rothe, and L. Van Gool, “Seven ways to improve example-based single image super resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1865–1873.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
-  E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017.
-  R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in Asian Conference on Computer Vision. Springer, 2014, pp. 111–126.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in British Machine Vision Conference (BMVC), 2012.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Curves and Surfaces. Springer, 2012, pp. 711–730.
-  P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2011.
-  J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
-  Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, pp. 1–28, 2016.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” Image Processing, IEEE Transactions on, vol. 13, no. 4, pp. 600–612, 2004.
-  A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
-  L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414–2423.
-  C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, “Learning a no-reference quality metric for single-image super-resolution,” Computer Vision and Image Understanding, vol. 158, pp. 1–16, 2017.
-  A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a” completely blind” image quality analyzer.” IEEE Signal Process. Lett., vol. 20, no. 3, pp. 209–212, 2013.
Muhammad Haris Muhammad Haris received S. Kom (Bachelor of Computer Science) from the Faculty of Computer Science, University of Indonesia, Depok, Indonesia, in 2009. Then, he received the M. Eng and Dr. Eng degree from Department of Intelligent Interaction Technologies, University of Tsukuba, Japan, in 2014 and 2017, respectively, under the supervision of Dr. Hajime Nobuhara. Currently, he is working as postdoctoral fellow in Intelligent Information Media Laboratory, Toyota Technological Institute with Prof. Norimichi Ukita. His main research interests are low-level vision and image/video processing.
Greg Shakhnarovich Greg Shakhnarovich has been faculty member at TTI-Chicago since 2008. He received his BSc degree in Computer Science and Mathematics from the Hebrew University in Jerusalem, Israel, in 1994, and a MSc degree in Computer Science from the Technion, Israel, in 2000. Prior to joining TTIC Greg was a Postdoctoral Research Associate at Brown University, collaborating with researchers at the Computer Science Department and the Brain Sciences program there. Greg’s research interests lie broadly in computer vision and machine learning.
Norimichi Ukita Norimichi Ukita is a professor at the graduate school of engineering, Toyota Technological Institute, Japan (TTI-J). He received the B.E. and M.E. degrees in information engineering from Okayama University, Japan, in 1996 and 1998, respectively, and the Ph.D degree in Informatics from Kyoto University, Japan, in 2001. After working for five years as an assistant professor at NAIST, he became an associate professor in 2007 and moved to TTIJ in 2016. He was a research scientist of Precursory Research for Embryonic Science and Technology, Japan Science and Technology Agency (JST), during 2002 - 2006. He was a visiting research scientist at Carnegie Mellon University during 2007-2009. He currently works also at the Cybermedia center of Osaka University as a guest professor. His main research interests are object detection/tracking and human pose/shape estimation. He is a member of the IEEE.