Adaptive Densely Connected Single Image Super-Resolution
For a better performance in single image super-resolution(SISR), we present an image super-resolution algorithm based on adaptive dense connection (ADCSR). The algorithm is divided into two parts: BODY and SKIP. BODY improves the utilization of convolution features through adaptive dense connections. Also, we develop an adaptive sub-pixel reconstruction layer (AFSL) to reconstruct the features of the BODY output. We pre-trained SKIP to make BODY focus on high-frequency feature learning. The comparison of PSNR, SSIM, and visual effects verify the superiority of our method to the state-of-the-art algorithms.
Single image super-resolution aims at reconstructing an accurate high-resolution image from the low-resolution image. Since deep learning made big progress in the computer version, many SISR algorithms based on deep Convolution Neural Networks (CNN) have been proposed in recent years. The powerful feature representation and end-to-end training skill of CNN makes a huge breakthrough in SISR.
Dong \etal  first proposed SRCNN by introducing a three-layer CNN for image SR. Kim et al. increased the number of layers to 20 in VDSR  and DRCN ,making notable improvements over SRCNN. As we all know, the deeper the network is, the more powerful the representation it has. However, with the depth of network grow, gradient disappear and gradient explosion will be the main problem to hinder the performance of the network. This problem was solved when He \etal  proposed residual net (ResNet), and Huang \etal  proposed dense net (DesNet). Many large scale networks were introduced in SISR, such as SRResNet , EDSR , SRDenseNet , RDN  etc.These methods aim at building a deeper network to increase the performance. Other methods such as RCAN  and SAN  try to learn the correlation of the features in the middle layers.
WDSR  allows for better network performance with less computational effort. AWSRN  applies an adaptive weighted network. Weight adaptation is achieved by multiplying the residual convolution and the residual hopping by coefficients respectively, and the coefficients can be trained. Since the performance of dense connections is better than the residual  , we develop an adaptive densely connection method to enhance the efficiency of feature learning. There is a similar global SKIP, a single sub-pixel convolution, in WDSR and AWSRN . Although the SKIP is set to recover low-order frequencies, there is no practical measure to limit its training. We present an adaptive densely connected super-resolution reconstruction algorithm (ADCSR). The algorithm is divided into two parts: BODY and SKIP. BODY is focused on high-frequency information reconstruction through pre-training the SKIP. ADCSR obtained the optimal SISR performance based on bicubic interpolation. There are three main tasks:
(1)WDSR is optimized using adaptive dense connections. Experiments were carried out by initializing the adaptive parameters and optimizing the models. Based on the above efforts, the performance of the network has been greatly improved;
(2)We propose the AFSL model to perform image SR through adaptive sub-pixel convolution;
(3)We develop a method which pre-train SKIP first and then train the entire network at the same time. Thus, the BODY is focused on the reconstruction of high-frequency details to improve network performance.
2 Related Works
SISR has important applications in many fields, such as security and surveillance imaging , medical imaging , and image generation . The simplest method among them is the interpolation, such as linear interpolation, bicubic interpolation, and so on. This method takes the average of the pixel points in the known LR image as the missing pixel of the HR image. Interpolation works well in the smooth part of the image, but it works poorly in the edge regions, causing ringing and blurring. Additionally, learning-based and reconstruction-based methods are more complex such as sparse coding , neighborhood embedded regression  , random forest , etc.
Dong et al. first proposed a Convolutional Neural Network (CNN)-based super-resolution reconstruction network (SRCNN) , which performance is better than the most advanced algorithm at the time. Later, Shi et al. proposed a sub-pixel convolution super-resolution reconstruction network . The network contains several convolutional layers to learn LR image features. Reconstruction is performed using the proposed sub-pixel convolutional layer. We can directly reconstruct the image utilizing the convolutional features from the deep convolutional network. Lim et al. proposed an enhanced depth residual network (EDSR) , which made a significant performance through the deeper network. Other deep network like RDN  and MemNet , are based on dense blocks. Some networks focus on feature correlations in channel dimension, such as RCAN and SAN .
The WDSR proposed by Yu et al. draws two conclusions. First, when the parameters and calculations are the same, the model with more features before the activation function has better performance. Second, weight normalization (WN layer) can improve the accuracy of the network. In WDSR, there is a broader channel before the activation function of each residual block. Wang et al. proposed an adaptive weighted super-resolution network(AWSRN) based on WDSR . It designs a local fusion block for more efficient residual learning. Besides, an adaptive weighted multi-scale model is developed. The model is used to reconstruct features and has superior performance in methods with roughly equal parameters.
Cao et al. proposed an improved Deep Residual Network (IDRN) . It makes simple and effective modifications to the structure of residual blocks and skip-connections. Besides, a new energy-aware training loss EA-Loss was proposed. And it employs lightweight networks to achieve fast and accurate results. The SR feedback network (SRFBN)  proposed by Li et al. applies the RNN with constraints to process feedback information and performs feature reuse.
The Deep Plug and Play SR Network (DPSR)  proposed by Zhang et al. can process LR images with arbitrary fuzzy kernels. Zhang et al.  obtained real sensor data by optical zoom for model training. Xu et al. generated training data by simulating the digital camera imaging process. Their experiments have shown that SR using raw data helps to restore fine detail and clear structure.
3 Our Model
3.1 Network Architecture
As shown in Figure 1, our ADCSR mainly consists two parts: SKIP and BODY. The SKIP just uses the sub-pixel convolution . The BODY includes multiple ADRUs (adaptive, dense residual units), GFF (global feature fusion layer) , and an AFSL layer(adaptive feature sub-pixel reconstruction layer). The model takes the RGB patches from the LR image as input. On the one hand, the HR image is reconstructed by SKIP using the low-frequency information of the LR images. On the other hand, the image is reconstructed by BODY using the high-frequency information of the LR images. We can obtain the final complete reconstructed HR image by combining the results of SKIP and BODY.
SKIP consists of a single or multiple sub-pixel convolutions with a convolution kernel size of 5. we have:
where represents the output of skip part, denotes the input image of LR and represents the sub-pixel convolution, which convolution kernel size is 5.
In the BODY, first, we use a convolution layer to extract the shallow features from the LR image.
where represents the feature extraction convolution, which kernel size is 3.
Second, we use several ADRUs to extract the deep features. There are four ADRBs (adaptive dense residual blocks) through adaptive dense connections in Each ADRU. The features are merged by the LFF (Local Feature Fusion Layer) and combined with a skip connection as the output of the ADRU. Each ADRB combines four convolution units by the same adaptive dense connection structure as ADRU. The convolution units adopt a convolution structure, which is similar to WDSR , including two layers of wide active convolution and one layer of Leakyrelu. After that, we fuse features by LFF, which combined with a skip connection as the output of the ADRB. GFF fuses the outputs of multiple ADRUs by means of concatenation and convolution.
where denotes the input feature map of th ADRU, means the function of th ADRU, , are hyperparameters.
means the output of the th ADRU, and represents the output of the last ADRU, which includes the skip connection.
The third part of BODY uses the GEF to combine all the output of ADRU, which fuses features by two convotion layers.
where means feature fusion.
Finaly, Image upsampling via AFSL. The AFSL consists of four sub-pixel convolution branches of different scales with a convolution kernel size of 3, 5, 7, and 9, respectively. The output is obtained through the junction layer and a single layer convolution.
In the second stage of BODY, the feature amplification layer is also implemented by a single convolution layer. The whole BODY is:
represents the output of BODY. The whole network can be expressed by formulas (8).
3.2 ADRB and ADRU
We will demonstrate the superiority of the adaptive dense connection structure in Chapter 4. To use as much adaptive residual structure as possible, we split the ADRU into ADRBs using adaptive dense connections and split ADRB into densely connected convolution units. At the same time, to get better results with less parameter amount, we use the residual block in WDSR as our convolution unit. As shown in Figure 1, ADRB and ADRU have similar connection structure. ADRB contains four convolution units, each of which can be represented by equotion (9).
where means the input of the convolution units. The kernel size of the is , and the is , is the input channels of the convolution units.
The whole ADRB can be expressed by equation (10).
where means convolution unit, denotes the input of ADRB. are hyperparameter, denotes the input of th convolution unit, represents the output of th convolution unit.
The whole ADRU can be formulated by equation (11).
In this section, we will give specific implementation details. In SKIP, the convolution channel for the sub-pixel convolutional layer is defined as 5. The convolution kernel size of the LFF in BODY is 1.The two convolution kernel sizes of GFF are 1 and 3, respectively. In AFSL, the convolution kernels are 3, 5, 7 and 9. All other convolution kernel sizes are set to 3. There are 4 ADRUs in BODY. The number of output channels in feature extraction layer, convolution unit, LFF, and GFF are 128, and the 4 sub-pixel convolutions and the final output in AFSL are 3. The stride size is 1 throughout the network while using Leakyrelu as the activation function.
4.1 Adaptive dense connections
We propose a structure for adaptive dense connections such as ADRU, and verify its performance through experiments. In the experiment, we designed three models. The model parameters are the same, and the calculations are roughly equal. The structure of the models is similar to the ADCSR consisting of a single ADRU. These three models are:
a. Add LFF  on WDSR  (to obtain the same model depth);
b. Add a dense connection based on a;
c. Add parameter adaptation based on b.
The three models have the same training parameters. We train our models with the data set DIV2K . We also compare the performance on the standard benchmark dataset: B100 . The number of iterations is 200. The learning rate is and halved at every 100 epochs. As shown in Figure 2, networks with dense connections and parameter adaptation have the highest performance under the same conditions.
4.2 Adaptive sub-pixel reconstruction layer (AFSL)
We test the reconstruction layer in BODY. We have designed a new reconstruction network model AFSL. To verify the performance of the model, we designed a straightforward model for comparison experiments. The model only includes the feature extraction layer and the reconstruction layer. As shown in Figure 3, the reconstruction layers are Sub-pixel convolution , AWMS , and AFSL. We performed the task on scale . The feature extraction layers and experimental parameters of the models are the same. We tested the models with B100  and Urban100 . At the same time, we also analyzed the difference in the number of FLOPs and model parameters. The result is shown in Table 1. We can see that AWMS and AFSL require more calculations and parameters than Sub-pixel convolution while its performance is better. In the case where the setting and the calculated amount are the same, the performance of AFSL is slightly better than AWMS.
4.3 Pre-training SKIP
We have explored a training method that performs a separate pre-training of SKIP while training the entire model. This training method is used to make SKIP focus on the reconstruction of low-frequency information, while BODY focuses on high-frequency information reconstruction.
We employ the same model, that is, the ADCSR containing a single ADRU with the same training parameters. But we train the model in different ways:
a. Train the entire network directly;
b. First pre-train SKIP, then train the whole network at the same time;
c. First pre-train SKIP, then set SKIP to be untrainable when training the entire network.
Figure 4 compares the image and image spectrum of SKIP and BODY output for models a and b. By comparing the output images, it can be seen that the BODY of the pre-trained SKIP model focuses on learning the texture edge details of the image. From the comparison of the output spectrum of the BODY part, the spectrogram of the pre-trained SKIP model is darker near the center and brighter around. It proves that the proposed method makes the BODY use more high-frequency information and less low-frequency information.
Figure 5 is a comparison of the test curves of the model on the B100 under different training modes. We found that networks that were pre-trained with SKIP achieved higher performance. And the network performance of tests b and c are similar.
4.4 Training settings
We train our network with dataset DIV2K and Flickr2K . The training set has a total of 3,450 images without data augmentation. DIV2K is composed of 800 images for training while 100 images each for testing and validation. Flickr2K has 2,650 training images. The input image block size is . SKIP is trained separately, and then the entire network is trained at the same time. The initial learning rate is . When the learning rate drops to , the training stops. we also adopt L1 loss to optimize our model. We train the network of scale firstly. Subsequently, when training the network of scale , , the BODY parameter of the scale is loaded (excluding the parameters of the AFSL). We train the model through the NVIDIA RTX2080Ti. Pytorch1.1.0+Cuda10.0+cudnn7.5.0 is selected as the deep learning environment.
|method||scale||Set5 ||Set14 ||B100 ||Urban100 ||manga109 |
4.5 Results with Bicubic Degradation
In order to verify the validity of the model , we compare the performance on five standard benchmark datasets: Set5 , Set14 , B100 , Urban100 , and manga109 . In terms of PSNR, SSIM and visual effects, We compare our models with the state-of-the-art methods including Bicubic, SRCNN , VDSR , LapSRN , MemNet , EDSR , RDN , RCAN , SAN . We also adopt self-ensemble strategy  to further improve our ADCSR and denote the self-ensembled ADCSR as ADCSR+. The results are shown in Table 2. As can be seen from the table, the PSNR and SSIM of the algorithm in , , exceed the current state of the art.
Figure 6 show the Qualitative comparison of our models with Bicubic, SRCNN , VDSR , LapSRN , MSLapSRN , EDSR , RCAN , and SAN  . The images of SRCNN, EDSR, and RCAN are derived from the author’s open-source model and code. Test images for VDSR, LapSRN, MSLapSRN, SAN are provided by their respective authors. In the comparison chart of img044 in Figure 6, the image reconstructed by the algorithm is clear and close to the original image. In img004, our algorithm has a better visual effect.
5 AIM2019: Extreme Super-Resolution Challenge
This work is initially proposed for the purpose of participating in the AIM2019 Extreme Super-Resolution Challenge. The goal of the contest is to super-resolve an input image to an output image with a magnification factor and the challenge is called extreme super-resolution.
Our model is the improved ADCSR, a two-stage adaptive dense connection super-resolution reconstruction network (DSSR). As shown in Figure 7, the DSSR consists of two parts, SKIP and BODY. The SKIP is a simple sub-pixel convolution . The BODY part is divided into two stages. The first stage includes a feature extraction layer, multiple ADRUs (adaptive, dense residual units), GFF (global feature fusion layer) , and an AFSL layer (adaptive feature sub-pixel reconstruction layer).The second stage includes a feature amplification layer, an ADRB (adaptive dense residual block), and an AFSL.
During the training of DSSR, the network converges slowly due to the large network. We divide the network into two parts for training to speed up network convergence. When training DSSR, we first train the SKIP. The network ADCSR of scale is used as a pre-training parameter while training the entire network. At the same time, the feature extraction layer of the first level and each ADRU are set to be untrainable. During the period, GFF, AFSL and later second-level network parameters are trained at normal learning rates . Finally, we train the entire network when the learning rate is small. We train DSSR with dataset DIV8K. Other training settings are the same as ADCSR. Our model final result on the full resolution of the DIV8K test images is () : PSNR = 26.79, SSIM = 0.7289.
We propose an adaptive densely connected super-resolution reconstruction algorithm (ADCSR). The algorithm is divided into two parts: BODY and SKIP. BODY improves the utilization of convolution features by adaptively dense connections. We also explore an adaptive sub-pixel reconstruction layer (AFSL) to reconstruct the features of the BODY output. We pre-train SKIP in advance so that the BODY focuses on high-frequency feature learning. Several comparative experiments demonstrate the effectiveness of the proposed improved method. On the standard datasets, the comparisons of PSNR, SSIM, and visual effects show that the proposed algorithm is superior to the state-of-the-art algorithms.
- (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: §4.5, Table 2.
- (2019) Fast and accurate single image super-resolution via an energy-aware improved deep residual network. Signal Processing 162, pp. 115–125. Cited by: §2.
- (2004) Super-resolution through neighbor embedding. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 1, pp. I–I. Cited by: §2.
- (2019) Second-order attention network for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11065–11074. Cited by: §1, §1, §2, §4.1, §4.5, §4.5, Table 2.
- (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §1, §2, §4.5, §4.5, Table 2.
- (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- (2017-07) Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
- (2015) Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: §4.2, §4.5, Table 2.
- (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.
- (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §1, §4.5, §4.5, Table 2.
- (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §1.
- (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632. Cited by: §4.5, §4.5, Table 2.
- (2018) Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.5.
- (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1.
- (2019) Feedback network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3867–3876. Cited by: §2.
- (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §1, §1, §2, §4.1, §4.4, §4.5, §4.5, Table 2.
- (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Cited by: §4.1, §4.2, §4.5, Table 2.
- (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76 (20), pp. 21811–21838. Cited by: §4.5, Table 2.
- (2015) Fast and accurate image upscaling with super-resolution forests. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3799. Cited by: §2.
- (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §2, §3.1, §4.2, §5.
- (2013) Cardiac image super-resolution with global correspondence using multi-atlas patchmatch. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 9–16. Cited by: §2.
- (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE international conference on computer vision, pp. 4539–4547. Cited by: §2, §4.5, Table 2.
- (2013) Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of the IEEE international conference on computer vision, pp. 1920–1927. Cited by: §2.
- (2017) Image super-resolution using dense skip connections. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4799–4807. Cited by: §1.
- (2019) Lightweight image super-resolution with adaptive weighted learning network. arXiv preprint arXiv:1904.02358. Cited by: §1, §4.2.
- (2019) Towards real scene super-resolution with raw images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1723–1731. Cited by: §2.
- (2010) Image super-resolution via sparse representation. IEEE transactions on image processing 19 (11), pp. 2861–2873. Cited by: §2, §4.5, Table 2.
- (2018) Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718. Cited by: §1, §2, §3.1.
- (2019) Deep plug-and-play super-resolution for arbitrary blur kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1671–1681. Cited by: §2.
- (2019) Zoom to learn, learn to zoom. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3762–3770. Cited by: §2.
- (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: §1, §2, §4.5, §4.5, Table 2.
- (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2472–2481. Cited by: §1, §1, §2, §3.1, §4.1, §4.5, Table 2, §5.
- (2011) Very low resolution face recognition problem. IEEE Transactions on image processing 21 (1), pp. 327–340. Cited by: §2.