Lightweight Image Super-Resolution with Information Multi-distillation Network
In recent years, single image super-resolution (SISR) methods using deep convolution neural network (CNN) have achieved impressive results. Thanks to the powerful representation capabilities of the deep networks, numerous previous ways can learn the complex non-linear mapping between low-resolution (LR) image patches and their high-resolution (HR) versions. However, excessive convolutions will limit the application of super-resolution technology in low computing power devices. Besides, super-resolution of any arbitrary scale factor is a critical issue in practical applications, which has not been well solved in the previous approaches. To address these issues, we propose a lightweight information multi-distillation network (IMDN) by constructing the cascaded information multi-distillation blocks (IMDB), which contains distillation and selective fusion parts. Specifically, the distillation module extracts hierarchical features step-by-step, and fusion module aggregates them according to the importance of candidate features, which is evaluated by the proposed contrast-aware channel attention mechanism. To process real images with any sizes, we develop an adaptive cropping strategy (ACS) to super-resolve block-wise image patches using the same well-trained model. Extensive experiments suggest that the proposed method performs favorably against the state-of-the-art SR algorithms in term of visual quality, memory footprint, and inference time. Code is available at https://github.com/Zheng222/IMDN.
Single image super-resolution (SISR) aims at reconstructing a high-resolution (HR) image from its low-resolution (LR) observation, which is inherently ill-posed because many HR images that can be downsampled to an identical LR image. To address this problem, numerous image SR methods (Kim et al., 2016a; Tai et al., 2017b; Tong et al., 2017; Zhang et al., 2018c; Hui et al., 2018; Zhang et al., 2018b) based on deep neural architectures (Simonyan and Zisserman, 2015; He et al., 2016; Huang et al., 2017) have been proposed and shown prominent performance.
Dong et al. (Dong et al., 2014, 2016b) first developed a three-layer network (SRCNN) to establish a direct relationship between LR and HR. Then, Wang et al. (Wang et al., 2015) proposed a neural network according to the conventional sparse coding framework and further designed a progressive upsampling style to produce better SR results at the large scale factor (e.g., ). Inspired by VGG model (Simonyan and Zisserman, 2015) that used for ImageNet classification, Kim et al. (Kim et al., 2016a, b) first pushed the depth of SR network to and their model outperformed SRCNN by a large margin. This indicates a deeper model is instructive to enhance the quality of generated images. To accelerate the training of deep network, the authors introduced global residual learning with a high initial learning rate. At the same time, they also presented a deeply-recursive convolutional network (DRCN), which applied recursive learning to SR problem. This way can significantly reduce the model parameters. Similarly, Tai et al. proposed two novel networks, and one is a deep recursive residual network (DRRN) (Tai et al., 2017a), another is a persistent memory network (MemNet) (Tai et al., 2017b). The former mainly utilized recursive learning to reach the goal of economizing parameters. The latter model tackled the long-term dependency problem existed in the previous CNN architecture by several memory blocks that stacked with a densely connected structure (Huang et al., 2017). However, these two algorithms required a long time and huge graphics memory consumption both in the training and testing phases. The primary reason is the inputs sent to these two models are interpolation version of LR images and the networks have not adopted any downsampling operations. This scheme will bring about a huge computational cost. To increase testing speed and shorten the testing time, Shi et al. (Shi et al., 2016) first performed most of the mappings in low-dimensional space and designed an efficient sub-pixel convolution to upsample the resolutions of feature maps at the end of SR models.
To the same end, Dong et al. proposed fast SRCNN (FSRCNN) (Dong et al., 2016a), which employed a learnable upsampling layer (transposed convolution) to accomplish post-upsampling SR. Afterward, Lai et al. presented the Laplacian pyramid super-resolution network (LapSRN) (Lai et al., 2017) to progressively reconstruct higher-resolution images. Some other work such as MS-LapSRN (Lai et al., 2018) and progressive SR (ProSR) (Wang et al., 2018b) also adopt this progressive upsampling SR framework and achieve relatively high performance. EDSR (Lim et al., 2017) made a significant breakthrough in term of SR performance, which won the competition of NTIRE 2017 (Agustsson and Timofte, 2017; Timofte et al., 2017). The authors removed some unnecessary modules (e.g., Batch Normalization) of the SRResNet (Ledig et al., 2017) to obtain better results. Based on EDSR, Zhang et al. incorporated densely connected block (Huang et al., 2017; Tong et al., 2017) into residual block (He et al., 2016) to construct a residual dense network (RDN). Soon they exploited the residual-in-residual architecture for the very deep model and introduced channel attention mechanism (Hu et al., 2018) to form the very deep residual attention networks (RCAN) (Zhang et al., 2018b). More recently, Zhang et al. also introduced spatial attention (non-local module) into the residual block and then constructed residual non-local attention network (RNAN) (Zhang et al., 2019) for various image restoration tasks.
The major trend of these algorithms is increasing more convolution layers to improve performance that measured by PSNR and SSIM (Wang et al., 2004). As a result, most of them suffered from large model parameters, huge memory footprints, and slow training and testing speeds. For instance, EDSR (Lim et al., 2017) has about M parameters, layers, and RDN (Zhang et al., 2018c) achieved comparable performance, which has about M parameters, over layers. Another typical network is RCAN (Zhang et al., 2018b), its depth up to but the parameters are about M. However, these methods are still not suitable for resource-constrained equipment. For the mobile devices, the desired practice should be to pursuing higher SR performance as much as possible when the available memory and inference time are constrained in a certain range. Many cases require not only the performance but also high execution speed, such as video applications, edge devices, and smartphones. Accordingly, it is significant to devise a lightweight but efficient model for meeting such demands.
Concerning the reduction of the parameters, many approaches adopted the recursive manner or parameter sharing strategy, such as (Kim et al., 2016b; Tai et al., 2017a, b). Although these methods did reduce the size of the model, they increased the depth or the width of the network to make up for the performance loss caused by the recursive module. This will lead to spending a great lot of calculating time when performing SR processing. To address this issue, the better way is to design the lightweight and efficient network structures that avoid using recursive paradigm. Ahn et al. developed CARN-M (Ahn et al., 2018) for mobile scenario through a cascading network architecture, but it is at the cost of a substantial reduction on PSNR. Hui et al. (Hui et al., 2018) proposed an information distillation network (IDN) that explicitly divided the preceding extracted features into two parts, one was retained and another was further processed. Through this way, IDN achieved good performance at a moderate size. But there is still room for improvement in term of performance.
Another factor that affects the inference speed is the depth of the network. In the testing phase, the previous layer and the next layer have dependencies. Simply, conducting the computation of the current layer must wait for the previous calculation is completed. But multiple convolutional operations at each layer can be processed in parallel. Therefore, the depth of model architecture is an essential factor affecting time performance. This point will be verified in Section 4.
As to solving the different scale factors (, , ) SR problem using a single model, previous solutions pretreated an image to the desired size and using the fully convolutional network without any downsampling operations. This way will inevitably lead to a substantial increase in the amount of calculation.
To address the above issues, we propose a lightweight information multi-distillation network (IMDN) for better balancing performance against applicability. Unlike most previous small parameters models that use recursive structure, we elaborately design an information multi-distillation block (IMDB) inspired by (Hui et al., 2018). The proposed IMDB extracts features at a granular level, which retains partial information and further treats other features at each step (layer) as illustrated in Figure 2. For aggregating features distilled by all steps, we devise a contrast-aware channel attention layer, specifically related to the low-level vision tasks, to enhance collected various refined information. Concretely, we exploit more useful features (edges, corners, textures, et al. ) for image restoration. In order to handle SR of any arbitrary scale factor with a single model, we need to scale the input image to the target size, and then employ the proposed adaptive cropping strategy (see in Figure 4) to obtain image patches of appropriate size for lightweight SR model with downsampling layers.
The contributions of this paper can be summarized as follows:
We propose a lightweight information multi-distillation network (IMDN) for fast and accurate image super-resolution. Thanks to our information multi-distillation block (IMDB) with contrast-aware attention (CCA) layer, we achieve competitive results with a modest number of parameters (refer to Figure 6).
We propose the adaptive cropping strategy (ACS), which allows the network included downsampling operations (e.g., convolution layer with a stride of 2) to process images of any arbitrary size. By adopting this scheme, the computational cost, memory occupation, and inference time can dramatically reduce in the case of treating indefinite magnification SR.
We explore factors affecting actual inference time through experiments and find the depth of the network is related to the execution speed. It can be a guideline for guiding a lightweight network design. And our model achieves an excellent balance among visual quality, inference speed, and memory occupation.
2. Related Work
2.1. Single image super-resolution
With the rapid development of deep learning, numerous methods based on convolutional neural network (CNN) have been the mainstream in SISR. The pioneering work of SR is proposed by Dong et al. (Dong et al., 2014, 2016b) named SRCNN. The SRCNN upscaled the LR image with bicubic interpolation before feeding into the network, which would cause substantial unnecessary computational cost. To address this issue, the authors removed this pre-processing and upscaled the image at the end of the net to reduce the computation in (Dong et al., 2016a). Lim et al. (Lim et al., 2017) modified SRResNet (Ledig et al., 2017) to construct a more in-depth and broader residual network denoted as EDSR. With the smart topology structure and a significantly large number of learnable parameters, EDSR dramatically advanced the SR performance. Zhang et al. (Zhang et al., 2018c) introduced channel attention (Hu et al., 2018) into the residual block to further boost very deep network (more than layers without considering the depth of channel attention modules). Liu (Liu et al., 2018) explored the effectiveness of non-local module applied to image restoration. Similarly, Zhang et al. (Zhang et al., 2019) utilized non-local attention to better guide feature extraction in their trunk branch for reaching better performance. Very recently, Li et al. (Li et al., 2019) exploited feedback mechanism that enhancing low-level representation with high-level ones.
For lightweight networks, Hui et al. (Hui et al., 2018) developed the information distillation network for better exploiting hierarchical features by separation processing of the current feature maps. And Ahn (Ahn et al., 2018) designed an architecture that implemented a cascading mechanism on a residual network to boost the performance.
2.2. Attention model
Attention model, aiming at concentrating on more useful information in features, has been widely used in various computer vision tasks. Hu et al. (Hu et al., 2018) introduced squeeze-and-excitation (SE) block that models channel-wise relationships in a computationally efficient manner and enhances the representational ability of the network, showing its effectiveness on image classification. CBAM (Woo et al., 2018) modified the SE block to exploit both spatial and channel-wise attention. Wang et al. (Wang et al., 2018a) proposed the non-local module to generate the wide attention map by calculating the correlation matrix between each spatial point in the feature map, then the attention map guided dense contextual information aggregation.
In this section, we describe our proposed information multi-distillation network (IMDN) in detail, its graphical depiction is shown in Figure 1(a). The upsampler (see Figure 1(b)) includes one convolution with output channels and a sub-pixel convolution. Given an input LR image , its corresponding target HR image . The super-resolved image can be generated by
where is our IMDN. It is optimized with mean absolute error (MAE) loss followed most of previous works (Lim et al., 2017; Hui et al., 2018; Zhang et al., 2018c, b; Ahn et al., 2018). Given a training set that has LR-HR pairs. Thus, the loss function of our IMDN can be expressed by
where indicates the updateable parameters of our model and is norm. Then we give more details about the entire framework.
We first conduct LR feature extraction implemented by one convolution with output channels. Then, the key component of our network utilizes multiple stacked information multi-distillation blocks (IMDBs) and assembles all intermediate features to fusing by a convolution layer. This scheme, intermediate information collection (IIC), is beneficial to guarantee the integrity of the collected information and can further boost the SR performance by increasing very few parameters. The final upsampler only consists of one learnable layer and a non-parametric operation (sub-pixel convolution) for saving parameters as much as possible.
3.2. Information multi-distillation block
As depicted in Figure 2, our information multi-distillation block (IMDB) is constructed by progressive refinement module, contrast-aware channel attention (CCA) layer, and a convolution that is used to reduce the number of feature channels. The whole block adopts residual connection. The main idea of this block is extracting useful features little by little like DenseNet (Huang et al., 2017). Then we give more details to these modules.
3.2.1. Progressive refinement module
As labeled with the gray box in Figure 2, the progressive refinement module (PRM) first adopts the convolution layer to extract input features for multiple subsequent distillation (refinement) steps. For each step, we employ channel split operation on the preceding features, which will produce two-part features. One is preserved and the other portion is fed into the next calculation unit. The retained part can be regarded as the refined features. Given the input features , this procedure in the -th IMDB can be described as
where denotes the -th convolution layer (including Leaky ReLU) of the -th IMDB, indicates the -th channel split layer of the -th IMDB, represents the -th refined features (preserved), and is the -th coarse features to be further processed. The hyperparameter of PRM architecture is shown in Table 1. The following stage is concatenating refined features from each step. It can be expressed by
where denotes concatenation operation along the channel dimension.
3.2.2. Contrast-aware channel attention layer
The initial channel attention is employed in image classification task and is well-known as the squeeze-and-excitation (SE) module. In the high-level field, the importance of a feature map depends on activated high-value areas, since these regions in favor of classification or detection. Accordingly, global average/maximum pooling is utilized to capture the global information in these high-level or mid-level vision. Although the average pooling can indeed improve the PSNR value, it lacks the information about structures, textures, and edges that are propitious to enhance image details (related to SSIM). As depicted in Figure 3, the contrast-aware channel attention module is special to low-level vision, e.g., image super-resolution, and enhancement. Specifically, we replace global average pooling with the summation of standard deviation and mean (evaluating the contrast degree of a feature map). Let’s denote as the input, which has feature maps with spatial size of . Therefore, the contrast information value can be calculated by
where is the -th element of output. indicates the global contrast (GC) information evaluation function. With the assistance of the CCA module, our network can steadily improve the accuracy of SISR.
3.3. Adaptive cropping strategy
The adaptive cropping strategy (ACS) is special to image of any arbitrary size super-resolving. Meanwhile, it can also deal with the SR problem of any scale factor with a single model (see Figure 5). We slightly modify the original IMDN by introducing two downsampling layer and construct the current IMDN_AS (IMDN for any scales). Here, the LR and HR images have the same spatial size (height and width). To handle images whose height and width are not divisible by , we first cut the entire images into parts and then feed them into our IMDN_AS. As illustrated in Figure 4, we can obtain overlapped image patches through ACS. Take the first patch in the upper left corner as an example, and we give the details about ACS. This image patch must satisfy
where , are extra increments of height and width, respectively. They can be computed by
where , are preset additional lengths. In general, their values are setting by
Here, is an integer greater than or equal to 1. These four patches can be processed in parallel (they have the same sizes), after which the outputs are pasted to their original location, and the extra increments ( and ) are discarded.
4.1. Datasets and metrics
In our experiments, we use the DIV2K dataset (Agustsson and Timofte, 2017), which contains 800 high-quality RGB training images and widely used in image restoration tasks (Lim et al., 2017; Zhang et al., 2018c, b, 2019). For evaluation, we use five widely used benchmark datasets: Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2010), BSD100 (Martin et al., 2001), Urban100 (Huang et al., 2015), and Manga109 (Matsui et al., 2017). We evaluate the performance of the super-resolved images using two metrics, including peak signal-to-noise ratio (PSNR) and structure similarity index (SSIM) (Wang et al., 2004). As with existing works (Kim et al., 2016a; Tai et al., 2017a; Lim et al., 2017; Zhang et al., 2018c, b; Hui et al., 2018; Ahn et al., 2018), we calculate the values on the luminance channel (i.e., Y channel of the YCbCr channels converted from the RGB channels).
Additionally, for any/unknown scale factor experiments, we use RealSR dataset from NTIRE2019 Real Super-Resolution Challenge111http://www.vision.ee.ethz.ch/ntire19/. It is a novel dataset of real low and high resolution paired images. The training data consists of 60 real low, and high resolution paired images, and the validation data contains 20 LR-HR pairs. It is noteworthy that the LR and HR have the same size.
4.2. Implementation details
To obtain LR DIV2K training images, we downscale HR images with the scaling factors (, , and ) using bicubic interpolation in MATLAB R2017a. The HR image patches with a size of are randomly cropped from HR images as the input of our model, and the mini-batch size is set to . For data augmentation, we perform randomly horizontal flip and degree rotation. Our model is trained by ADAM optimizer with the momentum parameter . The initial learning rate is set to and halved at every iterations. We set the number of IMDB to in our IMDN and IMDN_AS. We apply PyTorch framework to implement the proposed network on the desktop computer with 4.2GHz Intel i7-7700K CPU, 64G RAM, and NVIDIA TITAN Xp GPU (12G memory).
4.3. Model analysis
In this subsection, we investigate model parameters, the effectiveness of IMDB, the intermediate information collection scheme, and adaptive cropping strategy.
4.3.1. Model parameters
To construct a lightweight SR model, the parameters of the network is vital. From Table 5, we can observe that our IMDN with fewer parameters achieves comparative or better performance when comparing with other state-of-the-art methods, such as EDSR-baseline (CVPRW’17), IDN (CVPR’18), SRMDNF (CVPR’18), and CARN (ECCV’18). We also visualize the trade-off analysis between performance and model size in Figure 6. We can see that our IMDN achieves a better trade-off between the performance and model size.
4.3.2. Ablation studies of CCA module and IIC scheme
|PSNR / SSIM||PSNR / SSIM||PSNR / SSIM||PSNR / SSIM||PSNR / SSIM|
|\XSolid||\XSolid||\XSolid||510K||31.86 / 0.8901||28.43 / 0.7775||27.45 / 0.7320||25.63 / 0.7711||29.92 / 0.9003|
|\Checkmark||\XSolid||\XSolid||480K||32.01 / 0.8927||28.49 / 0.7792||27.50 / 0.7338||25.81 / 0.7773||30.16 / 0.9038|
|\Checkmark||\Checkmark||\XSolid||482K||32.10 / 0.8934||28.51 / 0.7794||27.52 / 0.7341||25.89 / 0.7793||30.25 / 0.9050|
|\Checkmark||\Checkmark||\Checkmark||499K||32.11 / 0.8934||28.52 / 0.7797||27.53 / 0.7342||25.90 / 0.7797||30.28 / 0.9054|
|IMDN_basic_B4 + CA||32.0821||28.5086||27.5124||25.8829|
|IMDN_basic_B4 + CCA||32.0964||28.5118||27.5185||25.8916|
To quickly validate the effectiveness of the contrast-aware attention (CCA) module and intermediate information collection (IIC) scheme, we adopt IMDBs to conduct the following ablation study experiment, named IMDN_B4. When removing the CCA module and IIC scheme, the IMDN_B4 becomes IMDN_basic_B4 as illustrated in Figure 7. From Table 2, we can find out that the CCA module leads to performance improvement (PSNR: +0.09dB, SSIM: +0.0012 for Manga109) only by increasing 2K parameters (which is an increase of ). The results compared with the CA module are placed in Table 3. To study the efficiency of PRM in IMDB, we replace it with three cascaded convolution layers (64 channels) and remove the final convolution (used for fusion). The compared results are given in Table 2. Although this network has more parameters (510K), its performance is much lower than our IMDN_basic_B4 (480K) especially on Urban100 and Manga109 datasets.
4.3.3. Investigation of ACS
|Method||PSNR||SSIM||LPIPS (Zhang et al., 2018a)||Time||Memory|
|VDSR (Kim et al., 2016a)||28.75||0.8439||0.2417||0.0290||7,855M|
To verify the efficiency of the proposed adaptive cropping strategy (ACS), we use RealSR training images to train VDSR (Kim et al., 2016a) and our IMDN_AS. The results, evaluated on RealSR RGB validation dataset, are illustrated in Table 4 and we can easily observe that the presented IMDN_AS achieves better performance in term of image quality, execution speed, and footprint. Accordingly, it also suggests the proposed ACS is powerful to address SR problem of any scales.
4.4. Comparison with state-of-the-arts
We compare our IMDN with 11 state-of-the-art methods: SRCNN (Dong et al., 2014, 2016b), FSRCNN (Dong et al., 2016a), VDSR (Kim et al., 2016a), DRCN (Kim et al., 2016b), LapSRN (Lai et al., 2017), DRRN (Tai et al., 2017a), MemNet (Tai et al., 2017b), IDN (Hui et al., 2018), EDSR-baseline (Lim et al., 2017), SRMDNF (Zhang et al., [n. d.]), and CARN (Ahn et al., 2018). Table 5 shows quantitative comparisons for , , and SR. It can find out that our IMDN performs favorably against other compared approaches on most datasets, especially at the scaling factor of .
Figure 8 shows , and visual comparisons on Set5 and Urban100 datasets. For “img_67” image from Urban100, we can see that grid structure is recovered better than others. It also demonstrates the effectiveness of our IMDN.
|PSNR / SSIM||PSNR / SSIM||PSNR / SSIM||PSNR / SSIM||PSNR / SSIM|
|Bicubic||-||33.66 / 0.9299||30.24 / 0.8688||29.56 / 0.8431||26.88 / 0.8403||30.80 / 0.9339|
|SRCNN (Dong et al., 2014)||8K||36.66 / 0.9542||32.45 / 0.9067||31.36 / 0.8879||29.50 / 0.8946||35.60 / 0.9663|
|FSRCNN (Dong et al., 2016a)||13K||37.00 / 0.9558||32.63 / 0.9088||31.53 / 0.8920||29.88 / 0.9020||36.67 / 0.9710|
|VDSR (Kim et al., 2016a)||666K||37.53 / 0.9587||33.03 / 0.9124||31.90 / 0.8960||30.76 / 0.9140||37.22 / 0.9750|
|DRCN (Kim et al., 2016b)||1,774K||37.63 / 0.9588||33.04 / 0.9118||31.85 / 0.8942||30.75 / 0.9133||37.55 / 0.9732|
|LapSRN (Lai et al., 2017)||251K||37.52 / 0.9591||32.99 / 0.9124||31.80 / 0.8952||30.41 / 0.9103||37.27 / 0.9740|
|DRRN (Tai et al., 2017a)||298K||37.74 / 0.9591||33.23 / 0.9136||32.05 / 0.8973||31.23 / 0.9188||37.88 / 0.9749|
|MemNet (Tai et al., 2017b)||678K||37.78 / 0.9597||33.28 / 0.9142||32.08 / 0.8978||31.31 / 0.9195||37.72 / 0.9740|
|IDN (Hui et al., 2018)||553K||37.83 / 0.9600||33.30 / 0.9148||32.08 / 0.8985||31.27 / 0.9196||38.01 / 0.9749|
|EDSR-baseline (Lim et al., 2017)||1,370K||37.99 / 0.9604||33.57 / 0.9175||32.16 / 0.8994||31.98 / 0.9272||38.54 / 0.9769|
|SRMDNF (Zhang et al., [n. d.])||1,511K||37.79 / 0.9601||33.32 / 0.9159||32.05 / 0.8985||31.33 / 0.9204||38.07 / 0.9761|
|CARN (Ahn et al., 2018)||1,592K||37.76 / 0.9590||33.52 / 0.9166||32.09 / 0.8978||31.92 / 0.9256||38.36 / 0.9765|
|IMDN (Ours)||694K||38.00 / 0.9605||33.63 / 0.9177||32.19 / 0.8996||32.17 / 0.9283||38.88 / 0.9774|
|Bicubic||-||30.39 / 0.8682||27.55 / 0.7742||27.21 / 0.7385||24.46 / 0.7349||26.95 / 0.8556|
|SRCNN (Dong et al., 2014)||8K||32.75 / 0.9090||29.30 / 0.8215||28.41 / 0.7863||26.24 / 0.7989||30.48 / 0.9117|
|FSRCNN (Dong et al., 2016a)||13K||33.18 / 0.9140||29.37 / 0.8240||28.53 / 0.7910||26.43 / 0.8080||31.10 / 0.9210|
|VDSR (Kim et al., 2016a)||666K||33.66 / 0.9213||29.77 / 0.8314||28.82 / 0.7976||27.14 / 0.8279||32.01 / 0.9340|
|DRCN (Kim et al., 2016b)||1,774K||33.82 / 0.9226||29.76 / 0.8311||28.80 / 0.7963||27.15 / 0.8276||32.24 / 0.9343|
|LapSRN (Lai et al., 2017)||502K||33.81 / 0.9220||29.79 / 0.8325||28.82 / 0.7980||27.07 / 0.8275||32.21 / 0.9350|
|DRRN (Tai et al., 2017a)||298K||34.03 / 0.9244||29.96 / 0.8349||28.95 / 0.8004||27.53 / 0.8378||32.71 / 0.9379|
|MemNet (Tai et al., 2017b)||678K||34.09 / 0.9248||30.00 / 0.8350||28.96 / 0.8001||27.56 / 0.8376||32.51 / 0.9369|
|IDN (Hui et al., 2018)||553K||34.11 / 0.9253||29.99 / 0.8354||28.95 / 0.8013||27.42 / 0.8359||32.71 / 0.9381|
|EDSR-baseline (Lim et al., 2017)||1,555K||34.37 / 0.9270||30.28 / 0.8417||29.09 / 0.8052||28.15 / 0.8527||33.45 / 0.9439|
|SRMDNF (Zhang et al., [n. d.])||1,528K||34.12 / 0.9254||30.04 / 0.8382||28.97 / 0.8025||27.57 / 0.8398||33.00 / 0.9403|
|CARN (Ahn et al., 2018)||1,592K||34.29 / 0.9255||30.29 / 0.8407||29.06 / 0.8034||28.06 / 0.8493||33.50 / 0.9440|
|IMDN (Ours)||703K||34.36 / 0.9270||30.32 / 0.8417||29.09 / 0.8046||28.17 / 0.8519||33.61 / 0.9445|
|Bicubic||-||28.42 / 0.8104||26.00 / 0.7027||25.96 / 0.6675||23.14 / 0.6577||24.89 / 0.7866|
|SRCNN (Dong et al., 2014)||8K||30.48 / 0.8628||27.50 / 0.7513||26.90 / 0.7101||24.52 / 0.7221||27.58 / 0.8555|
|FSRCNN (Dong et al., 2016a)||13K||30.72 / 0.8660||27.61 / 0.7550||26.98 / 0.7150||24.62 / 0.7280||27.90 / 0.8610|
|VDSR (Kim et al., 2016a)||666K||31.35 / 0.8838||28.01 / 0.7674||27.29 / 0.7251||25.18 / 0.7524||28.83 / 0.8870|
|DRCN (Kim et al., 2016b)||1,774K||31.53 / 0.8854||28.02 / 0.7670||27.23 / 0.7233||25.14 / 0.7510||28.93 / 0.8854|
|LapSRN (Lai et al., 2017)||502K||31.54 / 0.8852||28.09 / 0.7700||27.32 / 0.7275||25.21 / 0.7562||29.09 / 0.8900|
|DRRN (Tai et al., 2017a)||298K||31.68 / 0.8888||28.21 / 0.7720||27.38 / 0.7284||25.44 / 0.7638||29.45 / 0.8946|
|MemNet (Tai et al., 2017b)||678K||31.74 / 0.8893||28.26 / 0.7723||27.40 / 0.7281||25.50 / 0.7630||29.42 / 0.8942|
|IDN (Hui et al., 2018)||553K||31.82 / 0.8903||28.25 / 0.7730||27.41 / 0.7297||25.41 / 0.7632||29.41 / 0.8942|
|EDSR-baseline (Lim et al., 2017)||1,518K||32.09 / 0.8938||28.58 / 0.7813||27.57 / 0.7357||26.04 / 0.7849||30.35 / 0.9067|
|SRMDNF (Zhang et al., [n. d.])||1,552K||31.96 / 0.8925||28.35 / 0.7787||27.49 / 0.7337||25.68 / 0.7731||30.09 / 0.9024|
|CARN (Ahn et al., 2018)||1,592K||32.13 / 0.8937||28.60 / 0.7806||27.58 / 0.7349||26.07 / 0.7837||30.47 / 0.9084|
|IMDN (Ours)||715K||32.21 / 0.8948||28.58 / 0.7811||27.56 / 0.7353||26.04 / 0.7838||30.45 / 0.9075|
|Memory / Time||Memory / Time||Memory / Time|
|EDSR-baseline (Lim et al., 2017)||1.6M||37||665 / 0.00295||2,511 / 0.00242||1,219 / 0.00232|
EDSR (Lim et al., 2017)
|43M||69||1,531 / 0.00580||8,863 / 0.00416||3,703 / 0.00380|
RDN (Zhang et al., 2018c)
|22M||150||1,123 / 0.01626||3,335 / 0.01325||2,257 / 0.01300|
RCAN (Zhang et al., 2018b)
|16M||415||777 / 0.09174||2,631 / 0.55280||1,343 / 0.72250|
CARN (Ahn et al., 2018)
|1.6M||34||945 / 0.00278||3,761 / 0.00305||2,803 / 0.00383|
|0.7M||34||671 / 0.00285||1,155 / 0.00284||895 / 0.00279|
4.5. Running time
|Scale||LapSRN (Lai et al., 2017)||IDN (Hui et al., 2018)||EDSR-b (Lim et al., 2017)||CARN (Ahn et al., 2018)||IMDN|
4.5.1. Complexity analysis
As the proposed IMDN mainly consists of convolutions, the total number of parameters can be computed as
where is the layer index, denotes the total number of layers, and represents the spatial size of the filters. The number of convolutional kernels belong to -th layer is , and its input channels are . Suppose that the spatial size of output feature maps is , the time complexity can be roughly calculated by
4.5.2. Running Time
We use official codes of the compared methods to test their running time in a feed-forward process. From Table 6, we can be informed of actual execution time is related to the depth of networks. Although EDSR has a large number of parameters (43M), it runs very fast. The only drawback is that it takes up more graphics memory. The main reason should be the convolution computation for each layer are parallel. And RCAN has only 16M parameters, its depth is up to 415 and results in very slow inference speed. Compared with CARN (Ahn et al., 2018) and EDSR-baseline (Lim et al., 2017), Our IMDN achieves dominant performance in term of memory usage and time consumption.
For more intuitive comparisons with other approaches, we provide the trade-off between the running time and performance on Set5 dataset for SR in the Figure 9. It shows our IMDN gains comparable execution time and best PSNR value.
In this paper, we propose an information multi-distillation network for lightweight and accurate single image super-resolution. We construct a progressive refinement module to extract hierarchical feature step-by-step. By cooperating with the proposed contrast-aware channel attention module, the SR performance is significantly and steadily improved. Additionally, we present the adaptive cropping strategy to solve the SR problem of an arbitrary scale factor, which is critical for the application of SR algorithms in the actual scenes. Numerous experiments have shown that the proposed method achieves a commendable balance between factors affecting practical use, including visual quality, execution speed, and memory consumption. In the future, this approach will be explored to facilitate other image restoration tasks such as image denoising and enhancement.
Acknowledgements.This work was supported in part by the National Natural Science Foundation of China under Grant 61432014, 61772402, U1605252, 61671339 and 61871308, in part by the National Key Research and Development Program of China under Grant 2016QY01W0200, in part by National High-Level Talents Special Support Program of China under Grant CS31117200001.
- Agustsson and Timofte (2017) Eirikur Agustsson and Radu Timofte. 2017. NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW). 126–135.
- Ahn et al. (2018) Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. 2018. Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network. In European Conference on Computer Vision (ECCV). 252–268.
- Bevilacqua et al. (2012) Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. 2012. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In British Machine Vision Conference (BMVC).
- Dong et al. (2014) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV). 184–199.
- Dong et al. (2016b) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2016b. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 2 (2016), 295–307.
- Dong et al. (2016a) Chao Dong, Chen Change Loy, and Xiaoou Tang. 2016a. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision (ECCV). 391–407.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
- Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7132–7141.
- Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4700–4708.
- Huang et al. (2015) Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. 2015. Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5197–5206.
- Hui et al. (2018) Zheng Hui, Xiumei Wang, and Xinbo Gao. 2018. Fast and Accurate Single Image Super-Resolution via Information Distillation Network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 723–731.
- Kim et al. (2016a) Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016a. Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1646–1654.
- Kim et al. (2016b) Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016b. Deeply-recursive convolutional network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1637–1645.
- Lai et al. (2017) Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2017. Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 624–632.
- Lai et al. (2018) Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2018. Fast and Accurate Image Super-Resolution with Deep Laplacian Pyramid Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
- Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, and Andrew Cunningham. 2017. Photo-Realistic single image super-resolution using a generative adversarial network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4681–4690.
- Li et al. (2019) Zhen Li, Jinglei Yang, Zheng Liu, Xiaoming Yang, Gwanggil Jeon, and Wei Wu. 2019. Feedback Network for Image Super-Resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Lim et al. (2017) Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced Deep Residual Networks for Single Image Super-Resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW). 136–144.
- Liu et al. (2018) Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. 2018. Non-Local Recurrent Network for Image Restoration. In Advances in Neural Information Processing Systems (NeurIPS). 1680–1689.
- Martin et al. (2001) David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In International Conference on Computer Vision (ICCV). 416–423.
- Matsui et al. (2017) Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2017. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76, 20 (2017), 21811–21838.
- Shi et al. (2016) Wenzhe Shi, Jose Caballero, Huszár, Ferenc, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1874–1883.
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference for Learning Representations (ICLR).
- Tai et al. (2017a) Ying Tai, Jian Yang, and Xiaoming Liu. 2017a. Image super-resolution via deep recursive residual network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3147–3155.
- Tai et al. (2017b) Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. 2017b. MemNet: A Persistent Memory Network for Image Restoration. In IEEE International Conference on Computer Vision (ICCV). 4539–4547.
- Timofte et al. (2017) Radu Timofte, Shuhang Gu, Jiqing Wu, Luc Van Gool, Lie Zhang, and et al. 2017. NTIRE 2018 Challenge on Single Image Super-Resolution: Methods and Results. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW). 965–976.
- Tong et al. (2017) Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. 2017. Image Super-Resolution Using Dense Skip Connections. In IEEE International Conference on Computer Vision (ICCV). 4799–4807.
- Wang et al. (2018a) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7794–7803.
- Wang et al. (2018b) Yifan Wang, Federico Perazzi, Brian McWilliams, Alexander Sorkine-Hornung, Olga Sorkin-Hornung, and Christopher Schroers. 2018b. A Fully Progressive Approach to Single-Image Super-Resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW). 977–986.
- Wang et al. (2004) Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.
- Wang et al. (2015) Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. 2015. Deep networks for image super-resolution with sparse prior. In IEEE International Conference on Computer Vision (ICCV). 370–378.
- Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In The European Conference on Computer Vision (ECCV). 3–19.
- Zeyde et al. (2010) Roman Zeyde, Michael Elad, and Matan Protter. 2010. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces (ICCS). 711–730.
- Zhang et al. ([n. d.]) Kai Zhang, Wangmeng Zuo, and Lei Zhang. [n. d.]. Learning a Single Convolutional Super-Resolution Network for Multiple Degradations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3262–3271.
- Zhang et al. (2018a) Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018a. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 586–595.
- Zhang et al. (2018b) Yulun Zhang, kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018b. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In European Conference on Computer Vision (ECCV). 286–301.
- Zhang et al. (2019) Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. 2019. Residual Non-local Attention Networks for Image Restoration. In International Conference on Learning Representations (ICLR).
- Zhang et al. (2018c) Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. 2018c. Residual Dense Network for Image Super-Resolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2472–2481.