Pyramidal Edge-maps and Attention based Guided Thermal Super-resolution

Pyramidal Edge-maps and Attention based Guided Thermal Super-resolution


Guided super-resolution (GSR) of thermal images using visible range images is challenging because of the difference in the spectral-range between the images. This in turn means that there is significant texture-mismatch between the images, which manifests as blur and ghosting artifacts in the super-resolved thermal image. To tackle this, we propose a novel algorithm for GSR based on pyramidal edge-maps extracted from the visible image. Our proposed network has two sub-networks. The first sub-network super-resolves the low-resolution thermal image while the second obtains edge-maps from the visible image at a growing perceptual scale and integrates them into the super-resolution sub-network with the help of attention-based fusion. Extraction and integration of multi-level edges allows the super-resolution network to process texture-to-object level information progressively, enabling more straightforward identification of overlapping edges between the input images. Extensive experiments show that our model outperforms the state-of-the-art GSR methods, both quantitatively and qualitatively.

guided super-resolution, thermal image, hierarchical edge-maps, attention based fusion, convolutional neural network

1 Introduction

Thermal imaging has many advantages over traditional visible-range imaging as it works well in extreme visibilty conditions. It has found applications in various fields such as firefighting [2], gas leakage detection [39], and automation [23, 24, 5], but the high cost of thermal sensors has considerably restricted its consumer application. Super-Resolution (SR) techniques can increase its applicability by simulating accurate high-resolution thermal images from measurements captured from the considerably inexpensive low-resolution thermal cameras.

Efficient methods have been proposed to perform super-resolution directly from the low-resolution thermal measurements. These single image SR methods [38, 22, 1, 7, 60] either take an iterative approach [60] or use a convolutional neural network (CNN) to learn the upsampling transformation function , such that . However, if the dimensions of the input thermal image are very small, for e.g. the thermal images from a low-end thermal camera FLIR-AX8 have a resolution of , then single image super-resolution becomes very challenging as the problem becomes highly ill-posed.

Figure 1: (a) Texture difference between thermal and visible images. (b) Multi-level edge-maps extracted from the visible image. Finer high-frequency details are present in Level-1 edges, but this level also contains the unwanted edges such as texts on the back of the truck, which are absent in Level-5. This antithetical variation of high-frequency information motivates the use of multi-level edge-maps. (c) Due to texture-mismatch, some existing methods such as MSG-Net [20] can produce blurred images. However, with the help of pyramidal edge-maps and adaptive fusion, our method is able to produce better high-frequency details.

Since many low-resolution thermal cameras are accompanied by a high-resolution visible-range camera, a practical solution to get better super-resolved thermal images is to use Guided Super-Resolution (GSR) techniques. A crucial part of super-resolution is to correctly predict the high-frequency details. These high-frequency details are present as edge information in the two images. Since the edges are shared across the modality, estimation of the overlapping edges between the two modalities becomes a crucial task for reconstructing better high-frequency details. Most of the existing guided thermal SR techniques [31, 16, 45, 6, 37] estimate high-frequency details implicitly by using CNNs and end-to-end learning. However, the RGB guide image contains fine-texture details that are confined to the visible spectrum. For e.g., the texts on the back of the truck in Fig.1(a) are present only in the visible-range image. Such non-overlapping texture details can cause artifacts when used for guided super-resolution of thermal images. To address this drawback of single RGB guide images, we propose to use hierarchical edge-maps as guide input to our thermal super-resolution network.

We propose a GSR model that takes multi-level edge-maps extracted from the visible image as input instead of a single RGB guide image. Edge-maps at different perceptual scale contain fine-texture to object-level information distinctly, as shown in Fig. 1(b). Due to this property of multi-level edge-maps, they can enhance the GSR performance as edge-maps at different scales could allow a more straightforward estimation of overlapping edge information. Furthermore, to allow the network to adaptively select the appropriate edge information for the input multiple edge-maps, we propose a spatial-attention based fusion module. This module adaptively selects the high-frequency information from the edge-maps before integrating them into the SR network at different depths or receptive-field sizes. Extensive experiments show that using such hierarchical form of guidance input followed by adaptive attention-based integration helps reconstruct better high-frequency details and enhances the SR performance. Our experiments also indicate that using hierarchical edge-maps and adaptive fusion can provide some robustness towards small geometric misalignment between the input images. In summary, the main contributions of this paper are:

  • We propose a novel guided super-resolution method that consists of two sub-networks: one for thermal image super-resolution and the other for feature extraction and integration of multi-level edge-maps obtained from the visible image.

  • We use hierarchical edge-maps as guidance input and propose a novel fusion module that adaptively selects information from these multi-level edge-maps and integrates them into our SR network with the help of spatial-attention modules.

  • We compare our model with existing state-of-the-art GSR methods and show that our method reconstructs more high-frequency details and performs significantly better, both quantitatively and perceptually.

2 Related Works

The high cost of thermal cameras has in the past inspired many research works to aim at thermal image super-resolution. Among the single thermal image super-resolution methods [7, 38, 22, 1], Choi et al. [7] suggested a shallow three-layer convolutional neural network (CNN). Zhang et al. [60] thereafter combined compressive sensing and deep learning techniques to perform infrared super-resolution. Apart from thermal super-resolution methods, multiple methods have been proposed for near-infrared image super-resolution [16, 34, 46, 55, 45]. However, the drawback of these methods is that they do not target super-resolution for very low-resolution inputs, which is the case for low-cost thermal cameras. Choi et al. [7] and Lee et al.’s [31] works suggest that using a visible super-resolution method or pre-trained model should perform well in the case of thermal images too. Many deep CNN based single image super-resolution methods, such as [47, 25, 51, 11, 10, 57, 33, 48, 41, 30, 62, 61, 8], have shown great performance on visible images. Most of the recent methods such as RCAN[61] and SAN [8] use self-attention mechanism to produce better reconstructions. But the common concern related to single image methods is that reconstructing HR images solely from low-resolution noisy sensor images can be challenging and a guided approach might perform better.

Among the guided thermal super-resolution methods, Lee et al. [31] used brightness information for the visible-range images. Han et al. [16] proposed a guided super-resolution method using CNN that extracts features from infrared and visible images and combines them using convolutional layers. Ni et al. [37] proposed a method to utilize an edge-map and perform GSR. Almasri et al.[1] performed a detailed study of different CNN architectures and up-sampling methods and proposed a network for guided thermal super-resolution. Interestingly, many recent methods for guided super-resolution for depth [40, 64, 14, 52, 27, 12, 63, 32, 18, 56, 54] or hyperspectral [29, 42, 28, 43] images have similar backbone as the thermal guided super-resolution methods. They all use some variation of the Siamese network [4] to simultaneously extract information from both images and merge them to reconstruct the super-resolved image. However, these methods tackle texture-mismatch with the help of implicit or end-to-end learning, which can perform sub-optimally and lead to blurred reconstructions, as shown in Fig. 1(c).

3 Pyramidal Edge-maps and Attention based Guided Super-Resolution (PAG-SR)

The guide image belongs to a higher resolution as compared to the input thermal image and has useful high-frequency details that can be fused with the low-resolution thermal image to perform better super-resolution. However, these high-frequency details should be extracted and integrated adaptively according to the input low-resolution thermal image. Non-optimality in feature-extraction can propagate the texture-mismatch, which can further cause artifacts in the reconstructed image. At a first glance, it seems that extracting the object-level edges could be an ideal solution as they are shared across the multispectral images, but as one can observe from Fig. 1(b), there are high-frequency details present in edge-maps at lower levels, i.e. levels 1 and 3 that are equally useful. To resolve this conundrum, we use edge-maps extracted at pyramidal levels from the visible image and integrate them with the help of adaptive fusion module using self-attention mechanism. This way, the network can leverage high-frequency information in a hierarchical fashion and adaptively select appropriate features according to the input low-resolution thermal image.

Figure 2: Our method utilizes hierarchical edge information extracted from the guide visible image and integrates this information into our thermal super-resolution network at different network-depths with the help of feature extraction and self-attention mechanism. We progressively merge information obtained from the guide image and allow the network to adaptively deal with high-frequency information mismatch.

Figure 2 shows the architecture of our proposed method. We denote the low-resolution thermal, the high-resolution visible and ground-truth thermal images as , and , respectively. Our proposed network consists of two sub-networks: one for thermal image super-resolution, denoted as and one for feature extraction and integration of multi-level edge-maps obtained from the visible image, denoted as . Many existing edge-detection methods for single images [53, 35] extract multi-level edges and merge them to obtain object-level edge-map. They extract edges at different perceptual scales by taking output at different layers of the VGG [44] network. Consequently, to obtain edges having visible-range information at different perceptual scales, we used one of these existing methods [35], which provides edge-maps at pyramidal levels. We denote these edge-maps as .

To extract guidance information and fuse it into the super-resolution network, we propose a fusion network and denote it as . As shown in Fig. 2, first contains a convolution layer, which we denote as . The convolution layer takes a concatenate of the multi-level edge-maps as an input and extracts features from these edge-maps collectively. We call these extracted features as edge-features and denote them as . contains the multi-level high-frequency guidance information from the visible-range image. These edge-features are then passed through an average pooling layer to reach the spatial resolution of . We perform the fusion at the low-resolution scale because our experiments showed that downsampling reduces the edge-mismatch between the input images and leads to better performance as compared to performing fusion in the spatially high-resolution feature-space.

The next part of contains a set of dense[19] and spatial attention blocks[50], which we collectively call as the fusion sub-block. For edge-maps, we have sets of fusion sub-blocks, denoted as that lead to connections into the thermal super-resolution network. Each fusion sub-block contains a dense-block with convolutional layers for extracting the relevant features from for that particular connection. Each dense-block is followed by a spatial attention block [50] that adaptively transforms the guidance information and outputs weighted features based on the spatial correlation of different channels in the features. Our spatial-attention block is similar to the one proposed in [50] and its architecture is shown in Figure 3. The mathematical description of the module can be found in the supplementary paper. We tried different variations of the fusion network , details of which are mentioned in Section 4.4.2.

Figure 3: Architecture of the spatial attention block used in our fusion module. The module adaptively transforms the features according to spatial correlation inside each feature-map and outputs re-scaled features such that the relevant information has a higher activation.

Our thermal image super-resolution sub-network, denoted as , consists of two parts: and , as shown in Fig. 2. The first part of the network, denoted as , is the part that extracts information from the low-resolution thermal image and is merged with to receive the guidance information. contains convolutional layers having channels, which are followed by dense-blocks [19] of two convolutional layers, each of which again have channels. For guide edge-maps, contains dense-blocks, denoted as . The fusion operation can be summarised as:


where, and are feature-map and dense-block of , respectively and is the fusion sub-block, as mentioned in the previous paragraph.

The features from are fed into , which contains convolutional and upsampling layers. For a super-resolution, contains deconvolution layers. The output of is . The final deconvolution layer is followed by a convolutional layer and a skip connection from the input image for residual learning. The output can be defined as:


Loss functions. To learn the parameters of , our optimization function contains two loss terms. First is the reconstruction loss for supervised training. The second term is a gradient loss to explicitly penalize loss of high frequency details. Here is the Laplacian operator that calculates both horizontal and vertical gradients. Hence, our overall loss function is


We found the optimal values for and to be and , respectively.

4 Experiments

4.1 Datasets and Setup

We perform experiments on three datasets: FLIR-ADAS [13], CATS [49] and KAIST [21]. FLIR-ADAS contains unrectified stereo thermal and visible-range image pairs having a resolution of and , respectively. Since the dataset does not contain any calibration images, we rectified one image pair manually, by identifying the correspondences and estimating the relative transformation matrices. We used this estimated transformations to rectify rest of the images in the dataset. After rectification, both thermal and visible-range images are of resolution . The CATS dataset contains rectified thermal and visible images, both of dimensions . This dataset also contains ground-truth disparity maps between the two images, but we observed that the disparity-maps are not accurate and results in artifacts when either of the images is warped using the disparity. Therefore, we used the rectified yet unaligned image pairs for our experiments, similar to the FLIR-ADAS dataset. Hence, both datasets have rectified image pairs (i.e. epipolar lines are horizontal), but they are not pixel-wise aligned. In contrast, the third KAIST dataset contains aligned thermal and visible images of resolution . This dataset was captured using a beam-splitter based setup and hence it has less practical similarity with the low-resolution thermal cameras. We therefore use this dataset for a smaller set of GSR methods, to present a baseline comparison on aligned thermal and visible images.

To create the low-resolution dataset, we used the blur-downscale degradation model proposed in [58] to simulate the low-resolution images. For the training-set, we down-sample images using blur kernels with at a step of . For the FLIR-ADAS dataset, our training set contains 43830 image pairs and the test-set contains 1257 pairs. In the CATS dataset, our training set contains 944 image-pairs and the test-set contains 50 pairs. Since CATS training-set is quite small, we used to it fine-tune the models pre-trained on FLIR-ADAS training-set and then tested them on the CATS test-set. For the KAIST dataset, our training-set contains image-pairs and our test-set contains pairs. We perform experiments for and upsampling factors. For both datasets, the input thermal-resolution is close to the resolution of low-cost thermal cameras like FLIR AX8. The dimensions are for FLIR-ADAS and KAIST; and for CATS. Hence, for SR, the guide image and the ground-truth image for FLIR-ADAS and KAIST are of resolution and for CATS, they are of resolution . Similarly, for , the corresponding guide and GT images are of sizes and . For network optimization, we use ADAM optimizer [26] with a learning rate of . The experiments were performed on a Nvidia 2080Ti GPU.

Method     G/S Scale PSNR SSIM MSE LPIPS
Bicubic Single 28.37 0.890 0.001652 0.405
RDN [62] Single 29.28 0.906 0.001441 0.282
RCAN [61] Single 29.18 0.908 0.001483 0.228
SAN [8] Single 26.47 0.859 0.002567 0.229
TGV2-L2 [12] Guided 28.77 0.892 0.001601 0.422
FBS [3] Guided 25.48 0.787 0.003152 0.387
Joint-BU [27] Guided 27.77 0.874 0.001855 0.284
Infrared SR [16] Guided 28.21 0.889 0.001692 0.405
SDF [15] Guided 28.70 0.875 0.001488 0.321
MSF-SR [1] Guided 29.21 0.901 0.001447 0.200
MSG-Net [20] Guided 29.46 0.897 0.001341 0.184
PixTransform [36] Guided 24.84 0.787 0.003679 0.329
Deep-ISTA [9] Guided 25.86 0.828 0.028939 0.529
PAG-SR (Ours) Guided 29.56 0.912 0.001309 0.147
RDN [62] Single 26.80 0.833 0.002314 0.389
RCAN [61] Single 22.35 0.758 0.006771 0.414
SAN [8] Single 25.38 0.811 0.003251 0.536
TGV2-L2 [12] Guided 26.42 0.821 0.002526 0.399
FBS [3] Guided 25.03 0.770 0.003451 0.476
Joint-BU [27] Guided 25.61 0.803 0.003006 0.406
Infrared SR [16] Guided 26.03 0.817 0.002782 0.521
SDF [15] Guided 26.72 0.819 0.002379 0.363
MSF-SR [1] Guided 27.92 0.835 0.002350 0.249
MSG-Net [20] Guided 27.29 0.827 0.002263 0.296
PixTransform [36] Guided 23.31 0.836 0.005224 0.371
Deep-ISTA [9] Guided 25.56 0.778 0.030982 0.598
PAG-SR (Ours) Guided 28.77 0.919 0.001581 0.214
Table 1: Comparison of existing methods on FLIR-ADAS for and SR cases.

Comparison details.

We compare our method with existing GSR methods: TGV2-L2 [12], FBS [3], Joint-BU [27], Infrared-SR [17], SDF [15], MSF-STI-SR [1], MSG-Net [20], Pix-Transform [36] and Deep-ISTA [9]. We also include comparison with a few recent single image SR methods such as RCAN [61], RCAN [61] and SAN [8]. We used the publicly available codes for the existing methods and trained the CNN based single and guided SR methods on the corresponding thermal datasets to perform the comparison. For the CATS dataset, the models pre-trained on FLIR-ADAS dataset were used for fine-tuning. We kept the default settings for most of the methods, except for few filtering based methods such as FBS and Joint-BU, where the weights had to be adjusted to reconstruct better texture-details in the super-resolved images.


We use four metrics to quantitatively assess the reconstructions: PSNR, SSIM, Mean-squared Error (MSE) and Perceptual distance (LPIPS) [59]. Among these, PSNR and SSIM are distortion-based metrics and hence, can be biased towards smooth or blurred images. Therefore, we also use LPIPS, a perceptual metric that computes the perceptual distance between the reconstructed and the ground-truth images. A point to note is that since the reconstructed images are thermal measurements, better MSE is also an important factor while comparing the methods.

Method     G/S Scale PSNR SSIM MSE LPIPS
Bicubic Single 32.19 0.959 0.000744 0.395
RDN [62] Single 29.41 0.914 0.004811 0.357
RCAN [61] Single 31.89 0.966 0.000796 0.159
SAN [8] Single 33.41 0.960 0.000507 0.141
TGV2-L2 [12] Guided 32.17 0.938 0.000741 0.225
FBS [3] Guided 29.12 0.825 0.035104 0.450
Joint-BU [27] Guided 31.23 0.953 0.000914 0.233
Infrared SR [16] Guided 28.27 0.901 0.031501 0.348
SDF [15] Guided 32.56 0.941 0.000686 0.246
MSF-SR [1] Guided 29.37 0.830 0.022598 0.415
MSG-Net [20] Guided 31.56 0.964 0.000789 0.177
PixTransform [36] Guided 28.48 0.792 0.185427 0.442
Deep-ISTA [9] Guided 33.72 0.956 0.000488 0.178
PAG-SR (Ours) Guided 34.97 0.968 0.000461 0.161
Bicubic Single 31.45 0.958 0.000868 0.413
RDN [62] Single 33.31 0.956 0.000631 0.392
RCAN [61] Single 27.63 0.931 0.002296 0.332
SAN [8] Single 32.17 0.953 0.000615 0.278
TGV2-L2 [12] Guided 31.55 0.951 0.000846 0.303
FBS [3] Guided 29.03 0.855 0.035654 0.495
Joint-BU [27] Guided 30.21 0.950 0.001131 0.314
Infrared SR [16] Guided 25.23 0.904 0.029725 0.409
SDF [15] Guided 31.91 0.948 0.000778 0.319
MSF-SR [1] Guided 27.97 0.811 0.032487 0.418
MSG-Net [20] Guided 32.83 0.957 0.000622 0.270
PixTransform [36] Guided 27.79 0.783 0.121195 0.584
Deep-ISTA [9] Guided 32.51 0.949 0.000595 0.290
PAG-SR (Ours) Guided 33.18 0.963 0.000537 0.246
Table 2: Comparison of guided super-resolution methods on CATS dataset for and upsampling cases. Higher PSNR, SSIM and lower MSE, LPIPS are better.

4.2 Quantitative comparison

Table 1 and 2 show the results for and upsampling factors for FLIR-ADAS and CATS datasets, and Table 3 shows the results for SR on KAIST dataset. A general trend among the existing methods is that they perform quite well in terms of distortion metrics but poorly in terms of the perceptual metric for the FLIR and CATS datasets. For these datasets, MSG-Net [20] and MSF-SR [1] results are the closest to our method. PixTransform [36] and Deep-ISTA [9] are the most recent methods, yet they perform poorly as compared to the others for the FLIR and CATS datasets, mostly due to edge-mismatch and inability to accommodate the texture difference. For the FLIR-dataset, we observe a small variance in the metric values but for the CATS dataset, the variance is much higher. The reason for this is the higher disparity range or higher misalignment between the input images from the CATS dataset. In contrast, the KAIST dataset has aligned images and hence, the results show much less variance as compared to the other two datasets. For this dataset, Deep-ISTA seems to be the closest to our method in terms of performance.

Our method outperforms the existing methods in terms of distortion as well as perceptual metrics on all three datasets. We observe a significant margin between ours and the existing methods’ performances, especially in the case of SR on FLIR-ADAS dataset. In Table 2, we observe a similar pattern in the metric values. However, an interesting observation is that in the case for CATS dataset, the single image SR methods perform better than many GSR methods. We believe this could be due to the edge-mismatch caused by the misalignment. For the KAIST dataset, our results are better than the existing methods, which indicates that our method works better for both aligned and misaligned inputs.

Method     G/S Scale PSNR SSIM MSE LPIPS
SDF [15] Guided 24.14 0.830 0.005647 0.242
MSF-SR [1] Guided 25.52 0.855 0.004319 0.274
MSG-Net [20] Guided 25.61 0.868 0.004031 0.242
PixTransform [36] Guided 24.85 0.828 0.007692 0.325
Deep-ISTA [9] Guided 26.12 0.871 0.003846 0.273
PAG-SR (Ours) Guided 27.98 0.856 0.002399 0.219
Table 3: Comparison of GSR methods on KAIST dataset for upsampling case.

(a) Visual comparison for guided super-resolution

(b) Visual comparison for guided super-resolution

Figure 4: Visual comparison on sample images from FLIR-ADAS dataset. Our method reconstructs high frequency details more accurately and has less artifacts due to mismatched edges as compared to existing GSR methods, for e.g. the ghosting effect of trees in MSF-STI’s reconstruction or blurred edges in most of the other methods. We also achieve higher SSIM and lower perceptual distance (LPIPS) values.

Robustness towards texture-mismatch and misalignment. The CATS dataset has higher misalignment which is visualized in the disparity-maps in Fig.5. Misalignment results in higher texture-mismatch, which consequently reduces the performance of many existing methods. We speculate that usage of edge-maps and attention module in our method provides some form of robustness towards misalignment and hence is the cause of our better performance. To validate this, we experimented with a variant of our network that takes an RGB guide input and does not have the attention block inside the fusion sub-blocks. We found that this model performs lower than our proposed method which has pyramidal edge-maps and attention based fusion. In terms of metrics, the model without edge-maps and attention achieved an SSIM of 0.901 and LPIPS of 0.173 as compared our proposed model’s SSIM of 0.912 and LPIPS of 0.147, for SR on FLIR dataset. This indicates that the edge-maps and attention module contribute towards the performance improvement and hence, are able to tackle texture-mismatch and most probably misalignment as well to a certain extent.

Figure 5: Visual comparison on images from CATS dataset for SR.

4.3 Qualitative comparison

We show the qualitative comparison of our method for and upsampling rates on FLIR-ADAS dataset in Fig. 4. Most of the existing methods have blurred edges, especially in the case of SR, which could be either due to very low-resolution of the input thermal images () or due to texture-mismatch and improper propagation of guidance information. Among the existing methods, MSF-STI, MSG-Net and Joint-BU results are considerably good, yet our method reconstructs high-frequency details more faithfully and has much sharper edges. For e.g. in Fig. 4(a), the branches in the red inset for our result are evidently most clear as compared to the other methods. Moreover, ghosting artifacts can be found in the results from few existing methods, such as MSF-STI’s reconstruction for case in Fig 4(b). In contrast, our method is able to reconstruct the object structure better for the same image and does not have such artifacts.

Fig. 5 shows the super-resolution results for few samples from the CATS dataset, that has higher misalignment as compared to the FLIR-ADAS dataset. We can observe that the existing guided super-resolution methods show more blur as compared to FLIR-ADAS dataset, mostly due to increased texture-mismatch caused by a higher misalignment. However, our results are overall much sharper than the existing methods, which indicates a comparatively higher robustness towards misalignment, as mentioned in Section 4.2.

4.4 Ablation studies

Usefulness of edge-maps.

Ideally, the RGB image could be used to extract features/edges in an end-to-end manner. However, estimating the optimal edge-map is very challenging because the SR network will require object-level awareness while extracting features. Using multilevel edge-maps simplifies this task by providing the object-edge information explicitly. To validate this hypothesis, we performed some experiments where we replaced the edge-maps with the visible RGB image for few variants of our model. The experiment was performed for SR on FLIR-ADAS dataset. Our proposed PAG-SR contains guide information fusion at positions to the thermal super-resolution network. However, since edge-features contain information from all the edge-maps, these positions can be reduced or expanded. Table 4 summarizes the results for a couple of such variations. When the guide information is fed at position 1, namely after first dense-block or , then RGB performs slightly better than the edge-maps. However, in the case of fusion at positions, the edge-maps perform better than both RGB, RGB combined with edge-maps, and all other variants as well. Thus, we can conclude that using edge-maps as a guidance information and fusing them at multiple-positions is overall a better strategy. We also performed a study to analyze the contribution of different levels of the edge-maps towards performance. The results can be found in Table 1 of the supplementary paper.

Integration positions Input type PSNR SSIM LPIPS
    1 2 3 4 5 RGB Edge-maps
28.05 0.916 0.249
28.34 0.909 0.222
28.19 0.916 0.243
28.85 0.916 0.215
28.14 0.916 0.258
28.77 0.919 0.214
28.83 0.915 0.221
Table 4: Performance variation with respect to guidance input type: RGB vs edge-maps for super-resolution on FLIR-ADAS dataset.
Edge-features() Dense block Attention module PSNR SSIM LPIPS
27.95 0.837 0.213
28.15 0.904 0.223
28.39 0.910 0.214
28.87 0.907 0.221
28.77 0.919 0.214
Table 5: Performance of different variations of our guidance fusion module.

Contribution of different components of the fusion network

To analyze the contribution of different components of our fusion network, we computed the performance of some variants of our fusion network while keeping the SR network constant for SR on FLIR-ADAS dataset. The results of the experiments are summarized in Table 5. The simplest model is our proposed method without the fusion module and hence contains edge-maps directly added to the SR network. Other variants include either having the dense-blocks or the attention module or neither or both in each fusion sub-block. The results show that having both dense-blocks and attention module helps is achieving better reconstructions. Moreover, Fig. 6 shows a visual comparison of the results from model with and without the fusion block. The model without the fusion module contains many artifacts induced by the edge-mismatch between the input images. Most of such artifacts are eliminated by our fusion network with the help of appropriate selection of edge-information.

Figure 6: Comparison of results from our method with and without the fusion module.

5 Conclusion

We proposed a hierarchical edge-maps based guided super-resolution algorithm that tackles edge-mismatch due to spectral-difference between the input low-resolution thermal and high-resolution visible-range images in a systematic and holistic manner. Our method robustly combines multi-level edge information extracted from the visible-range image into our tailored thermal super-resolution network with the help of attention based guidance propagation and consequently produces better high-frequency details. We showed that our results are significantly better both perceptually and quantitatively than the existing state-of-the-art guided super-resolution methods.


  1. email:;
  2. email:;


  1. Almasri, F., Debeir, O.: Multimodal sensor fusion in single thermal image super-resolution. In: Asian Conference on Computer Vision. pp. 418–433. Springer (2018)
  2. Arrue, B.C., Ollero, A., De Dios, J.M.: An intelligent system for false alarm reduction in infrared forest-fire detection. IEEE Intelligent Systems and Their Applications 15(3), 64–73 (2000)
  3. Barron, J.T., Poole, B.: The fast bilateral solver. ECCV (2016)
  4. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: European conference on computer vision. pp. 850–865. Springer (2016)
  5. Borges, P.V.K., Vidas, S.: Practical infrared visual odometry. IEEE Transactions on Intelligent Transportation Systems 17(8), 2205–2213 (2016)
  6. Chen, X., Zhai, G., Wang, J., Hu, C., Chen, Y.: Color guided thermal image super resolution. In: 2016 Visual Communications and Image Processing (VCIP). pp. 1–4. IEEE (2016)
  7. Choi, Y., Kim, N., Hwang, S., Kweon, I.S.: Thermal image enhancement using convolutional neural network. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 223–230. IEEE (2016)
  8. Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L.: Second-order attention network for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11065–11074 (2019)
  9. Deng, X., Dragotti, P.L.: Deep coupled ista network for multi-modal image super-resolution. IEEE Transactions on Image Processing 29, 1683–1698 (2019)
  10. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2016)
  11. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: European conference on computer vision. pp. 391–407. Springer (2016)
  12. Ferstl, D., Reinbacher, C., Ranftl, R., Rüther, M., Bischof, H.: Image guided depth upsampling using anisotropic total generalized variation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 993–1000 (2013)
  13. FLIR: Advanced driver assistance systems dataset. In: (2018)
  14. Guo, C., Li, C., Guo, J., Cong, R., Fu, H., Han, P.: Hierarchical features driven residual learning for depth map super-resolution. IEEE Transactions on Image Processing (2018)
  15. Ham, B., Cho, M., Ponce, J.: Robust image filtering using joint static and dynamic guidance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2015)
  16. Han, T.Y., Kim, Y.J., Song, B.C.: Convolutional neural network-based infrared image super resolution under low light environment. In: 2017 25th European Signal Processing Conference (EUSIPCO). pp. 803–807 (Aug 2017).
  17. Han, T.Y., Kim, Y.J., Song, B.C.: Convolutional neural network-based infrared image super resolution under low light environment. In: 2017 25th European Signal Processing Conference (EUSIPCO). pp. 803–807. IEEE (2017)
  18. Hayat, K.: Multimedia super-resolution via deep learning: A survey. Digital Signal Processing 81, 198–217 (2018)
  19. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
  20. Hui, T.W., Loy, C.C., , Tang, X.: Depth map super-resolution by deep multi-scale guidance. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 353–369 (2016),
  21. Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.S.: Multispectral pedestrian detection: Benchmark dataset and baselines. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
  22. Jones, H., Sirault, X.: Scaling of thermal images at different spatial resolution: the mixed pixel problem. Agronomy 4(3), 380–396 (2014)
  23. Khattak, S., Papachristos, C., Alexis, K.: Marker based thermal-inertial localization for aerial robots in obscurant filled environments. In: International Symposium on Visual Computing. pp. 565–575. Springer (2018)
  24. Khattak, S., Papchristos, C., Alexis, K.: Keyframe-based direct thermal-inertial odometry. In: arXiv preprint arXiv:1903.00798 (2019)
  25. Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1646–1654 (2016)
  26. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  27. Kopf, J., Cohen, M.F., Lischinski, D., Uyttendaele, M.: Joint bilateral upsampling. In: ACM Transactions on Graphics (ToG). vol. 26, p. 96. ACM (2007)
  28. Kwon, H., Tai, Y.W.: Rgb-guided hyperspectral image upsampling. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 307–315 (2015)
  29. Lahoud, F., Zhou, R., Süsstrunk, S.: Multi-modal spectral image super-resolution. In: European Conference on Computer Vision. pp. 35–50. Springer (2018)
  30. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE transactions on pattern analysis and machine intelligence (2018)
  31. Lee, K., Lee, J., Lee, J., Hwang, S., Lee, S.: Brightness-based convolutional neural network for thermal image enhancement. IEEE Access 5, 26867–26879 (2017)
  32. Li, Y., Sun, J., Wang, B., Zhao, Y.: Depth super-resolution using joint adaptive weighted least squares and patching gradient. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1458–1462. IEEE (2018)
  33. Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 136–144 (2017)
  34. Liu, F., Han, P., Wang, Y., Li, X., Bai, L., Shao, X.: Super resolution reconstruction of infrared images based on classified dictionary learning. Infrared Physics & Technology 90, 146–155 (2018)
  35. Liu, Y., Cheng, M.M., Hu, X., Bian, J.W., Zhang, L., Bai, X., Tang, J.: Richer convolutional features for edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1939 – 1946 (2019).
  36. Lutio, R.d., D’Aronco, S., Wegner, J.D., Schindler, K.: Guided super-resolution as pixel-to-pixel transformation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 8829–8837 (2019)
  37. Ni, M., Lei, J., Cong, R., Zheng, K., Peng, B., Fan, X.: Color-guided depth map super resolution using convolutional neural network. IEEE Access 5, 26666–26672 (2017).
  38. Panagiotopoulou, A., Anastassopoulos, V.: Super-resolution reconstruction of thermal infrared images. In: Proceedings of the 4th WSEAS International Conference on REMOTE SENSING (2008)
  39. Prata, A., Bernardo, C.: Retrieval of volcanic ash particle size, mass and optical depth from a ground-based thermal infrared camera. Journal of Volcanology and Geothermal Research 186(1-2), 91–107 (2009)
  40. Riegler, G., Ferstl, D., Rüther, M., Bischof, H.: A deep primal-dual network for guided depth super-resolution. arXiv preprint arXiv:1607.08569 (2016)
  41. Sajjadi, M.S., Scholkopf, B., Hirsch, M.: Enhancenet: Single image super-resolution through automated texture synthesis. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4491–4500 (2017)
  42. Shi, Z., Chen, C., Xiong, Z., Liu, D., Zha, Z.J., Wu, F.: Deep residual attention network for spectral image super-resolution. In: European Conference on Computer Vision. pp. 214–229. Springer (2018)
  43. Shoeiby, M., Robles-Kelly, A., Timofte, R., Zhou, R., Lahoud, F., Susstrunk, S., Xiong, Z., Shi, Z., Chen, C., Liu, D., et al.: Pirm2018 challenge on spectral image super-resolution: methods and results. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 0–0 (2018)
  44. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  45. Song, P., Deng, X., Mota, J.F., Deligiannis, N., Dragotti, P.L., Rodrigues, M.R.: Multimodal image super-resolution via joint sparse representations induced by coupled dictionaries. arXiv preprint arXiv:1709.08680 (2017)
  46. Sun, C., Lv, J., Li, J., Qiu, R.: A rapid and accurate infrared image super-resolution method based on zoom mechanism. Infrared Physics & Technology 88, 228–238 (2018)
  47. Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: Proceedings of the IEEE Conference on Computer vision and Pattern Recognition. pp. 3147–3155 (2017)
  48. Tai, Y., Yang, J., Liu, X., Xu, C.: Memnet: A persistent memory network for image restoration. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4539–4547 (2017)
  49. Treible, W., Saponaro, P., Sorensen, S., Kolagunda, A., O’Neal, M., Phelan, B., Sherbondy, K., Kambhamettu, C.: Cats: A color and thermal stereo benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2961–2969 (2017)
  50. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)
  51. Wang, Z., Liu, D., Yang, J., Han, W., Huang, T.: Deep networks for image super-resolution with sparse prior. In: Proceedings of the IEEE international conference on computer vision. pp. 370–378 (2015)
  52. Xie, J., Feris, R., Sun, M.T.: Edge-guided single depth image super resolution 25(1), 428–438 (Jan 2016)
  53. ”Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of IEEE International Conference on Computer Vision (2015)
  54. Ye, J., Gao, M., Yang, Y., Cao, Q., Yu, Z.: Super-resolution reconstruction of depth image based on edge-selected deep residual network. In: 2019 IEEE 16th International Conference on Networking, Sensing and Control (ICNSC). pp. 121–125. IEEE (2019)
  55. Yokoya, N.: Texture-guided multisensor superresolution for remotely sensed images. Remote Sensing 9(4),  316 (2017)
  56. Yu, S., Lan, H., Jung, C.: Intensity guided depth upsampling using edge sparsity and super-weighted gradient minimization. IEEE Access 7, 140553–140565 (2019)
  57. Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3262–3271 (2018)
  58. Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3262–3271 (2018)
  59. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
  60. Zhang, X., Li, C., Meng, Q., Liu, S., Zhang, Y., Wang, J.: Infrared image super resolution by combining compressive sensing and deep learning. Sensors 18(8),  2587 (2018)
  61. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: ECCV (2018)
  62. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: CVPR (2018)
  63. Zhou, D., Wang, R., Yang, X., Zhang, Q., Wei, X.: Depth image super-resolution reconstruction based on a modified joint trilateral filter. Royal Society open science 6(1), 181074 (2019)
  64. Zhou, W., Li, X., Reynolds, D.: Guided deep network for depth map super-resolution: How much can color help? In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1457–1461. IEEE (2017)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description