# Pixel-aware Deep Function-mixture Network for Spectral Super-Resolution

###### Abstract

Spectral super-resolution (SSR) aims at generating a hyperspectral image (HSI) from a given RGB image. Recently, a promising direction for SSR is to learn a complicated mapping function from the RGB image to the HSI counterpart using a deep convolutional neural network. This essentially involves mapping the RGB context within a size-specific receptive field centered at each pixel to its spectrum in the HSI. The focus thereon is to appropriately determine the receptive field size and establish the mapping function from RGB context to the corresponding spectrum. Due to their differences in category or spatial position, pixels in HSIs often require different-sized receptive fields and distinct mapping functions. However, few efforts have been invested to explicitly exploit this prior.

To address this problem, we propose a pixel-aware deep function-mixture network for SSR, which is composed of a new class of modules, termed function-mixture (FM) blocks. Each FM block is equipped with some basis functions, i.e., parallel subnets of different-sized receptive fields. Besides, it incorporates an extra subnet as a mixing function to generate pixel-wise weights, and then linearly mixes the outputs of all basis functions with those generated weights. This enables us to pixel-wisely determine the receptive field size and the mapping function. Moreover, we stack several such FM blocks to further increase the flexibility of the network in learning the pixel-wise mapping. To encourage feature reuse, intermediate features generated by the FM blocks are fused in late stage, which proves to be effective for boosting the SSR performance. Experimental results on three benchmark HSI datasets demonstrate the superiority of the proposed method.

## 1 Introduction

Hyperspectral imaging is the technique that captures the reflectance of scenes with extremely high spectral resolution (\eg, 10) [7]. The captured hyperspectral image (HSI) often contains hundreds or thousands of spectral bands, and each pixel has a spectrum [7, 41]. Profiting from the abundant spectral information, HSIs have been widely applied to various tasks, \eg, classification [3], detection [26] and tracking [35] \etcHowever, the expense of obtaining such spectral information is to increase the pixel size on the sensor, which inevitably limits the spatial resolution of HSIs [25]. Thus, it is crucial to investigate how to generate high-spatial-resolution (HSR) HSIs.

Different from convnetioanl HSIs super-resolution [27, 40] that directly improves the spatial resolution of a given HSI, spectral super-resolution (SSR) [5, 37] adopts an alternative way and attempts to produce an HSR HSI by increasing the spectral resolution of a given RGB image with satisfactory spatial resolution. Early SSR methods [5, 1, 20] often formulate SSR as a linear inverse problem, and exploit the inherent low-level statistic of HSR HSIs as priors. However, due to the limited expressive capacity of their handcrafted prior models, these methods fail to well generalize to challenging cases. Recently, witnessing the great success of deep convolutional neural networks (DCNNs) in a wide range of tasks [33, 17, 16], increasing efforts have been invested to learn a DCNN based mapping function to directly transform the RGB image into an HSI [4, 6, 32, 13]. These methods essentially involve mapping the RGB context within a size-specific receptive field centered at each pixel to its spectrum in the HSI, as shown in Figure 1. The focus thereon is to appropriately determine the receptive field size and establish the mapping function from RGB context to the corresponding spectrum. Due to the difference in category or spatial position, pixels in HSIs often necessitate collecting different RGB information and adopting various recovery schemes for SSR. Therefore, to obtain an accurate DCNN based SSR approach, it is crucial to adaptively determine the receptive field size and the RGB-to-spectrum mapping function for each pixel. However, most existing DCNN based SSR methods treat all pixels in HSIs equally and learn a universal mapping function with a fixed-sized receptive field, as shown in Figure 1.

In this study, we present a pixel-aware deep function-mixture network for SSR, which is flexible to pixel-wisely determine the receptive field size and the mapping function. Specifically, we first develop a new module, termed the function-mixture (FM) block. Each FM block consists of some parallel DCNN based subnets, among which one is termed the mixing function and the remaining are termed basis functions. The basis functions take different-sized receptive fields and learn distinct mapping schemes; while the mixture function generates pixel-wise weights to linearly mix the outputs of the basis functions. In this way, the pixel-wise weights can determine a specific information flow for each pixel and consequently benefit the network to choose appropriate RGB context as well as the mapping function for spectrum recovery. Then, we stack several such FM blocks to further improve the flexibility of the network in learning the pixel-wise mapping. Furthermore, to encourage feature reuse, the intermediate features generated by the FM blocks are fused in late stage, which proves to be effective for boosting the SSR performance. Experimental evaluation on three benchmark HSI datasets shows the superiority of the proposed approach for SSR.

In summary, we mainly contribute in three aspects. i) We present an effective pixel-aware deep function-mixture network for SSR, which is flexible to learn the pixel-wise RGB-to-spectrum mapping. To our best knowledge, this is the first attempt to explore this in SSR. ii) We design a new FM module, which is flexible to plug in any modern DCNN architectures; iii) We demonstrate new state-of-the-art performance on three benchmark SSR datasets.

## 2 Related Work

We first review the existing approaches for SSR and then introduce some techniques related to this work.

#### Spectral Super-resolution

Early methods mainly focus on exploiting appropriate image priors to regularize the linear inverse SSR problem. For example, Arad and Aeschbacher \etal [5, 1] investigated the sparsity of the latent HSI on a pre-trained over-complete spectral dictionary. Jia \etal [20] considered the manifold structure of HSIs in a low-dimensional space. Recently, most methods turn to learning a deep mapping function from the RGB image to an HSI. For example, Alvarez-Gila et al. [4] implemented the mapping function using an U-Net architecture [29] and trained it based on both the mean-square-error (MSE) loss and the adversarial loss [14]. Shi \etal [32] developed a deep residual network consisting of residual blocks to learn the mapping function. Despite obtaining impressive performance for SSR, these methods are limited by learning a universal RGB-to-spectrum mapping function for all pixels in HSIs. This leaves space for learning more flexible and adaptive mapping function.

#### Receptive Field in DCNNs

Receptive field is an important concept in the DCNN, which determines the sensing space of a convolutional neuron. There are many efforts dedicating to adjusting the size or shape of the receptive field [39, 36, 10] to meet the requirement of specific tasks at hand. Thereinto, dilated convolution [39] or kernel separation [31] were often utilized to enlarge the receptive field. Recently, Wei \etal [36] changed the receptive field by inflating or shrinking the feature maps using two affine transformation layers. Dai \etal [10] proposed to adaptively determine the context within the receptive field by estimating the offsets of pixels to the central pixel using an additional convolution layer. In contrast, we take a totally different direction and learn the pixel-wise receptive field size by mixing some basis function with different receptive field sizes.

#### Multi-column Network

Multi-column network [8] is a specicial type of network that feeds the input into several parallel DCNNs (\ie, columns), and then aggregates their outputs for final prediction. With the ability of using more context information, the multi-column network (MCNet) often shows better generalization capacity than that with only a single column in various tasks, \eg, classification [8], image processing [2], counting [43] \etc. Although we also adopt a similar multi-column structure in our module design, the proposed network is obviously different from these existing multi-column networks [8, 43, 2]. First, MCNet employs a separation-and-aggregation architecture which processes the input with parallel columns and then aggregates the outputs of all columns for output. In contrast, we adopt a recursive separation-and-aggregation architecture by stacking multiple FM modules, each of which can be viewed as an enhanced multi-column module, as shown in Figure 1, 3. Second, when applied to SSR, MCNet still learns a universal mapping function and fails to flexibly handle each pixel in an explicit way. In contrast, the proposed FM block incorporates a mixing function to generate pixel-wise weights and mix the outputs of all basis functions. This enables to flexibly customize the pixel-wise mapping function. In addition, we fuse the intermediate feature generated by FM blocks in the network for feature reuse.

## 3 Proposed Network

In this section, we present the technical details of the proposed pixel-aware deep function-mixture network, as shown in Figure 2. The proposed network adopts a global residual architecture as [22]. Its backbone is constructed by stacking multiple FM blocks and fusing the intermediate features generated by previous FM block with skip connections. In the following, we will first introduce the basic FM block. Then, we will introduce how to incorporate multiple FM blocks and the intermediate features fusion into the proposed network for performance enhancement.

### 3.1 Function-mixture Block

The proposed network essentially establishes an end-to-end mapping function from an RGB image to the HSI counterpart, and thus each FM block plays the role of a mapping subfunction. In this study, we attempt to utilize the FM block to adaptively determine the receptive field size and the mapping function for each pixel, \ie, to obtain a pixel-dependent mapping subfunction. To this end, a direct solution is to introduce an additional hypernetwork [15, 19] to adaptively generate the subfunction parameters for each pixel. However, this will greatly increase the computational complexity as well as the training difficulty [15]. To avoid this problem, we borrow the idea in function approximation [9] and assume that all pixel-dependent subfunctions can be accurately approximated by mixing some basis functions with pixel-wise weights. Due to being shared by all subfunctions, these basis functions can be learned by DCNNs. While the pixel-wise mixing weights can be viewed as the pixel-wise channel attention [30], which also can be directly generated by a DCNN.

Following this idea, we construct the FM block with a separation-and-aggregation structure, as shown in Figure 3. First, a convolutional block, \iea convolutional layer followed by a Rectified Linear Unit (ReLu) [28], is utilized for initial feature representation. Then, the obtained features are fed into multiple parallel subnets. Thereinto, one subnet is utilized to generate the pixel-wise mixing weights. For simplicity, we term it the mixing function. While the remaining subnets represent the basis functions. Finally, the outputs of all basis functions are linearly mixed based on the generated pixel-wise weights. Let denote the input for the -th FM block and denote the number of basis functions in . The output of can be formulated as

(1) | ||||

where denotes the -th basis function parameterized by and represents the mixing function parameterized by . When is of size (\ie, channel height width), is of size , and represents the mixing weights of size generated for all pixels corresponding to the -th basis function. denotes the point product. denotes the features output by the convolutional block in , and represents the convolutional filters. Inspired by [12], we also require the mixing weights to be non-negative and the summation across all basis functions is equal to 1, as shown in Eq. (1).

In this study, we implement the basis functions as well as the mixing function by stacking consecutive convolutional blocks, as shown in Figure 3. Moreover, we equip these basis functions with different-sized convolutional filters to ensure they take different-sized receptive fields and learn distinct mapping schemes. For the mixing function, we introduce a Softmax unit at the end to comply with the constraints in Eq. (1). Apparently, profiting from such a pixel-wise mixture architecture, the proposed FM block is able to determine the appropriate receptive field size and the mapping function for each pixel.

### 3.2 Multiple FM Blocks

As shown in Figure 2, in the proposed network, we first introduce an individual convolutional block, and then stack multiple FM blocks for the intermediate feature representation and the ultimate output. For an input RGB image , the output of the network with FM blocks can be given as

(2) | ||||

where denotes the generated HSI and represents the output of the first convolutional block parameterized by . It is worth noting that in this study we increase the spectral resolution of to the same as that of by the bilinear interpolation. In addition, show the same architecture, while the output of will be adjusted according to the number of spectral bands in .

It has been shown that the layers in an DCNN from bottom to top take increasingly larger receptive fields and extract different levels of features from the input signal [44]. Therefore, by stacking multiple FM blocks, we can further increase the flexibility of the proposed network in learning the pixel-wise mapping, viz., adjust the receptive field size and the mapping function for each pixel at multiple levels. In addition, considering that each FM block defines the mapping subfunction for each pixel, the ultimate mapping function obtained by stacking FM blocks can be viewed as a composition function of subfunctions. Since each subfunction is approximated by the mixture of basis functions, the ultimate mapping function can be viewed as the mixture of basis functions, which show much larger expressive capacity than a single FM block in pixel-wisely fitting an appropriate mapping function.

### 3.3 Intermediate Features Fusion

As previously mentioned, the FM blocks in the porposed network extract different levels of features from the input. Inspired by [23, 42], to reuse these intermediate features for performance enhancement, we employ skip connections to aggregate the intermediate features generated by each FM block before the ultimate output block with a concatenation operation, as shown in Figure 2. To better utilize all of these features for pixel-wise representation, we introduce an extra FM block to fuse the concatenation result. With such an intermediate feature fusion operation, the output of the proposed network can be reformulated as

(3) |

Methods | NTIRE2018 | CAVE | Harvard | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

RMSE | PSNR | SAM | SSIM | RMSE | PSNR | SAM | SSIM | RMSE | PSNR | SAM | SSIM | |

BI [18] | 15.41 | 25.73 | 15.30 | 0.8397 | 26.60 | 21.49 | 34.38 | 0.7382 | 30.86 | 19.44 | 39.04 | 0.5887 |

Arad [5] | 4.46 | 35.63 | 5.90 | 0.9082 | 10.09 | 28.96 | 19.54 | 0.8695 | 7.85 | 31.30 | 8.32 | 0.8490 |

Aitor [4] | 1.97 | 43.30 | 1.80 | 0.9907 | 6.80 | 32.53 | 17.50 | 0.8768 | 3.29 | 39.21 | 4.93 | 0.9671 |

HSCNN+ [37] | 1.55 | 45.38 | 1.63 | 0.9931 | 4.97 | 35.66 | 8.73 | 0.9529 | 2.87 | 41.05 | 4.28 | 0.9741 |

DCNN | 1.23 | 47.40 | 1.30 | 0.9939 | 5.77 | 34.09 | 11.35 | 0.9275 | 2.88 | 40.83 | 4.24 | 0.9724 |

MCNet | 1.11 | 48.43 | 1.13 | 0.9951 | 4.84 | 35.92 | 8.98 | 0.9555 | 2.83 | 40.70 | 4.26 | 0.9689 |

Ours | 1.03 | 49.29 | 1.05 | 0.9955 | 4.54 | 36.33 | 7.07 | 0.9611 | 2.54 | 41.54 | 3.76 | 0.9796 |

## 4 Experiment

In this section, we will conduct extensive comparison experiments and carry out an ablation study to demonstrate the effectiveness of the proposed method in SSR.

### 4.1 Experimental Setting

#### Datasets

In this study, we adopt three benchmark HSI datasets, including NTIRE2018 [34], CAVE [38] and Harvard [7]. NTIRE2018 dataset is the benchmark for the SSR challenge in NTIRE2018. In NTIRE2018 dataset, there are 255 paired HSIs and RGB images which have the same spatial resolution, \eg, 1392 1300. Each HSI consists of successive spectral bands ranging from 400 to 700 with a 10 interval. CAVE dataset contains 32 HSIs of indoor objects. Similar to NTIRE2018, each HSI contains spectral bands ranging from 400 to 700 with a 10 interval but with the spatial resolution 512 512. Harvard dataset is another common benchmark for HSIs. It consists of 50 HSIs with spatial resolution 13921040. Each image contains 31 spectral bands captured from 420 to 720 with a 10 interval. For the CAVE and Havard datasets, inspird by [11, 40], we adopt the spectral response function of Nikon D700 camera [11] to generate the corresponding RGB image for each HSI. In the following experiments, we randomly select 200 paired images from the NTIRE2018 dataset as the training set and the remaining 55 paired images for testing. For the CAVE dataset, we randomly choose 22 paired images for training and the remaining 10 paired images for testing. While in the Harvard dataset, 30 paired images are randomly chosen as the training set and the remaining 20 paired images are utilized for testing.

#### Comparison Methods

In this study, we compare the proposed method with 6 existing methods including the bilinear interpolation (BI) [18], Arad [5], Aitor [4], HSCNN+ [37], deep convolution neural network (DCNN) and the multi-column network (MCNet). Among them, the BI utilizes the bilinear interpolation to increase the spectral resolution of the input RGB image. The Arad is a sparsity induced conventional SSR method. The Aitor and HSCNN+ are two recent DCNN based state-of-the-art SSR methods. The DCNN and MCNet are two baselines for the proposed method. The DCNN is a variant of the proposed method that is implemented by replacing each FM block in the proposed method with a convolutional block. For the MCNet, we implement it following the basic architecture in [8, 43] with the convolutional blocks. Moreover, the column number is set as and the convolutions in columns are equipped with kinds of different-sized filters, which is similar as the proposed method. We further control the depth of each column to make sure the model complexity of the MCNet is comparable to the proposed method. By doing this, the only difference between the MCNet and the proposed network is the network architecture. For fair comparison, all these DCNN based competitors and the spectral dictionary in the Arad [5] are retrained on the training set utilized in the experiments.

#### Evaluation Metrics

To objectively evaluate the SSR performance of each method, we employ four commonly utilized metrics, including the root-mean-square error (RMSE), peak signal-to-noise ratio (PSNR), spectral angle sapper (SAM) and structural similarity index (SSIM). The RMSE and PSNR measure the numerical difference between the reconstructed image and the reference image. The SAM computes the average spectral angle between two spectra from the reconstructed image and the reference image at the same spatial position to indicate the reconstruction accuracy of spectrum. The SSIM is often utilized to measure the spatial structure similarity between two images. In general, a larger PSNR or SSIM and a smaller RMSE or SAM indicate better performance.

#### Implementation Details

In the proposed method, we adopt 4 FM blocks (\ie, including for feature fusion and =3), and each block contains basis functions. The basis functions and the mixing functions consist of =2 convolutional blocks. Each convolutional block contains 64 filters. In each FM block, three basis functions are equipped with three different-sized filters for convolution, \ie, 33, 77 and 1111. While the filter size in all other convolutional blocks is fixed as 33.

In this study, we implement the proposed method on the Pytorch platform [21] and train the network using the following model

(4) |

where denotes the -th paired HSI and RGB image, respectively. denotes the number of training pairs. denotes the ultimate mapping function defined by the proposed network and represents all involved parameters. represents the norm based loss. In the training stage, we employ the Adam optimizer [24] with the weight decay 1e-6. The learning rate is initially set as 1e-4 and halved in every 20 epochs. The batch size is 128. We terminate the optimization at the -th epoch.

### 4.2 Performance Evaluation

#### Performance comparison

Under the same experimental settings, we evaluate all those methods on the testing set from each benchmark dataset. Their numerical results are reported in Table 1. It can be seen that these DCNN based comparison methods often produce more accurate results than the interpolation or the sparsity induced SSR method. For example, on the NTIRE2018 dataset, the RMSE of the Aitor and HSCNN+ are less than 2.0 while that of the BI and Arad are higher than 4.0. Nevertheless, the proposed method obviously outperforms these DCNN based competitors. For example, compared with the state-of-the-art HSCNN+, the proposed method reduces the RMSE by 0.43 and improves the PSNR by 0.67db on the CAVE dataset. On the NTIRE2018 dataset, the decrease on RMSE is even up to 0.52 and the improvement on PSNR is up to 3.19db. This profits from the ability of the proposed method in adaptively determining the receptive field size and the mapping function for each pixel. With such an ability, the proposed method is able to handle each pixel more flexibly. Moreover, since various mapping functions can be approximated by the mixture of the learned basis functions, the proposed method can better generalize to the unknown pixels.

In addition, as shown in Table 1, the proposed method also performs better than two baselines, \ie, DCNN and MCNet. For example, on the NTIRE2018 dataset, the PSNR obtained by the proposed method is higher than that of DCNN by 1.89db and higher than that of MCNet by 0.86. Since the only difference between the proposed method and DCNN is the discrepancy between the convolutional block and the proposed FM block, the superiority of the proposed method demonstrates that the proposed FM block is much powerful than the convolutional block for SSR. Similarly, the advantage of the proposed method over MCNet clarifies that the proposed network architecture is more effective than the multi-column architecture in SSR.

To further clarify the above conclusions, we plot some visual super-resolution results of different methods on three datasets in Figure 4, Figure 5 and Figure 6. As can be seen, the super-resolution results of the proposed method have more details and show less reconstruction error than other competitors. In addition, we also sketch the recovered spectrum curves of the proposed method in Figure 7. It can be seen that the spectra produced by the proposed method are very close to the ground truth.

#### Pixel-wise mixing weights

In this study, we mix the outputs of the basis functions with pixel-wise weights to adaptively learn the pixel-wise mapping. To validate that the proposed method can effectively produce the pixel-wise weights as expected, we choose an example image from the NTIRE2018 and visualize the produced pixel-wise weights in each FM block, as shown in Figure 8. We can find that, i) pixels from different categories or spatial positions are often given different weights. For example, in the second weight map generated by , the weights for the pixels from ’road’ are obviously smaller than that for the pixels from ’tree’. ii) Pixels from the same category are pone to be given similar weights. For example, pixels from ’road’ are given similar weights in each weight map in Figure 8 (a)(b). To further clarify these two aspects of observations, we visualize the weight maps of some other images generated by the FM block in Figure 9, where similar phenomenon can be observed. iii) In the intermediate FM blocks (\ie, and in Figure 8), the high level block (\eg, ) can distinguish finer difference between pixels than the low level block (\eg, ), viz., only highly similar pixels will be assigned to similar weights. iv) Due to being forced to match the output, in the weight maps generated by the ultimate output block , the weight difference between pixels from various categories is not as obvious as that in previous FM block (\eg, and ), as shown in Figure 8(a)(b)(d).

According to the above observations, we can conclude that the proposed network can effectively generate the pixel-wise mixing weights and thus is able to pixel-wisely determine receptive field size and mapping function.

### 4.3 Ablation study

In this part, we carry out an ablation study on the NTIRE2018 dataset to demonstrate the effect of the different ingredients, the number of basis functions and the number of FM blocks on the proposed network.

Methods | RMSE | PSNR | SAM | SSIM |
---|---|---|---|---|

Ours w/o mix | 1.10 | 48.44 | 1.16 | 0.9950 |

Ours w/o fusion | 1.05 | 48.97 | 1.09 | 0.9953 |

Ours | 1.03 | 49.29 | 1.05 | 0.9955 |

Methods | RMSE | PSNR | SAM | SSIM |
---|---|---|---|---|

Ours (1) | 1.47 | 45.82 | 1.57 | 0.9913 |

Ours (2) | 1.08 | 48.76 | 1.10 | 0.9952 |

Ours (3) | 1.03 | 49.29 | 1.05 | 0.9955 |

Ours (5) | 0.98 | 49.87 | 1.00 | 0.9958 |

Methods | RMSE | PSNR | SAM | SSIM |
---|---|---|---|---|

Ours (2) | 1.05 | 48.95 | 1.09 | 0.9954 |

Ours (3) | 1.03 | 49.29 | 1.05 | 0.9955 |

Ours (4) | 1.05 | 49.42 | 1.05 | 0.9954 |

Ours (6) | 1.00 | 49.59 | 1.02 | 0.9956 |

#### Effect of Different Ingredients

In the proposed FM network, there are two important ingredients, namely the pixel-wise mixture and the intermediate feature fusion. To demonstrate the effect of these two ingredients, we compare the proposed method with its two variants. One (\ie, ’Ours w/o mix’) disables the pixel-wise mixture in the proposed network, which implies mixing the outputs of the basis functions with equal weights; while the other (\ie, ’Ours w/o fusion’) disables the intermediate feature fusion, \ie, removing the skip connections as well as the FM block . We plot the training loss curves and the testing PSNR curves of these three methods in Figure 10. As can be seen that the proposed method obtains the smallest training loss and the highest testing PSNR. More numerical results are reported in Table 2. It can be seen that the proposed method still obviously outperforms these two variants. This demonstrate that both the pixel-wise mixture and the intermediate feature fusion are crucial for the proposed network.

#### Effect of the Number of Basis Functions

In the above experiments, we fix the number of basis functions as in each FM block. Intuitively, increasing will enlarge the expressive capacity of the basis fictions and thus lead to better performance, vice versa. To validate this, we evaluate the proposed method on the NTIRE2018 dataset using different , \ie, 1, 2, 3 and 5. The obtained numerical results are provided in Table 3. As can be seen, the reconstruction accuracy gradually increases as the number of basis functions increases. When 1, the proposed method degenerates to the convolutional blocks based network, which shows the lowest reconstruction accuracy in Table 3. When increases to , the obtained RMSE is even lower than 1.0 and the PSNR is close to 50db. However, there is also no free lunch in our case and a larger often results in higher computational complexity. Therefore, we make a balance between the accuracy and efficiency by tuning . This makes it possible to customize the proposed network for a specific device.

#### Effect of the Number of FM Blocks

In addition to the number of basis functions, the model complexity of the proposed method also depends on the number of the FM blocks. To demonstrate the effect of on the proposed method, we evaluate the proposed method on the NTIRE2018 dataset using different number of FM blocks, \ie, =2,3,4 and 6. The obtained numerical results are reported in Table 4. Similar as the case of , the performance of the proposed method can be gradually improved as the number of FM blocks increases. We also find an interesting thing, increasing may be more effective than increasing in terms of boosting the performance of the proposed method.

## 5 Conclusion

In this study, to flexibly handle the pixels from different categories or spatial positions in HSIs and consequently improve the performance, we present a pixel-aware deep function-mixture network for SSR, which is composed of multiple FM blocks. Each FM block consists of one mixing function and some basis functions, which are implemented as parallel DCNN based subnets. Thereinto, the basis functions take different sized receptive fields and learn distinct mapping schemes; while the mixing function generates the pixel-wise weights to linearly mix the outputs of all these basis functions. This enables to pixel-wisely determine the receptive field size and mapping function. Moreover, we stack several such FM block in the network to further increase its flexibility in learning the pixel-wise mapping. To boost the SSR performance, we also fuse the intermediate features generated by the FM blocks for feature reuse. With extensive experiments on three benchmark SSR datasets, the proposed method shows superior performance over several existing state-of-the-art competitors.

It is worth noting that this study employs the linear mixture to approximate the pixel-wise mapping function. In the future, it is interesting to exploit the non-linear mixture. In addition, it is promising to generalize the idea in this study to other tasks requiring pixel-wise modelling, \eg, semantic segmentation, colorization \etc

## References

- [1] J. Aeschbacher, J. Wu, and R. Timofte. In defense of shallow learned spectral reconstruction from rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pages 471–479, 2017.
- [2] F. Agostinelli, M. R. Anderson, and H. Lee. Adaptive multi-column deep neural networks with application to robust image denoising. In Advances in Neural Information Processing Systems, pages 1493–1501, 2013.
- [3] N. Akhtar and A. Mian. Nonparametric coupled bayesian dictionary and classifier learning for hyperspectral classification. IEEE transactions on neural networks and learning systems, 29(9):4038–4050, 2018.
- [4] A. Alvarez-Gila, J. Van De Weijer, and E. Garrote. Adversarial networks for spatial context-aware spectral image reconstruction from rgb. In Proceedings of the IEEE International Conference on Computer Vision, pages 480–490, 2017.
- [5] B. Arad and O. Ben-Shahar. Sparse recovery of hyperspectral signal from natural rgb images. In European Conference on Computer Vision, pages 19–34. Springer, 2016.
- [6] B. Arad and O. Ben-Shahar. Filter selection for hyperspectral estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3153–3161, 2017.
- [7] A. Chakrabarti and T. Zickler. Statistics of real-world hyperspectral images. In CVPR 2011, pages 193–200. IEEE, 2011.
- [8] D. Cireşan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745, 2012.
- [9] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- [10] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
- [11] W. Dong, F. Fu, G. Shi, X. Cao, J. Wu, G. Li, and X. Li. Hyperspectral image super-resolution via non-negative structured sparse representation. IEEE Transactions on Image Processing, 25(5):2337–2352, 2016.
- [12] B. S. Everitt. Finite mixture distributions. Encyclopedia of statistics in behavioral science, 2005.
- [13] Y. Fu, T. Zhang, Y. Zheng, D. Zhang, and H. Huang. Joint camera spectral sensitivity selection and hyperspectral image recovery. In Proceedings of the European Conference on Computer Vision (ECCV), pages 788–804, 2018.
- [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [15] D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
- [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [18] H. Hou and H. Andrews. Cubic splines for image interpolation and digital filtering. IEEE Transactions on acoustics, speech, and signal processing, 26(6):508–517, 1978.
- [19] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Advances in Neural Information Processing Systems, pages 667–675, 2016.
- [20] Y. Jia, Y. Zheng, L. Gu, A. Subpa-Asa, A. Lam, Y. Sato, and I. Sato. From rgb to spectrum for natural scenes via manifold-based mapping. In Proceedings of the IEEE International Conference on Computer Vision, pages 4705–4713, 2017.
- [21] N. Ketkar. Introduction to pytorch. In Deep learning with python, pages 195–208. Springer, 2017.
- [22] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
- [23] J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016.
- [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [25] C. Lanaras, E. Baltsavias, and K. Schindler. Hyperspectral super-resolution by coupled spectral unmixing. In Proceedings of the IEEE International Conference on Computer Vision, pages 3586–3594, 2015.
- [26] D. Manolakis and G. Shaw. Detection algorithms for hyperspectral imaging applications. IEEE signal processing magazine, 19(1):29–43, 2002.
- [27] S. Mei, X. Yuan, J. Ji, Y. Zhang, S. Wan, and Q. Du. Hyperspectral image spatial super-resolution via 3d full convolutional neural network. Remote Sensing, 9(11):1139, 2017.
- [28] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
- [29] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [30] K. Sato and R. D. Lauro. Deep networks with internal selective attention through feedback connections. In International Conference on Neural Information Processing Systems, 2014.
- [31] G. Seif and D. Androutsos. Large receptive field networks for high-scale image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 763–772, 2018.
- [32] Z. Shi, C. Chen, Z. Xiong, D. Liu, and F. Wu. Hscnn+: Advanced cnn-based hyperspectral recovery from rgb images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 939–947, 2018.
- [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [34] R. Timofte, S. Gu, J. Wu, and L. Van Gool. Ntire 2018 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 852–863, 2018.
- [35] H. Van Nguyen, A. Banerjee, and R. Chellappa. Tracking via object reflectance using a hyperspectral video camera. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 44–51. IEEE, 2010.
- [36] Z. Wei, Y. Sun, J. Wang, H. Lai, and S. Liu. Learning adaptive receptive fields for deep image parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2434–2442, 2017.
- [37] Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu. Hscnn: Cnn-based hyperspectral image recovery from spectrally undersampled projections. In Proceedings of the IEEE International Conference on Computer Vision, pages 518–525, 2017.
- [38] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar. Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE transactions on image processing, 19(9):2241–2253, 2010.
- [39] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
- [40] L. Zhang, W. Wei, C. Bai, Y. Gao, and Y. Zhang. Exploiting clustering manifold structure for hyperspectral imagery super-resolution. IEEE Transactions on Image Processing, 27(12):5969–5982, 2018.
- [41] L. Zhang, W. Wei, Y. Zhang, C. Shen, A. van den Hengel, and Q. Shi. Cluster sparsity field: An internal hyperspectral imagery prior for reconstruction. International Journal of Computer Vision, 126(8):797–821, 2018.
- [42] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
- [43] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.
- [44] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.