Multispectral and Hyperspectral Image Fusion by MS/HS Fusion Net
Abstract
Hyperspectral imaging can help better understand the characteristics of different materials, compared with traditional image systems. However, only highresolution multispectral (HrMS) and lowresolution hyperspectral (LrHS) images can generally be captured at video rate in practice. In this paper, we propose a modelbased deep learning approach for merging an HrMS and LrHS images to generate a highresolution hyperspectral (HrHS) image. In specific, we construct a novel MS/HS fusion model which takes the observation models of lowresolution images and the lowrankness knowledge along the spectral mode of HrHS image into consideration. Then we design an iterative algorithm to solve the model by exploiting the proximal gradient method. And then, by unfolding the designed algorithm, we construct a deep network, called MS/HS Fusion Net, with learning the proximal operators and model parameters by convolutional neural networks. Experimental results on simulated and real data substantiate the superiority of our method both visually and quantitatively as compared with stateoftheart methods along this line of research.
1 Introduction
A hyperspectral (HS) image consists of various bands of images of a real scene captured by sensors under different spectrums, which can facilitate a fine delivery of more faithful knowledge under real scenes, as compared to traditional images with only one or a few bands. The rich spectra of HS images tend to significantly benefit the characterization of the imaged scene and greatly enhance performance in different computer vision tasks, including object recognition, classification, tracking and segmentation [10, 37, 35, 36].
In real cases, however, due to the limited amount of incident energy, there are critical tradeoffs between spatial and spectral resolution. Specifically, an optical system usually can only provide data with either high spatial resolution but a small number of spectral bands (e.g., the standard RGB image) or with a large number of spectral bands but reduced spatial resolution [23]. Therefore, the research issue on merging a highresolution multispectral (HrMS) image and a lowresolution hyperspectral (LrHS) image to generate a highresolution hyperspectral (HrHS) image, known as MS/HS fusion, has attracted great attention [47].
The observation models for the HrMS and LrHS images are often written as follows [12, 24, 25]:
(1) 
(2) 
where is the target HrHS image^{1}^{1}1The target HS image can also be written as tensor . We also denote the folding operator for matrix to tensor as: . with , and as its height, width and band number, respectively, is the HrMS image with as its band number (), is the LrHS image with , and as its height, width and band number (, ), is the spectral response of the multispectral sensor as shown in Fig. 1 (a), is a linear operator which is often assumed to be composed of a cyclic convolution operator and a downsampling matrix as shown in Fig. 1 (b), and are the noises contained in HrMS and LrHS images, respectively. Many methods have been designed based on (1) and (2), and achieved good performance [40, 14, 24, 25].
Since directly recovering the HrHS image is an illposed inverse problem, many techniques have been exploited to recover by assuming certain priors on it. For example, [54, 2, 11] utilize the prior knowledge of HrHS that its spatial information could be sparsely represented under a dictionary trained from HrMS. Besides, [27] assumes the local spatial smoothness prior on the HrHS image and uses total variation regularization to encode it in their optimization model. Instead of exploring spatial prior knowledge from HrHS, [52] and [26] assume more intrinsic spectral correlation prior on HrHS, and use lowrank techniques to encode such prior along the spectrum to reduce spectral distortions. Albeit effective for some applications, the rationality of these techniques relies on the subjective prior assumptions imposed on the unknown HrHS to be recovered. An HrHS image collected from real scenes, however, could possess highly diverse configurations both along space and across spectrum. Such conventional learning regimes thus could not always flexibly adapt different HS image structures and still have room for performance improvement.
Methods based on Deep Learning (DL) have outperformed traditional approaches in many computer vision tasks [34] in the past decade, and have been introduced to HS/MS fusion problem very recently [28, 30]. As compared with conventional methods, these DL based ones are superior in that they need fewer assumptions on the prior knowledge of the toberecovered HrHS, while can be directly trained on a set of paired training data simulating the network inputs (LrHS&HrMS images) and outputs (HrHS images). The most commonly employed network structures include CNN [7], 3D CNN [28], and residual net [30]. Like other image restoration tasks where DL is successfully applied to, these DLbased methods have also achieved good resolution performance for MS/MS fusion task.
However, the current DLbased MS/HS fusion methods still have evident drawbacks. The most critical one is that these methods use general frameworks for other tasks, which are not specifically designed for MS/HS fusion. This makes them lack interpretability specific to the problem. In particular, they totally neglect the observation models (1) and (2) [28, 30], especially the operators and , which facilitate an understanding of how LrHS and HrMs are generated from the HrHS. Such understanding, however, should be useful for calculating HrHS images. Besides this generalization issue, current DL methods also neglect the general prior structures of HS images, such as spectral lowrankness. Such priors are intrinsically possessed by all meaningful HS images, and the neglect of such priors implies that DLbased methods still have room for further enhancement.
In this paper, we propose a novel deep learningbased method that integrates the observation models and image prior learning into a single network architecture. This work mainly contains the following threefold contributions:
Firstly, we propose a novel MS/HS fusion model, which not only takes the observation models (1) and (2) into consideration but also exploits the approximate lowrankness prior structure along the spectral mode of the HrHS image to reduce spectral distortions [52, 26]. Specifically, we prove that if and only if observation model (1) can be satisfied, the matrix of HrHS image can be linearly represented by the columns in HrMS matrix and a tobeestimated matrix , i.e., with coefficient matrices and . One can see Fig. 1 (d) for easy understanding. We then construct a concise model by combining the observation model (2) and the linear representation of . We also exploit the proximal gradient method [3] to design an iterative algorithm to solve the proposed model.
Secondly, we unfold this iterative algorithm into a deep network architecture, called MS/HS Fusion Net or MHFnet, to implicitly learn the tobeestimated , as shown in Fig. 1 (c). After obtaining , we can then easily achieve with and . To the best of our knowledge, this is the first deeplearningbased MS/HS fusion method that fully considers the intrinsic mechanism of the MS/HS fusion problem. Moreover, all the parameters involved in the model can be automatically learned from training data in an endtoend manner. This means that the spatial and spectral responses ( and ) no longer need to be estimated beforehand as most of the traditional nonDL methods did, nor to be fully neglected as current DL methods did.
Thirdly, we have collected or realized current stateoftheart algorithms for the investigated MS/HS fusion task, and compared their performance on a series of synthetic and real problems. The experimental results comprehensively substantiate the superiority of the proposed method, both quantitatively and visually.
In this paper, we denote scalar, vector, matrix and tensor in nonbold case, bold lower case, bold upper case and calligraphic upper case letters, respectively.
2 Related work
2.1 Traditional methods
The pansharpening technique in remote sensing is closely related to the investigated MS/HS problem. This task aims to obtain a high spatial resolution MS image by the fusion of a MS image and a wideband panchromatic image. A heuristic approach to perform MS/HS fusion is to treat it as a number of pansharpening subproblems, where each band of the HrMS image plays the role of a panchromatic image. There are mainly two categories of pansharpening methods: component substitution (CS) [5, 17, 1] and multiresolution analysis (MRA) [20, 21, 4, 33, 6]. These methods always suffer from the high spectral distortion, since a single panchromatic image contains little spectral information as compared with the expected HS image.
In the last few years, machine learning based methods have gained much attention on MS/HS fusion problem [54, 2, 11, 14, 52, 48, 26, 40]. Some of these methods used sparse coding technique to learn a dictionary on the patches across a HrMS image, which delivers spatial knowledge of HrHS to a certain extent, and then learn a coefficient matrix from LrHS to fully represent the HrHS [54, 2, 11, 40]. Some other methods, such as [14], use the sparse matrix factorization to learn a spectral dictionary for LrHS images and then construct HrMS images by exploiting both the spectral dictionary and HrMS images. The lowrankness of HS images can also be exploited with nonnegative matrix factorization, which helps to reduce spectral distortions and enhances the MS/HS fusion performance [52, 48, 26]. The main drawback of these methods is that they are mainly designed based on human observations and strong prior assumptions, which may not be very accurate and would not always hold for diverse real world images.
2.2 Deep learning based methods
Recently, a number of DLbased pansharpening methods were proposed by exploiting different network structures [15, 22, 42, 43, 29, 30, 32]. These methods can be easily adapted to MS/HS fusion problem. For example, very recently, [28] proposed a 3DCNN based MS/HS fusion method by using PCA to reduce the computational cost. This method is usually trained with prepared training data. The network inputs are set as the combination of HrMS/panchromatic images and LrHS/multispectral images (which is usually interpolated to the same spatial size as HrMS/panchromatic images in advance), and the outputs are the corresponding HrHS images. The current DLbased methods have been verified to be able to attain good performance. They, however, just employ networks assembled with some offtheshelf components in current deep learning toolkits, which are not specifically designed against the investigated problem. Thus the main drawback of this technique is the lack of interpretability to this particular MS/HS fusion task. In specific, both the intrinsic observation model (1), (2) and the evident prior structures, like the spectral correlation property, possessed by HS images have been neglected by such kinds of “blackbox” deep model.
3 MS/HS fusion model
In this section, we demonstrate the proposed MS/HS fusion model in detail.
3.1 Model formulation
We first introduce an equivalent formulation for observation model (1). Specifically, we have following theorem^{2}^{2}2All proofs are presented in supplementary material..
Theorem 1.
For any and , if and , then the following two statements are equivalent to each other:
(a) There exists an , subject to,
(3) 
(b) There exist , and , subject to,
(4) 
In reality, the band number of an HrMS image is usually not large, which makes it full rank along spectral mode. For example, the most commonly used HrMS images, RGB images, contain three bands, and their rank along the spectral mode is usually also three. Thus, by letting where is the observed HrMS in (1), it is easy to find that and satisfy the conditions in Theorem 1. Then the observation model (1) is equivalent to
(5) 
where is caused by the noise contained in the HrMS image. In (5), can be viewed as bases that represent columns in with coefficients matrix , where only the bases in are unknown. In addition, we can derive the following corollary:
Corollary 1.
For any , , , if and , then the following two statements are equivalent to each other:
(a) There exist and , subject to,
(6) 
(b) There exist , , and , subject to,
(7) 
By letting , it is easy to find that, when being viewed as equations of the tobeestimated , and , the observation model (1) and model (2) are equivalent to the following equation of , , and :
(8) 
where denotes the noise contained in HrMS and LrHS image.
By (8), we design the following MS/HS fusion model:
(9) 
where is a tradeoff parameter, and is a regularization function. We adopt regularization on the tobeestimated bases in , rather than on as in traditional methods. This will help alleviate destruction of the spatial detail information in the known ^{3}^{3}3Many regularization terms, such as total variation norm, will lead to loss of details like the sharp edge, lines and high light point in the image. when representing with it.
It should be noted that for the same data set, the matrices , and are fixed. This means that these matrices can be learned from the training data. In the later sections we will show how to learn them with a deep network.
3.2 Model optimization
We now solve (9) using a proximal gradient algorithm [3], which iteratively updates by calculating
(10) 
where is the updating result after iterations, , and is a quadratic approximation [3] defined as:
(11) 
where and plays the role of stepsize.
It is easy to prove that the problem (10) is equivalent to:
(12) 
For many kinds of regularization terms, the solution of Eq. (12) is usually in a closedform [8], written as:
(13) 
Since , we can obtain the final updating rule for :
(14) 
In the later section, we will unfold this algorithm into a deep network.
4 MS/HS fusion net
Based on the above algorithm, we build a deep neural network for MS/HS fusion by unfolding all steps of the algorithm as network layers. This technique has been widely utilized in various computer vision tasks and has been substantiated to be effective in compressed sensing, dehazing, deconvolution, etc. [44, 45, 53]. The proposed network is a structure of stages implementing iterations in the iterative algorithm for solving Eq. (9), as shown in Fig. 3 (a) and (b). Each stage takes the HrMS image , LrHS image , and the output of the previous stage , as inputs, and outputs an updated to be the new input of next layer.
4.1 Network design
Algorithm unfolding. We first decompose the updating rule (14) into the following four sequential parts:
(15) 
(16) 
(17) 
(18) 
In the network framework, we use the images with their tensor formulations (, and ) instead of their matrix forms to protect their original structure knowledge and make the network structure (in tensor form) easily designed. We then design a network to approximately perform the above operations in tensor version. Refer to Fig. 2 for easy understanding.
In tensor version, Eq. (15) can be easily performed by the two multiplications between a tensor and a matrix along the mode of the tensor. Specifically, in the TensorFlow^{4}^{4}4https://tensorflow.google.cn/ framework, multiplying with matrix along the channel mode can be easily performed by using the 2D convolution function with a kernel tensor . and can be multiplied similarly. In summary, we can perform the tensor version of (15) by:
(19) 
where denotes the mode3 Multiplication for tensor^{5}^{5}5For a tensor with as its elements, and with as its elements, let , the elements of are . Besides, . .
In Eq. (16), the matrix represents the spatial downsampling operator, which can be decomposed into 2D convolutions and downsampling operators [12, 24, 25]. Thus, we perform the tensor version of (16) by:
(20) 
where is an tensor, is the downsampling network consisting of 2D channelwise convolutions and average pooling operators, and denotes filters involved in the operator at the stage of network.
In Eq. (17), the transposed matrix represents a spatial upsampling operator. This operator can be easily performed by exploiting the 2D transposed convolution [9], which is the transposition of the combination of convolution and downsampling operator. By exploiting the 2D transposed convolution with filter in the same size with the one used in (20), we can approach (17) in the network by:
(21) 
where , is the spacial upsampling network consisting of transposed convolutions and denotes the corresponding filters in the stage.
In Eq. (18), is a tobedecided proximal operator. We adopt the deep residual network (ResNet) [13] to learn this operator. We then represent (18) in our network as:
(22) 
where is a ResNet which represents the proximal operator in our algorithm and the parameters involved in the ResNet at the stage are denoted by .
With Eq. (19)(22), we can now construct the stages in the proposed network. Fig. 3 (b) shows the flowchart of a single stage of the proposed network.
Normal stage. In the first stage, we simply set . By exploiting (19)(22), we can obtain the first network stage as shown in Fig. 3 (c). Fig. 3 (d) shows the stage () of the network obtained by utilizing (19)(22).
Final stage. As shown in Fig. 3(e), in the final stage, we can approximately generate the HrHS image by (19). Note that (the unfolding matrix of ) has been intrinsically encoded with lowrank structure. Moreover, according to Theorem 1, there exists an , s.t., , which satisfies the observation model (1).
However, HrMS images are usually corrupted with slight noise in reality, and there is a little gap between the low rank assumption and the real situation. This implies that is not exactly equivalent to the tobeestimated HrHS image. Therefore, as shown in Fig. 3 (e), in the final stage of the network, we add a ResNet on to adjust the gap between the tobeestimated HrHS image and the :
(23) 
In this way, we design an endtoend training architecture, dubbed as HSI fusion net. We denote the entire MS/HS fusion net as , where represents all the parameters involved in the network, including , , , and . Please refer to supplementary material for more details of the network design.
4.2 Network training
Training loss. As shown in Fig. 3 (e), the training loss for each training image is defined as following:
(24) 
where and are the final and perstage outputs of the proposed network, and are two tradeoff parameters^{6}^{6}6We set and with small values ( and , respectively) in all experiments, to make the first term play a dominant role.. The first term is the pixelwise distance between the output of the proposed network and the ground truth , which is the main component of our loss function. The second term is the pixelwise distance between the output and the ground truth in each stage. This term helps find the correct parameters in each stage, since appropriate would lead to . The final term is the pixelwise distance of the residual of observation model (2) for the final stage of the network.
Training data. For simulation data and real data with available groundtruth HrHS images, we can easily use the paired training data to learn the parameters in the proposed MHFnet. Unfortunately, for real data, HrHS images s are sometimes unavailable. In this case, we use the method proposed in [30] to address this problem, where the Wald protocol [50] is used to create the training data as shown in Fig. 4. We downsample both HrMS images and LrHS images, so that the original LrHS images can be taken as references for the downsampled data. Please refer to supplementary material for more details.
Implementation details. We implement and train our network using TensorFlow framework. We use Adam optimizer to train the network for 50000 iterations with a batch size of 10 and a learning rate of 0.0001. The initializations of the parameters and other implementation details are listed in supplementary materials.
5 Experimental results
We first conduct simulated experiments to verify the mechanism of MHFnet quantitatively. Then, experimental results on simulated and real data sets are demonstrated to evaluate the performance of MHFnet.
Evaluation measures. Five quantitative picture quality indices (PQI) are employed for performance evaluation, including peak signaltonoise ratio (PSNR), spectral angle mapper (SAM) [49], erreur relative globale adimensionnelle de synthse (ERGAS [38]), structure similarity (SSIM [39]), feature similarity (FSIM [51]). SAM calculates the average angle between spectrum vectors of the target MSI and the reference one across all spatial positions and ERGAS measures fidelity of the restored image based on the weighted sum of MSE in each band. PSNR, SSIM and FSIM are conventional PQIs. They evaluate the similarity between the target and the reference images based on MSE and structural consistency, perceptual consistency, respectively. The smaller ERGAS and SAM are, and the larger PSNR, SSIM and FSIM are, the better the fusion result is.
5.1 Model verification with CAVE data
To verify the efficiency of the proposed MHFnet, we first compare the performance of MHFnet with different settings on the CAVE Multispectral Image Database [46]^{7}^{7}7http://www.cs.columbia.edu/CAVE/databases/. The database consists of 32 scenes with spatial size of , including full spectral resolution reflectance data from 400nm to 700nm at 10nm steps (31 bands in total). We generate the HrMS image (RGB image) by integrating all the ground truth HrHS bands with the same simulated spectral response , and generate the LrHS images via downsampling the groundtruth with a factor of implemented by averaging over pixel blocks as [2, 16].
To prepare samples for training, we randomly select HS images from CAVE database and extract overlapped patches from them as reference HrHS images for training. Then the utilized HrHS, HrMS and LrHS images are of size , and , respectively. The remaining HS images of the database are used for validation, where the original images are treated as ground truth HrHS images, and the HrMS and LrHS images are generated similarly as the training samples.
We compare the performance of the proposed MHFnet under different stage number . In order to make the competition fair, we adjust the level number of the ResNet used in for each situation, so that the total level number of the network in each setting is similar to each other. Moreover, to better verify the efficiency of the proposed network, we implement another network for competition, which only uses the ResNet in (22) and (23) without using other structures in MHFnet. This method is simply denoted as “ResNet”. In this method, we set the input as , where is obtained by interpolating the LrHS image (using a bicubic filter) to the dimension of as [28] did. We set the level number of ResNet to be 30.
ResNet  MHFnet with  

PSNR  32.25  36.15  36.61  36.85  37.23 
SAM  19.093  9.206  8.636  7.587  7.298 
ERGA  141.28  92.94  88.56  86.53  81.87 
SSIM  0.865  0.948  0.955  0.960  0.962 
FSIM  0.966  0.974  0.975  0.975  0.976 
Table 1 shows the average results over 12 testing HS images of two DL methods in different settings. We can observe that MHFnet with more stages, even with fewer net levels in total, can significantly lead to better performance. We can also observe that the MHFnet can achieve better results than ResNet (about 5db in PSNR), while the main difference between MHFnet and ResNet is our proposed stage structure in the network. These results show that the proposed stage structure in MHFnet, which introduces interpretability specifically to the problem, can indeed help enhance the performance of MS/HS fusion.
5.2 Experiments with simulated data
We then evaluate MHFnet on simulated data in comparison with stateofart methods.
Comparison methods. The comparison methods include: FUSE [41]^{8}^{8}8http://wei.perso.enseeiht.fr/publications.html, ICCV15 [18]^{9}^{9}9https://github.com/lanha/SupResPALM, GLPHS [31]^{10}^{10}10http://openremotesensing.net/knowledgebase/hyperspectralandmultispectraldatafusion/, SFIMHS [19]10, GSA [1]10, CNMF [48]^{11}^{11}11http://naotoyokoya.com/Download.html, MFUSE [40]^{12}^{12}12https://github.com/qw245/BlindFuse and SASFM [14]^{13}^{13}13We write the code by ourselves., representing the stateoftheart traditional methods. We also compare the proposed MHFnet with the implemented ResNet method.
PSNR  SAM  ERGAS  SSIM  FSIM  

FUSE  30.95  13.07  188.72  0.842  0.933 
ICCV15  32.94  10.18  131.94  0.919  0.961 
GLPHS  33.07  11.58  126.04  0.891  0.942 
SFIMHS  31.86  7.63  147.41  0.914  0.932 
GSA  33.78  11.56  122.50  0.884  0.959 
CNMF  33.59  8.22  122.12  0.929  0.964 
MFUSE  32.11  8.82  151.97  0.914  0.947 
SASFM  26.59  11.25  362.70  0.799  0.916 
ResNet  32.25  16.14  141.28  0.865  0.966 
MHFnet  37.23  7.30  81.87  0.962  0.976 
Performance comparison with CAVE data. With the same experiment setting as previous section, we compare the performance of all competing methods on the 12 testing HS images ( and in MHFnet). Table 2 lists the average performance over all testing images of all comparison methods. From the table, it is seen that the proposed MHFnet method can significantly outperform other competing methods with respect to all evaluation measures. Fig. 5 shows the th band (490nm) of the HS image chart and staffed toy obtained by the completing methods. It is easy to observe that the proposed method performs better than other competing ones, in the better recovery of both finergrained textures and coarsergrained structures. More results are depicted in the supplementary material.
Performance comparison with Chikusei data. The Chikusei data set [47]^{14}^{14}14http://naotoyokoya.com/Download.html is an airborne HS image taken over Chikusei, Ibaraki, Japan, on 29 July 2014. The data set is of size with the spectral range from 0.36 to 1.018. We view the original data as the HrHS image and simulate the HrMS (RGB image) and LrMS (with a factor of 32) image in the similar way as the previous section.
We select a pixelsize image from the top area of the original data for training, and extract overlapped patches from the training data as reference HrHS images for training. The input HrHS, HrMS and LrHS samples are of sizes , and , respectively. Besides, from remaining part of the original image, we extract 16 nonoverlap images as testing data. More details about the experimental setting are introduced in supplementary material.
Table 3 shows the average performance over 16 testing images of all competing methods. It is easy to observe that the proposed method significantly outperforms other methods with respect to all evaluation measures. Fig. 6 shows the composite images of a test sample obtained by the competing methods, with bands 7010036 as RGB. It is seen that the composite image obtained by MHFnet is closest to the groundtruth, while the results of other methods usually contain obvious incorrect structure or spectral distortion. More results are listed in supplementary material.
PSNR  SAM  ERGAS  SSIM  FSIM  
FUSE  26.59  7.92  272.43  0.718  0.860 
ICCV15  27.77  3.98  178.14  0.779  0.870 
GLPHS  28.85  4.17  163.60  0.796  0.903 
SFIMHS  28.50  4.22  167.85  0.793  0.900 
GSA  27.08  5.39  238.63  0.673  0.835 
CNMF  28.78  3.84  173.41  0.780  0.898 
MFUSE  24.85  6.62  282.02  0.642  0.849 
SASFM  24.93  7.95  369.35  0.636  0.845 
ResNet  29.35  3.69  144.12  0.866  0.930 
MHFnet  32.26  3.02  109.55  0.890  0.946 

5.3 Experiments with real data
In this section, sample images of Roman Colosseum acquired by World View2 (WV2) are used in our experiments^{15}^{15}15https://www.harrisgeospatial.com/DataImagery/SatelliteImagery/HighResolution/WorldView2.aspx. This data set contains an HrMS image (RGB image) of size and an LrHS image of size , while the HrHS image is not available. We select the top half part of the HrMS () and LrHS () image to train the MHFnet, and exploit the remaining parts of the data set as testing data. We first extract the training data into overlapped HrMS patches and overlapped LrHS patches and then generate the training samples by the method as shown in Fig. 4. The input HrHS, HrMS and LrHS samples are of size , and , respectively.
Fig. 6 shows a portion of the fusion result of the testing data (left bottom area of the original image). Visual inspection evidently shows that the proposed method gives the better visual effect. By comparing with the results of ResNet, we can find that the results of both methods are clear, but the color and brightness of result of the proposed method are much closer to the LrHS image.
6 Conclusion
In this paper, we have provided a new MS/HS fusion network. The network takes the advantage of deep learning that all parameters can be learned from the training data with fewer prior preassumptions on data, and furthermore takes into account the generation mechanism underlying the MS/HS fusion data. This is achieved by constructing a new MS/HS fusion model based on the observation models, and unfolding the algorithm into an optimizationinspired deep network. The network is thus specifically interpretable to the task, and can help discover the spatial and spectral response operators in a purely endtoend manner. Experiments implemented on simulated and real MS/HS fusion cases have substantiated the superiority of the proposed MHFnet over the stateoftheart methods.
References
 [1] B. Aiazzi, S. Baronti, and M. Selva. Improving component substitution pansharpening through multivariate regression of ms pan data. IEEE Transactions on Geoscience and Remote Sensing, 45(10):3230–3239, 2007.
 [2] N. Akhtar, F. Shafait, and A. Mian. Sparse spatiospectral representation for hyperspectral image superresolution. In European Conference on Computer Vision, pages 63–78. Springer, 2014.
 [3] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [4] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. In Readings in Computer Vision, pages 671–679. Elsevier, 1987.
 [5] P. Chavez, S. C. Sides, J. A. Anderson, et al. Comparison of three different methods to merge multiresolution and multispectral data landsat tm and spot panchromatic. Photogrammetric Engineering and remote sensing, 57(3):295–303, 1991.
 [6] M. N. Do and M. Vetterli. The contourlet transform: an efficient directional multiresolution image representation. IEEE Transactions on image processing, 14(12):2091–2106, 2005.
 [7] C. Dong, C. C. Loy, K. He, and X. Tang. Image superresolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
 [8] D. L. Donoho. Denoising by softthresholding. IEEE transactions on information theory, 41(3):613–627, 1995.
 [9] V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016.
 [10] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and J. C. Tilton. Advances in spectralspatial classification of hyperspectral images. Proceedings of the IEEE, 101(3):652–675, 2013.
 [11] C. Grohnfeldt, X. Zhu, and R. Bamler. Jointly sparse fusion of hyperspectral and multispectral imagery. In IGARSS, pages 4090–4093, 2013.
 [12] R. C. Hardie, M. T. Eismann, and G. L. Wilson. Map estimation for hyperspectral image resolution enhancement using an auxiliary sensor. IEEE Transactions on Image Processing, 13(9):1174–1184, 2004.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [14] B. Huang, H. Song, H. Cui, J. Peng, and Z. Xu. Spatial and spectral image fusion using sparse matrix factorization. IEEE Transactions on Geoscience and Remote Sensing, 52(3):1693–1704, 2014.
 [15] W. Huang, L. Xiao, Z. Wei, H. Liu, and S. Tang. A new pansharpening method with deep neural networks. IEEE Geoscience and Remote Sensing Letters, 12(5):1037–1041, 2015.
 [16] R. Kawakami, Y. Matsushita, J. Wright, M. BenEzra, Y.W. Tai, and K. Ikeuchi. Highresolution hyperspectral imaging via matrix factorization. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2329–2336. IEEE, 2011.
 [17] C. A. Laben and B. V. Brower. Process for enhancing the spatial resolution of multispectral imagery using pansharpening, Jan. 4 2000. US Patent 6,011,875.
 [18] C. Lanaras, E. Baltsavias, and K. Schindler. Hyperspectral superresolution by coupled spectral unmixing. In Proceedings of the IEEE International Conference on Computer Vision, pages 3586–3594, 2015.
 [19] J. Liu. Smoothing filterbased intensity modulation: A spectral preserve image fusion technique for improving spatial details. International Journal of Remote Sensing, 21(18):3461–3472, 2000.
 [20] L. Loncan, L. B. Almeida, J. M. BioucasDias, X. Briottet, J. Chanussot, N. Dobigeon, S. Fabre, W. Liao, G. A. Licciardi, M. Simoes, et al. Hyperspectral pansharpening: A review. arXiv preprint arXiv:1504.04531, 2015.
 [21] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on pattern analysis and machine intelligence, 11(7):674–693, 1989.
 [22] G. Masi, D. Cozzolino, L. Verdoliva, and G. Scarpa. Pansharpening by convolutional neural networks. Remote Sensing, 8(7):594, 2016.
 [23] S. Michel, M.J. LEFEVREFONOLLOSA, and S. HOSFORD. Hypxim–a hyperspectral satellite defined for science, security and defence users. PAN, 400(800):400, 2011.
 [24] R. Molina, A. K. Katsaggelos, and J. Mateos. Bayesian and regularization methods for hyperparameter estimation in image restoration. IEEE Transactions on Image Processing, 8(2):231–246, 1999.
 [25] R. Molina, M. Vega, J. Mateos, and A. K. Katsaggelos. Variational posterior distribution approximation in bayesian super resolution reconstruction of multispectral images. Applied and Computational Harmonic Analysis, 24(2):251–267, 2008.
 [26] Z. H. Nezhad, A. Karami, R. Heylen, and P. Scheunders. Fusion of hyperspectral and multispectral images using spectral unmixing and sparse coding. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(6):2377–2389, 2016.
 [27] F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson. A new pansharpening algorithm based on total variation. IEEE Geoscience and Remote Sensing Letters, 11(1):318–322, 2014.
 [28] F. Palsson, J. R. Sveinsson, and M. O. Ulfarsson. Multispectral and hyperspectral image fusion using a 3Dconvolutional neural network. IEEE Geoscience and Remote Sensing Letters, 14(5):639–643, 2017.
 [29] Y. Rao, L. He, and J. Zhu. A residual convolutional neural network for panshaprening. In Remote Sensing with Intelligent Processing (RSIP), 2017 International Workshop on, pages 1–4. IEEE, 2017.
 [30] G. Scarpa, S. Vitale, and D. Cozzolino. Targetadaptive cnnbased pansharpening. IEEE Transactions on Geoscience and Remote Sensing, (99):1–15, 2018.
 [31] M. Selva, B. Aiazzi, F. Butera, L. Chiarantini, and S. Baronti. Hypersharpening: A first approach on simga data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6):3008–3024, 2015.
 [32] Z. Shao and J. Cai. Remote sensing image fusion with deep convolutional neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(5):1656–1669, 2018.
 [33] J.L. Starck, J. Fadili, and F. Murtagh. The undecimated wavelet decomposition and its reconstruction. IEEE Transactions on Image Processing, 16(2):297–309, 2007.
 [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [35] Y. Tarabalka, J. Chanussot, and J. A. Benediktsson. Segmentation and classification of hyperspectral images using minimum spanning forest grown from automatically selected markers. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 40(5):1267–1279, 2010.
 [36] M. Uzair, A. Mahmood, and A. S. Mian. Hyperspectral face recognition using 3ddct and partial least squares. In BMVC, 2013.
 [37] H. Van Nguyen, A. Banerjee, and R. Chellappa. Tracking via object reflectance using a hyperspectral video camera. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 44–51. IEEE, 2010.
 [38] L. Wald. Data Fusion: Definitions and Architectures: Fusion of Images of Different Spatial Resolutions. Presses des lâEcole MINES, 2002.
 [39] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing, 13(4):600–612, 2004.
 [40] Q. Wei, J. BioucasDias, N. Dobigeon, J.Y. Tourneret, and S. Godsill. Blind modelbased fusion of multiband and panchromatic images. In Multisensor Fusion and Integration for Intelligent Systems (MFI), 2016 IEEE International Conference on, pages 21–25. IEEE, 2016.
 [41] Q. Wei, N. Dobigeon, and J.Y. Tourneret. Fast fusion of multiband images based on solving a sylvester equation. IEEE Transactions on Image Processing, 24(11):4109–4121, 2015.
 [42] Y. Wei and Q. Yuan. Deep residual learning for remote sensed imagery pansharpening. In Remote Sensing with Intelligent Processing (RSIP), 2017 International Workshop on, pages 1–4. IEEE, 2017.
 [43] Y. Wei, Q. Yuan, H. Shen, and L. Zhang. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett, 14(10):1795–1799, 2017.
 [44] D. Yang and J. Sun. Proximal dehazenet: A prior learningbased deep network for single image dehazing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 702–717, 2018.
 [45] Y. Yang, J. Sun, H. Li, and Z. Xu. Admmnet: A deep learning approach for compressive sensing mri. arXiv preprint arXiv:1705.06869, 2017.
 [46] F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar. Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE transactions on image processing, 19(9):2241–2253, 2010.
 [47] N. Yokoya, C. Grohnfeldt, and J. Chanussot. Hyperspectral and multispectral data fusion: A comparative review of the recent literature. IEEE Geoscience and Remote Sensing Magazine, 5(2):29–56, 2017.
 [48] N. Yokoya, T. Yairi, and A. Iwasaki. Coupled nonnegative matrix factorization (CNMF) for hyperspectral and multispectral data fusion: Application to pasture classification. In Geoscience and Remote Sensing Symposium (IGARSS), 2011 IEEE International, pages 1779–1782. IEEE, 2011.
 [49] R. H. Yuhas, J. W. Boardman, and A. F. Goetz. Determination of semiarid landscape endmembers and seasonal trends using convex geometry spectral unmixing techniques. 1993.
 [50] Y. Zeng, W. Huang, M. Liu, H. Zhang, and B. Zou. Fusion of satellite images in urban area: Assessing the quality of resulting images. In Geoinformatics, 2010 18th International Conference on, pages 1–4. IEEE, 2010.
 [51] L. Zhang, L. Zhang, X. Mou, and D. Zhang. Fsim: a feature similarity index for image quality assessment. IEEE Trans. Image Processing, 20(8):2378–2386, 2011.
 [52] Y. Zhang, Y. Wang, Y. Liu, C. Zhang, M. He, and S. Mei. Hyperspectral and multispectral image fusion using CNMF with minimum endmember simplex volume and abundance sparsity constraints. In Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International, pages 1929–1932. IEEE, 2015.
 [53] J. Zhang13, J. Pan, W.S. Lai, R. W. Lau, and M.H. Yang. Learning fully convolutional networks for iterative nonblind deconvolution. 2017.
 [54] Y. Zhao, J. Yang, Q. Zhang, L. Song, Y. Cheng, and Q. Pan. Hyperspectral imagery superresolution by sparse representation and spectral regularization. EURASIP Journal on Advances in Signal Processing, 2011(1):87, 2011.