Deep Convolutional Neural Network for Multi-modal Image Restoration and Fusion
In this paper, we propose a novel deep convolutional neural network to solve the general multi-modal image restoration (MIR) and multi-modal image fusion (MIF) problems. Different from other methods based on deep learning, our network architecture is designed by drawing inspirations from a new proposed multi-modal convolutional sparse coding (MCSC) model. The key feature of the proposed network is that it can automatically split the common information shared among different modalities, from the unique information that belongs to each single modality, and is therefore denoted with CU-Net, i.e., Common and Unique information splitting network. Specifically, the CU-Net is composed of three modules, i.e., the unique feature extraction module (UFEM), common feature preservation module (CFPM), and image reconstruction module (IRM). The architecture of each module is derived from the corresponding part in the MCSC model, which consists of several learned convolutional sparse coding (LCSC) blocks. Extensive numerical results verify the effectiveness of our method on a variety of MIR and MIF tasks, including RGB guided depth image super-resolution, flash guided non-flash image denoising, multi-focus and multi-exposure image fusion.
Multi-modal image processing has been attracting increasing interest from the computer vision community, due to a variety of intriguing applications, e.g., image style transfer [1, 2], image fusion [3, 4], RGB guided depth image super-resolution [5, 6], image denoising . Based on the reconstruction target, these applications can be roughly classified into two categories, the multi-modal image restoration (MIR) and multi-modal image fusion (MIF) task. Given an image and another image , in MIR problem, one aims to recover a better version of with the guidance of , while the MIF task aims to fuse and to a new image which has both the advantages of and . Different from uni-modal image processing tasks, e.g., single image super-resolution, multi-modal image processing usually requires proper modelling of the dependencies across different modalities.
Sparse coding and dictionary learning have been used widely to relate different modalities in multi-modal image processing. In , a method was proposed to learn one dictionary for each modality, and the dependencies across different modalities are modelled by assuming all the dictionaries share the same sparse representations. In , a coupled dictionary learning algorithm was proposed to learn a pair of common and unique dictionaries for each modality, where only the common dictionaries share the same sparse representations. These methods based on sparse coding can provide explicit modellings, but the calculation of sparse codes is often time-consuming. Recent studies show that deep neural networks can also be used to correlate different modalities. Li et. al  proposed a deep joint filtering network and Kim et. al  proposed a deformable kernel network. Both approaches use a two-stream convolutional neural network (CNN) architecture to extract the features from two different modalities and then combine them through another CNN to achieve the final reconstruction. However, the dependencies modelled by them are quite implicit.
To address the aforementioned issues, the intuitive solution is to incorporate the sparse coding modelling into the neural network. Recently, in , the authors proposed a deep neural network which unfolded the iterative shrinkage and thresholding algorithm (ISTA) into a two-branch deep neural network, in order to solve the multi-modal image super-resolution problem. However, since the traditional sparse coding is performed at patch level, it may not be as effective as a CNN which exploits both neighbourhood information as well as global features in the image. Very recently, the authors in  extent the traditional patch-based sparse coding to the convolutional sparse coding, and then unfold it into a deep convolutional neural network. However, it only targets the multi-modal image super-resolution problem.
In this paper, we aim to solve the general multi-modal image restoration and fusion problems, by proposing a deep convolutional neural network named the Common and Unique information splitting network (CU-Net). To the best of our knowledge, this is the first time a universal framework is proposed to solve both the MIR and MIF problems. Compared with other empirically designed networks, the proposed CU-Net is derived from a new multi-modal convolutional sparse coding (MCSC) model, and thus each part of the network has good interpretability. The MCSC model is developed upon a recent coupled dictionary learning algorithm in . This algorithm is demonstrated to model well the dependencies across modalities, but it is performed at patch level, which means it only models the patch dependencies. In contrast, our MCSC model is defined at image level, and thus we can model the image dependencies across modalities. The main contributions of this paper can be summarized as follows:
We propose a novel multi-modal convolutional sparse coding (MCSC) model, which has two variations to solve the general multi-modal image restoration (MIR) and multi-modal image fusion (MIF) problems, respectively. In this model, we represent each modality with two convolutional dictionaries: one for the common feature representation and the other for the unique feature representation.
Based on the MCSC model, we design a deep convolutional neural network, i.e., CU-Net, to solve the MIR and MIF problems. Our network is able to automatically split the common features from the unique features between the target and guided modality, which is beneficial to both MIR and MIF tasks.
We test the performance of the proposed method on various MIR and MIF tasks, such as RGB guided depth image super-resolution, flash guided non-flash image denoising, multi-focus and multi-exposure image fusion. The numerical results validate the effectiveness, flexibility, and universality of our method.
The remainder of this paper is organized as follows. In Section 2, we review the related work on MIR and MIF. In Section 3, we introduce two variations of the multi-modal convolutional sparse coding model to solve the MIR and MIF problems, respectively. Based on this model, the proposed Common and Unique information splitting network (CU-Net) is introduced in Section 4. Finally, Section 5 shows the numerical results and Section 6 concludes this paper.
2 Related Work
2.1 Multi-modal image restoration
For the MIR task, we first review traditional methods based on guided image filtering, and then introduce more recent methods based on deep learning. The MIR methods based on guided image filtering aim to transfer the salient structures in the guidance image to the target image, and they can be further classified into two categories: the statically guided [14, 15, 16, 17, 18] and dynamically guided methods [19, 20, 21, 22, 23, 24]. The statically guided methods assume that the guidance image is not updated at each iteration, and thus they are only effective for the scenarios where the guidance image provides sufficient reliable details. Representative algorithms include the joint bilateral filter (JBF) , guided filter (GF) , and weighted median filter (WMF) . They may struggle when there are structural inconsistencies between the two images, e.g., RGB and depth images. For the dynamically guided methods, they consider both the original guidance image and the filtered target image as guidance, which makes them more resilient to inconsistency between the two images. Specifically, Zhang et al.  proposed a rolling guided filter (RGF), which iteratively used the filtered image as guidance for small structure and edge recovery. Ham et al.  proposed a static/dynamic (SD) filter in which the static filter convolves the input image with a weight function calculated from only the static guidance image, and the dynamic filter uses the weight function repeatedly calculated from the filtered output image. To avoid the transfer of inconsistent structures between guidance and target images, Shen et al.  defined three different types of structures: mutual structure, inconsistent structure and smooth regions. They further defined a normalized cross-correlation (NCC) metric to find the mutual structures between two images and used them to guide the filtering. However, the NCC metric is defined at patch level, and thus it can only measure the patch similarity and sometimes leads to halo artefacts. To solve this problem, Guo et al.  recently proposed a mutually guided image filter (muGIF) which defines a relative structure to measure the similarity between two images at image level.
Recently, some works proposed to use deep neural networks to solve the MIR problem. The related methods can be classified into two categories: methods aiming to solve the general MIR problem and methods that are tailored to solve the MIR problem for specific pairs of modalities (e.g., RGB and depth). For the general MIR task, Li et. al  proposed a deep joint filtering (DJF) network, which used two independent convolutional neural networks (CNN) to extract the structural information from the guidance and input images, respectively, and then used a third CNN to predict the target image. Later, this work is further improved by adding a skip connection between the input and target images, which leads to DJFR . Kim et. al  proposed a deformable kernel network (DKN) for joint image filtering, which employs a similar network architecture as DJF , but with a new added weight and offset learning module to learn the neighbourhood system for each pixel. Wu et. al  turned the traditional guided filtering (GF) algorithm into a differentiable block, namely guided filtering layer, and then plug it into a convolutional neural network, so that it can be trained end-to-end. Recently, Pan et. al  proposed a spatially variant linear representation model, in which the target image is linearly represented by the guidance and input images. They then use a deep CNN to estimate the representation coefficients to restore the target image. For the specific MIR task, methods [28, 29, 30, 31] aim to upscale the depth image with guidance from the RGB image using deep neural networks. Methods [32, 33, 34] aim to improve the resolution of the multi-spectral image with the assistance of either RGB or panchromatic images. Note that the methods aiming for a specific MIR task may not perform well on other tasks, because either the filters or the network architectures have been tailored to the specific characteristics of the image modalities under consideration.
2.2 Multi-modal image fusion
Multi-modal image fusion aims to integrate the complementary information contained in different source images to generate a fused image. The fused image should provide more comprehensive information about the scene, which is more helpful for human or machine perception. The source images are usually captured by different sensors, e.g., the MRI and CT medical images, but sometimes they are obtained using the same sensor but with different imaging parameters, e.g., multi-focus and multi-exposure images.
Most traditional image fusion methods follow a three-step fusion procedure . Firstly, the source images are mapped into a specific transform domain, e.g., wavelet transform. Then, the transform coefficients are fused based on a fusion rule, and finally the fused coefficients are transformed back into the image domain to obtain the final fused image. There are two elements here which play a critical role in the fusion performance: the selection of the transform domain and the fusion rule. Many works have studied the fusion performance in different transform domains, including for example discrete wavelet transform (DWT) , discrete cosine transform (DCT) , non-subsampled contourlet transform . There are also some papers based on sparse coding [39, 40, 3], which fuse the sparse representations in the sparse domain. The most widely used fusion rules are choose-max  and weighted average . Usually, in order to avoid coefficient inconsistency, a neighbourhood morphological processing step is used after the choose-max strategy. However, the transform domain is manually selected and the fusion rule is hand-crafted, which may influence the fusion results.
Recently, some authors have proposed to use deep learning to solve the MIF problems. Specifically, Liu et al.  proposed a simple CNN to predict the decision map for multi-focus image fusion. Prabhakar et al.  proposed a CNN based unsupervised image fusion method to fuse one under-exposed image with an over-exposed one. Li et al.  proposed a CNN network with the dense block structure to solve the infrared and visible image fusion problem. To improve the perceptual quality of the fused image, Ma  proposed a generative adversarial network (GAN), called FusionGAN, for infrared and visual image fusion.
The MIR and MIF tasks are usually treated as two independent research problems, as shown in the literature review above. One main contribution of this paper is that we propose a novel multi-modal convolutional sparse coding (MCSC) model with two variations for these two tasks. Based on this model, we further propose a universal deep learning framework to achieve both the MIR and MIF tasks, as shown in Fig. 2. To the best of our knowledge, this is the first time a general deep learning framework is proposed that can address either the MIR or the MIF problem.
3 Multi-modal Convolutional Sparse Coding (MCSC)
In multi-modal image processing, one important issue is to model the dependencies among different modalities. In this section, we first introduce the multi-modal convolutional sparse coding (MCSC) model for the MIR and MIF tasks, and then introduce the optimization formulations, together with the solutions for these two tasks. To make the notations clear, we list the main symbols used in this section in Table I.
|the input image|
|the guidance image|
|the target image|
|the -th common filter of|
|the -th unique filter of|
|the -th common filter of|
|the -th unique filter of|
|the -th common filter of|
|the -th unique filter of related with|
|the -th unique filter of related with|
|the -th common feature response|
|the -th unique feature response of|
|the -th unique feature response of|
3.1 The MCSC model
For the MIR task, the problem can be formulated as follows: given a distorted image , and a guidance image , we aim to reconstruct an image which is a high-quality version of . Here, we assume that the image is square. Since and capture the same scene, they should share some common features, but they also have unique features. Take the depth and RGB images for example, discontinuities in the depth image are clearly related to edges in RGB image. However, RGB image contains texture information, which has no relations with depth image. Based on this observation, we model the relationships among different modalities as follows:
Here, the symbol * denotes the convolutional operation, , and are the common filters of , and , respectively; and are the unique filters of and , respectively; is the unique filter of which is related with . The common filters share the same feature responses , while the unique filters have their own feature responses and . In summary, and share the same common feature responses , but also have different unique feature responses and as indicated in Eqs. (1) and (2). Since the unique information in the guidance image is not useful for the reconstruction of , the model of is composed of two parts: the features which are in common between and , and the unique features of , as in Eq. (3). The advantage of this model is that only the useful information is preserved while the information that may interfere with the estimation of is discarded.
For the MIF task, we aim to fuse the image with another image to obtain a new image which has both the advantages of and . Different from the MIR task in which only the common information in is helpful, here in the MIF task, all the information contained in might be useful for the reconstruction of . In other words, the fused image should be composed of three parts: common features shared by and , unique features of and unique features of . Thus, in the MIF task, the models for and are the same as for the MIR task, but the model for should be re-formulated as follows:
where and are the unique filters of related with and , respectively.
3.2 Convolutional Sparse Coding
In the synthesis phase, given the input images and , we need to infer the image . To this end, assuming all dictionaries are known, we need to calculate the common and unique feature responses , through solving the following optimization problem:
To solve this problem, our strategy is to alternately update each variable with other variables fixed. This leads to the following three steps:
Step 1, we fix the common feature response and the unique response of , to update the unique response of .
Step 2, we fix the common feature response and the unique response of , to update the unique response of .
Step 3, we fix all the unique responses and to update the common response .
These three steps should be repeated until convergence. After solving (5), the target image can be reconstructed by either Eq. (3) or Eq. (4), depending on the task. However, solving (5) is very time-consuming, since the solution requires several iterations. In this paper, we aim to solve (5) using a deep learning approach, by turning each step above into a deep network module with tunable parameters and adjustable number of layers. The next section introduces in detail how we turn these three steps into a deep network.
4 Common and Unique Information Splitting Network (CU-Net)
The architecture of the proposed CU-Net is shown in Fig. 2. We can see that the network is composed of three parts: the unique feature extraction module (UFEM), common feature preservation module (CFPM), and the image reconstruction module (IRM). Next, we will introduce these three modules in detail and explain how we design the architecture of each of them from the MCSC model.
4.1 Unique feature extraction module (UFEM)
The UFEM aims to extract the unique features from each source image. Since we have two source images and , we have two UFEMs, i.e., the prediction network and the prediction network, as shown in Fig. 2. As discussed in Section 3.2, the first step is to fix and to update , which turns (5) into the following:
Since is fixed, we can further simplify (6) by denoting , which leads to the following expression:
Eq. (7) is a standard convolutional sparse coding problem, which can be solved using the traditional method in . We instead solve this problem using the learned convolutional sparse coding (LCSC) algorithm proposed in , which gives us the solution of as follows:
where is the stack of feature responses , and is the update of at the -th iteration, and are the learnable convolutional layers related to the filters . For details about how to derive (8) by solving (7), please refer to the Appendix A. The indicates the soft-thresholding operation with as the threshold. Following , we can unfold the iterations in (8) into a neural network, as shown in the UFEM module (the upper figure) in Fig. 3. Theoretically, since each LCSC block in the UFEM module corresponds to one iteration in (8), the UFEM module can be extended to any number of LCSC blocks, which makes the network architecture very flexible.
The architecture of prediction network is obtained in a way similar to that of , and it is derived from the second step in solving Eq. (5). When updating , we need to fix and , so that Eq. (5) becomes:
Here, is the stack of feature responses , and is the update of at the -th iteration, and are the learnable convolutional layers related with . The iterations in Eq. (10) can be also unfolded into a neural network, as shown in the middle figure in Fig. 3.
4.2 Common feature preservation module (CFPM)
Since different modalities capture the same image scene, there exist some consistent features among them. The CFPM aims to preserve these common features. The network architecture of CFPM is derived from the third step in solving Eq. (5) which aims to predict the common features . Thus, the CFPM is also called the prediction network in this paper. When fixing the unique responses and , Eq. (5) can be re-written as follows:
where and . The first two terms in Eq. (11) can be further combined and this yields the following optimization problem:
where is the stack of common feature responses and is the update of at the -th iteration, and are the learnable convolutional layers related with . The iterations in Eq. (13) can be unfolded and this leads to the CFPM architecture shown in the bottom figure in Fig. 3.
|Xie et al. ||8.79||0.9438||9.14||0.9221||12.21||0.8869||3.79||0.9758||1.63||0.9917||1.33||0.9910||2.77||0.9887||5.67||0.9571|
|Park et al. ||6.03||0.9678||7.13||0.9379||9.45||0.9067||3.76||0.9752||1.66||0.9912||1.42||0.9911||2.79||0.9864||4.61||0.9638|
|Ferstl et al. ||5.99||0.9701||6.40||0.9563||8.01||0.9298||3.73||0.9771||1.65||0.9915||1.43||0.9909||2.90||0.9859||4.30||0.9717|
|Lu et al. ||5.53||0.9712||6.10||0.9610||8.31||0.9266||4.10||0.9747||2.18||0.9896||1.56||0.9896||3.24||0.9867||4.43||0.9713|
|Gu et al.||6.04||0.9766||6.15||0.9613||8.10||0.9470||3.52||0.9779||1.57||0.9923||1.23||0.9930||2.66||0.9883||4.18||0.9766|
4.3 Image reconstruction module (IRM)
After we obtain the feature responses , the next step is to reconstruct the target image using either Eq. (3) for MIR related tasks or Eq. (4) for MIF related tasks. Specifically, for MIR related tasks, the image reconstruction module (IRM) only contains two set of filters and . As shown in Fig. 2, the filter is directly connected to the output of prediction network, which corresponds to the term in Eq. (3). The filter is connected to the output of prediction network, which corresponds to the term in Eq. (3). Then, these two terms are added to get . For the MIF related tasks, we use Eq. (4) to create the IRM, and it contains three set of filters , , and . The only difference to the MIR related task is the filter , which is connected to the output of prediction network, as shown in Fig. 2.
4.4 Discussion about the CU-Net Architecture
It is of interest to note that, although the CU-Net is derived from a theoretical MCSC model, it contains both skip connections and residual blocks. These two elements have both been demonstrated to improve the reconstruction performance of CNN architectures for different tasks. The concept of skip connection was first proposed by He et al.  for image recognition, and later successfully used in the field of image super-resolution [58, 59] and image denoising . In our network, as shown in Fig. 2, the target is obtained by combining three parts, in which the two parts convolved with and both use the skip connections. In addition, as we can see in Fig. 3, there are two residual lines in our UFEM and CFPM, which are similar to the residual block proposed in . As verified in , the residual block can make full use of the hierarchical features and thus significantly improve the image super-resolution performance. In this paper, we also demonstrate the effectiveness of the residual UFEM and CFPM architecture in Section 5.4.
In this section, we verify the effectiveness of our method on various applications, which can be classified into two categories: MIR and MIF related tasks. The MIR related tasks include RGB guided depth image super-resolution, RGB guided multispectral image super-resolution, and flash guided non-flash image denoising. The MIF related tasks include multi-exposure image fusion, multi-focus image fusion and medical image fusion.
Training details. For each task, we train the network using around 150,000 image patches with size . The Adam optimizer is used to train the network, with a basic learning rate of and it is decayed by 0.9 every 50 epochs. The number of iterated LCSC blocks in the UFEM and CFPM modules is set to 4. For each convolutional layer, the filter size is , and the number of filters is 64. We choose the size of mini-batch as 64 and the total number of epochs as 200. Note that for applications which aim to restore an image with multiple channels, e.g., the multi-exposure image fusion, instead of processing each channel independently, we adjust the dimensions of corresponding convolutional layers to accept multi-channel inputs. Specifically, we have and , and , and , where is the number of channels.
Next, in Section 5.1, we show the simulation results of our method on the MIR related tasks, together with the comparisons with other state-of-the-art approaches. In Section 5.2, we show the simulation results on the MIF related tasks. In order to further verify the effectiveness of our deep network, we visualize in Section 5.3 the features generated by different parts of the network, and in Section 5.4 we present comprehensive ablation study results. In Section 5.5, the computational cost is discussed.
|RGB/MS||Chart toy||Egyptian||Feathers||Glass tiles||Jelly beans||Oil painting||Paints||Average|
5.1 MIR related tasks
5.1.1 RGB guided depth image SR
We use the dataset from  to train the network, which is composed of 1000 synthetic RGB/Depth image pairs. The testing images are from the Middlebury dataset  and the Sintel dataset . The upscaling factor is 4, and we generate the low-resolution depth image by downsampling the high-resolution depth image by the upscaling factor, and then upsample it using the bicubic interpolation. For the high-resolution RGB image, we convert it into the YCbCr format and only the Y channel is used as guidance.
We compare our method with several state-of-the-art methods, which include methods developed for single image super-resolution, e.g., SCN, EDSR , SRFBN , methods for single depth image super-resolution and RGB guided depth image super-resolution, e.g., Xie et al. , Park et al. , Ferstl et al. , Gu et al., RCGD, RADAR , and methods for solving the general MIR problem, e.g., DGF, DJFR and CoISTA. The results of these approaches are all obtained by running the software codes provided by the authors.
Table II presents the quantitative comparison results between ours and the other state-of-the-art methods, in terms of root mean square error (RMSE) and SSIM . As can be seen from this table, our CU-Net achieves the best results among all the methods. Fig. 4 visualizes the reconstructed depth images using different methods for 4 upscaling. From this figure, we can see that the depth images recovered by our method are quite close to the ground-truth, with clearer and sharper edges.
5.1.2 RGB guided multi-spectral image SR
For the task of RGB guided Multi-spectral image SR, both the training and testing images are from the the Columbia multi-spectral database . We randomly select 7 images for testing and use the remaining images for training. To verify the effectiveness of our method, we compare it with the following approaches: JBF, GF, JFSM, SDF, EDSR , SRFBN , MMSR , DGF, DJFR and CoISTA. Table III shows the comparison results in terms of PSNR and SSIM for 4 upscaling. As we can see, our method can recover the HR multi-spectral images with high accuracy. Specifically, the average PSNR value of our method is 1.55 dB higher than the second best method DJFR. Fig. 5 shows examples of the upsampling results using different methods.
5.1.3 Flash guided non-flash image denoising
We use the new flash/non-flash image dataset provided by  for both training and testing. We randomly select 400 flash/non-flash image pairs from the dataset  for training. For testing, to make the results more convincing, we select 12 images from three different categories in , i.e., the toy, plant and object. Note that the testing images are different from the training images. The noisy non-flash images are obtained by adding white Gaussian noise to the clean images, with three different noise levels, i.e., , , and .
To verify the effectiveness of our method, we compare it with the other four state-of-the-art methods, including CBM3D  and DnCNN  which are specifically aimed at the image denoising task, DJFR and MuGIF  which are general MIR methods. Table IV shows the comparison results in terms of PSNR, in which the first four images are from the toy category, the middle four images are from the object category and the last four images are from the plant category. As we can see from this table, our method outperforms other methods for different noise levels. This is further verified in Fig. 6 in which we visualize the denoised non-flash images using different methods with . In the case of , the non-flash image is severely affected by noise, with most of the details removed by the heavy noise, and thus the guidance of the flash image becomes very important. As we can see in Fig. 6, our method is able to remove the noise effectively, and at the same time recover the fine details and sharp edges, while others either have blurred edges [67, 60] or unclear details . Take the image book for example, we can clearly recover the texts which is hard to read in the noisy image, while the other methods struggle to do so. This confirms that our method is able to make full use of the information in the guidance image, even in the case that most information in the noisy non-flash image has been lost.
5.2 MIF related tasks
5.2.1 Multi-exposure image fusion
For the multi-exposure image fusion task, we aim to combine one under-exposed image with an over-exposed one to obtain a photo-realistic natural image. We use the datasets from  for training and testing. The paper  provides the images with seven exposure levels, and we choose the first and sixth levels as the under-exposed and over-exposed images, respectively. Some examples of the images are shown in Fig. 7. As we can see from this figure, the under-exposed image is extremely dark, and the over-exposed image is extremely bright, so that many details are lost in both of them. Our method is able to detect the useful information in these two images and re-assemble them to a new image which contains all the relevant information and looks visually pleasing. We compare our method with the other two state-of-the-art approaches on multi-exposure image fusion task: the SPD-MEF  and MEF-OPT. As we can see in Fig. 7, the SPD-MEF method suffers severe non-uniform brightness artifacts and contrast loss, and thus it loses many important details. For example, in the third picture of the first row, there should be a person standing between the two pillars, but the SPD-MEF method fails to discover it. The MEF-OPT method performs better than SPD-MEF, but the brightness is still not uniform across the whole images. In addition, the combined image has halo effects around the edges (the fourth figure in the third row). In contrast, the images generated by our method are with consistent and uniform brightness across the whole image, and do not have contrast loss or halo effects.
5.2.2 Multi-focus image fusion
Due to the finite depth-of-field of a camera, it is difficult to capture an image with all the objects in focus. The multi-focus image fusion task aims to fuse two or more images focused on different depth planes to obtain an all-in-focus image. In this paper, we fuse one near-focus image with one far-focus image for an all-in-focus image. Examples of those images are shown in Fig. 8. The training dataset is from the General 100 dataset . The near-focus and far-focus image pairs are generated by blurring the randomly chosen foreground and background of an image using Gaussian blurring. The testing images are from the Lytro multi-focus image dataset . We compare our method with the following three state-of-the-art approaches: CSR , Deepfuse network  and Densefuse network . The visual comparison results are shown in Fig. 8. As we can see from this figure, the combined images generated by our method are more visually pleasant than the other approaches, with every part in focus, and with clear and sharp boundary edges. In contrast, the comparison methods, CSR , Deepfuse network  and Densefuse network , all lead to different levels of blurring artefacts across the boundary areas, as shown in the close-ups of the toy dog and the fence.
5.2.3 Medical image fusion
Medical image fusion plays an important role in clinical disease diagnosis. Since different medical images usually capture different features of the organs or tissues, a good combined image can deliver more useful information, which helps the effective diagnosis of some diseases. In this paper, we choose to fuse two different magnetic resonance (MR) images, i.e., the T1 weighted MR image and T2 weighted MR image. The testing dataset is from the Whole Brain Atlas  from Harvard medical school. For deep learning based medical image processing, one main problem is the lack of large training dataset. Here, we use the model trained by the multi-focus image fusion task and then fine tune it using a small medical dataset which is composed of 30 image pairs from the Whole Brain Atlas . Fig. 9 shows the fused images using different methods, including CSR , Deepfuse network , Densefuse network  and our CU-Net. As we can see, the method CSR  and Deepfuse network  both lose the image energy and have low contrast, which leads to low brightness in the soft-tissue regions. The method Densefuse  produces over-bright images, which makes many details difficult to see. Our method is able to preserve well the structures and details of the organs and tissues without losing the image energy or producing the brightness distortions.
5.3 Common and Unique Reconstruction Visualization
As shown in Fig. 2, the final reconstructed image is composed of one common reconstruction and one or two unique reconstructions (one for MIR related tasks and two for MIF related tasks). To better verify the effectiveness of our method, we will visualize the common and unique components which contribute to the final reconstruction, i.e., the Point 1, 2, 3, and 4 in Fig. 2. For the MIR tasks, we take the RGB guided depth SR as example. Fig. 10 shows the common and unique reconstructed images for the RGB guided depth SR task. As we can see, the common reconstruction only contains the high-resolution edge information, which is consistent with the fact that the only common element between the RGB and depth images is the edge details. The depth information is entirely provided by the unique reconstruction, which is reasonable because the RGB image does not contain any depth information. However, in the unique reconstruction, depth discontinuities are still blurred. In the final reconstructed image, the blurring effect is successfully eliminated due to the addition of the common reconstruction which has sharp high-resolution edges.
For the MIF related tasks, Fig. 11 visualizes the three components of the final fused image for the multi-exposure fusion task. We can clearly see from this figure that the common reconstruction contains the common parts shared between the under-exposed and over-exposed images, while the two unique reconstructions preserve the unique features in each image, e.g., see the yellow and red marks in the images. By adding these three components together, we can obtain the final fused image, which contains both the common feature between the two extremely exposed images and the unique feature of each single image.
5.4 Ablation Study
Residual architecture. As shown in Fig. 3, the UFEM and CFPM modules have residual structures. In the first ablation study, we explore the importance of this residual architecture on the reconstruction performance. To do this, we firstly remove the two residual arrows in the modules, which makes UFEM and CFPM pure feed-forward CNN modules. We then re-train the network using the same training strategy. Table V shows the results on the RGB guided depth image SR task with and without the residual structures, respectively. As we can see, the residual architecture indeed helps increase the reconstruction accuracy.
Filter size. We further analyse the effects of filter size on the reconstruction performance. Table VI shows the results on the RGB guided depth SR when we gradually increase the filter size from 3 to 9. We can see that the reconstruction performance is improved with larger . However, when the filter is too large, e.g., , the reconstruction accuracy decreases. The possible reason for this phenomenon is that a very large filter support may overlook local details, and this leads to less accurate reconstructions. In this paper, we choose to use a filter size of .
Network depth. The network depth also plays an important role in improving the reconstruction performance. Here, we use the number of LCSC blocks to indicate the network depth. Table VII presents the results with different numbers of LCSC blocks. We can see that the reconstruction accuracy improves with the depth of the network, however, the depth of the network also increases the model size and training complexity. In this paper, we use 4 LCSC blocks in each module, which represents a good trade-off between reconstruction accuracy and training complexity.
|Model size /KB||66||99||132||165||198||231|
|Task||Task 1||Task 2||Task 3||Task 4||Task 5||Task 6|
5.5 Running Speed
For the real-time applications, the computational complexity of a method is an important factor to be considered. Our method is a deep learning based method using a feed-forward network architecture, which has the advantage of fast running speed. In Table VIII, we present the running time of our method for the six applications presented in this paper. The task 1 to task 6 indicate the RGB guided depth image SR, RGB guided MS image SR, flash guided non-flash image denoising, multi-exposure image fusion, multi-focus image fusion, and medical image fusion, respectively. The time is recorded by implementing the experiments on a PC with an GEFORCE GTX 1080 Ti GPU.
In this paper, we address the general multi-modal image restoration and image fusion problems by proposing a novel and flexible CNN architecture, named Common and Unique information splitting network (CU-Net). Different from other approaches, our network architecture is derived from a new proposed multi-modal convolutional sparse coding (MCSC) model, which makes each part of our network interpretable. To verify the effectiveness of our CU-Net, we conduct exhaustive experiments on six different multi-modal image restoration and fusion tasks. We also make a comprehensive ablation study to explore the contribution of each component in the network and the effects of some critical network hyper-parameters, such as filter size and network depth. The experimental results show that our method outperforms other state-of-the-art methods in all the tasks considered, with a small model size and fast running speed.
Appendix A Learned Convolutional Sparse Coding 
The typical convolutional sparse coding problem with constraint is formulated as follows:
where is the input image, is a set of known filters, and is the filter response to be computed. Since the convolutional operation is linear, we can construct a Toeplitz matrix to make where is the vectorized . Then, we can turn (14) to the following:
where is the vectorized . By concatenating all the , we can have . By putting all the in a column, we can have . Then, Eq. (15) can be turned to a traditional sparse coding problem:
By replacing the matrix multiplication in Eq. (17) with convolutional operations, we can have the following:
where and are the two sets of filters which lead to the Toeplitz matrices and , respectively, and is the stack of to be learned in Eq. (14). In the learned convolutional sparse coding (LCSC), the and are learnable as convolutional layers in a deep network.
-  L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision. Springer, 2016, pp. 694–711.
-  Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, “Image fusion with convolutional sparse representation,” IEEE Signal Processing Letters, vol. 23, no. 12, pp. 1882–1886, 2016.
-  H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2614–2623, 2019.
-  S. Gu, W. Zuo, S. Guo, Y. Chen, C. Chen, and L. Zhang, “Learning dynamic guidance for depth image enhancement,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3769–3778.
-  W. Liu, X. Chen, J. Yang, and Q. Wu, “Robust color guided depth map restoration,” IEEE Transactions on Image Processing, vol. 26, no. 1, pp. 315–327, 2017.
-  Q. Yan, X. Shen, L. Xu, S. Zhuo, X. Zhang, L. Shen, and J. Jia, “Cross-field joint image restoration via scale map,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1537–1544.
-  H. Kwon, Y.-W. Tai, and S. Lin, “Data-driven depth map refinement via multi-scale sparse representation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 159–167.
-  P. Song, X. Deng, J. F. Mota, N. Deligiannis, P.-L. Dragotti, and M. Rodrigues, “Multimodal image super-resolution via joint sparse representations induced by coupled dictionaries,” IEEE Transactions on Computational Imaging, 2019.
-  Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Joint image filtering with deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 1909–1923, 2019.
-  B. Kim, J. Ponce, and B. Ham, “Deformable kernel networks for joint image filtering,” 2018.
-  X. Deng and P. L. Dragotti, “Deep coupled ISTA network for multi-modal image super-resolution,” IEEE Transactions on Image Processing, 2019, to appear.
-  I. Marivani, E. Tsiligianni, B. Cornelis, and N. Deligiannis, “Learned multimodal convolutional sparse coding for guided image super-resolution,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 2891–2895.
-  J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” ACM Transactions on Graphics (ToG), vol. 26, no. 3, p. 96, 2007.
-  K. He, J. Sun, and X. Tang, “Guided image filtering,” in European Conference on Computer Vision. Springer, 2010, pp. 1–14.
-  Z. Ma, K. He, Y. Wei, J. Sun, and E. Wu, “Constant time weighted median filtering for stereo matching and beyond,” in International Conference on Computer Vision (ICCV), 2013, pp. 49–56.
-  F. Kou, W. Chen, C. Wen, and Z. Li, “Gradient domain guided image filtering,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4528–4539, 2015.
-  R. J. Jevnisek and S. Avidan, “Co-occurrence filter,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3184–3192.
-  T. Brox, O. Kleinschmidt, and D. Cremers, “Efficient nonlocal means for denoising of textural patterns,” IEEE Transactions on Image Processing, vol. 17, no. 7, pp. 1083–1092, 2008.
-  Q. Zhang, X. Shen, L. Xu, and J. Jia, “Rolling guidance filter,” in European conference on computer vision. Springer, 2014, pp. 815–830.
-  X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” in International Conference on Computer Vision (ICCV), 2015, pp. 3406–3414.
-  X. Shen, Q. Yan, L. Xu, L. Ma, and J. Jia, “Multispectral joint image restoration via optimizing a scale map,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp. 2518–2530, 2015.
-  B. Ham, M. Cho, and J. Ponce, “Robust guided image filtering using nonconvex potentials,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 1, pp. 192–207, 2018.
-  X. Guo, Y. Li, J. Ma, and H. Ling, “Mutually guided image filtering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image filtering,” in European Conference on Computer Vision. Springer, 2016, pp. 154–169.
-  H. Wu, S. Zheng, J. Zhang, and K. Huang, “Fast end-to-end trainable guided filter,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1838–1847.
-  J. Pan, J. Dong, J. S. Ren, L. Lin, J. Tang, and M.-H. Yang, “Spatially variant linear representation models for joint filtering,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1702–1711.
-  X. Song, Y. Dai, and X. Qin, “Deep depth super-resolution: Learning depth super-resolution using deep convolutional neural network,” in Asian Conference on Computer Vision. Springer, 2016, pp. 360–376.
-  C. Guo, C. Li, J. Guo, R. Cong, H. Fu, and P. Han, “Hierarchical features driven residual learning for depth map super-resolution,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2545–2557, 2018.
-  X. Song, Y. Dai, and X. Qin, “Deeply supervised depth map super-resolution as novel view synthesis,” IEEE Transactions on Circuits and Systems for Video Technology, 2018.
-  Y. Wen, B. Sheng, P. Li, W. Lin, and D. D. Feng, “Deep color guided coarse-to-fine convolutional network cascade for depth image super-resolution,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 994–1006, 2019.
-  F. Lahoud, R. Zhou, and S. Susstrunk, “Multi-modal spectral image super-resolution,” in European Conference on Computer Vision (ECCV), 2018.
-  Z. Shi, C. Chen, Z. Xiong, D. Liu, Z.-J. Zha, and F. Wu, “Deep residual attention network for spectral image super-resolution,” in European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
-  S. Lohit, D. Liu, H. Mansour, and P. T. Boufounos, “Unrolled projected gradient descent for multi-spectral image fusion,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7725–7729.
-  S. Li, X. Kang, L. Fang, J. Hu, and H. Yin, “Pixel-level image fusion: A survey of the state of the art,” Information Fusion, vol. 33, pp. 100–112, 2017.
-  G. Pajares and J. M. De La Cruz, “A wavelet-based image fusion tutorial,” Pattern recognition, vol. 37, no. 9, pp. 1855–1872, 2004.
-  L. Cao, L. Jin, H. Tao, G. Li, Z. Zhuang, and Y. Zhang, “Multi-focus image fusion based on spatial frequency in discrete cosine transform domain,” IEEE Signal Processing Letters, vol. 22, no. 2, pp. 220–224, 2014.
-  Q. Zhang and B.-l. Guo, “Multifocus image fusion using the nonsubsampled contourlet transform,” Signal processing, vol. 89, no. 7, pp. 1334–1346, 2009.
-  B. Yang and S. Li, “Pixel-level image fusion with simultaneous orthogonal matching pursuit,” Information fusion, vol. 13, no. 1, pp. 10–19, 2012.
-  Q. Wei, J. Bioucas-Dias, N. Dobigeon, and J.-Y. Tourneret, “Hyperspectral and multispectral image fusion based on a sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 7, pp. 3658–3668, 2015.
-  J. Hu and S. Li, “The multiscale directional bilateral filter and its application to multisensor image fusion,” Information Fusion, vol. 13, no. 3, pp. 196–206, 2012.
-  J. H. Jang, Y. Bae, and J. B. Ra, “Contrast-enhanced fusion of multisensor images using subband-decomposed multiscale retinex,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3479–3490, 2012.
-  Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with a deep convolutional neural network,” Information Fusion, vol. 36, pp. 191–207, 2017.
-  K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs.” in International Conference on Computer Vision (ICCV), 2017, pp. 4724–4732.
-  H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2614–2623, 2018.
-  J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generative adversarial network for infrared and visible image fusion,” Information Fusion, vol. 48, pp. 11–26, 2019.
-  B. Wohlberg, “Efficient algorithms for convolutional sparse representations,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 301–315, 2015.
-  H. Sreter and R. Giryes, “Learned convolutional sparse coding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2191–2195.
-  J. Xie, R. S. Feris, and M.-T. Sun, “Edge-guided single depth image super resolution,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 428–438, 2016.
-  J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality depth map upsampling for 3D-TOF cameras,” in International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 1623–1630.
-  D. Ferstl, M. Ruther, and H. Bischof, “Variational depth superresolution using example-based edge representations,” in International Conference on Computer Vision (ICCV), 2015, pp. 513–521.
-  J. Lu and D. Forsyth, “Sparse depth super resolution,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2245–2253.
-  D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust single image super-resolution via deep networks with sparse prior,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–3207, 2016.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Conference on Computer Vision and Pattern Recognition workshops (CVPRW), 2017, pp. 136–144.
-  Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3867–3876.
-  X. Deng, P. Song, M. R. Rodrigues, and P. L. Dragotti, “Radar: Robust algorithm for depth image super resolution based on fri theory and multimodal dictionary learning,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1646–1654.
-  W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian pyramid networks for fast and accurate super-resolution,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 624–632.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
-  G. Riegler, D. Ferstl, M. Rüther, and H. Bischof, “A deep primal-dual network for guided depth super-resolution,” arXiv preprint arXiv:1607.08569, 2016.
-  H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,” in Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8.
-  D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in European Conference on Computer Vision (ECCV). Springer, 2012, pp. 611–625.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
-  F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum,” IEEE Transactions on Image Processing, vol. 19, no. 9, pp. 2241–2253, 2010.
-  Y. Aksoy, C. Kim, P. Kellnhofer, S. Paris, M. Elgharib, M. Pollefeys, and W. Matusik, “A dataset of flash and ambient illumination pairs from the crowd,” in European Conference on Computer Vision (ECCV), 2018, pp. 634–649.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image restoration by sparse 3d transform-domain collaborative filtering,” in Image Processing: Algorithms and Systems VI, vol. 6812. International Society for Optics and Photonics, 2008, p. 681207.
-  J. Cai, S. Gu, and L. Zhang, “Learning a deep single image contrast enhancer from multi-exposure images,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 2049–2062, 2018.
-  K. Ma, H. Li, H. Yong, Z. Wang, D. Meng, and L. Zhang, “Robust multi-exposure image fusion: a structural patch decomposition approach,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2519–2532, 2017.
-  K. Ma, Z. Duanmu, H. Yeganeh, and Z. Wang, “Multi-exposure image fusion by optimizing a structural similarity index,” IEEE Transactions on Computational Imaging, vol. 4, no. 1, pp. 60–72, 2018.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in European conference on computer vision. Springer, 2016, pp. 391–407.
-  M. Nejati, S. Samavi, and S. Shirani, “Multi-focus image fusion using dictionary-based sparse representation,” Information Fusion, vol. 25, pp. 72–84, 2015.
-  D. Summers, “Harvard whole brain atlas: www. med. harvard. edu/aanlib/home. html,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 74, no. 3, pp. 288–288, 2003.
-  I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.