A Cross-Modal Image Fusion Theory Guided by Human Visual Characteristics
The human visual perception system has strong robustness in image fusion. This robustness is based on human visual perception system’s characteristics of feature selection and non-linear fusion of different features. In order to simulate the human visual perception mechanism in image fusion tasks, we propose a multi-source image fusion framework that combines illuminance factors and attention mechanisms. The framework effectively combines traditional image features and modern deep learning features. First, we perform multi-scale decomposition of multi-source images. Then, the visual saliency map and the deep feature map are combined with the illuminance fusion factor to perform high-low frequency nonlinear fusion. Secondly, the characteristics of high and low frequency fusion are selected through the channel attention network to obtain the final fusion map. By simulating the nonlinear characteristics and selection characteristics of the human visual perception system in image fusion, the fused image is more in line with the human visual perception mechanism. Finally, we validate our fusion framework on public datasets of infrared and visible images, medical images and multi-focus images. The experimental results demonstrate the superiority of our fusion framework over state-of-arts in visual quality, objective fusion metrics and robustness.
keywords:image fusion, auxiliary learning, non-linear fusion characteristics, feature selection characteristics, deep learning.
In the aircraft combined vision system image fusion task, the situation awareness ability of the pilots in the complex flight environment can be effectively improved by the fusion of the enhanced vision image and the synthetic vision image. One of the important reasons why the scene image fusion technology has not been applied and popularized in the aviation field is the robustness of the scene image fusion technology, while the human brain has a strong robustness in the image fusion task. According to the view of cognitive psychology Treisman1980A (), the processing of information by human brain has the characteristics of feature selection, nonlinear combination and assistant task collaborative perception. As a highly complex non-linear system, human brain system will filter the perceptual target features based on subjective intention, ignore uncertain signals, and fuse the non mutually exclusive features that meet the subjective needs according to prior knowledge Treisman1980A (). According to the stored prior knowledge Treisman1980A (), human brain will reason and explore the new target task, so as to complete the perception of the new task. It is based on these three characteristics of human brain that makes human brain image fusion task or other tasks have stronger robustness and generality than existing derivative algorithms.
In recent years, the deep learning method inspired by neurobiology has made remarkable achievements in the task of image fusion. Compared with the traditional method, this method has better performance, but there is still a big gap with the effect of human brain image fusion. In order to make the results of image fusion more in line with the mechanism of human brain image fusion and narrow the gap with human brain image fusion, we propose an end-to-end unsupervised learning combined visual image fusion framework based on the theory of human cognitive psychology Treisman1980A (). Our algorithm framework uses multiple image fusion loss functions and multiple image assistant tasks to optimize the weight learning of neural network, and through the effective combination of attention mechanism and nonlinear neural network to simulate the feature selection characteristics and nonlinear combination characteristics of human brain image fusion, which effectively improves the robustness of image fusion. The algorithm framework we proposed is not a simple combination of existing deep learning methods. We fully tap CNN’s strong feature representation ability, nonlinear fitting ability and the human brain image fusion mechanism in the task of combining visual image fusion.
In order to demonstrate the superiority of our framework over the existing mainstream algorithmic framework, we give a representative example in 1 . In order to highlight the advantages of our algorithm framework, as shown in Figure 1, we use the enhanced and synthetic vision images in the scene fusion data set to do qualitative comparison experiments. The two images on the left are enhanced view image and synthetic view image, the third image is the fusion effect of traditional operator image, and the fourth image is the fusion result of our algorithm. From the fusion images of different algorithms, we can see that our algorithm is superior to the mainstream image fusion algorithm in subjective vision.
The main contributions of our work include the following five points:
First, we analyze the impact of common objective evaluation indexes on image quality, and analyze the impact of different fusion criteria (maximum, sum, weighted average and nonlinear fusion criteria proposed by us) on image quality.
Secondly, we analyze the characteristics of feature selection and nonlinear combination of human brain information processing, and based on this, we carry out feature selection and nonlinear fusion operations on the features of image fusion task. The experimental results show the effectiveness of our method.
Third, based on the unsupervised learning network framework, we extend the unsupervised learning multiple loss network. We introduce the mechanism of human brain auxiliary learning based on the task of image reconstruction and multi focus image fusion, and study the influence of auxiliary learning mechanism on image fusion task.
Fourth, based on the above theoretical research, we propose a robust multi task loss function collaborative optimization learning combined visual image fusion framework, Alan. Through the verification of combined visual image data, infrared and visible image data, and multi focus image data, it shows that our unsupervised learning framework is more general and robust than the existing algorithm framework.
Fifth, data set and code. We have produced and disclosed data sets of enhanced and synthesized visual images. At the same time, in order to speed up the research progress of the majority of researchers in the field of image fusion, more than 20 latest image fusion algorithm codes, some data sets and 6 image fusion algorithm codes that have not been compared in this paper will be summarized on my GitHub homepage.
2 Related work
In this chapter, we describe the definition of combined visual image, review the convolution neural network image fusion method inspired by neurobiology in recent years, analyze the existing problems of image fusion algorithm, and propose a robust multi task loss assisted learning optimization unsupervised learning network fusion framework.
2.1 Image fusion
Combined scene image fusion technology refers to the effective fusion technology of enhanced and synthesized scene images. The enhanced visual image mainly contains the real-time dynamic information of airport runway and runway, in order to deal with the impact of unstructured environment on the blind landing of aircraft. The synthetic visual image mainly refers to the effective combination of the static structured environment of the airport runway and the runway with the aircraft’s position and attitude, which provides the prior information for the recognition and location of the airport runway. By enhancing the fusion of visual image and synthetic visual image, pilots can effectively improve the situational awareness of aircraft landing environment in complex environment. However, for the task of image fusion, the current mainstream algorithms are mainly based on infrared and visible image, multi exposure image, multi focus image and other image data for image fusion, while the research on image fusion algorithms in the field of aviation is less.
There are three main approaches to image fusion based on deep learning. The first one is the combination of image transformation and deep learning feature, which only uses the convolutional neural network model of pre training to provide deep learning feature map Lahoud2019FastZERO (); Li_2018DL (); Liu2017InfraredCNN (); Zhang_2018RCAN (). The second is the end-to-end convolution neural network method based on the twin network, which uses the objective function for iterative optimization learning strategy Li2018DenseFuse (); PrabhakarK.Ram2017DADU (); YanXiang2018UDMI (); MaFusionGAN (). The third is to build an end-to-end deep convolution neural network, which is different from the second one in that it transforms image fusion into image classification. This method is more applicable to multi focus image fusion tasks Liu2017MultiCNN (); MaBoyuan2019SAUD (). In terms of image fusion criteria, both traditional image fusion algorithm and deep learning image fusion method mainly adopt maximum fusion, sum fusion and weighted average fusion Ma2018Infrared (); Lahoud2019FastZERO (); Li_2018DL (); Liu2017InfraredCNN (); Zhang_2018RCAN (); PrabhakarK.Ram2017DADU ().In view of the universality of image fusion, Zhang ZhangYu2020IAgi () proposed ifcnn network framework based on MEF framework. The method uses supervised learning method in multi focus data set, and then applies training weight to different image fusion tasks according to different fusion rules. This method focuses more on the generality of fusion framework than on the robustness of fusion algorithm. However, no deep learning method has been found for the task of scene image fusion.
In the field of image fusion, no new theoretical breakthrough has been found. Looking back on the deep learning algorithm in image fusion tasks in recent years, we can easily find the following problems in image fusion tasks through analysis:
First of all, there is a lack of secondary screening of effective features and learning of nonlinear fusion weights between pixels. Most algorithms only after extracting the features, we can add, sum or weighted average them directly, and the rules of image fusion are single. Only CNN’s strong feature representation ability is used, and the non exclusive relationship between each feature map is not considered.
Then, in the task of image fusion, especially the cross-modal image data, the real label is difficult to obtain. The lack of real label makes the end-to-end deep learning method lag behind the supervised learning method in training difficulty and training accuracy.
Secondly, the construction of loss function is simple. In the task of image fusion, the method of end-to-end network architecture is used, and single loss function SSIM or MSE is often used in the construction of loss function. This objective function shows good performance in the framework of supervised learning, but for the method of cross-modal unsupervised learning, the fusion performance is far from the subjective evaluation index of human beings.
Finally, the existing image fusion algorithm is more about the image fusion task itself, without considering the impact of the existing correlation task on the main task of image fusion.
In conclusion, based on the feature selection and non-linear fusion characteristics of human brain image fusion mechanism, we learn the feature selection weight and feature fusion weight through the effective combination of attention mechanism and non-linear neural network. In order to make the image fusion process more in line with the human brain fusion mechanism, we explore the auxiliary learning mechanism of human brain image fusion, and propose the method of auxiliary task collaborative optimization of main task. We make full use of CNN’s strong ability of feature representation and nonlinear relationship fitting, and propose an unsupervised network framework for assisted learning based on attention mechanism.
The rest of his paper is laid out as follows. Section 2 discusses the development and problems of image fusion based on deep learning method. In section 3, We will analyze the impact of multiple loss on image quality, and build a robust unsupervised learning framework for multi task loss collaborative optimization. In section 4, experiments and results analysis of different algorithms on different public datasets. And the results are qualitatively analyzed and discussed in Section 5, and the experimental conclusions are summarized in Section 6.
2.2 General framework
As shown in 2, is our proposed unsupervised learning framework Alan. Our network framework mainly consists of a main network and two subnetworks. The main network is mainly used for combination of scene image fusion tasks. The two self networks are image reconstruction task network and multi focus image fusion network. Among them, the combined scene fusion network and the multi focus image fusion network use a common basic skeleton network, sharing their respective weights, to learn the common features of multiple data. After the basic backbone network, they will follow their own branch networks to learn the ontology characteristics of different data. In branch networks, regularization terms are formed by the characteristics of different tasks. On the one hand, over fitting problems caused by experience loss can be prevented, on the other hand, convergence speed can be accelerated by the constraints of different task loss functions.
As shown in 2 , the unsupervised learning image fusion algorithm proposed by us needs to be completed in the following four steps: first, the evaluation indexes affecting the image quality are analyzed; then, the theoretical method of constructing nonlinear fusion and feature selection is proposed. Secondly, on the basis of unsupervised learning network, we expand and study the influence of auxiliary task learning mechanism on the main task of image fusion according to the learning mechanism of human brain for new knowledge. Finally, based on the characteristics of human brain image fusion and the mechanism of assistant task learning, the unsupervised learning combined visual image fusion framework is constructed.
3.1 Image quality evaluation relationship
The first image fusion method described in 2.1 mainly uses CNN’s strong feature representation ability and ignores CNN’s strong nonlinear relationship fitting ability. The image fusion method of end-to-end convolution neural network makes better use of the feature representation ability and relationship fitting ability of convolution neural network, so it will have better performance in generality and robustness than the first method of 2.1. The end-to-end deep convolution neural network can be divided into supervised learning and unsupervised learning. In the task of image fusion, the end-to-end image fusion network based on unsupervised learning is mainly the unsupervised learning method proposed by Prabhakar and Yan PrabhakarK.Ram2017DADU (); YanXiang2018UDMI (). This method only uses the combination of single SSIM and variance as the loss function of unsupervised network, and constructs an end-to-end unsupervised learning method. However, a single image quality evaluation parameter can not effectively evaluate image quality. As shown in Figure 3, the image comes from FLIR data set. For qualitative analysis of severely degraded image, we can see from the figure that the SSIM value of the image with high subjective score is very low due to the serious degradation of the original image.
According to the view of cognitive psychology, this is mainly affected by the visual masking characteristics and brightness contrast characteristics of human visual perception system. When the image quality is seriously degraded, SSIM and human subjective evaluation will have a big gap Treisman1980A (). At the same time, in the image fusion task and image reconstruction task, MSE is also a mainstream image fusion objective function. This method can not effectively capture the perception difference between the predicted image and the real image, resulting in the lack of high-frequency information of the reconstructed image, and the image is too smooth LedigChristian2017PSIS ().To solve this problem, Johson JohsonJustin2016PLfR () proposes a method of perception loss, which makes full use of high-level global information and low-level detail information, and effectively overcomes the problem of MSE image blur. Although SSIM 1284395 () is more suitable for human visual characteristics than MSE or PSNR because of considering the brightness, contrast and structure information of the image, SSIM still does not perform well in the face of high light or serious blur image degradation. Therefore, in our network architecture, we combine perceptual loss 3, MSE loss 2, structure similarity loss 1 and PSNR 4 loss to complete network optimization.
3.2 Non-linear and feature-selection image fusion theory
In the aspect of channel feature selection, Hu hu2017squeezeandexcitation () proposes senet channel attention network for image classification task. The core idea of this method is to guide the network to learn the effectiveness of different channel features through loss function, to give more weight to effective feature channels, and to give less weight to invalid feature channels; Zhang Zhang_2018RCAN () is to pass on senet for image super resolution task On the basis of the channel attention network, a residual network channel attention module rcan is proposed. The experimental results show that the channel attention mechanism is effective. All of the above methods are derived from the characteristics of human visual perception system. The intrinsic derivation mechanism of vision in human visual perception system points out that human visual system will deduce the content according to the prior knowledge in human brain, and discard the uncertain information Treisman1980A (). Inspired by this feature, we use the effective combination of attention mechanism and nonlinear convolution neural network to simulate the feature selection feature and nonlinear combination feature of human brain image fusion mechanism.
Feature selection characteristic
We suppose that the long and wide channels obtained by residual convolution after previous fusion are x x feature graphs . As shown in formula 5 , the global average pooling (GP) operation is performed on the feature map to obtain the global receptive field corresponding to the feature map, so that the network can exclude the spatial relationship between different channels and focus on learning the non-linear relationship between different feature channels.
Where: represents the pixel value corresponding to the kth channel coordinates. As shown in Equation 5 , after passing through the global average pooling layer, we obtain the output of the attention module through convolution, RELU activation function, convolution, Sigmoid activation function, and dot product operation.
The channel attention module , and represent the activation functions of Sigmoid and Relu respectively, while and represent the weight of two convolutions respectively. indicates the output of the input image after GP operation.
Non-linear fusion characteristic
In the task of image fusion, the current commonly used fusion criteria are weighted average, maximum and principal component analysis Ma2018Infrared (), while the research on nonlinear fusion theory is less. But as a highly complex nonlinear system, human brain needs to deal with very complex logical relations when facing various tasks, which can not be expressed by simple weighted average, maximum or principal component.At the same time, fixed image fusion criteria will seriously reduce the generality of image fusion algorithm. Based on this problem and combined with the strong nonlinear fitting ability of the deep convolution neural network, we construct the deep convolution neural network with the characteristics of feature selection to fit the nonlinear weight of image fusion. Our nonlinear fitting network is shown in 2. As shown in (3), is our nonlinear fusion weight.
From 6, we can find that both maximum fusion, weighted average fusion and summation fusion can be regarded as a special case of nonlinear fusion weight. Taking the fusion of two images as an example, the maximum fusion can be regarded as the problem that the weight value is 1 or 0, while the sum can be regarded as the problem that both fusion weights are 1, and the weighted average can be regarded as the fusion weight is 0.5. The reason why the existing algorithm adopts maximum value, weighted average or sum is that it introduces some prior knowledge to some extent. This fusion criterion of artificial design is only robust to specific tasks, and to a certain extent limits the self-learning ability of the network model. The nonlinear fusion method proposed by us is just to let the network automatically learn fusion weights according to the commonness of image training data and the characteristics of each image task. The existing image fusion methods lack the exploration of network self-learning, and more specifically specify fusion criteria to improve the accuracy of a specific task. Therefore, we use the mechanism of human brain assisted learning combined with nonlinear convolutional neural network for further research and exploration. Our proposed framework can learn not only the common characteristics of different data distribution, but also the characteristics of specific data sets.
3.3 Auxiliary Learning mechanism
In the task of image fusion, the biggest difference between unsupervised learning network and supervised learning network is the lack of real label data, so it generally lags behind supervised learning network in training accuracy and training difficulty. Especially in the task of combination visual image fusion or cross-modal image fusion such as infrared and visible light, we are faced with not only the lack of real label data, but also the problem that we have not found a complete and effective evaluation of image quality. To solve this problem, there are some related researches in the field of computer vision, such as the image fusion of confrontation generation network MaFusionGAN (), image quality assessment of deep learning IQA, etc., but these methods also have the same problems when training the network, especially for the task of cross-modal image data fusion, no new theoretical breakthrough has been found. Compared with the single loss function training method, our multi loss function joint training method has a great improvement in accuracy, but the quality of image fusion across data sets is still not better than the supervised learning method and some traditional fusion operators. By analyzing that the process of human brain learning perception for new tasks is based on the perceived task knowledge to assist learning the characteristics of new tasks. Therefore, based on the unsupervised learning network framework proposed in this Chapter, we introduce image reconstruction task and multi focus image fusion task to assist learning the combined visual image fusion task. Through different auxiliary task learning, we can fully mine the hidden features that unsupervised fusion network learning can not get LiuShikun2018EMLw (). Therefore, we take the image reconstruction task and the multi focus image fusion task as the object. Based on the unsupervised learning combined scene image fusion framework Ulan, we expand the research on the loss of auxiliary task and propose the auxiliary learning attention network Alan.
As shown in 5 , it is the schematic diagram of our sub task collaborative work. Where (a) shows the process of sub task assisting main task. (b) it is a collaborative optimization process of sub task and main task at the same time. Through such a network structure design, our main network can effectively retain the unique characteristics of two subtasks while extracting its own data characteristics, so as to improve the universality and robustness of the network model. In order to express the structure of the network more clearly, we build a mathematical model, which is shown in 7.
In the formula: represents the representation of task 0-l at layer i; R represents the non-linear activation layer, the same as represents the convolutweight of task l in layer network; represents the input of task l in the network; represents the sum of tasks in the i-1 convolution neural network.
The task of image reconstruction and multi focus image fusion is closely related to the task of combined visual image fusion. By introducing the task of image reconstruction and multi focus image fusion for auxiliary learning, the end-to-end unsupervised learning network framework can be transformed into a supervised learning network. In the image reconstruction task and the multi focus image fusion task, because there are many related open data sets with labels, we use the end-to-end supervised learning method to build the deep convolution neural network. In the sub task of image fusion assisted learning, we comprehensively consider the structure information, contrast information, brightness information and depth feature information of different network depths. Our loss function is mainly composed of content loss, structure similarity loss, MSE loss and peak signal-to-noise ratio loss.
Where represents the output image of the network; represents the input image of the network; represents the pixel loss; represents the structure loss of the input and output. represents the overall loss function of the network.
3.4 Unsupervised Attention network
Our proposed unsupervised learning network framework is shown in Figure 2. The network framework mainly includes one main task and two sub tasks. The main task of the combined visual image fusion task is an end-to-end unsupervised learning network. In the main task network, we add the attention mechanism module. Through the combination of attention mechanism and nonlinear convolution neural network, we can effectively simulate the feature selection characteristics and nonlinear combination characteristics of human brain image fusion. The main task is composed of multi-scale convolution and attention modules in series, and combined with two sub task models to form a dual input single output network architecture. In the main task training process, we used 4000 original combined visual data sets, with a single resolution of 1280x1024, and expanded to 20000 through data enhancement. In order to increase the diversity of data, we added 2999 pre registered infrared and visible light data in the data set, with a single resolution of 320x256, and expanded to 20000 after data enhancement. In the main task training, all our image inputs are in the form of gray-scale image, and the image size is 80x64.
Subtask 1 is image reconstruction task, which uses end-to-end supervised learning network. Some of the existing image reconstruction tasks, such as super-resolution reconstruction LedigChristian2017PSIS (), image restoration task ?, only take down samples of the image, or only consider a specific noise to enhance the restoration operation, so these methods have good performance in a specific data set, but the adaptability is poor in the complex real environment. In our image reconstruction network, stack self coding network is used to encode and decode the image. The difference between this network framework and the existing self coding network is that we add a dense connection module in the self coding network, and carry out special preprocessing operation on the training data set. On the basis of coco2014 data set, we make joint random adjustment of brightness, ambiguity and Gaussian noise in a certain range, so that the distribution of training data is as close as possible to that of real environment data. In the image reconstruction sub task training stage, we used more than 70000 training sets and 10000 verification sets. Due to the limitation of video memory, we adjust the image size after preprocessing to 256x256.
Subtask 2 adopts the task of multi focus image fusion, which adopts the end-to-end supervised learning network framework, which is basically the same as the main task framework. In the training process of the network, we used the data disclosed by lytro, and we expanded the data to 20000 pieces, with a single resolution of 80x64.
In order to avoid the influence of the main task loss function on the convolution weight of subtasks, we train subtasks separately and fix the convolution weight of subtasks. The convolution weight of subtask and the main task are combined as a part of the basic node of the main task, and the objective function of the main task is used to optimize the weight of the main task node. The whole loss function of the auxiliary learning network proposed by us is shown in 9.
4.1 Experiments Setup
In order to evaluate the robustness and generality of our algorithm, we do experiments on different image fusion task datasets. First of all, we compare our image fusion algorithm framework with the existing algorithm framework on the combination of scene image data set, infrared and visible image data set and multi focus image data set. Secondly, we compare the three fusion criteria, which are maximum fusion criteria, sum fusion criteria and weighted average fusion criteria. Then, we compare the effects of different loss functions on the main task of image fusion. Finally, we compare the algorithm framework of single image fusion task with that of our algorithm framework.
In the first experiment, we first carried out experiments on the combined visual image data set, which has 4000 pairs of original images. Secondly, we obtain infrared and visible images of natural scenes from TNO TNO () dataset, which includes 21 pairs of infrared and visible images. Finally, we obtain multi focus images from lytro Nejati2015Multi () dataset, which includes 20 pairs of commonly used multi focus images. In all experiments, we transform all images into gray-scale images for subsequent image fusion training. At the same time, we need to explain that in all subsequent experiments, our algorithm does not manually adjust parameters for fixed data sets. We will compare experiments with 20 mainstream algorithms such as fast-zero-learning(FEZ)Lahoud2019FastZERO (), fonvolutional sparse representation(CSR)Liu2016ImageCSR (), deep learning(DL) Li_2018DL (), dense fuse(DENSE)Li2018DenseFuse (), generative adversarial network for image fusion (Fusion GAN)Ma2018Infrared (), laplacian pyramid(LP) Burt1987TheLP (), dual-tree complex wavelet transform (DTCWT)Liu2015MultiDSIFT (), latent low-rank representation (LTLRR)Li2018InfraredLTLRR (), multi-scale transform and sparse representation(LP-SR) Liu2015ALPSR (), dense sift (DSIFT)Liu2015MultiDSIFT (), convolutional neural network (CNN)Liu2017InfraredJSR-SD (), curvelet transformation(CVT) Nencini2007RemoteCVT (), bilateral filter fusion method(CBF)Shreyamsha2015ImageCBF () , cross joint sparse representation (JSR)Zhang2013Dictionary (), joint sparse representation with saliency detection(JSRSD)Liu2017InfraredJSR-SD (), gradient transfer fusion (GTF)Ma2016InfraredGTF (), weighted least square optimization(WLS)Ma2017InfraredWLS () , a ratio of low pass pyramid(RP)Toet1989ImageRP (), waveletChipman1995Wavelets (), non-linear and selection(OURS+).
For different experiments, there will be some changes in the related algorithm experiments, and the changes will be explained in the respective experimental chapters. These algorithms have already published their code, and the relevant algorithm parameters are the same according to the settings in the public paper, and our paper-related procedures and data will then be published on github. For our proposed algorithm, we also conducted a comparative experiment on whether there is a channel attention module or not. Our experimental platform is desktop 3.0GHZ i5-8500, RTX2070, 16G memory.
4.2 Image fusion experiment of different data sets
Combined vision system image fusion experiment
In the combination of scene data set, we use the image fusion operator shown in 4.1 . To analyze the enhanced and synthetic scene images qualitatively.
As shown in Figure 6, in the synthetic visual image, due to the influence of dark light, many image texture details existing in the dark light are almost imperceptible to the naked eye, and the existing image fusion algorithms are unable to recover these details well during image fusion. Although RP algorithm and CNN algorithm recover some details, but also introduce some non image information. Compared with other algorithms, our algorithm has a very clear edge detail in the dark part.
Infrared and visible image fusion experiment
From 7, we can see that our algorithm can recover more image details while maintaining lower noise compared with other algorithms in infrared and visible image fusion tasks. In this data set, although ifcnn algorithm has higher contrast than our algorithm, but the fusion image of this algorithm also introduces a lot of image noise, which affects the quality of image fusion. The reasons are mainly divided into three parts. First, ifcnn uses supervised learning method to train image fusion, only through data-driven learning to learn the data distribution of multi focus data sets; second, ifcnn network adds the prior knowledge of human beings, and adopts the maximum fusion criterion for cross-modal infrared and visible light images; finally, image quality evaluation is not perfect. Our algorithm adopts unsupervised learning network framework, which can automatically learn the nonlinear fusion weights of images, rather than the specified fusion rules. In the actual objective index testing process, we can find that the image fusion quality of many algorithms is far from the subjective evaluation of people, but the objective evaluation indexes of related images are very high, such as gradient and SSIM. The main reason is that these algorithms introduce a lot of noise and edge oscillation effects in the process of image fusion, such as CBF, CSR and IFCNN.
Multi-focus image fusion experiment
At the same time, we have also carried on the correlation experiment verification to many kinds of image fusion algorithms in the multi focus image data set. Through the analysis of experimental data, we can see that our algorithm has higher entropy value and gradient value than other algorithms in the multi focus image, which shows that the fused image information is more abundant and the resolution is better. Especially in the case of low illumination, our algorithm can still better recover the texture details of the image, more in line with the human visual perception characteristics.
4.3 Fusion metrics
In order to qualitatively evaluate the performance of different algorithms, we mainly use four objective evaluation indexes of image: cumulative probability of blur detection(CPBD) NarvekarN.D2011ANIB (), just perceptible blur based on human vision(JNB) FerzliR2009ANOI (), visual information fidelity(VIF) Han2013A () , average gradient(AG) Cui2015Detail () . We have carried out quantitative experiments on the combined visual data set, infrared and visible light data set and multi focus data set, and the relevant data are shown in Figure 4.
We can see that the Alan image fusion framework proposed by us has better subjective score than other existing algorithms in different data sets. Compared with other algorithms, ifcnn has a better subjective score in the multi focus image dataset, mainly because ifcnn uses multi focus data for supervised learning training. In order to improve the generality of the network, ifcnn directly replaces the fusion criteria with maximum, weighted average and sum, but it is precisely because of the use of supervised learning method that it can not migrate to the image data without labels, which limits the robustness and generality of the algorithm to a certain extent. At the same time, we can also find that compared with the traditional CVT GTF RP, Alan ifcnn CNN has better accuracy and robustness in multiple datasets.
4.4 Comparative experiment of different fusion criteria
In this experiment, we compare the nonlinear fusion criteria, maximum fusion criteria, sum fusion criteria and weighted average fusion criteria based on our proposed network framework.
From 10, we can see clearly that our nonlinear fusion criteria have very similar fusion effect with sum fusion criteria and weighted average fusion criteria in the combination of scene data sets. The three criteria are well fused in texture details, generally better than the maximum fusion criteria. In infrared and visible data sets, our method is generally superior to the other three fusion criteria. The sum fusion criterion is very similar to the weighted average fusion criterion, while the maximum fusion captures a large number of features of visible light, and ignores the effective features of visible light, so the performance is poor. In the multi focus data set, the maximum fusion, sum fusion and weighted average fusion criteria are better in the dark area recovery, but our algorithm has more advantages in overall clarity.
4.5 A comparative experiment of main task fusion algorithm for single fusion task and sub task collaboration
From Figure 11, we can clearly see that single image fusion task has better performance in their respective training tasks, but poor performance in cross data. The combination of single image fusion task and image enhancement task can improve the clarity of image to a certain extent. Compared with single unsupervised learning, multi focus image fusion has a worse performance in the combination of visual data sets, while in the infrared and visible data sets, it has the opposite performance. In infrared and visible data sets, the effect of single unsupervised learning network combined with image enhancement subtask is much better than that of single unsupervised learning method. Through the auxiliary learning optimization of image enhancement task and multi focus image fusion task, the universality and robustness of image fusion algorithm on cross dataset can be effectively improved. This is because our method can effectively extract the common features of multiple data distribution, but also retain some characteristics of cross data.
A large number of experiments in the fourth chapter verify that our proposed unsupervised learning combined visual image fusion framework is better than the existing one We think that there are several main reasons: first, the construction of multivariate loss function. Compared with the traditional algorithm, deep learning algorithm has a very strong ability of feature representation and feature relationship fitting, but whether the deep network model can learn the subjective intention of human beings and whether the loss function is reasonably constructed through data-driven has an important relationship. However, although there are many methods to evaluate image quality, the existing objective function evaluation methods are relatively single Ma2018Infrared () . Single image quality evaluation method can not effectively represent image quality, so it is important to study the influence of multiple loss functions on image quality. Then, the auxiliary learning mechanism. When single task network training and learning, it is often affected by data noise, insufficient training data, cross modal and improper loss function, which leads to some hidden features of data can not be learned. Through auxiliary task learning, the learning ability of main task can be effectively optimized. In addition, the network can learn how to learn and reduce the subjective interference through the mechanism of multi task assisted learning. We can let the network automatically learn the common characteristics of different data and the characteristics of their own data only through the collaborative optimization of different tasks, which plays an important role in improving the robustness and generality of the network architecture. Secondly, the non-linearity and feature selection of human brain image fusion mechanism. Through the effective combination of attention mechanism and nonlinear convolution neural network, we can learn the non mutually exclusive nonlinear fusion weights between multimodal images. Experiments show that this method is in line with the human brain image fusion mechanism. Finally, unsupervised learning. At present, in the field of image fusion, it is very difficult to obtain the supervised learning labels of both infrared and visible light, or the combination of visual image data. For the main task of image fusion, we do not need labels. By introducing the auxiliary learning strategy, we can effectively transform the unsupervised learning network into the supervised learning, and effectively improve the robustness and universality of image fusion.
Based on the three characteristics of human brain image fusion mechanism, we propose a robust multi task assisted cooperative optimization unsupervised learning combined visual image fusion framework. The biggest difference between our algorithm framework and the current mainstream algorithm framework is: firstly, our image fusion network adopts multiple loss functions, which can better represent the image quality than the current single loss function method; secondly, we combine the attention mechanism and the deep convolution neural network effectively to simulate the feature selection characteristics and nonlinear groups of the human brain image fusion mechanism Secondly, the auxiliary learning mechanism is introduced into the image fusion task, and the main task of image fusion is effectively optimized through multiple sub tasks. Finally, the unsupervised learning combined scene image fusion framework proposed by us is more robust and universal than the existing algorithms. In addition to the combined scene fusion, it can also be applied to infrared and visible image fusion and multi focus image fusion. A large number of experiments show that our algorithm framework is more robust than the existing mainstream algorithm framework. Although our algorithm framework does not fully simulate the mechanism of human brain image fusion, our simulation of the characteristics of human brain image fusion mechanism is consistent with the mechanism of human brain image fusion. In the task of image fusion, although our algorithm has achieved relatively good robustness and generality compared with the existing algorithm, we still have the following two works to further explore: first, at present, only our network is applied to the combination of scene data, red and visible light data, multi focus data fusion task for experiments, which will be expanded to many in the future Exposure data fusion task and medical image fusion task. Secondly, at present, our network only extends the auxiliary learning mechanism to the task of image reconstruction and fusion. In the later stage, we will combine different high-level semantic tasks to optimize the image fusion task, so as to make the image fusion effect more consistent with the characteristics of human visual perception.
This work was supported by the National Natural Science Foundation of China under Grants nos. 61871326, and the Shanxi Natural Science Basic Research Program under Grant no. 2018JM6116.
- A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1) (1980) 97–136.
- F. Lahoud, S. SÃ¼sstrunk, Fast and efficient zero-learning image fusion, Information FusionarXiv:1905.03590.
- H. Li, X.-J. Wu, J. Kittler, Infrared and visible image fusion using a deep learning framework, 2018 24th International Conference on Pattern Recognition (ICPR)doi:10.1109/icpr.2018.8546006.
- Y. Liu, X. Chen, J. Cheng, H. Peng, Z. Wang, Infrared and visible image fusion with convolutional neural networks, International Journal of Wavelets Multiresolution & Information Processing 16 (3) (2017) 1–20.
- Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, Lecture Notes in Computer Science (2018) 294â310doi:10.1007/978-3-030-01234-2_18.
- H. Li, X. J. Wu, Densefuse: A fusion approach to infrared and visible images, IEEE Transactions on Image Processing 28 (5) (2018) 2614–2623.
- K. R. Prabhakar, V. S. Srikar, R. V. Babu, Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs, in: 2017 IEEE International Conference on Computer Vision (ICCV), Vol. 2017-, IEEE, 2017, pp. 4724–4732.
X. Yan, S. Gilani, A. Mian,
multi-focus image fusion, arXiv.org.
- M. Jiayi, Y. Wei, L. Pengwei, L. Chang, J. Junjun, Fusiongan: A generative adversarial network for infrared and visible image fusion, Information Fusion 48 (2019) 11 – 26. doi:https://doi.org/10.1016/j.inffus.2018.09.004.
- Y. Liu, X. Chen, H. Peng, Z. Wang, Multi-focus image fusion with a deep convolutional neural network, Information Fusion 36 (2017) 191–207.
B. Ma, X. Ban, H. Huang, Y. Zhu,
unsupervised deep model for multi-focus image fusion, arXiv.org.
- J. Ma, Y. Ma, C. Li, Infrared and visible image fusion methods and applications: A survey, Information Fusion 45 (2019) 153 – 178. doi:https://doi.org/10.1016/j.inffus.2018.02.004.
- Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, L. Zhang, Ifcnn: A general image fusion framework based on convolutional neural network, Information Fusion 54 (2020) 99–118.
C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,
A. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi,
image super-resolution using a generative adversarial network, arXiv.org.
- J. Johson, A. Alahi, L. Fei Fei, Perceptual losses for real-time style transfer and super-resolution, Vol. 9906, Springer, 2016.
- Zhou Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing 13 (4) (2004) 600–612. doi:10.1109/TIP.2003.819861.
- J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks (2017). arXiv:1709.01507.
- S. Liu, E. Johns, A. J. Davison, End-to-end multi-task learning with attention.
A. Rusu, N. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,
K. Kavukcuoglu, R. Pascanu, R. Hadsell,
- A. Toet, Tno dataset, https://figshare.com/articles/TNO_Image_Fusion_Dataset/1008029 (2018).
- M. Nejati, S. Samavi, S. Shirani, Multi-focus image fusion using dictionary-based sparse representation, Information Fusion 25 (2015) 72–84.
- Y. Liu, X. Chen, R. Ward, Z. J. Wang, Image fusion with convolutional sparse representation, IEEE Signal Processing Letters 23 (12) (2016) 1882–1886.
- P. J. Burt, E. H. Adelson, The laplacian pyramid as a compact image code, Readings in Computer Vision 31 (4) (1987) 671–679.
- Y. Liu, S. Liu, Z. Wang, Multi-focus image fusion with dense sift, Information Fusion 23 (C) (2015) 139–155.
- H. Li, X.-J. Wu, Infrared and visible image fusion using a novel deep decomposition method, IEEE Transactions on Image Processing.
- Y. Liu, S. Liu, Z. Wang, A general framework for image fusion based on multi-scale transform and sparse representation, Information Fusion 24 (2015) 147–164.
- C. Liu, Y. Qi, W. Ding, Infrared and visible image fusion method based on saliency detection in sparse domain, Infrared Physics & Technology 83 (2017) 94–102.
- F. Nencini, A. Garzelli, S. Baronti, L. Alparone, Remote sensing image fusion using the curvelet transform, Information Fusion 8 (2) (2007) 143–156.
- S. Kumar, B. K., Image fusion based on pixel significance using cross bilateral filter, Signal Image & Video Processing 9 (5) (2015) 1193–1204.
- Q. Zhang, Y. Fu, H. Li, J. Zou, Dictionary learning method for joint sparse representation-based image fusion, Optical Engineering 52 (5) (2013) 7006.
- J. Ma, C. Chen, C. Li, J. Huang, Infrared and visible image fusion via gradient transfer and total variation minimization, Information Fusion 31 (C) (2016) 100–109.
- J. Ma, Z. Zhou, B. Wang, H. Zong, Infrared and visible image fusion based on visual saliency map and weighted least square optimization, Infrared Physics & Technology 82 (2017) 8–17.
- A. Toet, Image fusion by a ratio of low-pass pyramid, Pattern Recognition Letters 9 (4) (1989) 245–253.
- L. J. Chipman, T. M. Orr, L. N. Graham, Wavelets and image fusion, in: International Conference on Image Processing, 1995.
- G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and feature extraction, in: International Conference on Computer Vision, 2011.
- V. P. S. Naidu, Image fusion technique using multi-resolution singular value decomposition, Defence Science Journal 61 (5) (2011) 479–484.
- N. D. Narvekar, L. J. Karam, A no-reference image blur metric based on the cumulative probability of blur detection (cpbd), IEEE Transactions on Image Processing 20 (9) (2011) 2678–2683.
- R. Ferzli, L. Karam, A no-reference objective image sharpness metric based on the notion of just noticeable blur (jnb), IEEE Transactions on Image Processing 18 (4) (2009) 717–728.
- Y. Han, Y. Cai, Y. Cao, X. Xu, A new image fusion performance metric based on visual information fidelity, Information Fusion 14 (2) (2013) 127–135.
- G. Cui, H. Feng, Z. Xu, Q. Li, Y. Chen, Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition, Optics Communications 341 (341) (2015) 199–209.