Deep Siamese Multi-scale Convolutional Network for Change Detection in Multi-temporal VHR Images
Very high resolution (VHR) images provide abundant ground details and spatial distribution information. Change detection in multi-temporal VHR images plays a significant role in urban expansion and area internal change analysis. Nevertheless, traditional change detection methods can neither take full advantage of spatial context information nor cope with the complex internal heterogeneity of VHR images. In this paper, we propose a powerful multi-scale feature convolution unit (MFCU) for change detection in VHR images. The proposed unit is able to extract multi-scale features in the same layer. Based on the proposed unit, two novel deep Siamese convolutional networks, deep Siamese multi-scale convolutional network (DSMS-CN) and deep Siamese multi-scale fully convolutional network (DSMS-FCN), are designed for unsupervised and supervised change detection in multi-temporal VHR images. For unsupervised change detection, we implement automatic pre-classification to obtain training patch samples, and the DSMS-CN fits the statistical distribution of changed and unchanged area from patch samples through multi-scale feature extraction module and deep Siamese architecture. For supervised change detection, the end-to-end deep fully convolutional network DSMS-FCN is trained in any size of multi-temporal VHR images, and directly output the binary change map. In addition, for the purpose of solving the inaccurate localization problem, the fully connected conditional random field (FC-CRF) is combined with DSMS-FCN to refine the results. The experimental results with challenging data sets confirm that the two proposed architectures perform better than the state-of-the-art methods.
Change detection (CD) is the process of identify differences in the state of an object or phenomenon by observing it at different time . And change detection is playing a vital role in land-use and land-cover (LULC) change analysis, forest or vegetation change monitoring, ecosystem monitoring, urban expansion research, resource management and damage assessment [2, 3, 4, 5, 6, 7].
Multi-spectral images are the most commonly used data source for change detection, and a number of methods were proposed based on it. Change vector analysis (CVA) generates a difference image (DI) and clusters DI to obtain the change result . Principal component analysis (PCA) , as a dimension reduction methods, transforms the images into a new feature space and select a part of new bands for change detection. In  and , multivariate alteration detection (MAD) and its iterative version IRMAD are proposed, which extract the changed objects by maximizing the difference between the projection features. Based on slow feature analysis (SFA) algorithm, Wu et al. [12, 13] proposed a SFA change detection method. The method aims to find the most invariant component in multi-temporal images, in which the changed object would be highlighted.
Nowadays, with the development of satellite sensors, very-high-resolution (VHR) images are more available by various types of sensors, such as QuickBird, IKONOS, SPOT and Worldview. And VHR images can provide abundant ground details and spatial distribution information, which have crucial effects on the research of urban change analysis and building detection. Nevertheless, due to internal complexity caused by the extremely high resolution of VHR images, conventional multi-spectral change detection methods may not be applicable. Recently, deep learning (DL) has achieved significant performances in many domains , as well as remote sensing image interpretation . And convolutional neural network (CNN) , as a classical and powerful DL architecture, is expert in extracting multi-level information, which is suitable for extracting spectral information and spatial context information in VHR images. In , a deep symmetric network is proposed for change detection of VHR heterogeneous images, and a convolutional layer plays an important role in feature extraction. Yang et al.  introduce a deep Siamese convolutional network for aerial image change detection, which extract features by two weight sharing convolutional branches, and generate binary change map based on the feature difference of the last layer. Three fully convolutional Siamese networks are first proposed in , which are trained end-to-end on change detection data set and have achieved good performances. Except CNN architecture, in , a change detection method based on conditional generative adversarial network (cGAN) is designed. In this method, convolutional layer is responsible for extracting features from VHR image patches. All of these methods adopt 3x3 convolution kernel as feature extraction module. Though 3x3 convolution kernel could extract spectral features and spatial context features in some extents, it still has some powerlessness in the complex ground situations of VHR images. No research has attempted to use other sizes of convolution kernels or even multiple kernels for change detection in VHR image.
Inspired by ânetwork in networkâ structure  and Inception network , we propose a multi-scale feature convolution unit extracting multi-scale spectral features and spatial context features in the same layer, which is suitable for VHR images. Adopting the MFCU as the basic feature extraction module, two powerful deep Siamese convolutional network are designed for unsupervised and supervised change detection. For supervised change detection, a probability graph model, fully connected conditional random field (FC-CRF)  is adopted to refine the change detection results. Further, based on the two deep Siamese multi-scale feature convolutional network architectures and FC-CRF, the specific supervised and unsupervised algorithms are developed.
The contributions of this paper can be summarized as follows:
This paper presents a multi-scale feature convolution unit for coping with internal complexity caused by the extremely high resolution of VHR images, which is able to extract multi-scale features from VHR images. To the authorsâ best knowledge, this is the first time that such multi-scale feature convolution unit is exploited for change detection.
Based on multi-scale feature convolution unit, two novel deep Siamese convolutional network are designed for unsupervised and supervised change detection respectively. Among them, the network used for supervised change detection is able to process images of any size and do not require sliding patch-window, therefore the accuracy and speed of inference could be significantly improved.
For the purpose of solving the problem of inaccurate localization caused by deep convolution architecture, fully connected conditional random field is adopted to refine the results obtained by deep Siamese fully convolutional network.
The rest of this paper is organized as follows. Section II describes the multi-scale feature convolution unit and two change detection network architectures in detail. Based on Section 2, two change detection algorithms based on the proposed network architectures are addressed in section III for unsupervised and supervised change detection. To evaluate our methods, Section IV and Section V contain quantitative and qualitative comparisons with the state-of-the-art change detection methods. In the end, Section VI draws the conclusion of our work in this paper.
In this section, we propose a multi-scale feature convolution unit, which is suitable for VHR images and can extract features at multiple scales, to replace conventional single convolution unit. Based on the proposed unit, two deep Siamese convolutional network architectures are designed for unsupervised and supervised change detection of VHR images. In order to solve the problem of inaccurate localization of DCNN, we adopt the fully connected conditional random field to refine the results.
Ii-a Multi-scale Feature Convolution Unit
The VHR images provide abundant ground details and spatial distribution information. And 3x3 convolution kernel, as a commonly used convolution unit, can extract spectral features and spatial context information of VHR images. However, the 3x3 convolution kernel has two obvious drawbacks in processing VHR images. First, the 3x3 convolution kernel has a limited receptive field and could only extracts single scale features in a same layer. But in VHR images, except normal scale features, there simultaneously exist some large-scale continuous features and a few very small salient features. Besides, as a weighted summation operation, convolution has a smoothing effect, thus some changes existing in multi-temporal images could be erased. Therefore, it is undeniable that the 3x3 convolution kernel, or say, the conventional single convolution unit is somewhat incompetent in dealing with complex multi-scale ground conditions in VHR images.
For the purpose of better extracting multi-scale features from VHR images, the multi-scale feature convolution unit (MFCU) is proposed. As shown in Figure 1, the MFCU is a ânetwork in networkâ structure , and it extracts multi-scale features in parallel by four ways, namely 1x1 convolution kernel, 3x3 convolution kernel, 5x5 convolution kernel and 3x3 max pooling. The 1x1 convolution kernel focuses on extracting the features of a pixel itself. The 3x3 convolution kernel extracts the features in a neighborhood. The 5x5 convolution kernel extracts features in a larger range, which is suitable for some large-scale continuous objects in VHR images. And max pooling is responsible for extracting the most salient features and efficiently avoids the smoothing effect of convolution operation. At last, the four type features are fused to obtain the higher dimensional multi-scale features. It should be noted that the 1x1 convolution kernel before 3x3 convolution kernel and 5x5 convolution kernel is a bottleneck design , which can efficiently reduce the parameters of network and make network easier to train.
Compared with conventional single convolution unit, the MFCU extracts multi-scale features, which improves the feature abstraction ability of network and does not significantly increase parameters of network to some extent. Using MFCU as multi-scale feature extraction module, we design two novel deep Siamese multi-scale convolutional network architectures for unsupervised and supervised change detection in multi-temporal VHR images, respectively.
Ii-B Deep Siamese Multi-scale Convolutional Network
The first proposed network architecture is deep Siamese multi-scale convolutional network (DSMS-CN). The DSMS-CN (Figure 2) has two components: feature extraction network (FEN) and change judging network (CJN).
The FEN is a Siamese network  and its two branches extract features from two multi-temporal image patches using exactly the same way because of weight sharing. The former two conventional convolutional modules in each branch transform the spatial and spectral information into relative high dimensional features, and the two latter MFCU modules extract abundant multi-scale features from high dimensional features. Through calculating the absolute differences of the features from two branches, the change information is highlighted and unchanged information are eliminated. Then, the absolute differences of features from multiple-layer are fused and input into the CJN. In the CJN, a MFCU layer extracts multi-scale difference feature, and a global average pooling (GAP) layer replace fully connected layer to generate feature vector, which can make network more robust and reduce overfitting . Finally, the probability of change in the center pixel of image patches is obtained by a fully connected layer.
In DSMS-CN, both conventional and multi-scale convolutional layers adopt rectified linear unit (ReLU)  as activation function, the last fully connected layer adopts sigmoid function to predict probability of change. Because of extracting features from multi-temporal image patches, DSMS-CN does not contain any max-pooling layer to reduce dimension.
Ii-C Deep Siamese Multi-scale Fully Convolutional Network
The second proposed network architecture is deep Siamese multi-scale fully convolutional network (DSMS-FCN). Same as the general FCN , the DSMS-FCN (Figure 3) consists of two parts: the left is an encoder network and another side is a decoder network.
The encoder network has two equal weight sharing branches, and the features of multi-temporal images are extracted in a same approach. Each branch has four subsampling layers, where each subsampling layer consists of a convolution module and a 2x2 max-pooling layer. The former two convolution modules consist of 3x3 conventional single convolution units and the latter two modules consist of MFCUs. Based on the concept of skip-connection in the U-Net , the features of subsampling layer and upsampling layer at the same scale are concatenated during upsampling phase, which can recover localization information and produce more accurate binary changed map with precise boundaries. The motivation for concatenating the absolute value of the difference between two branches with the features of upsampling layer is that change detection is trying to detect the differences between multi-temporal images. Furthermore, it is worth pointing out that only one branchâs features enter into decoder network. This is because the two branches share weights and changes in multi-temporal VHR image are only in the minority, thus most of features extracted from both sides are same. Meanwhile, difference of features is delivered to decoder network through the skip-connection structure. Consequently, using features of only one branch could avoid feature redundancy and overfitting.
In DSMS-FCN, all convolutional layers and transpose convolutional layers adopt rectified linear unit (ReLU) as activation function, except the last convolutional layer, which adopts sigmoid function to predict change probability of the whole image. Due to fully convolutional architecture of DSMS-FCN, it can process multi-temporal VHR images of any size.
Ii-D Fully Connected Conditional Random Field
Though DSMS-FCN adopts skip-connection structure to deliver localization information, it still cannot solve the problem of inaccurate localization caused by invariance of features and large receptive field . And fully connected conditional random field (FC-CRF), as a variation of conditional random field (CRF), is remarkably successful in processing the localization problem. Compared with CRF, FC-CRF considers short-range and long-range information simultaneously, thus it can better recover the local structure. Therefore, FC-CRF is adopted to refine the localization information of results obtained by DSMS-FCN.
FC-CRF is a conditional probability distribution model that outputs another set of random variables given a set of input random variables. According to , the FC-CRF is designed as follows:
It is characterized by a Gibbs distribution, where G is a complete graph on Y and is the set of all unary and pairwise cliques, which induces a potential . The corresponding Gibbs energy of FC-CRF is
where i and j range from 1 to N. In change detection, is the observed image acquired by the difference of multi-temporal images and is the binary change map.
The domain of each is . The unary potential
where is probability of change at pixel i, which is computed by DSMS-FCN. And in FC-CRF model, each pixel pair has a corresponding pairwise term no matter how far apart they are. The pairwise potential has the form:
where is a penalty, and zero otherwise. is a Gaussian kernel and is weighted by . and are feature vectors for pixels i and j in an feature space.
In change detection problem, the kernels are
where the first kernel depends on both pixel co-ordinates (denoted as c) and spectral difference intensities (denoted as X). The second smoothness kernel only depends on pixel co-ordinates.
As proposed in , the inference of FC-CRF adopts mean field approximation algorithm (MFA) which approximates P through the product of a set of marginal distribution . And the message passing step of MFA can be significantly accelerated by high dimensional filter algorithm which reduces the complexity of message passing from quadratic to linear, resulting in an efficient approximate inference algorithm for FC-CRF. The specific combination of FC-CRF and DSMS-FCN is presented in section 3-B.
Iii Employment of Change Detection Algorithms
Based on the two aforementioned deep Siamese multi-scale convolutional network architectures and FC-CRF, in this section, we formally propose the detailed employments of unsupervised and supervised change detection algorithms.
Iii-a Unsupervised Change Detection
In our proposed unsupervised change detection method, the DSMS-CN is adopted to predict change result of multi-temporal VHR images. And the flowchart of our unsupervised change detection algorithm is shown in Figure 4.
The first step of our architecture is image pre-processing, consisting of co-registration and radiometric correction. Image registration is the process of aligning two or more images of the same scene obtained at different times . Through collecting matched point-pairs, establishing transform model and transforming images, the two given multi-temporal VHR images are geometrically aligned. Then radiometric relative normalization is adopted to eliminate the radiometric difference between multi-temporal VHR images caused by different sun angle, light intensity and atmospheric conditions. The specific implementation of radiometric relative normalization is normalizing the two images with zero mean and unit variance respectively.
After pre-processing images, the suitable training samples are chosen by automatic pre-classification. The main purpose of this step is to find pixels which have extremely high changed or unchanged probabilities. CVA is first adopted to generate difference map (DM) of multi-temporal images. Then, fuzzy c-means clustering (FCM), based on the memberships of pixels, is implemented to partition DM into three clusters: , and . The pixels belonging to and are reliable pixels which have high probabilities to be changed or unchanged. And the pixels in are uncertain and need to be classified. Finally, the neighborhood area of pixels in and are chosen as training image patches.
Eventually, the DSMS-CN is trained on chosen training image patches. After training process is completed, the pixels in are categorized by the DSMS-CN and the whole changed map is generated. Owing to sigmoid function of fully connected layer of DSMS-CN, the threshold segmentation step could be simplified and just adopt 0.5 as threshold to get binary change map. We could see that the whole process is completely unsupervised.
Iii-B Supervised Change Detection
The flowchart of proposed supervised change detection algorithm is shown in Figure 5. In supervised algorithm, the fully convolutional architecture, DSMS-FCN, is adopted to predict change maps of multi-temporal VHR images. Same as unsupervised architecture, the first step is pre-processing, which can eliminate the difference of geometry and radiometric conditions of multi-temporal VHR images. Then, the DSMS-FCN is directly trained end-to-end on change detection data set without any pre-training or transfer learning. The inputs of the DSMS-FCN are two whole multi-temporal VHR images, and the output is a change map. Unlike our unsupervised architecture and majority of recent patch-based approach , the DSMS-FCN is able to process images of any size and do not require sliding patch-window, therefore the accuracy and speed of inference could be significantly improved.
As mentioned before, skip-connection structure of DSMS-FCN cannot solve the problem of inaccurate localization, thus we make use of FC-CRF to refine the results obtained by DSMS-FCN. The parameters of FC-CRF is trained on training set or validation set and the training process between FC-CRF and DSMS-FCN are decoupled. Firstly, the DM is computed by CVA and the change probability map of multi-temporal VHR images are inferred by DSMS-FCN. Then the unary potential of FC-CRF is generated by change probability map and the pairwise potential of FC-CRF is computed by DM. Based on its fully connected structure, the FC-CRF can efficiently extract accurate localization information by considering short-range and long-range information simultaneously. Thus the FC-CRF can further refine the results obtained by DSMS-FCN and eventually obtain a better binary change map with more accurate boundaries. Because of adopting high dimension filter algorithm to accelerate MFA, the inference of FC-CRF is very fast in practice. Therefore, the entire algorithm may take some time during training phase, but the speed of inference is still fast.
Iv Unsupervised change detection experiments
Iv-a Data set
In unsupervised change detection experiments, the first VHR data set was acquired by GaoFen-2 (GF-2) sensor on April 4, 2016 and September 1, 2016, covering the city of Wuhan, China, denoted as WH. The image size is 1000 x 1000 with four bands consisting of red, green, blue and near infrared. And its spatial resolution is 4 m. Figure 6 shows the pseudo-color images and ground truth of change and unchanged. (a) and (b) are the pseudo-color images acquired on April 4, 2016 (denoted as WH-1) and September 1, 2016 (denoted as WH-2), respectively. (c) is the ground truth. The changed area (red) contains 20026 pixels, and unchanged area (green) contains 484143 pixels. The remaining pixels are undefined.
The second VHR data set was also acquired by GF-2 with the resolution of 4m, denoted as HY. The images cover the Hanyang area of Wuhan city. The size of multi-temporal images are 1000 x 1000 pixels with four bands. Figure 7 shows the pseudo-color images and ground truth. The changed area (red) contains 59051 pixels, and unchanged area (green) contains 416404 pixels. Owing to the rapid development of Hanyang, the study area shows obvious land-cover changes.
We could see that in both data sets, the change area only occupies a small part, thus there exists a heavy skewed-class problem between changed and unchanged classes, which brings greater challenge to change detection. In addition, there exist âoverexposedâ problems on some buildings in images, which break the linear relationship of radiometric intensity of unchanged regions between multi-temporal images and cannot be eliminated by radiometric normalization. Hence, âoverexposedâ problem makes accurate change detection more difficult.
Iv-B Experiment settings
In our unsupervised architecture, the weights of DSMS-CN are initialized by he-normal way . In order to overcome skew-class problem, the loss function of DSMS-CN apply weighted binary cross-entropy function, its definition is
where is the reciprocal of the proportion of changed and unchanged classes in training samples. By means of setting a larger weight for change class, the change samples can play a more important role at training phase. Adam optimizer  is adopted to optimize loss function (learning rate is set to 1e-4). Dropout  and weight decay  are used to avoid overfitting during training phase. As discussed in [35, 36], the generally best choice of VHR image patch size is 5, in our experiments we also use 5 as image patch size.
To evaluate our method, seven state-of-the art methods are used for comparison, namely CVA, MAD , IRMAD , SFA , ISFA , PCA-Kmeans , and DSCN . The IRMAD and ISFA are iterative versions of MAD and SFA. Except DSCN, other methods are unsupervised methods. The DSCN is a deep siamese convolutional network and perform well in aerial images. We train it with exactly the same way as DSMS-CN. In our experiments, the hyper-parameters of all methods use the optimal values recommended in original references. OTSU, K-means and FCM are adopted as segmentation methods, the best results of comparison methods are chosen as final results. As evaluation criteria, recall rate, precision rate, overall accuracy (OA), F1 score and kappa coefficient are adopted for quantitative analysis. The following part discusses the experimental results for data sets.
Iv-C Experiments on WH data set
According to our unsupervised architecture, the WH data set is first pre-processed. Then, CVA fuses information of four bands and computes DM. As shown in Fig. 8(a), the areas of warmer tones have the greater probability of change. Based on DM, FCM is implemented to generate coarse init change map (CICM). In Fig. 8 (b), the white pixels belong to changed class, the dark pixels belong to unchanged class, and gray pixels are candidates to be classified by the proposed method. All the pixels in change class are chosen as training samples. However, since spectral difference between multi-temporal images of the pixels in unchanged class is relatively constant, we only randomly select a part of black area as training samples, and the specific number is four times the number of change pixel.
Figure 9 shows the binary change maps obtained by proposed method and comparison methods. The results obtained by MAD and SFA are unsatisfactory, a large number of unchanged pixels are classified as change. Through iterative weighting, the results obtained by IRMAD and ISFA are better in visual. Nevertheless, it is obvious that a lot of building roofs are misclassified into change class, which means ISFA and IRMAD cannot solve âoverexposedâ problem. Besides, some slightly changed pixels are classified as unchanged class. Compared with IRMAD and ISFA, CVA directly calculating a difference image and performing clustering achieve a relatively good result. Nevertheless, changes in a few buildings and small objects, such as roads, are not recognized and plenty of margins of buildings are misclassified as changed class. By means of transforming spatial blocks based on PCA, PCA-Kmeans fuses spatial context information to correctly classify small objects. But some margins of buildings are still misclassified as changed class. Through convolution operation, in the results obtained by DSCN, margins of buildings are correctly classified and DSCN can ignore the effects of âoverexposedâ. But many changes of building are not recognized. Same as DSCN, the DSMS-CN is completely immune to âoverexposedâ problem. And despite existing a few omissions and residual errors, the binary change map obtained by DSMS-CN achieve the best qualitative result.
Table I reports the quality analysis results based on five evaluation criteria as described in Section 4-A. As observed in Figure 9, IRMAD has a low accuracy, and its F1 and kappa are only 0.3651 and 0.3289. ISFA performs slightly better than IRMAD, where its F1 and kappa are 0.3888 and 0.3530. However, ISFA still has similar shortcomings with IRMAD. This is because both MAD and SFA are based on the central limit theorem, whereas the Gaussianity of VHR images is not obvious. Therefore, they are not suitable for change detection of the VHR images covering a small region, even though they can achieve outstanding results in low- and medium- resolution images. Through the training samples are selected by CVA and FCM, the DSMS-CN achieves the best result with OA of 0.9809, F1 of 0.7210, and kappa coefficient of 0.7114. It means that our proposed DSMS-CN can effectively fit the distributions of ground changes of VHR images from pre-classification samples based on powerful extraction ability of MFCU and deep Siamese convolutional structure, and achieve the best performance. By contrast, the DSCN directly processing features cannot compete with our approach.
Iv-D Experiments on HY data set
The DM and CICM are shown in Figure 10. Same as the WH data set, all the pixels in change class are chosen as training samples and for unchanged pixels, four times the number of the changed pixels are selected as samples.
The qualitative results obtained by proposed method and comparison methods are shown in Figure 11. The residual errors and noisy of the results obtained by MAD and SFA are obvious, where a lot of unchanged pixels are misclassified as changes. Because of the complexity of VHR images, through iterative weighting, the results obtained by IRMAD and ISFA are worse and have more misclassifications. The binary change maps obtained by CVA and PCA-Kmeans are better in visual. However, the CVA misclassifies a part of margins of buildings as changed class and PCA-Kmeans misses some obvious changes. The result obtained by DSCN is similar to that obtained by CVA, but more building margins are misclassified as changes. The change detection result by applying DSMS-CN on HY is shown in Fig. 11 (h). Though a few details in changed regions are missed, most unchanged regions are well preserved. It can be observed that DSNS-CN outperforms the other methods in comparison and best qualitative result.
Table II reports the quality analysis results. As observed in qualitative results, due to the complexity of VHR images, the accuracy of IRMAD and ISFA is unsatisfactory. And MAD and SFA is slightly better than IRMAD and ISFA. The results obtained by CVA and PCA-K-means is relatively better, with kappa coefficient of 0.7174 and 0.6794, respectively. Because of misclassifying more margins of buildings, the accuracy of DSCN is lower than CVA with 0.7074 of F1 score and 0.6796 of kappa coefficient. By contrast, the DSMS-CN achieves the best result with precision rate of 0.8982, OA of 0.9503, F1 score of 0.7720 and kappa coefficient of 0.7448, which proves the powerful extraction ability of MFCU and deep Siamese convolutional structure again.
V Supervised change detection experiments
V-a ACD data set
In order to train the proposed network and evaluate our method, the SZTAKI AirChange Benchmark set (ACD) is employed [38, 39]. The data set contains three sets of registered multi-temporal RGB aerial image pairs acquired in different seasonal conditions and associated ground truth. The size of each image is 952 x 640 and their spatial resolution is 1.5m. The main differences between image pairs are new built-up regions, fresh plough-land and groundwork before building over. As a public data set, ACD has already been used in [18, 19, 38, 39].
V-B Experiment settings
First, to augment avalibale training data, all possible filps and rotations of multiple of 90 degree are used. For the data set split, we adopt the way was proposed in  and : the top-left 784x448 corner of the Szada-1 and Tiszadob-3 are cropped for testing, and the rest of the images are used for training. The test image pairs is shown in Figure 12. Szada and Tiszadob are treated separately into two different data sets, and the images named âArchieveâ are ignored, since it contains only one image pair.
In our supervised algorithm, the weights of DSMS-FCN are initialized by âhe-normalâ way. In order to overcome skew-class problem, the loss function of DSMS-FCN apply weighted binary cross-entropy function. Adam optimizer is adopted to optimize loss function (learning rate is set to 2e-4). Dropout is used to avoid overfitting during training phase. As proposed in , the weights and parameters of the Gaussian kernel of FC-CRF are determined by grid search on training set and the penalty of pairwise potential is trained by L-BGFS algorithm.
The methods used for comparison are DSCN , CXM , SCCN  and three fully convolutional networks proposed in , using the values in  and . The three comparative full-convolution networks are FC-EF, FC-Siam-Conc, FC-Siam-Diff. For the purpose of better reflecting the role of FC-CRF, we separately evaluate the results obtained by DSMS-FCN and combination of DSMS-FCN and FC-CRF. In addition, in order to verify the superiority of the MFCU compared to conventional single convolution unit, we modify the fully convolution architecture FC-EF proposed in  and replace the 3x3 convolution kernel with our MFCU. The modified network is denoted as MSFC-EF. Same as the unsupervised experiment, recall rate, recall rate, precision rate, overall accuracy, F1 score and kappa coefficient are adopted as evaluation criteria.
V-C Experiment on ACD data set
Figure 13 and Figure 14 are illustrations of our results on ACD. We could see that both our DSMS-FCN and improved FC-EF network, namely MSFC-EF, can achieve satisfactory qualitative results. And combined with FC-CRF, more low-level information is considered, thus the results obtained by deep fully convolutional network are refined.
Table III and Table IV report the quality analysis results on two test image pairs. The results obtained on the Szada-1 show the superiority of the DSMS-FCN, which obviously outperforms all the other comparison methods in recall metric of 0.5133, overall accuracy of 0.9435 and F1 score of 0.5522. Utilizing the MFCU, each metric of MSFC-EF is better than FC-EF. And combined with FC-CRF, overall accuracy, F1 score and kappa coefficient of MSFC-EF and DSMS-FCN are further improved. The DSMS-FCN-FCCRF achieves the best recall metric, overall accuracy, F1 score and kappa coefficient.
On the Tisza-3, though the DSMS-FCN is not the best, it still superior than DSCN, CXM, SCCN and other two fully convolutional networks. The FC-EF network achieves a very high F1 score of 0.9340, but the MSFC-EF utilizing the MFCU still obtain improvement of performance. All metric of MSFC-EF is the better than FC-EF. Adopting FC-CRF, overall accuracy, F1 score and kappa coefficient of MSFC-EF and DSMS-FCN are further improved. The MSFC-EF-FCCRF achieves the best performance.
The experimental results that the MSFC-EF can outperform the original FC-EF in the data of Szada-1 and Tiszadob-3, which illustrates that the proposed MFCU has the ability to improve the performance of change detection with deep network. And depending on FC-CRF, the results obtained by deep fully convolutional network can be further improved. It is worth noting that in two test images, the performance of fully convolutional networks is obviously better than the patch based method DSCN and other architectures.
Furthermore, the inference time of DSMS-FCN is below 0.1s per image and the inference time of FC-CRF is under 1s per image, which means our method can efficiently process multi-temporal VHR images in real time.
In this paper, a powerful multi-scale feature convolution unit is presented, which is different from conventional single convolution unit only extracting single-scale feature in one layer, and able to extract features at multi-scale in the same layer by a ânetwork in networkâ structure. Based on the unit, two novel deep Siamese convolutional network architectures are designed for unsupervised and supervised change detection in VHR images. The DSMS-CN, used for unsupervised change detection is trained on pre-classification samples automatically generated by CVA and FCM. The DSMS-FCN is responsible for supervised change detection. As a fully convolutional network, the DSMS-FCN is able to process images of any size and do not require sliding patch-window, thus the accuracy and speed of inference could be significantly improved. To overcome inaccurate localization problem, the FC-CRF is adopted to refine the results obtained by DSMS-FCN. Through using the output of DSMS-FCN as unary potential, the FC-CRF is combined with DSMS-FCN.
On the unsupervised change detection experiments with the challenging WH and HY data sets, the qualitative and quantitative results indicate that the DSMS-CN outperforms other seven state-of-the-art methods with better overall accuracy, F1 score and kappa coefficient, which confirms the powerful extraction ability of MFCU and outstanding fitting capability of DSMS-CN. In the supervised change detection experiments with ACD data set, compared with three fully convolutional networks, a patch-based deep network method and other state-of-the-art methods, the DSMS-FCN delivers better performance, MFCU also exhibits powerful feature extraction capability and FC-CRF dose make the results obtained by deep fully convolutional networks more accurate. Besides, the inference time of our supervised architecture is below 1s per image, thus it can predict change maps of multi-temporal VHR images in real time.
Our future work includes but is not limited to applying deep Siamese architecture for change detection in heterogeneous VHR images and utilizing FC-CRF to improve the performance of traditional change detection methods in VHR images.
This work was supported in part by the National Natural Science Foundation of China under Grant 61601333, 61822113 and 41801285.
-  A. Singh, “Review article digital change detection techniques using remotely-sensed data,” International journal of remote sensing, vol. 10, no. 6, pp. 989–1003, 1989.
-  G. Xian, C. Homer, and J. Fry, “Updating the 2001 national land cover database land cover classification to 2006 by using landsat imagery change detection methods,” Remote Sensing of Environment, vol. 113, no. 6, pp. 1133–1147, 2009.
-  P. Coppin, I. Jonckheere, K. Nackaerts, B. Muys, and E. Lambin, “Review article digital change detection methods in ecosystem monitoring: a review,” International journal of remote sensing, vol. 25, no. 9, pp. 1565–1596, 2004.
-  H. Luo, C. Liu, C. Wu, and X. Guo, “Urban change detection based on dempster–shafer theory for multitemporal very high-resolution imagery,” Remote Sensing, vol. 10, no. 7, p. 980, 2018.
-  D. Lu, P. Mausel, E. Brondizio, and E. Moran, “Change detection techniques,” International journal of remote sensing, vol. 25, no. 12, pp. 2365–2401, 2004.
-  D. Brunner, G. Lemoine,and L. Bruzzone, “Earthquake Damage Assessment of Buildings Using VHR Optical and SAR Imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 5, pp. 2403–2420, 2010.
-  M. E. Zelinski, J. Henderson, and M. Smith, “Use of Landsat 5 for Change Detection at 1998 Indian and Pakistani Nuclear Test Sites,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 8, pp. 3453–3460, 2014.
-  L. Bruzzone and D. F. Prieto, “Automatic analysis of the difference image for unsupervised change detection,” IEEE Transactions on Geoscience and Remote sensing, vol. 38, no. 3, pp. 1171–1182, 2000.
-  J. Deng, K. Wang, Y. Deng, and G. Qi, “Pca-based land-use change detection and analysis using multitemporal and multisensor satellite data,” International Journal of Remote Sensing, vol. 29, no. 16, pp. 4823–4838, 2008.
-  A. A. Nielsen, K. Conradsen, and J. J. Simpson, “Multivariate alteration detection (MAD) and maf postprocessing in multispectral, bitemporal image data: New approaches to change detection studies,” Remote Sensing of Environment, vol. 64, no. 1, pp. 1–19, 1998.
-  A. A. Nielsen, “The regularized iteratively reweighted mad method for change detection in multi-and hyperspectral data,” IEEE Transactions on Image processing, vol. 16, no. 2, pp. 463–478, 2007.
-  C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection in multispectral imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 5, pp. 2858–2874, 2014.
-  C. Wu, L. Zhang, and B. Du, “Kernel slow feature analysis for scene change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 4, pp. 2367–2384, 2017.
-  Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  L. Zhang, L. Zhang, and B. Du, “Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  J. Liu, M. Gong, K. Qin, and P. Zhang, “A Deep Convolutional Coupling Network for Change Detection Based on Heterogeneous Optical and Radar Images,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 545–559, 2018.
-  Y. Zhan, K. Fu, M. Yan, X. Sun, H. Wang, and X. Qiu, “Change Detection Based on Deep Siamese Convolutional Network for Optical Aerial Images,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 10, pp. 1845–1849, 2017.
-  R. Caye Daudt, B. Le Saux, and A. Boulch, “Fully Convolutional Siamese Networks for Change Detection,” 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 4063–4067.
-  X. Niu, M. Gong Saux, T. Zhan, and Y. Yang, “A Conditional Adversarial Network for Change Detection in Heterogeneous Images,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 1, pp. 45–49, 2019.
-  M. Lin, S. Yan and Q. Chen. (Dec. 2013). “Network in network.” [Online]. Available: https://arxiv.org/abs/1312.4400
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
-  P. KrÃ¤henbÃ¼hl, and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” Advances in neural information processing systems 24: 25th annual conference on neural information processing systems 2011., 2011, pp. 109–117.
-  K. Simonyan, and A. Zisserman. (Sep. 2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556
-  J. Bromley, I. Guyon, Y. Lecun, E. SÃ¤ckinger, and R. Shah, “Signature Verification Using a Siamese Time Delay Neural Network,” Advances in neural information processing systems, vol. 7, no. 4, pp. 737–744, 1993.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241.
-  L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
-  Z. Barbara, and J. Flusser, “Image registration methods: a survey,” Image and Vision Computing, vol. 21, no. 11, pp. 977–1000, 2003.
-  K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
-  D. P. Kingma, J. Ba (Dec. 2014). “Adam: A Method for Stochastic Optimization.” [Online]. Available: https://arxiv.org/abs/1412.6980
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  A. Krogh, and J. A. Hertz, “A simple weight decay can improve generalization,” Proceedings of the 4th International Conference on Neural Information Processing Systems, 1991, pp. 950–957.
-  M. Gong, J. Zhao, J. Liu, Q. Miao, and L. Jiao, “Change Detection in Synthetic Aperture Radar Images Based on Deep Neural Networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 1, pp. 125–138, 2016.
-  M. Gong, X. Niu, P. Zhang, and Z. Li, “Generative Adversarial Networks for Change Detection in Multispectral Imagery,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 12, pp. 2310–2314, 2017.
-  C. Turgay, “Unsupervised Change Detection in Satellite Images Using Principal Component Analysis and K-means Clustering,” IEEE Geoscience and Remote Sensing Letters, vol. 6, no. 4, pp. 772–776, 2009.
-  C. Benedek, and T. SzirÃ¡nyi, “Change Detection in Optical Aerial Images by a Multi-Layer Conditional Mixed Markov Model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 47, no. 10, pp. 3416–3430, 2009.
-  C. Benedek, and T. SzirÃ¡nyi, “A Mixed Markov Model for Change Detection in Aerial Photos with Large Time Differences,” International Conference on Pattern Recognition (ICPR), 2008.