Unsupervised Change Detection in Multitemporal VHR Images Based on Deep Kernel PCA Convolutional Mapping Network
Abstract
With the development of Earth observation technology, veryhighresolution (VHR) image has become an important data source of change detection. Nowadays, deep learning methods have achieved conspicuous performance in the change detection of VHR images. Nonetheless, most of the existing change detection models based on deep learning require annotated training samples. In this paper, a novel unsupervised model called kernel principal component analysis (KPCA) convolution is proposed for extracting representative features from multitemporal VHR images. Based on the KPCA convolution, an unsupervised deep siamese KPCA convolutional mapping network (KPCAMNet) is designed for binary and multiclass change detection. In the KPCAMNet, the highlevel spatialspectral feature maps are extracted by a deep siamese network consisting of weightshared PCA convolution layers. Then, the change information in the feature difference map is mapped into a 2D polar domain. Finally, the change detection results are generated by threshold segmentation and clustering algorithms. All procedures of KPCAMNet does not require labeled data. The theoretical analysis and experimental results demonstrate the validity, robustness, and potential of the proposed method in two binary change detection data sets and one multiclass change detection data set.
I Introduction
The Earth surface is constantly evolving. Realtime and accurate access to the surface changes is of great importance for the better understanding of human activities, ecosystem and their interactions [Lu2004]. Multitemporal remote sensing images can reveal the dynamic changes in the surface. Therefore, change detection based on multitemporal images has become more and more significant. Change detection (CD) is the process of identifying differences in the state of an object or phenomenon by observing it at different time [Singh1989], which has played an important role in a large number of applications, to name a few, landuse and landcover change analysis, resource management, ecosystem monitoring, and disaster assessment [Xian2009, Coppin2004, Luo2018, Sim2006, Zelinski2014, 2014Li, Brunner2010, Wu2017b, Doroodgar2014]. Recently, owing to the development of Earth observation technology, veryhighresolution (VHR) images are available by more and more satellite sensors (e.g. SPOT, IKONOS, QuickBird, and GaoFen). The VHR images can provide abundant surface details and spatial distribution information, which makes it possible to detect subtler changes. Now, CD in multitemporal VHR images has caught more and more attention [Bovolo2009, Somasundaram2010, Chen2012, Lei2014, Huo2016, Tan2016, Zhan2017, Lv2018a, Daudt2018, Security2018, Wu2017a].
Pixelbased change detection (PBCD) model is the earliest CD method, which has been deeply studied and widely used [Singh1989, Sharma2007, Bovolo2012, Thonfeld2016, Deng2008, Nielsen1997, Nielsen2007, Wu2014]. Change vector analysis (CVA) [Sharma2007] is a classic CD model, which is designed to calculate change intensity and change direction for binary and multiclass CD. Based on CVA, many extension models are proposed, such as compressed CVA (VA) [Bovolo2012], robust CVA (RCVA) [Thonfeld2016] and so on. Principal component analysis (PCA) is a transformation method, which transforms the difference images or stacked images into a new feature space and selects a part of principal components for CD [Deng2008]. In [Nielsen1997] and [Nielsen2007], Nilsen et al propose multivariate alteration detection (MAD) that aims to maximize the variance of transformed variables and is invariant to affine transformation. Based on the theory of slow feature analysis (SFA), Wu et al [Wu2014] propose a novel CD model that is able to extract the most invariant components from multitemporal images, and transform original images into a new feature space, where the changed pixels are highlighted and unchanged ones are suppressed. Though these PBCD models are effective in their respective application scenarios, they only utilize spectral information of multitemporal images due to the limitation of models themselves. However, the increase in the spatial resolution of VHR images limits the spectral resolution and brings the spectral variability, which makes CD results of PBCD models suffered from the saltandpepper noise and internal fragmentation [Huo2016, Lv2018a, Li2017b]. Therefore, it is greatly important to employ the spatial context information and spectral information in VHR images CD.
For the purpose of exploring the spatial context information in VHR images, two major and conventional approaches, objectbased change detection (OBCD) model and spatial feature extraction method are developed. In OBCD, the image object consisting of pixels with similar spectral signatures becomes the basic unit of CD [Hussain2013, GilYepes2016, Desclee2006, Somasundaram2010]. To get the image objects, the first step of OBCD is image segmentation, which divides the images into multiple homogeneous regions. After obtaining the image objects, some representative features of objects, such as shape index, texture index, and mean and standard deviation of each band, are extracted for change detection. The second approach is the spatial feature extraction methods. In these methods, spatial context information is extracted from the local area of each pixel In [Lu2017, Lei2014, Hoberg2015, Zhou2016]. In [Lei2014], a CD method based on texton forest is proposed to capture spatial context information in VHR images. Integrating macro and microtexture features with random forest and fuzzy set model, Li et al [Li2017c] propose a multitexture CD method to extract spatial features for CD. Hoberg el at [Hoberg2015] introduce conditional random field (CRF) model into CD of multitemporal VRH images to model spatial information in VHR images and much research is conducted after that [Zhou2016, Lv2016, Cao2016, Lv2018a]. Nonetheless, in both aforementioned approaches, only lowlevel features are utilized and many of them are handcrafted, which are insufficient for representing the key information of original data and coping with complex ground situations in VHR images. In addition, the performance of OBCD depends on the effects of segmentation, but sometimes it is difficult to select the appropriate segmentation parameters.
Last few years, deep learning (DL) has achieved spectacular performance in the field of computer vision and remote sensing image interpretation [Lecun2015, Zhang2016b, Zhu2017a]. Different from conventional methods, DL models have the capability to extract representative highlevel features from VHR images. Therefore, a variety of models based DL are proposed for CD in multitemporal VHR images. In [Gong2016], a CD method based on deep neural network (DNN) is proposed for CD in SAR images. In [Zhan2017], Zhan et al specifically design a deep siamese convolutional network for multitemporal aerial images, which extracts spatialspectral features by two weightshared branches. Lyu et al [Lyu2016] adopt recurrent neural network (RNN) to tackle the temporal connection between multitemporal images. Going one step further, in [Mou2019], a CD architecture based on recurrent convolutional neural network is proposed to extract unified features for binary and multiclass CD. Combining a pretrained deep convolutional neural network with CVA, Saha et al [Saha2019] design a CD method called deep CVA for VHR images CD. In [CayeDaudt2018], Daudt et al first introduce fully convolutional network (FCN) into CD and propose two siamese extensions of FCN, which achieve good performance in two open VHR images CD datasets. Though these methods based on DL models achieve good performance in CD, the training process of DL models is in supervised learning fashion with annotated data. And it is undeniable that the manual selection of annotated samples is laborconsuming, especially for remote sensing data.
Therefore, several unsupervised feature extraction models, including restricted Boltzmann machines (RBMs) [hinton2006] and autoencoder (AE) [Bengio2007], have been adopted to solve this problem. However, these models flatten the image patches into vectors, ignoring the property of image in the spatial domain, due to equipped with fully connected layers [Mou2019]. Another way is cooperating deep learning models with unsupervised predetection algorithms. Gao et al [Gao2016] present a CD method based on PCANet [Chan2015] for multitemporal SAR images, and a predetection algorithm based on Gabor wavelets and fuzzy cmeans is utilized to select interested pixels with a high probability of being changed or unchanged. Then the network is trained on the samples selected by the automatic predetection algorithm. Learning nonlinear features with DNN and highlighting changes via SFA, Ru et al [Du2019a] propose an unsupervised deep slow feature analysis (DSFA) model for CD. For training the DNN of DSFA, a predetection method based on CVA is utilized to selecting samples. In [Li2019a], a supervised spatial fuzzy clustering is adopted to produce pseudolabels for training the DCNN. This approach solves the sample problem of the DL model to a certain degree. However, if the predetection algorithm does not perform well on one data set, the performance of DL model is also damaged. Whatâs more, most of these existing DLbased methods are merely focus on binary CD. And there are currently only a few methods [Saha2019, Zhang2019] that can be used for unsupervised multiclass CD.
Considering the above issues comprehensively, in this paper, we utilize the unsupervised subspace learning algorithms, kernel principal component analysis (KPCA) [Bernhard1998], to develop a novel feature extraction model called KPCA convolution to extract representative spatialspectral features from VHR images in a totally unsupervised manner. Based on the KPCA convolution, a powerful and general network called KPCAMNet is designed for unsupervised binary and multiclass CD. First, the highlevel spatialspectral feature maps are extracted by KPCAMNet with deep siamese network architecture. Then, pixelwise subtraction is implemented to get the feature difference map. To efficiently utilizing the change information in the feature difference map, KPCAMNet maps the feature difference map into a 2D polar domain. Finally, the unsupervised threshold segmentation methods or clustering techniques would be performed to get the desired CD results.
The rest of this paper is organized as follows. In section II, the background of KPCA and CNN are introduced. Section III elaborates the proposed KPCA convolution and KPCAMNet. Section IV provides experimental settings, experimental results and discussion. In section V, the experiment of multiclass CD is carried. Finally, Section VI draws the conclusion of our work in this paper.
Ii Background
Iia Kpca
PCA is a powerful technique for feature extraction that uses an orthogonal transformation to convert original data into new feature space [Wold1987]. In the new feature space, possibly correlated data becomes linearly uncorrelated.
But in some case, the principal components in input space which are linear related to the input variables cannot extract representative features from the observation data well. The PCA can be generalized to the KPCA, which maps the original data into a nonlinear highdimensional space, to better extract the essential highlevel representation of original observation data.
Considering a possibly nonlinear map , where is the feature space that could have an arbitrarily large, even infinite, dimensionality. The original data could be mapped into a new nonlinear feature space . The covariance matrix of mapped data is expressed as:
(1) 
where is the centralized mapped data. The basic idea of KPCA is to solve the problem of PCA in the new nonlinear feature space . This can be achieved by solving the following eigenvalue problem:
(2) 
Since the eigenvectors can be expressed as the linear combination of mapped data [Bernhard1998]
(3) 
and considering
(4) 
the Eq. (2) can be reformulated as
(5) 
Defining define an kernel matrix as
(6) 
Therefore, the eigenvalue problem of KPCA can be rewritten as
(7) 
As is a symmetric matrix, it has a set of eigenvectors which spans the whole space [Bernhard1998], thus
(8) 
can give all solutions of Eq. (7). Further, we can normalize to make the corresponding eigenvectors in be normalized
(9) 
For the purpose of feature extraction, it should compute projections of original data on the eigenvectors in . Let be a test data, with an image in , the th feature component can be calculated as
(10) 
In the above procedure, although we assume that is centralized, it is actually difficult to compute the mean of the mapped data directly in . Therefore, in KPCA, the centralized kernel matrix is computed by the original kernel matrix [Bernhard1998]
(11) 
where 1 is an matrix with elements of 1, is the original kernel matrix directly calculated by original nonlinear mapped data .
Note that in the procedure of KPCA, we only need to compute the dot products of the nonlinear mapped data, namely . Therefore, the kernel function can be used to compute their dot products in without explicitly calculating the nonlinear mapped data [boser1992]. There are several commonly used kernel functions in KPCA, including polynomial kernel, radial basis function (RBF), sigmoid kernel, and cosine kernel.
IiB Cnn
Convolutional neural network is a specialized kind of neural network for processing data that has a known, gridlike topology [Lecun2015], such as image data. Now, CNN has been tremendously successful in a variety of computer vision tasks [Lecun2015, He2016, Han2018, Chen2018, Li2019c] and become one of the most popular network architectures.
Generally, CNN is composed of several convolutional layers, followed by activation functions, and pooling layers. Given an image or other gridlike topology data, a set of feature maps are extracted by multiple trainable convolution kernel in convolutional layer, i.e.
(12) 
where is the th feature map obtained by the th convolution kernel and bias , denotes convolution operation, and is the activation function following the convolution layer, which can introduce nonlinearity into network. Then the pooling layer subsamples each feature map to remove redundant features, and helps to make the features approximately invariant to spatial translations. After processed by several alternating convolutional layers and pooling layers, the highlevel features of input data are finally generated in an automatic manner instead of handcrafted.
The training (or optimization) process of CNN is in a supervised learning fashion with annotated data. In the training process, stochastic gradient descent with the backpropagation (BP) algorithm [Y1998] is usually utilized to optimize the parameters in CNN. As the training process goes on, the features generated by the network are more and more representative.
Iii Methodology
Iiia KPCA Convolution for VHR Images
Compared with low and mediumresolution images, VHR images could not only provide spectral information, but also afford abundant ground details, texture features, and spatial distribution information. Therefore, the spatialspectral feature is of great importance for change detection in VHR images. The convolutional layer of CNN is a suitable structure for simultaneously extracting spatialcontext features and spectral features in a 2D manner. However, as we mentioned in section IIB, the training process of CNN is in a supervised learning fashion with annotated data. And in the field of remote sensing, datasets with groundtruth labels are not always available in practical applications.
In this subsection, a novel convolution operation based on KPCA is proposed to extract the representative spatialspectral features from VHR images in an unsupervised manner. Let us consider the case of linear PCA first.
PCA Convolution
Given a VHR image , around each pixel, a corresponding image patches is taken. For the entire image, all image patches are vectorized and collected, i.e. . Then, vectorized image patches are randomly selected as training samples, we denote .
The covariance matrix of the training samples can be calculated as
(13) 
The eigenvalues and corresponding eigenvectors are available by solving the PCA problem. Utilizing the first components to transform image patches into a new feature space, the representative spatialspectral features can be extracted as
(14) 
where , , and is the spatialspectral features of the th pixel.
Note that the above procedure is equivalent to a convolution operation
(15) 
where the th convolution kernel is the reshaped eigenvector , and is the th channel of the spatialspectral feature map .
Kernel PCA Convolution
As seen from Eq. (15), PCA convolution is a linear transformation that limits the representing ability of extracted features. To improve the representativeness and discriminability of the extracted features and subsequent CD performance, the nonlinearity is introduced into spatialspectral features by means of mapping original images into a new nonlinear feature space. To do this, PCA convolution should be generalized to KPCA convolution.
Considering a nonlinear map, the vectorized image patches taken from the VHR image I can be mapped into a highdimensional nonlinear feature space
(16) 
The covariance matrix of the training samples is expressed as
(17) 
The nonlinear eigenvectors (principal components) can be achieved by solving the KPCA problem
(18) 
As elaborated in section IIA, the eigenvalue problem of KPCA can be rewritten as
(19) 
By solving this equation and get all solutions , according to Eq. (9) and Eq. (10), the nonlinear spatialspectral features can be extracted by the first nonlinear principal components as
(20) 
where is the th channel of vectorized nonlinear spatialspectral feature map. As seen from Eq. (20), in the nonlinear feature extraction process, only the dot products of the nonlinear mapped data should be calculated. Therefore, a suitable kernel function is utilized to compute the dot products and extract nonlinear spatialspectral features without directly employing the nonlinear mapping :
(21) 
The above procedure is also equivalent to a convolution operation, though the convolution kernels are not constructed explicitly
(22) 
where is the th channel of the nonlinear spatialspectral feature map , is the th implicit linear convolution kernel that has a same size with image patches, and is an implicit nonlinear function. This procedure is called as KPCA convolution. Note that when is an identity map , the Eq. (22) degenerates to
(23) 
which is the PCA convolution, thus the linear PCA convolution is a special case of KPCA convolution.
As shown in Fig. 1, similar to the convolutional layer of CNN, the KPCA convolutional layer is able to extract representative spatialspectral features from VHR images in a 2D manner, but the learning process of parameters is totally unsupervised and does not require annotated data. The whole process of KPCA is summarized in Algorithm 1.
IiiB Deep Siamese KPCA Convolutional Mapping Network
Based on the proposed KPCA convolution, the KPCAMNet is designed for CD in multitemporal VHR images. The architecture of the proposed KPCAMNet is shown in Fig. 2. First, the multitemporal VHR images are preprocessed. Then, representative spatialspectral feature maps are extracted from multitemporal VHR images by several weightshared KPCA convolutional layers. Next, by means of pixelwise subtraction, the feature difference map is generated. Finally, the highlevel difference features are mapped into a 2D polar domain to employ change information efficiently and the final binary and multiclass CD results are generated by threshold segmentation methods and clustering algorithms.
Data Preprocessing
Before utilizing the proposed KPCAMNet to detect changes, the first step is data preprocessing, including image registration and radiometric correction. Image registration is the process of aligning two or more images covering the same scene obtained at different times [Zitova2003]. Given two multitemporal images, only when the same pixels correspond to the same geographical location, can the CD between the corresponding areas be meaningful. There are three main steps in image registration: collecting matched pointpairs, establishing a transformation model, and transforming images. By these three steps, the two given multitemporal images are geometrically aligned.
Next, radiometric correction is performed to eliminate the radiometric difference between multitemporal images caused by different imaging conditions, including sun angle, light intensity, atmospheric conditions, and so on. The specific method is radiometric relative normalization based on the zscore method, which standardize images with zero mean and unit variance. Given a multitemporal spectral vector pair, denoted by and , where indicates the pixel number and means the dimension of the spectral band. The radiometric relative normalization can be expressed as
(24) 
where is the mean and is the variance for band of image . Through radiometric correction, the radiometric difference between multitemporal VHR images caused by different conditions would be suppressed.
Deep KPCA Convolution
After data preprocessing, we could train a KPCA convolutional layer, denoted as KC, to extracted nonlinear features from VHR images:
(25) 
Though the single KPCA convolutional layer could extract representative spatialspectral features to some extent, it still has some inadequacies. First, the single KPCA convolutional layer has a fixed and limited receptive field and could only extract single scale features. However, there exist several scale features in VHR images. Besides, only a single layer means that it may not extract highlevel features from VHR images.
Therefore, for the purpose of extracting features from VHR images better, several KPCA convolutional layers are stacked for constructing a deep KPCA convolutional network to extract more representative features. Considering that CD involves two images and the processed VHR images in this paper are homogeneous, two siamese subnetworks and are designed to extract spatialspectral features from two VHR images in parallel. Given two multitemporal VHR images and , without loss of generality, the outputs of the two subnetworks are
(26) 
where and are the outputs, is the th KPCA convolutional layer of , and is the number of KPCA convolutional layer. Due to , the convolution kernel size, kernel parameters, and number of KPCA convolutional layers of and are all the same. Therefore, the two subnetworks extract features from multitemporal VHR images via the exact same way.
To train each KPCA convolutional layer, an unsupervised layerwise training way is proposed. For the first layer, multiple training patch pairs are randomly selected to train the weightshared KPCA convolutional layer. Each training patch pair is two multitemporal image patches with the same size covering the same area. Next, multiple training patch pairs are selected from the two feature maps extracted by the first layer to train the second layer. By repeating this trainingextracting process, each KPCA convolutional layer of KPCAMNet can be trained and two representative spatialspectral feature maps are generated.
Feature Mapping and Change Detection
After two highlevel spatialspectral feature maps are extracted by subnetworks, the pixelwise subtraction operation is implemented to get the feature difference map , which contains abundant change information. To comprehensively utilize these change information, the feature magnitude (FM) is calculated as
(27) 
where is the dimension of channel, is the th channel of the feature difference map. carries information about whether change exists. The larger is, the higher change probability of pixel is. Through a threshold segmentation method, such as OTSU [otsu1979] and EM [moon1996], the binary CD result can be generated.
At present, most of the unsupervised CD methods stop here, namely obtaining binary change maps. Nevertheless, the representative highlevel spatialspectral feature maps can be generated by the proposed method, in which the different objects could be more discriminative. That means the feature difference map is able to contain abundant information of different kind of change, which makes unsupervised multiclass CD possible.
An intuitive idea is directly performing clustering algorithm on the feature difference map to get the multiclass change map. But plenty of information in feature difference map is redundant. Besides, the amount of information contained in different channels of the feature difference map is also different. Thus, performing clustering algorithms in the original difference space is often not effective to get the right change class [lee2007, Saha2019] (this conclusion is also verified in section VD). Inspired by the change vector direction measure proposed in [kruse1993, Bovolo2012], a weighting feature direction (WFD) is developed that maps the highlevel change information to a 1D variable
(28) 
where is the eigenvalue corresponding to the th convolution kernel of the last KPCA convolutional layer in KPCANet. Since the larger eigenvalue means more information, in Eq. (28), the channel of feature difference map corresponding to the larger eigenvalue would play a more important role in calculating . Despite existing some loss of information, the different kinds of change information are effectively utilized and mapped into . By performing clustering algorithm on , the multiclass CD results could be generated.
Eventually, as shown in Fig. 3, the highlevel difference features are mapped into a 2D polar domain based on and . If only the binary change map is required, a threshold segmentation method is adopted to get the result according to . If the aim is a multiclass CD map, a threshold segmentation method is first implemented to distinguish change pixels from unchanged ones. After that, a clustering algorithm is performing on to classify each changed pixel into specific change kind.
Summarizing all of the aforementioned contents, the whole detailed process of KPCAMNet is elaborated in Algorithm 2.
Iv Experiment of Binary Change Detection
Iva Data Description
In the experiment of binary change detection, two VHR images are adopted to evaluate the proposed method. The first data set was acquired by GF2 on April 4, 2016 and September 1, 2016, covering the city of Wuhan, China, denoted as WH. The multitemporal VHR image pairs in this data set consist of 4 spectral bands, i.e. red band, green band, blue band and nearinfrared band, with 1000 1000 pixels and they have a spatial resolution of 4 m. The pseudocolor VHR images and the corresponding ground truth of changes and nonchanges are shown in Fig. 4. In the ground truth, red indicates changed regions, green indicates unchanged regions, and the remaining pixels are undefined. In the WH data set, a small part of buildings suffers from an âoverexposedâ problem, which breaks the linear relationship of the digital numbers of unchanged regions between multitemporal images and cannot be eliminated by radiometric normalization. Therefore, the âoverexposedâ problem makes accurate CD more difficult.
The second VHR data set was gathered by QuickBird on 2002 and 2005 over the Wuhan University, denoted as QU. The multitemporal VHR image pairs in the QU data set consist of 4 spectral bands with 358 280 pixels and they have a spatial resolution of 2.4 m. The pseudocolor VHR images and the ground truth are shown in Fig. 5. In the ground truth, red indicates changed regions, green indicates unchanged regions, and the remaining pixels are undefined.
IvB Experiment settings
Several hyperparameters influence the performance of the proposed KPCAMNet model, including the selection of kernel function , the number of training patches , the number and size of KPCA convolution kernel, and , and the network depth . For simplicity, the hyperparameters of different KPCA convolutional layers use the same values. The choices of these parameters in the experiment are listed in Table I. And the specific influence of these parameters at different values is discussed in section IVD.
To validate the performance of the proposed KPCAMNet, it is compared with the most widely used CD methods, including CVA [Sharma2007], MAD [Nielsen1997], IRMAD [Nielsen2007], USFA [Wu2014], ISFA [Wu2014], PCAKmeans [Celik2009], OBCD [Desclee2006], SVM, DSCN [Zhan2017], RNNCD [Lyu2016] and SAE [Bengio2007]. MAD is a CD method based on the canonical correlation analysis (CCA), which is first proposed in [Nielsen1997]. Based on the SFA theory, USFA computes a projecting matrix to suppress pseudochange pixels and highlight changed pixels. IRMAD and ISFA are iterative versions of MAD and USFA. DSCN is a deep siamese convolutional network and performs well in aerial images. RNNCD is an RNNbased CD network proposed in [Lyu2016], which has shown good performance in CD. SAE is an unsupervised deep learning method constructed by stacking several autoencoders. By layerwise greedy training, SAE can extract essential features of input data. Among these methods, SVM, DSCN, and RNNCD are supervised methods. For the sake of fairness, a common way is that selecting the annotated training samples by the unsupervised automatic predetection method [Gao2016, Li2019a, Du2019a]. Therefore, though training these three methods is a supervised learning fashion, the whole process is a totally unsupervised process without any prioriknowledge. In OBCD, the mean and standard deviation of each band are utilized to generate objects. The RBF is chosen as the kernel function of SVM. The optimal hyperparameters and are chosen by grid search. The hyperparameters of the remaining unsupervised methods use the optimal values recommended in their original references. Kmeans, OTSU, and FCM are adopted as threshold segmentation methods, the best results are chosen as the final results.
The binary CD problem could be regarded as a binary classification problem for change and nonchange. In the binary change map, white pixels denote changed areas and black pixels are unchanged areas. To comprehensively evaluate the performance of the proposed model and comparison methods, the following evaluation criteria are utilized in the accuracy assessment.

False Positive (FP). FP is the number of unchanged pixels which are falsely detected as changed ones.

False Negative (FN). FN is the number of pixels which are detected as unchanged pixels but actually changed ones.

Overall Error (OE). OE is the total number of pixels that are misclassified. OE can be calculated by OE=FP+FN.

Overall accuracy (OA). OA shows the number pixels that are detected correctly, divided by the number of total pixels. OA can be given by OE: OA=1 â OE / N, where N is the number of total pixels.

Kappa Coefficient (KP). KP is a similarity measurement between the final CD results and ground truth. In binary CD, KP is defined as KP=(OAPE) / (1PE), where PE=((TP+FP)(TP+FN) +(TN+FN)(FP+TN)) / .
IvC Experimental Results and Analysis
The first is the experimental results on the WH data set. The binary change maps obtained by the proposed KPCAMNet and all comparison methods are illustrated in Fig. 6. As shown in Fig. 6(a) and Fig. 6(c), the performance of MAD and SFA are unsatisfactory, numerous saltandpepper noise exists in the result of IRMAD and ISFA. Employing iterative reweighting, the results acquired by IRMAD and ISFA are relatively cleaner in visual. However, plenty of changed regions are miss detected. Besides, some unchanged buildings are misclassified as changed ones, which means IRMAD and ISFA are confused by the âoverexposedâ problem. Compared with IRMAD and ISFA, the binary change map obtained by CVA is better. Nonetheless, since the pixels are processed in isolation and only spectral information is considered, some small objects, such as road, are not classified correctly and a lot of margins of buildings are falsely detected as change. Adopting objects to replace pixels, in OBCD, the object becomes the basic unit of CD. Therefore, in Fig. 6(f), there is almost no saltandpepper noise in the results of OBCD. But due to only utilizing lowlevel features and limited to the performance of segmentation algorithm, lots of building changes are not detected. By means of transforming spatial blocks, PCAKmeans utilizes spatial context information to detect changes, thus a majority of building changes is detected and the result is not affected by the âoverexposedâ problem. Nevertheless, some margins of building are also falsely detected.
Data set  

WH  RBF  200  8  5  3 
QU  RBF  200  8  11  3 
Limited to the relatively weak fitting ability of SVM model, in Fig. 6(h), a small part of changed regions are broken inside and some groundworks before buildingover are not detected. Utilizing a deep siamese convolution structure to extract spatialspectral features and adopting a threshold segmentation algorithm to get the results, in the CD results generated by DSCN, margins of building are correctly classified. However, many changes in buildup regions are not detected. Adopting the RNN architecture to learn the universal rules of change, the change information between multitemporal VHR images is captured by RNNCD, and the acquired binary change map is more accurate than the ones acquired by SVM and DSCN. But RNNCD cannot extract spatial features efficiently, which results in the misdetection of building margins and internal fragmentation of changed regions. Learning the essential features from image patches with several AE, SAE generates a good result, in which almost of changed pixels are classified correctly. The downside is that the fixed conceptive filed and fullyconnected structure of SAE damage the feature extraction procedure to some degree. Hence, in Fig. 6(k), there exists some noise and a part of unchanged pixels is falsely detected as changed ones. Besides, the results of SAE are affected by the âoverexposedâ problem.
Method  FP  FN  OE  OA  KC 

MAD  79298  2066  78364  0.8446  0.2662 
IRMAD  31154  8596  39750  0.9212  0.3289 
USFA  86977  2271  89248  0.8230  0.2335 
ISFA  33155  7194  40349  0.9200  0.3530 
CVA  3432  8654  12086  0.9760  0.6409 
OBCD  617  10201  10818  0.9785  0.6350 
PCAKmeans  10435  5401  15836  0.9686  0.6325 
SVM  3556  8791  12347  0.9755  0.6330 
DSCN  2457  11046  13503  0.9732  0.5564 
RNNCD  5426  7359  12785  0.9746  0.6515 
SAE  14869  1599  16468  0.9673  0.6750 
KPCAMNet  2525  6677  9202  0.9819  0.7316 
Utilizing deep siamese KPCA convolutional architecture to extract representative highlevel spatialspectral features and adopting the feature magnitude to merge the change information in these features, the proposed KPCAMNet obtains the best binary change map with less noise and more complete changed regions. Whatâs more, due to using spatial context information efficiently, KPCAMNet is immune to the âoverexposedâ problem.
To better evaluate these methods, the quantitative results are listed in Table II. Similar to the conclusion derived from Fig. 6(a) and Fig. 6(c), MAD and USFA has a low KC of 0.2662 and 0.2335. And their iterative versions are slightly better, with a small OE and a higher KC. In fact, SFA and MAD are based on the central limit theorem, but the gaussianity of VHR images is not obvious, so it is difficult for these two methods to show good performance on VHR images. Due to the objects become the basic units in OBCD, thus the FP of OBCD is best, only 617 unchanged pixels are misclassified as changed ones. Trained on the samples selected by unsupervised automatic predetection, the accuracy of the three supervised methods are relatively good. Owing to the good capacity of feature extraction, SAE achieves the best FN. Finally, the proposed method outperforms the other compared methods with the best OE, OA, and KC.
Method  FP  FN  OE  OA  KC 

MAD  3146  3611  6757  0.8089  0.5640 
IRMAD  3100  3520  6620  0.8128  0.5733 
USFA  3393  3453  6846  0.8064  0.5621 
ISFA  2021  4744  6765  0.8087  0.5409 
CVA  2595  4552  7147  0.7979  0.5233 
OBCD  709  7009  7718  0.7817  0.4293 
PCAKmeans  1371  6003  7374  0.7914  0.4766 
SVM  2042  3932  5974  0.8310  0.6022 
DSCN  1361  6363  7727  0.7815  0.4468 
RNNCD  2929  4344  7273  0.7943  0.5208 
SAE  3670  3733  7403  0.7906  0.5265 
KPCAMNet  1673  1765  3468  0.9019  0.7778 
The CD results obtained by the proposed method and comparison methods on the QU data set are displayed in Fig. 7. In the result acquired by MAD, a part of changed areas are falsely detected, and there exists plenty of saltandpepper noise. Induced by the complexity of VHR images, the problems in the binary change map of MAD are not solved by iterative reweighting operation, thus they also exist in the result of IRMAD. In Fig. 7(c), the CD result of SFA is similar to the ones of MAD. Slightly different from MAD, after iterative reweighting, ISFA acquires the results with less noise. In the binary change maps of CVA and PCAKmeans (see in Fig. 7(e) and Fig. 7(g)), there are more changed pixels are misclassified. Besides, some unchanged buildings are falsely detected as changed ones. Similar to what we observed on the WH data set, there is almost no noise in the results generated by OBCD. However, lots of obvious changed regions are not recognized, such as the changes of the central playground.
Compared with all the methods mentioned before, SVM achieves a better binary change map, even though a few pixels are misclassified. The result obtained by DSCN is similar to that obtained by PCAKmeans, yet more changed regions are falsely detected as unchanged ones. In Fig. 7(j), these main changed regions are correctly classified by RNNCD. But some details in changed regions are missed and plenty of unchanged buildings and roads are falsely detected. In the result of SAE, a majority of changed pixels are correctly detected. Nonetheless, a part of unchanged features is incorrectly classified as change, such as vegetation. Finally, there is undeniable that the proposed KPCAMNet generates the best qualitative result with less noise and more complete changed regions.
Table III reports the values of evaluation criteria on the QU data set. Compared with CVA and PCAKmeans, MAD and SFA perform better in the QU data set. Through adopting objects as the basic units of CD to utilize spatial context information, the FP of OBCD is the best again. But due to only using the lowlevel features, numerous changed pixels are incorrectly detected as unchanged ones, which makes the OE, OA, and KP of OBCD unsatisfactory. Finally, it can be observed that KPCAMNet outperforms the other compared methods with FN of 1795, OE of 3468, OA of 0.9019, and KC of 0.7778, which proves the effectiveness of KPCANet for binary CD in VHR images once again.
Furthermore, it is worth noting that, in the WH data set, trained in the samples selected by the unsupervised predetection method, RNNCD achieves a relatively good result. Nonetheless, in the QU data set, the performance of RNNCD is even worse than MAD and SFA. This is because the automatic selection of training samples is unsupervised and some false alarm samples may be included in the training phase. Therefore, if the automatic predetection cannot generate relatively clean and representative training samples in a data set, it is difficult for the deep learning method based on predetection to show good performance in this data set. Besides, the accuracy of DSCN on both data sets is not high, which implies that this predetectionbased approach does not apply to all deep learning models. In contrast, the training phase of KPCAMNet does not require annotated data and the performance of KPCAMNet is better and more stable.
IvD Discussion
In both data sets, the KPCAMNet achieves good performance and obtains accurate binary change maps. And there are several parameters of KPCAMNet play an important role, including the selection of kernel function , the number of training patches , the number and size of KPCA convolution kernel, and , and the network depth . For the sake of simplify, we employ the same parameters per layer to reduce the vast number of possible choices.
The network depth and the convolution kernel size are first discussed. As shown in Fig. 8, the deeper network shows a better performance in both data set. But considering that KPCAMNet is trained in a layerwise manner, a too deep architecture (deeper than 4) would increase the burden of computing and not help to increase accuracy too much. Therefore, for network depth , 2 to 4 would be good choices.
For the parameter , the larger w will increase the receptive field of the network and makes the network utilize more spatial context information. However, if is too large, some unrelated information is considered by the network, which would damage CD performance. Therefore, in the WH data set with a spatial resolution of 4m, the KP increases greatly as the increases from 3 to 5. But when is larger than 5, the accuracy decreases gradually as the gets larger. For the QU data set, the spatial resolution of images is 2.4m, the performance of the network becomes better as the gets larger. However, when greater than 9, the network performance improvement becomes no longer obvious. Specifically, depending on the spatial resolution of different VHR images, 20m to 30m is a suitable range for the real size of the KPCA convolution kernel.
The number of training samples will affect the feature extraction ability of KPCA convolution. The more training samples are, the spatialspectral features extracted by the KPCA convolution are more representative. However, if is too large, the size of the constructed kernel matrix is too large, which will increase computational cost and even cause the KPCA problem to be difficult to solve. As shown in Fig. 9, setting is a good choice.
For the KPCA convolution, the selection of kernel function is significant, because the different kernel functions have diverse mapping ability. In Fig. 10, the performance of the KPCAMNet with five commonly used kernel function in both data sets is compared. RBF is the best choice for KPCA convolution, which can make the network extract more representative nonlinear features. And the sigmoid kernel also shows good performance. Compared with linear PCA convolution, selecting RBF or sigmoid as a kernel function can significantly improve the CD performance, which proves the necessity of generalizing linear PCA convolution to KPCA convolution.
In addition, the number of KPCA convolution kernel is also related to the CD performance. The more convolution kernels can extract more features from the multitemporal images for CD. As a tradeoff between accuracy and computational expense, we empirically set parameter as 8.
V Experiment of Multiclass Change Detection
Va Data Description
Not only can be used for binary CD, but the KPCAMNet also can detect multiclass change between multitemporal VHR images. In the experiment of multiclass CD, HY data set is adopted to evaluate the proposed method. The HY data set was acquired by GF2 sensor over the city of Hanyang, China, in April 2016 and September 2016. These two images both consist of 1000 1000 pixels with a spatial resolution of 4m. Owing to the rapid development of Hanyang, the study area shows obvious landcover changes and the changes between the two images mainly involve city change (buildings or roads to other artificial facilities), water change (water regions to bare soils or vegetation), and soil change (soils to artificial facilities or water regions). The pseudocolor VHR images and the corresponding ground truth of three multiclass change types and nonchanges are shown in Fig. 11. In the ground truth, red indicates city change, blue indicates water change, yellow indicates soil change, green indicates unchanged area, and the remaining pixels are undefined. Whatâs more, the HY data set is also suffered from the âoverexposedâ problem.
VB Experiment Setting
According to the conclusion derived from section IVD, in the experiment of multiclass CD, RBF is chosen as the kernel function, the number of training patches is set as 200, the size of KPCA convolution kernel is set as 5, setting parameter as 8, and two weightshared KPCA convolutional layers are stacked to extract spatialspectral features.
For the purpose of demonstrating the superiority of the KPCAMNet in unsupervised multiclass CD, seven representative methods are adopted for comparison: PCC, 2DCVA [Sharma2007], VA [Bovolo2012], PCA, IRMAD [Nielsen1997], SAE [Bengio2007], and RandNet. Specifically, PCC is a widely used framework for multiclass CD. In the experiment, the unsupervised clustering algorithm, ISODATA, is utilized to get the multitemporal classification maps. In 2DCVA, two bands are selected to calculate the magnitudes and directions of spectral change vectors to detect diverse kinds of change. The optimal combination can be selected by trailandtest or some priori information. VA compresses all the bands into the magnitudes and directions of spectral change vectors and perform threshold segmentation and clustering to get the multiclass CD result. In PCAbased approach, the magnitude and directions of the first two principal components are computed to generate the multiclass change map. As for IRMAD, it is similar to 2DCVA and PCA, except that the last two MAD variates are used to calculate the magnitude and direction since they contain the most change information. In SAE, spatialspectral features are first extracted from VHR images, then the same way proposed in VA is adopted to utilize these features for CD. To evaluate the effectiveness of the proposed model, a natural idea is to simplify the KPCA convolution kernels in KPCAMNet by random convolution kernels following the standard Gaussian distribution, thus a variant of KPCAMNet, RandNet is introduced, which has the same structure as KPCAMNet but replaces the KPCA convolution kernels with the random filters following the standard Gaussian distribution. Whatâs more, to prove the necessity of generalizing linear PCA convolution to KPCA convolution, a variant of the proposed method entitled LinearPCAMNet is developed by replacing KPCA convolution layers with linear PCA convolution layers.
For the accuracy assessment, we first measure the accuracy of perclass, and then OA and KP are used to measure the overall performance.
VC Experimental Result and Analysis
Fig. 12 shows the multiclass CD results acquired by the proposed method and comparison methods on the HY data set. The change map generated by PCC is not good in visual, because the clustering techniques are powerless in classify the landcover accurately, which would accumulate the false detection to result in a rough change map. In the results of 2DCVA, the major changed and unchanged regions are correctly detected. However, due to the changed information from only two bands is used, water changes are unclassified as soil changes and some soil changes are not detected. By the magnitude and direction of spectral change vector, change information from all bands is compressed into a polar representation. Therefore, the multiclass change map obtained by VA is better than the ones obtained by 2DCVA. But due to some information is lost after the compress procedure, a part of soil changes is falsely detected as water changes. Only spectral information is utilized by VA, thus all kinds of changed regions have internal fragmentation and unchanged regions have saltandpepper noise. Besides, caused by the âoverexposedâ problem, a few unchanged regions are incorrectly classified as soil changes and city changes. For PCA, because the main change information is concentrated in the first two principal components, the CD result of PCA is similar to the ones of VA. In Fig. 12(e), the main soil changes and water changes are correctly classified by IRMAD, but most of the city changes are not detected. Whatâs more, IRMAD is strongly puzzled by the âoverexposedâ problem.
Compared with VA, the spatialcontext information is utilized to detect multiclass changes by SAE, thus, most of the unchanged regions are classified correctly and there are a little internal fragmentation and saltandpepper noise in the result. Nonetheless, the learning process of SAE can only make the essential features of VHR images be extracted, it cannot guarantee that the extracted features are sufficiently discriminative for distinguishing different kinds of changes. Therefore, some soil changes and plenty of city changes are not detected by SAE. As shown in Fig. 12(g), though sharing the same architecture with KPCAMNet, RandNet performs not well in the HY data set. A part of unchanged regions is misclassified as change and the three kinds of changes are mixed together, which indicates that the random convolution kernels cannot extract valid features for multiclass CD.
The multiclass change map detected by the LinearPCAMNet is good. However, the features extracted by linear PCA convolution are not representative enough, which results in the misclassification between soli changes and water changes. By contrast, in the result of KPCAMNet, most of the changed regions and unchanged regions are detected accurately and three kinds of changes are efficiently distinguished from each other, which implies that the KPCA convolutions do extract representative spatialspectral features from VHR images and the different kinds of changes are efficiently discriminative in the new feature space represented by FM and WFD. The visual comparisons between multiclass change maps fully demonstrate the effectiveness of the proposed KPCAMNet.
Table IV lists the quantitative statistics on the change maps by different methods. And the confusion matrices of CD results are shown in Fig. 12. Except for the water changes accuracy, each metric of PCC is not high. The accuracy achieved by IRMAD and RandNet is also not good, the KPs of these two methods are 0.6360 and 0.6062, respectively. 2DCVA achieves a relative high city changes accuracy, but a major part of water changes is not classified incorrectly, which leads to low accuracy of water change with 0.2551. Since utilizing the changed information in each band, the performance of VA is better than 2DCVA with a higher KP of 0.7634. SAE achieves high accuracy, with OA of 0.9576 and KP of 0.7712. But the accuracy of city changes in SAE is not good with only 0.4797. Compared with these methods, the KPCAMNet behaves prominent superiority in all evaluation criteria, which demonstrate that the proposed method is applicable to unsupervised multiclass CD in multitemporal VHR images. Whatâmore, though the LinearPCAMNet shows good performance in multiclass CD with KP of 0.7745, it still cannot compete with the KPCAMNet, especially in the accuracy of each change class, which proves that the KPCA convolution has more powerful feature extraction ability compared to the linear PCA convolution.
Method  Perclass  OA  KC  

soil  water  city  nonchange  
PCC  0.3530  0.9052  0.5043  0.6633  0.6375  0.1313 
2DCVA  0.7138  0.2551  0.6330  0.9796  0.9428  0.7089 
VA  0.7546  0.8784  0.6027  0.9841  0.9538  0.7634 
PCA  0.7320  0.8926  0.5990  0.9848  0.9528  0.7546 
IRMAD  0.7200  0.8389  0.2230  0.9744  0.9294  0.6360 
SAE  0.7646  0.7443  0.4797  0.9936  0.9576  0.7712 
RandNet  0.6556  0.7825  0.3834  0.9544  0.9120  0.6062 
LinearPCAMNet  0.6890  0.7393  0.5689  0.9975  0.9586  0.7745 
KPCAMNet  0.8008  0.8862  0.6438  0.9952  0.9687  0.8334 
VD Discussion
The 2D polar representations of change information extracted by VA, SAE, RandNet, and KPCAMNet are shown in Fig. 13. And confusion matrices of the three comparison methods and KPCAMNet are illustrated in Fig. 14. As shown in Fig. 13(a), in the polar domain of VA, the three kinds of changes are separable from each other, but due to only original spectral information is used, some unchanged pixels are mixed with the pixels belonging to city changes and soil changes. For SAE, the spatial context information is utilized to get the spatialspectral features, compared with VA, the unchanged pixels are detected well, as shown in Fig. 14(b). But the city changes are not highlighted in the magnitude, which leads to more than half of the pixels of city changes to be misclassified as unchanged pixels. As for RandNet, the unchanged pixels and changed ones are separable to a certain degree. However, since the random convolution kernels cannot extract discriminative features from VHR images, the three changes are mixed together and it is difficult to separate them from each other. Finally, in the 2D polar domain of KPCAMNet, it can be observed that the unchanged pixels are suppressed, changed pixels are highlighted and the three kinds of changes are compact and discriminative from each other in the WFD, which makes it easier to identify the changes and distinguish different change types. Compared with the three methods, each kind of changes and nonchange are detected better by KPCAMNet according to Fig. 14, which demonstrate that the spatialspectral features extracted by KPCAMNet are discriminative and more suitable for unsupervised multiclass CD.
To verify the superiority of the WFD in multiclass CD, we compare it with other four representative methods. The first method, denoted as M1, is performing clustering algorithms on the original feature difference map. The second method (M2) is also performing clustering algorithms on the feature difference map, but the channels are weighted by the eigenvalues. The third way is similar to 2DCVA. The two channels with the largest eigenvalues are selected from feature difference map and are adopted to calculate the direction, denoted as M3. The last method entitled M4 is adopting the direction measurement of VA, which is similar to the proposed WFD, but the channels of feature difference map are not weighted by eigenvalues. The performance of these four methods and the proposed WFD is shown in Table V.
OA  KC  

M1  0.8136  0.4015 
M2  0.9052  0.5867 
M3  0.9630  0.8028 
M4  0.9447  0.7103 
Proposed  0.9687  0.8334 
As we mentioned in section IIIB, performing clustering algorithms in the original difference space is not effective, the KP of M1 is only 0.4015. After weighted by eigenvalues, the KP of M2 is increased by 46. However, the performance of M2 is also unsatisfactory. By selecting the two channels containing the most change information to calculate the feature direction, the performance of M3 is relatively good with OA of 0.9630 and KP of 0.8028. Though using all the channels, the performance of M4 cannot compete with M3. This is because the different channels of the feature difference map contain different amounts of change information, but M4 treats these channels equally. By contrast, in the proposed WFD, the changed information in each channel is considered and the channel corresponding to the more changed information would play a more important role. Therefore, the proposed mapping way achieves the best performance, which proves the superiority of the WFD in multiclass CD.
Vi Conclusion
In this paper, a novel convolution operation based on kernel PCA, called KPCA convolution, is proposed for CD in multitemporal VHR images. Compared with the conventional convolutional layer in CNN, the KPCA convolution can also extract nonlinear spatialspectral features from VHR images, but the training process of the KPCA convolutional layer is an unsupervised learning fashion and does not require any annotated data. Based on the proposed KPCA convolution, a powerful and general network called KPCAMNet is designed for unsupervised binary and multiclass CD in multitemporal VHR images. In the KPCAMNet, by the deep siamese network architecture consisting of several weightshared KPCA convolutional layers, representative features are extracted from multitemporal VHR images and two highlevel spatialspectral feature maps are generated. To train the network, an unsupervised layerwise training way is developed. Through a pixelwise subtraction operation, the feature difference map is generated. For the purpose of efficiently utilizing change information contained in the feature difference map, FM and WFD are proposed. The FM carries information about whether change exists and the WFD indicates what kind of change is. Then the feature difference map is mapped into a 2D polar domain based on the FM and WFD. For the binary CD, a threshold segmentation method is adopted to get the result based on the FM. For the multiclass CD, a threshold segmentation method is first utilized to separate change and nonchange, then a clustering algorithm is implemented on the WFD to get the multiclass change map.
In the experiment of binary CD with the WH and QU data sets, the qualitative and quantitative results demonstrate that the proposed architecture outperforms the conventional CD models as well as deep learningbased methods. The experimental results of multiclass CD in the challenging HY data set confirm that the KPCAMNet achieves better performance compared with several unsupervised multiclass CD methods. In addition, the proposed WFD can more efficiently utilize the multiclass change information in the feature difference map in contrast to other commonly used ways. Both experimental results reveal that the effectiveness of the proposed KPCA convolution and good versatility of KPCAMNet in binary and multiclass CD.
Our future work includes but is not limited to applying KPCA convolution and KPCAMNet for heterogeneous image CD and hyperspectral image classification.