Learning SpectralSpatialTemporal Features via a Recurrent Convolutional Neural Network for Change Detection in Multispectral Imagery
Abstract
Change detection is one of the central problems in earth observation and was extensively investigated over recent decades. In this paper, we propose a novel recurrent convolutional neural network (ReCNN) architecture, which is trained to learn a joint spectralspatialtemporal feature representation in a unified framework for change detection in multispectral images. To this end, we bring together a convolutional neural network (CNN) and a recurrent neural network (RNN) into one endtoend network. The former is able to generate rich spectralspatial feature representations, while the latter effectively analyzes temporal dependency in bitemporal images. In comparison with previous approaches to change detection, the proposed network architecture possesses three distinctive properties: 1) It is endtoend trainable, in contrast to most existing methods whose components are separately trained or computed; 2) it naturally harnesses spatial information that has been proven to be beneficial to change detection task; 3) it is capable of adaptively learning the temporal dependency between multitemporal images, unlike most of algorithms that use fairly simple operation like image differencing or stacking. As far as we know, this is the first time that a recurrent convolutional network architecture has been proposed for multitemporal remote sensing image analysis. The proposed network is validated on real multispectral data sets. Both visual and quantitative analysis of experimental results demonstrates competitive performance in the proposed mode.
Change detection, multitemporal image analysis, recurrent convolutional neural network (ReCNN), long shortterm memory (LSTM).
1 Introduction
\IEEEPARstartWith the development of remote sensing technology, every day massive amounts of remotely sensed data are produced from a rich number of spaceborne and airborne sensors; e.g., the Landsat 8 satellite is capable of imaging the entire Earth every 16 days in an eightday offset from Landsat 7, and every 10 days the Sentinel2 mission can provide a global coverage of the Earth’s land surface. For the Sentinel2 mission alone, to date about 3.4 petabytes of data have been acquired. Triggered by these exciting existing and future observation capabilities, methodological research on the multitemporal data analysis is of great importance [[1], [2]]. Change detection is very crucial in the field of multitemporal image analysis, as it is able to identify land use or land cover differences in the same geographical area across a period of time and can be used in a large number of applications, to name a few, urban expansion, disaster assessment, resource management, and monitoring dynamics of land use [[3], [4], [5]].
In the literature, many methods have been proposed to better identify land cover changes [[1]]. Among them, a widely used model is based on image algebra approaches. A classic one is change vector analysis (CVA) proposed by Malila in 1980 [[6]]. CVA is designed to analyze possible multiple changes in pairs of multispectral pixels of bitemporal images. Bovolo and Bruzzone [[7]] propose a formal definition and a theoretical study to of CVA in the polar domain. Later some extensions of the CVA model have been proposed, e.g., compressed CVA (CVA) [[8]]. CVA is used together with unsupervised threshold selection techniques based on different possible models of the data distribution. For example, the RayleighRice mixture density model [[9]] has been recently used in the framework of the ExpectationMaximization (EM) algorithm.
In addition, some image transformationbased models have been proposed in change detection to improve detection performance. These approaches mainly aim at learning a new, transformed feature representation from the original spectral domain, in order to suppress unchanged regions and highlight the presence of changes in the new feature space. For example, principal component analysis (PCA), GramSchmidt transformation, multivariate alteration detection (MAD), slow feature analysis (SFA), sparse learning, and deep belief network (DBN) use transformation algorithms in change detection methods [[10], [11], [12], [13], [14], [15]]. PCA is one of the best known subspace learning algorithms and can be used on both difference images and stacked images [[10], [16]]. The goal of GramSchmidt transformation is to reduce data correlation. MAD makes an attempt at maximizing variance of independently transformed variables [[12]] and is invariant to linear scaling of the input data. SFA [[13]] is able to extract the most temporally invariant component from multitemporal images to transform data into a new feature space and, in this space, differences in unchanged pixels are suppressed so that changed regions can be better separated. In [[14]], the authors apply sparse learning on stacked multitemporal images and expect that resulting sparse solutions do not vary greatly between the multitemporal data. In [[15]], the authors learn feature representations of two images with DBNs. Feature vectors issued from the two image acquisitions are stacked and used to learn a representation, where changes stand out more clearly. Using such feature representation, changes are more easily detected by image differencing.
Another important branch of change detection methods is based on classification approaches. For example, Bruzzone and Serpico [[17]] propose a supervised nonparametric model, based on the compound classification rule for minimum error, to detect land cover transitions between two remote sensing images acquired at different times. The main idea of this approach is to consider the temporal correlation between images in the classification without requiring complex training data. In [[18]], the authors use the Bayes rule for minimum error in the compound classification framework for detecting land cover transitions between pairs multisource images gathered at two different dates. In [[19]], the authors propose a multiclassifier architecture, which is composed of an ensemble of partially unsupervised classifiers, to detect changes or update land cover maps. Later, Bruzzone et al. [[20]] develop an effective system that employs an ensemble of nonparametric multitemporal classifiers to address the problem of detecting land cover transitions in multitemporal images. All these techniques consider different tradeoffs between modeling the temporal correlation in the training of the system and requiring complex training data.
One crucial issue in change detection is modeling the temporal correlation between bitemporal images. Various atmospheric scattering conditions, complicated light scattering mechanisms, and intraclass variability lead to change detection being inherently nonlinear. Thus sophisticated, taskdriven, learningbased methods are desirable.
Deep neural networks have recently been shown to be very successful on a variety of computer vision and remote sensing tasks [[21], [22]]. They can also provide the opportunity for change detection, where one would like to extract joint spectraltemporal features from a bitemporal image sequence in an endtoend manner. In this respect, as an important branch of deep learning family, a recurrent neural network (RNN) is a natural candidate to tackle the temporal connection between multitemporal sequence data in change detection tasks. Recently, Lyu et al. [[23]] make use of an endtoend RNN to solve the multi/hyperspectral image change detection task, since RNN is well known to be good at processing sequential data. In their framework, a long shortterm memory (LSTM)based RNN is employed to learn a joint spectraltemporal feature representation from a bitemporal image sequence. In addition, the authors also show the versatility of their network by applying it to detect multiclass changes and pointing out a good transferability for change detection in an “unseen” scene without finetuning. The authors of [[24]] follow a similar idea, where an RNN based on LSTM units is used to extract dynamic spectraltemporal features but, in contrast to the change detection scenario, their goal is to address land cover classification of multitemporal image sequence.
In this paper, we would like to learn joint spectralspatialtemporal features using an endtoend network for change detection, which is named as recurrent convolutional neural network (ReCNN), since it combines convolutional neural network (CNN) and RNN. Although both CNN [[25], [26], [27], [28], [29], [30], [31], [32], [33]] and RNN [[23], [34], [24]] are wellestablished techniques for remote sensing applications, to the best of our knowledge, we are the first to combine them for multitemporal data analysis in the remote sensing community. Note that integrating CNN and RNN in an endtoend manner has also been explored in hyperspectral image classification [[35]], where the network is only used for extracting spectral information to build a spectral classifier for the classification purpose. In our work, the CNN part transforms the input, a pair of 3D multispectral patches, to an abstract spectralspatial feature representation, whereas the RNN part is not only employed for modeling temporal dependency, but is also used for predicting the final label (i.e., changed, unchanged, or changetype). In other words, the features from the proposed ReCNN encapsulate information related to spectral, spatial, and temporal components in bitemporal images, making them useful for an holistic change detection task. For multitemporal image analysis, the proposed ReCNN contributes to the literature in three major aspects:

It is able to extract a spectralspatialtemporal feature representation of multitemporal data through learning with a structured deep architecture.

It has the same property of 2D CNN used for multi/hyperspectral data classification on learning informative spectralspatial feature representations directly from multispectral data, requiring neither handcrafted visual features nor preprocessing steps.

It has the same characteristic of RNN, being capable of modeling the temporal correlation between bitemporal images using a sophisticated and taskdriven approach to the extraction of temporal features in an endtoend architecture, and finally producing labels for the image sequence.
The remainder of this paper is organized as follows. After the introductory Section 1 detailing change detection, Section 2 is dedicated to the details of the proposed recurrent convolutional network. Section 3 then provides data set information, network setup, experimental results, and discussion. Finally, Section 4 concludes the paper.
2 Methodology
2.1 Network Architecture
The architecture of the proposed ReCNN, as shown in Fig. 1, is made up of three components, including a convolutional subnetwork, a recurrent subnetwork, and fully connected layers, from bottom to top.
To acquire a joint spectralspatialtemporal feature representation for change detection, at the bottom of our network, convolutional layers automatically extract feature maps from each input. On top of the convolutional subnetwork, a recurrent subnetwork takes the feature representations produced by convolutional layers as inputs to exploit the temporal dependency in the bitemporal images. The third part is two fully connected layers widely used in classification problems. Although ReCNN is composed of different kinds of network architectures (i.e., CNN, RNN, and fully connected network) it can be trained endtoend by backpropagation with one loss function, due to the differential properties of all these components.
Let and represent a pair of multispectral images acquired over the same geographical area at two different times and , respectively. Let and be two patches taken from the exact same location in two images. is a label that indicates the category (i.e., changed, unchanged, or changetype) that the pair of patches belongs to. The flowchart of the proposed ReCNN can be summarized as follows:

First, the 3D multispectral patch is fed into branch of the convolutional subnetwork, which transforms it to an abstract feature vector .

Then, the recurrent subnetwork receives and calculates the hidden state information for the current input; it also restores that information in the meantime.

Subsequently, is input to branch for extracting spectralspatial feature , it is fed into the recurrent layer simultaneously with the state information of , and the activation at time is computed by a linear interpolation between existing value and the activation of the previous time .

Finally, fully connected layers of the ReCNN predict the label of the input bitemporal multispectral patches by looping through the entire sequence.
The entire change detection map can be obtained by applying the network to all pixels in the image.
2.2 SpectralSpatial Feature Extraction via the Convolutional SubNetwork
As we have mentioned, spectralspatial information is of great importance for change detection. Some of the previous widely used unsupervised image algebrabased and image transformationbased methods cannot totally capture task specialized features which may be discriminative for a specific change detection task. Features directly learned from data and driven by tasks are supposed to be better [[22]]. This advantage leads to our usage of a trainable feature generator.
Though trainable, early and fairly simple 1D neural network models, such as DBN [[15]] and multilayer perceptron (MLP), suffer from huge amount of learnable parameters, since those architectures are totally equipped with fully connected layers, which is an undesirable case given that available annotated training samples for change detection are often very limited. Moreover, another disadvantage of such networks is that they treat the multispectral data as vectors, ignoring the 2D property of imagery in the spatial domain.
CNNs, which are a significant branch of deep learning, have been attracting attention, due to the fact that they are capable of automatically discovering relevant contextual 2D spatial features as well as spectral features for multi/hyperspectral data. In addition, a CNN makes use of local connections to deal with spatial dependencies via sharing weights, and thus can significantly reduce the number of parameters of the network in comparison with the conventional 1D fully connected neural networks, e.g., DBN and MLP. Recently, CNNs used for hyperspectral image classification have proven their effectiveness in extracting useful spectralspatial features [[36], [29]]. Triggered by this, adopting a CNN in our architecture is natural.
However, a direct use of CNNs commonly used in typical recognition tasks, e.g., AlexNet [[37]], VGG Nets [[38]], and GoogLeNet [[39]], is not possible in our task, as we believe that a simpler network architecture is more appropriate for our problem due to the following reasons. First, change detection aims to distinguish only several classes (two for binary change detection), which requires much less model complexity than general visual recognition problems in computer vision, such as ImageNet classification with 1,000 categories. Second, since spatial resolution of multispectral imagery is limited, it is desirable to make input size small, which reduces the depth of the network naturally. Third, a smaller network is obviously more efficient in change detection problems, where testing may be performed in a largescale area. Finally, the abovementioned networks are not suitable to be used on multispectral images with a large number of spectral channels.
The convolutional subnetwork receives a sequence of multispectral patches as the input and has two separate, yet identical convolutional branches (i.e., branch and branch [cf. Fig. 1]) which process and in parallel, respectively. The learned features are fed into the following recurrent subnetwork. Using this twobranch architecture, the convolutional recurrent neural network is constrained to first learn meaningful spectralspatial representations of input patches, and to combine them on a higher level for modeling temporal dependency. More specifically, we make use of convolutional filters with a very small receptive field of , rather than using larger ones such as . Moreover, we do not adopt maxpooling after convolution or spatial padding for convolutional layers. The depth of the convolutional subnetwork is such that the output size of the last layer is .
Regarding convolution, we make use of dilated convolution to construct convolutional layers in the network because, for our task, it is able to offer a slightly better performance than a traditional convolution operation. The dilated convolution [[40]] was originally designed for the efficient computation of the undecimated wavelet transform in the “algorithme à trous” scheme [[41]]. This algorithm makes it possible to calculate responses of any layer at any desirable resolution and can be applied posthoc, once a network has been trained. Let be a discrete function. Let and let be a discrete filter of size . The traditional discrete convolution operation can be defined as follows:
(1) 
This operation can be generalized. Let be a dilation rate and let be defined as
(2) 
We will refer to as a dilated convolution or an dilated convolution. Fig. 2 shows differences between the conventional convolution and the dilated convolution.
The usage of dilated convolution in our network allows us to exponentially enlarge the field of view with linearly increasing number of parameters, providing a significant parameter reduction while increasing effective field of view. Note that a very recent study [[42]] found that large field of view actually plays an important role. This can be easily understood by an analogy that states the fact that humans usually confirm the category of a pixel by referring to its surrounding context region.
2.3 Modeling Temporal Dependency by the Recurrent SubNetwork
The impressive success of recent deep learning systems has been predominantly achieved by feedforward neural network architectures like CNN. In such networks, we implicitly assume that all inputs are independent of each other. However, for tasks that involve processing time sequence (e.g., change detection), that is not a good assumption. RNNs are a kind of neural networks that extend the conventional feedforward neural networks with loops in connections. Unlike a feedforward network, an RNN is capable of dealing with dependent, sequential inputs by having a recurrent hidden state whose activation at each time step depends on that of the previous time. By doing so, the network can exhibit dynamic temporal behavior, which is in line with our purpose; i.e., modeling temporal dependency between the and data. To this end, three types of RNN architectures, namely, fully connected RNN, LSTM, and gated recurrent unit (GRU), are used to construct the recurrent subnetwork in our ReCNN.
Fully Connected RNN. Given feature vectors and learned from the convolutional subnetwork, a fully connected RNN updates its recurrent hidden state by
(3) 
where is a nonlinear activation function, such as a hyperbolic tangent function or logistic sigmoid function. The recurrent layer will output a sequence . For our task, we only need the last one as input to the fully connected layers for predicting label.
In the fully connected RNN model, the update of the recurrent hidden state in Eq. (3) is implemented as
(4) 
where and are the coefficient matrices for the activation of recurrent hidden units at the previous time step and for the input at the present time, respectively.
Fully connected RNN is the concisest RNN model, and it can reflect the essence of RNNs; i.e., an RNN is capable of modeling a probability distribution over the next element of the sequence data, given its present state , by capturing a distribution over sequence data. Let be the sequence probability, which can be decomposed into
(5) 
Then, the conditional probability distribution can be modeled with an RNN:
(6) 
where is obtained from Eq. (3). Our motivation in this work is apparent here: bitemporal images act as true sequential data instead of a simple difference image or stacked image and, therefore, an RNN can be used to model the temporal dependency.
LSTM. LSTM is a special type of recurrent hidden unit and was initially proposed by Hochreiter and Schmidhuber [[43]]. Since then, a couple of minor modifications to the original version have been made. In this work, we follow the implementation of LSTM as used in [[44]]. As shown in Eq. (3), recurrent hidden units in a fully connected RNN simply compute a weight sum of inputs and then apply a nonlinear function. In contrast, an LSTMbased recurrent layer maintains a series of memory cells at time step . The activation of LSTM units can be calculated by
(7) 
where is the hyperbolic tangent function and is the output gates that control the amount of memory content exposure. The output gates are updated by
(8) 
where the terms represent coefficient matrices; e.g., and are the inputoutput weight matrix and memoryoutput weight matrix, respectively.
The memory cells are updated by partially discarding the present memory contents and adding new contents of the memory cells :
(9) 
where is an elementwise multiplication. The new memory contents are
(10) 
where is inputmemory weight matrix and represents hiddenmemory coefficient matrix.
The and are input gates and forget gates, respectively. The former modulates the extent to which the new memory information is added to the memory cell, whereas the latter controls the degree to which contents of the existing memory cells are forgotten. Specifically, gates are computed as follows:
(11) 
(12) 
GRU. Similarly to LSTM, a GRU makes use of a linear sum between the existing state and the newly computed state. It, however, directly exposes whole state values at each time step, instead of controlling what part of the state information will be exposed.
The activation of GRUs at time step is a linear interpolation between the previous activation and the candidate activation :
(13) 
where the update gates determine how much GRUs update their activations or contents. Update gates can be computed by
(14) 
where and are the inputupdate coefficient matrix and hiddenupdate weight matrix, respectively.
The candidate activation is computed similarly to that of the fully connected RNN (cf. Eq. (3)) and as follows
(15) 
where is the set of reset gates. When reset gates are totally off (i.e., is ), GRUs will completely forget the activation of the recurrent layer at previous time and only receive existing input. When open, reset gates will partially keep the information of the previously computed state. Reset gates are calculated similarly to update gates:
(16) 
where is the inputreset weight matrix and represents the hiddenreset coefficient matrix.
Fig. 3 shows graphic models of fully connected RNN, LSTM, and GRU through time.
2.4 Network Training
The network training is based on the TensorFlow framework. We chose Nesterov Adam [[45], [46]] as the optimizer to train the network since, for this task, it shows much faster convergence than standard stochastic gradient descent (SGD) with momentum [[47]] or Adam [[48]]. We fixed almost all of parameters of Nesterov Aadam as recommended in [[45]]: , , , and a schedule decay of 0.004, making use of a fairly small learning rate of . All network weights are initialized with a Glorot uniform initializer [[49]] that draws samples from a uniform distribution. We utilize sigmoid and softmax as activation functions of the last fully connected layer for the binary and multiclass change detections, respectively. Finally, we train our network on a single NVIDIA GeForce GTX TITAN with 12 GB of GPU memory.
3 Experimental Results and Discussion
3.1 Data Description
The performance of the proposed network is evaluated on two data sets, which were acquired by the Landsat Enhanced Thematic Mapper Plus (ETM+) sensor with six bands and a spatial resolution of 30m. Before feeding data into models, digital numbers (DNs) of the original data were converted into absolute radiance (i.e., all of the data sets used in the experiments were normalized into a range of [0,1]).
Taizhou Data
This data set consists of two images covering the city of Taizhou, China, in March 2000 and February 2003, with a WGS84 projection and a coordinate range of 311456N–312739N, 1200224E–1210745E. These two images both consist of pixels, and changes between them mainly involve city expansion. The available manually annotated samples of this data set for multiclass change detection cover four classes of interest (cf. Fig. 4); i.e., unchanged area, city expansion (bare soils, grasslands, or cultivated fields to buildings or roads), soil change (cultivated field to bare soil), and water change (nonwater regions to water regions). Table 1 provides information about different classes and their corresponding training and test samples.
Class Name  Training  Test  

Binary  Changed region  500  4055 
Unchanged region  500  16961  
TOTAL  1000  21016  
Multiple  Unchanged region  500  16961 
City expansion  500  2875  
Soil change  500  104  
Water change  500  75  
TOTAL  2000  20015 
Eppalock Lake
The second data set was acquired over the Eppalock lake, Victoria, Australia, in February 1991 and March 2009, with a WGS84 projection and a coordinate range of 364910S–370052S, 1442752E–1443735E. Both images in this data set are pixels. Similar to the Taizhou data, four multiclass change types are considered in the Eppalock lake scene, and they are unchanged region, city expansion (bare soils, grasslands, or cultivated fields to buildings or roads), water loss (water regions to bare soils), and soil change (vegetative covers or artificial buildings to bare soils). Fig. 5 shows tow truecolor composite images and their corresponding reference samples. The number of training and test samples is displayed in Table 2.
3.2 General Information
To evaluate the performance of different change detection algorithms, we utilize the following evaluation criteria:

Overall accuracy (OA): This index shows the number of bitemporal pixels that are classified correctly, divided by the number of test samples.

Kappa coefficient: This metric is a statistical measurement of agreement between the final change detection map and the ground truth map. It is the percentage agreement corrected by the level of agreement that could be expected due to change alone. In general, it is thought to be a more robust measure than a simple percent agreement computation, as takes into account the agreement occurring by chance.
To validate the effectiveness of the proposed ReCNN model, it is compared with the most widely used change detection methods. These methods are summarized as follows:

CVA [[7]], which is an effective unsupervised approach for multispectral image change detection tasks.

PCA [[10]], which is simple in computation and can be applied to realtime applications.

MAD [[12]], which is a classical image transformationbased unsupervised algorithm for bitemporal multispectral image change detection.

Iterativelyreweighted multivariate alteration detection (IRMAD) [[50]], which is an extension to MAD by introducing an iterative scheme.

Decision tree (DT), which is a nonparametric supervised learning method used for classification and regression. Its goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from data features.

Support vector machine (SVM), which works by mapping data to a kernelincluded highdimensional feature space seeking an optimal decision hyperplane that can best separate data samples, when data points are not linearly separable. Here, we use an SVM with RBF kernel. The optimal hyperplane parameters (parameter that controls the amount of penalty during the SVM optimization) and (spread of the RBF kernel) have been traced in the range of and using fivefold cross validation.

RNN [[23]], which has recently shown promising performance in classification and change detection.

ReCNNFC, which uses fully connected RNN as recurrent subnetwork in ReCNN model.

ReCNNGRU, which uses GRU architecture in the recurrent subnetwork.

ReCNNLSTM, which is the ReCNN model with LSTM as recurrent component.
Among these methods, CVA, PCA, MAD, IRMAD, and RNN are used in binary change detection experiments, and DT, SVM, and RNN are compared to the proposed network in multiclass change detection experiments. Moreover, kmeans algorithm is used to automatically select threshold for unsupervised methods in the binary change detection task.
Class Name  Training  Test  

Binary  Changed region  500  3380 
Unchanged region  500  4515  
TOTAL  1000  7895  
Multiple  Unchanged region  300  4715 
Water loss  300  2817  
Soil change  300  341  
City expansion  50  72  
TOTAL  950  7945 
Taizhou City  Eppalock Lake  

OA  Kappa  Unchanged  Changed  OA  Kappa  Unchanged  Changed  
CVA [[7]]  83.82  0.3202  97.38  27.10  81.28  0.6353  69.24  97.37 
PCA [[10]]  94.63  0.8181  99.79  74.51  74.68  0.5044  64.98  87.63 
MAD [[12]]  94.62  0.8168  98.47  78.52  91.10  0.8138  99.14  80.36 
IRMAD [[50]]  95.14  0.8313  99.35  77.53  91.27  0.8174  99.49  80.30 
RNN [[23]]  96.50  0.8884  97.58  91.96  95.21  0.9018  97.03  92.78 
ReCNNFC  98.35  0.9470  98.94  95.86  98.40  0.9674  98.56  98.20 
ReCNNGRU  98.67  0.9571  99.23  96.30  98.64  0.9723  99.22  97.87 
ReCNNLSTM  98.73  0.9592  99.20  96.77  98.67  0.9728  98.83  98.46 
OA  Kappa  Unchanged  City expansion  Soil change  Water change/loss  

Taizhou City  Decision Tree  85.19  0.5846  84.64  88.49  82.69  86.67 
SVM  93.90  0.7927  94.69  89.32  92.31  93.33  
RNN [[23]]  95.48  0.8374  97.04  86.92  85.58  85.33  
ReCNNFC  97.37  0.9039  97.95  94.12  95.19  92.00  
ReCNNGRU  97.52  0.9097  98.05  94.54  95.19  96.00  
ReCNNLSTM  98.04  0.9279  98.36  96.31  94.23  97.33  
Eppalock Lake  Decision Tree  87.56  0.7811  81.31  41.67  89.15  99.01 
SVM  95.86  0.9228  94.46  72.22  97.65  98.58  
RNN [[23]]  96.34  0.9392  95.55  41.67  96.48  99.04  
ReCNNFC  98.45  0.9705  98.01  80.56  100  99.47  
ReCNNGRU  98.49  0.9712  98.24  79.17  100  99.22  
ReCNNLSTM  98.70  0.9752  98.49  84.72  100  99.25 
3.3 Analysis of Recurrent Subnetwork: Comparisons between Fully Connected RNN, LSTM, and GRU
The most prominent trait shared between fully connected RNN, LSTM, and GRU is that there exists an additive loop of their update from to , which is lacking in the conventional feedforward neural networks such as CNNs. In contrast, compared to the fully connected RNN like Eq. (4), both LSTM and GRU keep the current content and add the new content on top of it (cf. Eq. (9) and Eq. (13)). These two RNN architectures, however, have a number of differences as well. LSTM makes use of three gates and a cell, namely, an input gate, forget gate, output gate, and memory cell, to control the exposure of memory content; whereas GRU only utilizes two gates to control the information flow. Therefore, the total number of parameters in GRU is reduced by about 25% compared to that in LSTM. Fig. 6 shows the number of total trainable parameters in different RNN architectures.
Table 3 and Table 4 list binary and multiclass change detection results obtained in our experiments, respectively. For both data sets, ReCNNLSTM outperforms ReCNNFC and ReCNNGRU on all indexes (i.e., OA and Kappa coefficient). For example, in the binary change detection, ReCNNLSTM increases the accuracy by 0.38% of OA and 0.0122 of Kappa on the Taizhou data set, in comparison with ReCNNFC; by 0.06% of OA and 0.0021 of Kappa on the same data set, compared to ReCNNGRU. However, we can see that on these data sets, all three variations of the proposed ReCNN perform closely to each other. On the other hand, the proposed networks with gating RNN architectures as the recurrent subnetwork (ReCNNLSTM and ReCNNGRU) slightly outperforms the more traditional ReCNNFC on both of data sets and change detection tasks.
3.4 Analysis of Spatial Component: RNN vs ReCNNLSTM
In the case of spectralspatialtemporal change detection, the proposed recurrent convolutional network is able to significantly improve the spectraltemporalbased RNN model. As shown in Table 3, compared to RNN, ReCNNLSTM increases the accuracy of binary change detection considerably by 2.23% of OA and 0.0708 of Kappa coefficient, respectively, on the Taizhou data set. For the Eppalock lake scene, the accuracy increments on OA and Kappa coefficient are 3.46% and 0.071, respectively. Table 4 compares the performance of RNN and ReCNNLSTM in terms of multiclass change detection task. The latter can improve the former by 2.56% of OA and 0.0905 of Kappa coefficient, respectively, on the Taizhou scene; by 2.36% of OA and 0.036 of Kappa, respectively, on the Eppalock lake data. These results reveal the fact that the usage of spatial cue in our model can construct a more powerful spectralspatialtemporal change detector.
Furthermore, as shown in Fig. 8, it is obvious that the spectraltemporal change detection method (RNN) always results in noisy scatter points in the change detection map. However, our spectralspatialtemporal model ReCNNLSTM addresses this problem by eliminating noisy scattered points of wrong detection.
3.5 Comparison with Other Approaches
The OAs and Kappa coefficients of all competitors and the proposed networks on binary change detection task can be found in Table 3. The classical change detection algorithms, CVA, PCA, MAD, and IRMAD, all achieve a good performance, especially IRMAD, which has the best performance among them. Compared to IRMAD, improvements in OA and Kappa coefficient achieved by ReCNNLSTM are 3.59% and 0.1279, respectively, on the Taizhou data set, and increments of OA and Kappa obtained by ReCNNLSTM on the Eppalock lake scene are 7.4% and 0.1554, respectively. However, the cost of such accuracy improvements is that we have to manually label some training data for supervised learning.
Table 4 presents accuracy indexes on multiclass change detection task. Analysis of the detection accuracies indicates that SVM with RBF kernel outperforms DT, mainly because the kernel SVM generally handles nonlinear inputs more efficiently than DT. It can be seen that the proposed recurrent convolutional network ReCNNLSTM outperforms SVM and RNN in terms of OA and Kappa coefficient on both the Taizhou and Eppalock lake data. Compared to SVM and RNN, ReCNNLSTM increases OA by 4.14% and 2.56%, respectively, on the Taizhou data set; by 2.84% and 2.36%, respectively, on the Eppalock lake data.
Fig. 7 shows change detection results of the Taizhou city and Eppalock lake obtained by our model.
4 Conclusion
In this paper, we have proposed a novel neural network architecture, called recurrent convolutional neural network (ReCNN), which integrates merits of both convolutional neural network (CNN) and recurrent neural network (RNN). ReCNN is capable of extracting joint spectralspatialtemporal features from bitemporal multispectral images and predicts change types. Moreover, it is endtoend trainable. All these properties make ReCNN an excellent approach for multitemporal remote sensing data analysis.
The experiments on real multispectral images demonstrate that ReCNN achieves competitive performance, compared with conventional change detection models as well as spectraltemporalbased RNN algorithm. This confirms advantages of the proposed recurrent convolutional network. In addition, ReCNN is a general framework; therefore, it can be applied to other domains and problems (such as multitemporal hyper/multispectral data classification) that involve sequence prediction in remote sensing sequence data.
Future works will focus on new architectures based on ReCNN, for example, a semisupervised ReCNN that can also use arbitrary amounts of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.
Acknowledgement
The authors would like to express their appreciation to Dr. Chen Wu for providing the Taizhou data set.
References
 F. Bovolo and L. Bruzzone, “The time variable in data fusion: A change detection perspective,” IEEE Geoscience and Remote Sensing Magazine, vol. 3, no. 3, pp. 8–26, 2015.
 N. Yokoya, X. X. Zhu, and A. Plaza, “Multisensor coupled spectral unmixing for timeseries analysis,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2842–2857, 2017.
 J. Yang, P. J. Weisberg, and N. A. Bristow, “Landsat remote sensing approaches for monitoring longterm tree cover dynamics in semiarid woodlands: Comparison of vegetation indices and spectral mixture analysis,” Remote Sensing of Environment, vol. 119, pp. 62–71, 2012.
 G. Xian and C. Homer, “Updating the 2001 national land cover database impervious surface products to 2006 using Landsat imagery change detection methods,” Remote Sensing of Environment, vol. 114, no. 8, pp. 1676–1686, 2010.
 B. Liang and Q. Weng, “Assessing urban environmental quality change of Indianapolis, United States, by the remote sensing and gis integration,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 4, no. 1, pp. 43–55, 2011.
 W. A. Malila, “Change vector analysis: An approach for detecting forest changes with Landsat,” in Machine Processing of Remotely Sensed Data Symposium, 1980.
 F. Bovolo and L. Bruzzone, “A theoretical framework for unsupervised change detection based on change vector analysis in the polar domain,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 1, pp. 218–236, 2007.
 F. Bovolo, S. Marchesi, and L. Bruzzone, “A framework for automatic and unsupervised detection of multiple changes in multitemporal images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 6, pp. 2196–2212, 2012.
 M. Zanetti, F. Bovolo, and L. Bruzzone, “RayleighRice mixture parameter estimation via em algorithm for change detection in multispectral images,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5004–5016, 2015.
 J. S. Deng, K. Wang, Y. H. Deng, and G. J. Qi, “PCAbased landuse change detection and analysis using multitemporal and multisensor satellite data,” International Journal of Remote Sensing, vol. 29, no. 16, pp. 4823–4838, 2008.
 J. B. Collins and C. E. Woodcock, “An assessment of several linear change detection techniques for mapping forest mortality using multitemporal Landsat TM data,” Remote Sensing of Environment, vol. 56, no. 1, pp. 66–77, 1996.
 A. A. Nielsen, K. Conradsen, and J. J. Simpson, “Multivariate alteration detection (MAD) and MAF postprocessing in multispectral, bitemporal image data: New approaches to change detection studies,” Remote Sensing of Environment, vol. 64, no. 1, pp. 1–19, 1998.
 C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection in multispectral imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 5, pp. 2858–2874, 2014.
 A. Ertürk, M.D. Iordache, and A. Plaza, “Sparse unmixingbased change detection for multitemporal hyperspectral images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 2, pp. 708–719, 2016.
 M. Gong, T. Zhan, P. Zhang, and Q. Miao, “Superpixelbased difference representation learning for change detection in multispectral remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2658–2673, 2017.
 X. Li and A. G. O. Yeh, “Principal component analysis of stacked multitemporal images for the monitoring of rapid urban expansion in the Pearl River Delta,” International Journal of Remote Sensing, vol. 19, no. 8, pp. 1501–1518, 1998.
 L. Bruzzone and S. B. Serpico, “An iterative technique for the detection of landcover transitions in multitemporal remotesensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 4, pp. 858–867, 1997.
 L. Bruzzone, D. F. Prieto, and S. B. Serpico, “A neuralstatistical approach to multitemporal and multisource remotesensing image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 3, pp. 1350–1359, 1999.
 L. Bruzzone and R. Cossu, “A multiple cascadeclassifier system for a robust a partially unsupervised updating of landcover maps,” IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no. 9, pp. 1984–1996, 2002.
 L. Bruzzone, R. Cossu, and G. Vernazza, “Detection of landcover transitions by combining multidate classifiers,” Pattern Recogniton Letters, vol. 25, no. 13, pp. 1491–1500, 2004.
 X. X. Zhu, D. Tuia, L. Mou, G.S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
 ——, “Deep learning in remote sensing: A review,” arXiv:1710.03959, 2017.
 H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule from a recurrent neural network for land cover change detection,” Remote Sensing, vol. 8, no. 6, p. 506, 2016.
 M. Russwurm and M. Körner, “Temporal vegetation modelling using long shortterm memory networks for crop identification from mediumresolution multispectral satellite images,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) workshop, 2017.
 L. Mou and X. X. Zhu, “IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutionaldeconvolutional network,” arXiv:1802.10249, 2018.
 J. Hu, L. Mou, A. Schmitt, and X. X. Zhu, “FusioNet: A twostream convolutional neural network for urban scene classification using PolSAR and hyperspectral data,” in Joint Urban Remote Sensing Event (JURSE), 2017.
 L. Mou, M. Schmitt, Y. Wang, and X. X. Zhu, “A CNN for the identification of corresponding patches in SAR and optical imagery of urban scenes,” in Joint Urban Remote Sensing Event (JURSE), 2017.
 M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 881–893, 2017.
 L. Mou, P. Ghamisi, and X. Zhu, “Unsupervised spectralspatial feature learning via deep residual convdeconv network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 1, pp. 391–406, 2018.
 E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convolutional neural networks for largescale remotesesning image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 645–657, 2017.
 L. Mou and X. X. Zhu, “Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016.
 L. Hughes, M. Schmitt, L. Mou, Y. Wang, and X. X. Zhu, “Identifying corresponding patches in SAR and optical images with a pseudoSiamese CNN,” arXiv:1801.08467, 2018.
 L. Mou, X. X. Zhu, M. Vakalopoulou, K. Karantzalos, N. Paragios, B. L. Saux, G. Moser, and D. Tuia, “Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS Data Fusion Contest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 8, pp. 3435–3447, 2017.
 L. Mou, P. Ghamisi, and X. Zhu, “Deep recurrent neural networks for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3639–3655, 2017.
 H. Wu and S. Prasad, “Convolutional recurrent neural networks for hyperspectral data classification,” Remote Sensing, vol. 9, no. 3, p. 298, 2017.
 Y. Li, H. Zhang, and Q. Shen, “Spectralspatial classification of hyperspectral imagery with 3D convolutional neural network,” Remote Sensing, vol. 18, no. 7, pp. 1527–1554, 2016.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2012.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in IEEE International Conference on Learning Representation (ICLR), 2015.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” arXiv:1606.00915, 2016.
 M. Holschneider, R. KronlandMartinet, J. Morlet, and P. Tchamitchian, “A realtime algorithm for signal analysis with the help of the wavelet transform,” in Wavelets: TimeFrequency Methods and Phase Space, 1989.
 C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters – improve semantic segmentation by global convolutional network,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 S. Hochreiter and J. Schmidhuber, “Long shortterm memory,,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 A. Graves, “Generating sequences with recurrent neural networks,,” arXiv:1308.0850, 2013.
 T. Dozat, “Incorporating Nesterov momentum into Adam,” http://cs229.stanford.edu/proj2015/054_report.pdf, online.
 I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in IEEE International Conference on Machine Learning (ICML), 2013.
 Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
 D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in IEEE International Conference on Learning Representations (ICLR), 2015.
 X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
 A. A. Nielsen, “The regularized iteratively reweighted MAD method for change detection in multi and hyperspectral data,” IEEE Transactions on Image Processing, vol. 16, no. 2, pp. 463–478, 2007.