Land Cover Classification via Multi-temporal Spatial Data by Recurrent Neural Networks
Nowadays, modern earth observation programs produce huge volumes of satellite images time series (SITS) that can be useful to monitor geographical areas through time. How to efficiently analyze such kind of information is still an open question in the remote sensing field. Recently, deep learning methods proved suitable to deal with remote sensing data mainly for scene classification (i.e. Convolutional Neural Networks - CNNs - on single images) while only very few studies exist involving temporal deep learning approaches (i.e Recurrent Neural Networks - RNNs) to deal with remote sensing time series.
In this letter we evaluate the ability of Recurrent Neural Networks, in particular the Long-Short Term Memory (LSTM) model, to perform land cover classification considering multi-temporal spatial data derived from a time series of satellite images. We carried out experiments on two different datasets considering both pixel-based and object-based classification. The obtained results show that Recurrent Neural Networks are competitive compared to state-of-the-art classifiers, and may outperform classical approaches in presence of low represented and/or highly mixed classes. We also show that using the alternative feature representation generated by LSTM can improve the performances of standard classifiers.
earth observation programs produce huge volumes of remotely sensed data every day. Such information can be organized in time series of satellite images that can be useful to monitor geographical zones through time. Efficiently manage and analyze remote sensing time series is still an open challenge in the remote sensing field .
In the context of land cover classification, exploiting time series of satellite images, instead that one single image, can be fruitful to distinguish among classes based on the fact they have different temporal profiles . Despite the usefulness of temporal trends that can be derived from remote sensing time series, most of the proposed strategies  directly apply standard machine learning approaches (i.e. Random Forest, SVM) on the stacked images. Since these approaches did not model temporal correlations, they manage features independently from each others, ignoring any temporal dependency which data may exhibit. Recently, the deep learning revolution  has shown that neural network models are well adapted tools to manage and automatically classify remote sensing data. While standard CNNs techniques are well suited to deal with spatial autocorrelation, the same approaches are not adapted to correctly manage long and complex temporal dependencies . A family of deep learning methods especially tailored to cope with temporal correlations are Recurrent Neural Networks  and, in particular, Long Short Term Memory (LSTM) networks . Such models explicitly capture temporal correlations by recursion and they have already proved to be effective in different domains such as speech recognition , natural language processing , image completion . Only recently, in the remote sensing field, the work proposed in  performs preliminary experiments with LSTM model on a (small) time series composed of only two dates to perform supervised change detection. The task was modeled as a binary classification problem (change vs. no-change). To the best of our knowledge, RNNs (i.e. LSTM) have not yet been considered to deal with land cover classification of deeper time series. Like any other deep learning model , LSTM can be used as a classifier itself or employed to extract new discriminative features (or representation). In the latter case the extracted features are successively used to feed a standard learning algorithm that does not consider temporal dependencies (i.e. Random Forest, Naive Bayes, KNN, SVMs).
In this letter we evaluate the quality of RNN models - Long Short Term Memory - to deal with land cover classification via multi-temporal spatial data that are derived from SITS. More in detail, we perform experiments on two study areas: i) the THAU basin, a site located in the south of France, from which we obtain a time series of 3 dates, and ii) the REUNION ISLAND, a region of France located in the Indian Ocean (east of Madagascar) from which we derive a 23-date times series. While on the first area we conduct an object-oriented classification, on the second site we have performed a pixel-based prediction showing the general applicability of RNNs models to both object and pixel-level analysis. We also assess the RNN model (i.e. LSTM) as feature extractor evaluating the quality of the new generated features to feed the same baseline classifiers we have used as reference methods.
The rest of the paper is organized as follows: Section 2 introduces the LSTM unit and specifies the network architecture we propose, Section 3 describes the two datasets, the time series characteristics and the preprocessing we have performed on the source data. Experimental setting and results are discussed in Section 4. Conclusions are drawn in Section 5.
We propose a neural architecture involving LSTM unit to deal with land cover classification via multi-temporal spatial data. We also assess the representation learned by the RNN model with standard classification strategies commonly used to perform prediction in the remote sensing field.
2.1Long-Short Term Memory
Recurrent Neural Networks are well established machine learning techniques that demonstrate their quality in different domains such as speech recognition , signal processing , natural language processing  and image completion . Differently from standard feed forward networks (i.e. CNNs), RNNs explicitly manage temporal data dependencies since the output of the neuron at time t-1 is used, together with the next input, to feed the neuron itself at time t. A sketch of a typical RNN neuron is depicted in Figure ?.
The most well-known type of RNN is the Long-Short Term Memory (LSTM)  model. There are many variants of LSTM network  but here we refer to the architecture proposed in . LSTM models were mainly introduced with the purpose to learn long term dependencies , since previous RNN models failed in this task due to the problem of vanishing and exploding gradients. The equations (1), (2), (3), (4), (5) and (6) formally describes the LSTM neuron while Figure ? graphically depicts the LSTM unit. The symbol indicates an element-wise multiplication while and represent Sigmoid and Hyperbolic Tangent respectively. The input of the LSTM is a sequence of variables (, ..., ) where a generic element is a feature vector and refers to the corresponding timestamp. RNN models are able to manage variable-length data sequences.
The LSTM unit is composed of two cell states, the memory and the hidden state , and three different gates, the input gate , the forget gate and the output gate that are employed to control the flow of information. All the three gates combine the current input with the hidden state coming from the previous timestamp. The gates have also two important functions: i) they regulate how much information have to be forgotten/reminded during the process; ii) they deal with the problem of vanishing/exploding gradients. We can observe that the gates are implemented by a sigmoid. This function returns values between 0 and 1 where, in this context, 0 indicates that the information is completely forgotten and 1 means that the information is completely retained.
The LSTM unit uses also a temporary cell state that rescales the current input always taking into account the previous hidden state. This temporary cell is implemented by an hyperbolic tangent function that returns values between -1 and 1. Both sigmoid and hyperbolic tangent are applied element-wise.
The input gate regulates how much of the current information needs to be maintained () while the forget gate indicates how much of the previous memory needs to be retained at the current step (). Finally, the output gate impacts on the new hidden state deciding how much information of the current memory will be outputted to the next step. The different matrices and bias coefficients are the parameters learned during the training of the model. Both, the memory and the hidden state are forwarded to the next time step.
2.2LSTM-Based Time Series Classification
The LSTM neuron learns an internal representation of the input sequences (in our case objects or pixels time series) but it does not make any prediction by itself. To perform the classification task, we stack on top of the LSTM neuron a SoftMax layer  to accomplish the final multi-class prediction. The SoftMax layer has as many neurons as the number of the classes to predict. We choose SoftMax instead of Sigmoid function because the value of the SoftMax layer can be seen as a probability distribution over the classes that sum to 1 while each of the Sigmoid neurons can output a value between 0 and 1. This is due to the fact that, for the SoftMax neuron, the values are normalized per layer while no normalization is performed in the case of Sigmoid layer. This is why, in our context (multi-class prediction), we prefer the SoftMax instead of Sigmoid layer since we know that our samples exclusively belong to a single class. From an architectural point of view, the connection between the LSTM and the SoftMax layer is realized fully connecting the last hidden state vector produced by the LSTM unit with the SoftMax neurons.
2.3Representation Learning with LSTM for time series data
Standard deep learning approaches can also be seen as a way to produce a new, more discriminative representation of the original data . Another way to assess the quality of LSTM unit for our land cover classification task is to use the features learned by the LSTM layer to feed a standard classifier. More in detail, we propose to employ the last hidden state vector, produced by the LSTM unit, as new data representation and, successively, train standard machine learning classifiers over such new set of features.
In order to prove the generality of our proposal, it has been tested over two different remote-sensing based datasets. The first is a collection of spatial objects described by a set regional statistics extracted from very high spatial resolution imagery (VHSR), but with a limited temporal depth. The second one is a pixel-based dataset, more noisy but richer in both spectral and temporal resolution. Detailed descriptions are provided in the following subsections.
The first dataset has been generated using a time series of Pléiades VHSR images (2 m) acquired in the context of the Airbus DS/Spot Image distribution (July and September 2012, March 2013, CNES). The study site is the THAU Basin located in the South of France, close to Montpellier. It covers an area of 42 000 ha with 70% of land area. The north is mainly composed of agricultural fields (i.e. vineyards) and natural spaces while the south is dominated by urban and industrial zones. For each date, two orthorectified, atmospherically corrected scenes are mosaicked.
Using the multi-temporal stack, a segmentation has been performed to extract a consistent multi-temporal object layer. Segmentation was performed using the Multiresolution Segmentation technique  available in the eCognition Developer software. Each object has been then featured using statistical mean and standard deviation using the four native bands (blue, green, red and near-infrared) and the NDVI. A total of 10 features are computed per object and per date (5 means and 5 standard deviations).
The so obtained segments have been subsequently filtered and labeled in 11 different classes by visual inspection. A total of 15 196 objects is retained. The set of classes with the relative cardinality is reported in Table ?.
|Land Cover Class||N. of Objects|
|(2)||Forests and woods||2 445|
|(7)||Sclerophyll vegetation||2 457|
3.2REUNION ISLAND dataset
The second dataset has been generated from an annual time series of 23 Landsat 8 images acquired in 2014 above the Reunion Island (2866 2633 pixels at 30 m spatial resolution), provided at level 2A
Reference land cover data has been built using two publicly available dataset, namely the 2012 Corine Land Cover (CLC) map and the 2014 farmers’ graphical land parcel registration (Régistre Parcellaire Graphique - RPG). The most significant classes for the study area have been retained, and a spatial processing (aided by photo-interpretation) has also been performed to ensure consistency with image geometry. Finally, a pixel-based random sampling of this dataset has been applied to provide an almost balanced ground truth. The final reference dataset consists of a total of 37 900 pixels distributed over 9 classes as reported in Table 1.
|Land Cover Class||N. of Pixels|
|(1)||Urban areas||10 000|
|(2)||Other built-up surfaces||1 500|
|(4)||Sparse Vegetation||5 095|
|(5)||Rocks and bare soil||3 729|
|(7)||Sugarcane crops||2 832|
|(8)||Other crops||1 500|
In this section we report the experimental settings and we discuss the results we obtained on the two SITS datasets we presented in Section 3.
We compare the LSTM-based Time Series Classification model to standard machine learning approaches commonly employed to perform land cover classification from multi-temporal spatial data . We also assess the value of the representation learned by the proposed model following the idea described in Section 2.3.
To our purpose, we use Random Forest (RF) and Support Vector Machine (SVM) as standard classification strategies. For the RF model, we set the number of generated trees equals to 400 and we allow a maximum tree depth of 10. For the SVM model we use RBF kernel with complexity parameter and gamma equal to 100 and 0.01 respectively. For Random Forest we used the python implementation supplied by the Scikit-learn library  while for SVM we use the LibSVM implementation . The same RF and SVM settings are used for both original data and the new representation learned by the LSTM-Based Time Series Classification model. For the latter, we set the number of hidden dimensions equal to 512, an initial learning rate equals to and a decay of . We implement the model via the Keras python library  with Theano as back end. We used as optimization method the RMSprop strategy that is commonly employed to train LSTM units . The model is trained for 200 epochs with a batch size equals to 20. We named RF(LSTM) (resp. SVM(LSTM)) the Random Forest (resp. SVM) learned over the new feature space induced by our RNN model. More in detail, each training and test instance is transformed in a 512 feature vector (the dimension of the hidden state of the LSTM neuron) and, successively, the classifiers are learned from this new representation instead of the original data.
To validate the different methods, we perform a 5-fold cross validation. Due to the unbalanced nature of the two time series datasets, in order to assess classification performances we use not only the Global Accuracy and Kappa measures, but we also provide average and per-class F-Measure.
4.2Results and Discussions
The Tables ? and ? and Figures ? and ? summarize the results we have obtained on the two SITS datasets.
Considering the THAU dataset, Table ? depicts the average values of Accuracy, F-Measure and Kappa for the different methods. We can observe that the LSTM-based classifier outperforms both RF and SVM approaches regarding all the three metrics. The more important gain is reached when the average F-Measure is taken into account, the LSTM-based classifier obtaining a score of 74.63% while the second best method (SVM(LSTM)) attains a score equals to 73.31%. Interestingly, we can also highlight that, for the THAU dataset, the classifiers trained on the features (representation) learned by the Recurrent Neural Network RF(LSTM) and SVM(LSTM) exhibit better performances than the same classifiers coupled with the original time series data when the F-Measure is considered. In terms of Accuracy, the behavior is comparable considering the SVM vs SVM(LSTM) while the RF model clearly benefit of the new data representation.
A more detailed assessment is provided in Figure ? where the per-class F-Measure is reported. The first point we can highlight is that, for classes with few reference samples (i.e. (1),(4),(8), (9) and (10)), the LSTM network neatly outperforms standard approaches, which in some cases (i.e. (1) and (4)) completely miss the classes, while it obtains similar or slightly better results w.r.t. RF and SVM for well represented classes. The second point is related to the comparison between standard approaches trained on the original data and the same methods powered by features learned by our proposal. We can see that the use of the new learned representation improves the performances of both RF and SVM. Again, this fact is particularly evident on critical classes like (1), (4), (8) (for SVM) and (10) (for RF).
Table ? summarizes the results on the REUNION ISLAND time series dataset. Similarly to the previous case, the LSTM-based classifier behaves better than the standard machine learning methods considering Accuracy and F-Measure; still in accord with the previous results, also the classifiers trained on the new features, obtained by the proposed network, outperform their counterparts trained on the original feature space. Conversely to the previous experiment, in this case the highest value of F-Measure is reached by the SVM(LSTM) but we can observe that the LSTM-based classifier still reaches competitive results. Figure ? reports the per-class F-Measure results on the REUNION ISLAND dataset. Also in this case we can note that the LSTM-based approach works well for low represented and difficult classes (i.e. (8) - “Other Crops”) and it remains competitive on all the other classes.
As expected, the combined optimization and learning of a new feature representation along with classification, proper to all deep learning approaches, provides here a better support to discriminate among the different classes. In addition, all these results indicate that the LSTM model is well suited to capture long-short temporal dependencies as opposed to common classification approaches where all the information are managed at the same level forgetting temporal correlations. This is particularly evident on low represented and highly mixed classes: Tree Crops, Summer crops and Truck Farming (resp. Other Crops) for the THAU (resp. REUNION ISLAND) dataset. All these classes are related to agricultural activities whose time patterns are strongly varying due to the heterogeneity of practices, which make the corresponding classes detectable only considering short portions of the time series in which crop conditions are comparable.
We remind that our proposal uses only one LSTM layer while more layers can be stacked together to build more complex architectures . This is out of our scope and we leave this point for future researches, since our main objective is to highlight the quality and the suitableness of RNNs methods to manage and analyze SITS data.
In this letter we asses the benefit of using Recurrent Neural Network (LSTM) to perform land cover classification via multi-temporal spatial data. We have validated the proposed model on two different SITS based datasets showing that the proposed framework efficiently deals with both pixel-based and object-based classification.
The proposed framework proved competitive, yet outperforming compared to classical approaches, with the remarkable advantage of improving the quality of the predictions on “weak” classes from unbalanced datasets. We also highlight that the proposed LSTM-based classification model can be used as feature extractor to learn a new data representation that positively affect the performances of standard classification approaches on SITS data.
The authors acknowledge also the National Research Agency in the framework of the program “Investissements d’Avenir” for the GEOSUD project (ANR-10-EQPX-20) for the distribution of the Pléiades satellite images.
- The source data are provided by the French Pôle Thématique Surfaces Continentales THEIA (www.theia-land.fr) and preprocessed by the Multi-sensor Atmospheric Correction and Cloud Screening (MACCS) level 2A processor  developed at the French National Space Agency (CNES) to provide accurate atmospheric, environmental and geometric corrections as well as precise cloud masks.
- Comparative analysis of modis time-series classification using support vector machines and methods based upon distance and similarity measures in the brazilian cerrado-caatinga boundary.
N. A. Abade, O. Abílio de Carvalho Júnior, R. Fontes Guimarães, and S. N. de Oliveira. Remote Sensing, 7(9):12160–12191, 2015.
- Multiresolution segmentation: an optimization approach for high quality multi-scale image segmentation.
M. Baatz and A. Schäpe. Angewandte Geographische Informationsverarbeitung XII, 58:12–23, 2000.
- Representation learning: A review and new perspectives.
Y. Bengio, A. C. Courville, and P. Vincent. IEEE TPAMI, 35(8):1798–1828, 2013.
- LIBSVM: A library for support vector machines.
C.-C. Chang and C.-J. Lin. ACM TIST, 2:27:1–27:27, 2011.
François Chollet. https://github.com/fchollet/keras, 2015.
- Rmsprop and equilibrated adaptive learning rates for non-convex optimization.
Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. CoRR, abs/1502.04390, 2015.
- Analysis of multitemporal classification techniques for forecasting image time series.
R. Flamary, M. Fauvel, M. Dalla Mura, and S. Valero. IEEE Geosci. Remote Sensing Lett., 12(5):953–957, 2015.
- Learning to forget: Continual prediction with LSTM.
F. A. Gers, J. Schmidhuber, and F. A. Cummins. Neural Comp., 12(10):2451–2471, 2000.
- Speech recognition with deep recurrent neural networks.
A. Graves, A.-r. Mohamed, and G. E. Hinton. In ICASSP, pages 6645–6649, 2013.
- LSTM: A search space odyssey.
K. Greff, R. Kumar Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. CoRR, abs/1503.04069, 2015.
- A multi-temporal and multi-spectral method to estimate aerosol optical thickness over land, for the atmospheric correction of formosat-2, landsat, venμs and sentinel-2 images.
O. Hagolle, M. Huc, D. Villa Pascual, and G. Dedieu. Remote Sensing, 7(3):2668–2691, 2015.
- Classification and monitoring of reed belts using dual-polarimetric terrasar-x time series.
I. Heine, T. Jagdhuber, and S. Itzerott. Remote Sensing, 8(7), 2016.
- Monitoring land-cover changes: A machine-learning perspective.
A. Karpatne, Z. Jiang, R. R. Vatsavai, S. Shekhar, and V. Kumar. IEEE Geoscience and Remote Sensing Magazine, 4:8–21, 2016.
- Assessing the ability of lstms to learn syntax-sensitive dependencies.
T. Linzen, E. Dupoux, and Y. Goldberg. TACL, 4:521–535, 2016.
- Learning a transferable change rule from a recurrent neural network for land cover change detection.
H. Lyu, H. Lu, and L. Mou. Remote Sensing, 8(6), 2016.
- Scikit-learn: Machine learning in Python.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Simultaneous multichannel signal transfers via chaos in a recurrent neural network.
K. Soma, R. Mori, R. Sato, N. Furumai, and S. Nara. Neural Computation, 27(5):1083–1101, 2015.
- Conditional image generation with pixelcnn decoders.
A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves. In NIPS, pages 4790–4798, 2016.
- Deep learning for remote sensing data: A technical tutorial on the state of the art.
L. Zhang, L. Zhang, and B. Du. IEEE Geoscience and Remote Sensing Magazine, 4:22–40, 2016.