STConvS2S: Spatiotemporal Convolutional Sequence to Sequence Network for Weather Forecasting

STConvS2S: Spatiotemporal Convolutional Sequence to Sequence Network for Weather Forecasting


Applying machine learning models to meteorological data brings many opportunities to the Geosciences field, such as predicting future weather conditions more accurately. In recent years, modeling meteorological data with deep neural networks has become a relevant area of investigation. These works apply either recurrent neural networks (RNNs) or some hybrid approach mixing RNNs and convolutional neural networks (CNNs). In this work, we propose STConvS2S (short for Spatiotemporal Convolutional Sequence to Sequence Network), a new deep learning architecture built for learning both spatial and temporal data dependencies in weather data, using only convolutional layers. Computational experiments using observations of air temperature and rainfall show that our architecture captures spatiotemporal context and outperforms baseline models and the state-of-art architecture for weather forecasting task.


Spatiotemporal data analysis Sequence-to-Sequence models Convolutional Neural Networks Weather Forecasting

1 Introduction

Weather forecasting plays an essential role in resource planning in cases of severe natural phenomena such as heat waves (extreme temperatures), droughts, and hurricanes. It also influences decision making in agriculture, aviation, retail market, and other sectors, since unfavorable weather negatively impacts corporate revenues (Štulec et al., 2019). Over the years, with technological development, predictions of meteorological variables are becoming more accurate. However, due to the stochastic behavior of the Earth system, which is governed by physical laws, traditional forecasting requires complex, physics-based models to predict the weather (Karpatne et al., 2018).

In recent years, a big volume of data about the Earth system is available. The remote sensing data collected by satellites provides meteorological data from the entire globe at specific time intervals (e.g., 6h or daily) and with a regular spatial resolution (e.g., 1km or 5km). The availability of historical data fosters researchers to design deep learning models that can make more accurately predictions about the weather (Reichstein et al., 2019).

Even though meteorological data exhibit both spatial and temporal structures, weather forecasting can be modeled as a sequence problem. In sequence models, an input sequence is encoded to map the representation of the sequence output, which may have a different length than the input. In Shi et al. (2015), the authors proposed the ConvLSTM architecture to solve the sequence prediction problem using the radar echo dataset. They combine a convolutional neural network (CNN) and a recurrent neural network (RNN) to simultaneously learn the spatial and temporal context of input data to predict the future sequence.

Although ConvLSTM architecture has achieved the state-of-art result for rainfall forecasting on spatiotemporal dataset and is now considered the potential approach to geoscience data prediction (Reichstein et al., 2019), new opportunities have emerged from recent advances in deep learning for sequence modeling adopting 1D CNN (Gehring et al., 2017) and spatiotemporal representation using 3D CNN with kernel decomposition (Tran et al., 2018). However, a limitation of CNN models when applied to forecasting tasks is the lack of causal constraint that allows future information in temporal reasoning (Singh and Cuzzolin, 2019). Another limitation when using convolutional layers in sequence modeling tasks is that the length of the output sequence must be the same size or shorter than the input sequence (Bai et al., 2018).

To tackle these limitations, we introduce STConvS2S (short for Spatiotemporal Convolutional Sequence to Sequence Network), a spatiotemporal predictive model for weather forecasting. STConvS2S combines the encoder-decoder architecture (Gehring et al., 2017) and the decomposition of convolution operation (Tran et al., 2018) to exploit spatial and temporal features in meteorological data. The main contributions of this work are as follows:

  • We introduce an architecture for sequence modeling using only 3D convolutional layers. Our model use encoder-decoder networks, where the encoder uses spatial convolution followed by the decoder network, which learns temporal features from data using a temporal convolution.

  • We add a causal convolution in some 3D convolutional layers of the decoder to ensure that no future values are used to capture temporal information of the current state in the sequence. This is a key constraint in spatiotemporal data forecasting.

  • We also add a transposed convolutional layer and use it to generate an output sequence whose length may be longer than the length of the input sequence. Thus, we remove this limitation of CNN models in sequence modeling tasks.

  • We evaluate our approach using the air temperature and rainfall from CFSR (Saha et al., 2014) and CHIRPS (Funk et al., 2015) datasets, respectively. Experiments cover South American region and our results outperform the state-of-the-art model for weather forecasting with lower error and training time. In particular, STConvS2S is 20% better than the state-of-the-art model in the 5-steps forecasting, and 6% in the 15-steps, using CFSR dataset.

The rest of this paper is organized into six sections. Section 2 presents an overview of the main concepts related to convolutional layers and sequence modeling . Section 3 formally describes the spatiotemporal data forecasting problem. Section 4 describes our proposed deep learning architecture. Section 5 discusses works related both to weather forecasting and spatiotemporal architectures. Section 6 presents our experiments and results. Section 7 provides the conclusions of the paper.

2 Background

2.1 Convolutional Neural Networks

Convolutional neural networks (CNN) are an efficient method for capturing spatial context and have recently attained state-of-art results for image classification using a 2D kernel (Krizhevsky et al., 2012). In recent years, researchers expanded CNN actuation field to natural language processing, such as machine translation (Gehring et al., 2017). This novel architecture is built on a CNN with a 1D kernel, useful to capture temporal patterns in a sequence of words to be translated. A CNN with 3D kernel is used to predict the future in visual representation, like action recognition (Tran et al., 2018). In this domain, CNN performs 3D convolution operations over both time and space dimensions of the video.

CNNs were studied in detail in LeCun and Bengio (1995) for image, speech, and time series tasks, where the architecture was designed to process data with grid-like topology. Inspired by the visual cortex, the artificial neurons in this model use convolution operation to scan the input data and extract features located in a small local neighborhood, called receptive field. The neighborhood coverage (receptive field) is defined by the kernel size and the stride parameter defines the position at which convolution operation must begin for each element. In the end, the output of each neuron after the convolution forms the feature map. For the feature map to preserve the dimensions of the input data, padding technique can be applied. This technique surrounds each slice of the input volume with cells containing zeros.

2.2 Causal convolutions

When a deep learning model satisfies the causal constraint, it means that the model ensures at step no future information from step onward is used by the learning process. The domain of the sequence modeling tasks determines the usage of this constraint. For example, in text summarization, the correct interpretation of the current word may depend on words from previous and next steps due to language dependencies (Goodfellow et al., 2016). Therefore, in this domain it is not necessary to follow the causal constraint. On the other hand, for forecasting tasks, the model must be causal, otherwise, it may exploit information from a future time step to learn current representation, which makes it an unrealistic model.

To incorporate the ability to respect the causal constraint in temporal learning of a 1D CNN, causal convolutions can be used (van den Oord et al., 2016). This technique can be implemented as follows: pad the input by elements, where is the kernel size, and then remove elements from the end of the feature map. Figure 1 shows the causal convolution operation in details.

Figure 1: Causal convolution operation in a 1D convolutional layer with (kernel size). Input is padded by elements on both sides, and from convolutional layer output (feature map), elements are removed.

2.3 Transposed convolutional layer

Transposed convolutional layer1 is widely used for semantic segmentation (Noh et al., 2015) and object detection (Racah et al., 2017) to reconstruct the shape of the input image after applying convolutional layers. Convolution operation can be interpreted as a many-to-one relationship, i.e., it associates multiple elements of input feature map with a single element of the generated feature map. In contrast, in a transposed convolutional layer, transposed convolution operation forms a one-to-many relationship, thus it can be used to generate an upsampled output (Noh et al., 2015) as shown in Figure 2. Formally, the size of the output can be defined as , for input size , stride , kernel size and padding (Dumoulin and Visin, 2016). See Section 2.1 for stride, kernel size and padding details.

Figure 2: Example of 1D transposed convolution operation where input feature map () is upsampled to output feature map. This 1D transposed convolutional layer has as hyperparameters: , , .

2.4 Sequence modeling

Sequence modeling (or sequence-to-sequence learning) can be defined as a way of generating a model that maps an input sequence vector of elements to an output sequence vector , where the size of the sequences may be different. A sequence modeling architecture is a two-phase architecture in which an encoder reads the input and generates a numerical representation of it, while a decoder writes the output sequence after processing the encoder output. The encoder-decoder architecture was first proposed by Sutskever et al. (2014) for machine translation tasks using long short-term memory (LSTM), a type of recurrent neural network (RNN).

LSTM has a chain-like structure, where the output of one step is passed to the next step and so on, which makes it to follow the causal constraint and be suitable for sequential processing. A drawback of the information dependency from previous step is that LSTM does not allow parallel computation, leading to a slow training phase. Gehring et al. (2017) propose a new encoder-decoder architecture using only 1D CNNs. The architecture designed with causal convolutions in decoder is able to capture temporal dependencies in sequences successfully and, compared to LSTM models, computations can be completely parallelized during training.

3 Problem Statement

Spatiotemporal data forecasting can be modeled as a sequence-to-sequence problem. Thus, the observations of spatiotemporal data (e.g. meteorological variables) measured in a specific geographic region over a period of time serve as the input sequence to the forecasting task. More formally, we define a spatiotemporal dataset as with samples of , where . Each training example is a tensor , that is a sequence of observations containing historical measurements. Each observation , for (i.e. the length of input sequence), consists of a grid map that determines the spatial location of the measurements, where and represent the size of latitude and longitude, respectively. In the observations, represents how many meteorological variables (e.g. temperature, humidity) are used simultaneously in the model. This structure is analogous to 2D images, where would indicate the amount of color components (RGB or grayscale).

Modeled as sequence-to-sequence problem in Equation 1, the goal of spatiotemporal data forecasting is to apply a function that maps an input sequence of past observations, satisfying the causal constraint at each time step , in order to predict a target sequence (), where the length of output sequence may differ from the length of input sequence.


4 STConvS2S architecture

In this section, we describe our proposed architecture, called Spatiotemporal Convolutional Sequence to Sequence Network (STConvS2S). STConvS2S is a deep learning architecture designed for short-term weather forecasting, as illustrated in Figure 3. We use an encoder-decoder architecture, typically used to model sequence tasks. However, in our model, the 1D convolutional layers used for time series are replaced by 3D ones. This is a crucial feature of our model, since it enables the learning of patterns in data with a spatiotemporal structure, which is typical in geoscience data.

Figure 3: An illustration of STConvS2S architecture, which comprises two components: the encoder and decoder networks. Encoder learns spatial representation of the input sequence using a spatial kernel. This representation is used as input to decoder network, which uses a temporal kernel to learn temporal features and make predictions. indicates causal convolutions in the convolutional layers of the decoder.

Moreover, instead of adopting a conventional kernel for 3D convolutional layers, we use a factorized 3D kernel adapted from R(2+1)D network, proposed in Tran et al. (2018). In their work, the factorized kernel split the convolution operation of one layer into two separate and successive operations, a 2D spatial convolution and a 1D temporal convolution. In our new architecture, we take a different approach: operations are not successive inside each convolutional layer. We configure the encoder to learn spatial dependencies by applying spatial kernel (2D spatial convolution) and the decoder to encapsulate temporal dependencies using temporal kernel (1D temporal convolution). Figure 4 schematically illustrates the difference between both approaches.

Figure 4: Comparison of factorized 3D kernel usage. (a) Proposed in Tran et al. (2018) as successive operations in a unique block called (2+1)D. The spatial kernel is defined as , where is the size of spatial kernel, and the temporal kernel as , where is the size of temporal kernel. (b) Our proposal for the factorized 3D kernel. The encoder only performs convolutions using the spatial kernel. The decoder uses the encoder output as input and applies the temporal kernel in convolutions.

STConvS2S is a stack of 3D convolutional layers. Each layer receives a 4D tensor with dimensions as input, where is the number of filters used in the previous layer (), is the sequence length (time dimension), and represent the size of the spatial coverage for latitude and longitude, respectively. In detail, the encoder is formed by convolutional blocks with batch normalization and a rectified linear unit (ReLU) as nonlinearity. The decoder is similar to the encoder, except that a causal convolution (Section 2.2) is used in its first layers to ensure only previous observations are considered in forecast, which is an essential constraint for weather forecasting.

Kernel decomposition allows us to analyze the spatial and temporal contexts separately. Thus, in encoder layers, feature maps must have a fixed-length in dimensions, which means the size of feature maps must match the input size in these dimensions. Otherwise, for some time series, temporal correlation would not be learned by decoder due to compression in the spatial dimension. To ensure a fixed-length, the input for the encoder is padded following , where is the size of spatial kernel. For decoder, we pad the input by , because of causal convolution, where is the size of temporal kernel.

Besides adopting causal convolution in 3D convolutional layers, another contribution of our work is the possibility of generating an output sequence in which its length differs from the length of the input sequence. When CNNs are used to sequence-to-sequence learning, such as forecasting tasks, the length of the output sequence must be the same size or shorter than the input sequence (Gehring et al., 2017; Bai et al., 2018). This is not only a limitation of CNN architectures but also of ConvLSTM ones (Shi et al., 2015; Kim et al., 2019). In Shi et al. (2015) all the sequences are 20 frames long, where they split it 5 for the input and 15 for the prediction. Kim et al. (2019) define an input sequence of 5 time steps and predict the next 5 time steps.

To tackle this limitation, we add a 3D transposed convolutional layer (Section 2.3) before the last convolutional layer and use it to generate an output sequence whose length may be longer than the length of the input sequence. This implementation is tested in the task where we use the previous 5 grids as input sequence to predict the next 15 grids.

5 Related work

Statistical methods and machine learning techniques use historical data of temperature, precipitation, and other variables to predict the weather conditions. Auto-regressive integrated moving average (ARIMA) are traditional statistical methods for times series analysis (Babu and Reddy, 2012). Studies also apply artificial neural networks (ANN) to time series prediction in weather data, such as temperature measurements (Corchado and Fyfe, 1999; Baboo and Shereef, 2010; Mehdizadeh, 2018). Recently, some authors have been developing new approaches based on deep learning to improve time series forecasting results, in particular, using LSTM networks. Traffic flow analysis (Yang et al., 2019), displacement prediction of landslide (Xu and Niu, 2018), petroleum production (Sagheer and Kotb, 2019) and sea surface temperature forecasting (Zhang et al., 2017) are some applications that successfully use LSTM architectures. In Zaytar and Amrani (2016), the authors build a model with stacked LSTM layers to map sequences of weather values (temperature, humidity, and wind speed) of the same length for 9 cities in Morocco and show that their results are competitive with traditional methods. However, these approaches addressed to time series are unable to capture the spatial dependencies in the observations.

Spatiotemporal deep learning models deal with spatial and temporal contexts simultaneously. In Shi et al. (2015), the authors formulate weather forecasting as a sequence-to-sequence problem, where the input and output are 2D radar map sequences. Besides, they introduce the convolutional LSTM (ConvLSTM) architecture to build an end-to-end trainable model for precipitation nowcasting. The proposed model includes the convolution operation into LSTM network to capture spatial patterns. Kim et al. (2019) also define their problem as a sequence task and adopt ConvLSTM for extreme climate event forecasting. Their model uses hurricane density map sequences as spatiotemporal data. The work proposed in Souto et al. (2018) implements a spatiotemporal aware ensemble approach adopting ConvLSTM architecture. The authors combine different meteorological models as channels in the convolutional layer to predict the next expected rainfall values for each location. Although related to the use of deep learning for climate/weather data, our model adopts only CNN rather than a hybrid approach that combines CNN and LSTM.

Some studies have applied spatiotemporal convolutions (Yuan et al., 2018; Tran et al., 2018) for video analysis and action recognition. In Tran et al. (2018), the authors compare several spatiotemporal architectures using only 3D CNN and show that factorizing the 3D convolutional kernel into separate spatial and temporal components produces gains in accuracy. Their architecture focuses on layer factorization, i.e., factorizing each convolution into a block of a spatial convolution and a temporal convolution. Moreover, in comparison to the full 3D convolution, they indicate advantages: an increase in the complexity of the functions that can be represented, and a facility in the optimization of spatial or temporal components. Inspired by Tran et al. (2018), we also adopt a factorized 3D CNN, but with a different implementation. Figure 4 highlights this difference.

A limitation of both 3D CNN or factorized 3D CNN (Tran et al., 2018) is the lack of causal constraint allowing future information in temporal learning. Singh and Cuzzolin (2019) also factorize the 3D convolution using the same spatial convolution as Tran et al. (2018) but propose a recurrent convolution unit based on RNN approach to address causal constraint in temporal learning for action recognition task. In contrast, we use an entirely CNN approach, adopting a causal convolution to tackle this limitation.

Following the success of 2D CNN in capturing spatial correlation in images, Xu et al. (2019) propose a model to predict vehicle pollution emissions using 2D CNN to capture temporal and spatial correlation separately. However, unlike our work, they also do not satisfy the causal constraint when adopting 2D CNN in temporal learning. Racah et al. (2017) use a 3D CNN in an encoder-decoder architecture, where they concatenate time axis as the third dimension of the input for extreme climate event detection. Their encoder and decoder use convolutional and deconvolutional (transposed convolutional) layers, respectively, to learn the spatiotemporal representation simultaneously in each layer. Our approach is similar to Racah et al. (2017) in using encoder-decoder architecture based on CNN, but we adopt a factorized 3D CNN instead of a 3D CNN and specialize our encoder to learn only spatial context and the decoder, temporal context.

Other deep learning approaches devised to explore spatiotemporal patterns differ in the grid-structured data we use as input. Wang and Song (2018) present an ensemble approach for air quality forecasting combining statistical hypothesis and deep learning. They explore spatial correlation by applying Granger causality between two time series and, for temporal learning, use LSTM networks. Yu et al. (2018) and Li and Moura (2019) use graph-structured data as input and propose a deep learning network to tackle a sequence-to-sequence problem using spatiotemporal data. Yu et al. (2018) build the architecture for traffic forecasting using convolutional structures composed with two temporal layers that are 1D CNN with a causal convolution and one spatial layer in between used to extract spatial features in graphs. Li and Moura (2019) adopt an encoder-decoder architecture based in Transformer model (Vaswani et al., 2017) for taxi ride-hailing prediction.

To sum up, our proposed STConvS2S architecture departs from the previous approaches, either in the manipulation of spatial and temporal dependencies or in the use of different deep learning layers to learn features from the data or in the adoption of a grid structure rather than a graph to model the input data.

6 Experiments

We perform experiments on two publicly available meteorological datasets containing air temperature and precipitation values to validate our proposed architecture. The deep learning experiments were conducted on a server with a single Nvidia GeForce GTX1080 GPU with 8GB memory. The baseline model was executed on 8 Intel i7 CPUs with 4 cores and 66GB RAM. We begin by explaining the datasets (Section 6.1) and evaluation metrics (Section 6.2). After that, we describe the results and a corresponding analysis (Section 6.3).

6.1 Datasets

The CFSR2 is a reanalysis3 product that contains high-resolution global land and ocean data (Saha et al., 2014). The data contain a spatial coordinate (latitude and longitude), a spatial resolution of 0.5 degrees (i.e. area for each grid cell) and a frequency of 6 hours for some meteorological variables, such as air temperature and wind speed.

In the experiments, we use a subset of CFSR with the air temperature observations from January 1979 to December 2015 covering the space in 8N-54S and 80W-25W as shown in Figure 5 (a). As data preprocessing, we scale down the grid to in the and dimensions to fit the data in GPU memory.The other dataset, CHIRPS4, incorporates satellite imagery and in-situ station data to create gridded rainfall times series with daily frequency and spatial resolution of 0.05 degrees (Funk et al., 2015). We use a subset with observation from January 1981 to March 2019 and also apply interpolation to reduce the grid size to . Figure 5 (b) illustrates the coverage space 10N-39S and 84W-35W adopted in our experiments.

(a) CFSR-temperature dataset
(b) CHIRPS-rainfall dataset
Figure 5: Spatial coverage of the datasets used in all experiments. (a) It shows the selected grid on January 1, 1979 with air temperature values.(b) It shows the selected grid of the sequence on March 31, 2019 with rainfall values.

Similar to Shi et al. (2015), we define the input sequence length as 5, which indicates that the previous 5 grids are used to predict the next grids. Thus, the data shapes used as input to the deep learning architectures are for CFSR dataset and for CHIRPS dataset, where 1 in both datasets indicates the one-channel (in this aspect similar to a grayscale image), 5 is the size of the sequence considered in the forecasting task, and 32 and 50 represent the numbers of latitudes and longitudes used to build the spatial grid in each dataset.

From the temperature dataset, we create 54,041 grid sequences and from the rainfall dataset, 13,960 grid sequences. Finally, we divide both datasets into non-overlapping training, validation, and test set following 60%, 20%, and 20% ratio, in this order.

6.2 Evaluation metrics

In order to evaluate the proposed architecture, we compare our results against ARIMA models, traditional statistical approaches for time series forecasting, and the ConvLSTM architecture proposed in Shi et al. (2015), which is considered the state-of-art for spatiotemporal forecasting. To accomplish this, we use the two evaluation metrics presented in Equation 2 and 3.

RMSE, denoted as , is based on MSE metric, which is the average of squared differences between real observation and prediction. The MSE square root gives the results in the original unit of the output, and is expressed at a specific point as:


where is the number of test samples, and are the real and predicted values at the location and at time , respectively.

MAE, denoted as , is the average of differences between real observation and prediction, which measures the magnitude of the errors in prediction. MAE also provides the result in the original unit of the output, and is expressed at a specific point as:


where , , , are defined as shown in Equation 2.

6.3 Results and Analysis

First, we conduct experiments with distinct numbers of layers, filters, and kernel sizes to investigate the best hyperparameters to fit the deep learning models. As a starting point, we set the version based on the settings described in Shi et al. (2015) with 2 layers, each one containing 64 filters and a kernel size of 35. In the training phase, we perform mini-batch learning with 50 epochs for both STConvS2S and ConvLSTM models using the temperature and rainfall datasets, and RMSprop optimizer with a learning rate of .

Besides, we apply dropout, a regularization technique, during training phase of the rainfall dataset to avoid overfitting and be able to make more accurate predictions for unseen data (test set). To evaluate the best dropout rate, we execute several experiments changing the dropout rate by 0.2, 0.4, 0.6, 0.8. STConvS2S models adopt 0.2 rate and ConvLSTM models, 0.8. As a sequence-to-sequence task, we use the previous 5 grids, as we established before in Section 6.1, to predict the next 5 grids (5-steps forecasting).

Table 1 provides the models considered in our investigation with four different settings, the values of the RMSE metric on the test set, and the training time for each dataset. The results show the superiority of version models achieving the lowest RMSE for both ConvLSTM and STConvS2S. Another aspect to note is the significant impact on training time by increasing the number of filters than the number of layers or kernels (versions and point to this).

Dataset Version Settings RMSE Training time RMSE Training time
CFSR (temperature) 1 L=2, K=3, F=64 2.0986 1:50:42 1.7548 0:42:37
2 L=3, K=3, F=32 2.0364 1:07:13 1.6921 0:29:24
3 L=3, K=3, F=64 1.9683 3:00:44 1.6489 1:03:47
4 L=3, K=5, F=32 1.8695 1:52:36 1.4920 0:44:13
CHIRPS (rainfall) 1 L=2, K=3, F=64 6.4327 1:11:16 6.4067 0:35:53
2 L=3, K=3, F=32 6.4356 0:42:52 6.3905 0:25:29
3 L=3, K=3, F=64 6.4108 1:52:50 6.3785 0:52:01
4 L=3, K=5, F=32 6.3794 1:12:17 6.3215 0:36:33
Table 1: Evaluation of different settings for STConvS2S and ConvLSTM, where the best version has the lowest RMSE value.

Figure 6 and 7 highlight the differences between the performances of RMSE metric and training time, respectively, for STConvS2S and ConvLSTM models. As shown in Figure 6, STConvS2S models outperform the ConvLSTM models for both datasets, which indicates that our architecture can simultaneously capture spatial and temporal correlations. In addiction, STConvS2S models resulted in the most efficient training with the smallest time in all versions and datasets (Figure 7). These results reinforce that CNN has fewer parameters to optimize than RNN and, as it does not depend on the computations of the previous time step, can be completely parallelized speeding up the learning process.

(a) Temperature dataset
(b) Rainfall dataset
Figure 6: Comparison between RMSE results for STConvS2S and ConvLSTM model versions.
(a) Temperature dataset
(b) Rainfall dataset
Figure 7: Comparison between training time (in hours) for STConvS2S and ConvLSTM model versions.
(a) Temperature dataset
(b) Rainfall dataset
Figure 8: Error analysis (RMSE) in each epoch during training phase for version 4.

Specifically in version for temperature dataset, our model significantly outperforms the state-of-art architecture for spatiotemporal forecasting. It was 2.5x faster and achieved a 20% improvement in RMSE over ConvLSTM. Comparing the same model version for rainfall dataset, our model was 2x faster and slightly improved the ConvLSTM result with a 1% improvement in RMSE. Futhermore, Figure 8 illustrates that STConvS2S has lower training error compared to ConvLSTM over 50 epoch. In this comparasion both models are using version 4.

To further evaluate our model, we chose the most efficient version () to perform new experiments. The chosen ConvLSTM and STConvS2S models with 3 layers, 32 filters, and kernel size of 5 were compared with ARIMA models. Since ARIMA are a traditional approach to time series forecasting, they served as a baseline for our proposed deep learning models. The experiment for the baseline takes into account the same temporal pattern and spatial coverage. Thus, predictions were performed throughout all the 1,024 time series (temperature dataset) and 2,500 time series (rainfall dataset), considering in each analysis the previous 5 values in the sequence.

To avoid overfitting during training of deep learning models, we apply the early stopping technique with patience hyperparameter set to 16 on the validation dataset. We train and evaluate each of them 10 times, and the mean and the standard deviation of RMSE and MAE metrics were calculated on the test set. This time, we evaluate the baseline and deep learning models in two horizons: 5 and 15-steps ahead. These experiments are relevant to test the capability of our model to predict a long sequence.

As shown in Table 2, STConvS2S outperforms the results of all baseline and state-of-art experiments on both metrics. STConvS2S performs much better than the baseline, indicating the importance of spatial dependence on geoscience data since ARIMA models only analyze temporal relationships. It also outperforms the state-of-art model (Shi et al., 2015), with lower errors in temperature forecasting than rainfall forecasting.

Dataset Horizon Metric ARIMA ConvLSTM STConvS2S
CFSR (temperature) = 5 RMSE 2.1880 1.8406  0.0318 1.4835  0.0131
MAE 1.9005 1.2672  0.0188 1.0157  0.0098
= 15 RMSE 2.2481 2.2170  0.0209 2.0821  0.0179
MAE 1.9077 1.5399  0.0171 1.4508  0.0125
CHIRPS (rainfall) = 5 RMSE 7.4377 6.3825  0.0031 6.3222  0.0028
MAE 6.1694 2.3620  0.0016 2.3400  0.0009
= 15 RMSE 7.9460 6.3930  0.0030 6.3693  0.0024
MAE 5.9379 2.3673  0.0012 2.3626  0.0008
Table 2: Performance results on temperature forecasting and rainfall forecasting. Mean and standard deviation of RMSE and MAE metrics for the baseline models (ARIMA), the state-of-art model (ConvLSTM), and the proposed architecture in this paper (STConvS2S).

To provide an overview, Figure 9 and Figure 10 illustrate the cumulative error based on both horizons ( and ). The former shows the superiority of our proposed architecture competing with ARIMA and ConvLSTM in the temperature forecasting, while the latter shows our competitive results compared to ConvLSTM in the rainfall forecasting. The experiments reveal that STConvS2S can effectively learn spatiotemporal representations.

Figure 9: Cumulative error based on both horizons ( and ) using temperature dataset. Evaluations on RMSE and MAE metrics.
Figure 10: Cumulative error based on both horizons ( and ) using rainfall dataset. Evaluations on RMSE and MAE metrics.

7 Conclusion

This paper presented a new deep learning architecture using only convolutional layers to deal with spatiotemporal data forecasting, termed as STConvS2S. In our architecture, the spatial features learned in the first layers (encoder) are used as input to the final layers (decoder) responsible for learning temporal features and predicting the output sequence.

The limitation of CNN models in sequence modeling tasks is to generate an output sequence whose length is greater than the length of the input sequence. It occurs either in 1D CNN for capturing temporal context, or 3D CNN and a hybrid approach using 2D CNN + LSTM, called ConvLSTM, for spatiotemporal context. Our work removes this limitation by adding a transposed convolutional layer before the final convolutional layer.

Causal convolution is typically used in temporal architectures (1D CNN). In our work, it was added to 3D CNN, allowing STConvS2S model not to violate the temporal order (causal constraint). This implementation was essential for a fair comparison with ConvLSTM (the state-of-art model), which is a causal model due to the chain-like structure of LSTM layers.

Experiments indicate that our model manages to better analyze both spatial and temporal dependencies of the data than the state-of-art model since it has achieved superior performance toward temperature forecasting and competitive results in the rainfall forecasting. Thus, STConvS2S could be a natural choice for sequence modeling tasks, such as weather forecasting, when using spatiotemporal data. Future work will search for ways to decrease rainfall dataset error, and directions may include applying preprocessing techniques to sparse data and adding data from other geographic regions. Besides, we will investigate more architectures for spatiotemporal data forecasting.

Computer Code Availability

Models described in this paper were developed with Python 3.6 programming language and deep learning models (STConvS2S and ConvLSTM) were also implemented using PyTorch 1.0, an open-source framework. Our source code are publicly available at

Data Availability

In this paper, spatiotemporal datasets in NetCDF format were used and can be downloaded at, an open-source online data repository.


  1. Also termed as deconvolution in previous works of the literature.
  3. Scientific method which is used to produce best estimates (analyses) of how the weather is changing over time (Fujiwara et al., 2017).
  5. kernel for ConvLSTM. temporal kernel and spatial kernel for STConvS2S.


  1. An efficient weather forecasting system using artificial neural network. International Journal of Environmental Science and Development 1 (4), pp. 321–326. External Links: Document Cited by: §5.
  2. Predictive data mining on Average Global Temperature using variants of ARIMA models. IEEE-International Conference on Advances in Engineering, Science and Management, ICAESM-2012, pp. 256–260. Cited by: §5.
  3. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Note: \href Cited by: §1, §4.
  4. Unsupervised neural method for temperature forecasting. Artificial Intelligence in Engineering 13 (4), pp. 351–357. External Links: Document Cited by: §5.
  5. A guide to convolution arithmetic for deep learning. Note: \href Cited by: §2.3.
  6. Introduction to the sparc reanalysis intercomparison project (s-rip) and overview of the reanalysis systems. Atmospheric Chemistry and Physics 17 (2), pp. 1417–1452. External Links: Document Cited by: footnote 3.
  7. The climate hazards infrared precipitation with stations - A new environmental record for monitoring extremes. Scientific Data 2. External Links: ISSN 20524463, Document Cited by: 4th item, §6.1.
  8. Convolutional sequence to sequence learning. In International Conference on Machine Learning, Vol. 3, pp. 2029–2042. Cited by: §1, §1, §2.1, §2.4, §4.
  9. Deep learning. MIT Press. Note: \url Cited by: §2.2.
  10. Machine Learning for the Geosciences: Challenges and Opportunities. IEEE Transactions on Knowledge and Data Engineering PP (c), pp. 1. External Links: Document Cited by: §1.
  11. Deep-hurricane-tracker: Tracking and forecasting extreme climate events. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV 2019, pp. 1761–1769. External Links: Document Cited by: §4, §5.
  12. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §2.1.
  13. Convolutional Networks for Images, Speech, and Time Series. In The Handbook of Brain Theory and Neural Networks, M. A. Arbib (Ed.), pp. 1–14. Cited by: §2.1.
  14. Forecaster: A Graph Transformer for Forecasting Spatial and Time-Dependent Data. Note: \href Cited by: §5.
  15. Assessing the potential of data-driven models for estimation of long-term monthly temperatures. Computers and Electronics in Agriculture 144, pp. 114–125. External Links: Document Cited by: §5.
  16. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528. External Links: Document Cited by: §2.3.
  17. ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), pp. 3402–3413. Cited by: §2.3, §5.
  18. Deep learning and process understanding for data-driven Earth system science. Nature 566 (7743), pp. 195–204. External Links: Document Cited by: §1, §1.
  19. Time series forecasting of petroleum production using deep lstm recurrent networks. Neurocomputing 323, pp. 203 – 213. External Links: Document Cited by: §5.
  20. The NCEP Climate Forecast System Version 2. Journal of Climate 27 (6), pp. 2185–2208. External Links: Document Cited by: 4th item, §6.1.
  21. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems (NeurIPS), pp. 802–810. Cited by: §1, §4, §5, §6.1, §6.2, §6.3, §6.3.
  22. Recurrent Convolutions for Causal 3D CNNs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 1–10. Cited by: §1, §5.
  23. A Spatiotemporal Ensemble Approach to Rainfall Forecasting. In Proceedings of the International Joint Conference on Neural Networks, pp. 574–581. External Links: Document Cited by: §5.
  24. Weather impact on retail sales: How can weather derivatives help with adverse weather deviations?. Journal of Retailing and Consumer Services 49 (February), pp. 1–10. External Links: Document Cited by: §1.
  25. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NeurIPS), pp. 3104–3112. Cited by: §2.4.
  26. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. External Links: Document Cited by: §1, §1, §2.1, Figure 4, §4, §5, §5.
  27. WaveNet: a generative model for raw audio. Note: \href Cited by: §2.2.
  28. Attention is All you Need. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: §5.
  29. A Deep Spatial-Temporal Ensemble Model for Air Quality Prediction. Neurocomputing 314, pp. 198–206. External Links: Document Cited by: §5.
  30. Displacement prediction of Baijiabao landslide based on empirical mode decomposition and long short-term memory neural network in Three Gorges area, China. Computers and Geosciences 111, pp. 87–96. External Links: Document Cited by: §5.
  31. Deep spatiotemporal residual early-late fusion network for city region vehicle emission pollution prediction. Neurocomputing 355, pp. 183–199. External Links: Document Cited by: §5.
  32. Traffic flow prediction using LSTM with feature enhancement. Neurocomputing 332, pp. 320–327. External Links: ISSN 0925-2312, Document Cited by: §5.
  33. Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 3634–3640. External Links: Document Cited by: §5.
  34. Action recognition using spatial-optical data organization and sequential learning framework. Neurocomputing 315, pp. 221–233. External Links: Document Cited by: §5.
  35. Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks. International Journal of Computer Applications 143 (11), pp. 7–11. External Links: Document Cited by: §5.
  36. Prediction of Sea Surface Temperature Using Long Short-Term Memory. IEEE Geoscience and Remote Sensing Letters 14 (10), pp. 1745–1749. External Links: Document Cited by: §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description