STConvS2S: Spatiotemporal Convolutional Sequence to Sequence Network for Weather Forecasting
Abstract
Applying machine learning models to meteorological data brings many opportunities to the Geosciences field, such as predicting future weather conditions more accurately. In recent years, modeling meteorological data with deep neural networks has become a relevant area of investigation. These works apply either recurrent neural networks (RNNs) or some hybrid approach mixing RNNs and convolutional neural networks (CNNs). In this work, we propose STConvS2S (short for Spatiotemporal Convolutional Sequence to Sequence Network), a new deep learning architecture built for learning both spatial and temporal data dependencies in weather data, using only convolutional layers. Computational experiments using observations of air temperature and rainfall show that our architecture captures spatiotemporal context and outperforms baseline models and the stateofart architecture for weather forecasting task.
Spatiotemporal data analysis SequencetoSequence models Convolutional Neural Networks Weather Forecasting
1 Introduction
Weather forecasting plays an essential role in resource planning in cases of severe natural phenomena such as heat waves (extreme temperatures), droughts, and hurricanes. It also influences decision making in agriculture, aviation, retail market, and other sectors, since unfavorable weather negatively impacts corporate revenues (Štulec et al., 2019). Over the years, with technological development, predictions of meteorological variables are becoming more accurate. However, due to the stochastic behavior of the Earth system, which is governed by physical laws, traditional forecasting requires complex, physicsbased models to predict the weather (Karpatne et al., 2018).
In recent years, a big volume of data about the Earth system is available. The remote sensing data collected by satellites provides meteorological data from the entire globe at specific time intervals (e.g., 6h or daily) and with a regular spatial resolution (e.g., 1km or 5km). The availability of historical data fosters researchers to design deep learning models that can make more accurately predictions about the weather (Reichstein et al., 2019).
Even though meteorological data exhibit both spatial and temporal structures, weather forecasting can be modeled as a sequence problem. In sequence models, an input sequence is encoded to map the representation of the sequence output, which may have a different length than the input. In Shi et al. (2015), the authors proposed the ConvLSTM architecture to solve the sequence prediction problem using the radar echo dataset. They combine a convolutional neural network (CNN) and a recurrent neural network (RNN) to simultaneously learn the spatial and temporal context of input data to predict the future sequence.
Although ConvLSTM architecture has achieved the stateofart result for rainfall forecasting on spatiotemporal dataset and is now considered the potential approach to geoscience data prediction (Reichstein et al., 2019), new opportunities have emerged from recent advances in deep learning for sequence modeling adopting 1D CNN (Gehring et al., 2017) and spatiotemporal representation using 3D CNN with kernel decomposition (Tran et al., 2018). However, a limitation of CNN models when applied to forecasting tasks is the lack of causal constraint that allows future information in temporal reasoning (Singh and Cuzzolin, 2019). Another limitation when using convolutional layers in sequence modeling tasks is that the length of the output sequence must be the same size or shorter than the input sequence (Bai et al., 2018).
To tackle these limitations, we introduce STConvS2S (short for Spatiotemporal Convolutional Sequence to Sequence Network), a spatiotemporal predictive model for weather forecasting. STConvS2S combines the encoderdecoder architecture (Gehring et al., 2017) and the decomposition of convolution operation (Tran et al., 2018) to exploit spatial and temporal features in meteorological data. The main contributions of this work are as follows:

We introduce an architecture for sequence modeling using only 3D convolutional layers. Our model use encoderdecoder networks, where the encoder uses spatial convolution followed by the decoder network, which learns temporal features from data using a temporal convolution.

We add a causal convolution in some 3D convolutional layers of the decoder to ensure that no future values are used to capture temporal information of the current state in the sequence. This is a key constraint in spatiotemporal data forecasting.

We also add a transposed convolutional layer and use it to generate an output sequence whose length may be longer than the length of the input sequence. Thus, we remove this limitation of CNN models in sequence modeling tasks.

We evaluate our approach using the air temperature and rainfall from CFSR (Saha et al., 2014) and CHIRPS (Funk et al., 2015) datasets, respectively. Experiments cover South American region and our results outperform the stateoftheart model for weather forecasting with lower error and training time. In particular, STConvS2S is 20% better than the stateoftheart model in the 5steps forecasting, and 6% in the 15steps, using CFSR dataset.
The rest of this paper is organized into six sections. Section 2 presents an overview of the main concepts related to convolutional layers and sequence modeling . Section 3 formally describes the spatiotemporal data forecasting problem. Section 4 describes our proposed deep learning architecture. Section 5 discusses works related both to weather forecasting and spatiotemporal architectures. Section 6 presents our experiments and results. Section 7 provides the conclusions of the paper.
2 Background
2.1 Convolutional Neural Networks
Convolutional neural networks (CNN) are an efficient method for capturing spatial context and have recently attained stateofart results for image classification using a 2D kernel (Krizhevsky et al., 2012). In recent years, researchers expanded CNN actuation field to natural language processing, such as machine translation (Gehring et al., 2017). This novel architecture is built on a CNN with a 1D kernel, useful to capture temporal patterns in a sequence of words to be translated. A CNN with 3D kernel is used to predict the future in visual representation, like action recognition (Tran et al., 2018). In this domain, CNN performs 3D convolution operations over both time and space dimensions of the video.
CNNs were studied in detail in LeCun and Bengio (1995) for image, speech, and time series tasks, where the architecture was designed to process data with gridlike topology. Inspired by the visual cortex, the artificial neurons in this model use convolution operation to scan the input data and extract features located in a small local neighborhood, called receptive field. The neighborhood coverage (receptive field) is defined by the kernel size and the stride parameter defines the position at which convolution operation must begin for each element. In the end, the output of each neuron after the convolution forms the feature map. For the feature map to preserve the dimensions of the input data, padding technique can be applied. This technique surrounds each slice of the input volume with cells containing zeros.
2.2 Causal convolutions
When a deep learning model satisfies the causal constraint, it means that the model ensures at step no future information from step onward is used by the learning process. The domain of the sequence modeling tasks determines the usage of this constraint. For example, in text summarization, the correct interpretation of the current word may depend on words from previous and next steps due to language dependencies (Goodfellow et al., 2016). Therefore, in this domain it is not necessary to follow the causal constraint. On the other hand, for forecasting tasks, the model must be causal, otherwise, it may exploit information from a future time step to learn current representation, which makes it an unrealistic model.
To incorporate the ability to respect the causal constraint in temporal learning of a 1D CNN, causal convolutions can be used (van den Oord et al., 2016). This technique can be implemented as follows: pad the input by elements, where is the kernel size, and then remove elements from the end of the feature map. Figure 1 shows the causal convolution operation in details.
2.3 Transposed convolutional layer
Transposed convolutional layer
2.4 Sequence modeling
Sequence modeling (or sequencetosequence learning) can be defined as a way of generating a model that maps an input sequence vector of elements to an output sequence vector , where the size of the sequences may be different. A sequence modeling architecture is a twophase architecture in which an encoder reads the input and generates a numerical representation of it, while a decoder writes the output sequence after processing the encoder output. The encoderdecoder architecture was first proposed by Sutskever et al. (2014) for machine translation tasks using long shortterm memory (LSTM), a type of recurrent neural network (RNN).
LSTM has a chainlike structure, where the output of one step is passed to the next step and so on, which makes it to follow the causal constraint and be suitable for sequential processing. A drawback of the information dependency from previous step is that LSTM does not allow parallel computation, leading to a slow training phase. Gehring et al. (2017) propose a new encoderdecoder architecture using only 1D CNNs. The architecture designed with causal convolutions in decoder is able to capture temporal dependencies in sequences successfully and, compared to LSTM models, computations can be completely parallelized during training.
3 Problem Statement
Spatiotemporal data forecasting can be modeled as a sequencetosequence problem. Thus, the observations of spatiotemporal data (e.g. meteorological variables) measured in a specific geographic region over a period of time serve as the input sequence to the forecasting task. More formally, we define a spatiotemporal dataset as with samples of , where . Each training example is a tensor , that is a sequence of observations containing historical measurements. Each observation , for (i.e. the length of input sequence), consists of a grid map that determines the spatial location of the measurements, where and represent the size of latitude and longitude, respectively. In the observations, represents how many meteorological variables (e.g. temperature, humidity) are used simultaneously in the model. This structure is analogous to 2D images, where would indicate the amount of color components (RGB or grayscale).
Modeled as sequencetosequence problem in Equation 1, the goal of spatiotemporal data forecasting is to apply a function that maps an input sequence of past observations, satisfying the causal constraint at each time step , in order to predict a target sequence (), where the length of output sequence may differ from the length of input sequence.
(1) 
4 STConvS2S architecture
In this section, we describe our proposed architecture, called Spatiotemporal Convolutional Sequence to Sequence Network (STConvS2S). STConvS2S is a deep learning architecture designed for shortterm weather forecasting, as illustrated in Figure 3. We use an encoderdecoder architecture, typically used to model sequence tasks. However, in our model, the 1D convolutional layers used for time series are replaced by 3D ones. This is a crucial feature of our model, since it enables the learning of patterns in data with a spatiotemporal structure, which is typical in geoscience data.
Moreover, instead of adopting a conventional kernel for 3D convolutional layers, we use a factorized 3D kernel adapted from R(2+1)D network, proposed in Tran et al. (2018). In their work, the factorized kernel split the convolution operation of one layer into two separate and successive operations, a 2D spatial convolution and a 1D temporal convolution. In our new architecture, we take a different approach: operations are not successive inside each convolutional layer. We configure the encoder to learn spatial dependencies by applying spatial kernel (2D spatial convolution) and the decoder to encapsulate temporal dependencies using temporal kernel (1D temporal convolution). Figure 4 schematically illustrates the difference between both approaches.
STConvS2S is a stack of 3D convolutional layers. Each layer receives a 4D tensor with dimensions as input, where is the number of filters used in the previous layer (), is the sequence length (time dimension), and represent the size of the spatial coverage for latitude and longitude, respectively. In detail, the encoder is formed by convolutional blocks with batch normalization and a rectified linear unit (ReLU) as nonlinearity. The decoder is similar to the encoder, except that a causal convolution (Section 2.2) is used in its first layers to ensure only previous observations are considered in forecast, which is an essential constraint for weather forecasting.
Kernel decomposition allows us to analyze the spatial and temporal contexts separately. Thus, in encoder layers, feature maps must have a fixedlength in dimensions, which means the size of feature maps must match the input size in these dimensions. Otherwise, for some time series, temporal correlation would not be learned by decoder due to compression in the spatial dimension. To ensure a fixedlength, the input for the encoder is padded following , where is the size of spatial kernel. For decoder, we pad the input by , because of causal convolution, where is the size of temporal kernel.
Besides adopting causal convolution in 3D convolutional layers, another contribution of our work is the possibility of generating an output sequence in which its length differs from the length of the input sequence. When CNNs are used to sequencetosequence learning, such as forecasting tasks, the length of the output sequence must be the same size or shorter than the input sequence (Gehring et al., 2017; Bai et al., 2018). This is not only a limitation of CNN architectures but also of ConvLSTM ones (Shi et al., 2015; Kim et al., 2019). In Shi et al. (2015) all the sequences are 20 frames long, where they split it 5 for the input and 15 for the prediction. Kim et al. (2019) define an input sequence of 5 time steps and predict the next 5 time steps.
To tackle this limitation, we add a 3D transposed convolutional layer (Section 2.3) before the last convolutional layer and use it to generate an output sequence whose length may be longer than the length of the input sequence. This implementation is tested in the task where we use the previous 5 grids as input sequence to predict the next 15 grids.
5 Related work
Statistical methods and machine learning techniques use historical data of temperature, precipitation, and other variables to predict the weather conditions. Autoregressive integrated moving average (ARIMA) are traditional statistical methods for times series analysis (Babu and Reddy, 2012). Studies also apply artificial neural networks (ANN) to time series prediction in weather data, such as temperature measurements (Corchado and Fyfe, 1999; Baboo and Shereef, 2010; Mehdizadeh, 2018). Recently, some authors have been developing new approaches based on deep learning to improve time series forecasting results, in particular, using LSTM networks. Traffic flow analysis (Yang et al., 2019), displacement prediction of landslide (Xu and Niu, 2018), petroleum production (Sagheer and Kotb, 2019) and sea surface temperature forecasting (Zhang et al., 2017) are some applications that successfully use LSTM architectures. In Zaytar and Amrani (2016), the authors build a model with stacked LSTM layers to map sequences of weather values (temperature, humidity, and wind speed) of the same length for 9 cities in Morocco and show that their results are competitive with traditional methods. However, these approaches addressed to time series are unable to capture the spatial dependencies in the observations.
Spatiotemporal deep learning models deal with spatial and temporal contexts simultaneously. In Shi et al. (2015), the authors formulate weather forecasting as a sequencetosequence problem, where the input and output are 2D radar map sequences. Besides, they introduce the convolutional LSTM (ConvLSTM) architecture to build an endtoend trainable model for precipitation nowcasting. The proposed model includes the convolution operation into LSTM network to capture spatial patterns. Kim et al. (2019) also define their problem as a sequence task and adopt ConvLSTM for extreme climate event forecasting. Their model uses hurricane density map sequences as spatiotemporal data. The work proposed in Souto et al. (2018) implements a spatiotemporal aware ensemble approach adopting ConvLSTM architecture. The authors combine different meteorological models as channels in the convolutional layer to predict the next expected rainfall values for each location. Although related to the use of deep learning for climate/weather data, our model adopts only CNN rather than a hybrid approach that combines CNN and LSTM.
Some studies have applied spatiotemporal convolutions (Yuan et al., 2018; Tran et al., 2018) for video analysis and action recognition. In Tran et al. (2018), the authors compare several spatiotemporal architectures using only 3D CNN and show that factorizing the 3D convolutional kernel into separate spatial and temporal components produces gains in accuracy. Their architecture focuses on layer factorization, i.e., factorizing each convolution into a block of a spatial convolution and a temporal convolution. Moreover, in comparison to the full 3D convolution, they indicate advantages: an increase in the complexity of the functions that can be represented, and a facility in the optimization of spatial or temporal components. Inspired by Tran et al. (2018), we also adopt a factorized 3D CNN, but with a different implementation. Figure 4 highlights this difference.
A limitation of both 3D CNN or factorized 3D CNN (Tran et al., 2018) is the lack of causal constraint allowing future information in temporal learning. Singh and Cuzzolin (2019) also factorize the 3D convolution using the same spatial convolution as Tran et al. (2018) but propose a recurrent convolution unit based on RNN approach to address causal constraint in temporal learning for action recognition task. In contrast, we use an entirely CNN approach, adopting a causal convolution to tackle this limitation.
Following the success of 2D CNN in capturing spatial correlation in images, Xu et al. (2019) propose a model to predict vehicle pollution emissions using 2D CNN to capture temporal and spatial correlation separately. However, unlike our work, they also do not satisfy the causal constraint when adopting 2D CNN in temporal learning. Racah et al. (2017) use a 3D CNN in an encoderdecoder architecture, where they concatenate time axis as the third dimension of the input for extreme climate event detection. Their encoder and decoder use convolutional and deconvolutional (transposed convolutional) layers, respectively, to learn the spatiotemporal representation simultaneously in each layer. Our approach is similar to Racah et al. (2017) in using encoderdecoder architecture based on CNN, but we adopt a factorized 3D CNN instead of a 3D CNN and specialize our encoder to learn only spatial context and the decoder, temporal context.
Other deep learning approaches devised to explore spatiotemporal patterns differ in the gridstructured data we use as input. Wang and Song (2018) present an ensemble approach for air quality forecasting combining statistical hypothesis and deep learning. They explore spatial correlation by applying Granger causality between two time series and, for temporal learning, use LSTM networks. Yu et al. (2018) and Li and Moura (2019) use graphstructured data as input and propose a deep learning network to tackle a sequencetosequence problem using spatiotemporal data. Yu et al. (2018) build the architecture for traffic forecasting using convolutional structures composed with two temporal layers that are 1D CNN with a causal convolution and one spatial layer in between used to extract spatial features in graphs. Li and Moura (2019) adopt an encoderdecoder architecture based in Transformer model (Vaswani et al., 2017) for taxi ridehailing prediction.
To sum up, our proposed STConvS2S architecture departs from the previous approaches, either in the manipulation of spatial and temporal dependencies or in the use of different deep learning layers to learn features from the data or in the adoption of a grid structure rather than a graph to model the input data.
6 Experiments
We perform experiments on two publicly available meteorological datasets containing air temperature and precipitation values to validate our proposed architecture. The deep learning experiments were conducted on a server with a single Nvidia GeForce GTX1080 GPU with 8GB memory. The baseline model was executed on 8 Intel i7 CPUs with 4 cores and 66GB RAM. We begin by explaining the datasets (Section 6.1) and evaluation metrics (Section 6.2). After that, we describe the results and a corresponding analysis (Section 6.3).
6.1 Datasets
The CFSR
In the experiments, we use a subset of CFSR with the air temperature observations from January 1979 to December 2015 covering the space in 8N54S and 80W25W as shown in Figure 5 (a). As data preprocessing, we scale down the grid to in the and dimensions to fit the data in GPU memory.The other dataset, CHIRPS
Similar to Shi et al. (2015), we define the input sequence length as 5, which indicates that the previous 5 grids are used to predict the next grids. Thus, the data shapes used as input to the deep learning architectures are for CFSR dataset and for CHIRPS dataset, where 1 in both datasets indicates the onechannel (in this aspect similar to a grayscale image), 5 is the size of the sequence considered in the forecasting task, and 32 and 50 represent the numbers of latitudes and longitudes used to build the spatial grid in each dataset.
From the temperature dataset, we create 54,041 grid sequences and from the rainfall dataset, 13,960 grid sequences. Finally, we divide both datasets into nonoverlapping training, validation, and test set following 60%, 20%, and 20% ratio, in this order.
6.2 Evaluation metrics
In order to evaluate the proposed architecture, we compare our results against ARIMA models, traditional statistical approaches for time series forecasting, and the ConvLSTM architecture proposed in Shi et al. (2015), which is considered the stateofart for spatiotemporal forecasting. To accomplish this, we use the two evaluation metrics presented in Equation 2 and 3.
RMSE, denoted as , is based on MSE metric, which is the average of squared differences between real observation and prediction. The MSE square root gives the results in the original unit of the output, and is expressed at a specific point as:
(2) 
where is the number of test samples, and are the real and predicted values at the location and at time , respectively.
MAE, denoted as , is the average of differences between real observation and prediction, which measures the magnitude of the errors in prediction. MAE also provides the result in the original unit of the output, and is expressed at a specific point as:
(3) 
where , , , are defined as shown in Equation 2.
6.3 Results and Analysis
First, we conduct experiments with distinct numbers of layers, filters, and kernel sizes to investigate the best hyperparameters to fit the deep learning models. As a starting point, we set the version based on the settings described in Shi et al. (2015) with 2 layers, each one containing 64 filters and a kernel size of 3
Besides, we apply dropout, a regularization technique, during training phase of the rainfall dataset to avoid overfitting and be able to make more accurate predictions for unseen data (test set). To evaluate the best dropout rate, we execute several experiments changing the dropout rate by 0.2, 0.4, 0.6, 0.8. STConvS2S models adopt 0.2 rate and ConvLSTM models, 0.8. As a sequencetosequence task, we use the previous 5 grids, as we established before in Section 6.1, to predict the next 5 grids (5steps forecasting).
Table 1 provides the models considered in our investigation with four different settings, the values of the RMSE metric on the test set, and the training time for each dataset. The results show the superiority of version models achieving the lowest RMSE for both ConvLSTM and STConvS2S. Another aspect to note is the significant impact on training time by increasing the number of filters than the number of layers or kernels (versions and point to this).
ConvLSTM  STConvS2S  

Dataset  Version  Settings  RMSE  Training time  RMSE  Training time 
CFSR (temperature)  1  L=2, K=3, F=64  2.0986  1:50:42  1.7548  0:42:37 
2  L=3, K=3, F=32  2.0364  1:07:13  1.6921  0:29:24  
3  L=3, K=3, F=64  1.9683  3:00:44  1.6489  1:03:47  
4  L=3, K=5, F=32  1.8695  1:52:36  1.4920  0:44:13  
CHIRPS (rainfall)  1  L=2, K=3, F=64  6.4327  1:11:16  6.4067  0:35:53 
2  L=3, K=3, F=32  6.4356  0:42:52  6.3905  0:25:29  
3  L=3, K=3, F=64  6.4108  1:52:50  6.3785  0:52:01  
4  L=3, K=5, F=32  6.3794  1:12:17  6.3215  0:36:33 
Figure 6 and 7 highlight the differences between the performances of RMSE metric and training time, respectively, for STConvS2S and ConvLSTM models. As shown in Figure 6, STConvS2S models outperform the ConvLSTM models for both datasets, which indicates that our architecture can simultaneously capture spatial and temporal correlations. In addiction, STConvS2S models resulted in the most efficient training with the smallest time in all versions and datasets (Figure 7). These results reinforce that CNN has fewer parameters to optimize than RNN and, as it does not depend on the computations of the previous time step, can be completely parallelized speeding up the learning process.
Specifically in version for temperature dataset, our model significantly outperforms the stateofart architecture for spatiotemporal forecasting. It was 2.5x faster and achieved a 20% improvement in RMSE over ConvLSTM. Comparing the same model version for rainfall dataset, our model was 2x faster and slightly improved the ConvLSTM result with a 1% improvement in RMSE. Futhermore, Figure 8 illustrates that STConvS2S has lower training error compared to ConvLSTM over 50 epoch. In this comparasion both models are using version 4.
To further evaluate our model, we chose the most efficient version () to perform new experiments. The chosen ConvLSTM and STConvS2S models with 3 layers, 32 filters, and kernel size of 5 were compared with ARIMA models. Since ARIMA are a traditional approach to time series forecasting, they served as a baseline for our proposed deep learning models. The experiment for the baseline takes into account the same temporal pattern and spatial coverage. Thus, predictions were performed throughout all the 1,024 time series (temperature dataset) and 2,500 time series (rainfall dataset), considering in each analysis the previous 5 values in the sequence.
To avoid overfitting during training of deep learning models, we apply the early stopping technique with patience hyperparameter set to 16 on the validation dataset. We train and evaluate each of them 10 times, and the mean and the standard deviation of RMSE and MAE metrics were calculated on the test set. This time, we evaluate the baseline and deep learning models in two horizons: 5 and 15steps ahead. These experiments are relevant to test the capability of our model to predict a long sequence.
As shown in Table 2, STConvS2S outperforms the results of all baseline and stateofart experiments on both metrics. STConvS2S performs much better than the baseline, indicating the importance of spatial dependence on geoscience data since ARIMA models only analyze temporal relationships. It also outperforms the stateofart model (Shi et al., 2015), with lower errors in temperature forecasting than rainfall forecasting.
Dataset  Horizon  Metric  ARIMA  ConvLSTM  STConvS2S 

CFSR (temperature)  = 5  RMSE  2.1880  1.8406 0.0318  1.4835 0.0131 
MAE  1.9005  1.2672 0.0188  1.0157 0.0098  
= 15  RMSE  2.2481  2.2170 0.0209  2.0821 0.0179  
MAE  1.9077  1.5399 0.0171  1.4508 0.0125  
CHIRPS (rainfall)  = 5  RMSE  7.4377  6.3825 0.0031  6.3222 0.0028 
MAE  6.1694  2.3620 0.0016  2.3400 0.0009  
= 15  RMSE  7.9460  6.3930 0.0030  6.3693 0.0024  
MAE  5.9379  2.3673 0.0012  2.3626 0.0008 
To provide an overview, Figure 9 and Figure 10 illustrate the cumulative error based on both horizons ( and ). The former shows the superiority of our proposed architecture competing with ARIMA and ConvLSTM in the temperature forecasting, while the latter shows our competitive results compared to ConvLSTM in the rainfall forecasting. The experiments reveal that STConvS2S can effectively learn spatiotemporal representations.
7 Conclusion
This paper presented a new deep learning architecture using only convolutional layers to deal with spatiotemporal data forecasting, termed as STConvS2S. In our architecture, the spatial features learned in the first layers (encoder) are used as input to the final layers (decoder) responsible for learning temporal features and predicting the output sequence.
The limitation of CNN models in sequence modeling tasks is to generate an output sequence whose length is greater than the length of the input sequence. It occurs either in 1D CNN for capturing temporal context, or 3D CNN and a hybrid approach using 2D CNN + LSTM, called ConvLSTM, for spatiotemporal context. Our work removes this limitation by adding a transposed convolutional layer before the final convolutional layer.
Causal convolution is typically used in temporal architectures (1D CNN). In our work, it was added to 3D CNN, allowing STConvS2S model not to violate the temporal order (causal constraint). This implementation was essential for a fair comparison with ConvLSTM (the stateofart model), which is a causal model due to the chainlike structure of LSTM layers.
Experiments indicate that our model manages to better analyze both spatial and temporal dependencies of the data than the stateofart model since it has achieved superior performance toward temperature forecasting and competitive results in the rainfall forecasting. Thus, STConvS2S could be a natural choice for sequence modeling tasks, such as weather forecasting, when using spatiotemporal data. Future work will search for ways to decrease rainfall dataset error, and directions may include applying preprocessing techniques to sparse data and adding data from other geographic regions. Besides, we will investigate more architectures for spatiotemporal data forecasting.
Computer Code Availability
Models described in this paper were developed with Python 3.6 programming language and deep learning models (STConvS2S and ConvLSTM) were also implemented using PyTorch 1.0, an opensource framework. Our source code are publicly available at https://github.com/MLRGCEFETRJ/stconvs2s
Data Availability
In this paper, spatiotemporal datasets in NetCDF format were used and can be downloaded at http://doi.org/10.5281/zenodo.3558773, an opensource online data repository.
Footnotes
 Also termed as deconvolution in previous works of the literature.
 https://climatedataguide.ucar.edu/climatedata/climateforecastsystemreanalysiscfsr
 Scientific method which is used to produce best estimates (analyses) of how the weather is changing over time (Fujiwara et al., 2017).
 https://chc.ucsb.edu/data/chirps
 kernel for ConvLSTM. temporal kernel and spatial kernel for STConvS2S.
References
 An efficient weather forecasting system using artificial neural network. International Journal of Environmental Science and Development 1 (4), pp. 321–326. External Links: Document Cited by: §5.
 Predictive data mining on Average Global Temperature using variants of ARIMA models. IEEEInternational Conference on Advances in Engineering, Science and Management, ICAESM2012, pp. 256–260. Cited by: §5.
 An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Note: \hrefhttps://arxiv.org/abs/1803.01271v2arXiv:1803.01271v2 Cited by: §1, §4.
 Unsupervised neural method for temperature forecasting. Artificial Intelligence in Engineering 13 (4), pp. 351–357. External Links: Document Cited by: §5.
 A guide to convolution arithmetic for deep learning. Note: \hrefhttps://arxiv.org/abs/1603.07285arXiv:1603.07285 Cited by: §2.3.
 Introduction to the sparc reanalysis intercomparison project (srip) and overview of the reanalysis systems. Atmospheric Chemistry and Physics 17 (2), pp. 1417–1452. External Links: Document Cited by: footnote 3.
 The climate hazards infrared precipitation with stations  A new environmental record for monitoring extremes. Scientific Data 2. External Links: ISSN 20524463, Document Cited by: 4th item, §6.1.
 Convolutional sequence to sequence learning. In International Conference on Machine Learning, Vol. 3, pp. 2029–2042. Cited by: §1, §1, §2.1, §2.4, §4.
 Deep learning. MIT Press. Note: \urlhttp://www.deeplearningbook.org Cited by: §2.2.
 Machine Learning for the Geosciences: Challenges and Opportunities. IEEE Transactions on Knowledge and Data Engineering PP (c), pp. 1. External Links: Document Cited by: §1.
 Deephurricanetracker: Tracking and forecasting extreme climate events. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV 2019, pp. 1761–1769. External Links: Document Cited by: §4, §5.
 ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §2.1.
 Convolutional Networks for Images, Speech, and Time Series. In The Handbook of Brain Theory and Neural Networks, M. A. Arbib (Ed.), pp. 1–14. Cited by: §2.1.
 Forecaster: A Graph Transformer for Forecasting Spatial and TimeDependent Data. Note: \hrefhttps://arxiv.org/abs/1909.04019v3arXiv:1909.04019v3 Cited by: §5.
 Assessing the potential of datadriven models for estimation of longterm monthly temperatures. Computers and Electronics in Agriculture 144, pp. 114–125. External Links: Document Cited by: §5.
 Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528. External Links: Document Cited by: §2.3.
 ExtremeWeather: A largescale climate dataset for semisupervised detection, localization, and understanding of extreme weather events. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), pp. 3402–3413. Cited by: §2.3, §5.
 Deep learning and process understanding for datadriven Earth system science. Nature 566 (7743), pp. 195–204. External Links: Document Cited by: §1, §1.
 Time series forecasting of petroleum production using deep lstm recurrent networks. Neurocomputing 323, pp. 203 – 213. External Links: Document Cited by: §5.
 The NCEP Climate Forecast System Version 2. Journal of Climate 27 (6), pp. 2185–2208. External Links: Document Cited by: 4th item, §6.1.
 Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems (NeurIPS), pp. 802–810. Cited by: §1, §4, §5, §6.1, §6.2, §6.3, §6.3.
 Recurrent Convolutions for Causal 3D CNNs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 1–10. Cited by: §1, §5.
 A Spatiotemporal Ensemble Approach to Rainfall Forecasting. In Proceedings of the International Joint Conference on Neural Networks, pp. 574–581. External Links: Document Cited by: §5.
 Weather impact on retail sales: How can weather derivatives help with adverse weather deviations?. Journal of Retailing and Consumer Services 49 (February), pp. 1–10. External Links: Document Cited by: §1.
 Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NeurIPS), pp. 3104–3112. Cited by: §2.4.
 A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6450–6459. External Links: Document Cited by: §1, §1, §2.1, Figure 4, §4, §5, §5.
 WaveNet: a generative model for raw audio. Note: \hrefhttps://arxiv.org/abs/1609.03499arXiv:1609.03499 Cited by: §2.2.
 Attention is All you Need. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: §5.
 A Deep SpatialTemporal Ensemble Model for Air Quality Prediction. Neurocomputing 314, pp. 198–206. External Links: Document Cited by: §5.
 Displacement prediction of Baijiabao landslide based on empirical mode decomposition and long shortterm memory neural network in Three Gorges area, China. Computers and Geosciences 111, pp. 87–96. External Links: Document Cited by: §5.
 Deep spatiotemporal residual earlylate fusion network for city region vehicle emission pollution prediction. Neurocomputing 355, pp. 183–199. External Links: Document Cited by: §5.
 Traffic flow prediction using LSTM with feature enhancement. Neurocomputing 332, pp. 320–327. External Links: ISSN 09252312, Document Cited by: §5.
 Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18, pp. 3634–3640. External Links: Document Cited by: §5.
 Action recognition using spatialoptical data organization and sequential learning framework. Neurocomputing 315, pp. 221–233. External Links: Document Cited by: §5.
 Sequence to Sequence Weather Forecasting with Long ShortTerm Memory Recurrent Neural Networks. International Journal of Computer Applications 143 (11), pp. 7–11. External Links: Document Cited by: §5.
 Prediction of Sea Surface Temperature Using Long ShortTerm Memory. IEEE Geoscience and Remote Sensing Letters 14 (10), pp. 1745–1749. External Links: Document Cited by: §5.