Spatiotemporal Graph Convolutional Neural Network: A Deep Learning Framework for Traffic Forecasting
Abstract
The goal of traffic forecasting is to predict the future vital indicators (such as speed, volume and density) of the local traffic network in reasonable response time. Due to the dynamics and complexity of traffic network flow, typical simulation experiments and classic statistical methods cannot satisfy the requirements of midandlong term forecasting. In this work, we propose a novel deep learning framework, SpatioTemporal Graph Convolutional Neural Network (STGCNN), to tackle this spatiotemporal sequence forecasting task. Instead of applying recurrent models to sequence learning, we build our model entirely on convolutional neural networks (CNNs) with gated linear units (GLU) and highway networks. The proposed architecture fully employs the graph structure of the road networks and enables faster training. Experiments show that our STGCNN network captures comprehensive spatiotemporal correlations throughout complex traffic network and consistently outperforms stateoftheart baseline algorithms on several realworld traffic datasets.
Spatiotemporal Graph Convolutional Neural Network: A Deep Learning Framework for Traffic Forecasting
Bing Yu, ^{†}^{†}thanks: Equal contributions.^{1} Haoteng Yin, ^{∗2,3} Zhanxing Zhu ^{†}^{†}thanks: Corresponding author.^{3,4} ^{1}School of Mathematical Sciences, Peking University, Beijing, China ^{2}Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China ^{3}Center for Data Science, Peking University, Beijing, China ^{4}Beijing Institute of Big Data Research (BIBDR), Beijing, China {byu, htyin, zhanxing.zhu}@pku.edu.cn
Introduction
Traffic forecasting is one of the most challenging studies of Intelligent Transportation System (ITS). Accurate and timely forecasting of multiscale traffic conditions is of paramount importance for road users, management agencies and private sectors. Widely used transportation services provided by ITS such as dynamic traffic control, route planning and navigation service also rely on a highquality assessment of future traffic network conditions under reasonable cost.
Indicators such as speed, volume and density gathered by various sensors reflect the general status of road traffic conditions. Thus, those measurements are typically chosen as the target of traffic prediction. Based on the length of prediction, traffic forecasting can be divided into three scales: shortterm (5 30 min), mediumterm (30 60 min) and longterm (over an hour). Most prevalent approaches are able to perform well on short forecasting interval. Inherently, because of the uncertainty and complexity of traffic flow, those methods are unsatisfying on longterm timeseries prediction.
Previous studies on traffic prediction can be roughly divided into two different categories, namely, traditional simulation approaches and datadriven methods. For the simulation approaches, making traffic flow prediction requires comprehensive and meticulous systemic modeling based on physical theories and prior knowledge (?). Even though, the analog system and simulation tools still consume massive computational power and skillful parameter settings to achieve steady state. Nowadays, with the rapid development of realtime traffic data collection methods and forms, researchers are transferring their attention to exploring datadriven methods through enormous historical traffic records which are gathered by the advanced ITS.
Classic statistical models and machine learning models are two major representative categories of datadriven methods. In timeseries analysis, autoregressive integrated moving average (ARIMA) is one of the most consolidated approaches. It has been applied into various study fields and firstly introduced into traffic forecasting as early as 1970s (?). ARIMA model can be applied to nonstationary data, which require an integrated term to make the time series stationary. Extensive variants of ARIMA model have been proposed to improve the ability on pattern capturing and prediction accuracy, such as seasonal ARIMA (SARIMA) (?), ARIMA with the Kalman filter (?). However, models mentioned above highly rely on the stationary assumption of the time series and ignore the spatial correlation among traffic network. Therefore, timeseries models have partially limited representability of highly dynamic and inconstant traffic flow.
Recently, machine learning methods have shown promising development in traffic study. Higher prediction accuracy can be acquired by these nonparametric methods, including nearest neighbors algorithm (KNN), support vector machine (SVM), and neural network (NN) models (also referred as deep learning models).
Deep Learning Approaches
Nowadays, deep learning techniques, deep architectures in particular, have drawn lots of academic and industrial interest and attention. Deep learning methods have been widely and successfully employed in various tasks such as classification, pattern recognition and object detection. In traffic prediction research, the deep belief network (DBN) has been proved the capability of capturing the stochastic features and characteristics of traffic flow without handengineered feature selection (?; ?). (?) proposed a stacked autoencoder (SAE) model to discover latent shortterm traffic flow features. (?) developed a stack denoise autoencoder to learn hierarchical representation of urban traffic flow. Those approaches mentioned above can learn effective features for shortterm traffic prediction. However, it is difficult for the fullyconnected neural network to extract representative spatial and temporal features from large amount of longterm traffic flow concurrently. Moreover, topological locality and historical memory among the spatiotemporal traffic variables are neglected in those deep learning models, which hindered their predictive power.
Recurrent neural network (RNN) and its variations (e.g. long shortterm memory neural network (LSTM), gated recurrent unit (GRU)) show tremendous potential for the traffic prediction with short and long temporal dependency. In spite of the efficient use of temporal dependency, the spatial part is not fully utilized in previous studies. To fill this gap, some researchers use the convolutional neural network (CNN) (?) to extract topological locality of traffic network. CNN model with customized kernels offers a robust algorithm to explore the local relationships between neighboring variants. By combining LSTM and 1D CNN, (?) designed a featurelevel fused architecture CLTFP for shortterm traffic flow forecasting. Even simply adopting a straightforward combined strategy, CLTFP still creates an insightful perspective to jointly excavate the spatial and temporal domains of traffic variables.
Traffic network variables are typical structured data with spatiotemporal features. How to effectively model temporal dynamic and topological locality from those highdimensional variables is the key to resolve the forecasting problem. (?) proposed a convolutional LSTM (ConvLSTM) model, which is an extended fullyconnected LSTM (FCLSTM) with embedded convolutional structures. The ConvLSTM imposes convolution operation on the statetransition procedure of video frames. However, these standard CNNs are restricted to processing regular grid structures (e.g. images, videos, and speech) other than general domains. In this case, structured traffic variables may not be applicable. Recent advances in the irregular or nonEuclidean domain modeling provide some useful insights on how to further study the structured data problem. (?) made a primary exploration on generalizing the signal domain of CNNs to arbitrarily structured graphs (e.g. social network, traffic network). Several followingup studies (?; ?) inspired researchers to develop novel combinational methods to reveal hidden features of structured datasets. (?) introduced graph convolutional recurrent network (GCRN) to simultaneously identify spatial domain and dynamic variation from the spatiotemporal sequences. The key challenge of the aforesaid study is to determine the best possible collaboration between recurrent models (e.g. RNN, LSTM or GRU) and graph CNN (?) for the specific dataset. Based on the above principles, (?) successfully employed GRU with graph convolutional layers to predict complex traffic flow. It is noteworthy that recurrent models normally require processing and learning input sequences step by step. As the iteration increases, the problem of accumulation of errors gradually appear, which leads to the drifting convergence. The serialized learning process limits parallelization of training process as well.
Motivated by graph CNN and convolutional sequence learning, we propose a novel deep learning architecture, the spatiotemporal graph convolutional neural network (STGCNN), for longterm traffic forecasting tasks. Our contributions are:

To the best of our knowledge, it is the first time to apply purely convolutional structures to extract spatiotemporal features of graphstructured traffic datasets on both space and time domains simultaneously.

We propose a novel deep learning architecture that combined graph convolution with sequence learning convolution network. Thanks to the architecture of pure convolution, it achieves much faster training than RNN/LSTM based models, almost acceleration of training speed.

The traffic forecasting framework we proposed outperforms among all the methods we implemented on both two realworld traffic datasets in multiple speed prediction experiments.

Not only exhibiting strong performance in traffic prediction domain, our STGCNN model is also a general deep learning framework for modeling graphstructured timeseries data. It can be applied in other scenarios, such as social network analysis.
Methodology
Problem Formulation
The purpose of traffic prediction task is to use previously observed road speed records to forecast the future status in a certain period of a specified region. Historical traffic data measured from sensor stations in previous time steps can be regarded as the form of a matrix with the size of .
In order to describe the relationship between neighboring sensor stations from traffic network, we introduce an undirected graph , where is a set of vertices, i.e. sensor stations; represents a set of edges, indicating the connectivity between those sensors in the network; while denotes the adjacency matrix of . If the topology of vertices in can be obtained from raw data, we calculate the value of based on the connectedness. Otherwise, the adjacency matrix is constructed according to the distance of each pairwise sensor stations. Therefore, historical traffic data can be defined on , consisting of graphstructured data frames as Figure 1 shows. Now we can formulate our spatiotemporal traffic prediction problem as
(1) 
where is the length of prediction.
Network Architecture
In this section, we describe the general architecture of our spatiotemporal graph CNN (STGCNN) for traffic speed prediction. See Figure 2 for the graphical illustration of our proposed model. The STGCNN is composed of several spatiotemporal convolutional blocks and a fullyconnected layer. Each spatiotemporal convblock is constructed by a highway graph CNN layer and a gated linear temporal CNN layer. We will elaborate the model details in the following sections.
Graph CNN for Extracting Spatial Features
Convolutional neural networks have been successfully implemented into extracting highly meaningful patterns and features in largescale and highdimensional datasets. The traffic variables which contain hidden local properties are perfectly fitted to retrieve the correlation with their location and neighbors through CNNs. However, the standard CNN is not able to tackle the complex urban road network problem. In (?), authors creatively defined convolutional neural networks on graphs (GCNN) in the context of spectral graph theory. The proposed GCNN model has the capability to handle any graphstructured datasets while achieving the equivalent linear computational complexity and constant learning complexity as classical CNNs. Therefore, we employ graph CNN to handle the structured urban traffic data. The input of graph CNN is converted from the data matrix into a 3D tensor with the size of ().
Graph Convolution
Graph convolution can extract the spatial information efficiently on sparse graphs with only a few trainable parameters. Information among neighboring nodes is grouped and distributed by the graph convolution since the operator can be regarded as applying strictly localized filters to traverse the graph.
Given an undirected graph with vertices and a vector of the size of on , the graph convolution is defined in the spectral domain of . By computing the graph Laplacian and the eigen decomposition of ( is the diagonal degree matrix ; is an orthogonal matrix), the Fourier transform for is defined as . Hence, the definition of the graph convolution of and (?) is
where is the elementwise Hadamard product. Further, we can define the graph convolution on a vector by the filter which is also a diagonal matrix (?):
In fact, the above equation is equivalent to computing the graph convolution of and vector as the following equations show:
Therefore, we can regard as the graph convolution as well. In order to reduce the number of parameters and localize the filter, can be restricted to a polynomial of : , where is the kernel size of graph convolution. Then, can be expanded as
A signal on graph of nodes can be described as a matrix consisting of vectors of size . Consequently, for a signal , a convolution operation with a kernel tensor of size on is
where and separately indicate the number of channels of the input and the output. For more details about the graph convolution, please refer to (?; ?).
Highway Network
The information from sensor stations on traffic network may have imbalanced contribution for speed prediction problem. For example, there are two geographically adjacent sensors, one is supposed to have higher impact than the other because of the former located in a traffic artery rather than an alley. To account for this issue, we introduce a datadependent gate to control the mixing ratio of the input and the output of spatial graph convolution. Highway network can achieve such a gate mechanism (?) by computing an extra gate function for each node. Inspired by this idea, we use a highway graph convolution layer to control information flow through the graph. Concretely, we can define the graph convolution on by two trainable kernels , of size , where is an signal on and :
(2)  
Hence, the final output of the highway graph convolution layer is
(3) 
where is the rectified linear units function. If and do not share the same size, we should pad with a zero tensor or apply a linear transformation to it first. Accordingly, we denote the highway graph convolutional layer as :
(4) 
Gated CNN for Extracting Temporal Features
In traffic prediction studies, many models are based on recurrent models, such as FCLSTM (?) and ConvLSTM (?). However, RNN models require computing different time steps successively and cannot process a time sequence in parallel. In addition, recurrent models tend to use a sequencetosequence method for longterm predictions by iteratively pumping forecasting results of the last time step into the network to predict the next step status. This mechanism introduces error accumulation step by step.
To overcome the aforementioned disadvantages, we employ CNNs instead of RNN along the time axis to capture temporal features. Recently, researchers from Facebook released a convolutional architecture for sequence modeling with gates and attention mechanism (?). This technique avoids ordered computational operation and achieves parallelized and customized training procedures which limit the efficiency of traditional recurrent models. Motivated by this intuition, we use gated linear units (GLU) to build the temporal CNN layers. The input of temporal layer is a 3D tensor obtained from Eq. (4) of size , which standing for time steps, nodes and channels individually. Furthermore, we design two convolution kernels , of size to exclusively apply on the time axis. Inevitably, convolution operations will modify the number of channels from to and the size of time axis from to . If we want to maintain the original size of time axis, a zerotensor of size is supposed to be concatenated on the left of , noted as . Consequently, two convolution operations applied on the input are determined by the separate kernels as
(5)  
where and are trainable variables; is the output of the convolution operation and is the input of a gate to control . As a result, the output of the temporal gated CNN is an elementwise product of the convolution output and the gate, which denoted as :
(6) 
Specifically, we do not have to maintain the same size of time axis in the last temporal CNN layer. Instead, we change its size from to 1 for making predictions. Thus, we directly employ a convolution operation with kernel of size and a relu activation function on the input without padding. That is , where , , . Therefore, the final output of the last temporal CNN layer is a tensor of size . Eventually, we reshape the output to a matrix of size and calculate the speed prediction for nodes by applying a linear transformation across channels as , where is a weight matrix and is a bias.
Combining Graph CNN and Temporal CNN
In order to fuse the spatial and temporal features, we propose a novel model, spatiotemporal graph CNN (STGCNN) combining temporal GLU CNN with spatial highway graph CNN. The model is stacked by several spatiotemporal convblocks and a linear output layer. Each block comprises of one highway spatial graph convolution layer and one following temporal GLU convolution layer. The input and the output of the blocks are all 3D tensors of size . For the input of block , the output can be computed by
After stacking spatiotemporal convblocks, we add an output spatiotemporal convblock and a fullconnected layer for each node of the sensor graph in the end, as shown in the Figure 2.
The loss function of our model for predicting the next time step can be written as
(7)  
where are all trainable variables in our STGCNN model, is the ground truth and denotes the model’s prediction.
Our STGCNN model is a universal framework to process structured time series, and it is not only able to tackle massive urban traffic network modeling and prediction issues but also to be applied to more general spatiotemporal sequence forecasting challenges. The graph convolution and the gates in highway layers can extract useful spatial features while resist useless information, and the temporal convolutions combined with gates can select the most important temporal features. Our model is entirely composed of convolutional layers, hence, the model fits well for parallelizing. Furthermore, STGCNN framework is not based on sequencetosequence learning, therefore, the model can obtain a much more accurate estimation without accumulating the error step by step.
Experiments
Dataset Description
We use two different traffic datasets which are collected and processed by Beijing University of Technology and California Deportment of Transportation respectively. Each dataset contains key indicators and geographic information with corresponding timestamps, as detailed as follows.
Beijing East Ring No.4 Road (BJER4)
BJER4 dataset was gathered from the certain area of east ring No.4 routes in Beijing City by doubleloop detectors. There are 12 roads (as Figure 3 shows, R207 & R208 were ditched since overlapping) selected for our experiment. The traffic data are aggregated every 5 minutes. The particular time period used in this study is from 1st July to 31st August, 2014 except the weekends. We select the first month of historical speed records as training set, and the rest serves as validation and test set separately.
PeMS District 7 (PeMSD7)
PeMSD7 dataset was collected from Caltrans Performance Measurement System (PeMS) in realtime by over 39,000 individual sensor stations, which are deployed across all major metropolitan areas of California state highway system (?). The dataset that we applied to numerical experiment is also aggregated into 5minute intervals from individual 30second data samples for each sensor station. We randomly select 228 stations (shown in Figure 4) as data source among the District 7 of California. The time range of PeMSD7 dataset is in the weekdays of May and June of 2012. We split the training and test sets based on the same principle.
Data Preprocessing
The traffic variables of the two datasets are all aggregated into 5min interval. Thus, each single node in traffic graph contains 288 data points per day. After the cleaning procedure, we apply linear interpolation method combining with time features to fill in the missing values. In addition, the input data of neural networks are uniformly normalized by the ZScore method.
In BJER4 dataset, the graph topology of traffic network in Beijing east No.4 ring route system is constructed by the geographical metadata in original sensor records. By collating affiliation, direction and origindestination points of each single route, the ring route system can be generalized as Figure 3 shows.
In PeMSD7 dataset, the weight matrix of sensor graph is computed based on the relative position of stations among the network. In this way, we can define the adjacency matrix as following,
(8) 
where is the edge weight which is decided by (the distance between station and ). and are thresholds to control the weights distribution and sparsity of the matrix. Parameters of those two thresholds are assigned to and individually. The numerical visualization of is presented in Figure 5.
Experimental Settings
All experiments are compiled and tested on a CentOS server (CPU: Intel(R) Xeon(R) CPU E52650 v4 @ 2.20GHz, Memory: 132GB, GPU: NVIDIA Tesla K80). We conduct the grid search to locate the best parameters which producing the highest score on validation sets. All the tests use 1 hour as the uniform historical time window, a.k.a. applying 12 observed data points to forecast the traffic condition in the next 15, 30, and 60 minutes.
Baselines
We compare our STGCNN framework with the following baselines: 1). Historical Average (HA); 2). Linear Support Victor Regression (LSVR) (?); 3). AutoRegressive Integrated Moving Average (ARIMA); 4). FeedForward Neural Network (FNN); 5). FullConnected LSTM (FCLSTM) (?); 6). Graph Convolutional LSTM (GCLSTM) (?). As for detailed parameter settings of baselines algorithms, please refer to the appendix.
STGCNN Model
For BJER4 dataset, we stack three spatialtemporal blocks of 64, 64, 128 channels and an output layer. While four spatialtemporal blocks of 64, 64, 128, 128 channels and an output layer are employed on dataset PeMSD7. Both graph convolution size and temporal convolution size are set to for the two datasets. We train our model by minimizing the mean square error using ADAM (?) for 50 epochs with batch size as 25. The initial learning rate is with a decay rate of 0.7 after every 5 epochs.
Evaluation
In our study, three metrics are adopted for evaluating quality of prediction with the ground truth .
Experiment Results
Table 1 demonstrates the results of STGCNN and baseline algorithms on the two datasets respectively. Our proposed model achieves the best performance in all three evaluation metrics.
Model  BJER4  PeMSD7  
MAE  MAPE  RMSE  MAE  MAPE  RMSE  
15min  
HA  5.21  14.67%  7.65  4.01  10.61%  7.20 
LSVR  4.11  9.84%  5.71  2.52  5.87%  4.55 
ARIMA  6.40  16.50%  9.55  5.72  14.15%  10.78 
FNN  4.20  10.34%  5.71  2.84  6.99%  4.66 
FCLSTM  4.30  10.92%  5.77  3.54  8.83%  6.25 
GCLSTM  3.95  9.36%  5.39  3.87  8.87%  6.8 
STGCNN  3.75  9.01%  5.11  2.35  5.44%  4.15 
30min  
HA  5.21  14.67%  7.65  4.01  10.61%  7.20 
LSVR  5.07  12.31%  7.10  3.62  8.90%  6.67 
ARIMA  6.27  16.53%  9.04  5.58  14.16%  10.04 
FNN  5.13  13.10%  7.12  4.04  9.84%  6.51 
FCLSTM  4.80  12.15%  6.66  3.74  9.43%  6.74 
GCLSTM  4.59  11.73%  6.41  3.98  9.35%  7.14 
STGCNN  4.41  10.65%  6.06  3.16  7.59%  5.52 
60min  
HA  5.21  14.67%  7.65  4.01  10.61%  7.20 
LSVR  6.56  16.35%  9.44  5.30  13.68%  9.44 
ARIMA  6.57  17.91%  9.24  5.92  15.48%  9.88 
FNN  6.88  17.87%  9.28  5.34  14.61%  8.89 
FCLSTM  5.63  14.91%  8.13  4.04  10.71%  7.35 
GCLSTM  5.83  13.96%  8.75  4.49  10.72%  7.83 
STGCNN  5.46  13.71%  7.66  3.95  9.72%  6.95 
We can easily observe that traditional statistical and machine learning methods may perform well for shortterm forecasting, but their longterm predictions are not accurate because of error accumulation. ARIMA model performs the worst due to its incapability of handling complex spatiotemporal data. Deep learning approaches generally achieved better prediction results than traditional machine learning models. It is worth noticing that except our STGCNN model and HA, the rest baseline models have relatively poor longterm forecasting performance on BJER4 dataset. Partially due to simple topology and confined region of the traffic network in BJER4, there is not enough spatial information for longterm speed prediction. In addition, traffic data of BJER4 are pretty noisy and highly fluctuant along the time axis, which causes severe error accumulation.
Benefits of Spatial Topology
Previous methods did not incorporate spatial topology and model the time series in a coarsegrained way. Differently, through modeling spatial topology of the sensors, our model STGCNN has achieved a significant improvement on short, medium and long term forecasting. The advantage of STGCNN is more obvious on PeMSD7 dataset than BJER4, since the sensor network of PeMS is more complicated (as illustrated in Figure 5), and our model can fully utilize spatial structure to make more accurate predictions. To compare three methods based on neural networks: FCLSTM, GCLSTM and STGCNN, we show their predictions during morning peak and evening rush hours, as shown in Figure 6 and 7. It is easy to observe that our proposal STGCNN captures the trend of rush hours more accurately than other methods; and it detects the ending of the rush hours earlier than others.
Generalization of Deep Learning Models
In order to investigate the performance of the compared deep learning models in more detail, we plot the RMSE of the validation set of PeMSD7 dataset during the training process, see Figure 8. The models based on recurrent networks (i.e. FCLSTM and GCLSTM) can fit the training set well, since the error for a singlestep prediction is small enough so that the error accumulations are not significant on the training set. However, for the validation set, the singlestep error is much larger than that of the training set, and consequently the longterm predictions of RNN/LSTMtype models are ruined by error accumulations. In this sense, RNN/LSTMtype of models tend to overfit the training data and generalize poorly. Our STGCNN model directly predicts the speed of specific time step, avoiding the issue of error accumulation. It performs best on the validation set and achieves a smallest gap between the training and validation, since its predictions do not rely on generating sequence iteratively.
Training Efficiency
To see the benefits of the convolution along time axis in our proposal STGCNN, we compare the training time of SGGCNN and GCLSTM with the same number of layers and hidden units. For the BJER4 dataset and two models with 3 hidden layers and 64 units in each layer, our model SGGCNN only consumes 69 seconds, while RNN/LSTMtype of model GCLSTM spends 676 seconds. This 10 times acceleration of training speed is due to the temporal convolution instead of recurrent model structures.
Conclusion and Future Work
In this paper, we have proposed a novel deep learning framework for traffic prediction, spatiotemporal graph convolutional neural network (STGCNN), combining graph and temporal CNN. The experimental results show that our model outperforms other stateoftheart methods on realworld datasets, indicating its great potentials on exploring spatiotemporal structures from the data. In the future, we will further optimize the network structure and parameters in order to obtain better results for traffic prediction problems. Moreover, the proposed framework can be applied into more general spatiotemporal structured sequence forecasting scenarios, such as evolving of social networks, and preference prediction in recommender systems, etc.
References
 [Ahmed and Cook 1979] Ahmed, M. S., and Cook, A. R. 1979. Analysis of freeway traffic timeseries data by using BoxJenkins techniques. Number 722.
 [Bruna et al. 2013] Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203.
 [Chen et al. 2001] Chen, C.; Petty, K.; Skabardonis, A.; Varaiya, P.; and Jia, Z. 2001. Freeway performance measurement system: mining loop detector data. Transportation Research Record: Journal of the Transportation Research Board (1748):96–102.
 [Chen et al. 2016] Chen, Q.; Song, X.; Yamada, H.; and Shibasaki, R. 2016. Learning deep representation from big and heterogeneous data for traffic accident inference. In AAAI, 338–344.
 [Defferrard, Bresson, and Vandergheynst 2016] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, 3844–3852.
 [Gehring et al. 2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.
 [Henaff, Bruna, and LeCun 2015] Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163.
 [Huang et al. 2014] Huang, W.; Song, G.; Hong, H.; and Xie, K. 2014. Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems 15(5):2191–2201.
 [Jia, Wu, and Du 2016] Jia, Y.; Wu, J.; and Du, Y. 2016. Traffic speed prediction using deep learning method. In Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on, 1217–1222. IEEE.
 [Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [Li et al. 2017] Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2017. Graph convolutional recurrent neural network: Datadriven traffic forecasting. arXiv preprint arXiv:1707.01926.
 [Lippi, Bertini, and Frasconi 2013] Lippi, M.; Bertini, M.; and Frasconi, P. 2013. Shortterm traffic flow forecasting: An experimental comparison of timeseries analysis and supervised learning. IEEE Transactions on Intelligent Transportation Systems 14(2):871–882.
 [Lv et al. 2015] Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; and Wang, F.Y. 2015. Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16(2):865–873.
 [Niepert, Ahmed, and Kutzkov 2016] Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning, 2014–2023.
 [Pedregosa et al. 2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
 [Seo et al. 2016] Seo, Y.; Defferrard, M.; Vandergheynst, P.; and Bresson, X. 2016. Structured sequence modeling with graph convolutional recurrent networks. arXiv preprint arXiv:1612.07659.
 [Shuman et al. 2013] Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. 2013. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30(3):83–98.
 [Srivastava, Greff, and Schmidhuber 2015] Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Highway networks. arXiv preprint arXiv:1505.00387.
 [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
 [Vlahogianni 2015] Vlahogianni, E. I. 2015. Computational intelligence and optimization for transportation big data: challenges and opportunities. In Engineering and Applied Sciences Optimization. Springer. 107–128.
 [Williams and Hoel 2003] Williams, B. M., and Hoel, L. A. 2003. Modeling and forecasting vehicular traffic flow as a seasonal arima process: Theoretical basis and empirical results. Journal of transportation engineering 129(6):664–672.
 [Wu and Tan 2016] Wu, Y., and Tan, H. 2016. Shortterm traffic flow forecasting with spatialtemporal correlation in a hybrid deep learning framework. arXiv preprint arXiv:1612.01022.
 [Xingjian et al. 2015] Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; and Woo, W.c. 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802–810.
Appendix
Baseline Settings

Historical Average (HA):
the mean value of the whole historical training data at current timestamp is used as the historical average. 
Linear Support Victor Regression (LSVR):
which uses a support vector machine with a linear kernel. The method is implemented by scikitlearn packages with optimized parameters by the grid search method. 
AutoRegressive Integrated Moving Average (ARIMA):
according to the training data to determine the best order of settings in ARIMA(p, d, q), then applying the corresponding parameters to predict the time series on test set iteratively. 
FeedForward Neural Network (FNN):
for BJER4 and PeMSD7 datasets, FNN is set to have two hidden layers of size 64 and an output layer; the activation function is relu. 
FullConnected LSTM (FCLSTM):
for BJER4 dataset, FCLSTM is set to stack two LSTM cells with hidden states of size ; For PeMSD7 dataset, it is set to stack three LSTM cells with hidden states of size . 
Graph Convolutional LSTM (GCLSTM):
a 2layer GCLSTM network of 64, 64 channels respectively with encoderdecoder mechanism is applied on BJER4 dataset; For PeMS dataset, we use a 3layer GCLSTM network of 64, 128, 128 channels respectively.