3D Graph Convolutional Networks with Temporal Graphs: A Spatial Information Free Framework For Traffic Forecasting
Abstract
Spatiotemporal prediction plays an important role in many application areas especially in traffic domain. However, due to complicated spatiotemporal dependency and high nonlinear dynamics in road networks, traffic prediction task is still challenging. Existing works either exhibit heavy training cost or fail to accurately capture the spatiotemporal patterns, also ignore the correlation between distant roads that share the similar patterns. In this paper, we propose a novel deep learning framework to overcome these issues: 3D Temporal Graph Convolutional Networks (3DTGCN). Two novel components of our model are introduced. (1) Instead of constructing the road graph based on spatial information, we learn it by comparing the similarity between time series for each road, thus providing a spatial information free framework. (2) We propose an original 3D graph convolution model to model the spatiotemporal data more accurately. Empirical results show that 3DTGCN could outperform stateoftheart baselines.
3D Graph Convolutional Networks with Temporal Graphs: A Spatial Information Free Framework For Traffic Forecasting
Bing Yu ^{*}^{*}footnotemark: *,
Mengzhang Li,
Jiyong Zhang,
Zhanxing Zhu ^{†}^{†}footnotemark: †
School of Mathematical Sciences, Peking University, Beijing, China
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
Center for Data Science, Peking University, Beijing, China
Beijing Institute of Big Data Research (BIBDR), Beijing, China
School of Automation, Hangzhou Dianzi University, Hangzhou, China
{byu,mcmong,zhanxing.zhu}@pku.edu.cn
1 Introduction
Traffic speed prediction is a crucial task for many key purposes in intelligent traffic systems and urban planning. For example, it is useful for not only explicit tasks such as calculating how many lanes a road should have, monitoring whether some places have a traffic jam, but it can also reflect road conditions for downstream traffic problems, e.g., employing it as an important feature for estimating time of arrival, route planning and traffic light control.
In traffic forecasting problems, we typically choose density [?], speed [?] and volume [?] as indicators to characterize current traffic conditions. The traffic forecasting problem can be categorized into three types, namely, based on the length of prediction, i.e., shortterm (less than 30 min) [?] and long term(30 60 min) [?], based on the data source, i.e., fixed sensors on several roads [?] and moving GPS trajectories treated with mapmatching algorithm [?], and based on the road type, i.e., urban road [?] and highway [?]. These prediction types are challenging due to the complexity of spatiotemporal dependencies and particularly the uncertainty of longterm forecasting.
Before datadriven approaches spring up, researchers usually apply mathematical tools such as differential equations and traditional traffic knowledge to simulate traffic behaviour by numerical simulation [?]. This makes strong assumptions, such as drivers’ identical behaviour and no sudden accidents. In the past several decades, many statistical and machine learning methods such as AutoRegressive Integrated Moving Average (ARIMA) models [?; ?], support vector regression (SVR) [?] were proposed. However, these methods rely on the stationary assumption of time series that are hard to model highly nonlinear traffic flow and they ignore the correlation between different roads. Meanwhile, some works consider spatial structure of input data, namely, applying convolutional neural network (CNN) to capture the adjacent correlation and recurrent neural network (RNN) or long shortterm memory (LSTM) network on time axis [?; ?; ?]. However, normal convolutional operation applies on grid structures such as images and videos, not suitable for traffic networks; and training of RNN, LSTM networks is time consuming and difficult.
To model temporal pattern and spatial dependencies effectively, recent works introduce graph convolutional network (GCN) to learn the traffic networks [?; ?]. DCRNN [?] utilizes the bidirectional random walks on the traffic graph to model spatial information; and captures temporal dynamics by gated recurrent units (GRU). This sequencetosequence model performs well at the cost of very expensive computation during training. STGCN [?] relies on graph convolution on spatial domain and 1D convolution along time axis. Though STGCN could significantly save training time due to its pure convolution operations, it processes graph information and time series separately, unfortunately, which might ignore accurately modeling the interaction between spatial and temporal dynamics.
On the other hand, existing graphbased prediction approaches consider the relationship between roads by relying on the graph constructed based on the spatial distance (e.g. GPS distance), or road connectivity. However, in some practical scenarios, the spatial adjacency matrix is difficult to generate, since for some free editable maps such as OpenStreetMap [?], acquiring uptodate and accurate spatial topology information is hard. Meanwhile, the service of commercial map is expensive and its API will constrain query times for distance calculation^{1}^{1}1For example, Baidu Map, one of the biggest commercial map app around the world, provides individual developers with at most 30, 000 query times per day and its full basic service costs 10 thousand dollars per month. See http://lbsyun.baidu.com/apiconsole/auth/privilege.. More importantly, we argue that this way of graph construction unfortunately ignores the correlation between distant roads that share the similar temporal pattern. For instance, at rush hours, most roads near office buildings that have similar traffic patterns will encounter traffic jams in the same period. Both of these influence could be extracted from the time series themselves.
To overcome the drawbacks above, we propose a novel methodology for improving traffic prediction from aspects of both model design and graph construction. To extract better spatiotemporal dependencies, we propose a 3D graph convolution network where 3D convolution is applied to simultaneously learn the spatial and temporal patterns together. Furthermore, we offer a spatial information free approach for constructing the graph for traffic network, purely relying on the similarity of time series for each road. This new proposal could capture more effective patterns between different roads than the spatial graph, facilitating superior prediction performance. The contributions of this work can be summarized as follows.

We create a 3D GCN model to jointly learn the static road graph and temporal dynamics together. This new network structure strikes a better balance between training efficiency and effectiveness of feature learning.

Instead of using spatial information, we construct the adjacency matrix between nodes only according to the time series similarity by dynamic time warping (DTW) algorithm. The difference between the two types of graph construction is presented in Figure 1. It solves the difficulty of acquirement of geographic information. We empirically show that the performance of this temporal graph performs much better than spatial graph. To the best of our knowledge, it is the first time to put aside spatial adjacency matrix and construct spatiotemporal graph by a datadriven method which extracts effective features from road networks’ time series themselves.

We conduct extensive experiments on two open largescale realworld datasets. Results show both of 3D GCN model and our spatial information free graph obtains significant improvement over stateoftheart baseline methods.
2 Preliminary
2.1 Traffic Forecasting Problem
We can represent the road network as a graph , where is a finite set of nodes , corresponding to observation of sensors or roads; is a set of edges and is a weighted adjacency matrix representing the nodes proximity (e.g. spatial distance or temporal similarity). Denote the observed graph signal , the element of which means observed traffic flow of each sensor. Let represents the graph signal on time step . The aim of traffic forecasting is learning a function from previous speed observations to predict next th traffic speed from correlated sensors on the road network.
(1) 
2.2 Convolution on graphs
Different from normal convolutional operation which processes regular grids on images or videos, graph convolution operation mainly has two types. One is based on the spectrum of the graph Laplacian, namely, extending convolutions to graphs in spectral domain by finding the corresponding Fourier basis [?]. The other is generalizing spatial neighbours by rearranging the neighbours of vertices in a graph to apply regular convolutional operation [?].
Graph convolutional operation based on the spectrum is able to extract local features with different reception fields from nonEuclidean structures[?]. It is defined over a graph , where is the set of all vertices in this graph and is the adjacency matrix whose entries represent certain distance between vertices. Let its normalized graph Laplacian matrix be , where is an identity matrix, is the degree matrix with . is the Fourier basis which is composed of eigenvectors of Laplacian matrix . The graph signal is filtered by a diagonal matrix kernel with multiplication between and :
(2) 
where the kernel is a group of parameters to be trained, and denotes the output of this GCN layer.
To reduce the number of parameters and generate a kernel which has better spatial localization, the kernel can be redesigned as the Chebshev polynomial , It has a truncated order and utilizes the largest eigenvalue of to rescale : [?].
Then we could reformulate Equation 2 into:
(3) 
where is the scaled Laplacian and are parameters which could be trained by Back Propagation.
2.3 Similarity of Temporal Sequences
Generally speaking, the methods for measuring the similarity between time series can be divided into three categories: (1) timestepbased, such as Euclidean distance reflecting pointwise temporal similarity; (2) shapebased, such as Dynamic Time Warping [?] according to the trend appearance; (3) changebased, such as Gaussian Mixture Model(GMM)[?] which reflects similarity of data generation process.
In this work, we utilize Dynamic Time Warping to measure similarity i.e., the spatial shape of time series, between different roads to predict future time series. Given two time series and whose length are and . We first introduce a series distance matrix whose entry is Euclidean distance of two series points . Then we can define the cost matrix (accumulated distance matrix) :
(4) 
After several iterations of and (i.e., each of them increases from 1 to and ), is the final distance between and with the best alignment which can represent the similarity between two time series.
From Equation 4 we can tell that Dynamic Time Warping is an algorithm based on dynamic programming and its core is solving the warping curve, i.e., matchup of series points and . In other words the ”warping path”
is generated through iterations of Equation 4. Its element means matchup of and . The warping path starts from and ends with thus every series points of and must appear in . Moreover, and in must increase monotonically to avoid crossover of each matchup. For instance, given and then and .
3 Proposed Model: 3DTGCN
In this section, we explicitly formalize the spatiotemporal traffic prediction problem and describe our 3D Temporal Graph Convolutional Networks.
3.1 Graph Generation
Different from those proposed models that requires spatial adjacency matrix, 3DTGCN could learn those roads’ interior temporal pattern by calculating their corresponding time series’ distance. This way of graph construction is completely datadriven, helping to capture more effective information than the priori given spatial information. For instance, if traffic data are aggregated every 5 minutes then each road has 288 time steps in one day. Given time series for one road and time series of another, then we could utilize Dynamic Time Warping algorithm to find optimal match and calculate distance of their time series.
As shown in Figure 2, given two roads’ time series whose length is 288 then we could achieve their warping path. The distance of those two time series could be calculated by Equation 4 (i.e., in this case). From the figure we could tell the warping path elongates along the diagonal since the trend of two time series are similar, consequently the difference between match and of the element of warping path are close.
Then we generate topology network . For each road , we pick up its top most similar roads and let while others . Moreover, it is possible that while , then we reassign if . After this treatment, the constructed could be applied in our 3DTGCN model, described in the following.
3.2 3D Graph Convolution Networks
3D Graph Convolutional Layer
Many existing approaches deal with spatial and temporal dependencies separately since they utilize graph convolution on spatial dependencies and leverage 1D CNN [?] or RNNbased models [?] to extract temporal dependency along time axis. For instance, if 1D CNN was deployed in the temporal direction, the output of each 1D convolution could be rewritten as,
(5) 
where is the size of convolutional kernel on timeaxis at time step .
We now propose a 3D graph convolutional operation on all dimensions, including graph topology and temporal direction.
For the input () with channels, it can be extended to multidimensional arrays . The 3D graph convolutional layer integrates all dimensions together:
(6)  
where and are the size of input and output of this 3D graph convolutional layer, respectively and is the parameter to be trained in each output channel of this layer. From Equation 6, the graph convolution operator of each layer could be denoted as ”” with .
The 3D graph convolutional layer scans neighbours on timeaxis without padding and order neighbourhood of temporal graph at the same time. This method shortens the length of sequences by each time. It follows by a gated linear units (GLU) whose input is: where is split in half with the size of channels. As a result, the final output of 3D graph convolutional layer is where denotes the Hadamard product and denotes the sigmoid function.
This integrated design of 3D graph convolution allows us to jointly learn graph structure and temporal dynamics as a whole. It is also easy for building such multilayer 3D graph convolutional structures.
3.3 The Entire Architecture of 3DTGCN Network
Figure 3 sketches the overall architecture of our proposed 3DTGCN model. It consists of four 3D graph convolutional blocks (3DConv blocks), one output block. Each 3DConv block contains two 3D graph convolutional layers and a layer normalization layer to prevent overfitting. The output block consists of several 3D graph convolutional layers or 1D temporal convolutional layers and a weight sharing fullyconnected output layer to obtain the prediction .
The loss and loss will be used together to train our model and the loss function of 3DTGCN model could be formulated as below:
(7) 
In summary, our 3DTGCN model has several advantages:

3DTGCN does not require spatial adjacency matrix, instead, it constructs temporal adjacency matrix to learn temporal patterns of different roads in a pure datadriven way.

The 3D graph convolution integrates all dimensions (i.e., timeaxis on each road and correlation between different roads) into one graph convolutional networks. This design presents a better balance between training efficiency and effectiveness of feature learning on complex spatiotemporal graph, compared with STGCN and DCRNN.

3DTGCN could be applied into many other tasks that have spatiotemporal features. Its universal framework can learn spatiotemporal dependencies between each participant. By calculating similarity between time series, 3DTGCN could extract important temporal pattern of different participants which might appear uncorrelated and make accurate prediction.
4 Experiments
Model  PeMSD7(M) (15/ 30/ 60 min)  PeMSD7(L) (15/ 30/ 60min)  
MAE  MAPE (%)  RMSE  MAE  MAPE (%)  RMSE  
HA  4.01  10.61  7.20  4.60  12.50  8.05 
LSVR  2.49/ 3.46/ 4.94  5.91/ 8.42/ 12.41  4.55/ 6.44/ 9.08  2.69/ 3.85/ 4.79  6.27/ 9.48/ 12.42  4.88/ 7.10/ 8.72 
FNN  2.53/ 3.73/ 5.28  6.05/ 9.48/ 13.73  4.46/ 6.46/ 8.75  2.61/ 3.71/ 5.36  6.11/ 9.20/ 14.68  4.74/ 6.76/ 9.09 
FCLSTM  3.57/ 3.92/ 4.16  8.60/ 9.55/ 10.10  6.20/ 7.03/ 7.51  4.36/ 4.51/ 4.66  11.10/ 11.41/ 11.69  7.68/ 7.94/ 8.20 
STGCN  2.24/ 3.02/ 4.01  5.20/ 7.27/ 9.77  4.07/ 5.70/ 7.55  2.37/ 3.27/ 4.35  5.56/ 7.98/ 11.17  4.32/ 6.21/ 8.27 
DCRNN  2.25/ 2.98/ 3.83  5.30/ 7.39/ 9.85  4.04/ 5.58/ 7.19  2.36/ 3.24/ 4.34  5.51/ 8.18/ 11.91  4.45/ 6.31/ 8.33 
3DTGCN  2.23/ 2.97/ 3.65  5.13/ 7.08/ 8.79  3.93/ 5.31/ 6.66  2.27/ 3.16/ 3.79  5.31/ 7.85/ 9.76  4.18/ 5.71/ 7.13 
4.1 Datasets
Our model is verified on two realworld traffic datasets which are used by two related stateoftheart models: STGCN [?] and DCRNN [?].
PeMSD7
has a medium and a large scale PeMSD7 (M) and PeMSD7 (L) containing 228 and 1, 026 sensors separately among the District 7 of California. The data ranges from May and June of 2012 which are all at weekdays.
PemsBay
has 325 sensors in Bay Area and its collecting time is 6 months, ranging from Jan 2017 to June 2017.
These datasets are collected from California Transportation Agencies (Caltrans) Performance Measurement System (PeMS) in realtime by over 39, 000 sensor stations, which are deployed in the major metropolitan areas of California highway system[?]. It is aggregated into 5minute interval (228 time steps per day). To compared strictly with those stateoftheart models, we follow all data preprocessing methods in each paper such as (1) the proportion and content of training, validation and test set, (2) utilizing the Gaussian kernel[?] to construct the spatial adjacency matrix.
4.2 Experimental Settings and Baselines
All experiments are compiled and tested on a Linux cluster(CPU: Intel(R) Xeon(R) CPU E52620 v4 @ 2.20GHz, GPU: Tesla P40). All model parameters are finetuned by gird search based on performance on validation set. Each prediction task uses past 60 minutes (i.e., 12 time steps are in time window ) to forecast traffic conditions in the next 15, 30 and 60 minutes ().
Evaluation Metric
Several criteria are introduced to evaluate 3DTGCN, including the Mean Absolute Percentage Errors (MAPE), the Mean Absolute Errors (MAE) and the Root Mean Squared Errors (RMSE). All of them are used widely in traffic prediction tasks.
3DTGCN model The channels of each 3D graph convolutional layer in 3DConv block is 64. Receptive field of temporal graph is set to 3 and is set to 2. We use GLU as activation function in 3dConv block and sigmoid in output block. The learning rate is set to with a decay rate of after epochs. We train our models by minimizing the mean square error and mean absolute error using Adam for epochs with batch size as .
Baselines
We compare our model with several baselines as follows:

HA Historical Average (HA), which treats the traffic speed value as a seasonal process and use weighted average of past several seasons as prediction value.

SVR Support Vector Regression (SVR), which uses linear support vector machine for regression tasks.

FNN FeedForward Neural Network (FNN), which is a classical neural network architecture with two hidden layers and loss function is RMSE.

FCLSTM FullConnected LSTM [?], which is a Recurrent Neural Network with fully connected LSTM hidden units.

DCRNN Diffusion Convolutional Rrcurrent Neural Network(DCRNN) [?], which models spatiotemporal dependencies with graph convolution into gated recurrent unit.

STGCN SpatioTemporal Graph Convolutional Networks(STGCN) [?], which models spatiotemporal dependencies with graph convolution into convolution structures.
All neural network based models are implemented in Tensorflow [?].
4.3 Experiment Results
In this section, we compare our model with those baselines on the two datasets, shown in Table 1 and 4. It is obvious to observe that, although all methods could perform well in shortterm prediction, their performance varies greatly in longterm prediction. Deep learning models generally can achieve better performance than traditional machine learning models. Especially, STGCN and DCRNN, both of them have achieved significant improvement over other deep learning approaches since they extract additional information from spatial topology graph. 3DTGCN could achieve the stateofart performance especially when it only combines with temporal graph, demonstrating the importance of our proposed graph construction.
Accumulated Error of SequencetoSequence Prediction
RNNbased model and CNN model are different especially on the format of their output: while RNNbased ones conduct the next few time steps recursively, GCNs could predict few time steps recursively or directly predict the target time step. Generally, RNNbased model performs better in time series tasks since the strategies such as scheduled sampling [?] which can reduce accumulated error could be adopted on the sequencetosequence architecture. To compare these two types of outputs, we check the performance of 3DTGCN: (1) predicting directly next th time step, (2) predicting the value of next time steps recursively. As we can see from Table 2, 3DTGCN is more suitable for single step prediction task, it performs worse when predicting recursively due to accumulated error since its performance is close to DCRNN. However, 3DTGCN could achieve better training efficiency since convolutiontype models have less parameters than RNNbased models and STGCN.
Model  PeMSD7(M) (15/ 30/ 60 min)  

MAE  MAPE (%)  RMSE  
DCRNN  2.25/ 2.98/ 3.83  5.30/ 7.39/ 9.85  4.04/ 5.58/ 7.19 
STGCN  2.24/ 3.02/ 4.01  5.20/ 7.27/ 9.77  4.07/ 5.70/ 7.55 
3DTGCN (iteration)  2.25/ 2.97/ 3.77  5.17/ 7.10 9.05  4.06/ 5.59/ 7.19 
3DTGCN (straightly)  2.23/ 2.97/ 3.65  5.13/ 7.08/ 8.79  3.93/ 5.31/ 6.66 
Model  PeMSD7(M) (15/ 30/ 60 min)  

MAE  MAPE (%)  RMSE  
STGCN (spatial)  2.24/ 3.02/ 4.01  5.20/ 7.27/ 9.77  4.07/ 5.70/ 7.55 
STGCN (temporal)  2.24/ 3.02/ 3.92  5.19/ 7.13/ 9.29  4.06/ 5.61/ 7.15 
DCRNN (spatial)  2.25/ 2.98/ 3.83  5.30/ 7.39/ 9.85  4.04/ 5.58/ 7.19 
DCRNN (temporal)  2.26/ 2.98/ 3.66  5.33/ 7.33/ 9.27  4.04/ 5.50/ 6.73 
TGCN (spatial)  2.24/ 3.00/ 3.76  5.21/ 7.12/ 8.96  3.96/ 5.37/ 6.64 
TGCN (temporal)  2.23/ 2.97/ 3.65  5.13/ 7.08/ 8.79  3.93/ 5.31/ 6.66 
Model  PEMSBAY (15/ 30/ 60 min)  

MAE  MAPE (%)  RMSE  
HA  2.88  6.80  5.59 
SVR  1.85/ 2.48/ 3.28  3.80/ 5.50/ 8.00  3.59/ 5.18/ 7.08 
FNN  1.49/ 2.04/ 2.88  3.09/ 4.59/ 7.11  3.25/ 4.45/ 5.99 
FCLSTM  2.20/ 2.34/ 2.55  4.85/ 5.30/ 5.84  4.28/ 4.74/ 5.31 
STGCN  1.41/ 1.84/ 2.37  3.02/ 4.19/ 5.39  3.02/ 4.19/ 5.27 
DCRNN  1.38/ 1.74/ 2.07  2.9/ 3.9/ 4.9  2.95/ 3.97/ 4.74 
3DTGCN  1.34/ 1.69/ 2.07  2.78/ 3.76/ 4.76  2.79/ 3.71/ 4.56 
Temporal v.s. Spatial Pattern
Previous works focus on incorporating spatial topology information of roads into time series prediction. Differently, our model switches to their dependencies of temporal patterns and has achieved the best performance on both short and longterm forecasting. The results of two types of graph construction are shown in Table 3. The performance of 3DTGCN on dataset PeMSD7 is extremely well because road network of PeMSD7 is more complicated and systematic.
3DTGCN does not require priori knowledge of spatial topology. On the contrary, it builds graphs on temporal dependency. As illustrated in Figure 1, the left panel is spatial graph and right is temporal graph, the sparsity of them are both 5%. The reason why temporal graph tends to be better than the spatial one is intuitive: (1) realistic data is full of noise, similar temporal dependency of different roads (maybe at distance) is much more important than spatial causality of neighbors; (2) traffic prediction is a timeseries prediction task thus learning temporal pattern is more directly meaningful.
An involuntary doubt about dynamic time warping is its computational complexity. Although is somewhat costly, in traffic prediction problem it is acceptable since the length of time series is 288 when time step is 5 min. Dataset PeMSD7(L) is one of biggest dataset in academic traffic speed field which has 1026 roads, in which the scalable version of DTW algorithm is still acceptable.
5 Conclusion and Future works
In this paper, we propose an original and effective deep learning framework 3DTGCN for traffic prediction. It learns the relations between roads by comparing temporal similarity from the roads’ times series and merges spatial and temporal information into 3D convolution simultaneously in the 3D graph convolutional layers. Numerical experiments show our model outperforms existing stateoftheart models on two realworld datasets. Especially, our model does not require spatial topology. 3DTGCN also achieves faster training and better convergence. Our discovery of the new way of graph generation paves a promising way for future graphbased learning approaches, due to the no need for spatial information based adjacency matrix, which in many cases are difficult to generate or achieve.
References
 [Abadi et al., 2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 [Bengio et al., 2015] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
 [Berndt and Clifford, 1994] Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994.
 [Bruna et al., 2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 [Castro et al., 2012] Pablo Samuel Castro, Daqing Zhang, and Shijian Li. Urban traffic modelling and prediction using large scale taxi gps traces. In International Conference on Pervasive Computing, pages 57–72. Springer, 2012.
 [Chen et al., 2001] Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway performance measurement system: mining loop detector data. Transportation Research Record, 1748(1):96–102, 2001.
 [Defferrard et al., 2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 [Fitzpatrick et al., 2000] Kay Fitzpatrick, Lily Elefteriadou, Douglas W Harwood, JM Collins, J McFadden, Ingrid B Anderson, Raymond A Krammes, Nelson Irizarry, Kelly D Parma, Karin M Bauer, et al. Speed prediction for twolane rural highways. Technical report, 2000.
 [Haklay and Weber, 2008] Mordechai Haklay and Patrick Weber. Openstreetmap: Usergenerated street maps. IEEE Pervasive Computing, 7(4):12–18, 2008.
 [Hammond et al., 2011] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
 [Hong, 2011] WeiChiang Hong. Traffic flow forecasting by seasonal svr with chaotic simulated annealing algorithm. Neurocomputing, 74(1213):2096–2107, 2011.
 [Kriegel et al., 2008] HansPeter Kriegel, Matthias Renz, Matthias Schubert, and Andreas Zuefle. Statistical density prediction in traffic networks. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 692–703. SIAM, 2008.
 [Li et al., 2018] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Datadriven traffic forecasting. 2018.
 [Ma et al., 2015] Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. Long shortterm memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies, 54:187–197, 2015.
 [Ma et al., 2017] Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Ma, Yong Wang, and Yunpeng Wang. Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction. Sensors, 17(4):818, 2017.
 [Niepert et al., 2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
 [Okutani and Stephanedes, 1984] Iwao Okutani and Yorgos J Stephanedes. Dynamic prediction of traffic volume through kalman filtering theory. Transportation Research Part B: Methodological, 18(1):1–11, 1984.
 [Ostring and Sirisena, 2001] Sven AM Ostring and Harsha Sirisena. The influence of longrange dependence on traffic prediction. In ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No. 01CH37240), volume 4, pages 1000–1005. IEEE, 2001.
 [Povinelli et al., 2004] Richard J Povinelli, Michael T Johnson, Andrew C Lindgren, and Jinjin Ye. Time series classification using gaussian mixture models of reconstructed phase spaces. IEEE Transactions on Knowledge and Data Engineering, 16(6):779–783, 2004.
 [Shuman et al., 2012] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053, 2012.
 [Stathopoulos and Karlaftis, 2003] Anthony Stathopoulos and Matthew G Karlaftis. A multivariate state space approach for urban traffic flow modeling and prediction. Transportation Research Part C: Emerging Technologies, 11(2):121–135, 2003.
 [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [Vlahogianni et al., 2005] Eleni I Vlahogianni, Matthew G Karlaftis, and John C Golias. Optimized and metaoptimized neural networks for shortterm traffic flow prediction: A genetic approach. Transportation Research Part C: Emerging Technologies, 13(3):211–234, 2005.
 [Vlahogianni, 2015] Eleni I Vlahogianni. Computtionl intelligence nd optimiztion for trnsporttion big dt: Chllenges nd opportunities. In Engineering and Applied Sciences Optimization, pages 107–128. Springer, 2015.
 [Williams and Hoel, 2003] Billy M Williams and Lester A Hoel. Modeling and forecasting vehicular traffic flow as a seasonal arima process: Theoretical basis and empirical results. Journal of transportation engineering, 129(6):664–672, 2003.
 [Wu and Tan, 2016] Yuankai Wu and Huachun Tan. Shortterm traffic flow forecasting with spatialtemporal correlation in a hybrid deep learning framework. arXiv preprint arXiv:1612.01022, 2016.
 [Yu et al., ] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting.
 [Yu et al., 2004] Guoqiang Yu, Changshui Zhang, et al. Switching arima model based forecasting for traffic flow. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pages ii–429. IEEE, 2004.
 [Zhao et al., 2017] Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu. Lstm network: a deep learning approach for shortterm traffic forecast. IET Intelligent Transport Systems, 11(2):68–75, 2017.