# Spatio-temporal Graph Convolutional Neural Network: A Deep Learning Framework for Traffic Forecasting

## Abstract

The goal of traffic forecasting is to predict the future vital indicators (such as speed, volume and density) of the local traffic network in reasonable response time. Due to the dynamics and complexity of traffic network flow, typical simulation experiments and classic statistical methods cannot satisfy the requirements of mid-and-long term forecasting. In this work, we propose a novel deep learning framework, Spatio-Temporal Graph Convolutional Neural Network (ST-GCNN), to tackle this spatio-temporal sequence forecasting task. Instead of applying recurrent models to sequence learning, we build our model entirely on convolutional neural networks (CNNs) with gated linear units (GLU) and highway networks. The proposed architecture fully employs the graph structure of the road networks and enables faster training. Experiments show that our ST-GCNN network captures comprehensive spatio-temporal correlations throughout complex traffic network and consistently outperforms state-of-the-art baseline algorithms on several real-world traffic datasets.

## 1Introduction

Traffic forecasting is one of the most challenging studies of Intelligent Transportation System (ITS). Accurate and timely forecasting of multi-scale traffic conditions is of paramount importance for road users, management agencies and private sectors. Widely used transportation services provided by ITS such as dynamic traffic control, route planning and navigation service also rely on a high-quality assessment of future traffic network conditions under reasonable cost.

Indicators such as speed, volume and density gathered by various sensors reflect the general status of road traffic conditions. Thus, those measurements are typically chosen as the target of traffic prediction. Based on the length of prediction, traffic forecasting can be divided into three scales: short-term (5 30 min), medium-term (30 60 min) and long-term (over an hour). Most prevalent approaches are able to perform well on short forecasting interval. Inherently, because of the uncertainty and complexity of traffic flow, those methods are unsatisfying on long-term time-series prediction.

Previous studies on traffic prediction can be roughly divided into two different categories, namely, traditional simulation approaches and data-driven methods. For the simulation approaches, making traffic flow prediction requires comprehensive and meticulous systemic modeling based on physical theories and prior knowledge [21]. Even though, the analog system and simulation tools still consume massive computational power and skillful parameter settings to achieve steady state. Nowadays, with the rapid development of real-time traffic data collection methods and forms, researchers are transferring their attention to exploring data-driven methods through enormous historical traffic records which are gathered by the advanced ITS.

Classic statistical models and machine learning models are two major representative categories of data-driven methods. In time-series analysis, autoregressive integrated moving average (ARIMA) is one of the most consolidated approaches. It has been applied into various study fields and firstly introduced into traffic forecasting as early as 1970s [1]. ARIMA model can be applied to non-stationary data, which require an integrated term to make the time series stationary. Extensive variants of ARIMA model have been proposed to improve the ability on pattern capturing and prediction accuracy, such as seasonal ARIMA (SARIMA) [22], ARIMA with the Kalman filter [13]. However, models mentioned above highly rely on the stationary assumption of the time series and ignore the spatial correlation among traffic network. Therefore, time-series models have partially limited representability of highly dynamic and inconstant traffic flow.

Recently, machine learning methods have shown promising development in traffic study. Higher prediction accuracy can be acquired by these non-parametric methods, including -nearest neighbors algorithm (KNN), support vector machine (SVM), and neural network (NN) models (also referred as deep learning models).

**Deep Learning Approaches** Nowadays, deep learning techniques, deep architectures in particular, have drawn lots of academic and industrial interest and attention. Deep learning methods have been widely and successfully employed in various tasks such as classification, pattern recognition and object detection. In traffic prediction research, the deep belief network (DBN) has been proved the capability of capturing the stochastic features and characteristics of traffic flow without hand-engineered feature selection [9]. [14] proposed a stacked autoencoder (SAE) model to discover latent short-term traffic flow features. [4] developed a stack denoise autoencoder to learn hierarchical representation of urban traffic flow. Those approaches mentioned above can learn effective features for short-term traffic prediction. However, it is difficult for the fully-connected neural network to extract representative spatial and temporal features from large amount of long-term traffic flow concurrently. Moreover, topological locality and historical memory among the spatio-temporal traffic variables are neglected in those deep learning models, which hindered their predictive power.

Recurrent neural network (RNN) and its variations (e.g. long short-term memory neural network (LSTM), gated recurrent unit (GRU)) show tremendous potential for the traffic prediction with short and long temporal dependency. In spite of the efficient use of temporal dependency, the spatial part is not fully utilized in previous studies. To fill this gap, some researchers use the convolutional neural network (CNN) [11] to extract topological locality of traffic network. CNN model with customized kernels offers a robust algorithm to explore the local relationships between neighboring variants. By combining LSTM and 1-D CNN, [23] designed a feature-level fused architecture CLTFP for short-term traffic flow forecasting. Even simply adopting a straightforward combined strategy, CLTFP still creates an insightful perspective to jointly excavate the spatial and temporal domains of traffic variables.

Traffic network variables are typical structured data with spatio-temporal features. How to effectively model temporal dynamic and topological locality from those high-dimensional variables is the key to resolve the forecasting problem. [24] proposed a convolutional LSTM (ConvLSTM) model, which is an extended fully-connected LSTM (FC-LSTM) with embedded convolutional structures. The ConvLSTM imposes convolution operation on the state-transition procedure of video frames. However, these standard CNNs are restricted to processing regular grid structures (e.g. images, videos, and speech) other than general domains. In this case, structured traffic variables may not be applicable. Recent advances in the irregular or non-Euclidean domain modeling provide some useful insights on how to further study the structured data problem. [2] made a primary exploration on generalizing the signal domain of CNNs to arbitrarily structured graphs (e.g. social network, traffic network). Several following-up studies [7] inspired researchers to develop novel combinational methods to reveal hidden features of structured datasets. [17] introduced graph convolutional recurrent network (GCRN) to simultaneously identify spatial domain and dynamic variation from the spatio-temporal sequences. The key challenge of the aforesaid study is to determine the best possible collaboration between recurrent models (e.g. RNN, LSTM or GRU) and graph CNN [5] for the specific dataset. Based on the above principles, [12] successfully employed GRU with graph convolutional layers to predict complex traffic flow. It is noteworthy that recurrent models normally require processing and learning input sequences step by step. As the iteration increases, the problem of accumulation of errors gradually appear, which leads to the drifting convergence. The serialized learning process limits parallelization of training process as well.

Motivated by graph CNN and convolutional sequence learning, we propose a novel deep learning architecture, the spatio-temporal graph convolutional neural network (ST-GCNN), for long-term traffic forecasting tasks. Our contributions are:

To the best of our knowledge, it is the first time to apply purely convolutional structures to extract spatio-temporal features of graph-structured traffic datasets on both space and time domains simultaneously.

We propose a novel deep learning architecture that combined graph convolution with sequence learning convolution network. Thanks to the architecture of pure convolution, it achieves much faster training than RNN/LSTM based models, almost acceleration of training speed.

The traffic forecasting framework we proposed outperforms among all the methods we implemented on both two real-world traffic datasets in multiple speed prediction experiments.

Not only exhibiting strong performance in traffic prediction domain, our ST-GCNN model is also a general deep learning framework for modeling graph-structured time-series data. It can be applied in other scenarios, such as social network analysis.

## 2Methodology

### 2.1Problem Formulation

The purpose of traffic prediction task is to use previously observed road speed records to forecast the future status in a certain period of a specified region. Historical traffic data measured from sensor stations in previous time steps can be regarded as the form of a matrix with the size of .

In order to describe the relationship between neighboring sensor stations from traffic network, we introduce an undirected graph , where is a set of vertices, i.e. sensor stations; represents a set of edges, indicating the connectivity between those sensors in the network; while denotes the adjacency matrix of . If the topology of vertices in can be obtained from raw data, we calculate the value of based on the connectedness. Otherwise, the adjacency matrix is constructed according to the distance of each pair-wise sensor stations. Therefore, historical traffic data can be defined on , consisting of graph-structured data frames as Figure ? shows. Now we can formulate our spatio-temporal traffic prediction problem as

where is the length of prediction.

### 2.2Network Architecture

In this section, we describe the general architecture of our spatio-temporal graph CNN (ST-GCNN) for traffic speed prediction. See Figure ? for the graphical illustration of our proposed model. The ST-GCNN is composed of several spatio-temporal convolutional blocks and a fully-connected layer. Each spatio-temporal conv-block is constructed by a highway graph CNN layer and a gated linear temporal CNN layer. We will elaborate the model details in the following sections.

### 2.3Graph CNN for Extracting Spatial Features

Convolutional neural networks have been successfully implemented into extracting highly meaningful patterns and features in large-scale and high-dimensional datasets. The traffic variables which contain hidden local properties are perfectly fitted to retrieve the correlation with their location and neighbors through CNNs. However, the standard CNN is not able to tackle the complex urban road network problem. In [5], authors creatively defined convolutional neural networks on graphs (GCNN) in the context of spectral graph theory. The proposed GCNN model has the capability to handle any graph-structured datasets while achieving the equivalent linear computational complexity and constant learning complexity as classical CNNs. Therefore, we employ graph CNN to handle the structured urban traffic data. The input of graph CNN is converted from the data matrix into a 3-D tensor with the size of ().

**Graph Convolution** Graph convolution can extract the spatial information efficiently on sparse graphs with only a few trainable parameters. Information among neighboring nodes is grouped and distributed by the graph convolution since the operator can be regarded as applying strictly localized filters to traverse the graph.

Given an undirected graph with vertices and a vector of the size of on , the graph convolution is defined in the spectral domain of . By computing the graph Laplacian and the eigen decomposition of ( is the diagonal degree matrix ; is an orthogonal matrix), the Fourier transform for is defined as . Hence, the definition of the graph convolution of and [18] is

where is the element-wise Hadamard product. Further, we can define the graph convolution on a vector by the filter which is also a diagonal matrix [5]:

In fact, the above equation is equivalent to computing the graph convolution of and vector as the following equations show:

Therefore, we can regard as the graph convolution as well. In order to reduce the number of parameters and localize the filter, can be restricted to a polynomial of : , where is the kernel size of graph convolution. Then, can be expanded as

A signal on graph of nodes can be described as a matrix consisting of vectors of size . Consequently, for a signal , a convolution operation with a kernel tensor of size on is

where and separately indicate the number of channels of the input and the output. For more details about the graph convolution, please refer to [2].

**Highway Network** The information from sensor stations on traffic network may have imbalanced contribution for speed prediction problem. For example, there are two geographically adjacent sensors, one is supposed to have higher impact than the other because of the former located in a traffic artery rather than an alley. To account for this issue, we introduce a data-dependent gate to control the mixing ratio of the input and the output of spatial graph convolution. Highway network can achieve such a gate mechanism [19] by computing an extra gate function for each node. Inspired by this idea, we use a highway graph convolution layer to control information flow through the graph. Concretely, we can define the graph convolution on by two trainable kernels , of size , where is an signal on and :

Hence, the final output of the highway graph convolution layer is

where is the rectified linear units function. If and do not share the same size, we should pad with a zero tensor or apply a linear transformation to it first. Accordingly, we denote the highway graph convolutional layer as :

### 2.4Gated CNN for Extracting Temporal Features

In traffic prediction studies, many models are based on recurrent models, such as FC-LSTM [20] and ConvLSTM [24]. However, RNN models require computing different time steps successively and cannot process a time sequence in parallel. In addition, recurrent models tend to use a sequence-to-sequence method for long-term predictions by iteratively pumping forecasting results of the last time step into the network to predict the next step status. This mechanism introduces error accumulation step by step.

To overcome the aforementioned disadvantages, we employ CNNs instead of RNN along the time axis to capture temporal features. Recently, researchers from Facebook released a convolutional architecture for sequence modeling with gates and attention mechanism [6]. This technique avoids ordered computational operation and achieves parallelized and customized training procedures which limit the efficiency of traditional recurrent models. Motivated by this intuition, we use gated linear units (GLU) to build the temporal CNN layers. The input of temporal layer is a 3-D tensor obtained from Eq. (Equation 1) of size , which standing for time steps, nodes and channels individually. Furthermore, we design two convolution kernels , of size to exclusively apply on the time axis. Inevitably, convolution operations will modify the number of channels from to and the size of time axis from to . If we want to maintain the original size of time axis, a zero-tensor of size is supposed to be concatenated on the left of , noted as . Consequently, two convolution operations applied on the input are determined by the separate kernels as

where and are trainable variables; is the output of the convolution operation and is the input of a gate to control . As a result, the output of the temporal gated CNN is an element-wise product of the convolution output and the gate, which denoted as :

Specifically, we do not have to maintain the same size of time axis in the last temporal CNN layer. Instead, we change its size from to 1 for making predictions. Thus, we directly employ a convolution operation with kernel of size and a relu activation function on the input without padding. That is , where , , . Therefore, the final output of the last temporal CNN layer is a tensor of size . Eventually, we reshape the output to a matrix of size and calculate the speed prediction for nodes by applying a linear transformation across channels as , where is a weight matrix and is a bias.

### 2.5Combining Graph CNN and Temporal CNN

In order to fuse the spatial and temporal features, we propose a novel model, spatio-temporal graph CNN (ST-GCNN) combining temporal GLU CNN with spatial highway graph CNN. The model is stacked by several spatio-temporal conv-blocks and a linear output layer. Each block comprises of one highway spatial graph convolution layer and one following temporal GLU convolution layer. The input and the output of the blocks are all 3-D tensors of size . For the input of block , the output can be computed by

After stacking spatio-temporal conv-blocks, we add an output spatio-temporal conv-block and a full-connected layer for each node of the sensor graph in the end, as shown in the Figure ?.

The loss function of our model for predicting the next time step can be written as

where are all trainable variables in our ST-GCNN model, is the ground truth and denotes the model’s prediction.

Our ST-GCNN model is a universal framework to process structured time series, and it is not only able to tackle massive urban traffic network modeling and prediction issues but also to be applied to more general spatio-temporal sequence forecasting challenges. The graph convolution and the gates in highway layers can extract useful spatial features while resist useless information, and the temporal convolutions combined with gates can select the most important temporal features. Our model is entirely composed of convolutional layers, hence, the model fits well for parallelizing. Furthermore, ST-GCNN framework is not based on sequence-to-sequence learning, therefore, the model can obtain a much more accurate estimation without accumulating the error step by step.

## 3Experiments

### 3.1Dataset Description

We use two different traffic datasets which are collected and processed by Beijing University of Technology and California Deportment of Transportation respectively. Each dataset contains key indicators and geographic information with corresponding timestamps, as detailed as follows.

##### Beijing East Ring No.4 Road (BJER4)

BJER4 dataset was gathered from the certain area of east ring No.4 routes in Beijing City by double-loop detectors. There are 12 roads (as Figure ? shows, R207 & R208 were ditched since overlapping) selected for our experiment. The traffic data are aggregated every 5 minutes. The particular time period used in this study is from 1st July to 31st August, 2014 except the weekends. We select the first month of historical speed records as training set, and the rest serves as validation and test set separately.

##### PeMS District 7 (PeMSD7)

PeMSD7 dataset was collected from Caltrans Performance Measurement System (PeMS) in real-time by over 39,000 individual sensor stations, which are deployed across all major metropolitan areas of California state highway system [3]. The dataset that we applied to numerical experiment is also aggregated into 5-minute intervals from individual 30-second data samples for each sensor station. We randomly select 228 stations (shown in Figure ?) as data source among the District 7 of California. The time range of PeMSD7 dataset is in the weekdays of May and June of 2012. We split the training and test sets based on the same principle.

### 3.2Data Preprocessing

The traffic variables of the two datasets are all aggregated into 5-min interval. Thus, each single node in traffic graph contains 288 data points per day. After the cleaning procedure, we apply linear interpolation method combining with time features to fill in the missing values. In addition, the input data of neural networks are uniformly normalized by the Z-Score method.

In BJER4 dataset, the graph topology of traffic network in Beijing east No.4 ring route system is constructed by the geographical metadata in original sensor records. By collating affiliation, direction and origin-destination points of each single route, the ring route system can be generalized as Figure ? shows.

In PeMSD7 dataset, the weight matrix of sensor graph is computed based on the relative position of stations among the network. In this way, we can define the adjacency matrix as following,

where is the edge weight which is decided by (the distance between station and ). and are thresholds to control the weights distribution and sparsity of the matrix. Parameters of those two thresholds are assigned to and individually. The numerical visualization of is presented in Figure ?.

### 3.3Experimental Settings

All experiments are compiled and tested on a CentOS server (CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, Memory: 132GB, GPU: NVIDIA Tesla K80). We conduct the grid search to locate the best parameters which producing the highest score on validation sets. All the tests use 1 hour as the uniform historical time window, a.k.a. applying 12 observed data points to forecast the traffic condition in the next 15, 30, and 60 minutes.

**Baselines** We compare our ST-GCNN framework with the following baselines: 1). Historical Average (HA); 2). Linear Support Victor Regression (LSVR) [16]; 3). Auto-Regressive Integrated Moving Average (ARIMA); 4). Feed-Forward Neural Network (FNN); 5). Full-Connected LSTM (FC-LSTM) [20]; 6). Graph Convolutional LSTM (GC-LSTM) [17]. As for detailed parameter settings of baselines algorithms, please refer to the appendix.

**ST-GCNN Model** For BJER4 dataset, we stack three spatial-temporal blocks of 64, 64, 128 channels and an output layer. While four spatial-temporal blocks of 64, 64, 128, 128 channels and an output layer are employed on dataset PeMSD7. Both graph convolution size and temporal convolution size are set to for the two datasets. We train our model by minimizing the mean square error using ADAM [10] for 50 epochs with batch size as 25. The initial learning rate is with a decay rate of 0.7 after every 5 epochs.

**Evaluation** In our study, three metrics are adopted for evaluating quality of prediction with the ground truth .

### 3.4Experiment Results

Table 1 demonstrates the results of ST-GCNN and baseline algorithms on the two datasets respectively. Our proposed model achieves the best performance in all three evaluation metrics.

Model |
||||||

MAE | MAPE | RMSE | MAE | MAPE | RMSE | |

HA | 5.21 | 14.67% | 7.65 | 4.01 | 10.61% | 7.20 |

LSVR | 4.11 | 9.84% | 5.71 | 2.52 | 5.87% | 4.55 |

ARIMA | 6.40 | 16.50% | 9.55 | 5.72 | 14.15% | 10.78 |

FNN | 4.20 | 10.34% | 5.71 | 2.84 | 6.99% | 4.66 |

FC-LSTM | 4.30 | 10.92% | 5.77 | 3.54 | 8.83% | 6.25 |

GC-LSTM | 3.95 | 9.36% | 5.39 | 3.87 | 8.87% | 6.8 |

ST-GCNN |
3.75 |
9.01% |
5.11 |
2.35 |
5.44% |
4.15 |

HA | 5.21 | 14.67% | 7.65 | 4.01 | 10.61% | 7.20 |

LSVR | 5.07 | 12.31% | 7.10 | 3.62 | 8.90% | 6.67 |

ARIMA | 6.27 | 16.53% | 9.04 | 5.58 | 14.16% | 10.04 |

FNN | 5.13 | 13.10% | 7.12 | 4.04 | 9.84% | 6.51 |

FC-LSTM | 4.80 | 12.15% | 6.66 | 3.74 | 9.43% | 6.74 |

GC-LSTM | 4.59 | 11.73% | 6.41 | 3.98 | 9.35% | 7.14 |

ST-GCNN |
4.41 |
10.65% |
6.06 |
3.16 |
7.59% |
5.52 |

HA | 5.21 |
14.67% | 7.65 |
4.01 | 10.61% | 7.20 |

LSVR | 6.56 | 16.35% | 9.44 | 5.30 | 13.68% | 9.44 |

ARIMA | 6.57 | 17.91% | 9.24 | 5.92 | 15.48% | 9.88 |

FNN | 6.88 | 17.87% | 9.28 | 5.34 | 14.61% | 8.89 |

FC-LSTM | 5.63 | 14.91% | 8.13 | 4.04 | 10.71% | 7.35 |

GC-LSTM | 5.83 | 13.96% | 8.75 | 4.49 | 10.72% | 7.83 |

ST-GCNN |
5.46 | 13.71% |
7.66 | 3.95 |
9.72% |
6.95 |

We can easily observe that traditional statistical and machine learning methods may perform well for short-term forecasting, but their long-term predictions are not accurate because of error accumulation. ARIMA model performs the worst due to its incapability of handling complex spatio-temporal data. Deep learning approaches generally achieved better prediction results than traditional machine learning models. It is worth noticing that except our ST-GCNN model and HA, the rest baseline models have relatively poor long-term forecasting performance on BJER4 dataset. Partially due to simple topology and confined region of the traffic network in BJER4, there is not enough spatial information for long-term speed prediction. In addition, traffic data of BJER4 are pretty noisy and highly fluctuant along the time axis, which causes severe error accumulation.

**Benefits of Spatial Topology** Previous methods did not incorporate spatial topology and model the time series in a coarse-grained way. Differently, through modeling spatial topology of the sensors, our model ST-GCNN has achieved a significant improvement on short, medium and long term forecasting. The advantage of ST-GCNN is more obvious on PeMSD7 dataset than BJER4, since the sensor network of PeMS is more complicated (as illustrated in Figure ?), and our model can fully utilize spatial structure to make more accurate predictions. To compare three methods based on neural networks: FC-LSTM, GC-LSTM and ST-GCNN, we show their predictions during morning peak and evening rush hours, as shown in Figure ? and ?. It is easy to observe that our proposal ST-GCNN captures the trend of rush hours more accurately than other methods; and it detects the ending of the rush hours earlier than others.

**Generalization of Deep Learning Models** In order to investigate the performance of the compared deep learning models in more detail, we plot the RMSE of the validation set of PeMSD7 dataset during the training process, see Figure ?. The models based on recurrent networks (i.e. FC-LSTM and GC-LSTM) can fit the training set well, since the error for a single-step prediction is small enough so that the error accumulations are not significant on the training set. However, for the validation set, the single-step error is much larger than that of the training set, and consequently the long-term predictions of RNN/LSTM-type models are ruined by error accumulations. In this sense, RNN/LSTM-type of models tend to overfit the training data and generalize poorly. Our ST-GCNN model directly predicts the speed of specific time step, avoiding the issue of error accumulation. It performs best on the validation set and achieves a smallest gap between the training and validation, since its predictions do not rely on generating sequence iteratively.

**Training Efficiency** To see the benefits of the convolution along time axis in our proposal ST-GCNN, we compare the training time of SG-GCNN and GC-LSTM with the same number of layers and hidden units. For the BJER4 dataset and two models with 3 hidden layers and 64 units in each layer, our model SG-GCNN only consumes **69** seconds, while RNN/LSTM-type of model GC-LSTM spends **676** seconds. This 10 times acceleration of training speed is due to the temporal convolution instead of recurrent model structures.

## 4Conclusion and Future Work

In this paper, we have proposed a novel deep learning framework for traffic prediction, spatio-temporal graph convolutional neural network (ST-GCNN), combining graph and temporal CNN. The experimental results show that our model outperforms other state-of-the-art methods on real-world datasets, indicating its great potentials on exploring spatio-temporal structures from the data. In the future, we will further optimize the network structure and parameters in order to obtain better results for traffic prediction problems. Moreover, the proposed framework can be applied into more general spatio-temporal structured sequence forecasting scenarios, such as evolving of social networks, and preference prediction in recommender systems, etc.

## 5Appendix

### 5.1Baseline Settings

Historical Average (HA):

the mean value of the whole historical training data at current timestamp is used as the historical average.Linear Support Victor Regression (LSVR):

which uses a support vector machine with a linear kernel. The method is implemented by scikit-learn packages with optimized parameters by the grid search method.Auto-Regressive Integrated Moving Average (ARIMA):

according to the training data to determine the best order of settings in ARIMA(p, d, q), then applying the corresponding parameters to predict the time series on test set iteratively.Feed-Forward Neural Network (FNN):

for BJER4 and PeMSD7 datasets, FNN is set to have two hidden layers of size 64 and an output layer; the activation function is relu.Full-Connected LSTM (FC-LSTM):

for BJER4 dataset, FC-LSTM is set to stack two LSTM cells with hidden states of size ; For PeMSD7 dataset, it is set to stack three LSTM cells with hidden states of size .Graph Convolutional LSTM (GC-LSTM):

a 2-layer GC-LSTM network of 64, 64 channels respectively with encoder-decoder mechanism is applied on BJER4 dataset; For PeMS dataset, we use a 3-layer GC-LSTM network of 64, 128, 128 channels respectively.

### References

**1979.**

Ahmed, M. S., and Cook, A. R.*Analysis of freeway traffic time-series data by using Box-Jenkins techniques*.**2013.**

Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. Spectral networks and locally connected networks on graphs.**2001.**

Chen, C.; Petty, K.; Skabardonis, A.; Varaiya, P.; and Jia, Z. Freeway performance measurement system: mining loop detector data.**2016.**

Chen, Q.; Song, X.; Yamada, H.; and Shibasaki, R. Learning deep representation from big and heterogeneous data for traffic accident inference.**2016.**

Defferrard, M.; Bresson, X.; and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering.**2017.**

Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. Convolutional sequence to sequence learning.**2015.**

Henaff, M.; Bruna, J.; and LeCun, Y. Deep convolutional networks on graph-structured data.**2014.**

Huang, W.; Song, G.; Hong, H.; and Xie, K. Deep architecture for traffic flow prediction: deep belief networks with multitask learning.**2016.**

Jia, Y.; Wu, J.; and Du, Y. Traffic speed prediction using deep learning method.**2014.**

Kingma, D., and Ba, J. Adam: A method for stochastic optimization.**1998.**

LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. Gradient-based learning applied to document recognition.**2017.**

Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. Graph convolutional recurrent neural network: Data-driven traffic forecasting.**2013.**

Lippi, M.; Bertini, M.; and Frasconi, P. Short-term traffic flow forecasting: An experimental comparison of time-series analysis and supervised learning.**2015.**

Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; and Wang, F.-Y. Traffic flow prediction with big data: a deep learning approach.**2016.**

Niepert, M.; Ahmed, M.; and Kutzkov, K. Learning convolutional neural networks for graphs.**2011.**

Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. Scikit-learn: Machine learning in Python.**2016.**

Seo, Y.; Defferrard, M.; Vandergheynst, P.; and Bresson, X. Structured sequence modeling with graph convolutional recurrent networks.**2013.**

Shuman, D. I.; Narang, S. K.; Frossard, P.; Ortega, A.; and Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.**2015.**

Srivastava, R. K.; Greff, K.; and Schmidhuber, J. Highway networks.**2014.**

Sutskever, I.; Vinyals, O.; and Le, Q. V. Sequence to sequence learning with neural networks.**2015.**

Vlahogianni, E. I. Computational intelligence and optimization for transportation big data: challenges and opportunities.**2003.**

Williams, B. M., and Hoel, L. A. Modeling and forecasting vehicular traffic flow as a seasonal arima process: Theoretical basis and empirical results.**2016.**

Wu, Y., and Tan, H. Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework.**2015.**

Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; and Woo, W.-c. Convolutional lstm network: A machine learning approach for precipitation nowcasting.