GraphPartitioningBased Diffusion Convolution Recurrent Neural Network for LargeScale Traffic Forecasting
1 Abstract
Traffic forecasting approaches are critical to developing adaptive strategies for mobility. Traffic patterns have complex spatial and temporal dependencies that make accurate forecasting on large highway networks a challenging task. Recently, diffusion convolutional recurrent neural networks (DCRNNs) have achieved stateoftheart results in traffic forecasting by capturing the spatiotemporal dynamics of the traffic. Despite the promising results, adopting DCRNN for large highway networks still remains elusive because of computational and memory bottlenecks. We present an approach to apply DCRNN for a large highway network. We use a graphpartitioning approach to decompose a large highway network into smaller networks and train them simultaneously on a cluster with graphics processing units (GPU). For the first time, we forecast the traffic of the entire California highway network with 11,160 traffic sensor locations simultaneously. We show that our approach can be trained within 3 hours of wallclock time using 64 GPUs to forecast speed with high accuracy. Further improvements in the accuracy are attained by including overlapping sensor locations from nearby partitions and finding highperforming hyperparameter configurations for the DCRNN using DeepHyper, a hyperparameter tuning package. We demonstrate that a single DCRNN model can be used to train and forecast the speed and flow simultaneously and the results preserve fundamental traffic flow dynamics. We expect our approach for modeling a large highway network in short wallclock time as a potential core capability in advanced highway traffic monitoring systems, where forecasts can be used to adjust traffic management strategies proactively given anticipated future conditions.
Deep learning, Graph neural networks, Diffusion, Traffic forecasting, Graph partitioning
2 Introduction
In the United States alone, the estimated loss in economic value due to traffic congestion reaches into the tens or hundreds of billions of dollars, impacting not only the productivity lost due to additional travel time but also the additional inefficiencies and energy required for vehicle operation. To address these issues, Intelligent Transportation Systems (ITS) [8] seek to better manage and mitigate congestion and other trafficrelated issues via a range of datainformed strategies and highway traffic monitoring systems. Nearterm traffic forecasting is a foundational component of these strategies; and accurate forecasting across a range of normal, elevated, and extreme levels of congestion is critical for improved traffic control, routing optimization, probability of incident prediction, and identification of other approaches for handling emerging patterns of congestion [54, 53]. Furthermore, these predictions and the related machine learning configurations and weights associated with a highly accurate model can be used to delve more deeply into the dynamics of a particular transportation network in order to identify additional areas of improvement above and beyond those enabled by improved prediction and control [19, 1, 39]. These forecasting methodologies are also expected to enable new and additional forms of intelligent transportation system strategies as they become integrated into larger optimization and control approaches and highway traffic monitoring systems [45, 17]. For example, the benefits of highly dynamic route guidance and alternative transit mode pricing in real time would be greatly aided by improved traffic forecasting.
Traffic forecasting is a challenging problem: The key traffic metrics such as flow^{1}^{1}1Flow (volume) is a quantity representing an estimate of the number of vehicles that passed over each detector on the highway in a given time period and speed^{2}^{2}2Speed is the estimated rate of motion at which a detector records drivers operating their vehicles exhibit complex spatial and temporal correlations that are difficult to model with classical forecasting approaches [56, 13, 27, 12]. From the spatial perspective, locations that are close geographically in the Euclidean sense (for example, two locations located in opposite directions of the same highway) may not exhibit a similar traffic pattern, whereas locations in the highway network that are far apart (for example, two locations separated by a mile in the same direction of the same highway) can show strong correlations. Many traditional predictive modeling approaches cannot handle these types of correlation. From the temporal perspective, because of different traffic conditions across different locations (e.g., diverse peak hour patterns, varying traffic flow and volume, highway capacity, incidents, and interdependencies), the time series data becomes nonlinear and nonstationary, rendering many statistical time series modeling approaches ineffective.
Recently, deep learning (DL) approaches have emerged as highperforming methods for traffic forecasting. In particular, Li et al. [34] developed a diffusion convolution recurrent neural network (DCRNN) that models complex spatial dependencies using a diffusion process^{3}^{3}3In physics, diffusion is a process of movement of particle from a region of higher concentration to a region of lower concentration. The diffusion process can be represented as a weighted combination of infinite random walks on a graph. on a graph and temporal dependencies using a sequence to sequence recurrent neural network. The authors reported forecasting performances for 15, 30, and 60 minutes on two data sets: a Los Angeles data set with 207 locations collected over 4 months and a Bay Area data set with 325 locations collected over 6 months. They showed improvement on the stateoftheart baselines methods such as historical average [56], an autoregressive integrated moving average model with a Kalman filter [57], a vector autoregressive model [22], a linear support vector regression, a feedforward neural network [47], and an encoderdecoder framework using long shortterm memory [52]. Despite these results, modeling large highway networks with DCRNN remains challenging due to the computational and memory bottlenecks.
We focus on developing and applying DCRNN to a large highway network with thousands of traffic sensor locations. Our study is motivated by the fact that the highway network of a state such as California is 30 times larger than the Los Angeles or Bay Area dataset. Training a DCRNN with 30 times more data poses two main challenges. First, the training data size for thousands of locations is too large to fit in a single computer’s memory. Second, the time required for training a DCRNN on a large data set can be prohibitive, rendering the method ineffective for large highway networks. Two common approaches to overcome this issue in deep learning literature are distributed dataparallel training or modelparallel training [16]. In dataparallel training, different computing nodes train the same copy of the model on different subsets of the data and synchronize the information from these models. The number of trainable parameters is the same as for singleinstance training because the whole highway network graph is considered together. Speedup is achieved only by the reduced amount of training data per compute node. In modelparallel training, the model is split across different computing nodes, and each node estimates a different part of the model parameters. It is used mostly when the model is too large to fit in a single node’s memory. Implementation, fault tolerance, and better cluster utilization are easier with dataparallel training than with modelparallel training. Therefore, dataparallel training is arguably the preferred approach for distributed systems [24]. On the other hand, in traditional highperformance computing (HPC) domains, a common approach for scaling is domain decomposition, wherein the problem is divided into a number of subproblems that are then distributed over different compute nodes. While domain decomposition approaches are not applicable in scaling typical DL training such as image and text classification, for the traffic forecasting problem with DCRNN it is well suited. The reason is that traffic flow in one part of the highway network does not affect another part when the parts are separated by a large driving distance.
In this paper, we develop a graphpartitioningbased DCRNN for traffic forecasting on a large highway network. The main contributions of our work are as follows.

We demonstrate the efficacy of the graphpartitioningbased DCRNN approach to model the traffic on the entire California highway network with 11,160 sensor locations. We show that our approach can be trained within 3 hours of wallclock time to forecast speed with high accuracy.

We develop two improvement strategies for the graphpartitioningbased DCRNN. The first is an overlapping sensor location approach that includes data from partitions that are geographically close to a given partition. The second is an adoption of DeepHyper, a scalable hyperparameter search, for finding highperforming hyperparameter configurations of DCRNN to improve forecast accuracy of multiple sensor locations.

We adopt and train a single DCRNN model to forecast both flow and speed simultaneously as opposed to the previous DCRNN implementation that predict either speed or flow.
3 Methodology
In this section, we describe the DCRNN approach for traffic modeling, followed by graph partitioning for DCRNN, the overlapping node method, and the hyperparameter search approach.
3.1 Diffusion convolution recurrent neural network
Formally, the problem of traffic forecasting can be modeled as spatial temporal time series forecasting defined on a weighted directed graph , where is a set of nodes that represent sensor locations, is the set of edges connecting the sensor locations, and is the weighted adjacency matrix that represents the connectivity between the nodes in terms of highway network distance. Given the graph and the time series data to , the goal of the traffic forecasting problem is to learn a function h(.) that maps historical data at given to future time steps:
In DCRNN, the temporal dependency of the historical data has been captured by the encoderdecoder architecture [14, 52] of recurrent neural networks. The encoder steps through the input historical time series data and encodes the entire sequence into a fixed length vector. The decoder predicts the output of the next time steps while reading from the vector. Along with the encoderdecoder architecture of RNN, a diffusion convolution process has been used to capture the spatial dependencies. The diffusion process [55] can be described by a random walk on with a state transition matrix . The traffic flow from one node to the neighbor nodes can be represented as a weighted combination of infinite random walks on the graph. The diffusion kernel is used in the convolution operation to map the features of the node to the result of the diffusion process beginning at that node. A filter learns the features for graphstructured data during training as a result of the diffusion convolution operation over a graph signal.
During the training phase, historical time series data and the graph are fed into the encoder, and the final stage of the encoder is used to initialize the decoder. The decoder predicts the output of the next time steps, and the layers of DCRNN are trained by using backpropagation through time. During the test, the ground truth observations are replaced by previously predicted output. The discrepancy between the input distributions of training and testing can cause performance degradation. In order to resolve this issue, scheduled sampling [6] has been used, where the model is fed a ground truth observation with probability of or the prediction by the model with probability at the th iteration. The model is trained with MAE loss function, defined as , where is the observed value and corresponds to the forecasted values for the training data.
3.2 Graphpartitioningbased DCRNN
To scale DCRNN, we adopt a divideandconquer approach for solving a large problem by solving subproblems defined on smaller subdomains. The overall idea of scaling is shown in Figure 1. Here, the graph has been divided into multiple subgraphs shown as partition 1 to partition M. Each of the partitions is then trained on M compute nodes simultaneously. Simultaneous training of subgraphs on multiple GPUs speeds up the overall training time in comparison with singlenode training. The speedup with graph partitioning can be expressed as , and the efficiency can be expressed as . Here, is the time to execute an algorithm on a single node, and is the time to execute the same algorithm on nodes. in a perfectly parallel algorithm.
We use Metis [43], a graphpartitioning package, to decompose the large network graph into smaller subgraphs. First, to reduce the size of the input graph, Metis coarsens the graph iteratively by collapsing the connected nodes into supernodes. The process of coarsening helps reduce edgecut. Then, the coarsened graph is partitioned by using either multilevel way partitioning [29] or multilevel recursive bisection algorithms [28]. The next step is to map the partitions into the original graph by backtracking through the coarsened graph. In order to reduce the edgecut, the nodes are swapped between partitions by using the KernighanLin algorithm [25] during uncoarsening. The method produces roughly equally sized partitions. Metis’s multilevel way partitioning algorithm provides additional capabilities such as minimizing the resulting subdomain connectivity graph, enforcing contiguous partitions, and minimizing alternative objectives. Therefore, we use the way partitioning algorithm in our work. Metis is extremely fast and provides highquality partitions in a few seconds. For example, to perform 64 partition on a graph of 11, 160 nodes, metis takes only 0.030 seconds.
Various graph clustering and community detection methods [36] have been developed, such as spectral clustering, Louvain, SlashBurn [31], and corebased clustering [21]. Compared with all these methods, Metis is a fast graphpartitioning algorithm [36] that is capable of partitioning a millionnode graph in a few tightly connected clusters. It generates roughly equally sized partitions. Our approach is agnostic to the graphpartitioning method adopted.
3.3 Overlapping nodes
An issue that affects the prediction accuracy in DCRNN due to graph partitioning is that nodes that are spatially correlated will end up in different partitions. While the graphpartitioning methods try to minimize this effect, the nodes at the boundary of the partitions will not have nearby spatially correlated nodes. To address this issue, we develop an overlapping nodes approach, wherein for each partition, we find and include spatially correlated nodes from other partitions. Consequently, the nodes that are near the boundary of the partition will appear in more than one partition. A naive approach for finding these nodes consists of computing nearest neighbors for each node in the partition based on the driving distance and excluding the nodes already included in the partition. The disadvantage of this approach is that it can include, for a given node, several spatially correlated nodes that are close to each other. This can lead to an increase in the number of nodes per partition, and consequently higher training time and memory requirement. Therefore, we down sample the spatially correlated nodes from other partitions as follows: given two spatially correlated overlapping nodes from a different partition, we select only one and remove the other if they are within driving distance miles, where is a parameter.
3.4 Hyperparameter tuning
The forecasting accuracy of the DCRNN depends on a number of hyperparameters such as batch size, filter type (i.e., random walk, Laplacian), maximum diffusion steps, number of RNN layers, number of RNN units per layers, a threshold max_grad_norm to clip the gradient norm to avoid exploring gradient problem of RNN [46], initial learning rate, and learning rate decay. Li et al. [34] used a treestructured Parzen estimator [7] for tuning the hyperparameters of the DCRNN; the obtained values are used as the default configuration. However, our dataset has a lot more variability because we consider all the districts of California. Therefore, finding the appropriate hyperparameter values is critical in our setting.
We use DeepHyper [5], a scalable hyperparameter search (HPS) package for neural networks, to search for high performing hyperparameters values for DCRNN. DeepHyper adopts an asynchronous modelbased search (AMBS) method, which relies on fitting a surrogate model that tries to learn the relationship between the hyperparameter configurations and their corresponding model validation errors. The surrogate model is then used to prune the search space and identify promising regions of the search space. The surrogate model is iteratively refined in the promising regions of the hyperparameter search space by obtaining new outputs at inputs that are predicted by the model to be high performing.
Given that we use a graph partition approach, finding the best hyperparameter configuration for each partition, although feasible, will be computationally expensive. Therefore, we select an arbitrary partition, run a hyperparameter search on it, and use the same best hyperparameter configuration for all the partitions.
3.5 Multioutput forecasting with a single model
In the previous study, DCRNN was used to forecast only speed based on historical speed data. In this paper, we customize the input and output layers of the DCRNN for multioutput forecasting and demonstrate that a single DCRNN model can be trained and used for forecasting speed and flow simultaneously. The three key modifications for multioutput forecasting are as follows: 1) normalization of speed and flow: to bring speed and flow to the same scale, normalization has been done separately on the two features using the standard scalar transformation. The normalized values of speed are given by: , where is the mean and is the standard deviation of the speed values . The same method is applied for normalizing the flow values (, where and are the standard deviation of the flow values ). We apply an inverse transformation to the normalized speed and flow forecasting values to transform them to the original scale (for computing error on the test data). 2) multiple output layers in the DCRNN: in the previous study of DCRNN, the convolution filter learns the graphstructured data from input graph signal . This filter is parameterized by to take Pdimensional input (such as speed and flow) and predict Qdimensional output (such as speed and flow). Though multiple output prediction is reported as a capability of DCRNN, but its implementation had the format to take only 1dimensional input and predict same as output. We changed the input/output format in our implementation with which dimensional input can be given to predict dimensional output. 3) loss function: for multioutput training, we use a loss function of the form , where and are observed speed and flow values and and are corresponding forecast values, respectively, for the training data, and is the total number of training points.
4 California highway network
For modeling the California highway network, we used data from PeMS [10]. It provides access to realtime and historical performance data from over 39,000 individual sensors. The individual sensors placed on the different highways are aggregated across several lanes and are fed into vehicle detector stations. The PeMS dataset contains raw detector data for over 18,000 vehicle detector stations. These include a variety of sensors such as inductive loops, sidefire radar, and magnetometers. The sensors may be located on Highoccupancy Vehicle lanes, mainlines, on ramps, and off ramps. The dataset covers 9 districts of California—D3 (North Central) with 1,212 stations, D4 (Bay Area) with 3,880 stations, D5 (Central coast) with 382 stations, D6 (South Central) with 624 stations, D7 (Los Angeles) with 4,864 stations, D8 (San Bernardino) with 2,115 stations, D10 (Central) with 1,195 stations, D11 (San Diego) with 1,502 stations, and D12 (Orange County) with 2,539 stations. A total of 18,313 stations are listed by site. Detectors capture samples every 30 seconds. PeMS then aggregates that data to the granularity of 5 minutes, an hour, and a day. The data includes timestamp, station ID, district, freeway, direction of travel, total flow, and average speed(mph). The time series data is available from 2001 to 2019.
PeMS details the station IDs, district, freeway, direction of travel, and absolute postmile markers. This list does not contain the latitude and longitude for the stations IDs, which is essential to defining the connectivity matrix used by the DCRNN. In the PeMS database, the latitude and longitude are associated with postmile markers of every freeway given the direction. We downloaded the entire time series data of the California highway network and find the latitude and longitude for sensor IDs by matching the absolute postmile markers of every freeway. Linear interpolation is used to find the exact latitude and longitude if the absolute postmile markers do not match exactly.
The official PeMs website shows that 69.59% of the 18K stations are in good working condition. The remaining 30.41% do not capture time series data throughout the year. These are excluded from our dataset. Our final dataset has 11,160 stations for the year 2018 with the granularity of 5 minutes. We observed that flow and speed values are missing for multiple time periods in the time series data. We calculate the missing data by taking the average of the past one week data of that particular timestamp. Holidays are handled separately from normal working days.
5 Experimental results
We represent the highway network of 11,160 detector stations as a weighted directed graph. The speed and flow data of each node of the graph is collected over one year ranging from January 1, 2018, to December 31, 2018, from PeMS [10]. From the oneyear data, we used the first 70% of the data (36 weeks approx.) for training and the next 10% (5 weeks approx.) and 20% (10 weeks approx.) of the data for validation and testing, respectively. Given 60 minutes of time series data on the nodes in the graph, we forecast for the next 60 minutes. We prepared the dataset in a way to look back ( as mentioned in 3.1) for 60 minutes or 12 time steps (granularity of the data is 5 minutes as mentioned in Section 4) to predict () next 60 minutes or 12 time steps. The look back () window slides by 5 minutes or 1 time steps and repeat until the whole data is consumed. The forecasting performance of the models were evaluated on the test data using MAE =, where , . . . , represent the observed values, represent the corresponding predicted values, and denotes the number of prediction samples.
The adjacency matrix for DCRNN requires the highway network distance between the nodes. We used the Open Source Routing Machine (OSRM) [38] running locally for the area of interest to compute the highway network distance. Given the latitude and longitude of two nodes, OSRM gives the shortest driving distance between them using OpenStreetMap data [44]. To speed up the highway network distance computation, first we find 30 nearest neighbors for each node using the Euclidean distance and then limit the OSRM queries only to the nearest neighbors. As in the original DCRNN work, we compute the pairwise highway network distances between nodes to build the adjacency matrix using a thresholded Gaussian kernel [51]: otherwise , where represents the edge weight between node and node ; denotes the highway network distance from node to node ; is the standard deviation of distances; and is the threshold, which introduces the sparsity in the adjacency matrix.
For the experimental evaluation, we used Cooley, a GPUbased cluster at the Argonne Leadership Computing Facility. It has 126 compute nodes, where each node consists of two 2.4 GHz Intel Haswell E52620 v3 processors (6 cores per CPU, 12 cores total), one NVIDIA Tesla K80 (two GPUs per node), 384 GB RAM per node, and 24 GB GPU RAM per node (12 GB per GPU). The compute nodes are interconnected via an InfiniBand fabric. We used Python 3.6.0, TensorFlow 1.3.1, and Metis 5.1.0. We customized the DCRNN code of [34], which is available on Github [35]. Given partitions of the highway network, we trained partitionspecific DCRNNs simultaneously on Cooley GPU nodes. We used two MPI ranks per node, where each rank ran a partitionspecific DCRNN using one GPU. The input data for different partitions (time series, and adjacency matrix of the graph) were prepared offline and loaded into the partitionspecific DCRNN before the training started.
We used a bidirectional graph random walk [37] to model the stochastic nature of highway traffic. Random walk on a directed graph is random process that gives a path composed of successive random steps on the graph. The default hyperparameter configuration for the DCRNN is: batch size: 64, filter type: random walk, number of diffusion steps: 2 , RNN layers: 2, and RNN units per layer: 16 , a threshold for gradient clipping: 5, initial learning rate: 0.01, and learning rate decay of 0.1. We trained our model by minimizing MAE using the Adam optimizer [30].
5.1 Impact of number of graph partitions on accuracy and training time
Here, we experiment with different number of graph partitions and show that partitions with larger number of nodes require longer training time and partitions with fewer nodes can reduce the forecasting accuracy.
We used Metis to obtain 2, 4, 8, 16, 32, 64, and 128 partitions of the California highway network graph. The average number of nodes in each case is 5,580, 2,790, 1395, 697, 348, 174, and 87, respectively. Partition of size 1 (the whole network) and 2 were not presented because the training data was too large to fit in the memory of a single K80 node of Cooley. Given partitions, we used nodes (or GPUs) on Cooley to run the partitionspecific DCRNNs simultaneously. We consider the training time as the maximum time taken by any partitionspecific DCRNN training (excluding the data loading time).
Figure 2 shows the distribution of MAE of all nodes obtained using boxandwhisker plots. Each box represents distribution of MAE of 11,160 nodes. The ends of each box are 25% (bottom) and 75% (top) quantiles of the distribution, the median of the distribution is shown as the horizontal line in the middle of the box, the two vertical lines on the two sides of the whisker represent 5% and 95% of the distribution, and the diamonds mark the outliers of the distribution. From the results we can observe that medians, 75% quantiles, and the maximum MAE values show a trend in which an increase in the number of partitions decreases the MAE. From 4 to 64 partitions, the median of MAE decreases from 2.11 to 2.02. The increase in accuracy can be attributed to the effectiveness of the graph partitioning of Metis that separates nodes that were not temporally and spatially correlated. For smaller number of partitions, presence of such nodes increases MAE. For 128 partitions (with only 87 nodes per partition), the observed MAE values are higher than that of 64 partitions. This is because the graph partition results in significant number of spatially correlated nodes ending up in different partitions. This can be assumed as a tipping point for graph partitioning, which relates to the size and spread of the actual network.
Figure 3 shows the training time required for different numbers of partitions. We can observe that the time decreases significantly with an increase in the number of partitions. We can also observe that our approach reduces the training time from 2,820 minutes on 4 partitions(= 4 GPUs) to 178.67 minutes on 64 partitions (= 64 GPUs), resulting in a 15.78x speedup. Until 64 partitions, we observe almost a liner speedup, where doubling the number of partitions (and GPUs) results in 2X speedup. However, the speedup gains drop significantly with 128 nodes. This can be attributed to the reduction in the workload per GPU, where there is not enough workload for the GPU given that there are only 87 nodes per partition.
Since the best forecasting accuracy and speedup were obtained by using 64 partitions, we used it as a default number of partitions in rest of the experiments.
5.2 Impact of training data size
Here, we assess the impact of training data size and show that it has a significant impact on the predictive accuracy.
From the full 36 weeks of training data, we selected the last 1, 2, 4, 12, and 20 weeks of data for training the DCRNN. The last weeks of data were chosen to minimize the impact of highway and sensor upgrades. Figure 4 shows the distribution of MAE of all nodes obtained using boxandwhisker plots. From the plots it can be observed that the medians, the 75% quantiles, and the maximum MAE values show that increasing the training data size decreases the MAE. These results show that DCRNN, similar to other state of the art neural networks [9, 3], can leverage large amount of data to improve accuracy. Therefore, we use the entire 36 weeks of training data in rest of the experiments.
5.3 Impact of overlapping nodes and hyperparameter tuning
Here, we demonstrate that the graphpartitioningbased DCRNN achieves high forecasting accuracy using overlapping nodes and hyperparameter search.
We trained the graphpartitioningbased DCRNN with 64 partitions for the California highway network on 32 nodes of Cooley (two DCRNNs per node; 64 GPUs). We refer this variant to DCRNN_64_naive. It took a total training time of 178 minutes. After training, we forecast the speed for 60 minutes on the test data and calculated the MAE for each node. The results are summarized in the first row of Table 1. We observe that MAE values of 1,716, 6,729, 2,266, and 449 nodes are less than 1, between 1 and 3, between 3 and 5, and greater than 5, respectively.
Next, we trained the graphpartitioningbased DCRNN with 64 partitions with overlapping nodes as described in Section 3.3. We down sampled nodes with different distance threshold () values: 0.5 mile, 1 mile, 1.5 miles, 2 miles, and 3 miles. The result showed no significant improvement beyond the 1 mile of threshold; therefore, we used 1 mile as distance threshold for our experiments. In a given partition, while calculating the MAE for each node, we did not consider the overlapping nodes as they originally belong to a different partition, where their MAE values will be computed. We refer this variant to DCRNN_64_overlap. The results are shown in the row 2 of Table 1. We observe that DCRNN_64_overlap completely outperforms DCRNN_64_naive. With reference to the latter, the number of nodes with MAE values less than 1 has increased from 1,716 to 1,837; on the other hand, the number of nodes with MAE values between 1 and 3, 3 and 5, and greater than 5 reduced from 6,729 to 6,687, 2,266 to 2,204, and 449 to 432, respectively. We observe that the training time increased from 178.67 minutes to 221.04 minutes, which can be attributed to the increase in the number of nodes per partition.
Finally, we ran hyperparameter search with DeepHyper for DCRNN_64_naive and
DCRNN_64_overlap. We used 5 months of data (from May 2018 to October 2018) from partition 1. We used 32 nodes of Cooley with a 12 hours of wallclock time as stopping criterion. DeepHyper sampled 518 and 478 hyperparameter configurations for naive and overlapping approaches, respectively. The best hyperparameter configurations are selected from each and used to train and infer the forecasting accuracy. We refer these two variants as DCRNN_64_naive_hps and DCRNN_64_overlap_hps. The results are shown in the rows 3 and 4 of the Table 1. We observe that DCRNN_64_naive_hps outperforms DCRNN_64_naive, where hyperparameter tuning improved the accuracy of several nodes. The number of nodes with MAE values less 1 and between 1 and 3, have increased from 1,716 to 1,920 and 6,729 to 6,897, respectively. The number of nodes with MAE values between 3 and 5, and greater than 5 got reduced from 2,266 to 1,980, and 449 to 363, respectively. We did not see a significant improvement with DCRNN_64_overlap_hps. The number of node in the MAE bins are similar to DCRNN_64_overlap. Moreover, hyperparameter tuning resulted in an increase in the number of trainable parameters, which led to training time increase from 221.04 min to 461.57 mins.
We did not notice a significant difference in the time required for forecasting on the test data. An exception is DCRNN_64_overlap_hps, where the large number of trainable parameters increases the forecasting time by 1 minute (5.83 mins).
To summarize, we can improve the graphpartitioningbased DCRNNs either by using overlapping nodes from other partitions or by tuning the hyperparameters of DCRNN. Combining both did not show any benefit in our study.









1.  DCRNN_64_naive  1,716  6,729  2,266  449  14,608  178.67  4.38  
2. 

1,837  6,687  2,204  432  14,608  221.04  4.88  
3. 

1,920  6,897  1,980  363  19,808  287.05  4.92  
4. 

1,897  6,940  1,972  351  38,048  461.57  5.83 
5.4 Multioutput forecasting
Here, we show that a single DCRNN model can be used to predict the speed and flow simultaneously and the forecasting results preserve the fundamental properties of traffic flow.
Figure 5 shows the distribution of MAE of all nodes using boxandwhisker plots. The first and second box plots show the speed forecast from the DCRNN models that are trained to forecast only speed and to forecast speed and flow simultaneously. Similarly, the third and forth box plots are for flow forecasts. The median of MAE from speed only model (first box plot) is 2.02, which got reduced to 1.98 when multioutput model (second box plot) is used. Similarly, the median of MAE from flow only model (third box plot) is 21.20, which got reduced to 20.64 when multioutput model (fourth box plot) is used. We adopted a statistical test to check if the observed MAE values between the two models are significant. We used the paired ttest and found that the multioutput model obtains MAE values that are significantly better than the speed only or flow only model (values of for speed and for flow). The superior performance of multioutput forecasting can be attributed to the multitask learning [50]. The key advantage is that it leverages the commonalities and differences across speed and flow learning tasks. This results in improved learning efficiency and consequently forecasting accuracy when compared to training the models separately.
In Figure 6, we show speed and flow forecasting forecasting results of a congested node (ID: 717322 located on the highway 60E in Los Angeles area) in a scatter plot. We can observe that the speed and flow forecast values closely follow the fundamental flow diagram with three distinct phases of congestion, bounded, and free flow. This forecasting pattern of DCRNN shows that the model has learned and preserved the properties of traffic flow.
6 Related work
Modeling the flow and speed patterns of traffic in a highway network has been studied for decades. Capturing the spatiotemporal dependencies of the highway network is a crucial task for traffic forecasting. The methods for traffic forecasting are broadly classified into two main categories: knowledgedriven and datadriven approaches. In transportation and operational research, knowledgedriven methods usually apply queuing theory [11, 49, 33, 58] and Petri nets [48] simulate user behaviors of the traffic. Usually, those approaches estimate the traffic flow of one intersection at a time. Traffic prediction for the full highway system of an entire state has not been attempted to date using knowledgedriven approaches.
Datadriven approaches have received notable attention in recent years. Traditional methods include statistical techniques such as autoregressive statistics for time series [56] and Kalman filtering techniques [32]. These models are mostly used to forecast at a single sensor location and are based on a stationary assumption about the time series data. Therefore, they often fail to capture nonlinear temporal dependencies and cannot predict overall traffic in a largescale network [34]. Recently, statistical models have been challenged by machine learning methods on traffic forecasting. More complex data modeling can be achieved by these models, such as artificial neural networks (ANNs) [13, 27], and support vector machines (SVMs) [12, 2].
However, SVMs are computationally expensive for large networks, and ANNs cannot capture the spatial dependencies of the traffic network. Furthermore, the shallow architecture of ANNs make the network less efficient compared with a deep learning architecture. Recently,deep learning models such as deep belief networks [26] and stacked autoencoders [40] have been used to capture effective features for traffic forecasting. Recurrent neural networks (RNNs) and their variants, long shortterm memory (LSTM) networks [42] and gated recurrent units [20], show effective forecasting [15, 61] because of their ability to capture the temporal dependencies. RNNbased methods can capture contextual dependency in the temporal domain, but spatial dynamics are often missed. To capture the spatial dynamics, researchers have used convolutional neural networks (CNNs). Ma et al. [41] proposed an imagebased traffic speed prediction method using CNNs, whereas Yu et al. [60] proposed spatiotemporal recurrent convolutional networks for traffic forecasting. Spatial dynamics have been captured by deep CNNs, and temporal dynamics have been learned by LSTM networks. In both, the highway network has been represented as an image, and the speed of each link is mapped by using color in the images. The model has been tested on 278 links of the Beijing transportation network. Zhang et al. [62, 63] also represented the flow of crowds in a traffic network using gridbased Euclidean space. The temporal closeness, period, and trend of the traffic were modeled by using a residual neural network framework. They evaluated the model on Beijing and New York City crowd flows. They used two datasets: (1) trajectory of taxicab GPS data of four time intervals and (2) trajectory of NYC bike data of one time interval. Trip data included trip duration, starting and ending sensor IDs, and start and end times. The key limitation of these approaches is that they do not capture nonEuclidean spatial connectivity. Du et al. [18] proposed a model with onedimensional CNNs and GRUs with the attention mechanism to forecast traffic flow on UK traffic data. The contribution of this method is multimodal learning by multiple features (flow, speed, events, weather, and so on) fusion on single time series data of one year (34,876 timestamps in 15minute intervals). The proposed approach is limited to a narrow spacial dimension, however.
Recently, CNNs have been generalized from a 2D gridbased convolution to a graphbased convolution in nonEuclidean space. Yu et al.[59] modeled the sensor network as a undirected graph and proposed a deep learning framework, called spatiotemporal graph convolutional networks, for speed forecasting. They applied graph convolution and gated temporal convolution through spatiotemporal convolutional blocks. The experiments were done on two datasets, BJER4 and PeMSD7, collected by the Beijing Municipal Traffic Commission and California Department of Transportation, respectively. The maximum size of their data set was 1,026 sensors of California district 7. However, these spectralbased convolution methods require the graph to be undirected. Hence, moving from a spectralbased to a vertexbased method, Atwood and Towsley [4] first proposed convolution as a diffusion process across the node of the graph. Later, Hechtlinger et al. [23] developed convolution to graphs by convolving every node and its closest neighbors selected by a random walk. However, none of these methods capture the temporal dependencies Li et al. [34] first represented diffusionconvolutional recurrent neural network (DCRNN) to capture the spatiotemporal dynamics of the highway network.
Our approach differs from these works in many respects. From the problem perspective, none have addressed a problem size of 11,160 sensor locations covering the fully monitored California highway system. From the solutions perspective, graphpartitioningbased approach for largescale traffic forecasting, adoption of multinode GPUs, and multioutput forecasting were never investigated before.
7 Conclusion and future work
We described a traffic forecasting approach for a large highway network comprising the entire state of California with 11,160 sensor locations. We developed a graphpartitioning approach to partition the large highway network into a number of small networks, and trained them simultaneously on a moderately sized GPU cluster. We studied the impact of the number of partitions on the training time and accuracy. We showed that 64 partitions gave the best forecasting accuracy and GPU resource usage efficiency with a training time of 178 minutes. We demonstrated that our approach leverages a large training data to improve forecasting accuracy. We developed overlapping nodes approach to include spatially correlated nodes from different partitions and showed significant improvement in accuracy. We tuned the hyperparameters of the graphpartitioningbased DCRNN using DeepHyper and showed improvement in forecasting accuracy. We adapted and trained a single DCRNN model to forecast speed and flow and showed that the accuracy is better than models that predict either speed or flow and that the forecasts preserve the fundamental traffic flow dynamics. The DCRNN model once trained can be run on traditional hardware such as CPUs for forecasting without the need for multiple GPUs and could be readily integrated into a traffic management center. Once integrated into a traffic management center, the scale and accuracy of the forecasting techniques discussed in this work would likely lead to more proactive decision making as well as better decisions themselves given the capability to make largescale and accurate forecasts regarding future traffic states.
Our current and future work includes 1) Extending the approach for large scale traffic forecasting with mobile device data. Our goal will be to determine if mobile device data can act as a proxy for inductive loop data, which could either be used as a substitute for poorly working loops or extending the scope of the monitoring to areas where loops would be prohibitively expensive. 2) Combining DCRNN with large scale simulation to integrate realistic speed and flow forecasts into active traffic management decision algorithms; and 3) Developing models for route and policy scenario evaluation in adaptive traffic routing and management studies.
Acknowledgments
This material is based in part upon work supported by the U.S. Department of Energy, Office of Science, under contract DEAC0206CH11357. This report and the work described were sponsored by the U.S. Department of Energy (DOE) Vehicle Technologies Office (VTO) under the Big Data Solutions for Mobility Program, an initiative of the Energy Efficient Mobility Systems (EEMS) Program. The following DOE Office of Energy Efficiency and Renewable Energy (EERE) managers played important roles in establishing the project concept, advancing implementation, and providing ongoing guidance: David Anderson and Prasad Gupte.
References
 [1] (2003) Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering 129 (3), pp. 278–285. Cited by: §2.
 [2] (2016) Highway traffic flow prediction using support vector regression and bayesian classifier. In 2016 International Conference on Big Data and Smart Computing (BigComp), pp. 239–244. Cited by: §6.
 [3] (2019) Characterlevel language modeling with deeper selfattention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3159–3166. Cited by: §5.2.
 [4] (2016) Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1993–2001. Cited by: §6.
 [5] (2018) DeepHyper: asynchronous hyperparameter search for deep neural networks. In 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pp. 42–51. Cited by: §3.4.
 [6] (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §3.1.
 [7] (2011) Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: §3.4.
 [8] (2005) Intelligent vehicle technology and trends. Cited by: §2.
 [9] (2018) Cascade rcnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §5.2.
 [10] (2019) Caltrans performance measurement system (pems). Note: http://pems.dot.ca.gov/Accessed: 20190524 Cited by: §4, §5.
 [11] (2013) Transportation systems engineering: theory and methods. Vol. 49, Springer Science & Business Media. Cited by: §6.
 [12] (2009) Onlinesvr for shortterm traffic flow prediction under typical and atypical traffic conditions. Expert systems with applications 36 (3), pp. 6164–6173. Cited by: §2, §6.
 [13] (2012) Neuralnetworkbased models for shortterm traffic flow forecasting using a hybrid exponential smoothing and levenberg–marquardt algorithm. IEEE Transactions on Intelligent Transportation Systems 13 (2), pp. 644–654. Cited by: §2, §6.
 [14] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.1.
 [15] (2018) Deep bidirectional and unidirectional lstm recurrent neural network for networkwide traffic speed prediction. arXiv preprint arXiv:1801.02143. Cited by: §6.
 [16] (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §2.
 [17] (1997) Total cost analysis: an alternative to benefitcost analysis in evaluating transportation alternatives. Transportation 24 (2), pp. 107–123. Cited by: §2.
 [18] (2018) A hybrid method for traffic flow forecasting using multimodal deep learning. arXiv preprint arXiv:1803.02099. Cited by: §6.
 [19] (2017) Stateoftheart deep learning: evolving machine intelligence toward tomorrow’s intelligent network traffic control systems. IEEE Communications Surveys & Tutorials 19 (4), pp. 2432–2455. Cited by: §2.
 [20] (2016) Using lstm and gru neural network methods for traffic flow prediction. In 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), pp. 324–328. Cited by: §6.
 [21] (2011) Evaluating cooperation in communities with the kcore structure. In 2011 International conference on advances in social networks analysis and mining, pp. 87–93. Cited by: §3.2.
 [22] (1995) Time series analysis. Economic Theory. II, Princeton University Press, USA, pp. 625–630. Cited by: §2.
 [23] (2017) A generalization of convolutional neural networks to graphstructured data. arXiv preprint arXiv:1704.08165. Cited by: §6.
 [24] (2016) Parallel and distributed deep learning. Cited by: §2.
 [25] (1995) A multilevel algorithm for partitioning graphs.. SC 95 (28), pp. 1–14. Cited by: §3.2.
 [26] (2014) Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Transactions on Intelligent Transportation Systems 15 (5), pp. 2191–2201. Cited by: §6.
 [27] (2011) Statistical methods versus neural networks in transportation research: differences, similarities and some insights. Transportation Research Part C: Emerging Technologies 19 (3), pp. 387–399. Cited by: §2, §6.
 [28] (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20 (1), pp. 359–392. Cited by: §3.2.
 [29] (1998) Multilevelkway partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing 48 (1), pp. 96–129. Cited by: §3.2.
 [30] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 [31] (2015) Summarizing and understanding large graphs. Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (3), pp. 183–202. Cited by: §3.2.
 [32] (2017) Traffic flow prediction using kalman filtering technique. Procedia Engineering 187, pp. 582–587. Cited by: §6.
 [33] (2014) Predicting traffic congestion: a queuing perspective. Open Journal of Modelling and Simulation 2 (02), pp. 57. Cited by: §6.
 [34] (2018) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In International Conference on Learning Representations (ICLR ’18), Cited by: §2, §3.4, §5, §6, §6.
 [35] (2018) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. GitHub. Note: https://github.com/liyaguang/DCRNN Cited by: §5.
 [36] (2015) An empirical comparison of the summarization power of graph clustering methods. arXiv preprint arXiv:1511.06820. Cited by: §3.2.
 [37] (1993) Random walks on graphs: a survey. Combinatorics, Paul erdos is eighty 2 (1), pp. 1–46. Cited by: §5.
 [38] (2018) Open source routing machine  c++ backend. Note: https://github.com/ProjectOSRM/osrmbackendAccessed: 20190325 Cited by: §5.
 [39] (2014) Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16 (2), pp. 865–873. Cited by: §2.
 [40] (2015) Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16 (2), pp. 865–873. Cited by: §6.
 [41] (2017) Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction. Sensors 17 (4), pp. 818. Cited by: §6.
 [42] (2015) Long shortterm memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54, pp. 187–197. Cited by: §6.
 [43] (2016) METIS  serial graph partitioning and fillreducing matrix ordering. Note: http://glaros.dtc.umn.edu/gkhome/metis/metis/overviewAccessed: 20190319 Cited by: §3.2.
 [44] (2019) Open street map. Note: https://www.openstreetmap.org/#map=5/38.007/95.844Accessed: 20190522 Cited by: §5.
 [45] (1999) Adaptive route selection for dynamic route guidance system based on fuzzyneural approaches. IEEE Transactions on Vehicular Technology 48 (6), pp. 2028–2041. Cited by: §2.
 [46] (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §3.4.
 [47] (2014) Traffic time series forecasting by feedforward neural network: a case study based on traffic data of monroe. The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 40 (2), pp. 219. Cited by: §2.
 [48] (2008) A petri nets based decision support tool for railway traffic conflicts forecasting and resolution. WIT Transactions on the Built Environment 103, pp. 483–492. Cited by: §6.
 [49] (2018) Queuing theory, time series and an application to tollgate’s traffic flow prediction. Cited by: §6.
 [50] (2018) Multitask learning as multiobjective optimization. In Advances in Neural Information Processing Systems, pp. 527–538. Cited by: §5.4.
 [51] (2012) The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053. Cited by: §5.
 [52] (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2, §3.1.
 [53] (2005) Trafficincident detectionalgorithm based on nonparametric regression. IEEE Transactions on Intelligent Transportation Systems 6 (1), pp. 38–42. Cited by: §2.
 [54] (2007) A genetic algorithm approach for optimizing traffic control signals considering routing. ComputerAided Civil and Infrastructure Engineering 22 (1), pp. 31–43. Cited by: §2.
 [55] (2016) Scalable algorithms for data and network analysis. Foundations and Trends® in Theoretical Computer Science 12 (1–2), pp. 1–274. Cited by: §3.1.
 [56] (2003) Modeling and forecasting vehicular traffic flow as a seasonal arima process: theoretical basis and empirical results. Journal of Transportation ZEngineering 129 (6), pp. 664–672. Cited by: §2, §2, §6.
 [57] (2017) Realtime road traffic state prediction based on arima and kalman filter. Frontiers of Information Technology & Electronic Engineering 18 (2), pp. 287–302. Cited by: §2.
 [58] (2014) The application of the queuing theory in the traffic flow of intersection. International Journal of Mathematical, Computational Sciences 8, pp. 986–989. Cited by: §6.
 [59] (2017) Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875. Cited by: §6.
 [60] (2017) Spatiotemporal recurrent convolutional networks for traffic prediction in transportation networks. Sensors 17 (7), pp. 1501. Cited by: §6.
 [61] (2017) Deep learning: a generic approach for extreme condition traffic forecasting. In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 777–785. Cited by: §6.
 [62] (2016) DNNbased prediction model for spatiotemporal data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 92. Cited by: §6.
 [63] (2017) Deep spatiotemporal residual networks for citywide crowd flows prediction. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §6.
Government license
The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory ("Argonne"). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DEAC0206CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paidup nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doepublicaccessplan.