Deep SpatioTemporal Residual Networks for Citywide Crowd Flows Prediction^{†}^{†}thanks: This research was supported by NSFC (Nos. 61672399, U1401258), and the 973 Program (No. 2015CB352400).
Abstract
Forecasting the flow of crowds is of great importance to traffic management and public safety, and very challenging as it is affected by many complex factors, such as interregion traffic, events, and weather. We propose a deeplearningbased approach, called STResNet, to collectively forecast the inflow and outflow of crowds in each and every region of a city. We design an endtoend structure of STResNet based on unique properties of spatiotemporal data. More specifically, we employ the residual neural network framework to model the temporal closeness, period, and trend properties of crowd traffic. For each property, we design a branch of residual convolutional units, each of which models the spatial properties of crowd traffic. STResNet learns to dynamically aggregate the output of the three residual neural networks based on data, assigning different weights to different branches and regions. The aggregation is further combined with external factors, such as weather and day of the week, to predict the final traffic of crowds in each and every region. Experiments on two types of crowd flows in Beijing and New York City (NYC) demonstrate that the proposed STResNet outperforms six wellknown methods.
Deep SpatioTemporal Residual Networks for Citywide Crowd Flows Prediction^{†}^{†}thanks: This research was supported by NSFC (Nos. 61672399, U1401258), and the 973 Program (No. 2015CB352400).
Junbo Zhang, Yu Zheng^{†}^{†}thanks: Correspondence author. This work was done when the third author was an intern at Microsoft Research. , Dekang Qi Microsoft Research, Beijing, China School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China School of Computer Science and Technology, Xidian University, China Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences {junbo.zhang, yuzheng}@microsoft.com, dekangqi@outlook.com
Copyright © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Predicting crowd flows in a city is of great importance to traffic management and public safety (?). For instance, massive crowds of people streamed into a strip region at the 2015 New Year’s Eve celebrations in Shanghai, resulting in a catastrophic stampede that killed 36 people. In midJuly of 2016, hundreds of “Pokemon Go” players ran through New York City’s Central Park in hopes of catching a particularly rare digital monster, leading to a dangerous stampede there. If one can predict the crowd flow in a region, such tragedies can be mitigated or prevented by utilizing emergency mechanisms, such as conducting traffic control, sending out warnings, or evacuating people, in advance.
In this paper, we predict two types of crowd flows (?): inflow and outflow, as shown in Figure 1(a). Inflow is the total traffic of crowds entering a region from other places during a given time interval. Outflow denotes the total traffic of crowds leaving a region for other places during a given time interval. Both flows track the transition of crowds between regions. Knowing them is very beneficial for risk assessment and traffic management. Inflow/outflow can be measured by the number of pedestrians, the number of cars driven nearby roads, the number of people traveling on public transportation systems (e.g., metro, bus), or all of them together if data is available. Figure 1(b) presents an example. We can use mobile phone signals to measure the number of pedestrians, showing that the inflow and outflow of are respectively. Similarly, using the GPS trajectories of vehicles, two types of flows are respectively.
Simultaneously forecasting the inflow and outflow of crowds in each region of a city, however, is very challenging, affected by the following three complex factors:

Spatial dependencies. The inflow of Region (shown in Figure 1(a)) is affected by outflows of nearby regions (like ) as well as distant regions. Likewise, the outflow of would affect inflows of other regions (e.g., ). The inflow of region would affect its own outflow as well.

Temporal dependencies. The flow of crowds in a region is affected by recent time intervals, both near and far. For instance, a traffic congestion occurring at 8am will affect that of 9am. In addition, traffic conditions during morning rush hours may be similar on consecutive workdays, repeating every 24 hours. Furthermore, morning rush hours may gradually happen later as winter comes. When the temperature gradually drops and the sun rises later in the day, people get up later and later.

External influence. Some external factors, such as weather conditions and events may change the flow of crowds tremendously in different regions of a city.
To tackle these challenges, we propose a deep spatiotemporal residual network (STResNet) to collectively predict inflow and outflow of crowds in every region. Our contributions are fourfold:

STResNet employs convolutionbased residual networks to model nearby and distant spatial dependencies between any two regions in a city, while ensuring the model’s prediction accuracy is not comprised by the deep structure of the neural network.

We summarize the temporal properties of crowd flows into three categories, consisting of temporal closeness, period, and trend. STResNet uses three residual networks to model these properties, respectively.

STResNet dynamically aggregates the output of the three aforementioned networks, assigning different weights to different branches and regions. The aggregation is further combined with external factors (e.g., weather).

We evaluate our approach using Beijing taxicabs’ trajectories and meteorological data, and NYC bike trajectory data. The results demonstrate the advantages of our approach compared with 6 baselines.
Preliminaries
In this section, we briefly revisit the crowd flows prediction problem (?; ?) and introduce deep residual learning (?).
Formulation of Crowd Flows Problem
Definition 1 (Region (?))
There are many definitions of a location in terms of different granularities and semantic meanings. In this study, we partition a city into an grid map based on the longitude and latitude where a grid denotes a region, as shown in Figure 2(a).
Definition 2 (Inflow/outflow (?))
Let be a collection of trajectories at the time interval. For a grid that lies at the row and the column, the inflow and outflow of the crowds at the time interval are defined respectively as
where is a trajectory in , and is the geospatial coordinate; means the point lies within grid , and vice versa; denotes the cardinality of a set.
At the time interval, inflow and outflow in all regions can be denoted as a tensor where , . The inflow matrix is shown in Figure 2(b).
Formally, for a dynamical system over a spatial region represented by a grid map, there are 2 types of flows in each grid over time. Thus, the observation at any time can be represented by a tensor .
Problem 1
Given the historical observations , predict .
Deep Residual Learning
Deep residual learning (?) allows convolution neural networks to have a super deep structure of 100 layers, even over1000 layers. And this method has shown stateoftheart results on multiple challenging recognition tasks, including image classification, object detection, segmentation and localization (?).
Formally, a residual unit with an identity mapping (?) is defined as:
(1) 
where and are the input and output of the residual unit, respectively; is a residual function, e.g., a stack of two convolution layers in (?). The central idea of the residual learning is to learn the additive residual function with respect to (?).
Deep SpatioTemporal Residual Networks
Figure 3 presents the architecture of STResNet, which is comprised of four major components modeling temporal closeness, period, trend, and external influence, respectively. As illustrated in the topright part of Figure 3, we first turn Inflow and outflow throughout a city at each time interval into a 2channel imagelike matrix respectively, using the approach introduced in Definitions 1 and 2. We then divide the time axis into three fragments, denoting recent time, near history and distant history. The 2channel flow matrices of intervals in each time fragment are then fed into the first three components separately to model the aforementioned three temporal properties: closeness, period and trend, respectively. The first three components share the same network structure with a convolutional neural network followed by a Residual Unit sequence. Such structure captures the spatial dependency between nearby and distant regions. In the external component, we manually extract some features from external datasets, such as weather conditions and events, feeding them into a twolayer fullyconnected neural network. The outputs of the first three components are fused as based on parameter matrices, which assign different weights to the results of different components in different regions. is further integrated with the output of the external component . Finally, the aggregation is mapped into by a Tanh function, which yields a faster convergence than the standard logistic function in the process of backpropagation learning (?).
Structures of the First Three Components
The first three components (i.e. closeness, period, trend) share the same network structure, which is composed of two subcomponents: convolution and residual unit, as shown in Figure 4.
Convolution. A city usually has a very large size, containing many regions with different distances. Intuitively, the flow of crowds in nearby regions may affect each other, which can be effectively handled by the convolutional neural network (CNN) that has shown its powerful ability to hierarchically capture the spatial structural information (?). In addition, subway systems and highways connect two locations with a far distance, leading to the dependency between distant regions. In order to capture the spatial dependency of any region, we need to design a CNN with many layers because one convolution only accounts for spatial near dependencies, limited by the size of their kernels. The same problem also has been found in the video sequence generating task where the input and output have the same resolution (?). Several methods have been introduced to avoid the loss of resolution brought about by subsampling while preserving distant dependencies (?). Being different from the classical CNN, we do not use subsampling, but only convolutions (?). As shown in Figure 4(a), there are three multiple levels of feature maps that are connected with a few convolutions. We find that a node in the highlevel feature map depends on nine nodes of the middlelevel feature map, those of which depend on all nodes in the lowerlevel feature map (i.e. input). It means one convolution naturally captures spatial near dependencies, and a stack of convolutions can further capture distant even citywide dependencies.
The closeness component of Figure 3 adopts a few 2channel flows matrices of intervals in the recent time to model temporal closeness dependence. Let the recent fragment be , which is also known as the closeness dependent sequence. We first concatenate them along with the first axis (i.e. time interval) as one tensor , which is followed by a convolution (i.e. Conv1 shown in Figure 3) as:
where denotes the convolution^{1}^{1}1To make the input and output have the same size (i.e. ) in a convolutional operator, we employ a bordermode which allows a filter to go outside the border of an input, padding each area outside the border with a zero. ; is an activation function, e.g. the rectifier (?); are the learnable parameters in the first layer.
Residual Unit. It is a wellknown fact that very deep convolutional networks compromise training effectiveness though the wellknown activation function (e.g. ReLU) and regularization techniques are applied (?; ?; ?). On the other hand, we still need a very deep network to capture very large citywide dependencies. For a typical crowd flows data, assume that the input size is , and the kernel size of convolution is fixed to , if we want to model citywide dependencies (i.e., each node in highlevel layer depends on all nodes of the input), it needs more than 15 consecutive convolutional layers. To address this issue, we employ residual learning (?) in our model, which have been demonstrated to be very effective for training super deep neural networks of over1000 layers.
In our STResNet (see Figure 3), we stack residual units upon Conv1 as follows,
(2) 
where is the residual function (i.e. two combinations of “ReLU + Convolution”, see Figure 4(b)), and includes all learnable parameters in the residual unit. We also attempt Batch Normalization (BN) (?) that is added before ReLU. On top of the residual unit, we append a convolutional layer (i.e. Conv2 shown in Figure 3). With 2 convolutions and residual units, the output of the closeness component of Figure 3 is .
Likewise, using the above operations, we can construct the period and trend components of Figure 3. Assume that there are time intervals from the period fragment and the period is . Therefore, the period dependent sequence is . With the convolutional operation and residual units like in Eqs. Structures of the First Three Components and 2, the output of the period component is . Meanwhile, the output of the trend component is with the input where is the length of the trend dependent sequence and is the trend span. Note that and are actually two different types of periods. In the detailed implementation, is equal to oneday that describes daily periodicity, and is equal to oneweek that reveals the weekly trend.
The Structure of the External Component
Traffic flows can be affected by many complex external factors, such as weather and event. Figure 5(a) shows that crowd flows during holidays (Chinese Spring Festival) can be significantly different from the flows during normal days. Figure 5(b) shows that heavy rain sharply reduces the crowd flows at Office Area compared to the same day of the latter week. Let be the feature vector that represents these external factors at predicted time interval . In our implementation, we mainly consider weather, holiday event, and metadata (i.e. DayOfWeek, Weekday/Weekend). The details are introduced in Table 1. To predict flows at time interval , the holiday event and metadata can be directly obtained. However, the weather at future time interval is unknown. Instead, one can use the forecasting weather at time interval or the approximate weather at time interval . Formally, we stack two fullyconnected layers upon , the first layer can be viewed as an embedding layer for each subfactor followed by an activation. The second layer is used to map low to high dimensions that have the same shape as . The output of the external component of Figure 3 is denoted as with the parameters .
Fusion
In this section, we discuss how to fuse four components of Figure 3. We first fuse the first three components with a parametricmatrixbased fusion method, which is then further combined with the external component.
Figures 6(a) and (d) show the ratio curves using Beijing trajectory data presented in Table 1 where axis is time gap between two time intervals and axis is the average ratio value between arbitrary two inflows that have the same time gap. The curves from two different regions all show an empirical temporal correlation in time series, namely, inflows of recent time intervals are more relevant than ones of distant time intervals, which implies temporal closeness. The two curves have different shapes, which demonstrates that different regions may have different characteristics of closeness. Figures 6(b) and (e) depict inflows at all time intervals of 7 days. We can see the obvious daily periodicity in both regions. In Office Area, the peak values on weekdays are much higher than ones on weekends. Residential Area has similar peak values for both weekdays and weekends. Figures 6(c) and (f) describe inflows at a certain time interval (9:00pm9:30pm) of Tuesday from March 2015 and June 2015. As time goes by, the inflow progressively decreases in Office Area, and increases in Residential Area. It shows the different trends in different regions. In summary, inflows of two regions are all affected by closeness, period, and trend, but the degrees of influence may be very different. We also find the same properties in other regions as well as their outflows.
Above all, the different regions are all affected by closeness, period and trend, but the degrees of influence may be different. Inspired by these observations, we propose a parametricmatrixbased fusion method.
Parametricmatrixbased fusion. We fuse the first three components (i.e. closeness, period, trend) of Figure 3 as follows
(3) 
where is Hadamard product (i.e., elementwise multiplication), , and are the learnable parameters that adjust the degrees affected by closeness, period and trend, respectively.
Fusing the external component. We here directly merge the output of the first three components with that of the external component, as shown in Figure 3. Finally, the predicted value at the time interval, denoted by , is defined as
(4) 
where is a hyperbolic tangent that ensures the output values are between 1 and 1.
Our STResNet can be trained to predict from three sequences of flow matrices and external factor features by minimizing mean squared error between the predicted flow matrix and the true flow matrix:
(5) 
where are all learnable parameters in the STResNet.
Algorithm and Optimization
Algorithm 1 outlines the STResNet training process. We first construct the training instances from the original sequence data (lines 16). Then, STResNet is trained via backpropagation and Adam (?) (lines 711).
Experiments
Settings
Datasets. We use two different sets of data as shown in Table 1. Each dataset contains two subdatasets: trajectories and weather, as detailed as follows.

TaxiBJ: Trajectoriy data is the taxicab GPS data and meteorology data in Beijing from four time intervals: 1st Jul. 2013  30th Otc. 2013, 1st Mar. 2014  30th Jun. 2014, 1st Mar. 2015  30th Jun. 2015, 1st Nov. 2015  10th Apr. 2016. Using Definition 2, we obtain two types of crowd flows. We choose data from the last four weeks as the testing data, and all data before that as training data.

BikeNYC: Trajectory data is taken from the NYC Bike system in 2014, from Apr. 1st to Sept. 30th. Trip data includes: trip duration, starting and ending station IDs, and start and end times. Among the data, the last 10 days are chosen as testing data, and the others as training data.
Dataset  TaxiBJ  BikeNYC 
Data type  Taxi GPS  Bike rent 
Location  Beijing  New York 
Time Span  7/1/2013  10/30/2013  
3/1/2014  6/30/2014  4/1/2014   
3/1/2015  6/30/2015  9/30/2014  
11/1/2015  4/10/2016  
Time interval  30 minutes  1 hour 
Gird map size  (32, 32)  (16, 8) 
Trajectory data  
Average sampling rate (s)  60  
# taxis/bikes  34,000+  6,800+ 
# available time interval  22,459  4,392 
External factors (holidays and meteorology)  
# holidays  41  20 
Weather conditions  16 types (e.g., Sunny, Rainy)  
Temperature / C  
Wind speed / mph 
Baselines. We compare our STResNet with the following 6 baselines:

HA: We predict inflow and outflow of crowds by the average value of historical inflow and outflow in the corresponding periods, e.g., 9:00am9:30am on Tuesday, its corresponding periods are all historical time intervals from 9:00am to 9:30am on all historical Tuesdays.

ARIMA: AutoRegressive Integrated Moving Average (ARIMA) is a wellknown model for understanding and predicting future values in a time series.

SARIMA: Seasonal ARIMA.

VAR: Vector AutoRegressive (VAR) is a more advanced spatiotemporal model, which can capture the pairwise relationships among all flows, and has heavy computational costs due to the large number of parameters.

STANN: It first extracts spatial (nearby 8 regions’ values) and temporal (8 previous time intervals) features, then fed into an artificial neural network.

DeepST (?): a deep neural network (DNN)based prediction model for spatiotemporal data, which shows stateoftheart results on crowd flows prediction. It has 4 variants, including DeepSTC, DeepSTCP, DeepSTCPT, and DeepSTCPTM, which focus on different temporal dependencies and external factors.
Preprocessing. In the output of the STResNet, we use as our final activation (see Eq. 4), whose range is between 1 and 1. Here, we use the MinMax normalization method to scale the data into the range . In the evaluation, we rescale the predicted value back to the normal values, compared with the groundtruth. For external factors, we use onehot coding to transform metadata (i.e., DayOfWeek, Weekend/Weekday), holidays and weather conditions into binary vectors, and use MinMax normalization to scale the Temperature and Wind speed into the range .
Hyperparameters. The python libraries, including Theano (?) and Keras (?), are used to build our models. The convolutions of Conv1 and all residual units use 64 filters of size , and Conv2 uses a convolution with 2 filters of size . The batch size is 32. We select 90% of the training data for training each model, and the remaining 10% is chosen as the validation set, which is used to earlystop our training algorithm for each model based on the best validation score. Afterwards, we continue to train the model on the full training data for a fixed number of epochs (e.g., 10, 100 epochs). There are 5 extra hyperparamers in our STResNet, of which and are empirically fixed to oneday and oneweek, respectively. For lengths of the three dependent sequences, we set them as: .
Evaluation Metric: We measure our method by Root Mean Square Error (RMSE) as
where and are the predicted value and ground thuth, respectively; is the number of all predicted values.
Results on TaxiBJ
We first give the comparison with 6 other models on TaxiBJ, as shown in Table 2. We give 7 variants of STResNet with different layers and different factors. Taking L12E for example, it considers all available external factors and has 12 residual units, each of which is comprised of two convolutional layers. We observe that all of these 7 models are better than 6 baselines. Comparing with the previous stateoftheart models, L12EBN reduces error to , which significantly improves accuracy.
Model  RMSE  
HA  57.69  
ARIMA  22.78  
SARIMA  26.88  
VAR  22.88  
STANN  19.57  
DeepST  18.18  
STResNet [ours]  
L2E  2 residual units + E  17.67 
L4E  4 residual units + E  17.51 
L12E  12 residual units + E  16.89 
L12EBN  L12E with BN  16.69 
L12singleE  12 residual units (1 conv) + E  17.40 
L12  12 residual units  17.00 
L12EnoFusion  12 residual units + E without fusion  17.96 
Effects of Different Components. Let L12E be the compared model.

Number of residual units: Results of L2E, L4E and L12E show that RMSE decreases as the number of residual units increases. Using residual learning, the deeper the network is, the more accurate the results will be.

Internal structure of residual unit: We attempt three different types of residual units. L12E adopts the standard Residual Unit (see Figure 4(b)). Compared with L12E, Residual Unit of L12singleE only contains 1 ReLU followed by 1 convolution, and Residual Unit of L12EBN added two batch normalization layers, each of which is inserted before ReLU. We observe that L12singleE is worse than L12E, and L12EBN is the best, demonstrating the effectiveness of batch normalization.

External factors: L12E considers the external factors, including meteorology data, holiday events and metadata. If not, the model is degraded as L12. The results indicate that L12E is better than L12, pointing out that external factors are always beneficial.

Parametricmatrixbased fusion: Being different with L12E, L12EnoFusion donot use parametricmatrixbased fusion (see Eq. 3). Instead, L12EnoFusion use a straightforward method for fusing, i.e., . It shows the error greatly increases, which demonstrates the effectiveness of our proposed parametricmatrixbased fusion.
Results on BikeNYC
Table 3 shows the results of our model and other baselines on BikeNYC. Being different from TaxiBJ, BikeNYC consists of two different types of crowd flows, including newflow and endflow (?). Here, we adopt a total of 4residualunit STResNet, and consider the metadata as external features like DeepST (?). STResNet has relatively from up to lower RMSE than these baselines, demonstrating that our proposed model has good generalization performance on other flow prediction tasks.
Model  RMSE 

ARIMA  10.07 
SARIMA  10.56 
VAR  9.92 
DeepSTC  8.39 
DeepSTCP  7.64 
DeepSTCPT  7.56 
DeepSTCPTM  7.43 
STResNet [ours, 4 residual units]  6.33 
Related Work
Crowd Flow Prediction. There are some previously published works on predicting an individual’s movement based on their location history (?; ?). They mainly forecast millions, even billions, of individuals’ mobility traces rather than the aggregated crowd flows in a region. Such a task may require huge computational resources, and it is not always necessary for the application scenario of public safety. Some other researchers aim to predict travel speed and traffic volume on the road (?; ?; ?). Most of them are predicting single or multiple road segments, rather than citywide ones. Recently, researchers have started to focus on cityscale traffic flow prediction (?; ?). Both work are different from ours where the proposed methods naturally focus on the individual region not the city, and they do not partition the city using a gridbased method which needs a more complex method to find irregular regions first.
Deep Learning. CNNs have been successfully applied to various problems, especially in the field of computer vision (?). Residual learning (?) allows such networks to have a very super deep structure. Recurrent neural networks (RNNs) have been used successfully for sequence learning tasks (?). The incorporation of long shortterm memory (LSTM) enables RNNs to learn longterm temporal dependency. However, both kinds of neural networks can only capture spatial or temporal dependencies. Recently, researchers combined above networks and proposed a convolutional LSTM network (?) that learns spatial and temporal dependencies simultaneously. Such a network cannot model very longrange temporal dependencies (e.g., period and trend), and training becomes more difficult as depth increases.
In our previous work (?), a general prediction model based on DNNs was proposed for spatiotemporal data. In this paper, to model a specific spatiotemporal prediction (i.e. citywide crowd flows) effectively, we mainly propose employing the residual learning and a parametricmatrixbased fusion mechanism. A survey on data fusion methodologies can be found at (?).
Conclusion and Future Work
We propose a novel deeplearningbased model for forecasting the flow of crowds in each and every region of a city, based on historical trajectory data, weather and events. We evaluate our model on two types of crowd flows in Beijing and NYC, achieving performances which are significantly beyond 6 baseline methods, confirming that our model is better and more applicable to the crowd flow prediction. The code and datasets have been released at: https://www.microsoft.com/enus/research/publication/deepspatiotemporalresidualnetworksforcitywidecrowdflowsprediction.
In the future, we will consider other types of flows (e.g., taxi/truck/bus trajectory data, phone signals data, metro card swiping data), and use all of them to generate more types of flow predictions, and collectively predict all of these flows with an appropriate fusion mechanism.
References
 [Abadi, Rajabioun, and Ioannou 2015] Abadi, A.; Rajabioun, T.; and Ioannou, P. A. 2015. Traffic flow prediction for road transportation networks with limited traffic data. IEEE Transactions on Intelligent Transportation Systems 16(2):653–662.
 [Chollet 2015] Chollet, F. 2015. Keras. https://github.com/fchollet/keras.
 [Fan et al. 2015] Fan, Z.; Song, X.; Shibasaki, R.; and Adachi, R. 2015. Citymomentum: an online approach for crowd behavior prediction at a citywide level. In ACM UbiComp, 559–569. ACM.
 [He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. In IEEE CVPR.
 [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity mappings in deep residual networks. In ECCV.
 [Hoang, Zheng, and Singh 2016] Hoang, M. X.; Zheng, Y.; and Singh, A. K. 2016. Forecasting citywide crowd flows based on big data. In ACM SIGSPATIAL.
 [Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456.
 [Jain et al. 2007] Jain, V.; Murray, J. F.; Roth, F.; Turaga, S.; Zhigulin, V.; Briggman, K. L.; Helmstaedter, M. N.; Denk, W.; and Seung, H. S. 2007. Supervised learning of image restoration with convolutional networks. In ICCV, 1–8. IEEE.
 [Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.
 [LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 [LeCun et al. 2012] LeCun, Y. A.; Bottou, L.; Orr, G. B.; and Müller, K.R. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer.
 [Li et al. 2015] Li, Y.; Zheng, Y.; Zhang, H.; and Chen, L. 2015. Traffic prediction in a bikesharing system. In ACM SIGSPATIAL.
 [Long, Shelhamer, and Darrell 2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In IEEE CVPR, 3431–3440.
 [Mathieu, Couprie, and LeCun 2015] Mathieu, M.; Couprie, C.; and LeCun, Y. 2015. Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.
 [Nair and Hinton 2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In ICML, 807–814.
 [Silva, Kang, and Airoldi 2015] Silva, R.; Kang, S. M.; and Airoldi, E. M. 2015. Predicting traffic volumes and estimating the effects of shocks in massive transportation systems. Proceedings of the National Academy of Sciences 112(18):5643–5648.
 [Song et al. 2014] Song, X.; Zhang, Q.; Sekimoto, Y.; and Shibasaki, R. 2014. Prediction of human emergency behavior and their mobility following largescale disaster. In ACM SIGKDD, 5–14. ACM.
 [Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
 [Theano Development Team 2016] Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints abs/1605.02688.
 [Xingjian et al. 2015] Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; and WOO, W.c. 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 802–810.
 [Xu et al. 2014] Xu, Y.; Kong, Q.J.; Klette, R.; and Liu, Y. 2014. Accurate and interpretable bayesian mars for traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems 15(6):2457–2469.
 [Zhang et al. 2016] Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; and Yi, X. 2016. DNNbased prediction model for spatialtemporal data. In ACM SIGSPATIAL.
 [Zheng et al. 2014] Zheng, Y.; Capra, L.; Wolfson, O.; and Yang, H. 2014. Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5(3):38.
 [Zheng 2015] Zheng, Y. 2015. Methodologies for crossdomain data fusion: An overview. IEEE transactions on big data 1(1):16–34.