Learning Spatiotemporal Features of Ridesourcing Services with Fusion Convolutional Network
Abstract
In order to collectively forecast the demand of ridesourcing services in all regions of a city, convolutional neural networks (CNNs) have been applied with commendable results. However, local statistical differences throughout the geographical layout of the city make the spatial stationarity assumption of the convolution invalid, which limits the performance of CNNs on demand forecasting task. Hence, we propose a novel deep learning framework called LCSTFCN (locallyconnected spatiotemporal fullyconvolutional neural network) that consists of a stack of 3D convolutional layers, 2D (standard) convolutional layers, and locally connected convolutional layers. This fully convolutional architecture maintains the spatial coordinates of the input and no spatial information is lost between layers. Features are fused across layers to define a tunable nonlinear localtoglobaltolocal representation, where both global and local statistics can be learned to improve predictive performance. Furthermore, as the local statistics vary from region to region, the arithmeticmeanbased metrics frequently used in spatial stationarity situations cannot effectively evaluate the models. We propose a weightedarithmetic approach to deal with this situation. Our findings are threefold: (1) 3D convolutions are more suitable for spatiotemporal feature learning compared to 2D convolutions; And the locally connected convolutional layers can deal with the impact of local statistical differences well, better than fully connected layers or standard convolutional layers. (2) The weightedarithmetic approach is able to eliminate the inconsistencies between the absolute and percentage errors, which makes the comparison of predictive performances among different models more convenient. (3) The aleatoric uncertainty arising from the data generating process, which is often neglected in deep learning models, has great influence on the final predictive performance of the model. In the experiments, a real dataset from a ridesourcing service platform (DiDiChuxing) is used, which demonstrates the effectiveness and superiority of our proposed model and evaluation method.
keywords:
3D convolution, Fusion, Locallyconnected, Ridesourcing, Spatiotemporal.1 Introduction
In recent years, ridesourcing service platforms, such as DiDiChuxing, Uber, and Lyft, have developed rapidly worldwide, aided by the growth of mobile internet, locationbased services, cloud computing, and other innovative technologies. These platforms have various advantages compared to taxis, which suffer from temporalspatial imbalance between supply and demand (Vazifeh et al., 2018). Ideally, the unoccupied driving time of ridesourcing services is much less than that of taxis, which alleviates road congestion, reduces pollution emissions, and results in lower costs for the driver and mileage charges for the passenger. However, a new report from the San Francisco County Transportation Authority showed that traffic congestion throughout the city increased from 2010 to 2016, where half of the congestion was attributed to the increase in ride hails (Jessica Christian, 2018) due to unoccupied driving time, finding parking spaces, and locating passengers.
Although the efficiency of matching drivers and passengers has been greatly improved by ridesourcing services (Dong et al., 2018), shortterm prediction of passenger demand are still highly important for ridesourcing service platforms so that available vehicles can be transferred from lowdemand to highdemand areas in advance. This can effectively improve overall travel efficiency by reducing the waiting times for passengers, reducing ineffective mileage for drivers, increasing the matching rate, and decreasing costs. Further optimization of the unoccupied driving time and accurate prediction of shortterm travel demand are being aided by the development of electronic sensors and wireless communication technology, global positioning system (GPS), global mobile communication system (GSM), and WiFi. At present, most ridesourcing vehicles are equipped with such devices, which provide rich spatial and temporal information of the vehicle. These data have been very useful for supply and demand forecasting, fleet dispatching, travel time estimation, and route planning (Chen et al., 2017b; Auer et al., 2017).
Various demand forecasting models have been developed up to date. Time series models are the most widely used methods, such as autoregressive integrated moving average model (ARIMA). Other machine learning methods have also been used, such as the deep neural network (DNN), the recurrent neural network (RNN), and convolutional neural network (CNN) models. Since the shortterm demand forecasting models must consider both time and space, i.e., at a given time , one must predict the demand in a certain region during the time period , typical CNNs, including LeNet (LeCun et al., 1989), AlexNet (Krizhevsky, 2012), and its deeper successors (Simonyan and Zisserman, 2014b; Szegedy et al., 2015), cannot be used directly for demand forecasting problems. The detailed reasons are illustrated below:
(1) Time series statistics: In the case of demand forecasting problems, it is desirable to capture the time series statistics of demand for multiple adjacent time intervals. CNNs have been primarily applied to 2D feature maps to compute features only from spatial dimensions. Although typical CNNs can also take inputs with time dimension by arranging time series data in multiple channels, the temporal information is still collapsed completely after the first convolution layer (Simonyan and Zisserman, 2014a).
(2) Spatial coordinates: Typical CNNs ostensibly take fixedsized inputs and produce nonspatial outputs as the fully connected layers have fixed dimensions and the spatial coordinates are lost (Shelhamer et al., 2014). For shortterm passenger demand forecasting, detailed predictions in all regions of a city depend on both global and local features, which need to be encoded using the spatial coordinates.
(3) Parameter sharing scheme: By assuming spatial stationarity, the parameter sharing scheme can be used in convolutional layers, which dramatically reduce the number of parameters. However, the spatial stationarity assumption does not hold in a city where the local statistics or features vary from region to region.
In order to address these problems, we develop LCSTFCN, a new CNNbased DL model to handle the unique challenges of shortterm passenger demand forecasting. In LCSTFCN model, the 3D convolutional operations are used to capture the time series statistics, which have achieved great performance on various video analysis tasks (Karpathy et al., 2014; Shuiwang et al., 2013). Since 3D convolution is a natural extension of standard convolution, features from both spatial and temporal dimensions can be learned simultaneously. The overall architecture of the LCSTFCN is a fully convolutional network that takes input of arbitrary size and produces output of the same size; the spatial coordinates are maintained throughout the process and no spatial information is lost between layers. We fuse features across the layers to define a tunable nonlinear localtoglobaltolocal representation where both global and local statistics are learned to improve the predictive performance.
For the demand forecasting problem of ridesourcing services, which is essentially a regression problem, local differences in the statistics are critical for both the model structure and evaluation. We propose a weighted scheme to better compare different models. In addition, although the stationarity and randomness of data are seldom considered in DL models, we observe that the randomness arising from the data generating process leads to learning difficulties in some regions. Hence, we classify all regions into two categories based on their randomnesses and evaluate the effect of uncertainty in the data on the predictive performance of DL models in demand forecasting problems.
The remainder of the paper is organized as follows. Section 2 reviews the existing literature. Section 3 describes the LCSTFCN model in detail. Section 4 outlines the evaluation of the model; we introduce the experimental dataset and metrics frequently used to evaluate model performance, analyze the inconsistency among those metrics and propose an improved method. Section 5 concludes the study and proposes future research directions.
2 Literature review
Demand forecasting using spatial and temporal data collected by internet and mobile terminals has becoming a research hotspot in the field of transportation. Kaltenbrunner et al. (2010) used data from the community bicycle program Bicing in Barcelona to analyze the pattern of human mobility in urban areas, and ARMA model was used to predict the number of bicycles available at different sites. MoreiraMatias et al. (2013) used the ARIMA model to forecast the demand of different taxi stations in Porto and Portugal, while the Markov algorithm, LempelZivWelch algorithm, passion model, Moran’s I values and others have been used to predict timeseries data of traffic (Deng and Ji, 2011; MoreiraMatias et al., 2013; Zhao et al., 2016). Other studies used spatial clustering to mine taxi demand and GPS trajectory data, and studied the demand distribution of taxis in urban areas (Chang et al., 2009; Yuan et al., 2011).
In recent years, DL technology has made tremendous progress, and has been widely used in many fields, including advanced speech recognition, visual object recognition, object detection, drug discovery, and genomics (Lecun et al., 2015). DL can transform original information into a higher level with more abstract expression using a simple nonlinear model. After sufficient combinations of transformations, very complex functions can be learned. Therefore, DL methods are being increasingly applied to traffic prediction problems. Huang et al. (2014) proposed a neural network model composed of a deep belief network (DBN) and a multitask regression layer to predict shortterm traffic flow. Ma et al. (2015) developed a depthlimited Boltzmann machine and RNN architecture to simulate and predict the evolution of traffic congestion. Chen et al. (2017a) studied the travel behavior of ridesourcing services using an ensemble method.
To realize accurate regionbased forecasting, a number of deep learning models were designed to model the complex spatiotemporal information, and stateoftheart results were achieved (Ke et al., 2017; Ma et al., 2017; Xingjian and Woo, 2015; Yao et al., 2018; Yu et al., 2017). Most of previous studies have focused on the combination of convolutional neural network and recurrent neural network, where the recurrent neural network architecture was utilized to model the temporal dependencies. The most related work to this paper is the model proposed by Zhang et al. (2017), in which the different components of temporal properties were learned separately and fused by a parametricmatrixbased method. Different from previous studies, the 3D convolution operations were employed to simultaneously capture the spatiotemporal dependencies in this paper. Other previous studies have also shown that 3D convolution operations have powerful ability to learn the spatiotemporal features (Tran et al., 2015; Hara et al., 2017). In particular, locally connected convolutional layers were used at the end of the model to obtain the final prediction results, without parameter sharing (Gregor and Lecun, 2010; Huang et al., 2012; Taigman et al., 2014). To avoid the loss of spatial information, all of these convolution layers were applied with appropriate padding (only spatial) and stride 1, thus there was no change in term of spatial size from the input to the output of these convolution layers. Thus, this fully convolutional architecture was able to maintains the spatial coordinates of the input and no spatial information was lost between layers (Shelhamer et al., 2014).
3 The LCSTFCN model
In this study, we partition a city into an grid map based on the longitude and latitude, where a grid cell denotes a region. We divide the observation time period into time intervals with the interval length to produce set , where . During the period , the order demand generated in the region is , and the demand matrix of the entire region is .
Our goal is to predict how many orders will emerge during the future period for each region at an instant . That is, given the historical observations , is predicted. The LCSTFCN model shown in Figure 1 is developed to achieve this goal. We select a collection of short pieces from historical observations and stack them together to form a 3D volume as input. The 3D convolution operations are used to fusion the spatial and temporal information in multiple contiguous time windows. Since 3D convolution is a natural extension of standard convolution, features from both spatial and temporal dimensions can be learned simultaneously. Then, multiple 2D convolutional layers are used to extract and encode features from low to high level, and from local to global. With the increasing of network depth, the receptive field of neurons also increases, allowing an increasing amount of spatial information to be learned. In particular, locally connected convolutional layers are used at the end of the model to obtain the final prediction results, without parameter sharing.
According to the findings in 2D CNNs (Simonyan and Zisserman, 2014b), small receptive fields of convolution kernels with deeper architectures yield best results. Hence, in our LCSTFCN model we fix the spatial receptive field to and vary only the temporal depth of the 3D convolution kernels. The 3D convolution kernel has access to information across all input demand matrix after several layers, depending on the depth of the input data. The important components and training methods of the LCSTFCN model are introduced in the following sections.
3.1 Input: 3D volume
As a class of attractive deep models for automated spatial feature construction, CNNs have been primarily applied on 2D images. However, for shortterm passenger demand forecasting, it is also important to capture the temporal information in multiple adjacent or periodic time intervals. For example, there is a high demand for morning and evening rushhour trips, but a low demand before dawn. Therefore, we construct the following input data forms, which integrate the spatiotemporal information of multiple time intervals into one 3D volume, with one demand matrix for each time interval:
(1) 
(2) 
(3) 
(4) 
Here, , and are the numbers of time intervals in , and separately. is the periodic length of the demand time series data. The periodicity of travel demand is the sharing feature of urban activities, so we use an additive model (Brockwell and Davis, 2015) to determine the periodicity. Then, we decompose the overall travel demand in the city into trend term , periodic term , and residual term :
(5) 
(6) 
(7) 
In the additive model, different values can be obtained by choosing different values, where smaller values indicate that the corresponding is closer to the practical situation. After , and are chosen, we can then construct the 3D volume, as shown in Figure 2. Each instance used by the LCSTFCN model is a 3D volume containing number of raw samples with each sample represented by a matrix of grid values.
3.2 The fusion of fully connected 3D and 2D Convolutions
In 2D CNNs, convolutions are applied on the 2D feature maps to compute features from the spatial dimensions only. When the 2D CNN is applied to demand forecasting problem, it is desirable to capture the temporal information in multiple time intervals, e.g. trend and period. Although 2D CNN can also take multiple time intervals as input, after the first convolution layer, temporal information is collapsed completely. Hence, it is difficult to effectively extract temporal information by 2D convolution operations. Compared to 2D convolution, 3D convolution has the ability to model temporal information better, owing to that 3D convolution preserves the temporal information of the input by maintaining a 3D volume as the output. (As shown in Figure 3.)
Unlike 2D convolution operation, the depth of the 3D convolution filters is less than the depth of corresponding input volume. The feature maps obtained by the filter in the layer are given by:
(8) 
(9) 
where is the parameter matrices of filter connected to the feature maps in the previous layer, is the depth and is intercept parameter. The asterisk denotes the convolutional operator and is an activation function, convolving each filter across the width, height and depth of the input volume and computing dot products between the entries of the filter and input at any position. Each filter produces a separate 3D feature maps , where the dimension of the depth of is obtained by valid convolution with stride 1 on previous layer with depth of . Thus, finally the filter will be able to get access to all the information across the temporal dimension after a certain number of convolutions, depending on . In the 3D convolutional layers, the filters have shared weights and the ReLU function is used (Krizhevsky, 2012).
After the 3D convolutional layers, we use a few 2D convolutional layers to further extract spatiotemporal features from low to high level and local to global. Stacking convolutional layers with tiny filters as opposed to having one convolutional layer with big filters allows us to express more powerful features of the input, and with fewer parameters.
3.3 Locally connected convolutional layer
The setting of parameter sharing in standard convolutional layers assumes local features. For example, if a horizontal boundary in some parts of the image is regarded important, then it will be equally useful in other places. However, the parameter sharing assumption may not be applicable for demand forecasting problems as the local features vary from region to region. For example, radial cities often have a typical central structure, and it is clearly inappropriate to use the same parameters to predict the demands for both the central and marginal areas of the city.
Hence, we relax the assumption of parameter sharing in standard convolution layers by using locally connected layers (without weight sharing) to obtain the final predictions. Like a standard convolutional layer, they apply a filter bank, except that every location in the feature maps is learnt by a different set of filters. As shown in Figure 4, in locally connected convolutional layers, the kernels in different spatial locations have different parameters. Compared with fully connected layers (Figure 4) or standard convolutional layers, locally connected convolution layers can simultaneously maintain local statistics and spatial coordinates.
3.4 Objective function
The LCSTFCN model can be trained by minimizing the mean squared error between the estimated demand and real demand . The objective function is shown in Eq.(10), where and are both learnable parameters. Algorithm 1 outlines the LCSTFCN training process, where the adaptive subgradient method is adopted for model training.
(10) 
4 Experiments
We perform experiments using a dataset from DiDiChuxing (https://gaia.didichuxing.com), which is the largest ridesourcing service platform in China. The dataset includes customer requests in Chengdu, China, containing the request time, longitude, and latitude. After the raw data is cleaned, the dataset contained 7,031,022 requests within 103.85104.30 longitude to the east and 30.4830.87 latitude to the north are used in our experiments. All the requests are partitioned into 10minute time intervals, and the investigated area is partitioned into grids. Considering the periodicity of the data (see Figure 5), we select the first three weeks of the dataset as a training set , and the remaining nine days as an independent testing set .
Because of the inherent periodicity of urban activity, before training we use the additive model to decompose the total demand data by time horizon length, . After testing, we find that (a day) or (a week) yields better decomposition results than others. As shown in Figure 6, for , is almost linear and the distribution of residual terms has smaller variance than that for . Hence, we select to construct our training 3D tensor for training. In order to illustrate the effectiveness of 3D convolution for learning time dependence (shortterm and periodic), we compare our results with the traditional difference method (Brockwell and Davis, 2015). In the experiment, we set . Correspondingly, we use four 3D convolutional layers with filter depth to extract the spatiotemporal information of input. And the number of standard and locallyconnected convolutional layers are 4 and 2.
To evaluate the performance of the LCSTFCN model, we select a bunch of different models as benchmarks, where all the models are trained and validated with the same training set and test set:

LCFCN: Variant of the LCSTFCN model, where the 2D convolutional layer are applied to the input to generate feature maps instead of 3D convolution layer. So, the temporal dimension of input is collapsed after the first 2D convolutional layer.

FCN: Variant of the LCFCN model, where the last two layers are standard 2D convolutional layers (weight sharing), while the other parts of the structure remain the same.

LCSTFCN (diff): Variant of the LCSTFCN model, where the difference method is used to generate input data.

CNN: The last two layers are fully connected layers, while the other parts of the structure are the same as FCN.

ConvLSTM: ConvLSTM layers (Xingjian and Woo, 2015) are used instead of convolution layers in FCN. ConvLSTM structure can simultaneously learn spatial correlation and time dependence.

ANN (artificial neural network): Since different regions of a city have different local statistics, it is difficult to achieve effective prediction for all regions using a single ANN. Hence, we use a unique ANN for training and prediction in each grid region, resulting in a total of 256 ANNs that are trained. The input of 1dimension time series data has a size of , where 20 is the length of the temporal dimension.

Additive model: As for ANN, 256 independent addition models are used to obtain final prediction results for each region.

ARIMA: As ARIMA requires the stability and randomness of data, in our experiments, it is only used to generate predictions in regions that meet the requirements of stationarity and randomness.
4.1 Model comparison with arithmeticmeanbased metrics
We first evaluate our model by the root mean squared error (RMSE), normalized root mean squared error (NRMSE), mean absolute percentage error (MAPE), and the modified MAPE methods according to a previous study (Chen and Li, 2018; MoreiraMatias et al., 2013), which are defined as follows:
(11) 
(12) 
(13) 
(14) 
(15) 
Here, is the number of all predicted values, and are the ground truth and predicted value of demand in region at time interval , respectively, while in Eq.(13) and (14) are set to 1.
Table 1 compares the predictive performances of the seven models on the test dataset. The numbers in the parentheses beside each value are the ranking of the predictive performance for the corresponding measurements. The results predicted by ARIMA are not included in the comparison range in Table 1 due to the requirement of stationarity and randomness of the data.
Model  RMSE  NRMSE (%)  MAPE (%)  sMAPE1 (%)  sMAPE2 (%) 
LCSTFCN  1.67 (1)  75.38 (2)  22.40 (1)  13.31 (3)  13.85 (1) 
LCFCN  1.69 (2)  75.28 (1)  22.98 (2)  13.45 (4)  13.85 (1) 
FCN  1.78 (3)  99.67 (5)  27.36 (7)  15.91 (7)  16.93 (7) 
CNN  1.82 (4)  104.62 (6)  25.78 (6)  15.09 (6)  16.11 (5) 
Additive model  1.86 (5)  91.24 (4)  24.78 (4)  13.21 (2)  14.96 (4) 
ConvLSTM  1.95 (6)  131.02 (8)  28.69 (8)  17.04 (8)  18.74 (8) 
ANN  2.04 (7)  76.92 (3)  24.43 (3)  14.23 (5)  14.53 (3) 
LCSTFCN (diff)  2.17 (8)  105.53 (7)  25.72 (5)  13.19 (1)  16.11 (5) 
Comparison of the predictive performance of the various models using the same test dataset.
As shown by Table 1, LCSTFCN model obtains the best results on RMSE, MAPE and sMAPE2. Generally, LCSTFCN has better predictive performance than LCFCN, which indicates that the 3D convolution operations capture the spatiotemporal information better than 2D ones. Compared with LCFCN, FCN and CNN, the predictive performance of LCFCN is significantly improved by simply replacing the last two standard convolutional layers or fully connected layers with locally connected convolutional layers. Which implies that the localtoglobaltolocal architectures have excellent adaptability to the complex and changeable demand patterns of different regions in a city.
However, further comparison of the models is difficult due to the inconsistencies among the evaluation metrics. We observe the following interesting phenomena in Table 1. The RMSE of CNN is less than that of the additive model, while all the relative errors (i.e. NRMSE and the others) are greater than that of the additive model. The same situation occurrs in the comparison of FCN and the additive model, CNN and ANN, FCN and ANN, and ConvLSTM and ANN. Moreover, the contradictions among percentage error metrics are even more obvious. In general, relative errors are expressed as a ratio (unitless number), which can eliminate the influence of scale and better reflect the credibility of the measurement. However, when the absolute error and relative error are inconsistent in the evaluation results, it is difficult to effectively judge which model is superior. Such inconsistency is caused by the unbalanced demands of the regions. Thus, it is necessary to modify the evaluation criteria as well as the model structure, So that the evalutions of different models can be consistent.
4.2 Issues in model evaluation
Three issues will affect the effectiveness of the model evaluation: (1) the demand uncertainty; (2) the demand level; and (3) the demand distribution. We analyze each issue below.
(1) Demand uncertainty
In the literature, the uncertainty of data is seldom considered when training DL models. Especially, in shortterm demand forecasting problems, such uncertainties significantly impact the model predictive performance. The CNN models do not output probability distributions. So attempting to extract the outcome of a sequence of a random time series data simply transfers the randomness from the inputs to the outputs. In each grid region, the historical observations constitute an independent time series dataset. To extract the demand uncertainty, we use the Ljung–Box test to assess the randomness of . If the value of the test is greater than 0.05, is considered a random sequence. Based on the results of the Ljung–Box test, the set of all regions, , can be divided into a nonrandom sequence set and random sequence set , where and . In this case, for the total 256 grid regions, contains only 56 regions (), but accounts for 80.88% of the total demand, while contains 200 regions (), but accounts for only 19.12% of the total demand. As shown in Figure 7, we choose a group of regions adjacent to each other in latitude and acrossing the city center, and compare the frequency distributions of real and predicted data within the test data set. The difference between the two distributions is usually measured by the KullbackLeibler (KL) divergence (Kullback and Leibler, 1951). We can easily observe that the gap between the two distributions is bigger for the regions in (regions 85, 86, 811, 812, 813) rather than , which implies that the model encounters greater difficulties in learning the distribution of regions with higher uncertainty. Therefore, it is necessary to treat the two different types of regions separately in the model evaluation.
(2) Demand level
The percentage error measurements should not be used to evaluate the overall performance of the model, since they are very sensitive to the level of absolute demand. For example, we compare the prediction results of the additive model for regions 102 and 88, as shown in Table 2 and Figure 8. In Table 2, the model has a better performance for region 102 in almost all the percentage error measurements. However, Figure 10 indicates an opposite conclusion that the prediction for region 88 is better , since the pattern of demand oscillation is well learned by the model.
(3) Demand distribution
The Gini coefficient is most often used to measure how far a distribution deviates from a totally equal distribution. We calculate the Gini coefficient of the order demand distribution based on a Lorenz curve, which plots the proportion of the total demand of the city as a function of the cumulative share of the regional demand (as shown in Figure 9). The line represents perfect equality of demand. Hence, the Gini coefficient is the ratio of the area that lies between the lines of equality and the Lorenz curve (marked as A in the figure) to the total area under the line of equality (A + B in the figure). As a result of the variety of urban geography and layout, the Gini coefficient of our demand dataset is around 0.89, indicating extreme spatial unbalance of demand.
When the regional demand is highly unbalanced in a city, the evaluation results obtained by the arithmeticmeanbased methods are affected by the extreme values. The inherent averaging effect hides the complexity and heterogeneity behind the prediction results.
We sort all regions in increasing order in terms of demand level and calculate the cumulative moving average of the percentage errors of each model, as shown in Figure 10. We can observe that the curve is almost flat after the demand is larger than 25. This implies that the percentage measurement is determined mainly by the low demand regions, which occupy the majority of the prediction area. This may not be reasonable since the regions with large demand should be paid more attention.
From another point of view, when we increase the grid size from to with the same LCSTFCN structure, the prediction resolution is reduced. However, Table 3 shows a significant improvement for all the arithmeticmeanbased metrics, which indicates that the such measurements cannot objectively reflect the prediction ability of the model.
RMSE  NRMSE (%)  MAPE (%)  sMAPE1 (%)  sMAPE2 (%)  
Region (102)  0.15  118.29  1.79  1.42  2.34 
Region (88)  23.06  9.13  17.90  7.27  4.08 
Prediction error of region 102 and 88 (Additive model).
Grid size  RMSE  NRMSE (%)  MAPE (%)  sMAPE1 (%)  sMAPE2 (%) 
1.67  75.38  22.40  13.31  13.85  
0.35  62.5  5.83  3.77  5.37 
Prediction errors of the LCSTFCN model with different grid sizes.
4.3 Model comparison with weightedarithmeticbased metrics
In order to address the above issues, we propose weightedarithmeticbased metrics instead of the widely used arithmeticmeanbased metrics. The weight is obtained by calculating the ratio of the regional demand to total demand for the training set . All the new metrics are shown in Table 4.
(16) 
Arithmeticmeanbased metrics  Weightedarithmeticbased metrics 
Figure 10 shows the difference between the two kinds of metrics. It is obvious that the regions with higher demands now contribute more in the weightedarithmeticbased metrics, which might be more preferable for the ridesourcing platform, since higher priority should be given to regions with higher demand to guarantee the system efficiency. Moreover, the weighted arithmetic is less sensitive to the grid size as shown in Table 5.
The prediction errors calculated based on weightedarithmetic metrics are shown in Table 6. Now the predictive performance rankings of the models under most metrics are consistent, except for WMAPE. Under weightedarithmetic metrics, the LCSTFCN proposed in this study outperforms all the other benchmark mondels in all the five metrics. There are some interesting observations below:
i) In terms of predictive performance, LCSTFCN LCFCN FCN, which indicates that 3D convolutions are more suitable for spatiotemporal feature learning compared to 2D convolutions; And the locally connected convolutional layers can deal with the impact of local statistical differences well, better than standard convolutional layers. The LCSTFCN, LCFCN and FCN are the top three ranked models. That may due to the fully convolutional architecture of them, which can help maintain the spatial coordinates of the input and avoid losing spatial information between layers.
ii) In terms of predictive performance, FCN CNN. The only difference between the two structures is the last two layers. FCN and CNN use standard convolution layers, and fully connected layers, respectively. In our experiments, the ratio between the total numbers of parameters of the last two layers for the two models are nearly 1:8000. One of the reasons why FCN is superior to CNN might be that the fully connected layer in CNN is affected by a very large patch of the input and learning the optimal combination of parameters for numerous neurons is difficult.
iii) In terms of predictive performance, LCSTFCNLCSTFCN (diff). The good predictive performance of additive model demonstrates that the trend and periodicity of demand data play an important role in demand forecasting. Both LCSTFCN (diff) and LCSTFCN can learn from these information, yet the results show that the 3D convolution operation is more powerful in feature extraction than difference method.
In order to further compare the predictive performance of each model in regions with different levels of uncertainty, we calculated the evaluation metrics for both and ( Table 7). It can be observed that for all the models, the weighted percentage errors for are nearly doubled, which indicates that uncertainties in the time series data can greatly reduce the models’ prediction ability.
Grid size  RMSE  WNRMSE (%)  WMAPE (%)  WsMAPE1 (%)  WsMAPE2 (%) 
1.67  18.76  22.09  10.28  7.83  
0.35  47.04  37.16  18.53  17.63 
Prediction error of the LCSTFCN model for different grid sizes using the weighted method.
Model  RMSE  WNRMSE (%)  WMAPE (%)  WsMAPE1 (%)  WsMAPE2 (%) 
LCSTFCN  1.67 (1)  18.76 (1)  22.09 (1)  10.28 (1)  7.83 (1) 
LCFCN  1.69 (2)  19.09 (2)  23.39 (2)  10.61 (2)  7.96 (2) 
FCN  1.78 (3)  19.80 (3)  23.84 (3)  10.88 (3)  8.24 (3) 
CNN  1.82 (4)  20.41 (4)  24.43 (5)  11.02 (4)  8.45 (4) 
Additive model  1.86 (5)  20.77 (5)  26.99 (6)  11.56 (5)  8.63 (5) 
ConvLSTM  1.95 (6)  21.61 (6)  24.32 (4)  11.95 (6)  9.16 (6) 
ANN  2.04 (7)  23.42 (7)  28.57 (8)  12.79 (7)  9.82 (7) 
LCSTFCN (diff)  2.17 (8)  24.32 (8)  27.02 (7)  12.92 (8)  9.96 (8) 
Predictive performance comparison (weighted).
Model  RMSE  WNRMSE (%)  WMAPE (%)  WsMAPE1 (%)  WsMAPE2 (%) 
regions ()  
LCSTFCN  4.35/0.92  14.40/37.21  18.38/37.79  8.45/18.04  6.12/15.05 
LCFCN  4.47/0.91  14.82/37.14  19.71/38.91  8.80/18.23  6.29/15.04 
FCN  4.57/1.01  15.09/39.72  19.77/41.01  8.95/19.04  6.42/15.93 
CNN  4.83/0.98  16.01/39.01  20.76/39.99  9.24/18.57  6.76/15.63 
Additive model  4.75/1.05  15.68/42.29  23.21/42.99  9.69/19.51  6.72/16.7 
ConvLSTM  5.14/1.06  16.97/41.23  20.72/39.55  10.15/19.59  7.36/16.77 
ANN  5.79/0.99  19.35/40.65  24.64/45.18  11.06/20.07  8.19/16.7 
LCSTFCN (diff)  5.60/1.21  18.51/48.92  22.78/44.99  10.98/21.15  7.85/18.9 
ARIMA  5.52/  18.23/  22.09/  10.75/  7.74/ 
Predictive performance of and (weighted method).
5 Conclusions
CNN has been demonstrated as an effective tool for learning information with spatial structure, which has led to breakthroughs in almost all machine learning tasks. In this paper, a fusion convolutional model, LCSTFCN, is established for demand forecasting of ridesourcing services. New model structure and evaluation metrics are developed to deal with the unique characteristics of ridesourcing services. A real dataset from DiDiChuxing platform is used for model evaluation and comparison. In the experiments, our model outperforms all the benchmark models in terms of all the weightedarithmeticbased metrics and most of the arithmeticmeanbased metrics. The weightedarithmeticbased metrics show better consistency in performance evaluation than the arithmeticmeanbased metrics, because they take demand level and unbalanced demand distribution into consideration. Moreover, we show that prediction results can be greatly affected by the local statistics, since in our experiments, the percentage errors in regions with high uncertainty are nearly twice of those in regions with low uncertainty. In this paper, the training data of our proposed model is generated by gridbased partition, which is carried out independently of the model training. In the future work, we expect to explore a learnable partition algorithm which has the ability of automatic region partition by learning the demand statistics of ridesourcing service platforms.
Acknowledgement
The dataset used in this paper comes from DiDiChuxing. The work described in this paper was supported by the National Key Research and Development Program of China (2018YFB1600902) and the National Natural Science Foundation of China (71622007, 71861167001, 71725001).
References
References
 Auer et al. (2017) Auer, M., Rehborn, H., Molzahn, S.E., Koller, M., 2017. Traffic services for vehicles: the process from receiving raw probe data to spacetime diagrams and the resulting traffic service. Frontiers of Engineering Management. 4 (4), 490–497.
 Brockwell and Davis (2015) Brockwell, P. J., Davis, R. A., 2015. Time Series: Theory and Methods. SpringerVerlag.
 Chang et al. (2009) Chang, H. W., Tai, Y. C., Hsu, Y. J., 2009. Contextaware taxi demand hotspots prediction. International Journal of Business Intelligence and Data Mining. 5 (1), 3–18.
 Chen and Li (2018) Chen, Xiqun Michael, C. C. L. N., Li, L., 2018. Spatial visitation prediction of ondemand ride services using the scaling law. Physica A: Statistical Mechanics and its Applications. 508, 84–94.
 Chen et al. (2017a) Chen, X., Zahiri, M., Zhang, S., 2017a. Understanding ridesplitting behavior of ondemand ride services: An ensemble learning approach. Transportation Research Part C: Emerging Technologies. 76, 51–70.
 Chen et al. (2017b) Chen, X. M., Chen, X., Zheng, H., Chen, C., 2017b. Understanding network travel time reliability with ondemand ride service data. Frontiers of Engineering Management. 4 (4), 388–398.
 Deng and Ji (2011) Deng, Z., Ji, M., 2011. Spatiotemporal structure of taxi services in shanghai: Using exploratory spatial data analysis. In: 2011 19th International Conference on Geoinformatics. pp. 1–5.
 Dong et al. (2018) Dong, Y., Wang, S., Li, L., Zhang, Z., 2018. An empirical study on travel patterns of internet based ridesharing. Transportation Research Part C: Emerging Technologies. 86, 1–22.
 Gregor and Lecun (2010) Gregor, K., Lecun, Y., 2010. Emergence of complexlike cells in a temporal product network with local receptive fields. arXiv:1006.0448.
 Hara et al. (2017) Hara, K., Kataoka, H., Satoh, Y., 2017. Learning spatiotemporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 3154–3160.
 Huang et al. (2012) Huang, G. B., Lee, H., LearnedMiller, E., 2012. Learning hierarchical representations for face verification with convolutional deep belief networks. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2518–2525.
 Huang et al. (2014) Huang, W., Song, G., Hong, H., Xie, K., 2014. Deep architecture for traffic flow prediction: Deep belief networks with multitask learning. IEEE: Transactions on Intelligent Transportation Systems. 15 (5), 2191–2201.
 Jessica Christian (2018) Jessica Christian, S. E., 2018. Study: Half of sf’s increase in traffic congestion due to uber, lyft. http://www.sfexaminer.com/studyhalfsfsincreasetrafficcongestiondueuberlyft/.
 Kaltenbrunner et al. (2010) Kaltenbrunner, A., Meza, R., Grivolla, J., Codina, J., Banchs, R., 2010. Urban cycles and mobility patterns: Exploring and predicting trends in a bicyclebased public transport system. Pervasive and Mobile Computing. 6 (4), 455–466.
 Karpathy et al. (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., FeiFei, L., 2014. Largescale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1725–1732.
 Ke et al. (2017) Ke, J., Zheng, H., Yang, H., Chen, X., 2017. Shortterm forecasting of passenger demand under ondemand ride services: A spatiotemporal deep learning approach. Transportation Research Part C: Emerging Technologies. 85, 591–608.
 Krizhevsky (2012) Krizhevsky, A., S. I. H. G., 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105.
 Kullback and Leibler (1951) Kullback, S., Leibler, R. A., 1951. On information and sufficiency. Annals of Mathematical Statistics. 22 (1), 79–86.
 Lecun et al. (2015) Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature. 521 (7553), 436.
 LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L. D., 1989. Backpropagation applied to handwritten zip code recognition. Neural computation. 1 (4), 541–551.
 Ma et al. (2017) Ma, X., Dai, Z., He, Z., Ma, J., Wang, Y., Wang, Y., 2017. Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction. Sensors. 17 (4), 818.
 Ma et al. (2015) Ma, X., Yu, H., Wang, Y., Wang, Y., 2015. Largescale transportation network congestion evolution prediction using deep learning theory. Plos One. 10 (3), e0119044.
 MoreiraMatias et al. (2013) MoreiraMatias, L., Gama, J., Ferreira, M., MendesMoreira, J., Damas, L., 2013. Predicting taxi–passenger demand using streaming data. IEEE: Transactions on Intelligent Transportation Systems. 14 (3), 1393–1402.
 Shelhamer et al. (2014) Shelhamer, E., Long, J., Darrell, T., 2014. Fully convolutional networks for semantic segmentation. IEEE: Transactions on Pattern Analysis and Machine Intelligence. 39 (4), 640–651.
 Shuiwang et al. (2013) Shuiwang, J., Ming, Y., Kai, Y., 2013. 3d convolutional neural networks for human action recognition. IEEE: Transactions on Pattern Analysis and Machine Intelligence. 35 (1), 221–231.
 Simonyan and Zisserman (2014a) Simonyan, K., Zisserman, A., 2014a. Twostream convolutional networks for action recognition in videos. pp. 568–576.
 Simonyan and Zisserman (2014b) Simonyan, K., Zisserman, A., 2014b. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556.
 Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1–9.
 Taigman et al. (2014) Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. Deepface: Closing the gap to humanlevel performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1701–1708.
 Tran et al. (2015) Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 4489–4497.
 Vazifeh et al. (2018) Vazifeh, M. M., Santi, P., Resta, G., Strogatz, S. H., Ratti, C., 2018. Addressing the minimum fleet problem in ondemand urban mobility. Nature. 557 (7706), 534–538.
 Xingjian and Woo (2015) Xingjian, S.H.I., C. Z. W. H. Y. D. W. W., Woo, W., 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp. 802–810.
 Yao et al. (2018) Yao, H., Wu, F., Ke, J., Tang, X., Jia, Y., Lu, S., Gong, P., Ye, J., Li, Z., 2018. Deep multiview spatialtemporal network for taxi demand prediction. In: ThirtySecond AAAI Conference on Artificial Intelligence.
 Yu et al. (2017) Yu, R., Li, Y., Shahabi, C., Demiryurek, U., Liu, Y., 2017. Deep learning: A generic approach for extreme condition traffic forecasting. In: Proceedings of the 2017 SIAM International Conference on Data Mining. pp. 777–785.
 Yuan et al. (2011) Yuan, J., Zheng, Y., Zhang, L., Xie, X., Sun, G., 2011. Where to find my next passenger. In: Proceedings of the 13th international conference on Ubiquitous computing. pp. 109–118.
 Zhang et al. (2017) Zhang, J., Zheng, Y., Qi, D., 2017. Deep spatiotemporal residual networks for citywide crowd flows prediction. In: ThirtyFirst AAAI Conference on Artificial Intelligence.
 Zhao et al. (2016) Zhao, K., Khryashchev, D., Freire, J., Silva, C., Vo, H., 2016. Predicting taxi demand at high spatial resolution: Approaching the limit of predictability. In: 2016 IEEE International Conference on Big Data (Big Data). pp. 833–842.