DataDriven Robust Taxi Dispatch under Demand Uncertainties
Abstract
In modern taxi networks, large amounts of taxi occupancy status and location data are collected from networked invehicle sensors in realtime. They provide knowledge of system models on passenger demand and mobility patterns for efficient taxi dispatch and coordination strategies. Such approaches face new challenges: how to deal with uncertainties of predicted customer demand while fulfilling the system’s performance requirements, including minimizing taxis’ total idle mileage and maintaining service fairness across the whole city; how to formulate a computationally tractable problem. To address this problem, we develop a datadriven robust taxi dispatch framework to consider spatialtemporally correlated demand uncertainties. The robust vehicle dispatch problem we formulate is concave in the uncertain demand and convex in the decision variables. Uncertainty sets of random demand vectors are constructed from data based on theories in hypothesis testing, and provide a desired probabilistic guarantee level for the performance of robust taxi dispatch solutions. We prove equivalent computationally tractable forms of the robust dispatch problem using the minimax theorem and strong duality. Evaluations on four years of taxi trip data for New York City show that by selecting a probabilistic guarantee level at 75%, the average demandsupply ratio error is reduced by 31.7%, and the average total idle driving distance is reduced by 10.13% or about 20 million miles annually, compared with nonrobust dispatch solutions.
I Introduction
Modern transportation systems are equipped with various sensing technologies for passenger and vehicle tracking, such as radiofrequency identification (RFID) and global positioning system (GPS). Sensing data collected from transportation systems provides us opportunities for understanding spatialtemporal patterns of passenger demand. Methods of predicting taxipassenger demand [28, 22], travel time [15, 27, 3] and traveling speed [13, 2] according to traffic monitoring data have been developed.
Based on such rich spatialtemporal information about passenger mobility patterns and demand, many control and coordination solutions have been designed for intelligent transportation systems. Robotic mobilityondemand systems that minimize the number of rebalancing trips [24, 30], and smart parking systems that allocates resource based on a driver’s cost function [14] have been proposed. Dispatch algorithms that aim to minimize customers’ waiting time [26, 17] or to reduce cruising mile [29] have been developed. In our previous work [20, 21], we design a receding horizon control (RHC) framework that incorporates predicted demand model and realtime sensing data. Considering future demand when making the current dispatch decisions helps to reduce autonomous vehicle balancing costs [30] and taxis’ total idle distance [21, 20]. Strategies for resource allocation depend on the model of demand in general, and the knowledge and assumptions about the demand affect the performance of the supplyproviding approaches [9], [23]. These works rely on precise passengerdemand models to make dispatch decisions.
However, passengerdemand models have their intrinsic model uncertainties that result from many factors, such as weather, passenger working schedule, and city events etc. Algorithms that do not consider these uncertainties can lead to inefficient dispatch services, resulting in imbalanced workloads, and increased taxi idle mileage. Although robust optimization aims to minimize the worstcase cost under all possible random parameters, it sacrifices average system performances [1]. For a taxi dispatch system, it is essential to address the tradeoff between worstcase and the average dispatch costs under uncertain demand. A promising yet challenging approach is a robust dispatch framework with an uncertain demand model, called an uncertainty set, that captures spatialtemporal correlations of demand uncertainties and the robust optimal solution under this set provides a probabilistic guarantee for the dispatch cost (as defined in problem (12)).
In this work, we consider two aspects of a robust vehicle dispatch model given a taxioperational records dataset: (1) how to formulate a robust resource allocation problem that dispatches vacant vehicles towards predicted uncertain demand, and (2) how to construct spatialtemporally correlated uncertain demand sets for this robust resource allocation problem without sacrificing too much average performance of the system. We first develop the objective and constraints of a robust dispatch problem considering spatialtemporally correlated demand uncertainties. The objective of a systemlevel optimal dispatch solution is balancing workload of taxis in each region of the entire city with minimum total current and expected future idle cruising distance. We define an approximation of the balanced vehicle objective in this work, such that the robust vehicle dispatch problem is concave of the uncertain demand and convex of the decision variables. We then design a datadriven algorithm for constructing uncertainty demand sets without assumptions about the true model of the demand vector. The constructing algorithm is based on hypothesis testing theories [6] [11] [25], however, how to apply these theories for spatialtemporally correlated transportation data and uncertainty sets of a robust vehicle resource allocation problem have not been explored before. To the best of our knowledge, this is the first work to design a robust vehicle dispatch model that provides a desired probabilistic guarantee using predictable and realistic demand uncertainty sets.
Furthermore, we explicitly design an algorithm to build demand uncertainty set from data according to different probabilistic guarantee level for the cost. With two types of uncertainty sets — box type and secondordercone (SOC) type, we prove equivalent computationally tractable forms of the robust dispatch problem under these uncertainty demand models via the minimax theorem and the strong duality theorem. The robust dispatch problem formulated in this work is convex over the decision variables and concave over the constructed uncertain sets with decision variables on the denominators. This form is not the standard form (i.e., linear programming (LP) or semidefinite programming (SDP) problems) that has already been covered by previous work [4, 6, 10]. With proofs shown in this work, both system performance and computational tractability are guaranteed under spatialtemporal demand uncertainties. The average performance of the robust taxi dispatch solutions with SOC type of uncertain demand set is better compared with that of the box (range) type of uncertainty set in the evaluations based on data. Hence, it is critical to use a more complex type of uncertainty set, the SOC type, and the corresponding robust dispatch model we design in this work. The contributions of this work are:

We develop a robust optimization model for taxi dispatch systems under spatialtemporally correlated uncertainties of predicted demand, and define an approximation of the balanced vehicle objective. The robust optimization problem of approximately balancing vacant taxis with least total idle distance is concave of the uncertain demand, convex of the decision variables and computationally tractable under multiple types of uncertainties.

We design a datadriven algorithm to construct uncertainty sets that provide a desired level of probabilistic guarantee for the robust taxi dispatch solutions.

We prove that there exist equivalent computationally tractable convex optimization forms for the robust dispatch problem with both polytope and secondordercone (SOC) types of uncertainty sets constructed from data.

Evaluations on four years of taxi trip data in New York City show that the SOC type of uncertain set provides a smaller average dispatch cost than the polytope type. The average demandsupply ratio mismatch is reduced by , and the average total idle distance is reduced by or about million miles annually with robust dispatch solutions under the SOC type of uncertainty set.
The rest of the paper is organized as follows. The taxi dispatch problem is described and formulated as a robust optimization problem given a closed and convex uncertainty set in Section II. We design an algorithm for constructing uncertain demand sets based on taxi operational records data in Section III. Equivalent computationally tractable forms of the robust taxi dispatch problem given different forms of uncertainty sets are proved in Section IV. Evaluation results based on a real data set are shown in Section V. Concluding remarks are provided in Section VI.
Ii Problem Formulation
The goal of taxi dispatch is to direct vacant taxis towards current and predicted future requests with minimum total idle mileage. There are two objectives. One is sending more taxis for more requests to reduce mismatch between supply and demand across all regions in the city. The other is to reduce the total idle driving distance for picking up passengers in order to save cost. Involving predicted future demand when making current decisions benefits to increasing total profits, since drivers are able to travel to regions with better chances to pick up future passengers. In this section, we formulate a taxi dispatch problem with uncertainties in the predicted spatialtemporal patterns of demand. A typical monitoring and dispatch infrastructure is shown in Figure 1. The dispatch center periodically collects and stores realtime information such as GPS location, occupancy status and road conditions; dispatch solutions are sent to taxis via cellular radio.
Iia Problem description
Parameters of (11)  Description 

the number of regions  
model predicting time horizon  
the uncertain total number of requests at each region during time  
weight matrix, is the distance from region to region  
probability matrix that describes taxi mobility patterns during one time slot  
the initial number of vacant taxis at each region provided by GPS and occupancy status data  
the upper bound of distance each taxi can drive idly for picking up a passenger  
the power on the denominator of the cost function  
the weight factor of the objective function  
Variables of (11)  
the number of taxis dispatched from region to region during time  
the number of vacant taxis at each region before dispatching at the beginning of time  
Parameters of Algorithm 1  
the uncertain concatenated demand vector of consecutive time slots  
one sample of according to subdataset , records of date  
significance level of a hypothesis testing 
For computational efficiency, we assume that the entire city is divided into regions, and time of one day is discretized to time slots indexed by . Taxi dispatch decision is calculated in a receding horizon process, since considering future demand when making the current dispatch decisions helps to reduce resource allocating costs [30] and taxis’ total idle distance [20]. At time , we consider the effects of current decision to the following time slots. Only the dispatch solution for time is implemented and solutions for remaining time slots are not materialized. When the time horizon rolls forward by one step from to , information about vehicle locations and occupancy status is observed and updated and we calculate a new dispatch solution for .
We define as the number of total requests within region during time , and is the model predicting time horizon. We relax the integer constraint of to positive real, since the integer constraint will make the robust dispatch problem in this section not computationally tractable. The total number of requests at region may have similar patterns as its neighbors, for instance, during busy hours, several downtown regions may all have peak demand. Meanwhile, demand during several consecutive time slots , are temporally correlated. Typically, it is difficult to predict a deterministic value of passenger demand of a region during specific time. We define the spatialtemporally correlated uncertain demand by one closed and convex, or compact set as
Where is called the concatenated demand vector, means the transpose of . The closed, bounded, and convex form of depends on the method to construct the uncertainty set, which we will describe in detail in Section III. Since depends on , and is one component of , the uncertainty set for demand at time is defined as a closed, convex set , and a projection of
Note that the projection of a convex set onto some of its coordinates is also convex [8, Chapter 2.3.2].
A robust dispatch model that decides the amount of vacant taxis sent between each node pair according to the demand at each node and practical constraints is described in a network flow model of Figure 2. The edge weight of the graph represents the distance between two regions. Specifically, each region has an initial number of vacant taxis provided by realtime sensing information and an uncertain predicted demand. We define a nonnegative decision variable matrix , , where is the number of vehicles dispatched from region to . We relax the integer constraint of to a nonnegative real constraint, since mixed integer programming is not computational tractable with uncertain parameters. Every time when making a resource allocation decision by solving the following robust optimization problem
(1) 
where is a convex cost function for allocating resources, is a function concave in and convex in that measures the service fairness of the resource allocating strategy, and is a convex domain of the decision variables that describes the constraints. We define specific formulations of the objective and constraint functions in the rest of this section.
IiB Robust taxi dispatch problem formulation
Estimated crossregion idledriving distance: When traversing from region to region , taxi drivers take the cost of cruising on the road without picking up a passenger till the target region. Hence, we consider to minimize this kind of idle driving distance while dispatching taxis. We define the weight matrix of the network in Fig. 2 as , where is the distance between region and region . The acrossregion idle driving cost according to is
(2) 
We assume that the region division method is timeinvariant in this work, and is a constant matrix for the optimization problem formulation – for instance, the value of represents the length of shortest path on streets from the center of region to the center of region ^{1}^{1}1For control algorithms with a dynamic region division method, the distance matrix can be generalized to a time dependent matrix as well..
The distance every taxi can drive should be bounded by a threshold parameter during limited time
which is equivalent to
(3) 
To explain this, assume the constraint (3) holds. If and , we have , which contradicts to (3). The threshold is related to the length of time slot and traffic conditions on streets. For instance, with an estimated average speed of cars in one city during time , and idle driving time to reach a dispatched region is required to be less than minutes, then the value of should be the distance one taxi can drive during minutes with the current average speed on road.
Metric of serving quality: We design the metric of service quality as a function concave in and convex in in this work for computational efficiency [4]. Besides vacant taxis traverse to region according to matrix , we define as the number of vacant taxis at region before dispatching at the beginning of time , and is provided by realtime sensing information. We assume that the total number of vacant taxis is greater than the number of regions, i.e., , and each region should have at least one vacant taxi after dispatch. Then the total number of vacant taxis at region during time satisfies that
(4)  
(5) 
One service metric is fairness, or that the demandsupply ratio of each region equals to that of the whole city. A balanced distribution of vacant taxis is an indication of good system performance from the perspective that a customer’s expected waiting time is short as shown by a queuing theoretic model [30]. Meanwhile, a balanced demandsupply ratio means that regions with less demand will get less resources, and idle driving distance will be reduced in regions with more supply than demand if we preallocate possible redundant supply to those regions in need. We aim to minimize the mismatch value or the total difference between local region demandsupply ratio and the global demandsupply ratio of the whole city, similarly as the objective defined in [21, 20]
(6) 
However, the function (6) is not concave in for any . It is worth noting we need a function concave in for any , and convex in for any , to make sure the robust optimization problem is computationally tractable. Hence, we define
(7) 
as a service fairness metric to minimize. This is because we approximately minimize (6) by minimizing (7) under the constraints (4) and (5) with an value chosen according to the desired approximation level, and the following Lemma explains this approximation.
Lemma 1
See Appendix AA. According to the proof, we can always choose to be small enough (or close enough to ) in order to obtain a desired level of approximation . Hence, in the experiments of Section V, we numerically choose based on simulation results. Therefore, with function (7), we map the objective of balancing supply according to demand across every region in the city to a computationally tractable function that concave in the uncertain parameters and convex in the decision variables for a robust optimization problem.
The number of initial vacant taxis depends on the number of vacant taxis at each region after dispatch during time and the mobility patterns of passengers during time , while we do not directly control the latter. We define as the probability that a taxi traverses from region to region and turns vacant again (after one or several drop off events) at the beginning of time , provided it is vacant at the beginning of . Methods of getting based on data include but not limited to modeling trip patterns of taxis [21] and autonomous mobility on demand systems [30]. Then the number of vacant taxis within each region by the end of time satisfies
(9) 
Weightedsum objective function: Since there exists a tradeoff between two objectives, we define a weightedsum with parameter of the two objectives defined in (2) and defined in (7) as the objective function. Let and represent decision variables and . Without considering model uncertainties corresponding to , a convex optimization form of taxi dispatch problem is
(10) 
Robust taxi dispatch problem formulation: We aim to find out a dispatch solution robust to an uncertain demand model in this work. For time , uncertain demand only affects the dispatch solutions of time , and dispatch solution at is related to uncertain demand at , similar to the multistage robust optimization problem in [7]. However, the control laws considered in [7] are polynomial in pastobserved uncertainties; in this work, we do not restrict the decision variables to be any forms of previousobserved uncertain demands. The dispatch decisions are numerical optimal solution of a robust optimization problem. With a list of parameters and variables shown in Table I, considering both the current and future dispatch costs when making the current decisions, we define a robust taxi dispatch problem as the following
(11) 
After getting an optimal solution of (11), we adjust the solution by rounding methods to get an integer number of taxis to be dispatched towards corresponding regions. It does not affect the optimality of the result much in practice, since the objective or cost function is related to the demandsupply ratio of each region. A feasible integer solution of (11) always exists, since is feasible. Although we cannot provide any theoretical guarantee on the suboptimality of the rounded integer solution, in the numerical experiments the costs under integer solution after rounding and the original real value optimal solution are comparable.
Iii Algorithm For Constructing Uncertain Demand Sets
With many factors affecting taxi demand during different time within different areas of a city, explicitly describing the model is a strict requirement and errors of the model will affect the performance of dispatch frameworks. Considering future demand and demand uncertainties benefits for minimizing worstcase demandsupply ratio mismatch error and idle distance [21, 20]. It is then essential to construct a model that captures the spatialtemporal demand uncertainties and provides a probabilistic guarantee about the vehicle resource allocation cost. We construct demand uncertainty sets via Algorithm 1—getting a sample set of from the original dataset and partition the sample set, bootstrapping a threshold for the test statistics according to the requirement of the probability guarantee, and calculating the model of uncertainty sets based on the thresholds.
Iiia An uncertainty set with probabilistic guarantee
For convenience, we concisely denote all the variables of the taxi dispatch problem as . Assume that we do not have knowledge about the true distribution of the random demand vector . WIth the objective function of problem (11), the probabilistic guarantee for the event that the true dispatch cost being smaller than the optimal dispatch cost is defined as the following chance constrained problem
(12) 
The constraint and objective function are concave in for any , and convex in for any . Without loss of generality about the objective and constraint functions, equivalently we aim to find solutions for
(13) 
When it is difficult to explicitly estimate , we solve the following robust problem such that its optimal solutions satisfy the probabilistic guarantee requirement for (13)
(14) 
Then of problem (14) can be any vector in the uncertainty set instead of a random vector in (13). The uncertainty set that keeps the optimal solution of (14) satisfying the constraints of problem (13) is defined as the following:
Problem 1
Construct an uncertainty set , given and samples of random vectors , such that
(P1). The robust constraint (14) is computationally tractable.
(P2). The set implies a probabilistic guarantee for the true distribution of a random vector at level , that is, for any optimal solution and for any function concave in , we have the implication:
(15) 
The given probabilistic guarantee level is related to the degree of conservativeness of the robust optimization problem.
IiiB Aggregating demand and partition the sample set
Every discretized time slots of demand are concatenated to a vector . The first step is to transform the original taxi operational data to a dataset of sampled vector of different dates for each index . For instance, assume we choose the length of each time slot as one hour, and the dataset records all trip information of taxis during each day. According to the start time and GPS coordinate of each pickup event, we aggregate the total number of pick up events during one hour at each region to get samples .
It is always possible to describe the support of the distribution of all samples contained in the dataset even they do not follow the same distribution, as explained in Figure 3. When there is prior knowledge or categorical information to partitioned the dataset into several subsets, we get a more accurate uncertainty set for each subdataset to provide the same probabilistic guarantee level compared with the uncertainty set from the entire dataset. Clustering algorithms with categorical information [16] is applicable for dataset partition when information besides pick up events is available, such as weekdays/weekends, weather or traffic conditions. It is worth noting that if the uncertainty sets are built for a categorical information set , then for the robust dispatch problems, we require the same set of categories is available in realtime, hence we apply the uncertainty set of to find solutions when the current situation is considered as .
IiiC Uncertainty Modeling
The basic idea to define an uncertainty set is to find a threshold for a hypothesis testing that is acceptable with respect to the given dataset and a required probabilistic guarantee level, and the formula of an uncertainty set is related to the threshold value of an acceptable hypothesis testing. Given the original data, the null hypothesis , , and the test statistics , we need to find a threshold that accepts at significance value for each subset of sampled demand vectors. Since we do not assume that the marginal distribution for every element of vector is independent with each other, we apply two models without any assumptions about the true distribution in the robust optimization literature [6] [11] [25] on the spatialtemporally correlated demand data.
IiiC1 Box type of uncertainty demand sets built from marginal samples
One intuitive description about a random vector is to define a range for each element. For instance, consider the following multivariate hypothesis holds simultaneously for with given thresholds [11]
(16) 
Assume that we have random samples for each component of , ordered in increasing value as no matter what is the original sampling order. We define the index by
(17) 
and let if the corresponding set is empty. The test is rejected if . To construct an uncertainty set, we need an accepted hypothesis test. Hence, we set and . The following uncertainty set is then applied in this work based on the range hypothesis testing (16).
IiiC2 SOC type of uncertainty set motivated by moment hypothesis testing
It is not easy to tell directly from the uncertainty set (18) when the range of one component changes how will others be affected. To directly show the spatialtemporal correlations of the demand, we also apply hypothesis testing related to both the first and second moments of the true distribution of the random vector [25].
(19) 
where and are the (unknown) true mean and covariance of , and are the estimated mean and covariance from data. Without knowledge of and , is rejected when the difference among the estimation of mean or covariance according to multiple times of samples is greater than the threshold, i.e., or , where is the estimated mean value of one experiment, and are the estimated mean and covariance from multiple experiments, and are the thresholds. The remaining problem is then to find the values of the thresholds such that hypothesis testing (19) holds given the dataset. The uncertainty set derived based on the moment hypothesis testing is defined in the following proposition.
Proposition 2 ([6], [25])
With probability at least with respect to the sampling, the following uncertainty set implies a probabilistic guarantee level of for
(20) 
where is a Cholesky decomposition.
When one component of increases or decreases, we have an intuition how it affects the value of other components of by the expression (20).
IiiD Algorithm
With a threshold of the test statistics calculated via the given dataset, we then apply the formula (18) for constructing a box type of uncertainty set, and the formula (20) for an SOC type of uncertainty set, respectively. The following Algorithm 1 describes the complete process for constructing uncertain demand sets based on the original dataset.
We do not restrict the method of estimating mean and covariance matrices of a subset in step , and bootstrap is one method. For step , the process for the box type of uncertainty sets is: calculate index that satisfies (17) with the given , sort each component of sampled vectors , and get the order statistics , of the th sample set . For the SOC type, we calculate the mean and covariance of the samples of the vector according to the subset as and , respectively.
In step , the level thresholds for the box type of uncertainty sets are the th largest value of the upper bound and the th largest value of the lower bound for the th component. For the SOC type of uncertainty sets, we calculate the mean and covariance of for the times bootstrap as and , and get , Denote the th largest value of and as and , respectively.
In summary, to construct a spatialtemporal uncertain demand model for problem (11), in this section, we consider the taxi operational record of each day as one independent and identically distributed (i.i.d.) sample for the concatenated demand vector . By partitioning the entire dataset to several subsets according to categorical information such as weekdays and weekends, we are able to build uncertainty sets for each subset of data without additional assumptions about the true distribution of the spatialtemporal demand profile. Then we design Algorithm 1 to construct a box type and an SOC type of uncertainty sets based on data that provide a desired probabilistic guarantee of robust solutions.
Iv Computationally Tractable Formulations
We build equivalent computationally tractable formulations of problem (11) with different definitions of uncertain sets calculated by Algorithm 1 in this section. Hence, the robust taxi dispatch problem considered in this work can be solved efficiently. Computational tractability of a robust linear programming problem for ellipsoid uncertainty sets is discussed in [4]. The process is to reformulate constraints of the original problem to its equivalent convex constraints that must hold given the uncertainty set. The objective function of problem (11) is concave of the uncertain parameters , convex of the decision variables with the decision variables on the denominators, not standard forms of linear programming (LP) or semidefinite programming (SDP) problems that already covered by previous work [4, 6]. Hence, we prove one equivalent computationally tractable form of problem (11) for each uncertainty set constructed in Section III.
Only the components of objective functions in (11) include uncertain parameters, and the decision variables of the function are in the denominator of the function . The box type uncertainty set defined as (18) is a special form of polytope, hence, we first prove an equivalent standard form of convex optimization problem for (11) for a polytope uncertainty set as the following.
Theorem 1
See Appendix AB.
To directly use the demand uncertainty set that describes the spatialtemporal correlation of like (18) and (20) for the concatenated demand in problem (11), we first consider to group the maximization over each together to save the process of projection for individual . Furthermore, we can find the dual (a minimizing problem) of the maximizing cost problem over , and then numerically efficiently solve (11) that minimizes the total cost during time under uncertain demand . Hence, we first prove that the minimax equality holds for the maximin problem over each pair of and for problem (11), and (11) is equivalent to the robust optimization problem shown in the following lemma.
Lemma 2
(Minimax equality) Given the assumption that the definition of the uncertainty sets and are compact (closed and convex), the robust dispatch problem (11) is equivalent to the following robust dispatch problem
(22) 
See Appendix AC.
For the robust optimization problem (11), the computationally tractable convex form depends on the definition of uncertainty sets.When conditions of Lemma 2 hold, equivalent convex optimization forms of problem (11) are derived based on problem (22). For a multistage robust optimization problem that restricts the nearoptimal control input of linear dynamical systems to be a certain degree of polynomial of previous observed uncertainties, an approximated semidefinite programming method for calculating the time dependent control input is proposed in [7]. The method does not require minimax equality holds for the robust optimal control problem.
The box type uncertainty set (18) is a special form of polytope, that the uncertain demand model during different time of a day is described separately. The process of converting problem (11) to an equivalent computationally tractable convex form is similar to that of the onestage robust optimization problem. The result is described as the following lemma.
Lemma 3
If the uncertain set for describes each demand vector separately as a nonempty polytope with the form
(23) 
problem (11) is equivalent to the following convex optimization problem
(24) 
See Appendix AD1.
For a more general case that the uncertainty sets for are temporally correlated, the following theorem and proof describe the equivalent computationally tractable convex form of (11).
Theorem 2
When is defined as the following nonempty polytope set
(25) 
problem (11) is equivalent to the following convex optimization problem
(26) 
See Appendix AD2.
With an uncertain demand model defined as (20) for concatenated , the following theorem derive the equivalent computationally tractable form of problem (11).
Theorem 3
See Appendix AE.
It is worth noting that any optimal solution for problem (10) has a special form between any pair of regions .
Proposition 3
Assume is an optimal solution of (10), then any satisfies that for any pair of , at least one value of the two elements and is .
We prove by contradiction. Assume that one optimal solution has the form such that and . Without loss of generality, we assume that , and let
other elements of equal to . Then
Hence, we have All constraints are satisfied and is also a feasible solution for (11).
Next, we compare and . With , and , we have
Thus the partial cost , which contradicts with the assumption that is an optimal solution. To summarize, we show that an optimal solution cannot have at the same time, and at least one of and should be .
With equivalent convex optimization forms under different uncertainty sets, robust taxi dispatch problem (11) is computationally tractable and solved efficiently.
V DataDriven Evaluations
We conduct datadriven evaluations based on four years of taxi trip data of New York City [12]. A summary of this data set is shown in Table II. In this data set, every record represents an individual taxi trip, which includes the GPS coordinators of pick up and drop off locations, and the date and time (with precision of seconds) of pickup and dropoff locations. The dispatch solutions based on different granularities of equalarea region partitions have been compared in [20], and other region partition methods are discussed in [18]. In the following experiments, we use equalarea grid partition since it is a baseline, and compare the robust and nonrobust solutions based on the same region partition method. One partition example given the map of Manhattan area is shown in Figure 4, where we visualize the density of taxi passenger demand with the data we use for largescale datadriven evaluations. The lighter the region, the higher the daily demand density, and the middle regions typically have higher density than the uptown and downtown regions. We construct uncertainty sets according to Algorithm 1, discuss factors that affect modeling of the uncertainty set, and compare optimal costs of the robust dispatch formulation (11) and the nonrobust optimization form (10) in this section.
How vacant taxis are balanced across regions with different values: Figure 5 shows mismatch between supply and demand defined as (6) for different optimal solutions of minimizing defined in (7) for . With closer to , the optimal value of (6) is smaller. We choose for calculating optimal solutions of (11) and (10) in this section.
Taxi Trip Data set  Format  

Collection Period  Data Size  Record Number  ID  Trip Time  Trip Location 
01/01/201012/31/2013  about million  Date  Start and end time  GPS coordinates of start and end 
Va Box type of uncertainty set
For all box type of uncertainty sets shown in this subsection with the model described in Subsection IIIC1, we set the confidence level of hypothesis testings as , bootstrap time as , number of randomly sampled data (with replacement) for each time of bootstrap as .
Partitioned dataset compared with nonpartitioned dataset:
We show the effects of partitioning the trip record dataset by weekdays and weekends in Figure 6 and 7. The whole city is partitioned into regions, the prediction time horizon is , where one time instant means one hour, , and every . Figures 6 and 7 show the lower and upper bounds of each region during one time slot of (18). By applying data of weekdays and weekends separately, the range of each component is reduced. To get a measurement of the uncertainty level, we defined the sum of range of every component for as
.
For the box type of uncertainty sets, when values of the dimension of , i.e., , and are fixed, a smaller means a smaller area of the uncertainty set, or a more accurate model. We denote calculated via records of weekdays and weekends as and