MultiAgent Reinforcement Learning for Orderdispatching via
OrderVehicle Distribution Matching
Abstract.
Improving the efficiency of dispatching orders to vehicles is a research hotspot in online ridehailing systems. Most of the existing solutions for orderdispatching are centralized controlling, which require to consider all possible matches between available orders and vehicles. For largescale ridesharing platforms, there are thousands of vehicles and orders to be matched at every second which is of very high computational cost. In this paper, we propose a decentralized execution orderdispatching method based on multiagent reinforcement learning to address the largescale orderdispatching problem. Different from the previous cooperative multiagent reinforcement learning algorithms, in our method, all agents work independently with the guidance from an evaluation of the joint policy since there is no need for communication or explicit cooperation between agents. Furthermore, we use KLdivergence optimization at each time step to speed up the learning process and to balance the vehicles (supply) and orders (demand). Experiments on both the explanatory environment and realworld simulator show that the proposed method outperforms the baselines in terms of accumulated driver income (ADI) and Order Response Rate (ORR) in various traffic environments. Besides, with the support of the online platform of Didi Chuxing, we designed a hybrid system to deploy our model.
1. Introduction
With the booming of mobile internet, it becomes feasible and promising to establish the modern largescale ridehailing systems such as Uber, Didi Chuxing and Lyft which allow passengers book routes with smartphones and match available vehicles to them based on intelligent algorithms. To some extent, these ridehailing systems improve the efficiency of the transportation system.
In ridehailing systems, a key point is how to dispatch orders to vehicles to make the system work more efficiently and generate more impact. We illustrate the orderdispatching in Figure. 1, where one can see that the algorithm used by the decision maker is critical for finding suitable matches because the result of orderdispatching has direct influences on the platform efficiency and income.
The general strategies of automatically orderdispatching systems are to minimize the waiting time and taxi cruising time through route planning or matching the nearest orders and vehicles (Chung, 2005; Lee et al., 2004; Myr, 2013; Chadwick and Baron, 2015). In recent research, another approach to solve the orderdispatching problem is to leverage combinatorial optimization (Papadimitriou and Steiglitz, 1998) to improve the success rate of orderdispatching (Zhang et al., 2017). It makes a significant improvement in the online test, but it suffers from high computational cost, and strongly relies on appropriate feature engineering. More importantly, the above strategies are myopic: they may find suitable matches in the current stage, but ignore the potential future impact.
In this paper, we focus on developing a method to maximize the accumulated driver income (ADI), i.e., the impact of orders served in one day, and the order response rate (ORR), i.e., the proportion of served orders to the total orders in one day. Intuitively, matching vehicles with highprice orders can receive high impact at a single orderdispatching stage. However, if the served orders result in the mismatch between the orders and vehicles in the near future, it would harm the overall service quality in terms of ORR and the longterm ADI. Hence, in order to find a balance between the longterm ADI and ORR, it is necessary to develop an orderdispatching algorithm which takes the future supply and demand into consideration.
Xu et al. (2018) proposed a planning and learning method based on decentralized multiagent deep reinforcement learning (MARL) and centralized combinatorial optimization to optimize the longterm ADI and ORR. The method formulates the orderdispatching task into a sequential decisionmaking problem and treats a vehicle as an agent. However, for centralized approaches, a critical issue is the potential ”single point of failure” (Lynch, 2009), i.e., the failure of the centralized authority control will fail the whole system (Lin et al., 2018). Another two related work using multiagent to learn orderdispatching is based on meanfield MARL (Li et al., 2019) and knowledge transferring (Wang et al., 2018b).
There are some challenges to be solved when we apply the MARL to the realtime orderdispatching scenario. First, handling the nonstationary environment in MARL is a major problem, which means that all agents learn policies concurrently, while each individual agent does not know the policies of other agents (Hu et al., 1998). The state transition in a multiagent environment is driven by all agents together, so it is important for agents to have knowledge about other agents’ policies. In the orderdispatching scenario, we only care about the idle status of a vehicle since they are available for orderdispatching. However, as the duration of each order is nondeterministic, compared to the traditional multiagent scenarios which have deterministic time interval, it is difficult to learn the interactions between agents in successive idle states, which makes many MARL methods including opponent modeling (Schadd et al., 2007; Billings et al., 1998) and communication mechanism (Foerster et al., 2016b) hard to work well. Second, the number of idle vehicles keeps changing during the whole episode, i.e., there will always be some vehicles getting offline or online, thus the general MARL methods which require fixed agent number cannot be directly applied in such a case (Foerster et al., 2016a; Zheng et al., 2017).
In addition, we believe that a higher ORR usually means a higher ADI, and if we can maintain a higher longterm ORR, we will get a higher longterm ADI. With regard to the correctness of this point, we also conducted a corresponding experimental analysis in Section 4.2.2.
To the best of our knowledge, this is the first work that utilizes this character to improve both ORR and ADI. In detail, we propose a centralized learning and decentralized execution MARL method to solve the above challenges with an extension of Double Qlearning Network (Mnih et al., 2015) with KullbackLeibler (KL) divergence optimization. Besides, the KLbased backward learning optimization method also speeds up the agents learning process with the help of others’. Considering the large scale of agents, and they are homogeneous, we learn only one network using parameter sharing, and share learning experiences among all agents at the training stage, as that in (Zheng et al., 2017; Sukhbaatar et al., 2016). To address the nonstationary action space problem, in our implementation, the input of deep Qlearning network consists of the state and selected action.
Extensive experiments with different traffic and order conditions and realworld simulation experiments are conducted. The experimental results demonstrate that our method yields a large improvement on both ADI and ORR compared to the baseline methods in various traffic environments. We also claim the proposed method is highly feasible to be deployed on the existing orderdispatching platform.
2. Related Work
Taxiorder Dispatching
There have been several GPSbased orderdispatching systems to enhance the accuracy, communications, and productivity of taxi dispatching (Liao, 2001, 2003; Myr, 2013). These systems do not offer detailed dispatching algorithms, which means these platforms are more like information sharing platforms, helping vehicles choose orders to serve by offering orders information. Other automatic orderdispatching methods (Li et al., 2011; Miao et al., 2016) focus on reducing the pickup distance or waiting time by finding the nearest orders. While these methods usually fail to reach a high success rate on orderdispatching and ignore many potential orders in the waiting list which may be more suitable for vehicles. Zhang et al. (Zhang et al., 2017) proposed a centralized control dispatching system based on combinatorial optimization. Although it is a simple method, the requirement of computing all available ordervehicle matches can be of much high computational cost in a largescale taxiorderdispatching situation. Moreover, it requires appropriate feature engineering. Thus it greatly increases the system implementation difficulty and human efforts of applying the method in a practical situation.
Multiagent Reinforcement Learning
Multiagent reinforcement learning has been applied in domains like collaborative decision support systems. Different from the single agent reinforcement learning (RL), multiagent RL needs the agents to learn to cooperate with others. It is generally impossible to know other policies since the learning process of all agents is simultaneous. Thus for each agent, the environment is nonstationary (Busoniu et al., 2006). It is problematical that directly apply the independent reinforcement learning methods into the multiagent environment. There are several approaches proposed to relieve or address this problem, including sharing the policy parameters (Gupta et al., 2017), training the Qfunction with other agents’ policy parameters (Tesauro, 2004), centralized training (Lowe et al., 2017) and opponent modeling (Schadd et al., 2007; Billings et al., 1998). Besides, there are also some methods which use explicit communication to offer a relatively stationary environment for peer agents (Sukhbaatar et al., 2016; Foerster et al., 2016b; Hausknecht, 2016). In the largescale multiagent systems, the nonstationary problem will be amplified. To address this problem, Yang et al. (Yang et al., 2018) proposed a novel method which converts multiagent learning into a twoplayer stochastic game (Shapley, 1953) by applying mean field theory in multiagent reinforcement learning to make it possible in largescale scenarios. Since the meanfield MARL method only takes a mean field on states/actions input into consideration, it ignores the agent interactions. Our proposed method provides another way to enable largescale multiagent learning and retain the interactions between agents, which makes agents receive global feedback from the next moments and adjust their strategies in time. Furthermore, our proposed method provides a backward stationary learning method and has a rapid reaction to the feedback from the environment.
Multiagent Taxi Dispatching
A lot of previous work models the taxi dispatching into multiagent learning, like (Alshamsi et al., 2009), it divides the city into many dispatching areas, and regards an area as an agent, then uses selforganization techniques to decrease the total waiting time and increase the taxi utilization. NTuCab (Seow et al., 2010) is a collaborative multiagent taxi dispatching system which attempts to increase custom satisfaction more globally, and it can dispatch multiple orders to taxis in the same geographical regions. NTuCab thinks that it is not feasible to compute the shortesttime path for each of a possibly large number of available taxis nearby a customer location since it is computationally costly. We follow these settings in our proposed model and divide the city into many dispatching regions. Each dispatching region is controlled in a given distance, which indirectly limits the maximum waiting time. The NTuCab achieves a significant improvement in reducing the wait time and taxi cruising time, but it is also a computational cost method. Xu et al. (Xu et al., 2018) proposed a learning and planning method based on MARL and combinatorial optimization recently, and some other methods (Wei et al., 2018; Oda and Tachibana, 2018; Lin et al., 2018) focus on fleet management to improve the ADI or decrease the waiting time. But considering the current operational ridesharing scenarios, it is hard to perform fleet management for it is impossible to force drivers to designated regions. The mentioned MARL method(Xu et al., 2018) is an independent MARL method, which ignores the interactions between agents. However, it is a consensus to consider that the agent interactions have a positive impact on making optimal decisions. Our proposed method considers the interaction between agents by applying constraints on the joint policies using KLdivergence optimization, and the experiments demonstrate that the proposed method outperforms baselines on all metrics in different traffic environments.
3. Methodology
In this section, we first give a definition of orderdispatching from a perspective of multiagent reinforcement learning process, and then discuss the main challenges when applying the MARL method for orderdispatching, and give our methods.
3.1. Orderdispatching as a Markov Game
We regard the orderdispatching task as a sequential decision task, where the goal is to maximize the longterm ADI and ORR per day. According to the characters of the practical environment, each vehicle can only serve the surrounding orders, thus we model the orderdispatching task using Partially Observable Markov Decision Process (POMDP) (Spaan, 2012) in multiagent settings. With the multiagent settings, we can decompose the original global orderdispatching task into many local orderdispatching tasks, and transform a highdimensional problem into multiple lowdimensional problems.
The POMDP framework to the multiagent orderdispatching problem can be formulated as a tuple , where , , , , , , represent the sets of states, state transition probability function, sets of action spaces, reward functions, set of grids, the number of agents and the future reward discount factor respectively.
For each agent , , , represent the state space, action space and reward function respectively, and represents the grid which the agent in. The state transition occurs after the decision making, i.e. agents executed their actions, then the state of environment at time transform to at time , and agents will get rewards given by the environment. Based on the above definitions, the main purpose of each agent is to learn to maximize the cumulative reward from to
In reinforcement learning, the parameterized with represents the policy with respect to the state at time .
It is common to divide the city into regional dispatch areas (Lin et al., 2018; Seow et al., 2010). In our settings, we use a gridworld to represent the real world and divide the real world into several orderdispatching regions. Each grid represents an individual orderdispatching region which contains some orders and vehicles, and we regard vehicles as agents here. Based on the above MARL settings, we specify the definitions of the orderdispatching task as follows from a mathematical perspective.

State: The state input used in our method is expressed as a four elements tuple, namely, . Elements in the tuple represent the grid index, the number of idle vehicles, the number of valid orders and the distribution of orders’ destinations respectively. The distribution of order’s destination is a mean over the destination vectors of orders in grid , which roughly reflects the overall the orders information. In our settings, agents in the same grid share the same state.

Action: The action input used in our method is expressed as . Elements in the tuple represent the source grid index, target grid index, order duration, and price respectively. We regard the set of orders in the grid at time as the candidate actions of the agent . Since agents are homogeneous, so agents in grid share the same action space. In practice, sometimes there is no order in some regions. Under the setting of MARL, agents need to select orders at each timestep, but some grids may not have orders, so in order to ensure the feasibility and sustainability of the MDP, we artificially add some virtual orders whose , and set the price to 0. When idle vehicles select these virtual orders, it means they will stay where they are.

State Transition: The agent which serves one order will migrate to the destination grid given by the taken order after time step, where is defined with the served order duration, then the state of agent will be updated to the newest state, namely, the stage of destination grid.

Reward: The reward function is very important for reinforcement learning to a great extent which determines the direction of optimization. Because of the goal of learning is to find a solution which maximizes the ADI with high ORR, so we design a reward function which is proportional to the price of each order.
3.2. Nonstationary Action Space
Traditional deep Qlearning network accepts a state input and outputs a vector of Q values whose dimension is equal to the dimension of action space, i.e.,
(1) 
It is correct when the action space is fixed, while it is problematic in our settings. There is a fact that for the grid , the orders produced at time are always different from the orders produced at other moments. It cannot ensure that the action space is consistent along with the whole episode, so it is problematical to regard the orders as an action while ignoring the distribution of the variant action space. In our proposed method, we use the tuple to represent the input of Qlearning, then evaluate all available stateorder pairs.
3.3. Action Selection Qlearning
For convenience, we name the Qlearning network with a stateaction input as action selection Qlearning shown in Figure 2.
For agent , supposing there are available orders, which requires stateaction evaluation. In the case of agents, the computational complexity will be . To decrease the original complexity to , we use parameter sharing and state sharing mentioned in previous sections to achieve it.
From the perspective of agent , we suppose that denotes the state at time , denotes the set of orders, then the Bellman Equation in our settings can be expressed as
(2) 
where is the discount factor, is the step size. The value of the next timestep is a expectation of all available stateorder pairs. When the policy is greedy, then Eq. (2) represents the traditional Qlearning algorithm.
To balance the exploitation and exploration, the Q values related to the same orders set are converted into a biased strategy Boltzman exploration
(3) 
where is the temperature to balance the exploitation and exploration.
3.4. KL Divergence Optimization
In the multiagent system, the main method to relieve or overcome the nonstationary problem is learning multiagent communication (Hausknecht, 2016; Foerster et al., 2016b, a; Sukhbaatar et al., 2016), while most of them require a fixed agent number or observations from other agents before making decisions. In the orderdispatching case, explicit communication occurs between agents is often timeconsuming and difficult to adapt. As illustrated in Figure 3, supposing that triangles in each grid represent orders, and dots represent vehicles. It shows that the orderdispatching process of each grid at time , and different order has different duration of , so the vehicles will arrive at the destination grids at different time, and vehicles serve different orders will be assigned to different grids, then it is hard to form continuous interactions and communication between vehicles. Also, it often suffers from computational cost, especially in largescale settings. Taking the aforementioned reasons, we introduce a centralized training method using KL divergence optimization, which aims to optimize the agents’ joint policy and try to match the distribution of vehicles with the distribution of orders.
Notice that we have two goals need to achieve in our proposed method: (1) maximize the long horizontal ADI; (2) optimize the order response rate. If there are always enough vehicles in the dispatching grid, it is easy to decrease the rate of idle vehicles and improve the order response rate, also the long horizontal ADI, while there is a fact that we cannot control the distribution of orders. So we want to make the order and vehicle distribution as similar as possible through finding feasible ordervehicle matches. We do not require explicit cooperation or communication between agents, but an independent learning process with centralized KL divergence optimization.
Supposing at time , the agents find a feasible order set by executing their policies, namely,
(4) 
Our purpose is to find an optimal order set . Focusing on a certain grid , it supposes that the policy at time is parameterized by . After all policies have been executed, we get the newest distribution of vehicles , and the newest distribution of orders is . The KL divergence from to shows the margin between the joint policy at time and , so the KL optimization is actually finding an optimal joint policy which has a minimal margin:
(5) 
where . For the convenience, we replace with . We want to decrease the KL divergence from the distribution of vehicles to the distribution of orders to balance the demand and supply at each orderdispatching grid. Formally, our KL policy optimization can be written as:
(6)  
(7) 
where . Then the objective function can be expressed as
(8) 
where is the target Qvalue, parameterizes the contribution of KL item. To formulate the relationship between and , we make some definitions of notations in advance. Considering that there is grids in total, represents the number of idle vehicles in grid at time step , which can be formulated as , where represents the idle driver number at last time step , represents the probability of dispatching orders which from grid to grid to idle vehicles at time , and these vehicles will arrive at grid at time . is the rate of idle vehicles in grid which can be formulated into . represents the rate of orders in grid at time here. Using chain rule, we can decompose the gradient of to as
(9) 
where . The gradient of to is . We use the , then the final gradient of to is calculated as
(10) 
For convenience, we give a summary for some important notations in Table 1.

the number of grids 

the number of idle vehicles in grid at time  
the number of idle vehicles in grid at time  
the probability of dispatching orders which from grid to grid to idle vehicle at time  
the rate of idle vehicles in grid  
the rate of orders in grid at time 
4. Experiments
We examine the correctness of our model in a toy gridbased orderdispatching environment and the practicality of our model using realworld data from three cities. Considering the constraint of a gridbased environment, we did not compare with orderdispatching algorithms based on coordinate systems. To compare with existing methods, and investigate the effectiveness of our method on the metrics of ORR and ADI, we select three typical algorithms as baselines, namely, Independent Deep Qlearning Network (IL), Nearest orderdispatching (NOD) and MDP respectively, and we will give a brief description at the first.

IL: A variant of Double DQN (Van Hasselt et al., 2016) which takes a tuple of state and action as an input. Compared with our method, the only difference in IL is that KL optimization is not used.

NOD: Nearestdistance Order Dispatching (NOD) algorithm, which dispatches orders to idle vehicles with considering the shortest distance. The reason why we use NOD as one of the baselines is that it is a fairly representative algorithm which is used frequently and easy to implement in practice. However, in our environment setting, because there are reasonable regional division strategies (for example, in our later realworld data experiments, the size of the division area guarantees the maximum order waiting time is 10 minute), we have no need to distinguish the specific position of vehicles in the same dispatching region. That is to say, the principle of matching orders based on distance is equivalent to random matching in our environment setting.

MDP: Proposed by Xu et al. (2018), a planning and learning method based on decentralized multiagent deep reinforcement learning and centralized combinatorial optimization.
Considering the fairness of experiments, we use the same reward function for the reinforcement learning methods.
Order Distribution Divergence  Low  Medium  High  

Metrics  ADI  ORR  ADI  ORR  ADI  ORR 
IL  +12.5%  +6.94%  +11.5%  +6.3%  +6.68%  +2.32% 
MDP  +14.5%  +8.94%  +13.3%  +6.69%  +7.28%  +3.42% 
KLBased  +25.12%  +13.40%  +20.94%  +7.89%  +13.47%  +4.61% 
City  City A  City B  City C  

Metrics  ADI  ORR  ADI  ORR  ADI  ORR 
IL  +4.69%  +1.68%  +2.96%  +1.11%  +4.72%  +2.05% 
MDP  +5.80%  +1.89%  +3.69%  +2.63%  +5.98%  +2.14% 
KLBased  +6.46%  +3.07%  +4.94%  +3.30%  +6.12%  +3.01% 
4.1. Model Settings
Our model is an extension of Double DQN with soft update. All neuralbased models used in our experiments are implemented by the MLP with 2 hidden layers, and the active function used here for all neuralbased algorithm is rectified linear unit (ReLU). The replay buffer stores experience tuples, which can be formulated into . Elements of the tuple represent the state of time , action selected at time , the action set at time , state at time , action set at time and the first gradient item of , respectively. The temperature is 1.0, discount factor and learning rate is .
4.2. Particle Orderdispatching Experiment
The gridbased orderdispatching environment showed in Figure 4 is implemented based on the multiagent particle environment supported by Mordatch et al. (Mordatch and Abbeel, 2017). This toy environment abstracts the realworld orderdispatching, where one grid represents one dispatching region, and orders have the same duration. The blue particles and the red particles represent vehicles and orders respectively. All of the blue particles and red particles scatter in a grid world. Each of the red particles owns a direction vector and a reward , the direction vector denotes a direction from the source grid to the target grid. In this environment, the order price is simplified with Euler distance between grids, so reward is proportional to the Euler distance between and , i.e.,
(11) 
The blue particles will get reward by picking red particles in the same grid, one blue particle can only pick one red particle at each timestep. Blue particles will migrate to the target grids given by the red particles at the next timestep. In our settings, the time horizon is . At , we produce the reds and blues using specific distributions respectively. At the next timesteps from to , there are some new red particles born with a specific distribution with respect to the grid position and time, while there is no new blue particles born expect , that means all blue particle movements fully dependent on picking red particles. In order to match the real world situation, the amount of blues is less than the reds in our settings.
4.2.1. Influence of KL Divergence
In order to verify the feasibility of KLdivergence optimization, we adopt three cases correspond to different degree of order distribution changes. The degree of order distribution changes means the margin between two adjacent order distributions. In the particle orderdispatching environment, we generate orders with a given 2dimensional Gaussian distribution at each timestep. To quantify and explicitly compare the margin between order distributions at different timesteps, we can change at each timestep. The degree of order distribution changes is equivalent to the distance between adjacent order distributions. Farther distance means a higher degree, that is, the greater the degree of changes. In our experiment settings, the degree of changes from low to high orders correspond to a distance of 1, 2, 4 grids respectively, Figure 5 shows an example of order distribution changes, the distance between and is 8 grid. Since the destinations of orders are random, if we want to let vehicles serve more orders at the next timestep, we need to let the algorithm learns to pick suitable orders at current decision stage to assign the vehicles to suitable grids, so that we can ensure that there is a better ORR at next timestep. Intuitively, a longterm higher ORR corresponds to a higher longterm ADI. Table 2 shows the performance at metrics of ORR and ADI at different degree of order distribution changes.
4.2.2. Influence of
plays an important role in our method, so it is necessary to investigate how it affects the performance at different degree of order distribution changes. In our experiments, the value of ranges from 0.0 to 0.6 with stepping 0.05. When , it means our method is equivalent to IL. Figure 6 shows curves at different degree of order distribution changes.
4.2.3. Result Analysis
We train 300 episodes for all algorithms in the three cases which are related to different degrees of KLdivergence. We compare the three baselines from metrics of ADI and ORR. As shown in Table 2, it shows the average experimental results of 5 groups with different random seeds. The particle orderdispatching environment generates orders with random destinations, that is, the probabilities of long and short orders appear are equivalent in a grid. Although choosing long orders means higher ADI, the degree of order distribution changes we set requires the algorithm to choose more nonlongest orders, thus ensuring both higher ORR and longterm ADI. In the three different degrees of order distribution changes, our KLbased outperforms all baselines on all metrics. It means our method can better counterweight the margin of order distribution.
Figure 6 shows the learning curve of at different degree of order distribution changes. In the three cases of order dispatching changes, our method achieves highest ORR at , , respectively. The results in Figure 6 also show that higher ORR often corresponds to higher ADI. In practice, if there is a low degree of order distribution changes, in order to achieve a higher ORR, the algorithm needs to pick shorter orders, so that in the future, agents in the closer regions can still serve more orders, namely, higher ORR. When it comes to a high degree of order distribution changes, the dispatching algorithm needs to perform more greedy to achieve a higher ORR, that is, it prefers to pick longer orders. Also, the result of Figure 5(c) shows our algorithm achieves better performance on ORR and ADI than IL in the case of a high degree of order distribution changes. Therefore, combined with the results of Table. 2 and Figure 6, our method can flexibly choose long or short orders.
4.3. Real World Data Experiments
4.3.1. Dispatching Simulator
Since our model is implemented on the setting of dividing the city into many order dispatching regions, so we conduct experiments on an open source gridbased environment simulator provided by Didi Chuxing (Lin et al., 2018). The simulator divides the city into hexagonal grids which depends on the size of the city. At each time , the simulator provides a set of idle vehicles and a set of available orders. Each order is featured with its origin, destination, and duration, and vehicles in the same region share the same state. The travel distance between neighboring regions is approximately 2.2km and the time interval is 10min.
4.3.2. Data Description
The realworld datasets provided by Didi Chuxing include order and trajectories of vehicles information of three cities in one month. The order information includes price, origin, destination, and duration. The trajectories contain the positions (latitude and longitude) and status (online, offline, onservice). We divide the three cities into 182, 126, 112 hexagonal grids respectively.
Result Analysis
We compare our model with three baselines after 300 episodes training. As shown in Table 3, it lists the average results of 5 groups experiments with different random seeds. The real datasets contain more changes in the order distribution. From the results, our method can still better discover the changes of order distribution and improve the ORR and ADI via ordervehicle distribution matching.
5. Deployment
Taking both model setting in this paper and online platforms of Didi Chuxing, we design a hybrid system and incorporate with other components including routing planning technique (Tong et al., 2018) and estimating time of arrival (ETA) (Wang et al., 2018a) as illustrated in Figure 7.
As aforementioned mentioned in Section 3, there are several assumptions prevent this model from deploying in realworld settings: (i) vehicles in the same grid share the same setting, and this isomorphic setting ignores the intragrid information; (ii) this paper adopts the gridworld map to simplify the realworld environment which replace coordinate position information with grid information. To address these issues, we adapt estimate travel time techniques proposed in and incorporate with our action selection Qlearning mentioned in Section 3.3. For example, the duration time of each order in our model is regarded as one of the already known order features. However, in the realworld scenario, each order’s travel time obtained via the ETA model is dynamic and depends on current traffic and route conditions. Since coordinate position information is taken into consideration in the ETA model, this hybrid system is able to deal with the assumption (ii) and feasible to be deployed in realworld.
We extend the Matching System and the Routing System after obtaining value via the hybrid system as illustrated in Figure. 7. Specifically, in each time slot, the goal of the realtime order dispatch algorithm is to determine the best matching between vehicles and orders (see Figure 1) in the matching system and plan a routine for drivers to serve the orders. Formally, the principle of Matching System can be formulated as:
(12) 
(13)  
where
where and present all idle drivers and available orders at each time step respectively. is the output from hybrid system and represents the actionvalue function driver performing an action of serving order . Note that constraints in Eq. (13) guarantee that each driver will select one available real orders or doing nothing while each order will be assigned to one driver or stay unserved at each time step.
This Matching System used in Xu et al. (2018) and Wang et al. (2018b) is implemented using KuhnMunkres (KM) algorithm (Munkres, 1957). In detail, they formulated Eq. (12) as a bipartite graph matching problem where drivers and orders are presented as two set of nodes. Then, each edge between order and driver is valued with , and the best matches will be fined using KM algorithm. Different from them, since we implemented our method based on assumption (i), that is, there is no difference in the drivers in a same grid. So the KM algorithm will degenerate into a sorting algorithm here. We just need to select the top orders with the highest values.
Once the matching pairs of orders and vehicles has been selected from the matching system, we then deliver these pairs with coordinate information to the routing system. The routing system equipped with route planning techniques (Tong et al., 2018) allows drivers to serve the order. This process will give feedback, i.e. reward to the hybrid system and help the whole system training to achieve better performance.
6. Conclusions
In this paper, we proposed a multiagent reinforcement learning method for orderdispatching via matching the distribution of orders and vehicles. Results on the three cases in the simulated orderdispatching environment have demonstrated that our proposed method achieves both higher ADI and ORR than the three baselines, including one independent MARL method, one planning algorithm, and one rulebased algorithm, in various traffic environments. The experiments on realworld datasets also show that our model can obtain higher ADI and ORR. Furthermore, our proposed method is a centralized training method and can be executed decentralized. In addition, we designed the deployment system of the model with the support of the existing platform of Didi Chuxing. In future work, we plan to deploy the model to do online tests through the designed deployment system.
Acknowledgments
We thank the support of National Natural Science Foundation of China (61702327, 61772333, 61632017).
References
 (1)
 Alshamsi et al. (2009) Aamena Alshamsi, Sherief Abdallah, and Iyad Rahwan. 2009. Multiagent selforganization for a taxi dispatch system. In 8th international conference on autonomous agents and multiagent systems. 21–28.
 Billings et al. (1998) Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. 1998. Opponent modeling in poker. Aaai/iaai 493 (1998), 499.
 Busoniu et al. (2006) Lucian Busoniu, Robert Babuska, and Bart De Schutter. 2006. Multiagent reinforcement learning: A survey. In Control, Automation, Robotics and Vision, 2006. ICARCV’06. 9th International Conference on. IEEE, 1–6.
 Chadwick and Baron (2015) Stephen C Chadwick and Charles Baron. 2015. Contextaware distributive taxi cab dispatching. (March 19 2015). US Patent App. 14/125,549.
 Chung (2005) Lee Chean Chung. 2005. GPS taxi dispatch system based on A* shortest path algorithm. Ph.D. Dissertation. Master’s thesis, Submitted to the Department of Transportation and Logistics at Malausia University of Science and Technology (MUST) in partial fulfillment of the requirements for the degree of Master of Science in Transportation and Logistics.
 Foerster et al. (2016a) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. 2016a. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems. 2137–2145.
 Foerster et al. (2016b) Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. 2016b. Learning to communicate to solve riddles with deep distributed recurrent qnetworks. arXiv preprint arXiv:1602.02672 (2016).
 Gupta et al. (2017) Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative multiagent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems. Springer, 66–83.
 Hausknecht (2016) Matthew John Hausknecht. 2016. Cooperation and communication in multiagent deep reinforcement learning. Ph.D. Dissertation.
 Hu et al. (1998) Junling Hu, Michael P Wellman, et al. 1998. Multiagent reinforcement learning: theoretical framework and an algorithm.. In ICML, Vol. 98. Citeseer, 242–250.
 Lee et al. (2004) DerHorng Lee, Hao Wang, Ruey Cheu, and Siew Teo. 2004. Taxi dispatch system based on current demands and realtime traffic conditions. Transportation Research Record: Journal of the Transportation Research Board 1882 (2004), 193–200.
 Li et al. (2011) Bin Li, Daqing Zhang, Lin Sun, Chao Chen, Shijian Li, Guande Qi, and Qiang Yang. 2011. Hunting or waiting? Discovering passengerfinding strategies from a largescale realworld taxi dataset. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2011 IEEE International Conference on. IEEE, 63–68.
 Li et al. (2019) Minne Li, Yan Jiao, Yaodong Yang, Zhichen Gong, Jun Wang, Chenxi Wang, Guobin Wu, Jieping Ye, et al. 2019. Efficient Ridesharing Order Dispatching with Mean Field MultiAgent Reinforcement Learning. arXiv preprint arXiv:1901.11454 (2019).
 Liao (2001) Ziqi Liao. 2001. Taxi dispatching via global positioning systems. IEEE Transactions on Engineering Management 48, 3 (2001), 342–347.
 Liao (2003) Ziqi Liao. 2003. Realtime taxi dispatching using global positioning systems. Commun. ACM 46, 5 (2003), 81–83.
 Lin et al. (2018) Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. 2018. Efficient LargeScale Fleet Management via MultiAgent Deep Reinforcement Learning. arXiv preprint arXiv:1802.06444 (2018).
 Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems. 6379–6390.
 Lynch (2009) Gary S Lynch. 2009. Single point of failure: The 10 essential laws of supply chain risk management. John Wiley & Sons.
 Miao et al. (2016) Fei Miao, Shuo Han, Shan Lin, John A Stankovic, Desheng Zhang, Sirajum Munir, Hua Huang, Tian He, and George J Pappas. 2016. Taxi dispatch with realtime sensing data in metropolitan areas: A receding horizon control approach. IEEE Transactions on Automation Science and Engineering 13, 2 (2016), 463–478.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
 Mordatch and Abbeel (2017) Igor Mordatch and Pieter Abbeel. 2017. Emergence of Grounded Compositional Language in MultiAgent Populations. arXiv preprint arXiv:1703.04908 (2017).
 Munkres (1957) James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32–38.
 Myr (2013) David Myr. 2013. Automatic optimal taxicab mobile location based dispatching system. (May 14 2013). US Patent 8,442,848.
 Oda and Tachibana (2018) Takuma Oda and Yulia Tachibana. 2018. Distributed Fleet Control with Maximum Entropy Deep Reinforcement Learning. (2018).
 Papadimitriou and Steiglitz (1998) Christos H Papadimitriou and Kenneth Steiglitz. 1998. Combinatorial optimization: algorithms and complexity. Courier Corporation.
 Schadd et al. (2007) Frederik Schadd, Sander Bakkes, and Pieter Spronck. 2007. Opponent Modeling in RealTime Strategy Games.. In GAMEON. 61–70.
 Seow et al. (2010) Kiam Tian Seow, Nam Hai Dang, and DerHorng Lee. 2010. A collaborative multiagent taxidispatch system. IEEE Transactions on Automation Science and Engineering 7, 3 (2010), 607–616.
 Shapley (1953) Lloyd S Shapley. 1953. Stochastic games. Proceedings of the national academy of sciences 39, 10 (1953), 1095–1100.
 Spaan (2012) Matthijs TJ Spaan. 2012. Partially observable Markov decision processes. In Reinforcement Learning. Springer, 387–414.
 Sukhbaatar et al. (2016) Sainbayar Sukhbaatar, Rob Fergus, et al. 2016. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems. 2244–2252.
 Tesauro (2004) Gerald Tesauro. 2004. Extending Qlearning to general adaptive multiagent systems. In Advances in neural information processing systems. 871–878.
 Tong et al. (2018) Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Lei Chen, Jieping Ye, and Ke Xu. 2018. A unified approach to route planning for shared mobility. Proceedings of the VLDB Endowment 11, 11 (2018), 1633–1646.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep Reinforcement Learning with Double QLearning.. In AAAI, Vol. 2. Phoenix, AZ, 5.
 Wang et al. (2018a) Zheng Wang, Kun Fu, and Jieping Ye. 2018a. Learning to estimate the travel time. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 858–866.
 Wang et al. (2018b) Zhaodong Wang, Zhiwei Qin, Xiaocheng Tang, Jieping Ye, and Hongtu Zhu. 2018b. Deep Reinforcement Learning with Knowledge Transfer for Online Rides Order Dispatching. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 617–626.
 Wei et al. (2018) Chong Wei, Yinhu Wang, Xuedong Yan, and Chunfu Shao. 2018. LookAhead Insertion Policy for a SharedTaxi System Based on Reinforcement Learning. IEEE Access 6 (2018), 5716–5726.
 Xu et al. (2018) Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan, Chunyang Liu, Wei Bian, and Jieping Ye. 2018. LargeScale Order Dispatch in OnDemand RideHailing Platforms: A Learning and Planning Approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 905–913.
 Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field MultiAgent Reinforcement Learning. arXiv preprint arXiv:1802.05438 (2018).
 Zhang et al. (2017) Lingyu Zhang, Tao Hu, Yue Min, Guobin Wu, Junying Zhang, Pengcheng Feng, Pinghua Gong, and Jieping Ye. 2017. A taxi order dispatch model based on combinatorial optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2151–2159.
 Zheng et al. (2017) Lianmin Zheng, Jiacheng Yang, Han Cai, Weinan Zhang, Jun Wang, and Yong Yu. 2017. MAgent: A ManyAgent Reinforcement Learning Platform for Artificial Collective Intelligence. NIPS Demo (2017).