Reward Design for Driver Repositioning Using MultiAgent Reinforcement Learning
Abstract
A large portion of passenger requests is reportedly unserviced, partially due to vacant forhire drivers’ cruising behavior during the passenger seeking process. This paper aims to model the multidriver repositioning task through a mean field multiagent reinforcement learning (MARL) approach that captures competition among multiple agents. Because the direct application of MARL to the multidriver system under a given reward mechanism will likely yield a suboptimal equilibrium due to the selfishness of drivers, this study proposes an reward design scheme with which a more desired equilibrium can be reached. To effectively solve the bilevel optimization problem with upper level as the reward design and the lower level as a multiagent system, a Bayesian optimization (BO) algorithm is adopted to speed up the learning process. We then apply the bilevel optimization model to two case studies, namely, ehailing driver repositioning under service charge and multiclass taxi driver repositioning under NYC congestion pricing. In the first case study, the model is validated by the agreement between the derived optimal control from BO and that from an analytical solution. With a simple piecewise linear service charge, the objective of the ehailing platform can be increased by . In the second case study, an optimal toll charge of is solved using BO, which improves the objective of city planners by , compared to that without any toll charge. Under this optimal toll charge, the number of taxis in the NYC central business district is decreased, indicating a better traffic condition, without substantially increasing the crowdedness of the subway system.
keywords:
Mean Field MultiAgent Reinforcement Learning, Reward Design, Bayesian Optimization1 Introduction
The emergence of transportation network companies (TNCs) or ehailing platforms (such as Didi and Uber) has revolutionizsed the traditional taxi market and provided commuters a flexibleroute doortodoor mobility service. Nonetheless, it is reported that a large portion of the passenger requests remain unserviced because of the imbalance between demand (i.e., passenger requests) and supply (i.e., available drivers) (Lin et al., 2018), resulting in long cruising trips for taxi drivers to find the next passenger (Powell et al., 2011; Di and Ban, 2019). Such cruising behavior has negative impact on urban economy by not only decreasing drivers’ income but also generating additional vehicle miles traveled. Thus, repositioning available drivers to potential locations with nearfuture high demand, i.e., to balance supply and demand, becomes the key challenge faced by the taxi and forhire market, including ehailing platforms. Leveraging cutting edge machine learning techniques, this paper aims to improve the efficiency of the taxi and forhire market.
The essence of the repositioning task is to provide recommendations to idle drivers on where to find the next passenger. Some recommender systems have been proposed for drivers (Ge et al., 2010; Hwang et al., 2015; Yuan et al., 2011; Qu et al., 2014). These studies extracted useful aggregated statistical quantities such as taxi demand and travel time from historical data and recommended a next cruising location (Ge et al., 2010), a sequence of potential pickup points (Hwang et al., 2015), a driving route (Qu et al., 2014), or a route and a location (Yuan et al., 2011).
Although the aforementioned studies provide effective recommendations of the next cruising route or location to drivers at the immediate next step, they are nearsighted and fall short of capturing the future longrun payoffs. To capture the effect of future rewards on the recommendation at the immediate next step, various Markov decision process (MDP) based approaches have been proposed to model idle drivers’ passenger searching process (Rong et al., 2016; Zhou et al., 2018; Verma et al., 2017; Gao et al., 2018; Yu et al., 2019; Shou et al., 2020). In an MDP with a single agent, a driver is the agent who makes decisions of where to go next. The dynamic environment is determined by the stochastic passenger requests and all other traffic information including the road network, distribution of drivers, and traffic conditions. Once the agent makes an action in a state, the agent then transits into a new state and receives an immediate reward by following the dynamics of the environment. The agent aims to derive an optimal policy which maximizes her expected cumulative reward. When the dynamic environment is known to the agent, dynamic programming or value iteration can be used to solve the MDP and derive an optimal policy. When the dynamic environment is unknown to the agent, the agent needs to interact with the environment by the trial and error process and gradually learns an optimal policy by some reinforcement learning (RL) algorithms such as Qlearning and temporal difference learning (Sutton and Barto, 1998).
The competition among multiple agents is, however, neglected in the aforementioned MDP models due to their singleagent setting, resulting in overly optimistic optimal policies. In other words, one agent cannot earn the full amount of the expected reward by following the policy derived in the singleagent setting. In a dynamic environment involving a group of agents, multiple agents interact with both the shared environment and other agents. Multiagent reinforcement learning (MARL) (BuÅoniu et al., 2010) thus fits naturally well in this multiagent system (MAS). Recently, MARL has been attracting significant attention due to its success in tackling high dimensional and complicated tasks such as playing the game of Go (Silver et al., 2016, 2017), Poker (Brown and Sandholm, 2018, 2019), Dota 2 (OpenAI, 2018), and StarCraft II (Vinyals et al., 2019).
MARL tasks can be broadly grouped into three categories, namely, fully cooperative, fully competitive, and a mix of the two, depending on different applications (Zhang et al., 2019): (1) In the fully cooperative setting, agents collaborate with each other to optimize a common goal; (2) In the fully competitive setting, agents have competing goals, and the return of agents sums up to zero; (3) The mixed setting is more like a generalsum game where each agent cooperates with some agents while competes with others. For instance, in the video game Pong, an agent is expected to be either fully competitive if its goal is to beat its opponent or fully cooperative if its goal is to keep the ball in the game as long as possible (Tampuu et al., 2017). A progression from fully competitive to fully cooperative behavior of agents was also presented in Tampuu et al. (2017) by simply adjusting the reward.
A key challenge arises in MARL when independent agents have no knowledge of other agents, that is, the theoretical convergence guarantee is no longer applicable since the environment is no longer Markovian and stationary (Matignon et al., 2012; Nguyen et al., 2018). To tackle this issue, one way is to exchange some information among agents. In some contexts, agents actually exchange information with their peers through some coordination. For example, in the game of a team of hunters capturing a team of preys, Tan (1993) proposed multiple ways to enable coordination among agents and concluded that the performance of the hunter agents can be better off through some coordination. However, in other contexts such as the driver repositioning system, agents only have access to their own information. Thus, information exchange among agents involves a central controller which collects the information of all agents and disseminates it to agents. Agents update their value functions and policies based on the provided information from the central controller and their local observations. This is the centralized learning (i.e., based on global information) and decentralized execution (i.e., based on local observation) paradigm, which has become increasingly popular in recent research (Foerster et al., 2016; Lowe et al., 2017; Lin et al., 2018; Li et al., 2019).
While training is stabilized conditioning on the information of other agents such as joint state and joint action in the centralized training paradigm, scalability becomes a critical issue in MARL because the joint state space and joint action space grow exponentially with the number of agents. To make MARL tractable when a large number of agents coexist, Yang et al. (2018) employed the mean field theory to simplify the interaction among agents. The basic idea is, from the perspective of an agent, to treat other agents as a mean agent. Thus, the complexity of interactions among a large number of agents is substantially eased by reducing the dimension in the Qvalue function. The large scale MARL with hundreds of or even thousands of agents becomes solvable. To investigate the largescale order dispatching problem where thousands of agents are present, Li et al. (2019) adopted a mean field approximation and proposed to take the average response from neighboring agents as a proxy of the interaction between the agent and other agents.
Recent studies have successfully applied MARL to multidriver repositioning and large scale order dispatching problems (Lin et al., 2018; Li et al., 2019; Zhou et al., 2019). Different from treating each driver as an agent in previous studies, Jin et al. (2019) treated each spatial grid as a worker agent and each region composed of several spatial grids as a manager agent and adopted hierarchical reinforcement learning to tackle the joint task of order dispatching and fleet management. All these studies rely on an underlying assumption that drivers are willing to cooperate under a specifically crafted reward function. For example, embedding the goal of the platform such as improving the gross merchandise volume (GMV) or the order response rate (ORR) into the reward function of a driver encourages cooperation among drivers. Humandrivers are, however, selfish in nature and will only cooperate if the overall return from cooperation is higher than that from competition. This selfinterested behavior is utilized to achieve certain degree of cooperation among agents such as adjusting the reward for each agent. However, when the imposed reward function (Lin et al., 2018; Li et al., 2019; Zhou et al., 2019; Jin et al., 2019) is not aligned with the goal of real drivers (e.g., a real driver’s goal can simply be maximizing her monetary return), drivers will not follow the derived optimal policy. Thus, in this work, instead of enforcing a reward function for drivers to cooperate, drivers are regarded as selfish and noncooperative, and the reward for a driver is simply the monetary return that the driver earns.
Although the approaches in Lin et al. (2018); Li et al. (2019); Zhou et al. (2019); Jin et al. (2019) are efficient under a given reward function, the reached equilibrium is very likely to be a suboptimal from the overall perspective of the system. Congestion pricing is a common way to drive the traffic system performance towards a system optimum in transportation network design problems (Yang and H. Bell, 1998; Zhang and Yang, 2004; Meng and Liu, 2012; Di et al., 2014, 2016, 2018). In this paper, we show that by integrating a reward design mechanism which adjusts the monetary return that a driver earns, a desirable equilibrium can be reached in this intrinsically largescale noncooperative system. The desirable equilibrium refers to a Nash equilibrium where each independent and selfish agent’s strategy is the bestresponse to other agents’ strategies and will produce better overall performance of the system. Mguni et al. (2019) proposed a twolayer architecture with an incentive designer as the upper layer and a potential game as the lower layer and formulated the incentive designerâs problem as an optimization problem. In contrast, the MARL problem in our context may not be able to be transformed as a potential game, complicating computation of its equilibrium.
In summary, the major contributions of this paper are as follows: (1) With the lower level as the MAS and the upper level as the reward design, this paper formulates a bilevel optimization problem in which a mean field actorcritic algorithm is developed to solve the MAS and a Bayesian optimization algorithm is adopted to efficiently solve the problem. (2) Instead of intentionally crafting a reward function, which aligns with the goal of the platform but may not reflect the intrinsic reward of real drivers, this paper takes the monetary return of a driver as the reward function. It aims to improve the performance of the platform by adjusting the monetary return that one driver can earn through a reward design mechanism of the platform (e.g., platform service charge and incentives). (3) In the case study of taxi driver repositioning under congestion pricing, a multiclass MARL is developed to capture the intrinsic behavioral difference between yellow taxis and green taxis.
The remainder of the paper is organized as follows. Section (2) introduces the singleagent actorcritic algorithm, which is a stepping stone for MARL. Section (3) presents the mean field multiagent reinforcement learning algorithm. Section (4) presents a reward design mechanism and formulates a bilevel optimization problem. Section (5) presents the result and validates the effectiveness of the proposed reward design. Section (6) concludes.
2 Single agent reinforcement learning
As a stepping stone, we first introduce the single agent reinforcement learning where only one agent interacts with the environment.
2.1 Problem definition
A Markov decision process (MDP) (Puterman, 1994) is typically specified by a tuple , where denotes the state space, stands for the allowable actions, collects rewards, denotes a state transition probability from one state to another, and is a discount factor. A general MDP proceeds simply as follows. Starting from the initial state, the agent specifies an action whenever the agent is in a state . The agent then transits into a new state with probability and observes an immediate reward by obeying the dynamics of the environment. Then the process repeats until a terminal state is reached. A policy simply maps from state to the probability of taking action in state , i.e., . The goal of solving an MDP is to derive an optimal policy so that the agent can maximize her long term expected reward by following the policy. In reinforcement learning problems, the transition probability matrix is commonly unknown, and the agent learns about from its interaction with the environment.
Denote as the state value, which is the expected cumulative reward that an agent can earn by starting from state and following a policy . can be recursively given as (Sutton and Barto, 1998) . Denote as the stateaction value, which is the expected cumulative reward that an agent can earn by starting from state , taking action , and following a policy . is related with through .
The optimal value can then be written as . The Bellman optimality equation is given as (Sutton and Barto, 1998):
where the optimal stateaction value is .
Our task is then to derive an optimal policy (i.e., to solve the MDP) with which the agent can optimize its expected cumulative reward.
To demonstrate how to apply MDPs problems to the context of ehailing driver reposition, we will use examples on a 2by2 grid world throughout the paper every time when models are introduced.
Example 2.0.
(SingleAgent ). The singleagent driver reposition is presented in Figure (1). We adopt a grid world setup where the index of each grid (denoted as ) is shown at the upper left corner. The taxi icon denotes the driver, and the person icon is the passenger request with the corresponding fare shown above. The time beneath the driver and the passenger request records the current time of the driver and the appearance time of the passenger request, respectively. The dashed line with arrow shows the origin and destination of the passenger request.
S. The state of the driver consists of two components, namely, the grid index and current time , i.e., . For instance, the current state of the driver is in this example.
A. The allowable action of the driver is either moving into one of the neighboring grids or staying within the current grid. To be concise, we use the index of grid where the driver chooses to enter as the action. Suppose the driver decides to go rightward in the example, then we can denote . We further assume it takes the driver one time step to enter grid . In other words, the current time of the driver is when the driver arrives in grid .
P. Considering the driver arrives in grid at time , and at the same time a passenger request appears in grid with probability. If this driver is matched to the passenger and picks up the passenger, the driver will transit to the passenger’s destination, which is grid . Denote the transition time from grid to grid as . We can define the new state . Then the transition probability from the state at time to the state at time is , mathematically, . If there is no passenger request in grid at time , then the driver ends up in state . The transition probably becomes .
R. If we take the fare of the fulfilled passenger request as the reward, in the example. Based on the received reward at this step and the future cumulative reward, the driver chooses an action in the new state , and the state transition process repeats until a terminal state (i.e., where is a predefined ending time, say, the end of the driver’s work time) is reached. ∎
2.2 ActorCritic method
To solve optimal policies, there are two types of methods, namely, value based or criticonly method and policy based or actoronly method. Value based and policy based methods are commonly used terminologies, but from now on we will use criticonly and actoronly methods for the purpose of introducing the actorcritic method.
Criticonly methods aim to output the optimal policy through optimizing the stateaction or the state value . Actoronly methods directly output an optimal policy without resorting to stored value functions or as an intermediary. Both methods have pros and cons. Criticonly methods enjoy a low variance in the estimate of the stateaction value but may lack guarantees on the optimality or nearoptimality of the resulting policy if an optimal policy cannot be easily solved from value functions. Actoronly methods work well on continuous and large action spaces but may suffer from high fluctuation in policies (Konda and Tsitsiklis, 2003; Grondman et al., 2012). To overcome the shortcomings of these methods, actorcritic methods are developed to combine strengths of both methods (Konda and Tsitsiklis, 2003).
Figure (2) presents the architecture of the actorcritic algorithm. One agent, who has an actor and a critic, interacts with the environment. The agent observes its state from the environment and inputs to the actor that outputs the policy, i.e., a probability distribution over all possible actions. The agent samples an action from the probability distribution and takes action in the environment. Then the agent observes a state transition and receives a reward from the environment. Based on the onestep transition as well as action and reward , the agent updates its critic. With the updated Qvalue , the agent updates its actor using policy gradient. Now we detail both the critic and the actor, respectively.
Critic. The critic takes as input state and action and outputs Qvalue . Qlearning is the most commonly used algorithm to update the Q value based on the state transition with reward and updates the Qvalue by
(1) 
where is the learning rate and . If reduces over time properly, the Qlearning update converges (Sutton and Barto, 1998). Equation (1), however, is only applicable to a finite and discrete state and action space. In other words, one needs to maintain a Q table with all possible combinations of and , which is not tractable for a continuous and large state and action space. Therefore we need functional approximation to the original Qvalue. Deep neural network, i.e., deep Q network (DQN), is one of the most popular value approximator (Mnih et al., 2015). Denote a deep neural network parameterized by as , to approximate . DQN updates its parameter by minimizing the loss
(2) 
This problem can be solved by the gradient descent method, whose gradient is straightforward to compute as follows: , where the gradient is not taken with respect to the target.
Actor. The actor takes as input state and outputs a probability distribution on all allowable actions in this state. Similarly to how we use a value network to approximate Qvalue, we can also use a deep neural network, i.e., policy network, to approximate the policy . Denote the policy network parameterized by as . The goal of the actor is to maximize its expected cumulative reward, denoted as , where is the reward the actor receives at time . To solve the optimal policy of the actor requires us to know its gradient. The gradient of the policy is complicated to solve and is given as (Sutton et al., 1999)
(3) 
where denotes the Qvalue function following the policy , is some baseline (e.g., , i.e., the value function following the policy ), and is called the advantage of a taken action , a measure of the goodness of an action. If it is greater than zero, it means this taken action is generally good, otherwise it may be bad. Naturally, the underlying rationale in computing the policy gradient defined in Equation (3) is to update the policy distribution to concentrate on potentially good action(s). When the chosen action leads to a positive advantage, i.e., , the policy is updated towards the direction of favoring action . When the advantage is negative for action , the policy is updated in the direction of against action .
To summarize, in addition to the policy network , the actorcritic algorithm also maintains a value network so that the calculation of the gradient of the policy in Equation (3) directly uses the Qfunction approximator , to ensure stability of policy update. The actorcritic algorithm simultaneously updates critic (by minimizing the loss given in Equation (2)) and the actor (by the gradient given in Equation (3)) as more samples are fed in.
3 Multiagent reinforcement learning
To tackle a realworld problem with multiple agents, the aforementioned single agent reinforcement learning falls short of capturing the coupling effects or the competition among multiple agents. In this section, we introduce a mean field multiagent reinforcement learning approach to model the multidriver repositioning task.
3.1 Problem definition
The multiagent problem is modeled as a partially observable Markov decision process (POMDP) (Littman, 1994), defined by a tuple , where is the number of agents and is the environment state space. Environment state is not fully observable. Instead, agent draws a private observation which is correlated with . is the observation space of agent , yielding a joint observation space , is the action space of agent , yielding a joint action space , is the state transition probability, is the reward function for agent , and is the discount factor.
Agent uses a policy to choose actions after drawing observation . After all agents taking actions, the joint action triggers a state transition based on the state transition probability . Agent draws a private observation corresponding to and receives a reward . Agent aims to maximize its discounted expected cumulative reward by deriving an optimal policy which is the best response to other agents’ policies. This process repeats until agents reach their own terminal state.
Due to the existence of other agents, the Qvalue function for agent , i.e., , is now dependent on the environment state and the joint action of all agents, i.e,
(4) 
Similarly, the value function of agent , i.e., , is dependent on the environment state .
Subsequently, we will demonstrate how to formulate the multidriver repositioning problem in MARL, building on the singleagent example developed in the previous section.
Example 3.0.
(MultiAgent ). The multiagent driver reposition is presented in Figure (3). Same as before, a grid world setup is adopted. Now we have two drivers with their indices shown above the taxi icon and two passenger requests with fare presented above the passenger icon. The time beneath drivers and passenger requests records the current time of the driver and the appearance time of the passenger request, respectively. The dashed line with arrow shows the origin and destination of the passenger request.
N. There are drivers moving around in the environment. We denote drivers by .
S. The environmental state consists state information of both drivers. For driver , her state is composed of her current location (i.e., the grid index based on a grid world setup) and current time , i.e., . The joint state of both drivers, i.e., the environment state , at time is denoted as . In this example, at current time , .
A. For driver , her action can be any of the five possible actions, i.e., moving into any of her four neighboring grids or staying in the current grid. The same as before, we use the index of grid where the driver chooses to enter as the action. The joint action of both drivers is . Assuming driver decides to go rightward (i.e, to enter grid ) and driver chooses to go leftward (i.e., to enter grid ), the joint action is . We further assume it then takes driver one time step to enter grid and driver one time step to enter grid . In other words, after driver arrives in grid and driver arrives in grid , the clock ticks one step forward and the current time is now .
P. The joint action triggers a state transition with some probability according to the state transition function, i.e., . Driver gets matched to the passenger request in grid at , loads up the passenger, and drives to the destination of the passenger. Driver then arrives in a new state where is the transition time from grid to grid . Driver gets matched to the passenger request in grid at , loads up the passenger, and drives to the destination of the passenger. Driver then arrives in a new state where is the transition time from grid to grid . . In this simple example, due to the deterministic appearance of passenger requests.
R. Along with the state transition, each driver receives a reward, i.e., . The reward function for each agent is simply the fare of the fulfilled passenger request, i.e., and . ∎
This example will be revisited later in this section to illustrate the algorithm.
3.2 Techniques to simplify the Qvalue function
The dependency of the Qvalue of an agent on other agents’ states and actions, as shown in Equation (4), however, introduces prohibitively high difficulties in learning the optimal Qvalue. The main reasons are twofold. First, although each agent draws its private observation from the environment state s, s cannot be observed by any agent, i.e., s is unknown. Second, one agent does not observe the actual actions taken by all agents, i.e., is unknown.
To make the Qvalue of an agent in the multiagent system tractable, the dependency of the Qvalue on the environment state and joint action needs to be simplified. A very natural approach, inspired by the singleagent setting, is independent learning where each agent only has information about its own observation and action but has no information about other agents. Thus, the Qvalue function of agent is reduced to
(5) 
In other words, private observations and joint action of other agents are not used by agent . After all agents choosing actions, the joint action triggers a state transition. Agent then draws a new private observation and receives a reward .
The independent learning algorithm, although is intuitive and simple, can be unstable and hard to reach convergence since the environment is no longer Markovian and stationary due to the appearance of other agents (Matignon et al., 2012).
Centralized training and decentralized execution
To make the training more stable and ensure convergence, we employ the centralized training and decentralized execution paradigm (Foerster et al., 2016; Lowe et al., 2017; Lin et al., 2018; Li et al., 2019). In this paradigm, to train the policy of agents, we assume these agents know the global information such as the joint observation and/or joint action. In other words, in addition to observation and action , agent also has access to the observations and/or actions of other agents during training. While in the execution phase, decentralized testing or execution is implemented, meaning they would not have access to the global information anymore. To realize this paradigm, the aforementioned actorcritic algorithm naturally fits in, because we can apply global information to the critic, i.e., joint observation and joint action in , in the training phase, while feeding local information to the actor, i.e., in , in the execution phase. Decentralized execution becomes possible because only actors are used in execution.
Then the Qvalue function of agent becomes
(6) 
where and denote the joint observation and joint action of all agents except agent , respectively.
In the context of ehailing driver repositioning, considering the definition of the action, which is the index of the grid where the driver chooses to enter, the Qvalue function of driver , i.e., , does not depend on the joint observation of other drivers, i.e., . Explanations are as follows. When driver chooses action based on its observation , driver then enters grid . At the same time, other drivers also enter some grid based on their joint action regardless of their joint observation . The Qvalue function of driver only depends on the current distribution of drivers, which has been determined by their joint action . Therefore it is the joint action which affects . The Qvalue function is thus further reduced to
(7) 
Mean field approximation
The centralized training and decentralized execution paradigm, however, can easily become intractable due to the exponential increase in the joint action space with the increasing number of agents. For example, the size of the joint action space easily blows up for agents with possible actions (i.e., possibilities). To simplify the interaction among agents, we adopt the mean field approximation. The basic idea of the mean field approximation is to simplify the complicated interaction between one agent and all other agents by a pairwise interaction between the agent and a virtual mean agent which is formed by the neighboring agents of the agent. Thus, the complexity of interactions among a large number of agents is substantially eased by reducing the dimension in the input of the Qvalue function. Therefore the large scale MARL with hundreds of or even thousands of agents becomes solvable.
To be more precise, we provide brief explanations that lead to the applicability of the mean field approximation in MARL as described in Yang et al. (2018). First, from the perspective of agent , the multiagent effect or competition effect mainly comes from its neighboring agents, i.e., , where denotes the neighboring agents of agent . However, it is still cumbersome to compute for the neighboring agents of agent if this number is large. Define a mean action , which is a proxy of the actions taken by the neighboring agents. Accordingly, can be further simplified to when Taylor expansion is applied, which is
(8) 
Interested readers can refer to Yang et al. (2018) for a detailed explanation and proof.
Example 3.0.
(MultiAgent ). The mean action of the neighboring drivers of driver i is defined as the demand to supply ratio in the grid where driver is entering. Assuming both drivers choose action , i.e., in the multiagent example shown in Figure (3), there are 2 drivers and 1 passenger request in grid after both drivers enter grid . The mean action for both drivers is thus . This definition of mean action captures the level of competition in a grid. A larger mean action denotes a higher demand to supply ratio and lower level of competition, and vice versa. ∎
3.3 Mean field actorcritic algorithm
As previously mentioned, each agent maintains a policy network (i.e., the actor) and a Qvalue network (i.e., the critic). For a realworld multiagent task, there are typically hundreds of or even thousands of agents, indicating that maintaining two deep neural networks (i.e., one for the actor and one for the critic) per agent is not computationally tractable. Considering that for a class of multiagent tasks where anonymous agents share the same state space, action space, and reward function, agents are thus homogeneous. The multiagent task can then be largely simplified by sharing both the actor and the critic among drivers, i.e., and .
After adopting the mean field approximation, the loss function for the critic, which was presented in Equation (2) for the singleagent setting, now becomes
(9) 
The only difference is the incorporation of the mean action into the Qvalue function approximation. Similarly, the gradient of the policy, which was presented in Equation (3) for singleagent setting, is now
(10) 
Example 3.0.
(MultiAgent ). Now we apply the mean field actorcritic algorithm to the multidriver example shown in Figure (3). Figure (4) presents the architecture of the mean field actorcritic algorithm particularly for the context of multidriver repositioning. Homogeneous agents, who share a common actor and a common critic, interact with the environment. The shared actor is a multilayer perceptron with 32 neurons in its hidden layer and takes as input observation and outputs a five dimensional vector denoting the probability distribution of taking five actions. Similarly, the shared critic takes as input and outputs the Qvalue. During training, agent draws its private observation from the environment and inputs to the actor which outputs a probability distribution over actions. Agent samples an action from the probability distribution and takes the sampled action in the environment. Joint action of all agents triggers a state transition in the environment. Agent then observes the mean action , draws a new observation , and receives a reward from the environment. The agent then uses to update the shared critic by minimizing the loss presented in Equation (9). Based on the advantage calculated from the critic, agent updates the shared actor using the gradient presented in Equation (10).
The aforementioned training process is centralized because the mean action used in the critic is actually some global information. During execution, agents only need to use the updated actor, which only takes as input the local information, i.e., the private observation. In other words, the shared critic is not used in execution.
The derived Q values corresponding to four scenarios of interest are presented in Figure (5). In Figure ((a)a), when both drivers choose action #4, the observed mean action for both of them is the ratio of demand to supply, i.e., . The resulting expected value for both drivers is , i.e., , because both of them have an equal probability to take the passenger request with . Similarly, the observed mean actions and resulting Q values can be explained in other scenarios. The Qvalue bimatrix is presented in Table (1) where driver is the column player and driver is the row player. When driver chooses action and driver 2 chooses action , Qvalues for them are and , respectively, according to Figure ((d)d). Similarly, Qvalues for both drivers can be read from Figure (5) for other scenarios. Based on the bimatrix, driver always chooses action because action is strictly better than action regardless of the observed mean action, and driver always chooses action for the same reason. Thus, the optimal policy for both drivers is to enter grid with an expected payoff .
4 Reward design for multiagent reinforcement learning
Due to selfishness of each agent, performing MARL under a given reward function in an MAS is very likely to yield an undesirable equilibrium from the perspective of the system. In other words, this equilibrium may not be an optimum with respect to some system objectives. To guide a multiagent system towards a desirable equilibrium, system planners could resort to reward design mechanisms by modifying the reward function of agents. In this paper, we introduce a new parameter into agents’ reward, where is the feasible domain of . Parameter can be either a scalar or a vector. The goal of system planners is to maximize some system performance measure dependent of , denoted as . The system planner first chooses a value of and inputs to the MAS. With the given which determines the reward, the developed mean field actorcritic algorithm is employed to derive an optimal policy , which is dependent on , for all agents in the system. Some performance measure , which is calculated by executing the derived optimal policy for all agents, is then fed into the reward design. The performance measure is dependent on through the dependency of on . In other words, .
In summary, the reward design problem is to select a parameter to maximize the performance measure on the upper level, while the distributed agents aim to maximize their individual cumulative rewards on the lower level once is given as part of their reward. This process can be formulated as a bilevel optimization problem, mathematically,
(11)  
The interaction between upper and lower levels through exchange of variables is shown in Figure (6).
The optimization problem presented in Equation (11), however, is not straightforward to solve due to the unknown complex structure of over the parameter . The traditional gradient based method such as gradient descent is thus no longer applicable.
In this paper, we adopt Bayesian optimization (hereafter we call it BO). The procedure of BO is as follows. First, BO places a statistical model on the objective function , such as a Gaussian process. Second, BO devises an acquisition function to decide where to evaluate next, i.e., to choose an based on the statistical model. Third, BO updates the statistical model based on the newly evaluated , and the process repeats. The pesudocode of BO is listed in Algorithm (2). Interested readers are referred to (Frazier, 2018) for more details on BO.
To be more concrete, now we use the multiagent example presented in Figure (3) to illustrate the potential of the reward design.
Example 4.0.
(MultiAgent ). We take the order response rate (ORR), i.e. the ratio of the number of fulfilled passenger requests to the total number of passenger requests, as the performance measure of the system. The direct application of mean field actorcritic algorithm yields a 50% ORR, which is obviously not the desired equilibrium from the perspective of the system. Noticing that the platform typically charges a certain proportion of the fare paid by the passenger as the socalled platform service charge, which is reportedly to be dependent on various factors such as distance, duration, and city. We aim to improve the performance of the system by devising a proper reward design.
In Figure (3), trip fares are shown right above each passenger request, the reached equilibrium for both drivers without any charge are to enter grid and get an expected reward as , leading to an oversupply (i.e., a low demand to supply ratio) in grid and an undersupply (i.e., a high demand to supply ratio) in grid , which is not beneficial for the system. A reward design which deducts from the passenger request paid to the driver in grid will effectively attract one driver to leave grid for grid to get more monetary return, resulting in a order response rate.
5 Case Study
To test the performance of the proposed bilevel optimization model, we use two datasets including a synthetic dataset and one realworld largescale taxi dataset downloadable from official website of New York City (NYC) Taxi & Limousine Commission (https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page).
5.1 Ehailing driver repositioning under service charge
We first test the bilevel optimization model on a 2by2 grid world example, where an analytical solution of the reward design can be derived. Then we compare both values to justify the correctness of our BO algorithm.
The dataset consists of seven deterministic passenger requests in a 2by2 grid world setup, as shown in Figure (7). At , there are five idle drivers in grid and five in grid . At time , five passenger requests with fare deterministically appear in grid and two passenger requests with fare appear in grid . The observation space for driver consists of the grid index and current time, i.e., , and the action space is to enter one of neighboring grids or to stay at the current grid.
Without any reward design, the optimal policy for all drivers is to enter grid , because the expected return for entering grid is at least (i.e., 10 drivers compete for 5 orders with each) while that for entering grid is at most (i.e., the highest fare of an order in is ). The resulting ORR is , which is not desirable from the perspective of the platform because it is expected to achieve a ORR in this setting. Actually, the platform can achieve a better ORR by adjusting the reward that drivers earn through the use of a platform service charge (aka the commission fee). The platform service charge used in this study is denoted as a fare percentage. For instance, a 10% service charge means the platform takes 10% of the fare paid by the passenger to the driver as its revenue. In other words, the driver gets less money under a higher service charge while the payment from the passenger remains the same. To achieve a better ORR, the platform needs to place a high service charge in grids which are oversupplied. Drivers oversupply grid because on average they can earn more by entering grid , compared with entering other grids. A high service charge placed in grid can effectively reduce monetary returns for drivers entering and make grid less attractive to drivers. Thus, some drivers choose other grids and take other passenger requests, resulting in an increase in ORR.
Before introducing a functional form of the platform service charge, we formly provide two notations, namely demand to supply ratio (DS) and service charge (SC). We then construct an effctive form of SC as a function of DS. A small DS indicates that the grid is oversupplied, and a large DS means the grid is undersupplied. The goal of the platform is to drive DS close to , meaning a balance between demand and supply. In a grid with below , is expected to be large to discourage drivers from oversupplying the grid; while in a grid with above , is supposed to be small. To illustrate such a relation, we use a piecewise linear function with a parameter as SC in grid , i.e.,
(12) 
where a relatively high SC is charged to all drivers in the grid with a low DS and no SC is charged to drivers in the grid with DS above .
With an adjustable parameter , the platform aims to maximize some objective , consisting of two components, namely ORR and overall service charge (OSC), where
The rationale of choosing these two components is as follows. First, from the perspective of the platform, it aims to maximize ORR, because a larger ORR typically means a higher revenue and a higher customer satisfaction. To maximize ORR, the platform simply chooses the largest possible value of . The reason is that with the largest possible , the platform penalizes drivers heavily for oversupplying a grid, and therefore drivers will be directed to other grids. This strategy, i.e., choosing the largest , however, is a big threat for the longterm growth of the platform because drivers are very likely to quit under such a high service charge. Thus, the platform also needs to maintain a relatively small OSC. Considering the competition between ORR and OSC, we use a weighted average of ORR and as the objective of the platform, i.e.,
(13) 
where is the weight for ORR. In this case study, we set , meaning that the platform cares more about ORR. We then use two methods, namely, BO and an analytical method, to determine the optimal value of .

BO. We first employ BO with the objective function given in Equation (13). For a bilevel optimization problem, first we need to check the convergence of the lower level. As an example to validate the convergence, ORR and (1  OSC) versus the index of iterations are presented in Figure (8) with . ORR increases very fast and (1  OSC) steadily decreases during the first 1,000 iterations where agents explore the environment and learn the optimal policy. ORR and (1  OSC) gradually converge after 1,000 iterations when agents mainly exploit the knowledge they have gained through their previous explorations.
With the validated convergence of the lower level MAS, we run BO until convergence. The convergence of BO is defined as choosing 5 consecutive s with the difference between the highest and lowest below a threshold of . In other words, BO converges when it starts choosing similar s to evaluate. The result from BO is presented in Figure (9). It is noticeable that the evaluation of the objective on s seems quite noisy. In other words, the evaluated objective may be slightly different even for the same . This is expected because there are multiple local optima when solving the lower level MAS. Actually, it is commonly impossible to find a global optimum using deep learning. Thus researchers usually settle for local optima (Goodfellow et al., 2016). Local optima introduce noise into the evaluation of the objective at each . Although the evaluations are noisy, the fitted curve is able to capture the mean objective for each . Due to the flatness near the peak of , optimal s are nonunique and determined as . In other words, any yields the same optimal mean objective, i.e., . The optimum is 4.0% higher than the objective without any reward design.

Analytical method. Due to the simplicity of this case, we can analytically derive the optimal value of and shed some light on the effectiveness of the proposed platform service charge. Recall that the optimal policy for all drivers is to enter grid when . The resulting DS ratio in grid is , which is well below , meaning that grid is oversupplied. ORR is . To increase ORR, one needs to increase to penalize drivers who oversupply a grid. As gradually increases, grid becomes less attractive, because the expected return one driver can earn decreases as increases. When the expected return one driver can earn is less than , one driver will enter grid instead of grid for a higher monetary return. Note that to ease the analysis, we assume the number of drivers entering a grid is always an integer. Similarly, as one keeps increasing , the second driver will choose to enter instead of grid . Now we present how we calculate the critical value of below which there is no driver choosing to enter grid while above which there is one driver attracted by grid . With one driver entering grid , there are 9 drivers entering grid , resulting a DS ratio in grid . , meaning that the expected return for these 9 drivers is . The expected return for the driver entering grid is . We then have the critical condition , yielding . Similarly, we can calculate the critical value of below which there is one driver choosing to enter grid while above which there are two drivers attracted by grid , and the critical value is .
DS ratio in grid () N.A. (supply is zero) DS ratio in grid () ORR OSC Table 2: Values of interest Values of interest are presented in Table (2). With , and . The objective is . With increasing to , there is one driver attracted by grid , resulting in a ORR. The OSC is calculated as follows. The DS ratio in grid is now , resulting in . Thus, . The objective is . Similarly, with increasing to , , , and the objective is . Increasing further does not improve ORR but increases OSC, resulting in a decrease in the objective. Thus, the analytically derived optimal value of is .
The analytically derived optimal value of , i.e., , agrees well with the derived optimal range of from BO, i.e., . The optimum from the analytical solution, i.e., , however, deviates from its numerical counterpart, i.e., . One possible explanation is as follows. In the analytical solution, the policy for agents is deterministic and exact two drivers choose grid after increasing to ; while in BO, the derived optimal policy for agents is stochastic, introducing variance in drivers’ actions. For example, the derived optimal policy says each driver has a probability of choosing grid and a probability of choosing grid . Although the expected number of agents in grid is and the expected number of agents in grid is , the probability of all agents choosing grid is . This variance reduces both ORR and (1  OSC), resulting in a lower objective from BO, compared with the objective from the analytical solution.
5.2 Multiclass taxi driver repositioning under congestion pricing
In this case study, we apply the proposed bilevel optimization model to a realworld scenario where city planners aim to mitigate traffic congestion in the central business district (CBD). As an effective way to improve traffic condition, congestion pricing has been adopted by many cities such as London and Stockholm (de Palma and Lindsey, 2011). The basic idea of congestion pricing is to impose a toll charge on all vehicles entering the CBD. Consequently, some drivers may be sensitive to the toll charge and take alternative travel modes such as subway while some drivers can bear the toll charge. To demonstrate the effectiveness of congestion pricing, we use NYC taxi and subway data due to data availability.
In the taxi market, congestion pricing affects both the demand and supply. On the demand side, the toll charge is passed to passengers for taxi drivers carrying passengers into the CBD. In other words, the fare paid by passengers is increased and thus the demand (i.e., number of passenger requests) is decreased. According to Schaller (1999), taxi demand falls by percent when taxi fare increases by percent. For example, when taxi fare increases from to (i.e., increases by ) for a trip, the probability of the passenger choosing alternative travel modes is (i.e. ). On the supply side, the toll charge is paid by taxi drivers when they enter the CBD vacantly, which discourages drivers from entering the CBD without any passenger. The overall effect of congection pricing results in a reduced number of taxis in the CBD, leading to an improved traffic condition. This, however, may direct too many passengers, whose taxi requests are unfulfilled, to the public transit which is already running at full pressure during rush hours (Plitt, 2020). Thus, there exists a tradeoff between reducing the number of taxis in the CBD and maintaining a reasonable level of crowdedness in the public transit system.
The objective of city planners thus consists of two components, namely, the number of taxis in the CBD and the crowdedness of the public transit system. Considering the accessibility of data (i.e., NYC taxi data and subway data), we make two assumptions: (1) The proposed congestion pricing scheme only affects the behavior of taxi drivers, as previously mentioned; (2) The subway system is used as a proxy of the public transit of the city.
To be more precise, Figure (10) presents objectives of city planners and how planners derive the best control. City planners impose a toll charge on taxis entering the CBD. Adaptive taxis learn the optimal policy by the meanfield actorcritic algorithm under the toll charge. With fewer taxis searching for passengers in the CBD, more unfulfilled passenger requests are directed to the subway system. City planners observe the number of taxis in the CBD and crowdedness in the subway and adjust the toll charge to achieve a better balance between these two objectives. This process repeats until city planners reach a satisfactory balance.
Data preprocessing
The NYC taxi trip records are publicly available on the official website of NYC Taxi & Limousine Commission (https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page). We use the data for both yellow and green taxis during May 2014 before the wide adoption of ridesharing service such as Uber and Lyft and after the business of green taxis gradually stabilizes. A data sample is listed in Table (3). Each entry in Table (3) collects the order information, including pickup and dropoff time and locations and fares. In total there are around 16 million taxi trips. We first remove the weekend data because trip patterns over weekends are obviously different from that on weekdays. We then restrict the time interval of interest as the evening peak, i.e., 4 PM to 8 PM. There are 2 million taxi trips in the weekday data after preprocessing.
pickup datetime  dropoff datetime  pickup longitude  pickup latitude  dropoff longitude  dropoff latitude  fare amount 

20140501 16:59:00  20140501 17:08:30  73.978818  40.785048  73.965570  40.800718  6.5 
20140501 16:59:00  20140501 17:23:00  73.960280  40.778892  73.975542  40.751427  15.5 
Figure (11) presents the spatial discretization of the area of interest. There are in total grids with a side length of 1 km covering the area from Manhattan to two airports located in Queens. Taxi orders outside grids consist of less than 2% of the overall taxi orders and are not considered. Each longitude and latitude coordinate is transformed into a grid index. As for the temporal discretization, the evening peak is divided into eighty 3minute time intervals and the pickup time and dropoff time are transformed into time interval index. Grids shown as bold red squares cover the CBD of NYC, which is the area between 19th street and 59th street in Manhattan. The proposed congestion pricing is applied to vehicles cross the red square into the CBD.
Figure (12) presents the spatial distribution of taxi orders (pickup) during evening peak. It can be seen that the majority of taxi orders emerge in Manhattan, especially in the CBD. There are two local hotspots near two airports.
NYC subway turnstile data is also publicly accessible (http://web.mta.info/developers/turnstile.html). A sample of the turnstile data is listed in Table (4). These two rows show that the reading of entries for turnstile ID (A002, R051, 020000) is 4,593,637 at 4 PM and 4,594,523 at 8 PM on 05/01/2014. Taking the difference between two readings yields the net entries at this turnstile during the 4hour time interval, i.e., 4,594,523  4,593,637 = 886. Similarly, we can calculate net entries and net exits for each turnstile. Net entries and net exits of a grid are then calculated by summing up the net entries and net exits of all turnstiles in that grid, respectively.
Turnstile ID  Date  Time  Entries  Exits 

(A002, R051, 020000)  05/01/2014  16:00:00  4,593,637  1,564,283 
(A002, R051, 020000)  05/01/2014  20:00:00  4,594,523  1,564,348 
Objective function of city planners
As previously mentioned, the objective function of city planners consists of two components, namely, number of vehicles in the CBD and the crowdedness of the public transit. Now we formally define these two components based on NYC taxi data and subway turnstile data.
The first component is defined as the percentage of taxis in the CBD, i.e, the ratio of the number of taxis in the CBD to the total number of taxis. For each time step, we calculate one value of the percentage. We then take the average of the percentages across all time steps as the overall percentage of taxis in the CBD. Hereafter we call this PTC (percentage of taxis in the CBD). PTC decreases with toll charge because fewer vacant taxis enter the CBD with a higher toll charge.
The crowdedness of the subway system in each grid is further decomposed into two parts, namely, the entry crowdedness which is related with the net entries into the subway system within the grid, and the exit crowdedness which is related with the net exits from the subway system within the grid. After imposing a toll charge on taxis entering the CBD, the crowdedness of the subway system increases due to the unserviced taxi orders. Here we assume that travel demand stemming from the unserviced taxi orders goes to the subway system. For each grid, we count the unserviced taxi orders with its origin inside the grid and call this quantity the additional entry into the subway system. We then take the ratio of the additional entry to the net entries within the grid as the increase in the entry crowdedness. Similarly, we can calculate the increase in the exit crowdedness within the grid. Taking the average of the increase in the entry crowdedness and the increase in the exit crowdedness yields the increase in the crowdedness of the grid. Among the overall 337 grids, we focus on the top 20 grids in terms of crowdedness. Hereafter we call this ICS (increase in crowdedness in subway). ICS increases with the toll charge because more passengers are directed to the subway system with a higher toll charge.
From the perspective of city planners, both PTC and ICS are expected to be small. These two components, however, are competing against each other. With a small toll charge, ICS is small but PTC is large; while with a large toll charge, PTC becomes smaller but ICS gets larger. Therefore city planners need to maintain some balance between these two components. Here we use a weighted average of these two components as the objective of city planners. To ensure maximization, we add a minus sign:
(14) 
where is the weight for PTC. In this case study, we set considering the difference in the magnitude of two components.
Multiclass MARL
The NYC taxi market contains two types of taxicabs, namely yellow taxis and green taxis. They are different because yellow taxis can go and pick up passengers anywhere while green taxis are not allowed to pick up passengers in Manhattan below East 96th Street and West 110th Street and at two airports, namely LaGuardia airport (LGA) and John F. Kennedy airport (JFK). Therefore we need to model them differently in the lower level MARL. To incorporate these two classes of agents into MARL, we create two actors and critics. All the yellow taxis share one actor (i.e., a policy network) and one critic (i.e., value network), and green taxis share the other actor and the critic. In the actorcritic algorithm demonstrated in Fig 4, both yellow and green agents interact with the same environment. They have the same observation space and action space. In other words, for both yellow and green taxis, its observation consists of the grid index and current time, and its action is to enter one of neighboring grids or to stay in the current grid. In addition, both of them aim to maximize their cumulative monetary return.
The key difference is, green taxis can drop off and search for passengers in those restricted areas (i.e., Manhattan below East 96th Street and West 110th Street and two airports), they can not pick up passengers there. From the modeling perspective, the environment will not assign orders to green taxis in restricted areas. This restriction discourages green taxis from searching for passengers or taking passengers to the restricted areas. Accordingly, the policy for green taxis is expected to be different from that of yellow taxis. Yellow taxis thus only compete among themselves in the restricted areas, while outside restricted areas, yellow and green taxis not only compete within the same type but also compete with the other taxi type.
Results
On weekdays, there are on average around 90,000 taxi orders during the evening peak. According to Wikipedia (https://en.wikipedia.org/wiki/Taxicabs_of_New_York_City), there are around 13,000 yellow “medallion” taxicabs in NYC. Considering that some drivers do not work during the evening peak and some drivers work outside the grid world, we thus set the number of yellow agents in MARL as 12,000. The number of green agents is set to 5,000 considering that there were in total around 6,000 green taxi drivers in 2015 and some of them may not work during the evening peak.
We then run the bilevel optimization with the objective function given in Equation (14) and derived the optimal toll charge as . The result is presented in Figure (13). The objective is around without any toll charge, as shown in Figure ((a)a) . The objective increases with toll charge before toll charge reaches . With a toll charge, the objective is , which is higher than . The objective decreases if toll charge is increased beyond . The parabolic shape of the objective can be explained by Figure ((b)b). Before the toll charge reaches , the steady increase in and the minor decrease in push the objective higher with a larger toll charge. After the toll charge is increased beyond , declines faster and suppresses the effect of the increase in , resulting in a decrease in the objective.
Figure (14) presents the average percentage of taxis in each grid for two scenarios, namely without any toll charge and with the optimal toll charge. With the optimal toll charge, the percentage of taxis in Manhattan, especially in the CBD, is decreased, while that for two airports are increased. This is as expected because taxi drivers are penalized for entering the CBD vacantly, meaning that CBD now becomes less attractive to taxi drivers. According to the demand distribution shown in Figure (12), two airports become comparatively attractive.
Figure (15) presents the increase in crowdedness across the busiest 20 grids. With the optimal toll charge, the increase in crowdedness in the subway is higher in the CBD, compared to that without any toll charge, because now there are fewer taxis in the CBD and therefore more passengers are directed to the subway system. For grids outside CBD, the increase in crowdedness can be either higher or lower because the crowdedness consists of two components, namely, the entry crowdedness and the exit crowdedness. The increase in entry crowdedness is expected to be lower for grids outside CBD because there are more vacant taxis outside the CBD who is willing to carry passengers into CBD. The increase in exit crowdedness is higher in many grids because more people take subway to arrive in grids outside the CBD.
6 Conclusion
Noticing the underutilization of taxi resources due to idle taxi drivers’ cruising behavior, this study aims to model the multidriver repositioning task through a mean field multiagent reinforcement learning approach. A mean field actorcritic algorithm is developed to solve the MARL with a given reward function. The direct application of the mean field actorcritic algorithm is, however, very likely to yield a suboptimal equilibrium from the standpoint of the system. Thus, this study proposes a bilevel optimization with the upper level as a reward design and the lower level as the MARL. The upper level interacts with the lower level by adjusting rewards. The bilevel optimization model is applied to two scenarios, namely, ehailing driver repositioning under service charge and taxi driver repositioning under congestion pricing. In the case of ehailing driver repositioning, the agreement between the derived optimal control from BO and that from an analytical solution validates the effectiveness of the model. It is also worth mentioning that the objective of the ehailing platform is increased by using a simple piecewiese linear platform service charge. In the case of multiclass taxi driver repositioning, a toll charge increases the objective of city planners by , compared to that without any toll charge. With the optimal toll charge, the number of taxis in the CBD is decreased, indicating a better traffic condition. The crowdedness is increased in the subway stations within the CBD due to fewer taxis. For subway stations outside the CBD, the crowdedness can be either higher or lower depending on the tradeoff between the entry crowdedness and the exit crowdedness.
The aforementioned two driverrepositioning applications validate the effectiveness of the proposed bilevel optimization model. We stress that the model is general and can be applied to various systems as long as there are two levels in the system and the upper level can affect the lower level through some control. With some optimal control, the performance of the system can be improved, which is beneficial for the urban economy.
There are some future work that can be done to overcome some limitations of this study.

Although the mean field approximation is effective to make MARL with a large number of agents tractable, it may oversimplify the interaction among agents. A theoretical or physicsinformed approach can be developed to better capture the interaction among agents.

We will further explore the modeling of multiclass MARL model. In the second case study, although the difference between yellow taxis and green taxis is considered, the heterogeneity among yellow taxis (or green taxis) is neglected due to the homogeneity assumption. A more personalized model capturing the behavioral difference within the same type of agents is left in future research.

A predefined form of the control with a parameter (e.g., the piecewise linear service charge and toll charge in previous case studies) was used in the upper level, meaning that the upper level aims to find the best control within the space defined the given form of the control. A more gametheoreical approach such as the leaderfollower game may relax this restriction and find a globally optimal control.
Footnotes
 journal: Transportation Research Part C
References
 Superhuman AI for headsup nolimit poker: Libratus beats top professionals. Science 359 (6374), pp. 418–424 (en). External Links: ISSN 00368075, 10959203 Cited by: §1.
 Superhuman AI for multiplayer poker. Science 365 (6456), pp. 885–890 (en). External Links: ISSN 00368075, 10959203 Cited by: §1.
 Multiagent Reinforcement Learning: An Overview. In Innovations in MultiAgent Systems and Applications  1, D. Srinivasan and L. C. Jain (Eds.), Studies in Computational Intelligence, pp. 183–221 (en). External Links: ISBN 9783642144356, Document Cited by: §1.
 Traffic congestion pricing methodologies and technologies. Transportation Research Part C: Emerging Technologies 19 (6), pp. 1377–1399 (en). External Links: ISSN 0968090X Cited by: §5.2.
 A unified equilibrium framework of new shared mobility systems. Transportation Research Part B: Methodological 129, pp. 50–78. Cited by: §1.
 Braess paradox under the boundedly rational user equilibria. Transportation Research Part B: Methodological 67, pp. 86–108. Cited by: §1.
 Second best toll pricing within the framework of bounded rationality. Transportation Research Part B 83, pp. 74–90. Cited by: §1.
 A linknode reformulation of ridesharing user equilibrium with network design. Transportation Research Part B: Methodological 112, pp. 230–255. Cited by: §1.
 Learning to Communicate with Deep Multiagent Reinforcement Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 2145–2153. Note: eventplace: Barcelona, Spain External Links: ISBN 9781510838819 Cited by: §1, §3.2.1.
 A Tutorial on Bayesian Optimization. arXiv:1807.02811 [cs, math, stat]. Note: arXiv: 1807.02811 Cited by: §4.
 Optimize taxi driving strategies based on reinforcement learning. International Journal of Geographical Information Science 32 (8), pp. 1677–1696. External Links: ISSN 13658816 Cited by: §1.
 An Energyefficient Mobile Recommender System. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, New York, NY, USA, pp. 899–908. External Links: ISBN 9781450300551 Cited by: §1.
 Deep learning. MIT Press. Note: \urlhttp://www.deeplearningbook.org Cited by: item 1.
 A Survey of ActorCritic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (6), pp. 1291–1307. External Links: ISSN 15582442 Cited by: §2.2.
 An effective taxi recommender system based on a spatiotemporal factor analysis model. Information Sciences 314, pp. 28–40. External Links: ISSN 00200255 Cited by: §1.
 CoRide: Joint Order Dispatching and Fleet Management for MultiScale RideHailing Platforms. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, New York, NY, USA, pp. 1983–1992. Note: eventplace: Beijing, China External Links: ISBN 9781450369763 Cited by: §1, §1.
 On ActorCritic Algorithms. SIAM J. Control Optim. 42 (4), pp. 1143–1166. External Links: ISSN 03630129 Cited by: §2.2.
 Efficient Ridesharing Order Dispatching with Mean Field MultiAgent Reinforcement Learning. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 983–994. Note: eventplace: San Francisco, CA, USA External Links: ISBN 9781450366748 Cited by: §1, §1, §1, §1, §3.2.1.
 Efficient LargeScale Fleet Management via MultiAgent Deep Reinforcement Learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 1774–1783. External Links: ISBN 9781450355520 Cited by: §1, §1, §1, §1, §3.2.1.
 Markov Games As a Framework for Multiagent Reinforcement Learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning, ICML’94, San Francisco, CA, USA, pp. 157–163. Note: eventplace: New Brunswick, NJ, USA External Links: ISBN 9781558603356 Cited by: §3.1.
 Multiagent Actorcritic for Mixed Cooperativecompetitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 6382–6393. Note: eventplace: Long Beach, California, USA External Links: ISBN 9781510860964 Cited by: §1, §3.2.1.
 Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. The Knowledge Engineering Review 27 (1), pp. 1–31 (en). External Links: ISSN 14698005, 02698889 Cited by: §1, §3.2.
 Impact analysis of cordonbased congestion pricing on modesplit for a bimodal transportation network. Transportation Research Part C: Emerging Technologies 21 (1), pp. 134–147. Cited by: §1.
 Coordinating the Crowd: Inducing Desirable Equilibria in NonCooperative Systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Montreal QC, Canada, pp. 386–394. External Links: ISBN 9781450363099 Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533 (en). External Links: ISSN 14764687 Cited by: §2.2.
 Deep Reinforcement Learning for MultiAgent Systems: A Review of Challenges, Solutions and Applications. arXiv:1812.11794 [cs, stat]. Note: arXiv: 1812.11794 Cited by: §1.
 OpenAI five. Note: \urlhttps://blog.openai.com/openaifive/ Cited by: §1.
 External Links: Link Cited by: §5.2.
 Towards Reducing Taxicab Cruising Time Using Spatiotemporal Profitability Maps. In Proceedings of the 12th International Conference on Advances in Spatial and Temporal Databases, SSTD’11, Berlin, Heidelberg, pp. 242–260. External Links: ISBN 9783642229213 Cited by: §1.
 Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 9780471619772 Cited by: §2.1.
 A Costeffective Recommender System for Taxi Drivers. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 45–54. External Links: ISBN 9781450329569 Cited by: §1.
 The Rich and the Poor: A Markov Decision Process Approach to Optimizing Taxi Driver Revenue Efficiency. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, New York, NY, USA, pp. 2329–2334. External Links: ISBN 9781450340731 Cited by: §1.
 Elasticities for taxicab fares and service availability. Transportation 26 (3), pp. 283–297 (en). External Links: ISSN 15729435 Cited by: §5.2.
 Optimal passengerseeking policies on Ehailing platforms using Markov decision process and imitation learning. Transportation Research Part C: Emerging Technologies 111, pp. 91–113 (en). External Links: ISSN 0968090X Cited by: §1.
 Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489 (en). External Links: ISSN 14764687 Cited by: §1.
 Mastering the game of Go without human knowledge. Nature 550 (7676), pp. 354–359 (en). External Links: ISSN 14764687 Cited by: §1.
 Introduction to Reinforcement Learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 9780262193986 Cited by: §1, §2.1, §2.1, §2.2.
 Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, Cambridge, MA, USA, pp. 1057–1063. Note: eventplace: Denver, CO Cited by: §2.2.
 Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE 12 (4), pp. e0172395 (en). External Links: ISSN 19326203 Cited by: §1.
 MultiAgent Reinforcement Learning: Independent vs. Cooperative Agents. In In Proceedings of the Tenth International Conference on Machine Learning, pp. 330–337. Cited by: §1.
 Augmenting decisions of taxi drivers through reinforcement learning for improving revenues. Proceedings of the TwentySeventh International Conference on Automated Planning and Scheduling ICAPS 2017: Pittsburgh, June 1823, pp. 409–417. Cited by: §1.
 Grandmaster level in StarCraft II using multiagent reinforcement learning. Nature 575 (7782), pp. 350–354 (en). External Links: ISSN 14764687 Cited by: §1.
 Models and algorithms for road network design: a review and some new developments. Transport Reviews 18 (3), pp. 257–278. Cited by: §1.
 Mean Field MultiAgent Reinforcement Learning. In International Conference on Machine Learning, pp. 5571–5580 (en). Cited by: §1, §3.2.2.
 A Markov decision process approach to vacant taxi routing with ehailing. Transportation Research Part B: Methodological 121, pp. 114–134. External Links: ISSN 01912615 Cited by: §1.
 Where to Find My Next Passenger. In Proceedings of the 13th International Conference on Ubiquitous Computing, UbiComp ’11, New York, NY, USA, pp. 109–118. External Links: ISBN 9781450306300 Cited by: §1.
 MultiAgent Reinforcement Learning: A Selective Overview of Theories and Algorithms. (en). External Links: Link Cited by: §1.
 The optimal cordonbased network congestion pricing problem. Transportation Research Part B: Methodological 38 (6), pp. 517–537. Cited by: §1.
 MultiAgent Reinforcement Learning for Orderdispatching via OrderVehicle Distribution Matching. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, New York, NY, USA, pp. 2645–2653. Note: eventplace: Beijing, China External Links: ISBN 9781450369763 Cited by: §1, §1.
 Optimizing Taxi Driver Profit Efficiency: A Spatial Networkbased Markov Decision Process Approach. IEEE Transactions on Big Data, pp. 1–1. External Links: ISSN 23327790 Cited by: §1.