CoLight: Learning Networklevel Cooperation for Traffic Signal Control
Abstract.
Cooperation is critical in multiagent reinforcement learning (MARL). In the context of traffic signal control, good cooperation among the traffic signal agents enables the vehicles to move through intersections more smoothly. Conventional transportation approaches implement cooperation by precalculating the offsets between two intersections. Such precalculated offsets are not suitable for dynamic traffic environments.
To incorporate cooperation in reinforcement learning (RL), two typical approaches are proposed to take the influence of other agents into consideration: (1) learning the communications (i.e., the representation of influences between agents) and (2) learning joint actions for agents. While joint modeling of actions has shown a preferred trend in recent studies, an indepth study of improving the learning of communications between agents has not been systematically studied in the context of traffic signal control.
To learn the communications between agents, in this paper, we propose to use graph attentional network to facilitate cooperation. Specifically, for a target intersection in a network, our proposed model, CoLight, can not only incorporate the influences of neighboring intersections but learn to differentiate their impacts to the target intersection. To the best of our knowledge, we are the first to use graph attentional network in the setting of reinforcement learning for traffic signal control. In experiments, we demonstrate that by learning the communication, the proposed model can achieve surprisingly good performance, whereas the existing approaches based on joint action modeling fail to learn well.
1. Introduction
Traffic signals decide how smoothly vehicles move in the city. However, designing an efficient traffic signal control system is quite complicated because the traffic situations are highly dynamic. Recent advances in technology provide increasingly available realtime traffic data collected from sources such as navigation systems and surveillance cameras. It has drawn increasing attention from data science researchers to look into this traditional but important problem.
Traffic signal control is a core research topic in the transportation field (Roess et al., 2004). The typical way to solve this problem is to formulate it as an optimization problem and solve it under certain assumptions (e.g., uniform arrival rate (Webster, 1958; Roess et al., 2004) and unlimited lane capacity (Varaiya, 2013)). Such methods, however, do not perform well because the assumptions do not hold in the real world.
Recently, researchers start to investigate using reinforcement learning (RL) techniques for traffic signal control (Wiering, 2000; van der Pol et al., 2016; Wei et al., 2018). In a typical RL setting for traffic signal control, each intersection is treated as an agent. An agent takes the representation on its local traffic situations from the environment as its observation . Reward is often defined as a measure correlated with travel time, and the action is the choice of a traffic signal phase. If the model is Qlearning (the most widely used model in recent literature), we are basically learning a value function , which gives the score for taking action in observation . Different from conventional approaches, such RL methods avoid making strong assumptions and directly learn good strategies from the dynamic environment in a trialanderror manner. It has shown to be more effective than conventional optimizationbased transportation approaches (Wiering, 2000; Wei et al., 2018). However, most RL methods have been focusing on individual single intersections (Wei et al., 2018; Casas, 2017; Liang et al., 2018) whereas the cooperation among traffic signal agents has not been extensively discussed.
Cooperation is important in traffic signal control because the actions of one signal could affect the other, especially when two intersections are spatially close and when the traffic flow is large. Good cooperation among the traffic signal agents enables the vehicles to move through intersections more smoothly. Take a look at the following examples.
Example 1.1 (Overflow).
When a lane is already fully congested, if its upstream intersection continues to give green signal, it will intensify traffic jam and also waste the opportunity to give green signal to the other competing direction.
Example 1.2 (Green Wave).
During the morning peaks, there is often a large number of vehicles moving from residential areas to working areas. If it is all green signals along the way, it will increase the citywide transportation efficiency. Note that, since there are many commuting routes and these routes compete at the intersections they meet, the solution is not to simply create green waves. The cooperation between the intersections is important.
To achieve cooperation in RL, the most straightforward way is to inform the target agent with other agents’ local observations through communication, i.e., expanding the observation of the target agent to a larger range for more comprehensive information. However, the more comprehensive the observation is, the more parameters we need to estimate, and the longer the learning time will be. Therefore, people tend to select only the relevant agents to be included in the observation. For example, the information of adjacent intersections could be included in the observations (ElTantawy et al., 2013; Dresner and Stone, 2006; Silva et al., 2006; Arel et al., 2010). However, the information from intersections that are more than onehop away could also be useful such as the case in Example 2. In addition, two intersections that are both same hopdistance from the target intersection might play different roles to the decision on the target intersection. For example, if intersections and are adjacent intersections on a major road, but intersection is on a side road linked with . The information from is more useful to compared with that from to . We ask the question: can the model automatically learn about what agent information to use for cooperation and how to use them in the observation?
In this paper, we propose to use a graph attentional network named CoLight to cooperatively learn to control traffic signals in a road network. Specifically, the proposed method learns an attention matrix , where is the learned attention weight from agent to agent . The information of intersections multihop away is captured through a graph convolutional network and the differences of their influence on the target intersection are learned through multihead attention (Velickovic et al., 2017). We summarize our main contributions as follows:
We propose a model, CoLight, that learns networklevel cooperation for traffic signal control using the graph attentional neural network.
We conduct experiments using both synthetic and realworld data, and the experiments demonstrate that the proposed model significantly outperforms the stateoftheart methods.
To the best of our knowledge, we are the first to learn traffic signal cooperation in the scale of hundreds of traffic signals. Our model adopts the paradigm of centralized training and distributed execution with all intersections sharing the same model, which makes it scalable to largescale road networks. Existing approaches only apply RL without cooperation on the network with fewer than 60 intersections.
2. Related Work
Conventional coordinated methods (Koonce et al., 2008) and systems (Lowrie, 1992; Hunt et al., 1981, 1982) in transportation usually coordinate traffic signals through modifying the offset (i.e., the time interval between the beginnings of green lights) between consecutive intersections and require the intersections to have the same cycle length. But this type of methods can only optimize the traffic for certain predefined directions (Gartner et al., 1991). Actually, it is not an easy task to coordinate the offsets for traffic signals in the network. For networklevel control, Maxpressure (Varaiya, 2013) is a stateoftheart signal control method which greedily takes actions that maximizes the throughput of the network, under the assumption that the downstream lanes have unlimited capacity. Other traffic control methods like TUC (Diakaki et al., 2002) also use optimization techniques to minimize vehicle travel time and/or the number of stops at multiple intersections under certain assumptions, such as traffic flow is unform in a certain time period. However, such assumptions often do not hold in the network setting and therefore prevent these methods from being widely applied.
Recently, reinforcement learning techniques have been proposed to control traffic signals for their capability of online optimization without prior knowledge about the given environment. (Prashanth and Bhatnagar, 2011) directly trains one central agent to decide the actions for all intersections but it cannot scale up due to the curse of dimension in joint action space. To mitigate this issue, independent RL methods (Wiering, 2000; Camponogara and Kraus, 2003) are proposed in which they train a bunch of RL agents separately, one for each intersection. When multiple agents are interacting with the environment at the same time, the nonstationary impacts from neighboring agents will be brought into the environment and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents (Bishop et al., 2006; NowÃ© et al., 2012). Improvements could be done using neighboring information for cooperation: (Silva et al., 2006; ElTantawy et al., 2013; Dresner and Stone, 2006; Wiering, 2000) add downstream information into states, (Arel et al., 2010; Wiering, 2000) add all neighboring states, and (Nishi et al., 2018) adds neighbors’ hidden states. However, in these methods, the information from different neighbors is treated evenly important while the influences of neighbors are changing with the dynamic traffic flow. Even when the traffic flow is static, Kinenmaticwave theory (Hayes, 1970) from the transportation area shows that the upstream intersections could have larger influence than downstream intersections. To address the shortcomings of prior methods, our proposed method leverages the attention mechanism to learn and specify different weights to different intersections in a neighborhood.
It is worth noting that the cooperation could also be implemented by jointly modeling the actions of multiple road intersections (Yang et al., 2018). For example, studies (van der Pol et al., 2016; Kuyer et al., 2008) have proposed to jointly model two adjacent intersections. What we propose on communication modeling is in parallel to this joint modeling of actions.
3. Problem Definition
In this section, we present the problem of traffic signal control as a Markov Game. Each intersection in the system is controlled by an agent. Given each agent observes part of the total system condition, we would like to proactively decide for all the intersections in the system which phases should they change to so as to minimize the average queue length on the lanes around the intersections. Specifically, the problem is characterized by the following five major components :
System state space and observation space . We assume that there are intersections in the system and each agent can observe part of the system state as its observation . In this work, we define for agent at time , which consists of its current phase (which direction is in green light) and the number of vehicles on each lane at time .
Set of actions . In traffic signal control problem, at time , an agent would choose an action from its candidate action set as a decision for the next period of time. Here, each intersection would choose a phase as its action from its predefined phase set, indicating that from time to , this intersection would be in phase .
Transition probability . Given the system state and corresponding joint actions of agents at time , the system arrives at the next state according to the state transition probability , where denotes the space of state distributions.
Reward . Each agent obtains an immediate reward from the environment at time by a reward function . In this paper, we want to minimize the travel time for all vehicles in the system, which is hard to optimize directly. Therefore, we define the reward for intersection i as where is the queue length on the approaching lane at time t.
Policy and discount factor . Intuitively, the joint actions have longterm effects on the system, so that we want to minimize the expected queue length of each intersection in each episode. Specifically, at time , each agent chooses an action following a certain policy , aiming to maximize its total reward , where is total time steps of an episode and differentiates the rewards in terms of temporal proximity.
In this paper, we use the actionvalue function for each agent at the th iteration (parameterized by )
to approximate the total reward with neural networks by minimizing the loss:
(1) 
where denotes the next observation for . These earlier snapshots of parameters are periodically updated with the most recent network weights and help increase the learning stability by decorrelating predicted and target qvalues.
4. Method
In this section, we will first introduce the proposed cooperated RL network structure, as Fig. 1 illustrates, from bottom to top layer: the first observation embedding layer, the interior neighborhood cooperation layers and the final qvalue prediction layer. Then we will discuss its time and space complexity compared with other methods of signal control for multiple intersections.
4.1. Observation Embedding
Given the raw data of the local observation, i.e., the number of vehicles on each lane and the phase the signal currently in, we first embed such dimensional data into an dimensional latent space via a layer of MultiLayer Perceptron (MLP):
(2) 
where is intersection ’s observation at time and is the feature dimension of , and are weight matrix and bias vector to learn, is ReLU function (same denotation for the following ). The generated hidden state represents the current traffic condition of the th intersection.
4.2. Neighborhood Cooperation via Attention
In order to cooperate with neighboring intersections, the CoLight agent considers the traffic condition of neighbors or even distant intersections through layers of interactions. Neighborhood cooperation is necessary in multiagent reinforcement learning (MARL) environment, since the evaluation of the conducted policy for each agent depends not only on the observable surrounding, but also on the policies of other agents (NowÃ© et al., 2012; de Oliveira and Bazzan, 2009).
The CoLight agent learns to communicate for cooperation within a neighborhood by leveraging the attention mechanism, by which a summary based on representations of neighboring intersections is generated according to the representation of the target intersection. The attention mechanism is employed widely in diverse domains to boost model accuracy (Yao et al., 2018; You et al., 2016; Cheng et al., 2018; Jiang et al., 2018).
4.2.1. Observation Interaction
To learn the importance of information from intersection (source intersection) in determining the policy for intersection (target intersection), we first embed the representation of the two intersections from the previous layer and calculate the importance of source to target with the following dot product operation:
(3) 
where are embedding parameters for the source and target intersection, respectively. Scalar depicts the importance of information from intersection to determine the policy for intersection .
Note that the interaction between two intersections is reciprocal, i.e., is not necessarily equal to . Imagine the scenario illustrated by Fig. 1(a), where vehicles are running on the unidirectional 9th Avenue, Manhattan, New York. As the traffic flow goes from intersection 950 to intersection 949, the information related to intersection 950 is important for intersection 949 to prepare for the future traffic ( should be quite large) while intersection 949 should pay little attention to the traffic condition of the downstream intersection 948 ( should be extremely small).
4.2.2. Attention Distribution within Neighborhood Scope
The number of road arms involved in intersections varies in different regions, e.g. fourway intersections in Manhattan, the fiveway intersection in the Five Points district in Atlanta, etc. Therefore, the previously computed interaction score between two intersections has distinct implications for different kinds of intersections. To retrieve a general attention value between source and target intersections, we further normalize the interaction scores between the target intersection and its neighborhood intersections:
(4) 
where is the temperature factor and is the set of intersections in the target intersection’s neighborhood scope. The neighborhood of the target contains the top closest intersections, and the distance can be defined in multiple ways. For example, we can construct the neighborhood scope for target intersection through:
road distance: the Manhattan distance between two intersections’ geolocations.
node distance: the smallest hop count between two nodes over the network, with each node as an intersection.
Note that intersection itself is also included in to help the agent get aware of how much attention should be put on its own traffic condition.
The general attention score is beneficial not only for it applies to all kinds of road network structures (intersections with different numbers of arms), but also for it relaxes the concept of “neighborhood”. Without losing generality, the target can even take some other intersections into although they are not adjacent to them. For instance, one fourway intersection can determine its signal policy based on information from five nearby intersections, four of which are the adjacent neighbors while the other is disconnected but geographically close to the target intersection.
4.2.3. Neighborhood Cooperation
The cooperation among multiple intersections is finally achieved by combining the representation of source intersections and their respective importance to the target intersection:
(5) 
where is a weight matrix for source intersection embedding, and are trainable variables. The weighted sum of neighborhood representation accumulates the key information from the surrounding environment for performing efficient signal policy.
In Fig. 1(b) and (c), there is only oneway running traffic across intersection (north to south in 9th Avenue, east to west in West 49th Street). The agent for acquires the knowledge of adjacent intersections (intersection , , and ) directly from the first NCvA (Neighborhood Cooperation via Attention) layer. Meanwhile, the emphasizes on four neighbors are quite distinct due to the unidirectional traffic flow, i.e., a higher attention score for intersection (upstream, red marked) than for (downstream, green marked). Since the hidden states of adjacent neighbors from the first NCvA layer carry their respective neighborhood message, then in the second NCvA layer, the cooperation scope of intersection has significantly expanded (light blue shadow in Fig. 1(c)) to 8 intersections. Such additional information helps the target intersection learn the flow trend and rely more on the upstream intersections but less on the downstream to take actions. As a result, the attention scores on intersection and grow higher while those on intersection and become lower. Two hidden NCvA layers offer the agent the chance to detect environment dynamics onehop away. More hops of view are accessible if the neural network has multiple NCvA layers.
In summary, the graphlevel attention employed in the NCvA layers allows the agent to adjust their focus according to the dynamic traffic and sense the environment in a larger scale.
4.3. Multihead Attention
The cooperating information for the th intersection concludes one type of relationship with neighboring intersections. To jointly attend to the neighborhood from different representation subspaces at different positions, we extend the previous singlehead attention in the neural network to multihead attention as much recent work did (Vaswani et al., 2017; Velickovic et al., 2017). Specifically, the attention function (procedures including Observation Interaction, Attention Distribution and Neighborhood Cooperation) with different linear projections (multiple sets of trainable parameters {, , }) is performed in parallel and the different versions of neighborhood condition summarization are averaged as :
(6)  
(7)  
(8) 
where is the number of attention heads. Besides averaging operation, concatenating the product of multihead attention is another feasible way to conclude multiple types of the neighborhood cooperation.



4.4. Qvalue Prediction
As illustrated in Fig. 1(a), each hidden layer of model CoLight learns the neighborhood representation through methods introduced in Section 4.2. We denote such layerwise cooperation procedure by NCvA (Neighborhood Cooperation via Attention), then the forward propagation of input data in CoLight can be formatted as follows.
(9)  
where and are parameters to learn, is the number of phases (action space), is the number of NCvA layers, is the predicted qvalue.
According to Eq. (1), the loss function for our CoLight to optimize the current policy is:
(10) 
where is the total number of time steps that contribute to the network update, is the number of intersections in the whole road network, represents all the trainable variables in CoLight.
4.5. Complexity Analysis
Although CoLight spares additional time and parameters to learn the cooperation according to neighborhood representation, both the time and space it demands are approximately equal to , which is irrelevant to the number of intersections. Hence CoLight is scalable even if the road network contains hundreds of or even thousands of intersections.
4.5.1. Space Complexity
If there are hidden layers and each layer has neurons, then the size of the weight matrices and bias vectors in each component of CoLight is: 1) Observation Embedding layer: ; 2) interior Neighborhood Cooperation layers: ; 3) Qvalue Prediction layer: . Hence the total number of learnable parameters to store is . Normally, the size of the hidden layer () is far greater than the number of layers (), the phase space () and comparable to the input dimension (). Therefore, the space complexity of CoLight is approximately equal to .
If we leverage separate RL models (no cooperation) to control signals in intersections, then the space complexity is , which is unfeasible when is extremely large for citylevel traffic signal control. To scale up, the simplest solution is allowing all the intersections to share parameters and maintain one model, in this case, the space complexity is , which is identical to that of CoLight.
4.5.2. Time Complexity
We assume that: 1) all the agents leverage CoLight to predict qvalues for the corresponding intersections concurrently; 2) the multiple heads of attention are independently computed so that they are as fast as the singlehead attention; 3) the embeddings for either source or target intersection condition via , and are separate processes that can also be executed at the same time, 4) for one target intersection, the interaction with all the neighbors is computed simultaneously, then the time complexity (only multiplication operations considered since the addition procedures are relatively insignificant) in each component of CoLight is: 1) Observation Embedding layer: ; 2) interior Neighborhood Cooperation layers: ; 3) Qvalue Prediction layer: . Hence the time complexity is , and similarly, it is approximately equal to .
Either the individual RL models or the shared single RL model for signal control in multiple intersections requires computation, approaching that of CoLight.
5. Experiments
Following the tradition of the traffic signal control study (Wei et al., 2018), we conduct experiments on SUMO (Simulation of Urban MObility)^{1}^{1}1https://sourceforge.net/projects/sumo/. In order to support largescale reinforcement learning, we implement the multithreaded version of SUMO. After the traffic data being fed into the simulator, a vehicle moves towards its destination according to the setting of the environment. The simulator provides the state to the signal control method and executes the traffic signal actions from the control method. Following the tradition, each green signal is followed by a threesecond yellow signal and twosecond all red time^{2}^{2}2The codes and the public datasets used in this paper are available online: https://www.dropbox.com/sh/bp4ak3fc8wv8p1j/AAA2GWClMmzVOWLCfrY03NAxa?dl=0.
In a traffic dataset, each vehicle is described as , where is the origin location, is time, and is the destination location. Locations and are both locations on the road network. Traffic data is taken as input for the simulator.
In a multiintersection network setting, we use the real road network to define the network in the simulator. Unless otherwise specified, each intersection in the road network is set to be a fourway intersection, with four 300meter long road segments.
Dataset  # intersections  Arrival rate (vehicles/300s)  
Mean  Std  Max  Min  
48  240.79  10.08  274  216  
12  250.70  38.21  335  208  
16  526.63  86.70  676  256 
5.1. Datasets
5.1.1. Synthetic Data
In the experiment, we use two kinds of synthetic data, i.e., uni and bidirectional traffic, on the following different road networks:
: A threeintersection arterial to investigate the effectiveness of our attention mechanism in a case study.
: A grid network to compare convergence speed of different RL methods.
: A grid network to evaluate effectiveness and efficiency of different methods.
: A largescale road network to show scalability and effectiveness of different methods.
Model  Uni  Bi  
Fixedtime  209.68  209.68  1831.37  728.79  869.85 
MaxPressure  186.07  194.96  404.71  422.15  361.33 
CGRL (van der Pol et al., 2016)  1532.75  2884.23  1888.47  1582.26  1210.70 
Neighbor RL (Arel et al., 2010)  240.68  248.11  1780.73  1053.45  1168.32 
Individual RL (Wei et al., 2018)  314.82  261.60    345.00  325.56 
OneModel  181.81  242.63  1777.87  394.56  728.63 
GCN (Nishi et al., 2018)  205.40  272.14  1374.01  768.43  625.66 
CoLight_node  178.42  176.71  172.80  331.50  340.70 
CoLight  173.79  176.32  158.13  297.26  291.14 
No result as Individual RL can not scale up to 48 intersections in New York’s road network.
Each intersection in the synthetic road network has four directions (WestEast, EastWest, SouthNorth, NorthSouth), and 3 lanes (300 meters in length and 3 meters in width) for each direction. The traffic comes uniformly with 300 vehicles/lane/hour in WestEast direction and 90 vehicles/lane/hour in SouthNorth direction.
5.1.2. Realworld Data
We also use the realworld traffic data from three cities: New York, Hangzhou and Jinan. Their road networks are imported from OpenStreetMap^{3}^{3}3https://www.openstreetmap.org, as shown in Fig. 2. And their traffic flows are processed from multiple sources, with data statistics listed in Table 1.
5.2. Compared Methods
We compare our model with the following two categories of methods: conventional transportation methods and RL methods.
5.2.1. Transportation Methods
Fixedtime (Koonce
et al., 2008): Fixedtime with random offsets. This method uses a predetermined plan for cycle length and phase time, which is widely used when the traffic flow is steady.
MaxPressure (Varaiya, 2013): Max pressure control (Varaiya, 2013) is a stateoftheart networklevel traffic signal control method in the transportation field, which greedily chooses the phase that maximizes the pressure (a predefined metric about upstream and downstream queue length).
5.2.2. RL Methods
We compare our method with 5 baseline RL methods. For fair comparisons, all the RL models are learned without any pretrained parameters.
CGRL (van der Pol et
al., 2016): A coordinated RL approach for multiintersection signal control (van der Pol et
al., 2016). Specifically, the cooperation is achieved by designing a coordination graph and it learns to optimize the joint action between two intersections.
Individual RL (Wei
et al., 2018): An individual deep RL approach which does not consider neighbor information. Each intersection is controlled by one agent, and the agents do not share parameters, but update their own networks independently.
OneModel: This method uses the same state and reward as Individual RL in its agent design, which only considers the traffic condition on the roads connecting with the controlled intersection. Instead of maintaining their own parameters, all the agents share the same policy network.
Neighbor RL (Arel
et al., 2010): Based on OneModel, agents concatenate their neighboring intersections’ traffic condition with their own and all the agents share the same parameters. Hence its feature space for observation is larger than OneModel.
GCN (Nishi
et al., 2018): A RL based traffic signal control method that uses a graph convolutional neural network to automatically extract the traffic features of adjacent intersections. This method treats each neighboring traffic condition without difference.
5.2.3. Variants of Our Proposed Method
CoLight: The neighborhood scope of an intersection is constructed through Manhattan distance.
CoLight_node: The neighborhood scope is constructed through node distance, i.e., the smallest hop count between two intersections in the road network.
5.3. Evaluation Metric
Following existing studies, we use the average travel time to evaluate the performance of different models for traffic signal control. It calculates the average travel time of all the vehicles spend between entering and leaving the area (in seconds), which is the most frequently used measure of performance to control traffic signal in the transportation field.
5.4. Effectiveness Comparison
5.4.1. Overall Analysis
Table 2 lists the performance of two types of the proposed CoLight, classic transportation models as well as stateoftheart learning methods in both synthetic and realworld datasets.
CoLight achieves consistent performance improvements over stateoftheart transportation (MaxPressure) and RL (Individual RL) methods across diverse road networks and traffic patterns: the average improvement is for synthetic data and for realworld data.
The performance improvements are attributed to the benefits of multihop view of the neighborhood and dynamic cooperation along with the traffic variation. The advantage of such multihop dynamic cooperation is especially evident when controlling signals in realworld cities, where road structures are more irregular and traffic flows are more dynamic.
The performance gap between the proposed CoLight and the conventional transportation method MaxPressure becomes larger as the evaluated data changes from synthetic regular traffic (average gap ) to realworld dynamic traffic (average gap ). Such growing performance divergence conforms to the deficiency inherent in MaxPressure, that it is incapable of learning from the feedback of the environment.
Baseline learning models show inferior performance due to lack of either the comprehensive view of the environment or the trafficdriven cooperation: 1) Limited view of the environment: Without neighborhood cooperation, Individual RL can hardly achieve satisfactory results because it independently optimizes the single intersection’s policy. Neighbor RL fails for signal control in part because the agent pays attention only to the adjacent intersections (unaware of intersections multihop away). 2) No trafficdriven cooperation: Since New York and Hangzhou have unidirectional traffic, the importance of traffic condition in upstream intersections is quite different to that in downstream intersections. Neighbor RL can hardly distinguish the traffic condition of important neighbors by simply mixing neighborhood information at the input stage. GCN does not work well for either or , as the agent treats the information from the upstream and downstream intersections with static importance according to the prior geographic knowledge rather than realtime traffic flows.
5.4.2. Variation Study
As mentioned in Section 4.2.2, the neighborhood scope of an intersection can be defined in different ways. And the results in Table 2 show that CoLight (using Manhattan distance) achieves similar performance with CoLight_node under synthetic data, but largely outperforms CoLight_node under realworld traffic. The reason could be that under synthetic data, since the lane lengths of all intersections are the same, Manhattan distance is identical to node distance. In the following parts of our experiments, we only compare CoLight with other methods.
5.4.3. Convergence Comparison
In Fig. 3, we compare CoLight’s performance (average travel time for vehicles evaluated at each episode) to the corresponding learning curves for the other five RL methods. Evaluated in all the listed datasets, the performance of CoLight is better than any of the baselines by a large margin, both in jumpstart performance (initial performance after the first episode), time to threshold (learning time to achieve a prespecified performance level), as well as in asymptotic performance (final learned performance). Learning the attention on neighborhood does not slow down model convergence, but accelerates the speed of approaching the optimal policy instead.
From Fig. 3(a), we discover that model Individual RL starts with extremely huge travel time and approaches to the optimal performance after a long training time. Such disparity of convergence speed shown in Fig. 3 agrees with our previous space complexity analysis (in Section 4.5.1), that agents with shared models (CGRL, Neighbor RL, OneModel, GCN and CoLight) need to learn parameters while individual agents (Individual RL) have to update parameters.
5.4.4. Scalability Comparison
In Fig. 4, we compare CoLight’s training time (total clock time for 100 episodes) to the corresponding running time for the other 5 RL methods. The time cost for CoLight is comparable to that of OneModel and GCN, which is far efficient than that of CGRL, Individual RL and Neighbor RL. Such high efficiency of CoLight is consistent with the time complexity analysis (in Section 4.5.2), as most of the parallel computation assumptions are satisfied in our experiments.
Note that the average travel time (in Table 2) and the bar of training time (in Fig. 4) for Individual RL on is missing and estimated, respectively. Besides Individual RL, model CGRL takes too much time to train (the second timeconsuming model) with unsatisfactory performance (the worst performance in all datasets). Hence, we compare the performance stability of CoLight on large road networks with three other scalable RL methods (Neighbor RL, OneModel and GCN) as well as conventional transportation methods (Fixedtime and MaxPressure).
We also test our model in a largescale network . In Table 3, CoLight outperforms all the transportation and learning baselines with and improvement on Uni and Bi, respectively.
Model  Unidirection  Bidirection 
Fixedtime  340.62  340.62 
MaxPressure  296.54  321.86 
Neighbor RL (Arel et al., 2010)  324.31  358.31 
OneModel  304.21  511.01 
GCN (Nishi et al., 2018)  338.61  384.53 
CoLight  286.87  289.78 
5.5. Hyperparameter Study
Impact of Neighbor Number. Our CoLight method introduces the additional hyperparameter to control the number of neighbors for cooperation. In Fig. 7, we show how the hyperparameter impact the performance and also shed lights on how to set them.
As the number of neighbors grows from 2 to 5, the performance of CoLight achieves the optimal. Further adding nearby intersections into the neighborhood scope , however, leads to the opposite trend. As illustrated in Fig. 7, including more neighbors in the neighborhood results in massive relation learning, which requires more training. To determine signal control policy for each intersection, computing only the attention scores on four nearby intersections and itself seems adequate for cooperation with both time and performance guarantee.
5.6. Attention Study
In this section, we study the attention distribution of CoLight evaluated on and realworld network to analyze how well the neighborhood cooperation is implemented via the attention mechanism.
5.6.1. Synthetic Arterial
In this section, we analyze the attention distribution for all the intersections along a synthetic arterial, as shown in Figure 5. The traffic along the arterial is unidirectional, making intersection 0 as the upstream of intersection 1 and 2. We have the following observations:
In Figure 5(b), intersection 0 pays most of its attention on traffic condition of itself, which coincides with the fact that the downstream intersection 2 and 3 have little impact on the policy of the upstream intersection 0.
Figure 5(c) shows that intersection 1 pays a remarkable attention on intersection 0 compared with intersection 2. Such attention distribution for cooperation is reasonable as there are no side roads for intersection 1 and the major traffic comes from the upstream intersection 0.
In Figure 5(d), the downstream intersection 2 cares most about its own traffic flow while intelligently allocating appropriate attention to the two upstream intersections.
5.6.2. Realworld Network
In this section, we analyze the attention distribution CoLight learns from the real data under different scenarios: upstream intersection vs. downstream intersection, and arterial vs. side street.
Upstream vs. Downstream. Figure 6(a) shows an intersection (green dot) in New York, whose neighborhood includes four nearby intersections along the arterial. Traffic along the arterial is unidirectional (blue arrow). From the attention distribution learned by CoLight, we can see that while the majority of attention is allocated to itself, the upstream intersections (orange and blue dots) have larger scores than downstream intersections (red and purple dots).
Arterial vs. Side Street. Figure 6(b) shows an intersection (green dot) in Hangzhou, whose neighborhood includes two intersections along the arterial and two intersections on the side street. Arterial traffic is heavy and unidirectional, while sidestreet traffic is light and bidirectional. From the attention distribution learned by CoLight, we can see that the arterial intersections (orange and blue dots) have larger scores than sidestreet intersections (red and purple dots).
6. Conclusion
In this paper, we propose a welldesigned reinforcement learning approach to solve the networklevel traffic light control problem. We conduct extensive experiments using synthetic and realworld data and demonstrate the superior performance of our proposed method over stateoftheart methods. In addition, we show indepth case studies and observations to understand how the attention mechanism helps cooperation.
We would like to point out several important future directions to make the method more applicable to the real world. First, the neighborhood scope can be determined in a more flexible way. The traffic flow information between intersections can be utilized to determine the neighborhood, rather than a static number of nearby intersections. Second, the raw data for observation only includes the phase and the number of vehicles on each lane. More exterior data like the road and weather condition might help to boot model performance.
References
 (1)
 Arel et al. (2010) Itamar Arel, Cong Liu, T Urbanik, and AG Kohls. 2010. Reinforcement learningbased multiagent system for network traffic signal control. IET Intelligent Transport Systems 4, 2 (2010), 128–135.
 Bishop et al. (2006) Christopher M Bishop et al. 2006. Pattern recognition and machine learning (information science and statistics). (2006).
 Camponogara and Kraus (2003) Eduardo Camponogara and Werner Kraus. 2003. Distributed learning agents in urban traffic control. In Portuguese Conference on Artificial Intelligence. Springer, 324–335.
 Casas (2017) Noe Casas. 2017. Deep Deterministic Policy Gradient for Urban Traffic Light Control. arXiv preprint arXiv:1703.09035 (2017).
 Cheng et al. (2018) Weiyu Cheng, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. 2018. A neural attention model for urban air quality inference: Learning the weights of monitoring stations. In ThirtySecond AAAI Conference on Artificial Intelligence.
 de Oliveira and Bazzan (2009) Denise de Oliveira and Ana LC Bazzan. 2009. Multiagent learning on traffic lights control: effects of using shared information. In Multiagent systems for traffic and transportation engineering. IGI Global, 307–321.
 Diakaki et al. (2002) Christina Diakaki, Markos Papageorgiou, and Kostas Aboudolas. 2002. A multivariable regulator approach to trafficresponsive networkwide signal control. Control Engineering Practice 10, 2 (2002), 183–195.
 Dresner and Stone (2006) Kurt Dresner and Peter Stone. 2006. Multiagent traffic management: Opportunities for multiagent learning. In Learning and Adaption in MultiAgent Systems. Springer, 129–138.
 ElTantawy et al. (2013) Samah ElTantawy, Baher Abdulhai, and Hossam Abdelgawad. 2013. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLINATSC): methodology and largescale application on downtown Toronto. IEEE Transactions on Intelligent Transportation Systems 14, 3 (2013), 1140–1150.
 Gartner et al. (1991) Nathan H Gartner, Susan F Assman, Fernando Lasaga, and Dennis L Hou. 1991. A multiband approach to arterial traffic signal optimization. Transportation Research Part B: Methodological 25, 1 (1991), 55–74.
 Hayes (1970) Wallce D Hayes. 1970. Kinematic wave theory. Proc. R. Soc. Lond. A 320, 1541 (1970), 209–226.
 Hunt et al. (1982) PB Hunt, DI Robertson, RD Bretherton, and M Cr Royle. 1982. The SCOOT online traffic signal optimisation technique. Traffic Engineering & Control 23, 4 (1982).
 Hunt et al. (1981) PB Hunt, DI Robertson, RD Bretherton, and RI Winton. 1981. SCOOTa traffic responsive method of coordinating signals. Technical Report.
 Jiang et al. (2018) Jiechuan Jiang, Chen Dun, and Zongqing Lu. 2018. Graph Convolutional Reinforcement Learning for MultiAgent Cooperation. arXiv preprint arXiv:1810.09202 (2018).
 Koonce et al. (2008) Peter Koonce et al. 2008. Traffic signal timing manual. Technical Report. United States. Federal Highway Administration.
 Kuyer et al. (2008) Lior Kuyer, Shimon Whiteson, Bram Bakker, and Nikos Vlassis. 2008. Multiagent reinforcement learning for urban traffic control using coordination graphs. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 656–671.
 Liang et al. (2018) Xiaoyuan Liang, Xunsheng Du, Guiling Wang, and Zhu Han. 2018. Deep reinforcement learning for traffic light control in vehicular networks. arXiv preprint arXiv:1803.11115 (2018).
 Lowrie (1992) PR Lowrie. 1992. SCATS–a traffic responsive method of controlling urban traffic. Roads and traffic authority. NSW, Australia (1992).
 Nishi et al. (2018) Tomoki Nishi, Keisuke Otaki, Keiichiro Hayakawa, and Takayoshi Yoshimura. 2018. Traffic Signal Control Based on Reinforcement Learning with Graph Convolutional Neural Nets. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 877–883.
 NowÃ© et al. (2012) Ann NowÃ©, Peter Vrancx, and Yann MichaÃ«l De Hauwere. 2012. Game Theory and Multiagent Reinforcement Learning.
 Prashanth and Bhatnagar (2011) LA Prashanth and Shalabh Bhatnagar. 2011. Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems 12, 2 (2011), 412–421.
 Roess et al. (2004) Roger P Roess, Elena S Prassas, and William R McShane. 2004. Traffic engineering. Pearson/Prentice Hall.
 Silva et al. (2006) Bruno Castro Da Silva, Eduardo W. Basso, Filipo Studzinski Perotto, Ana L. C. Bazzan, and Paulo Martins Engel. 2006. Improving reinforcement learning with context detection. In International Joint Conference on Autonomous Agents & Multiagent Systems.
 van der Pol et al. (2016) van der Pol et al. 2016. Coordinated Deep Reinforcement Learners for Traffic Light Control. NIPS.
 Varaiya (2013) Pravin Varaiya. 2013. The maxpressure controller for arbitrary networks of signalized intersections. In Advances in Dynamic Network Modeling in Complex Transportation Systems. Springer, 27–66.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
 Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 1, 2 (2017).
 Webster (1958) Fo Vo Webster. 1958. Traffic signal settings. Technical Report.
 Wei et al. (2018) Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li. 2018. Intellilight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2496–2505.
 Wiering (2000) MA Wiering. 2000. Multiagent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000). 1151–1158.
 Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field MultiAgent Reinforcement Learning. arXiv preprint arXiv:1802.05438 (2018).
 Yao et al. (2018) Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, Yanwei Yu, and Zhenhui Li. 2018. Modeling SpatialTemporal Dynamics for Traffic Prediction. arXiv preprint arXiv:1803.01254 (2018).
 You et al. (2016) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651–4659.