CoLight: Learning Network-level Cooperation for Traffic Signal Control

CoLight: Learning Network-level Cooperation for Traffic Signal Control

Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen,  and  Weinan Zhang, Yanmin Zhu, Kai Xu, Zhenhui Li Pennsylvania State University, Shanghai Jiao Tong Univerisity, Shanghai Tianrang Intelligent Technology Co., Ltd
{hzw77, gjz5038, jessieli}, {xunannancy, chacha1997, wnzhang, yzhu}, {zhc,xszang},

Cooperation is critical in multi-agent reinforcement learning (MARL). In the context of traffic signal control, good cooperation among the traffic signal agents enables the vehicles to move through intersections more smoothly. Conventional transportation approaches implement cooperation by pre-calculating the offsets between two intersections. Such pre-calculated offsets are not suitable for dynamic traffic environments.

To incorporate cooperation in reinforcement learning (RL), two typical approaches are proposed to take the influence of other agents into consideration: (1) learning the communications (i.e., the representation of influences between agents) and (2) learning joint actions for agents. While joint modeling of actions has shown a preferred trend in recent studies, an in-depth study of improving the learning of communications between agents has not been systematically studied in the context of traffic signal control.

To learn the communications between agents, in this paper, we propose to use graph attentional network to facilitate cooperation. Specifically, for a target intersection in a network, our proposed model, CoLight, can not only incorporate the influences of neighboring intersections but learn to differentiate their impacts to the target intersection. To the best of our knowledge, we are the first to use graph attentional network in the setting of reinforcement learning for traffic signal control. In experiments, we demonstrate that by learning the communication, the proposed model can achieve surprisingly good performance, whereas the existing approaches based on joint action modeling fail to learn well.

journalyear: 2018copyright: acmlicensedconference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NYbooktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NYprice: 15.00doi: 10.1145/1122445.1122456isbn: 978-1-4503-9999-9/18/06

1. Introduction

Traffic signals decide how smoothly vehicles move in the city. However, designing an efficient traffic signal control system is quite complicated because the traffic situations are highly dynamic. Recent advances in technology provide increasingly available real-time traffic data collected from sources such as navigation systems and surveillance cameras. It has drawn increasing attention from data science researchers to look into this traditional but important problem.

Traffic signal control is a core research topic in the transportation field (Roess et al., 2004). The typical way to solve this problem is to formulate it as an optimization problem and solve it under certain assumptions (e.g., uniform arrival rate (Webster, 1958; Roess et al., 2004) and unlimited lane capacity (Varaiya, 2013)). Such methods, however, do not perform well because the assumptions do not hold in the real world.

Recently, researchers start to investigate using reinforcement learning (RL) techniques for traffic signal control (Wiering, 2000; van der Pol et al., 2016; Wei et al., 2018). In a typical RL setting for traffic signal control, each intersection is treated as an agent. An agent takes the representation on its local traffic situations from the environment as its observation . Reward is often defined as a measure correlated with travel time, and the action is the choice of a traffic signal phase. If the model is Q-learning (the most widely used model in recent literature), we are basically learning a value function , which gives the score for taking action in observation . Different from conventional approaches, such RL methods avoid making strong assumptions and directly learn good strategies from the dynamic environment in a trial-and-error manner. It has shown to be more effective than conventional optimization-based transportation approaches (Wiering, 2000; Wei et al., 2018). However, most RL methods have been focusing on individual single intersections (Wei et al., 2018; Casas, 2017; Liang et al., 2018) whereas the cooperation among traffic signal agents has not been extensively discussed.

Cooperation is important in traffic signal control because the actions of one signal could affect the other, especially when two intersections are spatially close and when the traffic flow is large. Good cooperation among the traffic signal agents enables the vehicles to move through intersections more smoothly. Take a look at the following examples.

Example 1.1 (Overflow).

When a lane is already fully congested, if its upstream intersection continues to give green signal, it will intensify traffic jam and also waste the opportunity to give green signal to the other competing direction.

Example 1.2 (Green Wave).

During the morning peaks, there is often a large number of vehicles moving from residential areas to working areas. If it is all green signals along the way, it will increase the city-wide transportation efficiency. Note that, since there are many commuting routes and these routes compete at the intersections they meet, the solution is not to simply create green waves. The cooperation between the intersections is important.

To achieve cooperation in RL, the most straightforward way is to inform the target agent with other agents’ local observations through communication, i.e., expanding the observation of the target agent to a larger range for more comprehensive information. However, the more comprehensive the observation is, the more parameters we need to estimate, and the longer the learning time will be. Therefore, people tend to select only the relevant agents to be included in the observation. For example, the information of adjacent intersections could be included in the observations (El-Tantawy et al., 2013; Dresner and Stone, 2006; Silva et al., 2006; Arel et al., 2010). However, the information from intersections that are more than one-hop away could also be useful such as the case in Example 2. In addition, two intersections that are both same hop-distance from the target intersection might play different roles to the decision on the target intersection. For example, if intersections and are adjacent intersections on a major road, but intersection is on a side road linked with . The information from is more useful to compared with that from to . We ask the question: can the model automatically learn about what agent information to use for cooperation and how to use them in the observation?

In this paper, we propose to use a graph attentional network named CoLight to cooperatively learn to control traffic signals in a road network. Specifically, the proposed method learns an attention matrix , where is the learned attention weight from agent to agent . The information of intersections multi-hop away is captured through a graph convolutional network and the differences of their influence on the target intersection are learned through multi-head attention (Velickovic et al., 2017). We summarize our main contributions as follows:

We propose a model, CoLight, that learns network-level cooperation for traffic signal control using the graph attentional neural network.

We conduct experiments using both synthetic and real-world data, and the experiments demonstrate that the proposed model significantly outperforms the state-of-the-art methods.

To the best of our knowledge, we are the first to learn traffic signal cooperation in the scale of hundreds of traffic signals. Our model adopts the paradigm of centralized training and distributed execution with all intersections sharing the same model, which makes it scalable to large-scale road networks. Existing approaches only apply RL without cooperation on the network with fewer than 60 intersections.

2. Related Work

Conventional coordinated methods (Koonce et al., 2008) and systems (Lowrie, 1992; Hunt et al., 1981, 1982) in transportation usually coordinate traffic signals through modifying the offset (i.e., the time interval between the beginnings of green lights) between consecutive intersections and require the intersections to have the same cycle length. But this type of methods can only optimize the traffic for certain pre-defined directions (Gartner et al., 1991). Actually, it is not an easy task to coordinate the offsets for traffic signals in the network. For network-level control, Max-pressure (Varaiya, 2013) is a state-of-the-art signal control method which greedily takes actions that maximizes the throughput of the network, under the assumption that the downstream lanes have unlimited capacity. Other traffic control methods like TUC (Diakaki et al., 2002) also use optimization techniques to minimize vehicle travel time and/or the number of stops at multiple intersections under certain assumptions, such as traffic flow is unform in a certain time period. However, such assumptions often do not hold in the network setting and therefore prevent these methods from being widely applied.

Recently, reinforcement learning techniques have been proposed to control traffic signals for their capability of online optimization without prior knowledge about the given environment. (Prashanth and Bhatnagar, 2011) directly trains one central agent to decide the actions for all intersections but it cannot scale up due to the curse of dimension in joint action space. To mitigate this issue, independent RL methods (Wiering, 2000; Camponogara and Kraus, 2003) are proposed in which they train a bunch of RL agents separately, one for each intersection. When multiple agents are interacting with the environment at the same time, the non-stationary impacts from neighboring agents will be brought into the environment and the learning process usually cannot converge to stationary policies if there are no communication or coordination mechanisms among agents (Bishop et al., 2006; Nowé et al., 2012). Improvements could be done using neighboring information for cooperation:  (Silva et al., 2006; El-Tantawy et al., 2013; Dresner and Stone, 2006; Wiering, 2000) add downstream information into states, (Arel et al., 2010; Wiering, 2000) add all neighboring states, and (Nishi et al., 2018) adds neighbors’ hidden states. However, in these methods, the information from different neighbors is treated evenly important while the influences of neighbors are changing with the dynamic traffic flow. Even when the traffic flow is static, Kinenmatic-wave theory (Hayes, 1970) from the transportation area shows that the upstream intersections could have larger influence than downstream intersections. To address the shortcomings of prior methods, our proposed method leverages the attention mechanism to learn and specify different weights to different intersections in a neighborhood.

It is worth noting that the cooperation could also be implemented by jointly modeling the actions of multiple road intersections (Yang et al., 2018). For example, studies (van der Pol et al., 2016; Kuyer et al., 2008) have proposed to jointly model two adjacent intersections. What we propose on communication modeling is in parallel to this joint modeling of actions.

3. Problem Definition

In this section, we present the problem of traffic signal control as a Markov Game. Each intersection in the system is controlled by an agent. Given each agent observes part of the total system condition, we would like to proactively decide for all the intersections in the system which phases should they change to so as to minimize the average queue length on the lanes around the intersections. Specifically, the problem is characterized by the following five major components :

System state space and observation space . We assume that there are intersections in the system and each agent can observe part of the system state as its observation . In this work, we define for agent at time , which consists of its current phase (which direction is in green light) and the number of vehicles on each lane at time .

Set of actions . In traffic signal control problem, at time , an agent would choose an action from its candidate action set as a decision for the next period of time. Here, each intersection would choose a phase as its action from its pre-defined phase set, indicating that from time to , this intersection would be in phase .

Transition probability . Given the system state and corresponding joint actions of agents at time , the system arrives at the next state according to the state transition probability , where denotes the space of state distributions.

Reward . Each agent obtains an immediate reward from the environment at time by a reward function . In this paper, we want to minimize the travel time for all vehicles in the system, which is hard to optimize directly. Therefore, we define the reward for intersection i as where is the queue length on the approaching lane at time t.

Policy and discount factor . Intuitively, the joint actions have long-term effects on the system, so that we want to minimize the expected queue length of each intersection in each episode. Specifically, at time , each agent chooses an action following a certain policy , aiming to maximize its total reward , where is total time steps of an episode and differentiates the rewards in terms of temporal proximity.

In this paper, we use the action-value function for each agent at the -th iteration (parameterized by )

to approximate the total reward with neural networks by minimizing the loss:


where denotes the next observation for . These earlier snapshots of parameters are periodically updated with the most recent network weights and help increase the learning stability by decorrelating predicted and target q-values.

4. Method

Figure 1. Left: Framework of the proposed CoLight model. Right: variation of cooperation scope (light blue shadow) and attention distribution (colored points, the redder, the more important) of the target intersection.

In this section, we will first introduce the proposed cooperated RL network structure, as Fig. 1 illustrates, from bottom to top layer: the first observation embedding layer, the interior neighborhood cooperation layers and the final q-value prediction layer. Then we will discuss its time and space complexity compared with other methods of signal control for multiple intersections.

4.1. Observation Embedding

Given the raw data of the local observation, i.e., the number of vehicles on each lane and the phase the signal currently in, we first embed such -dimensional data into an -dimensional latent space via a layer of Multi-Layer Perceptron (MLP):


where is intersection ’s observation at time and is the feature dimension of , and are weight matrix and bias vector to learn, is ReLU function (same denotation for the following ). The generated hidden state represents the current traffic condition of the -th intersection.

4.2. Neighborhood Cooperation via Attention

In order to cooperate with neighboring intersections, the CoLight agent considers the traffic condition of neighbors or even distant intersections through layers of interactions. Neighborhood cooperation is necessary in multi-agent reinforcement learning (MARL) environment, since the evaluation of the conducted policy for each agent depends not only on the observable surrounding, but also on the policies of other agents (Nowé et al., 2012; de Oliveira and Bazzan, 2009).

The CoLight agent learns to communicate for cooperation within a neighborhood by leveraging the attention mechanism, by which a summary based on representations of neighboring intersections is generated according to the representation of the target intersection. The attention mechanism is employed widely in diverse domains to boost model accuracy (Yao et al., 2018; You et al., 2016; Cheng et al., 2018; Jiang et al., 2018).

4.2.1. Observation Interaction

To learn the importance of information from intersection (source intersection) in determining the policy for intersection (target intersection), we first embed the representation of the two intersections from the previous layer and calculate the importance of source to target with the following dot product operation:


where are embedding parameters for the source and target intersection, respectively. Scalar depicts the importance of information from intersection to determine the policy for intersection .

Note that the interaction between two intersections is reciprocal, i.e., is not necessarily equal to . Imagine the scenario illustrated by Fig. 1(a), where vehicles are running on the uni-directional 9-th Avenue, Manhattan, New York. As the traffic flow goes from intersection 9-50 to intersection 9-49, the information related to intersection 9-50 is important for intersection 9-49 to prepare for the future traffic ( should be quite large) while intersection 9-49 should pay little attention to the traffic condition of the downstream intersection 9-48 ( should be extremely small).

4.2.2. Attention Distribution within Neighborhood Scope

The number of road arms involved in intersections varies in different regions, e.g. four-way intersections in Manhattan, the five-way intersection in the Five Points district in Atlanta, etc. Therefore, the previously computed interaction score between two intersections has distinct implications for different kinds of intersections. To retrieve a general attention value between source and target intersections, we further normalize the interaction scores between the target intersection and its neighborhood intersections:


where is the temperature factor and is the set of intersections in the target intersection’s neighborhood scope. The neighborhood of the target contains the top closest intersections, and the distance can be defined in multiple ways. For example, we can construct the neighborhood scope for target intersection through:

road distance: the Manhattan distance between two intersections’ geo-locations.

node distance: the smallest hop count between two nodes over the network, with each node as an intersection.

Note that intersection itself is also included in to help the agent get aware of how much attention should be put on its own traffic condition.

The general attention score is beneficial not only for it applies to all kinds of road network structures (intersections with different numbers of arms), but also for it relaxes the concept of “neighborhood”. Without losing generality, the target can even take some other intersections into although they are not adjacent to them. For instance, one four-way intersection can determine its signal policy based on information from five nearby intersections, four of which are the adjacent neighbors while the other is disconnected but geographically close to the target intersection.

4.2.3. Neighborhood Cooperation

The cooperation among multiple intersections is finally achieved by combining the representation of source intersections and their respective importance to the target intersection:


where is a weight matrix for source intersection embedding, and are trainable variables. The weighted sum of neighborhood representation accumulates the key information from the surrounding environment for performing efficient signal policy.

In Fig. 1(b) and (c), there is only one-way running traffic across intersection (north to south in 9-th Avenue, east to west in West 49-th Street). The agent for acquires the knowledge of adjacent intersections (intersection , , and ) directly from the first NCvA (Neighborhood Cooperation via Attention) layer. Meanwhile, the emphasizes on four neighbors are quite distinct due to the uni-directional traffic flow, i.e., a higher attention score for intersection (upstream, red marked) than for (downstream, green marked). Since the hidden states of adjacent neighbors from the first NCvA layer carry their respective neighborhood message, then in the second NCvA layer, the cooperation scope of intersection has significantly expanded (light blue shadow in Fig. 1(c)) to 8 intersections. Such additional information helps the target intersection learn the flow trend and rely more on the upstream intersections but less on the downstream to take actions. As a result, the attention scores on intersection and grow higher while those on intersection and become lower. Two hidden NCvA layers offer the agent the chance to detect environment dynamics one-hop away. More hops of view are accessible if the neural network has multiple NCvA layers.

In summary, the graph-level attention employed in the NCvA layers allows the agent to adjust their focus according to the dynamic traffic and sense the environment in a larger scale.

4.3. Multi-head Attention

The cooperating information for the -th intersection concludes one type of relationship with neighboring intersections. To jointly attend to the neighborhood from different representation subspaces at different positions, we extend the previous single-head attention in the neural network to multi-head attention as much recent work did (Vaswani et al., 2017; Velickovic et al., 2017). Specifically, the attention function (procedures including Observation Interaction, Attention Distribution and Neighborhood Cooperation) with different linear projections (multiple sets of trainable parameters {, , }) is performed in parallel and the different versions of neighborhood condition summarization are averaged as :


where is the number of attention heads. Besides averaging operation, concatenating the product of multi-head attention is another feasible way to conclude multiple types of the neighborhood cooperation.

(a) Hell’s Kitchen, Manhattan,
New York, USA
(b) Dongfeng Sub-district,
Jinan, China
(c) Gudang Sub-district,
Hangzhou, China
Figure 2. Road networks for real-world datasets. Red polygons are the areas we select to model, blue dots are the traffic signals we control. Left: 48 intersections with uni-directional traffic, middle: 12 intersections with bi-directional traffic, right: 16 intersections with uni- & bi-directional traffic.

4.4. Q-value Prediction

As illustrated in Fig. 1(a), each hidden layer of model CoLight learns the neighborhood representation through methods introduced in Section 4.2. We denote such layerwise cooperation procedure by NCvA (Neighborhood Cooperation via Attention), then the forward propagation of input data in CoLight can be formatted as follows.


where and are parameters to learn, is the number of phases (action space), is the number of NCvA layers, is the predicted q-value.

According to Eq. (1), the loss function for our CoLight to optimize the current policy is:


where is the total number of time steps that contribute to the network update, is the number of intersections in the whole road network, represents all the trainable variables in CoLight.

4.5. Complexity Analysis

Although CoLight spares additional time and parameters to learn the cooperation according to neighborhood representation, both the time and space it demands are approximately equal to , which is irrelevant to the number of intersections. Hence CoLight is scalable even if the road network contains hundreds of or even thousands of intersections.

4.5.1. Space Complexity

If there are hidden layers and each layer has neurons, then the size of the weight matrices and bias vectors in each component of CoLight is: 1) Observation Embedding layer: ; 2) interior Neighborhood Cooperation layers: ; 3) Q-value Prediction layer: . Hence the total number of learnable parameters to store is . Normally, the size of the hidden layer () is far greater than the number of layers (), the phase space () and comparable to the input dimension (). Therefore, the space complexity of CoLight is approximately equal to .

If we leverage separate RL models (no cooperation) to control signals in intersections, then the space complexity is , which is unfeasible when is extremely large for city-level traffic signal control. To scale up, the simplest solution is allowing all the intersections to share parameters and maintain one model, in this case, the space complexity is , which is identical to that of CoLight.

4.5.2. Time Complexity

We assume that: 1) all the agents leverage CoLight to predict q-values for the corresponding intersections concurrently; 2) the multiple heads of attention are independently computed so that they are as fast as the single-head attention; 3) the embeddings for either source or target intersection condition via , and are separate processes that can also be executed at the same time, 4) for one target intersection, the interaction with all the neighbors is computed simultaneously, then the time complexity (only multiplication operations considered since the addition procedures are relatively insignificant) in each component of CoLight is: 1) Observation Embedding layer: ; 2) interior Neighborhood Cooperation layers: ; 3) Q-value Prediction layer: . Hence the time complexity is , and similarly, it is approximately equal to .

Either the individual RL models or the shared single RL model for signal control in multiple intersections requires computation, approaching that of CoLight.

5. Experiments

Following the tradition of the traffic signal control study (Wei et al., 2018), we conduct experiments on SUMO (Simulation of Urban MObility)111 In order to support large-scale reinforcement learning, we implement the multi-threaded version of SUMO. After the traffic data being fed into the simulator, a vehicle moves towards its destination according to the setting of the environment. The simulator provides the state to the signal control method and executes the traffic signal actions from the control method. Following the tradition, each green signal is followed by a three-second yellow signal and two-second all red time222The codes and the public datasets used in this paper are available online:

In a traffic dataset, each vehicle is described as , where is the origin location, is time, and is the destination location. Locations and are both locations on the road network. Traffic data is taken as input for the simulator.

In a multi-intersection network setting, we use the real road network to define the network in the simulator. Unless otherwise specified, each intersection in the road network is set to be a four-way intersection, with four 300-meter long road segments.

Dataset # intersections Arrival rate (vehicles/300s)
Mean Std Max Min
48 240.79 10.08 274 216
12 250.70 38.21 335 208
16 526.63 86.70 676 256
Table 1. Data statistics of real-world traffic dataset

5.1. Datasets

5.1.1. Synthetic Data

In the experiment, we use two kinds of synthetic data, i.e., uni- and bi-directional traffic, on the following different road networks:  
: A three-intersection arterial to investigate the effectiveness of our attention mechanism in a case study.  
: A grid network to compare convergence speed of different RL methods.  
: A grid network to evaluate effectiveness and efficiency of different methods.  
: A large-scale road network to show scalability and effectiveness of different methods.

Model -Uni -Bi
Fixedtime 209.68 209.68 1831.37 728.79 869.85
MaxPressure 186.07 194.96 404.71 422.15 361.33
CGRL (van der Pol et al., 2016) 1532.75 2884.23 1888.47 1582.26 1210.70
Neighbor RL (Arel et al., 2010) 240.68 248.11 1780.73 1053.45 1168.32
Individual RL (Wei et al., 2018) 314.82 261.60 - 345.00 325.56
OneModel 181.81 242.63 1777.87 394.56 728.63
GCN (Nishi et al., 2018) 205.40 272.14 1374.01 768.43 625.66
CoLight_node 178.42 176.71 172.80 331.50 340.70
CoLight 173.79 176.32 158.13 297.26 291.14

No result as Individual RL can not scale up to 48 intersections in New York’s road network.

Table 2. Performance on synthetic data and real-world data w.r.t average travel time. CoLight is the best.

Each intersection in the synthetic road network has four directions (WestEast, EastWest, SouthNorth, NorthSouth), and 3 lanes (300 meters in length and 3 meters in width) for each direction. The traffic comes uniformly with 300 vehicles/lane/hour in WestEast direction and 90 vehicles/lane/hour in SouthNorth direction.

5.1.2. Real-world Data

We also use the real-world traffic data from three cities: New York, Hangzhou and Jinan. Their road networks are imported from OpenStreetMap333, as shown in Fig. 2. And their traffic flows are processed from multiple sources, with data statistics listed in Table 1.

5.2. Compared Methods

We compare our model with the following two categories of methods: conventional transportation methods and RL methods.

5.2.1. Transportation Methods

Fixedtime (Koonce et al., 2008): Fixed-time with random offsets. This method uses a pre-determined plan for cycle length and phase time, which is widely used when the traffic flow is steady.  
MaxPressure (Varaiya, 2013): Max pressure control (Varaiya, 2013) is a state-of-the-art network-level traffic signal control method in the transportation field, which greedily chooses the phase that maximizes the pressure (a pre-defined metric about upstream and downstream queue length).

5.2.2. RL Methods

We compare our method with 5 baseline RL methods. For fair comparisons, all the RL models are learned without any pre-trained parameters.  
CGRL (van der Pol et al., 2016): A coordinated RL approach for multi-intersection signal control (van der Pol et al., 2016). Specifically, the cooperation is achieved by designing a coordination graph and it learns to optimize the joint action between two intersections.  
Individual RL (Wei et al., 2018): An individual deep RL approach which does not consider neighbor information. Each intersection is controlled by one agent, and the agents do not share parameters, but update their own networks independently.  
OneModel: This method uses the same state and reward as Individual RL in its agent design, which only considers the traffic condition on the roads connecting with the controlled intersection. Instead of maintaining their own parameters, all the agents share the same policy network.  
Neighbor RL (Arel et al., 2010): Based on OneModel, agents concatenate their neighboring intersections’ traffic condition with their own and all the agents share the same parameters. Hence its feature space for observation is larger than OneModel.  
GCN (Nishi et al., 2018): A RL based traffic signal control method that uses a graph convolutional neural network to automatically extract the traffic features of adjacent intersections. This method treats each neighboring traffic condition without difference.

5.2.3. Variants of Our Proposed Method

CoLight: The neighborhood scope of an intersection is constructed through Manhattan distance.  
CoLight_node: The neighborhood scope is constructed through node distance, i.e., the smallest hop count between two intersections in the road network.

5.3. Evaluation Metric

Following existing studies, we use the average travel time to evaluate the performance of different models for traffic signal control. It calculates the average travel time of all the vehicles spend between entering and leaving the area (in seconds), which is the most frequently used measure of performance to control traffic signal in the transportation field.

(b) -Uni
(c) -Bi
Figure 3. Convergence speed of CoLight (red continuous curves) and other 5 RL baselines (dashed curves) during training. CoLight starts with the best performance (Jumpstart), reaches to the pre-defined performance the fastest (Time to Threshold), and ends with the optimal policy (Aysmptotic). Curves are smoothed with a moving average of 5 points.

5.4. Effectiveness Comparison

5.4.1. Overall Analysis

Table 2 lists the performance of two types of the proposed CoLight, classic transportation models as well as state-of-the-art learning methods in both synthetic and real-world datasets.

CoLight achieves consistent performance improvements over state-of-the-art transportation (MaxPressure) and RL (Individual RL) methods across diverse road networks and traffic patterns: the average improvement is for synthetic data and for real-world data.

The performance improvements are attributed to the benefits of multi-hop view of the neighborhood and dynamic cooperation along with the traffic variation. The advantage of such multi-hop dynamic cooperation is especially evident when controlling signals in real-world cities, where road structures are more irregular and traffic flows are more dynamic.

The performance gap between the proposed CoLight and the conventional transportation method MaxPressure becomes larger as the evaluated data changes from synthetic regular traffic (average gap ) to real-world dynamic traffic (average gap ). Such growing performance divergence conforms to the deficiency inherent in MaxPressure, that it is incapable of learning from the feedback of the environment.

Baseline learning models show inferior performance due to lack of either the comprehensive view of the environment or the traffic-driven cooperation: 1) Limited view of the environment: Without neighborhood cooperation, Individual RL can hardly achieve satisfactory results because it independently optimizes the single intersection’s policy. Neighbor RL fails for signal control in part because the agent pays attention only to the adjacent intersections (unaware of intersections multi-hop away). 2) No traffic-driven cooperation: Since New York and Hangzhou have uni-directional traffic, the importance of traffic condition in upstream intersections is quite different to that in downstream intersections. Neighbor RL can hardly distinguish the traffic condition of important neighbors by simply mixing neighborhood information at the input stage. GCN does not work well for either or , as the agent treats the information from the upstream and downstream intersections with static importance according to the prior geographic knowledge rather than real-time traffic flows.

5.4.2. Variation Study

As mentioned in Section 4.2.2, the neighborhood scope of an intersection can be defined in different ways. And the results in Table 2 show that CoLight (using Manhattan distance) achieves similar performance with CoLight_node under synthetic data, but largely outperforms CoLight_node under real-world traffic. The reason could be that under synthetic data, since the lane lengths of all intersections are the same, Manhattan distance is identical to node distance. In the following parts of our experiments, we only compare CoLight with other methods.

5.4.3. Convergence Comparison

In Fig. 3, we compare CoLight’s performance (average travel time for vehicles evaluated at each episode) to the corresponding learning curves for the other five RL methods. Evaluated in all the listed datasets, the performance of CoLight is better than any of the baselines by a large margin, both in jumpstart performance (initial performance after the first episode), time to threshold (learning time to achieve a pre-specified performance level), as well as in asymptotic performance (final learned performance). Learning the attention on neighborhood does not slow down model convergence, but accelerates the speed of approaching the optimal policy instead.

From Fig. 3(a), we discover that model Individual RL starts with extremely huge travel time and approaches to the optimal performance after a long training time. Such disparity of convergence speed shown in Fig. 3 agrees with our previous space complexity analysis (in Section 4.5.1), that agents with shared models (CGRL, Neighbor RL, OneModel, GCN and CoLight) need to learn parameters while individual agents (Individual RL) have to update parameters.

5.4.4. Scalability Comparison

In Fig. 4, we compare CoLight’s training time (total clock time for 100 episodes) to the corresponding running time for the other 5 RL methods. The time cost for CoLight is comparable to that of OneModel and GCN, which is far efficient than that of CGRL, Individual RL and Neighbor RL. Such high efficiency of CoLight is consistent with the time complexity analysis (in Section 4.5.2), as most of the parallel computation assumptions are satisfied in our experiments.

Figure 4. The training time of different models for 100 episodes. CoLight is efficient across all the datasets. The bar for Individual RL on is shadowed as its running time is far beyond the acceptable time.
(a) Road network
(b) Intersection 0
(c) Intersection 1
(d) Intersection 2
Figure 5. Attention distribution learned by CoLight during training process on .
(a) Intersection A in New York
(b) Intersection B in Hangzhou
Figure 6. Attention distribution learned by CoLight during training process in real-world traffic. Left: major concentration is allocated on self attention and upstream intersections in . Left: Major concentration is allocated on self attention and arterial intersections in .

Note that the average travel time (in Table 2) and the bar of training time (in Fig. 4) for Individual RL on is missing and estimated, respectively. Besides Individual RL, model CGRL takes too much time to train (the second time-consuming model) with unsatisfactory performance (the worst performance in all datasets). Hence, we compare the performance stability of CoLight on large road networks with three other scalable RL methods (Neighbor RL, OneModel and GCN) as well as conventional transportation methods (Fixedtime and MaxPressure).

We also test our model in a large-scale network . In Table 3, CoLight outperforms all the transportation and learning baselines with and improvement on -Uni and -Bi, respectively.

Model Uni-direction Bi-direction
Fixedtime 340.62 340.62
MaxPressure 296.54 321.86
Neighbor RL (Arel et al., 2010) 324.31 358.31
OneModel 304.21 511.01
GCN (Nishi et al., 2018) 338.61 384.53
CoLight 286.87 289.78
Table 3. Performance on large-scale synthetic data w.r.t average travel time. CoLight is scalable with performance guarantee.

5.5. Hyper-parameter Study

Impact of Neighbor Number. Our CoLight method introduces the additional hyper-parameter to control the number of neighbors for cooperation. In Fig. 7, we show how the hyper-parameter impact the performance and also shed lights on how to set them.

As the number of neighbors grows from 2 to 5, the performance of CoLight achieves the optimal. Further adding nearby intersections into the neighborhood scope , however, leads to the opposite trend. As illustrated in Fig. 7, including more neighbors in the neighborhood results in massive relation learning, which requires more training. To determine signal control policy for each intersection, computing only the attention scores on four nearby intersections and itself seems adequate for cooperation with both time and performance guarantee.

Figure 7. Performance of CoLight with respect to different numbers of neighbors () on dataset (left) and (right). More neighbors () for cooperation brings better performance, but too many neighbors () requires more time (200 episodes or more) to learn.

5.6. Attention Study

In this section, we study the attention distribution of CoLight evaluated on and real-world network to analyze how well the neighborhood cooperation is implemented via the attention mechanism.

5.6.1. Synthetic Arterial

In this section, we analyze the attention distribution for all the intersections along a synthetic arterial, as shown in Figure 5. The traffic along the arterial is uni-directional, making intersection 0 as the upstream of intersection 1 and 2. We have the following observations:  
In Figure 5(b), intersection 0 pays most of its attention on traffic condition of itself, which coincides with the fact that the downstream intersection 2 and 3 have little impact on the policy of the upstream intersection 0.  
Figure 5(c) shows that intersection 1 pays a remarkable attention on intersection 0 compared with intersection 2. Such attention distribution for cooperation is reasonable as there are no side roads for intersection 1 and the major traffic comes from the upstream intersection 0.  
In Figure 5(d), the downstream intersection 2 cares most about its own traffic flow while intelligently allocating appropriate attention to the two upstream intersections.

5.6.2. Real-world Network

In this section, we analyze the attention distribution CoLight learns from the real data under different scenarios: upstream intersection vs. downstream intersection, and arterial vs. side street.  
Upstream vs. Downstream. Figure 6(a) shows an intersection (green dot) in New York, whose neighborhood includes four nearby intersections along the arterial. Traffic along the arterial is uni-directional (blue arrow). From the attention distribution learned by CoLight, we can see that while the majority of attention is allocated to itself, the upstream intersections (orange and blue dots) have larger scores than downstream intersections (red and purple dots).  
Arterial vs. Side Street. Figure 6(b) shows an intersection (green dot) in Hangzhou, whose neighborhood includes two intersections along the arterial and two intersections on the side street. Arterial traffic is heavy and uni-directional, while side-street traffic is light and bi-directional. From the attention distribution learned by CoLight, we can see that the arterial intersections (orange and blue dots) have larger scores than side-street intersections (red and purple dots).

6. Conclusion

In this paper, we propose a well-designed reinforcement learning approach to solve the network-level traffic light control problem. We conduct extensive experiments using synthetic and real-world data and demonstrate the superior performance of our proposed method over state-of-the-art methods. In addition, we show in-depth case studies and observations to understand how the attention mechanism helps cooperation.

We would like to point out several important future directions to make the method more applicable to the real world. First, the neighborhood scope can be determined in a more flexible way. The traffic flow information between intersections can be utilized to determine the neighborhood, rather than a static number of nearby intersections. Second, the raw data for observation only includes the phase and the number of vehicles on each lane. More exterior data like the road and weather condition might help to boot model performance.


  • (1)
  • Arel et al. (2010) Itamar Arel, Cong Liu, T Urbanik, and AG Kohls. 2010. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems 4, 2 (2010), 128–135.
  • Bishop et al. (2006) Christopher M Bishop et al. 2006. Pattern recognition and machine learning (information science and statistics). (2006).
  • Camponogara and Kraus (2003) Eduardo Camponogara and Werner Kraus. 2003. Distributed learning agents in urban traffic control. In Portuguese Conference on Artificial Intelligence. Springer, 324–335.
  • Casas (2017) Noe Casas. 2017. Deep Deterministic Policy Gradient for Urban Traffic Light Control. arXiv preprint arXiv:1703.09035 (2017).
  • Cheng et al. (2018) Weiyu Cheng, Yanyan Shen, Yanmin Zhu, and Linpeng Huang. 2018. A neural attention model for urban air quality inference: Learning the weights of monitoring stations. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • de Oliveira and Bazzan (2009) Denise de Oliveira and Ana LC Bazzan. 2009. Multiagent learning on traffic lights control: effects of using shared information. In Multi-agent systems for traffic and transportation engineering. IGI Global, 307–321.
  • Diakaki et al. (2002) Christina Diakaki, Markos Papageorgiou, and Kostas Aboudolas. 2002. A multivariable regulator approach to traffic-responsive network-wide signal control. Control Engineering Practice 10, 2 (2002), 183–195.
  • Dresner and Stone (2006) Kurt Dresner and Peter Stone. 2006. Multiagent traffic management: Opportunities for multiagent learning. In Learning and Adaption in Multi-Agent Systems. Springer, 129–138.
  • El-Tantawy et al. (2013) Samah El-Tantawy, Baher Abdulhai, and Hossam Abdelgawad. 2013. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): methodology and large-scale application on downtown Toronto. IEEE Transactions on Intelligent Transportation Systems 14, 3 (2013), 1140–1150.
  • Gartner et al. (1991) Nathan H Gartner, Susan F Assman, Fernando Lasaga, and Dennis L Hou. 1991. A multi-band approach to arterial traffic signal optimization. Transportation Research Part B: Methodological 25, 1 (1991), 55–74.
  • Hayes (1970) Wallce D Hayes. 1970. Kinematic wave theory. Proc. R. Soc. Lond. A 320, 1541 (1970), 209–226.
  • Hunt et al. (1982) PB Hunt, DI Robertson, RD Bretherton, and M Cr Royle. 1982. The SCOOT on-line traffic signal optimisation technique. Traffic Engineering & Control 23, 4 (1982).
  • Hunt et al. (1981) PB Hunt, DI Robertson, RD Bretherton, and RI Winton. 1981. SCOOT-a traffic responsive method of coordinating signals. Technical Report.
  • Jiang et al. (2018) Jiechuan Jiang, Chen Dun, and Zongqing Lu. 2018. Graph Convolutional Reinforcement Learning for Multi-Agent Cooperation. arXiv preprint arXiv:1810.09202 (2018).
  • Koonce et al. (2008) Peter Koonce et al. 2008. Traffic signal timing manual. Technical Report. United States. Federal Highway Administration.
  • Kuyer et al. (2008) Lior Kuyer, Shimon Whiteson, Bram Bakker, and Nikos Vlassis. 2008. Multiagent reinforcement learning for urban traffic control using coordination graphs. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 656–671.
  • Liang et al. (2018) Xiaoyuan Liang, Xunsheng Du, Guiling Wang, and Zhu Han. 2018. Deep reinforcement learning for traffic light control in vehicular networks. arXiv preprint arXiv:1803.11115 (2018).
  • Lowrie (1992) PR Lowrie. 1992. SCATS–a traffic responsive method of controlling urban traffic. Roads and traffic authority. NSW, Australia (1992).
  • Nishi et al. (2018) Tomoki Nishi, Keisuke Otaki, Keiichiro Hayakawa, and Takayoshi Yoshimura. 2018. Traffic Signal Control Based on Reinforcement Learning with Graph Convolutional Neural Nets. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 877–883.
  • Nowé et al. (2012) Ann Nowé, Peter Vrancx, and Yann Michaël De Hauwere. 2012. Game Theory and Multi-agent Reinforcement Learning.
  • Prashanth and Bhatnagar (2011) LA Prashanth and Shalabh Bhatnagar. 2011. Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems 12, 2 (2011), 412–421.
  • Roess et al. (2004) Roger P Roess, Elena S Prassas, and William R McShane. 2004. Traffic engineering. Pearson/Prentice Hall.
  • Silva et al. (2006) Bruno Castro Da Silva, Eduardo W. Basso, Filipo Studzinski Perotto, Ana L. C. Bazzan, and Paulo Martins Engel. 2006. Improving reinforcement learning with context detection. In International Joint Conference on Autonomous Agents & Multiagent Systems.
  • van der Pol et al. (2016) van der Pol et al. 2016. Coordinated Deep Reinforcement Learners for Traffic Light Control. NIPS.
  • Varaiya (2013) Pravin Varaiya. 2013. The max-pressure controller for arbitrary networks of signalized intersections. In Advances in Dynamic Network Modeling in Complex Transportation Systems. Springer, 27–66.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
  • Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 1, 2 (2017).
  • Webster (1958) Fo Vo Webster. 1958. Traffic signal settings. Technical Report.
  • Wei et al. (2018) Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li. 2018. Intellilight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2496–2505.
  • Wiering (2000) MA Wiering. 2000. Multi-agent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000). 1151–1158.
  • Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean Field Multi-Agent Reinforcement Learning. arXiv preprint arXiv:1802.05438 (2018).
  • Yao et al. (2018) Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, Yanwei Yu, and Zhenhui Li. 2018. Modeling Spatial-Temporal Dynamics for Traffic Prediction. arXiv preprint arXiv:1803.01254 (2018).
  • You et al. (2016) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651–4659.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description