Partially Observable Reinforcement Learning for Intelligent Transportation Systems

Partially Observable Reinforcement Learning for Intelligent Transportation Systems

Rusheng Zhang 1, Akihiro Ishikawa 1, Wenli Wang 1, Benjamin Striner 2, and Ozan Tonguz 1

1 Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890, USA 2 Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890, USA

Intelligent Transportation Systems (ITS) have attracted the attention of researchers and the general public alike as a means to alleviate traffic congestion. Recently, the maturity of wireless technology has enabled a cost-efficient way to achieve ITS by detecting vehicles using Vehicle to Infrastructure (V2I) communications.

Traditional ITS algorithms, in most cases, assume that every vehicle is observed, such as by a camera or a loop detector, but a V2I implementation would detect only those vehicles with wireless communications capability. We examine a family of transportation systems, which we will refer to as ‘Partially Detected Intelligent Transportation Systems’. An algorithm that can act well under a small detection rate is highly desirable due to gradual penetration rates of the underlying wireless technologies such as Dedicated Short Range Communications (DSRC) technology. Artificial Intelligence (AI) techniques for Reinforcement Learning (RL) are suitable tools for finding such an algorithm due to utilizing varied inputs and not requiring explicit analytic understanding or modeling of the underlying system dynamics.

In this paper, we report a RL algorithm for partially observable ITS based on DSRC. The performance of this system is studied under different car flows, detection rates, and topologies of the road network. Our system is able to efficiently reduce the average waiting time of vehicles at an intersection, even with a low detection rate.

Reinforcement Learning, Artificial Intelligence, Intelligent Transportation Systems, Partially Detected Intelligent Transportation Systems, Vehicle-to-Infrastructure Communications

I Introduction

111The research reported in this paper was funded by King Abdulaziz City of Science and Technology (KACST), Riyadh, Kingdom of Saudi Arabia

Traffic congestion is a daunting problem that affects the daily lives of billions of people in most countries across the world [1]. Over at least the past 30 years, many attempts to alleviate this problem in the form of intelligent transportation systems have been designed and demonstrated [2, 3, 4, 5, 6, 7, 8]. Among these different approaches, some use real time traffic information measured or collected by video cameras or loop detectors and optimize the cycle split of a traffic light accordingly [9]. Unfortunately, such intelligent traffic control schemes are expensive and, therefore, they exist only at a small percentage of intersections in the United States, Europe, and Asia.

Recently, several cost-effective approaches to implement intelligent transportation systems were proposed by leveraging the fact that Dedicated Short-Range Communication (DSRC) technology will be mandated by US Department of Transportation (DoT) and will be implemented in the near future [10, 11, 12]. DSRC technology is potentially a much cheaper technology for detecting the presence of vehicles on the approaches of an intersection. However, at the early stages of deployment, only a small percentage of vehicles will be equipped with DSRC radios. Since this adoption stage could last several years due to increasing vehicle life [13], new control algorithms that can handle partial detection of DSRC-equipped vehicles are required.

One potential AI algorithm is deep reinforcement learning (DRL), which has recently been explored by several groups [14, 15]. These results showed an improvement in terms of waiting time and queue length experienced at an intersection in a fully-observable environment. Hence, in this paper, we investigate this promising approach given a partially observable environment. We expect to see an asymptotically improving result as we increase the penetration rate of DSRC-equipped vehicles.

In this paper, we explore the capability of DRL to solve the DSRC-based partially detected intelligent transportation system. Though we mainly consider DSRC detection in this context, the scheme described here is generic enough to be used for any other partially detected intelligent transportation system, such as vehicle detection based on RFID, Bluetooth Low Energy 5.0 (BLE 5.0), and LTE. We perform extensive simulations to analyze different aspects of the RL method. Our results clearly show that AI, in general, and reinforcement learning, in particular, is capable of finding an excellent traffic management scheme that is able to reduce the waiting time of commuters at a given intersection, even at a low penetration rate.

Ii Related Works

Traffic control using Artificial Intelligence (AI), especially reinforcement learning (RL), has been an active field of research for the last 20 years. In 1994, Mikami, et al. proposed distributed reinforcement learning (Q-learning) using a Genetic Algorithm to present a traffic control scheme that effectively increased throughput of a traffic network [16]. Due to the limitations of computational power in 1994, however, it could not be implemented at that time.

Bingham proposed RL for parameter search of a fuzzy-neural traffic controller for a single intersection [17], while Choy et al. adapted RL on the fuzzy-neural system in a cooperative scheme, achieving adaptive control for a large area [18]. These algorithms are based on RL, but the major goal of RL is parameter tuning of the fuzzy-neural system. Abdulhai et al. proposed the first true adaptive traffic signal which learns to control the traffic dynamically based on a Cerebellar Model Articulation Controller (CMAC), as a Q-estimation network [19]. Silva, and Oliveira then proposed a context-detector (CD) in conjunction with RL to further improve the performance under non-stationary traffic situations [20, 21]. Several researchers have focused on multi-agent reinforcement learning for implementing it at a large scale [22, 23, 24, 25].

Recently, with the development of GPU and computation power, Deep Reinforcement Learning has become an attractive method in several fields. Several attempts have been made using Deep Q-learning for ITS, including [26, 14, 15, 27]. These results show that a DQN based Q-learning algorithm is capable of optimizing the traffic in an intelligent manner.

All the aforementioned research, however, focuses on traditional intelligent transportation systems (ITS), mostly with loop/camera detectors, where all vehicles are detected. However, even though RL yields impressive results for these cases, it does not outperform current systems [2, 3, 4, 5, 6, 7, 8]. Hence, the progress of these algorithms, while interesting, is of limited impact, since traditional ITS systems perform comparably.

Meanwhile, as Dedicated Short-Range Communications start to be installed on vehicles in the United States, traffic control schemes based on such technology have become a rising field, as the cost is significantly lower than a traditional ITS [10, 11, 12]. Within these schemes, a system known as Virtual Traffic Lights (VTL) is very attractive, as it proposes an infrastructure-free DSRC-based solution, by putting traffic control devices inside of vehicles and having the vehicles decide the right-of-way at an intersection locally. Different aspects of VTL technology, including algorithm design, system simulation, deployment policy, and carbon emission have been studied by different research groups in the last few years [10, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. However, a VTL system requires all vehicles in the transportation system to be equipped with DSRC device, therefore, a transition scheme for the current transportation system to smoothly transition to VTL is needed.

The main contributions of this paper are:

  1. Explore a new kind of intelligent system that is based on partial detection of DSRC-equipped vehicles, which is a cost-effective alternative to current ITS and an important problem not addressed by traditional ITS.

  2. Propose a transition scheme to VTL. Not only do we reduce the average commute time for all end users, but those users with DSRC have much lower commute time, which attracts additional users to have DSRC capability.

  3. Design a new RL-based traffic control algorithm and system design that performs well under low penetration ratio and detection rates.

  4. Provide a detailed performance analysis. The analysis shows that, under a low detection rate, the system can perform almost as good as an ITS that employs full detection. This is a promising solution considering its cost-effectiveness.

Iii Problem Formulation

The rapid development of the Internet of Things (IoT) has created new technology applicable for sensing vehicles for intelligent transportation systems. Other than DSRC, applicable technologies include, but are not limited to, RFID, Bluetooth, Ultra-Wide Band (UWB), Zigbee, and even cellphone apps such as Google Map [38, 39, 40]. All these systems are more economical than traditional loop detectors or cameras. Performance-wise, most of these systems are able to track vehicles in a continuous way, while loop detectors can only detect the presence of vehicles, suggesting that a system based on wireless communications would be able to utilize finer-grained information.

Fig. 1: Illustration of Partially Detected Intelligent Transportation System

Unfortunately, the transportation systems mentioned above have a critical shortcoming: they are not able to detect vehicles unequipped with the communication device. Within these systems, only a portion of all vehicles are detectable, unlike a traditional ITS. As this is a common characteristic for several aforementioned traffic systems, we denote these traffic systems collectively as Partially Detected Intelligent Transportation System (PD-ITS).

Figure.1 gives an illustration of a PD-ITS. There are two kinds of vehicles in the system: the red vehicles are equipped with a communication device which is able to communicate with the corresponding device on the traffic lights, so that the traffic lights are able to detect these vehicles; the blue vehicles, on the other hand, are not equipped with a communication device, and hence undetectable by the traffic lights. In a PD-ITS, both vehicles co-exist in the system. The traffic lights, based on the information from the detected vehicles, decide the current phase at the intersections, in order to minimize the delay at the intersection for both detected vehicles and undetected vehicles.

This paper aims to build a traffic control scheme that:

  1. performs well even with a low detection rate;

  2. accelerates the transition to a higher adoption rate and therefore a higher detection rate.

In the rest of the paper, for notational convenience, we choose one of the typical PD-ITS, the transportation system based on DSRC radios, as an example. The detected vehicles are vehicles equipped with DSRC radios, and the undetected vehicles are those unequipped with DSRC radios. Observe that other kinds of PD-ITS are analogous, thus making the methodologies described in this paper adaptable for them as well.

Iv Methodology

Iv-a Q-Learning Algorithm

We refer to Watkins [41] for a detailed explanation of general reinforcement learning and Q-learning but we will provide a brief review in this section.

The goal of reinforcement learning is to train an agent that interacts with the environment by selecting the action in a way that maximizes the future reward. As shown in Figure 2, at every time step, the agent gets the state (the current observation of the environment) and reward information (the quantified indicator of performance from the last time step) from the environment and makes an action. During this process, the agent tries to optimize (maximize/minimize) the cumulative reward for its action policy. The beauty of this kind of algorithm is the fact that it doesn’t need any supervision, since the agent observes the environment and tries to optimize its performance without human intervention.

Fig. 2: Concept for reinforcement learning

One RL algorithm is Q-learning [41], which enables an agent to learn to act optimally in finite Markovian domains. In the Q-learning approach, the agent learns a ‘Q-Value’, denoted , which is a function of observed state and action that outputs the expected cumulative discounted future reward. Here, denotes the discrete time index. The cumulative discounted future reward is defined as:

Here, is the reward at each time step, the meaning of which needs to be specified according to the actual problem, and is a design parameter for the trade-off between immediate gratification and future payoffs. If the user is concerned with long-term consequences, should be close to 1 to make decay slower. At every time step, the agent updates its Q function by an update of the Q value:

In most cases, including the traffic control scenarios of interest, due to the complexity of the state space and action space, deep neural networks can be used to approximate the Q function. Instead of updating the Q value, we use the value:

as the output target of a Q network and do a step of back propagation on the input of .

We utilized two known methods to stabilize the training process [42, 43]:

  1. Two Q-networks are maintained, a target Q-network and an on-line Q network. Target Q-network is used to approximate the true Q-values, and the on-line Q-network is back-propagated every step. In the training period, the agent makes decision with the target Q-network, the results from each time instance are used to update the on-line Q-network. At periodic interval, on-line Q network’s weights are synchronized with the target Q-network. This will keep the agent’s decision network relatively stable, instead of changing at every step.

  2. Instead of training after every step an agent has taken, past experience was stored in a memory buffer and training data was sampled from the memory for a certain batch size. This experience replay aims to break the time correlation between samples [44].

In this paper, we train the traffic lights agents using a Deep Q-network (DQN) [44]. With the Q-learning algorithm described above, our work focuses on the definition of agents’ actions and the assignment of the states and rewards, which is discussed in the the following subsection IV-B.

Iv-B Parameter Modeling

We consider a traffic light controller, which takes reward and state observation from the environment and chooses an action. In this subsection, we introduce our design of actions, rewards, and states for the aforementioned PD-ITS problem.

Iv-B1 Agent action

In our context, the relevant action of the agent is either to keep the current traffic light phase, or to switch to the next traffic light phase. Every time step, the agent makes an observation and takes action accordingly, achieving intelligent control of traffic.

Iv-B2 Reward

For traffic optimization problems, the goal is to decrease the average traffic delay of commuters in the network. Specifically, find the best strategy , such that is minimum, where is the average travel time of commuters in the network, under the traffic control scheme , and is the physically possible lowest average travel time. Consider traveling the same distance ,

Here, is some maximum reasonable speed for the vehicle, such as the speed limit of the road in concern. Therefore,

Therefore, to get minimum delay is equivalent to minimizing at each step , for each vehicle:


Notice this is equivalent to maximizing , if the on all roads for all cars are the same. If different vehicles have different , the reward function is taken as the arithmetic average of the function for all vehicles.

We define the statement in (1) as the penalty of each step. Our goal is to minimize the penalty of each step. Since reinforcement learning tries to maximize the reward (minimize penalty), we define the opposite number of the loss as the reward for the reinforcement learning problem:


Iv-B3 State representation

For optimal decision making, a system should consider as much relevant information about traffic processes as possible. Traditional ITS only typically detect simple information such as the presence of vehicles. In partially detected ITS, only a portion of the vehicles are detected, but more specific information about these vehicles such as speed and position are available due to the capabilities of DSRC.

Reinforcement learning enables experimentation with many possible choices of inputs and input representations. Further research is required to determine the experimental benefits of each option and that goes beyond the scope of this paper. Based on initial experiments, for the purpose of this paper, we selected a state representation including the distance to the nearest vehicle at each approach, number of vehicles at each approach, amber phase indicator, current traffic light phase elapsed time and current time, as detailed in the following table:

Information Representation
Detected car count Number of detected vehicles in each approach
Distance to nearest detected vehicle Distance to nearest detected vehicle on each approach; if no detected vehicle, set to lane length (in meters)
Current phase time Duration from start of current phase to now (in seconds)
Amber phase Indicator of amber phase; 1 if currently in amber phase, otherwise 0
Current time Current time of day (hours since midnight)
Current phase Detected car count and distance to nearest detected vehicle is negated if red, positive if green

Note that current traffic light phase (green or red) is represented by a sign change in the per-lane detected car count and distance rather than by a separate indicator. In initial experiments, we observed slightly faster convergence using this distributed representation (sign representation) than a separate indicator (shown in Figure 6). We hypothesize that, in combination with Rectified Linear Unit (ReLU) activation, this encoding biases the network to utilize different combinations of neurons for different phases. ReLU units are active if the output is positive and inactive if the output is negative, so our representation may encourage different units to be utilized during different phases, accelerating learning. There are many possible representations and our experimentation with different representations is not exhaustive, but we found that Reinforcement Learning was able to handle several different representations with reasonable performance.

Iv-C System Design

Fig. 3: One possible system design for the proposed scheme

We provide here one of the possible system realizations for the proposed scheme, based on Dedicated Short-Range Communications (DSRC). The system has an ’On Roadside’ unit and an ’On Vehicle’ unit, as shown in Figure 3. DSRC RoadSide Unit(RSU) sense the Basic Safety Message (BSM) broadcast by the DSRC OnBoard Unit (OBU), parse the useful information out, and send them to the Reinforcement Learning Based Decision Making Unit. This unit will then make a decision based on the information provided by the RSU.

Fig. 4: Control logic of RL based decision making unit

Figure 4 gives a flow chart on how the RL based control unit makes the decision. As shown in the figure, control unit gets the state representation from the DSRC RSU every second, calculates the Q-value for all the possible actions and if the action of keeping the current phase has bigger Q-value, it retains the phase, otherwise, switches to the next phase.

Other than the main logic discussed above, a sanity check is performed on the agent: a mandatory maximum and minimum phase. If the current phase duration is less than the minimum phase time, the agent will keep the current phase no matter what action the DQN is choosing; similarly, if phase duration is larger or equal to maximum phase time, the phase will be forced to switch.

Iv-D Implementation

In this section, we describe the design of the proposed scheme at the system level. The implementation of the system contains two phases, the training phase and the deployment phase. As shown in Figure 5, the agent is first trained with a simulator, which is then ported to the intersection, connected to the real traffic signal, after which it starts to control the traffic.

Fig. 5: The deployment scheme

Iv-D1 Training phase

The agent is trained by interacting with a traffic simulator. The simulator randomly generates vehicle arrivals, then determines whether each vehicle can be detected by drawing from a Bernoulli distribution parameterized by , the detection rate. In the context of DSRC-based vehicle detection systems, the detection rate corresponds to the DSRC penetration rate. The simulator obtains the traffic state and calculates the current reward accordingly, and feeds it to the agent. Using the Q-learning updating formula cited in previous sections, the agent updates itself based on the information from the simulator. Meanwhile, the agent chooses an action , and forwards the action to the simulator. The simulator will then update, and change the traffic light phase according to agent’s indication. These steps are done repeatedly until convergence, at which point the agent is trained.

The performance of an agent relies heavily on the quality of the simulator. To obtain similar arrival pattern as the real world, the simulator generates car flow by the historical record of vehicle arrival rate on the same map of the real intersection. To address the variance in car flow in different parts of the day, current time of the day is also specified in the state representation, so that after training the agent is able to adapt to different car flow in different time of the day. Other factors that affect car flow, such as day of the week, could also be parameterized in the state representation.

The goal of training is to have the traffic control scheme achieve the shortest average commute time for all commuters. In the training period, the machine tries different control schemes and eventually converges to an optimal scheme which yields a minimum average commute time.

Iv-D2 Deployment phase

In the deployment phase, the software agent is installed to the intersection for controlling the traffic light. Here, the agent will not update the learned Q-function, but simply control the traffic signal. Namely, the detector will feed the agent’s current detected traffic state ; based on , the agent chooses an action based on the trained Q-network and directs the traffic signal to switch/keep phase accordingly. This step is performed in real-time, thus enabling continuous traffic control.

V Simulation and Performance Analysis

In this section, we give several scenarios of simulations to evaluate various aspects of the performance of the proposed scheme. The simulations are performed with SUMO, a microscopic traffic simulator [45]. Different scenarios are considered, in order to give a comprehensive analysis for the proposed scheme.

Qualitatively speaking, we see the performance of the agent reacting to the traffic in an intelligent way from the GUI. It makes reasonable decisions for the arriving vehicles. We demonstrate the performance of the agent after different periods of training in a video available here [46].

Fig. 6: Penalty function going down with number of iterations in training, the situation shown in the figure is plotted from training of dense car flow at a single intersection

Figure 6 shows typical training curves. Both phase representations have similar trends, but we do observe that the sign representation had a slightly faster convergence rate in every experiment (see section IV-B3).

We provide a quantitative analysis in the following subsections of performance for different detection rates and network topologies and sensitivity to car flow and detection rate.

V-a Performance for different detection rates

In this subsection, we present results of the performance under different detection rates, to observe the performance of a partially observable ITS as the detection rate transitions from 0% to 100%. We compare to the performance of a typical pre-timed signal with green phase duration of 24 seconds, shown in dashed lines as a simple reference.

Fig. 7: Waiting time under different detection rate under medium car flow

Figure 7 shows a typical trend we obtained in simulations. The figure shows the waiting time of vehicles at a single intersection under the car flow from north, east, south, west to be 0.02 veh/s, 0.1 veh/s, 0.02 veh/s, 0.05 veh/s, respectively, with vehicles arriving as a Poisson process. One can make several interesting observations from this figure. First of all, the system under AI control is much better than the traditional pre-timed traffic signal, even under low detection rate. We can also observe that the overall waiting time (red line) within this system decreases as the detection rate increases. This is intuitive, since as more vehicles are detected, the more information the system has and thus the system is able to optimize the car flow better.

Additionally, from the figure one can observe that approximately 80% of the benefit happens in the first 20% of transition. This finding is quite significant in that we find a transition scheme that asymptotically gets better as the system gradually evolves to a 100% detection rate, and will be able to receive much of the benefit of the final stage system during the initial transition.

Another important observation is that during the transition, although the agent is rewarded for optimizing the overall average commute time for both detected and undetected vehicles, the detected vehicles (green line in Figure 7) have a lower commute time than undetected vehicles (blue line in Figure 7). This provides an interesting ’potential’ or ’incentive’ to the system, to transition from no vehicles equipped with the IoT device, to all vehicles equipped with the device. Drivers of those vehicles not yet equipped with the device now have a good reason and strong incentive to install one.

(a) Performance under dense flow
(b) Performance under sparse flow
Fig. 8: Waiting time under different detection rate under dense and sparse car flow

Figure 8 shows the performance under the other two cases: when the car flow is very sparse (0.02 veh/s at each lane) or very dense (0.5 veh/s at each lane). For the sparse situation in Figure (b)b, the trend is similar to the medium flow case shown in Figure 7.

One can see from Figure (a)a that under the dense situation, the curve becomes quite flat. This is because when car flow is high, detecting individual vehicles become less important. When many cars arrive at the intersection, car flow has ’liquid’ qualities, as opposed to ’particle’ qualities in the previous two situations. The trained RL agent is able to seamlessly transition from a ’particle arrival’ optimization agent which handles random arrivals to a ’liquid arrival’ optimization agent which handles macroscopic flow. This result shows that RL is able to capture the main factors that affect traffic system’s performance and performs differently under different car arrival rates.

V-B Performance under different network topology

Figure 7 shows a typical situation for the system at a single intersection with Poisson arrival; however, in most intersections, vehicles form platoons because of previous intersections. We also present results under other topologies that create more complicated arrival patterns: arterial road topology and mesh network topology.

(a) Arterial topology
(b) Mahattan Grid topology
Fig. 9: Arterial and Manhattan Grid topology are used in the simulation

Figure 9 shows the two topologies of roads we used to evaluate performance. Figure (a)a shows the arterial road structure. The arrival rate on arterial road is 0.2 veh/s from north and 0.1 veh/s from south, the arrival rates on the other roads are all set to 0.05 veh/s. The vehicles on the arterial road, after going through one intersection, will automatically form clusters, and form a more realistic arrival pattern than Poisson arrival. Figure (b)b shows 4x4 Manhattan Grid road structure we use for our simulations. This 2-dimensional structure will form more complicated arrival patterns at each intersection.

At each intersection, an independent RL agent is assigned with an independent Q-network. Each agent aims to optimize its own intersection separately, within the same traffic simulation.

(a) Performance for 5x1 arterial road
(b) Performance for 4x4 Manhattan Grid
Fig. 10: Expected performance for arterial and network topology under medium car flow

Figure 10 shows, for two topologies, the performance for medium car flow. Notice that the trend of the two figures are both similar to what we obtained in figure 7. This indicates that the reinforcement learning agent is capable of handling different arrival patterns and achieves good performance under bulk arrivals.

V-C Performance of a whole day

Section V-A examines the effect of flow rate on system performance. Since the car flow differs at different times of the day, we simulate an entire day of traffic. Figure 11 shows the car flow rate we used for the simulation, based on the hourly car flow reported in [47].

Fig. 11: Typical car flow in a day
Fig. 12: Expected Performance by Time

Figure 12 shows the performance of different vehicles in a whole day. One can observe from this figure that the performance of 20% detection rate (red line) is very close to the performance of 100% detection rate (green line), at most times of the day (from 5am to 9pm). During rush hours, the system with 100% detection rate is almost the same as the system with 20% detection rate. Though a traffic system under 100% detection rate performs visibly better at midnight, the performance at that time is not as critical as the performance during the busier daytime. This result indicates that by detecting 20% of vehicles, we can perform almost the same as detecting all vehicles. But those detectable vehicles (yellow lines) will have a benefit against those undetectable vehicles (dash line).

These results are intuitive. With a large volume of cars, a low detection rate should still provide a relatively low-variance estimate of traffic flow. If there are few cars and a low detection rate, the estimate of traffic flow can have very high-variance. Late at night with only a single detected car, an ITS can give that car a green immediately, which would not be possible with an undetected car.

V-D Sensitivity Analysis

The results obtained above used agents trained and evaluated under the same environmental parameters, since traffic patterns only fluctuate slightly day to day.

Below, we evaluate the sensitivity of the agents to two environmental parameters: the car flow and the detection rate.

V-D1 Sensitivity of car flow

Figure 13 shows the agents’ sensitivity to car flow. Figure (a)a shows the performance of an agent trained under 0.1 veh/s car flow, operating at different flow rates. Figure (b)b shows the sensitivity of an agent trained under 0.5 veh/s car flow. The blue curve in the figure is the trained agent’s performance, while the red one is the performance of the optimal agent (the agent trained under that situation and tested under that situation). Both agents perform well over a range of flow rates. The agent trained under 0.1 veh/s flow can handle flow rates from 0 to 0.15 at near-optimal levels. At higher flow rates, it still performs reasonably well. The agent trained on 0.5 veh/s flow will perform reasonably from 0.25 veh/s to 0.5 veh/s, but under 0.25 veh/s, the agent will start to perform substantially worse than the optimal agent. Since traffic patterns are not expected to heavily fluctuate, these results give a strong indication that the agent trained by the data will be able to adapt to the environment even when the trained situation is slightly different.

(a) Sensitivity of agent trained under 0.1 veh/s flow rate
(b) Sensitivity of agent trained under 0.5 veh/s flow rate
Fig. 13: Sensitivity analysis of flow rate

V-D2 Sensitivity of detection rate

In most situations, the detection rate can only be approximately measured. It is likely that an agent trained under one detection rate needs to operate under a slightly different detection rate, so we test the sensitivity of agents to detection rates.

(a) Sensitivity of agent trained under 0.2 detection rate
(b) Sensitivity of agent trained under 0.8 detection rate
Fig. 14: Sensitivity analysis of detection rate

Figure 14 shows the sensitivity of two cases. Figure (a)a shows the sensitivity of low detection rate (0.2), figure (b)b shows the sensitivity under high detection rate (0.8).

We observe that the agent trained under 0.2 detection rate performs at an optimal level from 0.1 to 0.4 detection rate. The sensitivity upward is better than downward. This indicates that at early deployment of this system, it’s better to under-estimate detection rate, since the agent’s performance is more stable for the higher detection rate.

Figure (b)b shows the sensitivity of the agent trained under high detection rate (0.8). We can see that the performance of this agent is at optimal level when detection rate is from 0.5 to 1. Though the sensitivity performance for an agent under low detection rate is different than the sensitivity under high detection rate, for both cases, the agent shows a level of stability, which means that as long as the detection rate used for training is not too different from the actual detection rate, the performance of the agent will not be affected a lot.

Vi Discussion

As the simulation results show, while all vehicles will experience a shorter waiting time under an RL-based traffic controller, detected vehicles will have a shorter commute time than undetected vehicles. This property makes it possible for hardware manufacturers, software companies, and vehicle manufacturers to help push forward the scheme, other than the Department of Transportation (DoT) alone, for the simple reason that all of them can profit from this system. For example, it would be valuable for a certain navigation app to advertise that their customers can save 30% on commute time.

Therefore, we view this technology as a new generation of intelligent transportation system, as it inherently comes with a lucrative commercial model. The burden of spreading penetration rate of this system is distributed to a lot of companies, as opposed to the traditional ITS which is the burden of the DoT alone. This makes it financially possible to have the system installed on most of the intersections in the city, as opposed to the current situation where only a small proportion of intersections are installed with ITS.

The mechanism of the system solution described will also make it possible to have dynamic pricing for different vehicles. Dynamic pricing refers to reserving certain roads during rush hour exclusively for paid users. This method has been scuttled by public or political opposition and only a few cities have implemented dynamic pricing [48, 49]. The method depends hugely on road topologies and public opinion. Those few successful examples, however, cannot be easily copied or adapted to other cities. In our solution, we can accomplish dynamic pricing in a more intelligent way, by simply providing vehicle detection as a service, since detected vehicles experience reduced commute times. There is no requirement to reserve roads, which makes the scheme extremely easy to deploy. For the end-users, they also have a choice; when they are in a hurry, they can pay more for lower commute time; if they aren’t in a hurry and wouldn’t mind to wait longer, they simply don’t pay. The scheme itself, unlike the traditional congestion pricing scheme, will therefore not hurt the nonpaying users significantly. By enabling vehicle detection, the user receives slightly preferential treatment at traffic lights, instead of entirely reserving a road for paid users.

Vii Conclusion

In this paper, we have proposed reinforcement learning, specifically deep Q-learning, for traffic control with partial detection of vehicles. The results of our study show that reinforcement learning is a promising new approach to optimizing traffic control problems under partial detection scenarios, such as traffic control systems using DSRC technology. This is a very promising outcome that is highly desirable since the industry forecasts on DSRC penetration process seems gradual as opposed to abrupt.

The numerical results on a single intersection with sparse, medium, and dense arrival rates suggest that reinforcement learning is able to handle all kinds of traffic flow. Although the optimization of traffic on sparse arrival and dense arrival are, in general, very different, results show that reinforcement learning is able to leverage the ’particle’ property of the vehicle flow, as well as the ’liquid’ property, thus providing a very powerful overall optimization scheme.

We have shown promising results for single agent case that were subsequently extended to 5 intersections. The difficulty of multi-agent case is that the car arrival distribution will no longer be a Poisson process. However, with the help of DSRC radios, traffic lights will be able to communicate with each other and designing such a system will significantly improve the performance of the traffic control systems.


The authors would like to thank to Dr. Hanxiao Liu from Language Technology Institute, Carnegie Mellon University for informative discussions and a lot of suggestions to the methods reported in the paper. The authors would also like to thank Dr. Laurent Gallo from Eurecom, France and Mr. Manuel E. Diaz-Granados of Yahoo, US, for the initial attempt to solve this problem in 2016.


  • [1] “Traffic congestion and reliability: Trends and advanced strategies for congestion mitigation,”, 2017, [Online; accessed 19-Aug-2017].
  • [2] D. I. Robertson, “’tansyt’method for area traffic control,” Traffic Engineering & Control, vol. 8, no. 8, 1969.
  • [3] P. Hunt, D. Robertson, R. Bretherton, and M. C. Royle, “The scoot on-line traffic signal optimisation technique,” Traffic Engineering & Control, vol. 23, no. 4, 1982.
  • [4] J. Luk, “Two traffic-responsive area traffic control methods: Scat and scoot,” Traffic engineering & control, vol. 25, no. 1, 1984.
  • [5] N. H. Gartner, OPAC: A demand-responsive strategy for traffic signal control, 1983, no. 906.
  • [6] P. Mirchandani and L. Head, “A real-time traffic signal control system: architecture, algorithms, and analysis,” Transportation Research Part C: Emerging Technologies, vol. 9, no. 6, pp. 415–432, 2001.
  • [7] J.-J. Henry, J. L. Farges, and J. Tuffal, “The prodyn real time traffic algorithm,” in Control in Transportation Systems.   Elsevier, 1984, pp. 305–310.
  • [8] R. Vincent and J. Peirce, “’mova’: Traffic responsive, self-optimising signal control for isolated intersections,” Tech. Rep., 1988.
  • [9] “Traffic light control and coordination,”, 2016, [Online; accessed 23-Mar-2016].
  • [10] M. Ferreira, R. Fernandes, H. Conceição, W. Viriyasitavat, and O. K. Tonguz, “Self-organized traffic control,” in Proceedings of the seventh ACM international workshop on VehiculAr InterNETworking.   ACM, 2010, pp. 85–90.
  • [11] N. S. Nafi and J. Y. Khan, “A vanet based intelligent road traffic signalling system,” in Telecommunication Networks and Applications Conference (ATNAC), 2012 Australasian.   IEEE, 2012, pp. 1–6.
  • [12] V. Milanes, J. Villagra, J. Godoy, J. Simo, J. Pérez, and E. Onieva, “An intelligent v2i-based traffic management system,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 1, pp. 49–58, 2012.
  • [13] “Average age of cars on u.s.”, [Online; accessed 21-Aug-2017].
  • [14] W. Genders and S. Razavi, “Using a deep reinforcement learning agent for traffic signal control,” arXiv preprint arXiv:1611.01142, 2016.
  • [15] E. van der Pol, “Deep reinforcement learning for coordination in traffic light control,” Ph.D. dissertation, Master’s thesis, University of Amsterdam, 2016.
  • [16] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for cooperative traffic signal control,” in Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on.   IEEE, 1994, pp. 223–228.
  • [17] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal control,” European Journal of Operational Research, vol. 131, no. 2, pp. 232–241, 2001.
  • [18] M. C. Choy, D. Srinivasan, and R. L. Cheu, “Hybrid cooperative agents with online reinforcement learning for traffic control,” in Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference on, vol. 2.   IEEE, 2002, pp. 1015–1020.
  • [19] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning for true adaptive traffic signal control,” Journal of Transportation Engineering, vol. 129, no. 3, pp. 278–285, 2003.
  • [20] A. B. C. da Silva, D. de Oliveria, and E. Basso, “Adaptive traffic control with reinforcement learning,” in Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2006, pp. 80–86.
  • [21] D. de Oliveira, A. L. Bazzan, B. C. da Silva, E. W. Basso, L. Nunes, R. Rossetti, E. de Oliveira, R. da Silva, and L. Lamb, “Reinforcement learning based control of traffic lights in non-stationary environments: A case study in a microscopic simulator.” in EUMAS, 2006.
  • [22] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Traffic light control in non-stationary environments based on multi agent q-learning,” in Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on.   IEEE, 2011, pp. 1580–1585.
  • [23] J. C. Medina and R. F. Benekohal, “Traffic signal control using reinforcement learning and the max-plus algorithm as a coordinating strategy,” in Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on.   IEEE, 2012, pp. 596–601.
  • [24] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 3, pp. 1140–1150, 2013.
  • [25] M. A. Khamis and W. Gomaa, “Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework,” Engineering Applications of Artificial Intelligence, vol. 29, pp. 134–151, 2014.
  • [26] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforcement learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3, pp. 247–254, 2016.
  • [27] E. van der Pol, F. A. Oliehoek, T. Bosse, and B. Bredeweg, “Video demo: Deep reinforcement learning for coordination in traffic light control,” in BNAIC, vol. 28.   Vrije Universiteit, Department of Computer Sciences, 2016.
  • [28] T. Neudecker, N. An, O. K. Tonguz, T. Gaugel, and J. Mittag, “Feasibility of virtual traffic lights in non-line-of-sight environments,” in Proceedings of the ninth ACM international workshop on Vehicular inter-networking, systems, and applications.   ACM, 2012, pp. 103–106.
  • [29] M. Ferreira and P. M. d’Orey, “On the impact of virtual traffic lights on carbon emissions mitigation,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 1, pp. 284–295, 2012.
  • [30] M. Nakamurakare, W. Viriyasitavat, and O. K. Tonguz, “A prototype of virtual traffic lights on android-based smartphones,” in Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 2013 10th Annual IEEE Communications Society Conference on.   IEEE, 2013, pp. 236–238.
  • [31] W. Viriyasitavat, J. M. Roldan, and O. K. Tonguz, “Accelerating the adoption of virtual traffic lights through policy decisions,” in Connected Vehicles and Expo (ICCVE), 2013 International Conference on.   IEEE, 2013, pp. 443–444.
  • [32] A. Bazzi, A. Zanella, B. M. Masini, and G. Pasolini, “A distributed algorithm for virtual traffic lights with ieee 802.11 p,” in Networks and Communications (EuCNC), 2014 European Conference on.   IEEE, 2014, pp. 1–5.
  • [33] F. Hagenauer, P. Baldemaier, F. Dressler, and C. Sommer, “Advanced leader election for virtual traffic lights,” ZTE Communications, Special Issue on VANET, vol. 12, no. 1, pp. 11–16, 2014.
  • [34] O. K. Tonguz, W. Viriyasitavat, and J. M. Roldan, “Implementing virtual traffic lights with partial penetration: a game-theoretic approach,” IEEE Communications Magazine, vol. 52, no. 12, pp. 173–182, 2014.
  • [35] J. Yapp and A. J. Kornecki, “Safety analysis of virtual traffic lights,” in Methods and Models in Automation and Robotics (MMAR), 2015 20th International Conference on.   IEEE, 2015, pp. 505–510.
  • [36] A. Bazzi, A. Zanella, and B. M. Masini, “A distributed virtual traffic light algorithm exploiting short range v2v communications,” Ad Hoc Networks, vol. 49, pp. 42–57, 2016.
  • [37] O. K. Tonguz and W. Viriyasitavat, “A self-organizing network approach to priority management at intersections,” IEEE Communications Magazine, vol. 54, no. 6, pp. 119–127, 2016.
  • [38] A. Chattaraj, S. Bansal, and A. Chandra, “An intelligent traffic control system using rfid,” IEEE potentials, vol. 28, no. 3, 2009.
  • [39] M. R. Friesen and R. D. McLeod, “Bluetooth in intelligent transportation systems: a survey,” International Journal of Intelligent Transportation Systems Research, vol. 13, no. 3, pp. 143–153, 2015.
  • [40] F. Qu, F.-Y. Wang, and L. Yang, “Intelligent transportation spaces: vehicles, traffic, communications, and beyond,” IEEE Communications Magazine, vol. 48, no. 11, 2010.
  • [41] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [42] L.-J. Lin, “Reinforcement learning for robots using neural networks,” Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, Tech. Rep., 1993.
  • [43] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [44] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  • [45] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent development and applications of sumo–simulation of urban mobility,” International Journal On Advances in Systems and Measurements, vol. 5, no. 3&4, 2012.
  • [46] “Reinforcement Learning for Traffic Optimization,”, [Online; accessed 12-May-2018].
  • [47] “Traffic Monitoring Guide,”, 2014, online; accessed 5-13-2018.
  • [48] A. de Palma and R. Lindsey, “Traffic congestion pricing methodologies and technologies,” Transportation Research Part C: Emerging Technologies, vol. 19, no. 6, pp. 1377–1399, 2011.
  • [49] B. Schaller, “New york city’s congestion pricing experience and implications for road pricing acceptance in the united states,” Transport Policy, vol. 17, no. 4, pp. 266–273, 2010.

Rusheng Zhang was born in Chengdu, China in 1990. He received the B.E. degree in micro electrical mechanical system and second B.E. degree in Applied Mathematics from Tsinghua University, Beijing, in 2013, and the M.S. degree in electrical and computer engineering from Carnegie Mellon University, in 2015. He is a Ph.D. candidate at Carnegie Mellon University. His research areas include vehicular networks, intelligent transportation systems, wireless computer networks, artificial intelligence and intra vehicular sensor networks.

Akihiro Ishikawa was an MS student in the Electrical and Computer Engineering Department of Carnegie Mellon University until he received his MS degree in 2017. His research interests include vehicular networks, wireless networks, and artificial intelligence.

Wenli Wang has obtained an M.S. degree in the Electrical and Computer Engineering Department of Carnegie Mellon University in 2018. Prior to Carnegie Mellon University, she received B.S. in Statistics and B.A. in Fine Arts from University of California, Los Angeles in 2016. Her research interests include machine learning and it applications in wireless networks and computer vision.

Benjamin Striner is a master’s student in the Machine Learning Department at Carnegie Mellon University. Previously, he was a patent expert witness and engineer, especially in wireless communications. He received a B.A. in neuroscience and psychology from Oberlin College in 2005. Research interests include reinforcement learning, generative networks, and better understandability and explainability in machine learning.

Ozan Tonguz is a tenured full professor in the Electrical and Computer Engineering Department of Carnegie Mellon University (CMU). He currently leads substantial research efforts at CMU in the broad areas of telecommunications and networking. He has published about 300 research papers in IEEE journals and conference proceedings in the areas of wireless networking, optical communications, and computer networks. He is the author (with G. Ferrari) of the book Ad Hoc Wireless Networks: A Communication-Theoretic Perspective (Wiley, 2006). He is the inventor of 15 issued or pending patents (12 US patents and 3 international patents). In December 2010, he founded the CMU startup known as Virtual Traffic Lights, LLC, which specializes in providing solutions to acute transportation problems using vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications paradigms. His current research interests include vehicular networks, wireless ad hoc networks, sensor networks, self-organizing networks, artificial intelligence (AI), statistical machine learning, smart grid, bioinformatics, and security. He currently serves or has served as a consultant or expert for several companies, major law firms, and government agencies in the United States, Europe, and Asia.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description