Using Reinforcement Learning with Partial Vehicle Detection for Intelligent Traffic Signal Control

Using Reinforcement Learning with Partial Vehicle Detection for Intelligent Traffic Signal Control


Intelligent Traffic Signal Control (ITSC) systems have attracted the attention of researchers and the general public alike as a means of alleviating traffic congestion. Recently, the vehicular wireless technologies have enabled a cost-efficient way to achieve ITSC by detecting vehicles using Vehicle to Infrastructure (V2I) wireless communications.

Traditional ITSC algorithms, in most cases, assume that every vehicle is detected, such as by a camera or a loop detector, but a V2I implementation would detect only those vehicles equipped with wireless communications capability. We examine a family of transportation systems, which we will refer to as ‘Partially Detected Intelligent Transportation Systems’. An algorithm that can perform well under a small detection rate is highly desirable due to gradual increasing penetration rates of the underlying technologies such as Dedicated Short Range Communications (DSRC) technology. Reinforcement Learning (RL) approach in Artificial Intelligence (AI) could provide indispensable tools for such problems where only a small portion of vehicles are detected by the ITSC system.

In this paper, we report a new RL algorithm for Partially Detected Intelligent Traffic Signal Control (PD-ITSC) systems. The performance of this system is studied under different car flows, detection rates, and typologies of the road network. Our system is able to efficiently reduce the average waiting time of vehicles at an intersection, even with a low detection rate, thus reducing the travel time of vehicles.

Reinforcement Learning, Artificial Intelligence, Intelligent Transportation Systems, Partially Detected Intelligent Transportation Systems, Vehicle-to-Infrastructure Communications

I Introduction


Traffic congestion is a daunting problem that affects the daily lives of billions of people in most countries across the world [69]. Over the last 30 years, many Intelligent Traffic Signal Control (ITSC) systems have been designed and demonstrated as one of the effective way to reduce traffic congestion [59, 40, 28, 42, 22, 46, 27, 75]. These systems use real time traffic information measured or collected by video cameras or loop detectors and optimize the cycle split of a traffic light accordingly [70]. Unfortunately, such intelligent traffic signal control schemes are expensive and, therefore, they exist only at a small percentage of intersections in the United States, Europe, and Asia.

Recently, several more cost-effective approaches to implement ITSC systems were proposed by leveraging the fact that Dedicated Short-Range Communication (DSRC) technology [20, 50, 45]. DSRC technology is potentially a much cheaper technology for detecting the presence of vehicles on the approaches of an intersection. However, at the early stages of deployment, only a small percentage of vehicles will be equipped with DSRC radios. Meanwhile, the rapid development of the Internet of Things (IoT) has created new technology applicable for sensing vehicles for ITSC. Other than DSRC, applicable technologies include, but are not limited to, RFID, Bluetooth, Ultra-Wide Band (UWB), Zigbee, and even cellphone apps such as Google Map [10, 21, 56]. All these systems are more economical than traditional loop detectors or cameras. Performance-wise, most of these systems are able to track vehicles in a continuous manner, while loop detectors can only detect the presence of vehicles. These ITSC systems mentioned above are all promising technologies that could bring the expensive price of traditional ITSC systems down dramatically; however, these systems have a common critical shortcoming: they are not able to detect vehicles unequipped with the communication device (i.e., DSRC radios, RFID tags, Bluetooth device, etc.).

Since this adoption stage of the aforementioned systems could possibly take several years [3], new control algorithms that can handle partial detection of vehicles are required. One potential AI algorithm that could be very helpful is deep reinforcement learning (DRL), which has recently been explored by several groups [24, 73]. These results show an improvement in terms of waiting time and queue length experienced at an intersection in a fully-observable environment. Hence, in this paper, we investigate this promising approach in a partially observable environment. As expected, we observe an asymptotically improving result as we increase the penetration rate of DSRC-equipped vehicles.

In this paper, we explore the capability of DRL for handling ITSC systems using partial detection. For simplicity, in some sections, we use DSRC detection based system as the example system, but the scheme described in this paper is very general and therefore can be used for any possible forms of partial detection, such as vehicle detection based on RFID, Bluetooth Low Energy 5.0 (BLE 5.0), cellular (LTE or 5G). Via extensive simulations we analyze the performance of the Reinforcement Learning (RL) method. Our results clearly show that reinforcement learning is capable of providing an excellent traffic management scheme that is able to reduce the waiting time of commuters at intersections, even at a low penetration rate. The results also show a different performance in detected vehicles and undetected vehicles, suggesting a built-in business model, which could be the key to eventually push forward on large-scale deployment of ITSC.

The remainder of this paper is organized as follows. In Section II, we review the related work in this area. Section III gives a detailed Problem formulation. Section IV outlines the Approach we use. Section V presents the Results of our study in terms of performance and sensitivity to critical system parameters. In Section VI, a Discussion is presented that highlights the practical implications of our results for intelligent transportation systems in addition to highlighting important extensions of our work for future work. Finally, Section VII concludes the paper.

Ii Related Works

Traffic signal control using Artificial Intelligence (AI), especially reinforcement learning (RL), has been an active field of research for the last 20 years. In 1994, Mikami, et al. proposed distributed reinforcement learning (Q-learning) using a Genetic Algorithm to present a traffic signal control scheme that effectively increased throughput of a road network [44]. Due to the limitations of computing power in 1994, however, it could not be implemented at that time.

Bingham proposed RL for parameter search of a fuzzy-neural traffic signal controller for a single intersection [9], while Choy et al. adapted RL on the fuzzy-neural system in a cooperative scheme, achieving adaptive control for a large area [11]. These algorithms are based on RL, but the major goal of RL is parameter tuning of the fuzzy-neural system. Abdulhai et al. proposed the first truly adaptive traffic signal which learns to control the traffic signal dynamically based on a Cerebellar Model Articulation Controller (CMAC), as a Q-estimation network [2]. Silva, and Oliveira then proposed a context-detector (CD) in conjunction with RL to further improve the performance under non-stationary traffic conditions [13, 15]. Several researchers have focused on multi-agent reinforcement learning for implementing it on a large scale [1, 43, 17, 31].

Recently, with the development of GPU and computing power, DRL has become an attractive method in several fields. Several attempts have been made using Deep Q-learning for ITSC system, including [37, 24, 73, 72]. These results show that a DQN based Q-learning algorithm is capable of optimizing the traffic in an intelligent manner.

Traditional intelligent traffic signal systems use loop detectors, magnetic detectors and cameras for improving the performance of traffic lights. In the past few decades, various adaptive traffic systems were developed and implemented. Some of these traffic systems such as SCOOT [28], SCATS [40], are based on dynamic traffic coordination [42], and can be viewed as a traffic-responsive version of TRANSYT [59]. These systems optimize the offsets of traffic signals in the network, based on current traffic demand, and generate ‘green-wave’ for major car flow. Meanwhile, some other model-based systems have been proposed, including OPAC [22], RHODES[46], PRODYN[27]. These systems use both the current traffic arrivals and the prediction of future arrivals, and choose a signal phase planning that which optimize the objective functions. While these systems work efficiently, they do have some significant shortcomings. The cost of these systems are generally very high [29, 60].

Even though RL yields impressive results for these cases, it does not outperform current systems. Hence, the progress of these algorithms, while interesting, is of limited impact, since traditional ITSC systems perform comparably.

Meanwhile, the recent advancements in Vehicle-to-Everything (V2X) communication have made traffic signal control schemes based on such technology a rising field, as the cost is significantly lower than a traditional ITSC system [20, 50, 45]. Within these schemes, a system known as Virtual Traffic Lights (VTL) is very attractive, as it proposes an infrastructure-free DSRC-based solution, by installing traffic control devices in vehicles and having the vehicles decide the right-of-way at an intersection locally. Different aspects of VTL technology have been studied by different research groups in the last few years [20, 52, 19, 51, 76, 4, 26, 65, 79, 5, 66, 81, 68]. However, a VTL system requires all vehicles in the road network to be equipped with DSRC devices, therefore, a transition scheme for the current transportation systems to smoothly transition to VTL system is needed.

On the other hand, several methods have been proposed for floating vehicle data gathered from Global Position System (GPS) that are used to detect, estimate and predict traffic states based on fuzzy logic, Genetic Algorithm (GA), Support Vector Mechine (SVM) and other statistical learning algorithms [41, 30, 55, 14, 18, 32]. The success of these works suggest the possibility to optimize traffic control based on partial detection (such a system is formally introduced in section III).

There are a few research projects currently available using partial detection. For example, COLOMBO is one of the projects that focuses on low-penetration rate of DSRC-equipped vehicles [6, 34, 7]. The system uses information provided by V2X technology and feed the information to a traffic management system. Since COLOMBO cannot directly react to real-time traffic flow (the detected and undetected vehicles have the same performance), under low to medium car flow it will NOT achieve optimum performance as the optimal strategy under low-to-medium car flow has to react according to detected car arrivals. Another very recent system is DSRC-Actuated Traffic Lights, which is one of our previous implementations using DSRC radio for traffic control. The designed prototype of this system was publicly demonstrated in Riyadh, Saudi Arabia, in July 2018 [82, 67]. DSRC-Actuated Traffic Lights, however, is based on the arrival of each vehicle, and hence works well under low to medium car flow rates, but it does not work well under high car flow rate.

The main contributions of this paper are:

  1. Explore a new kind of intelligent system that is based on partial detection of vehicles, which is a cost-effective alternative to current ITSC systems and an important problem not addressed by traditional ITSC systems.

  2. Propose a transition scheme to VTL. Not only do we reduce the average commute time for all users, but those users that can be detected have much lower commute time, which attracts additional users to adopt the device or service.

  3. Design a new RL-based traffic signal control algorithm and system design that performs well under low penetration ratio and detection rates.

  4. Provide a detailed performance analysis. Our results show that, under a low detection rate, the system can perform almost as good as an ITSC system that employs full detection. This is a very attractive solution considering its cost-effectiveness.

Iii Problem Statement

Iii-a What is a Partial Detection based ITSC System ?

Fig. 1: Illustration of Partially Detected Intelligent Transportation System

Figure.1 gives an illustration of a Partially Detected Intelligent Traffic Signal Control (PD-ITSC) system. There are two kinds of vehicles in the system: the red vehicles in the figure are the vehicles that the traffic lights are able to detect, we denote these vehicles as detected vehicles; the blue semi-transparent vehicles in the figure, on the other hand, are undetectable by the traffic system, are denoted as undetected vehicles. In a PD-ITSC system, both kinds of vehicles co-exist in the system. The system, based on the information from the detected vehicles, decide the current phase at the intersections, in order to minimize the delay at the intersection for both detected vehicles and undetected vehicles.

Many example systems can be categorized as PD-ITSC, especially the newly proposed systems from the last decade based on wireless communications and IoT [10, 21, 56]. In these systems, the vehicles are equipped with communication devices that communicate with traffic lights. Vehicles equipped with the communication device are detected vehicles and vehicles NOT equipped with the device are undetected vehicles.

In this paper, we choose one of the typical PD-ITSC system, the traffic signal system based on DSRC radios, as an example. The detected vehicles are vehicles equipped with DSRC radios, and the undetected vehicles are those unequipped with DSRC radios. Observe that other kinds of PD-ITSC system are analogous, thus making the methodologies described in this paper applicable to them as well.

Iii-B Example PD-ITSC System Design based on DSRC

Fig. 2: One possible system design for the proposed scheme

We provide here one of the possible system realizations for the proposed scheme, based on Dedicated Short-Range Communications (DSRC). The system has an ’On Roadside’ unit and an ’On Vehicle’ unit, as shown in Figure 2. DSRC RoadSide Unit(RSU) senses the Basic Safety Message (BSM) broadcast by the DSRC OnBoard Unit (OBU), parse the useful information out, and send them to the RL Based Decision Making Unit. This unit will then make a decision based on the information provided by the RSU.

Even though the example system won’t be able to detect all vehicles, it will collect more detailed information about the detected vehicles: While in traditional ITSC systems based on loop detectors, only the vehicle occupancy is detected, the system based on DSRC technology can provide a rich set of attributes including speed, distance, trajectory, and even destination of each detected vehicle. It is worth mentioning here that such properties are NOT unique to the example system considered in this section that uses DSRC technology; in fact, the same properties exist in most of other partial detection ITSC systems as well since they are based on similar wireless technologies. Therefore, the algorithm designed for PD-ITSC handling the PD-ITSC systems should be able to integrate all these pieces of information. Obviously, developing a pure analytical algorithm that takes all these information into consideration is non-trivial, thus making RL a very attractive and promising method, as it does not require a comprehensive theoretical analysis of the environment to find a near-optimal solution.

It is clear that since most of the traditional ITSC schemes do not take undetected vehicles into account, they are not suitable for PD-ITSC systems. Moreover, an ideal scheme for PD-ITSC should also:

  1. perform well even with a low detection rate;

  2. accelerate the transition to a higher adoption rate and therefore a higher detection rate (this point will be discussed in more details in Section VI).

Iv Approach and the Underlying Theory

Iv-a Q-Learning Algorithm

We refer to Watkins [77] for a detailed explanation of general reinforcement learning and Q-learning but we will provide a brief review of the underlying theory in this section.

The goal of reinforcement learning is to train an agent that interacts with the environment by selecting the action in a way that maximizes the future reward. At every time step, the agent gets the state (the current observation of the environment) and reward information (the quantified indicator of performance from the last time step) from the environment and makes an action. During this process, the agent tries to optimize (maximize/minimize) the cumulative reward for its action policy. The beauty of this kind of algorithm is the fact that it doesn’t need any supervision, since the agent observes the environment and tries to optimize its performance without human intervention.

RL algorithms come in two categories: policy based algorithms such as Trust Region Policy Optimization (TRPO) [62], Advantage Actor Critic (A2C) [47], Proximal Policy Optimization (PPO) [63] that optimize the policy that maps from states to actions; and value based algorithms such as Q-learning [77], double Q-Learning [74] , and soft Q-learning [25] that directly maximize the cumulative rewards. While policy based algorithms have achieved good results and will potentially be applicable for the problem proposed in this paper [8, 78], in this paper, we choose deep Q-learning algorithm.

In the Q-learning approach, the agent learns a ’Q-Value’, denoted , which is a function of observed state and action that outputs the expected cumulative discounted future reward. Here, denotes the discrete time index. The cumulative discounted future reward is defined as:

Here, is the reward at each time step, the meaning of which needs to be specified according to the actual problem, and is the discount factor. At every time step, the agent updates its Q function by an update of the Q value:

In most cases, including the traffic control scenarios of interest, due to the complexity of the state space and action space, deep neural networks can be used to approximate the Q function. Instead of updating the Q value, we use the value:

as the output target of a Q network and do a step of back propagation on the input of .

We utilized two known methods to stabilize the training process [38, 48]:

  1. Two Q-networks are maintained, a target Q-network and an on-line Q network. Target Q-network is used to approximate the true Q-values, and the on-line Q-network is back-propagated every step. In the training period, the agent makes decision with the target Q-network, the results from each time instance are used to update the on-line Q-network. At periodic intervals, on-line Q network’s weights are synchronized with the target Q-network. This will keep the agent’s decision network relatively stable, instead of changing at every step.

  2. Instead of training after every step an agent has taken, past experience is stored in a memory buffer and training data is sampled from the memory for a certain batch size. This experience replay aims to break the time correlation between samples [49].

In this paper, we train the traffic lights agents using a Deep Q-network (DQN) [49]. With the Q-learning algorithm described above, our work focuses on the definition of agents’ actions and the assignment of the states and rewards, which is discussed in the the following subsection IV-B.

Iv-B Parameter Modeling

We consider a traffic light controller, which takes reward and state observation from the environment and chooses an action. In this subsection, we introduce our design of actions, rewards, and states for the aforementioned PD-ITSC system problem.

Agent action

In our context, the relevant action of the agent is either to keep the current traffic light phase, or to switch to the next traffic light phase. At every time step, the agent makes an observation and takes action accordingly, achieving intelligent control of traffic.


For traffic optimization problems, the goal is to decrease the average traffic delay of commuters in the network, by using traffic light phasing strategy . Specifically, find the best traffic light phasing strategy , such that is minimum, where is the average travel time of commuters in the network, under the traffic control scheme , and is the physically possible lowest average travel time. Consider traveling the same distance ,

Here, is some maximum reasonable speed for the vehicle, such as the speed limit of the road of interest. denotes the actual vehicle speed under strategy , at time . Therefore,

Therefore, to get minimum delay is equivalent to minimizing at each step , for each vehicle:


We note that this is equivalent to maximizing , if the on all roads for all cars are the same. If different vehicles have different , the reward function is taken as the arithmetic average of the function for all vehicles.

We define the statement in (1) as the penalty of each step. Our goal is to minimize the penalty of each step. Since reinforcement learning tries to maximize the reward (minimize penalty), we define the opposite number of the loss as the reward for the reinforcement learning problem:


In some cases, especially when the traffic flow is heavy, one can shape the rewards to guide the agent’s action, such as avoiding big traffic jams [53]. This is certainly an interesting direction for future research.

State representation

For optimal decision making, a system should consider as much relevant information about traffic processes as possible. Traditional ITSC system only typically detect simple information such as the presence of vehicles. In PD-ITSC system, only a portion of the vehicles are detected, but it’s likely that more specific information about these vehicles such as speed and position are available due to the capabilities of the underlying wireless technologies (discussed in Section III-B).

RL enables experimentation with many possible choices of inputs and input representations. Further research is required to determine the experimental benefits of each option and that goes beyond the scope of this paper. Based on initial experiments, for the purpose of this paper, we selected a state representation including the distance to the nearest vehicle at each approach, number of vehicles at each approach, amber phase indicator, current traffic light phase elapsed time and current time, as shown in Table I.

Information Representation
Detected car count Number of detected vehicles in each approach
Distance to nearest detected vehicle Distance to nearest detected vehicle on each approach; if no detected vehicle, set to lane length (in meters)
Current phase time Duration from start of current phase to now (in seconds)
Amber phase Indicator of amber phase; 1 if currently in amber phase, otherwise 0
Current time Current time of day (hours since midnight), normalized from 0 to 1 (divided by 24)
Current phase Detected car count and distance to nearest detected vehicle is negated if red, positive if green
TABLE I: details of state representation

Note that current traffic light phase (green or red) is represented by a sign change in the per-lane detected car count and distance rather than by a separate indicator. In initial experiments, we observed slightly faster convergence using this distributed representation (sign representation) than a separate indicator (shown in Figure 5). We hypothesize that, in combination with Rectified Linear Unit (ReLU) activation, this encoding biases the network to utilize different combinations of neurons for different phases. ReLU units are active if the output is positive and inactive if the output is negative, so our representation may encourage different units to be utilized during different phases, accelerating learning. There are many possible representations and our experimentation with different representations is not exhaustive, but we found that RL was able to handle several different representations with reasonable performance.

Iv-C System

Fig. 3: Control logic of RL based decision making unit

Figure 3 gives a flow chart on how the RL based control unit makes the decisions. As shown in the figure, control unit gets the state representation periodically, calculates the Q-value for all the possible actions and if the action of keeping the current phase has bigger Q-value, it retains the phase; otherwise, switches to the next phase.

Other than the main logic discussed above, a sanity check is performed on the agent: a mandatory maximum and minimum phase. If the current phase duration is less than the minimum phase time, the agent will keep the current phase no matter what action the DQN is choosing; similarly, if phase duration is larger or equal to maximum phase time, the phase will be forced to switch.

Iv-D Implementation

In this section, we describe the design of the proposed scheme at the system level. The implementation of the system contains two phases, the training phase and the deployment phase. As shown in Figure 4, the agent is first trained with a simulator, which is then ported to the intersection, connected to the real traffic signal, after which it starts to control the traffic.

Fig. 4: The deployment scheme

Training phase

The agent is trained by interacting with a traffic simulator. The simulator randomly generates vehicle arrivals, then determines whether each vehicle can be detected by drawing from a Bernoulli distribution parameterized by , the detection rate. In the context of DSRC-based vehicle detection systems, the detection rate corresponds to the DSRC penetration rate. The simulator obtains the traffic state and calculates the current reward accordingly, and feeds it to the agent. Using the Q-learning updating formula cited in previous sections, the agent updates itself based on the information from the simulator. Meanwhile, the agent chooses an action , and forwards the action to the simulator. The simulator will then update, and change the traffic light phase according to agent’s indication. These steps are done repeatedly until convergence, at which point the agent is trained.

The performance of an agent relies heavily on the quality of the simulator. To obtain similar arrival pattern as the real world, the simulator generates car flow by the historical record of vehicle arrival rate on the same map of the real intersection. To address the variance in car flow in different parts of the day, current time of the day is also specified in the state representation, so that after training the agent is able to adapt to different car flow in different time of the day. Other factors that affect car flow, such as day of the week, could also be parameterized in the state representation.

The goal of training is to have the traffic control scheme achieve the shortest average commute time for all commuters. In the training period, the machine tries different control schemes and eventually converges to an optimal scheme which yields a minimum average commute time.

Deployment phase

In the deployment phase, the software agent is moved to the intersection for controlling the traffic signal. Here, the agent will not update the learned Q-function, but simply control the traffic signal. Namely, the detector will feed the agent’s current detected traffic state ; based on , the agent chooses an action based on the trained Q-network and directs the traffic signal to switch/keep phase accordingly. This step is performed in real-time, thus enabling continuous traffic control.

V Performance Analysis

In this section, we give several scenarios of simulations to evaluate various aspects of the performance of the proposed scheme. The simulations are performed with SUMO, a microscopic traffic simulator [33, 35, 39]. Different scenarios are considered, in order to provide a comprehensive analysis for the proposed scheme.

Qualitatively speaking, we see the performance of the agent reacting to the traffic intelligently from the GUI. It makes reasonable decisions for the arriving vehicles. We demonstrate the performance of the agent after different periods of training in a video available in [57].

Fig. 5: Penalty function decreasing with number of iterations in training, the situation shown in the figure is plotted from training with dense car flow at a single intersection

Figure 5 shows a typical training process curve. Both phase representations have similar trends, but we do observe that the sign representation has a slightly faster convergence rate in all experiments (see section IV-B3).

We provide a quantitative analysis in the following subsections. Though currently there are no analytical results for PD-ITSC system, we can predict what will be observed by considering the following two extreme cases:

  • When the car flow rate is extremely low, vehicles come to the intersection independently. For detected vehicles, the optimal traffic signal should switch phases on their arrival to yield zero waiting time, for the undetected vehicles, the traffic agent won’t be able to do anything. In this case, vehicles can be considered as independent ’particles’, and the optimal traffic agent react for each of their arrivals independently. Therefore, we should observe much better performance for the detected vehicles than those undetected vehicles, which corresponds to the cases shown in Figure. (b)b.

  • When the car flow rate is extremely heavy (at the point of saturation), the optimal traffic agent should take a completely different strategy, instead of only taking care of the detected vehicles, the agent should be aware of the fact that the detected vehicles are only representatives of the car flow, and react in a way that maximizes the overall waiting time. The waiting time of detected vehicles and undetected vehicles should be similar, because they are of the same car flow. The vehicles here should be considered as ’liquid’ instead of ’particles’ from the previous case. This can be seen in Figure (a)a.

The rest of the section is organized as follows: subsection V-A evaluates the performance of the system under different detection rates. One should expect different performance for different car flow rates for the reasons mentioned above. SubsectionV-B gives an estimate on the benefit of the designed agent during different times of the day. Finally, subsection V-C and V-D show that when the implementation scenario is slightly different from the training scenario, the performance of the designed agent is still reasonably good.

V-a Performance for different detection rates

In this subsection, we present performance results under different detection rates, to qualify the performance of a PD-ITSC system as the detection rate increases from 0% to 100%. We compare to the performance of a typical pre-timed signal with green phase duration of 24 seconds, shown in dashed lines as a simple reference.

Fig. 6: Waiting time under different detection rate under medium car flow

Figure 6 shows a typical trend we obtained in simulations. The figure shows the waiting time of vehicles at a single intersection under the car flow from north, east, south, west to be 0.02 veh/s, 0.1 veh/s, 0.02 veh/s, 0.05 veh/s, respectively, with vehicles arriving as a Poisson process. One can make several interesting observations from this figure. First of all, the system under AI control is much better than the traditional pre-timed traffic signal, even under low detection rate. We can also observe that the overall waiting time (red line) within this system decreases as the detection rate increases. This is intuitive, since as more vehicles are detected, the more information the system has and thus the system is able to optimize the car flow better.

Additionally, from the figure one can observe that approximately 80% of the benefit happens in the first 20% of transition. This finding is quite significant in that we find a transition scheme that asymptotically gets better as the system gradually evolves to a 100% detection rate, and will be able to receive much of the ultimate benefit during the initial transition.

Another important observation is that during the transition, although the agent is rewarded for optimizing the overall average commute time for both detected and undetected vehicles, the detected vehicles (green line in Figure 6) have a lower commute time than undetected vehicles (blue line in Figure 6). This provides an interesting ’potential’ or ’incentive’ to the system, to transition from no vehicles equipped with the IoT device, to all vehicles equipped with the device. Drivers of those vehicles not yet equipped with the device now have a good reason and strong incentive to install one.

Here, we also compare with our previous designed system known as DSRC-ATL [82], which is an algorithm designed for dealing with partial detection under sparse to medium car flow. We see that though the algorithms exhibit similar trends, RL agents have better performance during the whole transition from 0 to 1 detection rate.

(a) Performance under dense flow
(b) Performance under sparse flow
Fig. 7: Waiting time under different detection rate under dense and sparse car flow

Figure 7 shows the performance under the other two cases: when the car flow is very sparse (0.02 veh/s at each lane) or very dense (0.5 veh/s at each lane). For the sparse situation in Figure (b)b, the trend is similar to the medium flow case shown in Figure 6.

One can see from Figure (a)a that under the dense situation, the curve becomes quite flat. This is because when car flow is high, detecting individual vehicles become less important. When many cars arrive at the intersection, car flow has ’liquid’ qualities, as opposed to ’particle’ qualities in the previous two situations. The trained RL agent is able to seamlessly transition from a ’particle arrival’ optimization agent which handles random arrivals to a ’liquid arrival’ optimization agent which handles macroscopic flow. This result shows that RL is able to capture the main factors that affect traffic system’s performance and performs differently under different car arrival rates. Hence, RL provides a much desired adaptive behavior.

V-B Performance of a whole day

Section V-A examines the effect of flow rate on system performance. Since the car flow differs at different times of the day, we simulate an entire day of traffic. To generate realistic car flow of a day, we refer to the whole day car flow reported in [71]. To adapt the reported arrival rate to the simulation system, we multiply the car flow in [71] with a factor so that the peak volume matches the saturation flow rate of the simulated roads. Figure 8 shows the car flow rate we used for the simulation, the car flow reach peak on 8 am in the morning and 6 pm in the afternoon of 1.2 vehicles/s, the car flow of the regular hours is around 0.7 vehicles/s. It is worth mentioning that the car flow of different intersections in the real world might be very different, so the result presented here is just an example of what the performance looks like under a typical traffic volume of a whole day.

Fig. 8: Typical car flow in a day
Fig. 9: Expected Performance by Time

Figure 9 shows the performance of different vehicles in a whole day. One can observe from this figure that the performance of 20% detection rate (red line) is very close to the performance of 100% detection rate (green line), at most times of the day (from 5am to 9pm). During rush hours, the system with 100% detection rate is almost the same as the system with 20% detection rate. Though a traffic system under 100% detection rate performs visibly better at midnight, the performance at that time is not as critical as the performance during the busier daytime. This result indicates that by detecting 20% of vehicles, we can perform almost the same as detecting all vehicles. But those detectable vehicles (yellow lines) will have a benefit against those undetectable vehicles (dash line).

These results confirm intuition. With a large volume of cars, a low detection rate should still provide a relatively low-variance estimate of traffic flow. If there are few cars and a low detection rate, the estimate of traffic flow can have very high-variance. Late at night with only a single detected car, an ITSC system can give that car a green immediately, which would not be possible with an undetected car.

V-C Sensitivity Analysis

The results obtained above used agents trained and evaluated under the same environmental parameters, since traffic patterns only fluctuate slightly from day to day.

Below, we evaluate the sensitivity of the agents to two environmental parameters: the car flow and the detection rate.

Sensitivity to car flow

Figure 10 shows the agents’ sensitivity to car flow. Figure (a)a shows the performance of an agent trained under 0.1 veh/s car flow, operating at different flow rates. Figure (b)b shows the sensitivity of an agent trained under 0.5 veh/s car flow. The blue curve in the figure is the trained agent’s performance, while the red one is the performance of the optimal agent (the agent trained under that situation and tested under that situation). Both agents perform well over a range of flow rates. The agent trained under 0.1 veh/s flow can handle flow rates from 0 to 0.15 at near-optimal levels. At higher flow rates, it still performs reasonably well. The agent trained on 0.5 veh/s flow will perform reasonably from 0.25 veh/s to 0.5 veh/s, but under 0.25 veh/s, the agent will start to perform substantially worse than the optimal agent. Since traffic patterns are not expected to heavily fluctuate, these results give a strong indication that the agent trained by the data will be able to adapt to the environment even when the trained situation is slightly different.

(a) Sensitivity of agent trained under 0.1 veh/s flow rate
(b) Sensitivity of agent trained under 0.5 veh/s flow rate
Fig. 10: Sensitivity analysis of flow rate

Sensitivity to detection rate

In most situations, the detection rate can only be approximately measured. It is likely that an agent trained under one detection rate needs to operate under a slightly different detection rate, so we test the sensitivity of agents to detection rates.

(a) Sensitivity of agent trained under 0.2 detection rate
(b) Sensitivity of agent trained under 0.8 detection rate
Fig. 11: Sensitivity analysis of detection rate

Figure 11 shows the sensitivity of two cases. Figure (a)a shows the sensitivity of low detection rate (0.2), figure (b)b shows the sensitivity under high detection rate (0.8).

We observe that the agent trained under 0.2 detection rate performs at an optimal level from 0.1 to 0.4 detection rate. The sensitivity upward is better than downward. This indicates that at early deployment of this system, it’s better to under-estimate detection rate, since the agent’s performance is more stable for the higher detection rate.

Figure (b)b shows the sensitivity of the agent trained under high detection rate (0.8). We can see that the performance of this agent is at optimal level when detection rate is from 0.5 to 1. Though the sensitivity performance for an agent under low detection rate is different than the sensitivity under high detection rate, for both cases, the agent shows a level of stability, which means that as long as the detection rate used for training is not too different from the actual detection rate, the performance of the agent will not be affected a lot.

V-D Robustness between training and deployment scenario

There are many differences between the training and the actual deployment scenario, as the simulator, though quite sophisticated, will never able to take all the factors in the real scenario into account. This simulation aims to evaluate and verify that those minor factors, such as stop-and-go vehicles, arrival patterns and other factors won’t affect the system in a major way. We choose a newly published realistic scenario known as Luxembourg SUMO Traffic (LuST) [12]. The scenario is generated on the real map of Luxembourg, the activity of vehicles are generated according to the demographic data published by the government. The authors of this scenario compared the generated traffic with a data set collected between March and April 2015 in Luxembourg, which contains 6,000,000 floating vehicles sample and achieved similar speed distributions, hence the LuST scenario has a high degree of realism.

In our simulation, we don’t directly train the traffic light on the scenario; instead, we use this scenario as ground truth to evaluate the trained traffic light. The simulation steps we performed are as follows:

  1. Choose a certain intersection from LuST with high rate of car flow (intersection -12408)

  2. Measure the hourly traffic volume of that intersection

  3. Build a simple intersection in a separate simulator and train a traffic agent with car flow generated by the new simulator, according to the hourly traffic volume measured in step 2.

  4. Train an agent on the simplified scenario we built in step 3.

  5. After training, we evaluate the performance on the original LuST scenario, by substituting the traffic agent of that intersection to the new traffic agent we trained.

It is worth mentioning here that this simulation follows the steps of actual implementation in real world (described in section IV-D), so the performance here can be considered as a reference for the performance of actual deployment when the simulator and real world have major differences in details.

Other than the difference in the map and car flow, there are more differences between training and evaluation, the scenario used for evaluation is rich in details. In Table II, we list all the differences between the Lust scenario (for evaluation) and the simulator used for training.

training Evaluation (LuST)
Map topology Simple straight street intersection Real world map
Street length 125m for each approach Different length for each approach
Car arrival pattern Poisson Bulk arrival when vehicle go through intersections
Car speed Constant Gaussian mixture distribution
Stop-and-go No stop-and-go vehicles Bus stops
U-turn vehicles No U-turn A small proportion of U-turn
Location where vehicle generated End of the road Anywhere of the road
Location of destination End of the road anywhere of the road, some might not even go through the intersection
Buses No buses Regular buses arrival with a bus stop close to the intersection
Vehicle passing Almost no passing due to constant speed Some vehicle passing due to the randomness of the speed
TABLE II: Deference in training and evaluation scenario

Notice that the simulator is sophisticated enough to take all the factors listed in the table into account. Here we intentionally introduce differences between training and evaluation. This is a judicious choice on our part. Our goal is to give a reasonable estimate of the performance in the real-world implementation where the simulation scenario is slightly different than the real-world scenario.

We choose three different times of the day to present the results:

  1. Midnight: 2 AM in the morning, in this case, the car flow at intersection is sparse

  2. Rush-hours: 8 AM in the morning, this is a situation where car flow is dense

  3. Regular hours: 2 PM in the afternoon, this is the situation during regular hours, the car flow is in between of midnight car flow and rush hours car flow (medium car flow).

(a) Performance of traffic agent in 2 am
(b) Performance of traffic agent in 8 am
(c) Performance of traffic agent in 2 pm
Fig. 12: Performance of the agent in LuST scenario

Figure 12 shows the performance of the agent in the LuST scenario. We can clearly see that even though the evaluated situation is quite different from the training situation, we still observe: the performance improves asymptotically as the detection rate grows, which exhibits the same trend as we observed in V-A.

Vi Discussion

As the simulation results show, while all vehicles will experience a shorter waiting time under an RL-based traffic controller, detected vehicles will have a shorter commute time than undetected vehicles. This property makes it possible for hardware manufacturers, software companies, and vehicle manufacturers to help push forward the proposed scheme, other than the Department of Transportation (DoT) alone, for the simple reason that all of them can profit from this system. For example, it would be valuable for a certain navigation app to advertise that their customers can save 30% on commute time.

Therefore, we view this technology as a new generation of Intelligent Transportation Systems, as it inherently comes with a lucrative commercial business model. The burden of spreading the penetration rate in this system is distributed to a lot of companies, as opposed to the traditional ITSC systems which put all the burden on the DoT alone. This makes it financially feasible to have the system installed on most of the intersections in a city, as opposed to the current situation where only a small proportion of intersections are installed with ITSC.

The mechanism of the system solution described will also make it possible to have dynamic pricing. Dynamic pricing refers to reserving certain roads during rush hours exclusively for paid users. This method has been scuttled by public or political opposition and only a few cities have implemented dynamic pricing [16, 61]. Those few successful examples, however, cannot be easily copied or adapted to other cities, as the method depends hugely on road topologies.. In our solution, we can accomplish dynamic pricing in a more intelligent way, by simply consider vehicle detection as a service. Compared to existing solutions, this service will not require to reserve roads, making the scheme flexible and easy to implement. The user will also be able to choose to pay for a prioritized signal phase whenever they are in a hurry.

Further research is needed to make this AI-based Intelligent Traffic Control System more practical. First of all, the system currently needs to be fully trained in a simulator; under the partial observation setup, the system will not be able to observe the reward, hence, it won’t be able to do any incremental training after deployment. Clearly, this is a drawback or shortcoming of the proposed system. Some solutions to this problem are reported in a follow-up paper [80]. Another future direction would be to further develop the system to achieve multi-agent coordination so that, with the help of DSRC radios (or other forms of communications), traffic lights will be able to communicate with each other. Clearly, designing such a system will significantly improve the performance of PD-ITSC system. Further research is also required to investigate whether the RL agent will be able to pick up the drivers’ behavior accurately at each intersection [54, 64, 23, 58, 36].

Vii Conclusion

In this paper, we have proposed reinforcement learning, specifically deep Q-learning, for traffic control with partial detection of vehicles. The results of our study show that reinforcement learning is a promising new approach to optimizing traffic control problems under partial detection scenarios, such as traffic control systems using DSRC technology. This is a very promising outcome that is highly desirable since the industry forecasts on DSRC penetration process seems gradual as opposed to abrupt.

The numerical results on sparse, medium, and dense arrival rates suggest that reinforcement learning is able to handle all kinds of traffic flow. Although the optimization of traffic on sparse arrival and dense arrival are, in general, very different, results show that reinforcement learning is able to leverage the ’particle’ property of the vehicle flow, as well as the ’liquid’ property, thus providing a very powerful overall optimization scheme.


The authors would like to thank to Dr. Hanxiao Liu from Language Technology Institute, Carnegie Mellon University for informative discussions and a lot of suggestions to the methods reported in the paper. The authors would also like to thank Dr. Laurent Gallo from Eurecom, France and Mr. Manuel E. Diaz-Granados of Yahoo, US, for the initial attempt to solve this problem in 2016.

Rusheng Zhang was born in Chengdu, China in 1990. He received the B.E. degree in micro electrical mechanical system and second B.E. degree in Applied Mathematics from Tsinghua University, Beijing, in 2013, and the M.S. degree in electrical and computer engineering from Carnegie Mellon University, in 2015. He is a Ph.D. candidate at Carnegie Mellon University. His research areas include vehicular networks, intelligent transportation systems, wireless computer networks, artificial intelligence and intra vehicular sensor networks.

Akihiro Ishikawa was an MS student in the Electrical and Computer Engineering Department of Carnegie Mellon University until he received his MS degree in 2017. His research interests include vehicular networks, wireless networks, and artificial intelligence.

Wenli Wang has obtained an M.S. degree in the Electrical and Computer Engineering Department of Carnegie Mellon University in 2018. Prior to Carnegie Mellon University, she received B.S. in Statistics and B.A. in Fine Arts from University of California, Los Angeles in 2016. Her research interests include machine learning and it applications in wireless networks and computer vision.

Benjamin Striner is a master’s student in the Machine Learning Department at Carnegie Mellon University. Previously, he was a patent expert witness and engineer, especially in wireless communications. He received a B.A. in neuroscience and psychology from Oberlin College in 2005. Research interests include reinforcement learning, generative networks, and better understandability and explainability in machine learning.

Ozan Tonguz is a tenured full professor in the Electrical and Computer Engineering Department of Carnegie Mellon University (CMU). He currently leads substantial research efforts at CMU in the broad areas of telecommunications and networking. He has published about 300 research papers in IEEE journals and conference proceedings in the areas of wireless networking, optical communications, and computer networks. He is the author (with G. Ferrari) of the book Ad Hoc Wireless Networks: A Communication-Theoretic Perspective (Wiley, 2006). He is the inventor of 15 issued or pending patents (12 US patents and 3 international patents). In December 2010, he founded the CMU startup known as Virtual Traffic Lights, LLC, which specializes in providing solutions to acute transportation problems using vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications paradigms. His current research interests include vehicular networks, wireless ad hoc networks, sensor networks, self-organizing networks, artificial intelligence (AI), statistical machine learning, smart grid, bioinformatics, and security. He currently serves or has served as a consultant or expert for several companies, major law firms, and government agencies in the United States, Europe, and Asia.


  1. The research reported in this paper was partially funded by King Abdulaziz City of Science and Technology (KACST), Riyadh, Kingdom of Saudi Arabia


  1. M. Abdoos, N. Mozayani and A. L. Bazzan (2011) Traffic light control in non-stationary environments based on multi agent q-learning. In Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on, pp. 1580–1585. Cited by: §II.
  2. B. Abdulhai, R. Pringle and G. J. Karakoulas (2003) Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering 129 (3), pp. 278–285. Cited by: §II.
  3. Average age of cars on u.s.. Note: \url[Online; accessed 21-Aug-2017] Cited by: §I.
  4. A. Bazzi, A. Zanella, B. M. Masini and G. Pasolini (2014) A distributed algorithm for virtual traffic lights with ieee 802.11 p. In Networks and Communications (EuCNC), 2014 European Conference on, pp. 1–5. Cited by: §II.
  5. A. Bazzi, A. Zanella and B. M. Masini (2016) A distributed virtual traffic light algorithm exploiting short range v2v communications. Ad Hoc Networks 49, pp. 42–57. Cited by: §II.
  6. P. Bellavista, F. Caselli and L. Foschini (2014) Implementing and evaluating v2x protocols over itetris: traffic estimation in the colombo project. In Proceedings of the fourth ACM international symposium on Development and analysis of intelligent vehicular networks and applications, pp. 25–32. Cited by: §II.
  7. P. Bellavista, L. Foschini and E. Zamagni (2014) V2x protocols for low-penetration-rate and cooperative traffic estimations. In Vehicular technology conference (VTC Fall), 2014 IEEE 80th, pp. 1–6. Cited by: §II.
  8. F. Belletti, D. Haziza, G. Gomes and A. M. Bayen (2018) Expert level control of ramp metering based on multi-task deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 19 (4), pp. 1198–1207. Cited by: §IV-A.
  9. E. Bingham (2001) Reinforcement learning in neurofuzzy traffic signal control. European Journal of Operational Research 131 (2), pp. 232–241. Cited by: §II.
  10. A. Chattaraj, S. Bansal and A. Chandra (2009) An intelligent traffic control system using rfid. IEEE potentials 28 (3). Cited by: §I, §III-A.
  11. M. C. Choy, D. Srinivasan and R. L. Cheu (2002) Hybrid cooperative agents with online reinforcement learning for traffic control. In Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference on, Vol. 2, pp. 1015–1020. Cited by: §II.
  12. L. Codecá, R. Frank, S. Faye and T. Engel (2017) Luxembourg SUMO Traffic (LuST) Scenario: Traffic Demand Evaluation. IEEE Intelligent Transportation Systems Magazine 9 (2), pp. 52–63. Cited by: §V-D.
  13. A. B. C. da Silva, D. de Oliveria and E. Basso (2006) Adaptive traffic control with reinforcement learning. In Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 80–86. Cited by: §II.
  14. C. De Fabritiis, R. Ragona and G. Valenti (2008) Traffic estimation and prediction based on real time floating car data. In Intelligent Transportation Systems, 2008. ITSC 2008. 11th International IEEE Conference on, pp. 197–203. Cited by: §II.
  15. D. de Oliveira, A. L. Bazzan, B. C. da Silva, E. W. Basso, L. Nunes, R. Rossetti, E. de Oliveira, R. da Silva and L. Lamb (2006) Reinforcement learning based control of traffic lights in non-stationary environments: a case study in a microscopic simulator.. In EUMAS, Cited by: §II.
  16. A. de Palma and R. Lindsey (2011) Traffic congestion pricing methodologies and technologies. Transportation Research Part C: Emerging Technologies 19 (6), pp. 1377–1399. Cited by: §VI.
  17. S. El-Tantawy, B. Abdulhai and H. Abdelgawad (2013) Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems 14 (3), pp. 1140–1150. Cited by: §II.
  18. Y. Feng, J. Hourdos and G. A. Davis (2014) Probe vehicle based real-time traffic monitoring on urban roadways. Transportation Research Part C: Emerging Technologies 40, pp. 160–178. Cited by: §II.
  19. M. Ferreira and P. M. d’Orey (2012) On the impact of virtual traffic lights on carbon emissions mitigation. IEEE Transactions on Intelligent Transportation Systems 13 (1), pp. 284–295. Cited by: §II.
  20. M. Ferreira, R. Fernandes, H. Conceição, W. Viriyasitavat and O. K. Tonguz (2010) Self-organized traffic control. In Proceedings of the seventh ACM international workshop on VehiculAr InterNETworking, pp. 85–90. Cited by: §I, §II.
  21. M. R. Friesen and R. D. McLeod (2015) Bluetooth in intelligent transportation systems: a survey. International Journal of Intelligent Transportation Systems Research 13 (3), pp. 143–153. Cited by: §I, §III-A.
  22. N. H. Gartner (1983) OPAC: a demand-responsive strategy for traffic signal control. Cited by: §I, §II.
  23. T. J. Gates and D. A. Noyce (2010) Dilemma zone driver behavior as a function of vehicle type, time of day, and platooning. Transportation Research Record 2149 (1), pp. 84–93. Cited by: §VI.
  24. W. Genders and S. Razavi (2016) Using a deep reinforcement learning agent for traffic signal control. arXiv preprint arXiv:1611.01142. Cited by: §I, §II.
  25. T. Haarnoja, H. Tang, P. Abbeel and S. Levine (2017) Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165. Cited by: §IV-A.
  26. F. Hagenauer, P. Baldemaier, F. Dressler and C. Sommer (2014) Advanced leader election for virtual traffic lights. ZTE Communications, Special Issue on VANET 12 (1), pp. 11–16. Cited by: §II.
  27. J. Henry, J. L. Farges and J. Tuffal (1984) The prodyn real time traffic algorithm. In Control in Transportation Systems, pp. 305–310. Cited by: §I, §II.
  28. P. Hunt, D. Robertson, R. Bretherton and M. C. Royle (1982) The scoot on-line traffic signal optimisation technique. Traffic Engineering & Control 23 (4). Cited by: §I, §II.
  29. (2016) Intelligent traffic system cost. Note: \url; accessed 23-November-2017 Cited by: §II.
  30. B. Kerner, C. Demir, R. Herrtwich, S. Klenov, H. Rehborn, M. Aleksic and A. Haug (2005) Traffic state detection with floating car data in road networks. In Intelligent Transportation Systems, 2005. Proceedings. 2005 IEEE, pp. 44–49. Cited by: §II.
  31. M. A. Khamis and W. Gomaa (2014) Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework. Engineering Applications of Artificial Intelligence 29, pp. 134–151. Cited by: §II.
  32. X. Kong, Z. Xu, G. Shen, J. Wang, Q. Yang and B. Zhang (2016) Urban traffic congestion estimation and prediction based on floating car trajectory data. Future Generation Computer Systems 61, pp. 97–107. Cited by: §II.
  33. D. Krajzewicz, J. Erdmann, M. Behrisch and L. Bieker (2012) Recent development and applications of sumo–simulation of urban mobility. International Journal On Advances in Systems and Measurements 5 (3&4). Cited by: §V.
  34. D. Krajzewicz, M. Heinrich, M. Milano, P. Bellavista, T. Stützle, J. Härri, T. Spyropoulos, R. Blokpoel, S. Hausberger and M. Fellendorf (2013) COLOMBO: investigating the potential of v2x for traffic management purposes assuming low penetration rates. ITS Europe. Cited by: §II.
  35. S. Krauß, P. Wagner and C. Gawron (1997) Metastable states in a microscopic model of traffic flow. Physical Review E 55 (5), pp. 5597. Cited by: §V.
  36. J. Li, X. Jia and C. Shao (2016) Predicting driver behavior during the yellow interval using video surveillance. International journal of environmental research and public health 13 (12), pp. 1213. Cited by: §VI.
  37. L. Li, Y. Lv and F. Wang (2016) Traffic signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica 3 (3), pp. 247–254. Cited by: §II.
  38. L. Lin (1993) Reinforcement learning for robots using neural networks. Technical report Carnegie-Mellon Univ Pittsburgh PA School of Computer Science. Cited by: §IV-A.
  39. P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner and E. Wießner (2018) Microscopic traffic simulation using sumo. In The 21st IEEE International Conference on Intelligent Transportation Systems, External Links: Link Cited by: §V.
  40. P. Lowrie (1990) Scats, sydney co-ordinated adaptive traffic system: a traffic responsive method of controlling urban traffic. Cited by: §I, §II.
  41. J. Lu and L. Cao (2003) Congestion evaluation from traffic flow information based on fuzzy logic. In Intelligent Transportation Systems, 2003. Proceedings. 2003 IEEE, Vol. 1, pp. 50–53. Cited by: §II.
  42. J. Luk (1984) Two traffic-responsive area traffic control methods: scat and scoot. Traffic engineering & control 25 (1). Cited by: §I, §II.
  43. J. C. Medina and R. F. Benekohal (2012) Traffic signal control using reinforcement learning and the max-plus algorithm as a coordinating strategy. In Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on, pp. 596–601. Cited by: §II.
  44. S. Mikami and Y. Kakazu (1994) Genetic reinforcement learning for cooperative traffic signal control. In Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on, pp. 223–228. Cited by: §II.
  45. V. Milanes, J. Villagra, J. Godoy, J. Simo, J. Pérez and E. Onieva (2012) An intelligent v2i-based traffic management system. IEEE Transactions on Intelligent Transportation Systems 13 (1), pp. 49–58. Cited by: §I, §II.
  46. P. Mirchandani and L. Head (2001) A real-time traffic signal control system: architecture, algorithms, and analysis. Transportation Research Part C: Emerging Technologies 9 (6), pp. 415–432. Cited by: §I, §II.
  47. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §IV-A.
  48. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §IV-A.
  49. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland and G. Ostrovski (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: item 2, §IV-A.
  50. N. S. Nafi and J. Y. Khan (2012) A vanet based intelligent road traffic signalling system. In Telecommunication Networks and Applications Conference (ATNAC), 2012 Australasian, pp. 1–6. Cited by: §I, §II.
  51. M. Nakamurakare, W. Viriyasitavat and O. K. Tonguz (2013) A prototype of virtual traffic lights on android-based smartphones. In Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 2013 10th Annual IEEE Communications Society Conference on, pp. 236–238. Cited by: §II.
  52. T. Neudecker, N. An, O. K. Tonguz, T. Gaugel and J. Mittag (2012) Feasibility of virtual traffic lights in non-line-of-sight environments. In Proceedings of the ninth ACM international workshop on Vehicular inter-networking, systems, and applications, pp. 103–106. Cited by: §II.
  53. A. Y. Ng, D. Harada and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §IV-B2.
  54. D. A. Noyce, D. B. Fambro and K. C. Kacir (2000) Traffic characteristics of protected/permitted left-turn signal displays. Transportation Research Record 1708 (1), pp. 28–39. Cited by: §VI.
  55. W. Pattara-Atikom, P. Pongpaibool and S. Thajchayapong (2006) Estimating road traffic congestion using vehicle velocity. In ITS Telecommunications Proceedings, 2006 6th International Conference on, pp. 1001–1004. Cited by: §II.
  56. F. Qu, F. Wang and L. Yang (2010) Intelligent transportation spaces: vehicles, traffic, communications, and beyond. IEEE Communications Magazine 48 (11). Cited by: §I, §III-A.
  57. Reinforcement Learning for Traffic Optimization. Note: \url[Online; accessed 12-May-2018] Cited by: §V.
  58. L. Rittger, G. Schmidt, C. Maag and A. Kiesel (2015) Driving behaviour at traffic light intersections. Cognition, Technology & Work 17 (4), pp. 593–605. Cited by: §VI.
  59. D. I. Robertson (1969) ’TANSYT’method for area traffic control. Traffic Engineering & Control 8 (8). Cited by: §I, §II.
  60. (2016) SCATS system cost. Note: \url; accessed 13-May-2018 Cited by: §II.
  61. B. Schaller (2010) New york city’s congestion pricing experience and implications for road pricing acceptance in the united states. Transport Policy 17 (4), pp. 266–273. Cited by: §VI.
  62. J. Schulman, S. Levine, P. Abbeel, M. Jordan and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §IV-A.
  63. J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-A.
  64. K. Tang and H. Nakamura (2007) A comparative study on traffic characteristics and driver behavior at signalized intersections in germany and japan. In Proceedings of the Eastern Asia Society for Transportation Studies Vol. 6 (The 7th International Conference of Eastern Asia Society for Transportation Studies, 2007), pp. 324–324. Cited by: §VI.
  65. O. K. Tonguz, W. Viriyasitavat and J. M. Roldan (2014) Implementing virtual traffic lights with partial penetration: a game-theoretic approach. IEEE Communications Magazine 52 (12), pp. 173–182. Cited by: §II.
  66. O. K. Tonguz and W. Viriyasitavat (2016) A self-organizing network approach to priority management at intersections. IEEE Communications Magazine 54 (6), pp. 119–127. Cited by: §II.
  67. O. K. Tonguz and R. Zhang (2019) Harnessing vehicular broadcast communications: dsrc-actuated traffic control. IEEE Transactions on Intelligent Transportation Systems. Cited by: §II.
  68. O. K. Tonguz (2018-10) Red light, green light — no light: tomorrow’s communicative cars could take turns at intersections. IEEE Spectrum Magazine 55 (10), pp. 24–29. Cited by: §II.
  69. (2017) Traffic congestion and reliability: Trends and advanced strategies for congestion mitigation. Note: \url[Online; accessed 19-Aug-2017] Cited by: §I.
  70. (2016) Traffic light control and coordination. Note: \url[Online; accessed 23-Mar-2016] Cited by: §I.
  71. (2014) Traffic Monitoring Guide. Note: \url; accessed 5-13-2018 Cited by: §V-B.
  72. E. van der Pol, F. A. Oliehoek, T. Bosse and B. Bredeweg (2016) Video demo: deep reinforcement learning for coordination in traffic light control. In BNAIC, Vol. 28. Cited by: §II.
  73. E. van der Pol (2016) Deep reinforcement learning for coordination in traffic light control. Ph.D. Thesis, Master’s thesis, University of Amsterdam. Cited by: §I, §II.
  74. H. Van Hasselt, A. Guez and D. Silver (2016) Deep reinforcement learning with double q-learning.. In AAAI, Vol. 2, pp. 5. Cited by: §IV-A.
  75. R. Vincent and J. Peirce (1988) ’MOVA’: traffic responsive, self-optimising signal control for isolated intersections. Technical report Cited by: §I.
  76. W. Viriyasitavat, J. M. Roldan and O. K. Tonguz (2013) Accelerating the adoption of virtual traffic lights through policy decisions. In Connected Vehicles and Expo (ICCVE), 2013 International Conference on, pp. 443–444. Cited by: §II.
  77. C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §IV-A, §IV-A.
  78. C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky and A. M. Bayen (2017) Flow: architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465. Cited by: §IV-A.
  79. J. Yapp and A. J. Kornecki (2015) Safety analysis of virtual traffic lights. In Methods and Models in Automation and Robotics (MMAR), 2015 20th International Conference on, pp. 505–510. Cited by: §II.
  80. R. Zhang, R. Leteurtre, B. Striner, A. Alanazi, A. Alghafis and O. K. Tonguz (2019) Partially detected intelligent traffic signal control: environmental adaptation. arXiv preprint arXiv:1910.10808. Cited by: §VI.
  81. R. Zhang, F. Schmutz, K. Gerard, A. Pomini, L. Basseto, S. B. Hassen, A. Ishikawa, I. Ozgunes and O. Tonguz (2018) Virtual traffic lights: system design and implementation. arXiv preprint arXiv:1807.01633. Cited by: §II.
  82. R. Zhang, F. Schmutz, K. Gerard, A. Pomini, L. Basseto, S. B. Hassen, A. Jaiprakash, I. Ozgunes, A. Alarifi and H. Aldossary (2018) Increasing traffic flows with dsrc technology: field trials and performance evaluation. In IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society, pp. 6191–6196. Cited by: §II, §V-A.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description