ES-CTC: A Deep Neuroevolution Model for Cooperative Intelligent Freeway Traffic Control
Cooperative intelligent freeway traffic control is an important application in intelligent transportation systems, which is expected to improve the mobility of freeway networks. In this paper, we propose a deep neuroevolution model, called ES-CTC, to achieve a cooperative control scheme of ramp metering, differential variable speed limits and lane change control agents for improving freeway traffic. In this model, the graph convolutional networks are used to learn more meaningful spatial pattern from traffic sensors, a knowledge sharing layer is designed for communication between different agents. The proposed neural networks structure allows different agents share knowledge with each other and execute action asynchronously. In order to address the delayed reward and action asynchronism issues, the evolutionary strategy is utilized to train the agents under stochastic traffic demands. The experimental results on a simulated freeway section indicate that ES-CTC is a viable approach and outperforms several existing methods.
ES-CTC: A Deep Neuroevolution Model for Cooperative Intelligent Freeway Traffic Control
Yuankai Wu , Huachun Tan , Zhuxi Jiang and Bin Ran
School of Mechanical Engineering, Beijing Institute of Tachnology, China
School of Transportation Engineering, Southeast University, China
College of Engineering, University of Wisconsin-Madison, USA
Kaimaogege@gmail.com, email@example.com, firstname.lastname@example.org, email@example.com,
The ongoing drastic expansion of car ownership and travel demand have led to increasing freeway congestion, with adverse effects on the economy. To relieve freeway congestion, numerous freeway traffic control approaches, e.g. dynamic routing, variable speed limit (VSL), ramp metering (RM), lane change control (LCC) etc., are studied. From a systematic viewpoint, using one management approach alone cannot fully optimize the freeway traffic in practice. The mainlane flow, on-ramp flow, routing behaviors and lane changing behaviors need to be regulated in a coordinated manner in order to improve the freeway condition. This is the motivation for investigating the coordination of different traffic control approaches.
There is a large volume of published studies describing the cooperative traffic control: Hedgy et.al [?] developed a predictive coordinated control approach for the coordination of VSL and RM. Carlson et.al [?] formulated coordinated VSL and RM control as an optimal control problem using second-order traffic flow model. Recently, the coordination of RM, VSL and LCC under connected autonomous vehicle environment was studied [?]. Two limitations worth noting in respect of the studies mentioned above are: 1) The control model are highly dependent on the integrated traffic flow models, which are inevitably inconsistent with the real-world traffic breakdown. 2) The success of proactive approaches are based on robustness and reliability of the short-term traffic prediction model. The accurate and reliable short-term traffic prediction is not an easy task because the evolution of traffic state is related to many factors [?].
Recently, the advent of deep reinforcement learning (DRL) has lead to potential applications of reinforcement learning (RL) techniques to tackle challenging control problems in intelligent transportation systems. DRL has given promising results in RM [?], traffic light control [?], differential VSL control [?], fleet management [?] and hybrid electric vehicle energy management [?]. The utilization of deep learning algorithms within RL allows a well-trained traffic control agent achieves a proactive control scheme, and optimizes the transportation benefits. The success of DRL on one specific traffic control approach hold great promise for application of DRL on coordination of different traffic control approaches.
However, the coordination of different traffic control approaches within one DRL framework is not an easy task. The first challenge is due to the difference between the control cycle of different agents. In many situations, the agents change actions asynchronously, a somewhat different situation from that familiar from popular multi-agent DRL frameworks [?; ?]. For example, the agents controlling on-ramp flow should decide whether to change traffic light phase every few seconds. While the control cycle for VSL agents are always above 1 minute because a frequently change speed limit will unstablilize the traffic flow.
The second challenge stems from the difficulties in defining a representative reward signal for different traffic control agents. The aim of traffic management would be to reduce travel time and increase traffic flow. However, the average travel time and total flow cannot be computed until all the vehicles have completed their routes, which causes the issue of delayed rewards [?]. The delayed rewards would cause further credit assignment problems in multi-agent DRL [?].
The third challenge lies in the modeling of the traffic state. Traditional, the traffic state collected from sensors are modeled as images and/or vectors, and is directly taken as an input for a convolutional neural networks (CNN) [?] or fully connected neural networks (FC) [?]. However, sensors on the road network contain complex spatial correlations and exhibits graph structure. There have been numerous studies reported that the graph convolutional network (GCN) is more suitable for modeling spatial correlation of traffic sensors than CNN and FCN in traffic prediction [?; ?].
To tackle those challenges, we propose a deep neuroevolution [?] based multi-agent framework for cooperative traffic control (ES-CTC). The main contributions of this paper can be summarized as follows:
We find that the deep neuroevolution approach is a perfect match for cooperative traffic control. In deep neuroevolution approach like evolutional strategies (ES), the only feedback signal for different agents is the final return of an episode. As a result, the problem of delayed reward is readily solved with ES.
We proposed a novel structure named knowledge sharing graph convolutional nets (KS-GCN) to generate control actions from state collected from traffic sensors. GCN is used as the building block for the proposed structure, which can fully capture the spatial dependency between different sensors. The structure allows communication and knowledge-sharing between different agents. Based on the knowledge sharing layer, the neural agent can coordinate with other agents by executing action in its own control circle.
The travel demands for training the neural networks are modeled as a stochastic distribution, leading to the changes in system dynamics of the environment. The experiments show that the proposed approach works well under stochastic travel demands.
2 Problem Statement
The freeway section considered in this paper is given in Figure 1. The freeway section in Figure 1 is composed by multiple lanes and it presents an on-ramp and an off-ramp. As it may be seen in the figure, the interference between vehicles is appearing in the merging area between inflow of on-ramp and outflow of mainstream. The conflicts cause further speed reductions in the merging area, contributing to the creation of a generalised bottleneck.
Following the statement in [?], we consider that the freeway flow with a high ratio of connected autonomous vehicle (CAV). Therefore the differential VSL and LCC can be successfully implemented. More specifically, the following control agents are considered in this paper:
Ramp-metering agent: The agent is to regulate the inflow from on-ramp to mainstream by change the phase of the traffic light in on-ramp.
Differential VSL (DVSL) agent: The DVSL agent aims at regulating the outflow of controlled area to prevent the capacity drop at bottlenecks. The conflicts between vehicles occur mostly in the right lanes. Therefore different speed limits among lanes might be more effective. The DVSL strategy can be implemented under CAV environment. The DVSL signs can send speed limit orders to the vehicles in the corresponding lane, the vehicles are forced to drive under the received speed limit.
LCC agent: The LCC is used to regulate the lateral flows for each lane. The implementation of LCC agent is more challenging than RM and DVSL agents. In this paper, we only considered to use a road-side unit (RSU) to send “keep lane” orders to the vehicle in left 2 lanes of the merge area. The reason is that the lateral inflow from left lanes to right lanes will cause severe congestion when traffic breakdown occured in the merge area of the right lanes.
Each control agent executes its own action according to its own control cycle. We denote by the control cycle for RM agent, for DVSL agent and for LCC agent. The main goal of these agents is to reduce congestion and promote the freeway capacity in a coordinated manner.
3 The KS-GCN Model Description
Figure 2 presents the architecture of KS-GCN, which is comprised of several GCN layers, traffic state inputs for DVSL, RM, LCC, several knowledge sharing layers, DVSL, RM and LCC actuators respectively.
The function of KS-GCN is to generate coordinated actions for the DVSL agent, RM agent and LCC agent given observed traffic state from correlated sensors/detectors on the targeted freeway section. Each agent only receives states from its mostly related sensors. Each sensor collects traffic variables (e.g., velocity, occupancy rate) in one cycle and is denoted as a vector . The sensor network can be represented as a weighted undirected graph , where is a set of nodes , is a set of edges, is a weighted adjacency matrix. The KS-GCN learns functions that map graph signals to traffic control signals asynchronously:
where , and are graph sensor signals that related to RM, DVSL and LCC agents respectively.The 3 agents can share sensors, therefore . , and are RM, DVSL and LCC similarity matrices derived from . is an integer. KS-GCN asynchronously updates the control signals every control cycle. The control cycles of RM (), DVSL () and LCC () can be different from each other.
3.2 Network Structure
We use the GCN architecture proposed in [?] to learn the spatial dependence between traffic signals on the graph. The layer-wise propagation rule of the specific GCN is:
where is the adjacency matrix that added self-connections. is the identity matrix. . , are the layer-specific trainable weight matrix and bias. , is the number of graph signal, is the number of feature in -th layer, and is the activation in -th layer. In KS-GCN, there are 3 stacked GCNs, which are used to learn features from traffic states for RM, DVSL and LCC agents respectively.
On top of the GCN, we further use a knowledge sharing layer to learn the sharing features for each agent. After layers of GCN, the last output matrix is of size . We use a simple FC layer for knowledge sharing, the output matrix is reshaped as a vector . The sharing feature can be obtained by:
and are trainable weights for the knowledge sharing layer. is the dimension of the sharing knowledge. Each agent shares its own knowledge with the other agents for generating specific action. The sharing process is done by concatenation:
Here, is the final vectorized feature for generating control action, is the concatenation layer.
3.3 Action Design
In this subsection, we introduce the action representation of different agents. The action for RM is represented by the phase of traffic light in the on-ramp. It is defined as : change the light to green phase (the vehicles in on-ramp is allowed to enter the freeway), and : change the light to red phase. The action for RM agent can be generated by a FC layer with softmax activation:
where , and are the trainable weights. is used to find the index with maximum value.
A similar action design can be applied to LCC agent. The action of LCC agent is defined as : allow lane change in left 2 lanes, and : forbidden lane change in left 2 lanes. The generation process of is:
The action interacts the speed limit of all lanes in the controlled area. Therefore , where is the number of lane at the controlled section. Considering the real world implementation and the driver compliance issue, the elements of is set as discrete values . And the speed limits is equal to , where is the minimum value of the speed limit, is the integer multiples, the maximum value of speed limits is . It is not feasible for a neural networks to generate explicit discrete speed limits for multiple lanes because the total number of actions for a -lane freeway section will be as large as . The neural networks with limited size is difficult or impossible to handle such a large action space. Follow the work in [?], the action generation process for the DVSL agent is defined as:
The activation of FC layer for DVSL agent is function. The outputs of the FC layer are then multiplied with . The discrete action is obtained by the integer parts of the scaled outputs.
4 Evolutionary Strategy for Optimization
In this section our aim is to propose an efficient and effective optimization algorithm for coopetative traffic control using KS-GCN based on evolutionary strategy (ES). Finding an optimal coopetative control policy for a given freeway section in section 2 can be seen as an optimization problem to search for a trainable parameter set for KS-GCN that maximize the total outflow of the freeway section. is the instantaneous outflow of the freeway section.
The parameters of KS-GCN can be directed updated by using the final return of parallel workers in ES, therefore we proposed to use ES as the optimization algorithm for KS-GCN. Another objective of the freeway control agents is to achieve an optimal control scheme under stochastic traffic demand. This also can be easily done via ES. In simulation, the traffic demand is modeled as a random process. In each episode, a new traffic demand is set by sampling data from the random process, then several parallel workers are used to run on simulations with the same traffic demand, finally the parameters is updated by the final returns of these parallel workers. We find that this stochastic training approach guarantees the generalization of the agents.
Another core challenge is how to balance exploration and exploitation using ES. The total outflow as the reward function is sometimes deceptive, e.g, the agents that achieved high outflow for a specific traffic demand might perform badly under another traffic demand sampled from the same random process. Without adequate exploration, the agents might fail to discover effective traffic control strategies. In this paper, we exploit the novelty-seeking (NS) proposed in [?] for exploration. In NS, the novelty of one policy is characterized by a behavior vector that describes its behavior. For CTC, we define traffic demand specific as:
where , and are vectors that contain all time RM, DVSL and LCC actions under demand . The original work of NS use a set of parameters to calculate the novelty. Because the traffic demand changes every episode, calculating demand specific behavior vectors for a set of parameters will be very time-consuming. In this paper, the novelty of a parallel worker is directly defined as the distance between its behavior vector and the one of unperturbed agent on demand :
The parameter update rule for ES-CTC is then expressed as follows:
where is the number of parallel workers, is the learning rate. is the parameter to balance between exploration and exploitation. In this work, we slowly decrease every episode. Algorithm 1 summarizes the optimization procedure of ES-CTC
In this section, we mainly conducted experiments on a simulated freeway section built by SUMO to evaluate the effectiveness of ES-CTC.
5.1 The simulated freeway section
The open source software SUMO is selected for the experiments. The software supports set the speed limits for each lane, set traffic phase for traffic light and forbidden lane changing using its API–the Traffic Control Interface (TraCI) package. A 874.51m freeway section with on- and off- ramps of I405 north bound in California, USA is selected. The original speed limits for the mainlane of this section are , for the on- and off- ramps are . The freeway section in SUMO and each agents’ control area can be found in Figure 3. The travel demand of this freeway can be categorized into 3 routes: 1) From mainlane to mainlane (M2M), 2) From mainlane to off-ramp (M2Off), and 3) From on-ramp to mainline (On2M). Based on observation from recorded traffic flow from sensors of PeMS111http://pems.dot.ca.gov, the hourly demand of these 3 routes is modeled as Poisson distribution with average value 5427, 1809 and 1153 respectively. The depart lane of the vehicles are randomly set according to uniform distribution. Passenger car with a length 3.5m and truck/bus with a length 8m are selected as vehicle types in the simulated traffic stream. The type of vehicles are selected randomly according to probability . Each round simulation lasts for 1 hour.
We place sensors in the upstream of DVSL controlled area, DVSL controlled area, on-ramp and merge area to detect the traffic state. The sensors on off-ramp and downstream area are used to calculate the outflow of the freeway section. The outflow can be used to compute the final return for the agents. The traffic speed and occupancy rate collected from these sensors are used as inputs for the KS-GCN. Specifically, the on-ramp and upstream of merge area are used for RM agent. The sensors in the upstream of DVSL controlled, DVSL controlled area and upstream of merge area are used for DVSL agent. The sensors in the merge area are used for LCC agent. The sizes of , and are , and respectively. The element of similarity matrix for input states is given by:
where denotes the location of the sensor. means that sensor and sensor belong to different freeway sections. denotes that sensor and sensor are in the same freeway section. The control cycle , and of RM, DVSL and LCC agents are set to 3, 60 and 30 seconds respectively. The speed limits set for DVSL agent is .
We compare ES-CTC with the following baseline methods, which include numerous DRL based traffic control models:
No control: The baseline without any DVSL, RM and LCC control.
DQN-RM A modified version of DQN based traffic light control for RM. The state input of the neural networks is the vectorization of . The agent is modeled as a neural networks with two hidden FC layers.
TRPO-RM The actor and critic of the agent are modeled as neural networks with two hidden FC layers.
DDPG-DVSL A DRL based DVSL control model whose actor and critic of the agent are modeled as a neural networks with two hidden FC layers.
The traffic state is used as the state variable for DQN-RM and TRPO-RM. The traffic state is used as the state variable for DDPG-DVSL. The neural networks of DQN-RM, actor and critic of DDPG-DVSL and TRPO-RM have 2 hidden FC layers, which contain 30 hidden neurons and 20 hidden neuron respectively. The agents of ES-CTC are built upon 2 layer GCNs, the numbers of feature in 1st and 2nd are 5 and 3 respectively, the dimensions of sharing feature are set as 8. The reward signal of DQN-RM, TRPO-RM and DDPG-DVSL is the outflow of the freeway section at time point . Their discount factors are set to 0.9. The return for ES-CTC is the total outflow of the freeway section.
5.3 Performance Comparisons
We first evaluate all models on a simple case, they are constantly optimized on a same demand profile. The DRL based DQN-RM, TRPO-RM and DDPG-DVSL are trained with the demand with 2000 episodes. The number of parallel workers for ES-CTC is set to 50. To make the comparison fair, we update the parameters of ES-CTC 40 times therefore all models are learned with same number of simulation. In this scenario, we can observe whether the compared models can converge to a stable and optimal point by the training process of all models. The evolution of the overall outflow of each algorithm during training can be seen in Figure 4.
We discover that the DQN-RM, TRPO-RM and DDPG-DVSL fail to converge to a stable value. Several oscillations can be observed from Figure 5(a). The outflow are related to many other factors such as the inflow of on-ramp and outflow of off-ramp, which could not fully controlled by the agents. Moreover, the vehicle can be computed as a out vehicle only when it has leaved the freeway section, there could be a delay between the control effects of the agents on the vehicle and computation of reward signal. These issues make the DRL based approaches difficult to converge. It is observed that ES-CTC is more stable from Figure 5(c). ES-CTC reaches a relatively high outflow after 25 round generation and achieves the highest max outflow with 6609 vehicles. Another advantage of ES-CTC models is that they are significantly faster than DRL models due to their higher parallelization capability. The results indicate that deep neuroevolution model is more suitable for cooperative traffic control compared with DRL models. The total number of outflow only reaches 6289 when no control strategy is implemented. The maximum outflows of all DRL models and ES-CTC are significantly higher than 6289. The maximum outflows for DQN-RM, TRPO-RM and DDPG-DVSL are 6577, 6570 and 6588 respectively. It shows that the traffic control strategies can promote the capacity of the freeway.
In the second case, the DQN-RM, TRPO-RM, DDPG-DVSL and ES-CTC are trained and evaluated on stochastic traffic demand. The DRL based DQN-RM, TRPO-RM and DDPG-DVSL are trained with the demand with 3000 episodes. They are trained with a new traffic demand in each episode. The number of parallel workers for ES-CTC is set to 100. In order to guarantee all models consume similar wall-clock time, we evolved the ES-CTC model with 200 generations. After training, we compare the average outflow of all models on 100 stochastic demands. The traditional performance metric used in the RL problems is the average total return achieved by the model in an episode. In order to obtain more representative metrics independent of reward shaping for traffic control, we also compute the average traffic demand satisfaction degree and average improvement level , which are defined as
Here is the total demand of the th episode, is the total outflow of th episode without any traffic control agents. The evaluation results of 4 models are given in Table 1. We can find ES-CTC achieves relatively higher average outflow, and than three DRL benchmarks on 100 stochastic traffic demands. The ES based optimization strategy, graph convolutional structure and coordination between different agents are the keys to its success.
The RM, DVSL and LCC actions of ES-CTC obtained from one simulation are presented in Figure 5. The most interesting one is the speed limits produced by DVSL agent. The DVSL agent has learned to always set a maximum speed limit for the leftest lane. it automatically set the left lanes as overtaking lanes. The agents mainly adjusts inflow to the bottleneck by adjusting the speed limits of the right lanes, on-ramp vehicles and vehicles’ lane change behaviors. As stated before, the conflicts between vehicles occur mostly in the right lanes. Therefore it is not necessary to decrease the speed limits of left 2 lanes (lane 4 and lane 5).
In this paper we have proposed a deep neuroevolutional model for cooperative freeway traffic control. In order to learn the spatial dependence between traffic sensors, the neural networks structure of the model are built upon graph convolutional layer. Our structure allows several traffic control agents with different control cycles work cooperatively to improve the freeway traffic efficiency. Our solution outperforms the state-of-the-art DRL based solutions in terms of improvements in freeway capacity.
Several interesting questions stem from our paper both theoretically and practically, that we plan to study in the future. We aim to extend the approach to large freeway networks and a broader set of dynamic events such as adverse weather and traffic incidents in the future. Another interesting direction we plan to study is the incorporation of more advanced traffic control strategies. In this paper, the most basic graph convolutional network architecture and evolutionary strategy are used. We believe that a more systemic research of architectures and optimization strategies may provide improvements in control performance.
The work was supported by national natural science foundation of China (61620106002). Any opinions expressed in this paper are solely those of the authors and do not represent those of the sponsors. The authors would like to thank experienced anonymous reviewers for their constructive and valuable suggestions for improving the overall quality of this paper.
- [Belletti et al., 2018] Francois Belletti, Daniel Haziza, Gabriel Gomes, and Alexandre M Bayen. Expert level control of ramp metering based on multi-task deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 19(4):1198–1207, 2018.
- [Carlson et al., 2010] Rodrigo C Carlson, Ioannis Papamichail, Markos Papageorgiou, and Albert Messmer. Optimal motorway traffic flow control involving variable speed limits and ramp metering. Transportation Science, 44(2):238–253, 2010.
- [Conti et al., 2018] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems, pages 5032–5043, 2018.
- [Foerster et al., 2016] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
- [Foerster et al., 2017] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
- [Hegyi et al., 2005] Andreas Hegyi, Bart De Schutter, and Hans Hellendoorn. Model predictive control for optimal coordination of ramp metering and variable speed limits. Transportation Research Part C: Emerging Technologies, 13(3):185–209, 2005.
- [Hellinga and Mandelzys, 2011] Bruce Hellinga and Michael Mandelzys. Impact of driver compliance on the safety and operational impacts of freeway variable speed limit systems. Journal of Transportation Engineering, 137(4):260–268, 2011.
- [Kipf and Welling, 2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- [Li et al., 2016] Li Li, Yisheng Lv, and Fei-Yue Wang. Traffic signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3):247–254, 2016.
- [Li et al., 2018] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. 2018.
- [Lin et al., 2018] Kaixiang Lin, Renyu Zhao, Zhe Xu, and Jiayu Zhou. Efficient large-scale fleet management via multi-agent deep reinforcement learning. arXiv preprint arXiv:1802.06444, 2018.
- [Lowe et al., 2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
- [Lv et al., 2018] Zhongjian Lv, Jiajie Xu, Kai Zheng, Hongzhi Yin, Pengpeng Zhao, and Xiaofang Zhou. Lc-rnn: A deep learning model for traffic speed prediction. In IJCAI, pages 3470–3476, 2018.
- [Roncoli et al., 2015] Claudio Roncoli, Markos Papageorgiou, and Ioannis Papamichail. Traffic flow optimisation in presence of vehicle automation and communication systems–part ii: Optimal control for multi-lane motorways. Transportation Research Part C: Emerging Technologies, 57:260–275, 2015.
- [Salimans et al., 2017] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- [Van der Pol and Oliehoek, 2016] Elise Van der Pol and Frans A Oliehoek. Coordinated deep reinforcement learners for traffic light control. Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016), 2016.
- [Wei et al., 2018] Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li. Intellilight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2496–2505. ACM, 2018.
- [Wu et al., 2018a] Jingda Wu, Hongwen He, Jiankun Peng, Yuecheng Li, and Zhanjiang Li. Continuous reinforcement learning of energy management with deep q network for a power split hybrid electric bus. Applied Energy, 222:799–811, 2018.
- [Wu et al., 2018b] Yuankai Wu, Huachun Tan, Lingqiao Qin, Bin Ran, and Zhuxi Jiang. A hybrid deep learning based traffic flow prediction method and its understanding. Transportation Research Part C: Emerging Technologies, 90:166–180, 2018.
- [Wu et al., 2018c] Yuankai Wu, Huachun Tan, and Bin Ran. Differential variable speed limits control for freeway recurrent bottlenecks via deep reinforcement learning. arXiv preprint arXiv:1810.10952, 2018.