Multi-Vehicle Mixed Reality Reinforcement Learning for Autonomous Multi-Lane Driving

Multi-Vehicle Mixed Reality Reinforcement Learning for Autonomous Multi-Lane Driving


Autonomous driving promises to transform road transport. Multi-vehicle and multi-lane scenarios, however, present unique challenges due to constrained navigation and unpredictable vehicle interactions. Learning-based methods—such as deep reinforcement learning—are emerging as a promising approach to automatically design intelligent driving policies that can cope with these challenges. Yet, the process of safely learning multi-vehicle driving behaviours is hard: while collisions—and their near-avoidance—are essential to the learning process, directly executing immature policies on autonomous vehicles raises considerable safety concerns. In this article, we present a safe and efficient framework that enables the learning of driving policies for autonomous vehicles operating in a shared workspace, where the absence of collisions cannot be guaranteed. Key to our learning procedure is a sim2real approach that uses real-world online policy adaptation in a mixed reality setup, where other vehicles and static obstacles exist in the virtual domain. This allows us to perform safe learning by simulating (and learning from) collisions between the learning agent(s) and other objects in virtual reality. Our results demonstrate that, after only a few runs in mixed reality, collisions are significantly reduced.

Multi-robot systems; Machine learning for robotics; Reinforcement learning; Autonomous vehicles; Reality gap; Sim2real

doi \acmISBN \acmConference[] \acmYear2020 \copyrightyear2020 \acmPrice


1 Introduction

The deployment of automated and autonomous vehicles presents us with transformational opportunities for road transport. To date, the number of companies working on this technology is substantive, and growing CBS (2018). Opportunities reach beyond single-vehicle automation: by enabling groups of vehicles to jointly agree on maneuvers and navigation strategies, real-time coordination promises to improve overall traffic throughput, road capacity, and passenger safety Dressler et al. (2014); Ferreira et al. (2010). However, driving in multi-vehicle and multi-lane settings still remains a challenging research problem, due to unpredictable vehicle interactions (e.g., non-cooperative cars, unreliable communication), hard workspace limitations (e.g., lane topographies), and constrained platform dynamics (e.g., steering kinematics, driver comfort).

Learning-based methods, such as deep reinforcement learning, have proven effective at designing robot control policies for an increasing number of tasks in single-vehicle systems, for applications such as navigation Khan et al. (2019), flight Molchanov et al. (2019), and locomotion Tan et al. (2018). Leveraging such methods for learning autonomous driving policies is emerging as a particularly promising approach Pan et al. (2017); Shalev-Shwartz et al. (2016); Kuderer et al. (2015). Yet, the process of safely learning autonomous driving involves unique challenges, since the decision models often used in robotics do not lend themselves naturally to the multi-vehicle domain, due to the unpredictable behaviour of other agents. The unapologetic nature of the trial-and-error process in reinforcement learning compounds the difficulty of ensuring functional safety.

These adversities call for learning that first takes place in simulation, before transferring to the real world Miglino et al. (1995); Shah et al. (2018). This transfer, often referred to as sim2real, is challenging due to discrepancies between conditions in simulation and the real world (such as vehicle dynamics and sensor data) Peng et al. (2018); James et al. (2019); Chebotar et al. (2019). Despite substantial advances in this field, the problem of executing immature policies directly on an autonomous vehicle still raises considerable safety concerns. These concerns are exacerbated when multiple autonomous vehicles share the same workspace, risking collisions and un-reparable damage. Simultaneously, the act of colliding—or nearly-colliding—is essential to the learning process, enabling future policy roll-outs to incorporate these critical experiences. How are we to provide safe multi-vehicle learning experiences, without forgoing the realism of high-fidelity training data? There is a dearth of work that addresses this challenge.

Figure 1: Mixed reality multi-vehicle multi-lane traffic circuit including one real DeepRacer robot and twelve virtual ones, in beige. Four static virtual vehicles are rendered in blue. The colliding virtual vehicle is rendered in red.

Our goal in this paper is to develop a safe and efficient framework that allows us to learn driving policies for autonomous vehicles operating in a shared workspace, where collision-freeness cannot be guaranteed. Rather than focusing on re-elaborating or advancing state-of-the-art reinforcement learning, our desire is to make it directly applicable onto physical robots. Towards this end, we learn an end-to-end policy for vehicle navigation on a multi-lane track that is shared with other moving vehicles and static obstacles. The learning is based on a model-free method embedded in a distributed training mechanism that we tailor for mixed reality compatibility. Key to our learning procedure is a sim2real approach that uses real-world online policy adaptation in a mixed reality setup, where obstacles (vehicles and objects) exist in the virtual domain. This allows us to perform safe learning by simulating (and learning from) collisions between the learning agent(s) and other objects in virtual reality. We apply our framework to a multi-vehicle setup consisting of one real vehicle, and several simulated vehicles (as shown in Figure 1). Experiments show that a significant performance improvement can be obtained after just a few runs in mixed reality, reducing the number of collisions and increasing reward collection. To the best of our knowledge, this is the first demonstration of mixed reality reinforcement learning for multi-vehicle applications.

2 Related Work

Training in simulation before transferring learned policies to the real world provides the benefits of safety and facilitated data collection. Several methods alleviate the difficulty of bridging the reality gap: (i) parameter estimation, which estimates parameters of the real system to achieve a more realistic simulation Lowrey et al. (2018); Tan et al. (2018), (ii) iterative data collection, which learns distributions of dynamics parameters in an iterative manner Christiano et al. (2016); Chebotar et al. (2019), and (iii) domain randomization, which trains over a distribution of the system dynamics for policies that are more robust against simulator discrepancies from reality Peng et al. (2018); Muratore et al. (2018); James et al. (2019); Tobin et al. (2017). Although these methods contribute significantly to closing the reality gap, the problem of guaranteeing safe policy execution still persists. Moreover, it often proves hard to accommodate all situations the robot may encounter in the real world, where unexpected conditions are the norm. To ease this challenge, researchers have proposed methods for continuous online adaptation in model-based reinforcement learning Fu et al. (2016); Gu et al. (2016). The aim of this approach is to learn an approximate model and then adapt it at test time. However, this can still lead to safety concerns when there is a mismatch between what the model is trained for, and how it is used at test-time. More recent approaches, such as meta-learning, strive to overcome this challenge Nagabandi et al. (2019). The commonality of all these approaches, however, is their focus on single-robot systems in isolated work-spaces; guaranteeing safe online-learning in shared workspaces is still an open problem.

The idea of exploiting mixed (and augmented) reality for robotics applications was originally introduced as a tool to facilitate development and prototyping. Early work experiments with virtual humanoids amongst real obstacles Stilman et al. (2005), leveraging the setup to rapidly prototype and test humanoid sub-components. Chen et al. Chen et al. (2009) use augmented reality to obtain a coherent display of visual feedback during interactions between a real robot and virtual objects. More recently, mixed reality has gained importance in shared human-robot environments Williams et al. (2018), where combinations of physical and virtual environments can provide safer ways to test interactions, “… by also allowing a gradual transition of the system components into shared physical environments” Hoenig et al. (2015). The introduction of mixed reality to support reinforcement learning has barely been considered. In Mohammadi et al. (2019), Mohammadi et al. present an approach for online continuous deep reinforcement learning for a reach-to-grasp task in a mixed reality environment. Although targets exist in the physical world, the learning procedure is carried out in simulation (using real data), before actions are transferred and executed on the actual robot.

The particularity of our work is that we focus on multi-robot settings, where inter-robot interactions contribute significantly to the learning process, but cannot be executed directly on multiple real platforms without incurring repeated damages. Not only does our mixed reality framework help bridge the reality gap that still stymies progress in reinforcement learning for robotics, but also, it is especially significant for the specific application at hand in this work.

3 Problem Statement

We consider a multi-vehicle system composed of vehicles on a multi-lane (closed) traffic circuit with lanes. Each vehicle in the system has a unique target velocity, , i.e., vehicles aim to travel at potentially different speeds. The circuit is obstructed by obstacles (static vehicles). In order to maintain target speeds and avoid collisions, vehicles must learn to change lanes and execute overtaking maneuvers (we do not enforce a rule regarding which side a vehicle may overtake on). An image of our three-lane setup is shown in Figure 1, with 13 vehicles (one of which is real) and 4 virtual obstacles (in blue).

Assumptions. We are especially interested in a vehicle’s high-level decision-making process that involves lane changes and speed modulation. We, therefore, consider the availability of a low-level controller that executes reliable trajectory following, allowing the vehicle to remain in the centre of its current lane. To facilitate the low-level control task, we represent a lane by a sequence of cubic Bezier curves, continuous up to their first derivative (i.e. having no sharp corners). Vehicles are provided reliable positioning information (e.g., through a motion capture system). We also assume the ability of basic local communication, such that the desired velocity of each neighbouring vehicle is available to the high-level controller. This neighbourhood includes the six nearest vehicles within a vision radius, . Our vehicles’ knowledge is thus local. We do not directly deal with noisy perception, as our sim2real challenge is the result of non-ideal vehicle models. We observe, however, that imperfect sensing would exacerbate this, and our work would prove equally or more valuable in such scenarios.

Goal. Our goal is to learn a high-level control policy that allows vehicles to drive as closely as possible to their target velocities, while avoiding collisions with other vehicles.

4 Multi-Vehicle System

Our multi-vehicle system is based on a physical vehicle, the DeepRacer robot Balaji et al. (2019), for which we also develop a virtual counterpart. This platform, its dynamics, and control model are detailed below.

4.1 The DeepRacer Robot

The DeepRacer is a 1/18th scale car with a 4MP camera, 4-wheel drive and Ackermann steering. It sports an Intel Atom processor, 4GB of memory, and 32GB of storage. It runs Ubuntu 16.04 LTS and ROS Kinetic Kame. The on-board computer and motors are powered by 13600mAh and 1100mAh batteries, respectively.

The DeepRacer was originally designed as a platform for vision-based reinforcement learning, with training carried out in simulation only. This is different to our aim—which includes online training and but also only focuses on non-vision-based, high-level decision-making. Therefore, we modified the platform to make it more suited to our goal. The default ROS launch script was replaced, so that the DeepRacer does not run a ROS master but relies on one running on a different device—therefore allowing more than one DeepRacer to be controlled simultaneously. We implemented a new ROS node to communicate with the DeepRacer’s servo node to set turning and throttle values. Adding this node also meant that communication to the DeepRacer could be done via UDP, reducing latency. Finally, a custom, non-reflective case was designed to allow the integration of the robot with a motion tracking system.

4.2 Vehicle Model

The DeepRacer has Ackermann steering geometry. We approximate its kinematics by the bicycle model, with motion equations:


where is the steering angle, is the forward speed, is the heading, and is the vehicle’s wheel base. These equations are numerically integrated in our simulation via the Euler method to obtain the position of the DeepRacer at each time step. For the purpose of collision detection in mixed reality, the DeepRacer was modeled by a bounding box of similar size to its physical dimensions (). Virtual vehicles are also identically modeled.

4.3 Two-Level Driving Strategy

We segregate the vehicle’s driving strategy into two levels: a high-level controller that is responsible for (i) lane-change decisions and (ii) velocity modulation, and a low-level controller that acts upon this information to track desired lanes at desired speeds. In Section 5, our objective is to learn the high-level control policy only. We assume the existence of background traffic that is deployed with a fixed high-level driving strategy.

Low-level control. Two low-level controllers are used for lateral and longitudinal control. A PID controller onboard the DeepRacer maintains the robot’s forwards velocity at the value requested by the high-level controller. The steering angle of the DeepRacer is set by a PD controller, keeping the robot on the trajectory chosen by the higher level controller. The onboard velocity controller gets a desired velocity from the high-level controller, and pose information from the motion tracking system; it calculates velocity and acceleration towards the desired trajectory. These are used in the PID controller which outputs a throttle value to the motors. This allows the DeepRacer to travel at the speed requested by the high level controller regardless of external factors such as how discharged the battery is.

The objective of the steering angle controller is to minimise the perpendicular distance, , between the robot and the desired trajectory. For small deviations, the angle of the robot’s heading with respect to the trajectory, , is proportional to and the steering angle of the robot, , is proportional to , where is the traveled distance. This permits a controller of the form , where is the curvature of the trajectory at the nearest point and and are gain and damping factors, respectively. The use of in place of causes the robot to continue to converge to the desired trajectory even for larger deviations, not affecting its behaviour for small deviations. Since the controller uses derivatives with respect to rather than directly, it behaves the same independently of how the high-level controller changes the robot’s speed.

High-level control policy. While low-level controller is capable of maintaining a specified velocity and following the centre of a chosen lane, we use a high-level control algorithm to decide when to accelerate or decelerate and when to change lanes. This high-level policy is the learnable policy (described in Section 5.2) applied to the agent vehicle.

Background traffic. For realistic (virtual) background traffic we use a hard-coded algorithm, following the work in Hyldmar et al. (2019). This controller has both longitudinal and lateral control components. The longitudinal component is based on the Intelligent Driver Model (IDM) proposed in Treiber et al. (2000). Using this control method, a vehicle’s forward acceleration is a function of its current velocity, , its gap to the vehicle in front, and the rate at which it is approaching the vehicle in front, :


where is a function determining the desired minimum gap to the preceding vehicle and is a target velocity. This gap is defined as:


where , , , , are parameters and is a jam distance—the distance which cars in a queue will leave between each other.

The lateral component of this high level controller, responsible for lane changes, is based on the MOBIL controller proposed in Kesting et al. (2007). The MOBIL strategy is designed to maximise the current vehicle’s freedom to accelerate while also considering the interests of nearby vehicles, and maintaining safety. To determine the effect of a lane change on the current vehicle’s own acceleration, the MOBIL controller considers the effect () the new gap to the next vehicle would have on the chosen acceleration by its longitudinal control algorithm, IDM. The MOBIL controller similarly calculates the effect a proposed lane change would have on the chosen accelerations of nearby vehicles, assuming they were also using IDM. It then compares the expected benefit to a threshold value to determine whether or not to change lane:


where and are the effects on the new and old following vehicles, and is a politeness factor. Safety is maintained by adding the condition that the MOBIL controller does not force the new follower vehicle to decelerate at a rate greater than a safety limit, . Since we do not enforce a rule regarding which side vehicles may overtake on, the MOBIL controller considers changing lanes in both directions, and takes the better option if both surpass the threshold .

5 Learning Framework

As anticipated in Section 4.3, we wish to learn a high-level control policy letting a vehicle avoid collisions while maintaining its desired velocity. We formulate this as a sequential decision problem and solve it with an actor-critic based reinforcement learning approach. We approximate the value function and the policy function using the critic and actor components, respectively. Our implementation is largely inspired by existing literature Sutton and Barto (2011); Fujimoto et al. (2018); Mnih et al. (2016) as our goal is not to advance these techniques, but rather to evaluate their effectiveness in our mixed reality framework.

5.1 Reinforcement Learning Problem

Our goal is to safely (collision and damage-free) find an optimal high-level controller, such that each vehicle (agent) is as close as possible to its desired velocity. We formalise this high-level control problem as a reinforcement learning problem Sutton and Barto (2011) with state space, (the agent’s observations), and action space . contains both information about the agent’s own state, , as well as the state of other nearby vehicles, , such that:


In , an agent observes: (i) its current velocity, ; (ii) its target velocity, ; (iii) the number of lanes to its right, ; (iv) the number of lanes to its left, ; (v) its lane-changing state (i.e. whether it is changing lane or not). An element of is thus represented as a vector of the form:


In , the agent observes up to six nearby vehicles (defining its neighbourhood, as introduced in Section 3). If there are less than six vehicles within radius , then this vector is padded up to six using “null” vehicles. For each nearby vehicle, , the agent receives the relative position of in polar coordinates (, ). The agent also receives the relative lane-wise velocity, , of , the number of lanes to , , and the lane-changing state of , . An element of is thus represented as 6 vectors of the form:


The action space, , contains pairs of tuples from a (discrete) acceleration space, , and a (discrete) lane changing space, , such that:


Set consists of “constant acceleration”, “maintaining the current speed”, and a “constant deceleration”. Set consists of “changing lane left”, “right”, or “not at all”. The reinforcement learning reward function is designed to prevent the agent from deviating unnecessarily from its desired speed while avoiding collisions with other cars. This function is expressed as:


where and are proximity penalty terms defined as:


where is the length of a vehicle, is the distance to the closest (ahead or behind) vehicle in the same lane, is the distance between lanes, is the distance to the closest vehicle (in any lane), and , , and are parameters (see also Figure 2). These two proximity penalties exist to deter the agent from coming too close to other vehicles. While this specific formalization would admit a solution through discrete action-space methods, such as Double Q-learning Hasselt (2010), in the following, we present a more general approach based on the actor critic method. As a consequence, our approach can generalise to continuous action spaces as well.

Figure 2: Schematics presenting the main components in the observations vectors and for a vehicle tackling the reinforcement learning problem described in Subsection 5.1.

5.2 Neural Network Architecture

We approximate value and policy function using a deep neural network containing one actor and two critics (Figure 3). From observation vectors ’s, the salient features of nearby cars are extracted using a sequence of four linear layers of hidden size with output size . These features are then max-pooled across nearby vehicles to get a single size vector of features pertaining to observed vehicles. This vector is then concatenated with the agent’s own observations to produce the input of the actor and critic networks.

The actor network consists of a sequence of three linear layers of hidden and output size followed by two heads, each consisting of a final layer of hidden size and an output size of 3, followed by soft-max activation. These two heads correspond to the two discrete spaces and , i.e., lane changes and acceleration, respectively. We elect to use two critic networks which are similarly composed by a sequence of four linear layers of hidden size , though this time each terminating in a one-dimensional evaluation of the value function. As proposed by Fujimoto et al. Fujimoto et al. (2018), we consider the less extreme of the two evaluations during training to try to reduce the impact of outlier estimations of the value function when updating in the early stages.

Figure 3: Schematics of the neural network mapping observations , to (i) actions , and (ii) value function . We detail this architecture in Subsection 5.2 and its training in Subsection 5.3.

5.3 Distributed Training

We develop our reinforcement learning method as an adaptation of Asynchronous Advantage Actor Critic (A3C) Mnih et al. (2016), by maintaining an approximation for the value function of a state , , and for the policy function using explicitly calculated returns over short trajectories. Returns from actions were calculated as


where for trajectory length and is the mean of the two value functions. The approximation of the value function was trained to minimise where is the Advantage function, .

The policy function is updated using the PPO-Clip Schulman et al. (2017) loss function:


where are the network parameters, subscript denotes the evaluation of the network using parameters , is the clamp function and is a constant parameter:


As we do not use mini-batching, the target policy that we compare against is not one computed before a current set of mini-batches (as in Schulman et al. (2017)), but rather duplicated versions of part of the network (the shaded boxes in Figure 3) with parameters smoothed exponentially in time, , updated to follow the latest parameters, , according to Polyak-Ruppert averaging:


where is a parameter set during training. We also add to the loss function a term proportional to the negation of the policy entropy, in order to discourage premature convergence. We weight the three contributions to the total network loss with coefficients , and corresponding to the PPO loss, the critic loss and the entropy term, respectively.

To improve speed and stability of learning, we use multiple parallel actors when pre-training a policy in simulation only. We parallelise this process on two levels. First, we use asynchronous updates, as in Mnih et al. (2016), to allow multiple threads acting in the problem environment to send gradients to a separate thread updating the policy parameters, and then returning the new parameters (as shown in Figure 4). In addition, each actor thread simultaneously acts in multiple environments Clemente et al. (2017) in order to take advantage of vectorisation (Figure 4). Combined, these two parallelisation strategies substantially improved (10x speed-up) training speed in purely virtual environments.

Figure 4: Schematics of the distributed training approach presented in Subsection 5.3 for the network in Figure 3.

6 Mixed Reality Setup

Figure 5: Overall schematics of the proposed multi-vehicle, mixed reality reinforcement learning approach. Reinforcement learning of high-level driving policies is handled through PyTorch. Both virtual and real DeepRacer vehicles exist within a C++ simulation that manages the physics of the virtual cars and emulates collisions in mixed reality. The physics of real-life DeepRacers is captured through OptiTrack’s motion capture system and fed to the simulation.

Our mixed reality experimental setup seamlessly integrates multiple real-world and virtual components, as illustrated in Figure 5. The learning of high-level policies by DeepRacer agents, using the framework presented in Section 5, is performed during the concurrent execution of all these modules, i.e., in mixed reality.

6.1 Simulation Setup

In our setup, a C++ simulation provides the environment in which reinforcement learning agents can act, observe, and learn. As such, it also contains the high-level IDM/MOBIL controllers of the background traffic vehicles. We implemented the reinforcement learning approach described in the previous section using Python and the PyTorch library. An interface between the C++ simulation and the Python interpreter was created using the BOOST.Python C++ library. This interface exposes the ability to create environments as either mixed real or purely virtual. The simulation provides observations and reward signals to the Python implementation, according to the state of the environment. Then, it updates its state to reflect the agents’ actions, as received from the Python interpreter.

The simulated environment also contains (i) the specifications of the Bezier curves for all lanes in the track, (ii) the states of the vehicles controlled by either reinforcement learning agents or the IDM/MOBIL algorithms, and (iii) static obstacles. These obstacles are placed far enough apart to not fully block the road, and so that there is at least one in each lane of the circuit. Their exact positions are otherwise randomised. The starting locations of the background traffic and agent vehicle are likewise randomised along with the desired velocities ’s of all vehicles. For each of the vehicles in the environment, collision detection is accomplished using bounding boxes of the same shape and size of a DeepRacer.

The simulation was written in C++ in order to provide higher performance, especially when pre-training a network in a purely simulated environment. To the same end, the simulation was designed to be capable of running several simultaneous virtual environments (Figure 4) in order to allow the reinforcement learning algorithm to submit multiple parallel actions and receive multiple parallel observations—thus making a more efficient use of our learning computing hardware.

6.2 Real-World Setup

As shown in Figure 5, the physical DeepRacer must interface with the simulation while training in mixed reality. The location and pose of a real-life DeepRacer in the environment is tracked using six OptiTrack Prime 17W cameras and the Motive motion capture software. When multiple real DeepRacers are used, we distinguish them by using unique layouts of reflective markers. The positions of each of the DeepRacers is broadcast by Motive, received by a VRPN client and published to a ROS topic, making the data available to all nodes in our ROS environment. In order to reduce network load and increase reliability, the frequency at which poses were transmitted was restricted to 50Hz, since this was also the update rate of the physics engine in the simulation. From the perspective of the tracking system, the centre of a vehicle was defined as the centre of its rear axle. This choice preserves consistency with the simulation’s definition of the centre of a car—itself chosen for the sake of simplicity, while using an Ackermann steering model. The vehicles drive on a closed loop track made up of individual trajectories that contain no intersections and are continuous.

6.3 Mixed Reality

Mixed reality plays a two-fold role in our work: (i) it fosters an agent’s learning, allowing simultaneous real and simulated training, and (ii) it provides us with better evaluation tools, through the ability to visualise the virtual and real agents’ interactions.

Learning In the mixed reality environment, the simulation receives live updates on the pose of the DeepRacer through the motion capture system and updates its representation of the environment state accordingly. The simulation sends commands setting the steering angle and velocity of the DeepRacer according to the actions of the high-level controller and the lateral component of the low-level controller.

The simulation is able to detect collisions between the DeepRacer and the virtual vehicles through a collision box identical to that of a virtual vehicle sharing the same pose as the real agent. From the point of view of the high-level controllers, including the reinforcement learning agent, the situation is no different from a purely virtual scenario—with the exception of the world’s physics affecting the real DeepRacer. Parallelisation of environments is unavailable when training in a mixed real environment, but since our implementation of A3C uses trajectories of experience with explicitly calculated returns, we substantially increase their length and generate only a small number of trajectories for each optimisation step. Each of these trajectories is created using a different random initialisation of the environment in order to provide a variety of experiences to the reinforcement learning algorithm, at each optimisation step.

Visualisation To visualise the interaction between the virtual cars and the DeepRacer, during our tests, we set up a fixed camera to record the entire full-length experiments. From the simulation environment, we collect pose data for both the virtual and real cars and compute whether any vehicle is currently experiencing collisions. These data are processed through a Python script importing Blender’s API. At each timestep, we insert an animation keyframe of a vehicle model in the pose specified by the previously recorded data and a colour determined by whether the vehicle is (i) a fixed obstacle (blue), (ii) a moving vehicle (beige), or (iii), a vehicle currently in collision (red). In a separate scene, the DeepRacer alias is also animated using the same procedure. These two scenes are then composited together using Z-buffer values so that—when the DeepRacer is in front of a virtual vehicle—the area obscured by the Deepracer is transparent. The output can then be overlayed on top of the test footage to create the effect that the real and virtual vehicles are interacting.

7 Experiments

To demonstrate the effectiveness of our mixed reality setup—to train agents capable of collision-free driving—we performed experiments on a () -lane track (see Figure 1) with lanes wide. The track itself fits a area, with a lap length of roughly metres, i.e., 50 times the size of a DeepRacer (). Our experiments include (1 real, 12 virtual) vehicles and virtual obstacles. The low-level control parameters and (see Subsection 4.3) were set to and , respectively. For the learning parameters (see Section 5), we selected , , , , , , , , , , , and . For the actor and critics, we used learning rates of 2e-4 and 2e-3. Our results are summarised in Figures 6, 7, and 8 as well as by additional footage available on the Prorok Lab YouTube channel.1

Figure 6: Evolution during one training instantiation of (i) the number of collisions per minute (top plot, lower is better) and (ii) the average reward collected by the training agent, over a sliding window of 8’000 frames (bottom plot, higher is better).
Figure 7: Empirical distributions at test time of (i) the number of collisions per scenario (top plot, left is best) and (ii) the total collected reward per scenario (bottom plot, right is best) before (blue) and after (red) training in mixed reality.
Figure 8: Plots of track positions ( axis) against time ( axis) of four static obstacles (horizontal lines), twelve virtual vehicles, and one real-life DeepRacer (thicker line). The colormap captures the velocities of all cars. The red dots represent collisions incurred by the DeepRacer. The top and bottom plots compare behaviours recorded before and after mixed reality training.

First, we want to assess the soundness of our approach by evaluating how well training fares—in terms of incurred collisions and collected reward. This is shown in Figure 6, where the two plots describe the evolution over time in a given scenario (measured in frames, i.e., the steps in which an agent receives one set of observations and takes one action) of: (i) the number of collisions per minute (top plot of Figure 6); and (ii) the average collected reward (bottom plot of Figure 6). Collisions per minute are computed as those yielded by a 8’000-frame long sliding window. Successful training is reflected in a general downward slope of the top plot (fewer collisions) and, conversely, a general upward slope of the bottom plot (greater reward).

Second, we want to quantify the effectiveness of mixed reality training at test time. This is shown in Figure 7. The top and bottom plots refer, once more, to collisions and collected reward, respectively. Each one of the two plots compares two density distributions of these performance metrics: one before (in blue) and one after (in red) training in mixed reality. As our simulation environment is partially randomised, the word scenario refers to all the data gathered from a single instantiation. On the top plot, we can observe a left-shift (from blue to red, i.e., before and after) of the collisions’ density distribution, that is, fewer collisions occurring after mixed reality training. On the bottom plot, conversely, a right-shift reflects the improved ability of the agent, trained in mixed reality, to collect reward.

Finally, Figure 8 presents a qualitative comparison of how a DeepRacer agent’s behaviour changes before (top) and after (bottom) mixed reality training. The axis in Figure 8 shows the passing of time (in seconds) while the axis captures the position of a vehicle along the track (in metres). Four blue horizontal lines represent obstacles (i.e., static virtual vehicles) on the track. All other (13) lines represent moving vehicles—the thicker one being the DeepRacer agent. A colour map is used to encode the speed (in metres per second) of each vehicle. Red dots indicate collisions between the real-life DeepRacer and either a virtual obstacle or vehicle. Indeed, collisions are rarer after mixed reality training. Footage of the mixed reality experiments in Figure 8 is also available (link).

8 Discussion

The training stability and effectiveness of the proposed approach is reported in Figure 6: in the top plot, one can observe early improvements—i.e., a reduction—in the number of collisions during training. This is followed by two periods of worsening performance (around frames 20’000 and 30’000), and then a more consistent downward trend (from frame 35’000 on). The early improvements and performance deterioration (until frame 25’000) may be explained by the choice of hyper-parameters. Our learning rates aimed at aggressive policy changes. That is, an agent would have been, at first, too eager to learn how to overly accelerate—and collect more reward—resulting into more early collisions. The bottom plot, presenting the collection of reward during training, shows a distinct mirroring ( axis symmetry) of the top plot. This is consistent with what we would expect—that is, a sanity check confirming that a vehicle was led to fewer collisions by seeking higher reward.

Figure 7 demonstrates the performance of our methodology at test time. In the top plot, we observe that the density distribution of collisions is significantly shifted to the left after mixed reality training—indicating that our learning approach can effectively reduce collisions. The after-training distribution is also narrower, suggesting reduced variance and uncertainty. The bottom plot presents the slightly more trivial result that reinforcement learning training does, indeed, lead to improved reward collection. Nonetheless, at test time, this is evidence of the ability of our approach to generalise.

The qualitative results in Figure 8 demonstrate how the learning agent’s behaviour changes before and after mixed reality training. In the top plot, a DeepRacer that has not yet been trained in mixed reality collides remarkably often, with nearly every obstacle. This collision-prone behaviour may be due to the reduced responsiveness of the real DeepRacer hardware—when compared to the simulated vehicle—making it harder for the agent to timely stop or avoid other vehicles. After training in mixed reality, collisions are almost completely amended. In the bottom plot of Figure 8, we can also observe virtual agents (IDM/MOBIL background traffic) either (i) overtaking the learning agent in the longer gaps between obstacles or (ii) piling-up behind it in more constrained regions of the road—e.g., when the agent is cautiously approaching two near obstacles. Interestingly enough, traffic (e.g., between 50” and 80” in the bottom plot of Figure 8) is likely exacerbated by the fact that IDM/MOBIL agents would be willing to give the agent room to accelerate instead of overtaking it—yet, the agent proceeds at a reduced speed. While the learning agent is less dangerous after training, its unexpected prudence can mislead the other driving agents—which are not capable of learning—and reduce throughput. While the slower speed of the real DeepRacer might appear as a sub-optimal outcome, we should remember that our aim was not to outperform the IDM/MOBIL vehicles—in fact, these can achieve a higher safe speed as their virtual models are only simulated and, thus, more responsive and easier to control than the actual DeepRacers.

Finally, it is important to observe that the simulation performance of the agents we transferred into our framework was still characterised by relatively high entropy. This choice was made to minimise the risk of overfitting to the simulation environment and let agents adapt more quickly to the mixed reality setup. While we cannot say whether additional simulation-only training would have benefited or hurt the agents transferring to mixed reality, our results support the idea that this approach led to quick and effective real-world adaptation. In future developments of our framework, we will investigate more sample-efficient off-policy reinforcement learning methods—e.g., Haarnoja et al. (2018) which might allow for better performance without the need for a substantial increase in data gathering—and continuous action spaces.

9 Conclusions

This work presented a mixed reality framework for safe and efficient reinforcement learning of driving policies in multi-vehicle systems. Our learning algorithm was trained using a distributed mechanism specifically tailored to suit the needs of our mixed reality setup. We demonstrated successful online policy adaptation in an experimental setup involving one real vehicle and sixteen virtual vehicles. Our results showed that mixed reality learning is able to provide significant performance improvements, leading to a reduction of collisions in the learned policies.

The particularity of our system is that it focuses on multi-robot settings, where interactions with other dynamic objects contribute significantly to the learning process, but cannot be executed directly on multiple real platforms without incurring repeated damages. The proposed framework is a first of its kind: beyond providing specific benefits to the application at hand, it also helps bridge the reality gap that still stymies progress in reinforcement learning for robotics at large. Future work will consider multiple learning agents using on-board sensing (e.g., vision), and how our mixed reality setup enables their gradual introduction into mutually shared spaces.


This work was supported by the Engineering and Physical Sciences Research Council (grant EP/S015493/1). Their support is gratefully acknowledged. The DeepRacer robots used in this work were a gift to Amanda Prorok from AWS. Their support is gratefully acknowledged. This article solely reflects the opinions and conclusions of its authors and not AWS or any other Amazon entity.




  1. Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac, Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, et al. 2019. DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning. arXiv preprint arXiv:1911.01562 (2019).
  2. CBS. 2018. CBS Insights Research Brief. (Accessed August 15, 2018).
  3. Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. 2019. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 8973–8979.
  4. Ian Yen-Hung Chen, Bruce MacDonald, and Burkhard Wunsche. 2009. Mixed reality simulation for mobile robots. In 2009 IEEE International Conference on Robotics and Automation. IEEE, 232–237.
  5. Paul Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, and Wojciech Zaremba. 2016. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518 (2016).
  6. Alfredo V. Clemente, Humberto Nicolás Castejón Martínez, and Arjun Chandra. 2017. Efficient Parallel Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1705.04862 (2017).
  7. Falko Dressler, Hannes Hartenstein, Onur Altintas, and Ozan Tonguz. 2014. Inter-vehicle communication: Quo vadis. IEEE Communications Magazine 52, 6 (2014), 170–177.
  8. Michel Ferreira, Ricardo Fernandes, Hugo Conceição, Wantanee Viriyasitavat, and Ozan K Tonguz. 2010. Self-organized traffic control. In Proceedings of the seventh ACM international workshop on VehiculAr InterNETworking. ACM, 85–90.
  9. Justin Fu, Sergey Levine, and Pieter Abbeel. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4019–4026.
  10. Scott Fujimoto, Herke van Hoof, and David Meger. 2018. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 1587–1596.
  11. Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. 2016. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning. 2829–2838.
  12. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:abs/1801.01290 (2018).
  13. Hado van Hasselt. 2010. Double Q-learning. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2 (NIPS’10). Curran Associates Inc., USA, 2613–2621.
  14. Wolfgang Hoenig, Christina Milanes, Lisa Scaria, Thai Phan, Mark Bolas, and Nora Ayanian. 2015. Mixed reality for robotics. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5382–5387.
  15. Nicholas Hyldmar, Yijun He, and Amanda Prorok. 2019. A Fleet of Miniature Cars for Experiments in Cooperative Driving. IEEE International Conference Robotics and Automation (ICRA) (2019).
  16. Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis. 2019. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12627–12637.
  17. Arne Kesting, Martin Treiber, and Dirk Helbing. 2007. General Lane-Changing Model MOBIL for Car-Following Models. Transportation Research Record 1999, 1 (2007), 86–94.
  18. Arbaaz Khan, Chi Zhang, Shuo Li, Jiayue Wu, Brent Schlotfeldt, Sarah Y Tang, Alejandro Ribeiro, Osbert Bastani, and Vijay Kumar. 2019. Learning safe unlabeled multi-robot planning with motion constraints. arXiv preprint arXiv:1907.05300 (2019).
  19. Markus Kuderer, Shilpa Gulati, and Wolfram Burgard. 2015. Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2641–2646.
  20. Kendall Lowrey, Svetoslav Kolev, Jeremy Dao, Aravind Rajeswaran, and Emanuel Todorov. 2018. Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system. In 2018 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR). IEEE, 35–42.
  21. Orazio Miglino, Henrik Hautop Lund, and Stefano Nolfi. 1995. Evolving mobile robots in simulated and real environments. Artificial life 2, 4 (1995), 417–434.
  22. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1602.01783 (2016).
  23. Hadi Beik Mohammadi, Mohammad Ali Zamani, Matthias Kerzel, and Stefan Wermter. 2019. Mixed-Reality Deep Reinforcement Learning for a Reach-to-grasp Task. In International Conference on Artificial Neural Networks. Springer, 611–623.
  24. Artem Molchanov, Tao Chen, Wolfgang Hönig, James A. Preiss, Nora Ayanian, and Gaurav S. Sukhatme. 2019. Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to Multiple Quadrotors. arXiv:1903.04628 [cs] (March 2019). arXiv: 1903.04628.
  25. Fabio Muratore, Felix Treede, Michael Gienger, and Jan Peters. 2018. Domain randomization for simulation-based policy optimization with transferability assessment. In Conference on Robot Learning. 700–713.
  26. Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. 2019. Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning. arXiv:1803.11347 [cs, stat] (Feb. 2019). arXiv: 1803.11347.
  27. Xinlei Pan, Yurong You, Ziyan Wang, and Cewu Lu. 2017. Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952 (2017).
  28. Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2018. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1–8.
  29. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347 (2017).
  30. Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2018. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics. Springer, 621–635.
  31. Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. 2016. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 (2016).
  32. Michael Stilman, Philipp Michel, Joel Chestnutt, Koichi Nishiwaki, Satoshi Kagami, and James Kuffner. 2005. Augmented reality for robot development and experimentation. Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-05-55 2, 3 (2005).
  33. Richard S Sutton and Andrew G Barto. 2011. Reinforcement learning: An introduction. (2011).
  34. Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. 2018. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332 (2018).
  35. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 23–30.
  36. Martin Treiber, Ansgar Hennecke, and Dirk Helbing. 2000. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 62 (Aug 2000), 1805–1824. Issue 2.
  37. Tom Williams, Daniel Szafir, Tathagata Chakraborti, and Heni Ben Amor. 2018. Virtual, augmented, and mixed reality for human-robot interaction. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 403–404.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description