Fully Distributed Multi-Robot Collision Avoidance via Deep Reinforcement Learning for Safe and Efficient Navigation in Complex Scenarios
Developing a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generates its paths with the limited observation of other robots’ states and intentions. Prior distributed multi-robot collision avoidance systems often require frequent inter-robot communication or agent-level features to plan a local collision-free action, which is not robust and computationally prohibitive. In addition, the performance of these methods is not comparable to their centralized counterparts in practice.
In this paper, we present a decentralized sensor-level collision avoidance policy for multi-robot systems, which shows promising results in practical applications. In particular, our policy directly maps raw sensor measurements to an agent’s steering commands in terms of the movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to learn an optimal policy. The policy is trained over a large number of robots in rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. The learning algorithm is also integrated into a hybrid control framework to further improve the policy’s robustness and effectiveness.
We validate the learned sensor-level collision avoidance policy in a variety of simulated and real-world scenarios with thorough performance evaluations for large-scale multi-robot systems. The generalization of the learned policy is verified in a set of unseen scenarios including the navigation of a group of heterogeneous robots and a large-scale scenario with robots. Although the policy is trained using simulation data only, we have successfully deployed it on physical robots with shapes and dynamics characteristics that are different from the simulated agents, in order to demonstrate the controller’s robustness against the sim-to-real modeling error. Finally, we show that the collision-avoidance policy learned from multi-robot navigation tasks provides an excellent solution to the safe and effective autonomous navigation for a single robot working in a dense real human crowd. Our learned policy enables a robot to make effective progress in a crowd without getting stuck. More importantly, the policy has been successfully deployed on different types of physical robot platforms without tedious parameter tuning. Videos are available at https://sites.google.com/view/hybridmrca.
Fully Distributed Multi-Robot Collision Avoidance via Deep Reinforcement Learning for Safe and Efficient Navigation in Complex Scenarios
Journal Title XX(X):Fully Distributed Multi-Robot Collision Avoidance via Deep Reinforcement Learning for Safe and Efficient Navigation in Complex Scenarios–References ©The Author(s) 2018 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/ToBeAssigned www.sagepub.com/
Tingxiang Fan, Pinxin Long, Wenxi Liu and Jia Pan
Distributed collision avoidance, multi-robot systems, multi-scenario multi-stage reinforcement learning, hybrid control
Department of Mechanical and Biomedical Engineering, City University of Hong Kong, China
Metoak Technology (Beijing) CO., LTD, China
College of Mathematics and Computer Science, Fuzhou University, China00footnotetext: Corresponding author:
Jia Pan, City University of Hong Kong00footnotetext: Email: firstname.lastname@example.org
Multi-robot navigation has recently gained much interest in robotics and artificial intelligence and has many applications including multi-robot search and rescue, navigation through human crowds, and autonomous warehouse. One of the major challenges for multi-robot navigation is to develop a safe and robust collision avoidance policy such that each robot can navigate from its starting position to its desired goal safely and efficiently. More importantly, a sophisticated collision-avoidance skill is also a prerequisite for robots to accomplish many complex tasks, including multi-robot collaboration, tracking, and navigation through a dense crowd.
Some of prior works, known as centralized methods, such as (?, ?, ?, ?), assume that the actions of all agents are determined by a central server aware of comprehensive knowledge about all agents’ intents (e.g., initial states and goals) and their workspace (e.g., a 2D grid map). These centralized approaches generate collision avoidance actions by planning optimal paths for all robots simultaneously. They can guarantee safety, completeness, and approximate optimality but are difficult to scale to large systems with many robots due to several main limitations. First, the centralized control and scheduling become computationally prohibitive as the number of robots increases. Second, these methods require a reliable synchronized communication between the central server and all robots, which is either uneconomical or not feasible for large-scale systems. Third, the centralized system is vulnerable to failures or disturbances of the central server, communication between robots, or motors and sensors mounted on any individual robot. Furthermore, these centralized methods are inapplicable when multiple robots are deployed in an unknown and unstructured environment, e.g., in a warehouse with human co-workers.
Compared with centralized methods, some existing works propose agent-level decentralized collision avoidance policies, where each agent independently makes its decision by taking into account the observable states (e.g., shapes, velocities, and positions) of other agents. Most agent-level policies are based on the velocity obstacle (VO) framework (?, ?, ?, ?, ?, ?), which infers local collision-free motion efficiently for multiple agents in cluttered workspaces. However, these methods have several severe limitations which prevent them from being widely used in real scenarios. First, the simulation-based works (?, ?, ?) assume that each agent has perfect sensing about the surrounding environment, but this assumption does not hold in real-world scenarios due to omnipresent sensing uncertainty. To moderate the limitation of perfect sensing, some previous approaches use a global positioning system to track the positions and velocities of all robots (?, ?, ?). However, such external infrastructure is too expensive to scale to large multi-robot systems. Some other methods design inter-agent communication protocols for sharing position and velocity information among nearby agents (?, ?, ?, ?). However, the communication systems introduce additional difficulties such as delay or blocking of communication signals due to obstacle occlusions. The second limitation is that the VO-based policies are controlled by multiple tunable parameters that are sensitive to scenario settings and thus must be carefully set to accomplish satisfactory multi-robot motion. Finally, the timing performance of previous decentralized methods for navigation is significantly inferior to their centralized counterparts in practical applications.
Inspired by VO-based approaches, (?, ?) trained an agent-level collision avoidance policy using deep reinforcement learning (DRL), which learns a two-agent value function that explicitly maps an agent’s own state and its neighbors’ states to a collision-free action, whereas it still demands the perfect sensing. In their later work (?, ?), multiple sensors are deployed to perform tasks of segmentation, recognition, and tracking in order to estimate the states of nearby agents and moving obstacles. However, this complex pipeline not only requires expensive online computation but also makes the whole system less robust to the perception uncertainty.
In this paper, we focus on sensor-level decentralized collision avoidance policies that directly map the raw sensor data to desired collision-free steering commands. Compared with agent-level policies, our approach requires neither perfect sensing for neighboring agents and obstacles nor tedious offline parameter-tuning for adapting to different scenarios. We have successfully deployed the learned policy to a large number of heterogeneous robots to achieve high-quality, large-scale multi-robot collaboration in both simulation and physical settings. Our method also endows a robot with high mobility to navigate safely and efficiently through a dense crowd with moving pedestrians.
Sensor-level collision avoidance policies are often modeled using deep neural networks (DNNs) and trained using supervised learning on a large dataset (?, ?, ?). However, there are several limitations to learning policies under supervision. First, it demands a large amount of training data that should cover different interaction situations for multiple robots. Second, the expert trajectories in datasets are not guaranteed to be optimal in interaction scenarios, which makes training difficult to converge to a robust solution. Third, it is difficult to hand-design a proper loss function for training robust collision avoidance policies. To overcome the above drawbacks of supervised learning, we propose a multi-scenario multi-stage deep reinforcement learning framework to learn the optimal collision avoidance policy using the policy gradient method. The policy is trained in simulation for a large-scale multi-robot system in a set of complex scenarios. The learned policy outperforms the existing agent- and sensor-level approaches in term of navigation speed and safety. It can be deployed to physical robots directly and smoothly without tedious parameter tuning.
Our collision avoidance policy is trained in simulation rather in the real world because it is difficult to generate training data safely and effectively for physical robots while the performance of robot learning heavily depends on the quality and quantity of the collected training data. In particular, to learn a sophisticated collision avoidance policy, robots need to undergo thousands of millions of collisions with the static environments and the moving pedestrians in order to fully explore different reaction and interaction behaviors. Such data collection and policy update procedure would be expensive, time-consuming and dangerous if being implemented in the real world. As a result, we follow the recent trend of sim-to-real methods (?, ?, ?, ?, ?), by first training a control policy in virtual simulation and then deploying the learned model to the real robot directly. However, it is non-trivial to transfer learned policies from simulation to the real world and may suffer a significant loss in performance, because the real world is messy and contains an infinite number of novel scenarios that may not be encountered during training. Fortunately, our policy uses an input space that is robust to the sim-to-real gap, and therefore we are able to use simulation to greatly accelerate our robot learning process.
Another challenge for learning-based collision avoidance is the lack of guarantee for the safety and completeness of the learned policy, which is considered as one of the main concerns when using deep learning in robotics. Inspired by the hybrid control framework (?, ?), we alleviate this difficulty by combining the learning-based policy with traditional control. In particular, the traditional control takes charge of relatively simple scenarios or emergent situations, while the learned policies will handle more complex scenarios by generating more sophisticated behaviors bounded in a limited action space. In this way, we can further improve the performance of the collision avoidance policy in terms of safety without sacrificing efficiency.
Contributions: Our main contributions in this paper are:
We develop a fully decentralized multi-robot collision avoidance framework, where each robot makes navigation decisions independently without any communication with others. The navigation policy has the ranging data collected from a robot’s on-board sensors as the input and outputs a control velocity. The policy is optimized offline via a novel multi-scenario multi-stage reinforcement learning algorithm and trained with a robust policy gradient method. The learned policy can be executed online on each robot using a relatively cheap on-board computer, e.g., an NUC or a Movidius neural compute stick.
We further combine the deep-learned policy with traditional control approaches to achieve a robust hybrid navigation policy. The hybrid policy is able to find time-efficient and collision-free paths for a large-scale nonholonomic robot system and it can be safely generalized to unseen simulated and real-world scenarios. Its performance is also much better than previous decentralized methods, and can serve as a first step toward reducing the gap between centralized and decentralized navigation policies.
Our decentralized policy is trained in simulation but does not suffer from the sim-to-real gap and can be directly deployed to physical robots without tedious fine-tuning.
We deploy the policy in a warehouse scenario to control multiple robots in a decentralized manner. Compared to previous solutions to the multi-robot warehouse, our system requires neither a pre-constructured bar-code infrastructure nor a map. It also alleviates the demands for high-band and low-latency synchronized communication that is considered as a main bottleneck when scaling a multi-robot system. In particular, we use an Ultra-Wide-Band (UWB) system to provide a rough guidance to the robots about their goals, and all robots navigate toward their goals independently by using our hybrid policy to resolve the collisions during the navigation. Hence, our system can be easily extend to hundreds or more robots.
We deploy the policy directly on a physical robot to achieve smooth and effective navigation through a dense crowd with moving pedestrians. The system’s excellent mobility outperforms state-of-the-art solutions. Moreover, the policy can be easily deployed to different robotic platforms with different shapes and dynamics, and still can provide satisfactory navigation performance.
The rest of this paper is organized as follows. Section 2 provides a brief review about related works. In Section 3, we introduce the notations and formulate our problem. In Section 4, we discuss the details about how to use reinforcement learning to train a high-quality multi-robot collision avoidance policy, and how to combine the learned policy with traditional optimal control schemes to achieve a more robust hybrid control policy. Finally, we highlight our experimental results in both the simulation scenarios (Section 5) and the real-world scenarios (Section 6).
This paper is an extension of our previous conference paper (?, ?).
2 Related works
In this section, we first briefly survey the related works on multi-robot navigation, including the centralized and decentralized approaches. Next, we summarize recent literatures about how to solve the multi-robot navigation problem in a data-driven manner with machine learning techniques. We further discuss the sim-to-real gap of the learning-based navigation control, and then summarize the recent progress in resolving this issue, including a short introduction to the hybrid control. Finally, we briefly survey the deep reinforcement learning and its application in reactive control.
2.1 Centralized multi-robot navigation
Over the past decade, the centralized multi-robot motion planning methods have been widely applied in many applications, such as task allocation (?, ?, ?), formation control (?, ?, ?, ?, ?), and object transportation (?, ?, ?, ?, ?). The centralized methods assume that each robot can have access to the complete information (e.g., velocity and desired goal) of other robots via a global communication system for motion planning. In this way, they can guarantee a safe, optimal and complete solution for navigation (?, ?, ?, ?, ?). However, these centralized approaches are difficult to be deployed in large-scale real-world systems due to several reasons. First, the centralized control and scheduling become computationally prohibitive as the number of robots increases. Second, these methods require a reliable synchronized communication between the central server and all robots, which is either uneconomical or not feasible for large-scale systems. Third, the centralized system is vulnerable to failures or disturbances of the central server, the communication between robots, or motors and sensors of any individual robot. Furthermore, these centralized methods are inapplicable when multiple robots are deployed in an unknown and unstructured environment, e.g., in a warehouse with human co-workers. Due to these limitations, in this paper, we present a decentralized method to accomplish large-scale multi-robot navigation.
2.2 Decentralized multi-robot navigation
Compared with the centralized methods, the decentralized methods are extensively studied mainly for the collision avoidance task in multi-agent systems. As one representative approach, the Optimal Reciprocal Collision Avoidance (ORCA) framework (?, ?) has been widely used in crowd simulation and multi-agent systems. In particular, ORCA provides a sufficient condition for multiple agents to avoid collisions with others in a short time horizon, and can easily be scaled to large systems with many agents. ORCA and its variants (?, ?, ?) used heuristics to construct a parametric collision avoidance model, which are tedious to be tuned for satisfactory performance. Besides, these methods are difficult to adapt to real-world scenarios with ubiquitous uncertainties, because they assume that each robot has perfect sensing about the surrounding agents’ positions, velocities, and shapes. To alleviate the demand for perfect sensing, communication protocols have been introduced by (?, ?, ?, ?) to share agents’ state information, including positions and velocities among the group. However, introducing the communication network hurts the robustness and flexibility of multi-robot system. To alleviate this problem, (?, ?) planned the robots’ paths by computing each robot’s buffered Voronoi cell. Unlike previous work that needs both the position and velocity knowledge about adjacent robots, this method only assumes each robot knowing the position of surrounding agents, though it still uses a motion capture system as the global localization infrastructure.
In addition, the original formulation of ORCA is based on holonomic robots, while the robots in the real-world scenarios are often non-holonomic. In order to deploy ORCA on the most common differential drive robots, several methods have been proposed to deal with kinematics of non-holonomic robots. ORCA-DD (?, ?) doubles the effective radius of each agent to ensure collision-free and smooth paths for robots under differential constraints. But it causes problems for the navigation in narrow passages and unstructured environments. NH-ORCA (?, ?) enabled a differential drive robot to follow a desired speed command within an error of . It only needs to slightly increase the effective radius of a robot according to the value of and thus outperforms ORCA-DD in collision avoidance performance.
2.3 Learning-based collision avoidance policy
The decentralized polices discussed above are based on some first-principle rules for collision avoidance in general situations, and thus cannot achieve the optimal performance for a specific task or scenario, such as navigation in the crowded pedestrian scenario. To solve this problem, learning-based collision avoidance techniques try to optimize a parameterized policy using the data collected from a specific task. Most recent work focus on the task of single robot navigation in environments with static obstacles. Many approaches adopted the supervised learning paradigm to train a collision avoidance policy by mapping sensor input to motion commands. (?, ?) presented a vision-based static obstacle avoidance system. They trained a -layer convolutional network which maps raw input images to steering angles. (?, ?) exploited a deep reinforcement learning algorithm based on the successor features (?, ?) to transfer the depth information learned in previously mastered navigation tasks to new problem instances. (?, ?) proposed a mobile robot control system based on multimodal deep autoencoders. (?, ?) trained a discrete controller for a small quadrotor helicopter with imitation learning techniques to accomplish the task of collision avoidance using a single monocular camera. In their formulation, only discrete movements (left/right) need to be learned. Note that all of the aforementioned approaches only take into account the static obstacles and require manually collecting training data in a wide variety of environments.
To deal with unstructured environments, one data-driven motion planner is presented by (?, ?). They trained a model that maps laser range findings and target positions to motion commands using expert demonstrations generated by the ROS navigation package. This model can navigate a robot through an unseen environment and successfully react to sudden changes. Nonetheless, similar to other supervised learning methods, the performance of the learned policy is severely constrained by the quality of the training sets. To overcome this limitation, (?, ?) proposed a map-less motion planner trained through a deep reinforcement learning method. (?, ?) presented an uncertainty-aware model-based reinforcement learning algorithm to estimate the probability of collision in an unknown environment. However, these test environments are relatively simple and structured, and the learned planner is difficult to generalize to complex scenarios with dynamic obstacles and other proactive agents.
To address highly dynamic unstructured environments, some decentralized multi-robot navigation approaches are proposed recently. (?, ?) introduced Bayesian inference approach to predict surrounding dynamic obstacles and to compute collision-free command through the ORCA framework (?, ?). (?, ?, ?, ?) proposed multi-robot collision avoidance policies by deep reinforcement learning, which requires deploying multiple sensors to estimate the states of nearby agents and moving obstacles. However, their complicated pipeline not only demands expensive online computation but also makes the whole system less robust to the perception uncertainty.
2.4 Sim-to-real gap and hybrid control
In order to learn a sophisticated control policy with reinforcement learning, robots need to interact with the environment for a long period to accumulate knowledge about the consequences of different actions. Collecting such interaction data in real world is expensive, time-consuming, and sometimes infeasible due to safety issues. As a result, most reinforcement learning studies are conducted in simulation. However, a policy learned in simulation may be problematic when directly adapting to real world applications, known as the sim-to-real gap. Many recent researches attempt to address this problem. (?, ?) proposed a progressive nets architecture that assembles a set of neural networks trained for different tasks to build a new network that is able to bridge the simulation and the real world. (?, ?) leveraged diversified rendering parameters in simulation to accomplish collision avoidance in real world without the input of any real images. (?, ?) introduced domain randomization to train a robust and transferable policy for manipulation tasks.
In this paper, we utilize the hybrid control framework to alleviate the sim-to-real problem. Hybrid control architecture has been widely used for robot navigation problems in the past two decades (?, ?, ?), which designs a high-level control strategy to manage a set of low-level control rules. Hybrid control has also been heavily leveraged in multi-robot systems (?, ?, ?). For distributed control, (?, ?) proposed switching rules to deal with challenging cases where collisions or noises cannot be handled appropriately. (?, ?) introduced a combination of hybrid decomposition and reachability analysis to design a decentralized collision avoidance algorithm for multiple aerial vehicles. Different from prior methods, we use hybrid control to improve the transferability and robustness of the learned control policies. In particular, we use traditional control policies to deal with situations with large sim-to-real gap (i.e., the situations cannot be handled well by learned policy alone), and use a high level heuristic to switch between the learned policy and traditional control policies.
2.5 Deep reinforcement learning and reactive control
Deep learning methods have been successfully used in a wide variety of applications, including computer vision (?, ?, ?) and nature language processing (?, ?, ?). These successes indicate that deep neural networks are good at extracting hierarchal features from complex and high-dimensional data and making high-quality decisions. These advantages make deep neural networks a powerful tool for robotic applications, which also need to deal with complex raw input data captured from onboard sensors. Recently, there is an increasing amount of research that appeals for deep neural networks to solve robotic perception and control problems. (?, ?) developed a vision-based obstacle avoidance system for a mobile robot by training a -layer convolutional network that maps raw input images to steering angles. (?, ?) presented an end-to-end framework which learns control policies mapping raw image observations to torques at the robotâs motors. Deep neural networks have also been integrated with model predictive control to control robotic systems (?, ?, ?). (?, ?, ?) combined recurrent neural networks with convolutional operations to learn a direct mapping from raw 2D laser data to the unoccluded state of the entire scene; the mapping is then used for object tracking. In this paper, we trained a multi-layer neural network to produce the navigation velocity for each robot according to observations.
3 Problem Formulation
The multi-robot collision avoidance problem is formulated primarily in the context of a nonholonomic differential drive robot moving on the Euclidean plane and avoiding collision with obstacles and other fellow-robots.
To tackle this problem, we first assume all robots in the environment are homogeneous. Specifically, all of robots are modeled as discs with the same radius , and the multi-robot collision avoidance is formulated as a partially observable decision making problem. At each timestep , the -th robot has access to an observation which only provides partial knowledge about the world and then computes a collision-free steering command that drives itself toward the goal from the current position .
In our formulation, the observation is drawn from a probability distribution w.r.t. the latent system state , i.e., and it only provides partial information about the state . In particular, the -th robot cannot access other robots’ states and intents, which is in accord with the real world situation. Compared with the perfect sensing assumption applied in prior methods, e.g. (?, ?, ?, ?, ?, ?, ?), our partial observation assumption makes our proposed approach more applicable and robust in real world applications. The observation vector of each robot is divided into three parts: (here we ignore the robot ID for legibility), where denotes the raw 2D laser measurements about its surrounding environment, stands for its relative goal position (i.e. the coordinate of the goal in the robot’s local polar coordinate frame), and refers to its current velocity. Given the partial observation , each robot independently computes an action or a steering command, , sampled from a stochastic policy shared by all robots:
where denotes the policy parameters. The computed action is a vector representing the velocity driving the robot to approach its goal while avoiding collisions with other robots and obstacles within the time horizon until the next observation is received. Hence, each robot sequentially makes decision until it reaches the goal. Given the sequential decisions consisting of observations and actions (velocities) , the trajectory of the -th robot, , can be computed, starting from the position to its desired goal , where is the traveled time.
To wrap up the above formulation, we define as the set of trajectories for all robots, which are subject to the robot’s kinematic (e.g., non-holonomic) constraints, i.e.:
To find an optimal policy shared by all robots, we adopt an objective by minimizing the expectation of the mean arrival time of all robots in the same scenario, which is defined as:
where is the traveled time of the trajectory in controlled by the shared policy .
The average arrival time will also be used as an important metric to evaluate the learned policy in Section 5. We solve this optimization problem through a policy gradient based reinforcement learning method, which bounds the policy parameter updates to a trust region to ensure stability.
We begin this section by introducing the key ingredients of our reinforcement learning framework. Next, we describe the details about the architecture of the collision avoidance policy based on a deep neural network. Then we discuss the training protocols used to optimize the policy and to bridge the sim-to-real gap. Finally, we introduce the hybrid control framework (as shown in Figure 3), which combines the learning-based policy with traditional rule-based control methods to improve the controller’s safety, effectiveness, and transferability.
4.1 Reinforcement learning setup
The partially observable sequential decision making problem defined in Section 3 can be formulated as a Partially Observable Markov Decision Process (POMDP) solved with reinforcement learning. Formally, a POMDP can be described as a -tuple , where is the state space, is the action space, is the state-transition model, is the reward function, is the observation space (), and is the observation probability distribution given the system state (). In our formulation, each robot only has access to the observation sampled from the underlying system states. Furthermore, since each robot plans its motions in a fully decentralized manner, a multi-robot state-transition model determined by the robots’ kinematics and dynamics is not needed. Below we describe the details of the observation space, the action space, and the reward function.
4.1.1 Observation space
As mentioned in Section 3, the observation consists of three parts: the readings of the 2D laser range finder , the relative goal position , and the robot’s current velocity . Specifically, includes the measurements of the last three consecutive frames from a 180-degree laser scanner which provides 512 distance values per scanning (i.e., ) with a maximum range of 4 meters. In practice, the scanner is mounted on the forepart of the robot instead of in the center (as shown in the left image in Figure 1) to obtain a large un-occluded view. The relative goal position is a 2D vector representing the goal in polar coordinate (distance and angle) with respect to the robot’s current position. The observed velocity includes the current translational and rotational velocity of the differential-driven robot. The observations are normalized by subtracting the mean and dividing by the standard deviation using the statistics aggregated over the course of the entire training.
4.1.2 Action space
The action space is a set of permissible velocities in continuous space. The action of differential robot includes the translational and rotational velocity, i.e., . In this work, considering the real robot’s kinematics and the real world applications, we set the range of the translational velocity and the rotational velocity in . Note that backward moving (i.e., ) is not allowed since the laser range finder can not cover the back region of the robot.
4.1.3 Reward design
Our objective is to avoid collisions during the navigation and to minimize the mean arrival time of all robots. A reward function is designed to guide a team of robots to achieve this objective:
The reward received by robot at time-step is a sum of three terms, , , and . In particular, the robot is awarded by for reaching its goal:
When the robot collides with other robots or obstacles in the environment, it is penalized by :
To encourage the robot to move smoothly, a small penalty is introduced to punish the large rotational velocities:
We set , , and in the training procedure.
4.2 Network architecture
Given the input (observation ) and the output (action ), now we elaborate the policy network mapping to .
We design a -hidden-layer neural network as a non-linear function approximation to the policy . Its architecture is shown in Figure 4. We employ the first three hidden layers to process the laser measurements effectively. The first hidden layer convolves one-dimensional filters with kernel size , stride over the three input scans and applies ReLU nonlinearities (?, ?). The second hidden layer convolves one-dimensional filters with kernel size , stride , again followed by ReLU nonlinearities. The third hidden layer is a fully-connected layer with rectifier units. The output of the third layer is concatenated with the other two inputs ( and ), and then are fed into the last hidden layer, a fully-connected layer with rectifier units. The output layer is a fully-connected layer with two different activations: a sigmoid function, which is used to constrain the mean of translational velocity within , and a hyperbolic tangent function (), which constrains the mean of rotational velocity within .
As a summary, the neural network maps the input observation vector to a vector . The final action is sampled from a Gaussian distribution that is used to model the stochastic policy formulated in Equation 1, where is a separate set of parameters referring to a log standard deviation which will be updated only during the training process. The entire input-output architecture is as shown in Figure 4.
4.3 Multi-scenario multi-stage training
In order to learn a robust policy, we present a multi-stage training scheme in varied scenarios.
4.3.1 Training algorithm
Although deep reinforcement learning algorithms have been successfully applied in mobile robot motion planning, they have mainly focused on a discrete action space (?, ?, ?) or on small-scale problems (?, ?, ?, ?, ?). Large-scale, distributed policy optimization over multiple robots and the corresponding algorithms have been less studied for mobile robot applications.
Here we focus on learning a robust collision avoidance policy for navigating a large number of robots in complex scenarios (e.g., mazes) with static obstacles. To accomplish this, we deploy and extend a recently proposed robust policy gradient algorithm, the Proximal Policy Optimization (PPO) (?, ?, ?, ?), in our multi-robot system. Our approach adapts the centralized learning, decentralized execution paradigm. In particular, the policy is learned with experiences collected by all robots simultaneously. Each robot receives its own at each time step and executes the action generated from the shared policy .
As summarized in Algorithm 1, which is adapted from (?, ?, ?), the training process alternates between sampling trajectories by executing the policy in parallel, and updating the policy with the sampled data. During data collection, each robot exploits the shared policy to generate trajectories until the maximum time period is reached and a batch of trajectory data is sampled. Then the sampled trajectories are used to construct the surrogate loss which is optimized with the Adam optimizer (?, ?) for epochs under the Kullback-Leiber (KL) divergence constraint. As a baseline for estimating the advantage , the state-value function is approximated with a neural network with parameters using the sampled trajectories. The network structure of is the same as that of the policy network , except that it has only one linear activation unit in its last layer. Besides, we adopt -Loss for , and optimize it with the Adam optimizer for epochs as well. We update and independently and their parameters are not shared since we have found that using two separate networks could provide better performance in practice.
This parallel PPO algorithm can be easily scaled to a large-scale multi-robot system with hundreds of robots in a decentralized fashion, since each robot in the team serves as an independent worker collecting data. The decentralized execution not only dramatically reduces the sampling time cost, also makes the algorithm suitable for training a large number of robots in various scenarios. In this way, our network can quickly converge to a solution with good generalization performance.
4.3.2 Training scenarios
To achieve a generalized model that can handle different complex real world scenarios, we empirically build up multiple scenarios with a variety of obstacles using the Stage mobile robot simulator***http://rtv.github.io/Stage/ (as shown in Figure 5) and move all robots concurrently. For Scenario , , , , and shown in Figure 5 (where the black solid lines are obstacles), we first select reasonable starting and arrival areas from the available workspace, then randomly sample the start and goal positions for each robot in the corresponding areas. In Scenario , robots are randomly initialized in a circle with varied radii. Since each robot only has a limited-range observation about its surrounding environment, these robots cannot make a farsighted decision to avoid congestion. As for Scenario , we generate random positions for both robots and obstacles (as marked by the black areas) at the beginning of each episode; and the goals of robots are also randomly selected. In general, these rich, complex training scenarios enable robots to explore their high-dimensional observation space and can improve the quality, robustness, and generalization of the learned policy.
4.3.3 Training stages
Although training on multiple environments simultaneously brings robust performance over different test cases (as discussed in Section 5.2), it makes the training process harder. Inspired by the curriculum learning paradigm (?, ?), we propose a two-stage training process, which accelerates the policy to converge to a satisfactory solution, and gets higher rewards than the policy trained from scratch with the same number of epoch (as shown in Figure 6). In the first stage, we only train robots on the random scenarios (i.e., Scenario 7 in Figure 5) without any obstacles. This allows to find a reasonable solution quickly on relatively simple collision avoidance tasks. Once the robots achieve stable performance, we stop the first stage and save the trained policy. The policy will continue to be updated in the second stage, where the number of robots increases to . The policy network is trained on the richer and more complex scenarios shown in Figure 5. In our experiments discussed in Section 5, we call the policy generated after the first stage as the RL Stage-1 Policy, and the policy generated after the second stage as the RL Stage-2 Policy.
4.4 Hybrid control architecture
Our reinforcement learned policy generalizes well in complex scenarios in both simulation and real world. Unfortunately, it still cannot produce perfect behaviors all the time. There exist some trivial cases that the learned policy cannot provide high quality collision avoidance decisions. For instance, as a robot runs towards its goal through a wide-open space without other agents, the robot may approach the goal in a curved trajectory rather than in a straight line. We have also observed that a robot may wander around its goal rather than directly moving toward the goal, even though the robot is already in the close proximity of the target. In addition, the learned collision avoidance policy is still more vulnerable in real world than in simulation in terms of a higher collision rate.
These imperfect behaviors are due to several properties of the learning scheme. First, it is difficult to collect training data from some special situations, e.g., the chance to sample a straight line trajectory in our training scenarios is close to zero since the policy is modeled as a conditional probability distribution. Second, in a high-dimensional continuous action space, the probability to learn low-dimensional optimal action subspaces, e.g., turnaround in place or moving in a straight line, is also extremely low. Third, the sensor noises added during the training procedure also increase the difficulty of learning a deterministic optimal control. Finally, the sim-to-real gap may fail the collision avoidance policy in marginal cases. For instance, a robot that learned its collision avoidance capability with simulated crowds may difficult to avoid real-world pedestrians. Such sim-to-real gap is caused by the difference between the simulation and the real world, including a robot’s shape, size, dynamics, and the physical properties of the workspace.
However, the situations which are challenging for the learned policy could be simple for traditional control. For example, we can use a PID controller to move a robot in a straight line or to quickly push a robot toward its goal in a wide-open space without any obstacles. Hence, we present a hybrid control framework (as shown in Figure 3) that combines the learned policy and several traditional control policies to generate an effective composite policy. In particular, we first classify the scenario faced by a robot into three categories, and then design separate control sub-policies for each category. During the execution (Algorithm 2), the robot heuristically switches among different sub-policies.
4.4.1 Scenario classification
According to a robot’s sensor measurement about the surrounding environment, we classify the scenarios faced by the robot into three categories: the simple scenarios, the complex scenarios, and the emergent scenarios, by using the safe radius and the risk radius as classification parameters. The classification rule is as follows. When the measured distances from the robot to all nearby obstacles are larger than (i.e., and thus the robot is still far away from obstacles) or when the distance between the robot and target position is smaller than the distance to the nearest obstacle (i.e., and thus the robot may go straight toward the target), we classify this scenario as a simple scenario where the robot can focus on approaching the goal without worrying too much about collision avoidance. When the robot’s distance to an obstacle is smaller than , we consider the situation as an emergent scenario because the robot is too close to the obstacle and thus needs to be cautious and conservative when making movement decisions. All other scenarios are classified as complex scenarios where the robot’s distance to the surrounding obstacles is neither too small nor too large and the robot needs to make a good balance between approaching the goal and avoiding collisions.
In this work, and are two hyper-parameters and thus the switching rule among different scenarios is manually designed. Ideally, we hope to use deep reinforcement learning to learn these switch parameters automatically from data, similar to the meta-learning scheme (?, ?). However, in practice we find that it is difficult to guarantee the deep reinforcement learning to recognize the emergent scenarios reliably. As a result, in this work, we only use the and as hyper-parameters to robustly distinguish different scenarios, and leave the meta-learning as the future work.
Each of the three scenarios discussed above will be handled by one separate sub-policy. For the simple scenario, we will apply the PID control law , which provides an almost optimal solution to move the robot toward its target and guarantees stability. For the emergent scenario, we apply the conservative safe policy as shown in Algorithm 3 to generate a conservative movement by leveraging the DRL trained policy but scaling the neural network’s input and restricting the output of the robot. For the complex scenario, we use the reinforcement learned policy directly, which uses multi-scenario multi-stage reinforcement learning framework to achieve the state-of-the-art navigation performance. An intuitive illustration of the three sub-policies and their corresponding situations are provided in Figure 7.
4.4.3 Hybrid control behaviors
In Figure 8, we compare different behaviors of the reinforcement learning policies with or without using the hybrid control, which are denoted as RL policy and Hybrid-RL policy, respectively. In both benchmarks, we can observe that Hybrid-RL will switch to in the open space, which helps the robot to go toward the goal in the most efficient way, rather than taking the curved paths as the RL policy. In cases when the collision risk is high, the agent will switch to to guarantee safety. For other cases, the agents will enjoy the flexibility of . The switch among sub-policies allows robots to fully leverage the available space by taking trajectories with short lengths and small curvatures. More experimental results verifying the effectiveness of the hybrid control architecture are also available latter in Section 5.
5 Simulation Experiments
In this section, we first describe the setup details for the policy training, including the simulation setup, the hyper-parameters for the deep reinforcement learning algorithm and the training time. Next, we provide a thorough evaluation of the performance of our hybrid multi-scenario multi-stage deep reinforcement learned policy by comparing with other approaches. In particular, we compare four approaches in the experiments:
NH-ORCA policy is the state-of-the-art rule-based approach proposed by (?, ?, ?, ?)
SL policy is trained with the supervised learning approach proposed by (?, ?)
RL policy is the original multi-scenario multi-stage reinforcement learned policy without the hybrid architecture, as proposed by (?, ?)
Hybrid-RL policy is our proposed method in this paper, which augments the RL policy with the hybrid control.
For the RL policy, in some experiments, we analyze its two stages – the RL Stage-1 policy and RL Stage-2 policy – separately, in order to better understand its behavior.
The comparison of these policies is performed from different perspectives, including the generalization to unseen scenarios and tasks, the navigation efficiency, and the robustness to the agent density and to the robots’ inaccurate shape and dynamics. These experiments demonstrate the superiority of our proposed methods over previous methods.
5.1 Training setup
The training of our policy is implemented in TensorFlow and the large-scale multi-robot system with the 2D laser scanner measurements is simulated in the Stage simulator. The simulation time step is set as and the robot’s control frequency is . We train the multi-robot collision avoidance policy on a computer with one i7-7700 CPU and one Nvidia GTX 1080 GPU. The offline training takes hours to run about iterations in Algorithm 1 so that the policy converges in all the training scenarios. The hyper-parameters in Algorithm 1 are summarized in Table 1. Specifically, the learning rate of the policy network is set to in the first training stage, and is then reduced to in the second training stage. The hyper-parameters for the hybrid control are summarized in Table 2.
The online execution of the learned policy for both simulation and the real-world robots runs in real-time. In simulation, it takes on CPU or on GPU for the policy network to compute new actions for robots. Deployed on a physical robot with one Nvidia Jetson TX1, the policy network takes about to generate online robot control commands.
|in line 6||0.95|
|in line 6 and 20||0.99|
|in line 7||8000|
|in line 11||20|
|in line 12||1.0|
|in line 12|
|in line 12||50.0|
|in line 16||(st stage), (nd stage)|
|in line 19||10|
|in line 21|
|in line 24||2.0|
|in line 24 and 27||1.5|
|in line 26||0.5|
5.2 Generalization capability
A notable benefit of the multi-scenario training is the excellent generalization capability of our learned policy after the two-stage training. Thus, we first demonstrate the generalization of the RL policy with a series of experiments.
5.2.1 Non-cooperative robots
As mentioned in Section 3, our policy is trained on a team of robots, in which all robots share the same collision avoidance strategy. Non-cooperative robots are not introduced in the entire training process. However, as shown in Figure 10, our learned policy can directly generalize to avoid non-cooperative agents, i.e., the rectangle-shaped robots which travel in straight lines with a fixed speed.
5.2.2 Heterogeneous robots
Our policy is trained on robots with the same disc shape and a fixed radius. Figure 10 demonstrates that the learned policy can efficiently navigate a group of heterogeneous robots with different sizes and shapes to reach their goals without any collisions.
5.2.3 Large-scale scenarios
To evaluate the performance of our method on large-scale scenarios, we simulate robots in a large circle moving to antipodal positions as shown in Figure 10(b). This simulation experiment serves as a pressure test for our learned collision avoidance policy, since the robots may run into congestion easily due to their limited sensing about the environment. The result shows that our learned policy generalize well to large-scale environments without any fine-tuning. In addition, we also simulate robots in the random scenario as shown in Figure 11(b). All these robots are able to arrive at their destinations without any collisions.
Note that the good generalization capability of the RL policy is also preserved in the Hybrid-RL policy, since all decisions in the complex and emergent situations are made by the RL policy. The generalization of the Hybrid-RL policy is also verified by the pressure test of robots, as shown in Figure 10(c) and Figure 11(c). In both scenarios, most robots adopt the policy near the start positions and goals (shown as green lines in Figure 10(c) and Figure 11(c)) due to the considerable distance to its neighbors. They then switch to the or policy when they encounter other robots (shown as blue or red lines in Figure 10(c) and Figure 11(c)).
5.3 Efficiency evaluation
In this section, we evaluate the performance of the proposed hybrid collision avoidance policy in terms of the navigation efficiency. In particular, we first present multiple performance metrics for evaluating the navigation efficiency, and then compare different polices.
5.3.1 Performance metrics
To compare the performance of our approach with other methods, we use the following metrics:
Success rate is the ratio of the number of robots reaching their goals in a certain time limit without any collisions to the total number of robots.
Extra time measures the difference between the average travel time of all robots and the lower bound of robots’ travel time which is computed as the average travel time of going straight toward goals at the maximum speed without collision avoidance (?, ?, ?).
Extra distance measures the difference between the average traveled trajectory length of all robots and the lower bound of robots’ traveled distance which is computed as the average traveled distance for robots following the shortest paths toward their goals.
Average speed measures the average speed of all robots during the navigation.
Note that in the definitions of the extra time and extra distance , we use the lower-bound baseline to alleviate the biases brought by the different number of agents and different distances to goals. Hence, we can provide a fair comparison among different methods, and this is important especially for the random scenarios.
Since all four policies in comparison (the NH-ORCA policy, the SL policy, the RL policy, and the Hybrid-RL policy) contain some randomness, we run each method times for each test scenario, and then report the mean and variance for each metric.
5.3.2 Circle scenarios
We first compare our Hybrid-RL policy with other methods on several circle scenarios with different number of robots. These scenarios are similar to the Scenario in Figure 5, except that the initial positions of these robots are uniformly along the circle. We test in seven circle scenarios with 4, 6, 8, 10, 12, 15 and 20 agents each. Accordingly, we adjust the circle radius in each scenario to make sure that its average agent density (i.e. the number of agents divided by the area of the circle) is similar (around ).
We use the open-source NH-ORCA implementation from (?, ?, ?) in the evaluation. Since the agents executing the NH-ORCA policy cannot make decisions based on the raw sensor measurements, we need to share the ground truth positions and velocities for all robots in the simulation, such information is not available for the learning-based approaches. The SL policy learned a neural network model the same as the Hybrid-RL policy (as described in Section 4.2), and it is trained in a supervised mode on about samples using the same training strategy as (?, ?, ?).
The comparison results are summarized in Table 3. Compared to the NH-ORCA policy, the reinforcement learned policies (the RL policy and the Hybrid-RL policy) have significant improvements in terms of the success rate, while the SL policy has the lowest success rate. For the largest test scenario with robots, the Hybrid-RL policy still achieves % success rate while the RL policy has a non-zero failure rate. In addition, both the RL policy and the Hybrid-RL policy are superior to the NH-ORCA policy in the average extra time and the travel speed, which means that the robots executing the RL-based policies can provide a higher throughput. And the Hybrid-RL policy performs better than the RL policy according to these two metrics.
Furthermore, as indicated by the extra distance, the reinforcement learned policies provide a comparable or even shorter traveled path than the NH-ORCA policy, as shown in the third row of Table 3. In scenarios with and robots, our method produces slightly longer paths, because the robots need more space to decelerate from the high speed before stopping at goals or to rotate at a larger radius of curvature due to the higher angular velocity. Compared to the RL policy, our Hybrid-RL policy rarely needs to increase the extra distance for obtaining better performance in efficiency and safety. The quantitative difference between the learned policies and the NH-ORCA policy in Table 3 is also verified qualitatively by the difference of their trajectories illustrated in Figure 14 and Figure 14. In particular, the NH-ORCA policy sacrifices the speed of each robot to avoid collisions and to generate shorter trajectories. By contrast, our reinforcement learned policies produce more consistent behavior for robots to resolve the congestion situation efficiently, which avoids unnecessary velocity reduction and thus obtains a higher average speed.
|Metrics||Method||#agents (radius of the scene)|
|4 ()||6 ()||8 ()||10 ()||12 ()||15 ()||20 ()|
|Extra Time||SL||9.254 / 2.592||9.566 / 3.559||12.085 / 2.007||13.588 / 1.206||19.157 / 2.657||-||-|
|NH-ORCA||0.622 / 0.080||0.773 / 0.207||1.067 / 0.215||0.877 /0.434||0.771 / 0.606||1.750 / 0.654||1.800 / 0.647|
|RL||0.323 / 0.025||0.408 / 0.009||0.510 / 0.005||0.631 / 0.011||0.619 / 0.020||0.490 / 0.046||0.778 / 0.032|
|Hybrid-RL||0.251 / 0.007||0.408 / 0.008||0.494/ 0.006||0.629 / 0.009||0.518 / 0.005||0.332 / 0.007||0.702 / 0.013|
|Extra Distance||SL||0.358 / 0.205||0.181 / 0.146||0.138 / 0.079||0.127 / 0.047||0.141 / 0.027||-||-|
|NH-ORCA||0.017 / 0.004||0.025 / 0.005||0.041 / 0.034||0.034 / 0.009||0.062 / 0.024||0.049 / 0.019||0.056 / 0.018|
|RL||0.028 / 0.006||0.028 / 0.001||0.033 / 0.001||0.036 / 0.001||0.038 / 0.002||0.049 / 0.005||0.065 / 0.002|
|Hybrid-RL||0.013 / 0.001||0.028 / 0.001||0.031 / 0.001||0.036 / 0.001||0.039 / 0.001||0.033 / 0.001||0.058 / 0.001|
|Average Speed||SL||0.326 / 0.072||0.381 / 0.087||0.354 / 0.042||0.355 / 0.022||0.308 / 0.028||-||-|
|NH-ORCA||0.859 / 0.012||0.867 / 0.026||0.839 / 0.032||0.876 / 0.045||0.875 / 0.054||0.820 / 0.052||0.831 / 0.042|
|RL||0.939 / 0.004||0.936 / 0.001||0.932 / 0.001||0.927 / 0.001||0.936 / 0.002||0.953 / 0.004||0.939 / 0.002|
|Hybrid-RL||0.952 / 0.001||0.936 / 0.001||0.934 / 0.001||0.927 / 0.001||0.946 / 0.001||0.968 / 0.001||0.945 / 0.001|
5.3.3 Random scenarios
Random scenarios are another type of scenarios frequently used to evaluate the performance of multi-robot collision avoidance. One example of the random scenarios is shown in the Scenario in Figure 5, where each robot is assigned a random start point and a random goal position. To measure the performance of our method on random scenarios, we create different random scenarios with robots in each. For each random scenario, we repeat our evaluations times and the evaluation results are summarized in Figure 15. We compare the Hybrid-RL policy with the RL Stage-1 policy, the RL Stage-2 policy, and the NH-ORCA policy. Note that the difficulty of the random scenarios is lower compared to the circle scenarios, since the random distribution of agents makes the traffic congestion unlikely to happen.
As shown in Figure 14(a), we observe that all policies trained using deep reinforcement learning achieve success rates close to 100%, which means they are safer than the NH-ORCA policy whose success rate is about 96%. In addition, as shown in Figure 14(b), robots using the learned policies reach their goals much faster than those using the NH-ORCA policy. Although the learned policies generate longer trajectories than the NH-ORCA policy (as shown in Figure 14(c)), the higher average speed (Figure 14(d)) and success rate indicate that our policies enable a robot to better cooperate with its nearby agents for higher navigation throughput. Similar to the circle scenarios above, the slightly longer path is due to robots’ inevitable deceleration before stopping at goals or the larger radius of curvature taken to deal with the higher angular velocity.
As depicted in Figure 15, the three reinforcement learning based policies have similar overall performance. The RL Stage-1 policy’s performance is bit higher than the RL Stage-2 policy (e.g., according to the extra distance metric). This is probably because the RL Stage-1 policy is trained in similar scenarios as the test scenarios and thus it may be overfitted, while the RL Stage-2 policy is trained in miscellaneous scenarios for better generalization. Our Hybrid-RL policy further improves the RL Stage-2 policy and achieves similar overall performance as the RL Stage-1 policy, because the traditional control sub-policies improve the trajectory’s optimality, and thus help to make a better balance between the optimality and generalization.
5.3.4 Group scenarios
In order to evaluate the cooperation between robots, we test our learned policy on more challenging group scenarios, including group swap, group crossing, and group moving in the corridors. In the group swap scenario, two groups of robots, each with robots, are moving in opposite directions to swap their positions. As for the group crossing scenario, robots are separated in two groups with robots each, and their paths intersect in the center of the scenario.
We compare our method with the NH-ORCA policy on these two cases by measuring the average extra time with trials. As summarized in Figure 16, the reinforcement learned policies perform much better than the NH-ORCA policy on both benchmarks. The robots take shorter time to reach goals when using the learned policies, which indicates that the learned policies produce more cooperative behaviors than the NH-ORCA policy. Among the three learned policies, the Hybrid-RL policy provides the best performance. Besides, we also illustrate the trajectories of different methods in both scenarios in Figure 17 and Figure 18. From these trajectories, we observe that the NH-ORCA policy generates unnecessary circuitous trajectory. The RL Stage-1 policy tends to generate stopping behaviors when one agent is blocked by another, while the RL Stage-2 policy and Hybrid-RL policy provide smooth and collision-free trajectories.
5.3.5 Corridor scenario
We further evaluate the four policies in a corridor scene, where two groups exchange their positions via a narrow corridor connecting two open regions, as shown in Figure 18(a).
Note that this is a benchmark with static obstacles, and we have to use different pipelines when using the NH-ORCA policy and the learned policies. In particular, the NH-ORCA requires prior knowledge about the global environment (usually a map from SLAM) to identify the static obstacles and then depends on global planners to guide robots navigating in the complex environment. The global planners do not consider the other moving agents, and thus such guidance is sub-optimal and significantly reduces the efficiency and the safety of the navigation. For the learned policies, we are using them in a map-less manner, i.e., each robot only knows its goal but without any knowledge about the global map. The goal information can be provided by global localization system such as the GPS or UWB (Ultra Wide-Band) in practice. A robot will use the vector pointing from its current position to the goal position as a guidance toward the target. Such guidance may fail in buildings with complex topologies but works well for common scenarios with moving obstacles.
Navigation in the corridor scenario is a challenging task, and only the Hybrid-RL policy and the RL Stage-2 policy can complete it. The trajectories generated by the Hybrid-RL policy are illustrated in Figure 18(b). The failure of the RL Stage-1 policy indicates that the co-training on a variety of scenarios is beneficial for the robustness across different situations. The NH-ORCA policy fails with many collisions.
In this section, we have validated the superiority of our learned policies in multiple scenarios by comparing with the state-of-the-art rule-based reaction controller NH-ORCA. We also show that the hybrid control framework can effectively improve the learned policy’s performance in many scenarios.
5.4 Robustness evaluation
Besides the generalization and the quality of the policy, another main concern about deep reinforcement learning is whether the learned policy is stable and robust to model uncertainty and input noises. In this section, we design multiple experiments to verify the robustness of our learned policy.
5.4.1 Performance metrics
In order to quantify the robustness of our policy with respect to the workspace and robot setup, we design the following performance metrics.
Failure rate is the ratio of robots that cannot reach their goals within a certain time limit.
Collision rate is the ratio of robots that cannot reach their goals due to colliding with other robots or static obstacles during the navigation.
Stuck rate is the ratio of robots that cannot reach their goals because they stuck in some position but without any collisions.
Note that the failure rate is equal to the collision rate plus the stuck rate. Similar to the experiments in Section 5.3, when computing each metric for a given method, we repeat the experiment times and report the mean value.
5.4.2 Different agent densities
We first evaluate the performance of different policies in scenarios with varied agent densities. In particular, we design five scenarios which is a -meter by -meter square region and allocate , , , , and agents, respectively. The agent density for these scenarios are about , , , , robots per square meter, respectively. Note that the agent densities of these five scenarios are higher than densities of the training and test scenarios in previous experiments. For instance, the agent density in circle scenarios in Table 3 is only agents . As a result, this experiment is used to challenge the learned policy’s robustness when handling the high agent density.
From the experimental results shown in Figure 19(a), we observe that the failure rates of all methods increase when the agent density increases. This is because in a crowded environment, each robot has limited space and time window to accomplish the collision-free movement and it has to deal with many neighboring agents. Thus, the robots are more likely running into collisions and getting stuck in congestion. Among three methods, the Hybrid-RL policy achieves the lowest failure rate and collision rate, as shown in Figure 19(a) and Figure 19(b). Both the RL and Hybrid-RL policy’s collision rates and overall failure rates are significantly lower than that of the NH-ORCA policy. However, as we observe in Figure 19(c), the RL policy’s stuck rate is slightly higher than that of the NH-ORCA policy in low density situations. This is due to the limitation of the RL policy that we discussed in Section 4.4 when introducing the hybrid control framework: the robot controlled by the RL policy may wander around its goal when it is in the close proximity of the goal. Such navigation artifact of the RL policy takes place in the high density scenarios, which leads to higher stuck rates. Resolving this difficulty by adopting proper sub-policies, our Hybrid-RL policy enable robots to reach goals efficiently and thus it significantly outperforms the RL policy and the NH-ORCA policy in performance, as shown in Figure 19(c).
We also observe two interesting phenomena in Figure 20. First, the RL policy achieves the lowest stuck rate at the density robots . We believe that this is because the average density in the training set is close to robots and thus the RL policy slightly overfits in this case. Second, we observe that the stuck rate of the Hybrid-RL policy is a bit higher than RL policy at the density robots . This is because in emergent cases, the RL policy has a higher probability to directly run into the obstacles and get collision while the Hybrid-RL policy will use to keep safe but may get stuck. This results in the slight lower stuck rate but much higher collision rate of the RL policy compared to the Hybrid-RL policy at the density robots .
5.4.3 Different agent sizes
In our training scenarios, all robots are of the same disc shape with a radius of . When transferring the learned policy to unseen simulated or real world scenarios, the robots may have different shapes and different sizes.
In Section 5.2, we directly deploy our learned policy to agents with similar sizes but shapes different with the prototype agents used in training, and we demonstrate that our method still provides satisfactory performance. In this part, we focus on evaluating the robustness of our policy when the robot’s size is significantly different from the setup in training. In particular, we generate five testing scenarios, each of which is a -meter by -meter square region with round shape robots with the same radius, where the radius is , , , , and meters, respectively. In other words, the smallest agent has the same size as that of the prototype agent, and the largest agent has the radius times larger than that of the prototype agent.
From the results shown in Figure 21, we observe that our Hybrid-RL policy can be directly deployed to robots that are times larger in radius (and times greater in the area) than the prototype agent used for training, and the robust collision avoidance behavior can still be achieved, with a failure rate of about %. When the size disparity between the training and test becomes larger, the collision rate of the Hybrid-RL policy increases but is still significantly lower than that of the RL policy. This implies that the Hybrid-RL policy has learned to maintain an appropriate safety margin with other obstacles when navigating through them. In addition, the hyper-parameter about the size of safety radius in the hybrid control framework also contributes to an excellent balance between the navigation efficiency and safety.
5.4.4 Different maximum velocities
In our proposed policy, the output linear and angular velocities of the policy network are truncated to a certain range ( for the linear velocity and for the angular velocity in our training setup) to reduce the exploration space of reinforcement learning and thus to speed up the training process. However, since the robots in a variety of applications may have different dynamics and different ranges for actions, we here focus on studying the robustness of the learned policy for the testing robots with different maximum linear and angular velocities.
In particular, we allocate robots with the same range for the linear and angular velocities in a -meter by -meter square region. We create five testing scenarios, whose maximum setups for linear and angular velocities are (, ), (, ), (, ), (, ), and (, ), respectively. Note that since the network output is constrained in training, there is no need to evaluate the velocities lower than (, ).
The experimental result in Figure 22 shows that the RL policy is very sensitive to the changes of the network output range, i.e., as the maximum speed increases, its failure rate increases quickly to an unacceptable level, while our Hybrid-RL policy’s collision rate is still below % and thus is still relatively safe even when the maximum speed is two times the value in training. In addition, the Hybrid-RL policy’s stuck rate remains close to zero when the maximum speed is three times of the training setup, while the RL policy’s stuck rate increases to be about %, as shown in Figure 21(c). Thanks to the special sub-policy for the emergent situation (as discussed in Section 4.4), our Hybrid-RL policy outperforms the RL policy significantly.
5.4.5 Different control frequencies
In our training setup, we set the control frequency of executing our collision avoidance policy to . Nevertheless, the actual control frequency depends on many factors, including the available computation resources of the on-board computer, the sensor’s measurement frequency, and the localization system’s frequency. As a result, it is difficult to maintain the control frequency at for long-term execution. In this part, we evaluate whether our collision avoidance policy are able to avoid obstacles when the robot’s control frequency varies.
We deploy robots in a -meter by -meter square region, and run five experiments with the control frequencies to be , , , and respectively. In other words, the robots’ control period is , , , and , respectively. From the experimental results in Figure 23, we observe that the collision rate of our policy increases dramatically when the control frequency is below . This is consistent with human experience: the navigation is difficult when blinking eyes frequently. The Hybrid-RL policy slightly outperforms the RL policy when the control frequency varies.
After the experimental evaluations above, we demonstrate that our Hybrid-RL policy can be well generalized to unseen scenarios and is robust to the variations of the scenario configuration including densities, sizes, maximum velocities and control frequencies of the robots. In this way, we validate that the Hybrid-RL policy is able to bridge the sim-to-real gap and has the potential to be directly deployed on physical robots with different setups.
6 Real World Experiments
In this section, we demonstrate that our reinforcement learning based policy can be deployed to physical robots. Besides the generalization capability and uncertainty modeling that we discussed in Section 5, there still exist other challenges of transferring our learned policy from simulation to the real world. First, in real situations, noises in observation are ubiquitous due to the imperfect sensing. Second, the clock of each individual robot is not synchronized with each other and such an asynchronized distributed system is challenging for control. Finally, it is not guaranteed that all robots can provide consistent behavior given the same control command, because many real-world factors such as mechanics details, motor properties, and frictions cannot be accurately modeled in a simulation environment.
In the following sub-sections, we first briefly introduce the hardware setup of our robots. Then, we present the multi-robot scenarios for evaluation. Lastly, we demonstrate the collision avoidance behavior of the robot controlled by our learned policy when interacting the real human crowd.
6.1 Hardware setup
To validate the transferability of our Hybrid-RL policy, we use a self-developed mobile robot platform for the real-world experiments. The mobile platform has a squared shape with a side length of , as shown in Figure 24. Its shape, size, and dynamic characteristics are completely different from the prototype agents used in the training scenarios. Hence, the experiments based on such a physical platform provide a convincing evaluation about whether our method can bridge the sim-to-real gap.
In our distributed multi-robot systems, each robot is mounted a sensor for measuring the distance to the surrounding obstacles, a localization system for measuring the distance from the robot to its goal, and an on-board computer for computing the navigation velocity in real-time. We use the Hokuyo URG-04LX-UG01 2D LiDAR as our laser scanner, the Pozyx localization system based on the UWB (Ultra-Wide Band) technology to localize our ground vehicles’ 2D positions, and the Nvidia Jetson TX1 as our on-board computing device. The detail about each component of our sensor kit is as follows:
LiDAR: Its measurement distance is from to . Its precision is when the measuring distance is between and , and the precision is when the measuring distance is between to . Its angle range is [ , ] (but we only use the measurements from to ), and the angle resolution is . The sensor’s update frequency is .
UWB localization system: Its measurement accuracy is around in our experiment. Note that unlike other centralized collision avoidance systems, our solution does not need the localization for collision avoidance and thus the precision is sufficient.
On-board computer: Nvidia Jetson TX1 is an embedded computing platform consisting of a Quad-core ARM A57 CPU and a 256-core Maxwell graphics processor.
6.2 Multi-robot scenarios
In this section, we design a series of real-world multi-robot navigation experiments to test the performance of our Hybrid-RL policy on the physical robots. First, we test the performance of the robots moving in a basic scenario without any obstacles. Next, we validate the policy on robotic platforms with more complex scenarios by adding static and moving obstacles (e.g. pedestrians). In addition, we increase the number of robots in the same workspace (i.e., increasing the density of robots) to further challenge the navigation algorithm’s capability. Finally, we design a scenario which emulates the autonomous warehouse application to provide a comprehensive evaluation about our distributed collision avoidance policy.
As the first step, we use a swap scenario (Figure 24(a)) and a crossing scenario (Figure 24(b)) to demonstrate that two robots can reliably avoid each other based on our Hybrid-RL policy. Although the size, shape and dynamic characteristics of these physical robots are different from the agents trained in the simulation, our deep reinforcement learning based collision avoidance policy still performs well in these two scenarios.
Next, we add static obstacles in the swap benchmark and test whether the two robots can still successfully navigate in this complex scenario in a map-less manner. From the trajectories in Figure 24(c), we observe that both robots smoothly avoid static obstacles that are randomly placed in the environment. Then we allow two pedestrians to interfere the robots in the same environment. Taking the trajectories shown in Figure 24(d) for example, the two robots adaptively slow down and wait for the pedestrians and then start again towards the goal after the pedestrian passes. It is important to note that there is no pedestrian data introduced in the training process, and the pedestrians’ shape and dynamic characteristics are quite different from other robots and static obstacles. Moreover, we do not leverage any pedestrian tracking or prediction methods in the robot’s navigation pipeline. As illustrated in the qualitative results, our learned policy demonstrates an excellent generalization capability in collision avoidance no only for the static obstacles but also for the dynamic obstacles. Hence, based on our learned policy without fine-tuning, the robots can easily adapt to new and more challenging tasks.
We further add another two robots in the two-robot benchmarks and design a four-robot benchmark as shown in Figure 26. From the resulting trajectories, we observe that all robots reach their goals successfully without any collisions. However, we also notice that the quality of the robots’ trajectories is lower than that in the simpler benchmarks demonstrated in Figure 25. This is mainly due to the positioning error of the UWB localization system and the disparity in robot hardwares, both of which are amplified as the number of robots increases.
Finally, we design two scenarios which emulate the autonomous warehouse application, where multiple robots work together to achieve efficient and flexible transportation of commodities in an unstructured warehouse environment for a long time. In the first scenario as illustrated in Figure 26(a), there are two transportation stations in the warehouse and two robots are responsible for transporting objects between these two stations. Next, we add random moving pedestrians as the co-workers in the warehouse and build the second scenario as illustrated in Figure 26(b) and Figure 28. The second scenario is more complex, because the robots and the humans share the workspace, especially that the workers may block the robot, which challenges the safety of our multi-robot collision avoidance policy. As illustrated from the results and the attached video, our policy enables multiple robots to accomplish efficient and safe transportation in these two scenarios. In our test, the entire system can safely run for more than thirty minutes without any faults. Please refer to our attached videos for more details.
6.3 Robotic collision avoidance in dense human crowds
There are many previous work about navigating a robot toward its goals in a dense human crowd without any collisions (?, ?). The typical pipeline is first predicting pedestrian movements (?, ?, ?, ?, ?) and then using reactive control (similar to (?, ?)) or local planning for collision avoidance. Since it is difficult to predict the human motion accurately, such a pipeline is not reliable for practical applications.
In this section, we use our reinforcement learning based collision avoidance policy to achieve high-quality navigation in a dense human crowd. Our method is applied for navigation tasks involving multiple cooperative robots in dynamic unstructured environments, but it also provides high-quality collision avoidance policy for a single robot task in a dynamic unstructured environment with non-cooperative agents (e.g., pedestrians).
Our experimental setup is as follows. We attach one UWB localization tag to a person so that the mobile robot will follow this person according to the UWB signal. Hence, the person controls the robot’s desired motion direction by providing a time-varying goal. For evaluation, this person leads the robot into the dense human stream, which provides a severe challenge to the navigation policy. We test the performance of our method in a wide variety of scenarios using different types of robots.
For the testing mobile robot platform, we use four platforms, including the Turtlebot platform, the Igor robot from Hebi robotics, the Baidu Bear robot, and the Baidu shopping cart from Baidu Inc. These four different robots have their own characteristics. The Turtlebot (Figure 28(a)) has a size bigger than the -radius prototype agent used in the training process; the Igor robot (Figure 28(b)) is a self-balancing wheeled R/C robot; the BaiduBear robot is a human-alike service robot (Figure 28(c)); and the Baidu shopping cart used differential wheels as the mobile base, whose weight is quite different with the Turtlebot. The maximum linear velocities for these robots are , , , and , respectively.
Our testing scenarios cover the real-world environments that an mobile robot may encounter, including the canteen, farmer’s market, outdoor street, and indoor conference room, as shown in Figure 30. The Turtlebot running our Hybrid-RL policy safely avoids pedestrians and static obstacles with different shapes, even in some extremely challenging situations, e.g., when curious people suddenly block the robot from time to time. However, the Turtlebot’s behavior is not perfect. In particular, the robot tends to accelerate and decelerate abruptly within dense obstacles. Such behavior is reasonable according to the objective function of the policy optimization algorithm (in Equation 3), because the robot will move to its goal as fast as possible in this way and its movement is more efficient. A refined objective function for optimizing a smooth motion in dense crowds would be left as our future work.
More experiments on different robotic platforms deploying our policy work in different highly dynamic scenarios are available in Figure 31, Figure 32, and Figure 33. Please refer to the video for more details.
We demonstrate that our multi-robot collision avoidance policy can be well deployed to different types of real robots, though their shape, size, and dynamic characteristics are quite different from the prototype agents used in the training simulation scenarios. We validate the learned policy on various real-world scenarios and show that our Hybrid-RL policy can run robustly for a long time without collisions and the robots’ movement easily adapt with the pedestrians in the warehouse scenario. We also verify the possibility of using the Hybrid-RL policy to enable a single robot to navigate through a dense human crowd efficiently and safely.
In this paper, we present a multi-scenario multi-stage training framework to optimize a fully decentralized sensor-level collision avoidance policy with a robust policy gradient algorithm. We evaluate the performance of our method using a series of comprehensive experiments and demonstrate that the learned policy significantly outperforms the state-of-the-art approaches in terms of the success rate, the navigation efficiency, and the generalization capability. Our learned policy can also be used to navigate a single robot through a dense pedestrian crowd, and illustrate the excellent performance in a wide variety of scenarios and on different robot platforms.
Our work serves as the first step towards reducing the navigation performance gap between the centralized and decentralized methods, though we are fully aware of that, as a local collision avoidance method, our approach cannot completely replace a global path planner when scheduling many robots to navigate through complex environments with dense obstacles or with dense pedestrians. Our future work would be how to incorporate our approach with classical mapping methods (e.g. SLAM) and global path planners (e.g. RRT and A) to achieve satisfactory performance for planning a safe trajectory through a dynamic environment.
The authors would like to thank Hao Zhang from Dorabot Inc. and Ruigang Yang from Baidu Inc. for their support to us when preparing the physical robot experiments. The authors would like to thank Dinesh Manocha from the University of Maryland for his constructive discussion about the method and the experiments.
- AdouaneAdouane Adouane, L. (2009). Hybrid and safe control architecture for mobile robot navigation. In 9th conference on autonomous robot systems and competitions.
- Alahi et al.Alahi et al. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S. (2016). Social LSTM: Human trajectory prediction in crowded spaces. In Ieee conference on computer vision and pattern recognition.
- Alonso-Mora, Baker, RusAlonso-Mora et al. Alonso-Mora, J., Baker, S., Rus, D. (2017). Multi-robot formation control and object transport in dynamic environments via constrained optimization. The International Journal of Robotics Research, 36(9), 1000–1021.
- Alonso-Mora, Breitenmoser, Rufli, Beardsley, SiegwartAlonso-Mora et al. Alonso-Mora, J., Breitenmoser, A., Rufli, M., Beardsley, P., Siegwart, R. (2013). Optimal reciprocal collision avoidance for multiple non-holonomic robots. In Distributed autonomous robotic systems (pp. 203–216). Springer.
- Alur, Esposito, Kim, Kumar, LeeAlur et al. Alur, R., Esposito, J., Kim, M., Kumar, V., Lee, I. (1999). Formal modeling and analysis of hybrid systems: A case study in multi-robot coordination. In International symposium on formal methods (pp. 212–232).
- Amodei et al.Amodei et al. Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., … others (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on international conference on machine learning (pp. 173–182).
- Balch ArkinBalch Arkin Balch, T., Arkin, R. C. (1998). Behavior-based formation control for multirobot teams. IEEE transactions on robotics and automation, 14(6), 926–939.
- Bareiss van den BergBareiss van den Berg Bareiss, D., van den Berg, J. (2015). Generalized reciprocal collision avoidance. The International Journal of Robotics Research, 34(12), 1501–1514.
- Barreto, Munos, Schaul, SilverBarreto et al. Barreto, A., Munos, R., Schaul, T., Silver, D. (2017). Successor features for transfer in reinforcement learning. In Neural information processing systems (nips).
- Bengio, Louradour, Collobert, WestonBengio et al. Bengio, Y., Louradour, J., Collobert, R., Weston, J. (2009). Curriculum learning. In International conference on machine learning (pp. 41–48).
- J. Chen SunJ. Chen Sun Chen, J., Sun, D. (2011). Resource constrained multirobot task allocation based on leader–follower coalition methodology. The International Journal of Robotics Research, 30(12), 1423–1434.
- J. Chen, Sun, Yang, ChenJ. Chen et al. Chen, J., Sun, D., Yang, J., Chen, H. (2010). Leader-follower formation control of multiple non-holonomic mobile robots incorporating a receding-horizon scheme. The International Journal of Robotics Research, 29(6), 727–747.
- Y. F. Chen, Everett, Liu, HowY. F. Chen, Everett, et al. Chen, Y. F., Everett, M., Liu, M., How, J. P. (2017). Socially aware motion planning with deep reinforcement learning. arXiv:1703.08862.
- Y. F. Chen, Liu, Everett, HowY. F. Chen, Liu, et al. Chen, Y. F., Liu, M., Everett, M., How, J. P. (2017). Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In International conference on robotics and automation (pp. 285–292).
- Claes, Hennes, Tuyls, MeeussenClaes et al. Claes, D., Hennes, D., Tuyls, K., Meeussen, W. (2012). Collision avoidance under bounded localization uncertainty. In Intelligent robots and systems (iros), 2012 ieee/rsj international conference on (pp. 1192–1198).
- De La Cruz CarelliDe La Cruz Carelli De La Cruz, C., Carelli, R. (2008). Dynamic model based formation control and obstacle avoidance of multi-robot systems. Robotica, 26(3), 345–356.
- Egerstedt HuEgerstedt Hu Egerstedt, M., Hu, X. (2002). A hybrid control approach to action coordination for mobile robots. Automatica, 38(1), 125–130.
- Ess, Schindler, Leibe, Van GoolEss et al. Ess, A., Schindler, K., Leibe, B., Van Gool, L. (2010). Object detection and tracking for autonomous navigation in dynamic environments. The International Journal of Robotics Research, 29(14), 1707–1725.
- Everett, Chen, HowEverett et al. Everett, M., Chen, Y. F., How, J. P. (2018). Motion planning among dynamic, decision-making agents with deep reinforcement learning. arXiv preprint arXiv:1805.01956.
- Flacco, KrÃ¶ger, Luca, KhatibFlacco et al. Flacco, F., KrÃ¶ger, T., Luca, A. D., Khatib, O. (2012). A depth space approach to human-robot collision avoidance. In Ieee international conference on robotics and automation (p. 338-345).
- Frans, Ho, Chen, Abbeel, SchulmanFrans et al. Frans, K., Ho, J., Chen, X., Abbeel, P., Schulman, J. (2017). Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767.
- Gillula, Hoffmann, Huang, Vitus, TomlinGillula et al. Gillula, J. H., Hoffmann, G. M., Huang, H., Vitus, M. P., Tomlin, C. J. (2011). Applications of hybrid reachability analysis to robotic aerial vehicles. The International Journal of Robotics Research, 30(3), 335–354.
- Godoy, Karamouzas, Guy, GiniGodoy et al. Godoy, J., Karamouzas, I., Guy, S. J., Gini, M. (2016a). Implicit coordination in crowded multi-agent navigation. In Thirtieth aaai conference on artificial intelligence.
- Godoy, Karamouzas, Guy, GiniGodoy et al. Godoy, J., Karamouzas, I., Guy, S. J., Gini, M. L. (2016b). Moving in a crowd: Safe and efficient navigation among heterogeneous agents. In Ijcai (pp. 294–300).
- Graves, Mohamed, HintonGraves et al. Graves, A., Mohamed, A.-R., Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 ieee international conference on acoustics, speech and signal processing (pp. 6645–6649).
- Gupta, Johnson, Fei-Fei, Savarese, AlahiGupta et al. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A. (2018). Social GAN: Socially acceptable trajectories with generative adversarial networks. In Ieee conference on computer vision and pattern recognition.
- He, Zhang, Ren, SunHe et al. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Ieee conference on computer vision and pattern recognition.
- Heess et al.Heess et al. Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., … others (2017). Emergence of locomotion behaviours in rich environments. arXiv:1707.02286.
- Hennes, Claes, Meeussen, TuylsHennes et al. Hennes, D., Claes, D., Meeussen, W., Tuyls, K. (2012). Multi-robot collision avoidance with localization uncertainty. In International conference on autonomous agents and multiagent systems-volume 1 (pp. 147–154).
- Hu SunHu Sun Hu, S., Sun, D. (2011). Automatic transportation of biological cells with a robot-tweezer manipulation system. The International Journal of Robotics Research, 30(14), 1681–1694.
- Kahn, Villaflor, Pong, Abbeel, LevineKahn et al. Kahn, G., Villaflor, A., Pong, V., Abbeel, P., Levine, S. (2017). Uncertainty-aware reinforcement learning for collision avoidance. arXiv:1702.01182.
- Kim et al.Kim et al. Kim, S., Guy, S. J., Liu, W., Wilkie, D., Lau, R. W. H., Lin, M. C., Manocha, D. (2015). BRVO: predicting pedestrian trajectories using velocity-space reasoning. International Journal of Robotics Research, 34(2), 201–217.
- Kingma BaKingma Ba Kingma, D., Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
- Krizhevsky, Sutskever, HintonKrizhevsky et al. Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
- Lenz, Knepper, SaxenaLenz et al. Lenz, I., Knepper, R., Saxena, A. (2015). Deepmpc: Learning deep latent features for model predictive control. In Rss.
- Levine, Finn, Darrell, AbbeelLevine et al. Levine, S., Finn, C., Darrell, T., Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39), 1–40.
- Long, Fan, et al.Long, Fan, et al. Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., Pan, J. (2017). Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning. In Ieee international conference on robotics and automation.
- Long, Liu, PanLong, Liu, Pan Long, P., Liu, W., Pan, J. (2017). Deep-learned collision avoidance policy for distributed multiagent navigation. Robotics and Automation Letters, 2(2), 656–663.
- Luna BekrisLuna Bekris Luna, R., Bekris, K. E. (2011). Efficient and complete centralized multi-robot path planning. In Intelligent robots and systems (iros), 2011 ieee/rsj international conference on (pp. 3268–3275).
- Michael, Fink, KumarMichael et al. Michael, N., Fink, J., Kumar, V. (2011). Cooperative manipulation and transportation with aerial robots. Autonomous Robots, 30(1), 73–86.
- Muller, Ben, Cosatto, Flepp, CunMuller et al. Muller, U., Ben, J., Cosatto, E., Flepp, B., Cun, Y. L. (2006). Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems (pp. 739–746).
- Nair HintonNair Hinton Nair, V., Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In International conference on machine learning (pp. 807–814).
- Ondruska, Dequaire, Zeng Wang, PosnerOndruska et al. Ondruska, P., Dequaire, J., Zeng Wang, D., Posner, I. (2016). End-to-End Tracking and Semantic Segmentation Using Recurrent Neural Networks. In Robotics: Science and systems, workshop on limits and potentials of deep learning in robotics.
- Ondruska PosnerOndruska Posner Ondruska, P., Posner, I. (2016). Deep tracking: Seeing beyond seeing using recurrent neural networks. In Aaai conference on artificial intelligence. Phoenix, Arizona USA.
- Peng, Andrychowicz, Zaremba, AbbeelPeng et al. Peng, X. B., Andrychowicz, M., Zaremba, W., Abbeel, P. (2017). Sim-to-real transfer of robotic control with dynamics randomization. arXiv:1710.06537.
- Pfeiffer, Schaeuble, Nieto, Siegwart, CadenaPfeiffer et al. Pfeiffer, M., Schaeuble, M., Nieto, J., Siegwart, R., Cadena, C. (2017). From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. In International conference on robotics and automation (pp. 1527–1533).
- Quottrup, Bak, ZamanabadiQuottrup et al. Quottrup, M. M., Bak, T., Zamanabadi, R. (2004). Multi-robot planning: A timed automata approach. In Robotics and automation, 2004. proceedings. icra’04. 2004 ieee international conference on (Vol. 5, pp. 4417–4422).
- Ross et al.Ross et al. Ross, S., Melik-Barkhudarov, N., Shankar, K. S., Wendel, A., Dey, D., Bagnell, J. A., Hebert, M. (2013). Learning monocular reactive uav control in cluttered natural environments. In International conference on robotics and automation (pp. 1765–1772).
- Rusu et al.Rusu et al. Rusu, A. A., Vecerik, M., Rothörl, T., Heess, N., Pascanu, R., Hadsell, R. (2016). Sim-to-real robot learning from pixels with progressive nets. In Conference on robot learning.
- Sadeghi LevineSadeghi Levine Sadeghi, F., Levine, S. (2017). Cad2rl: Real single-image flight without a single real image. In Robotics: Science and system.
- Schulman, Levine, Abbeel, Jordan, MoritzSchulman, Levine, et al. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889–1897).
- Schulman, Moritz, Levine, Jordan, AbbeelSchulman, Moritz, et al. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438.
- Schulman, Wolski, Dhariwal, Radford, KlimovSchulman et al. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
- Schwartz SharirSchwartz Sharir Schwartz, J. T., Sharir, M. (1983). On the piano movers’ problem: Iii. coordinating the motion of several independent bodies: The special case of circular bodies moving amidst polygonal barriers. The International Journal of Robotics Research, 2(3), 46–75.
- Sergeant, Sünderhauf, Milford, UpcroftSergeant et al. Sergeant, J., Sünderhauf, N., Milford, M., Upcroft, B. (2015). Multimodal deep autoencoders for control of a mobile robot. In Australasian conference on robotics and automation.
- Sharon, Stern, Felner, SturtevantSharon et al. Sharon, G., Stern, R., Felner, A., Sturtevant, N. R. (2015). Conflict-based search for optimal multi-agent pathfinding. Artificial Intelligence, 219, 40–66.
- Shucker, Murphey, BennettShucker et al. Shucker, B., Murphey, T., Bennett, J. K. (2007). Switching rules for decentralized control with simple control laws. In American control conference, 2007. acc’07 (pp. 1485–1492).
- Snape, Van Den Berg, Guy, ManochaSnape et al. Snape, J., Van Den Berg, J., Guy, S. J., Manocha, D. (2010). Smooth and collision-free navigation for multiple robots under differential-drive constraints. In International conference on intelligent robots and systems (pp. 4584–4589).
- Snape, van den Berg, Guy, ManochaSnape et al. Snape, J., van den Berg, J., Guy, S. J., Manocha, D. (2011). The hybrid reciprocal velocity obstacle. Transactions on Robotics, 27(4), 696–706.
- Stone VelosoStone Veloso Stone, P., Veloso, M. (1999). Task decomposition, dynamic role assignment, and low-bandwidth communication for real-time strategic teamwork. Artificial Intelligence, 110(2), 241–273.
- Sun, Wang, Shang, FengSun et al. Sun, D., Wang, C., Shang, W., Feng, G. (2009). A synchronization approach to trajectory tracking of multiple mobile robots while maintaining time-varying formations. IEEE Transactions on Robotics, 25(5), 1074–1086.
- Tai, Paolo, LiuTai et al. Tai, L., Paolo, G., Liu, M. (2017). Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In International conference on intelligent robots and systems.
- Tang, Thomas, KumarTang et al. Tang, S., Thomas, J., Kumar, V. (2018). Hold or take optimal plan (hoop): A quadratic programming approach to multi-robot trajectory generation. The International Journal of Robotics Research, 0278364917741532.
- Tobin et al.Tobin et al. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent robots and systems (iros), 2017 ieee/rsj international conference on (pp. 23–30).
- Turpin, Michael, KumarTurpin et al. Turpin, M., Michael, N., Kumar, V. (2014). Capt: Concurrent assignment and planning of trajectories for multiple robots. The International Journal of Robotics Research, 33(1), 98–112.
- van den Berg, Guy, Lin, Manochavan den Berg et al. van den Berg, J., Guy, S. J., Lin, M., Manocha, D. (2011a). International symposium on robotics research. In C. Pradalier, R. Siegwart, G. Hirzinger (Eds.), (pp. 3–19). Berlin, Heidelberg: Springer Berlin Heidelberg.
- van den Berg, Guy, Lin, Manochavan den Berg et al. van den Berg, J., Guy, S. J., Lin, M., Manocha, D. (2011b). Reciprocal n-body collision avoidance. In Robotics research (pp. 3–19). Springer.
- van den Berg, Lin, Manochavan den Berg et al. van den Berg, J., Lin, M., Manocha, D. (2008). Reciprocal velocity obstacles for real-time multi-agent navigation. In International conference on robotics and automation (pp. 1928–1935).
- Yi, Li, WangYi et al. Yi, S., Li, H., Wang, X. (2016). Pedestrian behavior understanding and prediction with deep neural networks. In European conference on computer vision (pp. 263–279).
- Yu LaValleYu LaValle Yu, J., LaValle, S. M. (2016). Optimal multirobot path planning on graphs: Complete algorithms and effective heuristics. IEEE Transactions on Robotics, 32(5), 1163–1177.
- J. Zhang, Springenberg, Boedecker, BurgardJ. Zhang et al. Zhang, J., Springenberg, J. T., Boedecker, J., Burgard, W. (2016). Deep reinforcement learning with successor features for navigation across similar environments. arXiv:1612.05533.
- T. Zhang, Kahn, Levine, AbbeelT. Zhang et al. Zhang, T., Kahn, G., Levine, S., Abbeel, P. (2016). Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In Ieee international conference on robotics and automation (p. 528-535).
- Zhou, Wang, Bandyopadhyay, SchwagerZhou et al. Zhou, D., Wang, Z., Bandyopadhyay, S., Schwager, M. (2017). Fast, on-line collision avoidance for dynamic vehicles using buffered voronoi cells. IEEE Robotics and Automation Letters, 2(2), 1047–1054.
- Zhu et al.Zhu et al. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In International conference on robotics and automation (pp. 3357–3364).