Safe Reinforcement Learning with Model Uncertainty Estimates
Many current autonomous systems are being designed with a strong reliance on black box predictions from deep neural networks (DNNs). However, DNNs tend to be overconfident in predictions on unseen data and can give unpredictable results for far-from-distribution test data. The importance of predictions that are robust to this distributional shift is evident for safety-critical applications, such as collision avoidance around pedestrians. Measures of model uncertainty can be used to identify unseen data, but the state-of-the-art extraction methods such as Bayesian neural networks are mostly intractable to compute. This paper uses MC-Dropout and Bootstrapping to give computationally tractable and parallelizable uncertainty estimates. The methods are embedded in a Safe Reinforcement Learning framework to form uncertainty-aware navigation around pedestrians. The result is a collision avoidance policy that knows what it does not know and cautiously avoids pedestrians that exhibit unseen behavior. The policy is demonstrated in simulation to be more robust to novel observations and take safer actions than an uncertainty-unaware baseline.
[name=J. H., color=blue]jh \definechangesauthor[name=M. E., color=red]me \definechangesauthor[name=B. L., color=green]bl \definechangesauthor[name=G. H., color=orange]gh \setremarkmarkup(#2)
Reinforcement learning (RL) is used to produce state-of-the-art results in manipulation, motion planning and behavior prediction. However, the underlying neural networks often lack the capability to produce qualitative predictive uncertainty estimates and tend to be overconfident on out-of-distribution test data [Amodei_2016, Lakshmi_2016, Hendrycks_2017]. In safety-critical tasks, such as collision avoidance of cars or pedestrians, incorrect but confident predictions of unseen data can lead to fatal failure [Tesla_2016]. We investigate methods for Safe RL that are robust to unseen observations and “know what they do not know” to be able to raise an alarm in unpredictable test cases; ultimately leading to safer actions.
A particularly challenging safety-critical task is avoiding pedestrians in a campus environment with an autonomous shuttle bus or rover [Miller_2016, Navya_2018]. Humans achieve mostly collision-free navigation by understanding the hidden intentions of other pedestrians and vehicles and interacting with them [Zheng_2015, Helbing_1995]. Furthermore, most of the time this interaction is accomplished without verbal communication. Our prior work uses RL to capture the hidden intentions and achieve collaborative navigation around pedestrians [Chen_2016, Chen_2017, Everett_2018]. However, RL approaches always face the problem of generalizability from simulation to the real world and cannot guarantee performance on far-from-training test data. An example policy that has only been trained on collaborative pedestrians could fail to generalize to uncollaborative pedestrians in the real world. The trained policy would output a best guess policy that might assume collaborative behavior and, without labeling the novel observation, fail ungracefully. To avoid such failure cases, this paper develops a Safe RL framework for dynamic collision avoidance that expresses novel observations in the form of model uncertainty. The framework further reasons about the uncertainty and cautiously avoids regions of high uncertainty, as displayed in Fig. 5.
Much of the existing Safe RL research has focused on using external novelty detectors or internal modifications to identify environment or model uncertainty [Garcia_2015]. Note that our work targets model uncertainty estimates because they potentially reveal sections of the test data where training data was sparse and a model could fail to generalize [Gal_2016Thesis]. Work in risk-sensitive RL (RSRL) often focuses on environment uncertainty to detect and avoid high-risk events that are known from training to have low probability but high cost [Geibel_2006, Mihatsch_2002, Shen_2013, Tamar_2015, Evendar_2006]. Other work in RSRL targets model uncertainty in MDPs, but does not readily apply to neural networks [Chow_2015, Mihatsch_2002]. Our work is mainly orthogonal to risk-sensitive RL approaches and could be combined into an RL policy that is robust to unseen data and sensitive to high-risk events.
Extracting model uncertainty from discriminatively trained neural networks is complex, as the model outcome for a given observation is deterministic. Mostly, Bayesian neural networks are used to extract model uncertainty but require a significant restructuring of the network architecture [Neal_1996]. Additionally, even approximate forms, such as Markov Chain Monte Carlo [Neal_1996] or variational methods [Blundell_2015, Graves_2011, Louizos_2016], come with extensive computational cost and have a sample-dependent accuracy [Neal_1996, Lakshmi_2016, Springenberg_2016]. Our work uses Monte Carlo Dropout (MC-Dropout) [Gal_2015] and bootstrapping [Osband_2016] to give parallelizable and computationally feasible uncertainty estimates of the neural network without significantly restructuring the network architecture [Dropout_2014, Bootstrap_1995].
The main contributions of this work are i) an algorithm that identifies novel pedestrian observations and ii) avoids them more cautiously and safer than an uncertainty-unaware baseline, iii) an extension of an existing uncertainty-aware reinforcement learning framework [Kahn_2017] to more complex dynamic environments with exploration aiding methods, and iv) a demonstration in a simulation environment.
Ii Related Work
This section investigates related work in Safe Reinforcement Learning to develop a dynamic collision avoidance policy that is robust to out-of-data observations.
Ii-a External verification and novelty detection
Many related works use off-policy evaluation or external novelty detection to verify the learned RL policy [Richter_2017, Long_2018, Garcia_2015]. Reachability analysis could verify the policy by providing regional safety bounds, but the bounds would be too conservative in a collaborative pedestrian environment [Lygeros_1999, Majumdar_2016, Perkins_2003]. Novelty detection approaches place a threshold on the detector’s output and switch to a safety controller if the threshold is exceeded. This requires the knowledge of a safety controller that can act in a complex collaborative pedestrian environment. Moreover, there is no known mechanism of gradually switching from an RL policy to a safety controller, because the latter has no knowledge about the RL’s decision-making process. An example failure case would be a pedestrian in front of a robot, that is planned to be avoided to the left by the RL and to the right by a safety controller. An interpolation could collide in the middle [Amini_2017]. In our framework, the understanding of pedestrian behavior and knowledge of uncertainty is combined to allow a vehicle to stay gradually further away from unpredictable and uncertain regions, as seen in Fig. 3.
Ii-B Environment and model uncertainty
This paper focuses on detecting novel observations via model uncertainty, also known as parametric or epistemic uncertainty [Kendall_2017]. The orthogonal concept of environment uncertainty does not detect out-of-data points as it captures the uncertainty due to the imperfect nature of partial observations [Gal_2016Thesis]. For example, an observation of a pedestrian trajectory will, even with infinite training in the real-world, not fully capture the decision-making process of pedestrians and thus be occasionally ambiguous; will she turn left or right? The RL framework accounts for the unobservable decision ambiguity by learning a mean outcome [Gal_2016Thesis]. Model uncertainty, in comparison, captures how well a model fits all possible observations from the environment. It could be explained away with infinite observations and is typically high in applications with limited training data, or with test data that is far from the training data [Gal_2016Thesis]. Thus, the model uncertainty captures cases in which a model fails to generalize to unseen test data and hints when one should not trust the network predictions [Gal_2016Thesis].
Ii-C Measures of model uncertainty
A new topic calculates approximations of Bayesian inference without significantly changing the neural network’s architecture. Bootstrapping has been explored to generate approximate uncertainty measures to guide exploration [Osband_2016]. By training an ensemble of networks on partially overlapping dataset samples they agree in areas of common data and disagree, and have a large sample variance, in regions of uncommon data [Lakshmi_2016, Osband_2016]. Dropout can be interpreted similarly, if it is activated during test-time, and has been shown to approximate Bayesian inference in Gaussian processes [Dropout_2014, Gal_2015]. An alternative approach uses a Hypernet, a network that learns the weights of another network to directly give parameter uncertainty values, but was shown to be computationally too expensive [Pawlowski_2017]. An innovative, but controversial, approach claims to retrieve Bayesian uncertainty estimates via batch normalization [Teye_2018]. This work uses MC-Dropout and bootstrapping to give computationally tractable uncertainty estimates.
Ii-D Applications of model uncertainty in RL
Measures of model uncertainty have been used in RL very recently to speed up training by guiding the exploration into regions of high uncertainty [Thompson_1933, Osband_2016, Liu_2017]. Kahn et al. used uncertainty estimates in model-based RL for static obstacle collision avoidance [Kahn_2017]. Instead of a model-based RL approach, one could argue to use model-free RL and draw the uncertainty of an optimal policy output . However, the uncertainty estimate would contain a mix from the uncertainties of multiple objectives and would not focus on the uncertain region of collision. Our work extends the model-based framework by [Kahn_2017] to the highly complex domain of pedestrian collision avoidance. [Kahn_2017] is further extended by using the uncertainty estimates for guided exploration to escape locally optimal policies, analyzing the regional increase of uncertainty in novel dynamic scenarios, using LSTMs and acting goal-guided.
This work proposes an algorithm that uses uncertainty information to cautiously avoid dynamic obstacles in novel scenarios. As displayed in the system architecture in Fig. 2, an agent observes a simulated obstacle’s position and velocity, and the goal. A set of Long-Short-Term-Memory (LSTM) [Hochreiter_1997] networks predicts collision probabilities for a set of motion primitives . MC-Dropout and bootstrapping are used to acquire a distribution over the predictions. From the predictions, a sample mean and variance is drawn for each motion primitive. In parallel, a simple model estimates the time to goal at the end of each evaluated motion primitive. In the next stage, the minimal cost motion primitive is selected and executed for one step in the environment. The environment returns the next observation and at the end of an episode a collision label. After a set of episodes, the network weights are adapted and the training process continues. Each section of the algorithm is explained in detail below.
Iii-a Collision Prediction Network
A set of LSTM networks (ensemble) estimates the probability that a motion primitive would lead to a collision in the next time steps, given the history of observations and past actions . The observations of duration contain the past and current relative goal position and a pedestrian’s position, velocity and radius. Each motion primitive of length is a straight line, described through a heading angle and speed. The optimal motion primitive is taken for one time step until the network is queried again.
LSTM networks are chosen for the dynamic obstacle avoidance, because they are the state-of-the-art model in predicting pedestrian paths by understanding the hidden temporal intentions of pedestrians best [Alahi_2016_CVPR, Vemula_2017]. Based on this success, the proposed work first applies LSTMs to pedestrian avoidance in an RL setting. For safe avoidance, LSTM predictions need to be accurate from the first time step a pedestrian is observed in the robot’s field of view. To handle the variable length observation input, masking [Che_2018] is used during training and test to deactivate LSTM cells that exceed the length of the observation history.
Iii-B Uncertainty Estimates with MC-Dropout and Bootstrapping
MC-Dropout [Gal_2015] and bootstrapping [Osband_2016, Lakshmi_2016] are used to compute stochastic estimates of the model uncertainty . For bootstrapping, multiple networks are trained and stored in an ensemble. Each network is randomly initialized and trained on sample datasets that have been drawn with replacement from a bigger experience dataset [Osband_2016]. By being trained on different but overlapping sections of the observation space, the network predictions differ for uncommon observations and are similar for common observations. As each network can be trained and tested in parallel, bootstrapping does not come with significant computational cost and can be run on a real robot.
Dropout [Dropout_2014] is traditionally used for regularizing networks. It randomly deactivates network units in each forward pass by multiplying the unit weights with a dropout mask. The dropout mask is a set of Bernoulli random variables of value and a keeping probability . Traditionally, dropout is deactivated during test and each unit is multiplied with . However, [Gal_2015] has shown that an activation of dropout during test, named MC-Dropout, gives model uncertainty estimates by approximating Bayesian inference in deep Gaussian processes. To retrieve the model uncertainty with dropout, our work executes multiple forward passes per network in the bootstrapped ensemble with different dropout masks and acquires a distribution over predictions. Although dropout has been seen to be overconfident on novel observations [Osband_2016], Table I shows that the combination of bootstrapping and dropout reliably detects novel scenarios.
From the parallelizable collision predictions from each network and each dropout mask, the sample mean and variance is drawn.
Iii-C Selecting actions
A Model Predictive Controller (MPC) selects the safest motion primitive with the minimal joint cost:
The chosen MPC that considers the second order moment of probability [Lee_2017, Theodorou_2010, Kahn_2017] is able to select actions that are more certainly safe. The MPC estimates the time-to-goal from the end of each motion primitive by measuring the straight line distance. Each cost term is weighted by its own factor . Note that the soft constraint on collision avoidance requires and to be chosen such that the predicted collision cost is greater than the goal cost. In comparison to [Kahn_2017], this work does not multiply the variance term with the selected velocity. The reason being is that simply stopping or reducing one’s velocity is not always safe, for example on a highway scenario or in the presence of adversarial agents. The proposed work instead focuses on identifying and avoiding uncertain observations regionally in the ground plane.
Iii-D Adaptive variance
Note that during training an overly uncertainty-averse model would discourage exploration and rarely find the optimal policy. Additionally, the averaging during prediction reduces the ensemble’s diversity, which additionally hinders explorative actions. The proposed approach increases the penalty on highly uncertain actions over time to overcome this effect. Thus, the policy efficiently explores in directions of high model uncertainty during early training phases; is brought to convergence to act uncertainty-averse during execution.
Iii-E Collecting the dataset
The selected action is executed in the learning environment. The environment returns the next observation and a collision label. The motion primitive decision history is labeled with 1 or 0 if a collision occurred. Several episodes are executed and the observation-action history stored in an experience dataset. Random subsets from the full experience set are drawn to train the ensemble of networks for the next observe-act-train cycle. The policy roll-out cycle is necessary to learn how dynamic obstacles will react to the agent’s learned policy. A supervised learning approach, as taken in [Richter_2017] for static obstacle avoidance, would not learn the reactions of environment agents on the trained policy.
We show that our algorithm uses uncertainty information to regionally detect novel obstacle observations and causes fewer collisions than an uncertainty-unaware baseline. First, a simple 1D case illustrates how the model regionally identifies novel obstacle observations. In a scaled up environment with novel multi-dimensional observations, the proposed model continues to exhibit regionally increased uncertainty values. The model is compared with an uncertainty-unaware baseline in a variety of novel scenarios; the proposed model performs more robust to novel data and causes fewer collisions.
Iv-a Regional novelty detection in 1D
First, we show that model uncertainty estimates are able to detect novel one-dimensional observations regionally, as seen in Fig. 3. For the 1D test-case, a two-layer fully-connected network with MC-Dropout and Bootstrapping is trained to predict collision labels. To generate the dataset, an agent randomly chose heading actions, independent of the obstacle observations, and the environment reported the collision label. The network input is the agent heading angle and obstacle heading. Importantly, the training set only contains obstacles that are on the right-hand side of the agent (top plot:).
After training, the network accurately predicts collision and no-collision labels with low uncertainty for obstacle observations from the training distribution, as seen in Fig. 2(a). For out-of-training obstacle observations on the agent’s left (bottom plot: ), the neural network fails to generalize and predicts collision (red) as well as non-collision (green) labels for actions (straight lines) that would collide with the obstacle (blue). However, the agent identifies regions of high model uncertainty (left: y-axis, right: light colors) for actions in the direction of the unseen obstacle. The high uncertainty values suggest that the network predictions are false-positives and should not to be trusted. Based on the left-right difference in uncertainty estimates, the MPC would prefer a conservative action that is certainly safe (bottom-right: dark green lines) over a false-positive action that is predicted to be safe but uncertain (bottom-right: light green lines).
Iv-B Novelty detection in multi-dimensional observations
The following experiments show that our model continues to regionally identify uncertainty in multi-dimensional observations and choose safer actions.
A one-layer 16-unit LSTM model has been trained in a gym [Gym_2016] based simulation environment with one agent and one dynamic obstacle. The dynamic obstacle in the environment is capable of following a collaborative RVO [Berg_2009], GA3C-CADRL [Everett_2018], or non-cooperative or static policy. For the analyzed scenarios, the agent was trained with obstacles that follow an RVO policy and are observed as described in Section III. The training process took 20 minutes on a low-compute amazon AWS c5.large Intel Xeon Platinum 8124M with 2vCPUs and 4GiB memory and one hundred stochastic forward passes with dropout and bootstrapping per step take in average . The train and execution time could be further decreased by parallelizing the computation on GPUs.
In the test setup, observations of obstacles are manipulated to create scenarios with novel observations that could break the trained model. In one scenario, sensor noise is simulated by adding Gaussian noise on the observation of position and velocity. In another scenario, observations are randomly dropped with a probability of . In a third and fourth scenario that simulate sensor failure, the obstacle position and velocity is masked, respectively. None of the manipulations were applied at training time.
Regional novelty detection
Figure 4 shows that the proposed model continues to regionally identify novel obstacle observations in a higher dimensional observation space. In the displayed experiment, an uncertainty-aware agent (orange) observes a dynamic obstacle (blue) with newly added noise and evaluates actions to avoid it. The collision predictions for actions in the direction of the obstacle (light green lines) have higher uncertainty than for actions into free-space (dark green lines). The difference in the predictive uncertainties from left to right, although being stochastic and not perfectly smooth, is used by the MPC to steer the agent away from the noisy obstacle and cautiously avoid it without a collision (orange/yellow line). Figure 4(b) shows the full trajectory of the uncertainty-aware agent and illustrates how an uncertainty-unaware agent in Fig. 4(a) with same speed and radius fails to generalize to the novel noise and collides with the obstacle after five time steps.
Novel scenario identification with uncertainty
Table I shows that overall model uncertainty is high in every of the tested novel scenarios, including the illustrated case of added noise. The measured uncertainty is the sum of variance of the collision predictions for each action at one time step. The uncertainty values have been averaged over sessions with random initialization, episodes and all time steps until the end of each episode. As seen in Table I the uncertainty in a test set of the training distribution is relatively low. All other scenarios cause higher uncertainty values and the relative magnitude of the uncertainty values can be interpreted as how novel the set of observations is for the model, in comparison to the training case.
|Training||Added noise||Dropped observations||Masked vel. info.||Masked pos. info.|
Fewer collisions in novel scenarios
The proposed model uses the uncertainty information to act more cautiously and be more robust to novel scenarios. Figure 6 shows that this behavior causes fewer collisions during the novel scenarios than an uncertainty-unaware baseline. The proposed model (red) and the baseline (blue) perform similarly well on samples from the training distribution. In the test scenarios of added noise, masked position and masked velocity information, the proposed model causes fewer collisions and is more robust to the novel class of observations. In the case of dropped observations, both models perform similarly well, in terms of collisions, but the uncertainty-unaware model was seen to take longer to reach the goal. The baseline model has been trained with the same hyperparameters in the same environment except that the variance penalty is set to zero.
Generalization to other novel scenarios
In all demonstrated cases one could have found a model that generalizes to noise, masked position observations, etc. However, one cannot design a simulation that captures all novel scenarios that could occur in real life. A significantly novel event should be recognized with a high model uncertainty. In the pedestrian avoidance task, novel observations might be uncommon pedestrian behavior. But really all forms of observations that are novel to the deployed model should be identified and reacted upon by driving more cautiously. The shown results suggest that model uncertainty is able to identify such observations and that the MPC selects actions with extra buffer space to avoid these pedestrians cautiously.
Iv-C Using uncertainty to escape local minima
This work increases the variance penalty to avoid getting stuck in local minima of the MPC optimization during the training process. Figure 7 shows that the proposed algorithm with increasing can escape a local minimum by encouraging explorative actions in the early stages of training. For the experiment, an agent (orange) was trained to reach a goal (star) that is blocked by a static obstacle (blue) by continuously selecting an action (left plot). In an easy avoidance case, the obstacle is placed further away from the agent’s start position (in dark orange); in a challenging case closer to the agent. A close obstacle is challenging, as the agent is initially headed into the obstacle direction and needs to explore avoiding actions. The collision estimates of the randomly initialized networks are uninformative in early training stages and the goal cost drives the agent into the obstacle. A negative variance penalty in early stages forces the agent to explore actions away from the goal and avoid getting stuck in a local minimum.
Figure 7 displays that, in the challenging training case, the agent with a constant fails to explore and the algorithm gets stuck in a bad local minimum (bottom-right plot: blue), where 80% of the runs end in a collision. The policy with an increasing , and the same hyperparameters (bottom-right plot: red), is more explorative in early stages and converges to a lower minimum in an average of five sessions. In the easy test case, both algorithms perform similarly well and converge to a policy with near-zero collisions (top-right plot).
V Discussion and Future Work
V-a Accurately calibrated model uncertainty estimates
In another novel scenario, an agent was trained to avoid collaborative RVO agents and tested on uncollaborative agents. The uncertainty values did not significantly increase, which can be explained by two reasons. First, uncollaborative agents could not be seen as novel for the model; possibly, because RVO agents, further away from the agent also act in a straight line. The fact that humans think that uncollaborative agents might be novel for a model that has only been trained on collaborative agents, does not change the fact that the model might be generalizable enough to not see it as novel. Another explanation is the observed overconfidence of dropout as an uncertainty estimate. Future work will find unrevealed estimates of model uncertainty for neural networks that provide stronger guarantees on the true model uncertainty.
This work has developed a Safe RL framework with model uncertainty estimates to cautiously avoid dynamic obstacles in novel scenarios. An ensemble of LSTM networks was trained with dropout and bootstrapping to estimate collision probabilities and gain predictive uncertainty estimates. The magnitude of the uncertainty estimates was shown to reveal novelties in a variety of scenarios, indicating that the model ”knows what it does not know”. The regional uncertainty increase in the direction of novel obstacle observations is used by an MPC to act more cautious in novel scenarios. The cautious behavior made the uncertainty-aware framework more robust to novelties and safer than an uncertainty-unaware baseline. This work is another step towards opening up the vast capabilities of deep neural networks for the application in safety-critical tasks.
This work is supported by Ford Motor Company. The authors want to thank Golnaz Habibi for insightful discussions.