Constrained Exploration and Recovery from Experience Shaping
Abstract
We consider the problem of reinforcement learning under safety requirements, in which an agent is trained to complete a given task, typically formalized as the maximization of a reward signal over time, while concurrently avoiding undesirable actions or states, associated to lower rewards, or penalties. The construction and balancing of different reward components can be difficult in the presence of multiple objectives, yet is crucial for producing a satisfying policy. For example, in reaching a target while avoiding obstacles, low collision penalties can lead to reckless movements while high penalties can discourage exploration. To circumvent this limitation, we examine the effect of past actions in terms of safety to estimate which are acceptable or should be avoided in the future. We then actively reshape the action space of the agent during reinforcement learning, so that rewarddriven exploration is constrained within safety limits. We propose an algorithm enabling the learning of such safety constraints in parallel with reinforcement learning and demonstrate its effectiveness in terms of both task completion and training time.
Constrained Exploration and Recovery from Experience Shaping
TuHoa Pham, Giovanni De Magistris, Don Joven Agravante, Subhajit Chaudhury, Asim Munawar and Ryuki Tachibana IBM Research  Tokyo {pham,giovadem,subhajit,asim,ryuki}@jp.ibm.com, don.joven.r.agravante@ibm.com
1 Introduction
Recent work in reinforcement learning has established the potential for deep neural network architectures to tackle difficult control and decisionmaking problems, such as playing video games from raw pixel information (?), Go (?), as well as robot manipulation (?) and wholebody control (?). Such problems are often characterized by the high dimensionality of possible actions and observations, making them difficult to solve or even intractable for traditional optimization methods. Still, deep reinforcement learning techniques have remained subject to limitations including poor sample efficiency, large requirements in data and interactions with the environment and strong dependency on an appropriatelydesigned reward signal (?). In particular, a considerable challenge towards their applicability to realworld problems is that of safety. Indeed, while deep neural networks can reasonably be employed as blackbox controllers within simulated or wellcontrolled environments, limited interpretability and vulnerability to adversarial attacks (?; ?) can hinder their deployment to situations where poor decisions can have undesirable consequences for the agent or its environment, e.g., in autonomous driving. For such critical applications, deep neural networks can be used as a specialized service for components offering stability and performance guarantees, e.g., for visual recognition or dynamics prediction in conjunction with modelpredictive control (?).
In the absence of groundtruth physical models, prediction robustness can also be improved when large amounts of data are available or can be generated, e.g., through transfer learning with domain randomization (?). Data can also be used to tackle the limitations of reinforcement learning in terms of training time and reward crafting. When reference trajectories are available, e.g., demonstrated by an expert, it is possible to initialize a control policy by behavioral cloning (?), e.g., by training the neural network in a supervised manner using reference states and actions as inputsoutputs. However, behavioral cloning frequently requires tremendous amounts of data that extensively span observation and action spaces. Expert demonstrations were also used in interaction with the reinforcement learning process to accelerate training (?). Alternatively, it is possible to use expert data to infer a reward signal that the expert is assumed to be following through inverse reinforcement learning (IRL) (?). However, IRL methods can perform poorly in the presence of imperfect demonstrations. In addition, even when a reward signal can be estimated, it can remain insufficient to train a control policy by reinforcement learning afterwards. Towards this limitation, (?) proposed to bypass the reward estimation step by directly training a control policy together with a discriminator classifying stateaction pairs as expertlike or not, in a manner analogous to generative adversarial networks (?), showing successful imitation from very few expert trajectories on robot control tasks. Other successes have also been obtained in metalearning frameworks for generalization from single demonstrations (?) or reinforcement learning from imperfect demonstrations using multimodal policies (?; ?).
In this work, we propose to use reference demonstrations in a novel manner: not to train a control policy directly, but rather to learn safety constraints towards the completion of a given task. In this direction, we build upon the start of the art in safe reinforcement learning (Section 2). In contrast with traditional imitation learning frameworks, our approach leverages both positive and negative demonstrations, which we loosely define as aiming to complete and fail the designated task, respectively, yet without need for optimality (e.g., maximum reward or fastest failure).

We demonstrate that it is possible to automatically learn actionspace constraints in a supervised manner even when no groundtruth constraints are available, through the formulation of a loss function acting as a proxy for a convex optimization problem (Section 3).

Positive and negative reference demonstrations may not be available in many practical problems of interest. Thus, we derive an algorithm, Constrained Exploration and Recovery from Exploration Shaping (CERES), to discover both from parallel instantiations of a reinforcement learning problem while learning safety constraints (Section 4).

On collision avoidance tasks with dynamics, we show that our approach makes reinforcement learning more efficient, achieving higher rewards in fewer iterations, while also enabling learning from reduced observations (Section 5).
Finally, we discuss the challenges we encountered and future directions for our work (Section 6). To facilitate its reproduction and foster the research in constraintbased reinforcement learning, we make our algorithms public and opensource^{1}^{1}1https://www.github.com/IBM/constrainedrl.
2 Background and Motivation
2.1 Reinforcement Learning
We consider an infinitehorizon discounted Markov decision process (MDP) characterized by: a state domain representing observations available to the agent we seek to control; an action domain representing how the agent can interact with its environment; a probability distribution for the initial state; a transition probability distribution describing how, from a given state, taking an action can lead to another state; and a function associating rewards to such transitions. With a discount factor on future reward expectations, we aim to construct a stochastic policy that maximizes the discounted expected return :
(1) 
with a sequence of states and actions where the initial state is initialized following , and each action is sampled following the control policy given the current state , leading to a new state following the transition function . Through Eq. (1), we seek to maximize not a onestep reward, but rather a reward expectation over time. While allow some variation in their implementation (e.g., different resolutions for images as state space), they remain mostly characterized by the considered task. In contrast, the reward function can often be engineered empirically, from intuition, experience, and trial and error. Such a process is ineffective and costly, since evaluating a reward function candidate requires training a policy with it.
We consider deep reinforcement learning in continuous action spaces, in which actions are typically dimensional realvalued vectors, (e.g., joint commands for a robot arm). Given an dimensional input state vector , actions are sampled following a neural network representing the control policy , . Multiple methods were developed to tackle such problems, such as Deep Deterministic Policy Gradient (DDPG) (?) or Trust Region Policy Optimization (TRPO) (?), which were benchmarked on robot control tasks in (?). In this work, we build upon the Proximal Policy Optimization (PPO) algorithm (?) and its OpenAI Baselines reference implementation (?), in which the control policy is an dimensional multivariate Gaussian distribution of mean and standard deviation predicted by the neural network , trained onpolicy by interacting with the environment to collect stateactionreward tuples on a timestep horizon. Alternative frameworks employing energybased policies also achieved significant results on improved exploration and skill transfer between tasks (?; ?).
2.2 Safe Reinforcement Learning
While failure is most often permissible in simulated environments, realworld applications often come with requirements in terms of safety, for both the artificial agent and its environment. Indeed, poor decisions may have undesirable consequences, both in the physical world (e.g., an autonomous vehicle colliding with another vehicle or person) and within information systems (e.g., algo trading). Thus, the topic of safe reinforcement learning has been the subject of considerable research from multiple perspectives (?). From the deep reinforcement learning domain, (?) recently proposed a trust region method, named Constrained Policy Optimization (CPO), enabling the training of control policies with nearsatisfaction of given, known safety constraints. Towards realworld applications, (?) proposed to combine the TRPO reinforcement learning algorithm with an optimization layer that takes as input an action predicted by a neural network policy and correct it to lie within safety constraints via convex optimization. There again, safety constraints are required to be specified in advance. Namely, given a state , an action is sampled from a neural network policy as . Instead of directly executing onto the environment, as the neural network has no explicit safety guarantee, it is first corrected by solving the following quadratic program (QP) (?):
(2) 
with and linear constraint matrices of respective size and describing the range of possible actions. The closest action satisfying these constraints, , is then executed in the environment. While in Eq. (2), and are assumed provided by the user, in our work, we instead propose to learn them. This is a rather unexplored idea, since safety constraints can sometimes be constructed in a principled way, e.g., using the equations of physics in robotics. However, doing so is often cumbersome and possibly imprecise, as it depends on the availability of an accurate model of the agent and its environment. In contrast, our approach operates in a complete modelfree fashion and is able to learn from direct demonstrations (possibly generated from scratch), without need for prior knowledge.
Other evidence of reinforcement learning acceleration through improved exploration was presented in (?), where safety constraints where modelled with Gaussian processes. From the perspective of planning, (?) defined safety not as a numerical quantity as in the previous works, but through the notion of avoiding dead ends, from which the task can no longer be completed. In our work, we similarly define negative demonstrations as stateaction couples such that taking from inevitably leads to failure: either directly (e.g., the agent immediately crashes against a wall) or because recovery is no longer possible after taking (e.g., still accelerating despite passing a minimum braking distance). Conversely, we define positive demonstrations as stateaction couples such that the agent can still recover from the resulting state (e.g., starting to decelerate before passing the minimum braking distance). The set of demonstrations that are neither known to be positive or negative are called uncertain. Although determining beyond doubt whether an uncertain demonstration is positive or negative may be intractable in many cases, we propose a heuristic approach to sample and classify such demonstrations through a specialized reinforcement learning process. Our approach is thus related to that of (?), where a reset policy was learned to return the environment to a safe state for future attempts. In (?), increased robustness was achieved by training the control policy together with an adversary learning to produce optimal perturbations. While such perturbations can be used as negative demonstrations, we seek to collect a variety of such examples without need for optimality (e.g., any action leading an agent to collide against a wall, without necessary inducing the greatest impact). Finally, although not reinforcement learning, we were also inspired by the work of (?), where negative demonstrations were collected by purposely crashing a drone into surrounding objects to learn whether a direction is safe to fly to as a simple binary classification problem.
3 Learning ActionSpace Constraints from Positive and Negative Demonstrations
3.1 Definitions
Statedependent actionspace constraints
Let denote a set of constraints functions operating on actions , of the general form:
(3) 
In general, the constraint functions can take different forms but are typically realvalued, e.g., , the secondorder cone inequality constraining to be of norm 1 or less. We consider in particular the case of linear inequalities, e.g., , parameterized by a row vector of size and a scalar such that:
(4) 
With and the constraint matrices of respective size and , Eq. (3) takes the familiar form of Eq. (2), with inequalities considered rowwise. We are interested in estimating such constraint matrices as functions of state vectors , e.g., as outputs of a neural network :
(5) 
Formally, we thus consider constraints that operate on the action domain and depend on the current state (e.g., an autonomous vehicle may not accelerate more than a given rate – action constraint – if another vehicle is less than a given distance ahead – current state). Eq. (4) can thus be rewritten:
(6) 
with when the constraint is satisfied, and when it is violated. Given a demonstration , wethen define, for each constraint , a satisfaction margin and a violation margin :
(7)  
(8) 
with the maximum operator, which for comparisons with zero can be represented by the rectified linear unit. Thus, (resp. ), is positive if the th constraint is satisfied (resp. violated), and zero otherwise. Finally, given a set of known positive and bad demonstrations, we associate to each an indicator equal to if is a positive demonstration and otherwise.
3.2 Constraint Training Loss
In this Section, we assume the availability of stateaction demonstrations along with associated indicators (e.g., provided by a human expert). We then seek to construct constraint functions that satisfy the following. If is a positive demonstration, then we want all constraints to be satisfied:
(9) 
Having all constraints satisfied is equivalent to having none violated. Using the violation margin defined in Eq. (8) yields:
(10) 
Since by definition, all margins are nonnegative, we get:
(11) 
Conversely, if is a negative demonstration, we want at least one constraint to be violated:
(12) 
This amounts to having at least one constraint of zero satisfaction margin, while others can be strictly positive:
(13) 
Thus, we can define a constraint loss comprising the maximum violation for positive demonstrations following Eq. (11) and the mimimum satisfaction for negative demonstrations following Eq (13):
(14) 
Backtracing from Eq. (14) to Eq. (5) shows that is computed as a succession of differentiable operations from . As the constraint matrices are computed in particular from being fed through the constraint network , it can thus be trained in a supervised manner by minimizing as a training loss, using existing stochastic optimization methods such as Adam (?). Still, some considerations remain.
3.3 Optimizing the Constraint Loss in Practice
Constraint Normalization
In practice, directly minimizing the loss function of Eq. (14) does not suffice to yield useful constraints in practice. Indeed, from the definition of satisfaction and violation margins, it appears that and yields a trivial minimum for . In fact, simply having results in the optimization problem being illdefined. Considering individual constraint parameters , we can instead observe that when is nonzero, is the equation of a hyperplane in (i.e., a line in 2D action space, a plane in 3D action spaces, etc.), of normal itself. Geometric considerations then yield that is the signed distance between and the constraint hyperplane. It thus appears that having each row of the predicted constraint matrix to be of unit norm would be practical, for both avoiding trivial optima while maintaining geometric interpretability. One possibility is to systematically renormalize satisfaction and violation margins by division with the norm of each postprediction, within Eqs. (7) and (8). However, we noted that doing so could result in two problems in particular: the neural network predictions growing indefinitely large as they are normalized within the training loss, or conversely decreasing in norm such that eventually becomes close to zero, causing numerical errors.
Unit constraint matrices
Instead of renormalizing row constraint matrices a posteriori, we adopt an alternative formulation ensuring that they are of unit norm in the first place. Recalling that each row can be interpreted as a unit vector in , we have the constraint neural network predict it in generalized spherical coordinates, representing dimensional vectors in Cartesian coordinates as a radius and angular coordinates . For example, 2D vectors in Cartesian coordinates can be computed from polar coordinates as , with analogous formulas for generalized dimensional spheres. By simply setting the radius to , any combination of angles in produces in a unit vector in . We then change the output layer of the neural network so that it predicts parameters for each constraint : spherical coordinates for and the scalar . The transformation from spherical to Cartesian coordinates only involving cosine and sine functions, the differentiability of the loss function is preserved.
Avoiding constraint incompatibility
While constraint satisfaction and violation terms appear together in Eq. (14), they may not be optimized on partially overlapping demonstrations, e.g., a positive and negative demonstration sharing the same state (one action leading to failure and the other not). As isolated demonstrations do not suffice to cleanly separate actionspaces for any state, it is possible that the neural network produces constraints that minimize the training loss but are incompatible with each other. For example, it is not possible to simultaneously satisfy and . Instead, we would like to ensure that the domain described by never boils down to the empty set. Remark that if , then the optimization problem is always solvable, since the valid domain now contains at least . While it is straightforward to enforce , e.g., by passing it through a operation, having as default fallback action may not always be safe in practice. Instead of , given an arbitrary point , it is possible to parameterize constraints such that always satisfies , by decomposing the right handside into , with . While can be fixed manually, it can also be considered as an interior point that can be learned and shared with each individual constraint. Finally, the bounds of can also be set to guarantee, e.g., a minimum or maximum distance between constraints and the interior point . In the following, we set the minimum value of to of half the action space range and its maximum value to half the action space range directly, so that constraint satisfaction never becomes trivial while guaranteeing a minimum exploration volume.
3.4 Application
We consider a task consisting in controlling an agent to reach a target point in a mazelike environment, see Fig. 2 (left). Both the agent and the target are represented by circles of diameter . Throughout its motion, the agent has to avoid certain areas of the map: the external bounds of the world represented by a square of side 2, , and holes in the ground, represented by black surfaces. At the beginning of each episode, agent and target positions are randomly sampled within the allowed surface. At each timestep, the state vector comprises the position of the target and that of the agent. The agent can then make a 2D motion as action vector . If the norm of is within a maximum step size , the position is directly incremented by it, else it is clipped to lie within the circular movement range of radius . Each action thus results in an updated state and a reward signal of the form , with a penalty when reaching the border of the world or the central hole, a bonus when reaching the target, a reward on the distance between the agent of the target (increasing towards zero as the distance decreases), and a constant penalty per timestep encouraging rapid completion of the task. The episode ends when timesteps have passed or when either or occurs.
We then collect a set of expert demonstrations by having a human user directly controlling the agent with the mouse, without specific instructions on how to reach the goal (e.g., shortest path possible). We collect such trajectories and take them as our set of positive demonstrations . As this environment does not involve complicated dynamics for the agent, we can define bad actions as those immediately leading to fail the task. We thus iterate through each positive demonstration and sample actions along the circular action range of radius . Stateaction couples leading to task failure are then taken as negative demonstrations. As an additional heuristic, we also consider the expert path, reversed, as negative demonstrations, i.e., if a positive demonstration leads to the state , then we take as negative demonstration. Note that these heuristics are only applicable due to the simplicity of the environment and its dynamics. We discuss their automatic discovery in the next Section. Overall, we thus collect a set of demonstrations: positive and negative.
We then train a constraint network to predict constraints on the action space of the agent when exploring the maze. We minimize the constraint loss using the Adam optimizer, on minibatches of size comprising positive demonstrations and negative demonstrations each. In doing so, each training epoch consists in iterating through the negative demonstrations exactly once, while each of the positive demonstration appears on average times per epoch. Alternatively, we could weigh violation and satisfaction losses differently in Eq. (14), e.g., in function of their proportion in the total dataset. We depict the resulting training loss in Fig. (a)a. By counting how many positive (resp. negative) demonstrations actually satisfy (resp. violate) the predicted constraints after each training epoch, we empirically verify that the proposed loss constitutes a representative proxy to learn constraints from demonstrations only. Once is done training, we embed it within a reinforcement learning process to predict constraints from states encountered during exploration and thus guide the behavior of the agent. Fig. (b)b illustrates that this enables both starting from higher rewards, since penaltyheavy collisions are avoided, and reaching a higher reward after training. We depict a full trajectory along with a visualization of actionspace constraints in Fig. 1.
4 Constrained Exploration and Recovery from Experience Shaping
4.1 Overview
We established in Section 3 that it is possible to learn actionspace constraints as functions of states to guide exploration during reinforcement learning, given a set of positive and negative demonstrations. However, the acquisition of such demonstrations is often problematic on problems of practical interest. First, one cannot always assume the availability of a human expert, e.g., for tasks that humans struggle to complete and look to automate, such as robotic tasks involving high payloads or requiring submillimeter accuracy. Second, even when positive demonstrations are available, there may not be clear heuristics to infer negative from positive demonstrations (e.g., by “reversing” them). Third, direct sampling and successfailure evaluation can quickly become intractable on highdimensional state and action spaces. Finally, even on lowdimensional domains, one may not be able to evaluate an action in a single step. Instead, the effects of a given action may only appear many steps later, mitigated by other events that happened in between. As a result, it is essential to derive an algorithm enabling the discovery and identification of positive and negative demonstrations starting from scratch.
We propose to do so through the reinforcement learning setting. First, we train a direct control policy that learns to complete the task. After each trajectory, stateaction couples are evaluated to determine if they can be labeled positive or negative. In this framework, we consider a demonstration as positive if from the successor step, there exists a trajectory that does not lead to failure within steps, with a hyperparameter to be chosen in function of dynamics of the task. Conversely, we consider a demonstration as negative if the resulting state only leads to failure within . At this stage, only the final demonstration can confidently be labeled negative, if the trajectory terminates with failure, while only the first demonstrations, of remaining trajectory length greater than , can confidently be labeled positive.
4.2 Demonstration Sorting by Learning Recovery
The second part of our algorithm thus consists in transferring the uncertain demonstrations sampled by the direct policy to a recovery control policy that learns to recover from such uncertain states. Namely, training of involves resetting episodes only to uncertain states visited by the direct policy. In addition, the reward signal used to train is simplified to being equal to if the agent is still alive at each timestep, and if it fails the task. If the recovery agent is still active after , the demonstration leading to the episode’s starting state (sampled from the direct policy) is labeled as positive, recursively with all predecessor demonstrations. Conversely, if recovery was unsuccessful for a chosen number of attempts , the starting direct demonstration is labeled as negative, along with all successor demonstrations. We remark that, when evaluating trajectories, it is useful to start from the middle as the characterization of a given demonstration affects that of either all its predecessors or all its successors, thus halving the search space each time. Overall, the positive and negative demonstrations collected from both direct and recovery policies can then be used to train a constraint network to guide the exploration for , and optionally .
In summary, given an environment on which we seek to train a direct policy , our approach necessitates the following adjustments in creating a recovery environment to train : {enumerate*}
a simplified reward that only penalizes task failure,
the availability of success and failure flags regarding the final action prior to episode termination, and
a function restoring the environment to chosen states. While 4.2 may appear rather restrictive, the idea of restoring reference states was also used to guide reinforcement learning for wholebody robot control in (?). Alternatively, when such a restoration function is unavailable but the environment can be finely controlled, we could consider simply resetting it to reference states by replaying a set number of demonstrations from the sampled direct trajectories. Finally, 4.2 is necessary to confidently classify the final demonstration, as early episode termination can occur from reasons besides failure (negative), such as completing the task (positive) or just reaching a maximum number of timesteps (uncertain).
4.3 Detailed Algorithm
Conventionally, in the onpolicy reinforcement learning setting, states, actions, rewards and other relevant quantities (e.g., value, termination, etc.) are collected as trajectories following predictions from the neural network policy that are then executed onto the environment in order to update a policy network , through the use of a UpdatePolicy method e.g., PPO. In CERES, described in Fig. 6, positive, negative, and uncertain demonstrations are sampled together with within a Sample method. Each time a stateaction demonstration is labeled as positive or negative, we store it together with the associated indicator , into an experience replay buffer , then used to iteratively train a constraint network with an UpdateConstraints method following Section 3. In parallel with each policy update, Uncertain trajectories are also transfered from direct to recovery environments to serve as episode initialization states.
The Sample method is further described in Fig. 7. We highlight in particular the following. On line 9, raw actions predicted by the policy network are corrected using a method Constrain implementing the quadratic program of Eq. (2). Then, on line 11, it is the initial prediction that is used for policy update and not the corrected action, as training is done onpolicy. Still, on line 12 it is the corrected action that is used as reference demonstration, since it is the action that is effectively performed onto the environment. Finally, given such unlabeled demonstrations, we sort them as positive, negative and uncertain through a procedure EvaluateDemos implementing the logic described in Section 4.2.
5 Experiments
5.1 Practical Implementation
We implement the constraint learning framework and the CERES algorithm within Tensorflow, while building upon the OpenAI Baselines with PPO as reinforcement learning method for training direct and recovery agents. Preliminary experiments showed that since constraint predictions can be rather inaccurate over the first iterations, as labeled demonstrations are still few, it is possible to only correct the action prediction with a certain probability in Fig. 7, line 9, and otherwise play the predicted action directly in the environment. We empirically found that an appropriate metric for the constraint activation probability is the percentage of actions that are correctly separated by the predicted constraints (i.e., the proportion of positive actions satisfying the predicted constraints and negative actions violating them). We also obtained good results by only constraining the direct policy, enabling a more diverse range of sampled actions to learn recovery, prior to training the constraint network.
5.2 Obstacle Avoidance with Dynamics
While the example considered in Section 3.4 was limited by fixed safe domains and position control for the agent, we now consider the case where hole placement is randomized at each episode, see Fig. 2 (right), and where the agent can be controlled with force commands. In the latter case, it is now insufficient to evaluate demonstrations as good or bad from a single step, as the agent can no longer stop instantly if it is travelling at maximum speed. In addition to the positions of the agent and the target, the state vector now includes the current linear velocity of the agent and its distance to surrounding obstacles akin to LIDAR systems, along regularlyspaced beams starting from its center.
We consider four variants of this environment: two where the agent is controlled in position, with 2D position increments as actions, and two where it is controlled with 2D forces as actions, updating its velocity and position by consecutive integration. For each control setting, we consider the case where all observations are provided to the control policy, i.e., agent, target and obstacle informations, and the case where the control policy has only access to agent and target information, while the constraint network still has access to the whole state vector. We evaluate CERES against vanilla PPO, sharing the same reinforcement learning hyperparameters and random seeds and depict the resulting rewards in Fig. 8. Overall, while fullstate tasks seem difficult to achieve in the first place, CERES enables the safe learning from fewer observations. Indeed, in such situations, considerations of distances can be left to the constraint network while the policy network can focus on general navigation.
6 Discussion and future work
In our work, we established that expert demonstrations could be used in a novel way, to learn safety constraints from positive and negative examples. When both are available, the resulting constraints can accelerate reinforcement learning by starting from and reaching higher rewards. Towards applications of practical interest, we derived a new algorithm, CERES, enabling the automatic discovery of such positive and negative examples, and thus the learning of safety constraints from scratch. On a task involving multistep dynamics, we demonstrated that our approach could preserve such advantages in terms of rewards, while also enabling the main control policy to learn from fewer observations. Possible future developments include tackling realworld robotics applications and problems where success and failure metrics are more ambiguous.
References
 [Achiam et al. 2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In International Conference on Machine Learning.
 [Cserna et al. 2018] Cserna, B.; Doyle, W. J.; Ramsdell, J. S.; and Ruml, W. 2018. Avoiding dead ends in realtime heuristic search. In AAAI.
 [Dhariwal et al. 2017] Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; and Wu, Y. 2017. Openai baselines. https://github.com/openai/baselines.
 [Duan et al. 2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
 [Duan et al. 2017] Duan, Y.; Andrychowicz, M.; Stadie, B.; Ho, O. J.; Schneider, J.; Sutskever, I.; Abbeel, P.; and Zaremba, W. 2017. Oneshot imitation learning. In Advances in Neural Information Processing Systems.
 [Eysenbach et al. 2018] Eysenbach, B.; Gu, S.; Ibarz, J.; and Levine, S. 2018. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. In International Conference on Learning Representations.
 [Gandhi, Pinto, and Gupta 2017] Gandhi, D.; Pinto, L.; and Gupta, A. 2017. Learning to fly by crashing. In IEEERSJ International Conference on Intelligent Robots and Systems.
 [Gao et al. 2018] Gao, Y.; Lin, J.; Yu, F.; Levine, S.; Darrell, T.; et al. 2018. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313.
 [Garcıa and Fernández 2015] Garcıa, J., and Fernández, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research.
 [Goodfellow et al. 2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems.
 [Goodfellow, Shlens, and Szegedy 2015] Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
 [Haarnoja et al. 2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energybased policies. In International Conference on Machine Learning.
 [Haarnoja et al. 2018] Haarnoja, T.; Pong, V.; Zhou, A.; Dalal, M.; Abbeel, P.; and Levine, S. 2018. Composable deep reinforcement learning for robotic manipulation. In IEEE International Conference on Robotics and Automation.
 [Hester et al. 2018] Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; DulacArnold, G.; et al. 2018. Deep qlearning from demonstrations. In AAAI.
 [Ho and Ermon 2016] Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems.
 [Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Lillicrap et al. 2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 [Mattingley and Boyd 2012] Mattingley, J., and Boyd, S. 2012. Cvxgen: A code generator for embedded convex optimization. Optimization and Engineering.
 [Mnih et al. 2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature.
 [Ng, Russell, and others 2000] Ng, A. Y.; Russell, S. J.; et al. 2000. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning.
 [Peng et al. 2018] Peng, X. B.; Abbeel, P.; Levine, S.; and van de Panne, M. 2018. Deepmimic: Exampleguided deep reinforcement learning of physicsbased character skills. ACM Transactions on Graphics 37(4).
 [Pham, De Magistris, and Tachibana 2018] Pham, T.H.; De Magistris, G.; and Tachibana, R. 2018. OptLayer  Practical Constrained Optimization for Deep Reinforcement Learning in the Real World. In IEEE International Conference on Robotics and Automation.
 [Pinto et al. 2017] Pinto, L.; Davidson, J.; Sukthankar, R.; and Gupta, A. 2017. Robust adversarial reinforcement learning. In International Conference on Machine Learning.
 [Pomerleau 1991] Pomerleau, D. A. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural Computation 3(1).
 [Schulman et al. 2015] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International Conference on Machine Learning.
 [Schulman et al. 2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 [Silver et al. 2016] Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature.
 [Su, Vargas, and Kouichi 2017] Su, J.; Vargas, D. V.; and Kouichi, S. 2017. One pixel attack for fooling deep neural networks. arXiv preprint arXiv:1710.08864.
 [Tobin et al. 2017] Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; and Abbeel, P. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In IEEERSJ International Conference on Intelligent Robots and Systems.
 [Wachi et al. 2018] Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M. 2018. Safe exploration and optimization of constrained mdps using gaussian processes. In AAAI.
 [Williams et al. 2017] Williams, G.; Wagener, N.; Goldfain, B.; Drews, P.; Rehg, J. M.; Boots, B.; and Theodorou, E. A. 2017. Information theoretic mpc for modelbased reinforcement learning. In IEEE International Conference on Robotics and Automation.