Conservative Safety Critics for Exploration

Conservative Safety Critics for Exploration

Abstract

Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https://sites.google.com/view/conservative-safety-critics/home

\definecolor

mydarkbluergb0,0.08,0.45 \iclrfinalcopy

1 Introduction

Reinforcement learning (RL) is a powerful framework for learning-based control because it can enable agents to learn to make decisions automatically through trial and error. However, in the real world, the cost of those trials – and those errors – can be quite high: an aerial robot that attempts to fly at high speed might initially crash, and then be unable to attempt further trials due to extensive physical damage. However, learning complex skills without any failures at all is likely impossible. Even humans and animals regularly experience failure, but quickly learn from their mistakes and behave cautiously in risky situations. In this paper, our goal is to develop safe exploration methods for RL that similarly exhibit conservative behavior, erring on the side of caution in particularly dangerous settings, and limiting the number of catastrophic failures.

A number of previous approaches have tackled this problem of safe exploration, often by formulating the problem as a constrained Markov decision process (CMDP) (Garcıa and Fernández, 2015; Altman, 1999). However, most of these approaches require additional assumptions, like assuming access to a function that can be queried to check if a state is safe (Thananjeyan et al., 2020), assuming access to a default safe controller (Koller et al., 2018; Berkenkamp et al., 2017), assuming knowledge of all the unsafe states (Fisac et al., 2019), and only obtaining safe policies after training converges, while being unsafe during the training process (Tessler et al., 2018; Dalal et al., 2018).

In this paper, we propose a general safe RL algorithm, with safety guarantees throughout training. Our method only assumes access to a sparse (e.g., binary) indicator for catastrophic failure, in the standard RL setting. We train a conservative safety critic that overestimates the probability of catastrophic failure, building on tools in the recently proposed conservative Q-learning framework (Kumar et al., 2020) for offline RL. In order to bound the likelihood of catastrophic failures at every iteration, we impose a KL-divergence constraint on successive policy updates so that the stationary distribution of states induced by the old and the new policies are not arbitrarily different. Based on the safety critic’s value, we consider a chance constraint denoting probability of failure, and optimize the policy through primal-dual gradient descent.

Our key contributions in this paper are designing an algorithm that we refer to as Conservative Safety Critics  (CSC), that learns a conservative estimate of how safe a state is, using this conservative estimate for safe-exploration and policy updates, and theoretically providing upper bounds on the probability of failures throughout training. Through empirical evaluation in five separate simulated robotic control domains spanning manipulation, navigation, and locomotion, we show that CSC is able to learn effective policies while reducing the rate of catastrophic failures by up to 50% over prior safe exploration methods.

2 Preliminaries

We describe the problem setting of a constrained MDP (Altman, 1999) specific to our approach and the conservative Q learning (Kumar et al., 2020) framework that we build on in our algorithm.

Constrained MDPs. A constrained MDP (CMDP) is a tuple , where is the state space, is the action space, is a transition kernel, is a task reward function, is a discount factor, is a starting state distribution, and is a set of (safety) constraints that the agent must satisfy, with constraint functions taking values either (alive) or (failure) and limits defining the maximal allowable amount of non-satisfaction, in terms of expected probability of failure. A stochastic policy is a mapping from states to action distributions, and the set of all stationary policies is denoted by . Without loss of generality, we can consider a single constraint, where denotes the constraint satisfaction function , () similar to the task reward function, and an upper limit . We define the discounted future state distribution of a policy as , the state value function as , the state-action value function as , and the advantage function as . we define similar quantities with respect to the constraint function, as , , and . So, we have and denoting expected probability of failure as . When the policy is parameterized as , we will denote as .

Conservative Q Learning. CQL (Kumar et al., 2020) is a method for offline/batch RL (Lange et al., 2012; Levine et al., 2020) that aims to learn a -function such that the expected value of a policy under the learned function lower bounds its true value, preventing over-estimation due to out-of-distribution actions as a result. In addition to training Q-functions via standard Bellman error, CQL minimizes the expected -values under a particular distribution of actions, , and maximizes the expected Q-value under the on-policy distribution, . CQL in and of itself might lead to unsafe exploration, whereas we will show in Section 3, how the theoretical tool introduced in CQL can be used to devise a safe RL algorithm.

3 The Conservative Safe-Exploration Framework

Figure 1: Illustration of the approach described in Algorithm 1. steps the simulator to the next state and provides and values to the agent. If (failure), episode terminates. is the learned safety critic.

In this section we describe our safe exploration framework. The safety constraint defined in Section 2 is an indicator of catastrophic failure: when a state is unsafe and when it is not, and we ideally desire that the agent visits. Since we do not make any assumptions in the problem structure for RL, we cannot guarantee this, but can at best reduce the probability of failure in every episode. So, we formulate the constraint as , where denotes probability of failure. Our approach is motivated by the insight that by being “conservative” with respect to how safe a state is, and hence by over-estimating this probability of failure, we can effectively ensure constrained exploration.

Figure 1 provides an overview of the approach. The key idea of our algorithm is to train a conservative safety critic denoted as , that overestimates how unsafe a particular state is and modifies the exploration strategy to appropriately account for this safety under-estimate (by overestimating the probability of failure). During policy evaluation in the environment, we use the safety critic to reduce the chance of catastrophic failures by checking whether taking action in state has less than a threshold . If not, we re-sample from the current policy .

We now discuss our algorithm more formally. We start by discussing the procedure for learning the safety critic , then discuss how we incorporate this in the policy gradient updates, and finally discuss how we perform safe exploration during policy execution in the environment.

Overall objective. Our objective is to learn an optimal policy that maximizes task rewards, while respecting the constraint on expected probability of failures.

(1)

Learning the safety critic. The safety critic is used to obtain an estimate of how unsafe a particular state is, by providing an estimate of probability of failure, that will be used to guide exploration. We desire the estimates to be “conservative”, in the sense that the probability of failure should be an over-estimate of the actual probability so that the agent can err on the side of caution while exploring. To train such a critic , we incorporate tools from CQL to estimate through updates similar to those obtained by reversing the sign of in Equation 2 of CQL()  (Kumar et al., 2020). This gives us an upper bound on instead of a lower bound, as guaranteed by CQL. We denote the over-estimated advantage corresponding to this safety critic as . Formally the safety critic is trained via the following objective, where the objective inside is called , parameterizes , and denotes the update iteration.

(2)

For states sampled from the replay buffer , the first term seeks to maximize the expectation of over actions sampled from the current policy, while the second term seeks to minimize the expectation of over actions sampled from the replay buffer. can include off-policy data, and also offline-data (if available). We interleave the gradient descent updates for training of , with gradient ascent updates for policy and gradient descent updates for Lagrange multiplier , which we describe next.

Policy learning. Since we want to learn policies that obey the constraint we set in terms of the safety critic, we solve the objective in equation 1 via a surrogate policy improvement problem:

(3)

Here, we have introduced a constraint to ensure successive policies are close in order to help obtain bounds on the expected failures of the new policy in terms of the expected failures of the old policy in Section 4. We replace the term by its second order Taylor expansion (expressed in terms of the Fisher Information Matrix) and enforce the resulting constraint exactly (Schulman et al., 2015a). For the constraint on , we follow the primal-dual optimization method of Lagrange multipliers without making any simplifications of the constraint term . This, as per equation 23 (Appendix) can be rewritten as

(4)
1:Initialize (task value fn), (safety critic), policy , , , thresholds .
2:Set . denotes avg. failures in the previous epoch.
3:for epochs until convergence do Execute actions in the environment. Collect on-policy samples.
4:     for episode in {1, …, M} do
5:          Set
6:          Sample . Execute iff . Else, resample .
7:          Obtain next state , , .
8:           If available, can be seeded with off-policy/offline data
9:     end for
10:     Store the average episodic failures
11:     for step in {1, …, N} do Policy and Q function updates using
12:          Gradient ascent on and (Optionally) add Entropy regularization (equation 7)
13:          Gradient updates for the Q-function
14:          Gradient descent step on Lagrange multiplier (equation 9)
15:     end for
16:     
17:end for
Algorithm 1 CSC: safe exploration with conservative safety critics

We replace the true by the learned over-estimated , and consider the Lagrangian dual of this constrained problem, which we can solve by alternating gradient descent as shown below.

(5)

We replace by its sample estimate and denote as . Note that is independent of parameter that is being optimized over. For notational convenience let denote the fraction , and define . In addition, we can approximate in terms of the Fisher Information Matrix , where, can be estimated with samples as

(6)

Following the steps in the Appendix A.2, we can write the gradient ascent step for as

(7)

Here is the backtracking coefficient and we perform backtracking line search with exponential decay. is calculated as,

(8)

For gradient descent with respect to the Lagrange multiplier we have,

(9)

is the learning rate. Detailed derivations of the gradient updates are in Appendix A.2.

Executing rollouts (i.e., safe exploration). Since we are interested in minimizing the number of constraint violations while exploring the environment, we do not simply execute the learned policy iterate in the environment for active data collection. Rather, we query the safety critic to obtain an estimate of how unsafe an action is and choose an action that is safe via rejection sampling. Formally, we sample an action , and check if . We keep re-sampling actions until this condition is met, and once met, we execute that action in the environment. Here, is a threshold that varies across iterations and is defined as where, is the average episodic failures in the previous epoch, denoting a sample estimate of the true . This value of is theoretically obtained such that Lemma 1 holds.

In the replay buffer , we store tuples of the form , where is the previous state, is the action executed, is the next state, is the task reward from the environment, and , the constraint value. In our setting, is binary, with denoting a live agent and denoting failure.

Overall algorithm. Our overall algorithm, shown in Algorithm 1, executes policy rollouts in the environment by respecting the constraint , stores the observed data tuples in the replay buffer , and uses the collected tuples to train a safety value function using equation 2, update the policy using equation 7, and update the dual variable using equation 9.

4 Theoretical Analysis

In this section, we aim to theoretically analyze our approach, showing that the expected probability of failures is bounded after each policy update throughout the learning process, while ensuring that the convergence rate to the optimal solution is only mildly bottlenecked by the additional safety constraint. Our main result, stated in Theorem 1, provides safety guarantees with a high probability during training, by bounding the expected probability of failure of the policy that results from Equation 4. To prove this, we first state a Lemma that shows that the constraints in Equation 4 are satisfied with high probability during the policy updates. Detailed proofs of all the Lemmas and Theorems are in Appendix A.1.

Notation. Let and be the overestimation in due to CQL, such that . Let denote the sampling error in the estimation of by its sample estimate and be the number of samples used in the estimation of .

Lemma 1.

If we follow Algorithm 1, during policy updates via Equation 4, the following is satisfied with high probability

Here, captures sampling error in the estimation of and we have , where is a constant independent of obtained from union bounds and concentration inequalities (Kumar et al., 2020) and is the number of samples used in the estimation of .

This lemma intuitively implies that the constraint on the safety critic in equation 4 is satisfied with a high probability, when we note that the RHS can be made small as becomes large.

Lemma 1 had a bound in terms of for the old policy . We now show that the expected probability of failure for the policy resulting from solving equation 4, is bounded with a high probability.

Theorem 1.

Consider policy updates that solve the constrained optimization problem defined in Equation 4. With high probability , we have the following upper bound on expected probability of failure for during every policy update iteration:

(10)

Since depends on the new policy , it can’t be calculated exactly prior to the update. As we cap to be , therefore, the best bound we can construct for is the trivial bound . Now, in order to have , we require . To guarantee this, we can obtain a theoretically prescribed minimum value for as shown in the proof in Appendix A.1.

So far we have shown that, with high probability, we can satisfy the constraint in the objective during policy updates (Lemma 1) and obtain an upper bound on the expected probability of failure of the updated policy (Theorem 1). We now show that incorporating and satisfying safety constraints during learning does not severely affect the convergence rate to the optimal solution for task performance. Theorem 2 directly builds upon and relies on the assumptions in (Agarwal et al., 2019) and extends it to our constrained policy updates in equation 4.

Theorem 2 (Convergence rate for policy gradient updates with the safety constraint).

If we run the policy gradient updates through equation 4, for policy , with as the starting state distribution, with , and learning rate , then for all policy update iterations we have, with probability ,

Since the value of the dual variables strictly decreases during gradient descent updates (Algorithm 1), is upper-bounded. In addition, if we choose as mentioned in the discussion of Theorem 1 (equation 28), we have . Hence, with probability , we can ensure

So, we see that the additional term proportional to introduced in the convergence rate (compared to (Agarwal et al., 2019)) due to the safety constraint is upper bounded, and can be made small with a high probability by choosing appropriately, even after accounting for sampling error. In addition, we note that the safety threshold helps tradeoff the convergence rate by modifying the magnitude of (a low means a stricter safety threshold, and a higher value of , implying a larger RHS and slower convergence). We discuss some practical considerations of the theoretical results in Appendix A.4.

5 Experiments

Through experiments on continuous control environments of varying complexity, we aim to empirically evaluate the agreement between empirical performance and theoretical guidance by understanding the following questions:

  • How safe is CSC in terms of constraint satisfaction during training?

  • How does learning of safe policies trade-off with task performance during training?

5.1 Experimental Setup

Environments. In each environment, shown in Figure 2, we define a task objective that the agent must achieve and a criteria for catastrophic failure. The goal is to solve the task without dying. In point agent/car navigation avoiding traps, the agent must navigate a maze while avoiding traps. The agent has a health counter that decreases every timestep that it spends within a trap. When the counter hits , the agent gets trapped and dies. In Panda push without toppling, a 7-DoF Franka Emika Panda arm must push a vertically placed block across the table to a goal location without the block toppling over. Failure is defined as when the block topples. In Panda push within boundary, the Panda arm must be controlled to push a block across the table to a goal location without the block going outside a rectangular constraint region. Failure occurs when the block center of mass ( position) move outside the constraint region. In Laikago walk without falling, an 18-DoF Laikago quadruped robot must walk without falling. The agent is rewarded for walking as fast as possible (or trotting) and failure occurs when the robot falls. Since quadruped walking is an extremely challenging task, for all the baselines, we initialize the agent’s policy with a controller that has been trained to keep the agent standing, while not in motion.

Baselines and comparisons. We compare CSC to three prior methods: constrained policy optimization (CPO(Achiam et al., 2017), a standard unconstrained RL method (Schulman et al., 2015a) which we call Base (comparison with SAC (Haarnoja et al., 2018) in Appendix Figure 7), and a method that extends Leave No Trace (Eysenbach et al., 2017) to our setting, which we refer to as Q ensembles. This last comparison is the most similar to our approach, in that it also implements a safety critic (adapted from LNT’s backward critic), but instead of using our conservative updates, the safety critic uses an ensemble for epistemic uncertainty estimation, as proposed by Eysenbach et al. (2017). There are other safe RL approaches which we cannot compare against, as they make multiple additional assumptions, such as the availability of a function that can be queried to determine if a state is safe or not Thananjeyan et al. (2020), availability of a default safe policy for the task Koller et al. (2018); Berkenkamp et al. (2017), and prior knowledge of the location of unsafe states (Fisac et al., 2019). In addition to the baselines (Figure 3), we analyze variants of our algorithm with different safety thresholds through ablation studies (Figure 4). We also analyze CSC and the baselines by seeding with a small amount of offline data in the Appendix A.10.

Figure 2: Illustrations of the five environments in our experiments: (a) 2D Point agent navigation avoiding traps. (b) Car navigation avoiding traps. (c) Panda push without toppling. (d) Panda push within boundary. (e) Laikago walk without falling.

5.2 Empirical Results

Comparable or better performance with significantly lower failures during training. In Figure 3, we observe that CSC has significantly lower average failures per episode, and hence lower cumulative failures during the entire training process. Although the failures are significantly lower for our method, task performance and convergence of average task rewards is comparable to or better than all prior methods, including the Base method, corresponding to an unconstrained RL algorithm. While the CPO and Q-ensembles baselines also achieve near average failures eventually, we see that CSC achieves this very early on during training. In order to determine whether the benefits in average failures are statistically significant, we conduct pairwise t-tests between CSC and the most competitive baseline Q-ensembles for the four environments in Figure. 3, and obtain p-values 0.002, 0.003, 0.001, 0.01 respectively. Since for all the environments, the benefits of CSC over the baselines in terms of lower average failures during training are statistically significant.

CSC trades off performance with safety guarantees, based on the safety-threshold . In Figure 4, we plot variants of our method with different safety constraint thresholds . Observe that: (a) when the threshold is set to a lower value (stricter constraint), the number of avg. failures per episode decreases in all the environments, and (b) the convergence rate of the task reward is lower when the safety threshold is stricter. These observations empirically complement our theoretical guarantees in Theorems 1 and 2. We note that there are quite a few failures even in the case where , which is to be expected in practice because in the initial stages of training there is high function approximation error in the learned critic . However, we observe that the average episodic failures quickly drop below the specified threshold after about 500 episodes of training.

Figure 3: Top row: Average task rewards (higher is better). Bottom row: Average catastrophic failures (lower is better). x-axis: Number of episodes (each episode has 500 steps). Results on four of the five environments we consider for our experiments. For each environment, we plot the average task reward, the average episodic failures, and the cumulative episodic failures. The safety threshold is for all the baselines in all the environments. Results are over four random seeds. Detailed results including plots of cumulative failures are in Fig. 6 of the Appendix.
Figure 4: Top row: Average task rewards (higher is better). Bottom row: Average catastrophic failures (lower is better). x-axis: Number of episodes (each episode has 500 steps). Results on four of the five environments we consider for our experiments. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. All the plots are for our method (CSC) with different safety thresholds , specified in the legend. From the plots it is evident that our method can naturally trade-off safety for task performance depending on how strict the safety threshold is set to. Results are over four random seeds. Detailed results including plots of cumulative failures are in Fig. 5 of the Appendix.

6 Related Work

We discuss prior safe RL and safe control methods under three subheadings

Assuming prior domain knowledge of the problem structure. Prior works have attempted to solve safe exploration in the presence of structural assumptions about the environment or safety structures. For example, Koller et al. (2018); Berkenkamp et al. (2017) assume access to a safe set of environment states, and a default safe policy, while in Fisac et al. (2018); Dean et al. (2019), knowledge of system dynamics is assumed and (Fisac et al., 2019) assume access to a distance metric on the state space. SAVED (Thananjeyan et al., 2020) learns a kernel density estimate over unsafe states, and assumes access to a set of user demonstrations and a user specified function that can be queried to determine whether a state is safe or not. In contrast to these approaches, our method does not assume any prior knowledge from the user, or domain knowledge of the problem setting, except a binary signal from the environment indicating when a catastrophic failure has occurred.

Assuming a continuous safety cost function. CPO (Achiam et al., 2017), and (Chow et al., 2019) assume a cost function can be queried from the environment at every time-step and the objective is to keep the cumulative costs within a certain limit. This assumption limits the generality of the method in scenarios where only minimal feedback, such as binary reward feedback is provided (additional details in section A.3). ? assume that the safety cost function over trajectories is a known continuous function, and use this to learn an explicit safety set. Stooke et al. (2020) devise a general modification to the Lagrangian by incorporating two additional terms in the optimization of the dual variable. SAMBA (Cowen-Rivers et al., 2020) has a learned GP dynamics model and a continuous constraint cost function that encodes safety. The objective is to minimize task cost function while maintaining the of cumulative costs below a threshold. In the work of Dalal et al. (2018); Paternain et al. (2019b, a); Grbic and Risi (2020), only the optimal policy is learned to be safe, and there are no safety guarantees during training. In contrast to these approaches, we assume only a binary signal from the environment indicating when a catastrophic failure has occurred. Instead of minimizing expected costs, our constraint formulation directly seeks to constrain the expected probability of failure.

Safety through recoverability. Prior works have attempted to devise resetting mechanisms to recover the policy to a base configuration from (near) a potentially unsafe state. LNT (Eysenbach et al., 2017) trains both a forward policy for solving a task, and a reset goal-conditioned policy that kicks in when the agent is in an unsafe state and learns an ensemble of critics, which is substantially more complex than our approach of a learned safety critic, which can give rise to a simple but provable safe exploration algorithm. In control theory, a number of prior works have focused on Hamilton-Jacobi-Isaacs (HJI) reachability analysis (Bansal et al., 2017) for providing safety guarantees and obtaining control inputs for dynamical systems (Herbert et al., 2019; Bajcsy et al., 2019; Leung et al., 2018). Our method does not require knowledge of the system dynamics or regularity conditions on the state-space, which are crucial for computing unsafe states using HJI reachability.

7 Discussion, Limitations, and Conclusion

We introduced a safe exploration algorithm to learn a conservative safety critic that estimates the probability of failure for each candidate state-action tuple, and uses this to constrain policy evaluation and policy improvement. We provably demonstrated that the probability of failures is bounded throughout training and provided convergence results showing how ensuring safety does not severely bottleneck task performance. We empirically validated our theoretical results and showed that we achieve high task performance while incurring low accidents during training.

While our theoretical results demonstrated that the probability of failures is bounded with a high probability, one limitation is that we still observe non-zero failures empirically even when the threshold is set to . This is primarily because of neural network function approximation error in the early stages of training the safety critic, which we cannot account for precisely in the theoretical results, and also due to the fact that we bound the probability of failures, and cannot provably bound the number of failures.

Although our approach bounds the probability of failure and is general in the sense that it does not assume access any user-specified constraint function, in situations where the task is difficult to solve, for example due to stability concerns of the agent, our approach will fail without additional assumptions. In such situations, some interesting future work directions would be to develop a curriculum of tasks to start with simple tasks where safety is easier to achieve, and gradually move towards more difficult tasks, such that the learned knowledge from previous tasks is not forgotten.

Acknowledgement

We thank Vector Institute, Toronto and the Department of Computer Science, University of Toronto for compute support. We thank Glen Berseth and Kevin Xie for helpful initial discussions about the project, Alexandra Volokhova, Arthur Allshire, Mayank Mittal, Samarth Sinha, and Irene Zhang for feedback on the paper, and other members of the UofT CS Robotics Group for insightful discussions during internal presentations and reading group sessions.

Appendix A Appendix

a.1 Proofs of all theorems and lemmas

Note. During policy updates via Equation 4, the constraint is satisfied with high probability if we follow Algorithm 1. This follows from the update equation 7 as we incorporate backtracking line search to ensure that the constraint is satisfied exactly. Let us revisit the update equation 7

(11)

After every update, we check if , and if not we decay , set and repeat for steps until is satisfied. If this is not satisfied after steps, we backtrack, and do not update i.e. set .

Lemma 1.

If we follow Algorithm 1, during policy updates via equation 4, the following is satisfied with high probability

Here, captures sampling error in the estimation of and we have , where is a constant and is the number of samples used in the estimation of .

Proof.

Based on line 6 of Algorithm 1, for every rollout , the following holds:

(12)

We note that we can only compute a sample estimate instead of the true quantity which can introduce sampling error in practice. In order to ensure that is not much lesser than , we can obtain a bound on their difference. Note that if , the Lemma holds directly, so we only need to consider the less than case.

Let . With high probability , we can ensure , where is a constant independent of (obtained from union bounds and concentration inequalities) and is the number of samples used in the estimation of . In addition, our estimate of is an overestimate of the true , and we denote their difference by .

So, with high probability , we have

(13)

Theorem 1.

Consider policy updates that solve the constrained optimization problem defined in equation 4. With high probability , we have the following upper bound on expected probability of failure for during every policy update iteration

(14)

Here, and is the overestimation in due to CQL.

Proof.

denotes the value of the constraint function from the environment in state . This is analogous to the task reward function . In our case is a binary indicator of whether a catastrophic failure has occurred, however the analysis we present holds even when is a shaped continuous cost function.

Let denotes the discounted task rewards obtained in expectation by executing policy for one episode, and let denote the corresponding constraint values.

(15)

From the TRPO (Schulman et al., 2015a) and CPO (Achiam et al., 2017) papers, following similar derivations, we obtain the following bounds

(16)

Here, is the advantage function corresponding to the task rewards and . is the total variation distance. We also have,

(17)

Here, is the advantage function corresponding to the costs and . In our case, is defined in terms of the safety Q function , and CQL can bound its expectation directly. To see this, note that, by definition . Here, the RHS is precisely the term in equation 2 of (Kumar et al., 2020) that is bounded by CQL. We get an overstimated advantage from training the safety critic through updates in equation 2. . Let denote the expected magnitude of over-estimate , where is positive. Note that replacing , by its over-estimate , the inequality in 17 above still holds.

Using Pinsker’s inequality, we can convert the bounds in terms of instead of ,

(18)

By Jensen’s inequality,

(19)

So, we can replace the terms in the bounds by . Then, inequation 17 becomes,

(20)

Re-visiting our objective in equation 4,

(21)

From inequation 20 we note that instead of of constraining we can constrain an upper bound on this. Writing the constraint in terms of the current policy iterate using equation 20,

(22)

As there is already a bound on , getting rid of the redundant term, we define the following optimization problem, which we actually optimize for

(23)

Upper bound on expected probability of failures. If is updated using equation 4, then we have the following upper bound on