Measuring and avoiding side effects using relative reachability

# Measuring and avoiding side effects using relative reachability

## Abstract

How can we design reinforcement learning agents that avoid causing unnecessary disruptions to their environment? We argue that current approaches to penalizing side effects can introduce bad incentives in tasks that require irreversible actions, and in environments that contain sources of change other than the agent. For example, some approaches give the agent an incentive to prevent any irreversible changes in the environment, including the actions of other agents. We introduce a general definition of side effects, based on relative reachability of states compared to a default state, that avoids these undesirable incentives. Using a set of gridworld experiments illustrating relevant scenarios, we empirically compare relative reachability to penalties based on existing definitions and show that it is the only penalty among those tested that produces the desired behavior in all the scenarios.

## 1 Introduction

An important component of safe behavior for reinforcement learning agents is avoiding unnecessary side effects while performing a task (Amodei et al., 2016; Taylor et al., 2016). For example, if a robot’s task is to carry a box across the room, we want it to do so without breaking vases, scratching furniture, and so on. This problem has mostly been studied in the context of safe exploration during the agent’s learning process (Pecka and Svoboda, 2014; García and Fernández, 2015), but it can also occur after training if the reward function does not incorporate disruptions to the environment. We would like to incentivize the agent to avoid side effects without explicitly penalizing every possible disruption or going through a process of trial and error when designing the reward function. While such ad-hoc approaches can be sufficient for agents deployed in a narrow set of environments, they often require a lot of human input and are unlikely to scale well to increasingly complex and diverse environments. It is thus important to develop more principled and general approaches for avoiding side effects.

Most previous methods that address this problem in a general way are safe exploration methods that focus on preserving ergodicity by ensuring the reachability of initial states (Moldovan and Abbeel, 2012; Eysenbach et al., 2017), but this approach has two notable limitations. First, it is insensitive to the magnitude of the irreversible disruption: e.g. it would equally penalize the agent for breaking one vase or a hundred vases. Thus, if the objective requires an irreversible action, any further irreversible actions would not be penalized. Second, this criterion introduces undesirable incentives in dynamic environments, where irreversible events can happen spontaneously (due to the forces of nature, the actions of other agents, etc). Since such events make the starting state unreachable, the agent has an incentive to prevent them. This is often undesirable, e.g. if the event is a human eating food. A similar argument applies to reachability analysis methods (Mitchell et al., 2005; Gillula and Tomlin, 2012; Fisac et al., 2017), which require that a safe region must be reachable by a known conservative policy: the agent would be penalized if another agent or an environment event make the safe region unreachable. Thus, while these methods address the side effects problem in environments where the agent is the only source of change and the objective does not require irreversible actions, a more general criterion is needed when these assumptions do not hold.

The main contribution of this paper is a side effects measure that reflects the magnitude of the agent’s effects, and does not introduce bad incentives in dynamic environments that occur with existing approaches, as outlined in Section 2. The measure computes the relative reachability of states compared to a default state, as shown in Figure 1. Section 3 proposes several mutually compatible definitions, which take into account whether a state is reachable, how long it takes to reach the state, or how long it takes to reach a similar state. In Section 4, we compare relative reachability with other side effects penalties on toy gridworlds (including dynamic environments), and show that it is the only method among those tested that incentivizes correct behavior on the full set of environments.

## 2 Desirable properties of a side effects measure

We begin with some motivating examples for distinguishing intended and unintended effects:

###### Example 1 (Box).

The agent’s goal is to carry a box from point A to point B, and there is a vase in the shortest path that would break if the agent walks into it.

###### Example 2 (Omelette).

The agent’s goal is to make an omelette, which requires breaking some eggs.

In both of these cases, the agent would take an irreversible action by default (breaking a vase vs breaking eggs). However, the agent can still get to point B without breaking the vase (at the cost of a bit of extra time), but it cannot make an omelette without breaking eggs. We would like to penalize breaking the vase but not breaking the eggs. This indicates a desirable property for our definition:

###### Property 1.

Penalize the agent for effects on the environment if and only if those effects are unnecessary for achieving the objective.

Safety criteria are often implemented as constraints (García and Fernández, 2015; Moldovan and Abbeel, 2012; Eysenbach et al., 2017). This approach works well if we know exactly what the agent must avoid, but is too inflexible for a general criterion for avoiding side effects. For example, a constraint that the agent must never make the starting state unreachable would prevent it from making the omelette in the Example 2, no matter how high the reward for doing so.

A more flexible way to implement a side effects criterion is by adding a penalty term to the reward function, which acts as an intrinsic pseudo-reward. Since the reward indicates whether the agent has achieved the objective, we could satisfy the above property by balancing the reward and the penalty. Then, the penalty would outweigh the small reward gain from walking into the vase over going around the vase, but it would not outweigh the large reward gain from breaking the eggs. This is the approach we take in this work. We will now discuss how to define such a penalty.

A side effects penalty can be defined as a measure of deviation of the current state from a baseline state , denoted as . The two main types of approaches in the literature, ergodicity-preserving safe exploration (Moldovan and Abbeel, 2012; Eysenbach et al., 2017) and low impact (Armstrong and Levinstein, 2017; Taylor et al., 2016), use criteria of this form. The deviation measure and baseline can be chosen separately, and we argue that both classes of methods use a suboptimal combination of the two.

### 2.1 Choosing a baseline state

We compare two existing approaches to choosing a baseline state:

Starting state baseline. One natural choice of baseline is the starting state when the agent was deployed, which we call the “starting state” baseline. This is the baseline used in the ergodicity-preserving approach, where the agent learns a reset policy. The reset policy is rewarded for reaching states that are likely under the initial state distribution, so its value function represents how quickly it is possible to reach one of the starting states. This addresses the case where there are no irreversible events in the environment other than those caused by the agent. In the more general case where this assumption does not hold, using the starting state baseline would penalize irreversible events in the environment that are unrelated to the objective:

###### Example 3 (Sushi).

The environment contains a human eating sushi, which is unrelated to the agent’s goal and would happen regardless of the agent being deployed. Penalizing deviations from the starting state would incentivize the agent to prevent the sushi from being eaten.

Inaction baseline. The low impact approach (Armstrong and Levinstein, 2017) measures side effects as the agent’s “impact”, defined as a measure of difference from a state where the agent is never deployed. Their baseline is the state that the environment would currently be in if the agent had not been deployed in state . We can also define as the state that would be reached if the agent had followed some safe default policy. Either way, we call the “inaction” baseline. It distinguishes the agent’s effects from environment events that would have happened anyway, such as the sushi being eaten. One downside is that determining the counterfactual default state requires the ability to simulate the environment, though a full causal model may not be necessary.

We have now identified a desirable property for a choice of baseline state:

###### Property 2.

Distinguish between agent effects and environment events, and only penalize the agent for the former but not the latter.

The inaction baseline achieves this, while the starting state baseline does not, so the inaction baseline is a better choice according to this criterion. However, the starting state baseline can be much easier to compute than the inaction baseline, since starting state is easier to get information about (especially in MDPs), while the counterfactual state needs to be computed by simulating the environment.

### 2.2 Choosing a measure of deviation from the baseline

We compare two existing types of deviation measure:

Symmetric deviation. One natural choice of deviation measure is a distance between states. Armstrong and Levinstein (2017) use a distance measure based on differences in some set of state variables between the current state and the inaction baseline. This is a symmetric deviation measure: . This symmetry means that the agent is equally penalized for irreversible effects and for some types of reversible effects - in particular, for preventing irreversible events.

Let be an irreversible event. Starting from state , let be the resulting state if happens, and let be the resulting state if does not happen. If the agent causes , then , so the agent receives penalty . If instead would happen by default but the agent prevents it from happening, then , so the agent receives the same penalty, . See Figure 2 for an illustration. This creates undesirable incentives for the agent when the agent’s objective is to prevent an irreversible event from happening:

###### Example 4 (Vase on a conveyor belt).

There is a vase on a moving conveyor belt, which would fall off and break upon reaching the end of the belt (the vase falling off is the irreversible event ). The agent’s task is to take the vase off the belt, and it would be rewarded for doing so. Symmetric deviation with the inaction baseline incentivizes the agent to take the vase off the conveyor belt, get the reward, and then put it back on the belt in order to reach the default state where the vase is broken.

Asymmetric deviation. Ergodicity-preserving safe exploration methods such as Eysenbach et al. (2017) use a deviation measure that represents the difficulty of returning from state to the starting state (e.g. the negative of the reset policy’s value function). This deviation is asymmetric, because reaching from can be much easier (or more difficult) than reaching from : . This approach is not limited to resetting to the starting state - it can be applied to reaching any baseline state. We call this deviation measure the reachability measure: represents the difficulty of reaching from . Due to this asymmetry, the reachability measure gives a higher penalty for irreversible effects than reversible effects (such as preventing irreversible events), which can help avoid the pathological behavior in Example 4.

We have now identified a desirable property for a deviation measure that is satisfied by the reachability measure but not by a symmetric deviation measure such as a distance between states:

###### Property 3.

Give a higher penalty for irreversible effects than for reversible effects.

Another desirable property for a deviation measure is sensitivity to the magnitude of the agent’s irreversible effects:

###### Property 4.

(Cumulative penalty) The penalty should accumulate when more irreversible effects occur. For example, if the agent starts in state , takes an irreversible action that leads to state , and then takes another irreversible action that leads to state , then .

###### Example 5.

A variation on Example 1, where the environment contains two vases (vase 1 and vase 2) and the agent’s goal is to do nothing. The agent can take action to break vase . The MDP is shown in Figure 3. The penalty should be higher in the case where the agent breaks two vases than in the case where it only breaks one vase.

This property cannot be satisfied by simply penalizing the agent for making the baseline state unreachable, since it will not have an incentive to avoid further irreversible effects after the baseline has become unreachable. For example, if the agent receives the maximum penalty whether it breaks one or two vases, it has no incentive to avoid breaking the second vase once the first vase is broken.

To satisfy this property, we introduce a measure of relative reachability in the next section. For each possible state, we penalize the agent if it is less reachable from the current state than from the baseline. This penalty will increase with each irreversible action by the agent that cuts off more states that were reachable from the baseline. In Example 5, breaking vase 1 cuts off states and , and breaking vase 2 after that cuts off state as well, which increases the penalty.

## 3 Relative reachability

Let the coverage of state from state be some measure of how easily the agent can reach from (we explore several specific instances later in this section). We define the relative reachability measure as the total reduction in coverage from the current state compared to the baseline :

 d(St;S′t):=∑smax(C(S′t;s)−C(St;s),0)

See Figure 0(b) for an illustration. This measure is nonnegative everywhere, and zero for states that reach or exceed baseline coverage of all states.

We now introduce several definitions of coverage that take into account different aspects of reachability, and show that these definitions are extensions of each other.

Undiscounted coverage only takes into account whether or not the given state is reachable, so the resulting relative reachability penalty only penalizes irreversible effects (making states unreachable that were reachable from the baseline). We define undiscounted coverage as the maximum probability of reaching state in finite time, over all possible policies:

 C1(~s;s):=maxπP(Nπ(~s;s)<∞) (3.1)

where is the number of steps policy takes to reach from . If the environment is deterministic, the coverage is equal to if is reachable from and otherwise (see Figure 0(a)).

Discounted coverage also takes into account how long it takes to reach the given state , so the resulting relative reachability penalty also penalizes reversible effects. To take into account the time costs of reaching states, we introduce a discount parameter . The higher the value of , the less time costs matter, with the limit case representing the undiscounted case. We define discounted coverage as follows:

 Cγ(~s;s):=maxπE[γNπ(~s;s)]. (3.2)

This is equivalent to the value function of optimal policy for an agent that receives reward 1 for reaching and 0 otherwise, and uses a discount factor of . States that are reachable in fewer steps will thus have higher coverage. If a state is not reachable in finitely many steps, , so since , the coverage will be .

###### Proposition 1.

For all , as , discounted coverage (3.2) approaches undiscounted coverage (3.1): .

###### Proof.

See Appendix A.1. ∎

See Appendix B for example computations of discounted and undiscounted coverage in Example 5. Discounted coverage can be computed recursively using the following Bellman equation (the case corresponds to undiscounted coverage):

 Cγ(~s;s)=γmaxa∑~s′P(~s′|~s,a)Cγ(~s′;s),

where is the action taken in state , and is the next state.

In large state spaces, the agent might not be able to reach the given state , but able to reach states that are similar to according to some distance measure . We will now extend our previous definitions to this case by defining similarity-based coverage:

 Discounted: Cγ,δ(~s;s) :=maxπ∞∑k=0(1−γ)γkE[e−δ(~Sπk,s)] (3.3) Undiscounted: C1,δ(~s;s) :=maxπlimk→∞E[e−δ(~Sπk,s)] (3.4)

where is the state that the agent is in after following policy for steps starting from . Discounted similarity-based coverage is proportional to the value function of the optimal policy for an agent that gets reward in state (which rewards the agent for going to states that are similar to ) and uses a discount factor of . Undiscounted similarity-based coverage represents the highest reward the agent could attain in the limit by going to states as similar to as possible.

###### Proposition 2.

For all , as , similarity-based discounted coverage (3.3) approaches similarity-based undiscounted coverage (3.4):

###### Proof.

See Appendix A.2. ∎

###### Proposition 3.

Let the indicator distance be a distance measure with if and otherwise (so it only matters whether the exact target state is reachable). Then for all ,

• similarity-based discounted coverage (3.3) is equivalent to discounted coverage (3.2): ,

• similarity-based undiscounted coverage (3.4) is equivalent to undiscounted coverage (3.1):
.

###### Proof.

See Appendix A.3. ∎

We can represent the relationships between the coverage definitions as follows:

 Cγ,δ(???) γ→1(Prop ???)−−−−−−−−−→ C1,δ(???) δ=δI(Prop ???)↓ ↓δ=δI(Prop ???) Cγ(???) γ→1(Prop ???)−−−−−−−−−→ C1(???)

## 4 Experiments

We run a tabular Q-learning agent with different penalties on gridworld environments (see Figures 5 and 5) illustrating Examples 1-4. These simple gridworlds make it clear what happens and what is supposed to happen in each of the scenarios. The agent’s total reward at time step is , where is the reward, is a side effects penalty (based on current state and baseline state ), and is a scaling parameter. We compare the following penalties:

• Relative reachability penalty using undiscounted coverage (3.1), where is the inaction or starting state baseline. (We test the relative reachability penalty using discounted coverage (3.2) on toy MDP versions of the examples in Appendix C.)

• Variable-based distance penalty , where [agent position (x), agent position (y), object position (x), object position (y)] is a vector of state variables and is the inaction or starting state baseline. This is an example of the low impact approach from Armstrong and Levinstein (2017), which does not satisfy Property 3. We also try a variant using object position only.

• Reset penalty , where is the value function of a reset policy, which gets reward 1 for reaching the starting state and 0 otherwise. This is an example of the ergodicity-preserving approach similar to Moldovan and Abbeel (2012), which does not satisfy Properties 2 or 4.

In these proof-of-concept experiments, the penalties are computed with full knowledge of the environment. In a more general setting where the agent is learning about the environment, it would approximate the penalty based on its current knowledge.

In addition to the reward function, each environment has a safety performance function, originally introduced in Leike et al. (2017), which is not observed by the agent. This represents the agent’s performance according to the designer’s true preferences: it reflects how well the agent achieves the objective and whether it does so safely.

### 4.1 Irreversible Side Effects (Box)

We test our set of penalties on the Irreversible Side Effects environment from the AI Safety Gridworlds suite (Leike et al., 2017), shown in Figure 5. The environment contains a box that needs to be pushed out of the way for the agent to reach the goal. The agent receives a reward of 50 for reaching the goal, and a reward of -1 for moving. The unsafe behavior is taking the shortest path to the goal, which involves pushing the box into a corner (an irrecoverable position). The desired behavior is to take a slightly longer path in order to push the box to the right. The starting state and inaction baseline are the same in this environment. The safety performance is reward where if the box is in a corner. The longer path to the goal achieves a performance of 43, while the unsafe shorter path achieves a performance of 35.

This environment illustrates Examples 1 and 2. Pushing the box into a corner is an irreversible effect that is unnecessary for the objective, so we would like the agent to avoid it (as in Example 1). However, after the box is moved in any direction, the agent and the box cannot simultaneously return to their starting positions (if the box is moved to the right, the agent can move it back, but then the agent ends up on the other side of the box). This is an irreversible effect required to reach the objective, and so is part of the desired behavior (as in Example 2).

The relative reachability penalty achieves the optimal safety performance of 43 for values between 0.05 and 2. The other penalties do not produce the desired behavior for any value of . For values below a certain threshold, the agent always pushes the box into a corner and the safety performance reaches the unsafe level of 35. As approaches the threshold, it takes longer and longer to reach this level, and for values above the threshold, the performance does not stabilize within 20000 episodes. The distance penalty variant that only takes the box position into account produces similar behavior. (See Figure 8(a) for the transition from low to high values of .) Results for representative values of (below the respective thresholds) are shown in the first plot of Figure 7.

The reset penalty fails here because reaching the goal requires an irreversible effect: it is impossible to reset the positions of both the agent and the box after the box is moved in any direction. This means that the reset policy’s value function is 0 for any state where the box has been moved, and so the penalty maxes out and does not distinguish between recoverable and irrecoverable box positions. This illustrates the importance of satisfying Property 4: accumulating penalties for irreversible effects.

The distance penalty fails because the agent receives the same penalty for moving the box down or to the right, and continues to receive this penalty until the box is moved into its original position or until the goal is reached. After moving the box to the right, it would take the agent 4 moves to go to the other side of the box, in order to move the box to its original position. However, in 4 moves the agent can just go to the goal directly. Thus, the agent is always incentivized to take the shortest path that moves the box into a corner. This would no longer hold if the goal was very far away from the box, so the penalty would accumulate over time and we would expect the distance penalty to succeed.

### 4.2 Conveyor Belt

We designed a new gridworld in the AI Safety Gridworlds framework1 to illustrate Examples 3 and 4. This environment has two variants representing the sushi and conveyor belt scenarios in Examples 3 and 4 respectively, shown in Figure 5. The object on the conveyor belt is either a sushi dish or a vase. After every agent action, the conveyor belt advances to the right by one cell. An irreversible event occurs when the object reaches the end of the belt: the sushi is eaten, or the vase falls off and breaks.

The agent can move the object off the belt by stepping down into its cell. In the sushi scenario, this is an unintended behavior that gets no reward, while in the vase scenario, this is an intended behavior that gets a reward of 50. The presence or absence of this reward is the only difference between the sushi and vase environments. In the sushi scenario, the safety performance is 50 if the sushi is eaten and 0 otherwise. In the vase scenario, the safety performance is 50 if the vase is intact and 0 if it is broken. The episode always takes 20 steps and there is no movement reward.

For the vase environment, the safety performance results are shown in the second plot of Figure 7. Unsurprisingly, all penalties with the starting state baseline perform well (with the performance stabilizing slightly below the optimal value of 50 due to exploration). The distance penalty with the inaction baseline achieves a performance of 0 (through the “overcompensation” behavior of moving the vase off the belt and then putting it back on) for values of below a certain threshold, and does not interfere with the vase at all for higher values of . The distance penalty variant that only takes the vase position into account produces similar behavior. Figure 6 shows the overcompensation behavior, and Figure 8(b) illustrates the the transition from low to high .

However, relative reachability with the inaction baseline avoids this behavior, since the baseline state remains reachable after taking the vase off, and so the penalty is 0. We note that the discounted relative reachability penalty with the inaction baseline could produce the overcompensation behavior if is low and the episodes are long, as the penalty would accumulate over time.

For the sushi environment, the safety performance results are shown in the last plot of Figure 7. Since the reward is always 0, the value of makes no difference. All agents with the inaction baseline do well here, while the agents with the starting state baseline remove the sushi from the belt. Relative reachability with the inaction baseline does slightly worse than the distance penalty, since on some small fraction of episodes, the agent accidentally pushes the sushi off the belt (by taking two steps down at the start of the episode). The agent then has no incentive to put the sushi back on the belt, since the baseline is reachable from there. We expect that this could be avoided by using discounted relative reachability, or by giving the agent a goal that had nothing to do with the conveyor belt.

## 5 Related work

Safe exploration. Safe exploration methods prevent the agent from taking harmful actions by enforcing safety constraints (Turchetta et al., 2016; Dalal et al., 2018), using intrinsic motivation (Lipton et al., 2016), penalizing risk (Chow et al., 2015; Mihatsch and Neuneier, 2002), preserving ergodicity (Moldovan and Abbeel, 2012; Eysenbach et al., 2017), etc. Explicitly defined constraints or safe regions tend to be task-specific and require significant human input, so they do not provide a general solution to the side effects problem. Penalizing risk can help the agent avoid getting trapped or damaged (which reduces the agent’s reward), but does not discourage the agent from damaging the environment if such damage is not accounted for in the reward function. Most importantly, none of the above classes of methods address the side effects problem in dynamic environments, where the agent is not the only source of change (as discussed in Section 1).

Empowerment. Our relative reachability measure is related to empowerment (Klyubin et al., 2005; Salge et al., 2014; Mohamed and Rezende, 2015; Gregor et al., 2017), a measure of the agent’s control over its environment, defined as the highest possible mutual information between the agent’s actions and the future state. Empowerment measures the agent’s ability to reliably reach many states, while the relative reachability measure penalizes the reduction in reachability of states relative to the baseline. Irreversible side effects decrease the agent’s empowerment (for example, the agent cannot reliably reach as many states if the vase is broken than if the vase is intact), so we expect that maximizing empowerment would encourage the agent to avoid irreversible side effects. However, similarly to ergodicity-preserving safe exploration methods (Moldovan and Abbeel, 2012), it would also incentivize the agent to prevent irreversible events, thus failing Property 2.

It is unclear how to define an empowerment-based measure that would satisfy Properties 1-4. It would not be sufficient to simply penalize the reduction in empowerment between the current state and the baseline: this creates a tradeoff between cutting off some states and making other states more reachable, and would thus not prevent some types of side effects (failing Property 1). (For example, if the agent replaced the sushi on the conveyor belt with a vase, empowerment could remain the same, and so the agent would not be penalized for destroying the vase.) In a deterministic environment, maximizing empowerment over the set of states that are reachable from the baseline state (e.g. by using conditional mutual information) would satisfy Property 2. However, this does not easily generalize to stochastic environments where states might only be reachable some of the time. It is also unclear whether or how such a measure could be extended to penalizing reversible side effects (the discounted case) since empowerment does not take into account how long it takes to reach states.

Human oversight. An alternative to specifying a side effects penalty is to teach the agent to avoid side effects through human oversight, such as inverse reinforcement learning (Ng and Russell, 2000; Ziebart et al., 2008; Hadfield-Menell et al., 2016), demonstrations (Abbeel and Ng, 2004; Hester et al., 2018), or human feedback (Christiano et al., 2017; Saunders et al., 2017; Warnell et al., 2018). Whether an agent would learn a general heuristic for avoiding side effects from human oversight depends on the diversity of settings in which it receives human oversight and its ability to generalize from those settings, which is hard to guarantee. We expect that an intrinsic penalty for side effects would more reliably result in avoiding them. Such a penalty could also be combined with human oversight to decrease the amount of human input required for an agent to learn human preferences.

One method that takes less human input than other human oversight approaches is inverse reward design (Hadfield-Menell et al., 2017). It incorporates uncertainty about the objective by considering alternative reward functions that are consistent with the given reward function in the training environment. This helps the agent avoid some side effects that stem from “distributional shift”, where the agent encounters a new state that was not present in training. However, this method assumes that the given reward function is correct for the training environment, and so does not prevent side effects caused by a reward function that is misspecified in the training environment.

## 6 Discussion and future work

We have outlined a set of properties that are desirable for a side effects measure, and defined a relative reachability measure that satisfies these criteria. We then showed that it succeeds on a set of illustrative toy experiments where simpler side effects measures fail, as shown in Table 8. This provides a proof of concept for relative reachability as a general approach to measuring and penalizing side effects.

There are several improvements that would make this approach more tractable and useful in practical applications, which we leave to future work:

Practical implementation of coverage and baseline. The idealized, theoretical form of the relative reachability measure that we introduced is not tractable for environments more complex than gridworlds. In particular, we assumed that all environment states are known to the agent, that the coverage between all pairs of states can be computed, and that the agent can simulate the environment to compute the inaction baseline, which is not generally realistic. To relax these assumptions, the relative reachability penalty could be computed over some set of representative states known to the agent. For example, the agent could learn a set of auxiliary policies for reaching distinct states, similarly to the method for approximating empowerment in Gregor et al. (2017). While the agent is still learning about the environment, it would not be penalized for reducing the reachability of states that it is not aware of, so side effects would still happen during training.

Better choices of baseline than inaction. While the inaction baseline is our current best choice, it is far from ideal. In particular, the agent is not penalized for causing side effects that would occur in the default outcome. For example, if the agent is driving a car, the default outcome of inaction is a crash, so the agent would not be penalized for spilling coffee in the car. A better default state would be produced by a fail-safe policy smoothly following the road, but this kind of baseline is task-dependent. More research is needed on defining a better baseline in a general and tractable way.

Reward-penalty balance. Our current approach requires choosing a value of the hyperparameter below the threshold where the agent no longer achieves the objective, which depends on the penalty and the environment. It would be useful to automatically choose a value of below the threshold.

Taking into account reward costs. While the discounted relative reachability measure takes into account the time costs of reaching various states, it does not take into account reward costs. For example, suppose the agent can reach state from the current state in one step, but this step would incur a large negative reward. Discounted coverage could be modified to reflect this by adding a term for reward costs.

Weights over the state space. In practice, we often value the reachability of some states much more than others. This could be incorporated into the relative reachability measure by adding a weight for each state in the sum. Such weights could be learned through human feedback methods, e.g. Christiano et al. (2017).

We hope this work lays the foundations for a practical methodology on avoiding side effects that would scale well to more complex environments.

## Acknowledgements

We are grateful to David Krueger, Ramana Kumar, Jan Leike, Pedro Ortega, Tom Everitt, Murray Shanahan, Janos Kramar, Jonathan Uesato, and Owain Evans for giving helpful feedback on drafts. We would like to thank them and Toby Ord, Stuart Armstrong, Geoffrey Irving, Anthony Aguirre, Max Wainwright, Jessica Taylor, Ivo Danihelka, and Shakir Mohamed for illuminating conversations.

## Appendix A Proofs of consistency between coverage definitions

### a.1 Proposition 1

We show that for all , as , the discounted coverage (3.2) approaches the undiscounted coverage (3.1): .

###### Proof.

First we show for fixed that

 limγ→1E[γNπ(~s;s)] = limγ→1P(Nπ(~s;s)<∞)E[γNπ(~s;s)|Nπ(~s;s)<∞]+limγ→1P(Nπ(~s;s)=∞)E[γNπ(~s;s)|Nπ(~s;s)=∞] = P(Nπ(~s;s)<∞)⋅1+P(Nπ(~s;s)=∞)⋅0 = P(Nπ(~s;s)<∞). (A.1)

Now let be an optimal policy for that value of : . For any , there is a such that both of the following hold:

 ∣∣E[γNπ~γ(~s;s)]−P(Nπ~γ(~s;s)<∞)∣∣ <ϵ(by equation ???) and ∣∣∣E[γNπ~γ(~s;s)]−limγ→1E[γNπγ(~s;s)]∣∣∣ <ϵ(assuming the limit exists).

Thus, . Taking , we have

 limγ→1Cγ(~s;s)=limγ→1E[γNπγ(~s;s)]=lim~γ→1P(Nπ~γ(~s;s)<∞). (A.2)

Let . Then,

 maxπP(Nπ(~s;s)<∞) =limγ→1E[γNπ(~s;s)] (by equation A.1) ≤limγ→1E[γNπγ(~s;s)] (since πγ is optimal for each γ) =limγ→1P(Nπ~γ(~s;s)<∞) (by equation A.2) ≤maxπP(Nπ(~s;s)<∞)

Thus, equality holds, which completes the proof. ∎

### a.2 Proposition 2

We show that for all , as , the similarity-based discounted coverage (3.3) will approach the similarity-based undiscounted coverage (3.4):

###### Proof.

First we show for fixed that if the limit exists, then

 limγ→1∞∑t=0(1−γ)γtE[e−δ(~Sπt,s)]=limt→∞E[e−δ(~Sπt,s)] (A.3)

Let . Since as , for any we can find a large enough such that . Then, we have

 limγ→1∞∑t=0(1−γ)γtxt =limγ→1kϵ−1∑t=0(1−γ)γtxt+limγ→1∞∑t=kϵ(1−γ)γtxt ≤limγ→1(1−γ)⋅limγ→1kϵ−1∑t=0γtxt+limγ→1∞∑t=kϵ(1−γ)γtϵ =0+ϵlimγ→1γkϵ =ϵ.

Similarly, we can show that . Since this holds for all ,

 limγ→1∞∑t=0(1−γ)γtxt=0

which is equivalent to equation A.3.

Now let be an optimal policy for that value of : . For any , there is a such that both of the following hold:

 ∣∣ ∣∣∞∑t=0(1−~γ)~γtE[e−δ(~Sπ~γt,s)]−limt→∞E[e−δ(~Sπ~γt,s)]∣∣ ∣∣ <ϵ(by equation ???) and ∣∣ ∣∣∞∑t=0(1−~γ)~γtE[e−δ(~Sπ~γt,s)]−limγ→1∞∑t=0(1−γ)γtE[e−δ(~Sπγt,s)]∣∣ ∣∣ <ϵ(assuming the limit exists).

Thus, . Taking , we have

 limγ→1Cγ,δ(~s;s)=limγ→1∞∑t=0(1−γ)γtE[e−δ(~Sπγt,s)]=lim~γ→1limt→∞E[e−δ(~Sπ~γt,s)]. (A.4)

Let be the optimal policy for the similarity-based undiscounted coverage. Then,

 maxπlimt→∞E[e−δ(~Sπt,s)] =limγ→1∞∑t=0(1−γ)γtE[e−δ(~S~πt,s)] (by equation A.3) ≤limγ→1∞∑t=0(1−γ)γtE[e−δ(~Sπγt,s)] (since πγ is optimal for each γ) =limγ→1limt→∞E[e−δ(~Sπγt,s)] (by equation A.4) ≤maxπlimt→∞E[e−δ(~Sπt,s)]

Thus, equality holds, which completes the proof. ∎

### a.3 Proposition 3

Let the indicator distance be a distance measure with if and otherwise (so it only matters whether the agent can reach the exact target state). Then we show that for all ,

• the similarity-based discounted coverage (3.3) is equivalent to the discounted coverage (3.2): ,

• the similarity-based undiscounted coverage (3.4) is equivalent to the undiscounted coverage (3.1):
.

## Appendix B Relative reachability computations in Example 5

We compute the relative reachability of different states from using undiscounted coverage:

 d(s2;s3)= 4∑k=1max(C1(s3;sk)−C1(s2;sk),0) =max(0−0,0)+max(0−1,0)+max(1−0,0)+max(1−1,0) =1, d(s2;s1)= 4∑k=1max(C1(s1;sk)−C1(s2;sk),0) =max(1−0,0)+max(1−1,0)+max(1−0,0)+max(1−1,0) =2,

where is 1 if is reachable from and 0 otherwise.

Now we compute the relative reachability of different states from using discounted coverage:

 d(s2;s3)= 4∑k=1max(Cγ(s3;sk)−Cγ(s2;sk),0) = max(γ∞−γ∞,0)+max(γ∞−γ0,0)+max(γ0−γ∞,0)+max(γ1−γ1,0) = max(0−0,0)+max(0−1,0)+max(1−0,0)+max(γ−γ,0) = 1, d(s2;s1)= 4∑k=1max(Cγ(s1;sk)−Cγ(s2;sk),0) = max(γ0−γ∞,0)+max(γ1−γ0,0)+max(γ1−γ∞,0)+max(γ2−γ1,0) = max(1−0,0)+max(γ−1,0)+max(γ−0,0)+max(γ2−γ,0) = 1+γγ→1−−→2.

## Appendix C Toy MDP examples

We construct a toy MDP for each of our Examples 1-4. The agent receives a small negative movement reward for any action except noop, which gives reward 0. When the agent reaches a goal state, it receives a large reward (which incorporates the movement reward for simplicity).

The agent’s total reward at time step is , where is the reward described above, is a side effects penalty, and is a scaling parameter. We try out the following policies with different side effects penalties:

• policy with a distance penalty based on a set of state variables (where is the inaction baseline),

• policy with a relative reachability penalty (where is the starting state baseline) using discounted coverage with discount ,

• policy with a relative reachability penalty (where is the inaction baseline) using discounted coverage with discount .

### c.1 Example 1: Box

Figure 10 gives a toy MDP for the box example. The inaction baseline is the same as the starting state baseline (), so the relative reachability policies and will be the same.

• For policy , we define where (box position, status of the vase). Since and are only different in the position of the box, and and are only different in the status of the vase, let and . Then let and .

Going straight from to gives reward , while going around gives reward . If is too high, both of these rewards will be negative, so the agent will take noops. Otherwise, the agent will go around if , which holds if and .

• For policies and , going straight gives reward , while going around gives reward . Thus, the agent take noops if is too high, and otherwise go around unless or is low.

Thus, all agents will go around and avoid breaking the vase unless is too low or too high.

### c.2 Example 2: Omelette

Figure 11 gives a toy MDP for the omelette example. The inaction baseline is the same as the starting state baseline (), so the relative reachability policies and reachability policy will be the same.

• For policy , the reward for breaking and cooking the eggs is , so the agent will take the action if , and take noops otherwise.

• For policies and , we compute the relative reachability measure from to the baseline state :

 d(s2;s1)=2∑k=1max(Cγ(s1;sk)−Cγ(s2;sk),0)=max(1−0,0)+max(γ−1,0)=1

Thus, breaking and cooking the eggs gives reward , so the agent will take the action if .

Thus, all agents will break and cook the eggs unless is too high.

### c.3 Example 3: Sushi

Figure 12 gives a toy MDP for the sushi example. The inaction baseline state and the starting state baseline state .