Robust Reinforcement Learning for Continuous Control with Model Misspecification

# Robust Reinforcement Learning for Continuous Control with Model Misspecification

## Abstract

We provide a framework for incorporating robustness – to perturbations in the transition dynamics which we refer to as model misspecification – into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a policy that optimizes for a worst case expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. In addition, we show improved robust performance on a high-dimensional, simulated, dexterous robotic hand. Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework. This includes an adaptation to another continuous control RL algorithm as well as learning the uncertainty set from offline data. Performance videos can be found online at https://sites.google.com/view/robust-rl.

\iclrfinalcopy

## 1 Introduction

Reinforcement Learning (RL) algorithms typically learn a policy that optimizes for the expected return (sutton98reinforcement). That is, the policy aims to maximize the sum of future expected rewards that an agent accumulates in a particular task. This approach has yielded impressive results in recent years, including playing computer games with super human performance (mnih2015human; Tessler2018), multi-task RL (Rusu2016PolicyD; Devin2017LearningMN; Teh2017DistralRM; Mankowitz2019; Riedmiller2018LearningBP) as well as solving complex continuous control robotic tasks (Duan2016BenchmarkingDR; abdolmaleki2018maximum; Kalashnikov2018ScalableDR; Haarnoja2018SoftAA).

The current crop of RL agents are typically trained in a single environment (usually a simulator). As a consequence, an issue that is faced by many of these agents is the sensitivity of the agent’s policy to environment perturbations. Perturbing the dynamics of the environment during test time, which may include executing the policy in a real-world setting, can have a significant negative impact on the performance of the agent (andrychowicz2018learning; peng2018sim; derman2018soft; Dicastro2012; mankowitz2018learning). This is because the training environment is not necessarily a very good model of the perturbations that an agent may actually face, leading to potentially unwanted, sub-optimal behaviour. There are many types of environment perturbations. These include changing lighting/weather conditions, sensor noise, actuator noise, action delays etc (Dulac2019).

It is desirable to train agents that are agnostic to environment perturbations. This is especially crucial in the Sim2Real setting (andrychowicz2018learning; peng2018sim; wulfmeier2017mutual; rastogi2018sample; Christiano2016) where a policy is trained in a simulator and then executed on a real-world domain. As an example, consider a robotic arm that executes a control policy to perform a specific task in a factory. If, for some reason, the arm needs to be replaced and the specifications do not exactly match, then the control policy still needs to be able to perform the task with the ‘perturbed’ robotic arm dynamics. In addition, sensor noise due to malfunctioning sensors, as well as actuator noise, may benefit from a robust policy to deal with these noise-induced perturbations.

Model misspecification: For the purpose of this paper, we refer to an agent that is trained in one environment and performs in a different, perturbed version of the environment (as in the above examples) as model misspecification. By incorporating robustness into our agents, we correct for this misspecification yielding improved performance in the perturbed environment(s).

In this paper, we propose a framework for incorporating robustness into continuous control RL algorithms. We specifically focus on robustness to model misspecification in the transition dynamics. Our main contributions are as follows:

(1) We incorporate robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO) (abdolmaleki2018maximum) to yield Robust MPO (R-MPO). We also carry out an additional experiment, where we incorporate robustness into an additional continuous RL algorithm called Stochastic Value Gradients (SVG) (heess2015learning).

(2) Entropy regularization encourages exploration and helps prevent early convergence to sub-optimal policies (nachum2017bridging). To incorporate these advantages, we: (i) Extend the Robust Bellman operator (iyengar2005robust) to robust and soft-robust entropy-regularized versions, and show that these operators are contraction mappings. In addition, we (ii) extend MPO to Robust Entropy-regularized MPO (RE-MPO) and Soft RE-MPO (SRE-MPO) and show that they perform at least as well as R-MPO and in some cases significantly better. All the derivations have been deferred to Appendices B, C and D.

We want to emphasize that, while the theoretical contributions are novel, our most significant contribution is that of the extensive experimental analysis we have performed to analyze the robustness performance of our agent. Specifically:

(3) We present experimental results in nine Mujoco domains showing that RE-MPO, SRE-MPO and R-MPO, SR-MPO outperform both E-MPO and MPO respectively.

(4) To ensure that our method scales, we show robust performance on a high-dimensional, simulated, dexterous robotic hand called Shadow hand which outperforms the non-robust MPO baseline.

(5) Multiple investigative experiments to better understand the robustness framework. These include (i) an analysis of modifying the uncertainty set; (ii) comparing our technique to data augmentation; (iii) a comparison to domain randomization; (iv) comparing with and without entropy regularization; (v) We also train the transition models from offline data and use them as the uncertainty set to run R-MPO. We show that R-MPO with learned transition models as the uncertainty set can lead to improved performance over R-MPO.

## 2 Background

A Markov Decision Process (MDP) is defined as the tuple where is the state space, the action space, is a bounded reward function; is the discount factor and maps state-action pairs to a probability distribution over next states. We use to denote the simplex. The goal of a Reinforcement Learning agent for the purpose of control is to learn a policy which maps a state and action to a probability of executing the action from the given state so as to maximize the expected return where is a random variable representing the reward received at time (sutton2018reinforcement). The value function is defined as and the action value function as .

A Robust MDP (R-MDP) is defined as a tuple where and are defined as above; is an uncertainty set where is the set of probability measures over next states . This is interpreted as an agent selecting a state and action pair, and the next state is determined by a conditional measure (iyengar2005robust). A robust policy optimizes for the worst-case expected return objective: .

The robust value function is defined as and the robust action value function as . Both the robust Bellman operator for a fixed policy and the optimal robust Bellman operator have previously been shown to be contractions (iyengar2005robust). A rectangularity assumption on the uncertainty set (iyengar2005robust) ensures that “nature” can choose a worst-case transition function independently for every state and action .

Maximum A-Posteriori Policy Optimization (MPO) (abbas2018a; abdolmaleki2018maximum) is a continuous control RL algorithm that performs an expectation maximization form of policy iteration. There are two steps comprising policy evaluation and policy improvement. The policy evaluation step receives as input a policy and evaluates an action-value function by minimizing the squared TD error: , where denotes the parameters of a target network (mnih2015human) that are periodically updated from . In practice we use a replay-buffer of samples in order to perform the policy evaluation step. The second step comprises a policy improvement step. The policy improvement step consists of optimizing the objective for states drawn from a state distribution . In practice the state distribution samples are drawn from an experience replay. By improving in all states , we improve our objective. To do so, a two step procedure is performed.

First, we construct a non-parametric estimate such that . This is done by maximizing while ensuring that the solution, locally, stays close to the current policy ; i.e. . This optimization has a closed form solution given as where is a temperature parameter that can be computed by minimizing a convex dual function (abdolmaleki2018maximum). Second, we project this non-parametric representation back onto a parameterized policy by solving the optimization problem , where is the new and improved policy and where one typically employs additional regularization (abbas2018a). Note that this amounts to supervised learning with samples drawn fron ; see abbas2018a for details.

## 3 Robust MPO

To incorporate robustness into MPO, we focus on learning a worst-case value function in the policy evaluation step. Note that this policy evaluation step can be incorporated into any actor-critic algorithm. In particular, instead of optimizing the squared TD error, we optimize the worst-case squared TD error, which is defined as:

 minθ(rt+γinfp∈P(st,at)[Qπk^θ(st+1∼p(⋅|st,at),at+1∼πk(⋅|st+1))]−Qπkθ(st,at))2, (1)

where is an uncertainty set for the current state and action ; is the current network’s policy, and denotes the target network parameters. It is in this policy evaluation step (Line 3 in Algorithms 1,2 and 3 in Appendix I) that the Bellman operators in the previous sections are applied.

Relation to MPO: In MPO, this replaces the current policy evaluation step. The robust Bellman operator (iyengar2005robust) ensures that this process converges to a unique fixed point for the policy . This is achieved by repeated application of the robust Bellman operator during the policy evaluation step until convergence to the fixed point. Since the proposal policy (see Section 2) is proportional to the robust action value estimate , it intuitively yields a robust policy as the policy is being generated from a worst-case value function. The fitting of the policy network to the proposal policy yields a robust network policy .

Entropy-regularized MPO: Entropy-regularization encourages exploration and helps prevent early convergence to sub-optimal policies (nachum2017bridging). To incorporate these advantages, we extended the Robust Bellman operator (iyengar2005robust) to robust and soft-robust entropy-regularized versions (See Appendix B and C respectively for a detailed overview and the corresponding derivations) and show that these operators are contraction mappings (Theorem 1 below and Theorem 2 in Appendix E) and yield a well-known value-iteration bound with respect to the max norm.

###### Theorem 1.

The robust entropy-regularized Bellman operator for a fixed policy is a contraction operator. Specifically: and , we have, .

In addition, we extended MPO to Robust Entropy-regularized MPO (RE-MPO) and Soft RE-MPO (SRE-MPO) (see Appendix D for a detailed overview and derivations) and show that they perform at least as well as R-MPO and in some cases significantly better. All the derivations have been deferred to the Appendix. The corresponding algorithms for R-MPO, RE-MPO and SRE-MPO can be found in Appendix I.

## 4 Experiments

We now present experiments on nine different continuous control domains (four of which we show in the paper and the rest can be found in Appendix H.4) from the DeepMind control suite (Tassa2018). In addition, we present an experiment on a high-dimensional dexterous, robotic hand called Shadow hand (Shadow2005). In our experiments, we found that the entropy-regularized version of Robust MPO had similar performance and in some cases, slightly better performance than the expected return version of Robust MPO without entropy-regularization. We therefore decided to include experiments of our agent optimizing the entropy-regularized objective (non-robust, robust and soft-robust versions). This corresponds to (a) non-robust E-MPO baseline, (b) Robust E-MPO (RE-MPO) and (c) Soft-Robust E-MPO (SRE-MPO). From hereon in, it is assumed that the algorithms optimize for the entropy-regularized objective unless otherwise stated.

Appendix: In Appendix H.4, we present results of our agent optimizing for the expected return objective without entropy regularization (for the non-robust, robust and soft-robust versions). This corresponds to (a’) non-robust MPO baseline, (b’) R-MPO and (c’) SR-MPO.

The experiments are divided into three sections. The first section details the setup for robust and soft-robust training. The next section compares robust and soft-robust performance to the non-robust MPO baseline in each of the domains. The final section is a set of investigative experiments to gain additional insights into the performance of the robust and soft-robust agents.

#### Setup:

For each domain, the robust agent is trained using a pre-defined uncertainty set consisting of three task perturbations 1. Each of the three perturbations corresponds to a particular perturbation of the Mujoco domain. For example, in Cartpole, the uncertainty set consists of three different pole lengths. Both the robust and non-robust agents are evaluated on a test set of three unseen task perturbations. In the Cartpole example, this would correspond to pole lengths that the agent has not seen during training. The chosen values of the uncertainty set and evaluation set for each domain can be found in Appendix H.3. Note that it is common practice to manually select the pre-defined uncertainty set and the unseen test environments. Practitioners often have significant domain knowledge and can utilize this when choosing the uncertainty set (derman2019; derman2018soft; Dicastro2012; mankowitz2018learning; tamar2014scaling).

During training, the robust, soft-robust and non-robust agents act in an unperturbed environment which we refer to as the nominal environment. During the TD learning update, the robust agent calculates an infimum between Q values from each next state realization for each of the uncertainty set task perturbations (the soft-robust agent computes an average, which corresponds to a uniform distribution over , instead of an infimum). Each transition model is a different instantiation of the Mujoco task. The robust and soft-robust agents are exposed to more state realizations than the non-robust agent. However, as we show in our ablation studies, significantly increasing the number of samples and the diversity of the samples for the non-robust agent still results in poor performance compared to the robust and soft-robust agents.

### 4.1 Main Experiments

Mujoco Domains: We compare the performance of non-robust MPO to the robust and soft-robust variants. Each training run consists of episodes and the experiments are repeated times. In the bar plots, the y-axis indicates the average reward (with standard deviation) and the x-axis indicates different unseen evaluation environment perturbations starting from the first perturbation (Env0) onwards. Increasing environment indices correspond to increasingly large perturbations. For example, in Figure 1 (top left), Env0, Env1 and Env2 for the Cartpole Balance task represents the pole perturbed to lengths of and meters respectively. Figure 1 shows the performance of three Mujoco domains (The remaining six domains are in Appendix H.4). The bar plots indicate the performance of E-MPO (red), RE-MPO (blue) and SRE-MPO (green) on the held-out test perturbations. This color scheme is consistent throughtout the experiments unless otherwise stated. As can be seen in each of the figures, RE-MPO attains improved performance over E-MPO. This same trend holds true for all nine domains. SRE-MPO outperforms the non-robust baseline in all but the Cheetah domain, but is not able to outperform RE-MPO. An interesting observation can be seen in the video for the Walker walk task (https://sites.google.com/view/robust-rl), where the RE-MPO agent learns to ‘drag’ its leg which is a fundamentally different policy to that of the non-robust agent which learns a regular gait movement.

Appendix: The appendix contains additional experiments with the non entropy-regularized versions of the algorithms where again the robust (R-MPO) and soft robust (SR-MPO) versions of MPO outperform the non-robust version (MPO).

Shadow hand domain: This domain consists of a dexterous, simulated robotic hand called Shadow hand whose goal is to rotate a cube into a pre-defined orientation (Shadow2005). The state space is a dimensional vector and consisting of angular positions and velocities, the cube orientation and goal orientation. The action space is a dimensional vector and consisting of the desired angular velocity of the hand actuators. The reward is a function of the current orientation of the cube relative to the desired orientation. The uncertainty set consists of three models which correspond to increasingly smaller sizes of the cube that the agent needs to orientate. The agent is evaluated on a different, unseen holdout set. The values can be found in Appendix H.3. We compare RE-MPO to E-MPO trained agents. Episodes are steps long corresponding to approximately seconds of interaction. Each experiment is run for episodes and is repeated times. As seen in Figure 2, RE-MPO outperforms E-MPO, especially as the size of the cube decreases (from Env0 to Env2). This is an especially challenging problem due to the high-dimensionality of the task. As seen in the videos (https://sites.google.com/view/robust-rl), the RE-MPO agent is able to manipulate significantly smaller cubes than it had observed in the nominal simulator.

### 4.2 Investigative Experiments

This section aims to investigate and try answer various questions that may aid in explaining the performance of the robust and non-robust agents respectively. Each investigative experiment is conducted on the Cartpole Balance and Pendulum Swingup domains.

#### What if we increase the number of training samples?

One argument is that the robust agent has access to more samples since it calculates the Bellman update using the infimum of three different environment realizations. To balance this is effect, the non-robust agent was trained for three times more episodes than the robust agents. Training with significantly more samples does not increase the performance of the non-robust agent and, can even decreases the performance, as a result of overfitting to the nominal domain. See Appendix H.5, Figure 12 for the results.

#### What about Domain Randomization?

A subsequent point would be that the robust agent sees more diverse examples compared to the non-robust agent from each of the perturbed environments. We therefore trained the non-robust agent in a domain randomization setting (andrychowicz2018learning; peng2018sim). We compare our method to two variants of DR. The first variant Limited-DR uses the same perturbations as in the uncertainty set of RE-MPO. Here, we compare which method better utilizes a limited set of perturbations to learn a robust policy. As seen in Figure 3 (left and middle left for Carpole Balance and Pendulum Swingup respectively), RE-MPO yields significantly better performance given the limited set of perturbations. The second variant Full-DR performs regular DR on a significantly larger set of perturbations in the Pendulum Swingup task. In this setting, DR, which uses times more perturbations, improves but still does not outperform RE-MPO (which still only uses three perturbations). This result can be seen in Figure 13, Appendix H.5.

What is the intuitive difference between DR and RE-MPO/SRE-MPO? DR defines the loss to be the expectation of TD-errors over the uncertainty set. Each TD error is computed using a state, action, reward, next state trajectory from a particular perturbed environment, (selected uniformly from the uncertainty set). These TD errors are then averaged together. This is a form of data augmentation and the resulting behaviour is the average across all of this data. RE-MPO/SRE-MPO: In the case of robustness, the TD error is computed such that the target action value function is computed as a worst case value function with respect to the uncertainty set. This means that the learned policy is explicitly searching for adversarial examples during training to account for worst-case performance. In the soft-robust case, the subtle yet important difference (as seen in the experiments) with DR is that the TD loss is computed with the average target action value function with respect to next states (as opposed to averaging the TD errors of each individual perturbed environment as in DR). This results in different gradient updates being used to update the action value function compared to DR.

#### A larger test set:

It is also useful to view the performance of the agent from the nominal environment to increasingly large perturbations in the unseen test set (see Appendix H.3 for values). These graphs can be seen in Figure 2 for Cartpole Balance and Pendulum Swingup respectively. As expected, the robust agent maintains a higher level of performance compared to the non-robust agent. Initially, the soft-robust agent outperforms the robust agent, but its performance degrades as the perturbations increase which is consistent with the results of derman2018soft. In addition, the robust and soft-robust agents are competitive with the non-robust agent in the nominal environment.

#### Modifying the uncertainty set:

We now evaluate the performance of the agent for different uncertainty sets. For Pendulum Swingup, the original uncertainty set values of the pendulum arm are and meters. We modified the final perturbation to values of and meters respectively. The agent is evaluated on unseen lengths of and meters. An increase in performance can be seen in Figure 4 as the third perturbation approaches that of the unseen evaluation environments. Thus it appears that if the agent is able to approximately capture the dynamics of the unseen test environments within the training set, then the robust agent is able to adapt to the unseen test environments. The results for cartpole balance can be seen in Appendix H.5, Figure 14.

#### What about incorporating Robustness into other algorithms?

To show the generalization of this robustness approach, we incorporate it into the critic of the Stochastic Value Gradient (SVG) continuous control RL algorithm (See Appendix H.1). As seen in Figure 3, Robust Entropy-regularized SVG (RE-SVG) and Soft RE-SVG (SRE-SVG) significantly outperform the non-robust Entropy-regularized SVG (E-SVG) baseline in both Cartpole and Pendulum.

#### Robust entropy-regularized return vs. robust expected return:

When comparing the robust entropy-regularized return performance to the robust expected return, we found that the entropy-regularized return appears to do no worse than the expected return. In some cases, e.g., Cheetah, the entropy-regularized objective performs significantly better (see Appendix H.5, Figure 11).

#### Different Nominal Models:

In this paper the nominal model was always chosen as the smallest perturbation parameter value from the uncertainty set. This was done to highlight the strong performance of robust policies to increasingly large environment perturbations. However, what if we set the nominal model as the median or largest perturbation with respect to the chosen uncertainty set for each agent? As seen in Appendix H.5, Figure 15, the closer (further) the nominal model is to (from) the holdout set, the better (worse) the performance of the non-robust agent. However, in all cases, the robust agent still performs at least as well as (and sometimes better than) the non-robust agent.

#### What about learning the uncertainty set from offline data?

In real-world settings, such as robotics and industrial control centers (gao2014machine), there may be a nominal simulator available as well as offline data captured from the real-world system(s). These data could be used to train transition models to capture the dynamics of the task at hand. For example, a set of robots in a factory might each be performing the same task, such as picking up a box. In industrial control cooling centers, there are a number of cooling units in each center responsible for cooling the overall system. In both of these examples, each individual robot and cooling unit operate with slightly different dynamics due to slight fluctuations in the specifications of the designed system, wear-and-tear as well as sensor calibration errors. As a result, an uncertainty set of transition models can be trained from data generated by each robot or cooling unit.

However, can we train a set of transition models from these data, utilize them as the uncertainty set in R-MPO and still yield robust performance when training on a nominal simulator? To answer this question, we mimicked the above scenarios by generating datasets for the Cartpole Swingup and the Pendulum swingup tasks. For Cartpole swingup, we varied the length of the pole and generated a dataset for each pole length. For Pendulum Swingup, we varied the mass of the pole and generated the corresponding datasets. We then trained transition models on increasingly large data batches ranging from to one million datapoints for each pole length and pole mass respectively. We then utilized each set of transition models for different data batch sizes as the uncertainty set and ran R-MPO on each task. We term this variant of R-MPO, Data-Driven Robust MPO (DDR-MPO). The results can be seen in Figure 5. There are a number of interesting observations from this analysis. (1) As expected, on small batches of data, the models are too inaccurate and result in poor performance. (2) An interesting insight is that as the data batch size increases, DDR-MPO starts to outperform R-MPO, especially for increasingly large perturbations. The hypothesis here is that due to the transition models being more accurate, but not perfect, adversarial examples are generated in a small region around the nominal next state observation, yielding an increasingly robust agent. (3) As the batch size increases further, and the transition models get increasingly close to the ground truth models, DDR-MPO converges to the performance of R-MPO.

## 5 Related Work

From a theoretical perspective, Robust Bellman operators were introduced in (iyengar2005robust; Nilim2005; Wiesemann2013; Hansen2011; tamar2014scaling). Our theoretical work extends this operator to the entropy regularized setting, for both the robust and soft-robust formulation, and modifies the MPO optimization formulation accordingly. A more closely related work from a theoretical perspective is that of Moya2016 who introduces a formulation for robustness to model misspecification. This work is a special case of robust MDPs where they introduce a robust Bellman operator that regularizes the immediate reward with two KL terms; one entropy term capturing model uncertainty with respect to a base model, and the other term being entropy regularization with respect to a base policy. Our work differs from this work in a number of respects: (1) Their uncertainty set is represented by a KL constraint which has the effect of restricting the set of admissible transition models. Our setup does not have these restrictions. (2) The uncertainty set elements from Moya2016 output a probability distribution over model parameter space whereas the uncertainty set elements in our formulation output a distribution over next states.

mankowitz2018learning learn robust options, also known as temporally extended actions (sutton1999between), using policy gradient. Robust solutions tend to be overly conservative. To combat this, derman2018soft extend the actor-critic two-timescale stochastic approximation algorithm to a ‘soft-robust’ formulation to yield a less, conservative solution. Dicastro2012 introduce a robust implementation of Deep Q Networks (mnih2015human). Domain Randomization (DR) (andrychowicz2018learning; peng2018sim) is a technique whereby an agent trains on different perturbations of the environment. The agent batch averages the learning error of these different perturbed trajectories together to yield an agent that is robust to environment perturbations. This can be viewed as a data augmentation technique where the resulting behaviour is the average across all of the data. There are also works that look into robustness to action stochasticity (Fox2015; Braun2011; Rubin2012).

## 6 Conclusion

We have presented a framework for incorporating robustness - to perturbations in the transition dynamics, which we refer to as model misspecification - into continuous control RL algorithms. This framework is suited to continuous control algorithms that learn a value function, such as an actor critic setup. We specifically focused on incorporating robustness into MPO as well as our entropy-regularized version of MPO (E-MPO). In addition, we presented an experiment which incorporates robustness into the SVG algorithm. From a theoretical standpoint, we adapted MPO to an entropy-regularized version (E-MPO); we then incorporated robustness into the policy evaluation step of both algorithms to yield Robust MPO (R-MPO) and Robust E-MPO (RE-MPO) as well as the soft-robust variants (SR-MPO/SRE-MPO). This was achieved by deriving the corresponding robust and soft-robust entropy-regularized Bellman operators to ensure that the policy evaluation step converges in each case. We have extensive experiments showing that the robust versions outperform the non-robust counterparts on nine Mujoco domains as well as a high-dimensional dexterous, simulated robotic hand called Shadow hand (Shadow2005). We also provide numerous investigative experiments to understand the robust and soft-robust policy in more detail. This includes an experiment showing improved robust performance over R-MPO when using an uncertainty set of transition models learned from offline data.

## Appendix A Background

Entropy-regularized Reinforcement Learning: Entropy regularization encourages exploration and helps prevent early convergence to sub-optimal policies (nachum2017bridging). We make use of the relative entropy-regularized RL objective defined as where is a temperature parameter and is the Kullback-Leibler (KL) divergence between the current policy and a reference policy given a state (schulman2017equivalence). The entropy-regularized value function is defined as . Intuitively, augmenting the rewards with the KL term regularizes the policy by forcing it to be ‘close’ in some sense to the base policy.

## Appendix B Robust Entropy-Regularized Bellman Operator

(Relative-)Entropy regularization has been shown to encourage exploration and prevent early convergence to sub-optimal policies (nachum2017bridging). To take advantage of this idea when developing a robust RL algorithm we extend the robust Bellman operator to a robust entropy regularized Bellman operator and prove that it is a contraction.2 We also show that well-known value iteration bounds can be attained using this operator. We first define the robust entropy-regularized value function as . For the remainder of this section, we drop the sub-and superscripts, as well as the reference policy conditioning, from the value function , and simply represent it as for brevity. We define the robust entropy-regularized Bellman operator for a fixed policy in Equation 2, and show it is a max norm contraction (Theorem 1).

 TπR-KLV(s) = Ea∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γinfp∈PEs′∼p(⋅|s,a)[V(s′)]], (2)
###### Theorem 1.

The robust entropy-regularized Bellman operator for a fixed policy is a contraction operator. Specifically: and , we have, .

The proof can be found in the (Appendix E, Theorem 1). Using the optimal robust entropy-regularized Bellman operator , which is shown to also be a contraction operator in Appendix E, Theorem 2, a standard value iteration error bound can be derived (Appendix E, Corollary 1).

## Appendix C Soft-Robust Entropy-Regularized Bellman Operator

In this section, we derive a soft-robust entropy-regularized Bellman operator and show that it is also a -contraction in the max norm. First, we define the average transition model as which corresponds to the average transition model distributed according to some distribution over the uncertainty set . This average transition model induces an average stationary distribution (see derman2018soft). The soft-robust entropy-regularized value function is defined as . Again, for ease of notation, we denote for the remainder of the section. The soft-robust entropy-regularized Bellman operator for a fixed policy is defined as:

 TπSR-KLV(s) = Ea∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[V(s′)]], (3)

which is also a contraction mapping (see Appendix F, Theorem 3) and yields the same bound as Corollary 1 for the optimal soft-robust Bellman operator derived in Appendix F, Theorem 4.

## Appendix D Robust Entropy-Regularized Policy Evaluation

To extend Robust policy evaluation to robust entropy-regularized policy evaluation, two key steps need to be performed: (1) optimize for the entropy-regularized expected return as opposed to the regular expected return and modify the TD update accordingly; (2) Incorporate robustness into the entropy-regularized expected return and modify the entropy-regularized TD update. To achieve (1), we define the entropy-regularized expected return as , and show in Appendix G that performing policy evaluation with the entropy-regularized value function is equivalent to optimizing the entropy-regularized squared TD error (same as Eq. (4), only omitting the operator). To achieve (2), we optimize for the robust entropy regularized expected return objective defined as , yielding the robust entropy-regularized squared TD error:

 minθ(rt+ γinfp∈P(st,at)[˜QπkR-KL,^θ(st+1∼p(⋅|st,at),at+1∼πk(⋅|st+1);¯π) (4) −τKL(πk(⋅|st+1∼p(⋅|st,at))∥¯π(⋅|st+1∼p(⋅|st,at)))]−˜QπkR-KL,θ(st,at;¯π))2,

where . For the soft-robust setting, we remove the infimum from the TD update and replace the next state transition function with the average next state transition function .

Relation to MPO: As in the previous section, this step replaces the policy evaluation step of MPO. Our robust entropy-regularized Bellman operator and soft-robust entropy-regularized Bellman operator ensures that this process converges to a unique fixed point for the policy for the robust and soft-robust cases respectively. We use as the reference policy . The pseudo code for the R-MPO, RE-MPO and Soft-Robust Entropy-regularized MPO (SRE-MPO) algorithms can be found in Appendix I (Algorithms 1, 2 and 3 respectively).

## Appendix E Proofs

###### Proof.

We follow the proofs from (tamar2014scaling; iyengar2005robust), and adapt them to account for the additional entropy regularization for a fixed policy . Let , and an arbitrary state. Assume . Let be an arbitrary positive number.

By the definition of the operator, there exists such that,

 Ea∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼ps(⋅|s,a)[V(s′)]]

In addition, we have by definition that:

 Ea∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼ps(⋅|s,a)[U(s′)]]≥infp∈PEa∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼p(⋅|s,a)[U(s′)]] (6)

Thus, we have,

 0≤TπR-KLU(s)−TπR-KLV(s)

Applying a similar argument for the case results in

 |TπR-KLU−TπR-KLV|<γ∥U−V∥+ϵ. (8)

Since is an arbitrary positive number, we establish the result, i.e.,

 |TπR-KLU−TπR-KLV|≤γ∥U−V∥. (9)

###### Proof.

We follow a similar argument to the proof of Theorem 1. Let , and an arbitrary state. Assume . Let be an arbitrary positive number. By definition of the operator, there exists such that,

 infp∈PEa∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼p(⋅|s,a)[U(s′)]]>TR-KLU(s)−ϵ (10)

In addition, by the definition of the operator, there exists such that,

 Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼ps(⋅|s,a)[V(s′)]]

Thus, we have,

 0≤TR-KLU(s)−TR-KLV(s)<(infp∈PEa∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼p(⋅|s,a)[U(s′)]]+ϵ)−(infp∈PEa∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼p(⋅|s,^π(a|s))[V(s′)]])<(Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼ps(⋅|s,a)[U(s′)]]+ϵ)−(Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼ps(⋅|s,a)[V(s′)]]−ϵ)=Ea∼^π(⋅|s),s′∼¯p(⋅|s,a)[γU(s′)]−Ea∼^π(⋅|s),s′∼¯p(⋅|s,a)[γV(s′)]+2ϵ≤γ∥U−V∥+2ϵ (12)

Applying a similar argument for the case results in

 |TR-KLU−TR-KLV|<γ∥U−V∥+2ϵ. (13)

Since is an arbitrary positive number, we establish the result, i.e.,

 |TR-KLU−TR-KLV|≤γ∥U−V∥. (14)

###### Corollary 1.

Let be the greedy policy after applying value iteration steps. The bound between the optimal value function and , the value function that is induced by , is given by, , where is the function approximation error, and is the initial value function.

###### Proof.

From Berteskas (1996), we have the following proposition:

###### Lemma L.

et be the optimal value function, some arbitrary value function, the greedy policy with respect to , and the value function that is induced by . Thus,

 ∥V∗−Vπ∥≤2γ(1−γ)∥V∗−V∥ (15)

Next, define the maximum projected loss to be:

 ϵ=max0≤k≤N∥TR-KLVk−Vk+1∥ (16)

We can now derive a bound on the loss between the optimal value function and the value function obtained after updates of value iteration (denoted by ) as follows:

 ∥V∗−VN∥≤∥V∗−TR-KLVN−1∥+∥TR-KLVN−1−VN∥=∥TR-KLV∗−TR-KLVN−1∥+∥TR-KLVN−1−VN∥≤γ∥V∗−VN−1∥+∥TR-KLVN−1−VN∥≤γ∥V∗−VN−1∥+ϵ≤(1+γ+⋯+γN−1)ϵ+γN∥V∗−V0∥≤ϵ(1−γ)+γN∥V∗−V0∥ (17)

Then, using Lemma L, we get:

 ∥V∗−VπN∥≤2γ(1−γ)∥V∗−VN∥≤2γ(1−γ)ϵ(1−γ)+2γ(1−γ)γN∥V∗−V0∥=2γϵ(1−γ)2+2γN+1(1−γ)∥V∗−V0∥ (18)

which establishes the result. ∎

## Appendix F Soft-Robust Entropy-Regularized Bellman Operator

###### Proof.

For an arbitrary and for a fixed policy :

 ∥TπSR-KLU(s)−TπSR-KLV(s)∥∞=sups∣∣∣Ea∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[U(s′)]]−Ea∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[V(s′)]]∣∣∣=γsups|∑s′¯p(s′|s,a)[U(s′)−V(s′)]|≤γsups∑s′¯p(s′|s,a)|U(s′)−V(s′)|≤γsups∑s′¯p(s′|s,a)∥U(s′)−V(s′)∥∞≤γ∥U−V∥∞

###### Proof.

Let , and an arbitrary state. Assume . Let be an arbitrary positive number. By definition of the operator, there exists such that,

 Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[U(s′)]]>TSR-KLU(s)−ϵ (19)

Thus, we have,

 \vspace−0.2in0≤TSR-KLU(s)−TSR-KLV(s)<(Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[U(s′)]]+ϵ)−(supπ∈ΠEa∼π(⋅|s)[r(s,a)−τlogπ(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[V(s′)]])≤(Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[U(s′)]]+ϵ)−(Ea∼^π(⋅|s)[r(s,a)−τlog^π(⋅|s)¯π(⋅|s)+γEs′∼¯p(⋅|s,a)[V(s′)]])=Ea∼^π(⋅|s),s′∼¯p(⋅|s,a)[γU(s′)]−Ea∼^π(⋅|s),s′∼¯p(⋅|s,a)[γV(s′)]+ϵ≤γ∥U−V∥+ϵ (20)

Applying a similar argument for the case results in

 |TSR-KLU−TSR-KLV|<γ∥U−V∥+ϵ. (21)

Since is an arbitrary positive number, we establish the result, i.e.,

 |TSR-KLU−TSR-KLV|≤γ∥U−V∥. (22)

## Appendix G Entropy-regularized Policy Evaluation

This section describes: (1) modification to the TD update for the expected return to optimize for the entropy-regularized expected return, (2) additional modification to account for robustness.

The entropy-regularized value function is defined as:

 VπKL(s;¯π) = Eπ[∞∑t=0γt(rt−τKL% (π(⋅|st)∥¯π(⋅|st)))|s0=s] (23)

and the corresponding entropy-regularized action value function is given by:

 QπKL(s,a;¯π) = Eπ[∞∑t=0γt(rt−τKL% (π(⋅|st)∥¯π(⋅|st)))|s0=s,a0=a] (24) = r(s,a)−τKL(π(⋅|s)∥¯π(⋅|s))+Es′∼p(⋅|s,a)[VπKL(s′;¯π)] (25)

Next, we define:

 ˜QπKL(s,a;¯π)=r(s,a)+Es′∼p(⋅|s,a)[VπKL(s′;¯π)] (26)

thus,

 QπKL(s,a;¯π)=˜QπKL(s,a;¯π)−τKL(π(⋅|s)∥¯π(⋅|s))) (27)

Therefore, we have the following relationship:

 (28)

We now retrieve the TD update for the entropy-regularized action value function:

 δt =rt−τKL(π(⋅|st)∥¯π(⋅|st))+γQπKL(st+1∼P(⋅|st,at),at+1∼π(⋅|st+1);¯π) (29) −QπKL(st,at;¯π) =rt−τKL(π(⋅|st)∥¯π(⋅|st))+γQπKL(st+1∼P(⋅|st,at),at+1∼π(⋅|st+1);