# Distributionally robust reinforcement learning

# Distributionally Robust Reinforcement Learning

###### Abstract

Generalization to unknown / uncertain environments of reinforcement learning algorithms is crucial for real-world applications. In this work, we explicitly consider uncertainty associated with the test environment through an uncertainty set. We formulate the Distributionally Robust Reinforcement Learning (DR-RL) objective that consists in maximizing performance against a worst-case policy in uncertainty set centered at the reference policy. Based on this objective, we derive computationally efficient policy improvement algorithm that benefits from Distributionally Robust Optimization (DRO) guarantees. Further, we propose an iterative procedure that increases stability of learning, called Distributionally Robust Policy Iteration. Combined with maximum entropy framework, we derive a distributionally robust variant of Soft Q-learning that enjoys efficient practical implementation and produces policies with robust behaviour at test time. Our formulation provides a unified view on a number of safe RL algorithms and recent empirical successes.

Elena Smirnovacriteo \icmlauthorElvis Dohmatobcriteo \icmlauthorJérémie Marycriteo

criteoCriteo AI Lab

Elena Smirnovae.smirnova@criteo.com

Machine Learning, ICML

## 1 Introduction

Generalization in reinforcement learning has been a long-studied issue Ackley & Littman (1990); Murphy (2005). Previous works have addressed generalization by using function approximators Tesauro (1992); Boyan & Moore (1995); Sutton (1996). Recently, the combination of powerful function approximators with the ideas of model-free RL allowed to efficiently train agents operating in complex simulated environments Mnih et al. (2015). Despite this notable progress, multiple works have shown that agents achieving excellent performance at training time can produce substantially sub-optimal behaviour in slightly modified test environments Hausknecht & Stone (2015); Zhang et al. (2018). To mitigate this issue, heuristic approaches akin to regularization have been proposed, such as injecting noise at training time Lillicrap et al. (2016); Tobin et al. (2017). A number of works are devoted to developing new evaluation methodologies that allow to detect over-fitting of RL algorithms Cobbe et al. (2018); Packer et al. (2018). The issue of poor generalization is amplified in real-world applications where an RL algorithm is typically faced with stochastic transitions, high-dimensional state space, and a finite sample of data / interactions. To succeed in real-world tasks, it is therefore crucial for RL algorithms to handle uncertainty over test environments.

Previous works on safe RL studied uncertainty in state-transitions under worst-case criterion. In model-based setting, robust MDP framework has been proposed Nilim & El Ghaoui (2004) that optimizes performance over worst-case transitions. In model-free setting, -learning optimizes over the worst-case transition incurred during learning Heger (1994); robust Q-learning Roy et al. (2017) replaces the worst transition seen by a fixed noise. Multiple studies report that worst-case criterion being too restrictive and resulting in pessimistic policies since it takes into account worst possible transitions even if it is occurs with negligible probability Mihatsch & Neuneier (2002); Gaskett (2003); Lim et al. (2013). Alternately, risk-sensitive objective functions integrate the desired level of risk directly in objective function where the return and the risk are balanced through a weighted sum or exponential utility function Gosavi (2009); Geibel & Wysotzki (2005).

Recently, maximum entropy approach to RL Ziebart (2010); Haarnoja et al. (2017); Fox et al. (2016); Nachum et al. (2017) has demonstrated robust behaviour on a variety of simulated and real-world tasks Haarnoja et al. (2018c); Mahmood et al. (2018). Maximum entropy objective modifies the standard RL objective to additionally maximize the entropy of policy at each visited state. This objective optimized over a class of Boltzmann policies results in improved exploration targeted at high-value actions Haarnoja et al. (2018b). The stochastic behaviour of the resulting policy is known to be beneficial in the case of uncertain dynamics Ziebart (2010) and on real-world robot tasks Haarnoja et al. (2018c); Mahmood et al. (2018). It has also been helpful to learning multi-modal behaviour Haarnoja et al. (2017) and pre-training from multiple related tasks Haarnoja et al. (2018a).

Recent empirical successes also include a number of works that propose to augment the replay buffer with prior transitions. This proposal brings benefits similar the ones observed for maximum entropy methods. Indeed, improved exploration has been observed on hard exploration problems by using expert demonstrations Hester et al. (2018) and trajectories under related tasks Colas et al. (2018). Gu et al. (2016); Rajeswaran et al. (2017) report faster and more stable learning when using model-based samples as a part of replay buffer. Combined with adversarial training, it is possible to train robust policies that generalize to broad range of target domains Rajeswaran et al. (2017).

In this paper, we frame the generalization requirement for RL as a general distributionally robust optimization (DRO) problem Ben-Tal et al. (2009); Postek et al. (2016). We formulate the distributionally robust reinforcement learning (DR-RL) objective that directly incorporates the uncertainty associated with the test environment through an uncertainty set. Differently from previous works, we propose the uncertainty set centered around the state-action distribution induced by the baseline policy. This uncertainty set defines a worst-case reference distribution that, for a particular case of Kullback-Leibler (KL) divergence uncertainty set, can be obtained in analytic form of Boltzmann distribution Hu & Hong (2013). Based on results from the DRO theory, the optimal policy w.r.t. DR-RL satisfies asymptotic guarantees on generalization performance Duchi et al. (2016). In addition to theoretical justification, the DR-RL objective enjoys a computationally efficient practical optimization procedure.

Based on the DR-RL objective, we propose an iterative policy improvement process, that we refer to as Distributionally Robust Policy Iteration. This algorithm features practically convenient implementation using off-the-shelf tools, while providing generalization guarantee on policy improvement step. Moreover, we combine benefits of maximum entropy methods and the DR-RL objective and propose a variant of Soft Q-learning Haarnoja et al. (2017); Fox et al. (2016), that we call Distributionally Robust Soft Q-learning. Our proposal consists in a minor change to the Soft Q-learning algorithm that stabilizes the training and produces policies with robust behaviour in test environment.

We connect the DR-RL objective with the previous work. In particular, we show that, at a high-level, DR-RL objective penalizes variance of performance metric, whereas maximum entropy objective regularizes the entropy of policy state-action distribution. We provide a view on a number of safe RL algorithms as instances of general DRO problem. We highlight specificities of their uncertainty sets with respect to the reported conservatism of their resulting policies. We ground aforementioned empirical successes into the DRO framework. We also relate to a recent proposal on distributional reinforcement learning Bellemare et al. (2017) through a particular definition of an uncertainty set.

We present experimental results that confirm the improved generalization of DR-RL based algorithms in both interpolation and extrapolation setting. We show the improved stability of learning using the iterative procedure derived from the DR-RL objective.

To summarize our main contributions are as follows: first we frame the generalization requirement as a general distributionally robust optimization (DRO) problem. This allows us to formulate the distributionally robust reinforcement learning objective (DR-RL) that accounts for uncertainty associated with test environment. Based on this formulation, we present a unified view on several safe RL algorithms and provide grounding for recent empirical successes. We also derive practical algorithms based on the DR-RL objective principled by generalization guarantees and illustrate them experimentally.

## 2 Preliminaries

We consider a Markov decision process where is the state space, is the action space, state-action transition probability that represents the probability density of next state given current state and action, is the reward function mapping state-action pairs into a bounded interval . We define a stochastic policy as probability distribution over actions given a state and let be the set of all policies. We consider the discounted setting with discount factor . We use standard definitions of state-action value function , and state-value function .

Let be a state marginal of the trajectory distribution induced by policy . We denote a future state distribution induced by policy defined as . We refer to as the joint probability distribution over state-action pairs defined as .

### 2.1 Standard RL objective

The standard reinforcement leaning objective aims at maximizing the expected long-term reward over trajectories generated by a policy :

(1) | ||||

Given the reference policy and corresponding induced future state distribution following the definition from preliminaries , the objective (1) can be formulated in terms of expectation of the performance metric under the reference policy using importance weighting:

(2) | ||||

where is the state-action pair distribution induced by the reference policy, is the performance metric of policy under reference distribution . Analogously to supervised learning, the optimal policy w.r.t. the objective (2) can be think as optimizing a performance metric over a fixed reference distribution:

(3) |

### 2.2 Problem setup

In practice, the state-action pair distribution is estimated based on a finite sample of data. The objective (2) does not provide guarantees on policy performance outside the empirical distribution , in particular, when deployed in test environments. In this paper, we explicitly consider the unknown test distribution by constructing an uncertainty set around which is rich enough to contain the actual test distribution with high confidence:

(4) |

Here, we consider two setups depending on the source of uncertainty, referred as interpolation and exploration.

In interpolation setup, the uncertainty is due to an estimation over a finite sample of data Sutton (1996); Murphy (2005) of a function approximator Tsitsiklis & Van Roy (1996). In this case, the objective (4) can be interpreted as an off-policy learning problem, where the task is to learn a policy on a training distribution that performs well in the true environment Precup et al. (2001); Munos et al. (2016). Interpolation setup also refers to the stability of learning. It is common for RL algorithms to learn in an iterative way, where each iteration is subject to an aforementioned uncertainty. Numerous works has focused on stabilization of temporal difference learning with function approximation Boyan & Moore (1995); Sutton & Barto (1998); Tsitsiklis & Van Roy (1996); Van Hasselt et al. (2016); Anschel et al. (2017).

In extrapolation setup, the uncertainty is associated with environment parameters da Silva et al. (2012). In this setting, training environment and test environment are sampled from an unknown distribution. For example, in real-world robot tasks, the parameters of the test system, such length of robot arms, are not exactly known Andrychowicz et al. (2018). Another example comes from temporal effects occurring in recommender systems, where the policy learns on a data sample collected at time , while it is deployed at time Zheng et al. (2018). In this setup, the objective (4) describes policies robust to changes in system parameters.

To mitigate the problem of poor test performance in RL, previous works proposed heuristics, such as injecting noise at training time Lillicrap et al. (2016); Plappert et al. (2017) or including randomization over system parameters directly during training Tobin et al. (2017); Andrychowicz et al. (2018). Nevertheless, as pointed by Zhang et al. (2018) generalization for RL remains difficult.

## 3 Distributionally robust objective for RL

The generalization problem (4) is an instance of general DRO problem Ben-Tal et al. (2009); Postek et al. (2016). Next, we will consider a particular case of uncertainty set defined using Kullback-Leibler (KL) divergence. This choice is motivated by applications, where it is common to deploy stochastic policies, such as Gaussian policies Haarnoja et al. (2018b) or randomized greedy policies Mnih et al. (2015) where KL divergence based uncertainty set around a reference state-action distribution is well-defined.

We proceed by decoupling the uncertainty over the empirical state-action pair distribution into an uncertainty over state-action probabilities and future state probabilities . In Section 3.1, we discuss the uncertainty set w.r.t reference state-action probabilities. We define the DR-RL objective and discuss its properties. Next, using similar arguments, we analyze the uncertainty set around the empirical future state probabilities in Section 3.2. Then we present a general uncertainty set w.r.t. the joint state-action distribution and define the general DR-RL objective in Section 3.3. Finally, we link the DR-RL objective to maximum entropy methods in Section 3.4.

### 3.1 Uncertainty over state-action distribution

We consider the uncertainty w.r.t. reference policy. Indeed, for a stochastic reference policy, the empirical distribution might include estimation errors due the finite sample size and class of function approximator. We construct an uncertainty set around the reference policy that contains policies inducing the same future state distribution and such that their respective state-action probabilities are within a KL-ball of radius centered at the reference policy:

(5) | ||||

where distributions are absolutely continuous w.r.t , that is they do not put mass on actions outside the support of the latter. The objective (4) with KL divergence uncertainty set (5) becomes:

(6) | ||||

The inner minimization problem of the above objective defines an adversarial policy to reference policy in terms of KL divergence, i.e. the worst-case policy from the uncertainty set around the actual reference policy.

By direct application of Theorem 3 in Hu & Hong (2013) (see Equations 16-18), the optimal adversarial policy in (6) is given by Boltzmann distribution:

(7) |

where the temperature parameter is a positive Lagrange multiplier. This form of adversarial policy allows a new interpretation: the adversarial policy re-weights the reference policy such that the worst-case actions are taken more frequently to the extent determined by the adversarial temperature . Using (7), the objective (6) simplifies into the adversarial objective of the following form:

(8) |

We refer to this objective as distributionally robust reinforcement learning objective (DR-RL). The optimal policy w.r.t DR-RL objective optimizes performance metric over the worst-case reference policy in uncertainty set. This adversarial policy takes a form of Boltzmann distribution that simplifies the optimization procedure. Indeed, in our experiments, we observe a well-behaved optimization.

The optimal state-dependent temperature parameter that corresponds to a given radius of uncertainty set is a solution to dual objective of DR-RL objective (see Theorem 4 in Hu & Hong (2013)):

(9) |

#### Uncertainty set of DR-RL objective

The temperature of adversarial policy (7) is defined by the radius of uncertainty set. Under infinitely small uncertainty set and high-temperature limit, , the adversarial policy takes the same actions as the reference policy. Thus, it makes the robust objective (8) to approach the standard non-robust objective (1). On the other end, when , the adversarial policy takes the worst actions in terms of performance metric w.r.t reference policy. This corresponds to the total uncertainty around the data sample and leads to conservative policies.

Therefore, the size of uncertainty set should reflect the amount of uncertainty over the test environment. While it should contain the test distribution with high confidence, it should be small enough to exclude pathological distributions, which would incentivize overly conservative policies. In the case of estimation errors, the size of the uncertainty set should be chosen w.r.t the sample size.

This work is based on KL divergence uncertainty set, but we note that in general the uncertainty set can take various forms, see Postek et al. (2016) for a comprehensive survey.

#### Generalization guarantees

Under mild conditions, Section 3 of Duchi et al. (2016) establishes an asymptotic generalization guarantee for any f-divergence uncertainty set, in particular for KL divergence. In other words, it is possible to construct a confidence interval around the test performance of the optimal policy w.r.t DR-RL objective (8) which has correct size w.r.t. fixed confidence level as the sample size tends to infinity. In general, Hu & Hong (2013) showed that the KL divergence uncertainty set does not satisfy finite-sample guarantee, i.e. at any given sample size. Differently, the finite sample result has been shown for Wasserstein distance uncertainty set with light-tailed distributions Esfahani et al. (2017).

### 3.2 Uncertainty over future state distribution

Similarly to uncertainty associated with reference policy, we consider the uncertainty w.r.t. state distribution induced by the reference policy. In interpolation setup, in the case of stochastic state-transitions, the states sampled under empirical distribution might not represent the actual distribution due to the finite sample size. In extrapolation setup, the uncertainty can also be related to the state distribution in the test environment. Again, we construct an uncertainty set using KL divergence:

(10) |

Using similar arguments to Section 3.1 and rearranging in terms of state-value function, we obtain a DR-RL objective:

(11) | ||||

where the worst-case reference state distribution can be interpreted as encouraging to visit the worst states in terms of performance metric to the extend defined by adversarial temperature :

(12) |

Equivalently, we can see the adversarial state distribution in terms of change to state-action probabilities that are re-weighted by a per-state quantity:

(13) |

The effect of such re-weighting consists in increased uncertainty over actions in a particular state, making it more probable to sample a sub-optimal action and transition into a new state. We note that the expectation in the denominator of (12) is a fixed quantity across states and in practice, can be approximated using Monte-Carlo sampling.

### 3.3 General DR-RL objective

By combining uncertainty sets (5) and (10), we give a general formulation of DR-RL objective that forms an uncertainty set around the joint distribution:

(14) |

The latter equation holds in accordance with the chain rule about divergence of joint probability distribution functions. Therefore, the general DR-RL objective is given by:

(15) |

In the following, we link the DR-RL objective to the maximum entropy objective.

### 3.4 Connection to maximum entropy objective

The framework of maximum entropy reinforcement learning Ziebart (2010); Fox et al. (2016); Haarnoja et al. (2017) regularizes the standard RL objective (1) with the expected relative entropy of the policy w.r.t. reference policy:

(16) |

where the soft Q-values are defined as expected discounted sum of rewards and per-state relative entropy:

(17) |

Boltzmann policy is known to be an optimal solution to this objective (16) Haarnoja et al. (2018b):

(18) |

where the temperature parameter controls the smoothness of the resulting distribution by balancing relative entropy term against return. Maximum entropy objective (16) encourages exploration, particularly useful in tasks with multiple modes of near-optimal behavior. Differently, in DR-RL objective (8), the expectation is taken under adversarial Boltzmann distribution that acts proportionally to inverse of exponential of state-action values.

Both entropy-regularized objective and DR-RL objective incorporate a notion of uncertainty. We show that, under certain conditions, DR-RL objective can be seen as variance-regularized objective. Recall that Namkoong & Duchi (2017) turned, in the case of convex and bounded loss function, the DRO problem with f-divergence uncertainty set into a convex surrogate to variance-regularized objective. With the same strategy the DR-RL objective becomes:

(19) |

Therefore, at a high level, the two objectives can be characterized by different regularization criteria: variance of the performance metric for DR-RL objective and entropy of the policy for maximum entropy objective. The optimization procedure for both DR-RL and maximum entropy objectives involves computing expectations under Boltzmann distribution that is conceptionally simple and shows good optimization behaviour in practice.

## 4 Distributionally robust policy iteration

Based on the DR-RL objective (8), we propose an iterative procedure that we call Distributionally Robust Policy Iteration, given in Algorithm 1. We generate a monotonically improving sequence of policies under assumption of exact computations. Moreover, each policy in the sequence is provided a generalization guarantee (see Section 3.1). Experimentally, we observe that our algorithm stabilizes the learning (see Section 6.2). The proposed policy iteration process is applicable to both on-policy and off-policy. Indeed, the robustification appears as re-weighting of the on-policy or off-policy samples based on the adversarial distribution.

#### Practical implementation

Our algorithm allows for computationally efficient practical implementation. It is common for RL algorithms to learn on a data sample. For off-policy algorithms, replay buffer is largely employed Lin (1991); Mnih et al. (2015); for policy gradient algorithms, the data sample is a set of trajectories generated by the current policy (e.g. Schulman et al. (2015)). To implement Algorithm 1, it is sufficient to learn on samples from adversarial distribution. One possibility is to re-sample the data based on the target distribution, for example, using importance sampling. Practically, prioritized experience replay module Schaul et al. (2016) can be used for this purpose. Alternatively, one can directly generate samples from the adversarial distribution resorting to uniform sampling.

### 4.1 Distributionally Robust Soft Q-learning

As discussed in Section 3.4, the exponential scheme of Boltzmann distribution is beneficial for exploration. Differently, DR-RL objective ensures a robust behaviour in the face of uncertainty. We bring together the best of both worlds and propose Algorithm 1 for a class of Boltzmann policies that correspond to maximum entropy objective (16).

In this setting, the expression for the adversarial policy (7) can be rewritten as a change of sampling temperature:

(20) |

Thus, the adversarial policy for maximum entropy objective applies a per-action correction of Q-values. Based on the Algorithm 1, we derive a Distributionally Robust Soft Q-iteration variant of soft Q-iteration Haarnoja et al. (2017).

###### Theorem 1 (Distributionally Robust Soft Q-iteration).

###### Proof.

See Appendix A.1. ∎

Theorem 1 requires the relationship (21) between the adversarial and policy temperature parameters. Adversarial temperature should satisfy a lower bound to allow for policy to improve. At the upper limit, adversarial temperature is unbounded as it corresponds to no adversarial correction to the reference policy. Intuitively, ensures the exploration part, while quantifies the degree of uncertainty over the empirical distribution. Based on dual objective (9), optimal adversarial temperature to a given size of uncertainty set is a solution of the following 1-d optimization problem:

(22) |

where with a lower bound given by (21): and upper bound given by: .

In practice, DR Soft Q-iteration can be implemented in a similar way to Mnih et al. (2015) using temporal difference learning with neural network function approximator and experience replay Lin (1991). We therefore propose a practical Q-learning style algorithm, called DR Soft Q-learning, described in Algorithm 2. DR Soft Q-learning represents a distributionally robust variant of Soft Q-learning algorithm Fox et al. (2016); Haarnoja et al. (2017).

### 4.2 Optimal schedule of the size of KL-ball

Previous works in DRO domain have extensively studied the uncertainty set based on the KL divergence Hu & Hong (2013). Under certain conditions, it is possible to provide an optimal radius of the KL-ball asymptotically based on the sample size. Specifically, Theorem 3 of Duchi et al. (2016) shows that the optimal radius of the KL-ball is decreasing as the inverse of the sample size :

(23) |

where is the quantile of chi-squared distribution with 1 degree of freedom, is a prescribed confidence level.

## 5 Related work

### 5.1 Safe RL

Research works in safe RL have proposed an optimization objective that takes into account risk. As per taxonomy proposed by Garcıa & Fernández (2015), we review two types of previously studied formulations: worst-case objective and risk-sensitive objective. We show that both of these objectives are instances of DRO problem.

#### Worst-case objective

The direct realizations of DRO problems have been proposed in safe RL literature under ”worst-case criterion” Garcıa & Fernández (2015). In the model-free setting, the -learning method utilizes the uncertainty set that contains all trajectories incurred during learning Heger (1994). In the model-based setting, Nilim & El Ghaoui (2004) propose robust MDP framework where the uncertainty set is defined over transition probabilities. Its adaption to model-free setting Roy et al. (2017), approximates the uncertainty set over transition probabilities by a fixed noise. Multiple studies report the worst-case criterion being too restrictive and resulting in pessimistic policies since it takes into account worst possible transition even if it occurs the negligible probability Mihatsch & Neuneier (2002); Gaskett (2003). From the perspective of DRO, the size of uncertainty set directly is crucial to obtaining robust, but not overly conservatively policies. As noted by Lim et al. (2013), the worst-case model-based approach often defines too large uncertainty set. In the model-free setting, constant uncertainty over transitions might not be representative of the statistical uncertainty.

#### Variance-penalized objective

As discussed in Section 3.4, under certain conditions, DRO problem can be formulated as variance-penalized objective. This objective is known in safe RL literature as variance-penalized criterion Gosavi (2009), expected value-variance criterion Heger (1994) and expected-value-minus-variance-criterion Geibel & Wysotzki (2005). Variance-penalized objective has also been proposed in the context of counterfactual evaluation Swaminathan & Joachims (2015b, a). Unfortunately, the variance-penalized objective is non-convex in general and has been observed to produce counterintuitive policies Garcıa & Fernández (2015). Differently, the optimizing DR-RL objective consists is computing expectations under Boltzmann distribution that empirically results in a well-behaved optimization.

### 5.2 Distributional reinforcement learning

Bellemare et al. (2017) introduced distributional distributional RL demonstrating the importance of the value distribution in approximate reinforcement learning. This can also be expressed in the DRO perspective by considering the value distributions with probability density defined as . The uncertainty set that corresponds to distributional RL algorithm takes the following form:

(24) |

where is a Wasserstein distance. DRO under Wasserstein metric offer powerful out-of-sample performance guarantees and enable the decision maker to control the model’s conservativeness, but is known to be computationally intensive Mohajerin Esfahani & Kuhn (2018). This provides an alternative view on distributional RL.

### 5.3 Empirical results

Previous works reported empirical successes when augmenting the replay buffer with prior transitions, such as expert demonstrations Hester et al. (2018), related policies Andrychowicz et al. (2017); Colas et al. (2018), model-based samples Gu et al. (2016); Rajeswaran et al. (2017). Indeed, this approach can been seen as producing a state-action pair distribution that corresponds to a DRO problem over the uncertainty set defined as a mixture distribution between policy distribution and prior knowledge.

## 6 Experiments

We experiment in two problem setups: interpolation and extrapolation (see Section 2.2 for description). Under interpolation setup, we consider an iterative policy improvement procedure over a finite sample of data using a function approximator. The proposed distributionally robust iterative process (Algorithms 1 and 2) provides an alternative way to stabilize the learning. Indeed, as discussed in Section 16, the DR-RL objective can be broadly seen as penalizing the variance of policy performance. Therefore, the target estimation error of Q-learning Anschel et al. (2017) can be reduced under DR-RL objective. We analyze the stability Q-learning derived algorithms in Section 6.2.

Under extrapolation setup, we view the scenario when train and test environment represent a distributional shift. Specifically, we consider deterministic training domain and stochastic test domain. It mimics the distributional shift in state-transition probabilities that occurs, in particular, in time-dependent systems, such as recommender systems. We present the extrapolation results in Section 6.3.

### 6.1 Evaluation setup

#### Stochastic grid-world

We consider a grid-world problem, similar to Anschel et al. (2017). The agent may move up, down, left and right; start state and end state are located in opposite corners. The agent receives no rewards, except for 1 at the goal state. The state size is 20x20. The learning is carried out in deterministic domain. At testing time, there is a probability 0.5 of moving in the intended direction and 0.2 of moving in the opposite direction, 0.1 probability of moving either right or left.

#### Baselines

We experiment with the following Q-learning derived algorithms: Q-learning, Soft Q-learning and this paper proposal, Distributionally Robust Soft Q-learning (DR Soft Q-learning). The practical implementation of these algorithms is similar to Mnih et al. (2015), that is, we employ the target network, the experience replay buffer and epsilon-greedy exploration. The algorithms differ in the sampling procedure, where the DR Soft Q-learning uses adversarial distribution and, in the of class learned policies, where Soft Q-learning trains Boltzmann policy.

#### Parameters

For each algorithm, we conduct 30 runs of 20K steps and perform testing on 10K steps. We set the temperature of Soft Q-learning algorithm . We set the fixed exploration rate to 0.1. We consider the confidence level as a hyperparameter . We use . We use one-hidden layer neural network with 32 neurons. Minimization is done using the Adam optimizer Kingma & Ba (2015) on mini-batches of 32 samples with fixed learning rate 0.001.

### 6.2 Stability of learning

We analyze the stability of learning of the proposed algorithm and the baselines. Figure 1 shows the mean squared error between the average predicted Q-value and the optimal Q-value in the gridworld problem. We observe that DR Soft Q-learning exhibits less noisy behaviour and faster convergence to the optimal Q-value compared to Soft Q-learning. This confirms the variance penalizing effect of DR-RL objective.

Figure 2 shows the size of uncertainty set and the corresponding optimal value of adversarial temperature during the training of DR Soft Q-learning algorithm. At lower confidence levels, we observe lower adversarial temperature that leads to increased adversarial effect.

### 6.3 Stochastic test domain

Table 1 compares algorithms in terms of performance in stochastic test domain, while trained in deterministic domain. We observe that DR Soft Q-learning is more robust than Soft Q-learning to transition changes. Low levels of confidence result in overly pessimistic policies, whereas higher levels with smaller size of uncertainty set improve the test performance. Greedy Q-learning produces more varied performance with larger confidence intervals and lower average compared to DR Soft Q-learning.

Method | Confidence | Test |
---|---|---|

Greedy Q | - | 0.252 (0.166,0.319) |

Soft Q | - | 0.213 (0.152, 0.259) |

DR Soft Q | 0.001 | 0.131 (0.004, 0.254) |

DR Soft Q | 0.01 | 0.257 (0.227, 0.294) |

DR Soft Q | 0.1 | 0.284 (0.259, 0.308) |

## 7 Conclusion

We study the problem of generalization for RL. We introduce a distributionally robust reinforcement learning objective that directly accounts for the uncertainty associated with the test environment. The proposed objective consists in optimizing the policy performance over a worst-case sampling distribution in the uncertainty set that takes the form of Boltzmann distribution. Based on this objective, we propose an iterative policy improvement process, called Distributionally Robust Policy Iteration. We also combine the DR-RL objective with maximum entropy framework and propose Distributionally Robust Soft Q-learning algorithm. This algorithm applies a temperature correction to the sampling distribution of Soft Q-learning in a way to robustify the behaviour under uncertain test environment. Our experiments show that distributional robustness is a promising direction for improving stability of training and ensuring the behaviour RL algorithms at test time. We hope this work opens an avenue to a new class of algorithms with better understood generalization performance.

## References

- Ackley & Littman (1990) Ackley, D. H. and Littman, M. L. Generalization and scaling in reinforcement learning. In Advances in neural information processing systems, pp. 550–557, 1990.
- Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
- Andrychowicz et al. (2018) Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
- Anschel et al. (2017) Anschel, O., Baram, N., and Shimkin, N. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 176–185, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 449–458, 2017.
- Ben-Tal et al. (2009) Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton University Press, 2009.
- Boyan & Moore (1995) Boyan, J. A. and Moore, A. W. Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems, pp. 369–376, 1995.
- Cobbe et al. (2018) Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
- Colas et al. (2018) Colas, C., Sigaud, O., and Oudeyer, P.-Y. Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. arXiv preprint arXiv:1802.05054, 2018.
- da Silva et al. (2012) da Silva, B., Konidaris, G., and Barto, A. Learning parameterized skills. In Proceedings of the Twenty Ninth International Conference on Machine Learning, June 2012.
- Duchi et al. (2016) Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach, 2016.
- Esfahani et al. (2017) Esfahani et al., M. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, Jul 2017.
- Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pp. 202–211, 2016.
- Garcıa & Fernández (2015) Garcıa, J. and Fernández, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Gaskett (2003) Gaskett, C. Reinforcement learning under circumstances beyond its control. International Conference on Computational Intelligence for Modelling Control and Automation, 2003.
- Geibel & Wysotzki (2005) Geibel, P. and Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81–108, 2005.
- Gosavi (2009) Gosavi, A. Reinforcement learning for model building and variance-penalized control. In Winter Simulation Conference, pp. 373–379. Winter Simulation Conference, 2009.
- Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838, 2016.
- Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1352–1361, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
- Haarnoja et al. (2018a) Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, S. Composable deep reinforcement learning for robotic manipulation. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 6244–6251, 2018a.
- Haarnoja et al. (2018b) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018b.
- Haarnoja et al. (2018c) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018c.
- Hausknecht & Stone (2015) Hausknecht, M. J. and Stone, P. The impact of determinism on learning atari 2600 games. In AAAI Workshop: Learning for General Competency in Video Games, 2015.
- Heger (1994) Heger, M. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994, pp. 105–111. Elsevier, 1994.
- Hester et al. (2018) Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J., Leibo, J. Z., and Gruslys, A. Deep q-learning from demonstrations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pp. 3223–3230, 2018.
- Hu & Hong (2013) Hu, Z. and Hong, L. J. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013.
- Kingma & Ba (2015) Kingma, D. P. and Ba, J. L. Adam: Amethod for stochastic optimization. In International Conference on Learning Representations, 2015.
- Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
- Lim et al. (2013) Lim, S. H., Xu, H., and Mannor, S. Reinforcement learning in robust markov decision processes. In Advances in Neural Information Processing Systems, pp. 701–709, 2013.
- Lin (1991) Lin, L. J. Programming robots using reinforcement learning and teaching. In AAAI, pp. 781–786, 1991.
- Mahmood et al. (2018) Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., and Bergstra, J. Benchmarking reinforcement learning algorithms on real-world robots. In Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pp. 561–591. PMLR, 29–31 Oct 2018.
- Mihatsch & Neuneier (2002) Mihatsch, O. and Neuneier, R. Risk-sensitive reinforcement learning. Machine learning, 49(2-3):267–290, 2002.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Mohajerin Esfahani & Kuhn (2018) Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115–166, Sep 2018.
- Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
- Murphy (2005) Murphy, S. A. A generalization error for q-learning. Journal of Machine Learning Research, 6(Jul):1073–1097, 2005.
- Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2775–2785, 2017.
- Namkoong & Duchi (2017) Namkoong, H. and Duchi, J. C. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pp. 2971–2980, 2017.
- Nilim & El Ghaoui (2004) Nilim, A. and El Ghaoui, L. Robustness in markov decision problems with uncertain transition matrices. In Advances in Neural Information Processing Systems, pp. 839–846, 2004.
- Packer et al. (2018) Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., and Song, D. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282, 2018.
- Plappert et al. (2017) Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. Parameter space noise for exploration. In International Conference on Learning Representations, 2017.
- Postek et al. (2016) Postek, K., den Hertog, D., and Melenberg, B. Computationally tractable counterparts of distributionally robust constraints on risk measures. SIAM Review, 58(4):603–650, 2016.
- Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In ICML, pp. 417–424, 2001.
- Rajeswaran et al. (2017) Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. EPopt: Learning robust neural network policies using model ensembles. International Conference on Learning Representations, 2017.
- Roy et al. (2017) Roy, A., Xu, H., and Pokutta, S. Reinforcement learning under model mismatch. In Advances in Neural Information Processing Systems, pp. 3043–3052, 2017.
- Schaul et al. (2016) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In International Conference on Learning Representations, 2016.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
- Sutton (1996) Sutton, R. S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pp. 1038–1044, 1996.
- Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
- Swaminathan & Joachims (2015a) Swaminathan, A. and Joachims, T. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16:1731–1755, 2015a.
- Swaminathan & Joachims (2015b) Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, pp. 814–823, 2015b.
- Tesauro (1992) Tesauro, G. Practical issues in temporal difference learning. In Advances in neural information processing systems, pp. 259–266, 1992.
- Tobin et al. (2017) Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 23–30. IEEE, 2017.
- Tsitsiklis & Van Roy (1996) Tsitsiklis, J. and Van Roy, B. An analysis of temporal-difference learning with function approximation. Technical report, LIDS-P-2322. Laboratory for Information and Decision Systems, 1996.
- Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, volume 2, pp. 5. Phoenix, AZ, 2016.
- Zhang et al. (2018) Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893, 2018.
- Zheng et al. (2018) Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 167–176, 2018.
- Ziebart (2010) Ziebart, B. D. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Machine Learning Department, Carnegie Mellon University, Dec 2010.

## Appendix A Appendix

### a.1 Proof of Theorem 1.

###### Proof.

Throughout the proof, we assume tabular setting with bounded rewards and finite action space. We define the entropy-augmented reward as:

(25) |

and give definition of soft Q-values as standard Q-values under entropy-augmented reward:

(26) |

Then, the DR Soft Q-iteration optimizes the distributionally robust maximum entropy objective:

(27) |

where adversarial policy is defined as in (20):

(28) |

At each iteration of DR soft Q-iteration, we consider policy evaluation step and policy improvement step.

In the policy evaluation step, we compute the value of a policy according to objective (27). For a fixed policy, the soft Q-value can be computed iteratively using a modified version of Bellman operator:

(29) |

According to Soft Policy Evaluation Lemma (Lemma 1 in Haarnoja et al. (2018b)), the soft Bellman iteration convergences to a soft value of as .

Next, we consider the policy improvement step:

(30) |

We use Theorem 4 of Haarnoja et al. (2017) that establishes .

Since adversarial and policy temperatures and satisfy condition as defined in (21), the following inequality holds between the objective (27) evaluated at policy and :

(31) |

To show that DR soft Q-iteration converges to of optimal policy of objective (27), we use the similar argument to Theorem 1 of Haarnoja et al. (2018b). The sequence of is monotonically increasing, and, as is bounded, the sequence converges to some policy . At convergence, it must hold that for all due to inequality (31). Therefore, achieves the highest value of objective (27).

∎