# Reinforcement Learning under Model Mismatch

## Abstract

We study reinforcement learning under *model misspecification*, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs of [2] to the *model-free* Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define *robust versions* of -learning, , and -learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the *mean squared robust projected Bellman error* and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum.

## 1Introduction

Reinforcement learning is concerned with learning a good policy for sequential decision making problems modeled as a Markov Decision Process (MDP), via interacting with the environment [22]. In this work we address the problem of reinforcement learning from a *misspecified model*. As a motivating example, consider the scenario where the problem of interest is not directly accessible, but instead the agent can interact with a simulator whose dynamics is reasonably close to the true problem. Another plausible application is when the parameters of the model may evolve over time but can still be reasonably approximated by an MDP.

To address this problem we use the framework of *robust MDPs* which was proposed by [2] to solve the planning problem under model misspecification. The robust MDP framework considers a class of models and finds the robust optimal policy which is a policy that performs best under the worst model. It was shown by [2] that the robust optimal policy satisfies the *robust Bellman equation* which naturally leads to exact dynamic programming algorithms to find an optimal policy. However, this approach is model dependent and does not immediately generalize to the model-free case where the parameters of the model are unknown.

Essentially, reinforcement learning is a *model-free* framework to solve the Bellman equation using samples. Therefore, to learn policies from misspecified models, we develop sample based methods to solve the *robust* Bellman equation. In particular, we develop robust versions of classical reinforcement learning algorithms such as -learning, , and -learning and prove convergence to an approximately optimal policy under mild assumptions on the discount factor. We also show that the nominal versions of these iterative algorithms converge to policies that may be arbitrarily worse compared to the optimal policy.

We also scale up these robust algorithms to large scale MDPs via function approximation, where we prove convergence under two different settings. Under a technical assumption similar to [6] we show convergence of robust approximate policy iteration and value iteration algorithms for linear architectures. We also study function approximation with nonlinear architectures, by defining an appropriate *mean squared robust projected Bellman error* (MSRPBE) loss function, which is a generalization of the mean squared projected Bellman error (MSPBE) loss function of [23]. We propose robust versions of stochastic gradient descent algorithms as in [23] and prove convergence to a local minimum under some assumptions for function approximation with arbitrary smooth functions.

**Contribution.** In summary we have the following contributions:

We extend the robust MDP framework of [2] to the

*model-free*reinforcement learning setting. We then define robust versions of -learning, , and -learning and prove convergence to an approximately optimal robust policy.We also provide robust reinforcement learning algorithms for the function approximation case and prove convergence of robust approximate policy iteration and value iteration algorithms for linear architectures. We also define the MSRPBE loss function which contains the robust optimal policy as a local minimum and we derive stochastic gradient descent algorithms to minimize this loss function as well as establish convergence to a local minimum in the case of function approximation by arbitrary smooth functions.

Finally, we demonstrate empirically the improvement in performance for the robust algorithms compared to their nominal counterparts. For this we used various Reinforcement Learning test environments from OpenAI [10] as benchmark to assess the improvement in performance as well as to ensure reproducibility and consistency of our results.

**Related Work.** Recently, several approaches have been proposed to address model performance due to parameter uncertainty for Markov Decision Processes (MDPs). A Bayesian approach was proposed by [21] which requires perfect knowledge of the prior distribution on transition matrices. Other probabilistic and risk based settings were studied by [11] which propose various mechanisms to incorporate percentile risk into the model. A framework for robust MDPs was first proposed by [2] who consider the transition matrices to lie in some *uncertainty set* and proposed a dynamic programming algorithm to solve the robust MDP. Recent work by [26] extended the robust MDP framework to the function approximation setting where under a technical assumption the authors prove convergence to an optimal policy for linear architectures. Note that these algorithms for robust MDPs do not readily generalize to the *model-free* reinforcement learning setting where the parameters of the environment are not explicitly known.

For reinforcement learning in the non-robust *model-free* setting, several iterative algorithms such as -learning, -learning, and are known to converge to an optimal policy under mild assumptions, see [5] for a survey. Robustness in reinforcement learning for MDPs was studied by [15] who introduced a robust learning framework for learning with disturbances. Similarly, [18] also studied learning in the presence of an adversary who might apply disturbances to the system. However, for the algorithms proposed in [15] no theoretical guarantees are known and there is only limited empirical evidence. Another recent work on robust reinforcement learning is [14], where the authors propose an online algorithm with certain transitions being stochastic and the others being adversarial and the devised algorithm ensures low regret.

For the case of reinforcement learning with large MDPs using function approximations, theoretical guarantees for most -learning based algorithms are only known for linear architectures [3]. Recent work by [7] extended the results of [23] and proved that a stochastic gradient descent algorithm minimizing the *mean squared projected Bellman equation* (MSPBE) loss function converges to a local minimum, even for nonlinear architectures. However, these algorithms do not apply to robust MDPs; in this work we extend these algorithms to the robust setting.

## 2Preliminaries

We consider an infinite horizon Markov Decision Process (MDP) [20] with finite state space of size and finite action space of size . At every time step the agent is in a state and can choose an action incurring a cost . We will make the standard assumption that future cost is discounted, see e.g., [22], with a discount factor applied to future costs, i.e., where is a fixed constant independent of the time step for and . The states transition according to probability transition matrices which depends only on their last taken action . A *policy of the agent* is a sequence , where every corresponds to an action in if the system is in state at time . For every policy , we have a corresponding value function , where for a state measures the expected cost of that state if the agent were to follow policy . This can be expressed by the following recurrence relation

The goal is to devise algorithms to learn an optimal policy that minimizes the expected total cost:

In the robust case we will assume as in [17] that the transition matrices are not fixed and may come from some uncertainty region and may be chosen adversarially by nature in future runs of the model. In this setting, [17] prove the following *robust* analogue of the *Bellman recursion*. A *policy of nature* is a sequence where every corresponds to a transition probability matrix chosen from . Let denote the set of all such policies of nature. In other words, a policy of nature is a sequence of transition matrices that may be played by it in response to the actions of the agent. For any set and vector , let be the *support function* of the set . For a state , let be the projection onto the row of .

The main shortcoming of this approach is that it does not generalize to the *model free* case where the transition probabilities are not explicitly known but rather the agent can only sample states according to these probabilities. In the absence of this knowledge, we cannot compute the support functions of the uncertainty sets . On the other hand it is often easy to have a *confidence region* , e.g., a ball or an ellipsoid, corresponding to every state-action pair that quantifies our uncertainty in the simulation, with the uncertainty set being the confidence region centered around the unknown simulator probabilities. Formally, we define the uncertainty sets corresponding to every state action pair in the following fashion.

As a simple example, we have the ellipsoid for some psd matrix with the uncertainty set being where is the *unknown* simulator state transition probability vector with which the agent transitioned to a new state during training. Note that while it may easy to come up with good descriptions of the confidence region , the approach of [17] breaks down since we have no knowledge of and merely observe the new state sampled from this distribution. See Figure 1 for an illustration with the confidence regions being an ball of fixed radius .

In the following sections we develop *robust versions* of -learning, , and -learning which are guaranteed to converge to an approximately optimal policy that is robust with respect to this confidence region. The robust versions of these iterative algorithms involve an additional linear optimization step over the set , which in the case of simply corresponds to adding fixed noise during every update. In later sections we will extend it to the function approximation case where we study linear architectures as well as nonlinear architectures; in the latter case we derive new stochastic gradient descent algorithms for computing approximately robust policies.

## 3Robust exact dynamic programming algorithms

In this section we develop robust versions of exact dynamic programming algorithms such as -learning, , and -learning. These methods are suitable for small MDPs where the size of the state space is not too large. Note that confidence region must also be constrained to lie within the probability simplex , see Figure 1. However since we do not have knowledge of the simulator probabilities , we do not know how far away is from the boundary of and so the algorithms will make use of a proxy confidence region where we drop the requirement of , to compute the robust optimal policies. With a suitable choice of step lengths and discount factors we can prove convergence to an approximately optimal -robust policy where the approximation depends on the difference between the unconstrained proxy region and the true confidence region . Below we give specific examples of possible choices for simple confidence regions.

Ellipsoid:

Let be a sequence of psd matrices. Then we can define the confidence region as

Note that has some additional linear constraints so that the uncertainty set lies inside . Since we do not know , we will make use of the proxy confidence region . In particular when for every then this corresponds to a spherical confidence interval of in every direction. In other words, each uncertainty set is an ball of radius .

Parallelepiped:

Let be a sequence of invertible matrices. Then we can define the confidence region as

As before, we will use the unconstrained parallelepiped without the constraints, as a proxy for since we do not have knowledge . In particular if for a diagonal matrix , then the proxy confidence region corresponds to a rectangle. In particular if every diagonal entry is , then every uncertainty set is an ball of radius .

### 3.1Robust -learning

Let us recall the notion of a -factor of a state-action pair and a policy which in the non-robust setting is defined as

where is the value function of the policy . In other words, the -factor represents the expected cost if we start at state , use the action and follow the policy subsequently. One may similarly define the *robust* -factors using a similar interpretation and the minimax characterization of Theorem ?. Let denote the -factors of the optimal robust policy and let be its value function. Note that we may write the value function in terms of the -factors as . From Theorem ? we have the following expression for :

where equation follows from Definition ?. For an estimate of , let be its value vector, i.e., . The *robust -iteration* is defined as:

where a state is sampled with the unknown transition probability using the simulator. Note that the robust -iteration of equation involves an additional linear optimization step to compute the support function of over the proxy confidence region . We will prove that iterating equation converges to an approximately optimal policy. The following definition introduces the notion of an -optimal policy, see e.g., [5]. The error factor is also referred to as the *amplification factor*. We will treat the -factors as a matrix in the definition so that its norm is defined as usual.

The following simple lemma allows us to decompose the optimization of a linear function over the proxy uncertainty set in terms of linear optimization over , and .

Note that every point in is of the form for some and every point is of the form for some , and this correspondence is one to one by definition. For any vector and pairs of points and we have

Since equation holds for every , it follows that it also holds for so that

The following theorem proves that under a suitable choice of step lengths and discount factor , the iteration of equation converges to an -approximately optimal policy with respect to the confidence regions .

Let be the proxy uncertainty set for state and , i.e., . We denote the value function of by . Let us define the following operator mapping -factors to -factors as follows:

We will first show that a solution to the equation is an -optimal policy as in Definition , i.e., .

where we used Lemma ? to derive equation . Equation implies that . If then we are done since . Otherwise assume that and use the triangle inequality: . This implies that

from which it follows that under the assumption that as claimed. The -iteration of equation can then be reformulated in terms of the operator as

where where the expectation is over the states with the transition probability from state to state given by . Note that this is an example of a *stochastic approximation algorithm* as in [5] with noise parameter . Let denote the history of the algorithm until time . Note that by definition and the variance is bounded by

Thus the noise term satisfies the zero conditional mean and bounded variance assumption (Assumption 4.3 in [5]). Therefore it remains to show that the operator is a *contraction mapping* to argue that iterating equation converges to the optimal -factor . We will show that the operator is a contraction mapping with respect to the infinity norm . Let and be two different -vectors with value functions and . If is not necessarily the same as the unconstrained proxy set for some , then we need the discount factor to satisfy in order to ensure convergence. Intuitively, the discount factor should be small enough that the difference in the estimation due to the difference of the sets and converges to over time. In this case we show contraction for operator as follows

where we used Lemma ? with vector to derive equation and the fact that to conclude that . Therefore if , then it follows that the operator is a norm contraction and thus the robust -iteration of equation converges to a solution of which is an -approximately optimal policy for , as was proved before.

### 3.2Robust

Recall that the update rule of is similar to the update rule for -learning except that instead of choosing the action , we choose the action where with probability , the action is chosen uniformly at random from and with probability , we have . Therefore, it is easy to modify the robust -iteration of equation to give us the *robust * updates:

In the exact dynamic programming setting, it has the same convergence guarantees as robust -learning and can be seen as a corollary of Theorem ?.

### 3.3Robust -learning

Recall that -learning allows us to estimate the value function for a given policy . In this section we will generalize the -learning algorithm to the robust case. The main idea behind -learning in the non-robust setting is the following Bellman equation

Consider a trajectory of the agent , where denotes the state of the agent at time step . For a time step , define the *temporal difference* as

Let . The recurrence relation for may be written in terms of the temporal difference as

The corresponding Robbins-Monro stochastic approximation algorithm with step size for equation is

A more general variant of the iterations uses *eligibility coefficients* for every state and temporal difference vector in the update for equation

Let denote the state of the simulator at time step . For the discounted case, there are two possibilities for the eligibility vectors leading to two different iterations:

The

*every-visit*method, where the eligibility coefficients areThe

*restart*method, where the eligibility coefficients are

We make the following assumptions about the eligibility coefficients that are sufficient for proof of convergence.

Note that the eligibility coefficients of both the every-visit and restart iterations satisfy Assumption ?. In the robust setting, we are interested in estimating the *robust value* of a policy , which from Theorem ? we may express as

where the expectation is now computed over the probability vector chosen adversarially from the uncertainty region . As in Section 3.1, we may decompose as

where is the transition probability of the agent during a simulation. For the remainder of this section, we will drop the subscript and just use to denote expectation with respect to this transition probability .

Define a *simulation* to be a trajectory of the agent, which is stopped according to a random *stopping time* . Note that is a random variable for making stopping decisions that is not allowed to foresee the future. Let denote the history of the algorithm up to the point where the simulation is about to commence. Let be the estimate of the value function at the start of the simulation. Let be the trajectory of the agent during the simulation with . During training, we generate several simulations of the agent and update the estimate of the *robust* value function using the the *robust temporal difference* which is defined as

where is the usual temporal difference defined as before

The *robust* -update is now the usual -update, except that we use the *robust temporal difference* computed over the proxy confidence region:

We define an -approximate value function for a fixed policy in a way similar to the -optimal -factors as in Definition ?:

The following theorem guarantees convergence of the robust iteration of equation to an approximate value function for under Assumption ?.

Let be the proxy uncertainty set for state and action as in the proof of Theorem ?, i.e., . Let be the set of time indices the simulation visits state . We define , so that we may write the update of equation as

Let us define the operator corresponding to the simulation as

We claim as in the proof of Theorem ? that a solution to must be an -approximation to . Define the operator with the proxy confidence regions replaced by the true ones, i.e.,

Note that for the *robust* value function since for every by Theorem ?. Finally by Lemma ? we have

for any vector , where the expectation is over the state . Thus for any solution to the equation , we have

where equation follows from equation . Therefore the solution to is an -approximation to for if as in the proof of Theorem ?. Note that the operator applied to the iterates is so that the update of equation is a *stochastic approximation algorithm* of the form

where and is a noise term with zero mean and is defined as

Note that by Lemma 5.1 of [5], the new step sizes satisfy and if the original step size satisfies the conditions and , since the conditions on the eligibility coefficients are unchanged. Note that the noise term also satisfies the bounded variance of Lemma 5.2 of [5] since any still specifies a distribution as .

Therefore, it remains to show that is a norm contraction with respect to the norm on . Let us define the operator as

and the expression so that . We will show that for some from which the contraction on follows because for any vector and the -optimal value function