# Learning safe policies with expert guidance

###### Abstract

We propose a framework for ensuring safe behavior of a reinforcement learning agent when the reward function may be difficult to specify. In order to do this, we rely on the existence of demonstrations from expert policies, and we provide a theoretical framework for the agent to optimize in the space of rewards consistent with its existing knowledge. We propose two methods to solve the resulting optimization: an exact ellipsoid-based method and a method in the spirit of the "follow-the-perturbed-leader" algorithm. Our experiments demonstrate the behavior of our algorithm in both discrete and continuous problems. The trained agent safely avoids states with potential negative effects while imitating the behavior of the expert in the other states.

## 1 Introduction

In Reinforcement Learning (RL), agent behavior is driven by an objective function defined through the specification of rewards. Misspecified rewards may lead to negative side effects [2], when the agent acts unpredictably responding to the aspects of the environment that the designer overlooked, and potentially causes harms to the environment or itself. As the environment gets richer and more complex, it becomes more challenging to specify and balance rewards for every one of its aspects. Yet if we want to have some type of safety guarantees in terms of the behavior of an agent learned by RL once it is deployed in the real world, it is crucial to have a learning algorithm that is robust to mis-specifications.

We assume that the agent has some knowledge about the reward function either through past experience or demonstrations from experts. The goal is to choose a robust/safe policy that achieves high reward with respect to any reward function that is consistent with the agent’s knowledge. We formulate this as a maxmin learning problem where the agent chooses a policy and an adversary chooses a reward function that is consistent with the agent’s current knowledge and minimizes the agent’s reward. The goal of the agent is to learn a policy that maximizes the worst possible reward.

We assume that the reward functions are linear in some feature space. Our formulation has two appealing properties: (1) it allows us to combine demonstrations from multiple experts even though they may disagree with each other; and (2) the training environment/MDP in which the experts operate need not be the same as the testing environment/MDP where the agent will be deployed, our results hold as long as the testing and training MDPs share the same feature space. As an application, our algorithm can learn a maxmin robust policy in a new environment that contains a few features that are not present in the training environment. See our gridworld experiment in Section 5.

Our first result (Theorem 1) shows that given any algorithm that can find the optimal policy for an MDP in polynomial time, we can solve the maxmin learning problem exactly in polynomial time. Our algorithm is based on a seminal result from combinatorial optimization – the equivalence between separation and optimization [8, 13] – and the ellipsoid method. To understand the difficulty of our problem, it is useful to think of maxmin learning as a two-player zero-sum game between the agent and the adversary. The deterministic policies correspond to the pure strategies of the agent. The consistent reward functions we define in Section 3 form a convex set and the adversary’s pure strategies are the extreme points of this convex set. Unfortunately, both the agent and the adversary may have exponentially many pure strategies, which are hard to describe explicitly. This makes solving the two-player zero-sum game challenging. Using tools from combinatorial optimization, we manage to construct separation oracles for both the agent’s and the adversary’s set of policies using the MDP solver as a subroutine. With the separation oracles, we can solve the maxmin learning problem in polynomial time using the ellipsoid method.

Theorem 1 provides a polynomial time algorithm, but as it heavily relies on the ellipsoid method, it is computationally expensive to run in practice. We propose another algorithm (Algorithm 3) based on the online learning algorithm – followed-the-perturbed-leader (FPL), and show that after iterations the algorithm computes a policy that is at most away from the true maxmin policy (Theorem 3). Moreover, each iteration of our algorithm is polynomial time. Notice that many other low-regret learning algorithms, such as the multiplicative weights update method (MWU), are not suitable for our problem. The MWU requires explicitly maintaining a weight for every pure strategy and updates them in every iteration, resulting in an exponential time algorithm for our problem. Furthermore, we show that Algorithm 3 still has similar performance when we only have a fully polynomial time approximation scheme (FPTAS) for solving the MDP. The formal statement and proof are postponed to the supplemental material due to space limit.

##### Related Work

In the sense of using expert demonstrations, our work is related to inverse reinforcement learning and apprenticeship learning [17, 1, 20, 19]. In particular, the apprenticeship learning problem can also be formulated as a maxmin problem [20, 19]. Despite the seemingly similarity, our maxmin learning problem aims to solve a completely different problem than apprenticeship learning. Here is a simple example: the expert is consistent with two reward functions and with ^{1}^{1}1See Section 3 for the formal definition of consistent rewards. Intuitively, it means that the expert’s policy yields a reward that is within of the optimal possible reward.. The expert gets under and under , and the only alternative policy gives and . Apprenticeship learning will return the expert policy while ours will return the latter solution. In the worst case (under ), our maxmin policy has better guarantees and thus is more robust. Unlike apprenticeship learning/IRL, we do not want to mimic the experts or infer their rewards, but we want to produce a policy with robustness guarantees by leveraging their data. As a consequence, our results are applicable to settings where the training and testing environments are different (as discussed in the Introduction). Moreover, our formulation allows us to combine multiple expert demonstrations.

Inverse reward design [9] uses a proxy reward and infers the true reward by estimating its posterior. Then it uses risk-averse planning together with samples from the posterior in the testing environment to achieve safe exploration. Our approach achieves a similar goal without assuming any distribution over the rewards and is arguably more robust. We apply a single reward function to the whole MDP while they apply (maybe too pessimistically) per step/trajectory maxmin planning. Furthermore, our algorithm is guaranteed to find the maxmin solution in polynomial time, and can naturally accommodate multiple experts.

Robust Markov Decision Processes [18, 11] have addressed the problem of performing dynamic programming-style optimization environments in which the transition probability matrix is uncertain. Lim, Xu & Mannor [15] have extended this idea to reinforcement learning methods. This body of work also uses min-max optimization, but because the optimization is with respect to worst-case transitions, this line of work results in very pessimistic policies. Our algorithmic approach and flavor of results are also different. [16] have addressed a similar adversarial setup, but in which the environment designs a worst-case disturbance to the dynamics of the agent, and have addressed this setup using control.

##### Paper Organization

We introduce the notations and define the maxmin learning problem in Section 2. We provide three different ways to define the set of consistent reward functions in Section 3, and present the ellipsoid-based exact algorithm and its analysis in Section 4.1. The FPL-based algorithm and its analysis can be found in Section 4.2. We provide experimental results in Section 5.

## 2 Preliminary

An MDP is a tuple , including a finite set of states, , a set of actions, , and transition probabilities, . is a discount factor, and is the distribution of initial states. The reward function instructs the learning process. We assume that the reward is a linear function of some vector of features : over states. That is for every state , where is called the reward weights of the MDP. Throughout the paper, we use to denote the true reward weights and we assume that is unknown to the agent. We use to denote the bit complexity of an object. In particular, we use to denote the bit complexity of an MDP , which is the number of bits required to represent the distribution of initial states, transition probabilities, the discount factor , and the rewards at all the states. We use the notation to denote a MDP without the reward function, and use to denote its bit complexity. We further assume that can be represented using at most bits for any state .

An agent selects the action according to a policy . The value of a policy under rewards is . It is expressed as the weights multiplied by the accumulated discounted feature value given a policy, which we define as .

##### MDP solver

We assume that there is a RL algorithm ALG that takes an MDP as input and outputs an optimal policy and its corresponding representation in the feature space. In particular, outputs such that and .

##### Maxmin Learning

All weights that are consistent with the agent’s knowledge form a set . We will discuss several formal ways to define this set in Section 3. The goal of the agent is to learn a policy that maximizes the reward for any reward function that could be induced by weights in and adversarially chosen. More specifically, the max-min learning problem is , where is the polytope that contains the representations of all policies in the feature space, i.e. . WLOG, we assume that all weights lie in .

##### Separation Oracles

To perform maxmin learning, we often need to optimize linear functions over convex sets that are intersections of exponentially many halfspaces. Such optimization problem is usually intractable, but if the convex set permits a polynomial time separation oracle, then there exists polynomial time algorithms (e.g. ellipsoid method) that optimize linear functions over it.

###### Definition 1.

(Separation Oracle) Let be a closed, convex subset of Euclidean space . Then a Separation Oracle for is an algorithm that takes as input a point and outputs “Yes” if , or a hyperplane such that for all , but . Note that because is closed and convex, such a hyperplane always exists whenever .

## 3 Consistent Reward Polytope

In this section, we discuss several ways to define the consistent reward polytope .

##### Explicit Description

We assume that the agent knows that the weights satisfy a set of explicitly defined linear inequalities of the form . For example, such an inequality can be learned by observing that a particular policy yields a reward that is larger or smaller than a certain threshold. ^{2}^{2}2Note that with a polynomial number of trajectories, one can apply standard Chernoff bounds to derive such inequalities that hold with high probability. It is often the case that the probability is so close to that the inequality can be treated as true always for any practical purposes.

##### Implicitly Specified by an Expert Policy

Usually, it may not be easy to obtain many explicit inequalities about the weights. Instead, we may have observed a policy used by an expert. We further assume that the expert’s policy has a reasonably good performance under the true rewards . Namely, ’s expected reward is only less than the optimal one. Let the expert’s feature vector . The set therefore contains all such that . It is not hard to verify that under this definition is a convex set. Even though explicitly specifying is extremely expensive as there are infinitely many , we can construct a polynomial time separation oracle (Algorithm 1). An alternative way to define is to assume that the expert policy can achieve of the optimal reward (assuming the final reward is positive). We can again design a polynomial time separation oracle similar to Algorithm 1.

##### Combining Multiple Experts

How can we combine demonstrations of experts that come from drastically different environments? Here is our model. For each environment , there is a separate MDP , and all the MDPs share the same underlying weight as they are all about completing the same task although in different environments. The -th expert’s policy is nearly optimal in . More specifically, we will assume that for expert , her policy is at most less than the optimal policy in . Therefore, each expert provides a set of constraints that any consistent reward needs to satisfy, and is the set of rewards that satisfy all constraints imposed by the experts. For each expert , we can design a separation oracle (similar to Algorithm 1) that only accepts weights that respect the constraints given by expert ’s policy. We can easily design a separation oracle for that only accepts weights that will be accepted by all separation oracles .

From now on, we will not distinguish between the different ways to define and access the consistent reward polytope , but simply assume that we have a polynomial time separation oracle for it. All the algorithms we design in this paper only require access to this separation oracle. In Section 5, we will specify how the is defined for each experiment.

## 4 Maxmin Learning using an Exact MDP Solver

In this section, we show how to design maxmin learning algorithms. Our algorithm only interacts with the MDP through the MDP solver, which can be either model-based or model-free. Our first algorithm solves the maxmin learning problem exactly using the ellipsoid method. Despite the fact that the ellipsoid method has provable worst-case polynomial running time, it is known to be inefficient sometimes in practice. Our second algorithm is an efficient iterative method based on the online learning algorithm – follow-the-perturbed-leader (FPL).

### 4.1 Ellipsoid-Method-Based Solution

###### Theorem 1.

Given a polynomial time separation oracle for the consistent reward polytope and an exact polynomial time MDP solver ALG, we have a polynomial time algorithm such that for any MDP without the reward function , the algorithm computes the maxmin policy with respect to and .

The plan is to first solve the maxmin learning problem in the feature space then convert it back to the policy space. Solving the maxmin learning problem in the feature space is equivalent to solving the linear program in Figure 1.

The challenges for solving the LP are that (i) it is not clear how to check whether lies in the polytope , and (ii) there are seemingly infinitely many constraints of the type as there are infinitely many . Next, we show that given an exact MDP solver ALG, we can design a polynomial time separation oracle for the set of feasible variables of LP 1. With this separation oracle, we can apply the ellipsoid method (see Theorem 2 in the supplementary material) to solve LP 1 in polynomial time.

The following theorem, reworded from [14, 8, 13], states that given a separation oracle of a convex polytope, the ellipsoid method can optimize any linear function over the convex polytope in polynomial time.

###### Theorem 2 (Ellipsoid Method).

([14, 8, 13]) Let be a -dimensional closed, convex subset of defined as the intersection of finitely many halfspaces, and be a poly-time separation oracle for . Then it is possible to
find an element in for any (i.e. solve linear programs) in time polynomial in and using the ellipsoid method, if can be described implicitly
using bits. ^{3}^{3}3We say a polytope can be described implicitly using bits if there exists a description of the polytope such that all constraints only use coefficients with bit complexity .

Next, we design a separation oracle for polytope by invoking a seminal result from optimization – the equivalence between separation and optimization.

#### 4.1.1 Separation Oracle for Polytope

###### Lemma 1 (Separation Optimization).

[8, 13]
Consider any convex polytope and the following two problems:

– Linear Optimization: given a linear objective , compute

– Separation: given a point , decide that , or else find s.t. , .

If can be described implicitly using bits, then the separation problem is solvable in time for P if and only if the linear optimization problem is solvable in time.

It is not hard to see that if one can solve the separation problem, one can construct a separation oracle in polynomial time and apply the ellipsoid method to solve the linear optimization problem. The less obvious direction in the result above states that if one can solve the linear optimization problem, one can also use it to construct a separation oracle. The equivalence between these two problems turns out to have profound implications in combinatorial optimization and has enabled numerous polynomial time algorithms for many problems that are difficult to solve otherwise.

Our goal is to design a polynomial time separation oracle for the polytope . The key observation is that the linear optimization problem over polytope : is exactly the same as solving the MDP with reward function . Therefore, we can use the MDP solver to design a polynomial time separation oracle for .

###### Lemma 2.

Given access to an MDP solver ALG that solves any MDP in time polynomial in , we can design a separation oracle for that runs in time polynomial in , , , and the bit complexity of the input ^{4}^{4}4Note that only depends on the bit complexity of , but not the actual model of such as the distributions of the initial states or the transition probabilities. We only require access to and an upper bound of ..

The proof is in Section A of the supplementary material.

With , we first design a polynomial time separation oracle for checking the feasible pairs in LP 1 (see Algorithm 2). With the separation oracle, we can solve LP 1 using the ellipsoid method. The last difficulty is that the optimal solution only gives us the maxmin feature vector instead of the corresponding maxmin policy. We use the following nice property of to convert the optimal solution in the feature space to the policy space. See Section A in the supplementary material for intuition behind Lemma 3.

###### Lemma 3.

The intuition behind Lemma 3 is that the separation oracle tries to search over all possible weights to find one to separate the query point from using the ellipsoid method. Along the way, it queries a set of weights (this is our set ) on ALG trying to find a separating weight such that . If such a separating weight is found, terminates immediately and outputs “No” together with the corresponding separating hyperplane. The says “Yes” only when it has searched over a polynomial number of weights and concludes that there is no possible weight to separate . The reason that can draw such a conclusion is due to the ellipsoid method. In particular, when says “Yes”, the correctness of the ellipsoid algorithm implies that is in the convex hull of all the extreme points of that have been outputted by the ALG.

#### 4.1.2 Exact max-min solution

We prove Theorem 1 in this section. First, we design a polynomial time separation oracle for checking the feasible pairs in LP 1 (see Algorithm 2).

With the separation oracle, we can solve LP 1 using the ellipsoid method. The last difficulty is that the optimal solution only gives us the maxmin feature vector instead of the corresponding maxmin policy. We address this issue using Lemma 3. See the proof of Theorem 1 for details.

Proof of Theorem 1: It is not hard to see that Algorithm 2 is a valid polynomial time separation oracle for the feasible pairs in LP 1. Hence, we can solve LP 1 in polynomial time with the ellipsoid method with access to Algorithm 2. Next, we show how to convert the optimal solution of LP 1 to the corresponding maxmin optimal policy . Here, we invoke Lemma 3. We query on and we record all weights that has queried the MDP solver ALG on. Let be all the queried weights. As is a polynomial time algorithm, is also polynomial in the input size. By Lemma 3, we know that is in the convex hull of , which means there exists a set of nonnegative numbers , such that and . Clearly, the discounted accumulated feature value of the randomized policy equals to . We can compute the s in poly-time via linear programming and is the maxmin policy.

### 4.2 Finding the Maxmin Policy using Follow the Perturbed Leader

The exact algorithm of Theorem 1 may be computationally expensive to run, as the separation oracle requires running the ellipsoid method to answer every query, and on top of that we need to run the ellipsoid method with queries to . In this section, we propose a simpler and faster algorithm that is based on the algorithm – follow-the-perturbed-leader (FPL) [12].

###### Theorem 3.

FPL is a classical online learning algorithm that solves a problem where a series of decisions need to be made. Each is from a possibly infinite set . The state at step is observed after the decision . The goal is to have the total reward not far from the best expert’s reward with hindsight, that is . The FPL algorithm guarantees that after rounds, the regret scales linearly in . This guarantee holds for both oblivious and adaptive adversary, and the bound holds both in expectation and with high probability (see Theorem 4 in Section A of the supplementary material for the formal statement).

FPL falls into a large class of algorithms that are called low-regret algorithms, as the regret grows sub-linearly in . It is well known that low-regret algorithms can be used to solve two-player zero-sum games approximately. The maxmin problem we face here can also be modeled as a two-player zero-sum games. One player is the agent whose strategy is a policy , and the other player is the reward designer whose strategy is a weight . The agent’s payoff is the reward that it collects using policy , which is , and the designer’s payoff is . Finding the maxmin strategy for the agent is equivalent to finding the maxmin policy. One challenge here is that the numbers of strategies for both players are infinite. Even if we only consider the pure strategies which correspond to the extreme points of and , there are still exponentially many of them. Many low-regret algorithms such as multiplicative-weights-update requires explicitly maintaining a distribution over the pure strategies, and update it in every iteration. In our case, these algorithms will take exponential time to finish just a single iteration. This is the reason why we favor the FPL algorithm, as the FPL algorithm only requires finding the best policy giving the past weights, which can be done by the MDP solver ALG. Next, we prove Theorem 3.

Proof of Theorem 3: We use to denote the sequence and to denote the sequence . First, notice that every realization of defines a deterministic adaptive adversary for the agent. In the setting of Algorithm 3, we can take to be , to be , and to be . By Theorem 4 (Section A of the supplementary material), we know that for all , for every realization of . Similarly, every realization of also defines a deterministic adaptive adversary for the designer, and by Theorem 4 , we know that for any realization of . Let . By the union bound, with probability at least over the randomness of and

(1) |

and

(2) |

Next, we argue that is an approximate maxmin policy.

The last inequality is because that on the LHS (line 2) the designer is choosing a fixed strategy , while on the RHS (line 3) the designer can choose the worst possible strategy for the agent. Therefore, if the agent uses policy , it guarantees expected reward . Finally, in every iteration , we query ALG once to compute and , and we use the ellipsoid method to find using queries to and poly regular computation steps. During each query, calls ALG. Thus, our result is a reduction from the maxmin learning problem to simply solving an MDP under given weights. Any improvement on ALG will also improve the running time of Algorithm 3. We discuss the empirical running time in section C of the supplementary material.

We also show that a similar result holds even if we replace the exact MDP solver with an additive FPTAS . Our generalization to cases where we only have access to is postponed to Section B in the supplementary material.

## 5 Experiments

##### Gridworld

We use gridworlds in the first set of experiments. Each grid may have a different "terrain" type such that passing the grid will incur certain reward. For each grid, a feature vector denotes the terrain type, and the true reward can be expressed as . The agent’s goal is to move to a goal grid with maximal reward under the worst possible weights that are consistent with the expert. In other words, the maxmin policy is a safe policy, as it avoids possible negative side effects [2]. In the experiments, we construct the expert policy that is optimal in a small (1010) demonstration gridworld that contains a subset of the terrain types.

An example behavior is shown in Figure 2. There are 5 possible terrain types. The expert policy in Figure 2 (left) has only seen 4 terrain types. We compute the maxmin policy in the "real-world" MDP of a much larger size (5050) with all 5 terrain types using Algorithm 3 with the reward polytope implicitly specified by the expert policy. Figure 2 (middle) shows that our maxmin policy avoids the red-colored terrain that was missing from the demonstration MDP. To facilitate observation, Figure 2 (right) shows the same behavior by an agent trained in a smaller MDP. Figure 3 compares the maxmin policy to a baseline. The baseline policy is computed in an MDP whose reward weights are the same as the demonstration MDP for the first four terrain types and the fifth terrain weight is chosen at random. Our maxmin policy is much safer than the baseline as it completely avoids the fifth terrain type. It also imitates the expert’s behavior by favoring the same terrain types.

We also implemented the maxmin method in gridworlds with a stochastic transition model. The maxmin policy (see Figure 8 in Section C of the supplementary material) is more conservative comparing to the deterministic model, and chooses paths that are further away from any unknown terrains. More details and computation time can be found in the supplementary material.

##### CartPole

Our next experiments are based on the classic control task of cartpole and the environment provided by OpenAI Gym [5]. While we can only solve the problem approximately using model-free learning methods, our experiments show that our FPL-based algorithm can learn a safe policy efficiently for a continuous task. Moreover, if provided with more expert policies, our maxmin learning method can easily accomodate and learn from multiple experts.

We modify the cartpole problem by adding two possible features to the environment as the two question blocks shown in Figure 4, and more details in the supplementary material. The agent has no idea of what consequences passing these two blocks may have. Instead of knowing the rewards associated with these two blocks, we have expert policies from two other related scenarios. The first expert policy (Expert A) performs well in scenario A where only the blue block to the left of the center is present, and the second expert policy (Expert B) performs well in scenario B where only the yellow block to the right of the center is present. The behavior of expert policies in a default scenario (without any question blocks), and scenarios and are shown in Figure 6. It is obvious that comparing with the default scenario, the expert policies in the other two scenarios prefer to travel to the right side. Intuitively, it seems that the blue block incurs negative effects while the yellow block is either neutral or positive.

Now we train the agent in the presence of both question blocks. First, we provide the agent with Expert A policy alone, and learn a maxmin policy. The maxmin policy’s behavior is shown in Figure 6 (top). It tries to avoid both question blocks since it observes that Expert A avoids the blue block and it has no knowledge of the yellow block. Then, we provide both and to the agent, and the resulting maxmin policy guides movement in a wider range extending to the right of the field as shown in Figure 6 (bottom). This time, our maxmin policy also learns from Expert B that the yellow block is not harmful.

The experiment demonstrates that our maxmin method works well with complex reinforcement learning tasks where only approximate MDP solvers are available.

## 6 Discussion

In this paper, we provided a theoretical treatment of the problem of reinforcement learning in the presence of mis-specifications of the agent’s reward function, by leveraging data provided by experts. The posed optimization can be solved exactly in polynomial-time by using the ellipsoid methods, but a more practical solution is provided by an algorithm which takes a follow-the-perturbed-leader approach. Our experiments illustrate the fact that this approach can successfully learn robust policies from imperfect expert data, in both discrete and continuous environments. It will be interesting to see whether our maxmin formulation can be combined with other methods in RL such as hierarchical learning to produce robust solutions in larger problems.

## References

- [1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
- [2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- [3] Aharon Ben-Tal, Elad Hazan, Tomer Koren, and Shie Mannor. Oracle-based robust optimization via online learning. Operations Research, 63(3):628–638, 2015.
- [4] Stephen Boyd and Lieven Vandenberghe. Localization and cutting-plane methods. From Stanford EE 364b lecture notes, 2007.
- [5] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- [6] Yang Cai, Constantinos Daskalakis, and S. Matthew Weinberg. Reducing Revenue to Welfare Maximization : Approximation Algorithms and other Generalizations. In the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2013.
- [7] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
- [8] Martin Grötschel, László Lovász, and Alexander Schrijver. The Ellipsoid Method and its Consequences in Combinatorial Optimization. Combinatorica, 1(2):169–197, 1981.
- [9] Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6768–6777, 2017.
- [10] Marcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research, 6(Apr):639–660, 2005.
- [11] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
- [12] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
- [13] Richard M. Karp and Christos H. Papadimitriou. On linear characterizations of combinatorial optimization problems. SIAM J. Comput., 11(4):620–632, 1982.
- [14] Leonid G. Khachiyan. A Polynomial Algorithm in Linear Programming. Soviet Mathematics Doklady, 20(1):191–194, 1979.
- [15] Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision processes. In Advances in Neural Information Processing Systems, pages 701–709, 2013.
- [16] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural computation, 17(2):335–359, 2005.
- [17] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
- [18] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
- [19] Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pages 1032–1039. ACM, 2008.
- [20] Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449–1456, 2008.

## Supplementary Material

## Appendix A Missing Proofs from Section 4

Proof of Lemma 2: Lemma 4 shows that can be implicitly described using bits. Maximizing any linear function can be solved by querying ALG on MDP with reward function . Since MDP has bit complexity polynomial in , , , and , we can solve the linear optimization problem in time . By Lemma 1, we can solve the separation problem in time on any input . Hence, we can design a polynomial time separation oracle.

###### Lemma 4.

Polytope for any MDP without reward function can be implicitly described using bits.

###### Proof.

The following constraints explicitly describe all , where s correspond to the occupancy measure of some policy .

Our statement follows from the fact that all the coefficients in these constraints have bit complexity or . ∎

##### Follow-the-Perturbed-Leader

Kalai and Vempala [12] proposed the FPL algorithm and showed that in expectation, the regret is small against any oblivious adversary. [10] showed that the same regret bound extends to settings with adaptive adversary. To obtain a high probability bound, one can construct a martingale to connect the actual reward and the expected reward obtained by the agent, then apply the Hoeffding-Azuma inequality.

###### Theorem 4 (Follow-the-Perturbed-Leader).

[12, 10, 7] Let be a sequence of decisions. Let be a state sequence chosen by an adaptive adversary, that is, can be selected based on all the previous states and all the previous decisions for every . If we let be , where is drawn uniformly from for some , then

is an upper bound of for all , is an upper bound of for all and , and is an upper bound of for all . Moreover, for all , with probability at least , the actual accumulative reward under any adaptive adversary satisfies,

## Appendix B Maxmin Learning using an Approximate MDP Solver

In the previous sections, we assume that we have access to an MDP solver ALG that solves any MDP optimally in time polynomial in . However, in practice, solving large-size MDPs, e.g. continuous control problems, exactly could be computationally expensive or infeasible. Our FPL-based algorithm also works in cases where we can only solve MDPs approximately.

Suppose we are given access to an additive FPTAS for solving MDPS. More specifically, finds in time polynomial in a solution , such that . Notice that the weights of ’s reward function have -norm .

We face two challenges when we replace ALG with : (i) we can no longer find the best policy with respect to all the previous weights plus the perturbation in every iteration, and (ii) we no longer have a separation oracle for , as the (Algorithm 1) relies on the MDP solver when is implicitly specified by the expert’s policy. It turns out (i) is not hard to deal with, as the FPL algorithm is robust enough to work with only an approximate leader. (ii) is much more subtle. We design a new algorithm and use it as a proxy for the polytope . We call this new algorithm a weird separation oracle (following the terminology in [6]) as the points it may accept do not necessarily form a convex set, even though it does accept all points in . It may seem at first not clear at all why such a weird separation oracle can help us. However, we manage to prove that just with this weird separation oracle, we can still compute an approximate minimizing weight vector in in every iteration (Step 5 of Algorithm 3). Combining this with our solution for challenge (i), we can still compute an approximately maxmin policy with essentially the same performance as in Algorithm 3.

###### Theorem 5.

If we replace the exact MDP solver ALG with an approximate solver in step 4 of Algorithm 3, then for any and any , with probability at least , Algorithm 3 finds a policy after rounds of iterations such that its expected reward under any weight from is at least . In every iteration, Algorithm 3 makes one query to and a polynomial number of queries to . In particular, for every query to , we first divide the input by then feed it to and ask for a policy that is at most worse than the optimal one.

The proof of Theorem 5 is similar to the proof of Theorem 3. We use the bounds provided by Lemma 5 instead of Theorem 4, and change the RHS in Equation (1) from to accordingly. The rest of the proof remains the same.

Assume we have a procedure for -approximating linear programs over the decision set such that for all ,

###### Lemma 5 (Follow the Approximate Perturbed Leader).

An astute reader may have noticed that in the analysis above, we used the same separation oracle as in section 3. However, in the case when the separation oracle for the reward polytope is implicitly specified by an expert policy, queries the MDP solver in step 1 of algorithm 1. If we do not have an exact MDP solver ALG, it is not clear how we can define a separation oracle for polytope . We use Algorithm 4 as an proxy to polytope .

We call a weird separation oracle for for the reward polytope for , because the set of that it will accept is not necessarily convex. For example, the following may happen. First, we query two points and that are close to each other. Both are accepted by , and it happens to be the case that and are both away from the optimal solutions. Now we query , and run . Luckily (or unfortunately) is close to optimal, and is rejected.

###### Lemma 6.

For any linear optimization problem, we can construct a polynomial time algorithm based on the ellipsoid-method that queries , such that it finds a solution that is at least as good as the best solution in polytope , although our solution does not necessarily lie in .

Proof of Lemma 6: We only sketch the proof here. Solving a linear optimization can be converted into solving a sequence of feasibility problems by doing binary search on the objective value. We show that for any objective value , as long as there is a solution whose objective value , our algorithm also finds a solution such that . First, imagine we have a separation oracle for , and the ellipsoid method needs to run iterations to determine whether there is a solution in whose objective value is at least . The correctness of ellipsoid method guarantees that if it hasn’t found any solution after iterations, then the intersection of the halfspace and is empty. The reason is that if the intersection is not empty it must have volume at least , and the ellipsoid method maintains an ellipsoid that contains the intersection of the halfspace and and shrinks the volume of the ellipsoid in every iteration. After iterations the ellipsoid already has volume less than .

Our algorithm also runs the ellipsoid method for iterations. In each iteration, we first check the constraint , if not satisfied, we output this constraint as the separating hyperplane. If it is satisfied, instead of querying the real separation oracle for , we query . If the answer is “YES", we have found a solution such that . If the answer is “NO", clearly this query point is not in , and the outputted separating hyperplane contains the intersection of the halfspace and . Therefore, whenever our algorithm accepts a point, it must have objective value higher than . Otherwise, the shrinking ellipsoid still contains the intersection of the halfspace and . If our algorithm terminates after iterations without accepting point, we know that the intersection between the halfspace and is empty as the volume of the ellipsoid after iterations is already too small.

Consider the following three polytopes:

(i)

(ii)

(iii) .

###### Fact 1.

.

###### Fact 2.

only accepts points that are in .

###### Proof.

Suppose , then clearly . Hence, will not accept . ∎

###### Lemma 7.

For all in , is in .

Proof of Lemma 7: From the definition of , multiply both side of the inequality with , and let , is in .

###### Theorem 6.

Now, we are ready to describe the algorithm using only access to .

## Appendix C Experiment Details

In every iteration of Algorithm 3 and Algorithm 5, step 5 computes a minimizing weight in . Instead of using the ellipsoid method to solve the LP, we use the analytic center cutting-plane method (see [4] for a brief overview) throughout our experiments. The method combines good practical performance with reasonable simplicity.

### c.1 Gridworld

The domain contains five types of terrain. Four terrain types are used in the demonstration gridworld where we construct the expert policy. We select the rewards for these four terrain types uniformly from , and the target has a reward of 10. The reward of each terrain type is deterministic. The demonstration MDP is uniformly composed of four terrain types, 25% each type. The fifth terrain type (red colored as in Figure 2) is not present in the demonstration gridworld. The agent is trained in a "real-world" MDP that is composed uniformly of all five terrain types, 20% each. We select maps that are feasible, such that for all rewards in the consistent reward polytope, value iteration has a solution for the agent to reach the goal. We use feature vectors that indicate the terrain type of each state, choose a discount factor of 0.95, and use value iteration throughout the experiment. The consistency between the expert policy and the reward function is defined with .

##### Deterministic transition model

In an MDP with deterministic transition model, the agent moves in exactly the direction chosen by the agent. We run FPL for iterations and use the average of policies output by the last iterations as the maxmin policy. Figure 3 shows that our maxmin policy is much safer than a baseline. The baseline policy is computed in an MDP whose reward weights are the same as the demonstration MDP for the first four terrain types and the fifth terrain weight is chosen uniformly at random from [-1,0]. The expert policy for the displayed results is constructed by computing the optimal policy in an demonstration MDP with rewards for the first four terrain type set as . The results are accumulated from 100 individual runs using the same expert policy. Examples of the baseline trajectories are shown in Figure 7.

##### Stochastic transition model

At each state, there is 10% chance that the agent will go in a random direction regardless of the action chosen by the agent. The agent will receive rewards based on the state it actually lands in. We show in Figure 8 that to mitigate the higher risk of traversing the unknown terrain type, our maxmin policy appears to be more conservative than the deterministic case. Although Figure 9 shows that it cannot absolutely avoid the unknown terrain type due of the stochastic nature of the model, the percentage is much lower than the baseline. The baseline was computed with the same reward weights as in the deterministic case.

##### Computation Performance

In our grid world experiment, the worst case running time of ALG is , but experiments show a more benign runtime of . For a grid world with 25 features, Algorithm 3 appears to converge after 325 iterations of FPL with total runtime of 3324 seconds (average of 20 trials, ordinary desktop computer). Instead of using the ellipsoid method, we used analytic center cutting-plane method, and the running time appears to scale in the order of .

### c.2 CartPole

We modify the classic CartPole task in the OpenAI Gym environment by adding features that may incur additional rewards. This is represented by the question blocks in Figure 4. The two question blocks correspond to feature indicators for the agent’s horizontal position in the range of and . We keep the same episode termination criterions for the pole angle and cart position as the original environment. An episode is considered ending without failing if the pole angel and cart position meet the criterion and the episode length is greater than 500. The agent receives a reward of for surviving every step.

We use longer episodes than the original problem to allow more diverse movement, while it also makes the task more challenging. During validation of a policy, we consider the task solved as getting a target average reward over 100 consecutive episodes with less than five failed episodes. The target average reward depends on the reward we assign for passing the question blocks. If each step spent at question block incurs reward of , the target average reward is set to be . For example, in , only the blue question block exists and it incurs reward of , our expert policy passes the validation criterion by getting average reward higher than 400 over 100 consecutive episodes with less than 5 failed episodes. Indeed, our policy performs quite well by getting a reward greater than 450 in . In , only the yellow question block is present and it incurs reward of . passes the validation criterion with reward greater than 1000.

The agent is in an MDP with both blue and yellow question blocks whose reward polytopes are implicitly defined by the expert policy. We use Q-learning and apply updates using minibatches of stored samples as the MDP solver. Notice that for this problem, our MDP solver is not necessarily optimal. We computed maxmin policies when provided with different expert policies. The results in Figure 6 are from testing the maxmin policy for episodes.