oIRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

oIRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Abstract

Explicit engineering of reward functions for given environments has been a major hindrance to reinforcement learning methods. While Inverse Reinforcement Learning (IRL) is a solution to recover reward functions from demonstrations only, these learned rewards are generally heavily entangled with the dynamics of the environment and therefore not portable or robust to changing environments. Modern adversarial methods have yielded some success in reducing reward entanglement in the IRL setting. In this work, we leverage one such method, Adversarial Inverse Reinforcement Learning (AIRL), to propose an algorithm that learns hierarchical disentangled rewards with a policy over options. We show that this method has the ability to learn generalizable policies and reward functions in complex transfer learning tasks, while yielding results in continuous control benchmarks that are comparable to those of the state-of-the-art methods.

\printAffiliationsAndNotice\icmlEqualContribution

1 Introduction

Reinforcement learning (RL) has been able to learn policies in complex environments but it usually requires designing suitable reward functions for successful learning. This can be difficult and may lead to learning sub-optimal policies with unsafe behavior Amodei et al. (2016) in the case of poor engineering. Inverse Reinforcement Learning (IRL) Ng and Russell (2000); Abbeel and Ng (2004) can facilitate such reward engineering through learning an expert’s reward function from expert demonstrations.

IRL, however, comes with many difficulties and the problem is not well-defined because, for a given set of demonstrations, the number of optimal policies and corresponding rewards can be very large, especially for high dimensional complex tasks. Also, many IRL algorithms learn reward functions that are heavily shaped by environmental dynamics. Policies learned on such reward functions may not remain optimal with slight changes in the environment. Adversarial Inverse Reinforcement Learning (AIRL) Fu et al. (2018) generates more generalizable policies with disentangled reward functions that are invariant to the environmental dynamics. The reward and value function are learned simultaneously to compute the reward function in a state-only manner. This is an instance of a transfer learning problem with changing dynamics, where the agent learns an optimal policy in one environment and then transfers it to an environment with different environmental dynamics. A practical example of this transfer learning problem would be teaching a robot to walk with some mechanical structure, and then generalize this knowledge to perform the task with different sized structural components.

Other methods have been developed to learn demonstrations in environments with differing dynamics and then exploit this knowledge while performing tasks in a novel environment. As complex tasks in different environments often come from several reward functions, methods such as Maximum Entropy IRL and GAN-Guided Cost Learning (GAN-GCL) tend to overgeneralize Finn et al. (2016b). One way to help solve the problem of over-fitting is to break down a policy into small option (temporally extended action)-policies that solve various aspects of an overall task. This method has been shown to create a policy that is more generalizable Sutton et al. (1999); Taylor and Stone (2009). Methods such as Option-Critic have implemented modern RL architectures with a policy over options and have shown improvements for generalization of policies Bacon et al. (2017). OptionGAN Henderson et al. (2018) also proposed an IRL framework for a policy over options and is shown to have some improvement in one-shot transfer learning tasks, but does not return disentangled rewards.

In this paper, we introduce Option-Inverse Reinforcement Learning (oIRL), to investigate transfer learning with options. Following AIRL, we propose an algorithm that computes disentangled rewards to learn joint reward-policy options, with each option-policy having rewards that are disentangled from environmental dynamics. These policies are shown to be heavily portable in transfer learning tasks with differences in environments. We evaluate this method in a variety of continuous control tasks in the Open AI Gym environment using the MuJoCo simulator Brockman et al. (2016); Todorov et al. () and GridWorlds with MiniGrid Chevalier-Boisvert et al. (2018). Our method shows improvements in terms of reward on a variety of transfer learning tasks while still performing better than benchmarks for standard continuous control tasks.

2 Preliminaries

Markov Decision Processes (MDP) are defined by a tuple where is a set of states, is the set of actions available to the agent, is the transition kernel giving a probability over next states given the current state and action, is a reward function and is a discount factor. and are respectively the state and action of the expert at time instant . We define a policy as the probability distribution over actions conditioned on the current state; . A policy is modeled by a Gaussian distribution where is the policy parameters. The value of a policy is defined as , where denotes the expectation. An agent follows a policy and receives reward from the environment. A state-action value function is . The advantage is . represents a one-step reward.

Options () are defined as a triplet (), where is a policy over options, is the initiation set of states and is the termination function. The policy over options is defined by . An option has a reward and an option policy .

The policy over options is parameterized by , the intra-option policies by for each option, the reward approximator by , and the option termination probabilities by .

In the one-step case, selecting an option using the policy-over-options can be viewed as a mixture of completely specialized experts. This overall policy can be defined as .

Disentangled Rewards are formally defined as a reward function that is disentangled with respect to (w.r.t.) a ground-truth reward and a set of environmental dynamics such that under all possible dynamics , the optimal policy computed w.r.t. the reward function is the same.

3 Related Work

Generative Adversarial Networks (GANs) learn the generator distribution and discriminator . They use a prior distribution over input noise variables . Given these input noise variables the mapping is learned, which maps these input noise variables to the data set space. is a neural network. Another neural network, , learns to estimate the probability that came from the data set and not the generator .

In our two-player adversarial training procedure, is trained to maximize the probability of assigning the correct labels to the dataset and the generated samples. is trained to minimize the objective , which causes it to generate samples that are more likely to fool the discriminator.

Policy Gradient methods optimize a parameterized policy using a gradient ascent. Given a discounting term, the objective to be optimized is . Proximal policy optimization (PPO)  (Schulman et al., 2017) is a policy gradient method that uses policy gradient theorem, which states . PPO has been adapted for the option-critic architecture (PPOC) (Klissarov et al., 2017).

Inverse Reinforcement Learning (IRL) is a form of imitation learning 1, where the expert’s reward is estimated from demonstrations and then forward RL is applied to that estimated reward to find the optimal policy. Generative Adversarial Imitation Learning (GAIL) directly extracts optimal policies from expert’s demonstrations  (Ho and Ermon, 2016). IRL infers a reward function from expert demonstrations, which is then used to optimize a generator policy.

In IRL, an agent observes a set of state-action trajectories from an expert demonstrator . We let be the state-action trajectories of the expert, where . We wish to find the reward function given the set of demonstrations . It is assumed that the demonstrations are drawn from the optimal policy . The Maximum Likelihood Estimation (MLE) objective in the IRL problems is therefore:

(1)

with .

Adversarial Inverse Reinforcement Learning (AIRL) is based on GAN-Guided Cost Learning (Finn et al., 2016a), which casts the MLE objective as a Generative Adversarial Network (GAN)  (Goodfellow et al., 2014) optimization problem over trajectories. In AIRL (Fu et al., 2018), the discriminator probability is evaluated using the state-action pairs from the generator (agent), as given by

(2)

The agent tries to maximize where is a learned function and is pre-computed. This formula is similar to GAIL but with a recoverable reward function since GAIL outputs 0.5 for the reward of all states and actions at optimality. The discriminator function is then formulated as given shaping function and reward approximator . Under deterministic dynamics, it is shown in AIRL that there is a state-only reward approximator ( where the reward is invariant to transition dynamics and is disentangled.

Hierarchical Inverse Reinforcement Learning learns policies with high level temporally extended actions using IRL. OptionGAN  (Henderson et al., 2018) provides an adversarial IRL objective function for the discriminator with a policy over options. It is formulated such that defines the regularization terms on the mixture of experts so that they converge to options. The discriminator objective in OptionGAN takes state-only input and is formulated as:

where
(3)

In Directed-Info GAIL (Sharma et al., 2019) implements GAIL in a policy over options framework.

Work such as (Krishnan et al., 2016) solves this hierarchical problem of segmenting expert demonstration transitions by analyzing the changes in local linearity w.r.t a kernel function. It has been suggested that decomposing the reward function is not enough (Henderson et al., 2018). Other works have learned the latent dimension along with the policy for this task (Hausman et al., 2017; Wang et al., 2017). In this formulation, the latent structure is encoded in an unsupervised manner so that the desired latent variable does not need to be provided. This work parallels many hierarchical IRL methods but with recoverable robust rewards.

4 MLE Objective for IRL Over Options

Let be an expert trajectory of state-action pairs. Denote by a novice trajectory generated by policy over options of the generator at iteration .

Given a trajectory of state-action pairs, we first define an option transition probability given a state and an option. Similar transition probabilities given state, action or option information are defined in (Appendix A).

(4)

We can similarly define a discounted return recursively. Consider the policy over options based on the probabilities of terminating or continuing the option policies given a reward approximator for the state-action reward.

(5)

is selected according to . The expressions for all relevant discounted returns appearing in the analysis are given in Appendix B. A suitable parameterization of the discounted return can be found by maximizing the causal entropy w.r.t parameter . We then have for a trajectory with time-steps:

(6)

4.1 MLE Derivative

Similar to Fu et al. (2018) and Finn et al. (2016a), we define the MLE objective for the generator as

(7)

Note that we may or may not know the option trajectories in our expert demonstrations, instead they are estimated according to the policy over options. The gradient of (7) w.r.t (See Appendix B for detailed derivations) is given by:

We define as the state action marginal at time . This allows us to examine the trajectory from step as defined similarly in (Fu et al., 2018). Consequently, we have

(xyz)
(8)

Since is difficult to draw samples from, we estimate it using importance sampling distribution over the generator density. Then, we compute an importance sampling estimate of a mixture policy for each option as follows.

We sample a mixture policy defined as and is a density estimate trained on the demonstrations. We wish to minimize to reduce the importance sampling distribution variance, where is the Kullback–Leibler divergence metric Kullback and Leibler (1951) between two probability distributions. Applying the aforementioned density estimates in (8), we can express the gradient of the MLE objective follows:

(9)

where

(10)

5 Discriminator Objective

In this section we formulate the discriminator, parameterized by , as the odds ratio between the policy and the exponentiated reward distribution for option . We have a discriminator for each option and a sample generator option policy , defined as follows:

(11)

5.1 Recursive Loss Formulation

The discriminator is trained by minimizing the cross-entropy loss between expert demonstrations and generated examples assuming we have the same number of options in the generated and expert trajectories. We define the loss function as follows:

(12)

The parameterized total loss for the entire trajectory, , can be expressed recursively as follows by taking expectations over the next options and states:

(13)
(14)
(15)
(16)

The agent wishes to minimize to find its optimal policy.

5.2 Optimization Criteria

For a given option , define the reward function , which is to be maximised. We then write a negative discriminator loss () to turn our loss minimization problem into a maximization problem, as follows:

(17)

We set a mixture of experts and novice as observations in our gradient. We then wish to take the derivative of the inverse discriminator loss as,

(18)

We can multiply the top and bottom of the fraction in the mixture expectation by the state marginal . This allows us to write . Using this, we can derive an importance sampling distribution in our loss,

(19)

The gradient of this parametrized reward function corresponds to the inverse of the discriminator’s objective:

(20)

See Appendix C for the detailed derivations of the terms appearing in (5.2). Substituting (5.2) into (19) one can see that (9) (derivative of MLE objective) and (10) are of the same form as of (19) (derivative of the discriminator objective and (5.2)).

6 Learning Disentangled State-only Rewards with Options

In this section, we provide our main algorithm for learning robust rewards with options. Similar to AIRL, we implement our algorithm with a discriminator update that considers the rollouts of a policy over options. We perform this update with triplets and a discriminator function in the form of as given in (21). This allows us to formulate the discriminator with state-only rewards in terms of option-value function estimates. We can then compute an option-advantage estimate. Since the reward function only requires state, we learn a reward function and corresponding policy that is disentangled from the environmental transition dynamics.

(21)

Where and .

Our discriminator model must learn a parameterization of the reward function and the value function for each option, given the total loss function in (37). These parameterized models are learned with a multi-layer perceptron. For each option, the termination functions and option-policies are learned using PPOC.

6.1 The main algorithm: Option-Adversarial Inverse Reinforcement Learning (oIRL)

Our main algorithm, oIRL, is given by Algorithm 1. Here, we iteratively train a discriminator from expert and novice sampled trajectories using the derived discriminator objective. This allows us to obtain reward function estimates for each option. We then use any policy optimization method for a policy over options given these estimated rewards.

We can also have discriminator input of state-only format as described in (21). It is important to note that in our recursive loss, we recursively simulate a trajectory to compute the loss a finite number of times (and return if the state is terminal). We show the adversarial architecture of this algorithm in Appendix D.

0:  
1:  
2:  for  do
3:     
4:     
5:     for  do
6:        
7:        if  not terminal state then
8:           
9:           
10:           
11:           
12:           
13:        end if
14:     end for
15:     
16:     
17:  end for
Algorithm 1 IRL Over Options with Robust Rewards (oIRL)

7 Convergence Analysis

In this section we explain the gist of the analysis of convergence of oIRL. The detailed proofs can be found in Appendix E and F.

We first show that the actual reward function is recovered (up to a constant) by the reward estimators. We show that for each option’s reward estimator , we have , where is a finite constant. Using the fact that , and by using Cauchy-Schwarz inequality of sup-norm, we prove that the update of the TD-error is a contraction, i.e.,

(22)

In order to prove asymptotic convergence to the optimal option-value , we show using the contraction argument that converges to by establishing the following inequality:

(23)

8 Experiments

oIRL learns disentangled reward functions for each option policy, which facilitates policy generalizability and is instrumental in transfer learning.

Transfer learning can be described as using information learned by solving one problem and then applying it to a different but related problem. In the RL sense, it means taking a policy trained on one environment and then using the policy to solve a similar task in a different previously unseen environment.

We run experiments in different environments to address the following questions:

  • Does learning a policy over options with the AIRL framework improve policy generalization and reward robustness in transfer learning tasks where the environmental dynamics are manipulated?

  • Can the policy over options framework match or exceed benchmarks for imitation learning on complex continuous control tasks?

To answer these questions, we compare our model against AIRL (the current state of the art for transfer learning) in a transfer task by learning in an ant environment and modifying the physical structure of the ant and compare our method on various benchmark IRL continuous control tasks. We wish to see if learning disentangled rewards for sub-tasks through the options framework is more portable.

We train a policy using each of the baseline methods and our method on these expert demonstrations for 500 time steps on the gait environments and 500 time steps on the hierarchical ones. Then we take the trained policy (the parameterized distribution) and use this policy on the transfer environments and observe the reward obtained. Such a method of transferring the policy is called a direct policy transfer.

8.1 Gait Transfer Learning Environments

For the transfer learning tasks we use Transfer Environments for MuJoCo Chu and Arnold (2018), a set of gym environments for studying potential improvements in transfer learning tasks. The task involves an Ant as an agent which optimizes a gait to crawl sideways across the landscape. The expert demonstrations are obtained from the optimal policy in the basic Ant environment. We disable the agent ant in two ways for two transfer learning tasks. In BigAnt tasks, the length of all legs is doubled, no extra joints are added though. The Amputated Ant task modifies the agent by shortening a single leg to disable it. These transfer tasks require the learning of a true disentangled reward of walking sideways instead of directly imitating and learning the reward specific to the gait movements. These manipulations are shown in Figure 4.

(a) Ant environment
(b) Big Ant environment
(c) Amputated Ant environment
Figure 4: MuJoCo Ant Gait transfer learning task environments. When the ant is disabled, it must position itself correctly to crawl forward. This requires a different initial policy than the original environment where the ant must only crawl sideways.

Table 1 shows the results in terms of reward achieved for the ant gait transfer tasks. As we can see, in both experiments our algorithm performs better than AIRL. Remark that the ground truth is obtained with PPO after 2 million iterations (therefore much less sample efficient than IRL).


Big Ant Amputated Ant
AIRL (Primitive) -11.6 134.3
2 Options oIRL 4.7 122.6
4 Options oIRL -1.7 167.1
Ground Truth 142.9 335.4
Table 1: The mean reward obtained (higher is better) over 100 runs for the Gait transfer learning tasks. We also show the results of PPO optimizing the ground truth reward.

8.2 Maze Transfer Learning Tasks

We also create transfer learning environments in a 2D Maze environment with lava blockades. The goal of the agent is to go through the opening in a row of lava cells and reach a goal on the other end. For the transfer learning task, we train the agent on an environment where the “crossing” path requires the agent to go through the middle for (LavaCrossing-M) and then the policy is directly transferred and used on a GridWorld of the same size where the crossing is on the right end of the room (LavaCrossing-R). An additional task would be changing a blockade in a Maze (FlowerMaze-(R,T)). The two environments are shown in Figure 9. We can think of two sub-tasks in this environment, going to the lava crossing and then going to the goal.

In all of these environments, the rewards are sparse. The agent receives a non-zero reward only after completing the mission, and the magnitude of the reward is , where is the length of the successful episode and is the maximum number of steps that we allowed for completing the episode, different for each mission.

(a) width=
(b) LavaCrossing-R MiniGrid Env
(c) FlowerMaze-R MiniGrid Env
(d) FlowerMaze-T MiniGrid Env
Figure 9: The MiniGrid transfer learning task set 1. Here the policy is trained on (a or c) using our method and the baseline methods and then transferred to be used on environment (b or d). The green cell is the goal.

We show the mean reward after 10 runs using the direct policy transfers on the environments in Table 2. The 4 option oIRL achieved the highest reward on the LavaCrossing tasks. The FlowerMaze task was quite difficult with most algorithms obtaining very low reward. Options still result in a large improvement.


LavaCrossing FlowerMaze
AIRL (Primitive) 0.64 0.11
2 Options oIRL 0.67 0.20
4 Options oIRL 0.81 0.23
Ground Truth 1.00 1.00
Table 2: The mean reward obtained (higher is better) over 10 runs for the Maze transfer learning tasks. We also show the results of PPO optimizing the ground truth reward.

8.3 Hierarchical Transfer Learning Tasks

In addition, we adopt more complex hierarchical environments that require both locomotion and object interaction. In the first environment, the ant must interact with a large movable block. This is called the Ant-Push environment Duan et al. (2016). To reach the goal, the ant must complete two successive processes: first, it must move to the left of the block and then push the block right, which clears the path towards the target location. There is a maximum of 500 timesteps. These can be thought of as hierarchical tasks with pushing to the left, pushing to the right and going to the goal as sub-goals.

We also utilize an Ant-Maze environment Florensa et al. (2017) where we have a simple maze with a goal at the end. The agent receives a reward of if it reaches the goal and elsewhere. The ant must learn to make two turns in the maze, the first is down the hallway for one step and then a turn towards the goal. Again, we see hierarchical behavior in this task: we can think of sub-goals consisting of learning to exit the first hall of the maze, then making the turn and finally going down the final hall towards the goal. The two complex environments are shown in Figure 12.

(a) Ant-Maze environment
(b) Ant-Push environment
Figure 12: MuJoCo Ant Complex Gait transfer learning task environments. We perform these transfer learning tasks with the Big Ant and the Amputated Ant.

Table 3 shows that oIRL performs better than AIRL in all of the complex hierarchical transfer tasks. In some tasks such as the Maze environment, AIRL fails to have any or very few successful runs while our method achieves reasonably high reward. In the BigAnt push task, AIRL achieves only very minimal reward where oIRL succeeds to perform the task in some cases.


Big Ant Maze Amputated Ant Maze Big Ant Push Amputated Ant Push
AIRL (Primitive) 0.28 0.14 0.02 0.17
2 Options oIRL 0.62 0.29 0.46 0.34
4 Options oIRL 0.55 0.31 0.55 0.41
Ground Truth 0.96 0.98 0.90 0.86
Table 3: The mean reward obtained (higher is better) over 100 runs for the MuJoCo Ant Complex Gait transfer learning tasks. We also show the results of PPO optimizing the ground truth reward.
(a) Ant
(b) Half Cheetah
(c) Walker
Figure 16: MuJoCo Continuous control locomotion tasks showing the mean reward (higher is better) achieved over 500 iterations of the benchmark algorithms for 10 random seeds. The shaded area represents the standard deviation.

8.4 MuJoCo Continuous Control Benchmarks

We also test our algorithm on a number of robotic continuous control benchmark tasks. These tasks do not involve transfer.

We show the plots of the average reward for each iteration during training in Figure 16. Achieving a higher reward in fewer iterations is better for these experiments. We examine the Ant, the Half Cheetah, the and Walker MuJoCo gait/locomotion tasks. We run these experiments with 10 random seeds. The results are quite similar between the benchmarks. Using a policy over options shows reasonable improvements in each task.

9 Discussion

This work presents Option-Inverse Reinforcement Learning (oIRL), the first hierarchical IRL algorithm with disentangled rewards. We validate oIRL on a wide variety of tasks, including transfer learning tasks, locomotion tasks, complex hierarchical transfer RL environments and GridWorld transfer navigation tasks and compare our results with the state-of-the-art algorithm. Combining options with a disentangled IRL framework results in highly portable policies. Our empirical studies show clear and significant improvements for transfer learning. The algorithm is also shown to perform well in continuous control benchmark tasks.

For future work, we wish to test other sampling methods (e.g., Markov-chain Monte Carlo) to estimate the implicit discriminator-generator pair’s distribution in our GAN, such as Metropolis-Hastings GAN Turner et al. (2019). We also wish to investigate methods to reduce the computational complexity for the step of computing the recursive loss function, which requires simulating some short trajectories, lowering the variance. Analyzing our algorithm using physical robotic tests for tasks that require multiple sub-tasks would be an interesting future course of research.

Appendix A Option Transition Probabilities

It is useful to redefine transition probabilities in terms of options. Since at each step we have a additional consideration, we can continue following the policy of the current option we are in or terminate the option with some probability, sample a new option and follow that option’s policy from a stochastic policy dependent on states. We have

(24)
(25)
(26)

Appendix B MLE Objective for IRL Over Options

We can define a discounted return recursively for a policy over options in a similar manner to the transition probabilities. Consider the policy over options based on the probabilities of terminating or continuing the option policies given a reward approximator for the state-action reward.

(27)

These formulations of the reward function account for option transition probabilities, including the probability of terminating the current option and therefore selecting a new one according to the policy over options.

With selected according to , we can define a parameterization of the discounted return in the style of a maximum causal entropy RL problem with objective , where

(28)

MLE Derivative

We can write out our MLE objective for our generator. We may or may not know the option trajectories in our expert demonstrations, but they are estimated below according to the policy over options. This is defined similarly in Fu et al. (2018) and Finn et al. (2016a) as . The full derivation is shown as (with generator ):

(29)

We go from Line 4 to 5 seeing .

Now, we take the gradient of the MLE objective w.r.t yields,

(30)