Multi-task Maximum Entropy Inverse Reinforcement Learning
Multi-task Inverse Reinforcement Learning (IRL) is the problem of inferring multiple reward functions from expert demonstrations. Prior work, built on Bayesian IRL, is unable to scale to complex environments due to computational constraints. This paper contributes the first formulation of multi-task IRL in the more computationally efficient Maximum Causal Entropy (MCE) IRL framework. Experiments show our approach can perform one-shot imitation learning in a gridworld environment that single-task IRL algorithms require hundreds of demonstrations to solve. Furthermore, we outline how our formulation can be applied to state-of-the-art MCE IRL algorithms such as Guided Cost Learning. This extension, based on meta-learning, could enable multi-task IRL to be performed for the first time in high-dimensional, continuous state MDPs with unknown dynamics as commonly arise in robotics.
Inverse reinforcement learning (IRL) is the task of determining the reward function that generated a set of trajectories: sequences of state-action pairs (Ng & Russell, 2000; Abbeel & Ng, 2004). It is a form of learning from demonstration that assumes the expert demonstrator follows an (approximately) optimal policy with respect to an unknown reward. Sample efficiency is a key design goal, since it is costly to elicit human demonstrations.
In practice, demonstrations are often generated from multiple reward functions. This situation naturally arises when the demonstrations are for different tasks, such as grasping different types of objects, as depicted in fig. 1. Less obviously, it also occurs when different individuals perform what is nominally the same task, reflecting individuals’ unique preferences and styles. In this paper, we assume demonstrations of the same task are assigned a common label.
A naive solution to multi-task IRL is to repeatedly apply a single-task IRL algorithm to the demonstrations of each task. However, this method requires that the number of samples increases proportionally with the number of tasks, which is prohibitive in many settings. Fortunately, the reward functions for related tasks are often similar, and exploiting this structure can enable greater sample efficiency.
Previous work on the multi-task IRL problem (Dimitrakakis & Rothkopf, 2011; Babeş-Vroman et al., 2011; Choi & Kim, 2012) builds on Bayesian IRL (Ramachandran & Amir, 2007). Unfortunately, no extant Bayesian IRL methods scale to complex environments with high-dimensional, continuous state spaces such as robotics. By contrast, approaches based on maximum causal entropy show more promise (Ziebart et al., 2010). Although the original maximum causal entropy IRL algorithm is limited to discrete state spaces, recent extensions such as guided cost learning and adversarial IRL scale to challenging continuous control environments (Finn et al., 2016; Fu et al., 2018).
Our two main contributions in this paper are:
Regularized Maximum Causal Entropy (MCE). We present the first formulation of the multi-task IRL problem in the MCE framework. Our approach simply adds a regularization term to the loss, and therefore retains the computational efficiency of the original MCE IRL algorithm. We evaluate in a gridworld that takes hundreds of demonstrations for unmodified MCE IRL to solve. By contrast, our regularized variant recovers a reward that leads to a near-optimal policy after a single demonstration.
Meta-Learning Rewards. We outline preliminary work extending our approach to the function approximator setting used by guided cost learning, finding a connection with meta-learning. Since guided cost learning is a scalable approximation to maximum causal entropy IRL, the success of the former approach gives reason for optimism. However, we leave empirical investigation of this method to future work. To the best of our knowledge, meta-learning has never been applied to IRL in any previously published work.
2 Preliminaries and Single-Task IRL
A Markov Decision Process (MDP) is a tuple where and are sets of states and actions; is the probability of transitioning to from after taking action ; is a discount factor; is the probability of starting in ; and is the reward upon taking action in state . We write MDP\R to denote an MDP without a reward function.
In the single-task IRL problem, the IRL algorithm is given access to an MDP\R and demonstrations from an (approximately) optimal policy. The goal is to recover a reward function that explains the demonstrations . Note this is an ill-posed problem: many reward functions , including the constant zero reward function , make the demonstrations optimal.
Bayesian IRL addresses this identification problem by inferring a posterior distribution (Ramachandran & Amir, 2007). Although some probability mass will be placed on degenerate reward functions, for reasonable priors the majority of the probability will lie on more plausible explanations.
By contrast, maximum causal entropy chooses a single reward function, using the principle of maximum entropy to select the least specific reward function that is still consistent with the demonstrations (Ziebart et al., 2008, 2010). It models the demonstrations as being sampled from:
a stochastic expert policy that is noisily optimal for:
Note there can exist multiple solutions to these softmax Bellman equations (Asadi & Littman, 2017).
To reduce the dimension of the problem, it is common to assume the reward function is linear in features over the state-action pairs:
Let the expert demonstration consist of trajectories . For convenience, write:
Given a known feature map , the IRL problem reduces to finding weights .
A key insight behind maximum causal entropy IRL is that actions in the trajectory sequence depend causally on previous states and actions: i.e. may depend on and , but not on states or actions that occur later in time. The causal log-likelihood of a trajectory is defined to be:
with causal entropy of a policy defined in terms of the causal log-likelihood of its trajectories:
Maximum causal likelihood estimation of given the expert demonstrations is equivalent to maximizing the causal entropy of the stochastic policy subject to the constraint that its expected feature counts match those of the demonstrations:
Note this constraint guarantees attains the same (expected) reward as the expert demonstrations (Abbeel & Ng, 2004). Maximum causal entropy thus recovers reward weights that match the performance of the expert, while avoiding degeneracy by maximizing the diversity of the policy.
3 Methods for Multi-Task IRL
In multi-task IRL, the reward varies between MDPs with associated expert demonstrations . If the reward functions are unrelated to each other, we cannot do better than repeated application of a single-task IRL algorithm. However, in practice similar tasks have reward functions with similar structure, enabling specialized multi-task IRL algorithms to accurately infer the reward with fewer demonstrations.
In the next section, we solve the multi-task IRL problem using the original maximum causal entropy IRL algorithm with an additional regularization term. To the best of our knowledge, this is the first published algorithm for multi-task IRL within the maximum entropy paradigm. Following this, we describe how our method can be applied to scalable approximations of maximum causal entropy IRL.
3.1 Regularized Maximum Causal Entropy
In the multi-task setting, we must jointly infer reward weights that explain each demonstration . To make progress we must make some assumption on the relationship between different reward weights. A natural assumption is that the reward weights for most tasks lie close to the mean across all tasks, i.e. should be small, where . This corresponds to a prior that is drawn from i.i.d. Gaussians with mean and variance monotonic with . In practice, we do not know , but we can estimate it by taking the mean of the current iterates for . This results in a pleasingly simple inference procedure. The regularized loss is:
3.2 Meta-Learning on Reward Networks
In the previous section, we saw how multi-task IRL can be incorporated directly into the Maximum Causal Entropy (MCE) framework. However, the original MCE IRL algorithm has two major limitations. First, it assumes the MDP’s dynamics are known, whereas in many applications (e.g. robotics) the dynamics are unknown and must also be learned. Second, it requires the practitioner to provide a feature mapping such that the resulting reward is linear. For many problems, finding these features may be the bulk of the problem, negating the benefit of IRL.
Both of these shortcomings are addressed by guided cost learning (Finn et al., 2016) and its successor adversarial IRL (Fu et al., 2018), scalable approximations of MCE IRL. Specifically, adversarial IRL uses a neural network to represent the reward as a function from states and actions, obviating the need to specify a feature map . Furthermore, it can handle unknown transition dynamics since it estimates the loss gradient via sampling rather than direct computation, and so only requires access to a simulation of the environment for rollouts.
Naively, we could directly translate the regularization approach given in the previous section to this setting, applying it to the parameters of the neural network . However, regularizing the parameter space may not regularize the output space: small changes in some parameters may have a large effect on the predicted reward, while large changes in other parameters may have little effect.
A more promising approach is to meta-learn the reward network parameters . To the best of our knowledge, meta-learning has never been applied to IRL in any prior published work, so it is unclear which meta-learning approach is best suited to this problem. We have selected Reptile (Nichol et al., 2018) as the basis for our initial experiments due to its computational efficiency, a key consideration given that IRL in complex environments is already computationally demanding. Moreover, Reptile attains state-of-the-art performance on few-shot supervised learning, the closest problem to multi-task IRL that meta-learning algorithms have been evaluated on.
Our method is described in algorithm 1. We seek to find an initialization for the reward network that can be quickly finetuned for any given task (by running adversarial IRL on demonstrations of that task). To achieve this, we repeatedly sample a task and run steps of adversarial IRL, starting from our current initialization . The initialization is then updated along the line between the initialization and final iterate of adversarial IRL. Although this appears superficially similar to joint training, for it is an approximation to first-order model-agnostic meta-learning (MAML) (Finn et al., 2017), a more principled but computationally expensive meta-learning algorithm.
Algorithm 1 cannot be applied verbatim since adversarial IRL jointly learns a reward function and a policy optimising that reward function. This is analogous to a GAN, where the policy network is a generator and the reward network defines a discriminator (assigning greater probability to higher reward trajectories). It therefore requires both a policy and reward initialization.
A naive solution is to randomly initialize the policy at the start of each new task. Although this method would work in principle, adversarial IRL requires a large number of iterations to converge to a good policy from a random start, making this approach computationally impractical.
Alternatively, one could jointly perform meta-RL and meta-IRL, learning initializations for both the reward and policy parameters. We view this as a promising line of research, but consider it to be a bad fit for Reptile, which is known to perform poorly on meta-RL problems (Nichol et al., 2018).
Instead, we favor maintaining separate policy parameters per task, applying Reptile to just the reward parameters. This method learns reward parameters that can be quickly finetuned to discriminate data from a distribution of generators. It can be applied with a small value of , imposing little computational overhead: only a modest increase in memory consumption from storing multiple policy parameters.
However, since the policy for a task is updated only when that task is sampled, care must be taken to ensure the frequency between samples does not grow too large. Otherwise, policies for many tasks might become very suboptimal for the current reward network weights, slowing convergence. Accordingly, we suggest training in minibatches of small numbers of tasks.
We are in the process of implementing this variant of algorithm 1, and plan to evaluate in a simulated robotics setting.
We evaluate our regularized maximum causal entropy (MCE) IRL algorithm in a few-shot reward learning problem on the gridworld depicted in fig. 2. Transitions in the gridworld are stochastic, with probability of moving in the desired direction, and of moving in each of the two orthogonal directions. Each cell in the gridworld is either a wall (in which case the state can never be visited), or one of five objects types: dirt, grass, lava, gold and silver.
We define three different reward functions A, B and A+B in terms of these object types, as specified by the legend of fig. 2. The reward functions assign the same weights to dirt, grass and lava, but differ in the weights for gold and silver. A likes silver but is neutral about gold, B has the opposite preferences and A+B likes both gold and silver. We generate synthetic demonstrations for each of these three reward functions using the MCE planner given by eq. (1).
Our multi-task IRL algorithm is then presented with demonstrations from each reward function. Demonstrations for the few-shot environment are restricted to trajectories, varying between and , while demonstrations for the other two environments contain trajectories. To make the task more challenging, our algorithm is not provided with the feature representation, instead having to learn the reward separately for each state. We repeat all experiments for random seeds.
4.1 Comparison to baselines
We compare against two baselines. The first (‘single’) corresponds to using single-task MCE IRL, seeing only the trajectories from the few-shot environment. The second (‘joint training’) combines the demonstrations from all three environments into a single -length sequence of trajectories. For reference, we also display the value obtained by an optimal (‘oracle’) policy, computed by value iteration on the ground truth reward. Figure 3 shows the best out of the 5 random seeds and table 1 reports 95% confidence intervals.
|A+B||0||—||359.6 +/- 0.2||-20.1 +/- 4.1||362.7|
|1||-248.5 +/- 327.2||359.6 +/- 0.2||355.3 +/- 2.0||362.7|
|2||-337.1 +/- 242.8||359.6 +/- 0.2||356.1 +/- 3.6||362.7|
|5||-233.6 +/- 396.0||359.6 +/- 0.2||355.6 +/- 1.2||362.7|
|10||-226.7 +/- 389.0||359.6 +/- 0.2||356.7 +/- 3.7||362.7|
|20||-39.9 +/- 513.5||359.6 +/- 0.2||356.3 +/- 4.6||362.7|
|50||208.8 +/- 347.5||359.6 +/- 0.2||358.2 +/- 2.5||362.7|
|100||56.2 +/- 285.2||359.7 +/- 0.2||357.9 +/- 2.1||362.7|
|A||0||—||-463.0 +/- 12.2||-475.3 +/- 0.6||357.8|
|1||-325.6 +/- 311.5||-466.8 +/- 14.0||222.1 +/- 444.8||357.8|
|2||-267.2 +/- 373.6||-463.0 +/- 12.2||352.8 +/- 9.5||357.8|
|5||-155.4 +/- 180.9||-463.4 +/- 13.6||355.1 +/- 0.3||357.8|
|10||-209.3 +/- 429.8||-463.4 +/- 13.6||355.1 +/- 0.2||357.8|
|20||-130.0 +/- 406.3||-463.4 +/- 13.6||355.1 +/- 0.2||357.8|
|50||-596.1 +/- 2602.7||-434.2 +/- 63.7||355.2 +/- 0.2||357.8|
|100||120.6 +/- 612.5||-431.5 +/- 62.0||355.2 +/- 0.1||357.8|
|B||0||—||-462.6 +/- 39.9||-475.2 +/- 0.1||357.8|
|1||-227.9 +/- 247.8||-462.6 +/- 39.9||23.1 +/- 688.1||357.8|
|2||-293.9 +/- 246.0||-463.2 +/- 40.7||339.0 +/- 52.4||357.8|
|5||-122.6 +/- 106.7||-462.6 +/- 39.9||354.8 +/- 1.2||357.8|
|10||-129.1 +/- 308.9||-462.4 +/- 39.9||354.9 +/- 0.5||357.8|
|20||-305.4 +/- 404.0||-458.6 +/- 37.6||355.0 +/- 0.6||357.8|
|50||112.2 +/- 638.0||-465.5 +/- 15.9||355.1 +/- 0.3||357.8|
|100||-731.8 +/- 3508.9||-452.9 +/- 40.1||355.1 +/- 0.4||357.8|
Our multi-task IRL algorithm recovers a near-optimal policy in all 5 runs after only two trajectories, and in the best case requires only a single trajectory. By contrast, the ‘single’ baseline requires trajectories or more to recover a good policy even in the best case, and after trajectories several seeds still obtain negative total rewards.
The ‘joint training’ baseline performs well on A+B. This is unsurprising, since an optimal policy in A or B is near-optimal in A+B. However, it fares poorly in both the A and B environments, never obtaining a positive reward even in the best case.
Note that all approaches fail in the zero-shot case on A and B, making the success of multi-task IRL in the few-shot case all the more remarkable. Demonstrations solely from non-target environments are not enough to recover a good reward in the target, and so substantial learning must be taking place with only one or two trajectories.
4.2 Hyperparameter choice
Our regularized MCE IRL algorithm takes a hyperparameter that specifies the regularization strength. We show in fig. 4 the results of a hyperparameter sweep between and . As expected, the weakest regularization constant suffers from high variance across the random seeds when the number of trajectories is small.
Perhaps more surprising, the strongest regularization constant also has high variance. We hypothesise that it imposes too strong a prior, making it highly sensitive to the trajectories observed in the off-target environments.
The median regularization constant attains the lowest variance and highest mean of the hyperparameters tested, and was used in the previous section’s experiments.
These results indicate that where sample efficiency is paramount, it is important to choose a regularization hyperparameter suited to the task distribution. However, the algorithm is reasonably robust to hyperparameter choice, with all parameters (varying across two orders of magnitude) attaining near-optimal performance after as few as 20 trajectories. By contrast, the single-task IRL algorithm did not achieve this level of performance even in the best case until observing 50 or more trajectories.
5 Related Work
Previous work in multi-task IRL has approached the problem from a Bayesian perspective. Dimitrakakis & Rothkopf (2011) model reward-policy function pairs as being drawn from a common (unknown) prior, over which they place a hyperprior. They go on to propose two hyperpriors with structures that simplify inference, evaluating on a 5-state MDP. Their work provides a pleasing theoretical basis for work on multi-task IRL, but is not intended to provide an algorithm that can scale to practical applications.
Complementary work has tackled a variant of the multi-task IRL problem, where the trajectories are unlabeled. That is, not only are the reward functions unknown, it is also not known which reward function each trajectory is associated with. Babeş-Vroman et al. (2011) propose using expectation-maximization (EM) to cluster trajectories, an approach that can be paired with several different IRL algorithms. Choi & Kim (2012) instead take a Bayesian IRL approach, using a Dirichlet process mixture model, avoiding the need to specify a fixed number of clusters. Both methods reduce the problem to multiple single-task IRL problems, and so unlike this work do not exploit similarities between reward functions.
IRL is often used as an imitation learning technique, by optimizing the inferred reward function. Multi-task imitation learning has also been studied from a non-IRL perspective. Recent extensions to generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) augment trajectories with a latent intention variable that specifies the task, and then maximize mutual information between the state-action pairs and intention variable (Hausman et al., 2017; Li et al., 2017). Both of these approaches are focused on disentangling trajectories associated with different tasks, and are not intended to speed up the learning of new tasks.
However, the imitation learning community has tackled this problem in the form of one-shot imitation learning: having seen a distribution of trajectories for other tasks, learn a new task from a single demonstration. Wang et al. (2017) use GAIL with the discriminator conditioned on the output generated by an LSTM encoder. After training on unlabeled trajectories, this method can perform one-shot imitation learning by conditioning on a single demonstration trajectory. One-shot imitation learning can also be tackled within the behavioral cloning paradigm, with Duan et al. (2017) training a temporal convolution network with attention to mimic trajectories.
The closest work to this paper in the imitation learning community is multi-agent GAIL (Song et al., 2018). Although GAIL never learns an explicit reward function, it is equivalent to IRL composed with RL. Similar to this work, multi-agent GAIL seeks to improve sample efficiency by exploiting similarity between the reward functions. However, multi-agent GAIL makes strong assumptions on the reward function (e.g. zero sum games), whereas our work depends on a much weaker prior.
Amin et al. (2017) have studied the complementary problem of repeated IRL: learning a common reward component that is invariant across tasks, when the task-specific reward components are known. Although this could be solved by applying IRL to any one of the tasks (and then subtracting the task-specific reward), they show a repeated IRL algorithm can attain better bounds, and under certain constraints can even be used to resolve the ambiguity inherent in single-task IRL problems.
6 Conclusions and Future Work
Sample efficient solutions to the multi-task IRL problem are critical for enabling real-world applications, where collecting human demonstrations is expensive and slow. The multi-task IRL problem has previously been studied exclusively from a Bayesian IRL perspective. In this paper we took the alternative approach of formulating the multi-task problem inside the maximum causal entropy IRL framework by adding a regularization term to the loss. Experiments find our multi-task IRL algorithm can perform one-shot imitation learning in an environment that single-task IRL requires hundreds of demonstrations to learn.
Maximum causal entropy IRL (Ziebart et al., 2010) cannot scale to MDPs with large or infinite state spaces, and moreover requires known dynamics. Both these problems have been alleviated by recent extensions to maximum causal entropy IRL, such as guided cost learning and adversarial IRL (Finn et al., 2016; Fu et al., 2018). Our second contribution is to show how in this function approximator setting, the problem of multi-task IRL can be framed as a meta-learning problem. We have a prototype of algorithm 1, combining Reptile (Nichol et al., 2018) and adversarial IRL (Fu et al., 2018), and are working to evaluate it on multi-task variants of MuJoCo robotics environments (Todorov et al., 2012).
Another avenue for future work is to extend our approach to handle the unlabeled multi-task IRL problem. Prior work on unlabeled multi-task IRL has not exploited any similarity between different reward functions. However, we know from work on one-shot imitation learning that it is possible to use unlabeled demonstrations to speed up the acquisition of new tasks by learning a latent space representation (Wang et al., 2017). Synthesizing our work with existing unlabeled multi-task IRL approaches such as Babeş-Vroman et al. (2011) could enable similar feats in the IRL context.
The source code for our algorithms and experiments is open source and available at —removed for double blind— .
Removed for double blind review.
- Abbeel & Ng (2004) Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
- Amin et al. (2017) Amin, Kareem, Jiang, Nan, and Singh, Satinder. Repeated inverse reinforcement learning. In NIPS. 2017.
- Asadi & Littman (2017) Asadi, Kavosh and Littman, Michael L. An alternative softmax operator for reinforcement learning. 2017.
- Babeş-Vroman et al. (2011) Babeş-Vroman, Monica, Marivate, Vukosi, Subramanian, Kaushik, and Littman, Michael. Apprenticeship learning about multiple intentions. In ICML, 2011.
- Choi & Kim (2012) Choi, Jaedeug and Kim, Kee-Eung. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. In NIPS, 2012.
- Dimitrakakis & Rothkopf (2011) Dimitrakakis, Christos and Rothkopf, Constantin A. Bayesian multitask inverse reinforcement learning. In EWRL, 2011.
- Duan et al. (2017) Duan, Yan, Andrychowicz, Marcin, Stadie, Bradly, Ho, Jonathan, Schneider, Jonas, Sutskever, Ilya, Abbeel, Pieter, and Zaremba, Wojciech. One-shot imitation learning. In NIPS, 2017.
- Finn et al. (2016) Finn, Chelsea, Levine, Sergey, and Abbeel, Pieter. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016.
- Finn et al. (2017) Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
- Fu et al. (2018) Fu, Justin, Luo, Katie, and Levine, Sergey. Learning robust rewards with adverserial inverse reinforcement learning. In ICLR, 2018.
- Hausman et al. (2017) Hausman, Karol, Chebotar, Yevgen, Schaal, Stefan, Sukhatme, Gaurav, and Lim, Joseph J. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In NIPS. 2017.
- Ho & Ermon (2016) Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In NIPS. 2016.
- Li et al. (2017) Li, Yunzhu, Song, Jiaming, and Ermon, Stefano. InfoGAIL: Interpretable imitation learning from visual demonstrations. In NIPS. 2017.
- Ng & Russell (2000) Ng, Andrew Y. and Russell, Stuart. Algorithms for inverse reinforcement learning. In ICML, 2000.
- Nichol et al. (2018) Nichol, Alex, Achiam, Joshua, and Schulman, John. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018.
- Ramachandran & Amir (2007) Ramachandran, Deepak and Amir, Eyal. Bayesian inverse reinforcement learning. In IJCAI, 2007.
- Song et al. (2018) Song, Jiaming, Ren, Hongyu, Sadigh, Dorsa, and Ermon, Stefano. Multi-agent generative adversarial imitation learning. In ICLR Workshop. 2018.
- Todorov et al. (2012) Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In IROS, 2012.
- Wang et al. (2017) Wang, Ziyu, Merel, Josh S, Reed, Scott E, de Freitas, Nando, Wayne, Gregory, and Heess, Nicolas. Robust imitation of diverse behaviors. In NIPS. 2017.
- Ziebart et al. (2008) Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew, and Dey, Anind K. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
- Ziebart et al. (2010) Ziebart, Brian D., Bagnell, J. Andrew, and Dey, Anind K. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.