Multitask Maximum Entropy Inverse Reinforcement Learning
Abstract
Multitask Inverse Reinforcement Learning (IRL) is the problem of inferring multiple reward functions from expert demonstrations. Prior work, built on Bayesian IRL, is unable to scale to complex environments due to computational constraints. This paper contributes the first formulation of multitask IRL in the more computationally efficient Maximum Causal Entropy (MCE) IRL framework. Experiments show our approach can perform oneshot imitation learning in a gridworld environment that singletask IRL algorithms require hundreds of demonstrations to solve. Furthermore, we outline how our formulation can be applied to stateoftheart MCE IRL algorithms such as Guided Cost Learning. This extension, based on metalearning, could enable multitask IRL to be performed for the first time in highdimensional, continuous state MDPs with unknown dynamics as commonly arise in robotics.
1 Introduction
Inverse reinforcement learning (IRL) is the task of determining the reward function that generated a set of trajectories: sequences of stateaction pairs (Ng & Russell, 2000; Abbeel & Ng, 2004). It is a form of learning from demonstration that assumes the expert demonstrator follows an (approximately) optimal policy with respect to an unknown reward. Sample efficiency is a key design goal, since it is costly to elicit human demonstrations.
In practice, demonstrations are often generated from multiple reward functions. This situation naturally arises when the demonstrations are for different tasks, such as grasping different types of objects, as depicted in fig. 1. Less obviously, it also occurs when different individuals perform what is nominally the same task, reflecting individuals’ unique preferences and styles. In this paper, we assume demonstrations of the same task are assigned a common label.
A naive solution to multitask IRL is to repeatedly apply a singletask IRL algorithm to the demonstrations of each task. However, this method requires that the number of samples increases proportionally with the number of tasks, which is prohibitive in many settings. Fortunately, the reward functions for related tasks are often similar, and exploiting this structure can enable greater sample efficiency.
Previous work on the multitask IRL problem (Dimitrakakis & Rothkopf, 2011; BabeşVroman et al., 2011; Choi & Kim, 2012) builds on Bayesian IRL (Ramachandran & Amir, 2007). Unfortunately, no extant Bayesian IRL methods scale to complex environments with highdimensional, continuous state spaces such as robotics. By contrast, approaches based on maximum causal entropy show more promise (Ziebart et al., 2010). Although the original maximum causal entropy IRL algorithm is limited to discrete state spaces, recent extensions such as guided cost learning and adversarial IRL scale to challenging continuous control environments (Finn et al., 2016; Fu et al., 2018).
Our two main contributions in this paper are:

Regularized Maximum Causal Entropy (MCE). We present the first formulation of the multitask IRL problem in the MCE framework. Our approach simply adds a regularization term to the loss, and therefore retains the computational efficiency of the original MCE IRL algorithm. We evaluate in a gridworld that takes hundreds of demonstrations for unmodified MCE IRL to solve. By contrast, our regularized variant recovers a reward that leads to a nearoptimal policy after a single demonstration.

MetaLearning Rewards. We outline preliminary work extending our approach to the function approximator setting used by guided cost learning, finding a connection with metalearning. Since guided cost learning is a scalable approximation to maximum causal entropy IRL, the success of the former approach gives reason for optimism. However, we leave empirical investigation of this method to future work. To the best of our knowledge, metalearning has never been applied to IRL in any previously published work.
2 Preliminaries and SingleTask IRL
A Markov Decision Process (MDP) is a tuple where and are sets of states and actions; is the probability of transitioning to from after taking action ; is a discount factor; is the probability of starting in ; and is the reward upon taking action in state . We write MDP\R to denote an MDP without a reward function.
In the singletask IRL problem, the IRL algorithm is given access to an MDP\R and demonstrations from an (approximately) optimal policy. The goal is to recover a reward function that explains the demonstrations . Note this is an illposed problem: many reward functions , including the constant zero reward function , make the demonstrations optimal.
Bayesian IRL addresses this identification problem by inferring a posterior distribution (Ramachandran & Amir, 2007). Although some probability mass will be placed on degenerate reward functions, for reasonable priors the majority of the probability will lie on more plausible explanations.
By contrast, maximum causal entropy chooses a single reward function, using the principle of maximum entropy to select the least specific reward function that is still consistent with the demonstrations (Ziebart et al., 2008, 2010). It models the demonstrations as being sampled from:
(1) 
a stochastic expert policy that is noisily optimal for:
(2)  
Note there can exist multiple solutions to these softmax Bellman equations (Asadi & Littman, 2017).
To reduce the dimension of the problem, it is common to assume the reward function is linear in features over the stateaction pairs:
(3) 
Let the expert demonstration consist of trajectories . For convenience, write:
(4)  
(5) 
Given a known feature map , the IRL problem reduces to finding weights .
A key insight behind maximum causal entropy IRL is that actions in the trajectory sequence depend causally on previous states and actions: i.e. may depend on and , but not on states or actions that occur later in time. The causal loglikelihood of a trajectory is defined to be:
(6) 
with causal entropy of a policy defined in terms of the causal loglikelihood of its trajectories:
(7) 
Maximum causal likelihood estimation of given the expert demonstrations is equivalent to maximizing the causal entropy of the stochastic policy subject to the constraint that its expected feature counts match those of the demonstrations:
(8) 
Note this constraint guarantees attains the same (expected) reward as the expert demonstrations (Abbeel & Ng, 2004). Maximum causal entropy thus recovers reward weights that match the performance of the expert, while avoiding degeneracy by maximizing the diversity of the policy.
3 Methods for MultiTask IRL
In multitask IRL, the reward varies between MDPs with associated expert demonstrations . If the reward functions are unrelated to each other, we cannot do better than repeated application of a singletask IRL algorithm. However, in practice similar tasks have reward functions with similar structure, enabling specialized multitask IRL algorithms to accurately infer the reward with fewer demonstrations.
In the next section, we solve the multitask IRL problem using the original maximum causal entropy IRL algorithm with an additional regularization term. To the best of our knowledge, this is the first published algorithm for multitask IRL within the maximum entropy paradigm. Following this, we describe how our method can be applied to scalable approximations of maximum causal entropy IRL.
3.1 Regularized Maximum Causal Entropy
In the multitask setting, we must jointly infer reward weights that explain each demonstration . To make progress we must make some assumption on the relationship between different reward weights. A natural assumption is that the reward weights for most tasks lie close to the mean across all tasks, i.e. should be small, where . This corresponds to a prior that is drawn from i.i.d. Gaussians with mean and variance monotonic with . In practice, we do not know , but we can estimate it by taking the mean of the current iterates for . This results in a pleasingly simple inference procedure. The regularized loss is:
(9) 
with gradient:
(10) 
3.2 MetaLearning on Reward Networks
In the previous section, we saw how multitask IRL can be incorporated directly into the Maximum Causal Entropy (MCE) framework. However, the original MCE IRL algorithm has two major limitations. First, it assumes the MDP’s dynamics are known, whereas in many applications (e.g. robotics) the dynamics are unknown and must also be learned. Second, it requires the practitioner to provide a feature mapping such that the resulting reward is linear. For many problems, finding these features may be the bulk of the problem, negating the benefit of IRL.
Both of these shortcomings are addressed by guided cost learning (Finn et al., 2016) and its successor adversarial IRL (Fu et al., 2018), scalable approximations of MCE IRL. Specifically, adversarial IRL uses a neural network to represent the reward as a function from states and actions, obviating the need to specify a feature map . Furthermore, it can handle unknown transition dynamics since it estimates the loss gradient via sampling rather than direct computation, and so only requires access to a simulation of the environment for rollouts.
Naively, we could directly translate the regularization approach given in the previous section to this setting, applying it to the parameters of the neural network . However, regularizing the parameter space may not regularize the output space: small changes in some parameters may have a large effect on the predicted reward, while large changes in other parameters may have little effect.
A more promising approach is to metalearn the reward network parameters . To the best of our knowledge, metalearning has never been applied to IRL in any prior published work, so it is unclear which metalearning approach is best suited to this problem. We have selected Reptile (Nichol et al., 2018) as the basis for our initial experiments due to its computational efficiency, a key consideration given that IRL in complex environments is already computationally demanding. Moreover, Reptile attains stateoftheart performance on fewshot supervised learning, the closest problem to multitask IRL that metalearning algorithms have been evaluated on.
Our method is described in algorithm 1. We seek to find an initialization for the reward network that can be quickly finetuned for any given task (by running adversarial IRL on demonstrations of that task). To achieve this, we repeatedly sample a task and run steps of adversarial IRL, starting from our current initialization . The initialization is then updated along the line between the initialization and final iterate of adversarial IRL. Although this appears superficially similar to joint training, for it is an approximation to firstorder modelagnostic metalearning (MAML) (Finn et al., 2017), a more principled but computationally expensive metalearning algorithm.
Algorithm 1 cannot be applied verbatim since adversarial IRL jointly learns a reward function and a policy optimising that reward function. This is analogous to a GAN, where the policy network is a generator and the reward network defines a discriminator (assigning greater probability to higher reward trajectories). It therefore requires both a policy and reward initialization.
A naive solution is to randomly initialize the policy at the start of each new task. Although this method would work in principle, adversarial IRL requires a large number of iterations to converge to a good policy from a random start, making this approach computationally impractical.
Alternatively, one could jointly perform metaRL and metaIRL, learning initializations for both the reward and policy parameters. We view this as a promising line of research, but consider it to be a bad fit for Reptile, which is known to perform poorly on metaRL problems (Nichol et al., 2018).
Instead, we favor maintaining separate policy parameters per task, applying Reptile to just the reward parameters. This method learns reward parameters that can be quickly finetuned to discriminate data from a distribution of generators. It can be applied with a small value of , imposing little computational overhead: only a modest increase in memory consumption from storing multiple policy parameters.
However, since the policy for a task is updated only when that task is sampled, care must be taken to ensure the frequency between samples does not grow too large. Otherwise, policies for many tasks might become very suboptimal for the current reward network weights, slowing convergence. Accordingly, we suggest training in minibatches of small numbers of tasks.
We are in the process of implementing this variant of algorithm 1, and plan to evaluate in a simulated robotics setting.
4 Experiments
We evaluate our regularized maximum causal entropy (MCE) IRL algorithm in a fewshot reward learning problem on the gridworld depicted in fig. 2. Transitions in the gridworld are stochastic, with probability of moving in the desired direction, and of moving in each of the two orthogonal directions. Each cell in the gridworld is either a wall (in which case the state can never be visited), or one of five objects types: dirt, grass, lava, gold and silver.
We define three different reward functions A, B and A+B in terms of these object types, as specified by the legend of fig. 2. The reward functions assign the same weights to dirt, grass and lava, but differ in the weights for gold and silver. A likes silver but is neutral about gold, B has the opposite preferences and A+B likes both gold and silver. We generate synthetic demonstrations for each of these three reward functions using the MCE planner given by eq. (1).
Our multitask IRL algorithm is then presented with demonstrations from each reward function. Demonstrations for the fewshot environment are restricted to trajectories, varying between and , while demonstrations for the other two environments contain trajectories. To make the task more challenging, our algorithm is not provided with the feature representation, instead having to learn the reward separately for each state. We repeat all experiments for random seeds.
4.1 Comparison to baselines
We compare against two baselines. The first (‘single’) corresponds to using singletask MCE IRL, seeing only the trajectories from the fewshot environment. The second (‘joint training’) combines the demonstrations from all three environments into a single length sequence of trajectories. For reference, we also display the value obtained by an optimal (‘oracle’) policy, computed by value iteration on the ground truth reward. Figure 3 shows the best out of the 5 random seeds and table 1 reports 95% confidence intervals.
Single  Joint Training  MultiTask  Oracle  

A+B  0  —  359.6 +/ 0.2  20.1 +/ 4.1  362.7 
1  248.5 +/ 327.2  359.6 +/ 0.2  355.3 +/ 2.0  362.7  
2  337.1 +/ 242.8  359.6 +/ 0.2  356.1 +/ 3.6  362.7  
5  233.6 +/ 396.0  359.6 +/ 0.2  355.6 +/ 1.2  362.7  
10  226.7 +/ 389.0  359.6 +/ 0.2  356.7 +/ 3.7  362.7  
20  39.9 +/ 513.5  359.6 +/ 0.2  356.3 +/ 4.6  362.7  
50  208.8 +/ 347.5  359.6 +/ 0.2  358.2 +/ 2.5  362.7  
100  56.2 +/ 285.2  359.7 +/ 0.2  357.9 +/ 2.1  362.7  
A  0  —  463.0 +/ 12.2  475.3 +/ 0.6  357.8 
1  325.6 +/ 311.5  466.8 +/ 14.0  222.1 +/ 444.8  357.8  
2  267.2 +/ 373.6  463.0 +/ 12.2  352.8 +/ 9.5  357.8  
5  155.4 +/ 180.9  463.4 +/ 13.6  355.1 +/ 0.3  357.8  
10  209.3 +/ 429.8  463.4 +/ 13.6  355.1 +/ 0.2  357.8  
20  130.0 +/ 406.3  463.4 +/ 13.6  355.1 +/ 0.2  357.8  
50  596.1 +/ 2602.7  434.2 +/ 63.7  355.2 +/ 0.2  357.8  
100  120.6 +/ 612.5  431.5 +/ 62.0  355.2 +/ 0.1  357.8  
B  0  —  462.6 +/ 39.9  475.2 +/ 0.1  357.8 
1  227.9 +/ 247.8  462.6 +/ 39.9  23.1 +/ 688.1  357.8  
2  293.9 +/ 246.0  463.2 +/ 40.7  339.0 +/ 52.4  357.8  
5  122.6 +/ 106.7  462.6 +/ 39.9  354.8 +/ 1.2  357.8  
10  129.1 +/ 308.9  462.4 +/ 39.9  354.9 +/ 0.5  357.8  
20  305.4 +/ 404.0  458.6 +/ 37.6  355.0 +/ 0.6  357.8  
50  112.2 +/ 638.0  465.5 +/ 15.9  355.1 +/ 0.3  357.8  
100  731.8 +/ 3508.9  452.9 +/ 40.1  355.1 +/ 0.4  357.8 
Our multitask IRL algorithm recovers a nearoptimal policy in all 5 runs after only two trajectories, and in the best case requires only a single trajectory. By contrast, the ‘single’ baseline requires trajectories or more to recover a good policy even in the best case, and after trajectories several seeds still obtain negative total rewards.
The ‘joint training’ baseline performs well on A+B. This is unsurprising, since an optimal policy in A or B is nearoptimal in A+B. However, it fares poorly in both the A and B environments, never obtaining a positive reward even in the best case.
Note that all approaches fail in the zeroshot case on A and B, making the success of multitask IRL in the fewshot case all the more remarkable. Demonstrations solely from nontarget environments are not enough to recover a good reward in the target, and so substantial learning must be taking place with only one or two trajectories.
4.2 Hyperparameter choice
Our regularized MCE IRL algorithm takes a hyperparameter that specifies the regularization strength. We show in fig. 4 the results of a hyperparameter sweep between and . As expected, the weakest regularization constant suffers from high variance across the random seeds when the number of trajectories is small.
Perhaps more surprising, the strongest regularization constant also has high variance. We hypothesise that it imposes too strong a prior, making it highly sensitive to the trajectories observed in the offtarget environments.
The median regularization constant attains the lowest variance and highest mean of the hyperparameters tested, and was used in the previous section’s experiments.
These results indicate that where sample efficiency is paramount, it is important to choose a regularization hyperparameter suited to the task distribution. However, the algorithm is reasonably robust to hyperparameter choice, with all parameters (varying across two orders of magnitude) attaining nearoptimal performance after as few as 20 trajectories. By contrast, the singletask IRL algorithm did not achieve this level of performance even in the best case until observing 50 or more trajectories.
5 Related Work
Previous work in multitask IRL has approached the problem from a Bayesian perspective. Dimitrakakis & Rothkopf (2011) model rewardpolicy function pairs as being drawn from a common (unknown) prior, over which they place a hyperprior. They go on to propose two hyperpriors with structures that simplify inference, evaluating on a 5state MDP. Their work provides a pleasing theoretical basis for work on multitask IRL, but is not intended to provide an algorithm that can scale to practical applications.
Complementary work has tackled a variant of the multitask IRL problem, where the trajectories are unlabeled. That is, not only are the reward functions unknown, it is also not known which reward function each trajectory is associated with. BabeşVroman et al. (2011) propose using expectationmaximization (EM) to cluster trajectories, an approach that can be paired with several different IRL algorithms. Choi & Kim (2012) instead take a Bayesian IRL approach, using a Dirichlet process mixture model, avoiding the need to specify a fixed number of clusters. Both methods reduce the problem to multiple singletask IRL problems, and so unlike this work do not exploit similarities between reward functions.
IRL is often used as an imitation learning technique, by optimizing the inferred reward function. Multitask imitation learning has also been studied from a nonIRL perspective. Recent extensions to generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) augment trajectories with a latent intention variable that specifies the task, and then maximize mutual information between the stateaction pairs and intention variable (Hausman et al., 2017; Li et al., 2017). Both of these approaches are focused on disentangling trajectories associated with different tasks, and are not intended to speed up the learning of new tasks.
However, the imitation learning community has tackled this problem in the form of oneshot imitation learning: having seen a distribution of trajectories for other tasks, learn a new task from a single demonstration. Wang et al. (2017) use GAIL with the discriminator conditioned on the output generated by an LSTM encoder. After training on unlabeled trajectories, this method can perform oneshot imitation learning by conditioning on a single demonstration trajectory. Oneshot imitation learning can also be tackled within the behavioral cloning paradigm, with Duan et al. (2017) training a temporal convolution network with attention to mimic trajectories.
The closest work to this paper in the imitation learning community is multiagent GAIL (Song et al., 2018). Although GAIL never learns an explicit reward function, it is equivalent to IRL composed with RL. Similar to this work, multiagent GAIL seeks to improve sample efficiency by exploiting similarity between the reward functions. However, multiagent GAIL makes strong assumptions on the reward function (e.g. zero sum games), whereas our work depends on a much weaker prior.
Amin et al. (2017) have studied the complementary problem of repeated IRL: learning a common reward component that is invariant across tasks, when the taskspecific reward components are known. Although this could be solved by applying IRL to any one of the tasks (and then subtracting the taskspecific reward), they show a repeated IRL algorithm can attain better bounds, and under certain constraints can even be used to resolve the ambiguity inherent in singletask IRL problems.
6 Conclusions and Future Work
Sample efficient solutions to the multitask IRL problem are critical for enabling realworld applications, where collecting human demonstrations is expensive and slow. The multitask IRL problem has previously been studied exclusively from a Bayesian IRL perspective. In this paper we took the alternative approach of formulating the multitask problem inside the maximum causal entropy IRL framework by adding a regularization term to the loss. Experiments find our multitask IRL algorithm can perform oneshot imitation learning in an environment that singletask IRL requires hundreds of demonstrations to learn.
Maximum causal entropy IRL (Ziebart et al., 2010) cannot scale to MDPs with large or infinite state spaces, and moreover requires known dynamics. Both these problems have been alleviated by recent extensions to maximum causal entropy IRL, such as guided cost learning and adversarial IRL (Finn et al., 2016; Fu et al., 2018). Our second contribution is to show how in this function approximator setting, the problem of multitask IRL can be framed as a metalearning problem. We have a prototype of algorithm 1, combining Reptile (Nichol et al., 2018) and adversarial IRL (Fu et al., 2018), and are working to evaluate it on multitask variants of MuJoCo robotics environments (Todorov et al., 2012).
Another avenue for future work is to extend our approach to handle the unlabeled multitask IRL problem. Prior work on unlabeled multitask IRL has not exploited any similarity between different reward functions. However, we know from work on oneshot imitation learning that it is possible to use unlabeled demonstrations to speed up the acquisition of new tasks by learning a latent space representation (Wang et al., 2017). Synthesizing our work with existing unlabeled multitask IRL approaches such as BabeşVroman et al. (2011) could enable similar feats in the IRL context.
The source code for our algorithms and experiments is open source and available at —removed for double blind— .
Acknowledgements
Removed for double blind review.
References
 Abbeel & Ng (2004) Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 Amin et al. (2017) Amin, Kareem, Jiang, Nan, and Singh, Satinder. Repeated inverse reinforcement learning. In NIPS. 2017.
 Asadi & Littman (2017) Asadi, Kavosh and Littman, Michael L. An alternative softmax operator for reinforcement learning. 2017.
 BabeşVroman et al. (2011) BabeşVroman, Monica, Marivate, Vukosi, Subramanian, Kaushik, and Littman, Michael. Apprenticeship learning about multiple intentions. In ICML, 2011.
 Choi & Kim (2012) Choi, Jaedeug and Kim, KeeEung. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions. In NIPS, 2012.
 Dimitrakakis & Rothkopf (2011) Dimitrakakis, Christos and Rothkopf, Constantin A. Bayesian multitask inverse reinforcement learning. In EWRL, 2011.
 Duan et al. (2017) Duan, Yan, Andrychowicz, Marcin, Stadie, Bradly, Ho, Jonathan, Schneider, Jonas, Sutskever, Ilya, Abbeel, Pieter, and Zaremba, Wojciech. Oneshot imitation learning. In NIPS, 2017.
 Finn et al. (2016) Finn, Chelsea, Levine, Sergey, and Abbeel, Pieter. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016.
 Finn et al. (2017) Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 Fu et al. (2018) Fu, Justin, Luo, Katie, and Levine, Sergey. Learning robust rewards with adverserial inverse reinforcement learning. In ICLR, 2018.
 Hausman et al. (2017) Hausman, Karol, Chebotar, Yevgen, Schaal, Stefan, Sukhatme, Gaurav, and Lim, Joseph J. Multimodal imitation learning from unstructured demonstrations using generative adversarial nets. In NIPS. 2017.
 Ho & Ermon (2016) Ho, Jonathan and Ermon, Stefano. Generative adversarial imitation learning. In NIPS. 2016.
 Li et al. (2017) Li, Yunzhu, Song, Jiaming, and Ermon, Stefano. InfoGAIL: Interpretable imitation learning from visual demonstrations. In NIPS. 2017.
 Ng & Russell (2000) Ng, Andrew Y. and Russell, Stuart. Algorithms for inverse reinforcement learning. In ICML, 2000.
 Nichol et al. (2018) Nichol, Alex, Achiam, Joshua, and Schulman, John. On firstorder metalearning algorithms. CoRR, abs/1803.02999, 2018.
 Ramachandran & Amir (2007) Ramachandran, Deepak and Amir, Eyal. Bayesian inverse reinforcement learning. In IJCAI, 2007.
 Song et al. (2018) Song, Jiaming, Ren, Hongyu, Sadigh, Dorsa, and Ermon, Stefano. Multiagent generative adversarial imitation learning. In ICLR Workshop. 2018.
 Todorov et al. (2012) Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for modelbased control. In IROS, 2012.
 Wang et al. (2017) Wang, Ziyu, Merel, Josh S, Reed, Scott E, de Freitas, Nando, Wayne, Gregory, and Heess, Nicolas. Robust imitation of diverse behaviors. In NIPS. 2017.
 Ziebart et al. (2008) Ziebart, Brian D., Maas, Andrew, Bagnell, J. Andrew, and Dey, Anind K. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
 Ziebart et al. (2010) Ziebart, Brian D., Bagnell, J. Andrew, and Dey, Anind K. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.