Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement
Abstract
Multitask reinforcement learning (RL) aims to simultaneously learn policies for solving many tasks. Several prior works have found that relabeling past experience with different reward functions can improve sample efficiency. Relabeling methods typically ask: if, in hindsight, we assume that our experience was optimal for some task, for what task was it optimal? In this paper, we show that hindsight relabeling is inverse RL, an observation that suggests that we can use inverse RL in tandem for RL algorithms to efficiently solve many tasks. We use this idea to generalize goalrelabeling techniques from prior work to arbitrary classes of tasks. Our experiments confirm that relabeling data using inverse RL accelerates learning in general multitask settings, including goalreaching, domains with discrete sets of rewards, and those with linear reward functions.
1 Introduction
Reinforcement learning (RL) aims to acquire control policies that take actions to maximize their cumulative reward. Existing RL algorithms remain data inefficient, requiring exorbitant amounts of experience to learn even simple tasks (e.g., (Dubey et al., 2018; Kapturowski et al., 2018)). Multitask RL, where many RL problems are solved in parallel, has the potential to be more sample efficient than singletask RL, as data can be shared across tasks. Nonetheless, the problem of effectively sharing data across tasks remains largely unsolved.
The idea of sharing data across tasks has been studied at least since the 1990s (Caruana, 1997). More recently, a number of works have observed that retroactive relabeling of experience with different tasks can improve data efficiency. A common theme in prior relabeling methods is to relabel past trials with whatever goal or task was performed successfully in that trial. For example, relabeling for a goalreaching task might use the state actually reached at the end of the trajectory as the relabeled goal, sine the trajectory corresponds to a successful trial for the goal that was actually reached (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018). However, prior work has presented these goalrelabeling methods primarily as heuristics, and it remains unclear how to intelligently apply the same idea to tasks other than goalreaching, such as those with linear reward functions.
In this paper, we formalize prior relabeling techniques under the umbrella of inverse RL: by inferring the most likely task for a given trial via inverse RL, we provide a principled formula for relabeling in arbitrary multitask problems. Inverse RL is not the same as simply assigning each trajectory to the task for which it received the highest reward. In fact, this strategy would often result in assigning most trajectories to the easiest task. Rather, inverse RL takes into account the difficulty of different tasks and the amount of reward that each yields. RL and inverse RL can be seen as complementary tools for maximizing reward: RL takes tasks and produces highreward trajectories, and inverse RL takes trajectories and produces task labels such that the trajectories receive high reward. Formally, we prove that maximum entropy (MaxEnt) RL and MaxEnt inverse RL optimize the same multitask objective: MaxEnt RL optimizes with respect to trajectories, while MaxEnt inverse RL optimizes with respect to tasks. Unlike prior goalrelabeling techniques, we can use inverse RL to relabel experience for arbitrary task distributions, including sets of linear or discrete rewards. This observation suggests that tools from RL and inverse RL might be combined to efficiently solve many tasks simultaneously. The combination we develop, Hindsight Inference for Policy Improvement (HIPI), first relabels experience with inverse RL and then uses the relabeled experience to learn a policy (see Fig. 1). One variant of this framework follows the same design as prior goalrelabeling methods (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018) but uses inverse RL to relabel experience, a difference that allows our method to handle arbitrary task families. The second variant has a similar flavour to selfimitation behavior cloning methods (Oh et al., 2018; Ghosh et al., 2019; Savinov et al., 2018): we relabel past experience using inverse RL and then learn a policy via taskconditioned behavior cloning. Both algorithms can be interpreted as probabilistic reinterpretation and generalization of prior work.
The main contribution of our paper is the observation that hindsight relabeling is inverse RL. This observation not only provides insight into success of prior relabeling methods, but it also provides guidance on applying relabeling to arbitrary multitask RL problems. That RL and inverse RL can be used in tandem is not a coincidence; we prove that MaxEnt RL and MaxEnt inverse RL optimize the same multitask RL objective with respect to trajectories and tasks, respectively. Our second contribution consists of two simple algorithms that use inverse RLbased relabeling to accelerate RL. Our experiments on complex simulated locomotion and manipulation tasks demonstrate that our method outperforms stateoftheart methods on tasks ranging from goalreaching, running in various directions, and performing a host of manipulation tasks.
2 Prior Work
The focus of our work is on multitask RL problems, for which a number of algorithms have been proposed over the past decades (Thrun & Pratt, 2012; Hessel et al., 2019; Teh et al., 2017; Espeholt et al., 2018; Riedmiller et al., 2018). Existing approaches still struggle to reuse data across multiple tasks, with researchers often finding that training separate models is a very strong baseline (Yu et al., 2020) and using independentlytrained models as an initialization or prior for multitask models (Parisotto et al., 2015; Rusu et al., 2015; Ghosh et al., 2017; Teh et al., 2017). When applying offpolicy RL in the multitask setting, a common trick is to take experience collected when performing task A and pretend that it was collected for task B by recomputing the rewards at each step. This technique effectively inflates the amount of data available for learning, and a number of prior works have found this technique quite effective (Kaelbling, 1993; Pong et al., 2018; Andrychowicz et al., 2017; Schaul et al., 2015). In this paper we show that the relabeling done in prior work can be understood as inverse RL.
If RL is asking the question of how to go from a reward function to a policy, inverse RL asks the opposite question: after observing an agent acting in an environment, can we infer which reward function the agent was trying to optimize? A number of inverse RL algorithms have been proposed (Ratliff et al., 2006; Abbeel & Ng, 2004), with MaxEnt inverse RL being one of the most commonly used frameworks (Ziebart et al., 2008; Finn et al., 2016; Javdani et al., 2015). Since MaxEnt inverse RL can be viewed as an inference problem, we can calculate either the posterior distribution over reward functions, or the maximum aposteriori (MAP) estimate. While most prior work is concerned with MAP estimates, we follow HadfieldMenell et al. (2017) in using the full posterior distribution. Section 3 discusses how MaxEnt RL and MaxEnt inverse RL are closely connected, with one problem being the dual of the other. It is therefore not a coincidence that many MaxEnt inverse RL algorithms involve solving a MaxEnt RL problem in the inner loop. Our paper proposes the opposite, using MaxEnt inverse RL in the inner loop of MaxEnt RL.
Our work builds on the idea that MaxEnt RL can be viewed as probabilistic inference. This idea has been proposed in a number of prior works (Kappen et al., 2012; Toussaint, 2009; Todorov, 2008, 2007; Rawlik et al., 2013; Theodorou & Todorov, 2012; Levine, 2018) and used to build a number of modern RL algorithms (Haarnoja et al., 2017, 2018a; Abdolmaleki et al., 2018). Perhaps the most relevant prior work is Rawlik et al. (2013), which emphasizes that MaxEnt RL can be viewed as minimizing an KL divergence, an idea that we extend to the multitask setting.
3 Preliminaries
This section reviews MaxEnt RL and MaxEnt inverse RL. We start by introducing notation.
Notation We will analyze an MDP with states and reward function . We assume that actions are sampled from a policy . The initial state is sampled and subsequent transitions are governed by a dynamics distribution . We define a trajectory as a sequence of states and actions: , and write the likelihood of a trajectory under policy as
(1) 
In the multitask setting, we will use to identify each task, and assume that we are given a prior over tasks. The set of tasks can be continuous or discrete, finite or infinite; each particular task can be continuous or discrete valued. We define as the reward function for task . Our experiments will use both goalreaching tasks, where is a goal state, as well as more general task distributions, where is the hyperparameters of the reward function
MaxEnt RL MaxEnt RL casts the RL problem as one of sampling trajectories with probability proportional to exponentiated reward. Given a reward function , we aim to learn a policy that samples trajectories from the following target distribution, :
(2) 
The partition function is introduced to make integrate to one. The objective function for MaxEnt RL is to maximize the entropyregularized sum of rewards, which is equivalent to minimizing the reverse KL divergence between the policy’s distribution over trajectories, , and a target distribution, defined in terms of rewards :
The partition function does not depend on the policy, so prior RL algorithms have ignored it.
MaxEnt Inverse RL Inverse RL observes previouslycollected data and attempts to infer the intent of the actor, which is represented by a reward function . MaxEnt inverse RL is a variant of inverse RL that defines the probability of trajectory being produced for task as
where
Applying Bayes’ Rule, the posterior distribution over reward functions is given as follows:
(3) 
While many applications of MaxEnt inverse RL use the maximum a posteriori estimate, in this paper will use the full posterior distribution. While the partition function, an integral over all states and actions, is typically hard to compute, its dual is the MaxEnt RL problem:
(4) 
The striking similarities between MaxEnt RL and MaxEnt inverse RL are not a coincidence. As we will show in the next section, both minimize the same reverse KL divergence on the joint distribution of tasks and trajectories.
4 Hindsight Relabeling is Inverse RL
We now aim to use the tools of RL and inverse RL to solve many RL problems simultaneously, each with the same dynamics but a different reward function. Given a prior over tasks, , the target joint distribution over tasks and trajectories is
(5) 
We can express the multitask (MaxEnt) RL objective as the reverse KL divergence between the joint trajectorytask distributions:
(6) 
If we factor the joint distribution as , Eq. 6 is equivalent to maximizing the expected (entropyregularized) reward of a taskconditioned policy :
Since the distribution over tasks, is fixed, we can ignore the term for optimization. A less common but more intriguing choice is to factor , where is represented nonparametrically as a distribution over previouslyobserved trajectories, and is a relabeling distribution. We find the optimal relabeling distribution by first rewriting Eq. 6
and then solving for the optimal relabeling distribution, ignoring terms that do not depend on :
(7) 
The key observation here is that the optimal relabeling distribution corresponds exactly to MaxEnt inverse RL posterior over tasks (Eq. 3). Thus, we can obtain the optimal relabeling distribution via inverse RL. While the optimal relabeling distribution derived here depends on the entire trajectory, Appendix B shows how to perform relabeling when given a transition rather than an entire trajectory:
(8) 
In the next section we show that prior goalrelabeling methods are a special case of inverse RL.
4.1 Special Case: Goal Relabeling
A number of prior works have explicitly (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018) and implicitly (Savinov et al., 2018; Lynch et al., 2019; Ghosh et al., 2019) found that hindsight relabeling can accelerate learning for goalreaching tasks, where tasks correspond to goal states. These prior relabeling methods are a special case of inverse RL. We define a goalconditioned reward function that penalizes the agent for failing to reaching the goal at the terminal step:
(9) 
We assume that the time step is included in the observation to ensure that this reward function is Markovian. With this reward function, the optimal relabeling distribution from Eq. 7 is simply , where is the final state in trajectory . Thus, relabeling with the state actually reached is equivalent inverse RL when using the reward function in Eq. 9. While inverse RL is particularly convenient when using this reward function, it is rarely the metric of success that we actually care about. Viewing goal relabeling as a special case of inverse RL under a special reward function allows us to extend goal relabeling to general task arbitrary reward functions and arbitrary task distributions. In our experiments, we show that inverse RL seamlessly handles task distributions including goalreaching, discrete sets of tasks, and linear reward functions.
4.2 The Importance of the Partition Function
The partition function used by inverse RL will be important for hindsight relabeling, as it will normalize the rewards from tasks with varying difficulty and reward scale. Fig. 2 shows a didactic example with two tasks, where the rewards for one task are larger than the rewards for the other task. Relabeling with the reward under which the agent received the largest reward (akin to Andrychowicz et al. (2017)) fails, because all experience will be relabeled with the first (easier) task. Subtracting the partition function from the rewards (as in Eq. 7) results in the desired behavior, trajectory is assigned task and is assigned to .
4.3 How Much Does Relabeling Help?
Up to now, we have shown that the optimal way to relabel data is via inverse RL. How much does relabeling help? We now obtain a lower bound on the improvement from relabeling. Both lemmas in this section will assume that a joint distribution over tasks and trajectories be given (e.g., specified by a policy ). We will define as the marginal distribution over trajectories and then construct using the optimal relabeling distribution (Eq. 7). We first show that relabeling data using inverse RL improves the MaxEnt RL objective:
Lemma 1.
The relabeled distribution is closer to the target distribution than the original distribution, as measured by the KL divergence:
Proof.
Of the many possible relabeling distributions, one choice is to do no relabeling, assigning to each trajectory the task that was commanded when the trajectory was collected. Denote this relabeling distribution , so . Because was chosen as that which minimizes the KL among all relabeling distributions (including ), the desired inequality holds:
∎
Thus, the relabeled data is an improvement over the original data, achieving a larger entropyregularized reward (Eq. 6). As our experiments will confirm, relabeling data will accelerate learning. Our next result will give us a lower bound on this improvement:
Lemma 2.
The improvement in the MaxEnt RL objective (Eq. 6) gained by relabeling is lower bounded as follows:
The proof, a straightforward application of information geometry, is in Appendix A. This result says that the amount that relabeling helps is at least as large as the difference between the task labels and the task labels inferred by inverse RL, . Note that, when we have learned the optimal policy (Eq. 5), our experience is already optimally labeled, so relabeling has no effect.
5 Using Inverse RL to Accelerate RL
In this section, we outline a general recipe, Hindsight Inference for Policy Improvement (HIPI), for using inverse RL to accelerate the learning of downstream tasks. Given a dataset of trajectories, we use inverse RL to infer for which tasks those trajectories are optimal. We discuss two options for how to use these relabeled trajectories. One option is to apply offpolicy RL on top of these relabeled trajectories. This option generalizes previouslyintroduced hindsight relabeling techniques (Kaelbling, 1993; Andrychowicz et al., 2017), allowing them to be applied to task distributions beyond goalreaching. A second option is to apply behavior cloning to the relabeled experience. This option generalizes a number of previous methods, extending variational policy search (Peters & Schaal, 2007; Dayan & Hinton, 1997; Levine & Koltun, 2013; Peng et al., 2019) to the multitask setting and extending goalconditioned imitation learning (Ghosh et al., 2019; Savinov et al., 2018; Lynch et al., 2019) to arbitrary task distributions.
5.1 Using Relabeling Data for OffPolicy RL (HIPIRL)
Offpolicy RL algorithms, such as Qlearning and actorcritic algorithms, represent a broad class of modern RL methods. These algorithms maintain a replay buffer of previously seen experience, and we can relabel this experience using inverse RL when sampling from the replay buffer. As noted in Section 4.1, hindsight experience replay (Andrychowicz et al., 2017) can be viewed as a special case of this idea. Viewing relabeling as inverse RL, we can extend these methods to general classes of reward functions.
There are many algorithms for inverse methods, and we outline one approximate algorithm that can be efficiently integrated into offpolicy RL. To relabel entire trajectories, we would start by computing the cumulative reward: . However, most offpolicy RL algorithms maintain a replay buffer that stores transitions, rather than entire trajectories. In this case, following Eq. 8, we instead use the soft Qfunction: . We approximate the partition function using Monte Carlo samples from within a batch of size :
We finally sample tasks following Eq. 7:
We summarize the procedure for relabeling with inverse RL procedure in Alg. 1. The application of relabeling with inverse RL to offpolicy RL, which we call HIPIRL, is summarized in Alg. 2. We emphasize that Alg 1 is just one of many methods for performing inverse RL. Alternative methods include gradientbased optimization of the persample task, and learning a parametric tasksampler to approximate the optimal relabeling distribution (Eq. 7). We leave this as future work.
5.2 Using Relabeled Data for Behavior Cloning
We now introduce a second method to use data relabeled with inverse RL to acquire control policies. The idea is quite simple: given arbitrary data, first relabel that data with inverse RL, and then perform taskconditioned behavior cloning. We call this procedure HIPIBC and summarize it in Alg. 3. Why should we expect this procedure to work? The intuition is that relabeling with inverse RL makes the joint distribution of tasks and trajectories closer to the target distribution (i.e., it maximizes the multitask MaxEnt RL objective (Eq. 6)). To convert this joint distribution into an actionable representation, we extract the policy implicitly defined by the relabeled trajectories. Behavioral cloning (i.e., supervised learning) does precisely this.
Relationship to Prior Methods
Prior work on both goalconditioned supervised learning, selfimitation learning, and rewardweighted regression can all be understood as special cases. Goalconditioned supervised learning (Savinov et al., 2018; Ghosh et al., 2019; Lynch et al., 2019) learns a goalconditioned policy using a dataset of past experience. For a given state, the action that was actually taken is treated as the correct action (i.e., label) for states reached in the future, and a policy is learned via supervised learning. As discussed in Section 4.1, relabeling with the goal actually achieved is a special case of our framework. We refer the reader to those papers for additional evidence for the value of combining inverse RL (albeit a trivial special case) with behavior cloning can effectively learn complex control policies. Selfimitation learning (Oh et al., 2018) and iterative maximum likelihood training (Liang et al., 2016) augment RL with supervised learning on a handful of the best previouslyseen trajectories, an approach that can be viewed in the inverse RL followed by supervised learning framework. However, because the connection to inverse RL is not made precise, these methods omit the partition function, which may prove problematic when extending these methods to multitask settings. Finally, singletask RL methods based on variational policy search (Levine, 2018) and rewardweighted regression (Peters & Schaal, 2007; Peng et al., 2019) can also be viewed in this framework. Noting that the optimal relabeling distribution is given as , relabeling by sampling from the inverse RL posterior and then performing behavior cloning can be written concisely as the following objective:
The key difference between this objective and prior work is the partition function. The observation that these prior methods are special cases of inverse RL allows us to apply similar ideas to arbitrary classes of reward functions, a capability we showcase in our experiments.
6 Experiments: Relabeling with Inverse RL Accelerates Learning
Our experiments focus on two methods for using relabeled data: offpolicy RL (Alg. 2) and behavior cloning (Alg. 3). We evaluate our method on both goalreaching tasks as well as more general task distributions, including linear combinations of a reward basis and discrete sets of tasks (see Fig. 3).
6.1 HIPIRL: Inverse RL for OffPolicy RL
Our first set of experiments apply Alg. 2 to domains with varying reward structure, demonstrating how relabeling data with inverse RL can accelerate offpolicy RL.
Didactic Example
We start with a didactic example to motivate why relabeling experience with inverse RL would accelerate offpolicy RL. In the gridworld shown in Fig. 4, we construct a dataset with two trajectories: and . From state A, inverse RL identifies both and as likely intentions, so we include both and in the relabeled data. Final state relabeling (HER) only uses trajectory . We then apply Qlearning to both datasets. to this dataset. Whereas Qlearning with final state relabeling only succeeds at reaching those goals in the top row, our approach, which corresponds to Qlearning with inverse RL, relabeling succeeds at reaching all goals. The remainder of this section will show the benefits of relabeling using inverse RL in domains of increasing complexity.
GoalReaching Task Distributions We next apply our method to goalreaching tasks, where each task corresponds to reaching a different goal state. We used six domains: a quadruped locomotion task, a robotic finger turning a knob, a 2D reacher, a reaching task on the Sawyer robot, a 2D navigation environment with obstacles, and a reaching task on the Jaco robot. Appendix C provides details of all tasks. We compared our method against four alternative relabeling strategies: relabeling with the final state reached (HER (Andrychowicz et al., 2017)), relabeling with a randomlysampled task, relabeling with a future state in the same trajectory, and doing no relabeling (SAC (Haarnoja et al., 2018a)). For tasks where the goal state only specifies certain dimensions of the state, relabeling with the final state and future state requires privileged information indicating to which state dimensions the goal corresponds.
As shown in Fig. 5, relabeling experience with inverse RL (our method) always learns at least as quickly as the other relabeling strategies, and often achieves larger asymptotic reward. While final state relabeling (HER) performs well on some tasks, it is worse than random relabeling on other tasks. We also observe that random relabeling is a competitive baseline, provided that the number of gradient steps is sufficiently tuned. We conjectured that soft relabeling would be most beneficial in settings with extremely sparse rewards. To test this hypothesis, we modified the reward functions in 2D reacher and Jaco reaching environments to be much sparser. As shown in the far right column on Fig. 5, only soft relabeling is able to make learning progress in this setting.
More General Task Distributions Our next experiment demonstrates that, in addition to relabeling goals, inverse RL can also relabel experience for more general tasks distributions. Our first task distribution is a discrete set of goal states for the 2D reacher environment. The second task distribution highlights the capability of inverse RL to relabel experience for classes of reward functions defined as linear combinations of features . We use the walker environment, with features corresponding to torso height, velocity, relative position of the feet, and a control cost. The third task distribution is again a goal reaching task, but one where the task indicates both the goal state as well as the desired margin from that goal state. As prior relabeling approaches are not applicable to these general task distributions, we only compared our approach to random relabeling and no relabeling (SAC (Haarnoja et al., 2018a)). As shown in Fig. 6, relabeling with inverse RL provides more sample efficient learning in all tasks, and the asymptotic reward is larger than the baselines by a nontrivial amount in two of the three tasks.
6.2 HIPIBC: Behavior Cloning on Experience Relabeled with Inverse RL
In this section, we present experiments that use behavior cloning on top of relabeled experience (Alg. 3). The three domains we use have varying reward structure: (1) halfcheetah with continuous goal velocities; (2) quadruped with linear reward functions; and (3) the manipulation environment with nine discrete tasks. For the halfcheetah and quadruped domains, we collected 1000 demonstrations from a policy trained with offpolicy RL. For the manipulation environment, Lynch et al. (2019) provided a dataset of 100 demonstrations for each of these tasks, which we aggregate into a dataset of 900 demonstrations. In all settings, we discarded the task labels, simulating the common realworld setting where experience does not come prepared with task labels. As shown in Fig. 7, first inferring the tasks with inverse RL and then performing behavioral cloning results in significantly higher final rewards than taskagnostic behavior cloning on the entire dataset, which is no better than random.
Our final experiment demonstrates the importance of the partition function. On the cheetah domain, we synthetically corrupt the demonstrations by adding a constant bias to the reward for the first task (whichever velocity was sampled first). We then compare the performance of our approach against an ablation that did not normalize by the partition function when relabeling data. As shown in Fig. 8, using task rewards of different scales significantly degrades the performance of the ablation. Our method, which normalizes the task rewards in the inverse RL step, is not affected by reward scaling.
7 Discussion
In this paper, we introduced the idea that hindsight relabeling is inverse RL. We showed that a number of prior works can be understood as special cases of this general framework. The idea that inverse RL might be used to relabel data is powerful because it enables us to extend relabeling techniques to general classes of reward functions. We explored two particular instantiations of this idea, using experience relabeled with inverse RL for offpolicy RL and for supervised learning.
We are only scratching the surface of the many ways relabeled experience might be used to accelerate learning. For example, the problem of task inference is everpresent in metalearning, and it is intriguing to imagine explicitly incorporating inverse RL into meta RL. Broadly, we hope that the observation that inverse RL can be used to accelerate RL will spur research on better inverse RL algorithms, which in turn will provide better RL algorithms.
Acknowledgements We thank Yevgen Chebotar, Aviral Kumar, Vitchyr Pong, and Anirudh Vemula for formative discussions. We are grateful to Ofir Nachum for pointing out the duality between MaxEnt RL and the partition function, and to Karol Hausman for reviewing an early draft of this paper. We thank Stephanie Chan, Corey Lynch, and Pierre Sermanet for providing the desk manipulation environment. This research was supported by the Fannie and John Hertz Foundation, NASA, DARPA, US Army, and the National Science Foundation (IIS1700696, IIS1700697, IIS1763562, and DGE 1745016). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Appendix A Proof of Lemma 2
This section provides a proof of Lemma 2.
Proof.
The optimal relabeling distribution can be viewed as an information projection of the joint distribution onto the target distribution (Eq. 5):
where is the set of all joint distributions with marginal . Note that this set is closed and convex. We then apply Theorem 11.6.1 from Cover & Thomas (2006):
(10) 
The second KL divergence on the RHS can be simplified:
Substituting this simplification into Eq. 10 and rearranging terms, we obtain the desired result. ∎
Appendix B Inverse RL on Transitions
For simplicity, our derivation of relabeling in Section 4 assumed that entire trajectories were provided. This section outlines how to do relabeling with inverse RL when we are only provided with transitions, rather than entire trajectories. This derivation will motivate the use of the soft Qfunction in Eq. 8.
In this case, policy distribution in the MaxEnt RL objective (Eq. 6) is conditioned on the current state and action in addition to the task :
(11) 
Following the derivation in Section 4, we expand this objective, using as our relabeling distribution:
(12) 
The expected value of the two summations is the soft Qfunction for the policy :
(13) 
Substituting Eq. 13 into Eq. 12 and ignoring terms that do not depend on , we can solve the optimal relabeling distribution:
(14) 
Appendix C Experimental Details
c.1 Hyperparameters for OffPolicy RL
Except for the didactic experiment, we used SAC (Haarnoja et al., 2018a) as our RL algorithm, taking the implementation from Guadarrama et al. (2018). This implementation scales the critic loss by a factor of 0.5. Following prior work (Pong et al., 2018), we only relabeled 50% of the samples drawn from the replay buffer, using the originallycommanded task the remaining 50%. The only hyperparameter that differed across relabeling strategies was the number of gradient updates per environment step. For each experiment, we evaluated each method with values in and reported the results of the best hyperparameter in our plots. Perhaps surprisingly, doing random relabeling but simply increasing the number of gradient updates per environment step was a remarkably competitive baseline.

Learning Rate: 3e4 (same for actor, critic, and entropy dual parameter)

Batch Size: 32

Network architecture: The input was the concatenation of the state observation and the task . Both the actor and critic networks were 2 hidden layer ReLu networks. The actor output was squashed by a tanh activation to lie within the actor space constraints. There was no activation at the final layer of the critic network, except in the desk environment (see comment below). The hidden layer dimensions were (32, 32) for the 2D navigation environments, (256, 256) for the quadruped and desk environments, and (64, 64) for all other environments.

Discount : 0.99

Initial data collection steps: 1e5

Target network update period: 1

Target network : 0.005

Entropy coefficient : We used the entropyconstrained version of SAC (Haarnoja et al., 2018b), using as the target value, where is the action space dimension.

Replay buffer capacity: 1e6

Optimizer: Adam

Gradient Clipping: We found that clipping the gradients to have unit norm was important to get any RL working on the Sawyer and Jaco tasks.
To implement final state relabeling, we modified transitions as they were being added to the replay buffer, adding both the original transition and the transition augmented to use the final state as the goal. To implement future state relabeling, we modified transitions as they were being added to the replay buffer, adding both the original transition and a transition augmented to use one of the next 4 states in the same trajectory as the goal.
c.2 Hyperparameters for Behavior Cloning Experiments
To account for randomness in the learning process, we collect at least 200 evaluation episodes per domain; we repeat this experiment for at least 5 random seeds on each domain, and plot the mean and standard deviation over the random seeds. We used a 2layer neural network with ReLu activations for all experiments. The hidden layers had size (256, 256). We optimized the network to minimize MSE using the Adam optimizer with a learning rate of 3e4. We used early stopping, halting training when the validation loss increased for 3 consecutive epochs. Typically training converged in 30  50 epochs. We normalized both the states and actions. For the taskconditioned experiments, we concatenated the task vectors to the state vectors.
c.3 Quadruped Environment
The quadruped was a modified version of the environment from Abdolmaleki et al. (2018). We modified the initial state distribution so the agent always started upright, and modified the observation space to include the termination signal as part of the observation. Tasks were sampled uniformly from the unit circle. Let and indicate the XY velocity and position of the agent. For the HIPIRL experiments, we used the following sparse reward function:
and the episode terminated when . We also reset the environment after 300 steps if the agent had failed to reach the goal. For the HIPIBC experiments, we used the following dense reward function:
and episodes were 300 steps long.
c.4 Finger Environment
The finger environment was taken from Tassa et al. (2018b). Tasks were sampled using the environment’s default goal sampling function. Let denote the XY position of the knob that the agent can manipulate. The reward function was defined as
and the episode terminated when . We also reset the environment after 1000 steps if the agent had failed to reach the goal.
c.5 2D Reacher Environment
The 2D reacher environment was taken from Tassa et al. (2018b). Let denote the XY position of the robot end effector. The reward function was defined as
and the episode terminated when , where is a margin around the goal. We used and in our experiments. We also reset the environment after 1000 steps if the agent had failed to reach the goal. Tasks were sampled using the environment’s default goal sampling function.
c.6 Sawyer Reach Environment
The sawyer reach environment was taken from Yu et al. (2019). Let denote the XYZ position of the robot end effector. The reward function was defined as
and the episode terminated when , where is a margin around the goal. We used and in our experiments. We also reset the environment after 150 steps if the agent had failed to reach the goal. Tasks were sampled using the environment’s default goal sampling function. For the experiment where the task indicator also specified the margin , the margin was sampled uniformly from the interval .
c.7 2D Navigation Environment
We used the 2D navigation environment from Eysenbach et al. (2019). The action space is continuous and indicates the desired change of position. The dynamics are stochastic, and the initial state and goal are sampled uniformly at random for each episode. To increase the difficulties of credit assignment and exploration, the agent is always initialized in the lower left corner, and we randomly sampled goal states that are at least 15 steps away. The layout of the obstacles is taken from the classic FourRooms domain, but dilated by a factor of three.
c.8 Jaco Reach Environment
We implemented a reaching task using a simulated Jaco robot. Goal states were sampled from uniformly from the interval . The agent controlled the velocity of 6 arm joints and 3 finger joints, so the action space was 9 dimensional. The action observation space was 43 dimensional. Let denote the XYZ position of the robot end effector. The reward function was defined as
and the episode terminated when , where is a margin around the goal. We used and in our experiments. We also reset the environment after 250 steps if the agent had failed to reach the goal.
c.9 Walker Environment
The walker environment was a modified version of the environment from Tassa et al. (2018a). We modified the initial state distribution so the agent always started upright, and modified the observation space to include the termination signal as part of the observation. For the linear reward function, the features are the torso height (normalized by subtracting 0.5m), velocity along the forward/aft axis, the XZ displacement of the two feet relative to the agent’s center of mass (the agent cannot move along the Y axis), and the squared L2 norm of the actions. The task coefficients can take on values in the range for all dimensions, except for the control penalty, which takes on values in . Episodes were 100 steps long.
c.10 HalfCheetah Environment
The halfcheetah environment was taken from Tassa et al. (2018a). We define tasks to correspond to goal velocities and use the reward function from Rakelly et al. (2019):
where is the horizontal root velocity. Tasks were sampled uniformly , with units of meters per second. Episodes were 100 steps long.
c.11 Desk Environment
The environment provided by Lynch et al. (2019) included 19 tasks. We selected the nine most challenging tasks by looking how often a task was accidentally solved. In the demonstrations for task A, we recorded the average return on the remaining 18 tasks. We chose the nine tasks whose average reward was lowest. The nine tasks were three button pushing tasks and six block manipulation tasks.
For experiments on this environment, we found that normalizing the action space was crucial. We computed the coordinatewise mean and standard deviation of the actions from the demonstrations, and modified the environment to implicitly normalize actions by subtracting the mean and dividing by the standard deviation. We clipped the action space to , so the agent was only allowed to command actions within one standard deviation (as measured by the expert demos).
Another trick that was crucial for RL on this environment was clipping the critic outputs. Since the reward was in and the episode length was capped at 128 steps, we squashed the Qvalue predictions with a scaled sigmoid to be in the range .
Appendix D Failed Experiments

100% Relabeling: When using inverse RL to relabel data for offpolicy RL, we initially relabeled 100% of samples from the replay buffer, but found that learning was often worse than doing no relabeling at all. We therefore switched to only 50% relabeling in our experiments. We speculate that retaining some of the originallycommanded goals serves as a sort of hardnegative mining.

Coordinate Ascent on Eq. 6: We attempted to devise an EMstyle algorithm that performed coordinate ascent in Eq. 6, alternating between (1) doing MaxEnt RL and (2) relabeling that data and acquiring the corresponding policy via behavior cloning. While we were unable to get this algorithm to outperform standard MaxEnt RL, we conjecture that this procedure would work with the right choice of inverse RL algorithm.
References
 Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, pp. 1. ACM, 2004.
 Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
 Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
 Caruana, R. Multitask learning. Machine learning, 28(1):41–75, 1997.
 Cover, T. M. and Thomas, J. A. Elements of information theory (wiley series in telecommunications and signal processing), 2006.
 Dayan, P. and Hinton, G. E. Using expectationmaximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
 Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., and Efros, A. A. Investigating human priors for playing video games. arXiv preprint arXiv:1802.10217, 2018.
 Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561, 2018.
 Eysenbach, B., Salakhutdinov, R., and Levine, S. Search on the replay buffer: Bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, pp. 15220–15231, 2019.
 Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016.
 Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., and Levine, S. Divideandconquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
 Ghosh, D., Gupta, A., Fu, J., Reddy, A., Devine, C., Eysenbach, B., and Levine, S. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019.
 Guadarrama, S., Korattikara, A., Ramirez, O., Castro, P., Holly, E., Fishman, S., Wang, K., Gonina, E., Harris, C., Vanhoucke, V., et al. Tfagents: A library for reinforcement learning in tensorflow, 2018.
 Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1352–1361. JMLR. org, 2017.
 Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018a.
 Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
 HadfieldMenell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
 Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. Multitask deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019.
 Javdani, S., Srinivasa, S. S., and Bagnell, J. A. Shared autonomy via hindsight optimization. Robotics science and systems: online proceedings, 2015, 2015.
 Kaelbling, L. P. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993.
 Kappen, H. J., Gómez, V., and Opper, M. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
 Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. Recurrent experience replay in distributed reinforcement learning. 2018.
 Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Levine, S. and Koltun, V. Variational policy search via trajectory optimization. In Advances in neural information processing systems, pp. 207–215, 2013.
 Liang, C., Berant, J., Le, Q., Forbus, K. D., and Lao, N. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020, 2016.
 Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019.
 Oh, J., Guo, Y., Singh, S., and Lee, H. Selfimitation learning. arXiv preprint arXiv:1806.05635, 2018.
 Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actormimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
 Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantageweighted regression: Simple and scalable offpolicy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
 Peters, J. and Schaal, S. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745–750. ACM, 2007.
 Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Modelfree deep rl for modelbased control. arXiv preprint arXiv:1802.09081, 2018.
 Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine, S. Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
 Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. ACM, 2006.
 Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. In TwentyThird International Joint Conference on Artificial Intelligence, 2013.
 Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playingsolving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567, 2018.
 Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
 Savinov, N., Dosovitskiy, A., and Koltun, V. Semiparametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018.
 Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320, 2015.
 Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018a.
 Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. DeepMind control suite. Technical report, DeepMind, January 2018b. URL https://arxiv.org/abs/1801.00690.
 Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
 Theodorou, E. A. and Todorov, E. Relative entropy and free energy dualities: Connections to path integral and kl control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 1466–1473. IEEE, 2012.
 Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.
 Todorov, E. Linearlysolvable markov decision problems. In Advances in neural information processing systems, pp. 1369–1376, 2007.
 Todorov, E. General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286–4292. IEEE, 2008.
 Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. ACM, 2009.
 Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Metaworld: A benchmark and evaluation for multitask and meta reinforcement learning. arXiv preprint arXiv:1910.10897, 2019.
 Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multitask learning. arXiv preprint arXiv:2001.06782, 2020.
 Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. 2008.