Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement


Multi-task reinforcement learning (RL) aims to simultaneously learn policies for solving many tasks. Several prior works have found that relabeling past experience with different reward functions can improve sample efficiency. Relabeling methods typically ask: if, in hindsight, we assume that our experience was optimal for some task, for what task was it optimal? In this paper, we show that hindsight relabeling is inverse RL, an observation that suggests that we can use inverse RL in tandem for RL algorithms to efficiently solve many tasks. We use this idea to generalize goal-relabeling techniques from prior work to arbitrary classes of tasks. Our experiments confirm that relabeling data using inverse RL accelerates learning in general multi-task settings, including goal-reaching, domains with discrete sets of rewards, and those with linear reward functions.


1 Introduction

Reinforcement learning (RL) aims to acquire control policies that take actions to maximize their cumulative reward. Existing RL algorithms remain data inefficient, requiring exorbitant amounts of experience to learn even simple tasks (e.g.,  (Dubey et al., 2018; Kapturowski et al., 2018)). Multi-task RL, where many RL problems are solved in parallel, has the potential to be more sample efficient than single-task RL, as data can be shared across tasks. Nonetheless, the problem of effectively sharing data across tasks remains largely unsolved.

The idea of sharing data across tasks has been studied at least since the 1990s (Caruana, 1997). More recently, a number of works have observed that retroactive relabeling of experience with different tasks can improve data efficiency. A common theme in prior relabeling methods is to relabel past trials with whatever goal or task was performed successfully in that trial. For example, relabeling for a goal-reaching task might use the state actually reached at the end of the trajectory as the relabeled goal, sine the trajectory corresponds to a successful trial for the goal that was actually reached (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018). However, prior work has presented these goal-relabeling methods primarily as heuristics, and it remains unclear how to intelligently apply the same idea to tasks other than goal-reaching, such as those with linear reward functions.

Figure 1: Hindsight Inference for Policy Improvement (HIPI): Given a dataset of prior experience, we use inverse RL to infer the agent’s intentions. We use the relabeled experience with any policy learning algorithm, such as off-policy RL or supervised learning.

In this paper, we formalize prior relabeling techniques under the umbrella of inverse RL: by inferring the most likely task for a given trial via inverse RL, we provide a principled formula for relabeling in arbitrary multi-task problems. Inverse RL is not the same as simply assigning each trajectory to the task for which it received the highest reward. In fact, this strategy would often result in assigning most trajectories to the easiest task. Rather, inverse RL takes into account the difficulty of different tasks and the amount of reward that each yields. RL and inverse RL can be seen as complementary tools for maximizing reward: RL takes tasks and produces high-reward trajectories, and inverse RL takes trajectories and produces task labels such that the trajectories receive high reward. Formally, we prove that maximum entropy (MaxEnt) RL and MaxEnt inverse RL optimize the same multi-task objective: MaxEnt RL optimizes with respect to trajectories, while MaxEnt inverse RL optimizes with respect to tasks. Unlike prior goal-relabeling techniques, we can use inverse RL to relabel experience for arbitrary task distributions, including sets of linear or discrete rewards. This observation suggests that tools from RL and inverse RL might be combined to efficiently solve many tasks simultaneously. The combination we develop, Hindsight Inference for Policy Improvement (HIPI), first relabels experience with inverse RL and then uses the relabeled experience to learn a policy (see Fig. 1). One variant of this framework follows the same design as prior goal-relabeling methods (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018) but uses inverse RL to relabel experience, a difference that allows our method to handle arbitrary task families. The second variant has a similar flavour to self-imitation behavior cloning methods (Oh et al., 2018; Ghosh et al., 2019; Savinov et al., 2018): we relabel past experience using inverse RL and then learn a policy via task-conditioned behavior cloning. Both algorithms can be interpreted as probabilistic reinterpretation and generalization of prior work.

The main contribution of our paper is the observation that hindsight relabeling is inverse RL. This observation not only provides insight into success of prior relabeling methods, but it also provides guidance on applying relabeling to arbitrary multi-task RL problems. That RL and inverse RL can be used in tandem is not a coincidence; we prove that MaxEnt RL and MaxEnt inverse RL optimize the same multi-task RL objective with respect to trajectories and tasks, respectively. Our second contribution consists of two simple algorithms that use inverse RL-based relabeling to accelerate RL. Our experiments on complex simulated locomotion and manipulation tasks demonstrate that our method outperforms state-of-the-art methods on tasks ranging from goal-reaching, running in various directions, and performing a host of manipulation tasks.

2 Prior Work

The focus of our work is on multi-task RL problems, for which a number of algorithms have been proposed over the past decades (Thrun & Pratt, 2012; Hessel et al., 2019; Teh et al., 2017; Espeholt et al., 2018; Riedmiller et al., 2018). Existing approaches still struggle to reuse data across multiple tasks, with researchers often finding that training separate models is a very strong baseline (Yu et al., 2020) and using independently-trained models as an initialization or prior for multi-task models (Parisotto et al., 2015; Rusu et al., 2015; Ghosh et al., 2017; Teh et al., 2017). When applying off-policy RL in the multi-task setting, a common trick is to take experience collected when performing task A and pretend that it was collected for task B by recomputing the rewards at each step. This technique effectively inflates the amount of data available for learning, and a number of prior works have found this technique quite effective (Kaelbling, 1993; Pong et al., 2018; Andrychowicz et al., 2017; Schaul et al., 2015). In this paper we show that the relabeling done in prior work can be understood as inverse RL.

If RL is asking the question of how to go from a reward function to a policy, inverse RL asks the opposite question: after observing an agent acting in an environment, can we infer which reward function the agent was trying to optimize? A number of inverse RL algorithms have been proposed (Ratliff et al., 2006; Abbeel & Ng, 2004), with MaxEnt inverse RL being one of the most commonly used frameworks (Ziebart et al., 2008; Finn et al., 2016; Javdani et al., 2015). Since MaxEnt inverse RL can be viewed as an inference problem, we can calculate either the posterior distribution over reward functions, or the maximum a-posteriori (MAP) estimate. While most prior work is concerned with MAP estimates, we follow Hadfield-Menell et al. (2017) in using the full posterior distribution. Section 3 discusses how MaxEnt RL and MaxEnt inverse RL are closely connected, with one problem being the dual of the other. It is therefore not a coincidence that many MaxEnt inverse RL algorithms involve solving a MaxEnt RL problem in the inner loop. Our paper proposes the opposite, using MaxEnt inverse RL in the inner loop of MaxEnt RL.

Our work builds on the idea that MaxEnt RL can be viewed as probabilistic inference. This idea has been proposed in a number of prior works (Kappen et al., 2012; Toussaint, 2009; Todorov, 2008, 2007; Rawlik et al., 2013; Theodorou & Todorov, 2012; Levine, 2018) and used to build a number of modern RL algorithms (Haarnoja et al., 2017, 2018a; Abdolmaleki et al., 2018). Perhaps the most relevant prior work is Rawlik et al. (2013), which emphasizes that MaxEnt RL can be viewed as minimizing an KL divergence, an idea that we extend to the multi-task setting.

3 Preliminaries

This section reviews MaxEnt RL and MaxEnt inverse RL. We start by introducing notation.

Notation We will analyze an MDP with states and reward function . We assume that actions are sampled from a policy . The initial state is sampled and subsequent transitions are governed by a dynamics distribution . We define a trajectory as a sequence of states and actions: , and write the likelihood of a trajectory under policy as


In the multi-task setting, we will use to identify each task, and assume that we are given a prior over tasks. The set of tasks can be continuous or discrete, finite or infinite; each particular task can be continuous or discrete valued. We define as the reward function for task . Our experiments will use both goal-reaching tasks, where is a goal state, as well as more general task distributions, where is the hyperparameters of the reward function

MaxEnt RL MaxEnt RL casts the RL problem as one of sampling trajectories with probability proportional to exponentiated reward. Given a reward function , we aim to learn a policy that samples trajectories from the following target distribution, :


The partition function is introduced to make integrate to one. The objective function for MaxEnt RL is to maximize the entropy-regularized sum of rewards, which is equivalent to minimizing the reverse KL divergence between the policy’s distribution over trajectories, , and a target distribution, defined in terms of rewards :

The partition function does not depend on the policy, so prior RL algorithms have ignored it.

MaxEnt Inverse RL Inverse RL observes previously-collected data and attempts to infer the intent of the actor, which is represented by a reward function . MaxEnt inverse RL is a variant of inverse RL that defines the probability of trajectory being produced for task as


Applying Bayes’ Rule, the posterior distribution over reward functions is given as follows:


While many applications of MaxEnt inverse RL use the maximum a posteriori estimate, in this paper will use the full posterior distribution. While the partition function, an integral over all states and actions, is typically hard to compute, its dual is the MaxEnt RL problem:


The striking similarities between MaxEnt RL and MaxEnt inverse RL are not a coincidence. As we will show in the next section, both minimize the same reverse KL divergence on the joint distribution of tasks and trajectories.

4 Hindsight Relabeling is Inverse RL

We now aim to use the tools of RL and inverse RL to solve many RL problems simultaneously, each with the same dynamics but a different reward function. Given a prior over tasks, , the target joint distribution over tasks and trajectories is


We can express the multi-task (MaxEnt) RL objective as the reverse KL divergence between the joint trajectory-task distributions:


If we factor the joint distribution as , Eq. 6 is equivalent to maximizing the expected (entropy-regularized) reward of a task-conditioned policy :

Since the distribution over tasks, is fixed, we can ignore the term for optimization. A less common but more intriguing choice is to factor , where is represented non-parametrically as a distribution over previously-observed trajectories, and is a relabeling distribution. We find the optimal relabeling distribution by first rewriting Eq. 6

and then solving for the optimal relabeling distribution, ignoring terms that do not depend on :


The key observation here is that the optimal relabeling distribution corresponds exactly to MaxEnt inverse RL posterior over tasks (Eq. 3). Thus, we can obtain the optimal relabeling distribution via inverse RL. While the optimal relabeling distribution derived here depends on the entire trajectory, Appendix B shows how to perform relabeling when given a transition rather than an entire trajectory:


In the next section we show that prior goal-relabeling methods are a special case of inverse RL.

4.1 Special Case: Goal Relabeling

A number of prior works have explicitly (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018) and implicitly (Savinov et al., 2018; Lynch et al., 2019; Ghosh et al., 2019) found that hindsight relabeling can accelerate learning for goal-reaching tasks, where tasks correspond to goal states. These prior relabeling methods are a special case of inverse RL. We define a goal-conditioned reward function that penalizes the agent for failing to reaching the goal at the terminal step:


We assume that the time step is included in the observation to ensure that this reward function is Markovian. With this reward function, the optimal relabeling distribution from Eq. 7 is simply , where is the final state in trajectory . Thus, relabeling with the state actually reached is equivalent inverse RL when using the reward function in Eq. 9. While inverse RL is particularly convenient when using this reward function, it is rarely the metric of success that we actually care about. Viewing goal relabeling as a special case of inverse RL under a special reward function allows us to extend goal relabeling to general task arbitrary reward functions and arbitrary task distributions. In our experiments, we show that inverse RL seamlessly handles task distributions including goal-reaching, discrete sets of tasks, and linear reward functions.

4.2 The Importance of the Partition Function

Figure 2: The partition function normalizes rewards of different scales: Two trajectories are evaluated on tasks with different reward scales. Black borders indicate the task to which we assign each trajectory. (Left)  Without normalization, both trajectories are assigned to task . (Right)  After normalizing with the partition function, as is done by inverse RL (our method), trajectory is assigned task and is assigned to .

The partition function used by inverse RL will be important for hindsight relabeling, as it will normalize the rewards from tasks with varying difficulty and reward scale. Fig. 2 shows a didactic example with two tasks, where the rewards for one task are larger than the rewards for the other task. Relabeling with the reward under which the agent received the largest reward (akin to Andrychowicz et al. (2017)) fails, because all experience will be relabeled with the first (easier) task. Subtracting the partition function from the rewards (as in Eq. 7) results in the desired behavior, trajectory is assigned task and is assigned to .

4.3 How Much Does Relabeling Help?

Up to now, we have shown that the optimal way to relabel data is via inverse RL. How much does relabeling help? We now obtain a lower bound on the improvement from relabeling. Both lemmas in this section will assume that a joint distribution over tasks and trajectories be given (e.g., specified by a policy ). We will define as the marginal distribution over trajectories and then construct using the optimal relabeling distribution (Eq. 7). We first show that relabeling data using inverse RL improves the MaxEnt RL objective:

Lemma 1.

The relabeled distribution is closer to the target distribution than the original distribution, as measured by the KL divergence:


Of the many possible relabeling distributions, one choice is to do no relabeling, assigning to each trajectory the task that was commanded when the trajectory was collected. Denote this relabeling distribution , so . Because was chosen as that which minimizes the KL among all relabeling distributions (including ), the desired inequality holds:

Thus, the relabeled data is an improvement over the original data, achieving a larger entropy-regularized reward (Eq. 6). As our experiments will confirm, relabeling data will accelerate learning. Our next result will give us a lower bound on this improvement:

Lemma 2.

The improvement in the MaxEnt RL objective (Eq. 6) gained by relabeling is lower bounded as follows:

The proof, a straightforward application of information geometry, is in Appendix A. This result says that the amount that relabeling helps is at least as large as the difference between the task labels and the task labels inferred by inverse RL, . Note that, when we have learned the optimal policy (Eq. 5), our experience is already optimally labeled, so relabeling has no effect.

Algorithm 1 Approximate Inverse RL. When used in HIPI-RL (Alg. 2) we only have transitions, so we compute using Eq. 8 ( blue line). When used in HIPI-BC (Alg. 3) we have full trajectories, so we compute using Eq. 7 (red line). function InverseRL()     for  do task index         for  do state-action index              Eq. 8              Eq. 7                            for  do                   return Algorithm 2 HIPI-RL: Inverse RL for Off-Policy RL while not converged do                Algorithm 3 HIPI-BC: Inverse RL for Behavior Cloning while not converged do                return

5 Using Inverse RL to Accelerate RL

In this section, we outline a general recipe, Hindsight Inference for Policy Improvement (HIPI), for using inverse RL to accelerate the learning of downstream tasks. Given a dataset of trajectories, we use inverse RL to infer for which tasks those trajectories are optimal. We discuss two options for how to use these relabeled trajectories. One option is to apply off-policy RL on top of these relabeled trajectories. This option generalizes previously-introduced hindsight relabeling techniques (Kaelbling, 1993; Andrychowicz et al., 2017), allowing them to be applied to task distributions beyond goal-reaching. A second option is to apply behavior cloning to the relabeled experience. This option generalizes a number of previous methods, extending variational policy search (Peters & Schaal, 2007; Dayan & Hinton, 1997; Levine & Koltun, 2013; Peng et al., 2019) to the multi-task setting and extending goal-conditioned imitation learning (Ghosh et al., 2019; Savinov et al., 2018; Lynch et al., 2019) to arbitrary task distributions.

5.1 Using Relabeling Data for Off-Policy RL (HIPI-RL)

Off-policy RL algorithms, such as Q-learning and actor-critic algorithms, represent a broad class of modern RL methods. These algorithms maintain a replay buffer of previously seen experience, and we can relabel this experience using inverse RL when sampling from the replay buffer. As noted in Section 4.1, hindsight experience replay (Andrychowicz et al., 2017) can be viewed as a special case of this idea. Viewing relabeling as inverse RL, we can extend these methods to general classes of reward functions.

There are many algorithms for inverse methods, and we outline one approximate algorithm that can be efficiently integrated into off-policy RL. To relabel entire trajectories, we would start by computing the cumulative reward: . However, most off-policy RL algorithms maintain a replay buffer that stores transitions, rather than entire trajectories. In this case, following Eq. 8, we instead use the soft Q-function: . We approximate the partition function using Monte Carlo samples from within a batch of size :

We finally sample tasks following Eq. 7:

We summarize the procedure for relabeling with inverse RL procedure in Alg. 1. The application of relabeling with inverse RL to off-policy RL, which we call HIPI-RL, is summarized in Alg. 2. We emphasize that Alg 1 is just one of many methods for performing inverse RL. Alternative methods include gradient-based optimization of the per-sample task, and learning a parametric task-sampler to approximate the optimal relabeling distribution (Eq. 7). We leave this as future work.

5.2 Using Relabeled Data for Behavior Cloning

We now introduce a second method to use data relabeled with inverse RL to acquire control policies. The idea is quite simple: given arbitrary data, first relabel that data with inverse RL, and then perform task-conditioned behavior cloning. We call this procedure HIPI-BC and summarize it in Alg. 3. Why should we expect this procedure to work? The intuition is that relabeling with inverse RL makes the joint distribution of tasks and trajectories closer to the target distribution (i.e., it maximizes the multi-task MaxEnt RL objective (Eq. 6)). To convert this joint distribution into an actionable representation, we extract the policy implicitly defined by the relabeled trajectories. Behavioral cloning (i.e., supervised learning) does precisely this.

Relationship to Prior Methods

Prior work on both goal-conditioned supervised learning, self-imitation learning, and reward-weighted regression can all be understood as special cases. Goal-conditioned supervised learning (Savinov et al., 2018; Ghosh et al., 2019; Lynch et al., 2019) learns a goal-conditioned policy using a dataset of past experience. For a given state, the action that was actually taken is treated as the correct action (i.e., label) for states reached in the future, and a policy is learned via supervised learning. As discussed in Section 4.1, relabeling with the goal actually achieved is a special case of our framework. We refer the reader to those papers for additional evidence for the value of combining inverse RL (albeit a trivial special case) with behavior cloning can effectively learn complex control policies. Self-imitation learning (Oh et al., 2018) and iterative maximum likelihood training (Liang et al., 2016) augment RL with supervised learning on a handful of the best previously-seen trajectories, an approach that can be viewed in the inverse RL followed by supervised learning framework. However, because the connection to inverse RL is not made precise, these methods omit the partition function, which may prove problematic when extending these methods to multi-task settings. Finally, single-task RL methods based on variational policy search (Levine, 2018) and reward-weighted regression (Peters & Schaal, 2007; Peng et al., 2019) can also be viewed in this framework. Noting that the optimal relabeling distribution is given as , relabeling by sampling from the inverse RL posterior and then performing behavior cloning can be written concisely as the following objective:

The key difference between this objective and prior work is the partition function. The observation that these prior methods are special cases of inverse RL allows us to apply similar ideas to arbitrary classes of reward functions, a capability we showcase in our experiments.

6 Experiments: Relabeling with Inverse RL Accelerates Learning

Figure 3: Environments for experiments: (a) quadruped, (b) finger, (c) 2D reacher, (d) sawyer reach, (e) 2D navigation (f) jaco reach, (g) walker, (h) cheetah, and (i) desk manipulation.

Our experiments focus on two methods for using relabeled data: off-policy RL (Alg. 2) and behavior cloning (Alg. 3). We evaluate our method on both goal-reaching tasks as well as more general task distributions, including linear combinations of a reward basis and discrete sets of tasks (see Fig. 3).

6.1 HIPI-RL: Inverse RL for Off-Policy RL

Our first set of experiments apply Alg. 2 to domains with varying reward structure, demonstrating how relabeling data with inverse RL can accelerate off-policy RL.

Didactic Example
Figure 4: Relabeling stitches crossing trajectories: (Left)  A simple gridworld environment, with two observed trajectories and indicated by grey arrows. Inverse RL identifies both and as likely intentions from state and includes both and in the relabeled data. Final state relabeling (HER) only relabels with the goal actually achieved, corresponding to trajectory . (Right)  We apply Q-learning to both datasets, finding that only relabeling with inverse RL allows the agent to reach all goals.

We start with a didactic example to motivate why relabeling experience with inverse RL would accelerate off-policy RL. In the gridworld shown in Fig. 4, we construct a dataset with two trajectories: and . From state A, inverse RL identifies both and as likely intentions, so we include both and in the relabeled data. Final state relabeling (HER) only uses trajectory . We then apply Q-learning to both datasets. to this dataset. Whereas Q-learning with final state relabeling only succeeds at reaching those goals in the top row, our approach, which corresponds to Q-learning with inverse RL, relabeling succeeds at reaching all goals. The remainder of this section will show the benefits of relabeling using inverse RL in domains of increasing complexity.

Figure 5: Relabeling for goals-reaching tasks: On six goal-reaching domains, relabeling with inverse RL (our method) learns faster than with previous relabeling strategies. On extremely sparse versions of two tasks, shown in the right column, only our method learns the tasks.

Goal-Reaching Task Distributions We next apply our method to goal-reaching tasks, where each task corresponds to reaching a different goal state. We used six domains: a quadruped locomotion task, a robotic finger turning a knob, a 2D reacher, a reaching task on the Sawyer robot, a 2D navigation environment with obstacles, and a reaching task on the Jaco robot. Appendix C provides details of all tasks. We compared our method against four alternative relabeling strategies: relabeling with the final state reached (HER (Andrychowicz et al., 2017)), relabeling with a randomly-sampled task, relabeling with a future state in the same trajectory, and doing no relabeling (SAC (Haarnoja et al., 2018a)). For tasks where the goal state only specifies certain dimensions of the state, relabeling with the final state and future state requires privileged information indicating to which state dimensions the goal corresponds.

As shown in Fig. 5, relabeling experience with inverse RL (our method) always learns at least as quickly as the other relabeling strategies, and often achieves larger asymptotic reward. While final state relabeling (HER) performs well on some tasks, it is worse than random relabeling on other tasks. We also observe that random relabeling is a competitive baseline, provided that the number of gradient steps is sufficiently tuned. We conjectured that soft relabeling would be most beneficial in settings with extremely sparse rewards. To test this hypothesis, we modified the reward functions in 2D reacher and Jaco reaching environments to be much sparser. As shown in the far right column on Fig. 5, only soft relabeling is able to make learning progress in this setting.

Figure 6: Relabeling for general tasks distributions: (Left)  2D reacher with a discrete set of target end effector positions, (Center)  walker with tasks defined as a linear combination of reward terms, and (Right)  sawyer reacher where tasks are defined as arriving within units of goal state . On all tasks, relabeling with inverse RL accelerates learning and leads to larger asymptotic reward. Note that existing relabeling strategies are not applicable in this setting.

More General Task Distributions Our next experiment demonstrates that, in addition to relabeling goals, inverse RL can also relabel experience for more general tasks distributions. Our first task distribution is a discrete set of goal states for the 2D reacher environment. The second task distribution highlights the capability of inverse RL to relabel experience for classes of reward functions defined as linear combinations of features . We use the walker environment, with features corresponding to torso height, velocity, relative position of the feet, and a control cost. The third task distribution is again a goal reaching task, but one where the task indicates both the goal state as well as the desired margin from that goal state. As prior relabeling approaches are not applicable to these general task distributions, we only compared our approach to random relabeling and no relabeling (SAC (Haarnoja et al., 2018a)). As shown in Fig. 6, relabeling with inverse RL provides more sample efficient learning in all tasks, and the asymptotic reward is larger than the baselines by a non-trivial amount in two of the three tasks.

6.2 HIPI-BC: Behavior Cloning on Experience Relabeled with Inverse RL

Figure 7: Behavior cloning on experience relabeled with inverse RL: We apply our approach to tasks with varying task distributions: (Left)  goal-reaching tasks on half-cheetah, (Center)  linear reward functions on quadruped, and (Right)  discrete tasks on the manipulation environment. Relabeling experience with inverse RL increases reward in all domains.

In this section, we present experiments that use behavior cloning on top of relabeled experience (Alg. 3). The three domains we use have varying reward structure: (1) half-cheetah with continuous goal velocities; (2) quadruped with linear reward functions; and (3) the manipulation environment with nine discrete tasks. For the half-cheetah and quadruped domains, we collected 1000 demonstrations from a policy trained with off-policy RL. For the manipulation environment, Lynch et al. (2019) provided a dataset of 100 demonstrations for each of these tasks, which we aggregate into a dataset of 900 demonstrations. In all settings, we discarded the task labels, simulating the common real-world setting where experience does not come prepared with task labels. As shown in Fig. 7, first inferring the tasks with inverse RL and then performing behavioral cloning results in significantly higher final rewards than task-agnostic behavior cloning on the entire dataset, which is no better than random.

Figure 8: Importance of the partition function: On the half-cheetah task, we simulated the effect of unnormalized reward functions by adding a constant bias to the first task. Inverse RL normalizes rewards by the partition function. Without this normalization, experience is disproportionately labeled with the first task label, resulting in poor performance during behavior cloning.

Our final experiment demonstrates the importance of the partition function. On the cheetah domain, we synthetically corrupt the demonstrations by adding a constant bias to the reward for the first task (whichever velocity was sampled first). We then compare the performance of our approach against an ablation that did not normalize by the partition function when relabeling data. As shown in Fig. 8, using task rewards of different scales significantly degrades the performance of the ablation. Our method, which normalizes the task rewards in the inverse RL step, is not affected by reward scaling.

7 Discussion

In this paper, we introduced the idea that hindsight relabeling is inverse RL. We showed that a number of prior works can be understood as special cases of this general framework. The idea that inverse RL might be used to relabel data is powerful because it enables us to extend relabeling techniques to general classes of reward functions. We explored two particular instantiations of this idea, using experience relabeled with inverse RL for off-policy RL and for supervised learning.

We are only scratching the surface of the many ways relabeled experience might be used to accelerate learning. For example, the problem of task inference is ever-present in meta-learning, and it is intriguing to imagine explicitly incorporating inverse RL into meta RL. Broadly, we hope that the observation that inverse RL can be used to accelerate RL will spur research on better inverse RL algorithms, which in turn will provide better RL algorithms.

Acknowledgements We thank Yevgen Chebotar, Aviral Kumar, Vitchyr Pong, and Anirudh Vemula for formative discussions. We are grateful to Ofir Nachum for pointing out the duality between MaxEnt RL and the partition function, and to Karol Hausman for reviewing an early draft of this paper. We thank Stephanie Chan, Corey Lynch, and Pierre Sermanet for providing the desk manipulation environment. This research was supported by the Fannie and John Hertz Foundation, NASA, DARPA, US Army, and the National Science Foundation (IIS-1700696, IIS-1700697, IIS1763562, and DGE 1745016). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Appendix A Proof of Lemma 2

This section provides a proof of Lemma 2.


The optimal relabeling distribution can be viewed as an information projection of the joint distribution onto the target distribution (Eq. 5):

where is the set of all joint distributions with marginal . Note that this set is closed and convex. We then apply Theorem 11.6.1 from Cover & Thomas (2006):


The second KL divergence on the RHS can be simplified:

Substituting this simplification into Eq. 10 and rearranging terms, we obtain the desired result. ∎

Appendix B Inverse RL on Transitions

For simplicity, our derivation of relabeling in Section 4 assumed that entire trajectories were provided. This section outlines how to do relabeling with inverse RL when we are only provided with transitions, rather than entire trajectories. This derivation will motivate the use of the soft Q-function in Eq. 8.

In this case, policy distribution in the MaxEnt RL objective (Eq. 6) is conditioned on the current state and action in addition to the task :


Following the derivation in Section 4, we expand this objective, using as our relabeling distribution:


The expected value of the two summations is the soft Q-function for the policy :


Substituting Eq. 13 into Eq. 12 and ignoring terms that do not depend on , we can solve the optimal relabeling distribution:


Appendix C Experimental Details

c.1 Hyperparameters for Off-Policy RL

Except for the didactic experiment, we used SAC (Haarnoja et al., 2018a) as our RL algorithm, taking the implementation from Guadarrama et al. (2018). This implementation scales the critic loss by a factor of 0.5. Following prior work (Pong et al., 2018), we only relabeled 50% of the samples drawn from the replay buffer, using the originally-commanded task the remaining 50%. The only hyperparameter that differed across relabeling strategies was the number of gradient updates per environment step. For each experiment, we evaluated each method with values in and reported the results of the best hyperparameter in our plots. Perhaps surprisingly, doing random relabeling but simply increasing the number of gradient updates per environment step was a remarkably competitive baseline.

  • Learning Rate: 3e-4 (same for actor, critic, and entropy dual parameter)

  • Batch Size: 32

  • Network architecture: The input was the concatenation of the state observation and the task . Both the actor and critic networks were 2 hidden layer ReLu networks. The actor output was squashed by a tanh activation to lie within the actor space constraints. There was no activation at the final layer of the critic network, except in the desk environment (see comment below). The hidden layer dimensions were (32, 32) for the 2D navigation environments, (256, 256) for the quadruped and desk environments, and (64, 64) for all other environments.

  • Discount : 0.99

  • Initial data collection steps: 1e5

  • Target network update period: 1

  • Target network : 0.005

  • Entropy coefficient : We used the entropy-constrained version of SAC (Haarnoja et al., 2018b), using as the target value, where is the action space dimension.

  • Replay buffer capacity: 1e6

  • Optimizer: Adam

  • Gradient Clipping: We found that clipping the gradients to have unit norm was important to get any RL working on the Sawyer and Jaco tasks.

To implement final state relabeling, we modified transitions as they were being added to the replay buffer, adding both the original transition and the transition augmented to use the final state as the goal. To implement future state relabeling, we modified transitions as they were being added to the replay buffer, adding both the original transition and a transition augmented to use one of the next 4 states in the same trajectory as the goal.

c.2 Hyperparameters for Behavior Cloning Experiments

To account for randomness in the learning process, we collect at least 200 evaluation episodes per domain; we repeat this experiment for at least 5 random seeds on each domain, and plot the mean and standard deviation over the random seeds. We used a 2-layer neural network with ReLu activations for all experiments. The hidden layers had size (256, 256). We optimized the network to minimize MSE using the Adam optimizer with a learning rate of 3e-4. We used early stopping, halting training when the validation loss increased for 3 consecutive epochs. Typically training converged in 30 - 50 epochs. We normalized both the states and actions. For the task-conditioned experiments, we concatenated the task vectors to the state vectors.

c.3 Quadruped Environment


The quadruped was a modified version of the environment from Abdolmaleki et al. (2018). We modified the initial state distribution so the agent always started upright, and modified the observation space to include the termination signal as part of the observation. Tasks were sampled uniformly from the unit circle. Let and indicate the XY velocity and position of the agent. For the HIPI-RL experiments, we used the following sparse reward function:

and the episode terminated when . We also reset the environment after 300 steps if the agent had failed to reach the goal. For the HIPI-BC experiments, we used the following dense reward function:

and episodes were 300 steps long.

c.4 Finger Environment


The finger environment was taken from Tassa et al. (2018b). Tasks were sampled using the environment’s default goal sampling function. Let denote the XY position of the knob that the agent can manipulate. The reward function was defined as

and the episode terminated when . We also reset the environment after 1000 steps if the agent had failed to reach the goal.

c.5 2D Reacher Environment

2D Reacher

The 2D reacher environment was taken from Tassa et al. (2018b). Let denote the XY position of the robot end effector. The reward function was defined as

and the episode terminated when , where is a margin around the goal. We used and in our experiments. We also reset the environment after 1000 steps if the agent had failed to reach the goal. Tasks were sampled using the environment’s default goal sampling function.

c.6 Sawyer Reach Environment

Sawyer Reach

The sawyer reach environment was taken from Yu et al. (2019). Let denote the XYZ position of the robot end effector. The reward function was defined as

and the episode terminated when , where is a margin around the goal. We used and in our experiments. We also reset the environment after 150 steps if the agent had failed to reach the goal. Tasks were sampled using the environment’s default goal sampling function. For the experiment where the task indicator also specified the margin , the margin was sampled uniformly from the interval .

c.7 2D Navigation Environment

2D Navigation

We used the 2D navigation environment from Eysenbach et al. (2019). The action space is continuous and indicates the desired change of position. The dynamics are stochastic, and the initial state and goal are sampled uniformly at random for each episode. To increase the difficulties of credit assignment and exploration, the agent is always initialized in the lower left corner, and we randomly sampled goal states that are at least 15 steps away. The layout of the obstacles is taken from the classic FourRooms domain, but dilated by a factor of three.

c.8 Jaco Reach Environment

Jaco Reach

We implemented a reaching task using a simulated Jaco robot. Goal states were sampled from uniformly from the interval . The agent controlled the velocity of 6 arm joints and 3 finger joints, so the action space was 9 dimensional. The action observation space was 43 dimensional. Let denote the XYZ position of the robot end effector. The reward function was defined as

and the episode terminated when , where is a margin around the goal. We used and in our experiments. We also reset the environment after 250 steps if the agent had failed to reach the goal.

c.9 Walker Environment


The walker environment was a modified version of the environment from Tassa et al. (2018a). We modified the initial state distribution so the agent always started upright, and modified the observation space to include the termination signal as part of the observation. For the linear reward function, the features are the torso height (normalized by subtracting 0.5m), velocity along the forward/aft axis, the XZ displacement of the two feet relative to the agent’s center of mass (the agent cannot move along the Y axis), and the squared L2 norm of the actions. The task coefficients can take on values in the range for all dimensions, except for the control penalty, which takes on values in . Episodes were 100 steps long.

c.10 Half-Cheetah Environment


The half-cheetah environment was taken from Tassa et al. (2018a). We define tasks to correspond to goal velocities and use the reward function from Rakelly et al. (2019):

where is the horizontal root velocity. Tasks were sampled uniformly , with units of meters per second. Episodes were 100 steps long.

c.11 Desk Environment

Desk Manipulation

The environment provided by Lynch et al. (2019) included 19 tasks. We selected the nine most challenging tasks by looking how often a task was accidentally solved. In the demonstrations for task A, we recorded the average return on the remaining 18 tasks. We chose the nine tasks whose average reward was lowest. The nine tasks were three button pushing tasks and six block manipulation tasks.

For experiments on this environment, we found that normalizing the action space was crucial. We computed the coordinate-wise mean and standard deviation of the actions from the demonstrations, and modified the environment to implicitly normalize actions by subtracting the mean and dividing by the standard deviation. We clipped the action space to , so the agent was only allowed to command actions within one standard deviation (as measured by the expert demos).

Another trick that was crucial for RL on this environment was clipping the critic outputs. Since the reward was in and the episode length was capped at 128 steps, we squashed the Q-value predictions with a scaled sigmoid to be in the range .

Appendix D Failed Experiments

  1. 100% Relabeling: When using inverse RL to relabel data for off-policy RL, we initially relabeled 100% of samples from the replay buffer, but found that learning was often worse than doing no relabeling at all. We therefore switched to only 50% relabeling in our experiments. We speculate that retaining some of the originally-commanded goals serves as a sort of hard-negative mining.

  2. Coordinate Ascent on Eq. 6: We attempted to devise an EM-style algorithm that performed coordinate ascent in Eq. 6, alternating between (1) doing MaxEnt RL and (2) relabeling that data and acquiring the corresponding policy via behavior cloning. While we were unable to get this algorithm to outperform standard MaxEnt RL, we conjecture that this procedure would work with the right choice of inverse RL algorithm.


  1. Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp.  1. ACM, 2004.
  2. Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
  3. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
  4. Caruana, R. Multitask learning. Machine learning, 28(1):41–75, 1997.
  5. Cover, T. M. and Thomas, J. A. Elements of information theory (wiley series in telecommunications and signal processing), 2006.
  6. Dayan, P. and Hinton, G. E. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
  7. Dubey, R., Agrawal, P., Pathak, D., Griffiths, T. L., and Efros, A. A. Investigating human priors for playing video games. arXiv preprint arXiv:1802.10217, 2018.
  8. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
  9. Eysenbach, B., Salakhutdinov, R., and Levine, S. Search on the replay buffer: Bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, pp. 15220–15231, 2019.
  10. Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, 2016.
  11. Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., and Levine, S. Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874, 2017.
  12. Ghosh, D., Gupta, A., Fu, J., Reddy, A., Devine, C., Eysenbach, B., and Levine, S. Learning to reach goals without reinforcement learning. arXiv preprint arXiv:1912.06088, 2019.
  13. Guadarrama, S., Korattikara, A., Ramirez, O., Castro, P., Holly, E., Fishman, S., Wang, K., Gonina, E., Harris, C., Vanhoucke, V., et al. Tf-agents: A library for reinforcement learning in tensorflow, 2018.
  14. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1352–1361. JMLR. org, 2017.
  15. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018a.
  16. Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  17. Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
  18. Hessel, M., Soyer, H., Espeholt, L., Czarnecki, W., Schmitt, S., and van Hasselt, H. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019.
  19. Javdani, S., Srinivasa, S. S., and Bagnell, J. A. Shared autonomy via hindsight optimization. Robotics science and systems: online proceedings, 2015, 2015.
  20. Kaelbling, L. P. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993.
  21. Kappen, H. J., Gómez, V., and Opper, M. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
  22. Kapturowski, S., Ostrovski, G., Quan, J., Munos, R., and Dabney, W. Recurrent experience replay in distributed reinforcement learning. 2018.
  23. Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  24. Levine, S. and Koltun, V. Variational policy search via trajectory optimization. In Advances in neural information processing systems, pp. 207–215, 2013.
  25. Liang, C., Berant, J., Le, Q., Forbus, K. D., and Lao, N. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020, 2016.
  26. Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973, 2019.
  27. Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learning. arXiv preprint arXiv:1806.05635, 2018.
  28. Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  29. Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  30. Peters, J. and Schaal, S. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745–750. ACM, 2007.
  31. Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.
  32. Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine, S. Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
  33. Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. ACM, 2006.
  34. Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
  35. Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. Learning by playing-solving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567, 2018.
  36. Rusu, A. A., Colmenarejo, S. G., Gulcehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  37. Savinov, N., Dosovitskiy, A., and Koltun, V. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018.
  38. Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320, 2015.
  39. Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018a.
  40. Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. DeepMind control suite. Technical report, DeepMind, January 2018b. URL
  41. Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.
  42. Theodorou, E. A. and Todorov, E. Relative entropy and free energy dualities: Connections to path integral and kl control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 1466–1473. IEEE, 2012.
  43. Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.
  44. Todorov, E. Linearly-solvable markov decision problems. In Advances in neural information processing systems, pp. 1369–1376, 2007.
  45. Todorov, E. General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286–4292. IEEE, 2008.
  46. Toussaint, M. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. ACM, 2009.
  47. Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. arXiv preprint arXiv:1910.10897, 2019.
  48. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782, 2020.
  49. Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description