Exploration via Hindsight Goal Generation
Abstract
Goaloriented reinforcement learning has recently been a practical framework for robotic manipulation tasks, in which an agent is required to reach a certain goal defined by a function on the state space. However, the sparsity of such reward definition makes traditional reinforcement learning algorithms very inefficient. Hindsight Experience Replay (HER), a recent advance, has greatly improved sample efficiency and practical applicability for such problems. It exploits previous replays by constructing imaginary goals in a simple heuristic way, acting like an implicit curriculum to alleviate the challenge of sparse reward signal. In this paper, we introduce Hindsight Goal Generation (HGG), a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term. We have extensively evaluated our goal generation algorithm on a number of robotic manipulation tasks and demonstrated substantially improvement over the original HER in terms of sample efficiency.
1 Introduction
Recent advances in deep reinforcement learning (RL), including policy gradient methods (Schulman et al., 2015, 2017) and Qlearning (Mnih et al., 2015), have demonstrated a large number of successful applications in solving hard sequential decision problems, including robotics (Levine et al., 2016), games (Silver et al., 2016; Mnih et al., 2015), and recommendation systems (Karatzoglou et al., 2013), among others. To train a wellbehaved policy, deep reinforcement learning algorithms use neural networks as functional approximators to learn a stateaction value function or a policy distribution to optimize a longterm expected return. The convergence of the training process, particularly in Qlearning, is heavily dependent on the temporal pattern of the reward function (Szepesvári, 1998). For example, if only a nonzero reward/return is provided at the end of an rollout of a trajectory with length , while no rewards are observed before the th time step, the Bellman updates of the Qfunction would become very inefficient, requiring at least steps to propagate the final return to the Qfunction of all earlier stateaction pairs. Such sparse or episodic reward signals are ubiquitous in many realworld problems, including complex games and robotic manipulation tasks (Andrychowicz et al., 2017). Therefore, despite its notable success, the application of RL is still quite limited to realworld problems, where the reward functions can be sparse and very hard to engineer (Ng et al., 1999). In practice, human experts need to design reward functions which would reflect the task needed to be solved and also be carefully shaped in a dense way for the optimization in RL algorithms to ensure good performance. However, the design of such dense reward functions is nontrivial in most realworld problems with sparse rewards. For example, in goaloriented robotics tasks, an agent is required to reach some state satisfying predefined conditions or within a state set of interest. Many previous efforts have shown that the sparse indicator rewards, instead of the engineered dense rewards, often provide better practical performance when trained with deep Qlearning and policy optimization algorithms (Andrychowicz et al., 2017). In this paper, we will focus on improving training and exploration for goaloriented RL problems.
A notable advance is called Hindsight Experience Replay (HER) (Andrychowicz et al., 2017), which greatly improves the practical success of offpolicy deep Qlearning for goaloriented RL problems, including several difficult robotic manipulation tasks. The key idea of HER is to revisit previous states in the experience replay and construct a number of achieved hindsight goals based on these visited intermediate states. Then the hindsight goals and the related trajectories are used to train an universal value function parameterized by a goal input by algorithms such as deep deterministic policy gradient (DDPG, Lillicrap et al. (2016)). A good way to think of the success of HER is to view HER as an implicit curriculum which first learns with the intermediate goals that are easy to achieve using current value function and then later with the more difficult goals that are closer to the final goal. A notable difference between HER and curriculum learning is that HER does not require an explicit distribution of the initial environment states, which appears to be more applicable to many real problems.
In this paper, we study the problem of automatically generating valuable hindsight goals which are more effective for exploration. Different from the random curriculum heuristics used in the original HER, where a goal is drawn as an achieved state in a random trajectory, we propose a new approach that finds intermediate goals that are easy to achieve in the short term and also would likely lead to reach the final goal in the long term. To do so, we first approximate the value function of the actual goal distribution by a lower bound that decomposes into two terms, a value function based on a hindsight goal distribution and the Wasserstein distance between the two distributions. Then, we introduce an efficient discrete Wasserstein Barycenter solver to generate a set of hindsight goals that optimizes the lower bound. Finally, such goals are used for exploration.
In the experiments, we evaluate our Hindsight Goal Generation approach on a broad set of robotic manipulation tasks. By incorporating the hindsight goals, a significant improvement on sample efficiency is demonstrated over DDPG+HER. Ablation studies show that our exploration strategy is robust across a wide set of hyperparameters.
2 Background
Reinforcement Learning The goal of reinforcement learning agent is to interact with a given environment and maximize its expected cumulative reward. The environment is usually modeled by a Markov Decision Process (MDP), given by tuples where represent the set of states and actions respectively. is the transition function and is the reward function. is the discount factor. The agent trys to find a policy that maximizes its expected curriculum reward , where is usually given or drawn from a distribution of initial state. The value function is defined as
Goaloriented MDP In this paper, we consider a specific class of MDP called goaloriented MDP. We use to denote the set of goals. Different from traditional MDP, the reward function is a goalconditioned sparse and binary signal indicating whether the goal is achieved:
(1) 
is a known and tractable mapping that defines goal representation. is a given threshold indicating whether the goal is considered to be reached (see Plappert et al. (2018)).
Universal value function The idea of universal value function is to use a single functional approximator, such as neural networks, to represent a large number of value functions. For the goaloriented MDPs, the goalbased value function of a policy for any given goal is defined as , for all state That is
(2) 
Let be the joint distribution over starting state and goal . That is, at the start of every episode, a stategoal pair will be drawn from the task distribution . The agent tries to find a policy that maximizes the expectation of discounted cumulative reward
(3) 
Goaloriented MDP characterizes several reinforcement benchmark tasks, such as the robotics tasks in the OpenAI gym environment (Plappert et al., 2018). For example, in the FetchPush (see Figure 1) task, the agent needs to learn pushing a box to a designated point. In this task, the state of the system contains the status for both the robot and the box. The goal , on the other hand, only indicates the designated position of the box. Thus, the mapping is defined as a mapping from a system state to the position of the box in .
Access to Simulator One of the common assumption made by previous work is an universal simulator that allows the environment to be reset to any given state (Florensa et al., 2017; Ecoffet et al., 2019). This kind of simulator is excessively powerful, and hard to build when acting in the real world. On the contrary, our method does not require an universal simulator, and thus is more realizable.
3 Related Work
MultiGoal RL The role of goalconditioned policy has been investigated widely in deep reinforcement learning scenarios (Pong et al., 2019). A few examples include grasping skills in imitation learning (Pathak et al., 2018; Srinivas et al., 2018), disentangling task knowledge from environment (Mao et al., 2018a; Ghosh et al., 2019), and constituting lowerlevel controller in hierarchical RL (Oh et al., 2017; Nachum et al., 2018; Huang et al., 2019; Eysenbach et al., 2019). By learning a universal value function which parameterizes the goal using a function approximator (Schaul et al., 2015), an agent is able to learn multiple tasks simultaneously (Kaelbling, 1993; Veeriah et al., 2018) and identify important decision states (Goyal et al., 2019b). It is shown that multitask learning with goalconditioned policy improves the generalizability to unseen goals (e.g., Schaul et al. (2015)).
Hindsight Experience Replay Hindsight Experience Replay (Andrychowicz et al., 2017) is an effective experience replay strategy which generates reward signal from failure trajectories. The idea of hindsight experience replay can be extended to various goalconditioned problems, such as hierarchical RL (Levy et al., 2019), dynamic goal pursuit (Fang et al., 2019a), goalconditioned imitation (Ding et al., 2019; Sun et al., 2019) and visual robotics applications (Nair et al., 2018; Sahni et al., 2019). It is also shown that hindsight experience replay can be combined with onpolicy reinforcement learning algorithms by importance sampling (Rauber et al., 2019).
Curriculum Learning in RL Curriculum learning in RL usually suggests using a sequence of auxiliary tasks to guide policy optimization, which is also related to multitask learning, lifelong learning, and transfer learning. The research interest in automatic curriculum design has seen rapid growth recently, where approaches have been proposed to schedule a given set of auxiliary tasks (Riedmiller et al., 2018; Colas et al., 2019), and to provide intrinsic motivation (Forestier et al., 2017; Péré et al., 2018; Sukhbaatar et al., 2018; Colas et al., 2018). Generating goals which leads to highvalue states could substantially improve the sample efficiency of RL agent (Goyal et al., 2019a). Guided exploration through curriculum generation is also an active research topic, where either the initial state (Florensa et al., 2017) or the goal position (Baranes and Oudeyer, 2013; Florensa et al., 2018) is considered as a manipulable factor to generate the intermediate tasks. However, most curriculum learning methods are domainspecific, and it is still open to build a generalized framework for curriculum learning.
4 Automatic Hindsight Goal Generation
As discussed in the previous section, HER provides an effective solution to resolve the sparse reward challenge in object manipulation tasks, in which achieved state in some past trajectories will be replayed as imaginary goals. In the other words, HER modifies the task distribution in replay buffer to generate a set of auxiliary nearby goals which can used for further exploration and improve the performance of an offpolicy RL agent which is expected to reach a very distant goal. However, the distribution of hindsight goals where the policy is trained on might differ significantly from the original task or goal distribution. Take Figure 1 as an example, the desired goal distribution is lying on the red segment, which is far away from the initial position. In this situation, those hindsight goals may not be effective enough to promote policy optimization in original task. The goal of our work is to develop a new approach to generate valuable hindsight goals that will improve the performance on the original task.
In the rest of this section, we will present a new algorithmic framework as well as our implementation for automatic hindsight goal generation for better exploration.
4.1 Algorithmic Framework
Following Florensa et al. (2018), our approach relies on the following generalizability assumption.
Assumption 1.
A value function of a policy for a specific goal has some generalizability to another goal close to .
One possible mathematical characterization for Assumption 1 is via the Lipschitz continuity. Similar assumptions have been widely applied in many scenarios (Asadi et al., 2018; Luo et al., 2019):
(4) 
where is a metric defined by
(5) 
for some hyperparameter that provides a tradeoff between the distances between initial states and the distance between final goals. is a state abstraction to map from the state space to the goal space. When experimenting with the tasks in the OpenAI Gym environment (Plappert et al., 2018), we simply adopt the stategoal mappings as defined in (1). Although the Lipschitz continuity may not hold for every we only require continuity over some specific region. It is reasonable to claim that bound Eq. (4) holds for most of the when is not too large.
Partly due to the reward sparsity of the distant goals, optimizing the expected cumulative reward (see Eq. (3)) from scratch is very difficult. Instead, we propose to optimize a relaxed lower bound which introduces intermediate goals that may be easier to optimize. Here we provide Theorem 4.1 that establishes the such a lower bound.
theoremthmbound Assuming that the generalizability condition (Eq. (4)) holds for two distributions and , we have
(6) 
where is the Wasserstein distance based on
where denotes the collection of all joint distribution whose marginal probabilities are , respectively. The proof of Theorem 1 is deferred to Appendix A.
It follows from Theorem 1 that optimizing cumulative rewards Eq. (3) can be relaxed into the following surrogate problem
(7) 
Note that this new objective function is very intuitive. Instead of optimizing with the difficult goal/task distribution , we hope to find a collection of surrogate goals , which are both easy to optimize and are also close or converging towards . However the joint optimization of and is nontrivial. This is because a) is a highdimensional distribution over tasks, b) policy is optimized with respect to a shifting task distribution , c) the estimation of value function may not be quite accurate during training.
Inspired by Andrychowicz et al. (2017), we adopt the idea of using hindsight goals here. We first enforce to be a finite set of particles which can only be from those already achieved states/goals from the replay buffer . In another word, the support of the set should lie inside . In the meanwhile, we notice that a direct implementation of problem Eq. (7) may lead to degeneration of hindsight goal selection of the training process, i.e., the goals may be all drawn from a single trajectory, thus not being able to provide sufficient exploration. Therefore, we introduce an extra diversity constraint, i.e, for every trajectory , at most states can be selected in . In practice, we find that simply setting it to 1 would result in reasonable performance. It is shown in Section 5.3 that this diversity constraint indeed improves the robustness of our algorithm.
Finally, the optimization problem we aim to solve is,
s.t.  
To solve the above optimization, we adapt a twostage iterative algorithm. First, we apply a policy optimization algorithm, for example DDPG, to maximize the value function conditioned on the task set . Then we fix and optimize the the hindsight set subject to the diversity constraint, which is a variant of the wellknown Wasserstein Barycenter problem with a bias term (the value function) for each particle. Then we iterate the above process until the policy achieves a desirable performance or we reach a computation budget. It is not hard to see that the first optimization of value function is straightforward. In our work, we simply use the DDPG+HER framework for it. The second optimization of hindsight goals is nontrivial. In the following, we describe an efficient approximation algorithm for it.
4.2 Solving Wasserstein Barycenter Problem via Bipartite Matching
Since we assume that is hindsight and with particles, we can approximately solve the above Wasserstein Barycenter problem in the combinatorial setting as a bipartite matching problem. Instead of dealing with , we draw samples from to empirically approximate it by a set of particles . In this way, the hindsight task set can be solved in the following way. For every task instance , we find a state trajectory that together minimizes the sum
(8) 
where we define
(9) 
Finally we select each corresponding achieved state to construct hindsight goal . It is not hard to see that the above combinatorial optimization exactly identifies optimal solution in the abovementioned Wasserstein Barycenter problem. In practice, the Lipschitz constant is unknown and therefore treated as a hyperparameter.
The optimal solution of the combinatorial problem in Eq. (8) can be solved efficiently by the wellknown maximum weight bipartite matching (Munkres, 1957; Duan and Su, 2012). The bipartite graph is constructed as follows. Vertices are split into two partitions . Every vertex in represents a task instance , and vertex in represents a trajectory . The weight of edge connecting and is as defined in Eq. (9). In this paper, we apply the Minimum Cost Maximum Flow algorithm to solve this bipartite matching problem (for example, see Ahuja et al. (1993)).
Overall Algorithm The overall description of our algorithm is shown in Algorithm 1. Note that our exploration strategy the only modification is in Step 8, in which we generate hindsight goals to guide the agent to collect more valuable trajectories. So it is complementary to other improvements in DDPG/HER around Step 16, such as the prioritized experience replay strategy (Schaul et al., 2016; Zhao and Tresp, 2018; Zhao et al., 2019) and other variants of hindsight experience replay (Fang et al., 2019b; Bai et al., 2019).
5 Experiments
Our experiment environments are based on the standard robotic manipulation environments in the OpenAI Gym (Brockman et al., 2016)

Fetch environments: Initial object position and goal are generated uniformly at random from two distant segments.

Handmanipulation environments : These tasks require the agent to rotate the object into a given pose, and only the rotations around axis are considered here. We restrict the initial axisangle in a small interval, and the target pose will be generated in its symmetry. That is, the object needs to be rotated in about degree.

Reach environment: FetchReach and HandReach do not support randomization of the initial state, so we restrict their target distribution to be a subset of the original goal space.
Regarding baseline comparison, we consider the original DDPG+HER algorithm. We also investigate the integration of the experience replay prioritization strategies, such as the EnergyBased Prioritization (EBP) proposed by Zhao and Tresp (2018), which draws the prior knowledge of physics system to exploit valuable trajectories. More details of experiment settings are included in the Appendix B.
5.1 HGG Generates Better Hindsight Goals for Exploration




We first check whether HGG is able to generate meaningful hindsight goals for exploration. We compare HGG and HER in the FetchPush environment. It is shown in Figure 2 that HGG algorithm generates goals that gradually move towards the target region. Since those goals are hindsight, they are considered to be achieved during training. In comparison, the replay distribution of a DDPG+HER agent has been stuck around the initial position for many iterations, indicating that those goals may not be able to efficiently guide exploration.
Performance on benchmark robotics tasks
Then we check whether the exploration provided by the goals generated by HGG can result in better policy training performance. As shown in Figure 3, we compare the vanilla HER, HER with EnergyBased Prioritization (HER+EBP), HGG, HGG+EBP. It is worth noting that since EBP is designed for the Bellman equation updates, it is complementary to our HGGbased exploration approach. Among the eight environments, HGG substantially outperforms HER on four and has comparable performance on the other four, which are either too simple or too difficult. When combined with EBP, HGG+EBP achieves the best performance on six environments that are eligible.
Performance on tasks with obstacle In a more difficult task, crafted metric may be more suitable than distance used in Eq. (5). As shown in Figure 4, we created an environment based on FetchPush with a rigid obstacle. The object and the goal are uniformly generated in the green and the red segments respectively. The brown block is a static wall which cannot be moved. In addition to , we also construct a distance metric based on the graph distance of a mesh grid on the plane, the blue line is a successful trajectory in such handcraft distance measure. A more detailed description is deferred to Appendix B.3. Intuitively speaking, this crafted distance should be better than due to the existence of the obstacle. Experimental results suggest that such a crafted distance metric provides better guidance for goal generation and training, and significantly improves sample efficiency over distance. It would be a future direction to investigate ways to obtain or learn a good metric.
5.2 Comparison with Explicit Curriculum Learning
Since our method can be seen as an explicit curriculum learning for exploration, where we generate hindsight goals as intermediate task distribution, we also compare our method with another recently proposed curriculum learning method for RL. Florensa et al. (2018) leverages LeastSquares GAN (Mao et al., 2018b) to mimic the set called Goals of Intermediate Difficult as exploration goal generator.
Specifically, in our task settings, we define a goal set where represents the average success rate in a small region closed by goal . To sample from , we implement an oracle goal generator based on rejection sampling, which could uniformly sample goals from . Result in Figure 5 indicates that our Hindsight Goal Generation substantially outperforms HER even with from the oracle generator. Note that this experiment is run on a environment with fixed initial state due to the limitation of Florensa et al. (2018). The choice of is also suggested by Florensa et al. (2018).
5.3 Ablation Studies on Hyperparameter Selection
In this section, we set up a set of ablation tests on several hyperparameters used in the Hindsight Goal Generation algorithm.
Lipschitz : The selection of Lipschitz constant is task dependent, since it iss related with scale of value function and goal distance. For the robotics tasks tested in this paper, we find that it is easier to set by first divided it with the upper bound of the distance between any two final goals in a environment. We test a few choices of on several environments and find that it is very easy to find a range of that works well and shows robustness for all the environments tested in this section. We show the learning curves on FetchPush with different . It appears that the performance of HGG is reasonable as long as is not too small. For all tasks we tested in the comparisons, we set .
Distance weight : Parameter defines the tradeoff between the initial state similarity and the goal similarity. Larger encourages our algorithm to choose hindsight goals that has closer initial state. Results in Figure 6 indicates that the choice of is indeed robust. For all tasks we tested in the comparisons, we set .
Number of hindsight goals : We find that for the simple tasks, the choice of is not critical. Even a greedy approach (corresponds to ) can achieved competitive performance, e.g. on FetchPush in the third panel of Figure 6. For more difficult environment, such as FetchPickAndPlace, larger batch size can significantly reduce the variance of training results. For all tasks tested in the comparisons, we ploted the best results given by .
6 Conclusion
We present a novel automatic hindsight goal generation algorithm, by which valuable hindsight imaginary tasks are generated to enable efficient exploration for goaloriented offpolicy reinforcement learning. We formulate this idea as a surrogate optimization to identify hindsight goals that are easy to achieve and also likely to lead to the actual goal. We introduce a combinatorial solver to generate such intermediate tasks. Extensive experiments demonstrated better goaloriented exploration of our method over original HER and curriculum learning on a collection of robotic learning tasks. A future direction is to incorporate the controllable representation learning (Thomas et al., 2017) to provide taskspecific distance metric (Ghosh et al., 2019; Srinivas et al., 2018), which may generalize our method to more complicated cases where the standard Wasserstein distance cannot be applied directly.
Appendix A Proof of Theorem 1
In this section we provide the proof of Theorem 1. \thmbound*
Appendix B Experiment Settings
b.1 Modified Environments
Fetch Environments:

FetchPushv1: Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment .

FetchPickAndPlacev1: Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment .

FetchSlidev1: Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment .
Hand Environments:

HandManipulateBlockRotatev0, HandManipulateEggRotatev0: Let be the default initial state defined in original simulator [Plappert et al., 2018]. The initial pose is generated by applying a rotation around axis, where the rotation degree will be uniformly sampled from . The goal is also rotated from around axis, where the degree is uniformly sampled from .

HandManipulatePenRotatev0: We use the same setting as the original simulator.
Reach Environments:

FetchReachv1: Let the origin denote the coordinate of gripper’s initial position. Goal is uniformly generated on the segment .

HandReachv0: Uniformly select one dimension of meeting point and add an offset of 0.005, where meeting point is defined in original simulator [Plappert et al., 2018]
Other attributes of the environment (such as horizon , reward function ) are kept the same as default.
b.2 Evaluation Details

All curves presented in this paper are plotted from 10 runs with random task initializations and seeds.

Shaded region indicates 60% population around median.

All curves are plotted using the same hyperparameters (except ablation section).
b.3 Details of Experiment with obstacle
Using the same coordinate system as Appendix B.1. Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment . The wall lies on .
The crafted distance used in Figure 4 is calculated by the following rules.

The distance metric between two initial states is kept as before.

The distance between the hindsight goal and the desired goal is evaluated as the summation of two parts. The first part is the distance between the goal and its closest point on the blue polygonal line shown in Figure 4. The second part the distance between and along the blue line.

The above two terms are comined with the same ratio used in Eq. (5).
b.4 Details of Experiment 5.2

Since the environment is deterministic, the success rate is defines as
where indicates a ball with radius , centered at . And is the same threshold using in reward function (1) and success testing.

The average success rate oracle is estimated by samples.
Appendix C Implementation Details
c.1 HyperParameters
Almost all hyperparameters using DDPG and HER are kept the same as benchmark results, only following terms differ with Plappert et al. [2018]:

number of MPI workers: 1;

buffer size: trajectories.
Other hyperparameters:

Actor and critic networks: 3 layers with 256 units and ReLU activation;

Adam optimizer with learning rate;

Polyakaveraging coefficient: 0.95;

Action norm penalty coefficient: 1.0;

Batch size: 256;

Probability of random actions: 0.3;

Scale of additive Gaussian noise: 0.2;

Probability of HER experience replay: 0.8;

Number of batches to replay after collecting one trajectory: 20.
Hyperparameters in weighted bipartite matching:

Lipschitz constant : 5.0;

Distance weight : 3.0;

Number of hindsight goals : 50 or 100.
c.2 Details on Data Processing

In policy training of HGG, we sample minibatches using HER.

As a normalization step, we use Lipschitz constant in backend computation, where is the diameter of the goal space , and corresponds to the amount discussed in ablation study.

To reduce computational cost of bipartite matching, we approximate the buffer set by a FirstInFirstOut queue containing recent trajectories.

An additional Gaussian noise is added to goals generated by HGG in Fetch environments. We don’t add this term in Hand environments because the goal space is not .
Appendix D Additional Experiment Results
d.1 Additional Visualization of Hindsight Goals Generated by HGG
To give better intuitive illustrations on our motivation, we provide an additional visualization of goal distribution generated by HGG on a complex manipulation task FetchPickAndPlace (Figures 8(a) and 8(b)). In Figure 8(a), “blue to green” corresponds to the generated goals during training. HGG will guide the agent to understand the location of the object in the early stage, and move it to its nearby region. Then it will learn to move the object towards the easiest direction, i.e. pushing the object to the location underneath the actual goal, and finally pick it up. For those tasks which are hard to visualize, such as the HandManipultation tasks, we plotted the curves of distances between proposed exploratory goals and actually desired goals (Figure 8(c)), all experiment followed the similar learning dynamics.
d.2 Evaluation on Standard Tasks
In this section, we provide experiment results on standard Fetch tasks. The learning are shown in Figure 10.
d.3 Additional Experiment Results on Section 5.2
d.4 Ablation Study
We provide full experiments on ablation study in Figure 12.
Footnotes
 footnotemark:
 Our code is available at https://github.com/StilwellGit/HindsightGoalGeneration.
References
 Network flows: theory, algorithms, and applications. PrenticeHall, Inc., Upper Saddle River, NJ, USA. External Links: ISBN 013617549X Cited by: §4.2.
 Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: 4th item, Figure 10, Figure 11, §1, §1, §3, §4.1.
 Lipschitz continuity in modelbased reinforcement learning. In International Conference on Machine Learning, pp. 264–273. Cited by: §4.1.
 Guided goal generation for hindsight multigoal reinforcement learning. Neurocomputing. Cited by: §4.2.
 Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems 61 (1), pp. 49–73. Cited by: §3.
 OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.
 CURIOUS: intrinsically motivated modular multigoal reinforcement learning. In International Conference on Machine Learning, pp. 1331–1340. Cited by: §3.
 GEPpg: decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning, pp. 1038–1047. Cited by: §3.
 Goalconditioned imitation learning. In Advances in Neural Information Processing Systems, Cited by: §3.
 A scaling algorithm for maximum weight matching in bipartite graphs. In Proceedings of the twentythird annual ACMSIAM symposium on Discrete Algorithms, pp. 1413–1424. Cited by: §4.2.
 Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995. Cited by: §2.
 Search on the replay buffer: bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §3.
 DHER: hindsight experience replay for dynamic goals. In International Conference on Learning Representations, Cited by: §3.
 Curriculumguided hindsight experience replay. In Advances in Neural Information Processing Systems, Cited by: §4.2.
 Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pp. 1514–1523. Cited by: §3, §4.1, §5.2, §5.2.
 Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, pp. 482–495. Cited by: §2, §3.
 Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §3.
 Learning actionable representations with goalconditioned policies. In International Conference on Learning Representations, Cited by: §3, §6.
 Recall traces: backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, Cited by: §3.
 Infobot: transfer and exploration via the information bottleneck. In International Conference on Learning Representations, Cited by: §3.
 Mapping state space using landmarks for universal goal reaching. In Advances in Neural Information Processing Systems, Cited by: §3.
 Learning to achieve goals. In IJCAI, pp. 1094–1099. Cited by: §3.
 Learning to rank for recommender systems. In Proceedings of the 7th ACM conference on Recommender systems, pp. 493–494. Cited by: §1.
 Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
 Learning multilevel hierarchies with hindsight. In International Conference on Learning Representations, Cited by: §3.
 Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1.
 Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, Cited by: §4.1.
 Universal agent for disentangling environments and tasks. In International Conference on Learning Representations, Cited by: §3.
 On the effectiveness of least squares generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §5.2.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
 Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §4.2.
 Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §3.
 Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §3.
 Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Cited by: §1.
 Zeroshot task generalization with multitask deep reinforcement learning. In International Conference on Machine Learning, pp. 2661–2670. Cited by: §3.
 Zeroshot visual imitation. In International Conference on Learning Representations, Cited by: §3.
 Unsupervised learning of goal spaces for intrinsically motivated goal exploration. In International Conference on Learning Representations, Cited by: §3.
 Multigoal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: 1st item, 2nd item, §C.1, §2, §2, §4.1.
 Skewfit: statecovering selfsupervised reinforcement learning. arXiv preprint arXiv:1903.03698. Cited by: §3.
 Hindsight policy gradients. In International Conference on Learning Representations, Cited by: §3.
 Learning by playing solving sparse reward tasks from scratch. In International Conference on Machine Learning, pp. 4341–4350. Cited by: §3.
 Addressing sample complexity in visual tasks using her and hallucinatory gans. In Advances in Neural Information Processing Systems, Cited by: §3.
 Universal value function approximators. In International conference on machine learning, pp. 1312–1320. Cited by: §3.
 Prioritized experience replay. In International Conference on Learning Representations, Cited by: §4.2.
 Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §1.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
 Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
 Universal planning networks: learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pp. 4739–4748. Cited by: §3, §6.
 Intrinsic motivation and automatic curricula via asymmetric selfplay. In International Conference on Learning Representations, Cited by: §3.
 Policy continuation with hindsight inverse dynamics. In Advances in Neural Information Processing Systems, Cited by: §3.
 The asymptotic convergencerate of qlearning. In Advances in Neural Information Processing Systems, pp. 1064–1070. Cited by: §1.
 Independently controllable features. arXiv preprint arXiv:1708.01289. Cited by: §6.
 Manygoals reinforcement learning. arXiv preprint arXiv:1806.09605. Cited by: §3.
 Maximum entropyregularized multigoal reinforcement learning. In International Conference on Machine Learning, pp. 7553–7562. Cited by: §4.2.
 Energybased hindsight experience prioritization. In Conference on Robot Learning, pp. 113–122. Cited by: §4.2, §5.