Exploration via Hindsight Goal Generation

Exploration via Hindsight Goal Generation

Abstract

Goal-oriented reinforcement learning has recently been a practical framework for robotic manipulation tasks, in which an agent is required to reach a certain goal defined by a function on the state space. However, the sparsity of such reward definition makes traditional reinforcement learning algorithms very inefficient. Hindsight Experience Replay (HER), a recent advance, has greatly improved sample efficiency and practical applicability for such problems. It exploits previous replays by constructing imaginary goals in a simple heuristic way, acting like an implicit curriculum to alleviate the challenge of sparse reward signal. In this paper, we introduce Hindsight Goal Generation (HGG), a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term. We have extensively evaluated our goal generation algorithm on a number of robotic manipulation tasks and demonstrated substantially improvement over the original HER in terms of sample efficiency.

1 Introduction

Recent advances in deep reinforcement learning (RL), including policy gradient methods (Schulman et al., 2015, 2017) and Q-learning (Mnih et al., 2015), have demonstrated a large number of successful applications in solving hard sequential decision problems, including robotics (Levine et al., 2016), games (Silver et al., 2016; Mnih et al., 2015), and recommendation systems (Karatzoglou et al., 2013), among others. To train a well-behaved policy, deep reinforcement learning algorithms use neural networks as functional approximators to learn a state-action value function or a policy distribution to optimize a long-term expected return. The convergence of the training process, particularly in Q-learning, is heavily dependent on the temporal pattern of the reward function (Szepesvári, 1998). For example, if only a non-zero reward/return is provided at the end of an rollout of a trajectory with length , while no rewards are observed before the -th time step, the Bellman updates of the Q-function would become very inefficient, requiring at least steps to propagate the final return to the Q-function of all earlier state-action pairs. Such sparse or episodic reward signals are ubiquitous in many real-world problems, including complex games and robotic manipulation tasks (Andrychowicz et al., 2017). Therefore, despite its notable success, the application of RL is still quite limited to real-world problems, where the reward functions can be sparse and very hard to engineer (Ng et al., 1999). In practice, human experts need to design reward functions which would reflect the task needed to be solved and also be carefully shaped in a dense way for the optimization in RL algorithms to ensure good performance. However, the design of such dense reward functions is non-trivial in most real-world problems with sparse rewards. For example, in goal-oriented robotics tasks, an agent is required to reach some state satisfying predefined conditions or within a state set of interest. Many previous efforts have shown that the sparse indicator rewards, instead of the engineered dense rewards, often provide better practical performance when trained with deep Q-learning and policy optimization algorithms (Andrychowicz et al., 2017). In this paper, we will focus on improving training and exploration for goal-oriented RL problems.

A notable advance is called Hindsight Experience Replay (HER) (Andrychowicz et al., 2017), which greatly improves the practical success of off-policy deep Q-learning for goal-oriented RL problems, including several difficult robotic manipulation tasks. The key idea of HER is to revisit previous states in the experience replay and construct a number of achieved hindsight goals based on these visited intermediate states. Then the hindsight goals and the related trajectories are used to train an universal value function parameterized by a goal input by algorithms such as deep deterministic policy gradient (DDPG, Lillicrap et al. (2016)). A good way to think of the success of HER is to view HER as an implicit curriculum which first learns with the intermediate goals that are easy to achieve using current value function and then later with the more difficult goals that are closer to the final goal. A notable difference between HER and curriculum learning is that HER does not require an explicit distribution of the initial environment states, which appears to be more applicable to many real problems.

In this paper, we study the problem of automatically generating valuable hindsight goals which are more effective for exploration. Different from the random curriculum heuristics used in the original HER, where a goal is drawn as an achieved state in a random trajectory, we propose a new approach that finds intermediate goals that are easy to achieve in the short term and also would likely lead to reach the final goal in the long term. To do so, we first approximate the value function of the actual goal distribution by a lower bound that decomposes into two terms, a value function based on a hindsight goal distribution and the Wasserstein distance between the two distributions. Then, we introduce an efficient discrete Wasserstein Barycenter solver to generate a set of hindsight goals that optimizes the lower bound. Finally, such goals are used for exploration.

In the experiments, we evaluate our Hindsight Goal Generation approach on a broad set of robotic manipulation tasks. By incorporating the hindsight goals, a significant improvement on sample efficiency is demonstrated over DDPG+HER. Ablation studies show that our exploration strategy is robust across a wide set of hyper-parameters.

2 Background

Reinforcement Learning The goal of reinforcement learning agent is to interact with a given environment and maximize its expected cumulative reward. The environment is usually modeled by a Markov Decision Process (MDP), given by tuples where represent the set of states and actions respectively. is the transition function and is the reward function. is the discount factor. The agent trys to find a policy that maximizes its expected curriculum reward , where is usually given or drawn from a distribution of initial state. The value function is defined as

Goal-oriented MDP In this paper, we consider a specific class of MDP called goal-oriented MDP. We use to denote the set of goals. Different from traditional MDP, the reward function is a goal-conditioned sparse and binary signal indicating whether the goal is achieved:

(1)

is a known and tractable mapping that defines goal representation. is a given threshold indicating whether the goal is considered to be reached (see Plappert et al. (2018)).

Universal value function The idea of universal value function is to use a single functional approximator, such as neural networks, to represent a large number of value functions. For the goal-oriented MDPs, the goal-based value function of a policy for any given goal is defined as , for all state That is

(2)

Let be the joint distribution over starting state and goal . That is, at the start of every episode, a state-goal pair will be drawn from the task distribution . The agent tries to find a policy that maximizes the expectation of discounted cumulative reward

(3)

Goal-oriented MDP characterizes several reinforcement benchmark tasks, such as the robotics tasks in the OpenAI gym environment (Plappert et al., 2018). For example, in the FetchPush (see Figure 1) task, the agent needs to learn pushing a box to a designated point. In this task, the state of the system contains the status for both the robot and the box. The goal , on the other hand, only indicates the designated position of the box. Thus, the mapping is defined as a mapping from a system state to the position of the box in .

Access to Simulator One of the common assumption made by previous work is an universal simulator that allows the environment to be reset to any given state (Florensa et al., 2017; Ecoffet et al., 2019). This kind of simulator is excessively powerful, and hard to build when acting in the real world. On the contrary, our method does not require an universal simulator, and thus is more realizable.

3 Related Work

Multi-Goal RL The role of goal-conditioned policy has been investigated widely in deep reinforcement learning scenarios (Pong et al., 2019). A few examples include grasping skills in imitation learning (Pathak et al., 2018; Srinivas et al., 2018), disentangling task knowledge from environment (Mao et al., 2018a; Ghosh et al., 2019), and constituting lower-level controller in hierarchical RL (Oh et al., 2017; Nachum et al., 2018; Huang et al., 2019; Eysenbach et al., 2019). By learning a universal value function which parameterizes the goal using a function approximator (Schaul et al., 2015), an agent is able to learn multiple tasks simultaneously (Kaelbling, 1993; Veeriah et al., 2018) and identify important decision states (Goyal et al., 2019b). It is shown that multi-task learning with goal-conditioned policy improves the generalizability to unseen goals (e.g., Schaul et al. (2015)).

Hindsight Experience Replay Hindsight Experience Replay (Andrychowicz et al., 2017) is an effective experience replay strategy which generates reward signal from failure trajectories. The idea of hindsight experience replay can be extended to various goal-conditioned problems, such as hierarchical RL (Levy et al., 2019), dynamic goal pursuit (Fang et al., 2019a), goal-conditioned imitation (Ding et al., 2019; Sun et al., 2019) and visual robotics applications (Nair et al., 2018; Sahni et al., 2019). It is also shown that hindsight experience replay can be combined with on-policy reinforcement learning algorithms by importance sampling (Rauber et al., 2019).

Curriculum Learning in RL Curriculum learning in RL usually suggests using a sequence of auxiliary tasks to guide policy optimization, which is also related to multi-task learning, lifelong learning, and transfer learning. The research interest in automatic curriculum design has seen rapid growth recently, where approaches have been proposed to schedule a given set of auxiliary tasks (Riedmiller et al., 2018; Colas et al., 2019), and to provide intrinsic motivation (Forestier et al., 2017; Péré et al., 2018; Sukhbaatar et al., 2018; Colas et al., 2018). Generating goals which leads to high-value states could substantially improve the sample efficiency of RL agent (Goyal et al., 2019a). Guided exploration through curriculum generation is also an active research topic, where either the initial state (Florensa et al., 2017) or the goal position (Baranes and Oudeyer, 2013; Florensa et al., 2018) is considered as a manipulable factor to generate the intermediate tasks. However, most curriculum learning methods are domain-specific, and it is still open to build a generalized framework for curriculum learning.

4 Automatic Hindsight Goal Generation

Figure 1: Visualization of hindsight goals (pink particles).

As discussed in the previous section, HER provides an effective solution to resolve the sparse reward challenge in object manipulation tasks, in which achieved state in some past trajectories will be replayed as imaginary goals. In the other words, HER modifies the task distribution in replay buffer to generate a set of auxiliary nearby goals which can used for further exploration and improve the performance of an off-policy RL agent which is expected to reach a very distant goal. However, the distribution of hindsight goals where the policy is trained on might differ significantly from the original task or goal distribution. Take Figure 1 as an example, the desired goal distribution is lying on the red segment, which is far away from the initial position. In this situation, those hindsight goals may not be effective enough to promote policy optimization in original task. The goal of our work is to develop a new approach to generate valuable hindsight goals that will improve the performance on the original task.

In the rest of this section, we will present a new algorithmic framework as well as our implementation for automatic hindsight goal generation for better exploration.

4.1 Algorithmic Framework

Following Florensa et al. (2018), our approach relies on the following generalizability assumption.

Assumption 1.

A value function of a policy for a specific goal has some generalizability to another goal close to .

One possible mathematical characterization for Assumption 1 is via the Lipschitz continuity. Similar assumptions have been widely applied in many scenarios (Asadi et al., 2018; Luo et al., 2019):

(4)

where is a metric defined by

(5)

for some hyperparameter that provides a trade-off between the distances between initial states and the distance between final goals. is a state abstraction to map from the state space to the goal space. When experimenting with the tasks in the OpenAI Gym environment (Plappert et al., 2018), we simply adopt the state-goal mappings as defined in (1). Although the Lipschitz continuity may not hold for every we only require continuity over some specific region. It is reasonable to claim that bound Eq. (4) holds for most of the when is not too large.

Partly due to the reward sparsity of the distant goals, optimizing the expected cumulative reward (see Eq. (3)) from scratch is very difficult. Instead, we propose to optimize a relaxed lower bound which introduces intermediate goals that may be easier to optimize. Here we provide Theorem 4.1 that establishes the such a lower bound.

{restatable}

theoremthmbound Assuming that the generalizability condition (Eq. (4)) holds for two distributions and , we have

(6)

where is the Wasserstein distance based on

where denotes the collection of all joint distribution whose marginal probabilities are , respectively. The proof of Theorem 1 is deferred to Appendix A.

It follows from Theorem 1 that optimizing cumulative rewards Eq. (3) can be relaxed into the following surrogate problem

(7)

Note that this new objective function is very intuitive. Instead of optimizing with the difficult goal/task distribution , we hope to find a collection of surrogate goals , which are both easy to optimize and are also close or converging towards . However the joint optimization of and is non-trivial. This is because a) is a high-dimensional distribution over tasks, b) policy is optimized with respect to a shifting task distribution , c) the estimation of value function may not be quite accurate during training.

Inspired by Andrychowicz et al. (2017), we adopt the idea of using hindsight goals here. We first enforce to be a finite set of particles which can only be from those already achieved states/goals from the replay buffer . In another word, the support of the set should lie inside . In the meanwhile, we notice that a direct implementation of problem Eq. (7) may lead to degeneration of hindsight goal selection of the training process, i.e., the goals may be all drawn from a single trajectory, thus not being able to provide sufficient exploration. Therefore, we introduce an extra diversity constraint, i.e, for every trajectory , at most states can be selected in . In practice, we find that simply setting it to 1 would result in reasonable performance. It is shown in Section 5.3 that this diversity constraint indeed improves the robustness of our algorithm.

Finally, the optimization problem we aim to solve is,

s.t.

To solve the above optimization, we adapt a two-stage iterative algorithm. First, we apply a policy optimization algorithm, for example DDPG, to maximize the value function conditioned on the task set . Then we fix and optimize the the hindsight set subject to the diversity constraint, which is a variant of the well-known Wasserstein Barycenter problem with a bias term (the value function) for each particle. Then we iterate the above process until the policy achieves a desirable performance or we reach a computation budget. It is not hard to see that the first optimization of value function is straightforward. In our work, we simply use the DDPG+HER framework for it. The second optimization of hindsight goals is non-trivial. In the following, we describe an efficient approximation algorithm for it.

4.2 Solving Wasserstein Barycenter Problem via Bipartite Matching

Since we assume that is hindsight and with particles, we can approximately solve the above Wasserstein Barycenter problem in the combinatorial setting as a bipartite matching problem. Instead of dealing with , we draw samples from to empirically approximate it by a set of particles . In this way, the hindsight task set can be solved in the following way. For every task instance , we find a state trajectory that together minimizes the sum

(8)

where we define

(9)

Finally we select each corresponding achieved state to construct hindsight goal . It is not hard to see that the above combinatorial optimization exactly identifies optimal solution in the above-mentioned Wasserstein Barycenter problem. In practice, the Lipschitz constant is unknown and therefore treated as a hyper-parameter.

The optimal solution of the combinatorial problem in Eq. (8) can be solved efficiently by the well-known maximum weight bipartite matching (Munkres, 1957; Duan and Su, 2012). The bipartite graph is constructed as follows. Vertices are split into two partitions . Every vertex in represents a task instance , and vertex in represents a trajectory . The weight of edge connecting and is as defined in Eq. (9). In this paper, we apply the Minimum Cost Maximum Flow algorithm to solve this bipartite matching problem (for example, see Ahuja et al. (1993)).

1:Initialize initialize neural networks
2:
3:for iteration  do
4:     Sample sample from target distribution
5:     Find distinct trajectories that minimize weighted bipartite matching
6:     Construct intermediate task distribution where
7:     for  do
8:          critical step: hindsight goal-oriented exploration
9:         for  do
10:               together with -greedy or Gaussian exploration
11:              
12:                        
13:         
14:               
15:     for  do
16:         Sample a minibatch from replay buffer using HER
17:         Perform one step on value and policy update on minibatch using DDPG      
Algorithm 1 Exploration via Hindsight Goal Generation (HGG)

Overall Algorithm The overall description of our algorithm is shown in Algorithm 1. Note that our exploration strategy the only modification is in Step 8, in which we generate hindsight goals to guide the agent to collect more valuable trajectories. So it is complementary to other improvements in DDPG/HER around Step 16, such as the prioritized experience replay strategy (Schaul et al., 2016; Zhao and Tresp, 2018; Zhao et al., 2019) and other variants of hindsight experience replay (Fang et al., 2019b; Bai et al., 2019).

5 Experiments

Our experiment environments are based on the standard robotic manipulation environments in the OpenAI Gym (Brockman et al., 2016)2. In addition to the standard settings, to better visualize the improvement of the sample efficiency, we vary the target task distributions in the following ways:

  • Fetch environments: Initial object position and goal are generated uniformly at random from two distant segments.

  • Hand-manipulation environments : These tasks require the agent to rotate the object into a given pose, and only the rotations around -axis are considered here. We restrict the initial axis-angle in a small interval, and the target pose will be generated in its symmetry. That is, the object needs to be rotated in about degree.

  • Reach environment: FetchReach and HandReach do not support randomization of the initial state, so we restrict their target distribution to be a subset of the original goal space.

Regarding baseline comparison, we consider the original DDPG+HER algorithm. We also investigate the integration of the experience replay prioritization strategies, such as the Energy-Based Prioritization (EBP) proposed by Zhao and Tresp (2018), which draws the prior knowledge of physics system to exploit valuable trajectories. More details of experiment settings are included in the Appendix B.

5.1 HGG Generates Better Hindsight Goals for Exploration

HGG
HER

(a) Episode 500
(b) Episode 1000
(c) Episode 2000
(d) Episode 3000
Figure 2: Visualization of goal distribution generated by HGG and HER on FetchPush. The initial object position is shown as a black box. The blue segment indicates target goal distribution. The above row presents the distribution of the hindsight goals generated by our HGG method, where bright green particles is a batch of recently generated goals, and dark green particles present the goals generated in the previous iterations. The bottom row presents the distribution of replay goals generated by HER.

We first check whether HGG is able to generate meaningful hindsight goals for exploration. We compare HGG and HER in the FetchPush environment. It is shown in Figure 2 that HGG algorithm generates goals that gradually move towards the target region. Since those goals are hindsight, they are considered to be achieved during training. In comparison, the replay distribution of a DDPG+HER agent has been stuck around the initial position for many iterations, indicating that those goals may not be able to efficiently guide exploration.

Performance on benchmark robotics tasks

Figure 3: Learning curves for variant a number of goal-oriented robotic manipulation tasks. All curves presented in this figure are trained with default hyper-parameters included in Appendix C.1. Note that since FetchReach and HandReach do not contain object instances for EBP, so we do not include the +EBP versions for them.

Then we check whether the exploration provided by the goals generated by HGG can result in better policy training performance. As shown in Figure 3, we compare the vanilla HER, HER with Energy-Based Prioritization (HER+EBP), HGG, HGG+EBP. It is worth noting that since EBP is designed for the Bellman equation updates, it is complementary to our HGG-based exploration approach. Among the eight environments, HGG substantially outperforms HER on four and has comparable performance on the other four, which are either too simple or too difficult. When combined with EBP, HGG+EBP achieves the best performance on six environments that are eligible.

Figure 4: Visualization of FetchPush with obstacle.

Performance on tasks with obstacle In a more difficult task, crafted metric may be more suitable than -distance used in Eq. (5). As shown in Figure 4, we created an environment based on FetchPush with a rigid obstacle. The object and the goal are uniformly generated in the green and the red segments respectively. The brown block is a static wall which cannot be moved. In addition to , we also construct a distance metric based on the graph distance of a mesh grid on the plane, the blue line is a successful trajectory in such hand-craft distance measure. A more detailed description is deferred to Appendix B.3. Intuitively speaking, this crafted distance should be better than due to the existence of the obstacle. Experimental results suggest that such a crafted distance metric provides better guidance for goal generation and training, and significantly improves sample efficiency over distance. It would be a future direction to investigate ways to obtain or learn a good metric.

5.2 Comparison with Explicit Curriculum Learning

Figure 5: Comparison with curriculum learning. We compare HGG with the original HER, HER+GOID with two threshold values.

Since our method can be seen as an explicit curriculum learning for exploration, where we generate hindsight goals as intermediate task distribution, we also compare our method with another recently proposed curriculum learning method for RL. Florensa et al. (2018) leverages Least-Squares GAN (Mao et al., 2018b) to mimic the set called Goals of Intermediate Difficult as exploration goal generator.

Specifically, in our task settings, we define a goal set where represents the average success rate in a small region closed by goal . To sample from , we implement an oracle goal generator based on rejection sampling, which could uniformly sample goals from . Result in Figure 5 indicates that our Hindsight Goal Generation substantially outperforms HER even with from the oracle generator. Note that this experiment is run on a environment with fixed initial state due to the limitation of Florensa et al. (2018). The choice of is also suggested by Florensa et al. (2018).

5.3 Ablation Studies on Hyperparameter Selection

In this section, we set up a set of ablation tests on several hyper-parameters used in the Hindsight Goal Generation algorithm.

Lipschitz : The selection of Lipschitz constant is task dependent, since it iss related with scale of value function and goal distance. For the robotics tasks tested in this paper, we find that it is easier to set by first divided it with the upper bound of the distance between any two final goals in a environment. We test a few choices of on several environments and find that it is very easy to find a range of that works well and shows robustness for all the environments tested in this section. We show the learning curves on FetchPush with different . It appears that the performance of HGG is reasonable as long as is not too small. For all tasks we tested in the comparisons, we set .

Distance weight : Parameter defines the trade-off between the initial state similarity and the goal similarity. Larger encourages our algorithm to choose hindsight goals that has closer initial state. Results in Figure 6 indicates that the choice of is indeed robust. For all tasks we tested in the comparisons, we set .

Number of hindsight goals : We find that for the simple tasks, the choice of is not critical. Even a greedy approach (corresponds to ) can achieved competitive performance, e.g. on FetchPush in the third panel of Figure 6. For more difficult environment, such as FetchPickAndPlace, larger batch size can significantly reduce the variance of training results. For all tasks tested in the comparisons, we ploted the best results given by .

Figure 6: Ablation study of hyper-parameter selection. Several curves are omitted in the forth panel to provide a clear view of variance comparison. A full version is deferred to Appendix D.4.

6 Conclusion

We present a novel automatic hindsight goal generation algorithm, by which valuable hindsight imaginary tasks are generated to enable efficient exploration for goal-oriented off-policy reinforcement learning. We formulate this idea as a surrogate optimization to identify hindsight goals that are easy to achieve and also likely to lead to the actual goal. We introduce a combinatorial solver to generate such intermediate tasks. Extensive experiments demonstrated better goal-oriented exploration of our method over original HER and curriculum learning on a collection of robotic learning tasks. A future direction is to incorporate the controllable representation learning (Thomas et al., 2017) to provide task-specific distance metric (Ghosh et al., 2019; Srinivas et al., 2018), which may generalize our method to more complicated cases where the standard Wasserstein distance cannot be applied directly.

Appendix A Proof of Theorem 1

In this section we provide the proof of Theorem 1. \thmbound*

Proof.

By Eq. (4), for any quadruple , we have

(10)

For any , we sample and take the expectation on both sides of Eq. (10), and get

(11)

Since Eq. (11) holds for any , we have

Appendix B Experiment Settings

b.1 Modified Environments

Figure 7: Visualization of modified task distribution in Fetch environments. The object is uniformly generated on the green segment, and the goal is uniformly generated on the red segment.

Fetch Environments:

  • FetchPush-v1: Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment .

  • FetchPickAndPlace-v1: Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment .

  • FetchSlide-v1: Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment .

Hand Environments:

  • HandManipulateBlockRotate-v0, HandManipulateEggRotate-v0: Let be the default initial state defined in original simulator [Plappert et al., 2018]. The initial pose is generated by applying a rotation around -axis, where the rotation degree will be uniformly sampled from . The goal is also rotated from around -axis, where the degree is uniformly sampled from .

  • HandManipulatePenRotate-v0: We use the same setting as the original simulator.

Reach Environments:

  • FetchReach-v1: Let the origin denote the coordinate of gripper’s initial position. Goal is uniformly generated on the segment .

  • HandReach-v0: Uniformly select one dimension of meeting point and add an offset of 0.005, where meeting point is defined in original simulator [Plappert et al., 2018]

Other attributes of the environment (such as horizon , reward function ) are kept the same as default.

b.2 Evaluation Details

  • All curves presented in this paper are plotted from 10 runs with random task initializations and seeds.

  • Shaded region indicates 60% population around median.

  • All curves are plotted using the same hyper-parameters (except ablation section).

  • Following Andrychowicz et al. [2017], an episode is considered successful if is achieved, where is the object position at the end of the episode. is the same threshold using in reward function (1).

b.3 Details of Experiment with obstacle

Using the same coordinate system as Appendix B.1. Let the origin denote the projection of gripper’s initial coordinate on the table. The object is uniformly generated on the segment , and the goal is uniformly generated on the segment . The wall lies on .

The crafted distance used in Figure 4 is calculated by the following rules.

  • The distance metric between two initial states is kept as before.

  • The distance between the hindsight goal and the desired goal is evaluated as the summation of two parts. The first part is the distance between the goal and its closest point on the blue polygonal line shown in Figure 4. The second part the distance between and along the blue line.

  • The above two terms are comined with the same ratio used in Eq. (5).

b.4 Details of Experiment 5.2

Figure 8: Visualization of modified task distribution in Experiment 5.2. The initial position of the object is as shown in this figure, and the goal is uniformly generated in the blue region.
  • Since the environment is deterministic, the success rate is defines as

    where indicates a ball with radius , centered at . And is the same threshold using in reward function (1) and success testing.

  • The average success rate oracle is estimated by samples.

Appendix C Implementation Details

c.1 Hyper-Parameters

Almost all hyper-parameters using DDPG and HER are kept the same as benchmark results, only following terms differ with Plappert et al. [2018]:

  • number of MPI workers: 1;

  • buffer size: trajectories.

Other hyper-parameters:

  • Actor and critic networks: 3 layers with 256 units and ReLU activation;

  • Adam optimizer with learning rate;

  • Polyak-averaging coefficient: 0.95;

  • Action -norm penalty coefficient: 1.0;

  • Batch size: 256;

  • Probability of random actions: 0.3;

  • Scale of additive Gaussian noise: 0.2;

  • Probability of HER experience replay: 0.8;

  • Number of batches to replay after collecting one trajectory: 20.

Hyper-parameters in weighted bipartite matching:

  • Lipschitz constant : 5.0;

  • Distance weight : 3.0;

  • Number of hindsight goals : 50 or 100.

c.2 Details on Data Processing

  • In policy training of HGG, we sample minibatches using HER.

  • As a normalization step, we use Lipschitz constant in back-end computation, where is the -diameter of the goal space , and corresponds to the amount discussed in ablation study.

  • To reduce computational cost of bipartite matching, we approximate the buffer set by a First-In-First-Out queue containing recent trajectories.

  • An additional Gaussian noise is added to goals generated by HGG in Fetch environments. We don’t add this term in Hand environments because the goal space is not .

Appendix D Additional Experiment Results

d.1 Additional Visualization of Hindsight Goals Generated by HGG

(a)
(b)
(c)
Figure 9: Additional visualization to illustrate the hindsight goals generated by HGG.

To give better intuitive illustrations on our motivation, we provide an additional visualization of goal distribution generated by HGG on a complex manipulation task FetchPickAndPlace (Figures 8(a) and 8(b)). In Figure 8(a), “blue to green” corresponds to the generated goals during training. HGG will guide the agent to understand the location of the object in the early stage, and move it to its nearby region. Then it will learn to move the object towards the easiest direction, i.e. pushing the object to the location underneath the actual goal, and finally pick it up. For those tasks which are hard to visualize, such as the HandManipultation tasks, we plotted the curves of distances between proposed exploratory goals and actually desired goals (Figure 8(c)), all experiment followed the similar learning dynamics.

d.2 Evaluation on Standard Tasks

In this section, we provide experiment results on standard Fetch tasks. The learning are shown in Figure 10.

Figure 10: Learning curves for HGG and HER in standard task distribution created by Andrychowicz et al. [2017].

d.3 Additional Experiment Results on Section 5.2

We provide the comparison of the performance of HGG and explicit curriculum learning on FetchPickAndPlace environment (see Figure 11), showing that the result given in Section 5.2 generalizes to a different environment.

Figure 11: Comparison with explicit curriculum learning in FetchPickAndPlace. The initial position of the object is as shown in the left figure, and the goal is generated in the blue region following the default distribution created by Andrychowicz et al. [2017].

d.4 Ablation Study

We provide full experiments on ablation study in Figure 12.

Figure 12: A full version of ablation study.

Footnotes

  1. footnotemark:
  2. Our code is available at https://github.com/Stilwell-Git/Hindsight-Goal-Generation.

References

  1. Network flows: theory, algorithms, and applications. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. External Links: ISBN 0-13-617549-X Cited by: §4.2.
  2. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: 4th item, Figure 10, Figure 11, §1, §1, §3, §4.1.
  3. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pp. 264–273. Cited by: §4.1.
  4. Guided goal generation for hindsight multi-goal reinforcement learning. Neurocomputing. Cited by: §4.2.
  5. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems 61 (1), pp. 49–73. Cited by: §3.
  6. OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.
  7. CURIOUS: intrinsically motivated modular multi-goal reinforcement learning. In International Conference on Machine Learning, pp. 1331–1340. Cited by: §3.
  8. GEP-pg: decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning, pp. 1038–1047. Cited by: §3.
  9. Goal-conditioned imitation learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  10. A scaling algorithm for maximum weight matching in bipartite graphs. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pp. 1413–1424. Cited by: §4.2.
  11. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995. Cited by: §2.
  12. Search on the replay buffer: bridging planning and reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  13. DHER: hindsight experience replay for dynamic goals. In International Conference on Learning Representations, Cited by: §3.
  14. Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems, Cited by: §4.2.
  15. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pp. 1514–1523. Cited by: §3, §4.1, §5.2, §5.2.
  16. Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, pp. 482–495. Cited by: §2, §3.
  17. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §3.
  18. Learning actionable representations with goal-conditioned policies. In International Conference on Learning Representations, Cited by: §3, §6.
  19. Recall traces: backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, Cited by: §3.
  20. Infobot: transfer and exploration via the information bottleneck. In International Conference on Learning Representations, Cited by: §3.
  21. Mapping state space using landmarks for universal goal reaching. In Advances in Neural Information Processing Systems, Cited by: §3.
  22. Learning to achieve goals. In IJCAI, pp. 1094–1099. Cited by: §3.
  23. Learning to rank for recommender systems. In Proceedings of the 7th ACM conference on Recommender systems, pp. 493–494. Cited by: §1.
  24. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  25. Learning multi-level hierarchies with hindsight. In International Conference on Learning Representations, Cited by: §3.
  26. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, Cited by: §1.
  27. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, Cited by: §4.1.
  28. Universal agent for disentangling environments and tasks. In International Conference on Learning Representations, Cited by: §3.
  29. On the effectiveness of least squares generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §5.2.
  30. Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  31. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §4.2.
  32. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §3.
  33. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §3.
  34. Policy invariance under reward transformations: theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 278–287. Cited by: §1.
  35. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pp. 2661–2670. Cited by: §3.
  36. Zero-shot visual imitation. In International Conference on Learning Representations, Cited by: §3.
  37. Unsupervised learning of goal spaces for intrinsically motivated goal exploration. In International Conference on Learning Representations, Cited by: §3.
  38. Multi-goal reinforcement learning: challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464. Cited by: 1st item, 2nd item, §C.1, §2, §2, §4.1.
  39. Skew-fit: state-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698. Cited by: §3.
  40. Hindsight policy gradients. In International Conference on Learning Representations, Cited by: §3.
  41. Learning by playing solving sparse reward tasks from scratch. In International Conference on Machine Learning, pp. 4341–4350. Cited by: §3.
  42. Addressing sample complexity in visual tasks using her and hallucinatory gans. In Advances in Neural Information Processing Systems, Cited by: §3.
  43. Universal value function approximators. In International conference on machine learning, pp. 1312–1320. Cited by: §3.
  44. Prioritized experience replay. In International Conference on Learning Representations, Cited by: §4.2.
  45. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §1.
  46. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
  47. Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
  48. Universal planning networks: learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pp. 4739–4748. Cited by: §3, §6.
  49. Intrinsic motivation and automatic curricula via asymmetric self-play. In International Conference on Learning Representations, Cited by: §3.
  50. Policy continuation with hindsight inverse dynamics. In Advances in Neural Information Processing Systems, Cited by: §3.
  51. The asymptotic convergence-rate of q-learning. In Advances in Neural Information Processing Systems, pp. 1064–1070. Cited by: §1.
  52. Independently controllable features. arXiv preprint arXiv:1708.01289. Cited by: §6.
  53. Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605. Cited by: §3.
  54. Maximum entropy-regularized multi-goal reinforcement learning. In International Conference on Machine Learning, pp. 7553–7562. Cited by: §4.2.
  55. Energy-based hindsight experience prioritization. In Conference on Robot Learning, pp. 113–122. Cited by: §4.2, §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402507
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description