Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Skew-Fit: State-Covering Self-Supervised Reinforcement Learning

Vitchyr H. Pong, Murtaza Dalalfootnotemark: , Steven Linfootnotemark: , Ashvin Nair, Shikhar Bahl, Sergey Levine
University of California, Berkeley
{vitchyr,mdalal,stevenlin598,anair17,shikharbahl,svlevine}@berkeley.edu
Equal contribution
Abstract

In standard reinforcement learning, each new skill requires a manually-designed reward function, which takes considerable manual effort and engineering. Self-supervised goal setting has the potential to automate this process, enabling an agent to propose its own goals and acquire skills that achieve these goals. However, such methods typically rely on manually-designed goal distributions, or heuristics to force the agent to explore a wide range of states. We propose a formal exploration objective for goal-reaching policies that maximizes state coverage. We show that this objective is equivalent to maximizing the entropy of the goal distribution together with goal reaching performance, where goals correspond to entire states. We present an algorithm called Skew-Fit for learning such a maximum-entropy goal distribution, and show that under certain regularity conditions, our method converges to a uniform distribution over the set of valid states, even when we do not know this set beforehand. Skew-Fit enables self-supervised agents to autonomously choose and practice reaching diverse goals. Our experiments show that it can learn a variety of manipulation tasks from images, including opening a door with a real robot, entirely from scratch and without any manually-designed reward function.

1 Introduction

(a) Skew-Fit
(b) MLE
Figure 1: Left: The robot must manipulate a hook to open a door. Right: Samples from a generative model when trained with (a) Skew-Fit and (b) maximum likelihood estimation (MLE). When used as goals, the diverse samples from Skew-Fit encourage the robot to practice opening the door more frequently.

Reinforcement learning (RL) provides an appealing formalism for automated learning of behavioral skills. However, if we want agents to learn large repertoires of behaviors, this framework proves inadequate: separately learning every potentially useful skill becomes prohibitively time consuming, both in terms of the experience required for the agent, and the effort required for the user to design reward functions for each behavior. What if we could instead design an unsupervised RL algorithm that automatically explores the environment and iteratively distills this experience into general-purpose policies that can accomplish new user-specified tasks at test time?

For an agent to learn autonomously, it needs an exploration objective. An effective exploration scheme is one that visits as many states as possible, allowing a policy to autonomously prepare for any possible user-specified task that it might see at test time. We can formalize this objective as maximizing the entropy of the learned policy’s stat distribution , since a policy that maximizes this objective should approach a uniform distribution over possible legal states. However, a short-coming of this objective is that the resulting policy cannot be used to solve new tasks: it only knows how to maximize state entropy. Thus, if we want to develop principled unsupervised RL algorithms that result in useful policies, maximizing is not enough. We need a mechanism that allows us to control the resulting policy to achieve new goals at test-time.

We argue that this can be accomplished by performing goal-directed exploration. In addition to maximizing the state distribution entropy , we should be able to control where the policy goes by giving it a goal that corresponds to a desired state. Mathematically, this can be accomplished by stating that a goal-conditioned policy should also minimize the conditional entropy over the states given a goal, . This objective provides us with a principled way for training a policy to explore all states (“maximize ”) such that the state that is reached can be controlled by commanding goals (“minimize ”).

Directly using this objective is often intractable, since it requires optimizing the marginal state distribution of the policy. However, we can avoid this difficult optimization by noting that our objective is the mutual information between the state and the goal, , which can be written as:

(1)

Equation 1 thus gives an equivalent objective for an unsupervised RL algorithm: the agent should set diverse goals for itself (“maximize ”) and learn how to reach them (“minimize ”).

While the second term is the typical objective studied in goal-conditioned RL (Kaelbling, 1993; Andrychowicz et al., 2017), maximizing the diversity of goals is crucial for effectively learning to reach all possible states. In a new environment, acquiring such a maximum-entropy goal distribution is challenging: how can an agent set diverse goals when it does not even know what states exist?

In this paper, we answer this question by presenting Skew-Fit, a method for learning to model the uniform distribution over states, given only access to data collected by an autonomous goal-conditioned policy. Our paper makes the following contributions. First, we propose a principled objective for unsupervised RL given by Equation 1. While a number of prior works ignore the term, we argue that jointly optimizing the entire quantity is needed to develop effective and useful exploration. Second, we propose Skew-Fit, a method that, under regularity conditions, provably maximizes even when the underlying state space is unknown. Third, we empirically demonstrate that, when combined with goal-conditioned RL, Skew-Fit allows us to autonomously train goal-conditioned policies that reach diverse states. We test this method on a variety of simulated vision-based robot tasks without any task-specific reward function. In these experiments, Skew-Fit reaches substantially better final performance than prior methods, and learns much more quickly. We also demonstrate that our approach solves a real-world manipulation task, which requires a robot to learn to open a door, from scratch in about five hours, without any manually-designed reward function, directly from images.

2 Related Work

Many prior methods for training goal-conditioned policies assume that a goal distribution is available to sample from during exploration (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018). Other methods use data collected from a randomly initialized policy or heuristics based on data collected online to design a non-parametric (Colas et al., 2018b; Warde-Farley et al., 2018; Florensa et al., 2018a) or parametric (Péré et al., 2018; Nair et al., 2018) goal distribution. We remark that Warde-Farley et al. (2018) also motivate their work in terms of minimizing a lower bound for . Our work is complementary to these goal-reaching methods: rather than focusing on how to train goal-reaching policies, we propose a principled method for maximizing the entropy of a goal sampling distribution, .

Our method learns without any task rewards, directly acquiring a policy that can be reused to reach user-specified goals. This stands in contrast to exploration methods that give bonus rewards based on state visitation frequency (Bellemare et al., 2016; Ostrovski et al., 2017; Tang et al., 2017; Savinov et al., 2018; Chentanez et al., 2005; Lopes et al., 2012; Stadie et al., 2016; Pathak et al., 2017; Burda et al., 2018, 2019; Mohamed & Rezende, 2015; Chentanez et al., 2005). While these methods can also be used without a task reward, they provide no mechanism for distilling the knowledge gained from visiting diverse states into flexible policies that can be applied to accomplish new goals at test-time: their policies visit novel states, and they quickly forget about them as other states become more novel.

Other prior methods extract reusable skills in the form of latent-variable-conditioned policies, where latent variables can be interpreted as options (Sutton et al., 1999) or abstract skills (Hausman et al., 2018; Gupta et al., 2018b; Eysenbach et al., 2019; Gupta et al., 2018a; Florensa et al., 2017). The resulting skills may be diverse, but they have no grounded interpretation, while our method can be used immediately after unsupervised training to reach diverse user-specified goals.

Some prior methods propose to choose goals based on heuristics such as learning progress (Baranes & Oudeyer, 2012; Veeriah et al., 2018; Colas et al., 2018a), how off-policy the goal is (Nachum et al., 2018), level of difficulty (Florensa et al., 2018b) or likelihood ranking (Zhao & Tresp, 2019). In contrast, our approach provides a principled framework for optimizing a concrete and well-motivated exploration objective, and can be shown to maximize this objective under regularity assumptions. Hazan et al. (2018) also provably optimizes a well-motivated exploration objective, but is limited to tabular MDPs, while ours is able to handle high dimensional settings such as vision-based continuous control. In order to maximize state entropy, our method employs a reweighting scheme when fitting the goal proposal model. Concurrent work has proposed a related prioritization based on ranking (Zhao & Tresp, 2019) but does not consider the unsupervised image-based reinforcement learning setting. We compare against this method in Section D.3 of the Appendix and find that our method significantly outperforms it.

3 Problem Formulation

To ensure that an unsupervised reinforcement learning agent learns to reach all possible states in a controllable way, we maximize the mutual information between the state and the goal , . As shown in Equation 1, this objective can be decomposed into two separate terms: minimizing and maximizing .

3.1 Minimizing : Goal-Conditioned RL

Standard RL considers a Markov decision process (MDP), which has a state space , action space , and unknown dynamics . Goal-conditioned RL also includes a goal space , which we assume to be the same as the state space, . 111We note that goal-conditioned RL can always be formulated as a standard RL problem by appending the goal to the state. Additionally, some authors define the goal as a feature of the state. It is straightforward to apply the analysis and method presented in this paper to this setting. A goal-conditioned policy maps a state and goal to a distribution over actions , and its objective is to reach the goal, i.e., to make the current state equal to the goal.

Goal-reaching can be formulated as minimizing , and many practical goal-reaching algorithms (Kaelbling, 1993; Lillicrap et al., 2016; Schaul et al., 2015; Andrychowicz et al., 2017; Nair et al., 2018; Pong et al., 2018; Florensa et al., 2018a) can be viewed as approximations of this. See Appendix A for more details. Our method may thus be used in conjunction with any of these prior goal-conditioned RL methods in order to jointly minimize and maximize .

3.2 Maximizing : Setting Diverse Goals

We now turn to the problem of setting diverse goals or, mathematically, maximizing the entropy of the goal distribution . Let be the uniform distribution over , where we assume has finite volume so that the uniform distribution is well-defined. Let be the goal distribution, i.e. . Our goal is to maximize the entropy of , which we write as . Since the maximum entropy distribution over is the uniform distribution , maximizing may seem as simple as choosing the uniform distribution to be our goal distribution: . However, this requires knowing the uniform distribution over valid states, which may be difficult to obtain when is a strict subset of , for some . For example, if the states correspond to images viewed through a robot’s camera, corresponds to the (unknown) set of valid images of the robot’s environment, while corresponds to all possible arrays of pixel values of a particular size. In such environments, sampling from the uniform distribution is unlikely to correspond to a valid image of the real world. Sampling uniformly from would require knowing the set of all possible valid images, which we assume the agent does not know when starting to explore the environment.

While we cannot sample arbitrary states from , we can sample states by performing goal-directed exploration. To derive and analyze our method, we introduce a simple model of this process: a goal is sampled from the goal distribution , and then the agent attempts to achieve this goal, which results in a distribution of states seen along the trajectory. We abstract this entire process by writing the resulting marginal distribution over as . We assume that has full support, which can be accomplished with an epsilon-greedy goal reaching policy in a communicating MDP. We also assume we have access to an oracle goal reacher, so that . This simplified model allows us to analyze the behavior of our goal-setting scheme in isolation of any goal-reaching algorithm, but we show in Section 6 that we can jointly learn a goal-reaching policy.

In summary, we want to acquire a maximum-entropy goal distribution over valid states , while only having access to state samples from .

4 Skew-Fit: Learning a Maximum Entropy Goal Distribution

Our method, Skew-Fit, learns a maximum entropy goal distribution using samples collected from a goal-conditioned policy. We analyze the algorithm and show that Skew-Fit maximizes the entropy of the goal distribution, and present a practical instantiation for unsupervised deep RL.

4.1 Skew-Fit Algorithm

To learn a uniform distribution over valid goal states, we present a method that iteratively increases the entropy of a generative model . In particular, given a generative model at iteration , we would like to train a new generative model such that has higher entropy over the set of valid states. While we do not know the set of valid states , we can sample states from , resulting in an empirical distribution over the states

(2)

and use this empirical distribution to train the next generative model . However, if we simply train to model this empirical distribution, it may not necessarily have higher entropy than .

The intuition behind our method is quite simple: rather than fitting a generative model to our empirical distribution, we skew the empirical distribution so that rarely visited states are given more weight. See Figure 2 for a visualization of this process.

Figure 2: Our method, Skew-Fit, samples goals for goal-conditioned RL in order to induce a uniform state visitation distribution. We start by sampling from our replay buffer, and weighting the states such that rare states are given more weight. We then train a generative model with the weighted samples. By sampling new states with goals proposed from this new generative model, we obtain a higher entropy distribution of states in our replay buffer at the next iteration.

How should we skew the empirical distribution if we want to maximize the entropy of ? If we had access to the density of each state, , then we could simply weight each state by . We could then perform maximum likelihood estimation (MLE) for the uniform distribution by using the following loss to train :

where we use the fact that the uniform distribution has constant density for all states in . However, computing this density requires marginalizing out the MDP dynamics, which requires an accurate model of both the dynamics and the goal-conditioned policy. We avoid needing to model the entire MDP process by approximating with our previous learned generative model: .

Variance minimization.

This procedure relies on importance sampling (IS), which can have high variance, particularly if . We reduce this variance by weighing each state by , for rather than . If , then we recover the exact importance sampling procedure described above. If , then this skew step has no effect. By choosing intermediate values of , we can trade off the variance introduced from dividing the weight by a potentially small with the speed at which we want to increase the entropy of the goal distribution. Moreover, we show in Section 4.2 that, for a range of values of , this procedure will converge to the uniform distribution over valid states. Therefore, we explicitly define a skewed distribution using a weight with exponent :

(3)

where is the normalizing coefficient and is given by Equation 2. Lastly, we fit the generative model at the next iteration to using standard MLE. We note that computing does not add much computational overhead, since all importance weights already need to be computed.

Goal sampling alternative.

Because , at iteration , one can sample goals from either or . Sampling goals from may be preferred if sampling from the learned generative model is computationally or otherwise challenging. In either case, one still needs to train the generative model to create . In our experiments, we found that both methods perform well.

Overall, Skew-Fit samples data from the environment and weights different samples by their density under the generative model . We prove in the next section that this weighting makes the generative model at the next iteration have higher entropy. By doing so, the generative model is more likely to generate goals at the frontier of unseen states, which results in more uniform state coverage. This procedure is shown in Figure 2 and Skew-Fit is summarized in Algorithm 1.

1:for Iteration  do
2:    Collect states by sampling goals from (or ) and running goal-conditioned policy.
3:    Construct skewed distribution (Equation 3).
4:    Fit to skewed distribution using MLE.
5:end for
Algorithm 1 Skew-Fit

4.2 Skew-Fit Analysis

In this section, we provide conditions under which converges to the uniform distribution over the state space . Our most general result is stated as follows:

Lemma 4.1.

Let be a compact set. Define the set of distributions . Let be a continuous function and such that with equality if and only if is the uniform probability distribution on , . Define the sequence of distributions by starting with any and recursively defining .

The sequence converges to .

Proof.

See Appendix Section A. ∎

We will apply Lemma 4.1 to be the map from to to show that converges to 222We take and refer to convergence in distribution.. If we assume that the goal-conditioned policy and generative model learning procedure are well behaved (technical assumptions are stated in B.1 of the Appendix), then to apply Lemma 4.1, we only need to show that with equality if and only if . For the simple case when identically at each iteration, we prove that this is true for any value of in Lemma B.3 of the Appendix. However, in practice, only approximates . To address this more realistic situation, we prove the following result:

Lemma 4.2.

Given two distribution and where 333 means that is absolutely continuous with respect to , i.e. . and

(4)

define the distribution as in Equation 3. Let be the entropy of for a fixed . Then there exists a constant such that for all ,

Proof.

See Appendix Section A. ∎

Thus, our generative model does not need to exactly fit the empirical distribution. We merely need for the log densities of and to be correlated, which we expect to happen frequently with an accurate goal-conditioned policy, since is the set of states seen when trying to reach goals from . In this case, if we choose negative values of that are small enough, then the entropy of will be higher than that of . Empirically, we found that values as low as performed well.

In summary, we see that under certain assumptions, converges to . Since we train each generative model by fitting it to , we expect to also converge to .

5 Skew-Fit with Goal-Conditioned Reinforcement Learning

To train agents that can autonomously explore the environment in a controllable manner, we need to maximize the mutual information between goals and states, . Thus far, we have presented and derived Skew-Fit assuming that we have access to an oracle goal-reaching policy, allowing us to separately analyze how we can maximize . However, in practice we do not have access to such an oracle, and in this section we discuss how one can combine Skew-Fit with existing goal-conditioned reinforcement learning to maximize both of these terms jointly.

Maximizing can be done by simultaneously performing Skew-Fit and training a goal conditioned policy to minimize , or, equivalently, maximize . Maximizing requires computing the density , which may be difficult to compute without strong modeling assumptions. However, for any distribution , the following lower bound for holds:

where denotes Kullback–Leibler divergence as discussed by Barber & Agakov (2004). Thus, to minimize , we train a policy to maximize the following reward:

Since Skew-Fit uses a generative model to propose goals, it is particularly natural to combine with reinforcement learning with imagined goals (RIG) Nair et al. (2018), though in principle any goal-conditioned method could be used. RIG is an efficient off-policy goal-conditioned method that solves the vision-based RL problem in a learned latent space. In particular, RIG fits a VAE and uses it to encode all observations and goals into a latent space. RIG also uses the generative model for both goal sampling and compute rewards, . Applying Skew-Fit to RIG then amounts to using Skew-Fit rather than MLE to train the VAE. See Appendix E for details on how to do so.

6 Experiments

Our experimental evaluation of Skew-Fit aims to study the following empirical questions: (1) Can Skew-Fit learn a generative model that can model the uniform distribution over the set of valid states? (2) When combined with goal-conditioned RL, can Skew-Fit enable agents to autonomously set and learn to reach a diverse set of goals?

Figure 3: We evaluate on these continuous control environments. From left to right: Visual Pusher, a simulated pushing task; Visual Door, a door opening task; Visual Pickup, a picking task; and Real World Visual Door, a real world door opening task. All tasks are solved from images and without any task-specific reward. See Appendix C for details.
Figure 4: (Left) Learning curves for simulated continuous control experiments. Lower is better. For each environment and method, we show the mean and standard deviation of 6 seeds and smooth temporally across 25 epochs within each seed. Skew-Fit consistently outperforms RIG and various baselines. See the text for description of each method. (Right) The first column displays example test goal images for each environment. In the next two columns, we display final images reached by Skew-Fit and RIG respectively. Under each image is the final distance in state space to provide a notion of the behavior of each method in the plots.

To answer these questions, we evaluate Skew-Fit with RIG on vision-based continuous control. In this setting, the agent must control a robot arm using only image observations, without access to any ground truth reward signal. We test our method on three different simulated continuous control tasks: Visual Door, Visual Pusher, and Visual Pickup. See Figure 3 for visuals and Section E for details of these environments. Training policies for these tasks is done in a completely unsupervised manner without access to either any prior information about the state-space or an oracle goal-sampling distribution. However, to evaluate their performance, we sample goal images from a uniform distribution over valid images and report the agent’s final distance to the corresponding simulator states (e.g. distance of the puck to the target puck location). While this evaluation method and metric is only practical in simulation, it provides us with a quantitative metric for measuring an RL method’s ability to reach a broad coverage of goals in a vision-based setting.

To train policies in image-based environments, for each method we compare to, with the exception of Warde-Farley et al. (2018), we apply it alongside RIG in order to enable them to solve visual tasks and ensure a fair comparison across methods. Additionally, we replace TD3(Fujimoto et al., 2018) with soft actor critic (SAC) Haarnoja et al. (2018) for all the methods that use RIG including Skew-Fit. We found that SAC performed as well or better than TD3 in every environment. In RIG, the VAE was pre-trained on a uniform sampling of images from the state space of each environment. In order to ensure a fair comparison to Skew-Fit, we forego pre-training and instead train the VAE alongside RL, using the variant described in the RIG paper.

First, we compare to standard RIG, without Skew-Fit. We ran RIG using the relabeling scheme described in the hindsight experience replay (HER) paper to form the baseline, HER. Florensa et al. (2018b) samples goals from a GAN based on the difficulty of reaching the goal. We compare against this method by replacing with the GAN and label it AutoGoal GAN. We compare to Warde-Farley et al. (2018), which uses a non-parametric approach based on clustering to sample goals and a state discriminator to compute rewards. We denote this baseline as DISCERN. We also compare to the goal proposal mechanism proposed by Warde-Farley et al. (2018) without the discriminative reward in DISCERN, which we label DISCERN-g. Lastly, we compare our method to two exploration methods based on reward bonuses: ICM (Pathak et al., 2017), and # Exploration (Tang et al., 2017), denoted ICM and #Exploration respectively.

We see in Figure 4 that Skew-Fit significantly outperforms prior methods both in terms of task performance and sample complexity. The most common failure mode for prior methods is that the goal sampling distribution collapses, resulting in the agent learning to reach only a fraction of the state space. For comparison, more goals sampled from the generative model trained with Skew-Fit and RIG are shown in Figure 13 in the Appendix. We can see that standard RIG produces a small non-diverse distribution for each environment: the object is in the same place for pickup, the puck is often in the starting position for pushing, and the door is always closed.

In contrast, Skew-Fit proposes goals where the object is in the air and on the ground, varied puck positions, and both open and closed door angles. The direct effect of these goal choices can be seen in Figure 12 in the appendix, which shows example reached goals for RIG and Skew-Fit. Standard RIG only learns to reach states close to the initial position, while Skew-Fit learns to reach the entire state space. A quantitative comparison between the two methods on the pickup task can be seen in Figure 10 in the appendix, which gives the cumulative total exploration pickups for each method. From the graph, we can see that only Skew-Fit learns to pay attention to the object, while other methods ignore it completely.

Figure 5: Learning curve for Real World Visual Door environment. We visually label a success if the policy opens the door to the target angle by the last state of the trajectory. Skew-Fit results in considerable sample efficiency gains over prior work on this real-world task.

Real-world vision-based robotic manipulation

Finally, we demonstrate that Skew-Fit scales well to the real world with a door opening task, Real World Visual Door. See Figure 3 for a picture of this environment. While a number of prior works have studied RL-based learning of door opening Kalakrishnan et al. (2011); Chebotar et al. (2017), we demonstrate the first method for autonomous learning of door opening without a user-provided, task-specific reward function. As in simulation, we do not provide any goals to the agent and simply let it interact with the door to solve the door opening task from scratch, without any human guidance or reward signal. We train agents using Skew-Fit as well as RIG. Unlike in simulation, we cannot measure the difference between the policy’s achieved and desired door angle since we do not have access to the true state of the world. Instead, we simply visually denote a binary success/failure for each goal based on whether the last state in the trajectory achieves the target angle. Every seven and a half minutes of interaction time we evaluate on goals and plot the cumulative successes for each method. As Figure 5 shows, standard RIG only starts to open the door consistently after five hours of training. In contrast, Skew-Fit learns to open the door after three hours of training and achieves a perfect success rate after five and a half hours of interaction time, demonstrating that Skew-Fit is a promising technique for solving real world tasks without any human-provided reward function. Videos of our method solving this task, along with the simulated environments, can be viewed on our website.444https://sites.google.com/view/skew-fit

Additional Experiments

We perform additional experiments including a more thorough analysis in a simple 2D setting, a study of how Skew-Fit can be used to train generative models on imbalanced data-sets, and a sensitivity analysis over the parameter , in Appendix D. Additionally, Appendix E provides a complete description of all our environments as well as our method hyper-parameters.

7 Conclusion

We presented Skew-Fit, an algorithm for training a generative model to closely approximate a uniform distribution over valid states, using data obtained via goal-conditioned reinforcement learning. Our method iteratively re-weights the samples for training the generative model, such that its entropy increases over the set of possible states, and our theoretical analysis gives conditions under which Skew-Fit converges to the uniform distribution. When such a model is used to choose goals for exploration and to relabeling goals for training, the resulting method results in much better coverage of the state space, enabling our method to explore effectively. Our experiments show that it produces quantifiable improvements when used along with goal-conditioned reinforcement learning on simulated robotic manipulation tasks, and can be used to learn a complex door opening skill to reach a success rate directly on a real robot, without any human-provided reward supervision.

8 Acknowledgement

This research was supported by Berkeley DeepDrive, Huawei, ARL DCIST CRA W911NF-17-2-0181, NSF IIS-1651843, and the Office of Naval Research, as well as Amazon, Google, and NVIDIA. We thank Aviral Kumar, Carlos Florensa, Aurick Zhou, Nilesh Tripuraneni, Vickie Ye, Dibya Ghosh, Coline Devin, and John D. Co-Reyes for their insightful discussions and feedback.

References

  • Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mcgrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • Baranes & Oudeyer (2012) Baranes, A. and Oudeyer, P.-Y. Active Learning of Inverse Models with Intrinsically Motivated Goal Exploration in Robots. Robotics and Autonomous Systems, 61(1):49–73, 2012. doi: 10.1016/j.robot.2012.05.008.
  • Barber & Agakov (2004) Barber, D. and Agakov, F. V. Information maximization in noisy channels: A variational approach. In Advances in Neural Information Processing Systems, pp. 201–208, 2004.
  • Bellemare et al. (2016) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems (NIPS), pp. 1471–1479, 2016.
  • Billingsley (2013) Billingsley, P. Convergence of probability measures. John Wiley & Sons, 2013.
  • Burda et al. (2018) Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  • Burda et al. (2019) Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Large-scale study of curiosity-driven learning. In International Conference on Learning Representations (ICLR), 2019.
  • Chebotar et al. (2017) Chebotar, Y., Kalakrishnan, M., Yahya, A., Li, A., Schaal, S., and Levine, S. Path integral guided policy search. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3381–3388. IEEE, 2017.
  • Chentanez et al. (2005) Chentanez, N., Barto, A. G., and Singh, S. P. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 1281–1288, 2005.
  • Colas et al. (2018a) Colas, C., Fournier, P., Sigaud, O., and Oudeyer, P. CURIOUS: intrinsically motivated multi-task, multi-goal reinforcement learning. CoRR, abs/1810.06284, 2018a.
  • Colas et al. (2018b) Colas, C., Sigaud, O., and Oudeyer, P.-Y. Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. International Conference on Machine Learning (ICML), 2018b.
  • Eysenbach et al. (2019) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is All You Need: Learning Skills without a Reward Function. In International Conference on Learning Representations (ICLR), 2019.
  • Florensa et al. (2017) Florensa, C., Duan, Y., and Abbeel, P. Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.
  • Florensa et al. (2018a) Florensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. Self-supervised Learning of Image Embedding for Continuous Control. In Workshop on Inference to Control at NeurIPS, 2018a.
  • Florensa et al. (2018b) Florensa, C., Held, D., Geng, X., and Abbeel, P. Automatic Goal Generation for Reinforcement Learning Agents. In International Conference on Machine Learning (ICML), 2018b.
  • Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In International Conference on Machine Learning (ICML), 2018.
  • Gupta et al. (2018a) Gupta, A., Eysenbach, B., Finn, C., and Levine, S. Unsupervised meta-learning for reinforcement learning. CoRR, abs:1806.04640, 2018a.
  • Gupta et al. (2018b) Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-Reinforcement Learning of Structured Exploration Strategies. In Advances in Neural Information Processing Systems (NIPS), 2018b.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., and Levine, S. Soft actor-critic algorithms and applications. CoRR, abs/1812.05905, 2018.
  • Hausman et al. (2018) Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an Embedding Space for Transferable Robot Skills. In International Conference on Learning Representations (ICLR), pp. 1–16, 2018.
  • Hazan et al. (2018) Hazan, E., Kakade, S. M., Singh, K., and Soest, A. V. Provably efficient maximum entropy exploration. CoRR, abs/1812.02690, 2018.
  • Kaelbling (1993) Kaelbling, L. P. Learning to achieve goals. In International Joint Conference on Artificial Intelligence (IJCAI), volume vol.2, pp. 1094 – 8, 1993.
  • Kalakrishnan et al. (2011) Kalakrishnan, M., Righetti, L., Pastor, P., and Schaal, S. Learning force control policies for compliant manipulation. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4639–4644. IEEE, 2011.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
  • Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016. ISBN 0-7803-3213-X. doi: 10.1613/jair.301.
  • Lopes et al. (2012) Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, pp. 206–214, 2012.
  • Mohamed & Rezende (2015) Mohamed, S. and Rezende, D. J. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133, 2015.
  • Nachum et al. (2018) Nachum, O., Brain, G., Gu, S., Lee, H., and Levine, S. Data-Efficient Hierarchical Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Nair et al. (2018) Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual Reinforcement Learning with Imagined Goals. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Nielsen & Nock (2010) Nielsen, F. and Nock, R. Entropies and cross-entropies of exponential families. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pp. 3621–3624. IEEE, 2010.
  • Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., Oord, A., and Munos, R. Count-based exploration with neural density models. In International Conference on Machine Learning, pp. 2721–2730, 2017.
  • Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-Driven Exploration by Self-Supervised Prediction. In International Conference on Machine Learning (ICML), pp. 488–489. IEEE, 2017.
  • Péré et al. (2018) Péré, A., Forestier, S., Sigaud, O., and Oudeyer, P.-Y. Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration. In International Conference on Learning Representations (ICLR), 2018.
  • Pong et al. (2018) Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal Difference Models: Model-Free Deep RL For Model-Based Control. In International Conference on Learning Representations (ICLR), 2018.
  • Savinov et al. (2018) Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274, 2018.
  • Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal Value Function Approximators. In International Conference on Machine Learning (ICML), pp. 1312–1320, 2015. ISBN 9781510810587.
  • Stadie et al. (2016) Stadie, B. C., Levine, S., and Abbeel, P. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. In International Conference on Learning Representations (ICLR), 2016.
  • Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • Tang et al. (2017) Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning. In Neural Information Processing Systems (NIPS), 2017.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033, 2012. ISBN 9781467317375. doi: 10.1109/IROS.2012.6386109.
  • Veeriah et al. (2018) Veeriah, V., Oh, J., and Singh, S. Many-goals reinforcement learning. arXiv preprint arXiv:1806.09605, 2018.
  • Warde-Farley et al. (2018) Warde-Farley, D., de Wiele, T. V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. CoRR, abs/1811.11359, 2018.
  • Zhao & Tresp (2019) Zhao, R. and Tresp, V. Curiosity-driven experience prioritization via density estimation. CoRR, abs/1902.08039, 2019.

Appendix A Goal-Conditioned RL minimizes

Some goal-conditioned RL methods such as Warde-Farley et al. (2018); Nair et al. (2018) present methods for minimizing a lower bound for , by approximating and using it as the reward. While, other goal-conditioned RL methods  (Kaelbling, 1993; Lillicrap et al., 2016; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018; Florensa et al., 2018a) are not developed with the intention of minimizing the conditional entropy , the optimal goal-conditioned policy will deterministically reach the goal, resulting in a conditional entropy of zero: . Thus, goal-conditioned RL methods effectively minimize .

Appendix B Analysis

We note several intermediate results that are necessary in the proof of Skew-Fit.

b.1 Proofs of Lemmas for the Analysis of Skew-Fit

Let means that is absolutely continuous with respect to , i.e. .

Assumption B.1.

Assume that

  • is compact.

  • The maps from to and to are continuous and do not decrease entropy.

  • The support of contains .

The assumption that is compact is easily achieved in most application and makes well defined.

The continuity assumption states that the method used to fit and the goal-conditioned policy are well-behaved.

Note that we do not assume that these procedures for obtain and increase the entropy, and only require that they do not decrease the entropy. Making a statement without these entropy assumption would be difficult: if the goal-conditioned policy ignored the goal and always entered the same state, or if the generative model only captured a single mode , it would be challenging for any procedure to result in the uniform distribution.

Next, the assumption that the support of contains ensures that is well defined and a continuous function of and . Note that can have support larger than , and so we can choose to optimize over any class of generative models with wide supports, without needing to know the manifold .

Under these assumptions, to use Lemma 4.1 for Skew-Fit, we must show with equality if and only if .

Lemma B.2.

Let be a compact set. Define the set of distributions . Let be a continuous function and such that with equality if and only if is the uniform probability distribution on , . Define the sequence of distributions by starting with any and recursively defining .

The sequence converges to .

Proof.

The uniform distribution is well defined since is compact. Because is a compact set, by Prokhorov’s Theorem Billingsley (2013), the set is sequentially compact. Thus, has a convergent subsequence for that converges to a distribution . Because is continuous, must be a fixed point of since by the convergence mapping theorem, we have that

and so

The only fixed point of is since for any distribution that is not the uniform distribution, , we have that which implies that . Thus, converges to the only fixed point, . Since the entropy cannot decrease, then entropy of the distributions in must also converge to the entropy of . Lastly, since entropy is a continuous function of distribution, must converge to . ∎

Lemma B.3.

Assume the set has finite volume so that its uniform distribution is well defined and has finite entropy. Given any distribution whose support is , recursively define

where is the normalizing constant and 555In the paper, . However, when , Equation 3 becomes .

For all ,

with equality if and only if is , the uniform distribution .

Proof.

If or is the uniform distribution, the result is clearly true. We now study the case where and .

Define the one-dimensional exponential family where is

with log carrier density , natural parameter , sufficient statistic , and log-normalizer . As shown in Nielsen & Nock (2010), the entropy of a distribution from a one-dimensional exponential family with parameter is given by:

The derivative with respect to is then

where we use the fact that the th derivative of is the central moment, i.e. . Since variance is always non-negative, this means the entropy is monotonically decreasing with , and so

with equality if and only if

However, this only happens if is constant over its support, i.e. it is the uniform distribution over its support. ∎

We also prove the convergence directly for the (even more) simplified case when using a similar technique:

Lemma B.4.

Assume the set has finite volume so that its uniform distribution is well defined and has finite entropy. Given any distribution whose support is , recursively define with and

where is the normalizing constant and .

The sequence converges to , the uniform distribution .

Proof.

If , then (and all subsequent distributions) will clearly be the uniform distribution. We now study the case where .

At each iteration , define the one-dimensional exponential family where is

with log carrier density , natural parameter , sufficient statistic , and log-normalizer . As shown in Nielsen & Nock (2010), the entropy of a distribution from a one-dimensional exponential family with parameter is given by:

The derivative with respect to is then

(5)

where we use the fact that the th derivative of is the central moment, i.e. . Since variance is always non-negative, this means the entropy is monotonically decreasing with . Note that is a member of this exponential family, with parameter . So

which implies

This monotonically increasing sequence is upper bounded by the entropy of the uniform distribution, and so this sequence must converge.

The sequence can only converge if converges to zero. However, because is bounded away from , Equation 5 states that this can only happen if

(6)

Because has full support, then so does . Thus, Equation 6 is only true if converges to a constant, i.e. converges to the uniform distribution. ∎

Lemma B.5.

Given two distribution and where and

(7)

define the distribution as

where and is the normalizing factor. Let be the entropy of . Then there exists a constant such that for all ,

(8)
Proof.

Observe that is a one-dimensional exponential family

with log carrier density , natural parameter , sufficient statistic , and log-normalizer . As shown in Nielsen & Nock (2010), the entropy of a distribution from a one-dimensional exponential family with parameter is given by:

The derivative with respect to is then

where we use the fact that the th derivative of give the central moment, i.e. and . The derivative of is

which is negative by assumption. Because the derivative at is negative, then there exists a constant such that for all , . ∎

Appendix C Environment Details

Point-Mass: In this environment, an agent must learn to navigate a square-shaped corridor (see Figure 6). The observation is the 2D position, and the agent must specify a velocity as the 2D action. The reward at each time step is the negative distance between the achieved position and desired position.

Visual Pusher: A MuJoCo environment with a 7-DoF Sawyer arm and a small puck on a table that the arm must push to a target position. The agent controls the arm by commanding position for the end effector (EE). The underlying state is the EE position, and puck position . The evaluation metric is the distance between the goal and final puck positions. The hand goal/state space is a 10x10 cm box and the puck goal/state space is a 30x20 cm box. Both the hand and puck spaces are centered around the origin. The action space ranges in the interval in the x and y dimensions.

Visual Door: A MuJoCo environment with a 7-DoF Sawyer arm and a door on a table that the arm must pull open to a target angle. Control is the same as in Visual Pusher. The evaluation metric is the distance between the goal and final door angle, measured in radians. In this environment, we do not reset the position of the hand or door at the end of each trajectory. The state/goal space is a 5x20x15 cm box in the dimension respectively for the arm and an angle between radians. The action space ranges in the interval in the x, y and z dimensions.

Visual Pickup: A MuJoCo environment with the same robot as Visual Pusher, but now with a different object. The object is cube-shaped, but a larger intangible sphere is overlaid on top so that it is easier for the agent to see. Moreover, the robot is constrained to move in 2 dimension: it only controls the arm positions. The position of both the arm and the object is fixed. The evaluation metric is the distance between the goal and final object position. For the purpose of evaluation, of the goals have the object in the air and have the object on the ground. The state/goal space for both the object and the arm is 10cm in the dimension and 13cm in the dimension. The action space ranges in the interval in the and dimensions.

Real World Visual Door: A Rethink Sawyer Robot with a door on a table. The arm must pull the door open to a target angle. The agent controls the arm by commanding the velocity of the EE. Our controller commands actions at a rate of up to 10Hz with the scale of actions ranging up to 1cm in magnitude. The underlying state and goal is the same as in Visual Door. Again we do not reset the position of the hand or door at the end of each trajectory. We obtain images using a Kinect Sensor and our robotic control code can be found at https://github.com/mdalal2020/sawyer_control.git The state/goal space for the environment is a 10x10x10 cm box. The action space ranges in the interval (in cm) in the x, y and z dimensions. The door angle lies in the range degrees.

Appendix D Additional Experiments

Section D.1 studies the analyzes how well Skew-Fit can learn a maximum entropy goal distribution in isolation of a perfect goal reacher. We study this question in the context of a simple 2-dimensional navigation environment. Section D.3 analyzes the performance of Skew-Fit when combined with RIG on a variety of simulated and real vision-based robot manipulation tasks.

Figure 6: (Left) The set of final states visited by our agent and MLE over the course of training. In contrast to MLE, our method quickly approaches a uniform distribution over the set of valid states. (Right) The entropy of the sample data distribution, which quickly reaches its maximum for Skew-Fit. The entropy was calculated via discretization onto a 60 by 60 grid.

d.1 Skew-Fit on Simplified RL Environment

We analyze the effect of Skew-Fit for learning a goal distribution in isolation of learning a goal reacher by studying an idealized example where the policy is a near-perfect goal-reaching policy.

The MDP is defined on a 2-by-2 unit square-shaped corridor (see Figure 6). At the beginning of an episode, the agent begins in the bottom-left corner and samples a goal from a goal distribution . The policy reaches the state that is closest to this goal and inside the corridor, giving us a state to add to our empirical distribution. We compare Skew-Fit to a goal sampling distribution that is only trained using maximum likelihood estimation (MLE).

As seen in Figure 6, Skew-Fit results in learning a high entropy, near-uniform distribution over the state space. In contrast, MLE only models the states that are explored by the initial noise of the policy, resulting in the policy only setting goals in and exploring the bottom left corner. These results empirically validate that naively using previous experience to set goals will not result in diverse exploration and that Skew-Fit results in a maximum-entropy goal-sampling distribution.

d.2 Modeling Visual Observations

We would like to use Skew-Fit to learn maximum-entropy distributions over complex, high-dimensional state spaces, where we cannot manually design these uniform distributions. The next set of experiments study how Skew-Fit can be used to train a generative model to sample diverse images when trained on an imbalanced dataset. For these experiments, we use a simulated MuJoCo (Todorov et al., 2012) environment and real environment that each consists of a 7 degree of freedom robot arm in front of a door that it can open. See Figure 3 for a visualization of the simulated and real-world door environment, and the Appendix for more details on the environment.

We generate a dataset of images from the environment by running a policy that samples actions uniformly at random. Such a policy represents a particularly challenging setting for standard VAE training methods: a policy that chooses random actions does not visit uniformly random states, but rather states that are heavily biased toward ones that resemble the initial state. In the door opening environment, this means that many of the samples have the door in the initial closed position, and only the robot’s arm moves. We then train two generative models on these datasets: one using Skew-Fit and another using MLE. For our generative model, we use the same generative model as the one in RIG, a variational autoencoder Kingma & Welling (2014). To estimate the likelihood of our data, we use Monte Carlo estimation and importance sampling to marginalize the latent variables. See Appendix Section E.2 for experimental details.

In this experiment, we do not train a goal-conditioned policy, and instead only study how Skew-Fit can be used to effectively “balance” this dataset. As can be seen in Figure 7 and Figure 1, the samples produced by the model trained with Skew-Fit generally have a much wider range of door angles, while the model trained with MLE only captures a single mode of the door, both for simulated and real-world images.

d.3 Skew-Fit with Learned Policies

2D navigation

We reproduce the 2D navigation environment experiment from Section D.1, and replace the oracle goal-reacher with a goal-reaching policy that is simultaneously trained with the goal setter. The policy outputs velocities with maximum speed of one. Evaluation goals are chosen uniformly over the valid states. In Figure 7(a), we can see that a policy trained with a goal distribution trained by Skew-Fit consistently learns to reach all goals, whereas a goal distribution trained with MLE results in a policy that fails to reach states far from the starting position.

Sensitivity Analysis

We study the sensitvity of the hyperparameter by testing values of on the Visual Door and Visual Pusher task. The results are included in Figure 11 and shows that our method is relatively robust to different parameters of , particularly for the more challenging Visual Pusher task.

Prioritization Ablation

We test Skew-Fit against a concurrent method  Zhao & Tresp (2019) that suggests a different strategy in which is replaced with . The results are shown in Figure 9. Besides the Visual Door Opening Environment, which is the easiest of three simulated tasks, our method performs significantly better than rank-based weighting, illustrating the effectiveness of Skew-Fit.

(a) Skew-Fit
(b) MLE
Figure 7: Samples from a generative model when trained with (a) Skew-Fit and with (b) maximum likelihood estimation trained on images from the simulated door opening task. The models are trained on a dataset collected by executing a random policy in the environment, which results in mostly images with a closed door and only occasional images with the door open. Note that the Skew-Fit samples are substantially more diverse, meaning that if were used to sample goals, it would encourage the agent to practice opening the door more frequently.
(a)
(b)
Figure 8: (a) Comparison of Skew-Fit vs MLE goal sampling on final distance to goal on RL version of the pointmass environment. Skew-Fit consistently learns to solve the task, while MLE often fails. (b) Heatmaps of final distance to each possible goal location for Skew-Fit and MLE. Skew-Fit learns a good policy over the entire state space, but MLE performs poorly for states far away from the starting position.
(a)
(b)
(c)
Figure 9: Comparison of Skew-Fit and Rank Based Prioritization as described in  Zhao & Tresp (2019). We demonstrate that with the exclusion of the Visual Door Environment our method significantly outperforms this concurrent work.
Figure 10: Cumulative total pickups during exploration for each method. RIG + Skew-Fit quickly learns to learn to pick up the object while the baselines fail to pay attention to the object.

Appendix E Implementation Details

e.1 2D Navigation Experiments

We initialize the VAE to only output points in the bottom left corner of the environment. Both the encoder and decoder have ReLU hidden activations, 2 hidden layers with 32 units, and no output activations. The VAE has a latent dimension of and a Gaussian decoder trained with mean-squared error loss, batch size of , and epochs per iteration. For Skew-Fit hyperparameters, and .

For the RL version of this task, the VAE was trained in the same way. The RL hyperparameters are listed in Table 1.

(a)
(b)
Figure 11: We sweep different values of on (a) Visual Door and (b) Visual Pusher. Skew-Fit helps the finally performance marginally on the Visual Door task, and even degrades performance if is large in magnitude. In the more challenging Visual Pusher task, we see that Skew-Fit consistently helps and halves the final distance.

e.2 Vision-Based Continuous Control Experiments

For our underlying RL algorithm, we use a modified version of soft actor critic (SAC) with automated entropy tuning  Haarnoja et al. (2018) and twin Q-functions Fujimoto et al. (2018). This is in contrast to the original RIG Nair et al. (2018) paper which used TD3 Fujimoto et al. (2018). We found that maximum entropy policies in general improved the performance of RIG, and that we did not need to add noise on top of the stochastic policy’s noise. For our RL network architectures and training scheme, we use fully connected networks for the policy, Q-function and value networks with two hidden layers of size and each. We also delay training any of these networks for time steps in order to collect sufficient data for the replay buffer as well as to ensure the latent space of the VAE is relatively stable (since we train the VAE online in this setting). As in RIG, we train a goal-conditioned value functions Schaul et al. (2015) using hindsight experience replay Andrychowicz et al. (2017), relabelling of exploration goals as goals sampled from the VAE prior and from future goals in the trajectory.

In our experiments, we use an image size of 48x48. For our VAE architecture, we use a modified version of the architecture used in the original RIG paper  Nair et al. (2018). Our VAE has three convolutional layers with kernel sizes: 5x5, 3x3, and 3x3, number of output filters: 16, 32, and 64 and strides: 3, 2, and 2. We then have a fully connected layer with the latent dimension number of units, and then reverse the architecture with de-convolution layers. We vary the latent dimension of the VAE, the term of the VAE and the term for Skew-Fit based on the environment. Additionally, we vary the training schedule of the VAE based on the environment. See the table at the end of the appendix for more details. Our VAE has a Gaussian decoder with identity variance, meaning that we train the decoder with a mean-squared error loss.

We estimate the density under the VAE by using a sample-wise approximation to the marginal over estimated using importance sampling:

where is the encoder, is the decoder, and is the prior, which in this case is unit Gaussian. In practice we found that sampling latents for estimating the density to work well in practice.

When training the VAE alongside RL, we found the following two schedules to be effective for different environments:

  1. For first steps: Train VAE using standard MLE training every time steps for batches. After that, train VAE using Skew-Fit every time steps for batches.

  2. For first steps: Train VAE using standard MLE training every time steps for batches. For the next steps, train VAE using Skew-Fit every steps for batches. After that, train VAE using Skew-Fit every time steps for batches.

We found that initially training the VAE without Skew-Fit improved the stability of the algorithm. This is due to the fact that density estimates under the VAE are constantly changing and inaccurate during the early phases of training. Therefore, it made little sense to use those estimates to prioritize goals early on in training. Instead, we simply train using MLE training for the first timesteps, and after that we perform Skew-Fit according to the VAE schedules above. Table 2 lists the hyper-parameters that were shared across the continuous control experiments. Table 3 lists hyper-parameters specific to each environment. Additionally, 2 shows the combined RIG + Skew-Fit algorithm.

Figure 12: Example reached goals by Skew-Fit and RIG. The first column of each environment section specifies the target goal while the second and third columns show reached goals by Skew-Fit and RIG. Both methods learn how to reach goals close to the initial position, but only Skew-Fit learns to reach the more difficult goals.
Figure 13: Proposed goals from the VAE for RIG and with Skew-Fit on the Visual Pickup, Visual Pusher, and Visual Door environments. Standard RIG produces goals where the door is closed and the object and puck is in the same position, while RIG + Skew-Fit proposes goals with varied puck positions, occasional object goals in the air, and both open and closed door angles.
Figure 14: Proposed goals from the VAE for RIG (left) and with RIG + Skew-Fit (right) on the Real World Visual Door environment. Standard RIG produces goals where the door is closed while RIG + Skew-Fit proposes goals with both open and closed door angles.
Hyper-parameter Value
Algorithm TD3 Fujimoto et al. (2018)666We expect similar performance had we used SAC.
# training batches per time step
Q network hidden sizes
Policy network hidden sizes
Q network and policy activation ReLU
Exploration Noise None
RL Batch Size
Discount Factor
Path length
Reward Scaling
Number of steps per epoch
Table 1: Hyper-parameters used for 2D RL experiment (Figure 7(a)).
Hyper-parameter Value Comments
# training batches per time step Marginal improvements after
Exploration Noise None (SAC policy is stochastic) Did not tune
RL Batch Size smaller batch sizes work as well
VAE Batch Size Did not tune
Discount Factor Did not tune
Reward Scaling Did not tune
Path length Did not tune
Number of Latents for Estimating Density () Marginal improvements beyond
Table 2: General hyper-parameters used for all continuous control experiments.
Hyper-parameter Visual Pusher Visual Door Visual Pickup Real World Visual Door
Path Length
for -VAE
Latent Dimension Size
for Skew-Fit
VAE Training Schedule
Sample Goals From
Table 3: Environment specific hyper-parameters
1:VAE encoder , VAE decoder , policy , goal-conditioned value function , , VAE Training Schedule.
2:for  episodes do
3:    Sample latent goal from which is the prior or .
4:    Sample initial state .
5:    for  steps do
6:        Get action .
7:        Get next state .
8:        Store into replay buffer .
9:        Sample transition .
10:        Encode .
11:        (Probability ) replace with .
12:        Compute new reward .
13:        Minimize Bellman Error using .
14:    end for
15:    for  steps do
16:        for  steps do
17:            Sample future state , .
18:            Store into .
19:        end for
20:    end for
21:    Construct skewed distribution using data from using Equation 3
22:    if  then
23:        Train -VAE on data sampled uniformly from according to VAE Training Schedule
24:    else
25:        Train -VAE on data sampled from using according to VAE Training Schedule
26:    end if
27:end for
Algorithm 2 Skew-Fit with RIG
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
370523
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description