Unsupervised Curricula for Visual Meta-Reinforcement Learning
In principle, meta-reinforcement learning algorithms leverage experience across many tasks to learn fast reinforcement learning (RL) strategies that transfer to similar tasks. However, current meta-RL approaches rely on manually-defined distributions of training tasks, and hand-crafting these task distributions can be challenging and time-consuming. Can “useful” pre-training tasks be discovered in an unsupervised manner? We develop an unsupervised algorithm for inducing an adaptive meta-training task distribution, i.e. an automatic curriculum, by modeling unsupervised interaction in a visual environment. The task distribution is scaffolded by a parametric density model of the meta-learner’s trajectory distribution. We formulate unsupervised meta-RL as information maximization between a latent task variable and the meta-learner’s data distribution, and describe a practical instantiation which alternates between integration of recent experience into the task distribution and meta-learning of the updated tasks. Repeating this procedure leads to iterative reorganization such that the curriculum adapts as the meta-learner’s data distribution shifts. In particular, we show how discriminative clustering for visual representation can support trajectory-level task acquisition and exploration in domains with pixel observations, avoiding pitfalls of alternatives. In experiments on vision-based navigation and manipulation domains, we show that the algorithm allows for unsupervised meta-learning that transfers to downstream tasks specified by hand-crafted reward functions and serves as pre-training for more efficient supervised meta-learning of test task distributions.
The discrepancy between animals and learning machines in their capacity to gracefully adapt and generalize is a central issue in artificial intelligence research. The simple nematode C. elegans is capable of adapting foraging strategies to varying scenarios Calhoun et al. (2014), while many higher animals are driven to acquire reusable behaviors even without extrinsic task-specific rewards (White, 1959; Piaget, 1954). It is unlikely that we can build machines as adaptive as even the simplest of animals by exhaustively specifying shaped rewards or demonstrations across all possible environments and tasks. This has inspired work in reward-free learning (Hastie et al., 2009), intrinsic motivation (Singh et al., 2005), multi-task learning (Caruana, 1997), meta-learning (Schmidhuber, 1987), and continual learning Thrun and Pratt (1998).
An important aspect of generalization is the ability to share and transfer ability between related tasks. In reinforcement learning (RL), a common strategy for multi-task learning is conditioning the policy on side-information related to the task. For instance, contextual policies Schaul et al. (2015) are conditioned on a task description (e.g. a goal) that is meant to modulate the strategy enacted by the policy. Meta-learning of reinforcement learning (meta-RL) is yet more general as it places the burden of inferring the task on the learner itself, such that task descriptions can take a wider range of forms, the most general being an MDP. In principle, meta-reinforcement learning (meta-RL) requires an agent to distill previous experience into fast and effective adaptation strategies for new, related tasks. However, the meta-RL framework by itself does not prescribe where this experience should come from; typically, meta-RL algorithms rely on being provided fixed, hand-specified task distributions, which can be tedious to specify for simple behaviors and intractable to design for complex ones Hadfield-Menell et al. (2017). These issues beg the question of whether “useful” task distributions for meta-RL can be generated automatically.
In this work, we seek a procedure through which an agent in an environment with visual observations can automatically acquire useful (i.e. utility maximizing) behaviors, as well as how and when to apply them – in effect allowing for unsupervised pre-training in visual environments. Two key aspects of this goal are: 1) learning to operationalize strategies so as to adapt to new tasks, i.e. meta-learning, and 2) unsupervised learning and exploration in the absence of explicitly specified tasks, i.e. skill acquisition without supervised reward functions. These aspects interact insofar as the former implicitly relies on a task curriculum, while the latter is most effective when compelled by what the learner can and cannot do. Prior work has offered a pipelined approach for unsupervised meta-RL consisting of unsupervised skill discovery followed by meta-learning of discovered skills, experimenting mainly in environments that expose low-dimensional ground truth state Gupta et al. (2018a). Yet, the aforementioned relation between skill acquisition and meta-learning suggests that they should not be treated separately.
Here, we argue for closing the loop between skill acquisition and meta-learning in order to induce an adaptive task distribution. Such co-adaptation introduces a number of challenges related to the stability of learning and exploration. Most recent unsupervised skill acquisition approaches optimize for the discriminability of induced modes of behavior (i.e. skills), typically expressing the discovery problem as a cooperative game between a policy and a learned reward function Gregor et al. (2016); Eysenbach et al. (2019); Achiam et al. (2018). However, relying solely on discriminability becomes problematic in environments with high-dimensional (image-based) observation spaces as it results in an issue akin to mode-collapse in the task space. This problem is further complicated in the setting we propose to study, wherein the policy data distribution is that of a meta-learner rather than a contextual policy. We will see that this can be ameliorated by specifying a hybrid discriminative-generative model for parameterizing the task distribution.
The main contribution of this paper is an approach for inducing a task curriculum for unsupervised meta-RL in a manner that scales to domains with pixel observations. Through the lens of information maximization, we frame our unsupervised meta-RL approach as variational expectation-maximization (EM), in which the E-step corresponds to fitting a task distribution to a meta-learner’s behavior and the M-step to meta-RL on the current task distribution with reinforcement for both skill acquisition and exploration. For the E-step, we show how deep discriminative clustering allows for trajectory-level representations suitable for learning diverse skills from pixel observations. Through experiments in vision-based navigation and robotic control domains, we demonstrate that the approach i) enables an unsupervised meta-learner to discover and meta-learn skills that transfer to downstream tasks specified by human-provided reward functions, and ii) can serve as pre-training for more efficient supervised meta-reinforcement learning of downstream task distributions.
2 Preliminaries: Meta-Reinforcement Learning
Supervised meta-RL optimizes an RL algorithm for performance on a hand-crafted distribution of tasks , where might take the form of an recurrent neural network (RNN) implementing a learning algorithm (Duan et al., 2016; Wang et al., 2016), or a function implementing a gradient-based learning algorithm Finn et al. (2017). Tasks are Markov decision processes (MDPs) consisting of state space , action space , reward function , probabilistic transition dynamics , discount factor , initial state distribution , and finite horizon . Often, and in our setting, tasks are assumed to share . For a given , learns a policy conditioned on task-specific experience. Thus, a meta-RL algorithm optimizes for expected performance of over , such that it can generalize to unseen test tasks also sampled from .
For example, RL (Duan et al., 2016; Wang et al., 2016) chooses to be an RNN with weights . For a given task , hones as it recurrently ingests , the sequence of states, actions, and rewards produced via interaction within the MDP. Crucially, the same task is seen several times, and the hidden state is not reset until the next task. The loss is the negative discounted return obtained by across episodes of the same task, and can be optimized via standard policy gradient methods for RL, backpropagating gradients through time and across episode boundaries.
Unsupervised meta-RL aims to break the reliance of the meta-learner on an explicit, upfront specification of . Following Gupta et al. (2018a), we consider a controlled Markov process (CMP) , which is an MDP without a reward function. We are interested in the problem of learning an RL algorithm via unsupervised interaction within the CMP such that once a reward function is specified at test-time, can be readily applied to the resulting MDP to efficiently maximize the expected discounted return.
Prior work Gupta et al. (2018a) pipelines skill acquisition and meta-learning by pairing an unsupervised RL algorithm DIAYN (Eysenbach et al., 2019) and a meta-learning algorithm MAML (Finn et al., 2017): first, a contextual policy is used to discover skills in the CMP, yielding a finite set of learned reward functions distributed as ; then, the CMP is combined with a frozen to yield , which is fed to MAML to meta-learn . In the next section, we describe how we can generalize and improve upon this pipelined approach by jointly performing skill acquisition as the meta-learner learns and explores in the environment.
3 Curricula for Unsupervised Meta-Reinforcement Learning
Meta-learning is intended to prepare an agent to efficiently solve new tasks related to those seen previously. To this end, the meta-RL agent must balance 1) exploring the environment to infer which task it should solve, and 2) visiting states that maximize reward under the inferred task. The duty of unsupervised meta-RL is thus to present the meta-learner with tasks that allow it to practice task inference and execution, without the need for human-specified task distributions. Ideally, the task distribution should exhibit both structure and diversity. That is, the tasks should be distinguishable and not excessively challenging so that a developing meta-learner can infer and execute the right skill, but, for the sake of generalization, they should also encompass a diverse range of associated stimuli and rewards, including some beyond the current scope of the meta-learner. Our aim is to strike this balance by inducing an adaptive task distribution.
With this motivation, we develop an algorithm for unsupervised meta-reinforcement learning in visual environments that constructs a task distribution without supervision. The task distribution is derived from a latent-variable density model of the meta-learner’s cumulative behavior, with exploration based on the density model driving the evolution of the task distribution. As depicted in Figure1, learning proceeds by alternating between two steps: organizing experiential data (i.e., trajectories generated by the meta-learner) by modeling it with a mixture of latent components forming the basis of “skills”, and meta-reinforcement learning by treating these skills as a training task distribution.
Learning the task distribution in a data-driven manner ensures that tasks are feasible in the environment. While the induced task distribution is in no way guaranteed to align with test task distributions, it may yet require an implicit understanding of structure in the environment. This can indeed be seen from our visualizations in section 5, which demonstrate that acquired tasks show useful structure, though in some settings this structure is easier to meta-learn than others. In the following, we formalize our approach, CARML, through the lens of information maximization and describe a concrete instantiation that scales to the vision-based environments considered in section 5.
3.1 An Overview of CARML
We begin from the principle of information maximization (IM), which has been applied across unsupervised representation learning Bell and Sejnowski (1995); Barber and Agakov (2004); Oord et al. (2018) and reinforcement learning Mohamed and Rezende (2015); Gregor et al. (2016) for organization of data involving latent variables. In what follows, we organize data from our policy by maximizing the mutual information (MI) between state trajectories and a latent task variable . This objective provides a principled manner of trading-off structure and diversity: from , we see that promotes coverage in policy data space (i.e. diversity) while encourages a lack of diversity under each task (i.e. structure that eases task inference).
We approach maximizing exhibited by the meta-learner via variational EM (Barber and Agakov, 2004), introducing a variational distribution that can intuitively be viewed as a task scaffold for the meta-learner. In the E-step, we fit to a reservoir of trajectories produced by , re-organizing the cumulative experience. In turn, gives rise to a task distribution : each realization of the latent variable induces a reward function , which we combine with the CMP to produce an MDP (Line 8). In the M-step, meta-learns the task distribution . Repeating these steps forms a curriculum in which the task distribution and meta-learner co-adapt: each M-step adapts the meta-learner to the updated task distribution, while each E-step updates the task scaffold based on the data collected during meta-training. Pseudocode for our method is presented in Algorithm 1.
3.2 E-Step: Task Acquisition
The purpose of the E-step is to update the task distribution by integrating changes in the meta-learner’s data distribution with previous experience, thereby allowing for re-organization of the task scaffold. This data is from the post-update policy, meaning that it comes from a policy conditioned on data collected by the meta-learner for the respective task. In the following, we abuse notation by writing – conditioning on the latent task variable rather than the task experience .
The general strategy followed by recent approaches for skill discovery based on IM is to lower bound the objective by introducing a variational posterior in the form of a classifier. In these approaches, the E-step amounts to updating the classifier to discriminate between data produced by different skills as much as possible. A potential failure mode of such an approach is an issue akin to mode-collapse in the task distribution, wherein the policy drops modes of behavior to favor easily discriminable trajectories, resulting in a lack of diversity in the task distribution and no incentive for exploration; this is especially problematic when considering high-dimensional observations. Instead, here we derive a generative variant, which allows us to account for explicitly capturing modes of behavior (by optimizing for likelihood), as well as a direct mechanism for exploration.
We introduce a variational distribution , which could be e.g. a (deep) mixture model with discrete or a variational autoencoder (VAE) (Kingma and Welling, 2014) with continuous , lower-bounding the objective:
The E-step corresponds to optimizing Eq. 2 with respect to , and thus amounts to fitting to a reservoir of trajectories produced by :
What remains is to determine the form of . We choose the variational distribution to be a state-level mixture density model . Despite using a state-level generative model, we can treat as a trajectory-level latent by computing the trajectory-level likelihood as the factorized product of state likelihoods (Algorithm 2, Line 4). This is useful for obtaining trajectory-level tasks; in the M-step (section 3.3), we map samples from to reward functions to define tasks for meta-learning.
Modeling Trajectories of Pixel Observations. While models like the variational autoencoder have been used in related settings (Nair et al., 2018), a basic issue is that optimizing for reconstruction treats all pixels equally. We, rather, will tolerate lossy representations as long as they capture discriminative features useful for stimulus-reward association. Drawing inspiration from recent work on unsupervised feature learning by clustering Bojanowski and Joulin (2017); Caron et al. (2018), we propose to fit the trajectory-level mixture model via discriminative clustering, striking a balance between discriminative and generative approaches.
We adopt the optimization scheme of DeepCluster Caron et al. (2018), which alternates between i) clustering representations to obtain pseudo-labels and ii) updating the representation by supervised learning of pseudo-labels. In particular, we derive a trajectory-level variant (Algorithm 2) by forcing the responsibilities of all observations in a trajectory to be the same (see Appendix A.1 for a derivation), leading to state-level visual representations optimized with trajectory-level supervision.
The conditional independence assumption in Algorithm 2 is a simplification insofar as it discards the order of states in a trajectory. However, if the dynamics exhibit continuity and causality, the visual representation might yet capture temporal structure, since, for example, attaining certain observations might imply certain antecedent subtrajectories. We hypothesize that a state-level model can regulate issues of over-expressive sequence encoders, which have been found to lead to skills with undesirable attention to details in dynamics (Achiam et al., 2018). As we will see in section 5, learning representations under this assumption still allows for learning visual features that capture trajectory-level structure.
3.3 M-Step: Meta-Learning
Using the task scaffold updated via the E-step, we meta-learn in the M-step so that can be quickly adapted to tasks drawn from the task scaffold. To define the task distribution, we must specify a form for the reward functions . To allow for state-conditioned Markovian rewards rather than non-Markovian trajectory-level rewards, we lower-bound the trajectory-level MI objective:
We would like to optimize the meta-learner under the variational objective in Eq. 5, but optimizing the second term, the policy’s state entropy, is in general intractable. Thus, we make the simplifying assumption that the fitted variational marginal distribution matches that of the policy:
Optimizing Eq. 6 amounts to maximizing the reward of . As shown in Eq. 7, this corresponds to information maximization between the policy’s state marginal and the latent task variable, along with terms for matching the task-specific policy data distribution to the corresponding mixture mode and deviating from the mixture’s marginal density. We can trade-off between component-matching and exploration by introducing a weighting term into :
4 Related Work
Unsupervised Reinforcement Learning. Unsupervised learning in the context of RL is the problem of enabling an agent to learn about its environment and acquire useful behaviors without human-specified reward functions. A large body of prior work has studied exploration and intrinsic motivation objectives (Schmidhuber, 2009; Salge et al., 2014; Pathak et al., 2017; Fu et al., 2017; Burda et al., 2019; Bellemare et al., 2016; Lehman and Stanley, 2011; Osband et al., 2018). These algorithms do not aim to acquire skills that can be operationalized to solve tasks, but rather try to achieve wide coverage of the state space; our objective (Eq. 8) reduces to pure density-based exploration with . Hence, these algorithms still rely on slow RL (Botvinick et al., 2019) in order to adapt to new tasks posed at test-time. Some prior works consider unsupervised pre-training for efficient RL, but these works typically focus on settings in which exploration is not as much of a challenge (Watter et al., 2015; Finn and Levine, 2017; Ebert et al., 2017), focus on goal-conditioned policies Pathak et al. (2018); Nair et al. (2018), or have not been shown to scale to high-dimensional visual observation spaces (Lopes et al., 2012; Shyam et al., 2019). Perhaps most relevant to our work are unsupervised RL algorithms for learning reward functions via optimizing information-theoretic objectives involving latent skill variables (Gregor et al., 2016; Achiam et al., 2018; Eysenbach et al., 2019; Warde-Farley et al., 2019). In particular, with a choice of in Eq. 9 we recover the information maximization objective used in prior work (Achiam et al., 2018; Eysenbach et al., 2019), besides the fact that we simulatenously perform meta-learning. The setting of training a contextual policy with a classifier as in our proposed framework (see Appendix A.3) provides an interpretation of DIAYN as implicitly doing trajectory-level clustering. Warde-Farley et al. (2019) also considers accumulation of tasks, but with a focus on goal-reaching and by maintaining a goal reservoir via heuristics that promote diversity.
Meta-Learning. Our work is distinct from above works in that it formulates a meta-learning approach to explicitly train, without supervision, for the ability to adapt to new downstream RL tasks. Prior work (Hsu et al., 2019; Khodadadeh et al., 2018; Antoniou and Storkey, 2019) has investigated this unsupervised meta-learning setting for image classification; the setting considered herein is complicated by the added challenges of RL-based policy optimization and exploration. Gupta et al. (2018a) provides an initial exploration of the unsupervised meta-RL problem, proposing a straightforward combination of unsupervised skill acquisition (via DIAYN) followed by MAML Finn et al. (2017) with experiments restricted to environments with fully observed, lower-dimensional state. Unlike these works and other meta-RL works (Wang et al., 2016; Duan et al., 2016; Mishra et al., 2018; Rakelly et al., 2019; Finn et al., 2017; Houthooft et al., 2018; Gupta et al., 2018b; Rothfuss et al., 2019; Stadie et al., 2018; Sung et al., 2017), we close the loop to jointly perform task acquisition and meta-learning so as to achieve an automatic curriculum to facilitate joint meta-learning and task-level exploration.
Automatic Curricula. The idea of automatic curricula has been widely explored both in supervised learning and RL. In supervised learning, interest in automatic curricula is based on the hypothesis that exposure to data in a specific order (i.e. a non-uniform curriculum) may allow for learning harder tasks more efficiently Elman (1993); Schmidhuber (2009); Graves et al. (2017). In RL, an additional challenge is exploration; hence, related work in RL considers the problem of curriculum generation, whereby the task distribution is designed to guide exploration towards solving complex tasks Florensa et al. (2017b); Matiisen et al. (2019); Florensa et al. (2017a); Schmidhuber (2011) or unsupervised pre-training Sukhbaatar et al. (2018); Forestier et al. (2017). Our work is driven by similar motivations, though we consider a curriculum in the setting of meta-RL and frame our approach as information maximization.
We experiment in visual navigation and visuomotor control domains to study the following questions:
What kind of tasks are discovered through our task acquisition process (the E-step)?
Do these tasks allow for meta-training of strategies that transfer to test tasks?
Does closing the loop to jointly perform task acquisition and meta-learning bring benefits?
Does pre-training with CARML accelerate meta-learning of test task distributions?
Videos are available at the project website https://sites.google.com/view/carml.
5.1 Experimental Setting
The following experimental details are common to the two vision-based environments we consider. Other experimental are explained in more detail in Appendix B.
Meta-RL. CARML is agnostic to the meta-RL algorithm used in the M-step. We use the RL algorithm (Duan et al., 2016), which has previously been evaluated on simpler visual meta-RL domains, with a PPO (Schulman et al., 2017) optimizer. Unless otherwise stated, we use four episodes per trial (compared to the two episodes per trial used in Duan et al. (2016)), since the settings we consider involve more challenging task inference.
Baselines. We compare against: 1) PPO from scratch on each evaluation task, 2) pre-training with random network distillation (RND) (Burda et al., 2019) for unsupervised exploration, followed by fine-tuning on evaluation tasks, and 3) supervised meta-learning on the test-time task distribution, as an oracle.
Variants. We consider variants of our method to ablate the role of design decisions related to task acquisition and joint training: 4) pipelined (most similar to Gupta et al. (2018a)) – task acquisition with a contextual policy, followed by meta-RL with RL; 5) online discriminator – task acquisition with a purely discriminative (akin to online DIAYN); and 6) online pretrained-discriminator – task acquisition with a discriminative initialized with visual features trained via Algorithm 2.
5.2 Visual Navigation
The first domain we consider is first-person visual navigation in ViZDoom Kempka et al. (2016), involving a room filled with five different objects (drawn from a set of 50). We consider a setup akin to those featured in Chaplot et al. (2018); Xie et al. (2018) (see Figure 3). The true state consists of continuous 2D position and continuous orientation, while observations are egocentric images with limited field of view. Three discrete actions allow for turning right or left, and moving forward. We consider two ways of sampling the CMP . Fixed: fix a set of five objects and positions for both unsupervised meta-training and testing. Random: sample five objects and randomly place them (thereby randomizing the state space and dynamics).
Visualizing the task distribution. Modeling pixel observations reveals trajectory-level organization in the underlying true state space (Figure 5). Each map portrays trajectories of a mixture component, with position encoded in 2D space and orientation encoded in the jet color-space; an example of interpreting the maps is shown left of the legend. The components of the mixture model reveal structured groups of trajectories: some components correspond to exploration of the space (marked with green border), while others are more strongly directed towards specific areas (blue border). The skill maps of the fixed and random environments are qualitatively different: tasks in the fixed room tend towards interactions with objects or walls, while many of the tasks in the random setting sweep the space in a particular direction. We can also see the evolution of the task distribution at earlier and later stages of Algorithm 1. While initial tasks (produced by a randomly initialized policy) tend to be less structured, we later see refinement of certain tasks as well as the emergence of others as the agent collects new data and acquires strategies for performing existing tasks.
Do acquired skills transfer to test tasks? We evaluate how well the CARML task distribution prepares the agent for unseen tasks. For both the fixed and randomized CMP experiments, each test task specifies a dense goal-distance reward for reaching a single object in the environment. In the randomized environment setting, the target objects at test-time are held out from meta-training. The PPO and RND-initialized baseline polices, and the finetuned CARML meta-policy, are trained for a single target (a specific object in a fixed environment), with 100 episodes per PPO policy update.
In Figure 5(a), we compare the success rates on test tasks as a function of the number of samples with supervised rewards seen from the environment. Direct transfer performance of meta-learners is shown as points, since in this setting the RL agent sees only four episodes (200 samples) at test-time, without any parameter updates. We see that direct transfer is significant, achieving up to 71% and 59% success rates on the fixed and randomized settings, respectively. The baselines require over two orders of magnitude more test-time samples to solve a single task at the same level.
While the CARML meta-policy does not consistently solve the test tasks, this is not surprising since no information is assumed about target reward functions during unsupervised meta-learning; inevitable discrepancies between the meta-train and test task distributions will mean that meta-learned strategies will be suboptimal for the test tasks. For instance, during testing, the agent sometimes ‘stalls’ before the target object (once inferred), in order to exploit the inverse distance reward. Nevertheless, we also see that finetuning the CARML meta-policy trained on random environments on individual tasks is more sample efficient than learning from scratch. This suggests that deriving reward functions from our mixture model yields useful tasks insofar as they facilitate learning of strategies that transfer.
Benefit of reorganization. In Figure 5(a), we also compare performance across early and late outer-loop iterations of Algorithm 1, to study the effect of adapting the task distribution (the CARML E-step) by reorganizing tasks and incorporating new data. In both cases, number of outer-loop iterations . Overall, the refinement of the task distribution, which we saw in Figure 5, leads improved to transfer performance. The effect of reorganization is further visualized in the Appendix F.
Variants. From Figure 5(c), we see that the purely online discriminator variant suffers in direct transfer performance; this is due to the issue of mode-collapse in task distribution, wherein the task distribution lacks diversity. Pretraining the discriminator encoder with Algorithm 2 mitigates mode-collapse to an extent, improving task diversity as the features and task decision boundaries are first fit on a corpus of (randomly collected) trajectories. Finally, while the distribution of tasks eventually discovered by the pipelined variant may be diverse and structured, meta-learning the corresponding tasks from scratch is harder. More detailed analysis and visualization is given in Appendix E.
5.3 Visual Robotic Manipulation
To experiment in a domain with different challenges, we consider a simulated Sawyer arm interacting with an object in MuJoCo (Todorov et al., 2012), with end-effector continous control in the 2D plane. The observation is a bottom-up view of a surface supporting an object (Figure 7); the camera is stationary, but the view is no longer egocentric and part of the observation is proprioceptive. The test tasks involve pushing the object to a goal (drawn from the set of reachable states), where the reward function is the negative distance to the goal state. A subset of the skill maps is provided below.
Do acquired skills directly transfer to test tasks? In Figure 5(b), we evaluate the meta-policy on the test task distribution, comparing against baselines as previously. Despite the increased difficulty of control, our approach allows for meta-learning skills that transfer to the goal distance reward task distribution. We find that transfer is weaker compared to the visual navigation (fixed version): one reason may be that the environment is not as visually rich, resulting in a significant gap between the CARML and the object-centric test task distributions.
5.4 CARML as Meta-Pretraining
Another compelling form of transfer is pretraining of an initialization for accelerated supervised meta-RL of target task distributions. In Figure 8, we see that the initialization learned by CARML enables effective supervised meta-RL with significantly fewer samples. To separate the effect of the learning the recurrent meta-policy and the visual representation, we also compare to only initializing the pre-trained encoder. Thus, while direct transfer of the meta-policy may not directly result in optimal behavior on test tasks, accelerated learning of the test task distribution suggests that the acquired meta-learning strategies may be useful for learning related task distributions, effectively acting as pre-training procedure for meta-RL.
We proposed a framework for inducing unsupervised, adaptive task distributions for meta-RL that scales to environments with high-dimensional pixel observations. Through experiments in visual navigation and manipulation domains, we showed that this procedure enables unsupervised acquisition of meta-learning strategies that transfer to downstream test task distributions in terms of direct evaluation, more sample-efficient fine-tuning, and more sample-efficient supervised meta-learning. Nevertheless, the following key issues are important to explore in future work.
Task distribution mismatch. While our results show that useful structure can be meta-learned in an unsupervised manner, results like the stalling behavior in ViZDoom (see section 5.2) suggest that direct transfer of unsupervised meta-learning strategies suffers from a no-free-lunch issue: there will always be a gap between unsupervised and downstream task distributions, and more so with more complex environments. Moreover, the semantics of target tasks may not necessarily align with especially discriminative visual features. This is part of the reason why transfer in the Sawyer domain is less successful. Capturing other forms of structure useful for stimulus-reward association might involve incorporating domain-specific inductive biases into the task-scaffold model. Another way forward is the semi-supervised setting, whereby data-driven bias is incorporated at meta-training time.
Validation and early stopping: Since the objective optimized by the proposed method is non-stationary and in no way guaranteed to be correlated with objectives of test tasks, one must provide some mechanism for validation of iterates.
Form of skill-set. For the main experiments, we fixed a number of discrete tasks to be learned (without tuning this), but one should consider how the set of skills can be grown or parameterized to have higher capacity (e.g. a multi-label or continuous latent). Otherwise, the task distribution may become overloaded (complicating task inference) or limited in capacity (preventing coverage).
Accumulation of skill. We mitigate forgetting with the simple solution of reservoir sampling. Better solutions involve studying an intersection of continual learning and meta-learning.
We thank the BAIR community for helpful discussion, and Michael Janner and Oleh Rybkin in particular for feedback on an earlier draft. AJ thanks Alexei Efros for his steadfastness and advice, and Sasha Sax and Ashish Kumar for discussion. KH thanks his family for their support. AJ is supported by the PD Soros Fellowship. This work was supported in part by the National Science Foundation, IIS-1651843, IIS-1700697, and IIS-1700696, as well as Google.
Appendix A Derivations
a.1 Derivation for Trajectory-Level Responsibilities (Section 3.2.1)
Here we show that, assuming independence between states in a trajectory when conditioning on a latent variable, computing the trajectory likelihood as a factorized product of state likelihoods for the E-step in standard EM forces the component responsibilities for all states in the trajectory to be identical. Begin by lower-bounding the log-likelihood of the trajectory dataset with Jensen’s inequality:
We have introduced the variational distribution , where is a categorical variable. Now, to maximize Eq. 13 with respect to , we alternate between an E-step and an M-step, where the E-step is computing
We assume that each is Gaussian; the M-step amounts to computing the maximum-likelihood estimate of , under the mixture responsibilities from the E-step:
In particular, note that the expressions are independent of . Thus, the posterior will be, too.
a.2 CARML M-Step
The objective used to optimize the meta-RL algorithm in the CARML M-step can be interpreted as a sum of cross entropies, resulting in the mutual information plus two additional KL terms:
The first KL term can be interpreted as encouraging exploration with respect to the density of the mixture. The second KL term is the reverse KL term for matching the modes of the mixture.
Density-based exploration. In practice, we may want to trade off between exploration and matching the modes of the generative model:
where is constant with respect to the optimization of . Hence, the objective amounts to maximizing discriminability of skills where yields a bonus for exploring away from the mode of the corresponding skill.
a.3 Discriminative CARML and DIAYN
Here, we derive a discriminative instantiation of CARML. We begin with the E-step. We leverage the same conditional independence assumption as before, and re-write the trajectory-level MI as the state level MI, assuming that trajectories are all of length :
We then decompose MI as the difference between marginal and conditional entropy of the latent, and choose the variational distribution to be the product of a classifier and a density model :
We fix to be a uniformly-distributed categorical variable. The CARML E-step consists of two separate optimizations: supervised learning of with a cross-entropy loss and density estimation of :
For the CARML M-step, we start from the form of the reward in Eq. 26 and manipulate via Bayes’:
where is constant with respect to the optimization of in the M-step
To enable a trajectory-level latent , we want every state in a trajectory to be classified to the same . This is achievable in a straightforward manner: when training the classifier via supervised learning, label each state in a trajectory with the realization of that the policy was conditioned on when generating that trajectory.
Connection to DIAYN. Note that with in Eq. 34, we directly obtain the DIAYN [Eysenbach et al., 2019] objective without standard policy entropy regularization, and we do away with needing to maintain a density model , leaving just the discriminator. If is truly a contextual policy (rather than the policy given by adapting a meta-learner), we have recovered the DIAYN algorithm. This allows us to interpret on DIAYN-style algorithms as implicitly doing trajectory-level clustering with a conditional independence assumption between states in a trajectory given the latent. This arises from the weak trajectory-level supervision specified when training the discriminator: all states in a trajectory are assumed to correspond to the same realization of the latent variable.
Appendix B Additional Details for Main Experiments
b.1 CARML Hyperparameters
We train CARML for five iterations, with 500 PPO updates for meta-learning with RL in the M-step (i.e. update the mixture model every 500 meta-policy updates). Thus, the CARML unsupervised learning process consumes on the order of 1,000,000 episodes (compared to the ~400,000 episodes needed to train a meta-policy with the true task distribution, as shown in our experiments). We did not heavily tune this number, though we noticed that using too few policy updates (e.g. ~100) before refitting resulted in instability insofar as the meta-learner does not adapt to learn the updated task distribution. Each PPO learning update involves sampling 100 tasks with 4 episodes each, for a total of 400 episodes per update. We use 10 PPO epochs per update with a batch size of 100 tasks.
During meta-training, tasks are drawn according to , the mixture’s latent prior distribution. Unless otherwise stated, we use for all visual meta-RL experiments. For all experiments unless otherwise mentioned, we fix the number of components in our mixture to be . We use a reservoir of trajectories.
Temporally Smoothed Reward: At unsupervised meta-training time, we found it helpful to reward the meta-learner with the average over a small temporal window, i.e. , choosing to be . This has the effect of smoothing the reward function, thereby regularizing acquired task inference strategies.
Random Seeds: The results reported in Figure 6 are averaged across policies (for each treatment) trained with three different random seeds. The performance is averaged across 20 test tasks. The results reported in Figure 7 are based on finetuning CARML policies trained with three different random seeds. We did not observe significant effects of the random seed used in the finetuning procedure of experiments reported for Figure 7.
Model Selection: Models used for transfer experiments are selected by performance on a small held-out validation set (ten tasks) for each task, that does not intersect with the test task.
b.2 Meta-RL with RL
We adopt the recurrent architecture and hyperparameter settings as specified in the visual maze navigation tasks of Duan et al. , except we:
Use PPO for policy optimization ()
Set the entropy bonus coefficient in an environment-specific manner. We use for MuJoCo Sawyer and for ViZDoom.
Enlarge the input observation space to , adapting the encoder by half the stride in the first convolutional layer.
Increase the size of the recurrent model (hidden state size 512) and the capacity of the output layer of the RNN (MLP with one hidden layer of dimension 256).
Allow for four episodes per task (instead of two), since the tasks we consider involve more challenging task inference.
Use a multi-layer perceptron with one-hidden layer to readout the output for the actor and critic, given the recurrent hidden state.
b.3 Reward Normalization
A subtle challenge that arises in applying meta-RL across a range of tasks is difference in the statistics of the reward functions encountered, which may affect task inference. Without some form of normalization, the statistics of the rewards of unsupervised meta-training tasks versus those of the downstream tasks may be arbitrarily different, which may interfere with inferring the task. This is especially problematic for RL (compared to e.g. MAML Finn et al. ), which relies on encoding the reward as a feature at each timestep. We address this issue by whitening the reward at each timestep with running mean and variance computed online, separately for each task from the unsupervised task distribution during meta-training. At test-time, we share these statistics across tasks from the same test task distribution.
b.4 Learning Visual Representations with DeepCluster
To jointly learn visual representations with the mixture model, we adopt the optimization scheme of DeepCluster [Caron et al., 2018]. The DeepCluster model is parameterized by the weights of a convolutional neural network encoder as well as a -means model in embedding space. It is trained in an EM-like fashion, where the M-step additionally involves training the encoder weights via supervised learning of the image-cluster mapping.
Our contribution is that we employ a modified E-step, as presented in the main text, such that the cluster responsibilities are ensured to be consensual across states in a trajectory in the training data. As shown in our experiments, this allows the model to learn trajectory-level visual representations. The full CARML E-step with DeepCluster is presented below.
For updating the encoder weights, we use the default hyperparameter settings as described in Caron et al. , except 1) we modify the neural network architecture, using a smaller neural network, ResNet-10 He et al.  with a fixed number of filters (64) for every convolutional layer, and 2) we use number of components , which we did not tune. We tried using a more expressive Gaussian mixture model with full covariances instead of -means (when training the visual representation), but found that this resulted in overfitting. Hence, we use -means until the last iteration of EM, wherein a Gaussian mixture model is fitted under the resulting visual representation.
The environment used for visual navigation is a 500x500 room built with ViZDoom [Kempka et al., 2016]. We consider both fixed and random environments; for randomly placing objects, the only constraint enforced is that objects should not be within a minimal distance of one another. There are 50 train objects and 50 test objects. The agent’s pose is always initialized to be at the top of the room facing forward. We restrict observations from the environment to be RGB images. The maximum episode length is set to 50 timesteps. The hand-crafted reward function corresponds to the inverse distance from the specified target object.
The environment considered is relatively simple in layout, but compared to simple mazes, can provide a more complex observation space insofar as objects are constantly viewed from different poses and in various combinations, and are often occluded. The underlying ground-truth state space is the product of continuous 2D position and continuous pose spaces. There are three discrete actions that correspond to turning right, turning left, and moving forward, allowing translation and rotation in the pose space that can vary based on position; the result is that the effective visitable set of poses is not strictly limited to a subset of the pose space, despite discretized actions.
For visual manipulation, we use a MuJoCo [Todorov et al., 2012] environment involving a simulated Sawyer 7-DOF robotic arm in front of a table, on top of which is an object. The Sawyer arm is controlled by 2D continuous control. It is almost identical to the environment used by prior work such as Nair et al. , with the exception that our goal space is that of the object position. The robot pose and object are always initialized to the same position at the top of the room facing forward. We restrict observations from the environment to be RGB images. The maximum episode length is set to 50 timesteps. The hand-crafted reward function corresponds to the negative distance from the specified target object.
Appendix C Additional Details for Qualitative Study of
c.1 Instantiating as a VAE
Three factors motivate the use of a variational auto-encoder (VAE) as a generative model for the 2D toy environment. First, a key inductive bias of DeepCluster, namely that randomly initialized convolutional neural networks work surprisingly well, which Caron et al.  use to motivate its effectiveness in visual domains, does not apply for our 2D state space. Second, components of a standard Gaussian mixture model are inappropriate for modeling trajectories involving turns. Third, using a VAE allows sampling from a continuous latent, potentially affording an unbounded number of skills.
We construct the VAE model in a manner that enables expressive generative densities while allowing for computation of the policy reward quantities. We set the VAE latent to be , where . The form of follows from restricting the policy to sampling trajectories of length . We factorize the posterior as . Keeping with the idea of having a Markovian reward, we construct the VAE’s recognition network such that it takes as input individual states after training. To incorporate the constraint that all states in a trajectory are mapped to the same posterior, we adopt a particular training scheme: we pass in entire trajectories , and specify the posterior parameters as and .
The ELBO for this model is
where is constant with respect to the learnable parameters. The simplification directly follows from the form of the posterior; we have essentially passed through the network unchanged. Notice that the computation of the ELBO for a trajectory leverages the conditional independence in our graphical model.
c.2 CARML Details
Since we are not interested in meta-transfer for this experiment, we simplify the learning problem to training a contextual policy . To reward the policy using the VAE , we compute
and we approximate by its ELBO (Eq. 36), substituting the above expression for the reconstruction term.
Appendix D Sawyer Task Distribution
Visualizing the components of the acquired task distribution for the Sawyer domain reveals structure and diversity related to the position of the object as well as the control path taken to effect movement. Red encodes the true position of the object, and light blue that of the end-effector. We find tasks corresponding to moving the object to various locations in the environment, as well as tasks that correspond to moving the arm in a certain way without object interaction. The tasks provide a scaffold for learning to move the object to various regions of the reachable state space.
Since the Sawyer domain is less visually rich than the VizDoom domain, there may be less visually discriminative states that align with semantics of test task distributions. Moreover, since a large part of the observation is proprioceptive, the discriminative clustering representation used for density modeling captures various proprioceptive features that may not involve object interaction. The consequences are two-fold: 1) the gap in the CARML and the object-centric test task distributions may be large, and 2) the CARML tasks may be too diverse in-so-far as tasks share less structure, and inferring each task involves a different control problem.
Appendix E Mode Collapse in the Task Distribution
Here, we present visualizations of the task distributions induced by variants of the presented method, to illustrate the issue of using an entirely discrimination-based task acquisition approach. Using the fixed VizDoom setting, we compare:
CARML, the proposed method
online discriminator – task acquisition with a purely discriminative (akin to an online, pixel-observation-based adaptation of Gupta et al. [2018a]);
online pretrained-discriminator – task acquisition with a discriminative as in (ii), initialized with pre-trained observation encoder.
For all discriminative variants, we found it crucial to use a temperature to soften the classifier softmax to prevent immediate task mode-collapse.
We find the task acquisition of purely discriminative variants (ii, iii) to suffer from an effect akin to mode-collapse; the policy’s data distribution collapses to a smaller subset of the trajectory space (one or two modes), and tasks correspond to minor variations of these modes. Skill acquisition methods such as DIAYN rely purely on discriminability of states/trajectories under skills, which can be more easily satisfied in high-dimensional observation spaces and can thus lead to such mode-collapse. Moreover, they do not a provide a direct mechanism for furthering exploration once skills are discriminable.
On the other hand, the proposed task acquisition approach (Algorithm 2, section 3.2) fits a generative model over jointly learned discriminative features, and is thus not only less susceptible to mode-collapse (w.r.t the policy data distribution), but also allows for density-based exploration (section 3.3). Indeed, we find that (iii) seems to mitigate mode-collapse – benefiting from a pretrained encoder from (i) – but does not entirely prevent it. As shown in the main text (Figure 5(c)), in terms of meta-transfer to hand-crafted test tasks, the online discriminative variants (ii, iii) perform worse than CARML (i), due to lesser diversity in the task distribution.
Appendix F Evolution of Task Distribution
Here we consider the evolution of the task distribution in the Random VizDoom environment. The initial tasks (referred to as CARML It. 1) are produced by fitting our deep mixture model to data from a randomly-initialized meta-policy. CARML Its. 2 and 3 correspond to the task distribution after the first and second CARML E-steps, respectively.
We see that the initial tasks tend to be less structured, in so far as the components appear to be noisier and less distinct. With each E-step we see refinement of certain tasks as well as the emergence of others, as the agent’s data distribution is shifted by 1) learning the learnable tasks in the current data-distribution, and 2) exploration. In particular, tasks that are “refined” tend to correspond to more simple, exploitative behaviors (i.e. directly heading to an object or a region in the environment, trajectories that are more straight), which may not require exploration to discover. On the other hand, the emergent tasks seem to reflect exploration strategies (i.e. sweeping the space in an efficient manner). We also see the benefit of reorganization that comes from refitting the mixture model, as tasks that were once separate can be combined.
- UC Berkeley University of Toronto Carnegie Mellon University Stanford University
- footnotetext: Work done as a visiting student researcher at UC Berkeley.
- Variational option discovery algorithms. arXiv preprint arXiv:1807.10299. Cited by: §1, §3.2, §4.
- Assume, augment and learn: unsupervised few-shot meta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884v3. Cited by: §4.
- The IM algorithm: a variational approach to information maximization. In Neural Information Processing Systems (NeurIPS), Cited by: §3.1, §3.1.
- An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7 (6). Cited by: §3.1.
- Unifying count-based exploration and intrinsic motivation. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Unsupervised learning by predicting noise. In International Conference on Machine Learning (ICML), Cited by: §3.2.
- Reinforcement learning, fast and slow. Trends in Cognitive Science 23 (5). Cited by: §4.
- Exploration by random network distillation. In International Conference on Learning Representations (ICLR), Cited by: §4, §5.1.
- Maximally informative foraging by Caenorhabditis elegans. eLife 3. Cited by: §1.
- Deep clustering for unsupervised learning of visual features. In European Conference on Computer Vision (ECCV), Cited by: §B.4, §B.4, §C.1, §3.2, §3.2.
- Multitask learning. Machine Learning 28 (1). Cited by: §1.
- Gated-attention architectures for task-oriented language grounding. In AAAI Conference on Artificial Intelligence, Cited by: §5.2.
- RL: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §B.2, §2, §2, §4, §5.1, 10.
- Self-supervised visual planning with temporal skip connections. In Conference on Robotic Learning (CoRL), Cited by: §4.
- Learning and development in neural networks: the importance of starting small. Cognition 48 (1). Cited by: §4.
- Diversity is all you need: learning skills without a reward function. In International Conference on Learning Representations (ICLR), Cited by: §A.3, §1, §2, §4.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), Cited by: §B.3, §2, §2, §4.
- Deep visual foresight for planning robot motion. In International Conference on Robotics and Automation (ICRA), Cited by: §4.
- Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning (ICML), Cited by: §4.
- Reverse curriculum generation for reinforcement learning. In Conference on Robotic Learning (CoRL), Cited by: §4.
- Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §4.
- EX: exploration with exemplar models for deep reinforcement learning. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Automated curriculum learning for neural networks. In International Conference on Machine Learning (ICML), Cited by: §4.
- Variational intrinsic control. arXiv preprint arXiv:1611.07507. Cited by: §1, §3.1, §4.
- Unsupervised meta-learning for reinforcement learning. arXiv preprint arXiv:1806.04640. Cited by: item 2, Appendix E, §1, §2, §2, §4, §5.1.
- Meta-reinforcement learning of structured exploration strategies. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Inverse reward design. In Neural Information Processing Systems (NeurIPS), Cited by: §1.
- Unsupervised learning. In The Elements of Statistical Learning, Cited by: §1.
- Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §B.4.
- Evolved policy gradients. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Unsupervised learning via meta-learning. In International Conference on Learning Representations (ICLR), Cited by: §4.
- ViZDoom: a Doom-based AI research platform for visual reinforcement learning. In Conference on Computational Intelligence and Games (CIG), Cited by: §B.5.1, §5.2.
- Unsupervised meta-learning for few-shot image and video classification. arXiv preprint arXiv:1811.11819v1. Cited by: §4.
- Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.2.
- Abandoning objectives: evolution through the search for novelty alone. Evolutionary Computation 19 (2). Cited by: §4.
- Exploration in model-based reinforcement learning by empirically estimating learning progress. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Teacher-student curriculum learning. Transactions on Neural Networks and Learning Systems. Cited by: §4.
- A simple neural attentive meta-learner. In International Conference on Learning Representations (ICLR), Cited by: §4.
- Variational information maximisation for intrinsically motivated reinforcement learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 2125–2133. External Links: Cited by: §3.1.
- Visual reinforcement learning with imagined goals. In Neural Information Processing Systems (NeurIPS), Cited by: §B.5.2, §3.2, §4.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.1.
- Randomized prior functions for deep reinforcement learning. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), Cited by: §4.
- Zero-shot visual imitation. In International Conference on Learning Representations (ICLR), Cited by: §4.
- The construction of reality in the child. Basic Books. Cited by: §1.
- Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, Cited by: §4.
- ProMP: proximal meta-policy search. In International Conference on Learning Representations (ICLR), Cited by: §4.
- Empowerment – an introduction. In Guided Self-Organization: Inception, Cited by: §4.
- Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320. Cited by: §1.
- Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Ph.D. Thesis, Technische Universität München. Cited by: §1.
- Driven by compression progress: a simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. In Anticipatory Behavior in Adaptive Learning Systems, Cited by: §4, §4.
- POWERPLAY: training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. arXiv preprint arXiv:1112.5309. Cited by: §4.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.1.
- Model-based active exploration. In International Conference on Machine Learning (ICML), Cited by: §4.
- Intrinsically motivated reinforcement learning. In Neural Information Processing Systems (NeurIPS), Cited by: §1.
- Some considerations on learning to explore via meta-reinforcement learning. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Intrinsic motivation and automatic curricula via asymmetric self-play. In International Conference on Learning Representations (ICLR), Cited by: §4.
- Learning to learn: meta-critic networks for sample efficient learning. arXiv preprint arXiv:1706.09529. Cited by: §4.
- Learning to learn. Springer Science & Business Media. Cited by: §1.
- MuJoCo: a physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §B.5.2, §5.3.
- Learning to reinforcement learn. In Annual Meeting of the Cognitive Science Society (CogSci), Cited by: §2, §2, §4.
- Unsupervised control through non-parametric discriminative rewards. In International Conference on Learning Representations (ICLR), Cited by: §4.
- Embed to control: a locally linear latent dynamics model for control from raw images. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- Motivation reconsidered: the concept of competence. Psychological Review 66 (5). Cited by: §1.
- Few-shot goal inference for visuomotor learning and planning. In Conference on Robot Learning (CoRL), Cited by: §5.2.