Unsupervised MetaLearning for Reinforcement Learning
Abstract
Metalearning is a powerful tool that builds on multitask learning to learn how to quickly adapt a model to new tasks. In the context of reinforcement learning, metalearning algorithms can acquire reinforcement learning procedures to solve new problems more efficiently by metalearning prior tasks. The performance of metalearning algorithms critically depends on the tasks available for metatraining: in the same way that supervised learning algorithms generalize best to test points drawn from the same distribution as the training points, metalearning methods generalize best to tasks from the same distribution as the metatraining tasks. In effect, metareinforcement learning offloads the design burden from algorithm design to task design. If we can automate the process of task design as well, we can devise a metalearning algorithm that is truly automated. In this work, we take a step in this direction, proposing a family of unsupervised metalearning algorithms for reinforcement learning. We describe a general recipe for unsupervised metareinforcement learning, and describe an effective instantiation of this approach based on a recently proposed unsupervised exploration technique and modelagnostic metalearning. We also discuss practical and conceptual considerations for developing unsupervised metalearning methods. Our experimental results demonstrate that unsupervised metareinforcement learning effectively acquires accelerated reinforcement learning procedures without the need for manual task design, significantly exceeds the performance of learning from scratch, and even matches performance of metalearning methods that use handspecified task distributions.
Unsupervised MetaLearning for Reinforcement Learning
Abhishek Gupta University of California, Berkeley abhigupta@eecs.berkeley.edu Benjamin Eysenbach Google eysenbach@google.com Chelsea Finn University of California, Berkeley cbfinn@eecs.berkeley.edu Sergey Levine University of California, Berkeley svlevine@eecs.berkeley.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Reusing past experience for faster learning of new tasks is a key challenge for machine learning. Metalearning methods propose to achieve this by using past experience to explicitly optimize for rapid adaptation (Mishra et al., 2017; Snell et al., 2017; Schmidhuber, 1987; Finn et al., 2017a; Duan et al., 2016b; Gupta et al., 2018; Wang et al., 2016; AlShedivat et al., 2017). In the context of reinforcement learning, metareinforcement learning algorithms can learn to solve new reinforcement learning tasks more quickly through experience on past tasks Duan et al. (2016b); Gupta et al. (2018). Typical metareinforcement learning algorithms assume the ability to sample from a prespecified task distribution, and these algorithms learn to solve new tasks drawn from this distribution very quickly. However, specifying a task distribution is tedious and requires a significant amount of supervision Finn et al. (2017b); Duan et al. (2016b) that may be difficult to provide for large realworld problem settings. The performance of metalearning algorithms critically depends on the metatraining task distribution, and metalearning algorithms generalize best to new tasks which are drawn from the same distribution as the metatraining tasks Finn and Levine (2018). In effect, metareinforcement learning offloads some of the design burden from algorithm design to designing a sufficiently broad and relevant distribution of metatraining tasks. While this greatly helps in acquiring representations for fast adaptation to the specified task distribution, a natural question is whether we can do away with the need for manually designing a large family of tasks, and develop metareinforcement learning algorithms that learn only from unsupervised environment interaction. In this paper, we take an initial step toward the formalization and design of such methods.
Our goal is to automate the metatraining process by removing the need for handdesigned metatraining tasks. To that end, we introduce unsupervised metareinforcement learning: metalearning from a task distribution that is acquired automatically, rather than requiring manual design of the metatraining tasks. Developing effective unsupervised metareinforcement learning algorithms is challenging, since it requires solving two difficult problems together: metareinforcement learning with broad task distributions, and unsupervised exploration for proposing a wide variety of tasks for metalearning. Since the assumptions of our method differ fundamentally from prior metareinforcement learning methods (we do not assume access to handspecified metatraining tasks), the best points of comparison for our approach are learning the metatest tasks entirely from scratch with conventional reinforcement learning algorithms. Our method can also be thought of as a datadriven initialization procedure for deep neural network policies, in a similar vein to datadriven initialization procedures explored in supervised learning (Krähenbühl et al., 2015).
The primary contributions of our work are to propose a framework for unsupervised metareinforcement learning, sketch out a family of unsupervised metareinforcement learning algorithms, and describe a possible instantiation of a practical algorithm from this family that builds on a recently proposed procedure for unsupervised exploration (Eysenbach et al., 2018) and modelagnostic metalearning (MAML) (Finn et al., 2017a). We discuss the design considerations and conceptual issues surrounding unsupervised metareinforcement learning, and provide an empirical evaluation that studies the performance of two variants of our approach on simulated continuous control tasks. Our experimental evaluation shows that, for a variety of tasks, unsupervised metareinforcement learning can effectively acquire reinforcement learning procedures that perform significantly better than standard reinforcement learning in terms of sample complexity and asympototic performance, and even rival the performance of conventional metalearning algorithms that are provided with handdesigned task distributions.
2 Related Work
Our work lies at the intersection of meta learning for reinforcement learning, automatic goal generation, and unsupervised exploration. Metalearning algorithms use data from multiple tasks to learn how to learn, acquiring rapid adaptation procedures from experience (Schmidhuber, 1987; Naik and Mammone, 1992; Thrun and Pratt, 1998; Bengio et al., 1992; Hochreiter et al., 2001; Santoro et al., 2016; Andrychowicz et al., 2016; Li and Malik, 2017; Ravi and Larochelle, 2017; Finn et al., 2017a; Munkhdalai and Yu, 2017; Snell et al., 2017). These approaches have been extended into the setting of reinforcement learning (Duan et al., 2016b; Wang et al., 2016; Finn et al., 2017a; Sung et al., 2017; Mishra et al., 2017; Gupta et al., 2018; Houthooft et al., 2018; Stadie et al., 2018), though their performance in practice depends on the userspecified metatraining task distribution. We aim to lift this limitation, and provide a general recipe for avoiding manual task engineering for metareinforcement learning. To that end, we make use of unsupervised task proposals. These proposals can be obtained in a variety of ways, including adversarial goal generation (Sukhbaatar et al., 2017; Held et al., 2017), informationtheoretic methods (Gregor et al., 2016; Eysenbach et al., 2018), and even random functions.
Methods that address goal generation and curriculum learning have complementary aims. Graves et al. (2017) study this problem for supervised learning, while Forestier et al. (2017) apply a similar approach to robot learning. Prior work (Schaul et al., 2015; Pong et al., 2018; Andrychowicz et al., 2017) also studied learning of goalconditioned policies, which are closely related to metareinforcement learning in their ability to generalize to new goals at test time. However, like metalearning, goalconditioned policies typically require manually defined goals at training time. Although exploration methods coupled with goal relabeling (Pong et al., 2018; Andrychowicz et al., 2017) could provide for automated goal discovery, such methods would still be restricted to a specific goal parameterization. In contrast, unsupervised metareinforcement learning can solve arbitrary tasks at metatest time without being restricted to a particular task parameterization.
Prior work has used metalearning to learn unsupervised learning rules (Metz et al., 2018). This work learns strategies for unsupervised learning using supervised data, while our approach requires no supervision during metatraining, in effect doing the converse: using a form of unsupervised learning to acquire learning rules that can learn from rewards at metatest time.
3 Unsupervised MetaReinforcement Learning
The goal of unsupervised metareinforcement learning is to take an environment and produce a learning algorithm specifically tailored to this environment that can quickly learn to maximize reward on any task reward in this environment. This learning algorithm should be metalearned without requiring any human supervision. We can formally define unsupervised metareinforcement learning in the context of a controlled Markov process (CMP) – a Markov decision process without a reward function, , with state space , action space , transition dynamics , discount factor and initial state distribution . Our goal is to learn a learning algorithm on this CMP, which can subsequently learn new tasks efficiently in this CMP for a new reward function , which produces a Markov decision processes . We can, at a high level, denote as a mapping from tasks to policies, , where is the space of RL tasks defined by the given CMP and , and is a space of parameterized policies, such that is a probability distribution over actions conditioned on states, . Crucially, must be learned without access to any reward functions , using only unsupervised interaction with the CMP. The reward is only provided at metatest time.
3.1 A General Recipe
Our framework unsupervised metareinforcement learning consists of two components. The first component is a task identification procedure, which interacts with a controlled Markov process, without access to any reward function, to construct a distribution over tasks. Formally, we will define the task distribution as a mapping from a latent variable to a reward function . That is, for each value of the random variable , we have a different reward function . The prior may be specified by hand. For example, we might choose a uniform categorical distribution or a spherical unit Gaussian. A discrete latent variable corresponds to a discrete set of tasks, while a continuous representation could allow for an infinite task space. Under this formulation, learning a task distribution amounts to optimizing a parametric form for the reward function that maps each to a different reward function.
The second component of unsupervised metalearning is metalearning, which takes the family of reward functions induced by and , and metalearns a reinforcement learning algorithm that can quickly adapt to any task from the task distribution defined by and . The metalearned algorithm can then learn new tasks quickly at metatest time, when a userspecified reward function is actually provided. This generic design for an unsupervised metareinforcement learning algorithm is summarized in Figure 1.
The nature of the task distribution defined by and will affect the effectiveness of on new tasks: tasks that are close to this distribution will be easiest to learn, while tasks that are far from this distribution will be difficult to learn. However, the nature of the metalearning algorithm itself will also curcially affect the effectiveness of . As we will discuss in the following sections, some metareinforcement learning algorithms can generalize effectively to new tasks, while some cannot. A more general version of this algorithm might also use to inform the acquisition of tasks, allowing for an alternating optimization procedure the iterates between learning and updating , for example by designing tasks that are difficult for the current algorithm to handle. However, in this paper we will consider the stagewise approach, which acquires a task distribution once and metatrains on it, leaving the iterative variant for future work.
Why might we expect unsupervised metareinforcement learning to acquire an algorithm that improves on any standard, generic, handdesigned reinforcement learning procedure? On the one hand, the “no free lunch theorem” (Wolpert et al., 1995; Whitley and Watson, 2005) might lead us to expect that a truly generic approach to learning a task distribution (for example, by sampling completely random reward functions) would not yield a learning procedure that is effective on any real tasks – or even on the metatraining tasks, if they are truly sampled at random. However, the specific choice for the unsupervised learning procedure and metalearning algorithm can easily impose an inductive bias on the entire process that does produce a useful algorithm . As we will discuss below, we can identify specific choices for the task acquisition and metalearning procedures that are generic, in the sense that they can be applied to a wide range of CMPs, but also contain enough inductive bias to metalearn useful reinforcement learning procedures. We discuss specific choices for each of these procedures below, followed by a more general discussion of potential future choices for these procedures and the criteria that they should satisfy. We empirically validate these claims in Section 4.
3.2 Unsupervised Task Acquisition
An effective unsupervised metaRL algorithm requires a method to acquire task distributions for an environment. We consider two concrete possibilities for such a procedure in this paper, though many other options are also possible for this stage.
Task acquisition via random discriminators.
A simple and surprisingly effective way to define arbitrary task distributions is to use random discriminators on states. Given a uniformly distributed random variable , we can define a random discriminator as a parametric function , where the parameters are chosen randomly (e.g., a random weight initialization for a neural network). The discriminator observes a state and outputs the probabilities for a categorical random variable . The random discriminator draws random decision boundaries in state space. A reward function can be extracted according as . Note that this is not a random RL objective: the induced RL objective is affected by the inductive bias in the network and mediated by the CMP dynamics distribution. In our experiments, we find that random discriminators are able to acquire useful task distributions for simple tasks, but are not as effective as the tasks become more complicated.
Task acquisition via diversitydriven exploration.
We can acquire more varied tasks if we allow ourselves some amount of unsupervised environment interaction. Specifically, we consider a recently proposed method for unsupervised skill diversity method  Diversity is All You Need (DIAYN) Eysenbach et al. (2018) for task acquisition. DIAYN attempts to acquire a set of behaviors that are distinguishable from one another, in the sense that they visit distinct states, while maximizing conditional policy entropy to encourage diversity Haarnoja et al. (2018). Skills with high entropy that remain discriminable must explore a part of the state space far away from other skills. Formally, DIAYN learns a latent conditioned policy , with , where different values of induce different skills. The training process promotes discriminable skills by maximizing the mutual information between skills and states (), while also maximizing the policy entropy :
(1) 
A learned discriminator maximizes a variational lower bound on Equation 1 (see (Eysenbach et al., 2018) for proof). We train the discriminator to predict the latent variable from the observed state, and optimize the latent conditioned policy to maximize the loglikelihood of the discriminator correctly classifying states which are visited under different skills, while maximizing policy entropy. Under this formulation, we can think of the discriminator as rewarding the policy for producing discriminable skills, and the policy visitations as informing the training of the discriminator.
After learning the policy and discriminator, we can sample tasks by generating samples and using the corresponding task reward . Compared to random discriminators, the tasks acquired by DIAYN are more likely to involve visiting diverse parts of the state space, potentially providing both a greater challenge to the corresponding policy, and achieving better coverage of the CMP’s state space. This method is still fully unsupervised, as it requires no handcrafting of distance metrics or subgoals, and does not require training generative model to generate goals Held et al. (2017).
3.3 MetaReinforcement Learning with Acquired Task Distributions
Once we have acquired a distribution of tasks, either randomly or through unsupervised exploration, we must choose a metalearning algorithm to acquire the adaptation procedure from this task distribution. Which metalearning algorithm is best suited for this problem? To formalize the typical metareinforcement learning problem, we assume that tasks are drawn from a manually specified task distribution , provided by the algorithm designer. These tasks are different MDPs. Each task is an MDP . The goal of metaRL is to learn a reinforcement learning algorithm that can learn quickly on novel tasks drawn from . In contrast, our problem setting we acquire the task distribution completely unsupervised.
A particularly appealing choice for the metalearning algorithm is modelagnostic metalearning (Finn et al., 2017a), which trains a model that can adapt quickly to new tasks with standard gradient descent. In RL, this corresponds to the policy gradient, which means that simply runs policy gradient starting from the metalearned initial parameters . The metatraining objective for MAML is
(2) 
The rationale behind this objective is that, since the policy will be adapted at metatest time to new tasks using policy gradient, we can optimize the policy parameters so that one step of policy gradient improves its performance on any metatraining task as much as possible. MAML learns a datadriven initialization that makes standard reinforcement learning fast on tasks drawn from the task distribution . Importantly, MAML uses standard RL via policy gradient to adapt to new tasks, ensuring that we can continuously keep improving on new tasks, even when those tasks lie outside the metatraining distribution. Prior work has observed that metalearning with policy gradient improves extrapolation over metalearners that learn the entire adaptation procedure (e.g., using a recurrent network (Finn and Levine, 2018)). Generalization to outofdistribution samples is especially important for unsupervised metareinforcement learning methods because the actual task we might want to adapt to at metatest time will almost certainly be outofdistribution. For tasks that are too far outside of the metatraining set, MAML simply reverts to gradientbased RL. Other algorithms could also be used here, as discussed in the Section 3.5.
3.4 Practical Algorithm Implementation
A summary of a practical unsupervised metareinforcement learning algorithm is provided on the right. We first acquire a task distribution using unsupervised exploration (e.g., random discriminators or the DIAYN algorithm, as discussed in Section 3.2). We can sample from this task distribution by first sampling a random variable , and then use the reward induced by the resulting discriminator, to update our policy. Having defined a procedure for sampling tasks, we perform gradient based metalearning with MAML on this distribution until convergence. The resulting metalearned policy is then able to adapt quickly to new tasks in the environment via standard policy gradient (Section 4) without requiring additional metatraining supervision.
3.5 Which Unsupervised and MetaLearning Procedures Should Work Well?
Having introduced example instantiations of unsupervised metareinforcement learning, we discuss more generally what criteria each of the two procedures should satisfy  task acquisition and metareinforcement learning. What makes a good task acquisition procedure for unsupervised metareinforcement learning? Several criteria are desirable. First, we want the tasks that are learned to resemble the types of tasks that might be present at metatest time. DIAYN receives no supervision in this regard, basing its task acquisition entirely on the dynamics of the CMP. A more guided approach could incorporate a limited number of humanspecified tasks, or manuallyprovided guidance about valuable state space regions. Without any prior knowledge, we expect the ideal task distribution to induce a wide distribution over trajectories. As many distinct reward functions can have the same optimal policy, a random discriminator may actually result in a narrow distribution of optimal trajectories. In contrast, … Unsupervised task acquisition procedures like DIAYN, which mediate the task acquisition process via interactions with the environment (which imposes dynamically consistent structure on the tasks), are likely to yield better results than random task generation. The comparison to the random discriminator in our experiments sheds light on how a learned task distribution is important for this: while random and learned discriminators perform comparably on simple tasks, the learned discriminator performs significantly better on more complex tasks.
In the absence of any mechanism that constraints the metatraining task distribution to resemble the metatest distribution (which is unknown), we prefer methods that retain convergence guarantees, performing no worse than standard reinforcement learning algorithms that learn from scratch. Conveniently, gradientbased methods such as MAML gracefully revert to standard, convergent, reinforcement learning procedures on outofdistribution tasks. Additionally, unlike methods which restrict the space for adaptation using latent conditioned policies such as DIAYN (Eysenbach et al., 2018), gradient based metalearning does not lose policy expressivity because all policy parameters are being adapted.
We might then ask what kind of knowledge could possibly be “baked” into during metatraining. There are two sources of knowledge that can be acquired. First, a metalearning procedure like MAML modifies the initial parameters of a policy . When is represented by an expressive function class like a neural network, the initial setting of these parameters strongly affects how quickly the policy can be trained by gradient descent. Indeed, this is the rationale behind research into more effective generalpurpose initialization methods Koturwar and Merchant (2017); Xie et al. (). Metatraining a policy essentially learns an effective weight initialization such that a few gradient steps can effectively modify the policy in functionally relevant ways.
The policy found by unsupervised metatraining also acquires an awareness of the dynamics of the given controlled Markov process (CMP). Intuitively, an ideal policy should adapt in the space of trajectories , rather than the space of actions or parameters ; an RL update should modify the policy’s trajectory distribution, which determines the reward function. Natural gradient algorithms impose equalsized steps in the space of action distributions Schulman et al. (2015), but this is not necessarily the ideal adaptation manifold, since systematic changes in output actions do not necessarily translate into system changes in trajectory or state distributions. In effect, metalearning prepares the policy to modify its behavior in ways that cogently affect the states that are visited, which requires a parameter setting informed by the dynamics of the CMP. This can be provided effectively through unsupervised metareinforcement learning.
4 Experimental Evaluation


In our experiments, we aim to understand whether unsupervised metalearning can accelerate reinforcement learning of new tasks. Whereas standard metalearning requires a handspecified task distribution at metatraining time, unsupervised metalearning learns the task distribution through unsupervised interaction with the environment. A fair baseline that likewises uses requires no supervision is learning via RL from scratch without any metalearning. As an upper bound, we include the unfair comparison to a standard metalearning approach, where the metatraining distribution is manually designed. This method has access to a handspecified task distribution that is not available to our method. We evaluate two variants of our approach: (a) task acquisition based on DIAYN followed by metalearning using MAML, and (b) task acquisition using a randomly initialized discriminator followed by metalearning using MAML. Our experiments aim to answer the following questions: (1) Does unsupervised metalearning accelerate learning of unseen tasks? (2) How does unsupervised metalearning compare to metalearning on a handspecified task distribution? (3) When should unsupervised metalearning with a learned task distribution be preferred over a metalearning with a random discriminator? This last question sheds some light on the effect of task acquisition inductive bias on final reinforcement learning performance.
4.1 Tasks and Implementation Details
Our experiments study three simulated environments of increasing difficulty: 2D point navigation, 2D locomotion using the “HalfCheetah,” and 3D locomotion using the “Ant,” with the latter two environments adapted from popular reinforcement learning benchmarks (Duan et al., 2016a). While the 2D navigation environment allows for direct control of position, HalfCheetah and Ant can only control their center of mass via feedback control with high dimensional actions (6D for HalfCheetah, 8D for Ant) and observations (17D for HalfCheetah, 111D for Ant).
4.2 Fast Adaptation after Unsupervised Meta Learning



The comparison between the two variants of unsupervised metalearning and learning from scratch is shown in Fig 2, and we compare to handcrafted task distributions in Fig 3. We observe in all cases that unsupervised metalearning produces an RL procedure that substantially outperforms reinforcement learning from scratch, suggesting that unsupervised interaction with the environment and metalearning is effective in producing environmentspecific but taskagnostic priors that accelerate learning on new, previously unseen tasks. Interestingly, in all cases the performance of unsupervised metalearning with DIAYN matches or exceeds that of the handdesigned task distribution (Fig 3). We see that on the 2D navigation task, while handcrafted metalearning is able to learn very quickly initially, it performs similarly after 100 steps. For the cheetah environment as well, handcrafted metalearning is able to learn very quickly to start off, but is superseded by unsupervised metaRL with DIAYN. We also see on the HalfCheetah that, if we metatest using an initialization learned with a slightly different task distribution, performance degrades to below that of our approach. This result confirms that unsupervised environment interaction can extract a sufficiently diverse set of tasks to make unsupervised metalearning useful.


The comparison between the two unsupervised metalearning variants is also illuminating: while the DIAYNbased variant of our method generally achieves the best performance, even the random discriminator is able to provide a sufficient diversity of tasks to produce meaningful acceleration over learning from scratch in the case of 2D navigation and ant. This result has two interesting implications. First, it suggests that unsupervised metalearning is an effective tool for learning an environment prior, even when the metatraining task distribution does not necessarily broadly cover the state space. Although the performance of unsupervised metalearning can be improved with better coverage using DIAYN (as seen in Fig 2), even the random discriminator version provides competitive advantages over learning from scratch. Second, the comparison provides a clue for identifying the source of the structure learned through unsupervised metalearning: though the particular task distribution has an effect on performance, simply interacting with the environment (without structured objectives, using a random discriminator) already allows metaRL to learn effective adaptation strategies in a given environment. That is, the performance cannot be explained only by the unsupervised procedure (DIAYN) capturing the right task distribution.
4.3 Analysis of Learned Task Distributions



We can analyze the tasks discovered through unsupervised exploration and compare them to tasks we evaluate on at metatest time. Figure 4 illustrates these distributions using scatter plots for 2D navigation and the Ant, and a histogram for the HalfCheetah. Note that we visualize dimensions of the state that are relevant for the evaluation tasks – positions and velocities – but these dimensions are not specified in any way during unsupervised task acquisition, which operates on the entire state space. Although the tasks proposed via unsupervised exploration provide fairly broad coverage, they are clearly quite distinct from the metatest tasks, suggesting the approach can tolerate considerable distributional shift. Qualitatively, many of the tasks proposed via unsupervised exploration such as jumping and falling that are not relevant for the evaluation tasks. Our choice of the evaluation tasks was largely based on prior work, and therefore not tailored to this exploration procedure. The results for unsupervised metareinforcement learning therefore suggest quite strongly that unsupervised task acquisition can provide an effective metatraining set, at least for MAML, even when evaluating on tasks that do not closely match the discovered task distribution.
5 Discussion and Future Work
We presented an unsupervised approach to metareinforcement learning, where metalearning is used to acquire an efficient reinforcement learning procedure without requiring handspecified task distributions for metatraining. This approach accelerates RL without relying on the manual supervision required for conventional metalearning algorithms. Our experiments indicate that unsupervised metaRL can accelerate learning on a range of tasks, outperforming learning from scratch and often matching the performance of metalearning from handspecified task distributions.
As our work is the first foray into unsupervised metalearning, our approach opens a number of questions about unsupervised metalearning algorithms. While we focus on purely unsupervised task proposal mechanisms, it is straightforward to incorporate minimallyinformative priors into this procedure. For example, we might restrict the learned reward functions to operate on only part of the state. We consider the reinforcement learning setting in our work because environment interaction mediates the unsupervised learning process, ensuring that there is something to learn even without access to task reward. An interesting direction to study in future work is the extension of unsupervised metalearning to domains such as supervised classification, which might hold the promise of developing new unsupervised learning procedures powered by metalearning.
Acknowledgements.
This work was supported by two NSF Graduate Research Fellowships, NSF IIS1651843, the Office of Naval Research, and NVIDIA.We thank Ignasi Clavera and Gregory Kahn for insightful discussions and feedback.
References
 AlShedivat et al. (2017) M. AlShedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel. Continuous adaptation via metalearning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.
 Andrychowicz et al. (2016) M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Neural Information Processing Systems (NIPS), 2016.
 Andrychowicz et al. (2017) M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
 Bengio et al. (1992) S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei. On the optimization of a synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, 1992.
 Duan et al. (2016a) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016a.
 Duan et al. (2016b) Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016b.
 Eysenbach et al. (2018) B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 Finn and Levine (2018) C. Finn and S. Levine. Metalearning and universality: Deep representations and gradient descent can approximate any learning algorithm. International Conference on Learning Representations, 2018.
 Finn et al. (2017a) C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017a.
 Finn et al. (2017b) C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. Oneshot visual imitation learning via metalearning. CoRR, abs/1709.04905, 2017b. URL http://arxiv.org/abs/1709.04905.
 Forestier et al. (2017) S. Forestier, Y. Mollard, and P.Y. Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190, 2017.
 Graves et al. (2017) A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003, 2017.
 Gregor et al. (2016) K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
 Gupta et al. (2018) A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine. Metareinforcement learning of structured exploration strategies. arXiv preprint arXiv:1802.07245, 2018.
 Haarnoja et al. (2018) T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Held et al. (2017) D. Held, X. Geng, C. Florensa, and P. Abbeel. Automatic goal generation for reinforcement learning agents. arXiv preprint arXiv:1705.06366, 2017.
 Hochreiter et al. (2001) S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, 2001.
 Houthooft et al. (2018) R. Houthooft, R. Y. Chen, P. Isola, B. C. Stadie, F. Wolski, J. Ho, and P. Abbeel. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.
 Koturwar and Merchant (2017) S. Koturwar and S. Merchant. Weight initialization of deep neural networks(dnns) using data statistics. CoRR, abs/1710.10570, 2017. URL http://arxiv.org/abs/1710.10570.
 Krähenbühl et al. (2015) P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Datadependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.
 Li and Malik (2017) K. Li and J. Malik. Learning to optimize. International Conference on Learning Representations (ICLR), 2017.
 Metz et al. (2018) L. Metz, N. Maheswaranathan, B. Cheung, and J. SohlDickstein. Learning unsupervised learning rules. arXiv preprint arXiv:1804.00222, 2018.
 Mishra et al. (2017) N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. In NIPS 2017 Workshop on MetaLearning, 2017.
 Munkhdalai and Yu (2017) T. Munkhdalai and H. Yu. Meta networks. International Conference on Machine Learning (ICML), 2017.
 Naik and Mammone (1992) D. K. Naik and R. Mammone. Metaneural networks that learn by learning. In International Joint Conference on Neural Netowrks (IJCNN), 1992.
 Pong et al. (2018) V. Pong, S. Gu, M. Dalal, and S. Levine. Temporal difference models: Modelfree deep rl for modelbased control. arXiv preprint arXiv:1802.09081, 2018.
 Ravi and Larochelle (2017) S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations (ICLR), 2017.
 Santoro et al. (2016) A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Metalearning with memoryaugmented neural networks. In International Conference on Machine Learning (ICML), 2016.
 Schaul et al. (2015) T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320, 2015.
 Schmidhuber (1987) J. Schmidhuber. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. PhD thesis, Technische Universität München, 1987.
 Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Snell et al. (2017) J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
 Stadie et al. (2018) B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan, Y. Wu, P. Abbeel, and I. Sutskever. Some considerations on learning to explore via metareinforcement learning. CoRR, abs/1803.01118, 2018. URL http://arxiv.org/abs/1803.01118.
 Sukhbaatar et al. (2017) S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and R. Fergus. Intrinsic motivation and automatic curricula via asymmetric selfplay. arXiv preprint arXiv:1703.05407, 2017.
 Sung et al. (2017) F. Sung, L. Zhang, T. Xiang, T. Hospedales, and Y. Yang. Learning to learn: Metacritic networks for sample efficient learning. arXiv preprint arXiv:1706.09529, 2017.
 Thrun and Pratt (1998) S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 1998.
 Wang et al. (2016) J. X. Wang, Z. KurthNelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
 Whitley and Watson (2005) D. Whitley and J. P. Watson. Complexity theory and the no free lunch theorem, 2005.
 Wolpert et al. (1995) D. H. Wolpert, W. G. Macready, et al. No free lunch theorems for search. Technical report, Technical Report SFITR9502010, Santa Fe Institute, 1995.
 (40) D. Xie, J. Xiong, and S. Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition. doi: 10.1109/CVPR.2017.539. URL https://doi.org/10.1109/CVPR.2017.539.