ESMAML: Simple HessianFree Meta Learning
Columbia University
UC Berkeley
Abstract
We introduce ESMAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ESMAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ESMAML is competitive with existing methods and often yields better adaptation with fewer queries.
1 Introduction
Metalearning is a paradigm in machine learning which aims to develop models and training algorithms which can quickly adapt to new tasks and data. Our focus in this paper is on metalearning in reinforcement learning (RL), where data efficiency is of paramount importance because gathering new samples often requires costly simulations or interactions with the real world. A popular technique for RL metalearning is Model Agnostic Meta Learning (MAML) (Finn et al., 2017, 2018), a model for training an agent (the metapolicy) which can quickly adapt to new and unknown tasks by performing one (or a few) gradient updates in the new environment. We provide a formal description of MAML in Section 2.
MAML has proven to be successful for many applications. However, implementing and running MAML continues to be challenging. One major complication is that the standard version of MAML requires estimating second derivatives of the RL reward function, which is difficult when using backpropagation on stochastic policies; indeed, the original implementation of MAML (Finn et al., 2017) did so incorrectly, which spurred the development of unbiased higherorder estimators (DiCE, (Foerster et al., 2018)) and further analysis of the credit assignment mechanism in MAML (Rothfuss et al., 2019). Another challenge arises from the high variance inherent in policy gradient methods, which can be ameliorated through control variates such as in TMAML (Liu et al., 2019), through careful adaptive hyperparameter tuning (Behl et al., 2019; Antoniou et al., 2019) and learning rate annealing (Loshchilov and Hutter, 2017).
To avoid these issues, we propose an alternative approach to MAML based on Evolution Strategies (ES), as opposed to the policy gradient underlying previous MAML algorithms. We provide a detailed discussion of ES in Section 3.1. ES has several advantages:

Our zeroorder formulation of ESMAML (Section 3.2, Algorithm 3) does not require estimating any second derivatives. This dodges the many issues caused by estimating second derivatives with backpropagation on stochastic policies (see Section 2 for details).

ES is conceptually much simpler than policy gradients, which also translates to ease of implementation. It does not use backpropagation, so it can be run on CPUs only.

ES is highly flexible with different adaptation operators (Section 3.3).

ES allows us to use deterministic policies, which can be safer when doing adaptation (Section 4.3). ES is also capable of learning linear and other compact policies (Section 4.2).
On the point (4), a feature of ES algorithms is that exploration takes place in the parameter space. Whereas policy gradient methods are primarily motivated by interactions with the environment through randomized actions, ES is driven by optimization in highdimensional parameter spaces with an expensive querying model. In the context of MAML, the notions of “exploration” and “task identification” have thus been shifted to the parameter space instead of the action space. This distinction plays a key role in the stability of the algorithm. One immediate implication is that we can use deterministic policies, unlike policy gradients which is based on stochastic policies. Another difference is that ES uses only the total reward and not the individual stateaction pairs within each episode. While this may appear to be a weakness, since less information is being used, we find in practice that it seems to lead to more stable training profiles.
This paper is organized as follows. In Section 2, we give a formal definition of MAML, and discuss related works. In Section 3, we introduce Evolutionary Strategies and show how ES can be applied to create a new framework for MAML. In Section 4, we present numerical experiments, highlighting the topics of exploration (Section 4.1), the utility of compact architectures (Section 4.2), the stability of deterministic policies (Section 4.3), and comparisons against existing MAML algorithms in the fewshot regime (Section 4.4). Additional material can be found in the Appendix.
2 Model Agnostic Meta Learning in RL
We first discuss the original formulation of MAML (Finn et al., 2017). Let be a set of reinforcement learning tasks with common state and action spaces , and a distribution over . In the standard MAML setting, each task has an associated Markov Decision Process (MDP) with transition distribution , an episode length , and a reward function which maps a trajectory to the total reward . A stochastic policy is a function which maps states to probability distributions over the action space. A deterministic policy is a function . Policies are typically encoded by a neural network with parameters , and we often refer to the policy simply by .
The MAML problem is to find the socalled MAML point (called also a metapolicy), which is a policy that can be ‘adapted’ quickly to solve an unknown task by taking a (few)^{1}^{1}1We adopt the common convention of defining the adaptation operator with a single gradient step to simplify notation, but it can readily be extended to taking multiple steps. policy gradient steps with respect to . The optimization problem to be solved in training (in its oneshot version) is thus of the form:
(1) 
where: is called the adapted policy for a step size and is a distribution over trajectories given task and conditioned on the policy parameterized by .
Standard MAML approaches are based on the following expression for the gradient of the MAML objective function (1) to conduct training:
(2) 
We collectively refer to algorithms based on computing (2) using policy gradients as PGMAML.
Since the adaptation operator contains the policy gradient , its own gradient is secondorder in :
(3) 
Correctly computing the gradient (2) with the term (3) using automatic differentiation is known to be tricky. Multiple authors (Foerster et al., 2018; Rothfuss et al., 2019; Liu et al., 2019) have pointed out that the original implementation of MAML incorrectly estimates the term (3), which inadvertently causes the training to lose ‘preadaptation credit assignment’. Moreover, even when correctly implemented, the variance when estimating (3) can be extremely high, which impedes training. To improve on this, extensions to the original MAML include ProMP (Rothfuss et al., 2019), which introduces a new lowvariance curvature (LVC) estimator for the Hessian, and TMAML (Liu et al., 2019), which adds control variates to reduce the variance of the unbiased DiCE estimator (Foerster et al., 2018). However, these are not without their drawbacks: the proposed solutions are complicated, the variance of the Hessian estimate remains problematic, and LVC introduces unknown estimator bias.
Another issue that arises in PGMAML is that policies are necessarily stochastic. However, randomized actions can lead to risky exploration behavior when computing the adaptation, especially for robotics applications where the collection of tasks may involve differing system dynamics as opposed to only differing rewards (Yang et al., 2019). We explore this further in Section 4.3.
These issues: the difficulty of estimating the Hessian term (3), the typically high variance of for policy gradient algorithms in general, and the unsuitability of stochastic policies in some domains, lead us to the proposed method ESMAML in Section 3.
Aside from policy gradients, there have also been biologicallyinspired algorithms for MAML, based on concepts such as the Baldwin effect (Fernando et al., 2018). However, we note that despite the similar naming, methods such as ‘Evolvability ES’ (Gajewski et al., 2019) bear little resemblance to our proposed ESMAML. The problem solved by our algorithm is the standard MAML, whereas (Gajewski et al., 2019) aims to maximize loosely related notions of the diversity of behavioral characteristics. Moreover, ESMAML and its extensions we consider are all derived notions such as smoothings and approximations, with rigorous mathematical definitions as stated below.
3 ESMAML Algorithms
Formulating MAML with ES allows us to employ numerous techniques originally developed for enhancing ES, to MAML. We aim to improve both phases of MAML algorithm: the metalearning training algorithm, and the efficiency of the adaptation operator.
3.1 Evolution Strategies (ES)
Evolutionary Strategies (ES), which received their recent revival in RL (Salimans et al., 2017), rely on optimizing the smoothing of the blackbox function , which takes as input parameters of the policy and outputs total discounted (expected) reward obtained by an agent applying that policy in the given environment. Instead of optimizing the function directly, we optimize a smoothed objective. We define the Gaussian smoothing of as . The gradient of this smoothed objective, sometimes called an ESgradient, is given as (see: (Nesterov and Spokoiny, 2017)):
(4) 
Note that the gradient can be approximated via Monte Carlo (MC) samples:
In ES literature the above algorithm is often modified by adding control variates to equation 4 to obtain other unbiased estimators with reduced variance. The forward finite difference (ForwardFD) estimator (Choromanski et al., 2018) is given by subtracting the current policy value , yielding . The antithetic estimator (Nesterov and Spokoiny, 2017; Mania et al., 2018) is given by the symmetric difference . Notice that the variance of the ForwardFD and antithetic estimators is translationinvariant with respect to . In practice, the ForwardFD or antithetic estimator is usually preferred over the basic version expressed in equation 4.
In the next sections we will refer to Algorithm 1 for computing the gradient though we emphasize that there are several other recently developed variants of computing ESgradients as well as applying them for optimization. We describe some of these variants in Appendices A.3 and 3.3. A key feature of ESMAML is that we can directly make use of new enhancements of ES.
3.2 MetaTraining MAML with ES
To formulate MAML in the ES framework, we take a more abstract viewpoint. For each task , let be the (expected) cumulative reward of the policy . We treat as a blackbox, and make no assumptions on its structure (so the task need not even be MDP, and may be nonsmooth). The MAML problem is then
(5) 
As argued in (Liu et al., 2019; Rothfuss et al., 2019) (see also Section 2), a major challenge for policy gradient MAML is estimating the Hessian of , which is both conceptually subtle and difficult to correctly implement using automatic differentiation. The algorithm we propose obviates the need to calculate any second derivatives, and thus avoids this issue.
Suppose that we can evaluate (or approximate) and , but and may be nonsmooth or their gradients may be intractable. We consider the Gaussian smoothing of the MAML reward (5), and optimize using ES methods. The gradient is given by
(6) 
and can be estimated by jointly sampling over and evaluating . This algorithm is specified in Algorithm 2, and we refer to it as (zeroorder) ESMAML.
The standard adaptation operator is the onestep task gradient. Since is permitted to be nonsmooth in our setting, we use the adaptation operator acting on its smoothing. Expanding the definition of , the gradient of the smoothed MAML is then given by
(7) 
This leads to the algorithm that we specify in Algorithm 3, where the adaptation operator is itself estimated using the ES gradient in the inner loop.
We can also derive an algorithm analogous to PGMAML by applying a firstorder method to the MAML reward directly, without smoothing. The gradient is given by
(8) 
which corresponds to equation (3) in (Liu et al., 2019) when expressed in terms of policy gradients. Every term in this expression has a simple Monte Carlo estimator (see Algorithm 4 in the appendix for the MC Hessian estimator). We discuss this algorithm in greater detail in Appendix A.1. This formulation can be viewed as the “MAML of the smoothing”, compared to the “smoothing of the MAML” which is the basis for Algorithm 3. It is the additional smoothing present in equation 6 which eliminates the gradient of (and hence, the Hessian of ). Just as with the Hessian estimation in the original PGMAML, we find empirically that the MC estimator of the Hessian (Algorithm 4) has high variance, making it often harmful in training. We present some comparisons between Algorithm 3 and Algorithm 5, with and without the Hessian term, in Section A.1.2.
Note that when is estimated, such as in Algorithm 3, the resulting estimator for will in general be biased. This is similar to the estimator bias which occurs in PGMAML because we do not have access to the true adapted trajectory distribution. We discuss this further in Appendix A.2.
3.3 Improving the Adaptation Operator with ES
Algorithm 2 allows for great flexibility in choosing new adaptation operators. The simplest extension is to modify the ES gradient step: we can draw on general techniques for improving the ES gradient estimator, some of which are described in Appendix A.3. Some other methods are explored below.
3.3.1 Improved Exploration
Instead of using i.i.d Gaussian vectors to estimate the ES gradient in , we consider samples constructed according to Determinantal Point Processes (DPP). DPP sampling (Kulesza and Taskar, 2012; Wachinger and Golland, 2015) is a method of selecting a subset of samples so as to maximize the ‘diversity’ of the subset. It has been applied to ES to select perturbations so that the gradient estimator has lower variance (Choromanski et al., 2019c). The sampling matrix determining DPP sampling can also be datadependent and use information from the metatraining stage to construct a learned kernel with better properties for the adaptation phase. In the experimental section we show that DPPES can help in improving adaptation in MAML.
3.3.2 Hill Climbing and Population Search
Nondifferentiable operators can be also used in Algorithm 2. One particularly interesting example is the local search operator given by , where is the search radius. That is, selects the best policy for task which is in a ‘neighborhood’ of . For simplicity, we took the search neighborhood to be the ball here, but we may also use more general neighborhoods of . In general, exactly solving for the maximizer of over is intractable, but local search can often be well approximated by a hill climbing algorithm. Hill climbing creates a population of candidate policies by perturbing the best observed policy (which is initialized to ), evaluates the reward for each candidate, and then updates the best observed policy. This is repeated for several iterations. A key property of this search method is that the progress is monotonic, so the reward of the returned policy will always improve over . This does not hold for the stochastic gradient operator, and appears to be beneficial on some difficult problems (see Section 4.1). It has been claimed that hill climbing and other genetic algorithms (Moriarty et al., 1999) are competitive with gradientbased methods for solving difficult RL tasks (Such et al., 2017; Risi and Stanley, 2019).
4 Experiments
The performance of MAML algorithms can be evaluated in several ways. One important measure is the performance of the final metapolicy: whether the algorithm can consistently produce metapolicies with better adaptation. In the RL setting, the adaptation of the metapolicy is also a function of the number of queries used: that is, the number of rollouts used by the adaptation operator . The metalearning goal of data efficiency corresponds to adapting with low . The speed of the metatraining is also important, and can be measured in several ways: the number of metapolicy updates, wallclock time, and the number of rollouts used for metatraining. In this section, we present experiments which evaluate various aspects of ESMAML and PGMAML in terms of data efficiency () and metatraining time. Further details of the environments and hyperparameters are given in Appendix A.6.
In the RL setting, the amount of information used drastically decreases if ES methods are applied in comparison to the PG setting. To be precise, ES uses only the cumulative reward over an episode, whereas policy gradients use every stateaction pair. Intuitively, we may thus expect that ES should have worse sampling complexity because it uses less information for the same number of rollouts. However, it seems that in practice ES often matches or even exceeds policy gradients approaches (Salimans et al., 2017; Mania et al., 2018). Several explanations have been proposed: In the PG case, especially with algorithms such as PPO, the network must optimize multiple additional surrogate objectives such as entropy bonuses and value functions as well as hyperparameters such as the TDstep number. Furthermore, it has been argued that ES is more robust against delayed rewards, action infrequency, and long time horizons (Salimans et al., 2017). These advantages of ES in traditional RL also transfer to MAML, as we show empirically in this section. ES may lead to additional advantages (even if the numbers of rollouts needed in training is comparable with PG ones) in terms of wallclock time, because it does not require backpropagation, and can be parallelized over CPUs.
4.1 Exploration: Target Environments
In this section, we present two experiments on environments with very sparse rewards where the metapolicy must exhibit exploratory behavior to determine the correct adaptation.
The four corners benchmark was introduced in (Rothfuss et al., 2019) to demonstrate the weaknesses of exploration in PGMAML. An agent on a 2D square receives reward for moving towards a selected corner of the square, but only observes rewards once it is sufficiently close to the target corner, making the reward sparse. An effective exploration strategy for this set of tasks is for the metapolicy to travel in circular trajectories to observe which corner produces rewards; however, for a single policy to produce this exploration behavior is difficult.
In Figure 1, we demonstrate the behavior of ESMAML on the four corners problem. When , the same number of rollouts for adaptation as used in (Rothfuss et al., 2019), the basic version of Algorithm 3 is able to correctly explore and adapt to the task by finding the target corner. Moreover, it does not require any modifications to encourage exploration, unlike PGMAML. We further used , which caused the performance to drop. For better performance in this lowinformation environment, we experimented with two different adaptation operators in Algorithm 2, which are HC (hill climbing) and DPPES. The standard ES gradient is denoted MC.

From Figure 1, we observed that both operators DPPES and HC were able to improve exploration performance. We also created a modified task by heavily penalizing incorrect goals, which caused performance to dramatically drop for MC and DPPES. This is due to the variance from the MCgradient, which may result in a adapted policy that accidentally produces large negative rewards or become stuck in localoptima (i.e. refuse to explore due to negative rewards). This is also fixed by the HC adaptation, which enforces nondecreasing rewards during adaptation, allowing the ESMAML to progress.
Furthermore, ESMAML is not limited to “single goal” exploration. We created a more difficult task, six circles, where the agent continuously accrues negative rewards until it reaches six target points to “deactivate” them. Solving this task requires the agent to explore in circular trajectories, similar to the trajectory used by PGMAML on the four corners task. We visualize the behavior in Figure 2. Observe that ESMAML with the HC operator is able to develop a strategy to explore the target locations.
Additional examples on the classic Navigation2D task are presented in Appendix A.4, highlighting the differences in exploration behavior between PGMAML and ESMAML.
4.2 Good Adaptation with Compact Architectures
One of the main benefits of ES is due to its ability to train compact linear policies, which can outperform hiddenlayer policies. We demonstrate this on several benchmark MAML problems in the HalfCheetah and Ant environments in Figure 3. In contrast, (Finn and Levine, 2018) observed that PGMAML empirically and theoretically suggested that training with more deeper layers under SGD increases performance. We demonstrate that on the ForwardBackward and GoalVelocity MAML benchmarks, ESMAML is consistently able to train successful linear policies faster than deep networks. We also show that, for the ForwardBackward Ant problem, ESMAML with the new HC operator is the most performant. Using more compact policies also directly speeds up ESMAML, since fewer perturbations are needed for gradient estimation.
4.3 Deterministic Policies
We find that deterministic policies often produce more stable behaviors than the stochastic ones that are required for PG, where randomized actions in unstable environments can lead to catastrophic outcomes. In PG, this is often mitigated by reducing the entropy bonus, but this has an undesirable side effect of reducing exploration. In contrast, ESMAML explores in parameter space, which mitigates this issue. To demonstrate this, we use the “BiasedSensor CartPole” environment from (Yang et al., 2019). This environment has unstable dynamics and sparse rewards, so it requires exploration but is also risky. We see in Figure 4 that ESMAML is able to stably maintain the maximum reward (500).
We also include results in Figure 4 from two other environments, Swimmer and Walker2d, for which it is known that PG is surprisingly unstable, and ES yields better training (Mania et al., 2018). Notice that we again find linear policies (L) outperforming policies with one (H) or two (HH) hidden layers.
4.4 Low Benchmarks
For realworld applications, we may be constrained to use fewer queries than has typically been demonstrated in previous MAML works. Hence, it is of interest to compare how ESMAML compares to PGMAML for adapting with very low .
One possible concern is that low might harm ES in particular because it uses only the cumulative rewards; if for example , then the ES adaptation gradient can make use of only 5 values. In comparison, PGMAML uses stateaction pairs, so for , PGMAML still has 1000 pieces of information available.
However, we find experimentally that the standard ESMAML (Algorithm 3) remains competitive with PGMAML even in the low setting. In Figure 5, we compare ESMAML and PGMAML on the ForwardBackward and GoalVelocity tasks across four environments (HalfCheetah, Swimmer, Walker2d, Ant) and two model architectures. While PGMAML can generally outperform ESMAML on the GoalVelocity task, ESMAML is similar or better on the ForwardBackward task. Moreover, we observed that for low , PGMAML can be highly unstable (note the wide error bars), with some trajectories failing catastrophically, whereas ESMAML is relatively stable. This is an important consideration in real applications, where the risk of catastrophic failure is undesirable.
5 Conclusion
We have presented a new framework for MAML based on ES algorithms. The ESMAML approach avoids the problems of Hessian estimation which necessitated complicated alterations in PGMAML and is straightforward to implement. ESMAML is flexible in the choice of adaptation operators, and can be augmented with general improvements to ES, along with more exotic adaptation operators. In particular, ESMAML can be paired with nonsmooth adaptation operators such as hill climbing, which we found empirically to yield better exploratory behavior and better performance on sparsereward environments. ESMAML performs well with linear or compact deterministic policies, which is an advantage when adapting if the state dynamics are possibly unstable.
References
 Continuous adaptation via metalearning in nonstationary and competitive environments. In International Conference on Learning Representations, Cited by: Appendix A.2.
 How to train your MAML. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §1.
 Alpha MAML: adaptive modelagnostic metalearning. CoRR abs/1905.07435. External Links: 1905.07435 Cited by: §1.
 Unbiased simulation for optimizing stochastic function compositions. arXiv:1711.07564. Cited by: Appendix A.2.
 Provably robust blackbox optimization for reinforcement learning. accepted to CoRL 2019. Cited by: §A.3.2, §A.3.3.
 From complexity to simplicity: adaptive esactive subspaces for blackbox optimization. NeurIPS 2019. Cited by: §A.3.1.
 Structured monte carlo sampling for nonisotropic distributions via determinantal point processes. arXiv:1905.12667. Cited by: §3.3.1.
 Structured evolution with compact architectures for scalable policy optimization. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 969–977. Cited by: Appendix A.3, §3.1.
 Metalearning by the baldwin effect. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 2018, Kyoto, Japan, July 1519, 2018, pp. 109–110. External Links: Link, Document Cited by: §2.
 Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 1126–1135. Cited by: §A.1.2, Appendix A.4, Figure A5, Appendix A.5, §A.6.1, §1, §1, §2.
 Metalearning and universality: deep representations and gradient descent can approximate any learning algorithm. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §4.2.
 Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 9537–9548. Cited by: §1.
 DiCE: the infinitely differentiable Monte Carlo estimator. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 1529–1538. Cited by: Appendix A.2, §1, §2.
 Evolvability ES: scalable and direct optimization of evolvability. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2019, Prague, Czech Republic, July 1317, 2019, pp. 107–115. External Links: Link, Document Cited by: §2.
 Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization 26 (1), pp. 337–364. Cited by: Appendix A.2.
 Determinantal point processes for machine learning. Foundations and Trends in Machine Learning 5 (23), pp. 123–286. Cited by: §3.3.1.
 Taming MAML: efficient unbiased metareinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, pp. 4061–4071. Cited by: §A.1.1, §A.1.1, §A.1.2, Appendix A.2, §A.6.1, §A.6.2, §1, §2, §3.2, §3.2.
 SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §1.
 Simple random search provides a competitive approach to reinforcement learning. Advances in Neural Information Processing Systems 31, pp. 1800–1809. Cited by: §A.6.1, §3.1, §4.3, §4.
 Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research 11, pp. 241–276. Cited by: §3.3.2.
 Random gradientfree minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. Cited by: §3.1, §3.1.
 Deep neuroevolution of recurrent and discrete world models. arXiv:1906.08857. Cited by: §3.3.2.
 ProMP: proximal metapolicy search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §A.1.1, §A.1.2, Appendix A.2, Appendix A.2, §1, §2, §3.2, §4.1, §4.1.
 Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864. Cited by: §3.1, §4.
 Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567. Cited by: §3.3.2.
 Sampling from determinantal point processes for scalable manifold learning. Information Processing for Medical Imaging, pp. 687––698. Cited by: §3.3.1.
 Accelerating stochastic composition optimization. Journal of Machine Learning Research 18, pp. 1–23. Cited by: Appendix A.2.
 NoRML: noreward meta learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Montreal, QC, Canada, May 1317, 2019, pp. 323–331. Cited by: §2, §4.3.
Appendix A.1 FirstOrder ESMAML
a.1.1 Algorithm
Suppose that we first apply Gaussian smoothing to the task rewards and then form the MAML problem, so we have . The function is then itself differentiable, and we can directly apply firstorder methods to it. The classical case where yields the gradient
(9) 
This is analogous to formulas obtained in e.g (Liu et al., 2019) for the policy gradient MAML. We can then approximate this gradient as an input to stochastic firstorder methods.
Note the presence of the term . A central problem, as discussed in (Rothfuss et al., 2019; Liu et al., 2019) is how to correctly and accurately estimate this Hessian. However, a simple expression exists for this object in the ES setting; it can be shown that
(10) 
Note that for the vector , is the transpose (and unrelated to tasks ). A basic MC estimator is shown in Algorithm 4.
Given an independent estimator for , we can then take the product with a Hessian estimate from Algorithm 4 to obtain an estimator for . The resulting algorithm, using this gradient estimate as an input to SGD, is shown in Algorithm 5.
a.1.2 Experiments with FirstOrder ESMAML
Unlike zeroorder ESMAML (Algorithm 3), the firstorder ESMAML explicitly builds an approximation of the Hessian of . Given the literature on PGMAML, we expect that estimating the Hessian with Algorithm 4 without any control variates may have high variance. We compare two variants of firstorder ESMAML:

The full version (FOHessian) specified in Algorithm 5.

The ‘firstorder approximation’ (FONoHessian) which ignores the term and approximates the MAML gradient as . This is equivalent to setting in line 5 of Algorithm 5
The results on the four corner exploration problem (Section 4.1) and the ForwardBackward Ant, using Linear policies, are shown in Figure A1. On ForwardBackward Ant, FONoHessian actually outperformed FOHessian, so the inclusion of the Hessian term actually slowed convergence. On the four corners task, both FOHessian and FONoHessian have large error bars, and FOHessian slightly outperforms FONoHessian.
There is conflicting evidence as to whether the same phenomenon occurs with PGMAML; (Finn et al., 2017, §5.2) found that on supervised learning MAML, omitting Hessian terms is competitive but slightly worse than the full PGMAML, and does not report comparisons with and without the Hessian on RL MAML. Rothfuss et al. (2019); Liu et al. (2019) argue for the importance of the secondorder terms in proper credit assignment, but use heavily modified estimators (LVC, control variates; see Section 2) in their experiments, so the performance is not directly comparable to the ‘naive’ estimator in Algorithm 4. Our interpretation is that Algorithm 4 has high variance, making the Hessian estimates inaccurate, which can slow training on relatively ‘easier’ tasks like ForwardBackward walking but possibly increase the exploration on four corners.
We also compare FONoHessian against Algorithm 3 on ForwardBackward HalfCheetah and Ant in Figure A2. In this experiment, the two methods ran on servers with different number of workers available, so we measure the score by the total number of rollouts. We found that FONoHessian was slightly faster than Algorithm 3 when measured by rollouts on Ant, but FONoHessian had notably poor performance when the number of queries was low () on HalfCheetah, and failed to reach similar scores as the others even after running for many more rollouts.
Appendix A.2 Handling Estimator Bias
Since the adapted policy generally cannot be evaluated exactly, we cannot easily obtain unbiased estimates of . This problem arises for both PGMAML and ESMAML.
We consider PGMAML first as an example. In PGMAML, the adaptation operator is . In general, we can only obtain an estimate of and not its exact value. However, the MAML gradient is given by
(11) 
which requires exact sampling from the adapted trajectories . Since this is a nonlinear function of , we cannot obtain unbiased estimates of by sampling generated by an estimate of .
In the case of ESMAML, the adaptation operator is for , where . Clearly, is not an unbiased estimator of .
We may question whether using an unbiased estimator of is likely to improve performance. One natural strategy is to reformulate the objective function so as to make the desired estimator unbiased. This happens to be the case for the algorithm EMAML (AlShedivat et al., 2018), which treats the adaptation operator as an explicit function of sampled trajectories and “moves the expectation outside”. That is, we now have an adaptation operator , and the objective function becomes
(12) 
An unbiased estimator for the EMAML gradient can be obtained by sampling only from (AlShedivat et al., 2018). However, it has been argued that by doing so, EMAML does not properly assign credit to the preadaptation policy (Rothfuss et al., 2019). Thus, this particular mathematical strategy seems to be disadvantageous for RL.
The problem of finding estimators for functionofexpectations is difficult and while general unbiased estimation methods exist (Blanchet et al., 2017), they are often complicated and suffer from high variance. In the context of MAML, ProMP compares the low variance curvature (LVC) estimator (Rothfuss et al., 2019), which is biased, against the unbiased DiCE estimator (Foerster et al., 2018), for the Hessian term in the MAML gradient, and found that the lower variance of LVC produced better performance than DiCE. Alternatively, control variates can be used to reduce the variance of the DiCE estimator, which is the approach followed in (Liu et al., 2019).
In the ES framework, the problem can also be formulated to avoid exactly evaluating , and hence circumvents the question of estimator bias. We observe an interesting connection between MAML and the stochastic composition problem. Let us define and . For a given task , the MAML reward is given by
(13) 
This is a twolayer nested stochastic composition problem with outer function and inner function . An accelerated algorithm (ASCPG) was developed in (Wang et al., 2017)] for this class of problems. While neither nor is smooth, which is assumed in (Wang et al., 2017), we can verify that the crucial content of the assumptions hold:
The ASCPG algorithm does not immediately extend to the full MAML problem, as upon taking an outer expectation over , the MAML reward is no longer a stochastic composition of the required form. In particular, there are conceptual difficulties when the number of tasks in is infinite. However, it can be used to solve the MAML problem for each task within a consensus framework, such as consensus ADMM (Hong et al., 2016).
Appendix A.3 Extensions of ES
In this section, we discuss several general techniques for improving the basic ES gradient estimator (Algorithm 1). These can be applied both to the ES gradient of the metatraining (the ‘outer loop’ of Algorithm 3), and more interestingly, to the adaptation operator itself. That is, given , we replace the estimation of by ESGrad on line 4 of Algorithm 3 with an improved estimator of , which even may depend on data collected during the metatraining stage. Many techniques exist for reducing the variance of the estimator such as Quasi Monte Carlo sampling (Choromanski et al., 2018). Aside from variance reduction, there are also methods with special properties.
a.3.1 Active subspaces
Active Subspaces is a method for finding a lowdimensional subspace where the contribution of the gradient is maximized. Conceptually, the goal is to find and update onthefly a lowrank subspace so that the projection of into is maximized and apply instead of . This should be done in such a way that does not need to be computed explicitly. Optimizing in lowerdimensional subspaces might be computationally more efficient and can be thought of as an example of guided ES methods, where the algorithm is guided how to explore space in the anisotropic way, leveraging its knowledge about function optimization landscape that it gained in the previous steps of optimization. In the context of RL, the active subspace method ASEBO (Choromanski et al., 2019b) was successfully applied to speed up policy training algorithms. This strategy can be made datadependent also in the MAML context, by learning an optimal subspace using data from the metatraining stage, and sampling from that subspace in the adaptation step.
a.3.2 RegressionBased Optimization
RegressionBased Optimization (RBO) is an alternative method of gradient estimation. From Taylor series expansion we have . By taking multiple finite difference expressions for different , we can recover the gradient by solving a regularized regression problem. The regularization has an additional advantage  it was shown that the gradient can be recovered even if a substantial fraction of the rewards are corrupted (Choromanski et al., 2019a). Strictly speaking, this is not based on the Gaussian smoothing as in ES, but is another method for estimating gradients using only zeroth order evaluations.
a.3.3 Experiments
We present a preliminary experiment with RBO and ASEBO gradient adaptation in Figure A3. To be precise, the algorithms used are identical to Algorithm 3 except that in line 4, is replaced by (yielding RBOMAML) and (yielding ASEBOMAML) respectively.
On the left plot, we test for noise robustness on the ForwardBackward Swimmer MAML task, comparing standard ESMAML (Algorithm 3) to RBOMAML. To simulate noisy data, we randomly corrupt of the queries used to estimate the adaptation operator with an enormous additive noise. This is the same type of corruption used in (Choromanski et al., 2019a). Interestingly, RBO does not appear to be more robust against noise than the standard MC estimator, which suggests that the original ESMAML has some inherent robustness to noise.
On the right plot, we compare ASEBOMAML to ESMAML on the GoalVelocity HalfCheetah task in the low setting. We found that when measured in iterations, ASEBOMAML outperforms ESMAML. However, ASEBO requires additional linear algebra operations and thus uses significantly more wallclock time (not shown in plot) per iteration, so if measured by real time, then ESMAML was more effective.
Appendix A.4 Navigation2D Exploration Task
Navigation2D (Finn et al., 2017) is a classic environment where the agent must explore to adapt to the task. The agent is represented by a point on a 2D square, and at each time step, receives reward equal to its distance from a given target point on the square. Note that unlike the four corners and six circles tasks, the reward for Navigation2D is dense. We visualize the differing exploration strategies learned by PGMAML and ESMAML in Figure A4. Notice that PGMAML makes many tiny movements in multiple directions to ‘triangulate’ the target location using the differences in reward for different stateaction pairs. On the other hand, ESMAML learns a metapolicy such that each perturbation of the metapolicy causes the agent to move in a different direction (represented by red paths), so it can determine the target location from the total rewards of each path.
Appendix A.5 Other MAML Benchmarks
Appendix A.6 Hyperparameters and Setups
a.6.1 Environments
Unless otherwise explicitly stated, we default to and horizon = 200 for all RL experiments. We also use the standard reward normalization in (Mania et al., 2018), and use a global state normalization (i.e. the same mean, standard deviation normalization values for MDP states are shared across workers).
For the Ant environments (GoalPosition Ant, ForwardBackward Ant), there are significant differences in weighting on the auxiliary rewards such as control costs, contact costs, and survival rewards across different previous work (e.g. those costs are downweighted in (Finn et al., 2017) whereas the coefficients are vanilla Gym weightings in (Liu et al., 2019)). These auxiliary rewards can lead to local minima, such as the agent staying stationary to collect the survival bonus which may be confused with movement progress when presenting a training curve. To make sure the agent is explicitly performing the required task, we opted to remove such costs in our work and only present the main goaldistance cost and forwardmovement reward respectively.
For the other environments, we used default weightings and rewards, since they do not change across previous works.
a.6.2 ESMAML Hyperparameters
Let be the number of possible distinct tasks possible. We sample tasks without replacement, which is important if , as each worker performs adaptations on all possible tasks.
For standard ESMAML (Algorithm 3), we used the following settings.
Setting  Value 

(Total Workers, # Perturbations, # Current Evals)  (300, 150, 150) 
(Train Set Size, Task Batch Size, Test Set Size)  (50,5,5) or (N,N,N) 
Number of rollouts per parameter  1 
Number of Perturbations per worker  1 
OuterLoop Precision Parameter  0.1 
Adaptation Precision Parameter  0.1 
OuterLoop Step Size  0.01 
Adaptation Step Size ()  0.05 
Hidden Layer Width  32 
ES Estimation Type  ForwardFD 
Reward Normalization  True 
State Normalization  True 
For ESMAML and PGMAML, we took 3 seeded runs, using the default TRPO hyperparameters found in (Liu et al., 2019).